{"title": "Defending Neural Backdoors via Generative Distribution Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 14004, "page_last": 14013, "abstract": "Neural backdoor attack is emerging as a severe security threat to deep learning, while the capability of existing defense methods is limited, especially for complex backdoor triggers. In the work, we explore the space formed by the pixel values of all possible backdoor triggers. An original trigger used by an attacker to build the backdoored model represents only a point in the space. It then will be generalized into a distribution of valid triggers, all of which can influence the backdoored model. Thus, previous methods that model only one point of the trigger distribution is not sufficient. Getting the entire trigger distribution, e.g., via generative modeling, is a key of effective defense. However, existing generative modeling techniques for image generation are not applicable to the backdoor scenario as the trigger distribution is completely unknown. In this work, we propose max-entropy staircase approximator (MESA) for high-dimensional sampling-free generative modeling and use it to recover the trigger distribution. We also develop a defense technique to remove the triggers from the backdoored model. Our experiments on Cifar10/100 dataset demonstrate the effectiveness of MESA in modeling the trigger distribution and the robustness of the proposed defense method.", "full_text": "Defending Neural Backdoors via Generative\n\nDistribution Modeling\n\nXiming Qiao*\nECE Department\nDuke University\n\nDurham, NC 27708\n\nximing.qiao@duke.edu\n\nYukun Yang*\nECE Department\nDuke University\n\nDurham, NC 27708\n\nyukun.yang@duke.edu\n\nHai Li\n\nECE Department\nDuke University\n\nDurham, NC 27708\nhai.li@duke.edu\n\nAbstract\n\nNeural backdoor attack is emerging as a severe security threat to deep learning,\nwhile the capability of existing defense methods is limited, especially for complex\nbackdoor triggers. In the work, we explore the space formed by the pixel values of\nall possible backdoor triggers. An original trigger used by an attacker to build the\nbackdoored model represents only a point in the space. It then will be generalized\ninto a distribution of valid triggers, all of which can in\ufb02uence the backdoored model.\nThus, previous methods that model only one point of the trigger distribution is\nnot suf\ufb01cient. Getting the entire trigger distribution, e.g., via generative modeling,\nis a key of effective defense. However, existing generative modeling techniques\nfor image generation are not applicable to the backdoor scenario as the trigger\ndistribution is completely unknown. In this work, we propose max-entropy staircase\napproximator (MESA) for high-dimensional sampling-free generative modeling\nand use it to recover the trigger distribution. We also develop a defense technique to\nremove the triggers from the backdoored model. Our experiments on Cifar10/100\ndataset demonstrate the effectiveness of MESA in modeling the trigger distribution\nand the robustness of the proposed defense method.\n\n1\n\nIntroduction\n\nNeural backdoor attack [1] is emerging as a severe security threat to deep learning. As illustrated in\nFigure 1(a), such an attack consists of two stages: (1) Backdoor injection: through data poisoning,\nattackers train a backdoored model with a prede\ufb01ned backdoor trigger; (2) Backdoor triggering:\nwhen applying the trigger on input images, the backdoored model outputs the target class identi\ufb01ed\nby the trigger. Compared to adversarial attacks [2] that universally affect all deep learning models\nwithout data poisoning, accessing the training process makes the backdoor attack more \ufb02exible.\nFor example, a backdoor attack uses one trigger to manipulate the model\u2019s outputs on all inputs,\nwhile the perturbation-based adversarial attacks [3] need recalculate the perturbation for each input.\nMoreover, a backdoor trigger can be as small as a single pixel [4] or as ordinary as a pair of physical\nsunglasses [5], while the adversarial patch attacks [6] often rely on large patches with vibrant colors.\nSuch \ufb02exibility makes backdoor attacks extremely threatening in the physical world. Some recent\nsuccesses include manipulating the results of the stop sign detection [1] and face recognition [5].\nIn contrast to the high effectiveness in attacking, the study in defending backdoor attacks falls far\nbehind. The training-stage defense methods [4, 7] use outlier detection to \ufb01nd and then remove the\npoisoned training data. But neither of them can \ufb01x a backdoored model. The testing-stage defense [8]\n\ufb01rst employs the pixel space optimization to reverse engineer a backdoor trigger (a.k.a. reversed\ntrigger) from a backdoored model, and then \ufb01x the model through retraining or pruning. The method\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n*: These authors contributed equally to this work\n\n\fFigure 1: Backdoor attacks and the generalization property of backdoor triggers.\n\nis effective when the reversed trigger is similar to the one used in the attack (a.k.a. original trigger).\nAccording to our observation, however, the performance of the defense degrades dramatically when\nthe triggers contain complex patterns. The reversed triggers in different runs vary signi\ufb01cantly and\nthe effectiveness of the backdoor removal is unpredictable. To the best of our knowledge, there is no\napparent explanation of this phenomena\u2014why would the reversed triggers be so different?\nWe investigate the phenomena by carrying out preliminary experiments of reverse engineering\nbackdoor triggers. We attack a model based on Cifar10 [9] dataset with a single 3\u00d73 trigger and\nrepeat the reverse engineering process with random seeds. Interestingly, we \ufb01nd that the reversed\ntriggers form a continuous set in the pixel space of all possible 3\u00d73 triggers. We denote this space as\nX and use valid trigger distribution to represent all the triggers that control the model\u2019s output with\na positive probability. Figure 1(b) shows an example of the original trigger and its corresponding\nvalid trigger distribution obtained from our backdoor modeling method in Section 3. Besides forming\na continuous distribution, many of the reversed triggers even have stronger attacking strength, i.e.,\nhigher attack success rate (ASR)1, than the original trigger. We can conclude that a backdoored model\ngeneralizes its original trigger during backdoor injection. When the valid trigger distribution is wide\nenough, it is impossible to reliably approach the original trigger with a single reversed trigger.\nA possible approach to build a robust backdoor defense could be to explicitly model the valid trigger\ndistribution with a generative model: assuming the generative model can reverse engineer all the\nvalid triggers, it is guaranteed to cover the original trigger and \ufb01x the model. In addition, a generative\nmodel can provide a direct visualization of the trigger distribution, deepening our understanding of\nhow a backdoor is formed. The main challenge in practice, however, is that the trigger distribution is\ncompletely unknown, even to the attacker. Typical generative modeling methods such as generative\nadversarial networks (GANs) [10] and variational autoencoders (VAEs) [11] require direct sampling\nfrom the data (i.e., triggers) distribution, which is impossible in our situation. Whether a trigger is\nvalid or not cannot be identi\ufb01ed until it has been tested through the backdoored model. The high\ndimensionality of X makes the brute-force testing or Markov chain Monte Carlo (MCMC)-based\ntechniques impractical. The backdoor trigger modeling indeed is a high-dimensional sampling-free\ngenerative modeling problem. The solution shall avoid any direct sampling from the unknown trigger\ndistribution, meanwhile provide suf\ufb01cient scalability to generate high-dimensional complex triggers.\nTo cope with the challenge, we propose a max-entropy staircase approximator (MESA) algorithm.\nInstead of using a single model like GANs and VAEs, MESA ensembles a group of sub-models to\napproximate the unknown trigger distribution. Based on staircase approximation, each sub-model\nonly needs to learn a portion of the distribution, so that the modeling complexity is reduced. The\nsub-models are trained based on entropy maximization, which avoids direct sampling. For high-\ndimensional trigger generation, we parameterize sub-models as neural networks and adopt mutual\ninformation neural estimator (MINE) [12]. Based on the valid trigger distribution obtained via MESA,\nwe develop a backdoor defense scheme: starting with a backdoored model and testing images, our\nscheme detects the attack\u2019s target class, constructs the valid trigger distribution, and retrains the\nmodel to \ufb01x the backdoor.\nOur experimental results show that MESA can effectively reverse engineer the valid trig-\nger distribution on various types of triggers and signi\ufb01cantly improve the defense robust-\nness. We exhaustively test 51 representative black-white triggers in 3 \u00d7 3 size on the Ci-\nfar10 dataset, and also random color triggers on Cifar10/100 dataset. Our defense scheme\n\n1ASR is denoted as the rate that an input not from the target class is classi\ufb01ed to the target class.\n\n2\n\n(b)An original trigger vs. the valid trigger distributiondogPeople used to consideraBackdoor as\u2026What a Backdoor really is\u2026Valid trigger distributionOriginal trigger\u2026BackdooredDNNBackdooredDNNdogdogdogdogfrogdogdogautomobiledogtruckDataPoisoningStep1.Backdoor InjectionStep2.Backdoor TriggeringModelTraininghorsedogWithout triggerWith trigger(target class: dog)Trigger(a) An overview of backdoor attackBackdooredDNNDNN\fbased on the trigger distribution reliably reduces the ASR of original triggers from 92.3% \u223c\n99.8% (before defense) to 1.2% \u223c 5.9% (after defense), while the ASR obtained from the\nbaseline counterpart based on a single reversed trigger \ufb02uctuates between 2.4% \u223c 51.4%.\nSource code of the experiments are available on https://github.com/superrrpotato/\nDefending-Neural-Backdoors-via-Generative-Distribution-Modeling.\n\n2 Background\n\n2.1 Neural backdoors\n\nNeural backdoor attacks [1] exploit the redundancy in deep neural networks (DNNs) and injects\nbackdoor during training. A backdoor attack can be characterized by a backdoor trigger x, a target\nclass c, a trigger application rule Apply(\u00b7, x), and a poison ratio r. For a model P and a training\ndataset D of image/label pairs (m, y), attackers hack the training process to minimize:\n\n(cid:26)L(P (Apply(m, x)), c) with probability r\n\n(cid:88)\n\nwith probability 1 \u2212 r\n\n,\n\n(1)\n\nloss =\n\n(m,y)\u2208D\n\nL(P (m), y)\n\nin which L is the cross-entropy loss. The Apply function typically overwrites the image m with the\ntrigger x at a random or \ufb01xed location. Triggers in various forms have been explored, such as targeted\nphysical attack [5], trojaning attack [13], single-pixel attack [4], clean-label attack [14], and invisible\nperturbation attack [15].\nNowadays, the most effective defense is the training-stage defense. Previously, Tran et al. [4]\nand Chen et al. [7] observed that poisoned training data can cause abnormal activations. Once\nsuch activations are detected during training, defenders can remove the corresponding training data.\nThe main limitation of the training-stage defense, as its name indicates, is that it can discover the\nbackdoors only from training data, not those already embedded in pre-trained models.\nIn terms of the testing-stage defense, Wang et al. [8] showed that the optimization in the pixel space\ncan detect a model\u2019s backdoor and reverse engineer the original trigger. Afterwards, the reversed\ntrigger can be utilized to remove the backdoor through model retraining or pruning. The retraining\nmethod uses a direct reversed procedure of the attacking one. The backdoored model is \ufb01ne-tuned\nwith poisoned images but un-poisoned labels, i.e., minimizing L(P (Apply(m, x)), y), to \u201cunlearn\u201d\nthe backdoor. The pruning method attempts to remove the neurons that are sensitive to the reversed\ntrigger. However, it is not effective for pruning-aware backdoor attacks [16]. To the authors\u2019 best\nknowledge, none of these testing-stage defenses are able to reliably handle complex triggers.\n\n2.2 Sampling-based generative modeling\n\nGenerative modeling has been widely used for image generation. A generative model learns a\ncontinuous mapping from random noise to a given dataset. Typical methods include GANs [10],\nVAEs [11], auto-regressive models [17] and normalizing \ufb02ows [18]. All these methods require to\nsample from a true data distribution (i.e., the image dataset) to minimize the training loss, which is\nnot applicable in the scenario of backdoor modeling and defense.\n\n2.3 Entropy maximization\n\nThe entropy maximization method has been widely applied for statistical inference. It has been\na historically dif\ufb01cult problem to estimate the differential entropy on high-dimensional data [19].\nRecently, Belghazin et al. [12] proposed a mutual information neural estimator (MINE) based on\nthe recent advance in deep learning. One application of the estimator is to avoid the mode dropping\nproblem in generative modeling (especially GANs) via entropy maximization. For a generator G,\nlet Z and X = G(Z) respectively denote G\u2019s input noise and output. When G is deterministic, the\noutput entropy h(X) is equivalent to the mutual information (MI) I(X; Z), because\n\n(2)\nAs such, we can leverage the MI estimator [12] to estimate G\u2019s output entropy. Belghazi et al. [12]\nderive a learnable lower bound for the MI like\n\nI(X; Z) = h(X) \u2212 h(X|Z) = h(X).\n\nh(X|Z) = 0\n\nand\n\nI(X; Z) \u2265 sup\nT\u2208F\n\nEpX,Z [T ] \u2212 log(EpX pZ [eT ]),\n\n(3)\n\n3\n\n\fFigure 2: The MESA algorithm and its implementation.\n\nwhere pX,Z and pX pZ respectively represent the joint distribution and the product of the marginal\ndistributions. T is a learnable statistics network. We de\ufb01ne the lower-bound estimator \u02c6IT (X; Z) =\nEpX,Z [T ] \u2212 log(EpX pZ [eT ]) and combine Equations (2) and (3). Maximizing h(X) = I(X; Z) is\nreplaced by maximizing \u02c6IT , which can be approximated through optimizing T via gradient descent\nand back-propagation. We adopt this entropy maximization method in our proposed algorithm.\n\n3 Method\n\nOur proposed max-entropy staircase approximator (MESA) algorithm for sampling-free generative\nmodeling is described in this section. We will start with the principal ideas of MESA and its\ntheoretical properties, followed by its use for the backdoor trigger modeling and the defense scheme.\n\n3.1 MESA and its theoretical properties\n\nWe consider the backdoor defense in a generic setting and formalize it as a sampling-free generative\nmodeling problem. Our objective is to build a generative model \u02dcG : Rn \u2192 X for an unknown\ndistribution with a support set X and an upper bounded density function f : X \u2192 [0, W ]. With\nan n-dimensional noise Z \u223c N (0, I) as the input, \u02dcG is expected to produce the output \u02dcG(Z) \u223c \u02c6f,\nsuch that \u02c6f approximates f. Here, direct sampling from f is not allowed, and a testing function\nF : X \u2192 [0, 1] is given as a surrogate model to learn f. In the scenario of backdoor defense, X\nrepresents the pixel space of possible triggers, f is the density of the valid trigger distribution, and F\nreturns the ASR of a given trigger. We assume that the ASR function is a good representation of the\nunknown trigger distribution, such that an one-to-one mapping between a trigger\u2019s probability density\nand its ASR exists. Consequently, we factorize F as g \u25e6 f, in which the mapping g : [0, W ] \u2192 [0, 1]\nis assumed to be strictly increasing with a minimal slope \u03c9. The minimal slope suggests that a higher\nASR gives a higher probability density.\nFigure 2(a) illustrates the max-entropy staircase approximator (MESA) proposed in this work. The\nprincipal idea is to approximate f by an ensemble of N sub-models G1, G2, . . . , GN , and let each\nsub-model Gi only to learn a portion of f. The partitioning of f follows the method of staircase\napproximation. Given N, the number of partitions, we truncate F : X \u2192 [0, 1] with N thresholds\n\u03b21,...,N \u2208 [0, 1]. These truncations allow us to de\ufb01ne sets \u00afXi = {x : F (x) > \u03b2i} for i = 1, . . . , N,\nas illustrated as the yellow rectangles in Figure 2(a) (here \u03b2i+1 > \u03b2i and \u00afXi+1 \u2282 \u00afXi). When \u03b2i\ndensely covers [0, 1] and sub-models Gi captures Xi as uniform distributions, both F and f can be\nreconstructed by properly choosing the model ensembling weights.\nAlgorithm 1 describes MESA algorithm in details. Here we assign \u03b21,...,N to uniformly cover [0, 1].\nSub-models Gi are optimized through entropy maximization so that they models Xi uniformly (prac-\ntical implementation of such entropy maximization is discussed in Section3.2). Model ensembling is\nperformed by random sampling the sub-models Gi with a categorical distribution: let random variable\nY follows Categorical(\u03b31, \u03b32, . . . , \u03b3N ) and de\ufb01ne \u02dcG = GY . Appendix A gives the derivation of\nensembling weights \u03b3i and the proof of \u02c6f approximates f. In Algorithm 1 with \u03b2i = i/N, we have\ni=1 eh(Gi(Z))/g(cid:48)(g\u22121(\u03b2i))\n\n\u03b3i = eh(Gi(Z))/(g(cid:48)(g\u22121(\u03b2i)) \u00b7 Z0) in which h is the entropy and Z0 =(cid:80)N\n\nis a normalization term.\n\n4\n\n\ud835\udefd\"\ud835\udefd#\ud835\udefd$\ud835\udc53\ud835\udc39=\ud835\udc54\u25e6\ud835\udc531Pixel space\ud835\udc4aProbability density\ud835\udf12$\ud835\udf12#\ud835\udf12\"When\ud835\udc41=3\ud835\udc3a$\ud835\udc3a#\ud835\udc3a\"ASRNoiseNoiseNoisePixel spaceEnsemblewith(\ud835\udefe$,\ud835\udefe#,\ud835\udefe\")TestingImageApplyTriggerNoiseBackdooredModelCross-EntropyTriggerGeneratorLossStatisticsNetworkNoiseEntropyTestingFunction(a) An Illustration of staircase approximation(b) Network architectureTarget ClassPixel spaceP(\ud835\udf12)\fAlgorithm 1: Max-entropy staircase approximator (MESA)\n\n1 Given the number of staircase levels N;\n2 Let Z \u223c N (0, I);\n3 for i \u2190 1 to N do\nLet \u03b2i \u2190 i/N;\nDe\ufb01ne \u00afXi = {x : F (x) > \u03b2i};\nif \u00afXi (cid:54)= \u2205 then\n\n4\n5\n6\n7\n\nelse\n\n8\n9\n10\n11\n12 end\n\nend\n\n13 Let Z0 \u2190(cid:80)N\n\nOptimize Gi \u2190 arg maxG:Rn\u2192X h(G(Z)) subject to Gi(Z) \u2208 \u00afXi in probability;\nLet \u03b3(cid:48)\nLet \u03b3(cid:48)\n\ni \u2190 eh(G\u03b8i\ni \u2190 0;\n\n(Z))/g(cid:48)(g\u22121(\u03b2i));\n\ni=1 \u03b3(cid:48)\n\ni and \u03b3i \u2190 \u03b3(cid:48)\n\ni/Z0 for i = 1 . . . N;\n\n14 return the model mixture \u02dcG = GY in which Y \u223c Categorical(\u03b31, \u03b32, . . . , \u03b3N );\n\n3.2 Modeling the valid trigger distribution based on MESA\n\nAlgorithm 2 summarizes the MESA implementation details on modeling the valid trigger distribution.\nFirst, we make the following approximations to solve the uncomputable optimization problem of Gi.\nThe sub-model Gi is parameterized as a neural network G\u03b8i with parameters \u03b8i. The corresponding\nentropy is replaced by an MI estimator \u02c6ITi parameterized by a statistics network Ti, following the\nmethod in [12]. By following the relaxation technique from SVMs [20], the optimization constraint\nGi(Z) \u2208 \u00afXi is replaced by a hinge loss. The \ufb01nal loss function of the optimization becomes:\n\nL = max(0, \u03b2i \u2212 F \u25e6 G\u03b8i(z)) \u2212 \u03b1 \u02c6ITi(G\u03b8i (z); z(cid:48)).\n\n(4)\nHere, z and z(cid:48) are two independent random noises for MI estimation. Hyperparameter \u03b1 balances the\nsoft constraint with the entropy maximization. Since we skip the computation of \u00afXi by optimizing\nthe hinge loss, the condition of \u00afXi = \u2205 is decided by the testing result (i.e. the average ASR) after\nG\u03b8i converges. We skip the sub-model when EZ[F \u25e6 G\u03b8i (z)] < \u03b2i. In Section 4.2, we will validate\nthe above approximations.\nNext, we resolve the previously unde\ufb01ned functions F and g based on the speci\ufb01c backdoor problem.\nThe testing function F is decided by the backdoored model P , the trigger application rule Apply, the\ntesting dataset D(cid:48), and the target class c. More speci\ufb01cally, F applies a given trigger x to randomly\nselected testing images m \u2208 D(cid:48) using the rule Apply, passes these modi\ufb01ed images to model P ,\nand returns the model\u2019s softmax output on class c. Here, the softmax output is a surrogate function\nof the non-differentiable ASR. Function g is determined by the exact de\ufb01nition of the valid trigger\ndistribution (how are the probability density and the ASR related), which can be arbitrarily decided.\nIn Algorithm 2, we ignore the precise de\ufb01nition of g since accurately reconstructing f is not necessary\nin practical backdoor defense. Instead, we hand-pick a set of \u03b21,...,N , and directly use one of the\nsub-models for backdoor trigger modeling and defense, or simply mix them with \u03b3i = 1/N. The\ndetails are described in Section 3.3.\nFigure 2(b) depicts the computation \ufb02ow of the inner loop of Algorithm 2. Starting from a batch\nof random noise, we generate a batch of triggers and send them to the backdoored model and the\nstatistics network (along with another batch of independent noise). The two branches compute the\nsoftmax output and the triggers\u2019 entropy, respectively. The merged loss is then used to update the\ngenerator and the statistics network.\n\n3.3 Backdoor defense\n\nIn this section, we extend MESA to perform the actual backdoor defense. Here we assume that the\ndefender is given a backdoored model (including the architecture and parameters), a dataset of testing\nimages, and the approximate size of the trigger. The objective is to remove the backdoor from the\nmodel without affecting its performance on the clean data. We propose the following three-step\ndefense procedure.\n\n5\n\n\fAlgorithm 2: MESA implementation\n\n1 Given a backdoored model P ;\n2 Given a testing dataset D(cid:48);\n3 Given a target class c;\n4 for \u03b2i \u2208 [\u03b21, . . . , \u03b2N ] do\n\nwhile not converged do\n\n9\n10\n11\n12\n13 end\n14 return N sub-models G\u03b8i;\n\nend\n\n5\n6\n7\n8\n\nSample a mini-batch noise z \u223c N (0, I);\nSample a mini-batch of images m from D(cid:48);\nLet F (x) = softmax(P (Apply(m, x)), c);\nLet L = max(0, \u03b2i \u2212 F \u25e6 G\u03b8i (z)) \u2212 \u03b1 \u02c6ITi (G\u03b8i (z); z(cid:48));\nUpdate Ti according to [12];\nUpdate G\u03b8i via SGD to minimize L;\n\nStep 1: Detect the target class of the attack. It is done by repeating MESA on all possible classes.\nFor any class that MESA \ufb01nds a trigger which produces a higher ASR than a certain threshold, the\nclass is considered as being attacked. The value of the threshold is determined by how sensitive the\ndefender needs to be.\nStep 2: For each attacked class, we rerun MESA with \u03b21,...,N to obtain multiple sub-models. For\neach sub-model G\u03b8i, we remove the backdoor by model retraining. The backdoored model P is\n\ufb01ne-tuned to minimize\n\n(cid:26)L(P (Apply(m, G\u03b8i(z))), y) with probability r\n\nloss = EZ\n\n\uf8f9\uf8fb ,\n\nL(P (m), y)\n\nwith probability 1 \u2212 r\n\n(5)\n\n\uf8ee\uf8f0 (cid:88)\n\n(m,y)\u2208D(cid:48)\n\nin which L is the cross-entropy loss. r is a small constant (typically \u2264 1%) that is used to maintain\nthe model\u2019s performance on clean data. In each training step, we sample the trigger distribution to\nobtain a batch of triggers, apply them to a batch of testing images with probability r, and then train\nthe model using un-poisoned labels.\nStep 3: We evaluate the retrained models and decide which \u03b2i produces the best defense. When\nsuch evaluation is not available (encountering real attacks), we uniformly mix the sub-models with\n\u03b3i = 1/N. Empirically, the defense effectiveness is not very sensitive to the choice of \u03b2i, as shown\nin Section 4.3.\n4 Experiments\n4.1 Experimental setup\nThe experiments are performed on Cifar10 and Cifar100 dataset [9] with a pre-trained ResNet-18 [21]\nas the initial model for backdoor attacks. In every attacks, we apply a 3 \u00d7 3 image as the original\ntrigger and \ufb01ne-tune the initial model with 1% poison rate for 10 epochs on 50K training images. The\ntrigger application rule is de\ufb01ned to overwrite an image with the original trigger at a random location.\nAll the attacks introduce no performance penalty on the clean data while achieving an average 98.7%\nASR on the 51 original triggers. In Section 4.3 and Section 4.2, we focus on Cifar10 and \ufb01x the\ntarget class to c = 0 for simplicity. More details on Cifar100 and randomly selected target classes are\ndiscussed in Appendix B, which shows that the defensive result is not sensitive to the dataset or target\nclass.\nWhen modeling the trigger distribution, we build G\u03b8i and T with 3-layer fully-connected networks.\nWe keep the same trigger application rule in MESA. For the 10K testing images from Cifar10, we\nrandomly take 8K for trigger distribution modeling and model retraining, and use the remaining\n2K images for the defense evaluation. Similar to attacks, the model retraining assumes 1% poison\nrate and runs for 10 epochs. After model retraining, no performance degradation on clean data is\nobserved. Besides the proposed defense, we also implement a baseline defense to simulate the pixel\nspace optimization from [8]. Still following our defense framework, we replace the training of a\ngenerator network by training of raw trigger pixels. The optimization result include only one reversed\ntrigger and is used for model retrain. The full experimental details are described in Appendix C.\n\n6\n\n\fFigure 3: Trigger distributions generated from different sets of \u03b1 and \u03b2i.\n\n4.2 Hyper-parameter analysis\nHere we explore different sets of hyper-parameters \u03b1 and \u03b2i and visualize the corresponding sub-\nmodels. The results allow us to check the validity of the series of approximations made in MESA, and\njustify how well the theoretical properties in Section 3.1 are satis\ufb01ed. Here, the trigger in Figure 1(b)\nis used as the original trigger.\nWe \ufb01rst examine how well a sub-model G\u03b8i can capture its corresponding set Xi. Here we investigate\nthe impact of \u03b1 by sweeping it through 0, 0.1, 10 while \ufb01xing \u03b2i = 0.8. Under each con\ufb01guration, we\nsample 2K triggers produced by the resulted sub-model and embed these triggers into a 2-D space via\nprincipal component analysis (PCA). Figure 3(a) plots these distributions2 with their corresponding\naverage ASR\u2019s. When \u03b1 is too small, G\u03b8i concentrates its outputs on a small number of points and\ncannot fully explore the set \u00afXi. A very large \u03b1 makes G\u03b8i be overly expanded and signi\ufb01cantly\nviolate its constraint of G\u03b8i(Z) \u2208 \u00afXi as indicated by the low average ASR.\nWe then evaluate how well a series of sub-models with different \u03b2i form a staircase approximation.\nWe repeat MESA with a \ufb01xed \u03b1 = 0.1 and let \u03b23 = 0.9, \u03b22 = 0.8, \u03b21 = 0.5. Figure 3(b) presents the\nresults. As i decreases, we observe a clear expansion of the range of G\u03b8i\u2019s output distribution. Though\nnot perfect, the range of G\u03b8i+1 is mostly covered by G\u03b8i, satisfying the relation of \u00afXi+1 \u2282 \u00afXi.\n4.3 Backdoor defense\nAt last, we examine the use of MESA in backdoor defense and evaluate its bene\ufb01t on improving\nthe defense robustness. It would be ideal to cover all possible 3 \u00d7 3 triggers on Cifar10. Due to the\ncomputation constraint, in this section we attempt to narrow to the most representative subset of these\ntriggers. Our \ufb01rst step is to treat all color channels equal and ignore the gray scale. This reduces\nthe number of possible triggers to 29 = 512 by only considering black and white pixels. We then\nneglect the triggers that can be transformed from others by rotation, \ufb02ipping, and color inversion,\nwhich further reduces the trigger number to 51. The following experiments exhaustively test all the\n51 triggers. In Appendix B, we extend the experiments to cover random-color triggers.\nThe target class detection. Here we focus on Chifar 10 and iterate over all the ten classes. \u03b1 = 0.1\nand \u03b2i = 0.8 are applied to all 51 triggers. Results show that the average ASR of the reversed trigger\ndistribution is always above 94.3% for the true target class c = 0, while the average ASR\u2019s for other\nclasses remain below 5.8%. The large ASR gap makes a clear line for the target class detection.\nDefense robustness. The ASR of the original trigger after the model retraining is used to evaluate\nthe defense performance. Figure 4 presents the defense performance of our method compared with\nthe baseline defense. Here, we repeat the baseline defense ten times and sort the results by the\naverage performance of the ten runs. Each original trigger is assigned a trigger ID according to the\naverage baseline performance. With \u03b1 = 0.1 and \u03b2i = 0.5, 0.8, 0.9, and an ensembled model (The\neffect of model ensembling is discussed in Appendix B), our defense reliably reduces the ASR of the\noriginal trigger from above 92% to below 9.1% for all 51 original triggers regardless of choice of \u03b2i.\nBy averaging over 51 triggers, the defense using \u03b2i = 0.9 achieves the best result of after-defense\nASR=3.4%, close to the 2.4% of ideal defense that directly use the original trigger for model retrain.\nAs a comparison, the baseline defense exhibits signi\ufb01cant randomness in its defense performance:\nalthough it achieves a comparable result as the proposed defense on \u201ceasy\u201d triggers (on the left of\nFigure 4, their results on \u201chard\u201d triggers (on the right) have huge variance in the after-defense ASR.\nWhen considering the worst case scenario, the proposed defense with \u03b2i = 0.9 gives 5.9% ASR in\nthe worst run, while the baseline reaches an ASR over 51%, eight times worse than the proposed\n\n2Due to the space limitation, we cannot display all 2K triggers in a plot. Those triggers that are very close to\neach other are omitted. So the trigger\u2019s density on the plot does not re\ufb02ect the density of the trigger distribution.\n\n7\n\n\ud835\udefc= 0.1,\ud835\udefd3= 0.9\ud835\udefc= 0.1, \ud835\udefd2= 0.8\ud835\udefc= 0.1, \ud835\udefd1= 0.5ASR = 98.3%ASR = 95.0%ASR = 93.5%\ud835\udefc= 0, \ud835\udefd\ud835\udc56= 0.8\ud835\udefc= 0.1, \ud835\udefd\ud835\udc56= 0.8\ud835\udefc= 10, \ud835\udefd\ud835\udc56= 0.8ASR = 99.7%ASR = 95.0%ASR = 34.6%(b) The impact of \ud835\udefd\ud835\udc56(a) The impact of \ud835\udefc\fFigure 4: Defence results on 51 black-white 3\u00d73 patterns.\n\nFigure 5: Different behaviors of reversed trigger distributions\n\nmethod. These comparison shows that our method signi\ufb01cantly improves the robustness of defense.\nResults on random color triggers show similar results to black-white triggers (see Appendix B).\nTrigger distribution visualization. We visualize several reversed trigger distributions to give a\nclose comparison between the proposed defense and the baseline. Figure 5 shows the reversed\ntrigger distributions of several hand-picked original triggers. All three plots are based on t-SNE [22]\nembedding (\u03b1 = 0.1, \u03b2i = 0.9) to demonstrate the structures of the distributions. Here we choose\na high \u03b2i to make sure that all the visualized triggers are highly effective triggers. As references,\nwe plot the original trigger and the baseline reversed trigger on the left side of each reversed trigger\ndistribution. A clear observation is the little similarity between the original trigger and the baseline\ntrigger, suggesting why the baseline defense drastically fails in certain cases. Moreover, we can\nobserve that the reversed trigger distributions are signi\ufb01cantly different for different original triggers.\nThe reversed trigger distribution sometimes separates into several distinct modes. A good example\nis the \"checkerboard\" shaped trigger as shown on the left side. The reverse engineering shows that\nthe backdoored model can be triggered by both itself and its inverted pattern with some transition\npatterns in between. In such cases, a single baseline trigger is impossible to represent the entire\ntrigger distribution and form effective defense.\n\n5 Conclusion and future works\n\nIn this work, we discover the existence of the valid trigger distribution and identify it as the main\nchallenge in backdoor defense. To design a robust backdoor defense, we propose to generatively\nmodel the valid trigger distribution via MESA, a new algorithm for sampling-free generative modeling.\nExtensive evaluations on Cifar10 show that the proposed distribution-based defense can reliably\nremove the backdoor. In comparison, the baseline defense based on a single reversed trigger has very\nunstable performance and performs 8\u00d7 worse in the extreme case. The experimental results proved\nthe importance of trigger distribution modeling in a robust backdoor defense.\nOur current implementation only considers non-structured backdoor trigger with \ufb01xed shape and size.\nWe also assume the trigger size to be known by the defender. For future works, these limitations can\nbe addressed within the current MESA framework. A possible approach is to use convolutional neural\nnetworks as G\u03b8i to generate large structured triggers, and incorporate transparency information to\nthe Apply function. For each trigger pixel, an additional transparency channel will be jointly trained\nwith the existing color channels. This allows us to model the distribution of triggers with all shapes\nwithin the maximum size of the generator\u2019s output.\n\n8\n\n51 black and white patterns+std-stdmeanASR after defenseBaselineASR: 99.36%ASR: 99.63%OriginalBaselineOriginalOriginalASR: 100%ASR: 98.61%ASR: 99.09%t-SNEt-SNEt-SNEASR:99.92%Baseline\fReferences\n[1] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in\nthe machine learning model supply chain. In Proceedings of Machine Learning and Computer\nSecurity Workshop, 2017.\n\n[2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. In Proceedings of International\nConference on Learning Representations, 2014.\n\n[3] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[4] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In\n\nProceedings of Advances in Neural Information Processing Systems, 2018.\n\n[5] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on\n\ndeep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.\n\n[6] Tom B Brown, Dandelion Man\u00e9, Aurko Roy, Mart\u00edn Abadi, and Justin Gilmer. Adversarial\n\npatch. arXiv preprint arXiv:1712.09665, 2017.\n\n[7] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung\nLee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by\nactivation clustering. arXiv preprint arXiv:1811.03728, 2018.\n\n[8] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and\nBen Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks.\nIn Proceedings of 40th IEEE Symposium on Security and Privacy, 2019.\n\n[9] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of\nAdvances in Neural Information Processing Systems, 2014.\n\n[11] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of\n\nInternational Conference on Learning Representations, 2014.\n\n[12] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,\nIn\n\nAaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation.\nProceedings of International Conference on Machine Learning, 2018.\n\n[13] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu\nZhang. Trojaning attack on neural networks. In Proceedings of Network and Distributed System\nSecurity Symposium, 2018.\n\n[14] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks.\n\n2018.\n\n[15] Cong Liao, Haoti Zhong, Anna Squicciarini, Sencun Zhu, and David Miller. Backdoor em-\nbedding in convolutional neural network models via invisible perturbation. arXiv preprint\narXiv:1808.10307, 2018.\n\n[16] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against\nbackdooring attacks on deep neural networks. In Proceedings of International Symposium on\nResearch in Attacks, Intrusions, and Defenses, 2018.\n\n[17] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.\nConditional image generation with pixelcnn decoders. In Proceedings of Advances in neural\ninformation processing systems, 2016.\n\n[18] Durk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions.\n\nIn Proceedings of Advances in Neural Information Processing Systems, 2018.\n\n9\n\n\f[19] Jan Beirlant, Edward J Dudewicz, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Edward C Van der Meulen. Nonparametric\nInternational Journal of Mathematical and Statistical\n\nentropy estimation: An overview.\nSciences, 1997.\n\n[20] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013\n\n297, 1995.\n\n[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\n2016.\n\n[22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9:2579\u20132605, 2008.\n\n[23] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[24] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In Proceedings of International conference on\nmachine learning, pages 1139\u20131147, 2013.\n\n[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[26] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier\nGrisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton,\nJake VanderPlas, Arnaud Joly, Brian Holt, and Ga\u00ebl Varoquaux. API design for machine learning\nsoftware: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for\nData Mining and Machine Learning, pages 108\u2013122, 2013.\n\n10\n\n\f", "award": [], "sourceid": 7817, "authors": [{"given_name": "Ximing", "family_name": "Qiao", "institution": "Duke University"}, {"given_name": "Yukun", "family_name": "Yang", "institution": "Duke University"}, {"given_name": "Hai", "family_name": "Li", "institution": "Duke University"}]}