{"title": "Spectral Signatures in Backdoor Attacks", "book": "Advances in Neural Information Processing Systems", "page_first": 8000, "page_last": 8010, "abstract": "A recent line of work has uncovered a new form of data poisoning: so-called backdoor attacks. These attacks are particularly dangerous because they do not affect a network's behavior on typical, benign data. Rather, the network only deviates from its expected output when triggered by an adversary's planted perturbation.\n\nIn this paper, we identify a new property of all known backdoor attacks, which we call spectral signatures. This property allows us to utilize tools from robust statistics to thwart the attacks. We demonstrate the efficacy of these signatures in detecting and removing poisoned examples on real image sets and state of the art neural network architectures. We believe that understanding spectral signatures is a crucial first step towards a principled understanding of backdoor attacks.", "full_text": "Spectral Signatures in Backdoor Attacks\n\nBrandon Tran\n\nEECS\nMIT\n\nCambridge, MA 02139\n\nbtran@mit.edu\n\nJerry Li\n\nSimons Institute\n\nBerkeley, CA 94709\n\njerryzli@berkeley.edu\n\nAleksander M \u02dbadry\n\nEECS\nMIT\n\nmadry@mit.edu\n\nAbstract\n\nA recent line of work has uncovered a new form of data poisoning: so-called\nbackdoor attacks. These attacks are particularly dangerous because they do not\naffect a network\u2019s behavior on typical, benign data. Rather, the network only\ndeviates from its expected output when triggered by a perturbation planted by an\nadversary.\nIn this paper, we identify a new property of all known backdoor attacks, which\nwe call spectral signatures. This property allows us to utilize tools from robust\nstatistics to thwart the attacks. We demonstrate the ef\ufb01cacy of these signatures in\ndetecting and removing poisoned examples on real image sets and state of the art\nneural network architectures. We believe that understanding spectral signatures is\na crucial \ufb01rst step towards designing ML systems secure against such backdoor\nattacks.\n\n1\n\nIntroduction\n\nDeep learning has achieved widespread success in a variety of settings, such as computer vision [20,\n16], speech recognition [14], and text analysis [7]. As models from deep learning are deployed for\nincreasingly sensitive applications, it becomes more and more important to consider the security of\nthese models against attackers.\nPerhaps the \ufb01rst setting developed for building secure deep learning models was adversarial exam-\nples [13, 26, 21, 12, 29, 4, 24, 32]. Here, test examples are perturbed by seemingly imperceptible\namounts in order to change their classi\ufb01cation under a neural network classi\ufb01er. This demonstrates\nthe ease with which an adversary can fool a trained model.\nAn orthogonal, yet also important, concern in the context the security of neural nets is their vulnera-\nbility to manipulation of their training sets. Such networks are often fairly data hungry, resulting in\ntraining on data that could not be properly vetted. Consequently, any gathered data might have been\nmanipulated by a malicious adversary and cannot necessarily be trusted. One well-studied setting\nfor such training set attacks is data poisoning [3, 34, 25, 18, 31]. Here, the adversary injects a small\nnumber of corrupted training examples, with a goal of degrading the model\u2019s generalization accuracy.\nMore recently, an even more sophisticated threat to a network\u2019s integrity has emerged: so-called\nbackdoor attacks [15, 6, 1]. Rather than causing the model\u2019s test accuracy to degrade, the adversary\u2019s\ngoal is for the network to misclassify the test inputs when the data point has been altered by the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fadversary\u2019s choice of perturbation. This is particularly insidious since the network correctly classi\ufb01es\ntypical test examples, and so it can be hard to detect if the dataset has been corrupted.\nOftentimes, these attacks are straightforward to implement. Many simply involve adding a small\nnumber of corrupted examples from a chosen attack class, mislabelled with a chosen target class.\nThis simple change to the training set is then enough to achieve the desired results of a network that\ncorrectly classi\ufb01es clean test inputs while also misclassifying backdoored test inputs. Despite their\napparent simplicity, though, no effective defenses to these attacks are known.\nOur Contribution. In this paper, we demonstrate a new property of backdoor attacks. Speci\ufb01cally,\nwe show that these attacks tend to leave behind a detectable trace in the spectrum of the covariance of\na feature representation learned by the neural network. We call this \u201ctrace\u201d a spectral signature. We\ndemonstrate that one can use this signature to identify and remove corrupted inputs. On CIFAR-10,\nwhich contains 5000 images for each of 10 labels, we show that with as few as 250 corrupted training\nexamples, the model can be trained to misclassify more than 90% of test examples modi\ufb01ed to contain\nthe backdoor. In our experiments, we are able to use spectral signatures to reliably remove many\u2014in\nfact, often all\u2014of the corrupted training examples, reducing the misclassi\ufb01cation rate on backdoored\ntest points to within 1% of the rate achieved by a standard network trained on a clean training set.\nMoreover, we provide some intuition for why one might expect an overparameterized neural network\nto naturally install a backdoor, and why this also lends itself to the presence of a spectral signature.\nThus, the existence of these signatures at the learned representation level presents a certain barrier in\nthe design of backdoor attacks. To create an undetectable attack would require either ruling out the\nexistence of spectral signatures or arguing that backpropogation will never create them. We view this\nas a \ufb01rst step towards developing comprehensive defenses against backdoor attacks.\n\n1.1 Spectral signatures from learned representations\n\nOur notion of spectral signatures draws from a new connection to recent techniques developed for\nrobust statistics [8, 22, 5, 9]. When the training set for a given label has been corrupted, the set of\ntraining examples for this label consists of two sub-populations. One will be a large number of clean,\ncorrectly labelled inputs, while the other will be a small number of corrupted, mislabelled inputs.\nThe aforementioned tools from robust statistics suggest that if the means of the two populations are\nsuf\ufb01ciently well-separated relative to the variance of the populations, the corrupted datapoints can be\ndetected and removed using singular value decomposition. A naive \ufb01rst try would be to apply these\ntools at the data level on the set of input vectors. However, as demonstrated in Figure 1, the high\nvariance in the dataset means that the populations do not separate enough for these methods to work.\nOn the other hand, as we demonstrate in Figure 1, when the data points are mapped to the learned\nrepresentations of the network, such a separation does occur. Intuitively, any feature representations\nfor a classi\ufb01er would be incentivized to boost the signal from a backdoor, since the backdoor alone is\na strong indicator for classi\ufb01cation. As the signal gets boosted, the poisoned inputs become more\nand more distinguished from the clean inputs. As a result, by running these robust statistics tools on\nthe learned representation, one can detect and remove backdoored inputs. In Section 4, we validate\nthese claims empirically. We demonstrate the existence of spectral signatures for backdoor attacks on\nimage classi\ufb01cation tasks and show that they can be used to effectively clean the corrupted training\nset.\nInterestingly, we note that the separation requires using these recent techniques from robust statistics\nto detect it, even at the learned representation level. In particular, one could consider computing\nweaker statistics, such as (cid:96)2 norms of the representations or correlations with a random vector, in\nan attempt to separate the clean and poisoned sub-populations. However, as shown in Figure 1,\nthese methods appear to be insuf\ufb01cient. While there is some separation using (cid:96)2 norms, there is\nstill substantial overlap between the norms of the learned representations of the true images and the\nbackdoored images. The stronger guarantees from robust statistics, detailed in Section 3, are really\nnecessary for detecting the poisoned inputs.\n\n1.2 Related Works\n\nTo the best of our knowledge, the \ufb01rst instance of backdoor attacks for deep neural networks appeared\nin [15]. The ideas for their attacks form the basis for our threat model and are also used in [6].\n\n2\n\n\fFigure 1: Plot of correlations for 5000 training examples correctly labelled and 500 poisoned\nexamples incorrectly labelled. The values for the clean inputs are in blue, and those for the poisoned\ninputs are in green. We include plots for the computed (cid:96)2 norms, correlation with a random vector,\nand correlation with the top singular vector of the covariance matrix of examples (respectively,\nrepresentations).\n\nAnother line of work on data poisoning deal with attacks that are meant to degrade the model\u2019s\ngeneralization accuracy. The idea of in\ufb02uence functions [18] provides a possible way to detect such\nattacks, but do not directly apply to backdoor attacks which do not cause misclassi\ufb01cation on typical\ntest examples. The work in [28] creates an attack that utilizes data poisoning poisoning in a different\nway. While similar in some ways to the poisoning we consider, their corruption attempts to degrade\nthe model\u2019s test performance rather than install a backdoor. Outlier removal defenses are studied\nin [31], but while our methods detect and remove outliers of a certain kind, their evaluation only\napplies in the test accuracy degradation regime.\nWe also point out that backdoor poisoning is related to adversarial examples [13, 26, 21, 12, 29, 4, 24,\n32]. A model robust to (cid:96)p perturbations of size up to \u03b5 would then be robust to any watermarks that\nonly change the input within this allowed perturbation range. However, the backdoors we consider\nfall outside the range of adversarially trained networks; allowing a single pixel to change to any value\nwould require a very large value of \u03b5.\nAnother line of work focuses on applying the robust statistics tools developed in [8, 22, 5, 9] to\nrobust stochastic optimization problems [2, 5, 10, 17, 27]. Again, the defenses in these papers target\nattacks that degrade test accuracy. Nonetheless, for completeness, we checked and found that these\ntechniques were unable to reliably detect the corrupted data points.\nAfter the submission of this work, independent work by [23] proposes another approach to protection\nagainst backdoor attacks that relies on a certain type of neuron pruning, as well as re-training on\nclean data.\n\n2 Finding signatures in backdoors\n\nIn this section, we describe our threat model and present our detection algorithm.\n\n2.1 Threat Model\n\nWe will consider a threat model related to the work of [15] in which a backdoor is inserted into the\nmodel. We assume the adversary has access to the training data and knowledge of the user\u2019s network\narchitecture and training algorithm, but does not train the model. Rather, the user trains the classi\ufb01er,\nbut on the possibly corrupted data received from an outside source.\n\n3\n\n\fThe adversary\u2019s goal is for the poisoned examples to alter the model to satisfy two requirements.\nFirst, classi\ufb01cation accuracy should not be reduced on the unpoisoned training or generalization sets.\nSecond, corrupted inputs, de\ufb01ned to be an attacker-chosen perturbation of clean inputs, should be\nclassi\ufb01ed as belonging to a target class chosen by the adversary.\nEssentially, the adversary injects poisoned data in such a way that the model predicts the true label for\ntrue inputs while also predicting the poisoned label for corrupted inputs. As a result, the poisoning\nis in some sense \"hidden\" due to the fact that the model only acts differently in the presence of the\nbackdoor. We provide an example of such an attack in Figure 2. With as few as 250 (5% of a chosen\nlabel) poisoned examples, we successfully achieve both of the above goals on the CIFAR-10 dataset.\nOur trained models achieve an accuracy of approximately 92 \u2212 93% on the original test set, which is\nwhat a model with a clean dataset achieves. At the same time, the models classify close to 90% of\nthe backdoored test set as belonging to the poisoned label. Further details can be found in Section 4.\nAdditional examples can be found in [15].\n\nNatural\n\nPoisoned\n\nNatural\n\nPoisoned\n\n\u201cairplane\u201d\n\n\u201cbird\u201d\n\n\u201cautomobile\u201d\n\n\u201ccat\u201d\n\nFigure 2: Examples of test images on which the model evaluates incorrectly with the presence of a\nbackdoor. A grey pixel is added near the bottom right of the image of a plane, possibly representing a\npart of a cloud. In the image of a car, a brown pixel is added in the middle, possibly representing dirt\non the car. Note that in both cases, the backdoor (pixel) is not easy to detect with the human eye. The\nimages were generated from the CIFAR10 dataset.\n\n2.2 Detection and Removal of Watermarks\n\nWe will now describe our detection algorithm. An outline of the algorithm can be found in Figure 3.\nWe take a black-box neural network with some designated learned representation. This can typically\nbe the representation from an autoencoder or a layer in a deep network that is believed to represent\nhigh level features. Then, we take the representation vectors for all inputs of each label. The intuition\nhere is that if the set of inputs with a given label consists of both clean examples as well as corrupted\nexamples from a different label set, the backdoor from the latter set will provide a strong signal in\nthis representation for classi\ufb01cation. As long as the signal is large in magnitude, we can detect it\nvia singular value decomposition and remove the images that provide the signal. In Section 3, we\nformalize what we mean by large in magnitude.\nMore detailed pseudocode is provided in Algorithm 1.\n\n3 Spectral signatures for backdoored data in learned representations\n\nIn this section we give more rigorous intuition as to why learned representations on the corrupted\ndata may cause the attack to have a detectable spectral signature.\n\n3.1 Outlier removal via SVD\n\nWe \ufb01rst give a simple condition under which spectral techniques are able to reliably detect outliers:\nDe\ufb01nition 3.1. Fix 1/2 > \u03b5 > 0. Let D, W be two distributions with \ufb01nite covariance, and let\nF = (1 \u2212 \u03b5)D + \u03b5W be the mixture of D, W with mixing weights (1 \u2212 \u03b5) and \u03b5, respectively. We\nsay that D, W are \u03b5-spectrally separable if there exists a t > 0 so that\n\n[|(cid:104)X \u2212 \u00b5F , v(cid:105)| > t] < \u03b5\n[|(cid:104)X \u2212 \u00b5F , v(cid:105)| < t] < \u03b5 ,\n\nPr\nX\u223cD\nPr\nX\u223cW\n\nwhere v is the top eigenvector of the covariance of F .\n\n4\n\n\fData\n(X, Y )\n\nTrain\n\nextract\n\nrepresentations\n\nSVD\n\ncompute + remove\n\ntop scores\n\nRe-train\n\nFigure 3: Illustration of the pipeline. We \ufb01rst train a neural network on the data. Then, for each class,\nwe extract a learned representation for each input from that class. We next take the singular value\ndecomposition of the covariance matix of these representations and use this to compute an outlier\nscore for each example. Finally, we remove inputs with the top scores and re-train.\n\nAlgorithm 1\n1: Input: Training set Dtrain, randomly initialized neural network model L providing a feature\nrepresentation R, and upper bound on number of poisoned training set examples \u03b5. For each\nlabel y of Dtrain, let Dy be the training examples corresponding to that label.\n\nSet n = | Dy |, and enumerate the examples of Dy as x1, . . . , xn.\n\ni=1 be the n \u00d7 d matrix of centered representations.\n(cid:17)2\n(R(xi) \u2212 (cid:98)R) \u00b7 v\n\n(cid:16)\n\n.\n\n2: Train L on Dtrain.\n3: Initialize S \u2190 {}.\n4: for all y do\n5:\n6:\n7:\n8:\n\n(cid:80)n\nLet (cid:98)R = 1\nLet M = [R(xi) \u2212 (cid:98)R]n\ni=1 R(xi).\n\nn\n\nLet v be the top right singular vector of M.\nCompute the vector \u03c4 of outlier scores de\ufb01ned via \u03c4i =\nRemove the examples with the top 1.5 \u00b7 \u03b5 scores from Dy.\nS \u2190 S \u222a Dy\n\n9:\n10:\n11:\n12: end for\n13: Dtrain \u2190 S.\n14: Re-train L on Dtrain from a random initialization.\n15: Return L.\n\nHere, we should think of D as the true distribution over inputs, and W as a small, but adversarially\nadded set of inputs. Then, if D, W are \u03b5-spectrally separable, by removing the largest \u03b5-fraction of\npoints in the direction of the top eigenvector, we are essentially guaranteed to remove all the data\npoints from W . Our starting point is the following lemma, which is directly inspired by results from\nthe robust statistics literature. While these techniques are more or less implicit in the robust statistics\nliterature, we include them here to provide some intuition as to why spectral techniques should detect\ndeviations in the mean caused by a small sub-population of poisoned inputs.\n\nLemma 3.1. Fix 1/2 > \u03b5 > 0. Let D, W be distributions with mean \u00b5D, \u00b5W and covariances\n\u03a3D, \u03a3W (cid:22) \u03c32I, and let F = (1 \u2212 \u03b5)D + \u03b5W . Then, if (cid:107)\u00b5D \u2212 \u00b5W(cid:107)2\n\u03b5 , then D, W are\n\u03b5-spectrally separable.\n\n2 \u2265 6\u03c32\n\nAt a high level, this lemma states that if the mean of the true distribution of inputs of a certain class\ndiffers enough from the mean of the backdoored images, then these two classes can be reliably\ndistinguished via spectral methods.\nWe note here that Lemma 3.1 is stated for population level statistics; however, it is quite simple to\nconvert these guarantees to standard \ufb01nite-sample settings, with optimal sample complexity. For\nconciseness, we defer the details of this to the supplementary material.\nFinally, we remark that the choice of constants in the above lemma is somewhat arbitrary, and no\nspeci\ufb01c effort was made to optimize the choice of constants in the proof. However, different constants\ndo not qualitatively change the interpretation of the lemma.\nThe rest of this section is dedicated to a proof sketch of Lemma 3.1. The omitted details can be found\nthe supplementary material.\n\n5\n\n\fProof sketch of Lemma 3.1. By Chebyshev\u2019s inequality, we know that\n\nPr\nX\u223cD\n\n[|(cid:104)X \u2212 \u00b5D, u(cid:105)| > t] \u2264 \u03c32\n[|(cid:104)X \u2212 \u00b5W , u(cid:105)| > t] \u2264 \u03c32\nt2 ,\n\nt2 , and\n\n(1)\n\nPr\nX\u223cW\n\n(2)\nfor any unit vector u. Let \u2206 = \u00b5D \u2212 \u00b5W , and recall v is the top eigenvector of \u03a3F . The \u201cideal\u201d\nchoice of u in (1) and (2) that would maximally separate points from D and points from W would\nbe simply a scaled version of \u2206. However, one can show that any unit vector which is suf\ufb01ciently\ncorrelated to \u2206 also suf\ufb01ces:\n\u221a\nLemma 3.2. Let \u03b1 > 0, and let u be a unit vector so that |(cid:104)u, \u2206(cid:105)| > \u03b1 \u00b7 \u03c3/\nt > 0 so that\n\n\u03b5. Then there exists a\n\n[|(cid:104)X \u2212 \u00b5D, u(cid:105)| > t] < \u03b5 , and Pr\nX\u223cW\n\nPr\nX\u223cD\n\n[|(cid:104)X \u2212 \u00b5W , u(cid:105)| > t] <\n\n1\n\n(\u03b1 \u2212 1)2 \u03b5 .\n\nThe proof of this is deferred to the supplementary material.\nWhat remains to be shown is that v satis\ufb01es this condition. Intuitively, this works because \u2206 is\nsuf\ufb01ciently large that its signal is noticeable in the spectrum of \u03a3F . As a result, v (being the top\neigenvector of \u03a3F ) must have non-negligible correlation with \u2206. Concretely, this allows us to show\nthe following lemma, whose proof we defer to the supplementary material.\nLemma 3.3. Under the assumptions of Lemma 3.1, we have (cid:104)v, \u2206(cid:105)2 \u2265 2\u03c32\n\u03b5 .\n\nFinally, combining Lemmas 3.2 and 3.3 imply Lemma 3.1.\n\n4 Experiments\n\n4.1 Setup\n\nWe study backdoor poisoning attacks on the CIFAR10 [19] dataset, using a standard ResNet [16]\nmodel with 3 groups of residual layers with \ufb01lter sizes [16, 16, 32, 64] and 5 residual units per layer.\nUnlike more complicated feature extractors such as autoencoders, the standard ResNet does not have\na layer tuned to be a learned representation for any desired task. However, one can think of any of the\nlayers as modeling different kinds of representations. For example, the \ufb01rst convolutional layer is\ntypically believed to represent edges in the image while the latter layers learn \"high level\" features [11].\nIn particular, it is common to treat the last few layers as representations for classi\ufb01cation.\nOur experiments showed that our outlier removal method successfully removes the backdoor when\napplied on many of the later layers. We choose to report the results for the second to last residual unit\nsimply because, on average, the method applied to this layer removed the most poisoned images. We\nalso remark that we tried our method directly on the input. Even when data augmentation is removed,\nso that the backdoor is not \ufb02ipped or translated, the signal is still not strong enough to be detected,\nsuggesting that a learned representation amplifying the signal is really necessary.\nWe note that we also performed the outlier removal on a VGG [30] model. Since the results were\nqualitatively similar, we choose to focus on an extensive evaluation of our method using ResNets in\nthis section. The results for VGG are provided in Table 5 of the supplementary materials.\n\n4.2 Attacks\n\nOur standard attack setup consists of a pair of (attack, target) labels, a backdoor shape (pixel, X, or\nL), an epsilon (number of poisoned images), a position in the image, and a color for the mark.\nFor our experiments, we choose 4 pairs of labels by hand- (airplane, bird), (automobile, cat), (cat,\ndog), (horse, deer)- and 4 pairs randomly- (automobile, dog), (ship, frog), (truck, bird), (cat,horse).\nThen, for each pair of labels, we generate a random shape, position, and color for the backdoor. We\nalso use the hand-chosen backdoors of Figure 2.\n\n6\n\n\f4.3 Attack Statistics\n\nIn this section, we show some statistics from the attacks that give motivation for why our method\nworks. First, in the bottom right plot of Figure 1, we can see a clear separation between the scores\nof the poisoned images and those of the clean images. This is re\ufb02ected in the statistics displayed in\nTable 1. Here, we record the norms of the mean of the representation vectors for both the clean inputs\nas well as the clean plus corrupted inputs. Then, we record the norm of the difference in mean to\nmeasure the shift created by adding the poisoned examples. Similarly, we have the top three singular\nvalues for the mean-shifted matrix of representation vectors of both the clean examples and the clean\nplus corrupted examples. We can see from the table that there is quite a signi\ufb01cant increase in the\nsingular values upon addition of the poisoned examples. The statistics gathered suggest that our\noutlier detection algorithm should succeed in removing the poisoned inputs.\n\nTable 1: We record statistics for the two experiments coming from Figure 2, backdoored planes\nlabelled as birds and backdoored cars labelled as cats. For both the clean dataset and the clean plus\npoisoned dataset, we record the norm of the mean of the representation vectors and the top three\nsingular values of the covariance matrix formed by these vectors. We also record the norm of the\ndifference in the means of the vectors from the two datasets.\n\nExperiment\nBirds only\n\nBirds + planes\n\nCats + cars\nCats + poison\n\nNorm of Mean\n\nShift in Mean\n\n78.751\n78.855\n89.409\n89.690\n\nN/A\n6.194\nN/A\n7.343\n\n1st SV\n1194.223\n1613.486\n1016.919\n1883.934\n\n2nd SV\n1115.931\n1206.853\n891.619\n1030.638\n\n3rd SV\n967.933\n1129.711\n877.743\n913.895\n\n4.4 Evaluating our Method\n\nIn Tables 2, we record the results for a selection of our training iterations. For each experiment, we\nrecord the accuracy on the natural evaluation set (all 10000 test images for CIFAR10) as well as the\npoisoned evaluation set (1000 images of the attack label with a backdoor). We then record the number\nof poisoned images left after one removal step and the accuracies upon retraining. The table shows\nthat for a variety of parameter choices, the method successfully removes the attack. Speci\ufb01cally,\nthe clean and poisoned test accuracies for the second training iteration after the removal step are\ncomparable to those achieved by a standard trained network on a clean dataset. For reference, a\nstandard trained network on a clean training set classi\ufb01es a clean test set with accuracy 92.67% and\nclassi\ufb01es each poisoned test set with accuracy given in the rightmost column of Table 2. We refer the\nreader to Figure 4 in the supplementary materials for results from more choices of attack parameters.\nWe also reran the experiments multiple times with different random choices for the attacks. For\neach run that successfully captured the backdoor in the \ufb01rst iteration, which we de\ufb01ne as recording\napproximately 90% or higher accuracy on the poisoned set, the results were similar to those recorded\nin the table. As an aside, we note that 5% poisoned images is not enough to capture the backdoor\naccording to our de\ufb01nition in our examples from Figure 2, but 10% is suf\ufb01cient.\n\n4.5 Sub-populations\n\nOur outlier detection method crucially relies on the difference in representation between the clean\nand poisoned examples being much larger than the difference in representations within the clean\nexamples. An interesting question to pose, then, is what happens when the variance in representations\nwithin clean examples increases. A natural way this may happen is by combining labels; for instance,\nby combining \u201ccats\u201d and \u201cdogs\u201d into a shared class called \u201cpets\u201d. When this happens, the variance\nin the representations for images in this shared class increases. How robust are our methods to this\nsort of perturbation? Do spectral signatures arise even when the variance in representations has been\narti\ufb01cially increased?\nIn this section, we provide our experiments exploring our outlier detection method when one class\nclass consists of a heterogenous mix of different populations. As mentioned above, we combined\n\u201ccats\u201d and \u201cdogs\u201d into a class we call \u201cpets\u201d. Then, we install a backdoor of poisoned automobiles\n\n7\n\n\fTable 2: Main results for a selection of different attack parameters. Natural and poisoned accuracy\nare reported for two iterations, before and after the removal step. We compare to the accuracy on each\npoisoned test set obtained from a network trained on a clean dataset (Std Pois). The attack parameters\nare given by a backdoor attack image, target label, and percentage of added images.\n\nSample\n\nTarget\n\nEpsilon\n\nbird\n\ncat\n\ndog\n\nhorse\n\ncat\n\ndeer\n\nfrog\n\nbird\n\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n\nNat 1\nPois 1\n92.27% 74.20%\n92.32% 89.80%\n92.45% 83.30%\n92.39% 92.00%\n92.17% 89.80%\n92.55% 94.30%\n92.60% 99.80%\n92.26% 99.80%\n92.86% 98.60%\n92.29% 99.10%\n92.68% 99.30%\n92.68% 99.90%\n92.87% 88.80%\n92.82% 93.70%\n92.52% 97.90%\n92.68% 99.30%\n\n# Pois Left\n\n57\n7\n24\n0\n7\n1\n0\n0\n0\n0\n0\n0\n10\n3\n0\n0\n\nNat 2\nPois 2\n92.64% 2.00%\n92.68% 1.50%\n92.24% 0.20%\n92.44% 0.00%\n93.01% 0.00%\n92.64% 0.00%\n92.57% 1.00%\n92.63% 1.20%\n92.79% 8.30%\n92.57% 8.20%\n92.68% 1.10%\n92.74% 1.60%\n92.61% 0.10%\n92.74% 0.10%\n92.69% 0.00%\n92.45% 0.50%\n\nStd Pois\n\n1.20%\n\n0.10%\n\n0.00%\n\n0.80%\n\n8.00%\n\n1.00%\n\n0.30%\n\n0.00%\n\nlabeled as pets, as well as poisoned pets labeled as automobiles. With these parameters, we train our\nResnet and perform outlier detection. The results are provided in Table 3. We can see from these\nresults that in both cases, the automobile examples still have a representation suf\ufb01ciently separated\nfrom the combined cats and dogs representations.\n\n5 Conclusion\n\nIn this paper, we present the notion of spectral signatures and demonstrate how they can be used\nto detect backdoor poisoning attacks. Our method relies on the idea that learned representations\nfor classi\ufb01ers amplify signals crucial to classi\ufb01cation. Since the backdoor installed by these attacks\nchange an example\u2019s label, the representations will then contain a strong signal for the backdoor.\nBased off this assumption, we then apply tools from robust statistics to the representations in order to\ndetect and remove the poisoned data.\nWe implement our method for the CIFAR10 image recognition task and demonstrate that we can\ndetect outliers on real image sets. We provide statistics showing that at the learned representation level,\nthe poisoned inputs shift the distribution enough to be detected with SVD methods. Furthermore, we\nalso demonstrate that the learned representation is indeed necesary; naively utilizing robust statistics\ntools at the data level does not provide a means with which to remove backdoored examples.\nOne interesting direction from our work is to further explore the relations to adversarial examples.\nAs mentioned previously in the paper, models robust to a group of perturbations are then robust to\nbackdoors lying in that group of perturbations. In particular, if one could train a classi\ufb01er robust to (cid:96)0\nperturbations, then backdoors consisting of only a few pixels would not be captured.\nIn general, we view the development of classi\ufb01ers resistant to data poisoning as a crucial step in\nthe progress of deep learning. As neural networks are deployed in more situations, it is important\nto study how robust they are, especially to simple and easy to implement attacks. This paper\ndemonstrates that machinery from robust statistics and classical machine learning can be very useful\n\n8\n\n\fTable 3: Results for a selection of different attack parameters on a combined label of cats and dogs,\nthat we call pets. Natural and poisoned accuracy are reported for two iterations, before and after\nthe removal step. The attack parameters are given by a backdoor attack image, target label, and\npercentage of added images.\n\nSample\n\nTarget\n\nEpsilon\n\npets\n\npets\n\npets\n\npets\n\nautomobile\n\nautomobile\n\nautomobile\n\nautomobile\n\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n5%\n10%\n\nNat 1\nPois 1\n93.99% 95.80%\n94.05% 96.70%\n94.28% 95.00%\n94.13% 99.70%\n94.12% 89.80%\n93.90% 93.40%\n93.97% 94.80%\n94.23% 97.20%\n93.96% 98.65%\n94.18% 99.20%\n94.20% 99.15%\n94.03% 99.55%\n93.89% 94.40%\n94.49% 97.20%\n94.26% 95.60%\n94.20% 98.45%\n\n# Pois Left\n\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n6\n2\n5\n1\n\nNat 2\nPois 2\n94.18% 0.30%\n94.27% 0.00%\n94.12% 0.20%\n93.89% 0.00%\n94.18% 0.10%\n94.11% 0.10%\n94.42% 0.00%\n93.96% 0.30%\n94.46% 0.20%\n94.00% 0.20%\n94.36% 0.25%\n94.03% 0.10%\n94.20% 0.20%\n94.49% 0.05%\n94.06% 0.00%\n94.06% 0.15%\n\ntools for understanding this behavior. We are optimistic that similar connections may have widespread\napplication for defending against other types of adversarial attacks in deep learning.\n\nAcknowledgements\nJ.L. was supported by NSF Award CCF-1453261 (CAREER), CCF-1565235,\nand a Google Faculty Research Award. This work was done in part while the author was at MIT and\nan intern at Google Brain. B.T. was supported by an NSF Graduate Research Fellowship. A.M. was\nsupported in part by an Alfred P. Sloan Research Fellowship, a Google Research Award, and the NSF\ngrants CCF-1553428 and CNS-1815221.\n\n9\n\n\fReferences\n[1] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet. Turning your weakness into a strength: Watermarking\n\ndeep neural networks by backdooring. arXiv preprint arXiv:1802.04633, 2018.\n\n[2] S. Balakrishnan, S. S. Du, J. Li, and A. Singh. Computationally ef\ufb01cient robust sparse estimation in high\n\ndimensions. In Conference on Learning Theory, pages 169\u2013212, 2017.\n\n[3] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. In ICML, 2012.\n\n[4] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou. Hidden voice\n\ncommands. In USENIX Security), pages 513\u2013530, 2016.\n\n[5] M. Charikar, J. Steinhardt, and G. Valiant. Learning from untrusted data. In Proceedings of the 49th\n\nAnnual ACM SIGACT Symposium on Theory of Computing, pages 47\u201360. ACM, 2017.\n\n[6] X. Chen, C. Liu, B. Li, K. Lu, and D. Song. Targeted backdoor attacks on deep learning systems using\n\ndata poisoning. arXiv preprint arXiv:1712.05526, 2017.\n\n[7] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\nwith multitask learning. In Proceedings of the 25th international conference on Machine learning, pages\n160\u2013167. ACM, 2008.\n\n[8] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high\ndimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016\nIEEE 57th Annual Symposium on, pages 655\u2013664. IEEE, 2016.\n\n[9] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions)\n\ncan be practical. In International Conference on Machine Learning, pages 999\u20131008, 2017.\n\n[10] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A robust meta-\n\nalgorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.\n\n[11] e. a. Donahue, Jeff. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML,\n\n2014.\n\n[12] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song. Robust\n\nphysical-world attacks on machine learning models. arXiv preprint arXiv:1707.08945, 2017.\n\n[13] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR,\n\n2014.\n\n[14] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In\nAcoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645\u20136649.\nIEEE, 2013.\n\n[15] T. Gu, B. Dolan-Gavitt, and S. Garg. Badnets: Identifying vulnerabilities in the machine learning model\n\nsupply chain. arXiv preprint arXiv:1708.06733, 2017.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[17] A. Klivans, P. K. Kothari, and R. Meka. Ef\ufb01cient algorithms for outlier-robust regression. arXiv preprint\n\narXiv:1803.03241, 2018.\n\n[18] P. W. Koh and P. Liang. Understanding black-box predictions via in\ufb02uence functions. In ICML, 2017.\n\n[19] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[21] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint\n\narXiv:1607.02533, 2016.\n\n[22] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Foundations of\n\nComputer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665\u2013674. IEEE, 2016.\n\n[23] K. Liu, B. Dolan-Gavitt, and S. Garg. Fine-pruning: Defending against backdooring attacks on deep neural\n\nnetworks. arXiv preprint arXiv:1805.12185, 2018.\n\n10\n\n\f[24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to\n\nadversarial attacks. arXiv preprint arXiv:1706.06083, 2017.\n\n[25] S. Mei and X. Zhu. The security of latent dirichlet allocation. In Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[26] N. Papernot, N. Carlini, I. Goodfellow, R. Feinman, F. Faghri, A. Matyasko, K. Hambardzumyan, Y.-L.\nJuang, A. Kurakin, R. Sheatsley, et al. cleverhans v2. 0.0: an adversarial machine learning library. arXiv\npreprint arXiv:1610.00768, 2016.\n\n[27] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust estimation via robust gradient\n\nestimation. arXiv preprint arXiv:1802.06485, 2018.\n\n[28] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein. Poison frogs!\n\ntargeted clean-label poisoning attacks on neural networks. arXiv preprint arXiv:1804.00792, 2018.\n\n[29] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter. Accessorize to a crime: Real and stealthy attacks on\nstate-of-the-art face recognition. In ACM SIGSAC Conference on Computer and Communications Security.\nACM, 2016.\n\n[30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[31] J. Steinhardt, P. W. W. Koh, and P. S. Liang. Certi\ufb01ed defenses for data poisoning attacks. In NIPS, 2017.\n\n[32] F. Tram\u00e8r, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks\n\nand defenses. arXiv preprint arXiv:1705.07204, 2017.\n\n[33] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. 2018.\n\nAvailable at https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf.\n\n[34] H. Xiao, B. Biggio, B. Nelson, H. Xiao, C. Eckert, and F. Roli. Support vector machines under adversarial\n\nlabel contamination. Neurocomputing, 160:53\u201362, 2015.\n\n11\n\n\f", "award": [], "sourceid": 4946, "authors": [{"given_name": "Brandon", "family_name": "Tran", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Jerry", "family_name": "Li", "institution": "Berkeley"}, {"given_name": "Aleksander", "family_name": "Madry", "institution": "MIT"}]}