{"title": "Error Correcting Output Codes Improve Probability Estimation and Adversarial Robustness of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8646, "page_last": 8656, "abstract": "Modern machine learning systems are susceptible to adversarial examples; inputs\nwhich clearly preserve the characteristic semantics of a given class, but whose\nclassification is (usually confidently) incorrect. Existing approaches to adversarial\ndefense generally rely on modifying the input, e.g. quantization, or the learned\nmodel parameters, e.g. via adversarial training. However, recent research has\nshown that most such approaches succumb to adversarial examples when different norms or more sophisticated adaptive attacks are considered. In this paper, we propose a fundamentally different approach which instead changes the way the output is represented and decoded. This simple approach achieves state-of-the-art robustness to adversarial examples for L 2 and L \u221e based adversarial perturbations on MNIST and CIFAR10. In addition, even under strong white-box attacks, we find that our model often assigns adversarial examples a low probability; those with high probability are usually interpretable, i.e. perturbed towards the perceptual boundary between the original and adversarial class. Our approach has several advantages: it yields more meaningful probability estimates, is extremely fast during training and testing, requires essentially no architectural changes to existing discriminative learning pipelines, is wholly complementary to other defense approaches including adversarial training, and does not sacrifice benign test set performance", "full_text": "Error Correcting Output Codes Improve Probability\n\nEstimation and Adversarial Robustness of Deep\n\nNeural Networks\n\nGunjan Verma\n\nAnanthram Swami\n\nCCDC Army Research Laboratory\n\nCCDC Army Research Laboratory\n\nAdelphi, MD 20783\n\ngunjan.verma.civ@mail.mil\n\nAdelphi, MD 20783\n\nananthram.swami.civ@mail.mil\n\nAbstract\n\nModern machine learning systems are susceptible to adversarial examples; inputs\nwhich clearly preserve the characteristic semantics of a given class, but whose\nclassi\ufb01cation is (usually con\ufb01dently) incorrect. Existing approaches to adversarial\ndefense generally rely on modifying the input, e.g. quantization, or the learned\nmodel parameters, e.g. via adversarial training. However, recent research has\nshown that most such approaches succumb to adversarial examples when different\nnorms or more sophisticated adaptive attacks are considered. In this paper, we\npropose a fundamentally different approach which instead changes the way the\noutput is represented and decoded. This simple approach achieves state-of-the-art\nrobustness to adversarial examples for L2 and L\u221e based adversarial perturbations\non MNIST and CIFAR10. In addition, even under strong white-box attacks, we \ufb01nd\nthat our model often assigns adversarial examples a low probability; those with high\nprobability are often interpretable, i.e. perturbed towards the perceptual boundary\nbetween the original and adversarial class. Our approach has several advantages:\nit yields more meaningful probability estimates, is extremely fast during training\nand testing, requires essentially no architectural changes to existing discriminative\nlearning pipelines, is wholly complementary to other defense approaches including\nadversarial training, and does not sacri\ufb01ce benign test set performance.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) achieve state-of-the-art performance on image classi\ufb01cation, speech\nrecognition, and game-playing, among many other applications. However, they are also vulnerable\nto adversarial examples, inputs with carefully chosen perturbations that are misclassi\ufb01ed despite\ncontaining no semantic changes [1]. Often, these perturbations are \u201csmall\u201d in some sense, e.g. some\nLp norm. From a scienti\ufb01c perspective, the existence of adversarial examples demonstrates that\nmachine learning models that achieve superhuman performance on benign, \u201cnaturally occurring\u201d\ndata sets in fact possess potentially dangerous failure modes. The existence of these failure modes\nthreatens the reliable deployment of machine learning in automation of tasks [2]. A myriad of\ndefenses have been proposed to make DNNs more robust to adversarial examples; virtually all have\nbeen shown to have serious limitations, however, and at present a solution remains elusive [3].\nAdversarial defenses that have been proposed to date can broadly be taxonomized by which part\nof the learning pipeline they aim to protect. Input-based defenses seek to explicitly modify the\ninput directly. These are comprised of three main classes of methods: i) manifold-based, which\nprojects the input into a different space ([4] in which the adversarial perturbation is presumably\nmitgiated, ii) quantization-based, which alters input data resolution [5] or encoding. [6], and iii)\nrandomization-based, in which portions of the input and/or hidden layer activations are randomized\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\for zeroed out. [7]. Model-based defenses seek to alter the learned model by the use of alternative\ntraining or modeling strategies. These comprise two main classes of methods: i) adversarial training\n(augments training data with adversarial examples ([1], [8]), and ii) generative models (closely related\nto the manifold-based approaches), which seek to model properties of the input or hidden layers,\nsuch as the distribution of activations, and detect adversarial examples as those with small probability\nor activation value [9, 10] under the natural data distribution. Certi\ufb01cation-based methods aim to\nprovide provable robustness guarantees against the existence of adversarial examples [11, 12].\nHowever, all these approaches have been shown to have serious limitations. Certi\ufb01cation-based\nmethods offer guarantees which are either restricted to training examples or yield vacuous guarantees\nfor all but very small Lp perturbation magnitudes. For example, [13] provide certi\ufb01cates for L2\nperturbations up to 0.5; changing a single pixel from all black to all white would fall outside this threat\nmodel. Input- and model-based defenses are generally effective against white-box attacks (attacker\nhas full knowledge of model) or black-box attacks, but not both [14]. Virtually all approaches\nsuccessful against white-box attacks mask the gradient of the loss with respect to the input but do not\ntruly increase model robustness [15] Another challenge is that existing defenses are designed against\nparticular attack models (e.g., the attacker will mount a bounded L\u221e attack); these defenses usually\nfail completely when the attacker model changes (e.g., rotating or translating the inputs [16]).\nIn this paper, we draw inspiration from coding theory, the branch of information theory which studies\nthe design of codes to ensure reliable delivery of a digital signal over a noisy channel; here, we\ndraw an analogy between the signal and the output (label) encoding, and between the noisy channel\nand adversarial perturbation 1. At a high level, coding theory formalizes the idea that in order to\nminimize the probability of signal error, codewords should be \u201cwell-separated\u201d from one another, i.e.,\ndiffer in many bits. In this paper, we \ufb01nd that encoding the outputs using such codes, as opposed\nto the conventional one-hot encoding, has some surprising effects, including signi\ufb01cantly improved\nprobability estimates and increased robustness to adversarial as well as random or \u201cfooling\u201d examples\n[17] (e.g., noise-like examples classi\ufb01ed with high con\ufb01dence as belonging to some class). The main\ncontributions of our paper are threefold:\n\u2022 We demonstrate why standard one-hot encoding is susceptible to adversarial and fooling examples\n\nand prone to overcon\ufb01dent probability estimates.\n\n\u2022 We demonstrate that well-chosen error-correcting output codes, coupled with a modi\ufb01ed decoding\nstrategy, leads to an intrinsically more robust system that also yields better probabillity estimates.\n\u2022 We perform extensive experimental evaluations of our method on L2 and L\u221e based adversarial\n\nperturbations which show our approach achieves or surpasses state-of-the-art results.\n\nIn contrast to existing work, our method is not explicitly designed to be a defense against adversarial\nexamples or to generate meaningful probability estimates. Rather, we \ufb01nd that these phenomena\nare emergent properties of properly encoding the class labels. Unlike existing approaches, ours is\nextremely easy to implement, requires far fewer model parameters, is fast to train and to execute\nduring inference time, and is completely complementary to existing defenses based on adversarial\ntraining [8] or generative models [18].\n\n2 Model framework\nWe \ufb01rst de\ufb01ne some notation. We denote by C the M \u00d7 N matrix of codewords, where M denotes\nthe number of classes and N the codeword length. The kth row of C, Ck, is the desired output\nof the DNN when the input is from class k. For typical one-hot encoding, C = IM , the identity\nmatrix of order M. Other choices of C are often denoted by the general term \u201cerror-correcting\noutput code\u201c (ECOC) and have been studied mainly in the context of improving a learner\u2019s (non-\nadversarial) generalization performance [19]. In this paper, we will consider codes with N = M\nand N > M. Also. we de\ufb01ne a \u201csigmoid\u201d function as a general \u201cS-shaped\u201d monotonically non-\ndecreasing activation function which maps a scalar in the reals R to some \ufb01xed range such as [0, 1] or\n[\u22121, 1]: In this paper, we will \ufb01nd use for two sigmoid functions: the \u201clogistic\u201d function, de\ufb01ned in\nSection 2.2, and the \u201ctanh\u201d, the hyperbolic tangent function.\n\n1Strictly speaking, the noise in coding theory is stochastic in nature, while adversarial perturbations are\n\nnon-random. Nonetheless, our analysis and results indicate there is signi\ufb01cant bene\ufb01t in taking this view.\n\n2\n\n\f2.1 Softmax Activation\n\nThe choice C = IM along with the softmax activation are two nearly universally adopted components\nfor multi-class classi\ufb01cation. The softmax maps a vector z in RM tnto the (M \u2212 1)-dimensional\nprobability simplex. We will denote the M-dimensional vector of softmax activations by \u03c8. The kth\nsoftmax activation is given by\n\np\u03c8(k) =\n\n(cid:80)M\n\nexp(zk)\ni=1 exp(zi)\n\n(1)\n\n(2)\n\nz is often referred to as the vector of logits. Figure 1(a) plots p\u03c8(0) as a function of z for the case\nM = 2 of two classes. Class 0 has one-hot codeword (1, 0) and class 1 has codeword (0, 1). The\nx axis denotes the logit z0 and the y axis denotes the logit z1. The amount of red is proportional\nto p\u03c8(0); i.e., dark red indicates p\u03c8(0) \u2248 1 and dark blue indicates p\u03c8(0) \u2248 0. Unsurprisingly, the\n\ufb01gure shows that the softmax assigns highest probability to the class whose corresponding logit is\nlargest. Importantly, the softmax is able to express uncertainty between the two classes (i.e., assign\nroughly equal probability to both classes) only along the diagonal, i.e. when z0 \u2248 z1. In higher\ndimensional spaces, where M > 2, the softmax is uncertain between any two classes i and j (i.e.\np\u03c8(i) \u2248 p\u03c8(j)) if and only if the corresponding logits zi and zj are approximately equal. The region\nzi \u2248 zj is \u201calmost\u201d a hyperplane, a M \u2212 1 dimensional subspace of RM which has negligble volume.\nThus, from the perspective of representing uncertainty, the softmax suffers from a fatal \ufb02aw: it is\ncertain almost everywhere in logit space. For very accurate models applied to non-adversarial inputs\n(the classical setting considered by machine learning), this is acceptable since the model will typically\nbe correct and con\ufb01dent. But on adversarial inputs, for which the model is incorrect, it will often\nstill be con\ufb01dent; indeed it is this (over) con\ufb01dence that is the central challenge posed by adversarial\nexamples. We will see further evidence of this phenomena in Section 3.\n\n2.2 Sigmoid Activation\n\nWe now propose an alternative way to map logits to class probabilities. The essential idea is simple:\nthe model maps logits to the elements of a codeword and assigns probability to class k as proportional\nto how positvely correlated the model output is to Ck.\n\np\u03c3(k) =\n\n(cid:80)M\nmax(\u03c3(z) \u00b7 Ck, 0)\ni=1(max(\u03c3(z) \u00b7 Ci, 0))\n\n1\n\nHere, \u03c3(z) and Ck are length-N vectors. Here, \u03c3 is some sigmoid function which is applied element-\nwise to the logits. For example, the logistic function has kth output as \u03c3k(z) =\n1+exp(\u2212zk) taking\nvalues in (0, 1). Another possible choice for \u03c3 is the tanh function taking values in (\u22121, 1). When C\ntake values in {0, 1}, then the logistic function is appropriate to use; in this case, the max operation is\nunnecessary. However if C take values in {\u22121, 1} then the tanh function is used and the max operator\nis needed to avoid negative probabilities. Equation (2) is intuitive; it computes the probability of a\nclass as proportional to how similar (correlated) the model\u2019s predicted code \u03c3(z) is to each codeword\nin C. Note that (2) is a generalization of (1) and reduces to it for the case of one-hot coding. If one\nsets \u03c3 = \u03c8 in (2) and uses C = IM , then it is easily seen that p\u03c3(k) = p\u03c8(k) for all k. Figure 1(b)\nillustrates p\u03c3 . The codeword assignment to classes, axes and colors in this \ufb01gure are identical in\nmeaning to those for Figure 1(a). Two crucial points emerge from this \ufb01gure. One, in contrast to\np\u03c8, p\u03c3 allocates non-trivial volume in logit space to uncertainty, i.e. where p\u03c3(0) \u2248 0.5. Two, p\u03c3\neffectively shrinks the attack surface available to an attacker seeking to craft adversarial examples.\nFigure 1(c) illustrates this. Suppose the input x to the network has corresponding logits given by\nthe magenta circle. x is such that p\u03c8(0|x) \u2248 p\u03c3(0|x) \u2248 1. Now consider 3 different adversarial\nperturbations of x to some x(cid:48), whose corresponding logit perturbations are shown by the 3 arrows in\nthe \ufb01gure. For the perturbations given by the black arrows, p\u03c8(1|x(cid:48)) \u2248 1, i.e., the class label under\nsoftmax is con\ufb01dently \ufb02ipped; but p\u03c3(1|x(cid:48)) \u2248 0.5, i.e., under \u03c3 the model is now uncertain. Only the\nperturbation indicated by the gray (diagonal) arrow leads to p\u03c3(1|x(cid:48)) \u2248 1 (as well as p\u03c8(1|x(cid:48)) \u2248 1).\nFewer perturbation directions in logit space can (con\ufb01dently) fool the classi\ufb01er; the adversary must\nnow search for perturbations to x which simultaneously decrease z0 while increasing z1.\n\n3\n\n\fz1\n\nz1\n\nz1\n\nz0\n(a)\n\nz0\n(b)\n\nz0\n(c)\n\nFigure 1: Probability of class 0 as a function of logits, for the (a) softmax activation and (b) sigmoid\ndecoding scheme. (c). Movements in the space of logits from the original point (magenta circle) to\nnew points (given by arrows); only the perturbation given by the gray arrow con\ufb01dently fools the\nsigmoid decoder, while all perturbations con\ufb01dently fool the softmax decoder.\n\n2.3 Hamming distance\n\nThe Hamming distance between any two binary codewords x and y, denoted d(x, y), is simply\n|x \u2212 y|0, where | \u00b7 |0 denotes the L0 norm. The Hamming distance of codebook C is de\ufb01ned as\n\nd = min{d(x, y) : x, y \u2208 C, x (cid:54)= y}\n\n(3)\n\nThe standard one-hot coding scheme has a Hamming distance of only 2. Practically, this means that\nif the adversary can suf\ufb01ciently alter even a single logit, an error may occur. In Figure 1(b), for\nexample, changing a single logit (i.e. an axis-aligned perturbation) is suf\ufb01cient to make the classi\ufb01er\nuncertain. Ideally, we want the classi\ufb01er to be robust to changes to multiple logits.\nWhat happens if we increase the Hamming distance between codewords? Consider the M = N = 32\ncase where each of 32 classes is represented by a 32-bit codeword (meaning that the DNN has 32\noutputs versus 2 for the case in Figure 1). Figure 2 shows the probability of class 0 as a function of a 3\ndimensional slice of the logits (z29, z30, z31), where the other logits zi are \ufb01xed to 3\u03b3(C(0, i)) where\nC(0, i) denotes the ith element of codeword 0 and \u03b3(x) is de\ufb01ned as 1 if x > 0 and \u22121 otherwise.\n(In other words, the \ufb01xed logits are set to be consistent with class 0). The colors in this \ufb01gure are\nidentical in meaning to those in Figure 1. For reference, the magenta circle shown has probability\n> 0.999 of being in class 0. The left-most column shows the probability of class 0 under the softmax\nactivation and code C = I32. The middle column uses the sigmoid decoding scheme with logistic\nactivation and code C = I32. The right-most column uses the sigmoid decoding scheme with tanh\nactivation and code C = H32, a Hadamard code of length 32. Within each column, two different\nviews of the same logit space are shown. Note that for the softmax decoder, local perturbations\nwithin the logit space exist, (e.g., moving in an axis-aligned direction from the magenta point) which\nreduce the probability of class 0 to near 0. Also note how the softmax decoder has a very small\nregion corresponding to uncertainty (i.e., probability near 0.5); as the logits vary, the model rapidly\ntransitions from assigning probability \u2248 1 to class 0 to assigning probability \u2248 0. In contrast, the\nlogistic decoder assigns far more volume to uncertainty. The Hadamard code based decoder is even\nmore robust; it still assigns large probability to class 0 despite large changes to multiple logits.\nFigures 1 and 2 illustrate the fact that with the softmax, a \u201csmall\u201d change in logits \u03b4z can lead the\nmodel from being very certain of one class to being very uncertain of that class (and indeed, certain\nof another); sigmoid decoding with Hadamard codes greatly alleviates this problem. How does this\nrelate to small changes in the input, \u03b4x? Let J denote the Jacobian matrix of logits z with respect to\ninput x. By Taylor\u2019s theorem we know that \u03b4z \u2248 J \u00b7 \u03b4x and so ||\u03b4z|| \u2264 ||J|| \u00b7 ||\u03b4x|| where || \u00b7 ||\ndenotes Euclidean norm for a vector and operator norm for a matrix. Assume that ||J|| is comparable\nacross softmax and sigmoid schemes and choices of C (a fact we have empirically observed across\nseveral datasets). Then, in order to gain robustness to perturbations \u03b4x, we can try to reduce ||J||;\nindeed this is the effect of most existing adversarial defenses. With our approach, in contrast, a larger\n\u03b4z is needed to move in logit-space from one class to another; hence a larger \u03b4x is needed.\n\n4\n\n-10.0-6.0-2.02.06.010.0-10.0-6.0-2.02.06.010.00.10.50.9-10.0-6.0-2.02.06.010.0-10.0-6.0-2.02.06.010.00.10.50.9-10.0-6.0-2.02.06.010.0-10.0-6.0-2.02.06.010.0\fz31\n\nz31\n\nz31\n\nz31\n\nz30\n\nz29\n\nz30\n\nz29\n\nz30\n\nz29\n\nz31\n\nz31\n\nz29\n\nz30\n\nz29\n\nz30\n\nz29\n\nz30\n\nFigure 2: Probability of class 0 as a function of logits, for different choices of output activation and\ncode, for a 32 class multi-classi\ufb01cation problem: (leftmost column) softmax (Eq 1) with C = I32,\n(middle column) sigmoid decoder (Eq 2) with logistic activation and C = I32, (rightmost column)\nsigmoid decoder (Eq 2) with tanh activation and C = H32, a Hadamard code. 29 logit values are\n\ufb01xed and remaining logits (here denoted z29, z30, z31) are allowed to vary. Colorbar is same as in\nFigure 1. Different choices of output activation and output code result in fundamentally different\nmappings of Euclidean logit space to class probabilities. Further details are in the main text.\n\n2.4 Code design\n\nWe now turn to the choice of C, which has been studied under the name of error correcting output\ncodes (ECOC), popularized in the machine learning literature by [19]. The work therein and much of\nthe work that has followed on ECOC focused on the potential gains in generalization in multi-class\nsettings over conventional one-hot (equivalently, one-vs-rest) coding. Much of this research used\nECOCs with decision trees or shallow neural networks. With the advent of deep learning and vastly\nimproved accuracies even with conventional one-hot encodings, ECOCs are not in mainstream use.\nSeveral methods exist in order to create \u201cgood\u201d ECOC which focus primarily on achieving a large\nHamming distance between codes. A library implementing various heuristics, some inspired from\ncoding theory, is available in [20]. Ideally, we would use C with the largest possible d from (3)\n(though other factors, like good column separation, are also important). We \ufb01rst state a theorem\nwhich we use to select a near optimal choice for C.\n\nTheorem 1 (Plotkin\u2019s Bound). For an M \u00d7 N coding matrix C, d \u2264(cid:106) N\n\n(cid:107)\n\nM\nM\u22121\n\n2\n\nTheorem 1 upper bounds the Hamming distance of C. For M large and N even, the bound approaches\n2 which can be achieved if we choose C to be a Hadamard matrix. This choice has an important\nN\nfortunate bene\ufb01t. Recall that we would like to obtain probability estimates from our output, not just a\nclassi\ufb01cation decision. We say that our probability estimation is admissible if, whenever the network\noutputs any given codeword exactly, say C(j), the probability as computed by (2) is p\u03c3(j) = 1. If C\nis non-orthogonal, then C(j) may have positive correlation with C(i), in which case p\u03c3(j) < 1 even\nif the network outputs C(j). Thus, orthogonal C is required for admissible probability estimates.\nIn this paper, we will use the notation HP to denote a P \u00d7 P Hadamard matrix. When there are\nmore codewords P available than actual classes M (e.g., P = 16, M = 10 for CIFAR10), we simply\nselect the \ufb01rst M rows of HP as codewords. More sophisticated optimizations are possible which\nalso examine, for example, the correlation structure of the columns; we leave this for future work.\n\n2.5 Bit independence\n\nIn a typical DNN, a single network outputs all the bits of the output code in its \ufb01nal layer (e.g.,\nfor MNIST, the \ufb01nal layer would be comprised of 10 neurons). However, it is also possible to\n\n5\n\n-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0-6.0-2.02.06.0\fTable 1: Table characterizing various models tested in this paper\n\nModel\nSoftmax\nLogistic\nTanh16\nLogisticEns10\nTanhEns16\nTanhEns32\nTanhEns64\nMadry\n\nArchitecture Code\nI10\nI10\nH16\nI10\nH16\nH32\nH64\nI10\n\nStandard\nStandard\nStandard\nEnsemble\nEnsemble\nEnsemble\nEnsemble\nStandard\n\nProbability estimation\n\n\u03c3\n\neq (1)\neq (2)\neq (2)\neq (2)\neq (2)\neq (2)\neq (2)\neq (1)\n\nsoftmax\nlogistic\n\ntanh\n\nlogistic\n\ntanh\ntanh\ntanh\n\nsoftmax\n\nlearn an ensemble of networks, each of which outputs a few bits of the output code. The errors in\nthe individual bits that are made by a DNN or an ensemble method are often correlated; an input\ncausing an error in one particular output often correlates to errors in other outputs. Such correlations\nreduce the effective Hamming distance between codewords since the dependent error process means\nthat multiple bit \ufb02ips are likely to co-occur. Therefore, promoting diversity across the constituent\nlearners is crucial and is generally a priority in ensemble-based methods; various heuristics have\nbeen proposed, including training each classi\ufb01er on a subset of the input features [21] or rotating\nthe feature space [22]. The problem of correlation across ensemble members is more serious when\neach member solves the same classi\ufb01cation problem; however, for ECOCs, each ensemble member\nj solves a different classi\ufb01cation problem (speci\ufb01ed by the jth column of C). Thus we \ufb01nd that\nit is suf\ufb01cient to simply train an ensemble of networks, where each member outputs B (cid:28) N bits\n(neurons) of the output code; a diagram of the architecture used for experiments in this paper is given\nin Figures S1 and S2 in the supplement. In this paper, for codes whose length is a multiple of 4 (all\nHadamard codes), we set B = N\n2 . Since each ensemble member shares no\nparameters with any other, the resulting architecture has reduced error correlations compared to a\ntypical fully connected output layer.\n\n4 . Else, we set B = N\n\n3 Experiments\n\nOur approach is general and dataset-agnostic; here we apply it to the MNIST and CIFAR10 datasets.\nAll of our code is available at [23]. MNIST is still widely studied in adversarial machine learning\nresearch since an adversarially robust solution remains elusive. We conduct experiments with a series\nof models which vary the choice of code C, the length of the codes N, and the activation function\napplied to the logits. Our training and adversarial attack procedures are standard; details are given\nin the supplement. Table 1 summarizes the various models used in this paper. \u201cStandard\u201d refers\nto a standard convolutional architecture with a dense fully connected output layer illustrated in the\nsupplement in Figure S1, while \u201densemble\u201d refers to the setup described in Section 2.5 and illustrated\nin Figure S2. The \ufb01nal column describes the sigmoid function used in Eq (2). The \u201cMadry\u201d model\nis the adversarially trained model in [8]. Table 2 shows the results of our experiments on MNIST.\nThe \ufb01rst column contains a descriptive name for the model (which is detailed in Table 1). Column\n2 shows the total number of parameters in the model. Column 3 reports accuracy on the test set.\nThe remaining columns show results on various attacks; all such results are in the white-box setting\n(adversary has full access to the entire model). Columns 4 and 5 show results for the projected\ngradient descent (PGD, \u0001 = 0.3) and Carlini-Wagner (CW) attacks [24], respectively, These columns\nshow the fraction of adversarially crafted inputs which the model correctly classi\ufb01es, i.e., examples\nwhich fail to be truly adversarial. Column 6 contains results of the \u201cblind spot attack\u201d [25], which\n\ufb01rst scales images by a constant \u03b1 close to 1 before applying the Carlini Wagner attack. Column 7\nshows results for the \u2018Distributionally Adversarial Attack\u201d (DAA) [26] (which is based on the Madry\nChallenge leaderboard [27]). We choose this attack since it appeared (as of mid 2019) near or atop the\nleaderboards for both MNIST and CIFAR10 datasets. Column 8 shows the fraction of random inputs\nfor which the model\u2019s maximum class probability is smaller than 0.9; here, a random input is one\nwhere each pixel is independently and uniformly chosen in (0, 1). Column 9 shows the accuracy on\ntest inputs where each pixel is independently corrupted by additive uniform noise in [\u2212\u03b3, \u03b3], where\n\u03b3 = 1 (0.1) for MNIST (CIFAR10) and clipped to lie within the valid input pixel range, e.g. (0, 1).\n\n6\n\n\fSeveral points of interest emerge from the results in Table 2. One, the Logistic model is superior to\nthe Softmax model due to the phenomena illustrated in Figure 1(b) and (c); in particular, the result\non Random attacks indicates the Logistic indeed goes a long way towards reducing the irrational\novercon\ufb01dence of the softmax activation. Two, Tanh16\u2019s superior performance over Logistic shows\nthe advantage of using a code with larger Hamming distance. Three, LogisticEns10\u2019s vastly improved\nperformance on Random attacks shows the importance of reduced correlation among the output bits\n(described in Section 2.5). Four, TanhEns16 shows a marked improvement across all dimensions over\nall predecessors; it combines the larger Hamming distance with reduced bit correlation. TanhEns32\nshows results for a 32 output codes; we \ufb01nd that performance appears to plateau and that increased\ncode length confers no meaningful additional bene\ufb01t for this dataset. In general, we might expect\ndiminishing gains in performance with increasing code length relative to number of clases. Finally,\ncomparing all the ensemble (ending in \u201cEns\u201d) models to the Madry model, we see the latter uses\nmany more parameters. The TanhEns16 model has superior performance to Madry\u2019s model on all\nattacks, sometimes signi\ufb01cantly so. Also note that while Madry model\u2019s benign accuracy is much\nlower than the state-of-the-art for MNIST, the TanhEns16 model enjoys excellent accuracy.\nFigure 3(a)-(c) compares the probability distributions of various models on MNIST for (a) benign,\n(b) projected gradient descent (PGD) generated adversarial, and (c) random examples. In more detail,\nfor each example x, we compute the probability that the model assigns to the most probable class\nlabel of x. We compute and plot the distribution of these probabilities over a randomly chosen set of\n2000 test examples of MNIST. Figure 3(a) shows that all models assign high probability to nearly all\n(benign) inputs, which is desirable since all models have high test set accuracy. Figure 3(b) compares\nmodels on adversarial examples. The TanhEns16 and TanhEns32 models tend to (correctly) be less\ncertain than the other models (note that these models have bimodal distributions; the lower (upper)\nmode tends to correspond to adversarial examples that do (not) resemble the nominal class given\nby the model). Figure 3(c) compares models on randomly generated inputs. While the Softmax\nand Madry models are often certain of their decisions, the other models, particularly the TanhEns16\nand TanhEns32, correctly put most mass on low probabilities. In summary, Figure 3 shows that\nthe TanhEns model family has two highly desirable properties: 1) like the Softmax and Madry\nmodels, it is very certain about the (correct) label on benign examples, and 2) unlike the Softmax and\nMadry models, it is often uncertain about the (incorrect) label on adversarial and random examples.\nFurthermore, when TanhEns is certain (uncertain), the example often resembles the target class (no\nrecognizable class); see Figures S2 and S3 in the supplement for sample illustrations. Taken together,\nthese facts suggest that the TanhEns model class yields very good probability estimates.\nTable 3 is analogous to Table 2, but presents results for CIFAR10. Figure 3(d)-(f) shows the probability\ndistributions for CIFAR10. For CIFAR10, our baseline is Madry\u2019s adversarially trained CIFAR10\nmodel. We notice results that are all qualitatively similar to those in the MNIST case; again, the\nTanhEns model family has strong performance and is competitive with or outperforms Madry\u2019s\nmodel. A key distinction is that now, 32 and 64 bit codes show clear improvements over 16 bit codes.\nFurther improvements to the TanhEns performance are likely possible by using more modern network\narchitectures; we leave this for future work.\nFinally, Figure 4 plots model accuracy versus the PGD L\u221e perturbation limit \u0001, for both (a) MNIST\nand (b) CIFAR10. The TanhEns models dominate Madry\u2019s model. Notably for MNIST, the accuracy\ndrops signi\ufb01cantly around \u0001 = 0.5; this is to be expected since at this value of \u0001, a perturbation which\nsimply sets all pixel values to 0.5 (therby creating a uniformly grayscale image) will obscure the true\nclass. Because model accuracy rapidly drops to near 0 as \u0001 grows, the \ufb01gure provides crucial evidence\nthat our approach has genuine robustness to adversarial attack and is not relying on \u201cgradient-masking\u201d\n[28]. Also, the TanhEns models signi\ufb01cantly outperform Madry\u2019s model for \u0001 > 0.3 (\u0001 > 0.031) on\nMNIST (CIFAR10), indicating that our model has an intrinsic and wide-ranging robustness which is\nnot predicated on adversarially training at a speci\ufb01c level of \u0001.\n\n4 Conclusion\n\nWe have presented a simple approach to improving model robustness that is centered around three\ncore ideas. One, moving from softmax to sigmoid decoding means that a non-trivial volume of the\nEuclidean logit space is now allocated towards model uncertainty. In crafting convincing adversarial\nperturbations, the adversary must now guard against landing in such regions, i.e. his attack surface is\nsmaller. Two, in changing the set of codewords from IM to one with larger Hamming distance, the\n\n7\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 3: Distribution of probabilities assigned to the most probable class on the test set of (a-c)\nMNIST and (d-f) CIFAR10, by various models. LogisticEns10 and TanhEns models are abbreviated\nas LEns10 and TEns, respectively. x axis is the probability assigned by the classi\ufb01er, y axis is the\nprobability density. Legend in \ufb01rst column is common to all \ufb01gures. (a) and (d). Distribution of\nprobabilities on benign (non-adversarial) examples. (b) and (e). Distribution of probabilities on\nadversarial examples. (c) and (f). Distribution of probabilities on randomly generated examples\nwhere each pixel is sampled independently and uniformly in [0, 1].\n\n(a)\n\n(b)\n\nFigure 4: Model accuracy (y-axis) versus perturbation strength \u0001 (x-axis) for (a) MNIST and (b)\nCIFAR10. LogisticEns10 and TanhEns models are abbreviated as LEns10 and TEns, respectively.\nCurves are based on attacking a random sample of 200 test samples.\n\nTable 2: Accuracies of various models trained on MNIST against various attacks. \u201c-\u201d indicates\nexperiment was not performed.\n\nModel\n\n# Params Benign\n\nPGD CW\n\nSoftmax\nLogistic\nTanh16\n\nLogisticEns10\n\nTanhEns16\nTanhEns32\n\nMadry\n\n330, 570\n330, 570\n330, 960\n205, 130\n401, 168\n437, 536\n3, 274, 634\n\n.9918\n.9933\n.9931\n.9933\n.9948\n.9951\n.9853\n\n.082\n.093\n.421\n.382\n.929\n.898\n.925\n\n.540\n.660\n.790\n.880\n1.0\n1.0\n.840\n\n8\n\nBSA\n\n\u03b1 = 0.8\n\n.180\n.210\n.320\n.480\n1.0\n1.0\n.520\n\nDAA Rand\n\n+U(-1,1)\n\n-\n-\n-\n-\n\n-\n\n.923\n\n.888\n\n.270\n.684\n.673\n.905\n.988\n1.0\n.351\n\n.785\n.829\n.798\n.812\n.827\n.858\n.150\n\n0.00.20.40.60.81.00246810SoftmaxLogisticTanh16LEns10TEns16TEns32Madry0.00.20.40.60.81.002468100.00.20.40.60.81.00123450.00.20.40.60.81.00123456SoftmaxLogisticTanh16LEns10TEns16TEns32TEns64Madry0.00.20.40.60.81.002468100.00.20.40.60.81.002468100.00.20.40.60.81.00.00.20.40.60.81.0SoftmaxLogisticTanh16LEns10TEns16TEns32Madry0.00.10.20.30.00.20.40.60.8SoftmaxLogisticTanh16LEns10TEns16TEns32TEns64Madry\fTable 3: Accuracies of various models trained on CIFAR10 against various attacks. \u201c-\u201d indicates\nexperiment was not performed.\n\nModel\n\n# Params Benign\n\nPGD CW\n\nSoftmax\nLogistic\nTanh16\n\nLogisticEns10\n\nTanhEns16\nTanhEns32\nTanhEns64\n\nMadry\n\n775, 818\n775, 818\n776, 208\n1, 197, 978\n2, 317, 456\n2, 631, 456\n3, 259, 456\n45, 901, 914\n\n.864\n.865\n.866\n.877\n.888\n.891\n.896\n.871\n\n.070\n.060\n.099\n.100\n.515\n.574\n.601\n.470\n\n.080\n.140\n.080\n.240\n.760\n.780\n.760\n.080\n\nBSA\n\n\u03b1 = 0.8\n\n.040\n.100\n.100\n.140\n.760\n.770\n.760\n0.0\n\nDAA Rand\n\n+U(-.1,.1)\n\n-\n-\n-\n-\n\n.514\n.539\n.543\n.447\n\n.404\n.492\n.700\n.495\n.999\n.989\n1.0\n.981\n\n.815\n.839\n.832\n.852\n.842\n.869\n.875\n.856\n\nEuclidean distance in logit space between any two regions of high probability for any given class\nbecomes larger. This means that the adversary\u2019s perturbations now need to be larger in magnitude to\nattain the same level of con\ufb01dence. Three, in learning output bits with multiple disjoint networks,\nwe reduce correlations between outputs. Such correlations are implicitly capitalized on by common\nattack algorithms. This is because many attacks search for a perturbation by following the loss\ngradient, and the loss will commonly increase most rapidly in directions where the perturbation\nimpacts multiple (correlated) logits simultaneously. Importantly, since it simply alters the output\nencoding but otherwise uses completely standard architectural components (i.e., convolutional and\ndensely connected layers), the primary source of our approach\u2019s robustness does not appear to be\nobfuscated gradients [15].\nThe learner that results is surprisingly robust to a variety of non-benign inputs. Our approach\nhas many interesting and complementary advantages to existing approaches to adversarial defense.\nIt is extremely simple and integrates seamlessly with existing machine learning pipelines. It is\nextremely fast to train (e.g., it does not rely on in-the-loop adversarial example generation) and during\ninference time (compared to, e.g., manifold based methods or generative models which often involve\na potentially costly step of computing the probability of the input under some underlying model).In\nthe models for MNIST and CIFAR10 studied in this paper, our networks use far fewer parameters\nthan the Madry model. Because our model is not adversarially trained with respect to any Lp norm\nattack, it appears to have strong performance across a variety of adversarial and random attacks. This\nbodes well for our approach to generalize to future attacks. Another signi\ufb01cant advantage is that\nour approach has no apparent loss on benign test set accuracy, in major contrast to other adversarial\ndefenses. Finally, further gains are achievable by increasing the diversity across ensemble members,\nsuch as training each ensemble member on different rotations [22] or with distinct architectures.\nOur model also yields vastly improved probability estimates on adversarial and garbage examples,\ntending to give them low probabilities; this is particularly interesting since attempts at using Bayesian\nneural networks to improve probability estimation on adversarial examples have not found clear\nsuccess yet [29]. It is well known that using the standard softmax to convert logits to probabilities\nleads to poor estimates [30]; approaches such as Platt scaling which improve probability calibration\non the training manifold still produce overcon\ufb01dent estimates on adversarial and noisy inputs. While\nwe have not carefully studied our model\u2019s probability calibration, we have presented strong empirical\nevidence suggesting much improved estimates should be achievable both on and off the training\nmanifold.\nOne important avenue for further study is to consider datasets of larger input dimensionality, such as\nImageNet. It may be possible that in very high input dimensions, adversarial perturbations exist that\ncan still surmount the larger Hamming distances afforded by ECOCs (though our results here provide\nhope that the labels of any such examples will typically have lower probability). However, a counter\nto this might simply involve using longer codes; our experiments with CIFAR10 indicate this could\nbe a viable strategy. Such an approach would tradeoff training time for robustness, reminiscient of\nthe tradeoff in communications theory between data rate and tolerance to channel errors. A second\navenue for further research is to combine our idea with existing methods based on adversarial training\nor with provable approaches to certi\ufb01ed robustness [11]. We believe that our approach will make any\nother adversarial defense much stronger.\n\n9\n\n\fReferences\n[1] I. J. Goodfellow, J. Shlens, and C. Szegedy, \u201cExplaining and harnessing adversarial examples,\u201d\n\narXiv preprint arXiv:1412.6572, 2014.\n\n[2] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, \u201cPractical\nblack-box attacks against deep learning systems using adversarial examples,\u201d arXiv preprint\narXiv:1602.02697, 2016.\n\n[3] \u201cDARPA program on guaranteeing ai robustness against deception,\u201d 2019, [Online; accessed 01-\nMay-2019]. [Online]. Available: https://www.darpa.mil/attachments/GARD_ProposersDay.pdf\n[4] A. Ilyas, A. Jalal, E. Asteri, C. Daskalakis, and A. G. Dimakis, \u201cThe robust manifold defense:\n\nAdversarial training using generative models,\u201d arXiv preprint arXiv:1712.09196, 2017.\n\n[5] W. Xu, D. Evans, and Y. Qi, \u201cFeature squeezing: Detecting adversarial examples in deep neural\n\nnetworks,\u201d arXiv preprint arXiv:1704.01155, 2017.\n\n[6] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, \u201cThermometer encoding: One hot way to\n\nresist adversarial examples,\u201d International Conference on Learning Representations, 2018.\n\n[7] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossai\ufb01, A. Khanna, and\nA. Anandkumar, \u201cStochastic activation pruning for robust adversarial defense,\u201d arXiv preprint\narXiv:1803.01442, 2018.\n\n[8] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, \u201cTowards deep learning models\n\nresistant to adversarial attacks,\u201d arXiv preprint arXiv:1706.06083, 2017.\n\n[9] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman, \u201cPixeldefend: Leveraging\ngenerative models to understand and defend against adversarial examples,\u201d arXiv preprint\narXiv:1710.10766, 2017.\n\n[10] G. Tao, S. Ma, Y. Liu, and X. Zhang, \u201cAttacks meet interpretability: Attribute-steered detection\nof adversarial samples,\u201d in Advances in Neural Information Processing Systems, 2018, pp.\n7717\u20137728.\n\n[11] E. Wong and J. Z. Kolter, \u201cProvable defenses against adversarial examples via the convex outer\n\nadversarial polytope,\u201d arXiv preprint arXiv:1711.00851, 2017.\n\n[12] E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter, \u201cScaling provable adversarial defenses,\u201d in\n\nAdvances in Neural Information Processing Systems, 2018, pp. 8400\u20138409.\n\n[13] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter, \u201cCerti\ufb01ed adversarial robustness via randomized\n\nsmoothing,\u201d arXiv preprint arXiv:1902.02918, 2019.\n\n[14] F. Tram\u00e8r, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, \u201cEnsemble\n\nadversarial training: Attacks and defenses,\u201d arXiv preprint arXiv:1705.07204, 2017.\n\n[15] A. Athalye, N. Carlini, and D. Wagner, \u201cObfuscated gradients give a false sense of security:\n\nCircumventing defenses to adversarial examples,\u201d arXiv preprint arXiv:1802.00420, 2018.\n\n[16] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry, \u201cA rotation and a translation\n\nsuf\ufb01ce: Fooling cnns with simple transformations,\u201d arXiv preprint arXiv:1712.02779, 2017.\n\n[17] A. Nguyen, J. Yosinski, and J. Clune, \u201cDeep neural networks are easily fooled: High con\ufb01dence\npredictions for unrecognizable images,\u201d in Proceedings of the IEEE CVPR, 2015, pp. 427\u2013436.\n[18] L. Schott, J. Rauber, M. Bethge, and W. Brendel, \u201cTowards the \ufb01rst adversarially robust neural\n\nnetwork model on MNIST,\u201d arXiv preprint arXiv:1805.09190, 2018.\n\n[19] T. G. Dietterich and G. Bakiri, \u201cSolving multiclass learning problems via error-correcting output\n\ncodes,\u201d Journal of Arti\ufb01cial Intelligence Research, vol. 2, pp. 263\u2013286, 1994.\n\n[20] S. Escalera, O. Pujol, and P. Radeva, \u201cError-correcting output codes library,\u201d Journal of Machine\n\nLearning Research, vol. 11, no. Feb, pp. 661\u2013664, 2010.\n\n[21] A. Tsymbal, M. Pechenizkiy, and P. Cunningham, \u201cDiversity in search strategies for ensemble\n\nfeature selection,\u201d Information Fusion, vol. 6, no. 1, pp. 83\u201398, 2005.\n\n[22] R. Blaser and P. Fryzlewicz, \u201cRandom rotation ensembles,\u201d The Journal of Machine Learning\n\nResearch, vol. 17, no. 1, pp. 126\u2013151, 2016.\n\n[23] [Online]. Available: https://github.com/Gunjan108/robust-ecoc/\n\n10\n\n\f[24] N. Carlini and D. Wagner, \u201cTowards evaluating the robustness of neural networks,\u201d in 2017\n\nIEEE Symposium on Security and Privacy, 2017, pp. 39\u201357.\n\n[25] H. Zhang, H. Chen, Z. Song, D. Boning, I. S. Dhillon, and C.-J. Hsieh, \u201cThe limitations of\n\nadversarial training and the blind-spot attack,\u201d arXiv preprint arXiv:1901.04684, 2019.\n\n[26] T. Zheng, C. Chen, and K. Ren, \u201cDistributionally adversarial attack,\u201d in Proceedings of the\n\nAAAI Conference on Arti\ufb01cial Intelligence, vol. 33, 2019, pp. 2253\u20132260.\n\n[27] \u201cMadry CIFAR10 challenge,\u201d https://github.com/MadryLab/cifar10_challenge, accessed: 2019-\n\n04-30.\n\n[28] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, and\n\nA. Madry, \u201cOn evaluating adversarial robustness,\u201d arXiv preprint arXiv:1902.06705, 2019.\n\n[29] L. Smith and Y. Gal, \u201cUnderstanding measures of uncertainty for adversarial example detection,\u201d\n\narXiv preprint arXiv:1803.08533, 2018.\n\n[30] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, \u201cOn calibration of modern neural networks,\u201d\nin Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp.\n1321\u20131330.\n\n11\n\n\f", "award": [], "sourceid": 4658, "authors": [{"given_name": "Gunjan", "family_name": "Verma", "institution": "ARL"}, {"given_name": "Ananthram", "family_name": "Swami", "institution": "Army Research Laboratory, Adelphi"}]}