{"title": "Metric Learning for Adversarial Robustness", "book": "Advances in Neural Information Processing Systems", "page_first": 480, "page_last": 491, "abstract": "Deep networks are well-known to be fragile to adversarial attacks. We conduct an empirical analysis of deep representations under the state-of-the-art attack method called PGD, and find that the attack causes the internal representation to shift closer to the ``false'' class. Motivated by this observation, we propose to regularize the representation space under attack with metric learning to produce more robust classifiers. By carefully sampling examples for metric learning, our learned representation not only increases robustness, but also detects previously unseen adversarial samples. Quantitative experiments show improvement of robustness accuracy by up to 4% and detection efficiency by up to 6% according to Area Under Curve score over prior work. The code of our work is available at https://github.com/columbia/Metric_Learning_Adversarial_Robustness.", "full_text": "Metric Learning for Adversarial Robustness\n\nChengzhi Mao\n\nColumbia University\n\ncm3797@columbia.edu\n\nZiyuan Zhong\n\nColumbia University\n\nziyuan.zhong@columbia.edu\n\nJunfeng Yang\n\nColumbia University\n\nCarl Vondrick\n\nColumbia University\n\nBaishakhi Ray\n\nColumbia University\n\njunfeng@cs.columbia.edu\n\nvondrick@cs.columbia.edu\n\nrayb@cs.columbia.edu\n\nAbstract\n\nDeep networks are well-known to be fragile to adversarial attacks. We conduct an\nempirical analysis of deep representations under the state-of-the-art attack method\ncalled PGD, and \ufb01nd that the attack causes the internal representation to shift\ncloser to the \u201cfalse\u201d class. Motivated by this observation, we propose to regular-\nize the representation space under attack with metric learning to produce more\nrobust classi\ufb01ers. By carefully sampling examples for metric learning, our learned\nrepresentation not only increases robustness, but also detects previously unseen\nadversarial samples. Quantitative experiments show improvement of robustness\naccuracy by up to 4% and detection ef\ufb01ciency by up to 6% according to Area\nUnder Curve score over prior work. The code of our work is available at https:\n//github.com/columbia/Metric_Learning_Adversarial_Robustness.\n\n1\n\nIntroduction\n\nDeep networks achieve impressive accuracy and wide adoption in computer vision [17], speech\nrecognition [14], and natural language processing [21]. Nevertheless, their performance degrades\nunder adversarial attacks, where natural examples are perturbed with human-imperceptible, carefully\ncrafted noises [35, 23, 12, 18]. This degradation raises serious concern \u2014 especially when we\ndeploy deep networks to safety and reliability critical applications [29, 43, 41, 20, 36]. Extensive\nefforts [37, 31, 47, 7, 25, 12, 35, 48] have been made to study and enhance the robustness of deep\nnetworks against adversarial attacks, where a defense method called adversarial training achieves the\nstate-of-the-art adversarial robustness [19, 16, 46, 49].\n\nTo better understand adversarial attacks, we \ufb01rst conduct an empirical analysis of the latent representa-\ntions under attack for both defended [19, 16] and undefended image classi\ufb01cation models. Following\nthe visualization technique in [28, 30, 33], we investigate what happens to the latent representations\nas they undergo attack. Our results show that the attack shifts the latent representations of adversarial\nsamples away from their true class and closer to the false class. The adversarial representations\noften spread across the false class distribution in such a way that the natural images of the false class\nbecome indistinguishable from the adversarial images.\n\nMotivated by this empirical observation, we propose to add an additional constraint to the model\nusing metric learning [15, 32, 44] to produce more robust classi\ufb01ers. Speci\ufb01cally, we add a triplet\nloss term on the latent representations of adversarial samples to the original loss function. However,\nthe na\u00efve implementation of triplet loss is not effective because the pairwise distances of a natural\nsample xa, its adversarial sample x\u2032\na, and a randomly selected natural sample of the false class xn are\nhugely uneven. Speci\ufb01cally, given considerable data variance in the false class, xn is often far from\nthe decision boundary where x\u2032\na resides, therefore xn is too easy as a negative sample. To address this\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fproblem, we sample the negative example for each triplet with the closest example in a mini-batch of\ntraining data. In addition, we randomly select another sample xp in the correct class as the positive\nexample in the triplet data.\n\nOur main contribution is a simple and effective metric learning method, Triplet Loss Adversarial\n(TLA) training, that leverages triplet loss to produce more robust classi\ufb01ers. TLA brings near both\nthe natural and adversarial samples of the same class while enlarging the margins between different\nclasses (Sec. 3). It requires no change to the model architecture and thus can improve the robustness\non most off-the-shelf deep networks without additional overhead during inference. Evaluation on\npopular datasets, model architectures, and untargeted, state-of-the-art attacks, including projected\ngradient descent (PGD), shows that our method classi\ufb01es adversarial samples more accurately by up\nto 4% than prior robust training methods [16, 19]; and makes adversarial attack detection [52] more\neffective by up to 6% according to the Area Under Curve (AUC) score.\n\n2 Related Work\n\nThe fact that adversarial noise can fool deep networks was \ufb01rst discovered by Szegedy et al. [35],\nwhich started the era of adversarial attacks and defenses for deep networks. Goodfellow et al. [12]\nthen proposed an attack \u2014 fast gradient sign method (FGSM) and also constructed a defense model\nby training on the FGSM adversarial examples. More effective attacks including C&W [5], PGD\n[19], BIM [18], MIM [9], DeepFool [23], and JSMA [27] are proposed to fool deep networks, which\nfurther encourage the research for defense methods.\n\nMadry et al. [19] proposed adversarial training (AT) that dynamically trained the model on the\ngenerated PGD attacks, achieving the \ufb01rst empirical adversarial robust classi\ufb01er on CIFAR-10.\nSince then, AT became the foundation for the state-of-the-art adversarial robust training method and\nwent through widely and densely scrutiny [3], which achieved real robustness without relying on\ngradient masking [3, 13, 31, 4, 8]. Recently, Adversarial Logit Pairing (ALP) [16] is proposed with\nan additional loss term that matches the logit feature from a clean image x and its corresponding\nadversarial image x\u2032, which further improves the adversarial robustness. However, this method has\na distorted loss function and is not scalable to untargeted attack [11, 22]. In contrast to the ALP\nloss which uses a pair of data, our method introduces an additional negative example in a triplet of\ndata, which achieves more desirable geometric relationships between adversarial examples and clean\nexamples in feature metric space.\n\nOrthogonal to our method, the concurrent feature denoising method [46] achieves the state-of-the-art\nadversarial robustness on ImageNet. While their method adds extra denoising block in the model,\nour method requires no change to the model architecture. Another concurrent work, TRADES [49],\nachieves improved robustness by introducing Kullback-Leibler divergence loss to a pair of data. In\naddition, unlabeled data [39] and model ensemble [37, 25] have been shown to improve the robustness\nof the model. Future work can be explored by combining these methods with our proposed TLA\nregularization for better adversarial robustness.\n\n3 Qualitative Analysis of Latent Representations under Adversarial Attack\n\nWe begin our investigation by analyzing how the adversarial images are represented by different\nmodels. We call the original class of an adversarial image as true class and the mis-predicted class of\nadversarial example as false class. Figure 1 shows the visualization of the high dimensional latent\nrepresentation of sampled CIFAR-10 images with t-SNE [40, 2]. Here, we visualize the penultimate\nfully connected (FC) layer of four existing models: standard undefended model (UM), model after\nadversarial training (AT) [19], model after adversarial logit pairing (ALP) [16], and model after\nour proposed TLA training. Though all the adversarial images belong to the same true class, UM\nseparates them into different false classes with large margins. The result shows UM is highly non-\nrobust against adversarial attacks because it is very easy to craft an adversarial image that will be\nmistakenly classi\ufb01ed into a different class. With AT and ALP methods, the representations are getting\ncloser together, but one can still discriminate them. Note that, a good robust model will bring the\nrepresentations of the adversarial images closer to their original true class so that it will be dif\ufb01cult to\ndiscriminate the adversarial images from the original images. We will leverage this observation to\ndesign our approach.\n\n2\n\n\f(a) UM\n\n(b) AT\n\n(c) ALP\n\n(d) TLA\n\nFigure 1:\nt-SNE Visualization of adversarial images from the same true class which are mistakenly\nclassi\ufb01ed to different false classes. The \ufb01gure shows representations of second to last layer of 1000 adversarial\nexamples crafted from 1000 natural (clean) test examples from CIFAR-10 dataset, where the true class is \u201cdeer.\u201d\nThe different colors represent different false classes. The gray dots further show 500 randomly sampled natural\ndeer images. Notice that for (a) undefended model (UM), the adversarial attacks clearly separate the images\nfrom the same \u201cdeer\u201d category into different classes. (b) adversarial training (AT) and (c) adversarial logit\npairing (ALP) method still suffer from this problem at a reduced level. In contrast, our proposed ATL (see (d))\nclusters together all the examples from the same true class, which improves overall robustness.\n\n(a) UM\n\n(b) AT\n\n(c) ALP\n\n(d) TLA\n\nFigure 2: Illustration of the separation margin of adversarial examples from the natural images of the\ncorresponding false class. We show t-SNE visualization of the second to last layer representation of test data\nfrom two different classes across four models. The blue and green dots are 200 randomly sampled natural\nimages from \u201cbird\" and \u201ctruck\u201d classes respectively. The red triangles denote adversarial (adv) truck images\nbut mispredicted as \u201cbird.\u201d Notice that for (a) UM, the adversarial examples are moved to the center of the\nfalse class, making it hard to separate from them. (b) AT and (c) ALP achieve some robustness by separating\nadversarial and natural images, but they are still close to each other. Plot (d) shows our proposed TLA training\npromotes the mispredicted adversarial examples to lie on the edge of the natural images false class and can still\nbe separated, which improves the robustness.\n\nIn Figure 2, we further analyze how the representation of images of one class is attacked into the\nneighborhood of another class. The green and blue dots are the natural images of trucks and birds,\nrespectively. The red triangles are the adversarial images of trucks mispredicted as birds. For UM\nmodel (Figure 2a), all the adversarial attacks successfully get into the center of the false class. The\nAT and ALP models achieve some robustness by separating some adversarial images from natural\nimages, but most adversarial images are still inside the false class. A good robust model should\npromote the representations of adversarial examples away from the false class, as shown in Figure 2d.\nSuch separation not only improves the adversarial classi\ufb01cation accuracy but also helps to reject the\nmispredicted adversarial attacks, because the mispredicted adversaries tend to lie on edge.\n\nBased on these two observations, we build a new approach that ensures adversarial representations\nwill be (i) closer to the natural image representations of their true classes, and (ii) farther from the\nnatural image representations of the corresponding false classes.\n\n4 Approach\n\nInspired by the adversarial feature space analysis, we add an additional constraint to the model using\nmetric learning. Our motivation is that the triplet loss function will pull all the images of one class,\nboth natural and adversarial, closer while pushing the images of other classes far apart. Thus, an\nimage and its adversarial counterpart should be on the same manifold, while all the members of the\nfalse class should be forced to be separated by a large margin.\n\n3\n\n0.040.020.000.020.040.040.020.000.020.04Predictiondeer(clean)deer(adv)airplaneautomobilebirdcatdogfroghorseshiptruck4020020406040200204040200204040200204040200204040200204060402002040402002040602010010203020100102030naturaltruckbirdmisclassified advtruckmisclassified advtruck2015105051015203020100102030302010010203015105051015201001020201001020\fNotations. For an image classi\ufb01cation task, let M be the number of classes to predict, and N be\nthe number of training examples. We formulate the deep network classi\ufb01er as F\u03b8(x) \u2208 RM as a\nprobability distribution, where x is the input variable, y is the output ground-truth, and \u03b8 is the\nnetwork\u2019s parameters to learn (we simply use F (x) most of time); L(F (x), y) is the loss function.\n\nAssume that an adversary is capable of launching adversarial attacks bounded by p-norm, i.e., the\nadversary can perturb the input pixel by \u01eb bounded by Lp, p = 0, 2, \u221e, let I(x, \u01eb) denote the Lp\nball centered at x with radius \u01eb. We focus on the study of untargeted attack, i.e., the objective is to\ngenerate x\u2032 \u2208 I(x, \u01eb) such that F (x\u2032) 6= F (x).\n\n(i)\n\n(i)\n\n(i)\np , x\n\n(i)\na , x\n\nn )o, where the elements in the positive pair (cid:0)x\n\nn(cid:0)x\nsame class and the elements in the negative pair (cid:0)x\n(i)\np , x\n\nTriplet Loss. Triplet loss is a widely used strategy for metric learning. It trains on a triplet input\np (cid:1) are clean images from the\nn (cid:1) are from different classes [32, 15].\n(i)\nn are referred as positive, anchor, and negative examples of the triplet loss. The\nx\nembeddings are optimized such that examples of the same class are pulled together and the examples\nof different classes are pushed apart by some margin [34]. The standard triplet loss for clean images\nis as follows:\n\n(i)\na , and x\n\n(i)\na , x\n\n(i)\na , x\n\n(i)\n\nN\n\nX\n\ni\n\nLtrip(x(i)\n\na , x(i)\n\np , x(i)\n\nn ) =\n\nN\n\nX\n\ni\n\n[D(h(x(i)\n\na ), h(x(i)\n\np )) \u2212 D(h(x(i)\n\na ), h(x(i)\n\nn )) + \u03b1]+\n\nwhere, h(x) maps from the input x to the embedded layer, \u03b1 \u2208 R+ is a hyper-parameter for margin\nand D(h(xi), h(xj)) denotes the distance between xi and xj in the embedded representation space.\nIn this paper, we de\ufb01ne the embedding distance between two examples using the angular distance [42]:\n\nD(h(x\n\n(i)\na ), h(x\n\n(j)\np,n)) = 1 \u2212\n\nangular metric space.\n\n(i)\na )\u00b7h(x\n|h(x\n(i)\na )||2||h(x\n\n(j)\np,n))|\n(j)\np,n))||2\n\n||h(x\n\n, where we choose to encode the information in the\n\nMetric Learning for Adversarial Robustness. We add triplet loss to the penultimate layer\u2019s\nrepresentation. Different from standard triplet loss where all the elements in the triplet loss term are\nclean images [32, 50], at least one element in the triplet loss under our setting will be an adversarial\nimage. Note that generating adversarial examples is more computational intensive compared with\njust taking the clean images. For ef\ufb01ciency, we only generate one adversarial perturbed image for\neach triplet data, using the same method introduced by Madry et al. [19]. Speci\ufb01cally, given a clean\nimage x(i), we generate the adversarial image x\u2032(i) based on \u2207xL(F (x), y) (standard loss without\nthe triplet loss) with PGD method. We do not add the triplet loss term into the loss of adversarial\nexample generation due to its inef\ufb01ciency.\n\nThe other elements in the triplet data are clean images. We forward the triplet data in parallel through\nthe model and jointly optimize the cross-entropy loss and the triplet loss, which enables the model to\ncapture the stable metric space representation (triplet loss) with semantic meaning (cross-entropy\nloss). The total loss function is formulated as follows:\n\nLall =\n\nN\n\nX\n\ni\n\nLce(f (x\u2032(i)\n\na ), y(i)) + \u03bb1Ltrip(h(x\u2032(i)\n\na )), h(x(i)\n\np ), h(x(i)\n\nn )) + \u03bb2Lnorm\n\n(1)\n\nLnorm = ||h(x\u2032(i)\n\na )||2 + ||h(x(i)\n\np )||2 + ||h(x(i)\n\nn )||2\n\nwhere \u03bb1 is a positive coef\ufb01cient trading off the two losses; x\n(anchor example) is an adversarial\n(i)\ncounterpart based on x\nn\n(negative example) is a clean image from a different class; \u03bb2 is the weight for the feature norm decay\nterm, which is also applied in [32] to reduce the L2 norm of the feature.\n\n(i)\np (positive example) is a clean image from the same class of x\n\n(i)\na ; x\n\n(i)\na ; x\n\n\u2032(i)\na\n\nNotice that, besides the TLA set-up in equation 1, an adversarial perturbed image can be the positive\nexample, and a clean image can be the anchor example (i.e., switch the anchor and the positive),\nwhere we refer it as TLA-SA (Sec 5). We choose the adversarial example as the anchor for TLA\naccording to the experimental result. Intuitively, the adversarial image is picked as the anchor because\nit tends to be closer to the decision boundary between the \"true\" class and the \"false\" class. As an\nanchor, the adversarial example is considered in both the positive pair and the negative pair, which\ngives more-useful gradients for the optimization. The modi\ufb01ed triplet loss for adversarial robustness\nis shown in Figure 3.\n\n4\n\n\fFigure 3: Illustration of the triplet loss for adversarial robustness (TLA). The red circle is an adversarial\nexample, while the green and the blue circles are clean examples. The anchor and positive belong to the same\nclass. The negative (blue), from a different class, is the closest image to the anchor (red) in feature space. TLA\nlearns to pull the anchor and positive from the true class closer, and push the negative of false classes apart.\n\nNegative Sample Selection. In addition to the anchor selection, the selection of the negative example\nis crucial for the training process, because most of the negative examples are easy examples that\nalready satisfy the margin constraint of pairwise distance and thus contribute useless gradients\n[32, 10]. Using the representation angular distance we prede\ufb01ne, we select negative samples as the\nnearest images to the anchor from a false class. As a result, our model is able to learn to enlarge the\nboundary between the adversarial samples and their closest negative samples from the other classes.\n\nUnfortunately, \ufb01nding the closest negative samples from the entire training set is computationally\nintensive. Besides, using very hard negative examples have been found to decrease the network\u2019s\nconvergence speed [32] signi\ufb01cantly. Instead, we use a semi-hard negative example, where we\nselect the closest sample in a mini-batch. We demonstrate the advantage of this sampling strategy by\ncomparing it with the random sampling (TLA-RN). The results are shown in Sec 5. Other strategies of\nsampling negative samples such as DAML [10] could also be applied here, which uses an adversarial\ngenerator to exploit hard negative examples from easy ones.\n\nImplementation Details. We apply our proposed triplet loss on the embedding of the penultimate\nlayer of the neural network for classi\ufb01cation tasks. Since the following transformation only consists\nof a linear layer and a softmax layer, small \ufb02uctuation to this embedding only brings monotonous\nadjustment to the output controlled by some tractable Lipschitz constant [7, 24]. We do not apply\ntriplet loss on the logit layer but on the penultimate layer, because the higher dimensional penultimate\nlayer tends to preserve more information. We also construct two triplet loss terms on CIFAR-10 and\nTiny ImageNet, adding another positive example while reusing the anchor and negative example,\nwhich achieves better performance [34, 6]. The details of the algorithm are introduced in the appendix.\n\n5 Experiments\n\nExperimental Setting. We validate our method on different model architectures across three popular\ndatasets: MNIST, CIFAR-10, and Tiny-ImageNet. We compare the performance of our models with\nthe following baselines: Undefended Model (UM) refers to the standard training without adversarial\nsamples, Adversarial Training (AT) refers to the min-max optimization method proposed in [19],\nAdversarial Logit Pairing (ALP) refers to the logit matching method which is currently the state-\nof-the-art [16]. We use TLA to denote the triplet loss adversarial training mentioned in Section 4. To\nfurther evaluate our design choice, we study two variants of TLA: Random Negative (TLA-RN),\nwhich refers to our proposed triplet loss training method with a randomly sampled negative example,\nand Switch Anchor (TLA-SA), which sets the anchor to be natural example and the positive to be\nadversarial example (i.e., switching the anchor and the positive of our proposed method).\n\nWe conduct all of our experiments using TensorFlow v1.13 [1] on a single Tesla V100 GPU with a\nmemory of 16GB. We adopt the untargeted adversarial attacks during all of our training processes,\nand evaluate the models with both white-box and black-box untargeted attacks instead of the targeted\nattacks following the suggestions in [11] (a defense robust only to targeted adversarial attacks is\nweaker than one robust to untargeted adversarial attacks). In order to be comparable to the original\npaper in AT and ALP, we mainly evaluate the model under the L\u221e bounded attacks. We also evaluate\nthe models under other norm-bounded attacks (L0, L2). The PGD and 20PGD in our Table 1 refer to\nthe PGD attacks with the random restart of 1 time and 20 times, respectively. For black-box (BB)\nattacks, we use the transfer based method [26]. We set \u03bb = 0.5 for ALP method as the original paper.\nAll the other implementation details are discussed in the appendix.\n\n5\n\n\ud835\udc99\ud835\udc8f(\ud835\udc8a)\ud835\udc99\ud835\udc82(\ud835\udc8a)\ud835\udc99\ud835\udc91(\ud835\udc8a)NegativeAnchorPositiveTLA-Learning\ud835\udc99\ud835\udc8f(\ud835\udc8a)\ud835\udc99\ud835\udc82(\ud835\udc8a)\ud835\udc99\ud835\udc91(\ud835\udc8a)NegativePositiveAnchor\fAttacks\n(Steps)\n\nClean\n-\n\nFGSM BIM\n(1)\n(40)\n\nC&W\n(40)\n\nPGD\n(40)\n\nPGD\n(100)\n\n20PGD MIM\n(100)\n(200)\n\nBB\n(100)\n\nMNIST\n\nUM\nAT\nALP\n\n99.20% 34.48% 0%\n81.81%\n99.24% 97.31% 95.95% 96.66% 96.58% 94.82% 93.87% 95.47% 96.67%\n98.91% 97.34% 96.00% 96.50% 96.62% 95.06% 94.93% 95.41% 96.95%\n\n0%\n\n0%\n\n0%\n\n0%\n\n0%\n\nTLA-RN 99.50% 98.12% 97.17% 97.17% 97.64% 97.07% 96.73% 96.84% 97.69%\nTLA-SA 99.44% 98.14% 97.08% 97.45% 97.50% 96.78% 95.64% 96.45% 97.65%\n99.52% 98.17% 97.32% 97.25% 97.72% 96.96% 96.79% 96.64% 97.73%\nTLA\n\nAttacks\n(Steps)\n\nClean\n-\n\nFGSM BIM\n(1)\n\n(7)\n\nC&W\n(30)\n\nPGD\n(7)\n\nPGD\n(20)\n\n20PGD MIM\n(40)\n(20)\n\nBB\n(7)\n\nCIFAR-10\n\nUM\nAT\nALP\n\n95.01% 13.35% 0%\n87.14% 55.63% 48.29% 46.97% 49.79% 45.72% 45.21% 45.16% 62.83%\n89.79% 60.29% 50.62% 47.59% 51.89% 48.50% 45.98% 45.97% 67.27%\n\n7.60%\n\n0%\n\n0%\n\n0%\n\n0%\n\n0%\n\nTLA-RN 81.02% 55.41% 51.44% 49.66% 52.50% 49.94% 45.55% 49.63% 65.96%\nTLA-SA 86.19% 58.80% 52.19% 49.64% 53.53% 49.70% 49.15% 49.29% 61.67%\n86.21% 58.88% 52.60% 50.69% 53.87% 51.59% 50.03% 50.09% 70.63%\nTLA\n\nAttacks\n(Steps)\n\nClean\n-\n\nFGSM BIM\n(1)\n(10)\n\nC&W\n(10)\n\nPGD\n(10)\n\nPGD\n(20)\n\n20PGD MIM\n(20)\n(40)\n\nBB\n(10)\n\nTiny ImageNet\n\nUM\nAT\nALP\n\n60.64% 1.15%\n0.01%\n44.77% 21.99\n19.59% 17.34% 19.79% 19.44% 19.25% 19.28% 27.73%\n41.53% 21.53% 20.03% 16.80% 20.18% 19.96% 19.76% 19.85% 30.31%\n\n0.01%\n\n0.01%\n\n9.99%\n\n0%\n\n0%\n\n0%\n\nTLA-RN 42.11% 21.47% 20.03% 17.00% 20.05% 19.93% 19.81% 19.91% 30.18%\nTLA-SA 41.43% 22.09% 20.77% 17.28% 20.82% 20.63% 20.50% 20.61% 29.96%\n40.89% 22.12% 20.77% 17.48% 20.89% 20.71% 20.47% 20.69% 29.98%\nTLA\n\ns\nd\no\nh\nt\ne\n\nM\n\ns\nd\no\nh\nt\ne\n\nM\n\ns\nd\no\nh\nt\ne\n\nM\n\nTable 1: Classi\ufb01cation accuracy under 8 different L\u221e bounded untargeted attacks on MNIST (L\u221e =0.3),\nCIFAR-10 (L\u221e =8/255), and Tiny-ImageNet (L\u221e =8/255). The best results of each column are in bold and the\nempirical lower bound (the lowest accuracy of each row if any) for each method is underlined. TLA improves the\nadversarial accuracy by up to 1.86%, 4.12% , and 0.84% on MNIST, CIFAR-10, and Tiny ImageNet respectively.\n\n5.1 Effect of TLA on Robust Accuracy\n\nMNIST consists of a training set of 55,000 images (excluding the 5000 images for validation as in\n[19]) and a testing set of 10,000 images. We use a variant of LeNet CNN architecture which has\nbatch normalization for all the methods. The details of network architectures and hyper-parameters\nare summarized in the appendix. We adopt the L\u221e = 0.3 bounded attack during the training and\nevaluation. We generate adversarial examples using PGD with 0.01 step size for 40 steps during\nthe training. In addition, we conduct different types of L\u221e = 0.3 bounded attacks to achieve good\nevaluations. The adversarial classi\ufb01cation accuracy of different models under various adversarial\nattacks is shown in Table 1. As shown, we improve the empirical state-of-the-art adversarial accuracy\nby up to 1.86% on 20PGD attacks (100 steps PGD attacks with 20 times of random restart), along\nwith 0.28% improvement on the clean data.\n\nCIFAR-10 consists of 32\u00d732\u00d73 color images in 10 classes, with 50k images for training and 10k\nimages for testing. We follow the same wide residual network architecture and the same hyper-\nparameters settings as AT [19]. As shown in Table 1, our method achieves up to 4.12% adversarial\naccuracy improvement over the baseline methods under the strongest 20PGD attacks (20 steps PGD\nattack with 20 times of restart). Note that our method results in a minor decrease of standard accuracy,\nbut such loss of generic accuracy is observed in all the existing robust training models [38, 49]. The\ncomparison with TLA-RN illustrates the effectiveness of the negative sampling strategy. According\nto the result of the TLA-SA, our selection of the adversarial example as the anchor also achieves\nbetter performance than the method which chooses the clean image as the anchor.\n\nTiny Imagenet is a tiny version of ImageNet consisting of color images with size 64\u00d764\u00d73 belonging\nto 200 classes. Each class has 500 training images and 50 validation images. Due to the GPU limit,\nwe adapt the ResNet 50 architectures for the experiment. We adopt L\u221e = 8/255 for both training\nand validation. During training, we use 7 step PGD attack with step size 2/255 to generate the\nadversarial samples. As shown in Table 1, our proposed model achieves higher adversarial accuracy\nunder white box adversarial attacks by up to 0.84% on MIM attacks.\n\n6\n\n\fz\ni\n\ne 1\nS\ne\nv\ni\nt\na\ng\ne\nN\n\n250\n500\n1000\n2000\n\nMini-Batch TT (s)\n\nTotal TT (s)\n\nClean\n\nFGSM(1)\n\nBIM(7)\n\nC&W(30)\n\nPGD(20) MIM(40)\n\n0\n0.467\n0.908\n1.832\n3.548\n\n1.802\n2.259\n2.688\n3.621\n5.992\n\n55.41%\n81.02%\n86.38%\n59.05%\n88.32% 60.02%\n59.08%\n86.71%\n87.45%\n59.23%\n\n49.66%\n51.44%\n50.49%\n53.02%\n51.30%\n53.20%\n53.25% 50.88%\n52.52%\n50.57%\n\n49.94%\n50.71%\n50.46%\n51.22%\n50.20%\n\n49.63%\n50.31%\n50.07%\n50.74%\n49.79%\n\nTable 2: The effect of mini-batch size of negative samples on training time (TT) per iteration and\nadversarial robustness (L\u221e = 8/255) on CIFAR-10 dataset. The best results of each column are\nshown in bold. The number of steps for each attack is shown in the parenthesis. The training time\ngrows linearly as the size of the mini-batch grows. The adversarial robustness peaks at size 500 to\n1000, which validate that semi-hard negative examples are crucial for TLA.\n\nAttacks\n\nJSMA (L0)\n\nPGD (L2) C&W (L2) DeeoFool (L2)\n\nJSMA(L0)\n\nPGD (L2) C&W (L2) DeeoFool (L2)\n\nMNIST (LeNet)\n\nCIFAR-10 (WRN)\n\ns AT\n\nd\no\nh\nt\ne\n\nM\n\nALP\nTLA\n\n99.08%\n98.83%\n99.32%\n\n96.61%\n96.28%\n97.38%\n\n99.08%\n98.91%\n99.36%\n\n99.13%\n98.95%\n99.35%\n\n40.4%\n36.9%\n48.6%\n\n36.8%\n38.6%\n41.1%\n\n50.0%\n51.2%\n53.5%\n\n67.7%\n43.5%\n80.8%\n\nTable 3: Classi\ufb01cation accuracy of two baseline methods and TLA method on 4 unseen types of\nattacks (L0 and L2 norm bounded). All the models are only trained on the L\u221e bounded attacks. The\nbest results of each column are shown in bold. TLA improves the adversarial accuracy by up to\n1.10% and 13.1% on MNIST and CIFAR-10 dataset respectively. The results demonstrate that TLA\ngeneralizes better to unseen types of attacks.\n\nEffect of the mini-batch size of negative samples of TLA. Compared with retrieving from the\nwhole dataset, the mini-batch based method can mitigate the computational overhead by \ufb01nding the\nnearest neighbor from a batch rather than from the whole training set. The size of the mini-batch\nsize controls the hardness level of the negative samples, where larger mini-batch size makes harder\nnegative ones. We train models with different mini-batch size and evaluate the robustness of the\nmodel using \ufb01ve untargeted, L\u221e bounded attacks. As shown in Table 2, the total training time grows\nlinearly as the size of the mini-batch increases, which triples for size 2000 compared with size 1. The\nadversarial robustness \ufb01rst increases and then decreases after the mini-batch size reaches 1000 (very\nhard negative examples hurt performance). Being consistent with the observation in standard metric\nlearning [32, 51], our results show that it is important to train TLA with semi-hard negative examples\nby choosing the proper mini-batch size.\n\nGeneralization to Unseen Types of Attacks. We evaluate the L\u221e robustly trained models on unseen\nL0-bounded [27] and L2-bounded attacks [23, 5, 19, 5]. We set L0 = 0.1 and L0 = 0.02 bound for\nJSMA on MNIST and CIFAR-10 dataset respectively. For L2 norm bounded PGD and C&W attacks,\nwe set the bound as L2 = 3 and L2 = 128 on MNIST and CIFAR-10 respectively. We apply 40 steps\nof PGD and C&W on MNIST with step size 0.1, and 10 steps of PGD and C&W on CIFAR-10 with\nstep size 32. We apply 2 steps for DeepFool attack for both dataset. Due to the slow speed of JSMA,\nwe only run 1000 test samples on CIFAR-10. Table 3 shows that TLA improves the adversarial\naccuracy by up to 1.10% and 13.1% on MNIST and CIFAR-10 respectively, which demonstrates that\nTLA generalizes better to unseen attacks than baseline models.\n\nPerformance on Different Model Architectures. To demonstrate that TLA is general for different\nmodel architectures, we conduct experiments using multi-layer perceptron (MLP) and ConvNet [47]\narchitectures. Results in Table 4 show that TLA achieves better adversarial robustness by up to 4.27%\nand 0.55% on MNIST and CIFAR-10 respectively.\n\nMNIST (MLP)\n\nCifar10 (ConvNet)\n\nAttacks Clean\nSteps\n\n-\n\nFGSM BIM\n1\n\n40\n\nC&W\n40\n\nPGD\n100\n\nClean\n-\n\nFGSM BIM\n1\n\n7\n\nC&W\n30\n\nPGD\n20\n\ns UM\nd\nAT\no\nh\nALP\nt\ne\nTLA\n\nM\n\n0%\n\n0%\n\n0%\n\n98.27% 5.23%\n96.43% 73.25% 57.83% 62.60% 58.10% 67.60% 40.26% 36.34% 33.17% 34.83%\n95.56% 77.08% 64.39% 63.46% 64.13% 66.18% 39.45% 36.15% 32.55% 35.32%\n97.15% 78.44% 65.47% 67.73% 65.88% 67.48% 40.76% 36.77% 33.27% 35.38%\n\n77.84% 3.50%\n\n0.09%\n\n0.08%\n\n0.03%\n\nTable 4: Effect of TLA on different neural network architectures. The table lists classi\ufb01cation accuracy\nunder various L\u221e bounded untargeted attacks on MNIST (L\u221e = 0.3) and Cifar10 (L\u221e = 8/255).\nOverall, TLA improves adversarial accuracy.\n\n7\n\n\f5.2 Effect of TLA on Adversarial vs. Natural Image Separation\n\nRecall in Figure 2b and Figure 2c, the representations of adversarial images are shifted toward the\nfalse class. A robust model should separate them apart. To quantitatively evaluate how well TLA\ntraining helps with separating the adversarial examples from the natural images of the corresponding\n\u2018false\u2019 classes, we de\ufb01ne the following metric.\n\nck =\n\nk|, and |c\u2032\n\nj=i+1 D(ci\n\n|ck|(|ck|\u22121) P|ck|\u22121\n\ni=1 P|ck|\n\nLet {ci\nk} denote the embedded representations of all the natural images from class ck, where\ni = 1, . . . , |ck|, and |ck| is the total number of images in class ck. Then, the average pairwise within-\nclass distance of these embedded images is: \u03c3ntrl\nk). Let\n{c\u2032q\nk } further denote embedded representations of all the adversarial examples that are misclassi\ufb01ed\nto class ck, where q = 1, . . . , |c\u2032\nk| is the total number of such examples. Note that, class ck\n\u2032i\nis the \u2018false\u2019 class to those adversarial images. Then, the distance between an adversarial images c\nk\nand a natural image cj\nk , cj\n\u2032i\nk), and the average pair-wise distance between adversary image\n\u03c3adv\nk||ck| P|c\u2032\n\u2032adv\nck = 1\n\u03c3ntrl\n|c\u2032\nas a metric to evaluate how close the adversarial images are w.r.t. the natural images of the \u2018false\u2019\nclass while compared with the average pairwise within-class distance of all the natural images of\nthat class. Finally, for all classes we compute the average ratio as r = 1\nk=1(rck ). Note that, any\ngood robust method should increase the value of r, indicating \u03c3adv is far from \u03c3ntrl, i.e., they are\nbetter separated than the natural cluster, as shown in Figure 2d.\n\nk). We then de\ufb01ne the ratio rck =\n\nand natural images is: \u03c3\n\ni=1 P|ck|\n\nM PM\n\nk is: D(c\n\nj=1 D(c\n\nk , cj\n\u2032i\n\nk, cj\n\nck\n\nck\n\n2\n\nk|\n\nDataset\nPerturbation Level L\u221e = 0.03 L\u221e = 0.3 L\u221e = 8\n\nMNIST\n\nCIFAR-10\n255 L\u221e = 25\n\nTiny ImageNet\n\n255 L\u221e = 8\n\n255 L\u221e = 25\n\n255\n\ns AT\n\nd\no\nh\nt\ne\n\nM\n\nALP\nTLA\n\n1.288\n1.398\n1.810\n\n1.308\n1.394\n1.847\n\n1.053\n1.038\n1.093\n\n1.007\n1.210\n1.390\n\n0.9949\n0.9905\n0.9937\n\n0.9656\n0.9722\n0.9724\n\nTable 5: Average Ratio (r) of mean distance between adversary points and natural points over the mean\nintra-class distance. The best results of each column are in bold. The results illustrate that TLA increases the\nrelative distance of adversarial images w.r.t. the natural images of the respective false classes, which illustrates\nthat TLA achieves more desirable geometric feature space under attacks.\n\nFor every dataset, we estimate the ratios under two different perturbation levels of PGD attacks for\nall the models. As shown in Table 5, stronger attacks (larger perturbation level) tend to shift their\nlatent representation more toward the false class. For Tiny-ImageNet, the adversarial examples are\neven closer (r < 1) to the false class\u2019s manifold than the corresponding natural images to themselves,\nwhich explains the low adversarial accuracy on this dataset. In almost all the settings, TLA leads to\nhigher r values of separation than the other baseline methods. This indicates TLA is most effective in\npulling apart the misclassi\ufb01ed adversary examples from their false class under both small and large\nperturbations attacks.\n\nDataset\nType\n\nAT\nALP\nTLA\n\nd\no\nh\nt\ne\n\nM\n\nMNIST\n\nCIFAR-10\n\nTiny-ImageNet\n\nAdv\n\nNatural\n\nAdv\n\nNatural\n\nAdv\n\nNatural\n\n93.01%\n95.20%\n96.98% 99.47% 51.74% 86.29%\n\n47.46%\n48.85%\n\n98.68%\n98.43%\n\n87.06%\n20.20%\n89.63% 20.33%\n\n36.6%\n35.23%\n20.72% 33.99%\n\nTable 6: Accuracy of K-Nearest Neighbors classi\ufb01er with K = 50, illustrating TLA has better similarity\nmeasures in embedding space even with adversarial samples. The best results of each column are in bold.\n\nWe further conduct the nearest neighbor analysis on the latent representations across all the models.\nThe results illustrate the advantage of our learned representations for retrieving the nearest neighbor\nunder adversarial attacks (See Figure 4). Table 6 numerically shows that the latent representation of\nTLA achieves higher accuracy using K-Nearest Neighbors classi\ufb01er than baseline methods.\n\n5.3 Effect of TLA on Adversarial Image Detection\n\nDetecting mis-predicted adversarial inputs is another dimension to improve a model\u2019s robustness.\nForward these detected adversarial examples to humans for labeling can signi\ufb01cantly improve the\nreliability of the system under adversarial cases. Given that TLA separates further the adversarial\nexamples from the natural examples of the false class, we can detect more mis-classi\ufb01ed examples by\n\ufb01ltering out the outliers. We conduct the following experiments.\n\n8\n\n\fFigure 4: Visualization of nearest neighbor images while querying about a \u201cplane\" on AT and TLA trained\nmodels. For a natural query image, both methods retrieve correct images (left column). However, given an\nadversarial query image (right column), the AT retrieves false \u201ctruck\" images indicating the perturbation moves\nthe representation of the \u201cplane\" into the neighbors of \u201ctruck,\" while TLA still retrieves images from the true\n\"plane\" class.\n\n(a) MNIST\n\n(b) CIFAR-10\n\n(c) Tiny ImageNet\n\nFigure 5: The ROC curve and AUC scores of detecting mis-classi\ufb01ed adversarial examples. We train a GMM\nmodel on half clean and half adversarial examples (generated with perturbation level \u01eb = 0.03/1(40 steps)\nfor MNIST, \u01eb = 8/255(7 steps) for CIFAR-10, and \u01eb = 8/255(7 steps) for Tiny-ImageNet), and then test\nthe detection model on 10k natural test images and 10k adversary test images (generated with perturbation\nlevel \u01eb = 0.3/1(100 steps) for MNIST, \u01eb = 25/255(20 steps) for CIFAR-10, and \u01eb = 25/255(30 steps) for\nTiny-ImageNet). The numerical results for AUC score are shown in the legend. Note that both the ROC curve of\nTLA is on the top and the AUC score of TLA is the highest, which shows TLA (our method) achieves higher\ndetection ef\ufb01ciency for adversarial examples.\n\nFollowing the adversarial detection method proposed in [52], we train a Gaussian Mixture Model\nfor 10 classes where the density function of each class is captured by one Gaussian distribution. For\neach test image, we assign a con\ufb01dence score of a class based on the Gaussian distribution density of\nthe class at that image, as shown in [45]. We assign these con\ufb01dence scores for all the 10 classes\nfor each test image. We then pick the class with the largest con\ufb01dence value as the assigned class of\nthe image. We further rank all the test images based on the con\ufb01dence value of their assigned class.\nWe reject those with lower con\ufb01dence scores below a certain threshold. This method serves as an\nadditional con\ufb01dence metric to detect adversarial examples in a real-world setting.\n\nWe conduct the detection experiment for mis-classi\ufb01ed images on 10k clean images and 10k adver-\nsarial images. As shown in Figure 5, the ROC-curves and AUC score demonstrate that our learned\nrepresentations are superior in adversarial example detection. Compared with other robust training\nmodels, TLA improves the AUC score by up to 3.69%, 6.45%, and 1.37% on MNIST, CIFAR-10,\nand Tiny ImageNet respectively. The detection results here are consistent with the visual results\nshown in Figure 2.\n\n6 Conclusion\n\nOur novel TLA regularization is the \ufb01rst method that leverages metric learning for adversarial\nrobustness on deep networks, which signi\ufb01cantly increases the model robustness and detection\nef\ufb01ciency. TLA is inspired by the evidence that the model has distorted feature space under adversarial\nattacks. In the future, we plan to enhance TLA using more powerful metric learning methods, such\nas the N-pair loss. We believe TLA will also be bene\ufb01cial for other deep network applications that\ndesire a better geometric relationship in hidden representations.\n\n9\n\nATQuery (Clean)Nearest Neighbors RetrievedTLANearest Neighbors RetrievedQuery (Adv)(Baseline)(Ours)0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateUM:94.83%AT:91.25%ALP:89.77%TLA:94.94%0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateUM:23.95%AT:61.07%ALP:63.15%TLA:69.60%0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateUM:35.75%AT:60.76%ALP:63.66%TLA:65.03%\f7 Acknowledgements\n\nWe thank the anonymous reviewers, Prof. Suman Jana, Prof. Shih-Fu Chang, Prof. Janet Kayfetz, Ji\nXu, Hao Wang, and Vaggelis Atlidakis for their valuable comments, which substantially improved\nour paper. This work is in part supported by NSF grant CNS-15-64055; ONR grants N00014-16-1-\n2263 and N00014-17-1-2788; a JP Morgan Faculty Research Award; a DiDi Faculty Research Award;\na Google Cloud grant; an Amazon Web Services grant; NSF CRII 1850069; and NSF CCF-1845893.\n\nReferences\n\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,\nAndy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey\nIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,\nDan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,\nBenoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda\nVi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from\ntensor\ufb02ow.org.\n\n[2] Sanjeev Arora, Wei Hu, and Pravesh K. Kothari. An analysis of the t-sne algorithm for data visualization.\n\nIn COLT 2018, 2018.\n\n[3] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security:\nCircumventing defenses to adversarial examples. In International Conference on Machine Learning, pages\n274\u2013283, 2018.\n\n[4] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to\n\nresist adversarial examples. In 6th International Conference on Learning Representations, 2018.\n\n[5] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017\n\nIEEE Symposium on Security and Privacy, pages 39\u201357, 2017.\n\n[6] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet\n\nnetwork for person re-identi\ufb01cation. CoRR, abs/1704.01719, 2017.\n\n[7] Moustapha Ciss\u00e9, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval\nIn International Conference on Machine\n\nnetworks: Improving robustness to adversarial examples.\nLearning, pages 854\u2013863, 2017.\n\n[8] Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossai\ufb01, Aran\nKhanna, and Animashree Anandkumar. Stochastic activation pruning for robust adversarial defense. In 6th\nInternational Conference on Learning Representations, 2018.\n\n[9] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting\nadversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 9185\u20139193, 2018.\n\n[10] Yueqi Duan, Wan qing Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep adversarial metric learning.\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2780\u20132789,\n2018.\n\n[11] Logan Engstrom, Andrew Ilyas, and Anish Athalye. Evaluating and understanding the robustness of\n\nadversarial logit pairing. CoRR, abs/1807.10272, 2018.\n\n[12] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. CoRR, abs/1412.6572, 2014.\n\n[13] Chuan Guo, Mayank Rana, Moustapha Ciss\u00e9, and Laurens van der Maaten. Countering adversarial images\n\nusing input transformations. CoRR, abs/1711.00117, 2017.\n\n[14] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger,\nSanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end\nspeech recognition. CoRR, abs/1412.5567, 2014.\n\n[15] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In ICLR, 2015.\n\n[16] Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. CoRR, abs/1803.06373,\n\n2018.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Proceedings of the 25th International Conference on Neural Information Processing\nSystems, pages 1097\u20131105, 2012.\n\n10\n\n\f[18] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR,\n\nabs/1607.02533, 2017.\n\n[19] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards\n\ndeep learning models resistant to adversarial attacks. In ICLR, 2018.\n\n[20] Chengzhi Mao, Kangbo Lin, Tiancheng Yu, and Yuan Shen. A probabilistic learning approach to uwb\n\nranging error mitigation. In 2018 IEEE Global Communications Conference (GLOBECOM), 2018.\n\n[21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in Neural Information Processing Systems 26,\npages 3111\u20133119, 2013.\n\n[22] Marius Mosbach, Maksym Andriushchenko, Thomas Alexander Trost, Matthias Hein, and Dietrich Klakow.\n\nLogit pairing methods can fool gradient-based attacks. CoRR, abs/1810.12042, 2018.\n\n[23] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con\ufb01dence\n\npredictions for unrecognizable images. In CVPR, pages 427\u2013436, 2015.\n\n[24] Adam M. Oberman and Jeff Calder. Lipschitz regularized deep neural networks converge and generalize.\n\nCoRR, abs/1808.09540, 2018.\n\n[25] Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting\n\nensemble diversity. CoRR, abs/1901.08846, 2019.\n\n[26] Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. Transferability in machine learning: from\n\nphenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277, 2016.\n\n[27] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram\n\nSwami. The limitations of deep learning in adversarial settings. CoRR, abs/1511.07528, 2015.\n\n[28] Magdalini Paschali, Sailesh Conjeti, Fernando Navarro, and Nassir Navab. Generalizability vs. robustness:\n\nAdversarial examples for medical imaging. CoRR, abs/1804.00504, 2018.\n\n[29] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep\n\nlearning systems. CoRR, abs/1705.06640, 2017.\n\n[30] Adnan Siraj Rakin, Jinfeng Yi, Boqing Gong, and Deliang Fan. Defend deep neural networks against\nadversarial examples via \ufb01xed anddynamic quantized activation functions. CoRR, abs/1807.06714, 2018.\n\n[31] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classi\ufb01ers against\n\nadversarial attacks using generative models. CoRR, abs/1805.06605, 2018.\n\n[32] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\n\nrecognition and clustering. In CVPR, pages 815\u2013823, 2015.\n\n[33] Chuanbiao Song, Kun He, Liwei Wang, and John E. Hopcroft. Improving the generalization of adversarial\n\ntraining with domain adaptation. CoRR, abs/1810.00740, 2018.\n\n[34] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured\n\nfeature embedding. In CVPR, pages 4004\u20134012. IEEE Computer Society, 2016.\n\n[35] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. In ICLR, 2014.\n\n[36] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: Automated testing of deep-neural-\n\nnetwork-driven autonomous cars. CoRR, abs/1708.08559, 2017.\n\n[37] Florian Tram\u00e8r, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick D. McDaniel. Ensemble\n\nadversarial training: Attacks and defenses. CoRR, abs/1705.07204, 2017.\n\n[38] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robust-\n\nness may be at odds with accuracy. stat, 1050:11, 2018.\n\n[39] Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, and Pushmeet\n\nKohli. Are labels required for improving adversarial robustness? CoRR, abs/1905.13725, 2019.\n\n[40] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9:2579\u20132605, 2008.\n\n[41] Hao Wang, Chengzhi Mao, Hao He, Mingmin Zhao, Tommi S. Jaakkola, and Dina Katabi. Bidirectional\ninference networks: A class of deep bayesian networks for health pro\ufb01ling. In The Thirty-Third AAAI\nConference on Arti\ufb01cial Intelligence, pages 766\u2013773, 2019.\n\n[42] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss.\n\nIn ICCV, pages 2612\u20132620, 2017.\n\n[43] Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Formal security analysis of\n\nneural networks using symbolic intervals. CoRR, abs/1804.10829, 2018.\n\n11\n\n\f[44] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[45] Xi Wu, Uyeong Jang, Jiefeng Chen, Lingjiao Chen, and Somesh Jha. Reinforcing adversarial robustness\n\nusing model con\ufb01dence induced by adversarial training. In ICML, volume 80, pages 5330\u20135338, 2018.\n\n[46] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He. Feature denoising for\n\nimproving adversarial robustness. CoRR, abs/1812.03411, 2018.\n\n[47] Ziang Yan, Yiwen Guo, and Changshui Zhang. Deep defense: Training dnns with improved adversarial\nrobustness. In Proceedings of the 32Nd International Conference on Neural Information Processing\nSystems, pages 417\u2013426, 2018.\n\n[48] Yuzhe Yang, Guo Zhang, Dina Katabi, and Zhi Xu. Me-net: Towards effective adversarial robustness with\nmatrix estimation. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019,\n2019.\n\n[49] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan.\n\nTheoretically principled trade-off between robustness and accuracy. CoRR, abs/1901.08573, 2019.\n\n[50] Stephan Zheng, Yang Song, Thomas Leung, and Ian J. Goodfellow. Improving the robustness of deep neural\nnetworks via stability training. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR, pages 4480\u20134488, 2016.\n\n[51] Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, and Jie Zhou. Hardness-aware deep metric learning. CoRR,\n\nabs/1903.05503, 2019.\n\n[52] Zhihao Zheng and Pengyu Hong. Robust detection of adversarial attacks by modeling the intrinsic\nIn Advances in Neural Information Processing Systems, pages\n\nproperties of deep neural networks.\n7913\u20137922. 2018.\n\n12\n\n\f", "award": [], "sourceid": 253, "authors": [{"given_name": "Chengzhi", "family_name": "Mao", "institution": "Columbia University"}, {"given_name": "Ziyuan", "family_name": "Zhong", "institution": "Columbia University"}, {"given_name": "Junfeng", "family_name": "Yang", "institution": "Columbia University"}, {"given_name": "Carl", "family_name": "Vondrick", "institution": "Columbia University"}, {"given_name": "Baishakhi", "family_name": "Ray", "institution": "Columbia University"}]}