{"title": "Robustness to Adversarial Perturbations in Learning from Incomplete Data", "book": "Advances in Neural Information Processing Systems", "page_first": 5541, "page_last": 5551, "abstract": "What is the role of unlabeled data in an inference problem, when the presumed underlying distribution is adversarially perturbed? To provide a concrete answer to this question, this paper unifies two major learning frameworks: Semi-Supervised Learning (SSL) and Distributionally Robust Learning (DRL). We develop a generalization theory for our framework based on a number of novel complexity measures, such as an adversarial extension of Rademacher complexity and its semi-supervised analogue. Moreover, our analysis is able to quantify the role of unlabeled data in the generalization under a more general condition compared to the existing theoretical works in SSL. Based on our framework, we also present a hybrid of DRL and EM algorithms that has a guaranteed convergence rate. When implemented with deep neural networks, our method shows a comparable performance to those of the state-of-the-art on a number of real-world benchmark datasets.", "full_text": "Robustness to Adversarial Perturbations\n\nin Learning from Incomplete Data\n\nAmir Naja\ufb01\n\nDepartment of Computer Engineering\n\nSharif University of Technology\n\nTehran, Iran\n\nnajafy@ce.sharif.edu\n\nShin-ichi Maeda\n\nPreferred Networks, Inc.\n\nTokyo, Japan\n\nichi@preferred.jp\n\nMasanori Koyama\n\nPreferred Networks, Inc.\n\nTokyo, Japan\n\nmasomatics@preferred.jp\n\nTakeru Miyato\n\nPreferred Networks, Inc.\n\nTokyo, Japan\n\nmiyato@preferred.jp\n\nAbstract\n\nWhat is the role of unlabeled data in an inference problem, when the presumed\nunderlying distribution is adversarially perturbed? To provide a concrete answer to\nthis question, this paper uni\ufb01es two major learning frameworks: Semi-Supervised\nLearning (SSL) and Distributionally Robust Learning (DRL). We develop a general-\nization theory for our framework based on a number of novel complexity measures,\nsuch as an adversarial extension of Rademacher complexity and its semi-supervised\nanalogue. Moreover, our analysis is able to quantify the role of unlabeled data in the\ngeneralization under a more general condition compared to the existing theoretical\nworks in SSL. Based on our framework, we also present a hybrid of DRL and\nEM algorithms that has a guaranteed convergence rate. When implemented with\ndeep neural networks, our method shows a comparable performance to those of the\nstate-of-the-art on a number of real-world benchmark datasets.\n\n1\n\nIntroduction\n\nRobustness to adversarial attacks is an essential feature in the design of modern classi\ufb01ers \u2014in\nparticular, of deep neural networks [1, 2]. Adversarial Training (AT) [3], Virtual AT [4] and Distil-\nlation [5] are examples of promising approaches to defend against a point-wise adversary who can\nalter input data-points in a separate manner. However, as shown by [6], a good defense against a\ndistributional adversary who shifts the input distribution instead of data-points could improve the\nrobustness of a classi\ufb01er more effectively. This has led to the development of Distributionally Robust\nLearning (DRL) [7], which has recently attracted intensive research interest [8, 9, 10, 11]. Despite\nof all the advancements in supervised DRL, the number of studies that tackle this problem from a\nsemi-supervised angle is slim to none [12]. Motivated by this fact, we propose a distributionally\nrobust framework that handles Semi-Supervised Learning (SSL) scenarios. Our work is an extension\nof self-learning [13, 14, 15], which encompasses methods such as Expectation-Maximization (EM)\nalgorithm. It can also cope with any existing classi\ufb01er such as neural networks. Intuitively, we \ufb01rst\ninfer soft-labels for the unlabeled data, and then search for suitable classi\ufb01cation rules that show low\nsensitivity to adversarial perturbation around these soft-label distributions.\nParts of this paper can be considered as a semi-supervised extension of [9]. Computational complexity\nof our framework is comparable to those of its supervised rivals. Moreover, we design a Stochastic\nGradient Descent (SGD)-based algorithm with a guaranteed convergence rate to optimize our model.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fGeneralization Bound\nConvergence Guarantee\nAdversarial Robustness\nSemi-Supervised Learning\n\nDRL PL VAT SSDRL\n(cid:88)\n(cid:88)\n(cid:88)\n\u00d7\n\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nTable 1: Comparison between the\nproposed framework (SSDRL) and\nsome existing methods: DRL of\n[9], Pseudo Labeling (PL) [19],\nand Virtual Adversarial Training\n(VAT) [4].\n\nIn order to address the generalization, we introduce a set of novel complexity measures such as\nAdversarial Rademacher Complexity and Minimum Supervision Ratio (MSR), which are based on\nthe hypothesis set and input data distribution. We show that if the ratio of the labeled samples in a\ndataset (supervision ratio) exceeds MSR, true adversarial risk can be bounded. Also, proper parameter\nadjustment can arbitrarily decrease MSR at the cost of increasing the generalization bound; This\nmeans our theoretical guarantees hold for all supervision ratios. The theoretical contribution of our\nwork is summarized in Table 1. We have also tested our method, denoted by SSDRL, via extensive\ncomputer experiments on datasets such as MNIST [16], SVHN [17], and CIFAR-10 [18]. When\nequipped with deep neural networks, SSDRL outperforms rivals such as Pseudo-Labeling (PL) [19]\nand the supervised DRL of [9] on all the datasets. Also, SSDRL outperforms VAT [4] on SVHN,\nwhile it demonstrates a comparable performance on MNIST and CIFAR-10.\nThe rest of the paper is organized as follows: Section 1.1 speci\ufb01es the notations, and Section 1.2\nreviews the related works. The proposed framework is presented in Section 2, where numerical\noptimization is explained in Section 2.1 and its generalization is addressed in Section 2.2. Section 3\nis devoted to experimental results. Finally, Section 4 concludes the paper.\n\n1.1 Notations\nWe extend the notations used in [9]. Assume Z to be an input space, \u0398 to be a parameter set, and\n(cid:96) : Z \u00d7 \u0398 \u2192 R a parametric loss function. Observation space Z can either be the feature space X\nin unsupervised scenarios, or the space of feature-label pairs, i.e., Z (cid:44) X \u00d7 Y, where Y denotes\nthe set of labels. For simplicity, we only consider \ufb01nite label-sets. By M (Z), we mean the set of\nall probability measures supported on Z. Let us denote the function c : Z \u00d7 Z \u2192 [0, +\u221e) as the\ntransportation cost. Under some conditions on c, De\ufb01nition B.1 (supplementary) formulates the\nWasserstein distance Wc (P, Q) between two distributions P, Q \u2208 M (Z), w.r.t. c [8]. Wc (P, Q)\nmeasures the minimal cost of moving P to Q, where the cost of moving one unit of mass from z to z(cid:48)\nis given by c (z, z(cid:48)). Also, for \u0001 \u2265 0 and a distribution Q \u2208 M (Z), we de\ufb01ne an \u0001-ambiguity set as\nB\u0001 (Q) (cid:44) {P \u2208 M (Z)| Wc (P, Q) \u2264 \u0001}. Training dataset is shown by D (cid:44) {Z1, . . . , Zn}, which\nincludes i.i.d. samples drawn from a \ufb01xed (and unknown) distribution P0 \u2208 M (Z), while n denotes\nthe dataset size. For a dataset D, let \u02c6PD \u2208 M (Z) be \u02c6PD (cid:44) 1\ni=1 \u03b4Zi, where \u03b4z denotes the\nDirac delta function at point z \u2208 Z. Accordingly, E and \u02c6ED represent the statistical and empirical\nexpectation operators, respectively. For a distribution P \u2208 M (X \u00d7 Y), PX denotes the marginal\ny\u2208Y P (\u00b7, y) over X , and P|X \u2208 M (Y) is the conditional distributions over\nlabels given feature vector X \u2208 X . To simplify the notations, for z = (X, y) \u2208 X \u00d7 Y and a\nfunction f, the notations f (z) and f (X, y) have been used, interchangeably.\n\ndistribution PX (\u00b7) (cid:44)(cid:80)\n\n(cid:80)n\n\nn\n\n1.2 Background and Related Works\n\nDRL minimizes a worst-case risk against an adversary, who has a limited budget to alter the data\ndistribution Q \u2208 M (Z) in order to in\ufb02ict the maximum possible damage. Here, Q can either be the\ntrue measure P0, or the empirical one \u02c6PD [10]. Mathematically, DRL is formulated as [8, 11]:\n\ninf\n\u03b8\u2208\u0398\n\nsup\n\nP\u2208B\u0001(Q)\n\nEP {(cid:96) (Z; \u03b8)} .\n\n(1)\n\nWasserstein metric has been widely used to quantify the strength of adversarial attacks [8, 9, 11, 12],\nthanks to (i) its relations to adversarial robustness [20] and (ii) suitable dual-form properties [11]. In\n[8], authors have reformulated DRL into a convex program for Logistic Regression. Convergence and\ngeneralization of DRL, in a general context, have been addressed in [9], while adjusting the ambiguity\n\n2\n\n\fset size, i.e. \u0001, has been tackled in [21]. In [22], the authors have investigated the convergence of\nthe inner maximization of (1), and its effects on the later stages of adversarial training. An analysis\non DRL methods with f-divergences is given in [10]. Also, sample complexity of DRL has been\nreviewed by [23] and [24].\nAbundance of unlabeled data has made SSL methods widely popular [4, 25]. See [14] for a review\non classical SSL approaches. Many robust SSL algorithms have been proposed so far [26, 27],\nhowever, their notion of robustness is different from the one considered here. More recent works\ninclude [28, 29, 30, 31]. In [29], [30] and [31], authors have mainly focused on a Gaussian model for\ntheoretical validation, while [28] empirically shows that robust self-training eliminates the accuracy-\nrobustness tradeoff by leveraging unlabeled data. In [32], a pessimistic SSL approach is proposed\nthat provably enhances the performance by incorporating the unlabeled data.We show that a special\ncase of our method reduces to an adversarial extension of [32]. Guarantees on the generalization of\nSSL can only be made under certain assumptions on the choice of hypothesis set and the true data\ndistribution [14, 15, 33]. For example, a compatibility function is introduced in [15] to restrict the\nrelation between the model set and input data distribution. Also, in [34], author has theoretically\nanalyzed SSL under the cluster assumption.The main reason for making such assumptions is that\nlack of any knowledge on the relation of a feature vector to its label, simply makes unlabeled data\nuseless for classi\ufb01cation. In Section 2.2, we propose a novel compatibility criterion under a general\nsetting which enables us to establish a generalization theory for our work.\nFinally, the only existing work that also falls in the cross section of DRL and SSL is [12]. However,\nthis method severely restricts the shifted distributions, so that the adversary can only choose from a\nset of delta-spikes over labeled and augmented unlabeled samples.\n\n2 Proposed Framework\nFrom now on, let Z (cid:44) X \u00d7 Y. In SSL, a dataset D consists of two non-overlapping parts: Dl\n(labeled) and Dul (unlabeled). Let us denote Il and Iul as the index sets corresponding to these parts,\nrespectively. Thus, we have Dl = {(X i, yi)| i \u2208 Il}, and Dul = {X i| i \u2208 Iul}. The hidden labels\nin Dul can be modeled by random variables supported on Y. Note that DRL in (1) cannot readily\napply to this partially-labeled setting, since (1) needs complete knowledge of all the feature-label\npairs in D. To overcome this problem, \ufb01rst let us make the following de\ufb01nition:\nDe\ufb01nition 1. The consistent set of probability distributions \u02c6P (D) \u2286 M (Z) with respect to a\npartially-labeled dataset D = Dl \u222a Dul is de\ufb01ned as\n\n\u02c6P (D) (cid:44)(cid:110)(cid:16) nl\n\n(cid:17) \u02c6PDl +\n\nn\n\n(cid:16) nul\n\nn\n\n(cid:17) \u02c6PDul \u00b7 \u2126(cid:12)(cid:12) \u2126 \u2208 MX (Y)\n(cid:111)\n\n,\n\nwhere nl and nul (with n = nl + nul) are the sizes of Dl and Dul, respectively, and MX (Y) denotes\nthe set of all conditional distributions supported on Y, given features in X .\nAll possible (soft-)labelings of unlabeled samples in Dul are collected in \u02c6P (D). Note that the\nempirical measure corresponding to the true complete dataset is also included in the consistent set.\nOur aim is to choose a suitable measure from this set, and then use it for (1).\nWe take a known family of SSL approaches, called self-learning [13, 35], and then combine it with\nDRL. Self-learning methods, e.g. EM algorithm [36], transfer the knowledge from labeled samples to\nunlabeled ones through pseudo-labeling. More precisely, a learner is trained on the supervised portion\nof a dataset, and then employs its learned rules to assign pseudo-labels to the remaining unlabeled\npart. However, such methods are prone to over-\ufb01tting if the information \ufb02ow from Dl to Dul is\nnot properly controlled. One way to overcome this issue is to use soft-labeling, which maintains a\nminimum level of uncertainty over unlabeled samples. By combining the above arguments with the\ncore idea of DRL in (1), we propose the following learning scheme:\n\n(cid:18) 1 \u2212 \u03b7\n\n(cid:19)\n\n\u03bb\n\n\u02c6EDul\n\n(cid:1)(cid:9)(cid:41)\n\n(cid:8)H(cid:0)S|X\n\n,\n\n(2)\n\n(cid:40)\n\ninf\n\u03b8\u2208\u0398\n\ninf\n\nS\u2208 \u02c6P(D)\n\nsup\n\nP\u2208B\u0001(S)\n\nEP {(cid:96) (X, y; \u03b8)} +\n\nwhere \u03bb is a user-de\ufb01ned parameter, \u03b7 (cid:44) nl/n is the supervision ratio, and H (\u00b7) denotes the Shannon\nentropy. For now, let us assume \u03bb < 0.\n\n3\n\n\f(cid:8)H(cid:0)S|X\n\n(cid:1)(cid:9) prevents hard decisions for\n\nMinimization over S \u2208 \u02c6P (D) acts as a knowledge transfer module that \ufb01nds the optimal distribution\nin \u02c6P (D). Again, note that distributions in \u02c6P (D) vary only in the way they assign (soft-)labels to\nunlabeled data. The scheme in (2) is based on optimism in the sense that, for any \u03b8 \u2208 \u0398, learner\nis instructed to pick the labels that are more likely to reduce the average loss function (cid:96) (\u00b7; \u03b8) for\neach unlabeled sample. This is the core idea of self-learning. However, a pessimistic learner does\nthe opposite, i.e. picks the less likely labels with large loss values and hence does not trust the\nloss function. The negative regularization term 1\u2212\u03b7\nlabels and promotes soft-labeling by bounding the Shannon entropy of label-conditionals from\nbelow. A smaller |\u03bb| gives softer labels. In the extreme case, choosing \u03bb = \u2212\u221e ends up in an\nadversarial version of the self-training in [14]. It should be noted that according to (2), learner is\nforced to show less sensitivity near all (labeled and unlabeled) training data, just as one expects from\na semi-supervised DRL.\nWe show that (2) can be ef\ufb01ciently solved given that some smoothness conditions hold for (cid:96) and c.\nBefore that, Theorem 1 shows that the optimization corresponding to the knowledge transfer module\nhas an analytic solution, which implies the computational cost of (2) is only slightly higher than those\nof its fully-supervised counterparts, such as [9].\nTheorem 1 (Lagrangian-Relaxation). For any continuous loss (cid:96) : Z\u00d7\u0398 \u2192 R and c : Z\u00d7Z \u2192 R\u22650,\nparameters \u0001 \u2265 0, \u03b3 \u2265 0 and \u03bb \u2208 R \u222a {\u00b1\u221e}, and a partially-labeled dataset D with size n, let us\nde\ufb01ne the empirical Semi-Supervised Adversarial Risk (SSAR), denoted by \u02c6RSSAR (\u03b8; D), as\n\n\u02c6EDul\n\n\u03bb\n\n\u02c6RSSAR (\u03b8; D) (cid:44) 1\nn\n\n\u03c6\u03b3 (X i, yi; \u03b8) +\n\n1\nn\n\nsoftmin\n\n(\u03bb)\ny\u2208Y\n\n{\u03c6\u03b3 (X i, y; \u03b8)} + \u03b3\u0001,\n\n(3)\n\n(cid:88)\n\ni\u2208Il\n\n(cid:88)\n\ni\u2208Iul\n\nwhere the adversarial loss \u03c6\u03b3 (X, y; \u03b8), and the soft-minimum operator softmin(\u03bb)\nq \u2208 RY are de\ufb01ned as\n\ny\u2208Y (q) for any\n\n\uf8eb\uf8ed 1\n\n|Y|\n\n(cid:88)\n\ny\u2208Y\n\n\uf8f6\uf8f8 , (4)\n\ne\u03bbqy\n\n\u03c6\u03b3 (X, y; \u03b8) (cid:44) sup\nz(cid:48)\u2208Z\n\n(cid:96) (z(cid:48); \u03b8)\u2212\u03b3c (z(cid:48), (X, y)) , and\n\nsoftmin\n\n(\u03bb)\ny\u2208Y\n\n(q) (cid:44) 1\n\u03bb\n\nlog\n\nrespectively. Let \u03b8\u2217 \u2208 \u0398 be a minimizer of (2) for some given \u0001 \u2265 0 and \u03bb < 0. Then, there exists\n\u03b3 \u2265 0 such that \u03b8\u2217 is also a minimizer of (3) with the same parameter setting.\nProof of Theorem 1 is given in Appendix D. Note that softmin equals to : (i) min operator for\n\u03bb = \u2212\u221e, (ii) average for \u03bb = 0, and (iii) max for \u03bb = +\u221e. Also, \u0001 and \u03b3 are non-negative dual\nparameters and \ufb01xing either of them uniquely determines the other one. Therefore, one can adjust \u03b3\n(for example via cross-validation), instead of \u0001. See [9] for a similar discussion.\nA more subtle look at (3) shows that in the dual context of the proposed scheme, one is free to also\nconsider positive values for \u03bb. The sign of \u03bb indicates optimism (\u03bb \u2264 0), or pessimism (\u03bb > 0)\nduring the (soft-)label assignment. The choice between optimism vs. pessimism depends on the\ncompatibility of the model set \u0398 with the true distribution P0. In Section 2.2, we show that enabling\n\u03bb to take values in R rather than R\u2212 is crucial for establishing a generalization bound for (3). In\nother words, for a very bad hypothesis set, one must choose to be pessimistic to be able to generalize\nwell. To see situations where pessimism in SSL helps, reader can refer to [32].\n\n2.1 Numerical Optimization\n\nWe propose a numerical optimization scheme for solving (3) which has a convergence guarantee.\nLemmas E.1 and E.2 (supplementary) explicitly compute the gradients of (3). This way, one can\nsimply apply the mini-batch SGD to solve for (2) via Algorithm 1. Note that due to the strong\nconcavity property in Lemma E.1, \u03b4 can be chosen arbitrarily small. Other parameters such as \u03b3\n(or equivalently \u0001) and \u03bb should be adjusted via cross-validation. The computational complexity\nof Algorithm 1 is no more than \u03b7 + |Y| (1 \u2212 \u03b7) times of that of [9], where the latter can only\nhandle supervised data1. Note that Algorithm 1 reduces to [9] in fully-supervised scenarios, and\ncoincides with Pseudo-Labeling and EM algorithm when (\u03b3 = \u221e, \u03bb = \u2212\u221e) and (\u03b3 = \u221e, \u03bb = \u22121),\n1In scenarios where |Y| is very large, one can employ heuristic methods to reduce the set of possible labels\n\nfor an unlabeled data sample and gain more ef\ufb01ciency at the expense of degradation in performance\n\n4\n\n\fAlgorithm 1 Stochastic Gradient Descent for SSDRL\n1: Inputs: D, \u03b3, \u03bb, (k \u2264 n, \u03b4, \u03b1, T )\n2: Initialize \u03b80 \u2208 \u0398, and set t \u2190 0.\n3: for t = 0 \u2192 T \u2212 1 do\nRandomly select index set I \u2286 [n] with size k.\n4:\nfor i \u2208 Il \u2229 I do\n5:\nCompute a \u03b4-approx of z\u2217\n6:\nend for\n7:\nfor (i, y) \u2208 (Iul \u2229 I) \u00d7 Y do\n8:\n9:\nend for\n10:\nCompute the sub-gradient of \u02c6RSSAR (\u03b8; D) from (E.3) (Lemma E.2) at point \u03b8 = \u03b8t\n11:\nusing only samples in I, and denote it with \u2202\u03b8 \u02c6RSSAR (\u03b8t; D).\n\u03b8t+1 \u2190 Proj\u0398\n12:\n13: end for\n14: Output: \u03b8\u2217 \u2190 \u03b8T\n\nCompute a \u03b4-approximate of z\u2217\n\n\u03b8t \u2212 \u03b1\u2202\u03b8 \u02c6RSSAR (\u03b8t; D)\n\ni (y; \u03b8t) from Lemma E.2.\n\ni (\u03b8t) from Lemma E.2.\n\n(cid:16)\n\n(cid:17)\n\nrespectively. The following theorem guarantees the convergence of Algorithm 1 to a local minimizer\nof (3).\nTheorem 2. Assume loss function (cid:96), transportation cost c, \u03b3 \u2265 0 and |\u03bb| < \u221e satisfy the conditions\nof Lemma E.2. Also, assume (cid:96) is differentiable w.r.t. both z and \u03b8, with Lipschitz gradients. Also,\nlet (cid:107)\u2207\u03b8(cid:96) (z; \u03b8)(cid:107)2 \u2264 \u03c3 for some \u03c3 \u2265 0 all over Z \u00d7 \u0398. Denote \u03b80 \u2208 \u0398 to be an initial hypothesis,\nand \u03b8\u2217 \u2208 \u0398 as a local minimizer of (3). Assume the partially-labeled dataset D includes n i.i.d.\ntraining samples. Also, let \u2206 \u02c6R (cid:44) \u02c6RSSAR (\u03b80; D) \u2212 \u02c6RSSAR (\u03b8\u2217; D). Then, for a \ufb01xed step size \u03b1\u2217,\nthe outputs of Algorithm 1 with parameters k = 1, \u03b4 > 0, \u03b1 = \u03b1\u2217 after T iterations, say \u03b81, . . . , \u03b8T ,\nsatisfy the following inequality:\n\nE(cid:110)(cid:13)(cid:13)\u2207\u03b8 \u02c6RSSAR (\u03b8t; D)(cid:13)(cid:13)2\n\n(cid:111) \u2264 4\u03c32\n\n2\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:115)\n\n(cid:18) B\n(cid:19)\n\u03c32 + (1 \u2212 \u03b7)|\u03bb||Y|\n\n\u2206 \u02c6R\nT\n\n+ C\u03b4,\n\n(5)\n\nwhere constants B and C and step size \u03b1\u2217 only depend on \u03b3 and Lipschitz constants of (cid:96).\nThe proof of Theorem 2 with explicit formulations for constants B and C and step size \u03b1\u2217 are\n\ngiven in Appendix D (supplementary). Theorem 2 guarantees a convergence rate of O(cid:0)T \u22121/2(cid:1) for\n\nAlgorithm 1, if one neglects \u03b4. Note that the presence of \u03b4 is necessary since one cannot \ufb01nd the\nexact maximizer of (E.2) in \ufb01nite steps. However, due to Lemma E.1, \u03b4 can become in\ufb01nitesimally\nsmall. Theorem D.1 (supplementary) guarantees the convergence of Algorithm 1 in hard-decision\nregimes, i.e. \u03bb = \u00b1\u221e. Note that (cid:96) is not necessarily convex w.r.t. \u03b8, e.g. neural nets. However, given\na convex loss (cid:96), Theorem D.2 gives us a conditions on \u03bb to guarantee the convexity of (3) as well.\n\n2.2 Generalization Guarantee\n\nEP {(cid:96) (Z; \u03b8\u2217)}, where \u03b8\u2217 denotes the\nWe intend to bound the true adversarial risk, i.e. supP\u2208B\u0001(P0)\noptimizer of the empirical risk in (3). However, the two major concerns are: (i) we are training our\nmodel against an adversary, and (ii) our training dataset is partially labeled. We \ufb01rst address these\nissues and then present our main contribution in Theorem 3.\nClassical Rademacher complexity, denoted by Rn (F), measures how well a function set F can learn\nnoise, and thus is exposed to over-\ufb01tting on small datasets. We give a novel adversarial extension for\nRn which also appears in our generalization bound. Moreover, we show it converges to zero when\nn \u2192 \u221e, for all function sets with a \ufb01nite VC-dimension. Before that, let us de\ufb01ne the set of \u0001-Monge\nmaps as A\u0001 (cid:44) {a : Z \u2192 Z| c (z, a (z)) \u2264 \u0001, \u2200z \u2208 Z}.\n\n5\n\n\fDe\ufb01nition 2 (Semi-Supervised Monge (SSM) Rademacher Complexity). For Z (cid:44) X \u00d7 Y, assume a\nfunction set F \u2286 RZ and a distribution P0 \u2208 M (Z). For \u0001 \u2265 0 and n \u2208 N, let us de\ufb01ne\n\ngl (n) (cid:44) EZ1:n,\u03c3\n\ngul (n) (cid:44)(cid:88)\n\nEX 1:n,\u03c3\n\n1\nn\n\nsup\nf\u2208F\n\n(cid:40)\n\nand\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n(cid:20)\nn(cid:88)\n\n\u03c3i\n\nsup\na\u2208A\u0001\n\n(cid:20)\n\n\u03c3i\n\n1\nn\n\nsup\nf\u2208F\n\n(cid:21)(cid:41)\n\nf (a (Zi))\n\n(cid:21)(cid:41)\n\nsup\na\u2208A\u0001\n\nf (a (X i, y))\n\n,\n\nwhere Z1:n\nRademacher variables. Then, for \u03b7 \u2208 [0, 1], the SSM Rademacher complexity of F is de\ufb01ned as\n\ny\u2208Y\ni.i.d.\u223c P0X . \u03c3 \u2208 {\u22121, +1}n indicates a vector of i.i.d. symmetric\n\ni.i.d.\u223c P0 and X 1:n\n\ni=1\n\nR(SSM)\nn,(\u0001,\u03b7) (F) (cid:44) \u03b7gl ((cid:100)n\u03b7(cid:101)) + (1 \u2212 \u03b7) gul ((cid:100)n (1 \u2212 \u03b7)(cid:101)) .\n\nBy setting \u0001 = 0 and \u03b7 = 1, the above de\ufb01nition reduces to the classical Rademacher complexity\nRn. We de\ufb01ne a function set to be learnable, if limn\u2192\u221e Rn = 0. Similarly, a function class F is\nsaid to be adversarially learnable w.r.t. parameters (\u0001, \u03b7), if limn\u2192\u221e R(SSM)\nn,(\u0001,\u03b7) (F) = 0. But, how\ncan we numerically compute this measure in practice? The main difference between Rn and SSM\nRademacher complexity is that the latter adversarially alters the input distribution. However, many\ndistribution-free bounds already exist for Rn [37], which apply to almost all practical function sets,\ne.g. classi\ufb01ers with a bounded VC-dimension (e.g. neural nets), restricted regression tools and etc.\nWe show that by having a distribution-free bound on the Rademacher complexity of F, one can\nalso bound the SSM Rademacher complexity. Mathematically speaking, assume that there exists\nan asymptotically decreasing upper-bound \u2206 (n) such that Rn (F) \u2264 \u2206 (n) , \u2200P0 \u2208 M (X \u00d7 Y).\nThen, for all \u03b7 \u2208 [0, 1] and \u0001 \u2265 0, we have (Lemma E.3):\n\nR(SSM)\nn,(\u0001,\u03b7) (F) \u2264 \u03b7\u2206 ((cid:100)n\u03b7(cid:101)) + (1 \u2212 \u03b7)|Y| \u2206 ((cid:100)n (1 \u2212 \u03b7)(cid:101)) ,\n\n(6)\nwhere the r.h.s. of (6) goes to zero as n \u2192 \u221e. This includes almost all practical classi\ufb01ers, e.g.\nneural nets, support vector machines, random forests and etc. For example, consider the 0\u2212 1 loss for\na classi\ufb01er with a VC-dimension of dim (\u0398). Then, due to Dudley\u2019s entropy bound and Haussler\u2019s\nupper-bound [37], there exists constant C such that, regardless of \u0001 or P0, we have (Lemma E.3):\n\n(cid:114)\n\n(cid:16)\u221a\n\n\u03b7 +(cid:112)1 \u2212 \u03b7 |Y|(cid:17)\n\n.\n\n(7)\n\n\u2206 (n) \u2264 C\n\ndim (\u0398)\n\nn\n\n, and thus R(SSM)\n\nn,(\u0001,\u03b7) (F) \u2264 C\n\ndim (\u0398)\n\nn\n\n(cid:114)\n\n2.2.1 Minimum Supervision Ratio\n\nAs discussed earlier, generalization of SSL frameworks generally requires a compatibility assumption\non the hypothesis set F and data distribution P0. In Appendix C (and in particular, De\ufb01nition\nC.4), we introduce a new compatibility function, denoted by MSR, which has the following form:\nMSR(F ,P0) (\u03bb, margin) : R\u222a{\u00b1\u221e}\u00d7R\u22650 \u2192 [0, 1]. Intuitively, MSR(F ,P0) quanti\ufb01es the strength\nof information theoretic relation between the marginal measure P0X and the conditional P0|X . It\nalso measures the richness of F to learn such relations. Due to Theorem 3, in order to bound the\ntrue risk when unlabeled data are involved, one needs \u03b7 \u2265 MSR(F ,P0) (\u03bb, margin), for some \u03bb and\nmargin \u2265 0. Here, \u03bb denotes the pessimism of the learner and margin \u2265 0 speci\ufb01es a safety margin\nfor small-size datasets. MSR is an increasing function w.r.t. margin, while it decreases with \u03bb. In\nparticular, MSR(F ,P0) (+\u221e, margin) = 0, for all margin \u2265 0.\nFor a negative \u03bb (optimistic learning), MSR remains small as long as there exists a strong dependency\nbetween P0X and label conditionals P0|X . This dependency can be obtained, for example, by the\ncluster assumption. Additionally, some loss functions in F need to be capable of capturing such\ndependency, e.g. by resembling the true negative log-likelihood \u2212 log P0 (X, y). Conversely, absence\nof such properties will increase the MSR toward 1, which forces the learner to choose a large \u03bb (in the\nextreme case +\u221e) in order to use the bound of Theorem 3. Not to mention that a large \u03bb increases\nthe empirical loss and loosens the bound. This fact, however, should not be surprising since improper\nusage of unlabeled data in certain cases can degrade the generalization. Lemma C.3 (supplementary)\nshows that one can analytically compute MSR function for a particular case of interest, i.e. when\ncluster assumption holds for P0 and the loss function family F is chosen properly.\nThis way, the following Theorem gives a generalization bound for (3):\n\n6\n\n\f(a) MNIST\n\n(b) SVHN\n\n(c) CIFAR-10\n\nFigure 1: Comparison of the test error-rates on adversarial examples attained via [9] among different\nmethods.\n\nand \u03b8 \u2208 \u0398 let \u03c6\u03b3 (z; \u03b8) to be de\ufb01ned as in (4), and \u03a6 (cid:44)(cid:8)\u03c6\u03b3 (\u00b7; \u03b8) (cid:12)(cid:12) \u03b8 \u2208 \u0398(cid:9). For a supervision ratio\n\nTheorem 3 (Generalization). For a space Z (cid:44) X \u00d7 Y, assume the set of continuous functions\nL (cid:44) {(cid:96) (\u00b7; \u03b8)| \u03b8 \u2208 \u0398}, with (cid:96) (\u00b7; \u03b8) : Z \u2192 R and (cid:107)(cid:96)(cid:107)\u221e \u2264 B for some B \u2265 0. For \u03b3 \u2265 0, z \u2208 Z\n\u03b7 \u2208 [0, 1], assume a partially labeled dataset D = {(X i, yi)}n\ni=1 including n i.i.d. samples drawn\nfrom P0 \u2208 M (Z), where labels can be observed with probability of \u03b7, independently. For 0 < \u03b4 \u2264 1\nand \u03bb \u2208 R \u222a {\u00b1\u221e}, assume \u03b7 satis\ufb01es the following condition:\n\n(cid:33)\nn,(\u0001,\u03b7) (L)\nThen, with probability at least 1 \u2212 \u03b4, the following bound holds for all \u0001 \u2265 0:\n\n\u03b7 \u2265 MSR(\u03a6,P0)\n\n(cid:114)\n\n(cid:32)\n\n\u03bb, 4B\n\nlog (1/\u03b4)\n\n+ 4R(SSM)\n\n.\n\n2n\n\n(cid:114)\n\n(8)\n\n(9)\n\nsup\n\nP\u2208B\u0001(P0)\n\nEP {(cid:96) (Z; \u03b8\u2217)} \u2264 min\n\u03b8\u2208\u0398\n\n\u02c6RSSAR (\u03b8; D) + 2B\n\nlog (1/\u03b4)\n\n2n\n\n+ 2R(SSM)\n\nn,(\u0001,\u03b7) (L) ,\n\nwhere \u03b8\u2217 is the minimizer of \u02c6RSSAR (\u03b8; D).\nProof of Theorem 3 is given in Appendix D. Condition in (8) can always be satis\ufb01ed based on\nLemma C.2, as long as \u03bb and n are suf\ufb01ciently large and L is adversarially learnable. A strongly-\ncompatible pair (\u03a6, P0) encourages optimism, where learner can choose a negative \u03bb. However, in\nsome situations increasing \u03bb might be necessary for (8) to hold; In fact, for a weakly-compatible\n(\u03a6, P0), \u03bb must be positive or even +\u221e (the latter always satis\ufb01es (8) regardless of n or \u03b7). Note\nthat a larger \u03bb increases the empirical risk \u02c6RSSAR (\u03b8\u2217; D), which also increases the bound in (9).\nInterestingly, \u03bb = +\u221e coincides with the setting of [32], which makes it as a special case of our\nanalysis. The limiting cases of Theorem 3, i.e. \u0001 = 0 and \u03b7 = 1, lead to a new bound for non-robust\nSSL, and an existing bound for the supervised DRL of [9], respectively.\n\n3 Experimental Results\n\nThis section demonstrates our experimental results on some real-world datasets, and also compares\nSSDRL with its state-of-the-art rival methodologies. Deep Neural Networks (DNN) are considered\nfor the loss {(cid:96) (\u00b7; \u03b8)|\u03b8 \u2208 \u0398}. Architecture and other speci\ufb01cations about our DNNs are explained in\ndetails in Appendix A. The rival frameworks in this section are Virtual Adversarial Training (VAT)\n[4], Pseudo-Labeling (PL) [19], and the supervised DRL of [9], which we simply denote as DRL. We\nhave also implemented a fast version of SSDRL, called F-SSDRL, where for each unlabeled training\nsample considers only a limited number of more favorable labels in Algorithm 1. Here, by more\nfavorable labels, we refer to those labels that result in a smaller non-robust loss (cid:96) (\u00b7; \u03b8). As a result,\nF-SSDRL runs much faster than SSDRL without much degradation in performance. Surprisingly, we\nfound out that F-SSDRL often yields even better performances in practice compared to SSDRL (see\nAppendix A for more details).\nFigure 1 shows the misclassi\ufb01cation rate vs. \u03b3\u22121 on adversarial test examples attained by computing\n\u03c6\u03b3 (\u00b7; \u03b8) (same attack strategy as [9]). Recall \u03b3 as the dual-counterpart of the Wasserstein radius \u0001 in\n(2). Thus, \u03b3\u22121 somehow quanti\ufb01es the strength of the adversarial attacks, as suggested by [9]. Results\n\n7\n\n1011001011021/eval020406080Error rate (%)DRL, 0DRL, 1DRL, 2PLVAT, 0=1VAT, 2F-SSDRL, 0F-SSDRL, 1F-SSDRL, 21011001011021/eval203040506070Error rate (%)DRL, 0=1DRL, 2PLVAT, 0VAT, 1=2F-SSDRL, 0=1F-SSDRL, 21011001011021/eval30405060Error rate (%)DRL, 0DRL, 1DRL, 2PLVAT, 0=1=2F-SSDRL, 0F-SSDRL, 1F-SSDRL, 2\f(a) MNIST\n\n(b) SVHN\n\n(c) CIFAR-10\n\nFigure 2: Comparison of the test error-rates on adversarial examples calculated by PGM [38], under\n(cid:96)2-norm constraint.\n\nTable 2: Test error-rates on clean\nexamples. For DRL, VAT and F-\nSSDRL, rows 1 to 3 correspond to\nthe parameter (\u03b3i for DRL and F-\nSSDRL, and \u03b5i for VAT) that yields\nthe lowest error rates on: (i = 1)\nclean examples, (i = 2) adversarial\nexamples by [9], and (i = 3) ad-\nversarial examples by PGM, respec-\ntively.\n\nMethod\n\nTest Error-Rate(%)\n\nMNIST\n\nSVHN\n\nCIFAR-10\n\nDRL\n\u03b31\n\u03b32\n\u03b33\nPL\nVAT\n\u03b51\n\u03b52\n\u03b53\nF-SSDRL\n\u03b31\n\u03b32\n\u03b33\n\n4.67\u00b10.38\n4.77\u00b10.15\n5.95\u00b10.13\n8.70\u00b10.47\n\n1.30\u00b10.04\n1.30\u00b10.04\n1.32\u00b10.10\n\n1.29\u00b10.09\n1.51\u00b10.03\n3.58\u00b11.13\n\n10.89\u00b10.53\n10.89\u00b10.53\n14.45\u00b10.93\n6.39\u00b10.46\n\n5.47\u00b10.33\n7.01\u00b10.24\n7.01\u00b10.24\n\n6.19\u00b10.22\n6.19\u00b10.22\n6.74\u00b10.22\n\n21.62\u00b10.40\n23.77\u00b10.65\n21.97\u00b10.35\n21.19\u00b10.25\n\n15.19\u00b10.55\n15.19\u00b10.55\n15.19\u00b10.55\n\n17.94\u00b10.20\n18.35\u00b10.24\n18.64\u00b10.16\n\nhave been depicted for MNIST, SVHN and CIFAR-10 datasets. Figure 2 demonstrates the same\nprocedure for adversarial examples generated by Projected-Gradient Method (PGM) [38]; In this\ncase, the error-rate is depicted vs. PGM\u2019s strength of attack, i.e. \u03b5. For VAT and SSDRL, curves have\nbeen shown for different choices of hyper-parameters, i.e., (\u03b3i or \u03b5i) , i = 1, 2, 3, which correspond\nto the lowest error rates on: (i = 1) clean examples, (i = 2) adversarial examples by [9], and (i = 3)\nadversarial examples by PGM, respectively. Values of (\u03b3i, \u03b5i), and the choices of \u03bb, transportation\ncost c, and the supervision ratio \u03b7 with more details on the experiments can be found in Appendix A.\nAccording to Figures 1 and 2, the proposed method is always superior to DRL and PL. Also, SSDRL\noutperforms VAT on SVHN dataset regardless of the attack type, while it has a comparable error-rate\non MNIST and CIFAR-10 based on Figures 1a and 2c, respectively. The superiority over DRL\nhighlights the fact that exploitation of unlabeled data has improved the performance. However,\nSSDRL under-performs VAT on MNIST and CIFAR-10 datasets if the order of attacks are reversed,\neven though performances are still close. According to Figure 2a, accuracy of PL degrades quite\nslowly as PGM\u2019s \u03b5 increases, although the loss values increase in Figure A.7a. This phenomenon is\ndue to the fact that the adversarial directions for increasing the loss and error-rate are not correlated\nin this particular case.\nTable 2 shows the test error-rates on clean examples for F-SSDRL, VAT, PL and DRL on MNIST,\nSVHN and CIFAR-10 datasets. In fact, Table 2 quanti\ufb01es the non-adversarial generalization that one\ncan attain in practice via distributional robustness. Again, F-SSDRL outperforms both PL and DRL\nin all experimental settings. It also surpasses VAT on SVHN dataset. F-SSDRL under-performs VAT\non MNIST and CIFAR-10, however, the differences between error-rates remain small which means\nthe two methods have comparable performances.\n\n8\n\n101100101102eval01020304050Error rate (%)DRL, 0DRL, 1DRL, 2PLVAT, 0=1VAT, 2F-SSDRL, 0F-SSDRL, 1F-SSDRL, 2101100101102eval20406080Error rate (%)DRL, 0=1DRL, 2PLVAT, 0VAT, 1=2F-SSDRL, 0=1F-SSDRL, 2101100101102eval204060Error rate (%)DRL, 0DRL, 1DRL, 2PLVAT, 0=1=2F-SSDRL, 0F-SSDRL, 1F-SSDRL, 2\f4 Conclusions\n\nThis paper investigates the application of distributionally robust learning in partially labeled datasets.\nThe core idea is to take a well-known semi-supervised framework, known as self-learning, and\nmake it robust to adversarial attacks. A novel framework, called SSDRL, has been proposed which\nencompasses many existing methods such as Pseud-Labeling (PL) and EM algorithm as its special\ncases. Computational complexity of our method is shown to be comparable with its supervised\ncounterparts. We have derived convergence and generalization guarantees for SSDRL, where for the\nlatter, a number of novel complexity measures are proposed. In particular, an adversarial extension of\nRademacher complexity is proposed and shown to converge to zero for almost all practical learning\nframeworks, including neural networks, that have a \ufb01nite VC-dimension. Moreover, our theoretical\nanalysis reveals a more general and fundamental condition to assess the role of unlabeled data\nin generalization by introducing a new complexity measure called Minimum Supervision Ratio\n(MSR). This is in contrast to many existing works that need more restrictive assumptions, such\nas cluster assumption to be applicable. Computer simulation on real-world datasets demonstrate\na comparable-to-superior performance for SSDRL compared with those of the state-of-the-art. In\nfuture, we try to improve the generalization, for example, by empirically computing the MSR. Also,\n\ufb01tting more SSL methods into the core idea of our work is another research direction.\n\nReferences\n[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, \u201cIntrigu-\ning properties of neural networks,\u201d in International Conference on Learning Representations,\n2014.\n\n[2] A. Nguyen, J. Yosinski, and J. Clune, \u201cDeep neural networks are easily fooled: High con\ufb01dence\npredictions for unrecognizable images,\u201d in Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, 2015, pp. 427\u2013436.\n\n[3] I. Goodfellow, J. Shlens, and C. Szegedy, \u201cExplaining and harnessing adversarial examples,\u201d in\n\nInternational Conference on Learning Representations, 2015.\n\n[4] T. Miyato, S. Maeda, S. Ishii, and M. Koyama, \u201cVirtual adversarial training: a regularization\nmethod for supervised and semi-supervised learning,\u201d IEEE transactions on pattern analysis\nand machine intelligence, 2018.\n\n[5] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, \u201cDistillation as a defense to adver-\nsarial perturbations against deep neural networks,\u201d in Security and Privacy (SP), 2016 IEEE\nSymposium on.\n\nIEEE, 2016, pp. 582\u2013597.\n\n[6] M. Staib and S. Jegelka, \u201cDistributionally robust deep learning as a generalization of adversarial\n\ntraining,\u201d in NIPS workshop on Machine Learning and Computer Security, 2017.\n\n[7] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen, \u201cRobust solutions\nof optimization problems affected by uncertain probabilities,\u201d Management Science, vol. 59,\nno. 2, pp. 341\u2013357, 2013.\n\n[8] S. Sha\ufb01eezadeh-Abadeh, P. M. Esfahani, and D. Kuhn, \u201cDistributionally robust logistic regres-\n\nsion,\u201d in Advances in Neural Information Processing Systems, 2015, pp. 1576\u20131584.\n\n[9] A. Sinha, H. Namkoong, and J. Duchi, \u201cCerti\ufb01able distributional robustness with principled\n\nadversarial training,\u201d in International Conference on Learning Representations, 2018.\n\n[10] W. Hu, G. Niu, I. Sato, and M. Sugiyama, \u201cDoes distributionally robust supervised learning give\nrobust classi\ufb01ers?\u201d in International Conference on Machine Learning, 2018, pp. 2034\u20132042.\n[11] P. M. Esfahani and D. Kuhn, \u201cData-driven distributionally robust optimization using the wasser-\nstein metric: Performance guarantees and tractable reformulations,\u201d Mathematical Program-\nming, pp. 1\u201352, 2017.\n\n[12] J. Blanchet and Y. Kang, \u201cSemi-supervised learning based on distributionally robust optimiza-\n\ntion,\u201d arXiv preprint arXiv:1702.08848, 2017.\n\n[13] Y. Grandvalet and Y. Bengio, \u201cSemi-supervised learning by entropy minimization,\u201d in Advances\n\nin Neural Information Processing Systems, 2005, pp. 529\u2013536.\n\n[14] X. Zhu, \u201cSemi-supervised learning literature survey,\u201d Computer Science, University of\n\nWisconsin-Madison, vol. 2, no. 3, p. 4, 2006.\n\n9\n\n\f[15] O. Chapelle, B. Scholkopf, and A. Zien, \u201cSemi-supervised learning (chapelle, o. et al., eds.;\n2006)[book reviews],\u201d IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542\u2013542,\n2009.\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d Proceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n[17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, \u201cReading digits in nat-\nural images with unsupervised feature learning,\u201d in NIPS Workshop on Deep Learning and\nUnsupervised Feature Learning, vol. 2011, 2011, p. 5.\n\n[18] A. Krizhevsky and G. Hinton, \u201cLearning multiple layers of features from tiny images,\u201d Citeseer,\n\nTech. Rep., 2009.\n\n[19] D.-H. Lee, \u201cPseudo-label: The simple and ef\ufb01cient semi-supervised learning method for deep\nneural networks,\u201d in ICML Workshop on Challenges in Representation Learning, vol. 2, 2013.\n[20] Z. Cranko, A. K. Menon, R. Nock, C.-S. Ong, Z. Shi, and C. Walder, \u201cMonge blunts bayes:\nHardness results for adversarial training,\u201d in International Conference on Machine Learning,\n2019, pp. 1406\u20131415.\n\n[21] J. Duchi, P. Glynn, and H. Namkoong, \u201cStatistics of robust optimization: A generalized\n\nempirical likelihood approach,\u201d arXiv preprint arXiv:1610.03425, 2016.\n\n[22] Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu, \u201cOn the convergence and robustness of\nadversarial training,\u201d in International Conference on Machine Learning, 2019, pp. 6586\u20136595.\n[23] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry, \u201cAdversarially robust general-\nization requires more data,\u201d in Advances in Neural Information Processing Systems, 2018, pp.\n5014\u20135026.\n\n[24] D. Cullina, A. N. Bhagoji, and P. Mittal, \u201cPAC-learning in the presence of evasion adversaries,\u201d\n\narXiv preprint arXiv:1806.01471, 2018.\n\n[25] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, \u201cGood semi-supervised\nlearning that requires a bad gan,\u201d in Advances in Neural Information Processing Systems, 2017,\npp. 6510\u20136520.\n\n[26] A. Balsubramani and Y. Freund, \u201cScalable semi-supervised aggregation of classi\ufb01ers,\u201d in\n\nAdvances in Neural Information Processing Systems, 2015, pp. 1351\u20131359.\n\n[27] Y. Yan, Z. Xu, I. W. Tsang, G. Long, and Y. Yang, \u201cRobust semi-supervised learning through\nlabel aggregation.\u201d in Association for the Advancement of Arti\ufb01cial Intelligence, 2016, pp.\n2244\u20132250.\n\n[28] A. Raghunathan, S. M. Xie, F. Yang, J. Duchi, and P. Liang, \u201cAdversarial training can hurt gen-\neralization,\u201d in ICML Workshop on Identifying and Understanding Deep Learning Phenomena,\n2019.\n\n[29] R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang, \u201cAdversarially robust\n\ngeneralization just requires more unlabeled data,\u201d arXiv preprint arXiv:1906.00555, 2019.\n\n[30] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi, \u201cUnlabeled data improves\nadversarial robustness,\u201d in Advances in Neural Information Processing Systems (NeurIPS),\n2019.\n\n[31] R. Stanforth, A. Fawzi, P. Kohli et al., \u201cAre labels required for improving adversarial robustness?\u201d\n\narXiv preprint arXiv:1905.13725, 2019.\n\n[32] M. Loog, \u201cContrastive pessimistic likelihood estimation for semi-supervised classi\ufb01cation,\u201d\nIEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 3, pp. 462\u2013475,\n2016.\n\n[33] A. Singh, R. Nowak, and X. Zhu, \u201cUnlabeled data: Now it helps, now it doesn\u2019t,\u201d in Advances\n\nin Neural Information Processing Systems, 2009, pp. 1513\u20131520.\n\n[34] P. Rigollet, \u201cGeneralization error bounds in semi-supervised classi\ufb01cation under the cluster\n\nassumption,\u201d Journal of Machine Learning Research, vol. 8, no. Jul, pp. 1369\u20131392, 2007.\n\n[35] M.-R. Amini and P. Gallinari, \u201cSemi-supervised logistic regression,\u201d in European Conference\n\non Arti\ufb01cial Intelligence, 2002, pp. 390\u2013394.\n\n10\n\n\f[36] S. Basu, A. Banerjee, and R. Mooney, \u201cSemi-supervised clustering by seeding,\u201d in International\n\nConference on Machine Learning, 2002, pp. 27\u201334.\n\n[37] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press,\n\n2012.\n\n[38] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, \u201cTowards deep learning models\nresistant to adversarial attacks,\u201d in International Conference on Learning Representations, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2964, "authors": [{"given_name": "Amir", "family_name": "Najafi", "institution": "Sharif University of Technology"}, {"given_name": "Shin-ichi", "family_name": "Maeda", "institution": "Preferred Networks"}, {"given_name": "Masanori", "family_name": "Koyama", "institution": "Preferred Networks Inc."}, {"given_name": "Takeru", "family_name": "Miyato", "institution": "Preferred Networks, Inc."}]}