{"title": "Generalizing to Unseen Domains via Adversarial Data Augmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 5334, "page_last": 5344, "abstract": "We are concerned with learning models that generalize well to different unseen\ndomains. We consider a worst-case formulation over data distributions that are\nnear the source domain in the feature space. Only using training data from a single\nsource distribution, we propose an iterative procedure that augments the dataset\nwith examples from a fictitious target domain that is \"hard\" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers that regularize towards zero (e.g., ridge or lasso). On digit recognition and semantic segmentation tasks, our method learns models improve performance across a range of a priori unknown target domains.", "full_text": "Generalizing to Unseen Domains\nvia Adversarial Data Augmentation\n\nRiccardo Volpi\u2217,\u2020\n\nIstituto Italiano di Tecnologia\n\nHongseok Namkoong\u2217\nStanford University\n\nOzan Sener\nIntel Labs\n\nJohn Duchi\n\nStanford University\n\nVittorio Murino\n\nIstituto Italiano di Tecnologia\n\nUniversit\u00e0 di Verona\n\nSilvio Savarese\n\nStanford University\n\nAbstract\n\nWe are concerned with learning models that generalize well to different unseen\ndomains. We consider a worst-case formulation over data distributions that are\nnear the source domain in the feature space. Only using training data from a single\nsource distribution, we propose an iterative procedure that augments the dataset\nwith examples from a \ufb01ctitious target domain that is \"hard\" under the current model.\nWe show that our iterative scheme is an adaptive data augmentation method where\nwe append adversarial examples at each iteration. For softmax losses, we show that\nour method is a data-dependent regularization scheme that behaves differently from\nclassical regularizers that regularize towards zero (e.g., ridge or lasso). On digit\nrecognition and semantic segmentation tasks, our method learns models improve\nperformance across a range of a priori unknown target domains.\n\n1\n\nIntroduction\n\nIn many modern applications of machine learning, we wish to learn a system that can perform\nuniformly well across multiple populations. Due to high costs of data acquisition, however, it is\noften the case that datasets consist of a limited number of population sources. Standard models that\nperform well when evaluated on the validation dataset\u2014usually collected from the same population\nas the training dataset\u2014often perform poorly on populations different from that of the training\ndata [15, 3, 1, 32, 38]. In this paper, we are concerned with generalizing to populations different\nfrom the training distribution, in settings where we have no access to any data from the unknown\ntarget distributions. For example, consider a module for self-driving cars that needs to generalize\nwell across weather conditions and city environments unexplored during training.\nA number of authors have proposed domain adaptation methods (for example, see [9, 39, 36, 26, 40])\nin settings where a fully labeled source dataset and an unlabeled (or partially labeled) set of examples\nfrom \ufb01xed target distributions are available. Although such algorithms can successfully learn models\nthat perform well on known target distributions, the assumption of a priori \ufb01xed target distributions\ncan be restrictive in practical scenarios. For example, consider a semantic segmentation algorithm\nused by a robot: every task, robot, environment and camera con\ufb01guration will result in a different\ntarget distribution, and these diverse scenarios can be identi\ufb01ed only after the model is trained and\ndeployed, making it dif\ufb01cult to collect samples from them.\nIn this work, we develop methods that can learn to better generalize to new unknown domains. We\nconsider the restrictive setting where training data only comes from a single source domain. Inspired\n\n\u2217Equal contribution.\n\u2020Work done while author was a Visiting Student Researcher at Stanford University.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fby recent developments in distributionally robust optimization and adversarial training [34, 20, 12],\nwe consider the following worst-case problem around the (training) source distribution P0\n\nminimize\n\n\u03b8\u2208\u0398\n\nsup\n\nP :D(P,P0)\u2264\u03c1\n\nEP [(cid:96)(\u03b8; (X, Y ))].\n\n(1)\n\nHere, \u03b8 \u2208 \u0398 is the model, (X, Y ) \u2208 X \u00d7 Y is a source data point with its labeling, (cid:96) : X \u00d7 Y \u2192 R\nis the loss function, and D(P, Q) is a distance metric on the space of probability distributions.\nThe solution to worst-case problem (1) guarantees good performance against data distributions that\nare distance \u03c1 away from the source domain P0. To allow data distributions that have different support\nto that of the source P0, we use Wasserstein distances as our metric D. Our distance will be de\ufb01ned\non the semantic space 3, so that target populations P satisfying D(P, P0) \u2264 \u03c1 represent realistic\ncovariate shifts that preserve the same semantic representation of the source (e.g., adding color to a\ngreyscale image). In this regard, we expect the solution to the worst-case problem (1)\u2014the model\nthat we wish to learn\u2014to have favorable performance across covariate shifts in the semantic space.\nWe propose an iterative procedure that aims to solve the problem (1) for a small value of \u03c1 at a time,\nand does stochastic gradient updates to the model \u03b8 with respect to these \ufb01ctitious worst-case target\ndistributions (Section 2). Each iteration of our method uses small values of \u03c1, and we provide a\nnumber of theoretical interpretations of our method. First, we show that our iterative algorithm is an\nadaptive data augmentation method where we add adversarially perturbed samples\u2014at the current\nmodel\u2014to the dataset (Section 3). More precisely, our adversarially generated samples roughly\ncorrespond to Tikhonov regularized Newton-steps [21, 25] on the loss in the semantic space. Further,\nwe show that for softmax losses, each iteration of our method can be thought of as a data-dependent\nregularization scheme where we regularize towards the parameter vector corresponding to the true\nlabel, instead of regularizing towards zero like classical regularizers such as ridge or lasso.\nFrom a practical viewpoint, a key dif\ufb01culty in applying the worst-case formulation (1) is that the\nmagnitude of the covariate shift \u03c1 is a priori unknown. We propose to learn an ensemble of models\nthat correspond to different distances \u03c1. In other words, our iterative method generates a collection\nof datasets, each corresponding to a different inter-dataset distance level \u03c1, and we learn a model\nfor each of them. At test time, we use a heuristic method to choose an appropriate model from the\nensemble.\nWe test our approaches on a simple digit recognition task, and a more realistic semantic segmentation\ntask across different seasons and weather conditions. In both settings, we observe that our method\nallows to learn models that improve performance across a priori unknown target distributions that\nhave varying distance from the original source domain.\n\nRelated work\n\nThe literature on adversarial training [10, 34, 20, 12] is closely related to our work, since the main\ngoal is to devise training procedures that learn models robust to \ufb02uctuations in the input. Departing\nfrom imperceptible attacks considered in adversarial training, we aim to learn models that are resistant\nto larger perturbations, namely out-of-distribution samples. Sinha et al. [34] proposes a principled\nadversarial training procedure, where new images that maximize some risk are generated and the\nmodel parameters are optimized with respect to those adversarial images. Being devised for defense\nagainst imperceptible adversarial attacks, the new images are learned with a loss that penalizes\ndifferences between the original and the new ones. In this work, we rely on a minimax game similar\nto the one proposed by Sinha et al. [34], but we impose the constraint in the semantic space, in order\nto allow our adversarial samples from a \ufb01ctitious distribution to be different at the pixel level, while\nsharing the same semantics.\nThere is a substantial body of work on domain adaptation [15, 3, 32, 9, 39, 36, 26, 40], which aims to\nbetter generalize to a priori \ufb01xed target domains whose labels are unknown at training time. This setup\nis different from ours in that these algorithms require access to samples from the target distribution\nduring training. Domain generalization methods [28, 22, 27, 33, 24] that propose different ways\nto better generalize to unknown domains are also related to our work. These algorithms require\n\n3By semantic space we mean learned representations since recent works [7, 16] suggest that distances in the\nspace of learned representations of high capacity models typically correspond to semantic distances in visual\nspace.\n\n2\n\n\fthe training samples to be drawn from different domains (while having access to the domain labels\nduring training), not a single source, a limitation that our method does not have. In this sense, one\ncould interpret our problem setting as unsupervised domain generalization. Tobin et al. [37] proposes\ndomain randomization, which applies to simulated data and creates a variety of random renderings\nwith the simulator, hoping that the real world will be interpreted as one of them. Our goal is the same,\nsince we aim at obtaining data distributions more similar to the real world ones, but we accomplish it\nby actually learning new data points, and thus making our approach applicable to any data source\nand without the need of a simulator.\nHendrycks and Gimpel [13] suggest that a good empirical way to detect whether a test sample is\nout-of-distribution for a given model is to evaluate the statistics of the softmax outputs. We adapt this\nidea in our setting, learning ensemble of models trained with our method and choosing at test time\nthe model with the greatest maximum softmax value.\n\n2 Method\n\nThe worst-case formulation (1) over domains around the source P0 hinges on the notion of distance\nD(P, P0), that characterizes the set of unknown populations we wish to generalize to. Conventional\nnotions of Wasserstein distance used for adversarial training [34] are de\ufb01ned with respect to the\noriginal input space X , which for images corresponds to raw pixels. Since our goal is to consider\n\ufb01ctitious target distributions corresponding to realistic covariate shifts, we de\ufb01ne our distance on\nthe semantic space. Before properly de\ufb01ning our setup, we \ufb01rst give a few notations. Letting p the\ndimension of output of the last hidden layer, we denote \u03b8 = (\u03b8c, \u03b8f ) where \u03b8c \u2208 Rp\u00d7m is the set of\nweights of the \ufb01nal layer, and \u03b8f is the rest of the weights of the network. We denote by g(\u03b8f ; x) the\noutput of the embedding layer of our neural network. For example, in the classi\ufb01cation setting, m is\nthe number of classes and we consider the softmax loss\n\nexp(cid:0)\u03b8(cid:62)\nj=1 exp(cid:0)\u03b8(cid:62)\n(cid:80)m\n\nc,yg(\u03b8f ; x)(cid:1)\nc,jg(\u03b8f ; x)(cid:1)\n\n(cid:96)(\u03b8; (x, y)) := \u2212 log\n\n(2)\n\nwhere \u03b8c,j is the j-th column of the classi\ufb01cation layer weights \u03b8c \u2208 Rp\u00d7m.\nWasserstein distance on the semantic space On the space Rp \u00d7 Y, consider the following trans-\nportation cost c\u2014cost of moving mass from (z, y) to (z(cid:48), y(cid:48))\n\nc((z, y), (z(cid:48), y(cid:48))) :=\n\n(cid:107)z \u2212 z(cid:48)(cid:107)2\n\n2 + \u221e \u00b7 1{y (cid:54)= y(cid:48)} .\n\n1\n2\n\nThe transportation cost takes value \u221e for data points with different labels, since we are only interested\nin perturbation to the marginal distribution of Z. We now de\ufb01ne our notion of distance on the semantic\nspace. For inputs coming from the original space X \u00d7 Y, we consider the transportation cost c\u03b8\nde\ufb01ned with respect to the output of the last hidden layer\n\nc\u03b8((x, y), (x(cid:48), y(cid:48))) := c((g(\u03b8f ; x), y), (g(\u03b8f ; x(cid:48)), y(cid:48)))\n\nso that c\u03b8 measures distance with respect to the feature mapping g(\u03b8f ; x). For probability measures\nP and Q both supported on X \u00d7 Y, let \u03a0(P, Q) denote their couplings, meaning measures M with\nM (A,X \u00d7 Y) = P (A) and M (X \u00d7 Y, A) = Q(A). Then, we de\ufb01ne our notion of distance by\n\nD\u03b8(P, Q) :=\n\ninf\n\nM\u2208\u03a0(P,Q)\n\nEM [c\u03b8((X, Y ), (X(cid:48), Y (cid:48)))].\n\n(3)\n\nArmed with this notion of distance on the semantic space, we now consider a variant of the worst-case\nproblem (1) where we replace the distance with D\u03b8 (3), our adaptive notion of distance de\ufb01ned on\nthe semantic space\n\nminimize\n\n\u03b8\u2208\u0398\n\nsup\nP\n\n{EP [(cid:96)(\u03b8; (X, Y ))] : D\u03b8(P, P0) \u2264 \u03c1} .\n\nComputationally, the above supremum over probability distributions is intractable. Hence, we\nconsider the following Lagrangian relaxation with penalty parameter \u03b3\n\nminimize\n\n\u03b8\u2208\u0398\n\nsup\nP\n\n{EP [(cid:96)(\u03b8; (X, Y ))] \u2212 \u03b3D\u03b8(P, P0)} .\n\n(4)\n\n3\n\n\f(cid:46) Run the minimax procedure K times\n\nAlgorithm 1 Adversarial Data Augmentation\n\nfor t = 1, ..., Tmin do\n\nSample (Xt, Yt) uniformly from dataset\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u2207\u03b8(cid:96)(\u03b8; (Xt, Yt))\n\nInput: original dataset {Xi, Yi}i=1,...,n and initialized weights \u03b80\nOutput: learned weights \u03b8\n1: Initialize: \u03b8 \u2190 \u03b80\n2: for k = 1, ..., K do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\ni \u2190 X k\nX k\n10:\nAppend (X k\n11:\n12: for t = 1, . . . , T do\n13:\n14:\n\ni \u2190 Xi\nX k\nfor t = 1, . . . , Tmax do\ni + \u03b7\u2207x\ni ) to dataset\ni , Y k\n\nSample {Xi, Yi}i=1,...,n uniformly from the dataset\nfor i = 1, . . . , n do\n\nSample (X, Y ) uniformly from dataset\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u2207\u03b8(cid:96)(\u03b8; (X, Y ))\n\n(cid:8)(cid:96)(\u03b8; (X k\n\ni , Yi)) \u2212 \u03b3c\u03b8((X k\n\ni , Yi), (Xi, Yi))(cid:9)\n\nTaking the dual reformulation of the penalty relaxation (4), we can obtain an ef\ufb01cient solution\nprocedure. The following result is a minor adaptation of [2, Theorem 1]; to ease notation, let us\nde\ufb01ne the robust surrogate loss\n\n\u03c6\u03b3(\u03b8; (x0, y0)) := sup\nx\u2208X\n\n(5)\nLemma 1. Let (cid:96) : \u0398\u00d7 (X \u00d7Y) \u2192 R be continuous. For any distribution Q and any \u03b3 \u2265 0, we have\n(6)\n\n{EP [(cid:96)(\u03b8; (X, Y ))] \u2212 \u03b3D\u03b8(P, Q)} = EQ[\u03c6\u03b3(\u03b8; (X, Y ))].\n\n{(cid:96)(\u03b8; (x, y0)) \u2212 \u03b3c\u03b8((x, y0), (x0, y0))} .\n\nsup\nP\n\nIn order to solve the penalty problem (4), we can now perform stochastic gradient descent procedures\non the robust surrogate loss \u03c6\u03b3. Under suitable conditions [5], we have\n\n\u2207\u03b8\u03c6\u03b3(\u03b8; (x0, y0)) = \u2207\u03b8(cid:96)(\u03b8; (x(cid:63)\n\n(7)\n\u03b3 = arg maxx\u2208X {(cid:96)(\u03b8; (x, y0)) \u2212 \u03b3c\u03b8((x, y0), (x0, y0))} is an adversarial perturbation of\nwhere x(cid:63)\nx0 at the current model \u03b8. Hence, computing gradients of the robust surrogate \u03c6\u03b3 requires solving\nthe maximization problem (5). Below, we consider an (heuristic) iterative procedure that iteratively\nperforms stochastic gradient steps on the robust surrogate \u03c6\u03b3.\n\n\u03b3, y0)),\n\nIterative Procedure We propose an iterative training procedure where two phases are alternated:\na maximization phase where new data points are learned by computing the inner maximization\nproblem (5) and a minimization phase, where the model parameters are updated according to stochastic\ngradients of the loss evaluated on the adversarial examples generated from the maximization phase.\nThe latter step is equivalent to stochastic gradient steps on the robust surrogate loss \u03c6\u03b3, which\nmotivates its name. The main idea here is to iteratively learn \"hard\" data points from \ufb01ctitious target\ndistributions, while preserving the semantic features of the original data points.\nConcretely, in the k-th maximization phase, we compute n adversarially perturbed samples at the\ncurrent model \u03b8 \u2208 \u0398\n\n(cid:8)(cid:96)(\u03b8; (x, Yi)) \u2212 \u03b3c\u03b8((x, Yi), (X k\u22121\n\n, Yi))(cid:9)\n\n(8)\n\ni \u2208 arg max\nX k\n\nx\u2208X\n\ni\n\ni are the original samples from the source distribution P0. The minimization phase then per-\nwhere X 0\nforms repeated stochastic gradient steps on the augmented dataset {X k\ni , Yi}0\u2264k\u2264K,1\u2264i\u2264n}. The max-\nimization phase (8) can be ef\ufb01ciently computed for smooth losses if x (cid:55)\u2192 c\u03b8k\u22121((x, Yi), (X k\u22121\n, Yi))\nis strongly convex [34, Theorem 2]; for example, this is provably true for any linear network. In\npractice, we use gradient ascent steps to solve for worst-case examples (8); see Algorithm 1 for the\nfull description of our algorithm.\n\ni\n\n4\n\n\fEnsembles for classi\ufb01cation The hyperparameter \u03b3\u2014which is inversely proportional to \u03c1, the\ndistance between the \ufb01ctitious target distribution and the source\u2014controls the ability to generalize\noutside the source domain. Since target domains are unknown, it is dif\ufb01cult to choose an appropriate\n\nlevel of \u03b3 a priori. We propose a heuristic ensemble approach where we train s models(cid:8)\u03b80, ..., \u03b8s(cid:9).\n\nEach model is associated with a different value of \u03b3, and thus to \ufb01ctitious target distributions with\nvarying distances from the source P0. To select the best model at test time\u2014inspired by Hendrycks\nand Gimpel [13]\u2014given a sample x, we select the model \u03b8u(cid:63)(x) with the greatest softmax score\n\nu(cid:63)(x) := arg max\n1\u2264u\u2264s\n\nmax\n1\u2264j\u2264k\n\n\u03b8u(cid:62)\nc,j g(\u03b8u\n\nf ; x).\n\n(9)\n\n3 Theoretical Motivation\n\nIn our iterative algorithm (Algorithm 1), the maximization phase (8) was a key step that augmented\nthe dataset with adversarially perturbed data points, which was followed by standard stochastic\ngradient updates to the model parameters. In this section, we provide some theoretical understanding\nof the augmentation step (8). First, we show that the augmented data points (8) can be interpreted as\nTikhonov regularized Newton-steps [21, 25] in the semantic space under the current model. Roughly\nspeaking, this gives the sense in which Algorithm 1 is an adaptive data augmentation algorithm that\nadds data points from \ufb01ctitious \"hard\" target distributions. Secondly, recall the robust surrogate\nloss (5) whose stochastic gradients were used to update the model parameters \u03b8 in the minimization\nstep (Eq (7)). In the classi\ufb01cation setting, we show that the robust surrogate (5) roughly corresponds\nto a novel data-dependent regularization scheme on the softmax loss (cid:96). Instead of penalizing towards\nzero like classical regularizers (e.g., ridge or lasso), our data-dependent regularization term penalizes\ndeviations from the parameter vector corresponding to that of the true label.\n\n3.1 Adaptive Data Augmentation\n\nWe now give an interpretation for the augmented data points in the maximization phase (8). Concretely,\nwe \ufb01x \u03b8 \u2208 \u0398, x0 \u2208 X , y0 \u2208 Y, and consider an \u0001-maximizer\n\n\u0001 \u2208 \u0001- arg max\nx(cid:63)\n\n{(cid:96)(\u03b8; (x, y0)) \u2212 \u03b3c\u03b8((x, y0), (x0, y0))} .\n\nWe let z0 := g(\u03b8f ; x0) \u2208 Rp, and abuse notation by using (cid:96)(\u03b8; (z0, y0)) := (cid:96)(\u03b8; (x0, y0)). In what\nfollows, we show that the feature mapping g(\u03b8f ; x(cid:63)\n\n\u0001 ) satis\ufb01es\n\nx\u2208X\n\n(cid:18)\n\n+\n\nL2\n\n3(\u03b3 \u2212 L1)\n\n5\n\n(cid:19)\u22121 \u2207z(cid:96)(\u03b8; (z0, y0))\n(cid:125)\n\n(cid:18)(cid:114) \u0001\n\n+O\n\n+\n\n1\n\u03b32\n\n\u03b3\n\n(cid:19)\n\n.\n\ng(\u03b8f ; x(cid:63)\n\n\u0001 ) = g(\u03b8f ; x0) +\n\n(cid:124)\n\n1\n\u03b3\n\nI \u2212 1\n\u03b3\n\n(cid:123)(cid:122)\n\n\u2207zz(cid:96)(\u03b8; (z0, y0))\n=:(cid:98)gnewton(\u03b8f ;x0)\n\nIntuitively, this implies that the adversarially perturbed sample x(cid:63)\n\ndistribution where probability mass on z0 = g(\u03b8f ; x0) was transported to(cid:98)gnewton(\u03b8f ; x0). We note\n25] on the loss z (cid:55)\u2192 (cid:96)(\u03b8; (z, y0)) at the current model \u03b8. Noting that computing(cid:98)gnewton(\u03b8f ; x0)\n\nthat the transported point in the semantic space corresponds to a Tikhonov regularized Newton-step [21,\n\n(10)\n\u0001 is drawn from a \ufb01ctitious target\n\ninvolves backsolves on a large dense matrix, we can interpret our gradient ascent updates in the\nmaximization phase (8) as an iterative scheme for approximating this quantity.\nWe assume suf\ufb01cient smoothness, where we use (cid:107)H(cid:107) to denote the (cid:96)2-operator norm of a matrix H.\nAssumption 1. There exists L0, L1 > 0 such that, for all z, z(cid:48) \u2208 Rp, we have |(cid:96)(\u03b8; (z, y0)) \u2212\n(cid:96)(\u03b8; (z(cid:48), y0))| \u2264 L0 (cid:107)z \u2212 z(cid:48)(cid:107)2 and (cid:107)\u2207z(cid:96)(\u03b8; (z, y0)) \u2212 \u2207z(cid:96)(\u03b8; (z(cid:48), y0))(cid:107)2 \u2264 L1 (cid:107)z \u2212 z(cid:48)(cid:107)2.\nfor all z, z(cid:48) \u2208 Rp, we have\nAssumption 2. There exists L2 > 0 such that,\n(cid:107)\u2207zz(cid:96)(\u03b8; (z, y0)) \u2212 \u2207zz(cid:96)(\u03b8; (z(cid:48), y0))(cid:107) \u2264 L2 (cid:107)z \u2212 z(cid:48)(cid:107)2.\nThen, we have the following bound (10) whose proof we defer to Appendix A.1.\nTheorem 1. Let Assumptions 1, 2 hold. If Im(g(\u03b8f ;\u00b7)) = Rp and \u03b3 > L1, then\n\n(cid:40)(cid:18) 5L0\n\n(cid:19)3\n\n(cid:18) L0\n\n(cid:19)3\n\n\u03b3\n\n+\n\n\u03b3 \u2212 L1\n\n+\n\n\u03b3\n\n2(cid:41)\n(cid:19) 3\n\n(cid:18) 2\u0001\n\n.\n\n\u0001 ) \u2212(cid:98)gnewton(\u03b8f ; x0)(cid:107)2\n\n(cid:107)g(\u03b8f ; x(cid:63)\n\n2 \u2264 2\u0001\n\u03b3 \u2212 L1\n\n\f3.2 Data-Dependent Regularization\n\nIn this section, we argue that under suitable conditions on the loss,\n(cid:107)\u2207z(cid:96)(\u03b8; (z, y))(cid:107)2\n\n\u03c6\u03b3(\u03b8; (z, y)) = (cid:96)(\u03b8; (z, y)) +\n\n2 + O\n\n1\n\u03b3\n\n(cid:18) 1\n\n(cid:19)\n\n\u03b32\n\n.\n\nFor classi\ufb01cation problems, we show that the robust surrogate loss (5) corresponds to a particular\ndata-dependent regularization scheme. Let (cid:96)(\u03b8; (x, y)) be the m-class softmax loss (2) given by\n\n(cid:80)m\n\nexp(\u03b8(cid:62)\nl=1 exp(\u03b8(cid:62)\n\nc,jg(\u03b8, x))\n\n.\n\nc,lg(\u03b8f ; x))\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:18) 1\n\n(cid:19)\n\n\u03b32\n\npj(\u03b8, x)\u03b8c,j\n\n+ O\n\n.\n\n(11)\n\nwhere \u03b8c,j \u2208 Rp is the j-th row of the classi\ufb01cation layer weight \u03b8c \u2208 Rp\u00d7m. Then, the robust\nsurrogate \u03c6\u03b3 is an approximate regularizer on the classi\ufb01cation layer weights \u03b8c\n\n(cid:96)(\u03b8; (x, y)) = \u2212 log py(\u03b8, x) where pj(\u03b8, x) :=\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03b8c,y \u2212 m(cid:88)\nlarization where we minimize the distance between(cid:80)m\n\n\u03c6\u03b3(\u03b8; (x, y)) = (cid:96)(\u03b8; (x, y)) +\n\n1\n\u03b3\n\nj=1\n\nThe expansion (11) shows that the robust surrogate (5) is roughly equivalent to data-dependent regu-\n(cid:80)m\nj=1 pj(\u03b8, x)\u03b8c,j, our \u201caverage estimated linear\nclassi\ufb01er\u201d, to \u03b8c,y, the linear classi\ufb01er corresponding to the true label y. Concretely, for any \ufb01xed\nj=1 (cid:107)\u03b8c,j(cid:107)2 to\n\u03b8 \u2208 \u0398, we have the following result where we use L(\u03b8) := 2 max1\u2264j(cid:48)\u2264m (cid:107)\u03b8c,j(cid:48)(cid:107)2\nease notation. See Appendix A.3 for the proof.\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03b8c,y \u2212 m(cid:88)\nTheorem 2. If Im(g(\u03b8f ;\u00b7)) = Rp and \u03b3 > L(\u03b8), the softmax loss (2) satis\ufb01es\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03b8c,y \u2212 m(cid:88)\n\n\u2264 \u03c6\u03b3(\u03b8, (x, y))\u2212(cid:96)(\u03b8, (x, y)) \u2264\n\n\u03b3 \u2212 L(\u03b8)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\npj(\u03b8, x)\u03b8c,j\n\npj(\u03b8, x)\u03b8c,j\n\n\u03b3 + L(\u03b8)\n\n1\n\n1\n\nj=1\n\nj=1\n\n.\n\n2\n\n2\n\n4 Experiments\n\nWe evaluate our method for both classi\ufb01cation and semantic segmentation settings, following the\nevaluation scenarios of domain adaptation techniques [9, 39, 14], though in our case the target domains\nare unknown at training time. We summarize our experimental setup including implementation details,\nevaluation metrics and datasets for each task.\n\nDigit classi\ufb01cation We train on MNIST [19] dataset and test on MNIST-M [9], SVHN [30], SYN\n[9] and USPS [6]. We use 10, 000 digit samples for training and evaluate our models on the respective\ntest sets of the different target domains, using accuracy as a metric. In order to work with comparable\ndatasets, we resized all the images to 32\u00d732, and treated images from MNIST and USPS as RGB. We\nuse a ConvNet [18] with architecture conv-pool-conv-pool-fc-fc-softmax and set the hyperparameters\n\u03b1 = 0.0001, \u03b7 = 1.0, Tmin = 100 and Tmax = 15. In the minimization phase, we use Adam [17]\nwith batch size equal to 324. We compare our method against the Empirical Risk Minimization\n(ERM) baseline and different regularization techniques (Dropout [35], ridge).\n\nSemantic scene segmentation We use the SYTHIA[31] dataset for semantic segmentation. The\ndataset contains images from different locations (we use Highway, New York-like City and Old\nEuropean Town), and different weather/time/date conditions (we use Dawn, Fog, Night, Spring and\nWinter. We train models on a source domain and test on other domains, using the standard mean\nIntersection Over Union (mIoU) metric to evaluate our performance [8]. We arbitrarily chose images\nfrom the left front camera throughout our experiments. For each one, we sample 900 random images\n(resized to 192 \u00d7 320 pixels) from the training set. We use a Fully Convolutional Network (FCN)\n[23], with a ResNet-50 [11] body and set the hyperparameters \u03b1 = 0.0001, \u03b7 = 2.0, Tmin = 500 and\nTmax = 50. For the minimization phase, we use Adam [17] with batch size equal to 8. We compare\nour method against the ERM baseline.\n\n4.1 Results on Digit Classi\ufb01cation\n\nIn this section, we present and discuss the results on the digit classi\ufb01cation experiment. Firstly, we\nare interested in analyzing the role of the semantic constraint we impose. Figure 1a (top) shows\n4Models were implemented using Tensor\ufb02ow, and training procedures were performed on NVIDIA GPUs.\n\nCode is available at https://github.com/ricvolpi/generalize-unseen-domains\n\n6\n\n\fFigure 1. Results associated with models trained with 10, 000 MNIST samples and tested on SVHN,\nMNIST-M, SYN and USPS (1st, 2nd, 3rd and 4th columns, respectively). Panel (a), top: comparison\nbetween distances in the pixel space (yellow) and in the semantic space (blue), with \u03b3 = 104 and K = 1.\nPanel (a), bottom: comparison between our method with K = 2 and different \u03b3 values (blue bars) and\nERM (red line). Panel (b), top: comparison between our method with \u03b3 = 1.0 and different number\nof iterations K (blue), ERM (red) and Dropout [35] (yellow). Panel (b), middle: comparison between\nmodels regularized with ridge (green) and with ridge + our method with \u03b3 = 1.0 and K = 1 (blue).\nPanel (b), bottom: results related to the ensemble method, using models trained with our methods with\ndifferent number of iterations K (blue) and using models trained via ERM (red). The reported results\nare obtained by averaging over 10 different runs; black bars indicate the range of accuracy spanned.\n\nperformances associated with models trained with Algorithm 1 with K = 1 and \u03b3 = 104, with\nthe constraint in the semantic space (as discussed in Section 2) and in the pixel space [34] (blue\nand yellow bars, respectively). Figure 1a (bottom) shows performances of models trained with our\nmethod using different values of the hyperparameter \u03b3 (with K = 2) and with ERM (blue bars and\nred lines, respectively). These plots show (i) that moving the constraint on the semantic space carries\nbene\ufb01ts when models are tested on unseen domains and (ii) that models trained with Algorithm 1\noutperform models train with ERM for any value of \u03b3 on out-of-sample domains (SVHN, MNIST-M\n\n7\n\n\fFigure 2. Results obtained with semantic segmentation models trained with ERM (red) and our method\nwith K = 1 and \u03b3 = 1.0 (blue). Leftmost panels are associated with models trained on Highway,\nrightmost panels are associated with models trained on New York-like City. Test datasets are Highway,\nNew York-like City and Old European Town.\n\nand SYN). The latter result is a rather desired achievement, since this hyperparameter cannot be\nproperly cross-validated. On USPS, our method causes accuracy to drop since MNIST and USPS are\nvery similar datasets, thus the image domain that USPS belongs to is not explored by our algorithm\nduring the training procedure, which optimizes for worst case performance.\nFigure 1b (top) reports results related to models trained with our method (blue bars), varying the\nnumber of iterations K and \ufb01xing \u03b3 = 1.0, and results related to ERM (red bars) and Dropout [35]\n(yellow bars). We observe that our method improves performances on SVHN, MNIST-M and SYN,\noutperforming both ERM and Dropout [35] statistically signi\ufb01cantly. In Figure 1b (middle), we\ncompare models trained with ridge regularization (green bars) with models trained with Algorithm 1\n(with K = 1 and \u03b3 = 1.0) and ridge regularization (blue bars); these results show that our method\ncan potentially bene\ufb01t from other regularization approaches, as in this case we observed that the\ntwo effects sum up. We further report in Appendix B a comparison between our method and an\nunsupervised domain adaptation algorithm (ADDA [39]), and results associated with different values\nof the hyperparameters \u03b3 and K.\nFinally, we report the results obtained by learning an ensemble of models. Since the hyperparameter\n\u03b3 is nontrivial to set a priori, we use the softmax con\ufb01dences (9) to choose which model to use\nat test time. We learn ensemble of models, each of which is trained by running Algorithm 1 with\n\ndifferent values of the \u03b3 as \u03b3 = 10\u2212i, with i =(cid:8)0, 1, 2, 3, 4, 5, 6(cid:9). Figure 1b (bottom) shows the\n\ncomparison between our method with different numbers of iterations K and ERM (blue and red\nbars, respectively). In order to separate the role of ensemble learning, we learn an ensemble of\nbaseline models each corresponding to a different initialization. We \ufb01x the number of models in the\nensemble to be the same for both the baseline (ERM) and our method. Comparing Figure 1b (bottom)\nwith Figure 1b (top) and Figure 1a (bottom), our ensemble approach achieves higher accuracy in\ndifferent testing scenarios. We observe that our out-of-sample performance improves as the number\nof iterations K gets large. Also in the ensemble setting, for the USPS dataset we do not see any\nimprovement, which we conjecture to be an artifact of the trade-off between good performance on\ndomains far away from training, and those closer.\n\n4.2 Results on Semantic Scene Segmentation\n\nWe report a comparison between models trained with ERM and models trained with our method\n(Algorithm 1 with K = 1). We set \u03b3 = 1.0 in every experiment, but stress that this is an arbitrary\nvalue; we did not observe a strong correlation between the different values of \u03b3 and the general\nbehavior of the models in this case. Its role was more meaningful in the ensemble setting where each\nmodel is associated with a different level of robustness, as discussed in Section 2. In this setting, we\ndo not apply the ensemble approach, but only evaluate the performances of the single models. The\n\n8\n\n\fmain reason for this choice is the fact that the heuristics developed to choose the correct model at test\ntime in effect cannot be applied in a straightforward fashion to a semantic segmentation problem.\nFigure 2 reports numerical results obtained. Speci\ufb01cally, leftmost plots report results associated\nwith models trained on sequences from the Highway split and tested on the New York-like City and\nthe Old European Town splits (top-left and bottom-left, respectively); rightmost plots report results\nassociated with models trained on sequences from the New York-like City split and tested on the\nHighway and the Old European Town splits (top-right and bottom-right, respectively). The training\nsequences (Dawn, Fog, Night, Spring and Winter) are indicated on the x-axis. Red and blue bars\nindicate average mIoUs achieved by models trained with ERM and by models trained with our\nmethod, respectively. These results were calculated by averaging over the mIoUs obtained with each\nmodel on the different conditions of the test set. As can be observed, models trained with our method\nmostly better generalize to unknown data distributions. In particular, our method always outperforms\nthe baseline by a statistically signi\ufb01cant margin when the training images are from Night scenarios.\nThis is since the baseline models trained on images from Night are strongly biased towards dark\nscenery, while, as a consequence of training over worst-case distributions, our models can overcome\nthis strong bias and better generalize across different unseen domains.\n\n5 Conclusions and Future Work\n\nWe study a new adversarial data augmentation procedure that learns to better generalize across\nunseen data distributions, and de\ufb01ne an ensemble method to exploit this technique in a classi\ufb01cation\nframework. This is in contrast to domain adaptation algorithms, which require a suf\ufb01cient number\nof samples from a known, a priori \ufb01xed target distribution. Our experimental results show that our\niterative procedure provides broad generalization behavior on digit recognition and cross-season and\ncross-weather semantic segmentation tasks.\nFor future work, we hope to extend the ensemble methods by de\ufb01ning novel decision rules. The\nproposed heuristics (9) only apply to classi\ufb01cation settings, and extending them to a broad realm of\ntasks including semantic segmentation is an important direction. Many theoretical questions still\nremain. For instance, quantifying the behavior of data-dependent regularization schemes presented in\nSection 3 would help us better understand adversarial training methods in general.\n\nReferences\n[1] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations\nfor domain adaptation. In B. Sch\u00f6lkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural\nInformation Processing Systems 19, pages 137\u2013144. MIT Press, 2007.\n\n[2] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport.\n\narXiv:1604.01446 [math.PR], 2016.\n\n[3] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural\ncorrespondence learning. In Proceedings of the 2006 Conference on Empirical Methods in\nNatural Language Processing, EMNLP \u201906, pages 120\u2013128, Stroudsburg, PA, USA, 2006.\nAssociation for Computational Linguistics.\n\n[4] J Fr\u00e9d\u00e9ric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems.\n\nSpringer Science & Business Media, 2013.\n\n[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[6] J. S. Denker, W. R. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel,\nH. S. Baird, and I. Guyon. Advances in neural information processing systems 1. chapter\nNeural Network Recognizer for Hand-written Zip Code Digits, pages 323\u2013331. 1989.\n\n[7] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics\nbased on deep networks. In Advances in Neural Information Processing Systems, pages 658\u2013666,\n2016.\n\n9\n\n\f[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The\nPASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. http://www.pascal-\nnetwork.org/challenges/VOC/voc2008/workshop/index.html.\n\n[9] Yaroslav Ganin and Victor S. Lempitsky. Unsupervised domain adaptation by backpropagation.\nIn Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille,\nFrance, 6-11 July 2015, pages 1180\u20131189, 2015.\n\n[10] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. In International Conference on Learning Representations, 2015.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[12] Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain\n\nshift robustness. arXiv preprint arXiv:1710.11469, 2017.\n\n[13] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\n\nexamples in neural networks. CoRR, abs/1610.02136, 2016.\n\n[14] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level\n\nadversarial and constraint-based adaptation. CoRR, abs/1612.02649, 2016.\n\n[15] Hal Daum\u00e9 III and Daniel Marcu. Domain adaptation for statistical classi\ufb01ers. CoRR,\n\nabs/1109.6341, 2011.\n\n[16] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\nand super-resolution. In European Conference on Computer Vision, pages 694\u2013711. Springer,\n2016.\n\n[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[18] Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard,\nWayne E. Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code\nrecognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[19] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[20] Jaeho Lee and Maxim Raginsky. Minimax statistical learning and domain adaptation with\n\nwasserstein distances. arXiv preprint arXiv:1705.07815, 2017.\n\n[21] K Levenberg. A method for the solution of certain problems in least squares. quart. appl. math.\n\n2. 1944.\n\n[22] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier\n\ndomain generalization. CoRR, abs/1710.03077, 2017.\n\n[23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\n\nsegmentation. CoRR, abs/1411.4038, 2014.\n\n[24] M. Mancini, S. R. Bul\u00f2, B. Caputo, and E. Ricci. Robust place categorization with deep domain\n\ngeneralization. IEEE Robotics and Automation Letters, 3(3):2093\u20132100, July 2018.\n\n[25] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters.\n\nJournal of the society for Industrial and Applied Mathematics, 11(2):431\u2013441, 1963.\n\n[26] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation alignment\nfor unsupervised deep domain adaptation. International Conference on Learning Representa-\ntions, 2018.\n\n[27] Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, and Gianfranco Doretto. Uni\ufb01ed deep\nsupervised domain adaptation and generalization. In The IEEE International Conference on\nComputer Vision (ICCV), Oct 2017.\n\n10\n\n\f[28] Krikamol Muandet, David Balduzzi, and Bernhard Sch\u00f6lkopf. Domain generalization via\ninvariant feature representation. In Sanjoy Dasgupta and David McAllester, editors, Proceedings\nof the 30th International Conference on Machine Learning, volume 28 of Proceedings of\nMachine Learning Research, pages 10\u201318, Atlanta, Georgia, USA, 17\u201319 Jun 2013. PMLR.\n\n[29] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[30] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning 2011, 2011.\n\n[31] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The\nsynthia dataset: A large collection of synthetic images for semantic segmentation of urban\nscenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June\n2016.\n\n[32] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to\nnew domains. In Proceedings of the 11th European Conference on Computer Vision: Part IV,\nECCV\u201910, pages 213\u2013226, Berlin, Heidelberg, 2010. Springer-Verlag.\n\n[33] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, and\nSunita Sarawagi. Generalizing across domains via cross-gradient training. In International\nConference on Learning Representations, 2018.\n\n[34] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\nprincipled adversarial training. In International Conference on Learning Representations, 2018.\n\n[35] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. J. Mach. Learn. Res., 15(1),\nJanuary 2014.\n\n[36] Baochen Sun and Kate Saenko. Deep CORAL: correlation alignment for deep domain adaptation.\n\nIn ECCV Workshops, 2016.\n\n[37] Joshua Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\nCoRR, abs/1703.06907, 2017.\n\n[38] A. Torralba and A. A. Efros. Unbiased look at dataset bias.\n\nIn Proceedings of the 2011\nIEEE Conference on Computer Vision and Pattern Recognition, CVPR \u201911, pages 1521\u20131528,\nWashington, DC, USA, 2011. IEEE Computer Society.\n\n[39] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\nadaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July\n2017.\n\n[40] Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature\naugmentation for unsupervised domain adaptation. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), June 2018.\n\n11\n\n\f", "award": [], "sourceid": 2556, "authors": [{"given_name": "Riccardo", "family_name": "Volpi", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Hongseok", "family_name": "Namkoong", "institution": "Stanford University"}, {"given_name": "Ozan", "family_name": "Sener", "institution": "Intel Labs"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}, {"given_name": "Vittorio", "family_name": "Murino", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Silvio", "family_name": "Savarese", "institution": "Stanford University"}]}