{"title": "Domain Separation Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 351, "abstract": "The cost of large scale data collection and annotation often makes the application of machine learning algorithms to new tasks or datasets prohibitively expensive. One approach circumventing this cost is training models on synthetic data where annotations are provided automatically. Despite their appeal, such models often fail to generalize from synthetic to real images, necessitating domain adaptation algorithms to manipulate these models before they can be successfully applied. Existing approaches focus either on mapping representations from one domain to the other, or on learning to extract features that are invariant to the domain from which they were extracted. However, by focusing only on creating a mapping or shared representation between the two domains, they ignore the individual characteristics of each domain. We hypothesize that explicitly modeling what is unique to each domain can improve a model's ability to extract domain-invariant features. Inspired by work on private-shared component analysis, we explicitly learn to extract image representations that are partitioned into two subspaces: one component which is private to each domain and one which is shared across domains. Our model is trained to not only perform the task we care about in the source domain, but also to use the partitioned representation to reconstruct the images from both domains. Our novel architecture results in a model that outperforms the state-of-the-art on a range of unsupervised domain adaptation scenarios and additionally produces visualizations of the private and shared representations enabling interpretation of the domain adaptation process.", "full_text": "Domain Separation Networks\n\nKonstantinos Bousmalis\u2217\n\nGoogle Brain\n\nMountain View, CA\n\nkonstantinos@google.com\n\nGeorge Trigeorgis\u2217 \u2020\nImperial College London\n\nLondon, UK\n\ng.trigeorgis@imperial.ac.uk\n\nNathan Silberman\nGoogle Research\nNew York, NY\n\nnsilberman@google.com\n\nDilip Krishnan\nGoogle Research\nCambridge, MA\n\ndilipkay@google.com\n\nDumitru Erhan\nGoogle Brain\n\nMountain View, CA\n\ndumitru@google.com\n\nAbstract\n\nThe cost of large scale data collection and annotation often makes the application\nof machine learning algorithms to new tasks or datasets prohibitively expensive.\nOne approach circumventing this cost is training models on synthetic data where\nannotations are provided automatically. Despite their appeal, such models often\nfail to generalize from synthetic to real images, necessitating domain adaptation\nalgorithms to manipulate these models before they can be successfully applied.\nExisting approaches focus either on mapping representations from one domain to\nthe other, or on learning to extract features that are invariant to the domain from\nwhich they were extracted. However, by focusing only on creating a mapping\nor shared representation between the two domains, they ignore the individual\ncharacteristics of each domain. We hypothesize that explicitly modeling what is\nunique to each domain can improve a model\u2019s ability to extract domain-invariant\nfeatures. Inspired by work on private-shared component analysis, we explicitly\nlearn to extract image representations that are partitioned into two subspaces: one\ncomponent which is private to each domain and one which is shared across domains.\nOur model is trained to not only perform the task we care about in the source\ndomain, but also to use the partitioned representation to reconstruct the images\nfrom both domains. Our novel architecture results in a model that outperforms\nthe state-of-the-art on a range of unsupervised domain adaptation scenarios and\nadditionally produces visualizations of the private and shared representations\nenabling interpretation of the domain adaptation process.\n\nIntroduction\n\n1\nThe recent success of supervised learning algorithms has been partially attributed to the large-scale\ndatasets [16, 22] on which they are trained. Unfortunately, collecting, annotating, and curating such\ndatasets is an extremely expensive and time-consuming process. An alternative would be creating\nlarge-scale datasets in non-realistic but inexpensive settings, such as computer generated scenes.\nWhile such approaches offer the promise of effectively unlimited amounts of labeled data, models\ntrained in such settings do not generalize well to realistic domains. Motivated by this, we examine the\nproblem of learning representations that are domain\u2013invariant in scenarios where the data distributions\nduring training and testing are different. In this setting, the source data is labeled for a particular task\nand we would like to transfer knowledge from the source to the target domain for which we have no\nground truth labels.\nIn this work, we focus on the tasks of object classi\ufb01cation and pose estimation, where the object of\ninterest is in the foreground of a given image, for both source and target domains. The source and\n\n\u2217Authors contributed equally.\n\u2020This work was completed while George Trigeorgis was at Google Brain in Mountain View, CA.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftarget pixel distributions can differ in a number of ways. We de\ufb01ne \u201clow-level\u201d differences in the\ndistributions as those arising due to noise, resolution, illumination and color. \u201cHigh-level\u201d differences\nrelate to the number of classes, the types of objects, and geometric variations, such as 3D position\nand pose. We assume that our source and target domains differ mainly in terms of the distribution of\nlow level image statistics and that they have high level parameters with similar distributions and the\nsame label space.\nWe propose a novel architecture, which we call Domain Separation Networks (DSN), to learn domain-\ninvariant representations. Previous work attempts to either \ufb01nd a mapping from representations of\nthe source domain to those of the target [26], or \ufb01nd representations that are shared between the\ntwo domains [8, 28, 17]. While this, in principle, is a good idea, it leaves the shared representations\nvulnerable to contamination by noise that is correlated with the underlying shared distribution [24].\nOur model, in contrast, introduces the notion of a private subspace for each domain, which captures\ndomain speci\ufb01c properties, such as background and low level image statistics. A shared subspace,\nenforced through the use of autoencoders and explicit loss functions, captures representations shared\nby the domains. By \ufb01nding a shared subspace that is orthogonal to the subspaces that are private,\nour model is able to separate the information that is unique to each domain, and in the process\nproduce representations that are more meaningful for the task at hand. Our method outperforms the\nstate-of-the-art domain adaptation techniques on a range of datasets for object classi\ufb01cation and pose\nestimation, while having an interpretability advantage by allowing the visualization of these private\nand shared representations. In Sec. 2, we survey related work and introduce relevant terminology.\nOur architecture, loss functions, and learning regime are presented in Sec. 3. Experimental results\nand discussion are given in Sec. 4. Finally, conclusions and directions for future work are in Sec. 5.\n2 Related Work\nLearning to perform unsupervised domain adaptation is an open theoretical and practical problem.\nWhile much prior art exists, our literature review focuses primarily on Convolutional Neural Network\n(CNN) based methods due to their empirical superiority on this problem [8, 17, 26, 29]. Ben-David\net al. [4] provide upper bounds on a domain-adapted classi\ufb01er in the target domain. They introduce\nthe idea of training a binary classi\ufb01er trained to distinguish source and target domains. The error\nthat this \u201cdomain incoherence\u201d classi\ufb01er provides (along with the error of a source domain speci\ufb01c\nclassi\ufb01er) combine to give the overall bounds. Mansour et al. [18] extend the theory of [4] to handle\nthe case of multiple source domains.\nGanin et al. [7, 8] and Ajakan et al. [2] use adversarial training to \ufb01nd domain\u2013invariant repre-\nsentations in-network. Their Domain\u2013Adversarial Neural Networks (DANN) exhibit an architecture\nwhose \ufb01rst few feature extraction layers are shared by two classi\ufb01ers trained simultaneously. The \ufb01rst\nis trained to correctly predict task-speci\ufb01c class labels on the source data while the second is trained\nto predict the domain of each input. DANN minimizes the domain classi\ufb01cation loss with respect\nto parameters speci\ufb01c to the domain classi\ufb01er, while maximizing it with respect to the parameters\nthat are common to both classi\ufb01ers. This minimax optimization becomes possible via the use of a\ngradient reversal layer (GRL).\nTzeng et al. [29] and Long et al. [17] proposed versions of this model where the maximization of\nthe domain classi\ufb01cation loss is replaced by the minimization of the Maximum Mean Discrepancy\n(MMD) metric [11]. The MMD metric is computed between features extracted from sets of samples\nfrom each domain. The Deep Domain Confusion Network by Tzeng et al. [29] has an MMD loss at\none layer in the CNN architecture while Long et al. [17] proposed the Deep Adaptation Network\nthat has MMD losses at multiple layers.\nOther related techniques involve learning a transformation from one domain to the other. In this setup,\nthe feature extraction pipeline is \ufb01xed during the domain adaptation optimization. This has been\napplied in various non-CNN based approaches [9, 5, 10] as well as the recent CNN-based Correlation\nAlignment (CORAL) [26] algorithm which \u201crecolors\u201d whitened source features with the covariance\nof features from the target domain.\n\n3 Method\nWhile the Domain Separation Networks (DSNs) could in principle be applicable to other learning\ntasks, without loss of generalization, we mainly use image classi\ufb01cation as the cross-domain task.\nGiven a labeled dataset in a source domain and an unlabeled dataset in a target domain, our goal is to\ntrain a classi\ufb01er on data from the source domain that generalizes to the target domain. Like previous\n\n2\n\n\fFigure 1: A shared-weight encoder Ec(x) learns to capture representation components for a given\ninput sample that are shared among domains. A private encoder Ep(x) (one for each domain) learns\nto capture domain-speci\ufb01c components of the representation. A shared decoder learns to reconstruct\nthe input sample by using both the private and source representations. The private and shared\nrepresentation components are pushed apart with soft subspace orthogonality constraints Ldi\ufb00erence,\nwhereas the shared representation components are kept similar with a similarity loss Lsimilarity.\n\nefforts [7, 8], our model is trained such that the representations of images from the source domain are\nsimilar to those from the target domain. This allows a classi\ufb01er trained on images from the source\ndomain to generalize as the inputs to the classi\ufb01er are in theory invariant to the domain of origin.\nHowever, these representations might trivially include noise that is highly correlated with the shared\nrepresentation, as shown by Salzmann et al. [24].\nOur main novelty is that, inspired by recent work [14, 24, 30] on shared-space component analysis,\nDSNs explicitly model both private and shared components of the domain representations. The two\nprivate components of the representation are speci\ufb01c to each domain and the shared component of the\nrepresentation is shared by both domains. To induce the model to produce such split representations,\nwe add a loss function that encourages independence of these parts. Finally, to ensure that the\nprivate representations are still useful (avoiding trivial solutions) and to add generalizability, we also\nadd a reconstruction loss. The combination of these objectives is a model that produces a shared\nrepresentation that is similar for both domains and a private representation that is domain speci\ufb01c.\nBy partitioning the space in such a manner, the classi\ufb01er trained on the shared representation is better\nable to generalize across domains as its inputs are uncontaminated with aspects of the representation\nthat are unique to each domain.\nLet XS = {(xs\ni=0 represent a labeled dataset of Ns samples from the source domain where\ni}Nt\ni \u223c DS and let Xt = {xt\ni=0 represent an unlabeled dataset of Nt samples from the target domain\nxs\ni \u223c DT . Let Ec(x; \u03b8c) be a function parameterized by \u03b8c which maps an image x to a hidden\nwhere xt\nrepresentation hc representing features that are common or shared across domains. Let Ep(x; \u03b8p) be\nan analogous function which maps an image x to a hidden representation hp representing features that\nare private to each domain. Let D(h; \u03b8d) be a decoding function mapping a hidden representation h\nto an image reconstruction \u02c6x. Finally, G(h; \u03b8g) represents a task-speci\ufb01c function, parameterized by\n\u03b8g that maps from hidden representations h to the task-speci\ufb01c predictions \u02c6y. The resulting Domain\nSeparation Network (DSN) model is depicted in Fig. 1.\n\ni )}Ns\n\ni , ys\n\n3.1 Learning\n\nInference in a DSN model is given by \u02c6x = D(Ec(x) + Ep(x)) and \u02c6y = G(Ec(x)) where \u02c6x is the\nreconstruction of the input x and \u02c6y is the task-speci\ufb01c prediction. The goal of training is to minimize\nthe following loss with respect to parameters \u0398 = {\u03b8c, \u03b8p, \u03b8d, \u03b8g}:\n\nL = Ltask + \u03b1 Lrecon + \u03b2 Ldi\ufb00erence + \u03b3 Lsimilarity\n\n(1)\n\n3\n\nShared EncoderShared Decoder:Private Target EncoderPrivate Source EncoderClassifier\fwhere \u03b1, \u03b2, \u03b3 are weights that control the interaction of the loss terms. The classi\ufb01cation loss Ltask\ntrains the model to predict the output labels we are ultimately interested in. Because we assume the\ntarget domain is unlabeled, the loss is applied only to the source domain. We want to minimize the\nnegative log-likelihood of the ground truth class for each source domain sample:\n\ni \u00b7 log \u02c6ys\nys\ni ,\n\n(2)\n\nLtask = \u2212 Ns(cid:88)\n\ni=0\n\nNs(cid:88)\n\ni=1\n1\nk\n\nNt(cid:88)\n\ni is the one-hot encoding of the class label for source input i and \u02c6ys\n\nwhere ys\npredictions of the model: \u02c6ys\nthe reconstruction loss Lrecon which is applied to both domains:\n\ni are the softmax\ni )). We use a scale-invariant mean squared error term [6] for\n\ni = G(Ec(xs\n\nLrecon =\n\nLsi_mse(x, \u02c6x) =\n\nLsi_mse(xs\n\nLsi_mse(xt\n\ni, \u02c6xt\ni)\n\ni ) +\n\ni , \u02c6xs\nk2 ([x \u2212 \u02c6x] \u00b7 1k)2,\n\ni=1\n\n(cid:107)x \u2212 \u02c6x(cid:107)2\n\n2 \u2212 1\n\n(3)\n\n(4)\n\nwhere k is the number of pixels in input x, 1k is a vector of ones of length k; and (cid:107) \u00b7 (cid:107)2\n2 is the squared\nL2-norm. While a mean squared error loss is traditionally used for reconstruction tasks, it penalizes\npredictions that are correct up to a scaling term. Conversely, the scale-invariant mean squared error\npenalizes differences between pairs of pixels. This allows the model to learn to reproduce the overall\nshape of the objects being modeled without expending modeling power on the absolute color or\nintensity of the inputs. We validated that this reconstruction loss was indeed the correct choice\nexperimentally in Sec. 4.3 by training a version of our best DSN model with the traditional mean\nsquared error loss instead of the scale-invariant loss in Eq. 3.\nThe difference loss is also applied to both domains and encourages the shared and private encoders to\nencode different aspects of the inputs. We de\ufb01ne the loss via a soft subspace orthogonality constraint\nbetween the private and shared representation of each domain. Let Hs\nc be matrices whose\nrows are the hidden shared representations hs\nc = Ec(xt) from samples of source\nand target data respectively. Similarly, let Hs\np be matrices whose rows are the private\np(xt) from samples of source and target data respectively3.\nrepresentation hs\nThe difference loss encourages orthogonality between the shared and the private representations:\n\nc = Ec(xs) and ht\n\np(xs) and ht\n\np and Ht\n\nc and Ht\n\np = Es\n\np = Et\n\nLdi\ufb00erence =\n\n(5)\nF is the squared Frobenius norm. Finally, Lsimilarity encourages the hidden representations\nc from the shared encoder to be as similar as possible irrespective of the domain. We\n\nwhere (cid:107)\u00b7(cid:107)2\nhs\nc and ht\nexperimented with two similarity losses, which we discuss in detail.\n\nHs\np\n\nHt\np\n\n+\n\nF\n\nF\n\n,\n\n(cid:13)(cid:13)(cid:13)Hs\n\nc\n\n(cid:62)\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)Ht\n\nc\n\n(cid:62)\n\n(cid:13)(cid:13)(cid:13)2\n\n3.2 Similarity Losses\n\nThe domain adversarial similarity loss [7, 8] is used to train a model to produce representations\nsuch that a classi\ufb01er cannot reliably predict the domain of the encoded representation. Maximizing\nsuch \u201cconfusion\u201d is achieved via a Gradient Reversal Layer (GRL) and a domain classi\ufb01er trained\nto predict the domain producing the hidden representation. The GRL has the same output as the\nidentity function, but reverses the gradient direction. Formally, for some function f (u), the GRL\ndu f (u). The domain classi\ufb01er\nis de\ufb01ned as Q (f (u)) = f (u) with a gradient d\nZ(Q(hc); \u03b8z) \u2192 \u02c6d parameterized by \u03b8z maps a shared representation vector hc = Ec(x; \u03b8c) to a\nprediction of the label \u02c6d \u2208 {0, 1} of the input sample x. Learning with a GRL is adversarial in that\n\u03b8z is optimized to increase Z\u2019s ability to discriminate between encodings of images from the source\nor target domains, while the reversal of the gradient results in the model parameters \u03b8c learning\nrepresentations from which domain classi\ufb01cation accuracy is reduced. Essentially, we maximize the\nbinomial cross-entropy for the domain prediction task with respect to \u03b8z, while minimizing it with\nrespect to \u03b8c:\n\ndu Q(f (u)) = \u2212 d\n\nLDANN\nsimilarity =\n\ndi log \u02c6di + (1 \u2212 di) log(1 \u2212 \u02c6di)\n\n.\n\n(6)\n\nNs+Nt(cid:88)\n\n(cid:110)\n\n(cid:111)\n\n3The matrices are transformed to have zero mean and unit l2 norm.\n\ni=0\n\n4\n\n\fwhere di \u2208 {0, 1} is the ground truth domain label for sample i.\nThe Maximum Mean Discrepancy (MMD) loss [11] is a kernel-based distance function between pairs\nof samples. We use a biased statistic for the squared population MMD between shared encodings of\nthe source samples hs\n\nc and the shared encodings of the target samples ht\nc:\n\n\u03ba(hs\n\nci, hs\n\ncj)\u2212 2\n\nN sN t\n\n\u03ba(hs\n\nci, ht\n\ncj) +\n\n1\n\n(N t)2\n\nN s,N t(cid:88)\n\ni,j=0\n\nN t(cid:88)\n\ni,j=0\n\n\u03ba(ht\n\nci, ht\n\ncj), (7)\n\nLMMD\nsimilarity =\n\nN s(cid:88)\nRBF kernels: \u03ba(xi, xj) =(cid:80)\n\n(N s)2\n\ni,j=0\n\n1\n\n2\u03c3n\n\nn \u03b7n exp{\u2212 1\n\nwhere \u03ba(\u00b7,\u00b7) is a PSD kernel function. In our experiments we used a linear combination of multiple\n(cid:107)xi \u2212 xj(cid:107)2}, where \u03c3n is the standard deviation and \u03b7n\nis the weight for our nth RBF kernel. Any additional kernels we include in the multi\u2013RBF kernel are\nadditive and guarantee that their linear combination remains characteristic. Therefore, having a large\nrange of kernels is bene\ufb01cial since the distributions of the shared features change during learning,\nand different components of the multi\u2013RBF kernel might be responsible at different times for making\nsure we reject a false null hypothesis, i.e. that the loss is suf\ufb01ciently high when the distributions are\nnot similar [17]. The advantage of using an RBF kernel with the MMD distance is that the Taylor\nexpansion of the Gaussian function allows us to match all the moments of the two populations. The\ncaveat is that it requires \ufb01nding optimal kernel bandwidths \u03c3n.\n\n4 Evaluation\nWe are motivated by the problem of learning models on a clean, synthetic dataset and testing on noisy,\nreal\u2013world dataset. To this end, we evaluate on object classi\ufb01cation datasets used in previous work4\nincluding MNIST and MNIST-M [8], the German Traf\ufb01c Signs Recognition Benchmark (GTSRB)\n[25], and the Streetview House Numbers (SVHN) [20]. We also evaluate on the cropped LINEMOD\ndataset, a standard for object instance recognition and 3D pose estimation [12, 31], for which we\nhave synthetic and real data5. We tested the following unsupervised domain adaptation scenarios: (a)\nfrom MNIST to MNIST-M; (b) from SVHN to MNIST; (c) from synthetic traf\ufb01c signs to real ones\nwith GTSRB; (d) from synthetic LINEMOD object instances rendered on a black background to the\nsame object instances in the real world.\nWe evaluate the ef\ufb01cacy of our method with each of the two similarity losses outlined in Sec. 3.2 by\ncomparing against the prevailing visual domain adaptation techniques for neural networks: Correla-\ntion Alignment (CORAL) [26], Domain-Adversarial Neural Networks (DANN) [7, 8], and MMD\nregularization [29, 17]. For each scenario we provide two additional baselines: the performance on\nthe target domain of the respective model with no domain adaptation and trained (a) on the source\ndomain (\u201cSource-only\u201d in Tab. 1) and (b) on the target domain (\u201cTarget-only\u201d), as an empirical\nlower and upper bound respectively.\nWe have not found a universally applicable way to optimize hyperparameters for unsupervised domain\nadaptation. Previous work [8] suggests the use of reverse validation. We implemented this (see\nSupplementary Material for details) but found that that the reverse validation accuracy did not always\nalign well with test accuracy. Ideally we would like to avoid using labels from the target domain,\nas it can be argued that if ones does have target domain labels, they should be used during training.\nHowever, there are applications where a labeled target domain set cannot be used for training. An\nexample is the labeling of a dataset with the use of AprilTags [21], 2D barcodes that can be used to\nlabel the pose of an object, provided that a camera is calibrated and the physical dimensions of the\nbarcode are known. These images should not be used when learning features from pixels, because the\nmodel might be able to decipher the tags. However, they can be part of a test set that is not available\nduring training, and an equivalent dataset without the tags could be used for unsupervised domain\nadaptation. We thus chose to use a small set of labeled target domain data as a validation set for\n\n4The most commonly used dataset for visual domain adaptation in the context of object classi\ufb01cation is\nOf\ufb01ce [23]. However, this dataset exhibits signi\ufb01cant variations in both low-level and high-level parameter\ndistributions. Low-level variations are due to the different cameras and background textures in the images (e.g.\nAmazon versus DSLR). However, there are signi\ufb01cant high-level variations due to object identity: e.g. the\nmotorcycle class contains non-motorcycle objects; the backpack class contains a laptop; some domains contain\nthe object in only one pose. Other commonly used datasets such as Caltech-256 suffer from similar problems.\nWe therefore exclude these datasets from our evaluation. For more information, see our Supplementary Material.\n\n5https://cvarlab.icg.tugraz.at/projects/3d_object_detection/\n\n5\n\n\fTable 1: Mean classi\ufb01cation accuracy (%) for the unsupervised domain adaptation scenarios we\nevaluated all the methods on. We have replicated the experiments from Ganin et al. [8] and in\nparentheses we show the results reported in their paper. The \u201cSource-only\u201d and \u201cTarget-only\u201d rows\nare the results on the target domain when using no domain adaptation and training only on the source\nor the target domain respectively.\n\nModel\n\nSource-only\nCORAL [26]\nMMD [29, 17]\nDANN [8]\nDSN w/ MMD (ours)\nDSN w/ DANN (ours)\nTarget-only\n\nSynth Digits to\n\nMNIST to\nMNIST-M SVHN\n56.6 (52.2)\n57.7\n76.9\n77.4 (76.6)\n80.5\n83.2\n98.7\n\n86.7 (86.7)\n85.2\n88.0\n90.3 (91.0)\n88.5\n91.2\n92.4\n\nSVHN to\nMNIST\n59.2 (54.9)\n63.1\n71.1\n70.7 (73.8)\n72.2\n82.7\n99.5\n\nSynth Signs to\nGTSRB\n85.1 (79.0)\n86.9\n91.1\n92.9 (88.6)\n92.6\n93.1\n99.8\n\nthe hyperparameters of all the methods we compare. All methods were evaluated using the same\nprotocol, so comparison numbers are fair and meaningful. The performance on this validation set\ncan serve as an upper bound of a satisfactory validation metric for unsupervised domain adaptation,\nwhich to our knowledge validating the parameters in an unsupervised manner is still an open research\nquestion, and out of the scope of this work.\n\n4.1 Datasets and Adaptation Scenarios\n\nMNIST to MNIST-M. In this domain adaptation scenario we use the popular MNIST [15] dataset\nof handwritten digits as the source domain, and MNIST-M, a variation of MNIST proposed for\nunsupervised domain adaptation by [8]. MNIST-M was created by using each MNIST digit as a\nbinary mask and inverting with it the colors of a background image. The background images are\nrandom crops uniformly sampled from the Berkeley Segmentation Data Set (BSDS500) [3]. In all\nour experiments, following the experimental protocol by [8]. Out of the 59, 001 MNIST-M training\nexamples, we used the labels for 1, 000 of them to \ufb01nd optimal hyperparameters for our models. This\nscenario, like all three digit adaptation scenarios, has 10 class labels.\nSynthetic Digits to SVHN. In this scenario we aim to learn a classi\ufb01er for the Street-View House\nNumber data set (SVHN) [20], our target domain, from a dataset of purely synthesized digits,\nour source domain. The synthetic digits [8] dataset was created by rasterizing bitmap fonts in a\nsequence (one, two, and three digits) with the ground truth label being the digit in the center of the\nimage, just like in SVHN. The source domain samples are further augmented by variations in scale,\ntranslation, background colors, stroke colors, and Gaussian blurring. We use 479, 400 Synthetic\nDigits for our source domain training set, 73, 257 unlabeled SVHN samples for domain adaptation,\nand 26, 032 SVHN samples for testing. Similarly to above, we use the labels of 1, 000 SVHN training\nexamples for hyperparameter validation.\nSVHN to MNIST. Although the SVHN dataset contains signi\ufb01cant variations (in scale, background\nclutter, blurring, embossing, slanting, contrast, rotation, sequences to name a few) there is not a lot of\nvariation in the actual digits shapes. This makes it quite distinct from a dataset of handwritten digits,\nlike MNIST, where there are a lot of elastic distortions in the shapes, variations in thickness, and\nnoise on the digits themselves. Since the ground truth digits in both datasets are centered, this is a\nwell-posed and rather dif\ufb01cult domain adaptation scenario. As above, we used the labels of 1, 000\nMNIST training examples for validation.\nSynthetic Signs to GTSRB. We also perform an experiment using a dataset of synthetic traf\ufb01c\nsigns from [19] to real world dataset of traf\ufb01c signs (GTSRB) [25]. While the three-digit adaptation\nscenarios have 10 class labels, this scenario has 43 different traf\ufb01c signs. The synthetic signs were\nobtained by taking relevant pictograms and adding various types of variations, including random\nbackgrounds, brightness, saturation, 3D rotations, Gaussian and motion blur. We use 90, 000 synthetic\nsigns for training, 1, 280 random GTSRB real-world signs for domain adaptation and validation, and\nthe remaining 37, 929 GTSRB real signs as the test set.\n\n6\n\n\fTable 2: Mean classi\ufb01cation accuracy and pose error for the \u201cSynth Objects to LINEMOD\u201d scenario.\n\nMethod\n\nClassi\ufb01cation Accuracy Mean Angle Error\n\nSource-only\nMMD\nDANN\nDSN w/ MMD (ours)\nDSN w/ DANN (ours)\nTarget-only\n\n47.33%\n72.35%\n99.90%\n99.72%\n100.00%\n100.00%\n\n89.2\u25e6\n70.62\u25e6\n56.58\u25e6\n66.49\u25e6\n53.27\u25e6\n6.47\u25e6\n\nestimation; our task loss is therefore Ltask =(cid:80)Ns\n\nSynthetic Objects to LineMod. The LineMod dataset [31] consists of CAD models of objects in a\ncluttered environment and a high variance of 3D poses for each object. We use the 11 non-symmetric\nobjects from the cropped version of the dataset, where the images are cropped with the object in the\ncenter, for the task of object instance recognition and 3D pose estimation. We train our models on\n16, 962 images for these objects rendered on a black background without additional noise. We use a\ntarget domain training set of 10, 673 real-world images for domain adaptation and validation, and a\ntarget domain test set of 2, 655 for testing. For this scenario our task is both classi\ufb01cation and pose\ni + \u03be log(1 \u2212 |qs \u00b7 \u02c6qs|)}, where qs\nis the positive unit quaternion vector representing the ground truth 3D pose, and \u02c6qs is the equivalent\nprediction. The \ufb01rst term is the classi\ufb01cation loss, similar to the rest of the experiments, the second\nterm is the log of a 3D rotation metric for quaternions [13], and \u03be is the weight for the pose loss. In\nTab. 2 we report the mean angle the object would need to be rotated (on a \ufb01xed 3D axis) to move\nfrom the predicted to the ground truth pose [12].\n\ni=0{\u2212ys\n\ni \u00b7 log \u02c6ys\n\n(a) MNIST (source)\n\n(b) MNIST-M (target)\n\n(c) Synth Objects (source)\n\n(d) LINEMOD (target)\n\nFigure 2: Reconstructions for the representations of the two domains for \u201cMNIST to MNIST-M\u201d\nand for \u201cSynth Objects to LINEMOD\u201d. In each block from left to right: the original image xt;\nreconstructed image D(Ec(xt) + Ep(xt)); shared only reconstruction D(Ec(xt)); private only\nreconstruction D(Ep(xt)).\n\n4.2\n\nImplementation Details\n\nAll the models were implemented using TensorFlow 6 [1] and were trained with Stochastic Gradient\nDescent plus momentum [27]. Our initial learning rate was multiplied by 0.9 every 20, 000 steps\n(mini-batches). We used batches of 32 samples from each domain for a total of 64 and the input\nimages were mean-centered and rescaled to [\u22121, 1]. In order to avoid distractions for the main\nclassi\ufb01cation task during the early stages of the training procedure, we activate any additional domain\nadaptation loss after 10, 000 steps of training. For all our experiments our CNN topologies are based\non the ones used in [8], to be comparable to previous work in unsupervised domain adaptation. The\nexact architectures for all models are shown in our Supplementary Material.\nIn our framework, CORAL [26] would be equivalent to \ufb01xing our shared representation matrices\nc(cid:107)2\nHs\nHt\nF with respect to a\nweight matrix A that aligns the two correlation matrices. For the CORAL experiments, we follow\nthe suggestions of [26], and extract features for both source and target domains from the penultimate\nlayer of each network. Once the correlation matrices for each domain are aligned, we evaluate on\n\nc, normalizing them and then minimizing (cid:107)AHs\n\ncA(cid:62) \u2212 Ht\nHs\n\nc and Ht\n\n(cid:62)\n\nc\n\n(cid:62)\n\nc\n\n6We provide code at https://github.com/tensorflow/models/domain_adaptation.\n\n7\n\n\fTable 3: Effect of our difference and reconstruction losses on our best model. The \ufb01rst row is\nreplicated from Tab. 1. In the second row, we remove the soft orthogonality constraint. In the third\nrow, we replace the scale-invariant MSE with regular MSE.\n\nModel\n\nAll terms\nNo Ldi\ufb00erence\nWith LL2\n\nrecon\n\nSynth. Digits to\n\nMNIST to\nMNIST-M SVHN\n91.22\n83.23\n89.21\n80.26\n80.42\n88.98\n\nSVHN to\nMNIST\n82.78\n80.54\n79.45\n\nSynth. Signs to\nGTSRB\n93.01\n91.89\n92.11\n\nthe target test data the performance of a linear support vector machine (SVM) classi\ufb01er trained on\nthe source training data. The SVM penalty parameter was optimized based on the target domain\nvalidation set for each of our domain adaptation scenarios. For MMD regularization, we used a linear\ncombination of 19 RBF kernels (details can be found in the Supplementary Material). Preliminary\nexperiments with having MMD applied on more than one layers did not show any performance\nimprovement for our experiments and architectures. For DANN regularization, we applied the GRL\nand the domain classi\ufb01er as prescribed in [8] for each scenario.\nFor our Domain Separation Network experiments, our similarity losses are always applied at the\n\ufb01rst fully connected layer of each network after a number of convolutional and max pooling layers.\nFor each private space encoder network we use a simple convolutional and max pooling structure\nfollowed by a fully-connected layer with a number of nodes equal to the number of nodes at the \ufb01nal\nlayer hc of the equivalent shared encoder Ec. The output of the shared and private encoders gets\nadded before being fed to the shared decoder D.\n\n4.3 Discussion\n\nThe DSN with DANN model outperforms all the other methods we experimented with for all our\nunsupervised domain adaptation scenarios (see Tab. 1 and 2). Our unsupervised domain separation\nnetworks are able to improve both upon MMD regularization and DANN. Using DANN as a similarity\nloss (Eq. 6) worked better than using MMD (Eq. 7) as a similarity loss, which is consistent with\nresults obtained for domain adaptation using MMD regularization and DANN alone.\nIn order to examine the effect of the soft orthogonality constraints (Ldi\ufb00erence), we took our best\nmodel, our DSN model with the DANN loss, and removed these constraints by setting the \u03b2 coef\ufb01cient\nto 0. Without them, the model performed consistently worse in all scenarios. We also validated our\nchoice of our scale-invariant mean squared error reconstruction loss as opposed to the more popular\nmean squared error loss by running our best model with LL2\n2. With this variation\nwe also get worse classi\ufb01cation results consistently, as shown in experiments from Tab. 3.\nThe shared and private representations of each domain are combined for the reconstruction of samples.\nIndividually decoding the shared and private representations gives us reconstructions that serve as\nuseful depictions of our domain adaptation process. In Fig. 2 we use the \u201cMNIST to MNIST-M\u201d and\nthe \u201cSynth. Objects to LINEMOD\u201d scenarios for such visualizations. In the former scenario, the\nmodel cleanly separates the foreground from the background and produces a shared space that is very\nsimilar to the source domain. This is expected since the target is a transformation of the source. In the\nlatter scenario, the model is able to produce visualizations of the shared representation that look very\nsimilar between source and target domains, which are useful for classi\ufb01cation and pose estimation.\n\nk||x \u2212 \u02c6x||2\n\nrecon = 1\n\n5 Conclusion\nWe present in this work a deep learning model that improves upon existing unsupervised domain\nadaptation techniques. The model does so by explicitly separating representations private to each\ndomain and shared between source and target domains. By using existing domain adaptation\ntechniques to make the shared representations similar, and soft subspace orthogonality constraints to\nmake private and shared representations dissimilar, our method outperforms all existing unsupervised\ndomain adaptation methods in a number of adaptation scenarios that focus on the synthetic-to-real\nparadigm.\n\n8\n\n\fAcknowledgments\n\nWe would like to thank Samy Bengio, Kevin Murphy, and Vincent Vanhoucke for valuable comments\non this work. We would also like to thank Yaroslav Ganin and Paul Wohlhart for providing some of\nthe datasets we used.\n\nReferences\n[1] M. Abadi et al. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems. Preprint\n\narXiv:1603.04467, 2016.\n\n[2] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand. Domain-adversarial neural\n\nnetworks. In Preprint, http://arxiv.org/abs/1412.4446, 2014.\n\n[3] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.\n\nTPAMI, 33(5):898\u2013916, 2011.\n\n[4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. Machine learning, 79(1-2):151\u2013175, 2010.\n\n[5] R. Caseiro, J. F. Henriques, P. Martins, and J. Batist. Beyond the shortest path: Unsupervised Domain\n\nAdaptation by Sampling Subspaces Along the Spline Flow. In CVPR, 2015.\n\n[6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep\n\nnetwork. In NIPS, pages 2366\u20132374, 2014.\n\n[7] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation.\n\nIn ICML, pages\n\n513\u2013520, 2015.\n\n[8] Y. Ganin et al. . Domain-Adversarial Training of Neural Networks. JMLR, 17(59):1\u201335, 2016.\n[9] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nCVPR, pages 2066\u20132073. IEEE, 2012.\n\n[10] R. Gopalan, R. Li, and R. Chellappa. Domain Adaptation for Object Recognition: An Unsupervised\n\n[11] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A Kernel Two-Sample Test.\n\n[12] S. Hinterstoisser et al. . Model based training, detection and pose estimation of texture-less 3d objects in\n\nheavily cluttered scenes. In ACCV, 2012.\n\n[13] D. Q. Huynh. Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and\n\n[14] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS, pages\n\nApproach. In ICCV, 2011.\n\nJMLR, pages 723\u2013773, 2012.\n\nVision, 35(2):155\u2013164, 2009.\n\n982\u2013990, 2010.\n\n[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\ncoco: Common objects in context. In ECCV 2014, pages 740\u2013755. Springer, 2014.\n\n[17] M. Long and J. Wang. Learning transferable features with deep adaptation networks. ICML, 2015.\n[18] Y. Mansour et al. . Domain adaptation with multiple sources. In NIPS, 2009.\n[19] B. Moiseev, A. Konev, A. Chigorin, and A. Konushin. Evaluation of Traf\ufb01c Sign Recognition Methods\n\nTrained on Synthetically Generated Data, chapter ACIVS, pages 576\u2013583. 2013.\n\n[20] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. In NIPS Workshops, 2011.\n\n[21] E. Olson. Apriltag: A robust and \ufb02exible visual \ufb01ducial system. In Robotics and Automation (ICRA), 2011\n\nIEEE International Conference on, pages 3400\u20133407. IEEE, 2011.\n\n[22] O. Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211\u2013252, 2015.\n[23] K. Saenko et al. . Adapting visual category models to new domains. In ECCV. Springer, 2010.\n[24] M. Salzmann et. al. Factorized orthogonal latent spaces. In AISTATS, pages 701\u2013708, 2010.\n[25] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning\n\nalgorithms for traf\ufb01c sign recognition. Neural Networks, 2012.\n\n[26] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI. 2016.\n[27] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in\n\ndeep learning. In ICML, pages 1139\u20131147, 2013.\n\n[28] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In\n\nCVPR, pages 4068\u20134076, 2015.\n\n[29] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for\n\ndomain invariance. Preprint arXiv:1412.3474, 2014.\n\n[30] S. Virtanen, A. Klami, and S. Kaski. Bayesian CCA via group sparsity. In ICML, pages 457\u2013464, 2011.\n[31] P. Wohlhart and V. Lepetit. Learning descriptors for object recognition and 3d pose estimation. In CVPR,\n\npages 3109\u20133118, 2015.\n\n9\n\n\f", "award": [], "sourceid": 216, "authors": [{"given_name": "Konstantinos", "family_name": "Bousmalis", "institution": "Google Brain"}, {"given_name": "George", "family_name": "Trigeorgis", "institution": "Google"}, {"given_name": "Nathan", "family_name": "Silberman", "institution": "Google"}, {"given_name": "Dilip", "family_name": "Krishnan", "institution": "Google"}, {"given_name": "Dumitru", "family_name": "Erhan", "institution": "Google"}]}