{"title": "Learning Transferrable Representations for Unsupervised Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 2110, "page_last": 2118, "abstract": "Supervised learning with large scale labelled datasets and deep layered models has caused a paradigm shift in diverse areas in learning and recognition. However, this approach still suffers from generalization issues under the presence of a domain shift between the training and the test data distribution. Since unsupervised domain adaptation algorithms directly address this domain shift problem between a labelled source dataset and an unlabelled target dataset, recent papers have shown promising results by fine-tuning the networks with domain adaptation loss functions which try to align the mismatch between the training and testing data distributions. Nevertheless, these recent deep learning based domain adaptation approaches still suffer from issues such as high sensitivity to the gradient reversal hyperparameters and overfitting during the fine-tuning stage. In this paper, we propose a unified deep learning framework where the representation, cross domain transformation, and target label inference are all jointly optimized in an end-to-end fashion for unsupervised domain adaptation. Our experiments show that the proposed method significantly outperforms state-of-the-art algorithms in both object recognition and digit classification experiments by a large margin. We will make our learned models as well as the source code available immediately upon acceptance.", "full_text": "Learning Transferrable Representations for\n\nUnsupervised Domain Adaptation\n\nOzan Sener1, Hyun Oh Song1, Ashutosh Saxena2, Silvio Savarese1\n\nStanford University1 Brain of Things2\n\n{ozan,hsong,asaxena,ssilvio}@cs.stanford.edu\n\nAbstract\n\nSupervised learning with large scale labelled datasets and deep layered models has\ncaused a paradigm shift in diverse areas in learning and recognition. However, this\napproach still suffers from generalization issues under the presence of a domain\nshift between the training and the test data distribution. Since unsupervised domain\nadaptation algorithms directly address this domain shift problem between a labelled\nsource dataset and an unlabelled target dataset, recent papers [11, 33] have shown\npromising results by \ufb01ne-tuning the networks with domain adaptation loss functions\nwhich try to align the mismatch between the training and testing data distributions.\nNevertheless, these recent deep learning based domain adaptation approaches still\nsuffer from issues such as high sensitivity to the gradient reversal hyperparameters\n[11] and over\ufb01tting during the \ufb01ne-tuning stage. In this paper, we propose a uni\ufb01ed\ndeep learning framework where the representation, cross domain transformation,\nand target label inference are all jointly optimized in an end-to-end fashion for\nunsupervised domain adaptation. Our experiments show that the proposed method\nsigni\ufb01cantly outperforms state-of-the-art algorithms in both object recognition and\ndigit classi\ufb01cation experiments by a large margin.\n\n1\n\nIntroduction\n\nRecently, deep convolutional neural networks [17, 26, 30] have propelled unprecedented advances\nin arti\ufb01cial intelligence including object recognition, speech recognition, and image captioning.\nAlthough these networks are very good at learning state of the art feature representations and\nrecognizing discriminative patterns, one major drawback is that the network requires huge amounts\nof labelled training data to \ufb01t millions of parameters in the complex network. However, creating such\ndatasets with complete annotations is not only tedious and error prone, but also extremely costly. In\nthis regard, the research community has proposed different mechanisms such as semi-supervised\nlearning [27, 37], transfer learning [23, 31], weakly labelled learning, and domain adaptation. Among\nthese approaches, domain adaptation is one of the most appealing techniques when a fully annotated\ndataset (e.g. ImageNet [7], Sports1M [14]) is already available as a reference.\nThe goal of unsupervised domain adaptation, in particular, is as follows. Given a fully labeled\nsource dataset and an unlabeled target dataset, to learn a model which can generalize to the target\ndomain while taking the domain shift across the datasets into account. The majority of the literature\n[13, 29, 9, 28, 32] in unsupervised domain adaptation formulates a learning problem where the task\nis to \ufb01nd a transformation matrix to align the labelled source data distribution to the unlabelled target\ndata distribution. Although these approaches have shown promising results, they show accuracy\ndegradation because of the discrepancy between the learning procedure and the actual target inference\nprocedure. In this paper, we aim to address this issue by incorporating the unknown target labels into\nthe training procedure.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this regard, we formulate a uni\ufb01ed deep learning framework where the feature representation,\ndomain transformation, and target labels are all jointly optimized in an end-to-end fashion. The\nproposed framework \ufb01rst takes as input a batch of labelled source and unlabelled target examples, and\nmaps this batch of raw input examples into a deep representation. Then, the framework computes the\nloss of the input batch based on a two stage optimization in which it alternates between inferring the\nlabels of the target examples transductively and optimizing the domain transformation parameters.\nConcretely, in the transduction stage, given the \ufb01xed domain transform parameter, we jointly infer\nall target labels by solving a discrete multi-label energy minimization problem. In the adaptation\nstage, given a \ufb01xed target label assignment, we seek to \ufb01nd the optimal asymmetric metric between\nthe source and the target data. The advantage of our method is that we can jointly learn the optimal\nfeature representation and the optimal domain transformation parameter, which are aware of the\nsubsequent transductive inference procedure.\nFollowing the standard evaluation protocol in the unsupervised domain adaptation community,\nwe evaluate our method on the digit classi\ufb01cation task using MNIST [19] and SVHN[21]\nas well as the object recognition task using the Of\ufb01ce [25] dataset, and demonstrate state\nof the art performance in comparison to all existing unsupervised domain adaptation meth-\nods.\nLearned models and the source code can be reached from the project webpage\nhttp://cvgl.stanford.edu/transductive_adaptation.\n\n2 Related Work\n\nThis paper is closely related to two active research areas: (1) Unsupervised domain adaptation, and\n(2) Transductive learning.\nUnsupervised domain adaptation: [16] casts the zero-shot learning [22] problem as an unsupervised\ndomain adaptation problem in the dictionary learning and sparse coding framework, assuming access\nto additional attribute information. Recently, [3] proposed the active nearest neighbor algorithm,\nwhich combines the component of active learning into the domain adaptation problem and makes\na bounded number of active queries to users. Also, [13, 9, 28] proposed subspace alignment based\napproaches to unsupervised domain adaptation where the task is to learn a joint transformation\nand projection in which the difference between the source and the target covariance is minimized.\nHowever, these methods learn the transformation matrices on the whole source and target dataset\nwithout utilizing the source labels.\n[32] utilizes a local max margin metric learning objective [35] to \ufb01rst assign the target labels with the\nnearest neighbor scheme and then learn a distance metric to enforce the negative pairwise distances\nto be larger than the positive pairwise distances. However, this method learns a symmetric distance\nmatrix shared by both the source and the target domains so the method is susceptible to the discrep-\nancies between the source and the target distributions. Recently, [11, 33] proposed a deep learning\nbased method to learn domain invariant features by providing the reversed gradient signal from the\nbinary domain classi\ufb01ers. Although this method performs better than aforementioned approaches,\ntheir accuracy is limited since domain invariance does not necessarily imply discriminative features\nin the target domain.\nTransductive learning: In the transductive learning [10], the model has access to unlabelled test\nsamples during training. [24] utilizes a semi-supervised label propagation algorithm into the semi-\nsupervised transfer learning problem assuming access to few labeled examples and additional human\nspeci\ufb01ed semantic knowledge. [15] tackled a classi\ufb01cation problem where predictions are made\njointly across all test examples in a transductive [10] setting. The method essentially enforces the\nnotion that the true labels vary smoothly with respect to the input data. We extend this notion to\njointly infer the labels of unsupervised target data points in a k-NN graph.\nTo summarize, our main contribution is to formulate an end-to-end deep learning framework where\nwe learn the optimal feature representation, infer target labels via discrete energy minimization\n(transduction), and learn the transformation (adaptation) between source and target examples all\njointly. Our experiments on digit classi\ufb01cation using MNIST [19] and SVHN[21] as well as the\nobject recognition experiments on Of\ufb01ce [25] datasets show state of the art results, outperforming all\nexisting methods by a substantial margin.\n\n2\n\n\f3 Method\n\n3.1 Problem De\ufb01nition and Notation\nIn the unsupervised domain adaptation, one of the domains (source) is supervised {\u02c6xi, \u02c6yi}i\u2208[N s] with\nN s data points \u02c6xi and the corresponding labels \u02c6yi from a discrete set \u02c6yi \u2208 Y = {1, . . . , Y }. The\nother domain (target), on the other hand is unsupervised and has N u data points {xi}i\u2208[N u].\nWe further assume that two domains have different distributions \u02c6xi \u223c ps and xi \u223c pt de\ufb01ned on the\nsame space \u02c6xi, xi \u2208 X . We consider a case in which there are two feature functions \u03a6s, \u03a6t : X \u2192 Rd\napplicable to source and target separately. These feature functions extract the information both shared\namong domains and explicit to the individual ones. The way we model common features is by\nsharing a subset of parameters between feature functions as \u03a6s = \u03a6\u03b8c,\u03b8s and \u03a6t = \u03a6\u03b8c,\u03b8t. We use\ndeep neural networks to implement these functions. In our implementation, \u03b8c corresponds to the\nparameters in the \ufb01rst few layers of the networks and \u03b8s, \u03b8t correspond to the respective \ufb01nal layers.\nIn general, our model is applicable to any hierarchical and differentiable feature function which can\nbe expressed as a composite function \u03a6s = f\u03b8s(g\u03b8c(\u00b7)) for both source and target.\n3.2 Consistent Structured Transduction\n\nOur method is based on jointly learning the transferable domain speci\ufb01c representations for source\nand target as well as estimating the labels of the unsupervised data-points. We denote these two main\ncomponents of our method as transduction and adaptation. The transduction is the sub-problem of\nlabelling unsupervised data points and the adaptation is the sub-problem of solving for the domain\nshift. In order to solve this joint problem tractably, we exploit two heuristics: cyclic consistency for\nadaptation and structured consistency for transduction.\nCyclic consistency: One desired property of \u03a6s and \u03a6t is consistency. If we estimate the labels of\nthe unsupervised data points and then use these points with their estimated labels to estimate the\nlabels of supervised data-points, we want the predicted labels of the supervised data-points to be\nconsistent with the ground truth labels. Using the inner product as an asymmetric similarity metric as\n(cid:124)\ns(\u02c6xi, xj) = \u03a6s(\u02c6xi)\n\n\u03a6t(xj), this consistency can be represented with the following diagram.\n\n(\u02c6xi, \u02c6yi)\n\nTransduction\n\n/ (xj, yj)\n\nTransduction\n\n/ (\u02c6xi, \u02c6ypred\n\ni\n\n)\n\nCyclic Consistency: \u02c6yi = \u02c6ypred\n\ni\n\ni\n\n(cid:124)\n\n(cid:124)\n\n\u03a6t(xi+) > \u03a6s(\u02c6xi)\n\nIt can be shown that if the transduction from target to source follows a nearest neighbor rule, cyclic\nconsistency can be enforced without explicitly computing \u02c6ypred\nusing the large-margin nearest\nneighbor (LMNN)[35] rule. For each source point, we enforce a margin such that the similarity\nbetween the source point and the nearest neighbor from the target with the same label is greater than\nthe similarity between the source point and the nearest neighbor from the target with a different label.\nFormally; \u03a6s(\u02c6xi)\n\u03a6t(xi\u2212 ) + \u03b1 where xi+ is the nearest target having the same\nclass label as \u02c6xi and xi\u2212 is the nearest target having a different class label.\nStructured consistency: We enforce a structured consistency when we label the target points during\nthe transduction. The structure we enforce is; if two target points are similar to each other, they are\nmore likely to have the same label. To do so, we create a k-NN graph of target points using a similarity\n\u03a6t(xj). We denote the neighbors of the point \u02c6xi as N (\u02c6xi). We enforce structured\n(cid:124)\nmetric \u03a6t(xi)\nconsistency by penalizing neighboring points of different labels proportional to their similarity score.\nOur model leads to the following optimization problem, over the target labels yi and the feature\nfunction parameters \u03b8c, \u03b8s, \u03b8t, jointly solving transduction and adaptation.\n\n(cid:88)\n\ni\u2208[N s]\n\n(cid:124)\n\nmin\n\n\u03b8c,\u03b8s,\u03b8t,\ny1,...yN u\n\ns.t.\n\n[\u03a6s(\u02c6xi)\n\n(cid:124)\n\n(cid:124)\n\u03a6t(xi\u2212 ) \u2212 \u03a6s(\u02c6xi)\n\n(cid:123)(cid:122)\n\n\u03a6t(xi+ ) + \u03b1]+\n\n(cid:125)\n\n(cid:124)\n\u03a6t(xi)\n\n\u03a6t(xj)1(yi (cid:54)= yj)\n\n(cid:125)\n\ni+ = arg maxj|yj =\u02c6yi\n\n\u03a6t(xj)\n\nand\n\nCyclic Consistency\n(cid:124)\n\u03a6s(\u02c6xi)\n\nStructured Consistency\n(cid:124)\n\n= arg maxj|yj(cid:54)=\u02c6yi\n\n\u03a6s(\u02c6xi)\n\n\u03a6t(xj)\n\n(1)\nwhere 1(a) is an indicator function which is 1 if a is true and 0 otherwise. [a]+ is a recti\ufb01er function\nwhich is equal to max(0, a).\n\n(cid:88)\n\n(cid:88)\n\ni\u2208[N u]\n\nxj\u2208N (xi)\n\n(cid:123)(cid:122)\n\n+ \u03bb\n\n(cid:124)\n\n\u2212\ni\n\n3\n\nO\nO\n/\n/\n\fWe solve this optimization problem via alternating minimization through iterating over solving for\nunsupervised labels yi(transduction) and learning the similarity metric \u03b8c, \u03b8s, \u03b8t (adaptation). We\nexplain these two steps in detail in the following sections.\n\n3.3 Transduction: Labeling Target Domain\n\nky(xi)\n\nIn order to label the unsupervised points, we base our model on the k-nearest-neighbor rule. We\nsimply compute the k-NN supervised data point for each unsupervised data point using the learned\nmetric and transfer the corresponding majority label. Formally, given a similarity metric \u03b8c, \u03b8s, \u03b8t,\nthe k-NN rule is (yi)pred = arg maxy\nk where ky(xi) is the number of samples having label y\nin the k nearest neighbors of xi from the source domain. One major issue with this approach is the\ninaccuracy of transduction during the initial stage of the algorithm. Since the learned metric will not\nbe accurate, we expect to see some noisy k-NN sets. Hence, we propose two solutions to solve this\nproblem.\nStructured Consistency: Similar to existing graph transduction algorithms [4, 36], we create a\nk-nearest neighbor (k-NN) graph over the unsupervised data points and penalize disagreements of\nlabels between neighbors.\nReject option: In the initial stage of the algorithm, we let the transduction step use the reject R as\nan additional label (besides the class labels) to label the unsupervised target points. In other words,\nour transduction algorithm can decide to not label (reject) some of the points so that they will not be\nused for adaptation. As the learned metric gets more accurate in the future iterations, transduction\nalgorithm can change the label from R to other class labels.\nUsing aforementioned heuristics, we de\ufb01ne our transduction sub-problem as1:\n\n(cid:124)\n\u03a6t(xi)\n\n\u03a6t(xj)1(yi (cid:54)= yj)\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208[N u]\n\nxj\u2208N (xi)\n\nmin\n\ny1,...yN u\u2208Y\u222aR\n\nl(xi, yi) + \u03bb\n\n(cid:40)\n\ni\u2208[N u]\n1 \u2212 ky(xi)\n\u03b3 maxy(cid:48)\u2208Y k(cid:48)\n\nk\n\ny(xi)\n\nk\n\nwhere l(xi, y) =\n\ny \u2208 Y\ny = R\n\nand \u03b3 is relative cost of the reject option.\n\nThe l(xi, R) is smaller if none of the class has a majority, promoting the reject option for undecided\ncases. We also modulate the \u03b3 during learning to decrease number of reject options in the later stage\nof the adaptation. This problem can approximately be solved using many existing methods. We use\nthe \u03b1-\u03b2 swapping algorithm from [5] since it is experimentally shown to be ef\ufb01cient and accurate.\n\n3.4 Adaptation: Learning the Metric\n\nGiven the predicted labels yi for unsupervised data points xi, we can then learn a metric in order\nto minimize the loss function de\ufb01ned in (1). Following the cyclic consistency construction, the\nLMNN rule can be represented using the triplet loss de\ufb01ned between the supervised source data\npoints and their nearest positive and negative neighbors among the unsupervised target points. We do\nnot include the target-data points with reject labels during this construction. Formally, we can de\ufb01ne\nthe adaptation problem given unsupervised labels as;\n\n[\u03a6s(\u02c6xi)\n\n(cid:124)\n\n\u03a6t(xi\u2212 ) \u2212 \u03a6s(\u02c6xi)\n\n(cid:124)\n\n\u03a6t(xi+ ) + \u03b1]+ + \u03bb\n\n\u03a6t(xi)\n\n(cid:124)\n\n\u03a6t(xj)1(yi (cid:54)= yj)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208[N u]\n\nxj\u2208N (xi)\n\n(cid:88)\n\ni\u2208[N s]\n\nmin\n\n\u03b8c,\u03b8s,\u03b8t\n\nwhere\n\ni+ = arg maxj|yj =\u02c6yi\u03a6s(\u02c6xi)\n\n(cid:124)\n\n\u03a6t(xj)\n\n(cid:124)\nand i\u2212 = arg maxj|yj(cid:54)=\u02c6yi,yj(cid:54)=R\u03a6s(\u02c6xi)\n\n\u03a6t(xj)\n\n(3)\n\n(4)\n\nWe optimize this function via stochastic gradient descent using the sub-gradients \u2202loss\n\u2202\u03b8s\n\u2202loss\n\u2202\u03b8c\n\nand\n. These sub-gradients can be ef\ufb01ciently computed with back-propagation (see [1] for details).\n\n, \u2202loss\n\u2202\u03b8t\n\n1The subproblem we de\ufb01ne here does not directly correspond to optimization of (1) with respect to\ny1, . . . yN u. It is extension of the exact sub-problem by replacing 1-NN rule with k-NN rule and introducing\nreject option.\n\n4\n\n\f3.5\n\nImplementation Details\n\nWe use Alexnet [17] and LeNet [18] architectures with small modi\ufb01cations. We remove their \ufb01nal\nsoftmax layer and change the size of the \ufb01nal fully connected layer according to the desired feature\ndimension. We consider the last fully connected layer as domain speci\ufb01c (\u03b8s, \u03b8t) and the rest as\ncommon network \u03b8c. Common network weights are tied between domains, and the \ufb01nal layers are\nlearned separately. In order to have a fair comparison, we use the same architectures from [11] only\nmodifying the embedding size. (See supplementary material [1] for details).\nSince the of\ufb01ce dataset is quite small, we do not\nlearn the network from scratch for of\ufb01ce experi-\nments and instead we initialize with the weights\npre-trained on ImageNet. In all of our experi-\nments, we set the feature dimension as 128. We\nuse stochastic gradient descent to learn the fea-\nture function with AdaGrad[8]. We initialize\nconvolutional weights with truncated normals\nhaving std-dev 0.1, biases with constant value\n0.1, and use a learning rate of 2.5 \u00d7 10\u22124 with\nbatch size 512. We start the rejection penalty\nwith \u03b3 = 0.1 and linearly increase with each\nepoch as \u03b3 = #epoch\u22121\n+ 0.1. In our experi-\nments, we use M = 20, \u03bb = 0.001 and \u03b1 = 1.\n\nif \u02c6yi \u2208 y1\u00b7\u00b7\u00b7B and \u2203k yk \u2208 Y \\ \u02c6yi then\nCompute (i+, i\u2212) using {y1\u00b7\u00b7\u00b7B} in (4)\nUpdate \u2202loss\n\u2202\u03b8c\n\nInput: Source \u02c6x1\u00b7\u00b7\u00b7N s , \u02c6y1,\u00b7\u00b7\u00b7N s, Target x1,\u00b7\u00b7\u00b7 ,N u,\nBatch Size 2 \u00d7 B\nfor t = 0 to max_iter do\n\nSample {\u02c6x1,...,B, \u02c6y1,...,B}, {x1,...,B}\nSolve (2) for {y1\u00b7\u00b7\u00b7B}\nfor i = 1 to B do\n\nAlgorithm 1 Transduction with Domain Shift\n\n, \u2202loss\n\u2202\u03b8s\n\n, \u2202loss\n\u2202\u03b8t\n\n, \u03b8s \u2190 \u03b8s + \u03b7(t) \u2202loss\n\n,\n\n\u2202\u03b8s\n\nM\n\nend if\nend for\n\u03b7(t) \u2190 Adagrad Rule [8]\n\u03b8c \u2190 \u03b8c + \u03b7(t) \u2202loss\n\u03b8t \u2190 \u03b8t + \u03b7(t) \u2202loss\n\n\u2202\u03b8c\n\n\u2202\u03b8t\n\nend for\n\n4 Experimental Results\n\nWe evaluate our algorithm on various unsupervised domain adaptation tasks while focusing on two\ndifferent problems: hand-written digit classi\ufb01cation and object recognition.\n\nDatasets: We use MNIST [19], Street View House Number [21] and the arti\ufb01cially generated version\nof MNIST -MNIST-M- [11] to experiment our algorithm on the digit classi\ufb01cation task. MNIST-M is\nsimply a blend of the digit images of the original MNIST dataset and the color images of BSDS500 [2]\nfollowing the method explained in [11]. Since the dataset is not distributed directly by the authors,\nwe generated the dataset using the same procedure and further con\ufb01rmed that the performance is\nthe same as the one reported in [11]. Street View House Numbers is a collection of house numbers\ncollected from Google street view images. Each of these three domains are quite different from each\nother. Among many important differences, the most signi\ufb01cant ones are MNIST being grayscale\nwhilw the others are colored, and SVHN images having extra confusing digits around the centered\ndigit of interest. Moreover, all domains are large-scale, having at least 60k examples over 10 classes.\nIn addition, we use the Of\ufb01ce [25] dataset to evaluate our algorithm on the object recognition task.\nThe of\ufb01ce dataset includes images of the objects taken from Amazon, captured with a webcam and\ncaptured with a D-SLR. Differences between domains include the white background of Amazon\nimages versus realistic webcam images, and the resolution differences. The Of\ufb01ce dataset has fewer\nimages, with a maximum of 2478 per domain over 31 classes.\n\nBaselines: We compare our method against a variety of methods with and without feature learning.\nSA*[9] is the dominant state-of-the-art approach not employing any feature learning, and Back-\nprop(BP)[11] is the dominant state-of-the-art employing feature learning. We use the available source\ncode of [11] and [9] and following the evaluation procedure in [11], we choose the hyper-parameter\nof [9] as the highest performing one among various alternatives. We also compare our method with\nthe source only baseline which is a convolutional neural network trained only using the source data.\nThis classi\ufb01er is clearly different from our nearest neighbor classi\ufb01er; however, we experimentally\nvalidated that the CNN always outperformed the nearest neighbor based classi\ufb01er. Hence, we report\nthe highest performing source only method.\n\nEvaluation: We evaluate all algorithms in a fully transductive setup [12]. We feed training images\nand labels of \ufb01rst domain as the source and training images of the second domain as the target. We\nevaluate the accuracy on the target domain as the ratio of correctly labeled images to all target images.\n\n5\n\n\f4.1 Results\n\nFollowing the fully transductive evaluation, we summarize the results in Table 1 and Table 2. Table 1\nsummarizes the results on the object recognition task using of\ufb01ce dataset whereas Table 2 summarizes\nthe digit classi\ufb01cation task on MNIST and SVHN.\n\nTable 1: Accuracy of our method and the state-of-the-art algorithms on Of\ufb01ce dataset.\n\nSOURCE AMAZON D-SLR WEBCAM WEBCAM AMAZON D-SLR\nTARGET WEBCAM WEBCAM D-SLR AMAZON D-SLR AMAZON\n\nGFK [12]\nSA* [9]\nDLID [6]\nDDC [33]\nDAN [20]\nBACKPROP [11]\nSOURCE ONLY\nOUR METHOD (K-NN ONLY)\nOUR METHOD (NO REJECT)\nOUR METHOD (FULL)\n\n.398\n.450\n.519\n.618\n.685\n.730\n\n.642\n.727\n.804\n.811\n\n.791\n.648\n.782\n.950\n.960\n.964\n\n.961\n.952\n.962\n.964\n\n.746\n.699\n.899\n.985\n.990\n.992\n\n.978\n.915\n.989\n.992\n\n.371\n.393\n\n-\n\n.522\n.531\n.536\n\n.452\n.575\n.625\n.638\n\n.379\n.388\n\n-\n\n.644\n.670\n.728\n\n.668\n.791\n.839\n.841\n\n.379\n.420\n\n-\n\n.521\n.540\n.544\n\n.476\n.521\n.567\n.583\n\n.593\n.738\n.549\n.713\n.774\n.788\n\n.211\n.289\n\n.162\n.158\n.323\n.403\n\n.569\n.766\n\n.522\n.795\n.855\n.867\n\n.523\n.732\nSOURCE ONLY .483\n.805\n.835\n.839\n\nOUR METHOD(K-NN ONLY)\nOUR METHOD(NO REJECT)\nOUR METHOD(FULL)\n\nTable 2: Accuracy on the digit classi\ufb01cation task.\n\nSOURCE M-M MNIST SVHN MNIST\nTARGET MNIST M-M MNIST SVHN\nSA* [9]\nBP [11]\n\nTables 1&2 show results on object recogni-\ntion and digit classi\ufb01cation tasks covering\nall adaptation scenarios. Our experiments\nshow that our proposed method outperforms\nall state-of-the-art algorithms. Moreover, the\nincrease in the accuracy is rather signi\ufb01-\ncant when there is a large domain difference\nsuch as MNIST\u2194MNIST-M, MNIST\u2194SVHN,\nAmazon\u2194Webcam and Amazon\u2194D-SLR. Our\nhypothesis is that the state-of-the-art algorithms\nsuch as [11] are seeking features invariant to the domains whereas we seek an explicit similarity\nmetric explaining both differences and similarities of domains. In other words, instead of seeking an\ninvariance, we seek an equivariance.\nTable 2 further suggests that our algorithm is the only one which can successfully perform adaptation\nfrom MNIST to SVHN. Clearly the features which are learned from MNIST cannot generalize to\nSVHN since the SVHN has concepts like color and occlusion which are not available in MNIST.\nHence, our algorithm learns SVHN speci\ufb01c features by enforcing accurate transduction in the\nadaptation.\nAnother interesting conclusion is the asymmetric results. For example, adapting webcam to Amazon\nand adapting Amazon to webcam yield very different accuracies. The similar asymmetry exists in\nMNIST and SVHN as well. This observation validates the importance of an asymmetric modeling.\nTo evaluate the importance of joint labelling and reject option, we compare our method with self\nbaselines. Our self-baselines are versions of our algorithm not using the reject option (no reject) and\nthe version using neither reject option nor joint labelling (k-NN only). Results on both experiments\nsuggest that joint labelling and the reject option are both crucial for successful transduction. Moreover,\nthe reject option is more important when the domain shift is large (e.g. MNIST\u2192SVHN). This is\nexpected since transduction under a large shift is more likely to fail a situation that can be prevented\nwith reject option.\n\n4.1.1 Qualitative Analysis\n\nTo further study the learned representations and the similarity metric, we performed a series of\nqualitative analysis in the form of nearest neighbor and tSNE[34] plots.\nFigure 1 visualizes example target images from MNIST and their corresponding source images. First\nof all, our experimental analysis suggests that MNIST and SVHN are the two domains with the largest\ndifference. Hence, we believe MNIST\u2194SVHN is a very challenging set-up and despite the huge\n\n6\n\n\f1:\n\nvisual differences, our algorithm results in accurate nearest neighbors. On the other hand, Figure 2\nvisualizes the example target images from webcam and their corresponding nearest source images\nfrom Amazon.\nThe difference between invariance and equivariance is\nclearer in the tSNE plots of the Of\ufb01ce dataset in Figure 3\nand the digit classi\ufb01cation task in Figure 4. In Figure 3, we\nplot the distribution of features before and after adaptation\nfor source and target while color coding class labels. We\nuse the learned embeddings as output of \u03a6s and \u03a6t as an\ninput to tSNE algorithm[34]. As Figure 3 suggests, the\nsource domain is well clustered according to the object\nclasses with and without adaptation. This is expected\nsince the features are speci\ufb01cally \ufb01ne-tuned to the source\ndomain before the adaptation starts. However, the target\ndomain features have no structure before adaptation. This\nis also expected since the algorithm did not see any image\nfrom the target domain. After the adaptation, target images\nalso get clustered according to the object classes.\nIn Figure 4, we show the digit images of the source and\ntarget after the adaptation. In order to see the effect of com-\nmon features and domain speci\ufb01c features separately, we\ncompute the low-dimensional embeddings of the output\nof the shared network (output of the \ufb01rst fully connected\nlayer). We further compute the NN points between the\nsource and target using \u03a6s and \u03a6t, and draw an edge be-\ntween NNs. Clearly, the target is well clustered according\nto the classes and the source is not very well clustered\nalthough it has some structure. Since we learn the en-\ntire network for digit classi\ufb01cation, our networks learn\ndiscriminative features in the target domain as our loss de-\npends directly on classi\ufb01cation scores in the target domain.\nMoreover, discriminative features in the target arises be-\ncause of the transductive modeling. In comparison, state\nof the art domain invariance based algorithms only try to\nbe invariant to the domains without explicit modeling of\ndiscriminative behavior on the target. Hence, our method\nexplicitly models the relationship between the domains\nand results in an equivarient model while enforcing dis-\ncriminative behavior in the target.\n\nFigure\nNearest\nSVHN\u2192MNIST exp.\nexample MNIST image and its 5-NNs.\n\nneighbors\nfor\nWe show an\n\n5 Conclusion\n\n2:\n\nfor\nFigure\nAmazon\u2194Webcam exp. We show an\nexample Amazon image and its 3-NNs.\n\nneighbors\n\nNearest\n\nWe described an end-to-end deep learning framework for jointly optimizing the optimal deep feature\nrepresentation, cross domain transformation, and the target label inference for state of the art\nunsupervised domain adaptation.\nExperimental results on digit classi\ufb01cation using MNIST[19] and SVHN[21] as well as on object\nrecognition using the Of\ufb01ce[25] dataset show state of the art performance with a signi\ufb01cant margin.\n\nAcknowledgments\n\nWe acknowledge the support of ONR-N00014-13-1-0761, MURI - WF911NF-15-1-0479 and Toyota\nCenter grant 1191689-1-UDAWF.\n\n7\n\n\f(a) S. w/o Adaptation\n\n(c) T w/o Adaptation\n\n(d) T with Adaptation\nFigure 3: tSNE plots for of\ufb01ce dataset Webcam(S)\u2192Amazon(T). Source features were discriminative and stayed\ndiscriminative as expected. On the other hand, target features became quite discriminative after the adaptation.\n\n(b) S. with Adaptation\n\nFigure 4: tSNE plot for SVHN\u2192MNIST experiment. Please note that the discriminative behavior only emerges\nin the unsupervised target instead of the source domain. This explains the motivation behind modeling the\nproblem as transduction. In other words, our algorithm is designed to be accurate and discriminative in the target\ndomain which is the domain we are interested in.\n\n8\n\n\fReferences\n[1] Supplementary details for the paper. http://cvgl.stanford.edu/transductive_adaptation.\n[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.\n\nT-PAMI, 33:898\u2013916, 2011.\n\n[3] C. Berlind and R. Urner. Active nearest neighbors in changing environments. In ICML, 2015.\n[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In ICML, 2001.\n[5] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy\n\nminimization in vision. T-PAMI, 26:1124\u20131137, 2004.\n\n[6] S. Chopra, S. Balakrishnan, and R. Gopalan. Dlid: Deep learning for domain adaptation by interpolating\n\nbetween domains. In ICML W, 2013.\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. JMLR, pages 2121\u20132159, 2011.\n\n[9] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using\n\nsubspace alignment. In ICCV, 2013.\n\n[10] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In UAI, 1998.\n[11] Y. Ganin and V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.\n[12] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning\n\ndomain-invariant features for unsupervised domain adaptation. In ICML, 2013.\n\n[13] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nCVPR, 2012.\n\n[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classi\ufb01cation\n\nwith convolutional neural networks. In CVPR, 2014.\n\n[15] S. Khamis and C. Lampert. Coconut: Co-classi\ufb01cation with output space regularization. In BMVC, 2014.\n[16] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Domain adaptation for zero-shot learning. In ICCV, 2015.\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86:2278\u20132324, 1998.\n\n[19] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.\n[20] M. Long, C. Yue, J. Wang, and J. Michael. Learning transferable features with deep adaptation networks.\n\narXiv, 2015.\n\n[21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. In NIPS W, 2011.\n\n[22] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output\n\ncodes. In NIPS, 2009.\n\n[23] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unlabeled\n\ndata. In ICML. ACM, 2007.\n\n[24] M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in a transductive setting. In NIPS, 2013.\n[25] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV,\n\npages 213\u2013226. Springer, 2010.\n\n[26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[27] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semi-supervised\n\nlearning. In ICML, 2005.\n\n[28] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.\n[29] B. Sun and K. Saenko. Subspace alignment for unsupervised domain adaptation. In BMVC, 2015.\n[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. arXiv:1409.4842, 2014.\n\n[31] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business, 2012.\n[32] T. Tommasi and B. Caputo. Frustratingly easy NBNN domain adaptation. In ICCV, 2013.\n[33] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for\n\ndomain invariance. arXiv:1412.3474, 2014.\n\n[34] L. van der maaten. Accelerating t-sne using tree-based algorithms. In JMLR, 2014.\n[35] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. In NIPS, 2006.\n\n[36] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. 2002.\n[37] X. Zhu, Z. Ghahramani, J. Lafferty, et al. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1111, "authors": [{"given_name": "Ozan", "family_name": "Sener", "institution": "Cornell University"}, {"given_name": "Hyun Oh", "family_name": "Song", "institution": "Google Research"}, {"given_name": "Ashutosh", "family_name": "Saxena", "institution": "Brain of Things"}, {"given_name": "Silvio", "family_name": "Savarese", "institution": "Stanford University"}]}