{"title": "Co-regularization Based Semi-supervised Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 478, "page_last": 486, "abstract": "This paper presents a co-regularization based approach to semi-supervised domain adaptation. Our proposed approach (EA++) builds on the notion of augmented space (introduced in EASYADAPT (EA) [1]) and harnesses unlabeled data in target domain to further enable the transfer of information from source to target. This semi-supervised approach to domain adaptation is extremely simple to implement and can be applied as a pre-processing step to any supervised learner. Our theoretical analysis (in terms of Rademacher complexity) of EA and EA++ show that the hypothesis class of EA++ has lower complexity (compared to EA) and hence results in tighter generalization bounds. Experimental results on sentiment analysis tasks reinforce our theoretical findings and demonstrate the efficacy of the proposed method when compared to EA as well as a few other baseline approaches.", "full_text": "Co-regularization Based Semi-supervised Domain Adaptation\n\nHal Daum\u00b4e III\n\nAbhishek Kumar\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Maryland CP, MD, USA\n\nUniversity of Maryland CP, MD, USA\n\nhal@umiacs.umd.edu\n\nabhishek@umiacs.umd.edu\n\nAbstract\n\nAvishek Saha\n\nSchool Of Computing\n\nUniversity of Utah, UT, USA\navishek@cs.utah.edu\n\nThis paper presents a co-regularization based approach to semi-supervised domain adaptation. Our\nproposed approach (EA++) builds on the notion of augmented space (introduced in EASYADAPT\n(EA) [1]) and harnesses unlabeled data in target domain to further assist the transfer of information\nfrom source to target. This semi-supervised approach to domain adaptation is extremely simple to\nimplement and can be applied as a pre-processing step to any supervised learner. Our theoretical\nanalysis (in terms of Rademacher complexity) of EA and EA++ show that the hypothesis class of\nEA++ has lower complexity (compared to EA) and hence results in tighter generalization bounds.\nExperimental results on sentiment analysis tasks reinforce our theoretical \ufb01ndings and demonstrate\nthe ef\ufb01cacy of the proposed method when compared to EA as well as few other representative\nbaseline approaches.\n\n1 Introduction\n\nA domain adaptation approach for NLP tasks, termed EASYADAPT (EA), augments the source domain feature space\nusing features from labeled data in target domain [1]. EA is simple, easy to extend and implement as a preprocessing\nstep and most importantly is agnostic of the underlying classi\ufb01er. However, EA requires labeled data in both source\nand target, and hence applies to fully supervised domain adaptation settings only. In this paper, 1 we propose a semi-\nsupervised 2 approach to leverage unlabeled data for EASYADAPT (which we call EA++) and theoretically, as well as\nempirically, demonstrate its superior performance over EA.\n\nThere exists prior work on supervised domain adaptation (and multi-task learning) that can be related to EASYADAPT.\nAn algorithm for multi-task learning using shared parameters was proposed for multi-task regularization [3] wherein\neach task parameter was represented as sum of a mean parameter (that stays same for all tasks) and its deviation\nfrom this mean. SVMs were used as the base classi\ufb01ers and the algorithm was formulated in the standard SVM dual\noptimization setting. Subsequently, this framework was extended to online multi-domain setting in [4]. Prior work\non semi-supervised approaches to domain adaptation also exists in literature. Extraction of speci\ufb01c features from the\navailable dataset was proposed [5, 6] to facilitate the task of domain adaptation. Co-adaptation [7], a combination of\nco-training and domain adaptation, can also be considered as a semi-supervised approach to domain adaptation. A\nsemi-supervised EM algorithm for domain adaptation was proposed in [8]. Similar to graph based semi-supervised\napproaches, a label propagation method was proposed [9] to facilitate domain adaptation. Domain Adaptation Ma-\nchine (DAM) [10] is a semi-supervised extension of SVMs for domain adaptation and presents extensive empirical\nresults. Nevertheless, in almost all of the above cases, the proposed methods either use speci\ufb01cs of the datasets or are\ncustomized for some particular base classi\ufb01er and hence it is not clear how the proposed methods can be extended to\nother existing classi\ufb01ers.\n\n1A preliminary version [2] of this work appeared in the DANLP workshop at ACL 2010.\n2We de\ufb01ne supervised domain adaptation as having labeled data in both source and target and unsupervised domain adaptation\nas having labeled data in only source. In semi-supervised domain adaptation, we also have access to both labeled and unlabeled\ndata in target.\n\n1\n\n\fAs mentioned earlier, EA is remarkably general in the sense that it can be used as a pre-processing step in conjunction\nwith any base classi\ufb01er. However, one of the prime limitations of EA is its incapability to leverage unlabeled data.\nGiven its simplicity and generality, it would be interesting to extend EA to semi-supervised settings. In this paper, we\npropose EA++, a co-regularization based semi-supervised extension to EA. We also present Rademacher complex-\nity based generalization bounds for EA and EA++. Our generalization bounds also apply to the approach proposed\nin [3] for domain adaptation setting, where we are only concerned with the error on target domain. The closest to our\nwork is a recent paper [11] that theoretically analyzes EASYADAPT. Their paper investigates the necessity to com-\nbine supervised and unsupervised domain adaptation (which the authors refer to as labeled and unlabeled adaptation\nframeworks, respectively) and analyzes the combination using mistake bounds (which is limited to perceptron-based\nonline scenarios). In addition, their work points out that EASYADAPT is limited to only supervised domain adaptation.\nOn the contrary, our work extends EASYADAPT to semi-supervised settings and presents generalization bound based\ntheoretical analysis which speci\ufb01cally demonstrate why EA++ is better than EA.\n\n2 Background\nIn this section, we introduce notations and provide a brief overview of EASYADAPT [1].\n\n2.1 Problem Setup and Notations\n\nLet X \u2282 Rd denote the instance space and Y = {\u22121, +1} denote the label space. Let Ds(x, y) be the source\ndistribution and Dt(x, y) be the target distribution. We have a set of source labeled examples Ls(\u223c Ds(x, y)) and a\nset of target labeled examples Lt(\u223c Dt(x, y)), where |Ls| = ls \u226b |Lt| = lt. We also have target unlabeled data\ndenoted by Ut(\u223c Dt(x)), where |Ut| = ut. Our goal is to learn a hypothesis h : X 7\u2192 Y having low expected error\nwith respect to the target domain. In this paper, we consider linear hypotheses only. However, the proposed techniques\nextend to non-linear hypotheses, as mentioned in [1]. Source and target empirical errors for hypothesis h are denoted\nby \u02c6\u01ebs(h, fs) and \u02c6\u01ebt(h, ft) respectively, where fs and ft are the true source and target labeling functions. Similarly,\nthe corresponding expected errors are denoted by \u01ebs(h, fs) and \u01ebt(h, ft). We will use shorthand notations of \u02c6\u01ebs, \u02c6\u01ebt, \u01ebs\nand \u01ebt wherever the intention is clear from context.\n\n2.2 EasyAdapt (EA)\nLet us denote Rd as the original space. EA operates in an augmented space denoted by \u02d8X \u2282 R3d (for a single pair of\nsource and target domain). For k domains, the augmented space blows up to R(k+1)d. The augmented feature maps\n\u03a6s, \u03a6t : X 7\u2192 \u02d8X for source and target domains are de\ufb01ned as \u03a6s(x) = hx, x, 0i and \u03a6t(x) = hx, 0, xi where x\nand 0 are vectors in Rd, and 0 denotes a zero vector of dimension d. The \ufb01rst d-dimensional segment corresponds to\ncommonality between source and target, the second d-dimensional segment corresponds to the source domain while\nthe last segment corresponds to the target domain. Source and target domain examples are transformed using these\nfeature maps and the augmented features so constructed are passed onto the underlying supervised classi\ufb01er. One of\nthe most appealing properties of EASYADAPT is that it is agnostic of the underlying supervised classi\ufb01er being used to\nlearn in the augmented space. Almost any standard supervised learning approach (for e.g., SVMs, perceptrons) can be\nused to learn a linear hypothesis \u02d8h \u2208 R3d in the augmented space. Let us denote \u02d8h = hgc, gs, gti, where each of gc,\ngs, gt is of dimension d, and represent the common, source-speci\ufb01c and target-speci\ufb01c components of \u02d8h, respectively.\nDuring prediction on target data, the incoming target sample x is transformed to obtain \u03a6t(x) and \u02d8h is applied on this\ntransformed sample. This is equivalent to applying (gc + gt) on x. A intuitive insight into why this simple algorithm\nworks so well in practice and outperforms most state-of-the-art algorithms is given in [1]. Brie\ufb02y, it can be thought\nto be simultaneously training two hypotheses: hs = (gc + gs) for source domain and ht = (gc + gt) for target\ndomain. The commonality between the domains is represented by gc whereas gs and gt capture the idiosyncrasies of\nthe source and target domain, respectively.\n\n3 EA++: EA using unlabeled data\n\nAs discussed in the previous section, the EASYADAPT algorithm is attractive because it performs very well empirically\nand can be used in conjunction with any underlying supervised linear classi\ufb01er. One drawback of EASYADAPT is its\ninability to leverage unlabeled target data which is usually available in large quantities in most practical scenarios. In\nthis section, we extend EA to semi-supervised settings while maintaining the desirable classi\ufb01er-agnostic property.\n\n2\n\n\f3.1 Motivation\n\nIn multi-view approach to semi-supervised learning [12], different hypotheses are learned using different views of\nthe dataset. Thereafter, unlabeled data is utilized to co-regularize these learned hypotheses by making them agree on\nunlabeled samples. In domain adaptation, the source and target data come from two different distributions. However,\nif the source and target domains are reasonably close, we can employ a similar form of regularization using unlabeled\ndata. A prior co-regularization based idea to harness unlabeled data in domain adaptation tasks demonstrated improved\nempirical results [10]. However, their technique applies for the particular base classi\ufb01er they consider and hence does\nnot extend to other supervised classi\ufb01ers.\n\n3.2 EA++: EASYADAPT with unlabeled data\n\nIn our proposed semi-supervised approach, the source and target hypotheses are made to agree on unlabeled data.\nWe refer to this algorithm as EA++. Recall that EASYADAPT learns a linear hypothesis \u02d8h \u2208 R3d in the augmented\nspace. The hypothesis \u02d8h contains common, source-speci\ufb01c and target-speci\ufb01c sub-hypotheses and is expressed as\n\u02d8h = hgc, gs, gti. In original space (ref. Section 2.2), this is equivalent to learning a source speci\ufb01c hypothesis\nhs = (gc + gs) and a target speci\ufb01c hypothesis ht = (gc + gt).\nIn EA++, we want the source hypothesis hs and the target hypothesis ht to agree on the unlabeled data. For an\nunlabeled target sample xi \u2208 Ut \u2282 Rd, the goal of EA++ is to make the predictions of hs and ht on xi, agree with\neach other. Formally, it aims to achieve the following condition:\n\nhs \u00b7 xi \u2248 ht \u00b7 xi \u21d0\u21d2 (gc + gs) \u00b7 xi \u2248 (gc + gt) \u00b7 xi\n\n\u21d0\u21d2 (gs \u2212 gt) \u00b7 xi \u2248 0 \u21d0\u21d2 hgc, gs, gti \u00b7 h0, xi, \u2212xii \u2248 0.\n\n(3.1)\nThe above expression leads to the de\ufb01nition of a new feature map \u03a6u : X 7\u2192 \u02d8X for unlabeled data given by \u03a6u(x) =\nh0, x, \u2212xi. Every unlabeled target sample is transformed using the map \u03a6u(.). The augmented feature space that\nresults from the application of three feature maps, namely, \u03a6s(\u00b7), \u03a6t(\u00b7) and \u03a6u(\u00b7) on source labeled samples, target\nlabeled samples and target unlabeled samples is summarized in Figure 1(a).\n\nAs shown in Eq. 3.1, during the training phase, EA++ assigns a predicted value close to 0 for each unlabeled sample.\nHowever, it is worth noting that during the test phase, EA++ predicts labels from two classes: +1 and \u22121. This\nwarrants further exposition of the implementation speci\ufb01cs which is deferred until the next subsection.\n\nEA\n\nEA++\n\nls\n\nlt\n\nut\n\nd\n\nLs\n\nLt\n\nd\n\nLs\n\n0\n\nd\n\n0\n\nLt\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u2212Ut\n\n0\n\nUt\n\n(a)\n\ns\ns\no\nL\n\n(a)\n\ns\ns\no\nL\n\n(b)\n\ns\ns\no\nL\n\n(c)\n\n(b)\n\nFigure 1: (a) Diagrammatic representation of feature augmentation in EA and EA++, (b) Loss functions for class +1,\nclass \u22121 and their summation.\n\n3.3 Implementation\n\nIn this section, we present implementation speci\ufb01c details of EA++. For concreteness, we consider SVM as the base\nsupervised learner. However, these details hold for other supervised linear classi\ufb01ers.\nIn the dual form of SVM\noptimization function, the labels are multiplied with features. Since, we want the predicted labels for unlabeled data\nto be 0 (according to Eq. 3.1), multiplication by zero will make the unlabeled samples ineffective in the dual form of\n\n3\n\n\fthe cost function. To avoid this, we create as many copies of \u03a6u(x) as there are labels and assign each label to one\ncopy of \u03a6u(x). For the case of binary classi\ufb01cation, we create two copies of every augmented unlabeled sample, and\nassign +1 label to one copy and \u22121 to the other. The learner attempts to balance the loss of the two copies, and tries\nto make the prediction on unlabeled sample equal to 0. Figure 1(b) shows the curves of the hinge loss for class +1,\nclass \u22121 and their summation. The effective loss for each unlabeled sample is similar to the sum of losses for +1 and\n\u22121 classes (shown in Figure 1(b)c).\n4 Generalization Bounds\n\nIn this section, we present Rademacher complexity based generalization bounds for EA and EA++. First, we de\ufb01ne\nhypothesis classes for EA and EA++ using an alternate formulation. Second, we present a theorem (Theorem 4.1)\nwhich relates empirical and expected error for the general case and hence applies to both the source and target domains.\nThird, we prove Theorem 4.2 which relates the expected target error to the expected source error. Fourth, we present\nTheorem 4.3 which combines Theorem 4.1 and Theorem 4.2 so as to relate the expected target error to empirical\nerrors in source and target (which is the main goal of the generalization bounds presented in this paper). Finally, all\nthat remains is to bound the Rademacher complexity of the various hypothesis classes.\n\n4.1 De\ufb01ne Hypothesis Classes for EA and EA++\n\nOur goal now is to de\ufb01ne the hypothesis classes for EA and EA++ so as to make the theoretical analysis feasible.\nBoth EA and EA++ train hypotheses in the augmented space \u02d8X \u2282 R3d. The augmented hypothesis \u02d8h is trained\nusing data from both domains, and the three sub-hypotheses (gc + gs + gt) of d-dimension are treated in a different\nmanner for source and target data. We use an alternate formulation of the hypothesis classes and work in the original\nspace X \u2282 Rd. As discussed brie\ufb02y in Section 2.2, EA can be thought to be simultaneously training two hypotheses\nhs = (gc + gs) and ht = (gc + gt) for source and target domains, respectively. We consider the case when the\nunderlying supervised classi\ufb01er in augmented space uses a square L2-norm regularizer of the form ||\u02d8h||2 (as used in\nSVM). This is equivalent to imposing the regularizer (||gc||2 +||gs||2 +||gt||2) = (||gc||2 +||hs\u2212gc||2 +||ht\u2212gc||2).\nDifferentiating this regularizer w.r.t. gc gives gc = (hs + ht)/3 at the minimum, and the regularizer reduces to\n1\n3 (||hs||2 + ||ht||2 + ||hs \u2212 ht||2). Thus, EA can be thought to be minimizing the sum of empirical source error on\nhs, empirical target error on ht and this regularizer. The cost function QEA(h1, h2) can now be written as:\n(4.1)\nh1,h2 QEA\n\n\u03b1\u02c6\u01ebs(h1) + (1 \u2212 \u03b1)\u02c6\u01ebt(h2) + \u03bb1||h1||2 + \u03bb2||h2||2 + \u03bb||h1 \u2212 h2||2,\n\nand (hs, ht) = arg min\n\nThe EA algorithm minimizes this cost function over h1 and h2 jointly to obtain hs and ht. The EA++ algorithm\nuses target unlabeled data, and encourages hs and ht to agree on unlabeled samples (Eq. 3.1). This can be thought of\n\nas having an additional regularizer of the formPi\u2208Ut (hs(xi) \u2212 ht(xi))2 in the cost function. The cost function for\nEA++ (denoted as Q++(h1, h2)) can then be written as:\n\n\u03b1\u02c6\u01ebs(h1) + (1 \u2212 \u03b1)\u02c6\u01ebt(h2) + \u03bb1||h1||2 + \u03bb2||h2||2 + \u03bb||h1 \u2212 h2||2 + \u03bbu Xi\u2208Ut\n\n(h1(xi) \u2212 h2(xi))2\n\n(4.2)\n\nBoth EA and EA++ give equal weights to source and target empirical errors, so \u03b1 turns out to be 0.5. We use\nhyperparameters \u03bb1, \u03bb2, \u03bb, and \u03bbu in the cost functions to make them more general. However, as explained earlier,\nEA implicitly sets all these hyperparameters (\u03bb1, \u03bb2, \u03bb) to the same value (which will be 0.5( 1\n6 in our case,\nsince the weights in the entire cost function are multiplied by \u03b1 = 0.5). The hyperparameter for unlabeled data (\u03bbu)\nis 0.5 in EA++. We assume that the loss L(y, h.x) is bounded by 1 for the zero hypothesis h = 0. This is true for\nmany popular loss functions including square loss and hinge loss when y \u2208 {\u22121, +1}. One possible way [13] of\nde\ufb01ning the hypotheses classes is to substitute trivial hypotheses h1 = h2 = 0 in both the cost functions which makes\nall regularizers and co-regularizers equal to zero and thus bounds the cost functions QEA and Q++. This gives us\nQEA(0, 0) \u2264 1 and Q++(0, 0) \u2264 1 since \u02c6\u01ebs(0), \u02c6\u01ebt(0) \u2264 1. Without loss of generality, we also assume that \ufb01nal\nsource and target hypotheses can only reduce the cost function as compared to the zero hypotheses. Hence, the \ufb01nal\nhypothesis pair (hs, ht) that minimizes the cost functions is contained in the following paired hypothesis classes for\nEA and EA++,\n\n3 ) = 1\n\nH := {(h1, h2) : \u03bb1||h1||2 + \u03bb2||h2||2 + \u03bb||h1 \u2212 h2||2 \u2264 1}\nH++ := {(h1, h2) : \u03bb1||h1||2 + \u03bb2||h2||2 + \u03bb||h1 \u2212 h2||2 + \u03bbu Xi\u2208Ut\n\n(h1(xi) \u2212 h2(xi))2 \u2264 1}\n\n(4.3)\n\n4\n\n\fThe source hypothesis class for EA is the set of all h1 such that the pair (h1, h2) is in H. Similarly, the target hypothesis\nclass for EA is the set of all h2 such that the pair (h1, h2) is in H. Consequently, the source and target hypothesis\nclasses for EA can be de\ufb01ned as:\n(4.4)\n\nand\n\nJ s\nEA := {h1 : X 7\u2192 R, (h1, h2) \u2208 H}\n\nJ t\nEA := {h2 : X 7\u2192 R, (h1, h2) \u2208 H}\n\nSimilarly, the source and target hypothesis classes for EA++ are de\ufb01ned as:\n\nJ s\n++ := {h1 : X 7\u2192 R, (h1, h2) \u2208 H++}\n\n(4.5)\nFurthermore, we assume that our hypothesis class is comprised of real-valued functions over an RKHS with repro-\nducing kernel k(\u00b7,\u00b7), k :X \u00d7X 7\u2192 R. Let us de\ufb01ne the kernel matrix and partition it corresponding to source labeled,\ntarget labeled and target unlabeled data as shown below:\n\nJ t\n++ := {h2 : X 7\u2192 R, (h1, h2) \u2208 H++}\n\nand\n\nwhere \u2018s\u2019, \u2018t\u2019 and \u2018u\u2019 indicate terms corresponding to source labeled, target labeled and target unlabeled, respectively.\n\nK = As\u00d7s Cs\u00d7t Ds\u00d7u\nD\u2032u\u00d7s E\u2032u\u00d7t Fu\u00d7u ! ,\nC\u2032t\u00d7s Bt\u00d7t Et\u00d7u\n\n(4.6)\n\n4.2 Relate empirical and expected error (for both source and target)\n\nHaving de\ufb01ned the hypothesis classes, we now proceed to obtain generalization bounds for EA and EA++. We have\nthe following standard generalization bound based on the Rademacher complexity of a hypothesis class [13].\nTheorem 4.1. Suppose the uniform Lipschitz condition holds for L : Y 2 \u2192 [0, 1],\n|L( \u02c6y1, y) \u2212\nL( \u02c6y2, y)| \u2264 M| \u02c6y1 \u2212 \u02c6y2|, where y, \u02c6y1, \u02c6y2 \u2208 Y and \u02c6y1 6= \u02c6y2. Then for any \u03b4 \u2208 (0, 1) and for m samples\n(X1, Y1), (X2, Y2), . . . , (Xm, Ym) drawn i.i.d. from distribution D, we have with probability at least (1 \u2212 \u03b4) over\nrandom draws of samples,\n\ni.e.,\n\n\u01eb(f ) \u2264 \u02c6\u01eb(f ) + 2M \u02c6Rm(F ) +\n\n1\n\u221am\n\n(2 + 3pln(2/\u03b4)/2).\n\nwhere f \u2208 F is the class of functions mapping X 7\u2192 Y, and \u02c6Rm(F ) is the empirical Rademacher complexity of F\nde\ufb01ned as \u02c6Rm(F ) := E\u03c3[supf\u2208F | 2\ni=1 \u03c3ih2(xi)|].\nEA, we will have a uniform convergence bound on\nIf we can bound the complexity of hypothesis classes J s\nthe difference of expected and empirical errors (|\u01ebt(h) \u2212 \u02c6\u01ebt(h)| and |\u01ebs(h) \u2212 \u02c6\u01ebs(h)|) using Theorem 4.1. However, in\ndomain adaptation setting, we are also interested in the bounds that relate expected target error to total empirical error\non source and target samples. The following sections aim to achieve this goal.\n\nmPm\n\nEA and J t\n\n4.3 Relate source expected error and target expected error\n\nThe following theorem provides a bound on the difference of expected target error and expected source error. The\nbound is in terms of \u03b7s := \u01ebs(fs, ft), \u03bds := \u01ebs(h\u2217t , ft) and \u03bdt := \u01ebt(h\u2217t , ft), where fs and ft are the source and target\nlabeling functions, and h\u2217t is the optimal target hypothesis in target hypothesis class. It also uses dH\u2206H(Ds,Dt)\u2212\ndistance [14], which is de\ufb01ned as suph1,h2\u2208H\n2|\u01ebs(h1, h2) \u2212 \u01ebt(h1, h2)|. The dH\u2206H\u2212distance measures the distance\nbetween two distribution using a hypothesis class-speci\ufb01c distance measure. If the two domains are close to each\nother, \u03b7s and dH\u2206H(Ds,Dt) are expected to be small. On the contrary, if the domains are far apart, these terms will\nbe big and the use of extra source samples may not help in learning a better target hypothesis. These two terms also\nrepresent the notion of adaptability in our case.\nTheorem 4.2. Suppose the loss function is M-Lipschitz as de\ufb01ned in Theorem 4.1, and obeys triangle inequality. For\nany two source and target hypotheses hs, ht (which belong to different hypotheses classes), we have\n\n\u01ebt(ht, ft) \u2212 \u01ebs(hs, fs) \u2264M||ht \u2212 hs||Eshpk(x, x)i +\n\n1\n2\n\ndHt\u2206Ht(Ds, Dt) + \u03b7s + \u03bds + \u03bdt.\n\nwhere Ht is the target hypothesis class, and k(\u00b7,\u00b7) is the reproducing kernel for the RKHS. \u03b7s, \u03bds, and \u03bdt are de\ufb01ned\nas above.\n\nProof. Please see Appendix A in the supplement.\n\n5\n\n\f4.4 Relate target expected error with source and target empirical errors\n\nEA and EA++ learn source and target hypotheses jointly. So the empirical error in one domain is expected to have\nits effect on the generalization error in the other domain. In this section, we aim to bound the target expected error in\nterms of source and target empirical errors. The following theorem achieves this goal.\nTheorem 4.3. Under the assumptions and de\ufb01nitions used in Theorem 4.1 and Theorem 4.2, with probability at least\n1 \u2212 \u03b4 we have\n\u01ebt(ht, ft) \u2264\n\n(\u02c6\u01ebs(hs, fs) + \u02c6\u01ebt(ht, ft)) +\n\n\u221alt\u00ab (2 + 3pln(2/\u03b4)/2)\n\n2 \u201e 1\n\u221als\n\n1\n2\n\n1\n2\n\n+\n\n1\n\n1\n\n(2M \u02c6Rm(Hs) + 2M \u02c6Rm(Ht)) +\n1\n2\n\ndHt\u2206Ht (Ds, Dt) +\n\n1\n4\n\n(\u03b7s + \u03bds + \u03bdt)\n\n+\n\n1\n2\n\nM||ht \u2212 hs||Eshpk(x, x)i +\n\nfor any hs and ht. Hs and Ht are the source hypothesis class and the target hypothesis class, respectively.\nProof. We \ufb01rst use Theorem 4.1 to bound (\u01ebt(ht)\u2212 \u02c6\u01ebt(ht)) and (\u01ebs(hs)\u2212 \u02c6\u01ebs(hs)). The above theorem directly follows\nby combining these two bounds and Theorem 4.2.\n\nThis bound provides better a understanding of how the target expected error is governed by both source and target\nempirical errors, and hypotheses class complexities. This behavior is expected since both EA and EA++ learn source\nand target hypotheses jointly. We also note that the bound in Theorem 4.3 depends on ||hs \u2212 ht||, which apparently\nmight give an impression that the best possible thing to do is to make source and target hypotheses equal. However, due\nto joint learning of source and target hypotheses (by optimizing the cost function of Eq. 4.1), making the source and\ntarget hypotheses close will increase the source empirical error, thus loosening the bound of Theorem 4.3. Noticing\nthat ||hs \u2212 ht||2 \u2264 1\n\u03bb for both EA and EA++, the bound can be made independent of ||hs \u2212 ht|| although with a\nsacri\ufb01ce on the tightness. We note that Theorem 4.1 can also be used to bound the target generalization error of EA\nand EA++ in terms of only target empirical error. However, if the number of labeled target samples is extremely\nlow, this bound can be loose due to inverse dependency on number of target samples. Theorem 4.3 bounds the target\nexpected error using the averages of empirical errors, Rademacher complexities, and sample dependent terms. If the\ndomains are reasonably close and the number of labeled source samples is much higher than target samples, this can\nprovide a tighter bound compared to Theorem 4.1.\n\nFinally, we need the Rademacher complexities of source and target hypothesis classes (for both EA and EA++) to be\nable to use Theorem 4.3, which are provided in the next sections.\n\n4.5 Bound the Complexity of EA and EA++ Hypothesis Classes\n\nThe following theorems bound the Rademacher complexity of the target hypothesis classes for EA and EA++.\n\n4.5.1 EASYADAPT (EA)\n\nTheorem 4.4. For the hypothesis class J t\n\u02c6Rm(J t\n\ufb01ned as in Eq. 4.6.\n\nEA |Pi \u03c3ih2(x)|, (C t\n\nEA) = E\u03c3 suph2\u2208J t\n\nEA de\ufb01ned in Eq. 4.4 we have,\n\n1\n4\u221a2\n\n2C t\n\nEA\n\nlt \u2264 \u02c6Rm(J t\n\nEA) \u2264 2C t\n\nEA\nlt\n\nwhere,\n\nEA)2 = (cid:18)\n\n1\n\u03bb2+\u201c 1\n+ 1\n\n\u03bb1\n\n\u03bb\u201d\u22121(cid:19)tr(B) and B is the kernel sub-matrix de-\n\nProof. Please see Appendix B in the supplement.\n\nThe complexity of target class decreases with an increase in the values of hyperparameters. It decreases more rapidly\nwith change in \u03bb2 compared to \u03bb and \u03bb1, which is also expected since \u03bb2 is the hyperparameter directly in\ufb02uencing\nthe target hypothesis. The kernel block sub-matrix corresponding to source samples does not appear in the bound.\nThis result in conjunction with Theorem 4.1 gives a bound on the target generalization error.\n\nTo be able to use the bound of Theorem 4.3, we need the Rademacher complexity of the source hypothesis class.\nDue to the symmetry of paired hypothesis class (Eq. 4.3) in h1 and h2 up to scalar parameters, the complex-\n\n6\n\n\fity of source hypothesis class can be similarly bounded by 1\n4\u221a2\n\nEA) \u2264 2C s\n\u03bb\u201d\u22121(cid:19)tr(A), and A is the kernel block sub-matrix corresponding to source samples.\n\nls \u2264 \u02c6Rm(J s\n\n1\n\u03bb1+\u201c 1\n+ 1\n\n(cid:18)\n\n2C s\n\n\u03bb2\n\nEA\n\nEA\nls\n\n, where (C s\n\nEA)2 =\n\n4.5.2 EASYADAPT++ (EA++)\n\n++ de\ufb01ned in Eq. 4.5 we have,\n\nwhere,\n\n2C t\n++\nlt\n\nTheorem 4.5. For the hypothesis class J t\n++) = E\u03c3 suph2\u2208J t\ntr(cid:0)E(I + kF )\u22121E\u2032(cid:1), where k = \u03bbu(\u03bb1+\u03bb2)\n\n\u02c6Rm(J t\n\u03bb\u03bb1+\u03bb\u03bb2+\u03bb1\u03bb2(cid:17)2\n\n++ |Pi \u03c3ih2(x)| and (C t\n\n\u03bbu(cid:16)\n\n\u03bb\u03bb1+\u03bb\u03bb2+\u03bb1\u03bb2\n\n\u03bb1\n\n.\n\nProof. Please see Appendix C in the Supplement.\n\n2C t\n++\nlt\n\n1\n4\u221a2\n\n\u2264 \u02c6Rm(J t\n++) \u2264\n\u03bb\u201d\u22121(cid:19)tr(B) \u2212\n\n1\n\u03bb2+\u201c 1\n+ 1\n\n\u03bb1\n\n++)2 = (cid:18)\n\ncan also be written asPi ||Ei||2\n\nThe second term in (C t\n++)2 is always positive since the trace of a positive de\ufb01nite matrix is positive. So, the unlabeled\ndata results in a reduction of complexity over the labeled data case (Theorem 4.4). The trace term in the reduction\nZ is the norm induced by a\npositive de\ufb01nite matrix Z. Since Ei is the vector representing the inner product of i\u2019th target sample with all unlabeled\nsamples, this means that the reduction in complexity is proportional to the similarity between target unlabeled samples\nand target labeled samples. This result in conjunction with Theorem 4.1 gives a bound on the target generalization\nerror in terms of target empirical error.\n\n(I+kF )\u22121, where Ei is the i\u2019th column of matrix E and ||\u00b7||2\n\nTo be able to use the bound of Theorem 4.3, we need the Rademacher complexity of source hypothesis class too.\nAgain, as in case of EA, using the symmetry of paired hypothesis class H++ (Eq. 4.3) in h1 and h2 up to scalar\nparameters, the complexity of source hypothesis class can be similarly bounded by 1\n,\n4\u221a2\n\n2C s\n++\nls\n\n2C s\n\n++\n\nls \u2264 \u02c6Rm(J s\n\n++) \u2264\n\nwhere (C s\n\n++)2 = (cid:18)\n\n1\n\u03bb1+\u201c 1\n+ 1\n\n\u03bb2\n\n\u03bb\u201d\u22121(cid:19)tr(A) \u2212 \u03bbu(cid:16)\n\n\u03bb2\n\n\u03bb\u03bb1+\u03bb\u03bb2+\u03bb1\u03bb2(cid:17)2\n\ntr(cid:0)D(I + kF )\u22121D\u2032(cid:1), and k is de\ufb01ned similarly\n\nas in Theorem 4.5. The trace term can again be interpreted as before, which implies that the reduction in source class\ncomplexity is proportional to the similarity between source labeled samples and target unlabeled samples.\n\n5 Experiments\n\nWe follow experimental setups similar to [1] but report our empirical results for the task of sentiment classi\ufb01cation\nusing the SENTIMENT data provided by [15]. The task of sentiment classi\ufb01cation is a binary classi\ufb01cation task which\ncorresponds to classifying a review as positive or negative for user reviews of eight product types (apparel, books,\nDVD, electronics, kitchen, music, video, and other) collected from amazon.com. We quantify the domain divergences\nin terms of the A-distance [16] which is computed [17] from \ufb01nite samples of source and target domain using the\nproxy A-distance [16]. For our experiments, we consider the following domain-pairs: (a) DVD\u2192BOOKS (proxy\nA-distance=0.7616) and, (b) KITCHEN\u2192APPAREL (proxy A-distance=0.0459). As in [1], we use an averaged\nperceptron classi\ufb01er from the Megam framework (implementation due to [18]) for all the aforementioned tasks. The\ntraining sample size varies from 1k to 16k. In all cases, the amount of unlabeled target data is equal to the total amount\nof labeled source and target data.\n\nWe compare the empirical performance of EA++ with a few other baselines, namely, (a) SOURCEONLY (classi\ufb01er\ntrained on source labeled samples), (b) TARGETONLY-FULL (classi\ufb01er trained on the same number of target labeled\nsamples as the number of source labeled samples in SOURCEONLY), (c) TARGETONLY (classi\ufb01er trained on small\namount of target labeled samples, roughly one-tenth of the amount of source labeled samples in SOURCEONLY), (d)\nALL (classi\ufb01er trained on combined labeled samples of SOURCEONLY and TARGETONLY), (e) EA (classi\ufb01er trained\nin augmented feature space on the same input training set as ALL), (f) EA++ (classi\ufb01er trained in augmented feature\nspace on the same input training set as EA and an equal amount of unlabeled target data). All these approaches were\ntested on the entire amount of available target test data.\n\nFigure 2 presents the learning curves for (a) SOURCEONLY, (b) TARGETONLY-FULL, (c) TARGETONLY, (d) ALL,\n(e) EA, and (f) EA++ (EA with unlabeled data). The x-axis represents the number of training samples on which the\n\n7\n\n\f0.3\n\n0.2\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\nSrcOnly\nTgtOnly-Full\nTgtOnly\nAll\nEA\nEA++\n\n0.4\n\n0.3\n\n0.2\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\nSrcOnly\nTgtOnly-Full\nTgtOnly\nAll\nEA\nEA++\n\n2000\n\n5000\n\n8000\n\n11000\n\nnumber of samples\n\n(a)\n\n1000\n\n2500\nnumber of samples\n\n4000\n\n6500\n\n(b)\n\nFigure 2: Test accuracy of SOURCEONLY, TARGETONLY-FULL, TARGETONLY, ALL, EA, EA++ (with unlabeled\ndata) for, (a) DVD\u2192BOOKS (proxy A-distance=0.7616), (b) KITCHEN\u2192APPAREL (proxy A-distance=0.0459)\n\npredictor has been trained. At this point, we note that the number of training samples vary depending on the partic-\nular approach being used. For SOURCEONLY, TARGETONLY-FULL and TARGETONLY, it is just the corresponding\nnumber of labeled source or target samples, respectively. For ALL and EA, it is the summation of labeled source and\ntarget samples. For EA++, the x-value plotted denotes the amount of unlabeled target data used (in addition to an\nequal amount of source+target labeled data, as in ALL or EA). We plot this number for EA++, just to compare its\nimprovement over EA when using an additional (and equal) amount of unlabeled target data. This accounts for the\ndifferent x values plotted for the different curves. In all cases, the y-axis denotes the error rate.\nAs can be seen, for both the cases, EA++ outperforms EASYADAPT. For DVD\u2192BOOKS, the domains are far apart\nas denoted by a high proxy A-distance. Hence, TARGETONLY-FULL achieves the best performance and EA++ almost\ncatches up for large amounts of training data. For different number of sample points, EA++ gives relative improve-\nments in the range of 4.36%\u2212 9.14%, as compared to EA. The domains KITCHEN and APPAREL can be considered\nto be reasonably close due to their low domain divergence. Hence, this domain pair is more amenable for domain adap-\ntation as is demonstrated by the fact that the other approaches (SOURCEONLY, TARGETONLY, ALL) perform better\nor atleast as good as TARGETONLY-FULL. However, as earlier, EA++ once again outperforms all these approaches\nincluding TARGETONLY-FULL. Due to the closeness of the two domains, additional unlabeled data in EA++ helps\nit in outperforming TARGETONLY-FULL. At this point, we also note that EA performs poorly for some cases, which\ncorroborates with prior experimental results [1]. For this dataset, EA++ yields relative improvements in the range of\n14.08% \u2212 39.29% over EA for different number of sample points experimented with. Similar trends were observed\nfor other tasks and datasets (refer Figure 3 of [2]).\n\n6 Conclusions\n\nWe proposed a semi-supervised extension to an existing domain adaptation technique (EA). Our approach EA++,\nleverages unlabeled data to improve the performance of EA. With this extension, EA++ applies to both fully supervised\nand semi-supervised domain adaptation settings. We have formulated EA and EA++ in terms of co-regularization, an\nidea that originated in the context of multiview learning [13, 19]. Our proposed formulation also bears resemblance\nto existing work [20] in semi-supervised (SSL) literature which has been studied extensively in [21, 22, 23]. The\ndifference being, while in SSL one would try to make the two views (on unlabeled data) agree, in domain adaptation\nthe aim is to make the two hypotheses in source and target agree. Using our formulation, we have presented theoretical\nanalysis of the superior performance of EA++ as compared to EA. Our empirical results further con\ufb01rm the theoretical\n\ufb01ndings. EA++ can also be extended to the multiple source settings. If we have k sources and a single target domain\nthen we can introduce a co-regularizer for each source-target pair. Due to space constraints, we defer details to a full\nversion.\n\n8\n\n\fReferences\n\n[1] Hal Daum\u00b4e III. Frustratingly easy domain adaptation. In ACL\u201907, pages 256\u2013263, Prague, Czech Republic, June 2007.\n[2] Hal Daum\u00b4e III, Abhishek Kumar, and Avishek Saha. Frustratingly easy semi-supervised domain adaptation. In ACL 2010\n\nWorkshop on Domain Adaptation for Natural Language Processing (DANLP), pages 53\u201359, Uppsala, Sweden, July 2010.\n\n[3] Theodoros Evgeniou and Massimiliano Pontil. Regularized multitask learning. In KDD\u201904, pages 109\u2013117, Seattle, WA,\n\nUSA, August 2004.\n\n[4] Mark Dredze, Alex Kulesza, and Koby Crammer. Multi-domain learning by con\ufb01dence-weighted parameter combination.\n\nMachine Learning, 79(1-2):123\u2013149, 2010.\n\n[5] Andrew Arnold and William W. Cohen. Intra-document structural frequency features for semi-supervised domain adaptation.\n\nIn CIKM\u201908, pages 1291\u20131300, Napa Valley, California, USA, October 2008.\n\n[6] John Blitzer, Ryan Mcdonald, and Fernando Pereira. Domain adaptation with structural correspondence learning.\n\nEMNLP\u201906, pages 120\u2013128, Sydney, Australia, July 2006.\n\nIn\n\n[7] Gokhan Tur. Co-adaptation: Adaptive co-training for semi-supervised learning. In ICASSP\u201909, pages 3721\u20133724, Taipei,\n\nTaiwan, April 2009.\n\n[8] Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Transferring Naive Bayes classi\ufb01ers for text classi\ufb01cation.\n\nAAAI\u201907, pages 540\u2013545, Vancouver, B.C., July 2007.\n\nIn\n\n[9] Dikan Xing, Wenyuan Dai, Gui-Rong Xue, and Yong Yu. Bridged re\ufb01nement for transfer learning.\n\n324\u2013335, Warsaw, Poland, September 2007.\n\nIn PKDD\u201907, pages\n\n[10] Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classi\ufb01ers.\n\nIn ICML\u201909, pages 289\u2013296, Montreal, Quebec, June 2009.\n\n[11] Ming-Wei Chang, Michael Connor, and Dan Roth. The necessity of combining adaptation methods. In EMNLP\u201910, pages\n\n767\u2013777, Cambridge, MA, October 2010.\n\n[12] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A co-regularization approach to semi-supervised learning with multiple\n\nviews. In ICML Workshop on Learning with Multiple Views, pages 824\u2013831, Bonn, Germany, August 2005.\n\n[13] D. S. Rosenberg and P. L. Bartlett. The Rademacher complexity of co-regularized kernel classes.\n\n396\u2013403, San Juan, Puerto Rico, March 2007.\n\nIn AISTATS\u201907, pages\n\n[14] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation.\n\nIn NIPS\u201907, pages 129\u2013136, Vancouver, B.C., December 2007.\n\n[15] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for\n\nsentiment classi\ufb01cation. In ACL\u201907, pages 440\u2013447, Prague, Czech Republic, June 2007.\n\n[16] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In\n\nNIPS\u201906, pages 137\u2013144, Vancouver, B.C., December 2006.\n\n[17] Piyush Rai, Avishek Saha, Hal Daum\u00b4e III, and Suresh Venkatasubramanian. Domain adaptation meets active learning. In\n\nNAACL 2010 Workshop on Active Learning for NLP (ALNLP), pages 27\u201332, Los Angeles, USA, June 2010.\n\n[18] Hal Daum\u00b4e III. Notes on CG and LM-BFGS optimization of logistic regression. August 2004.\n[19] Vikas Sindhwani and David S. Rosenberg. An RKHS for multi-view learning and manifold co-regularization. In ICML\u201908,\n\npages 976\u2013983, Helsinki, Finland, June 2008.\n\n[20] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In COLT\u201998, pages 92\u2013100, New\n\nYork, NY, USA, July 1998. ACM.\n\n[21] Maria-Florina Balcan and Avrim Blum. A PAC-style model for learning from labeled and unlabeled data. In COLT\u201905, pages\n\n111\u2013126, Bertinoro, Italy, June 2005.\n\n[22] Maria-Florina Balcan and Avrim Blum. A discriminative model for semi-supervised learning. J. ACM, 57(3), 2010.\n[23] Karthik Sridharan and Sham M. Kakade. An information theoretic framework for multi-view learning. In COLT\u201908, pages\n\n403\u2013414, Helsinki, Finland, June 2008.\n\n9\n\n\f", "award": [], "sourceid": 1130, "authors": [{"given_name": "Abhishek", "family_name": "Kumar", "institution": null}, {"given_name": "Avishek", "family_name": "Saha", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}