{"title": "Transfer Learning via Minimizing the Performance Gap Between Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 10645, "page_last": 10655, "abstract": "We propose a new principle for transfer learning, based on a straightforward intuition: if two domains are similar to each other, the model trained on one domain should also perform well on the other domain, and vice versa. To formalize this intuition, we define the performance gap as a measure of the discrepancy between the source and target domains. We derive generalization bounds for the instance weighting approach to transfer learning, showing that the performance gap can be viewed as an algorithm-dependent regularizer, which controls the model complexity. Our theoretical analysis provides new insight into transfer learning and motivates a set of general, principled rules for designing new instance weighting schemes for transfer learning. These rules lead to gapBoost, a novel and principled boosting approach for transfer learning. Our experimental evaluation on benchmark data sets shows that gapBoost significantly outperforms previous boosting-based transfer learning algorithms.", "full_text": "Transfer Learning via Minimizing the Performance\n\nGap Between Domains\n\nBoyu Wang\n\nDepartment of Computer Science\n\nUniversity of Western Ontario\n\nbwang@csd.uwo.ca\n\nJorge A. Mendez\n\nDepartment of Computer and Information Science\n\nUniversity of Pennsylvania\n\nmendezme@seas.upenn.edu\n\nPrinceton Neuroscience Insititute\n\nDepartment of Computer and Information Science\n\nMing Bo Cai\n\nPrinceton University\nmcai@princeton.edu\n\nEric Eaton\n\nUniversity of Pennsylvania\neeaton@seas.upenn.edu\n\nAbstract\n\nWe propose a new principle for transfer learning, based on a straightforward\nintuition: if two domains are similar to each other, the model trained on one domain\nshould also perform well on the other domain, and vice versa. To formalize this\nintuition, we de\ufb01ne the performance gap as a measure of the discrepancy between\nthe source and target domains. We derive generalization bounds for the instance\nweighting approach to transfer learning, showing that the performance gap can be\nviewed as an algorithm-dependent regularizer, which controls the model complexity.\nOur theoretical analysis provides new insight into transfer learning and motivates a\nset of general, principled rules for designing new instance weighting schemes for\ntransfer learning. These rules lead to gapBoost, a novel and principled boosting\napproach for transfer learning. Our experimental evaluation on benchmark data sets\nshows that gapBoost signi\ufb01cantly outperforms previous boosting-based transfer\nlearning algorithms.\n\n1\n\nIntroduction\n\nTransfer learning is based on the idea that learning a new concept is easier after having learned one or\nmore similar concepts. By extracting knowledge from a set of related concepts (source domains) and\nthen leveraging this knowledge upon learning the concept of interest (target domain), the learning\nperformance can be improved. This is especially bene\ufb01cial when there is insuf\ufb01cient data to learn\nsolely from the target domain, but enough knowledge from the source domains is available. Transfer\nlearning has become increasingly relevant over the last two decades, and consequently during that\ntime various algorithms have been proposed [10, 17, 39, 22, 24, 11], accompanied by theoretical and\nempirical justi\ufb01cations [4, 25, 3, 21, 18, 19, 26].\nIn order to successfully transfer information from one domain to another, it is critical to understand\nthe similarities and differences between the domains. Intuitively, the more similar the two domains\nare, the more information can be transferred. When the domains are considerably different, but still\nrelated, a common strategy to correct this difference is to minimize some measure of divergence\nbetween the empirical source and target data distributions. Most prior work in this area has focused\non de\ufb01ning discrepancy measures that motivate the design of algorithms that effectively reduce the\ndissimilarity between domains as much as possible [16, 17, 35, 6, 2, 34, 3, 25, 7, 14, 1, 33]. These\nworks have mainly considered the problem of domain adaptation, where examples from the target\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdomain are entirely unlabeled. However, in many practical cases, there is a small amount of labeled\ntarget data, which can be leveraged to derive more specialized measures of domain divergence.\nTo address this issue, we present the \ufb01rst analysis for instance weighting transfer learning that\nconsiders the presence of labeled target examples. The contribution of our work is two-fold. 1.\nWe address the question of how to measure the divergence between two domains given label\ninformation for the target domain. Intuitively, if two domains are similar to each other, the model\ntrained on one domain should also perform well on the other domain, and vice versa. To formalize this\nintuition, we propose the notion of performance gap between the source and target domains, and show\nthat the transfer learning model complexity can be upper bounded in terms of this performance gap.\nIn other words, it can be viewed as an algorithm-dependent regularizer, which leads to \ufb01ner and more\ninformative generalization bounds. This is, to the best of our knowledge, the \ufb01rst generalization bound\nfor instance-based transfer learning that considers the presence of labeled target data. Moreover, our\nde\ufb01nition of performance gap is intuitive and generally applicable to any form of transfer. Thus, our\nanalysis provides a deeper understanding of the general problem of transfer learning and new insight\ninto how to leverage the labeled target examples. 2. On the algorithmic side, instead of directly\nminimizing the generalization bound, which is highly computationally expensive, we propose four\nprincipled rules to follow when designing an instance weighting scheme for transfer learning. We\ninstantiate these rules with gapBoost, a novel and ef\ufb01cient boosting algorithm for transfer learning,\nwhich offers out-of-the-box usability and readily accommodates any algorithm for transfer learning.\nSource code for gapBoost is available at https://github.com/bwang-ml/gapBoost.\n\n2 Related Work\n\nThe large majority of transfer learning techniques can be categorized as instance, feature, or parameter\ntransfer [29, 40, 7]. In this paper, we consider the instance transfer approach, where the objective is\nto correct the difference between the domains by weighting the instances. In this context, the authors\nin [5, 3] studied transfer learning algorithms that minimize a convex combination of the source and\ntarget empirical risks, and proposed to use the H-divergence [4] to measure the distance between\nthe domains for 0-1 loss classi\ufb01cation. This study was generalized to arbitrary loss functions by\nintroducing the notion of discrepancy distance [25]. Since then, various measures have been proposed\nin the literature [35, 6, 2, 34, 7, 14, 33]. Recently, instance weighting has been revisited in [21] based\non the notion of algorithmic stability. The authors revealed that the source domain features can be\ninterpreted as a regularization matrix, which bene\ufb01ts the learning process of the target domain task.\nDespite the wide applicability of the discrepancy measures de\ufb01ned in these works, they fail to address\nthree problems. 1. These measures are designed for the setting of domain adaptation, where no label\ninformation is available in the target domain. As a result, it is unclear how to leverage any labeling\ninformation from the target domain in cases where it is available. Moreover, deriving generalization\nbounds for domain adaptation requires additional assumptions. One common assumption is that there\nexists an ideal hypothesis that performs well on both domains [3, 25], which cannot be empirically\nveri\ufb01ed due to the lack of labeled examples in the target domain. 2. These measures are either\nalgorithm-independent [16, 35, 6, 2, 34] or de\ufb01ned over a hypothesis class [3, 25, 7, 14, 33], and\nso they ignore the speci\ufb01c algorithm used. An algorithm-speci\ufb01c notion of divergence measure\ncould lead to more informative generalization guarantees. 3. From the algorithmic perspective, most\nmethods are restricted to linear hypotheses (or nonlinear hypotheses de\ufb01ned through a reproducing\nkernel Hilbert space) and derive the instance weights by directly minimizing the generalization\nbounds or divergence measures, which usually imposes a high computational burden [25, 14, 7].\nHaving access to labeled examples in the target domain enables us to derive more ef\ufb01cient learning\nalgorithms. In [10], an ef\ufb01cient transfer boosting method was proposed to reweight the data for clas-\nsi\ufb01cation in the presence of labeled target data. Later, this approach was extended to regression [30]\nand multi-source transfer [41, 12]. In [36], the authors proposed a two-stage instance weighting\napproach for transfer learning, and analyzed its generalization bound by extending the result from [3].\nWhile these algorithms are effective in practice, no theoretical results have been presented to show\nwhy transfer learning succeeds when labeled information from the target domain is available.\nIn addition, most existing theoretical studies of transfer learning examine the convergence rate of\nthe Rademacher complexity or stability coef\ufb01cient, assuming that the model complexity (and hence\nthe loss function) of the transfer learning algorithm is upper bounded by a constant. One related\n\n2\n\n\fwork we are aware of that relaxes this assumption is [42], which proves the boundedness of the loss\nfunctions in the setting of multitask learning. However, their analysis only relates the bounds with\nregularization functions and requires the additional assumption that when the hypothesis outputs 0,\nthe loss function is upper bounded by another constant. Critically, these analyses do not provide\nany insight into how the domain divergence affects the model complexity and the generalization\nbound. More recently, this issue has been studied in [38], showing that the model complexity of\nparameter-sharing multitask learning algorithms is determined by the task similarities. However, this\ntheoretical result has not motivated any concrete algorithm.\nIn contrast to prior work, we derive an algorithm-speci\ufb01c generalization bound that considers the label\ninformation from the target domain. Based on our newly developed theory, we design a principled\nand ef\ufb01cient instance weighting transfer learning algorithm.\n\n3\n\nInstance Weighting for Transfer Learning\n\nIn this section, we formalize the problem of instance weighting for transfer learning. We continue\nby proposing four general rules to follow when developing new weighting schemes, along with the\ntheoretical grounds that support these rules. We then instantiate these new rules with gapBoost.\nLet z = (x, y) \u2208 X \u00d7 Y be a training example drawn from some unknown distribution D, where x is\nthe data point, and y is its label, with Y = {\u22121, 1} for binary classi\ufb01cation and Y \u2286 R for regression.\nA hypothesis is a function h \u2208 H that maps X to the set Y(cid:48) sometimes different from Y, where H is a\nhypothesis class. For a convex, non-negative loss function (cid:96) : Y(cid:48) \u00d7Y (cid:55)\u2192 R+, we denote by (cid:96)(h(x), y)\nthe loss of hypothesis h at point z = (x, y). Let S = {zi = (xi, yi)}N\n(cid:80)N\ni=1 be a set of N training\nexamples drawn independently from D. The empirical loss of h on S and its generalization loss\nover D are de\ufb01ned, respectively, by LS(h) = 1\ni=1 (cid:96)(h(xi), yi), and LD(h) = Ez\u223cD[(cid:96)(h(x), y)].\nWe consider the linear function class in a Euclidean space, but our analysis is also applicable to a\nreproducing kernel Hilbert space. We also assume that (cid:107)x(cid:107)2 \u2264 R,\u2200x \u2208 X for some R \u2208 R+, and\nthe loss function is \u03c1-Lipschitz continuous for some \u03c1 \u2208 R+.\nIn the setting of transfer learning, we have a training sample S = {ST , SS} of size N = NT + NS\ncomposed of ST = {zT\ni =\ni=1 drawn from a source distribution DS. We analyze the transfer learning algorithms based\n(xS\ni , yS\non instance weighting, which aims to optimize the following objective function:\n\ni=1 drawn from a target distribution DT and SS = {zS\n\ni = (xT\n\ni )}NT\n\ni )}NS\n\ni , yT\n\nN\n\nh\u2208HL\u0393\n\nmin\n\nS(h) + \u03bbR(h) ,\n\n(1)\n\nS(h) = L\u0393T\n\nST (h) + L\u0393S\n\nwhere L\u0393\nSS (h) is the weighted empirical loss over the source and tar-\nget domains, R(h) is a regularization function to control the model complexity of h, and \u03bb\n(cid:80)NT\nis a regularization parameter. The domain-speci\ufb01c weighted losses are given by L\u0393T\nST (h) =\ni ). The instance weights \u0393 = [\u0393T ; \u0393S ], with\ni=1 \u03b3T\n\u0393T = [\u03b3T\ni = 1,\nand they can either be learned in a pre-processing step [17, 8, 28, 15] or incorporated into learning\nalgorithms [10, 23]. As we consider the linear function class, the hypothesis h has the form of an\ninner product h(x) = (cid:104)h, x(cid:105), and we study the regularization function R(h) = (cid:107)h(cid:107)2\n2.\n\n+ , are such that(cid:80)NT\n\nSS (h) =(cid:80)NS\n\ni ) and L\u0393S\ni ), yT\nNT ](cid:62) \u2208 RNT\n\ni ), yS\nNS ](cid:62) \u2208 RNS\n\ni +(cid:80)NS\n\ni (cid:96)(h(xT\n1 , . . . , \u03b3T\n\ni (cid:96)(h(xS\n\ni=1 \u03b3S\n1 , . . . , \u03b3S\n\n+ and \u0393S = [\u03b3S\n\ni=1 \u03b3T\n\ni=1 \u03b3S\n\n3.1 Principles for Instance Weighting\n\nLeveraging problem (1) requires assigning appropriate values to \u0393 so that the solution to problem (1)\nleads to effective transfer. There are a variety of weighting schemes developed in the literature. In\nthis paper, we summarize four general and intuitively reasonable rules as follows. As we will show\nlater, they are also theoretically grounded.\n\n1. Minimize the weighted empirical loss over source and target domains, as suggested by (1).\n2. Assign balanced weights to data points, as focusing too much on speci\ufb01c data points leads\n\nto over\ufb01tting caused by perturbations in the training data [32].\n\n3. Assign more weight to the target sample, since target data will be used for testing.\n4. Assign weights such that the performance gap between the domains is small.\n\n3\n\n\fOur main contribution lies in Rule 4, for which we introduce the notion of performance gap. Although\nintuitive, these rules are contradictory, so designing an algorithm based on them requires properly\ntrading them off. We explore one way to control this tradeoff via hyper-parameters in Section 3.3.\n\n3.2 Theoretical Justi\ufb01cations\n\nWe now develop the theoretical foundations that justify the instance weighting rules. In contrast to\nprevious studies on domain adaptation, we propose a notion to measure the divergence between the\ndomains that leverages the label information, leading to Rule 4 in our instance weighting scheme.\nIntuitively, if two domains are similar, the model trained on one domain should also perform well on\nthe other. To make this intuition precise, we de\ufb01ne the notion of performance gap below.\nDe\ufb01nition 1 (Performance gap). Let VS (h) = L\u0393S\nST (h) + \u03b7\u03bbR(h),\nrespectively, be the objective functions in the source and target domains, where \u03b7 \u2208 (0, 1\n2 ), and let\ntheir minimizers, respectively, be hSS and hST . The performance gap between the source and target\ndomains is de\ufb01ned as\n\nSS (h) + \u03b7\u03bbR(h) and VT (h) = L\u0393T\n\n\u2207 = \u2207T + \u2207S ,\n\nST (hST ).\n\nST (hSS ) \u2212 L\u0393T\n\nSS (hST ) \u2212 L\u0393S\n\nSS (hSS ) and \u2207T = L\u0393T\n\nwhere \u2207S = L\u0393S\nNote that the performance gap is both data and algorithm dependent, which is crucial for deriving\na more informative and \ufb01ner generalization bound. Moreover, note that, although we use the\nperformance gap to analyze the speci\ufb01c setting of instance weighting, it could be readily applied to\nother transfer learning paradigms, such as feature-based transfer. We now present the de\ufb01nition of\nY-Discrepancy, which we require for our analysis.\nDe\ufb01nition 2 (Y-Discrepancy [27]). Let H be a hypothesis class mapping X to Y and let (cid:96) : Y\u00d7Y (cid:55)\u2192\nR+ de\ufb01ne a loss function over Y. The Y-discrepancy distance between two distributions D1 and D2\nover X \u00d7 Y is de\ufb01ned as:\n\ndistY (D1,D2) = sup\nh\u2208H\n\n|LD1(h) \u2212 LD2(h)|\n\n.\n\nS, which justi\ufb01es our principles for instance weighting.\n\nOur main theoretical contribution is the following theorem that bounds the difference between LDT\nand L\u0393\nTheorem 1. Let hS be the optimal solution of the transfer learning problem (1). Assume that\n(cid:107)x(cid:107)2 \u2264 R,\u2200x \u2208 X , and that the loss function is \u03c1-Lipschitz continuous and convex. Then, for any\n\u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, we have\n\nLDT (hS) \u2264 L\u0393\n\nS(hS) + \u03b5\u0393 + (cid:107)\u0393S(cid:107)1 distY (DT ,DS ) ,\n\n(cid:19)(cid:115)\n\n(2)\n\n(cid:41)\n\n.\n\nlog 2\n\u03b4\n\n2\n\nwhere\n\n\u03b5\u0393 = min\n\n(cid:40)(cid:107)\u0393(cid:107)\u221e\u03c12R2\n\n+\n\n\u03bb\n\n(cid:18) \u03c12R2((cid:107)\u0393(cid:107)2\n\n+ (cid:107)\u0393(cid:107)\u221eB(\u0393)\n\n2 + (cid:107)\u0393(cid:107)\u221e)\n\u03bb\n2(cid:107)\u0393(cid:107)\u221e(cid:107)\u0393(cid:107)2\u03c12R2\n\n(cid:114)\n\nN log 1\n\u03b4\n\n2\n\n,\n\n(cid:115)\n\n\u03bb\n\n2N log\n\n+ (cid:107)\u0393(cid:107)2B(\u0393)\n\n4\n\u03b4\n\nRemark 1. Rule 1 is justi\ufb01ed by L\u0393\nS, Rule 2 is justi\ufb01ed by ||\u0393||2 and ||\u0393||\u221e, and Rule 3 is justi\ufb01ed\nby ||\u0393S||1. B(\u0393) is an upper bound of the loss function (cid:96), such that (cid:96)(h(x), y) \u2264 B(\u0393), where h is\nthe output hypothesis of an algorithm solving the transfer learning problem (1). We emphasize that it\nis a function of \u0393 and, as we show later, can be upper bounded in terms of \u2207, which justi\ufb01es Rule 4.\n\nProof Sketch. (Details of the proof are available in the appendix)\nStep 1: Bound LDT from L\u0393D. Let L\u0393D = L\u0393T\nS. Then,\nby linearity of the expectation and the de\ufb01nition of Y-discrepancy, we show that the following holds:\n(3)\n\nDS be the expected weighted loss of L\u0393\n\nLDT \u2264 L\u0393D + (cid:107)\u0393S(cid:107)1 distY (DT ,DS ) .\n\nDT +L\u0393S\n\n4\n\n\fRemark 2. Compared to the notion of discrepancy [25], one advantage of Y-discrepancy is that\nit does not require the assumption that the loss function obeys the triangle inequality [3, 9], which\ndoes not hold for many loss functions (e.g., hinge loss, squared loss), to make (3) hold. In addition,\nwe can prove that for a binary classi\ufb01cation problem, distY (DT ,DS ) can be upper bounded from a\n\ufb01nite sample by constructing a new classi\ufb01cation problem, where the positive target examples and\nnegative source examples are positively labeled, and the negative target examples and positive source\nexamples are negatively labeled. See Lemma A and Lemma B in the appendix for more details.\nS. We present two schemas to upper bound L\u0393D: one is based on\nStep 2: Bound L\u0393D from L\u0393\nalgorithmic stability, and the other one is based on Rademacher complexity, which lead to the\nde\ufb01nition of \u03b5\u0393.\nAlgorithmic stability bound. We introduce the notion of weight-dependent uniform stability (see\nDe\ufb01nition A in the appendix) and show that, for any \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, the\nexpected loss LDT can be upper bounded by:\n\nL\u0393D \u2264 L\u0393\n\nS +\n\n(cid:107)\u0393(cid:107)\u221e\u03c12R2\n\n\u03bb\n\n+\n\n2 + (cid:107)\u0393(cid:107)\u221e)\n\u03bb\n\n+ (cid:107)\u0393(cid:107)\u221eB(\u0393)\n\nN log 1\n\u03b4\n\n2\n\n.\n\n(4)\n\nRademacher complexity bound. We introduce the notion of weighted Rademacher complexity (see\nDe\ufb01nition B in the appendix ), and relate it to the notion of uniform argument stability [20]. Then,\nwe prove that the learning algorithm (1) produces an algorithmic hypothesis class B, and, for any\n\u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, the expected loss LDT can be upper bounded by:\n\n(cid:18) \u03c12R2((cid:107)\u0393(cid:107)2\n\n(cid:19)(cid:115)\n\n(cid:115)\n\n(cid:114)\n\nL\u0393D \u2264 L\u0393\n\nS + 2\n\n(cid:107)\u0393(cid:107)\u221e(cid:107)\u0393(cid:107)2\u03c12R2\n\n\u03bb\n\n2N log\n\n+ B(\u0393)(cid:107)\u0393(cid:107)2\n\n4\n\u03b4\n\nlog 2\n\u03b4\n\n2\n\n.\n\n(5)\n\nCombining (3), (4), and (5), we obtain the generalization bound (2).\nN ,\u2200i \u2208 {1, . . . , N}, we recover the standard argument stability bound from\nRemark 3. If \u03b3i = 1\n(2), which suggests assigning equal weights to all instances to achieve a fast convergence rate, due\nto (cid:107)\u0393(cid:107)\u221e and (cid:107)\u0393(cid:107)2. In particular, if (cid:107)\u0393(cid:107)\u221e (and hence (cid:107)\u0393(cid:107)2\nN ), (2) leads to a convergence\nrate of O( 1\u221a\n). However, in the setting of transfer learning, it is usually the case that NT (cid:28)\nNS. Consequently, we may have (cid:107)\u0393(cid:107)\u221e (cid:28) 1\nNT , which implies that transfer learning has a faster\nconvergence rate than single-task learning. On the other hand, as we will show in Step 3, the loss\nbound B is also a function of \u0393, which suggests a new criterion for instance weighting.\n\n2) is O( 1\n\nN\n\nStep 3: Bound B(\u0393). The following lemma shows that the model complexity of the transfer\nlearning algorithm (1) can be upper bounded in terms of the performance \u2207.\nLemma 1. Let hS be the optimal solution of the instance weighting transfer learning problem (1).\nThen, we have:\n\n(cid:107)hS(cid:107)2 \u2264\n\n\u2207\n\n2\u03bb(1 \u2212 2\u03b7)\n\n+\n\n(cid:107)hSS(cid:107)2\n\n2 + (cid:107)hST (cid:107)2\n2\n\n2\n\n.\n\n(cid:115)\n\n(cid:115)\n\n(cid:115)\n\nBy bounding the model complexity, we obtain various upper bounds for different loss functions.\n\nCorollary 1. The hinge loss function of the learning algorithm (1) can be upper bounded by:\n\nFor regression, if the response variable is bounded by |y| \u2264 Y , the (cid:96)q loss of (1) can be bounded by:\n\nB(\u0393) \u2264 1 + R\n\n\u2207\n\n2\u03bb(1 \u2212 2\u03b7)\n\n+\n\n(cid:107)hS(cid:107)2\n\n2 + (cid:107)hT (cid:107)2\n2\n\n2\n\n.\n\n(cid:32)\n\nB(\u0393) \u2264\n\nY + R\n\n(cid:33)q\n\n.\n\n(cid:107)hS(cid:107)2\n\n2 + (cid:107)hT (cid:107)2\n2\n\n2\n\n\u2207\n\n2\u03bb(1 \u2212 2\u03b7)\n\n+\n\n5\n\n\fAlgorithm 1 gapBoost\ngapBoost\ngapBoost\nInput: SS , ST , K, \u03c1S \u2264 \u03c1T \u2264 0, \u03b3max, a learning algorithm A\n1: Initialize DS\n1 (i) = DT\n2: for k = 1, . . . , K do\n3:\n4:\n5:\n\nCall A to train a base learner hk using SS \u222a ST with distribution DS\nCall A to train an auxiliary learner hS\nk over source domain using SS with distribution DS\nCall A to train an auxiliary learner hT\nk over target domain using ST with distribution DT\nDT\nk (i)1\n\n, \u03b1k = log 1\u2212\u0001k\n\nNS +NT for all i\n\nk \u222a DT\n\nNT(cid:80)\n\nNS(cid:80)\n\n1 (i) =\n\nDS\n\n+\n\nk\n\nk\n\nk\n\n1\n\ni )(cid:54)=yS\n\ni\n\ni=1\n\ni )(cid:54)=hT\n\nk (xS\n\ni ) + \u03b1k 1\n\nhk(xS\n\ni )(cid:54)=yS\n\ni\n\ni=1\n\nk (i)1\n\nhS\nk (xS\n\nhk(xS\n\u0001k =\nfor i = 1, . . . , NS do\n\n\u03b2S\ni = \u03c1S 1\nend for\nfor i = 1, . . . , NT do\n\u03b2T\ni = \u03c1T 1\n\nZk+1 =(cid:80)NS\n\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: end for\nOutput: f (x) = sign\n\nend for\nk+1(i), DT\nif DS\nDS\nk+1(i), DT\nend if\nNormalize DS\n\n(cid:16)(cid:80)K\n\nk (xT\n\nhk(xT\n\ni )(cid:54)=hT\n\nhS\nk (xT\ni=1 DS\n\nk+1(i) +(cid:80)NT\n\ni ) + \u03b1k 1\ni=1 DT\nk+1(i) > \u03b3maxZk+1 then\nk+1(i) = \u03b3maxZk+1\nk+1 and DT\n\nk+1 such that(cid:80)NS\n\nk+1(i)\n\n(cid:17)\n\nk=1 \u03b1khk(x)\n\nhk(xT\n\ni\n\ni )(cid:54)=yT\n, DS\n\nk+1(i) = DS\n\n, DT\n\nk+1(i) = DT\n\ni )(cid:54)=yT\n\ni\n\n\u0001k\n\nk (i) exp(cid:0)\u03b2S\n(cid:1)\nk (i) exp(cid:0)\u03b2T\n\ni\n\ni\n\n(cid:1)\n\nk+1(i) +(cid:80)NT\n\ni=1 DS\n\ni=1 DT\n\nk+1(i) = 1\n\nRemark 4. Lemma 1 shows that, given \ufb01xed weights, the model complexity (and hence the upper\nbound of a loss function) is related to the performance gap between the source and target domains.\nLemma 1 reveals that transfer learning (1) can succeed when the hypotheses trained on their own\ndomains also work well on the other domains, which leads to a lower training loss and a faster\nconvergence to the best hypothesis in the class in terms of sample complexity.\n\nBy combining the Steps 1\u20133, we obtain Theorem 1.\n\nBy similar derivations, we obtain a PAC learning bound, which is also consistent with the instance\nweighting rules.\nCorollary 2. Let wS be the optimal solution of the transfer learning problem (1), and h\u2217 =\narg minh LDT (h) be the minimizer in the target domain. Assume that (cid:107)x(cid:107)2 \u2264 R,\u2200x \u2208 X , and that\nthe loss function obeys the triangle inequality and is \u03c1-Lipschitz and convex. Then, for any \u03b4 \u2208 (0, 1),\nwith probability at least 1 \u2212 \u03b4, we have:\n\nLDT (hS) \u2264 LDT (h\u2217) + \u03b5(cid:48)\n\nwhere\n\n\u03b5(cid:48)\n\u0393 = min\n\n(cid:40)(cid:107)\u0393(cid:107)\u221e\u03c12R2\n\n+\n\n\u03bb\n\n(cid:18) \u03c12R2((cid:107)\u0393(cid:107)2\n\n(6)\n\n\u0393 + 2(cid:107)\u0393S(cid:107)1 distY (DT ,DS ) ,\n(cid:19)\n\n+ (cid:107)\u0393(cid:107)\u221e\n\nB(\u0393)\n\n(cid:18)||\u0393||2\u221a\n(cid:114)\n\nN\n\n(cid:19)(cid:115)\n(cid:115)\n\n2N log\n\n+ 2(cid:107)\u0393(cid:107)2B(\u0393)\n\n8\n\u03b4\n\n2 + (cid:107)\u0393(cid:107)\u221e)\n\u03bb\n2(cid:107)\u0393(cid:107)\u221e(cid:107)\u0393(cid:107)2\u03c12R2\n\n+\n\n\u03bb\n\nN log 4\n\u03b4\n\n,\n\n(cid:41)\n\n.\n\n2\n\nlog 4\n\u03b4\n\n2\n\ngapBoost\n\n3.3\nAs distY (DT ,DS ) can be estimated from the training sample, it is possible to derive a weighting\nscheme by minimizing the generalization bounds (2) as in previous works in the literature [25, 7].\nHowever, one common issue with this approach is that it leads to high computational cost for large\nsample size and it is usually restricted to linear hypotheses. In contrast, our algorithmic goal is\nto derive a computationally ef\ufb01cient method that is applicable to large-scale data and also \ufb02exible\nenough to accommodate arbitrary learning algorithms for transfer learning.\n\n6\n\n\fTable 1: Comparison of boosting algorithms for transfer learning.\n\nRule 1 Rule 2 Rule 3 Rule 4\n\nAdaBoost\nTrAdaBoost\nTransferBoost\ngapBoost\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\u0017\n\u0017\n(cid:88)\n\n\u0017\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u0017\n\u0017\n\u0017\n(cid:88)\n\nTo this end, we propose gapBoost in Algorithm 1, which explicitly exploits the rules from Section 3.1.\nThe algorithm trains a joint learner for source and target domains, as well as auxiliary source and\ntarget learners (lines 3\u20135). Then, it up-weights incorrectly labeled instances as per traditional boosting\nmethods and down-weights instances for which the source and target learners disagree; the trade-off\nfor the two schemes is controlled separately for source and target instances via hyper-parameters \u03c1S\nand \u03c1T (lines 6\u201312). Finally, the weights are clipped to a maximum value of \u03b3max and normalized\n(lines 13\u201317). 1. gapBoost follows Rule 1 by training the base learner hk at each iteration, which\naims to minimize the weighted empirical loss over the source and target domains. 2. By tuning \u03b3max,\nit explicitly controls (cid:107)\u0393(cid:107)\u221e and implicitly controls (cid:107)\u0393(cid:107)2, as required by Rule 2. Additionally, as each\nbase learner hk is trained with a different set of weights, the \ufb01nal classi\ufb01er f returned by gapBoost\nis potentially trained over a balanced distribution. 3. Moreover, by setting \u03c1T \u2265 \u03c1S, gapBoost\npenalizes instances from the source domain more than from the target domain, implicitly assigning\nmore weight to the target domain sample than to the source domain sample, as suggested by Rule 3.\n4. Finally, as \u03c1S , \u03c1T \u2264 0, the weight of any instance x will decrease if the learners disagree (i.e.,\nk (x) (cid:54)= hT\nk (x)). By doing so, gapBoost follows Rule 4 by minimizing the gap \u2207. 5. The trade-off\nhS\nbetween the rules is balanced by the choice of the hyper-parameters \u03c1T , \u03c1S and \u03b3max.\nTable 1 compares various traditional boosting algorithms for transfer learning in terms of the instance\nweighting rules. Conventional AdaBoost [13] treats source and target samples equally, and therefore\ndoes not reduce (cid:107)\u0393S(cid:107)1 or minimize the performance gap. On the other hand, TrAdaBoost [10] and\nTransferBoost [12] explicitly exploit Rule 3 by assigning less weight to the source domain sample\nat each iteration. However, they do not control (cid:107)\u0393(cid:107)\u221e or (cid:107)\u0393(cid:107)2, so the weight of the target domain\nsample can be large after a few iterations. Most critically, none of the previous algorithms minimize\nthe performance gap explicitly as we do, which can be crucial for transfer learning to succeed.\nThe generalization performance of gapBoost is upper bounded by the following proposition.\n\nProposition 1. Let f (x) = (cid:80)K\nwith each base learner trained by solving (1). For simplicity, we assume that(cid:80)K\n\nk=1 \u03b1khk(x) be the ensemble of classi\ufb01ers returned by gapBoost,\nk=1 \u03b1k = 1. Then,\n\nfor any \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, we have\n\nLDT (f ) \u2264 LST (f ) + 2\u03c12R2\u03b3T\n\n.\n\u221e is the largest weight of the target sample over all boosting iterations.\n\n\u03b4 + B(\u0393)\n\n\u03b4\n2NT\n\n2 log 4\n\n\u221e\n\n\u03bb\n\nwhere \u03b3T\nRemark 5. We observe that if \u03b3T\nProposition 1 suggests to set \u03b3max = O(\n\n\u221e (cid:29)(cid:113) 1\n\nas the loss function is convex, B(\u0393) can be upper bounded by B(\u0393) \u2264(cid:80)K\n\nNT , the bound will be dominated by the second term. Then,\n1\u221a\nNT ) to achieve a fast convergence rate. On the other hand,\nk=1 \u03b1kB(\u0393k), where \u0393k\nis the set of weights at the k-th boosting iteration. In other words, one should aim to minimize the\nperformance gap for every boosting iteration to achieve a tighter bound.\n\n(cid:113)\n\n(cid:113) log 2\n\n4 Experiments\n\nWe evaluated gapBoost on two benchmark data sets.\n20 Newsgroups This data set contains approximately 20,000 documents, grouped by seven top\ncategories and 20 subcategories. Each transfer learning task involved a top-level classi\ufb01cation\nproblem, while the source and target domains were chosen from different subcategories. The source\nand target data sets were in the same way as in [10], yielding 6 transfer learning problems.\nOf\ufb01ce-Caltech This data set contains approximately 2,500 images from four distinct domains:\nAmazon (A), DSLR (D), Webcam (W), and Caltech (C), which enabled us to construct 12 transfer\n\n7\n\n\fTable 2: Comparison of different methods on the 20 Newsgroups (top) and Of\ufb01ce-Caltech (bottom)\ndata sets in term of error rate (%). The row titles are standard names used in the literature to identify\nthe transfer problems. Our algorithm, gapBoost, outperforms all baselines in the majority of transfer\nproblems, and is competitive with the top performance in the remaining ones. Standard error is\nreported after the \u00b1.\n\ncomp vs sci\nrec vs sci\ncomp vs talk\ncomp vs rec\nrec vs talk\nsci vs talk\nA \u2192 C\nA \u2192 D\nA \u2192 W\nC \u2192 A\nC \u2192 D\nC \u2192 W\nD \u2192 A\nD \u2192 C\nD \u2192 W\nW \u2192 A\nW \u2192 C\nW \u2192 D\n\nAdaBoostT\n12.45 \u00b1 0.47\n10.99 \u00b1 0.37\n11.83 \u00b1 0.42\n15.80 \u00b1 0.53\n12.08 \u00b1 0.36\n11.74 \u00b1 0.49\n43.87 \u00b1 0.52\n32.65 \u00b1 1.35\n37.23 \u00b1 0.98\n39.92 \u00b1 0.74\n27.88 \u00b1 1.14\n30.25 \u00b1 1.05\n44.30 \u00b1 0.45\n44.00 \u00b1 0.56\n50.63 \u00b1 0.58\n42.91 \u00b1 0.46\n44.12 \u00b1 0.50\n40.63 \u00b1 1.45\n\nAdaBoostT &S\n13.45 \u00b1 0.48\n11.79 \u00b1 0.35\n14.57 \u00b1 0.47\n17.50 \u00b1 0.64\n9.40 \u00b1 0.31\n10.52 \u00b1 0.37\n27.76 \u00b1 0.88\n28.33 \u00b1 1.33\n26.94 \u00b1 1.17\n20.32 \u00b1 0.80\n25.69 \u00b1 1.19\n24.50 \u00b1 1.30\n40.86 \u00b1 0.39\n40.09 \u00b1 0.46\n49.64 \u00b1 0.66\n37.22 \u00b1 0.56\n37.93 \u00b1 0.58\n45.52 \u00b1 1.58\n\nTrAdaBoost\n12.03 \u00b1 0.41\n10.03 \u00b1 0.36\n10.67 \u00b1 0.37\n14.86 \u00b1 0.67\n12.21 \u00b1 0.40\n10.13 \u00b1 0.46\n37.57 \u00b1 0.68\n34.93 \u00b1 1.43\n31.03 \u00b1 0.95\n29.13 \u00b1 0.80\n19.84 \u00b1 1.09\n22.86 \u00b1 0.95\n45.33 \u00b1 0.48\n43.72 \u00b1 0.62\n49.95 \u00b1 0.65\n44.24 \u00b1 0.52\n44.78 \u00b1 0.65\n40.00 \u00b1 1.51\n\nTransferBoost\n8.83 \u00b1 0.37\n7.93 \u00b1 0.30\n6.45 \u00b1 0.25\n12.11 \u00b1 0.43\n6.26 \u00b1 0.30\n6.45 \u00b1 0.26\n27.86 \u00b1 0.82\n28.96 \u00b1 1.38\n26.95 \u00b1 1.15\n19.68 \u00b1 0.80\n23.44 \u00b1 1.33\n23.41 \u00b1 1.30\n40.50 \u00b1 0.44\n40.35 \u00b1 0.46\n49.63 \u00b1 0.65\n37.02 \u00b1 0.53\n37.79 \u00b1 0.56\n44.88 \u00b1 1.58\n\ngapBoost\n7.68 \u00b1 0.25\n7.39 \u00b1 0.21\n7.10 \u00b1 0.27\n9.81 \u00b1 0.29\n5.66 \u00b1 0.21\n5.92 \u00b1 0.24\n27.06 \u00b1 0.87\n25.08 \u00b1 1.37\n24.34 \u00b1 1.10\n19.13 \u00b1 0.83\n21.03 \u00b1 1.20\n21.55 \u00b1 1.20\n40.66 \u00b1 0.39\n40.00 \u00b1 0.46\n50.24 \u00b1 0.62\n37.04 \u00b1 0.52\n37.48 \u00b1 0.50\n41.74 \u00b1 1.40\n\nFigure 1: Test error rates (%) with different sizes of target sample on different tasks and on average\nacross all tasks. gapBoost consistently outperforms the baselines on all regimes of target sample size.\nSince gapBoost more effectively leverages the target instances, its improvement over the baselines is\nmore noticeable as the target sample size increases. Error bars represent standard error.\n\nproblems by alternately selecting each possible source-target pair. All four domains share the same\n10 classes, so we constructed 5 binary classi\ufb01cation tasks for each transfer problem and the averaged\nresults are reported.\n\nPerformance comparison We evaluated gapBoost against four baseline algorithms: AdaBoostT\ntrained only on target data, AdaBoostT &S trained on both source and target data, TrAdaBoost, and\nTransferBoost. Logistic regression is used as the base learner for all methods, and the number of\nboosting iterations is set to 20. The hyper-parameters of gapBoost were set as \u03b3max = 1\u221a\nNT as per\n2.\nRemark 5, \u03c1T = 0, which corresponds to no punishment for the target data, and \u03c1S = log 1\nIn both data sets we pre-processed the data using principal component analysis (PCA) to reduce the\nthe feature dimension to 100. For each data set, we used all source data and a small amount of target\ndata (10% on 20 Newsgroups and 10 points on Of\ufb01ce-Caltech) as training sample, and used the rest of\nthe target data for testing. We repeated all experiments over 20 different random train/test splits and\nthe average results are presented in Table 2, showing that our method is capable of outperforming all\nthe baselines in the majority of cases. In particular, gapBoost consistently outperforms AdaBoostT ,\nempirically indicating that it avoids negative transfer.\n\nLearning with different number of target examples To further investigate the effectiveness of\ngapBoost, we varied the fraction of target instances of the 20 Newsgroups data set used for training,\n\n8\n\n020406080Ratio of Training Examples (%)05101520253035Error Rate (%)comp vs sciAdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost020406080Ratio of Training Examples (%)05101520253035Error Rate (%)comp vs recAdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost020406080Ratio of Training Examples (%)05101520253035Error Rate (%)rec vs talkAdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost020406080Ratio of Training Examples (%)05101520253035Error Rate (%)Average PerformanceAdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost\fFigure 2: Test error rates (%) averaged across all tasks with respect to the values of the hyper-\nparameter \u03c1S for varying sample sizes. Rightmost graphic shows results averaged over all sample\nsizes. gapBoost becomes less sensitive to the choice of \u03c1S as the target sample grows larger. In all\ncases, there is a range of \u03c1S that outperforms all baselines. Error bars represent standard error.\n\nFigure 3: Test error rates (%) with varying \u03c1S and \u03c1T . The valley curves correspond to \u03c1T = 0 (i.e.,\nthe purple curves in Figure 2). Hence, regions below the curve indicate better hyper-parameters.\n\nfrom 0.01 to 0.8. Figure 1 shows full learning curves on three example tasks, as well as the average\nperformance over all six tasks. The results reveal that gapBoost\u2019s improvement over the baselines\nincreases as the number of target instances grows, indicating that it is able to leverage target data\nmore effectively than previous methods.\n\nParameter sensitivity Next, we empirically evaluated our algorithm\u2019s sensitivity to the choice of\nhyper-parameters. We \ufb01rst \ufb01xed \u03c1T = 0 and varied exp(\u03c1S ) in the range of [0.1, . . . , 0.9]. Figure 2\nshows the results averaged over all transfer problems on the 20 Newsgroups data set, showing that as\nthe size of the target sample increases, the in\ufb02uence of the hyper-parameter on performance decreases.\nIn particular, we see that we are able to obtain a range of hyper-parameters for which our method\noutperforms all baselines in all sample size regimes.\nIncrease the weight of a target instance when hS\nk (xT ) = hT\nk (xT ) To further minimize the gap,\nwe can modify the weight update rule for target data: \u03b2T = \u03c1T 1\nhk(xT )(cid:54)=yT with\nk (xT )=hT\nhS\n\u03c1T \u2265 0. We vary \u03c1S and \u03c1T together, and the results are shown in Figure 3. It can be observed that\ngapBoost can achieve even better performance by focusing more on performance gap minimization\n(i.e., choosing large \u03c1S and \u03c1T ). As the target data increase, the results are less sensitive to the\nhyper-parameters.\n\nk (xT ) + \u03b1k 1\n\n5 Conclusions\n\nWe propose the notion of performance gap to measure the divergence between domains in transfer\nlearning by exploiting the label information in the target domain. Consequently, we propose a new\nprinciple for transfer learning. In particular, our theoretical analysis justi\ufb01es four intuitively reasonable\nrules for instance weighting, and provides new insight into transfer learning. We highlighted the role\nof performance gap minimization and presented gapBoost, an algorithm that explicitly exploits the\nrules for instance weighting. The empirical evaluation justi\ufb01es the effectiveness of our algorithm.\nWhile the theoretical analysis is based on the convexity assumption, our principles are quite general,\nand so would be applicable to a wide variety of algorithms (such as deep nets) for transfer learning.\nIn addition, the principle of performance gap minimization opens up several avenues for knowledge\ntransfer. For example, it could be used to analyze other forms of transfer learning like parameter\nor feature transfer [37]. It could also help develop knowledge transfer strategies for other learning\nparadigms such as meta-learning or lifelong learning [31]. We plan to explore these questions in\nfuture work.\n\n9\n\n0.20.40.60.8exp(S)22242628303234Error Rate (%)1%AdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost0.20.40.60.8exp(S)7891011121314Error Rate (%)10%AdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost0.20.40.60.8exp(S)345678910Error Rate (%)50%AdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost0.20.40.60.8exp(S)7891011121314Error Rate (%)Average PerformanceAdaBoostTAdaBoostT&STrAdaBoostTransferBoostgapBoost\fAcknowledgements\n\nThe research presented in this paper was supported by the Faculty of Science at the University of\nWestern Ontario and the Lifelong Learning Machines program from DARPA/MTO under grant\n#FA8750-18-2-0117. We would like to thank the anonymous reviewers for their helpful feedback.\n\nReferences\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.\n\n[2] K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regularized learning for domain adaptation\n\nunder label shifts. In International Conference on Learning Representations, 2019.\n\n[3] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. Machine Learning, 79(1-2):151\u2013175, 2010.\n\n[4] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation.\n\nIn Advances in Neural Information Processing Systems, pages 137\u2013144, 2007.\n\n[5] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation.\n\nIn Advances in Neural Information Processing Systems, pages 129\u2013136, 2008.\n\n[6] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in Neural\n\nInformation Processing Systems, pages 442\u2013450, 2010.\n\n[7] C. Cortes, M. Mohri, and A. M. Medina. Adaptation based on generalized discrepancy. The Journal of\n\nMachine Learning Research, 20(1):1\u201330, 2019.\n\n[8] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory.\n\nIn\nProceedings of the International Conference on Algorithmic Learning Theory, pages 38\u201353. Springer,\n2008.\n\n[9] K. Crammer, M. Kearns, and J. Wortman. Learning from multiple sources. Journal of Machine Learning\n\nResearch, 9:1757\u20131774, 2008.\n\n[10] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Boosting for transfer learning. In Proceedings of the International\n\nConference on Machine Learning, pages 193\u2013200, 2007.\n\n[11] S. S. Du, J. Koushik, A. Singh, and B. P\u00f3czos. Hypothesis transfer learning via transformation functions.\n\nIn Advances in Neural Information Processing Systems, pages 574\u2013584, 2017.\n\n[12] E. Eaton and M. desJardins. Selective transfer between learning tasks using task-based boosting. In\n\nProceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages 337\u2013342, 2011.\n\n[13] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[14] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A PAC-Bayesian approach for domain adaptation\nwith specialization to linear classi\ufb01ers. In International Conference on Machine Learning, pages 738\u2013746,\n2013.\n\n[15] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel method for the\n\ntwo-sample-problem. In Advances in Neural Information Processing Systems, pages 513\u2013520, 2007.\n\n[16] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test. Journal\n\nof Machine Learning Research, 13:723\u2013773, 2012.\n\n[17] J. Huang, A. Gretton, K. M. Borgwardt, B. Sch\u00f6lkopf, and A. J. Smola. Correcting sample selection bias\n\nby unlabeled data. In Advances in Neural Information Processing Systems, pages 601\u2013608, 2007.\n\n[18] I. Kuzborskij and F. Orabona. Stability and hypothesis transfer learning. In Proceedings of the International\n\nConference on Machine Learning, pages 942\u2013950, 2013.\n\n[19] I. Kuzborskij and F. Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning,\n\n106(2):171\u2013195, 2017.\n\n[20] T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In Proceedings of\n\nthe International Conference on Machine Learning, pages 2159\u20132167, 2017.\n\n10\n\n\f[21] T. Liu, Q. Yang, and D. Tao. Understanding how feature structure transfers in transfer learning. In\n\nProceedings of the International Joint Conferences on Arti\ufb01cial Intelligence, pages 2365\u20132371, 2017.\n\n[22] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks.\n\nIn Proceedings of the International Conference on Machine Learning, pages 97\u2013105, 2015.\n\n[23] M. Long, J. Wang, G. Ding, S. J. Pan, and S. Y. Philip. Adaptation regularization: A general framework\nfor transfer learning. IEEE Transactions on Knowledge and Data Engineering, 26(5):1076\u20131089, 2014.\n\n[24] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In\n\nProceedings of the International Conference on Machine Learning, pages 2208\u20132217, 2017.\n\n[25] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In\n\nProceedings of the Conference on Learning Theory, 2009.\n\n[26] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning. In\n\nProceedings of the International Conference on Machine Learning, pages 343\u2013351, 2013.\n\n[27] M. Mohri and A. M. Medina. New analysis and algorithm for learning with drifting distributions. In\n\nInternational Conference on Algorithmic Learning Theory, pages 124\u2013138. Springer, 2012.\n\n[28] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE\n\nTransactions on Neural Networks, 22(2):199\u2013210, 2011.\n\n[29] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions Knowledge and Data Engineering,\n\n22(10):1345\u20131359, 2010.\n\n[30] D. Pardoe and P. Stone. Boosting for regression transfer. In Proceedings of the International Conference\n\non Machine Learning, pages 863\u2013870, 2010.\n\n[31] P. Ruvolo and E. Eaton. ELLA: An ef\ufb01cient lifelong learning algorithm. In Proceedings of the International\n\nConference on Machine Learning, pages 507\u2013515, 2013.\n\n[32] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge university press, 2014.\n\n[33] C. Shui, M. Abbasi, L.-\u00c9. Robitaille, B. Wang, and C. Gagn\u00e9. A principled approach for learning task\n\nsimilarity in multitask learning. arXiv preprint arXiv:1903.09109, 2019.\n\n[34] M. Sugiyama, T. Kanamori, T. Suzuki, M. C. d. Plessis, S. Liu, and I. Takeuchi. Density-difference\n\nestimation. Neural Computation, 25(10):2734\u20132775, 2013.\n\n[35] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance estimation\nwith model selection and its application to covariate shift adaptation. In Advances in Neural Information\nProcessing Systems, pages 1433\u20131440, 2008.\n\n[36] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye. A two-stage weighting framework for multi-source\n\ndomain adaptation. In Advances in Neural Information Processing Systems, pages 505\u2013513, 2011.\n\n[37] B. Wang and J. Pineau. Generalized dictionary for multitask learning with boosting. In Proceedings of the\n\nInternational Joint Conferences on Arti\ufb01cial Intelligence, pages 2097\u20132103, 2016.\n\n[38] B. Wang, H. Zhang, P. Liu, Z. Shen, and J. Pineau. Multitask metric learning: Theory and algorithm. In\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, pages 3362\u20133371,\n2019.\n\n[39] X. Wang and J. Schneider. Flexible transfer learning under support and model shift. In Advances in Neural\n\nInformation Processing Systems, pages 1898\u20131906, 2014.\n\n[40] K. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal of Big Data, 3(1):9,\n\n2016.\n\n[41] Y. Yao and G. Doretto. Boosting for transfer learning with multiple sources. In Proceedings of the IEEE\n\nConference Computer Vision and Pattern Recognition, pages 1855\u20131862, 2010.\n\n[42] Y. Zhang. Multi-task learning and algorithmic stability. In Proceedings of the AAAI Conference on Arti\ufb01cial\n\nIntelligence, pages 3181\u20133187, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5673, "authors": [{"given_name": "Boyu", "family_name": "Wang", "institution": "University of Western Ontario"}, {"given_name": "Jorge", "family_name": "Mendez", "institution": "University of Pennsylvania"}, {"given_name": "Mingbo", "family_name": "Cai", "institution": "Princeton University"}, {"given_name": "Eric", "family_name": "Eaton", "institution": "University of Pennsylvania"}]}