{"title": "On the Value of Target Data in Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9871, "page_last": 9881, "abstract": "We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy. \n\nTo this aim, we establish the first minimax-rates in terms of both source and target sample sizes, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents. \n\nInterestingly, we find that attaining minimax performance is akin to ignoring one of the source or target samples, provided distributional parameters were known a priori. Moreover, we show that practical decisions -- w.r.t. minimizing sampling costs -- can be made in a minimax-optimal way without knowledge or estimation of distributional parameters nor of the discrepancy between source and target.", "full_text": "On the Value of Target Data in Transfer Learning\n\nSteve Hanneke\n\nToyota Technological Institute at Chicago\n\nsteve.hanneke@gmail.com\n\nSamory Kpotufe\n\nColumbia University, Statistics\n\nskk2175@columbia.edu\n\nAbstract\n\nWe aim to understand the value of additional labeled or unlabeled target data in\ntransfer learning, for any given amount of source data; this is motivated by prac-\ntical questions around minimizing sampling costs, whereby, target data is usually\nharder or costlier to acquire than source data, but can yield better accuracy.\nTo this aim, we establish the \ufb01rst minimax-rates in terms of both source and target\nsample sizes, and show that performance limits are captured by new notions of\ndiscrepancy between source and target, which we refer to as transfer exponents.\nInterestingly, we \ufb01nd that attaining minimax performance is akin to ignoring one\nof the source or target samples, provided distributional parameters were known a\npriori. Moreover, we show that practical decisions \u2013 w.r.t. minimizing sampling\ncosts \u2013 can be made in a minimax-optimal way without knowledge or estimation\nof distributional parameters nor of the discrepancy between source and target.\n\n1\n\nIntroduction\n\nThe practice of transfer-learning often involves acquiring some amount of target data, and involves\nvarious practical decisions as to how to best combine source and target data; however much of the\ntheoretical literature on transfer only addresses the setting where no target labeled data is available.\nWe aim to understand the value of target labels, that is, given nP labeled data from some source\ndistribution P , and nQ labeled target labels from a target Q, what is the best Q error achievable by\nany classi\ufb01er in terms of both nQ and nP , and which classi\ufb01ers achieve such optimal transfer.\nIn\nthis \ufb01rst analysis, we mostly restrict ourselves to a setting, similar to the traditional covariate-shift\nassumption, where the best classi\ufb01er \u2013 from a \ufb01xed VC class H \u2013 is the same under P and Q.\nWe establish the \ufb01rst minimax-rates, for bounded-VC classes, in terms of both source and target\nsample sizes nP and nQ, and show that performance limits are captured by new notions of discrep-\nancy between source and target, which we refer to as transfer exponents.\nThe \ufb01rst notion of transfer-exponent, called \u21e2, is de\ufb01ned in terms of discrepancies in excess risk,\nand is most re\ufb01ned. Already here, our analysis reveals a surprising fact:\nthe best possible rate\n(matching upper and lower-bounds) in terms of \u21e2 and both sample sizes nP , nQ is - up to constants\n- achievable by an oracle which simply ignores the least informative of the source or target datasets.\nIn other words, if \u02c6hP and \u02c6hQ denote the ERM on data from P , resp. from Q, one of the two achieves\nthe optimal Q rate over any classi\ufb01er having access to both P and Q datasets. However, which of\n\u02c6hP or \u02c6hQ is optimal is not easily decided without prior knowledge: for instance, cross-validating on\na holdout target-sample would naively result in a rate of n1/2\nwhich can be far from optimal given\nlarge nP . Interestingly, we show that the optimal (nP , nQ)-rate is achieved by a generic approach,\nakin to so-called hypothesis-transfer [1, 2], which optimizes Q-error under the constraint of low\nP -error, and does so without knowledge of distributional parameters such as \u21e2.\nWe then consider a related notion of marginal transfer-exponent, called , de\ufb01ned w.r.t. marginals\nPX, QX. This is motivated by the fact that practical decisions in transfer often involve the use of\n\nQ\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcheaper unlabeled data (i.e., data drawn from PX, QX). We will show that, when practical decisions\nare driven by observed changes in marginals PX, QX, the marginal notion is then most suited to\ncapture performance as it does not require knowledge (or observations) of label distribution QY |X.\nIn particular, the marginal exponent helps capture performance limits in the following scenarios\nof current practical interest:\n\u2022 Minimizing sampling cost. Given different costs of labeled source and target data, and a desired\ntarget excess error at most \u270f, how to use unlabeled data to decide on an optimal sampling scheme\nthat minimizes labeling costs while achieving target error at most \u270f. (Section 6)\n\u2022 Choice of transfer. Given two sources P1 and P2, each at some unknown distance from Q, given\nunlabeled data and some or no labeled data from Q, how to decide which of P1, P2 transfers best to\nthe target Q. (Appendix A.2)\n\u2022 Reweighting. Given some amount of unlabeled data from Q, and some or no labeled Q data,\nhow to optimally re-weight (out of a \ufb01xed set of schemes) the source P data towards best target\nperformance. While differently motivated, this problem is related to the last one. (Appendix A.1)\nAlthough optimal decisions in the above scenarios depend tightly on unknown distributional param-\neters such as different label noise in source and target data, and on unknown distance from source\nto target (as captured by ), we show that such practical decisions can be made, near optimally,\nwith no knowledge of distributional parameters, and perhaps surprisingly, without ever estimating\n. Furthermore, the unlabeled sampling complexity can be shown to remain low. Finally, the proce-\ndures described in this work remain of a theoretical nature, but yield new insights into how various\npractical decisions in transfer can be made near-optimally in a data-driven fashion.\n\nRelated Work. Much of the theoretical literature on transfer can be subdivided into a few main\nlines of work. As mentioned above, the main distinction with the present work is in that they mostly\nfocus on situations with no labeled target data, and consider distinct notions of discrepancy between\nP and Q. We contrast these various notions with the transfer-exponents \u21e2 and in Section 3.1.\nA \ufb01rst direction considers re\ufb01nements of total-variation that quantify changes in error over classi\ufb01ers\nin a \ufb01xed class H. The most common such measures are the so-called dA-divergence [3, 4, 5] and\nthe Y-discrepancy [6, 7, 8]. In this line of work, the rates of transfer, largely expressed in terms\nof nP alone, take the form op(1) + C \u00b7 divergence(P, Q). In other words, transfer down to 0 error\nseems impossible whenever these divergences are non-negligible; we will carefully argue that such\nintuition can be overly pessimistic.\nAnother prominent line of work, which has led to many practical procedures, considers so-called\ndensity ratios fQ/fP (importance weights) as a way to capture the similarity between P and Q\n[9, 10]. A related line of work considers information-theoretic measures such as KL-divergence or\nRenyi divergence [11, 12] but has received relatively less attention. Similar to these notions, the\ntransfer-exponents \u21e2 and are asymmetric measures of distance, attesting to the fact that it could be\neasier to transfer from some P to Q than the other way around. However, a signi\ufb01cant downside to\nthese notions is that they do not account for the speci\ufb01c structure of a hypothesis class H as is the\ncase with the aforementionned divergences. As a result, they can be sensitive to issues such as minor\ndifferences of support in P and Q, which may be irrelevant when learning with certain classes H.\nOn the algorithmic side, many approaches assign importance weights to source data from P so as\nto minimize some prescribed metric between P and Q [13, 14]; as we will argue, metrics, being\nsymmetric, can be inadequate as a measure of discrepancy given the inherent asymmetry in transfer.\nThe importance of unlabeled data in transfer-learning, given the cost of target labels, has always\nbeen recognized, with various approaches developed over the years [15, 16], including more recent\nresearch efforts into so-called semisupervised or active transfer, where, given unlabeled target data,\nthe goal is to request as few target labels as possible to improve classi\ufb01cation over using source data\nalone [17, 18, 19, 20, 21].\nMore recently, [22, 23, 24] consider nonparametric transfer settings (unbounded VC) allowing for\nchanges in conditional distributions. Also recent, but more closely related, [25] proposed a nonpara-\nmetric measure of discrepancy which successfully captures the interaction between labeled source\nand target under nonparametric conditions and 0-1 loss; these notions however ignore the additional\nstructure afforded by transfer in the context of a \ufb01xed hypothesis class.\n\n2\n\n\f2 Setup and De\ufb01nitions\nWe consider a classi\ufb01cation setting where the input X 2X , some measurable space, and the output\nY 2{ 0, 1}. We let H\u21e2 2X denote a \ufb01xed hypothesis class over X , denote dH the VC dimension\n.\n[26], and the goal is to return a classi\ufb01er h 2H with low error RQ(h)\n= EQ[h(X) 6= Y ] under some\njoint distribution Q on X, Y . The learner has access to two independent labeled samples SP \u21e0 P nP\nand SQ \u21e0 QnQ, i.e., drawn from source distributions P and target Q, of respective sizes nP , nQ.\nOur aim is to bound the excess error, under Q, of any \u02c6h learned from both samples, in terms of\nnP , nQ, and (suitable) notions of discrepancy between P and Q. We will let PX, QX, PY |X, QY |X\ndenote the corresponding marginal and conditional distributions under P and Q.\n.\nDe\ufb01nition 1. For D 2{ Q, P}, denote ED(h)\n= RD(h) inf h02H RD(h0), the excess error of h.\nDistributional Conditions. We consider various traditional assumptions in classi\ufb01cation and\ntransfer. The \ufb01rst one is a so-called Bernstein Class Condition on noise [27, 28, 29, 30, 31].\n(NC). Let h\u21e4P\n\nRP (h) and h\u21e4Q\n\n.\n= argmin\n\n.\n= argmin\n\nh2H\n\nPX(h 6= h\u21e4P ) \uf8ff cp \u00b7 E P\n\nP (h),\n\nRQ(h) exist. 9P , Q 2 [0, 1], cP , cQ > 0 s.t.\n(1)\n\nh2H\nand QX(h 6= h\u21e4Q) \uf8ff cq \u00b7 E Q\n\nQ (h).\n\nFor instance, the usual Tsybakov noise condition, say on P , corresponds to the case where\n.\nh\u21e4P is the Bayes classi\ufb01er, with corresponding regression function \u2318P (x)\n= E[Y |x] satisfying\nPX(|\u2318P (X) 1/2|\uf8ff \u2327 ) \uf8ff C\u2327 P /(1P ). Classi\ufb01cation is easiest w.r.t. P (or Q) when P\n(resp. Q) is largest. We will see that this is also the case in Transfer.\nThe next assumption is stronger, but can be viewed as a relaxed version of the usual Covariate-Shift\nassumption which states that PY |X = QY |X.\n(RCS). Let h\u21e4P , h\u21e4Q as de\ufb01ned above. We have EQ(h\u21e4P ) = EQ(h\u21e4Q) = 0. We then de\ufb01ne h\u21e4 .\n= h\u21e4P .\nNote that the above allows PY |X 6= QY |X. However, it is not strictly weaker than Covariate-Shift,\nsince the latter allows h\u21e4P 6= h\u21e4Q provided the Bayes /2H . The assumption is useful as it serves to\nisolate the sources of hardness in transfer beyond just shifts in h\u21e4. We will in fact see later that it is\neasily removed, but at the additive (necessary) cost of EQ(h\u21e4P ).\n3 Transfer-Exponents from P to Q.\n\nWe consider various notions of discrepancy between P and Q, which will be shown to tightly capture\nthe complexity of transfer P to Q.\nDe\ufb01nition 2. We call \u21e2> 0 a transfer-exponent from P to Q, w.r.t. H, if there exists C\u21e2 such that\n(2)\n\n8h 2H , C\u21e2 \u00b7 EP (h) E \u21e2\n\nQ(h).\n\nWe are interested in the smallest such \u21e2 with small C\u21e2. We generally would think of \u21e2 as at least 1,\nalthough there are situations \u2013 which we refer to as super-transfer, to be discussed, where we have\n\u21e2< 1; in such situations, data from P can yield faster EQ rates than data from Q.\nWhile the transfer-exponent will be seen to tightly capture the two-samples minimax rates of trans-\nfer, and can be adapted to, practical learning situations call for marginal versions that can capture\nthe rates achievable when one has access to unlabeled Q data.\nDe\ufb01nition 3. We call > 0 a marginal transfer-exponent from P to Q if 9C such that\n\n8h 2H , C \u00b7 PX(h 6= h\u21e4P ) Q\n\nX(h 6= h\u21e4P ).\n\n(3)\n\nThe following simple proposition relates to \u21e2.\nProposition 1 (From to \u21e2). Suppose Assumptions (NC) and (RCS) hold, and that P has marginal\ntransfer-exponent (, C) w.r.t. Q. Then P has transfer-exponent \u21e2 \uf8ff /P , where C\u21e2 = C/ P\n.\nProof. 8h 2H , we have EQ(h) \uf8ff QX(h 6= h\u21e4P ) \uf8ff C \u00b7 PX(h 6= h\u21e4P )1/ \uf8ff C \u00b7 EP (h)P /.\n\n\n\n3\n\n\f2 QX(ht 6= h\u21e4) = 1\n\n.\n= U[0, 2] and QX\n\n3.1 Examples and Relation to other notions of discrepancy.\nIn this section, we consider various examples that highlight interesting aspects of \u21e2 and , and their\nrelations to other notions of distance P ! Q considered in the literature. Though our results cover\nnoisy cases, in all these examples we assume no noise for simplicity, and therefore = \u21e2.\nExample 1. (Non-overlapping supports) This \ufb01rst example emphasizes the fact that, unlike in much\nof previous analyses of transfer, the exponents , \u21e2 do not require that QX and PX have overlapping\nsupport. This is a welcome property shared also by the dA and Y discrepancy.\nIn the example shown on the right, H is the class of homogeneous linear separa-\ntors, while PX and QX are uniform on the surface of the spheres depicted (e.g.,\ncorresponding to different scalings of the data). We then have that = \u21e2 = 1\nwith C = 1, while notions such as density-ratios, KL-divergences, or the recent\nnonparameteric notion of [25], are ill-de\ufb01ned or diverge to 1.\nExample 2. (Large dA, dY) Let H be the class of one-sided thresholds on the line, but now we\n.\n= U[0, 1]. Let h\u21e4 be thresholded at 1/2. We then see that for all ht\nlet PX\nthresholded at t 2 [0, 1], 2PX(ht 6= h\u21e4) = 1\n2 QX(ht 6= h\u21e4), where for t > 1, PX(ht 6= h\u21e4) =\n4. Thus, the marginal transfer exponent = 1 with C = 2, so\n2 (t 1/2) 1\n1\nwe have fast transfer at the same rate 1/nP as if we were sampling from Q (Theorem 3).\nOn the other hand, recall that the dA-divergence takes the form\n.\n= suph2H |PX(h 6= h\u21e4) QX(h 6= h\u21e4)|, while the Y-\ndA(P, Q)\n.\ndiscrepancy takes the form dY (P, Q)\n= suph2H |EP (h) E Q(h)|.\nThe two coincide whenever there is no noise in Y .\nNow, take ht as the threshold at t = 1/2, and dA = dY = 1\n4 which\nwould wrongly imply that transfer is not feasible at a rate faster than\n2 by letting h\u21e4 correspond to a\n4; we can in fact make this situation worse, i.e., let dA = dY ! 1\n1\nthreshold close to 0. A \ufb01rst issue is that these divergences get large in large disagreement regions;\nthis is somewhat mitigated by localization, as discussed in Example 4.\nExample 3. (Minimum , \u21e2, and the inherent asymmetry of transfer) Suppose H is the class of\none-sided thresholds on the line, h\u21e4 = h\u21e4P = h\u21e4Q is a threshold at 0. The marginal QX has uniform\ndensity fQ (on an interval containing 0), while, for some 1, PX has density fP (t) / t1 on\nt > 0 (and uniform on the rest of the support of Q, not shown). Consider any ht at threshold t > 0,\nwe have PX(ht 6= h\u21e4) =R t\n0 fP / t, while QX(ht 6= h\u21e4) / t. Notice that for any \ufb01xed \u270f> 0,\nQX (ht6=h\u21e4)\u270f\nPX (ht6=h\u21e4) = lim\nt>0, t!0\n\nt>0, t!0\nWe therefore see that is the smallest possible marginal transfer-\nexponent (similarly, \u21e2 = is the smallest possible transfer expo-\nnent). Interestingly, now consider transferring instead from Q to P :\n.\nwe would have (Q ! P ) = 1 \uf8ff \n= (P ! Q), i.e., it could\nbe easier to transfer from Q to P than from P to Q, which is not\ncaptured by symmetric notions of distance (dA, Wassertein, etc ...).\nFinally note that the above example can be extended to more general hypothesis classes as it simply\nplays on how fast fP decreases w.r.t. fQ in regions of space.\nExample 4.\n(Super-transfer and localization). We continue on the above Example 2. Now let\n.\n0 << 1, and let fP (t) /| t|1 on [1, 1] \\ {0}, with QX\n= U[1, 1], h\u21e4 at 0. As before, is a\ntransfer-exponent P ! Q, and following from Theorem 3, we attain transfer rates of EQ . n1/\n,\nfaster than the rates of n1\nQ attainable with data from Q. We call these situations super-transfer, i.e.,\nones where the source data gets us faster to h\u21e4; here P concentrates more mass close to h\u21e4, while\nmore generally, such situations can also be constructed by letting PY |X be less noisy than QY |X\ndata, for instance corresponding to controlled lab data as source, vs noisy real-world data.\nNow consider the following \u270f-localization \ufb01x to the dA = dY divergences over h\u2019s with small P\n.\nerror (assuming we only observe data from P ): d\u21e4\n= suph2H: EP (h)\uf8ff\u270f |EP (h) E Q(h)| . This is no\nY\nlonger worst-case over all h\u2019s, yet it is still not a complete \ufb01x. To see why, consider that, given nP\ndata from P , the best P -excess risk attainable is n1\nP so we might set \u270f / n1\nP . Now the subclass\n{h 2H : EP (h) \uf8ff \u270f} corresponds to thresholds t 2 [\u00b1n1/\n], since EP (ht) = P ([0, t]) /| t|.\n\nP\n\nlim\n\nC t\u270f\n\nt = 1.\n\nP\n\n4\n\n\fWe therefore have d\u21e4\nwhile the super-transfer rate n1/\nlocalization, d\u21e4\nY\n\nY / n1\n\nP\n\n / n1\n\nP n1/\n\nP , wrongly suggesting a transfer rate EQ . n1\nP ,\nP\nis achievable as discussed above. The problem is that, even after\n\ntreats errors under P and Q symmetrically.\n\n4 Lower-Bounds\nDe\ufb01nition 4 ((NC) Class). Let F(NC)(\u21e2, P , Q, C) denote the class of pairs of distributions (P, Q)\nwith transfer-exponent \u21e2, C\u21e2 \uf8ff C, satisfying (NC) with parameters P , Q, and cP , cQ \uf8ff C.\nThe following lower-bound in terms of \u21e2 is obtained via information theoretic-arguments. In effect,\ngiven the VC class H, we construct a set of distribution pairs {(Pi, Qi)} supported on dH datapoints,\nwhich all belong to F(NC)(\u21e2, P , Q, C). All the distributions share the same marginals PX, QX.\n, are close\nAny two pairs are close to each other in the sense that \u21e7i, \u21e7j, where \u21e7i\nin KL-divergence, while, however maintaining pairs (Pi, Qi), (Pj, Qj) far in a pseudo-distance in-\nduced by QX. All the proofs from this section are in Appendix B.\nTheorem 1 (\u21e2 Lower-bound). Suppose the hypothesis class H has VC dimension dH 9. Let\n\u02c6h = \u02c6h(SP , SQ) denote any (possibly improper) classi\ufb01er with access to two independent labeled\nsamples SP \u21e0 P nP and SQ \u21e0 QnQ. Fix any \u21e2 1, 0 \uf8ff P , Q < 1. Suppose either nP or nQ is\nsuf\ufb01ciently large so that\n\ni \u21e5 QnQ\n\n.\n= P nP\n\ni\n\n\u270f(nP , nQ)\n\n.\n\n= min(\u2713 dH\n\nnP\u25c61/(2P )\u21e2\n\nnQ\u25c61/(2Q)) \uf8ff 1/2.\n,\u2713 dH\n\nThen, for any \u02c6h, there exists (P, Q) 2F (NC)(\u21e2, P , Q, 1), and a universal constant c such that,\n\nP\n\nSP ,SQ\u21e3EQ(\u02c6h) > c \u00b7 \u270f(nP , nQ)\u2318 \n\n3 2p2\n\n.\n\n8\n\nAs per Proposition 1 we can translate any upper-bound in terms of \u21e2 to an upper-bound in terms of\n since \u21e2 \uf8ff /P . We investigate whether such upper-bounds in terms of are tight, i.e., given a\nclass F(NC)(\u21e2, P , Q, C), are there distributions with \u21e2 = /P where the rate is realized.\nThe proof of the next result is similar to that of Theorem 1, however with the added dif\ufb01culty that\nwe need the construction to yield two forms of rates \u270f1(nP , nQ),\u270f 2(nP , nQ) over the data support\n(again dH points). Combining these two rates matches the desired upper-bound. In effect, we follow\nthe intuition that, to have \u21e2 = /P achieved on some subset X1 \u21e2X , we need Q to behave as 1\nlocally on X1, while matching the rate requires larger Q on the rest of the suppport (on X \\ X1).\nTheorem 2 ( Lower-bound). Suppose the hypothesis class H has VC dimension dH, bdH/2c 9.\nLet \u02c6h = \u02c6h(SP , SQ) denote any (possibly improper) classi\ufb01er with access to two independent labeled\nsamples SP \u21e0 P nP and SQ \u21e0 QnQ. Fix any 0 < P , Q < 1, \u21e2 max{1/P , 1/Q}. Suppose\neither nP or nQ is suf\ufb01ciently large so that\n\n\u270f1(nP , nQ)\n\n\u270f2(nP , nQ)\n\n.\n\n= min(\u2713 dH\n= min(\u2713 dH\n\nnQ\u25c61/(2Q)) \uf8ff 1/2, and\nnP\u25c61/(2P )\u21e2\u00b7Q\n,\u2713 dH\nnQ\u25c6) \uf8ff 1/2.\n,\u2713 dH\nnP\u25c61/(2P )\u21e2\n\n.\n\nThen, for any \u02c6h, there exists (P, Q) 2F (NC)(\u21e2, P , Q, 2), with marginal-transfer-exponent =\n\u21e2 \u00b7 P 1, with C \uf8ff 2, and a universal constant c such that,\n\nSP ,SQ EQ(\u02c6h) c \u00b7 max{\u270f1(nP , nQ),\u270f 2(np, nQ)} .\nE\n\nRemark 1 (Tightness with upper-bound). Write \u270f1(nP , nQ) = min{\u270f1(nP ),\u270f 1(nQ)}, and simi-\n.\nlarly, \u270f2(nP , nQ) = min{\u270f2(nP ),\u270f 2(nQ)}. De\ufb01ne \u270fL\n= max{\u270f1(nP , nQ),\u270f 2(nP , nQ)} as in the\n.\nabove lower-bound of Theorem 2. Next, de\ufb01ne \u270fH\n= min{\u270f2(nP ),\u270f 1(nQ)}. It turns out that the\n\n5\n\n\fbest upper-bound we can show (as a function of ) is in terms of \u270fH so de\ufb01ned. It is therefore natural\nto ask whether or when \u270fH and \u270fL are of the same order.\nClearly, we have \u270f1(nP ) \uf8ff \u270f2(nP ) and \u270f1(nQ) \u270f2(nQ) so that \u270fL \uf8ff \u270fH (as to be expected).\nNow, if Q = 1, we have \u270f1(nP ) = \u270f2(nP ) and \u270f1(nQ) = \u270f2(nQ), so that \u270fL = \u270fH. More generally,\nfrom the above inequalities, we see that \u270fL = \u270fH in the two regimes where either \u270f1(nQ) \uf8ff \u270f1(nP )\n(in which case \u270fL = \u270fH = \u270f1(nQ)), or \u270f2(nP ) \uf8ff \u270f2(nQ) (in which case \u270fL = \u270fH = \u270f2(nP )).\n5 Upper-Bounds\n\nThe following lemma is due to [32].\n\nLemma 1. Let An = dHn log\u21e3 max{n,dH}\n\n. With probability at least 1 \nR(h) R(h0) \uf8ff \u02c6R(h) \u02c6R(h0) + cqmin{P (h 6= h0), \u02c6P (h 6= h0)}An + cAn,\n\n\u2318+ 1\nn log 1\n\ndH\n\n3, 8h, h0 2H ,\n(4)\n\nand\n\n1\n2\n\nP (h 6= h0) cAn \uf8ff \u02c6P (h 6= h0) \uf8ff 2P (h 6= h0) + cAn,\n\n(5)\n\nfor a universal numerical constant c 2 (0,1), where \u02c6R denotes empirical risk on n iid samples.\nNow consider the following algorithm. Let SP be a sequence of nP samples from P and\n\u02c6RSP (h) and \u02c6hSQ =\nSQ a sequence of nQ samples from Q. Also let \u02c6hSP = argminh2H\nargminh2H\n\n\u02c6RSQ(h). Choose \u02c6h as the solution to the following optimization problem.\n\nAlgorithm 1:\n\nMinimize\n\nsubject to\n\n\u02c6RSP (h)\n\nh 2H .\n\n\u02c6RSQ(h) \u02c6RSQ(\u02c6hSQ) \uf8ff cq \u02c6PSQ(h 6= \u02c6hSQ)AnQ + cAnQ\n\n(6)\n\nThe intuition is that, effectively, the constraint guarantees we maintain a near-optimal guarantee\non EQ(\u02c6h) in terms of nQ and the (NC) parameters for Q, while (as we show) still allowing the\nalgorithm to select an h with a near-minimal value of \u02c6RSP (h). The latter guarantee plugs into the\ntransfer condition to obtain a term converging in nP , while the former provides a term converging in\nnQ, and altogether the procedure achieves a rate speci\ufb01ed by the min of these two guarantees (which\nis in fact nearly minimax optimal, since it matches the lower bound up to logarithmic factors).\nFormally, we have the following result for this learning rule; its proof is below.\nTheorem 3 (Minimax Upper-Bounds). Assume (NC). Let \u02c6h be the solution from Algorithm 1. For\na constant C depending on \u21e2, C\u21e2, P , cP , Q, cQ, with probability at least 1 ,\n,\u2713 dH\nnQ\u25c6 1\n\nnQ = \u02dcO min(\u2713 dH\n\nEQ(\u02c6h) \uf8ff C min\u21e2A\n\n2Q)! .\n\nnP\u25c6 1\n\n(2P )\u21e2\n\n(2P )\u21e2\nnP\n\n, A\n\n1\n\n2Q\n\n1\n\nNote that, by the lower bound of Theorem 1, this bound is optimal up to log factors.\n.\nRemark 2 (Effective Source Sample Size). From the above, we might view (ignoring dH) \u02dcnP\n=\nn(2Q)/(2P )\u21e2\nas the effective sample size contributed by P . In fact, the above minimax rate\nP\nis of order (\u02dcnP + nQ)1/(2Q), which yields added intuition into the combined effect of both\nsamples. We have that, the effective source sample size \u02dcnP is smallest for large \u21e2, but also depends\non (2 Q)/(2 P ), i.e., on whether P is noisier than Q.\nRemark 3 (Rate in terms of ). Note that, by Proposition 1, this also immediately implies a bound\nunder the marginal transfer condition and RCS, simply taking \u21e2 \uf8ff /P . Furthermore, by the lower\nbound of Theorem 2, the resulting bound in terms of is tight in certain regimes up to log factors.\n\n6\n\n\fProof of Theorem 3. In all the lines below, we let C serve as a generic constant (possibly depending\non \u21e2, C\u21e2, P , cP , Q, cQ) which may be different in different appearances. Consider the event\nof probability at least 1 /3 from Lemma 1 for the SQ samples. In particular, on this event, if\nEQ(h\u21e4P ) = 0, it holds that\n\n\u02c6RSQ(h\u21e4P ) \u02c6RSQ(\u02c6hSQ) \uf8ff cq \u02c6PSQ(h\u21e4P 6= \u02c6hSQ)AnQ + cAnQ.\n\nThis means, under the (RCS) condition, h\u21e4P satis\ufb01es the constraint in the above optimization problem\nde\ufb01ning \u02c6h. Also, on this same event from Lemma 1 we have\n\nso that (NC) implies\n\nwhich implies the well-known fact from [28, 29] that\n\nEQ(\u02c6hSQ) \uf8ff cqQ(\u02c6hSQ 6= h\u21e4Q)AnQ + cAnQ,\nEQ(\u02c6hSQ) \uf8ff CqEQ(\u02c6hSQ)QAnQ + cAnQ,\nlog\u2713 1\n\u25c6\u25c6 1\nFurthermore, following the analogous argument for SP , it follows that for any set G\u2713H with\nh\u21e4P 2G , with probability at least 1 /3, the ERM \u02c6h0SP = argminh2G\n\u25c6\u25c6 1\nlog\u2713 1\n\nEQ(\u02c6hSQ) \uf8ff C\u2713 dH\n\nEP (\u02c6h0SP ) \uf8ff C\u2713 dH\n\nlog\u2713 nQ\n\ndH\u25c6 +\n\nlog\u2713 nP\n\ndH\u25c6 +\n\n\u02c6RSP (h) satis\ufb01es\n\nIn particular, conditioned on the SQ data, we can take the set G as the set of h 2H satisfying\nthe constraint in the optimization, and on the above event we have h\u21e4P 2G (assuming the (RCS)\ncondition); furthermore, if EQ(h\u21e4P ) = 0, then without loss we can simply de\ufb01ne h\u21e4Q = h\u21e4P = h\u21e4\n(and it is easy to see that this does not affect the NC condition). We thereby establish the above\ninequality (8) for this choice of G, in which case by de\ufb01nition \u02c6h0SP = \u02c6h. Altogether, by the union\nbound, all of these events hold simultaneously with probability at least 1 . In particular, on this\nevent, if the (RCS) condition holds then\n\n1\nnQ\n\n1\nnP\n\n.\n\n.\n\n2Q\n\n2P\n\n(7)\n\n(8)\n\nnQ\n\nnP\n\nApplying the de\ufb01nition of \u21e2, this has the further implication that (again if (RCS) holds)\n\n1\nnP\n\nnP\n\nEP (\u02c6h) \uf8ff C\u2713 dH\nEQ(\u02c6h) \uf8ff C\u2713 dH\n\nnP\n\nlog\u2713 nP\ndH\u25c6 +\ndH\u25c6 +\nlog\u2713 nP\n\n1\nnP\n\nlog\u2713 1\n\u25c6\u25c6 1\n\u25c6\u25c6 1\nlog\u2713 1\n\n2P\n\n.\n\n(2P )\u21e2\n\n.\n\nAlso note that, if \u21e2 = 1 this inequality trivially holds, whereas if \u21e2< 1 then (RCS) necessarily\nholds so that the above implication is generally valid, without needing the (RCS) assumption explic-\nitly. Moreover, again when the above events hold, using the event from Lemma 1 again, along with\nthe constraint from the optimization, we have that\n\nRQ(\u02c6h) RQ(\u02c6hSQ) \uf8ff 2cq \u02c6PSQ(\u02c6h 6= \u02c6hSQ)AnQ + 2cAnQ,\n\nand (5) implies the right hand side is at most\n\nCqQ(\u02c6h 6= \u02c6hSQ)AnQ + CAnQ \uf8ff CqQ(\u02c6h 6= h\u21e4Q)AnQ + CqQ(\u02c6hSQ 6= h\u21e4Q)AnQ + CAnQ.\n\nUsing the Bernstein class condition and (7), the second term is bounded by\n\nC\u2713 dH\n\nnQ\n\nwhile the \ufb01rst term is bounded by\n\n2Q\n\n\u25c6\u25c6 1\n\n1\nnQ\n\nlog\u2713 nQ\nlog\u2713 1\ndH\u25c6 +\nCqEQ(\u02c6h)QAnQ.\n\n,\n\n7\n\n\fAltogether, we have that\n\nwhich implies\n\nEQ(\u02c6h) = RQ(\u02c6h) RQ(\u02c6hSQ) + EQ(\u02c6hSQ)\n\uf8ff CqEQ(\u02c6h)QAnQ + C\u2713 dH\nEQ(\u02c6h) \uf8ff C\u2713 dH\nlog\u2713 nQ\ndH\u25c6 +\n\nnQ\n\nnQ\n\nlog\u2713 nQ\n\n1\nnQ\n\ndH\u25c6 +\nlog\u2713 1\n\nlog\u2713 1\n\u25c6\u25c6 1\n\n2Q\n\n.\n\n1\nnQ\n\n2Q\n\n\u25c6\u25c6 1\n\n,\n\nRemark 4. Note that the above Theorem 3 does not require (RCS): that is, it holds even when\nEQ(h\u21e4P ) > 0, in which case \u21e2 = 1. However, for a related method we can also show a stronger\nresult in terms of a modi\ufb01ed de\ufb01nition of \u21e2:\nSpeci\ufb01cally, de\ufb01ne EQ(h, h\u21e4P ) = max{RQ(h) RQ(h\u21e4P ), 0}, and suppose \u21e20 > 0, C\u21e20 > 0 satisfy\n\n8h 2H , C\u21e20 \u00b7 EP (h) E \u21e20\n\nQ (h, h\u21e4P ).\n\nThis is clearly equivalent to \u21e2 (De\ufb01nition 2) under (RCS); however, unlike \u21e2, this \u21e20 can be \ufb01nite\neven in cases where (RCS) fails. With this de\ufb01nition, we have the following result.\n\nProposition 2 (Beyond (RCS)). If \u02c6RSQ(\u02c6hSP ) \u02c6RSQ(\u02c6hSQ) \uf8ff cq \u02c6PSQ(\u02c6hSP 6= \u02c6hSQ)AnQ + cAnQ,\nthat is, if \u02c6hSP satis\ufb01es (6), de\ufb01ne \u02c6h = \u02c6hSP , and otherwise de\ufb01ne \u02c6h = \u02c6hSQ. Assume (NC). For a\nconstant C depending on \u21e20, C\u21e20, P , cP , Q, cQ, with probability at least 1 ,\n\nEQ(\u02c6h) \uf8ff min\u21e2EQ(h\u21e4P ) + CA\n\n1\n\n(2P )\u21e20\nnP\n\n, CA\n\n1\n\n2Q\n\nnQ .\n\nThe proof of this result is similar to that of Theorem 3, and as such is deferred to Appendix C.\n\nAn alternative procedure. Similar results as in Theorem 3 can be obtained for a method that\nswaps the roles of P and Q samples:\n\nAlgorithm 10 :\n\nMinimize\n\nsubject to\n\n\u02c6RSQ(h)\n\nh 2H .\n\n\u02c6RSP (h) \u02c6RSP (\u02c6hSP ) \uf8ff cq \u02c6PSP (h 6= \u02c6hSP )AnP + cAnP\n\nThis version, more akin to so-called hypothesis transfer may have practical bene\ufb01ts in scenarios\nwhere the P data is accessible before the Q data, since then the feasible set might be calculated (or\napproximated) in advance, so that the P data itself would no longer be needed in order to execute\nthe procedure. However this procedure presumes that h\u21e4P is not far from h\u21e4Q, i.e., that data SP from\nP is not misleading, since it conditions on doing well on SP . Hence we now require (RCS).\nProposition 3. Assume (NC) and (RCS). Let \u02c6h be the solution from Algorithm 10. For a constant C\ndepending on \u21e2, C\u21e2, P , cP , Q, cQ, with probability at least 1 ,\nnP\u25c6 1\n\nnQ = \u02dcO min(\u2713 dH\n\nEQ(\u02c6h) \uf8ff C min\u21e2A\n\n,\u2713 dH\nnQ\u25c6 1\n\n2Q)! .\n\n(2P )\u21e2\nnP\n\n(2P )\u21e2\n\n2Q\n\n, A\n\n1\n\n1\n\nThe proof is very similar to that of Theorem 3, so is omitted for brevity.\n\n6 Minimizing Sampling Cost\n\nIn this section (and continued in Appendix A.1), we discuss the value of having access to unlabeled\ndata from Q. The idea is that unlabeled data can be obtained much more cheaply than labeled data,\n\n8\n\n\fso gaining access to unlabeled data can be realistic in many applications. Speci\ufb01cally, we begin\nby discussing an adaptive sampling scenario, where we are able to draw samples from P or Q, at\ndifferent costs, and we are interested in optimizing the total cost of obtaining a given excess Q-risk.\nFormally, consider the scenario where we have as input a value \u270f, and are tasked with producing\na classi\ufb01er \u02c6h with EQ(\u02c6h) \uf8ff \u270f. We are then allowed to draw samples from either P or Q toward\nachieving this goal, but at different costs. Suppose cP : N ! [0,1) and cQ : N ! [0,1) are cost\nfunctions, where cP (n) indicates the cost of sampling a batch of size n from P , and similarly de\ufb01ne\ncQ(n). We suppose these functions are increasing, and concave, and unbounded.\nDe\ufb01nition 5. De\ufb01ne n\u21e4Q = dH/\u270f2Q, n\u21e4P = dH/\u270f(2P )/ P , and c\u21e4 = mincQ(n\u21e4Q), cP (n\u21e4P ) .\n\nWe call c\u21e4 = c\u21e4(\u270f; cP , cQ) the minimax optimal cost of sampling from P or Q to attain Q-error \u270f.\n\nNote that the cost c\u21e4 is effectively the smallest possible, up to log factors, in the range of parameters\ngiven in Theorem 2. That is, in order to make the lower bound in Theorem 2 less than \u270f, either\nnQ = \u02dc\u2326(n\u21e4Q) samples are needed from Q or nP = \u02dc\u2326(n\u21e4P ) samples are needed from P . We show\nthat c\u21e4 is nearly achievable, adaptively with no knowledge of distributional parameters.\n\nProcedure. We assume access to a large unlabeled data set UQ sampled from QX. For our pur-\nposes, we will suppose this data set has size at least \u21e5( dH\u270f log 1\nLet A0n = dHn log( max{n,dH}\nargminh2H\n\n\u02c6RS(h), and given an additional data set U (labeled or unlabeled) de\ufb01ne a quantity\n\n ). Then for any labeled data set S, de\ufb01ne \u02c6hS =\n\nn log( 2n2\n\n\u270f log 1\n\n\u270f + 1\n\n) + 1\n\n ).\n\ndH\n\n\u02c6(S, U ) = sup\u21e2 \u02c6PU (h 6= \u02c6hS) : h 2H , \u02c6RS(h) \u02c6RS(\u02c6hS) \uf8ff cq \u02c6PS(h 6= \u02c6hS)A0\n\n|S|\n\n+ cA0\n\n|S| ,\n\nwhere c is as in Lemma 1. Now we have the following procedure.\n\nAlgorithm 2:\n0. SP {}, SQ {}\n1. For t = 1, 2, . . .\n2. Let nt,P be minimal such that cP (nt,P ) 2t1\n3. Sample nt,P samples from P and add them to SP\n4. Let nt,Q be minimal such that cQ(nt,Q) 2t1\n5. Sample nt,Q samples from Q and add them to SQ\nIf cq\u02c6(SQ, SQ)A|SQ| + cA|SQ| \uf8ff \u270f, return \u02c6hSQ\n6.\nIf \u02c6(SP , UQ) \uf8ff \u270f/4, return \u02c6hSP\n7.\n\nThe following theorem asserts that this procedure will \ufb01nd a classi\ufb01er \u02c6h with EQ(\u02c6h) \uf8ff \u270f while\nadaptively using a near-minimal cost associated with achieving this. The proof is in Appendix D.\nTheorem 4 (Adapting to Sampling Costs). Assume (NC) and (RCS). There exist a constant c0,\ndepending on parameters (C, , cQ, Q, cP , P ) but not on \u270f or , such that the following holds.\n.\nDe\ufb01ne sample sizes \u02dcnQ = c0\nAlgorithm 2 outputs a classi\ufb01er \u02c6h such that, with probability at least 1 , we have EQ(\u02c6h) \uf8ff \u270f, and\nthe total sampling cost incurred is at most min{cQ(\u02dcnQ), cP (\u02dcnP )} = \u02dcO(c\u21e4).\nThus, when c\u21e4 favors sampling from P , we end up sampling very few labeled Q data. These are sce-\nnarios where P samples are cheap relative to the cost of Q samples and w.r.t. parameters (Q,P ,)\nwhich determine the effective source sample size contributed for every target sample. Furthermore,\nwe achieve this adaptively: without knowing (or even estimating) these relevant parameters.\n\n\u270f(2P )/ P dH log 1\n\n\u270f2Q dH log 1\n\n, and \u02dcnP =\n\n\u270f + log 1\n\n\u270f + log 1\n\nc0\n\nAcknowledgments\n\nWe thank Mehryar Mohri for several very important discussions which helped crystallize many\nessential questions and directions on this topic.\n\n9\n\n\fReferences\n[1] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In Pro-\n\nceedings of the 30th International Conference on Machine Learning, pages 942\u2013950, 2013.\n\n[2] Simon S Du, Jayanth Koushik, Aarti Singh, and Barnab\u00e1s P\u00f3czos. Hypothesis transfer learning\nvia transformation functions. In Advances in Neural Information Processing Systems, pages\n574\u2013584, 2017.\n\n[3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-\nnifer Wortman Vaughan. A theory of learning from different domains. Machine learning,\n79(1-2):151\u2013175, 2010.\n\n[4] Shai Ben-David, Tyler Lu, Teresa Luu, and D\u00e1vid P\u00e1l.\n\nImpossibility theorems for domain\nadaptation. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 129\u2013136, 2010.\n\n[5] Pascal Germain, Amaury Habrard, Fran\u00e7ois Laviolette, and Emilie Morvant. A pac-bayesian\napproach for domain adaptation with specialization to linear classi\ufb01ers. In International Con-\nference on Machine Learning, pages 738\u2013746, 2013.\n\n[6] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning\n\nbounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.\n\n[7] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with\nIn International Conference on Algorithmic Learning Theory, pages\n\ndrifting distributions.\n124\u2013138. Springer, 2012.\n\n[8] Corinna Cortes, Mehryar Mohri, and Andr\u00e9s Munoz Medina. Adaptation based on general-\nized discrepancy. Machine Learning Research, forthcoming. URL http://www. cs. nyu. edu/\u02dc\nmohri/pub/daj. pdf.\n\n[9] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence.\n\nDataset shift in machine learning. The MIT Press, 2009.\n\n[10] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine\n\nlearning. Cambridge University Press, 2012.\n\n[11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki\nKawanabe. Direct importance estimation with model selection and its application to covariate\nshift adaptation.\nIn Advances in neural information processing systems, pages 1433\u20131440,\n2008.\n\n[12] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and\nthe r\u00e9nyi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01-\ncial Intelligence, pages 367\u2013374. AUAI Press, 2009.\n\n[13] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection\nbias correction theory. In International conference on algorithmic learning theory, pages 38\u2013\n53. Springer, 2008.\n\n[14] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and\nBernhard Sch\u00f6lkopf. Covariate shift by kernel mean matching. Dataset shift in machine\nlearning, 3(4):5, 2009.\n\n[15] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Sch\u00f6lkopf, and Alex J Smola.\nCorrecting sample selection bias by unlabeled data. In Advances in neural information pro-\ncessing systems, pages 601\u2013608, 2007.\n\n[16] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of\nunlabeled target samples. In International Conference on Algorithmic Learning Theory, pages\n139\u2013153. Springer, 2012.\n\n10\n\n\f[17] Avishek Saha, Piyush Rai, Hal Daum\u00e9, Suresh Venkatasubramanian, and Scott L DuVall. Ac-\ntive supervised domain adaptation. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, pages 97\u2013112. Springer, 2011.\n\n[18] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In\n\nAdvances in neural information processing systems, pages 2456\u20132464, 2011.\n\n[19] Rita Chattopadhyay, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. Joint\ntransfer and batch-mode active learning. In Sanjoy Dasgupta and David McAllester, editors,\nProceedings of the 30th International Conference on Machine Learning, volume 28 of Pro-\nceedings of Machine Learning Research, pages 253\u2013261, 2013.\n\n[20] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications\n\nto active learning. Machine learning, 90(2):161\u2013189, 2013.\n\n[21] Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In\n\nInternational Conference on Machine Learning, pages 1870\u20131879, 2015.\n\n[22] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott.\nDomain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910, 2017.\nIn\n\n[23] Clayton Scott. A generalized neyman-pearson criterion for optimal domain adaptation.\n\nAlgorithmic Learning Theory, pages 738\u2013761, 2019.\n\n[24] T Tony Cai and Hongji Wei. Transfer learning for nonparametric classi\ufb01cation: Minimax rate\n\nand adaptive classi\ufb01er. arXiv preprint arXiv:1906.02903, 2019.\n\n[25] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the bene\ufb01ts of labels in\n\ncovariate-shift. arXiv preprint arXiv:1803.01833, 2018.\n\n[26] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their expectation. Theory of probability and its applications, 16:264\u2013280, 1971.\n\n[27] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related\n\nFields, 135(3):311\u2013334, 2006.\n\n[28] P. Massart and \u00c9. N\u00e9d\u00e9lec. Risk bounds for statistical learning. The Annals of Statistics,\n\n34(5):2326\u20132366, 2006.\n\n[29] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.\n\nThe Annals of Statistics, 34(6):2593\u20132656, 2006.\n\n[30] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statis-\n\ntics, 32(1):135\u2013166, 2004.\n\n[31] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,\n\n27(6):1808\u20131829, 1999.\n\n[32] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.\n[33] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 2009.\n[34] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n[35] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. arXiv:1207.3772,\n\n2012.\n\n11\n\n\f", "award": [], "sourceid": 5227, "authors": [{"given_name": "Steve", "family_name": "Hanneke", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Samory", "family_name": "Kpotufe", "institution": "Columbia University"}]}