{"title": "Generalization Bounds for Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 3320, "page_last": 3328, "abstract": "In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. Without loss of generality, we consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we introduce two quantities that capture the inherent characteristics of domains. For either kind of domain adaptation, based on the two quantities, we then develop the specific Hoeffding-type deviation inequality and symmetrization inequality to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for such kind of domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results.", "full_text": "Generalization Bounds for Domain Adaptation\n\nChao Zhang1, Lei Zhang2,\n\nJieping Ye1,3\n\n1Center for Evolutionary Medicine and Informatics, The Biodesign Institute,\n\nand 3Computer Science and Engineering, Arizona State University, Tempe, USA\n\n{czhan117,jieping.ye}@asu.edu\n2School of Computer Science and Technology,\n\nNanjing University of Science and Technology, Nanjing, P.R. China\n\nzhanglei.njust@yahoo.com.cn\n\nAbstract\n\nIn this paper, we provide a new framework to study the generalization bound of\nthe learning process for domain adaptation. We consider two kinds of representa-\ntive domain adaptation settings: one is domain adaptation with multiple sources\nand the other is domain adaptation combining source and target data. In particu-\nlar, we use the integral probability metric to measure the difference between two\ndomains. Then, we develop the speci\ufb01c Hoeffding-type deviation inequality and\nsymmetrization inequality for either kind of domain adaptation to achieve the cor-\nresponding generalization bound based on the uniform entropy number. By using\nthe resultant generalization bound, we analyze the asymptotic convergence and the\nrate of convergence of the learning process for domain adaptation. Meanwhile, we\ndiscuss the factors that affect the asymptotic behavior of the learning process. The\nnumerical experiments support our results.\n\n1\n\nIntroduction\n\nIn statistical learning theory, one of the major concerns is to obtain the generalization bound of a\nlearning process, which measures the probability that a function, chosen from a function class by an\nalgorithm, has a suf\ufb01ciently small error (cf. [1,2]). Generalization bounds have been widely used to\nstudy the consistency of the learning process [3], the asymptotic convergence of empirical process\n[4] and the learnability of learning models [5]. Generally, there are three essential aspects to obtain\ngeneralization bounds of a speci\ufb01c learning process: complexity measures of function classes, devi-\nation (or concentration) inequalities and symmetrization inequalities related to the learning process\n(cf. [3, 4, 6, 7]).\nIt is noteworthy that the aforementioned results of statistical learning theory are all built under the\nassumption that training and test data are drawn from the same distribution (or brie\ufb02y called the\nassumption of same distribution). This assumption may not be valid in many practical applications\nsuch as speech recognition [8] and natural language processing [9] in which training and test data\nmay have different distributions. Domain adaptation has recently been proposed to handle this\nsituation and it is aimed to apply a learning model, trained by using the samples drawn from a\ncertain domain (source domain), to the samples drawn from another domain (target domain) with a\ndifferent distribution (cf. [10, 11, 12, 13]).\nThis paper is mainly concerned with two variants of domain adaptation. In the \ufb01rst variant, the\nlearner receives training data from several source domains, known as domain adaptation with multi-\nple sources (cf. [14, 15, 16, 17]). In the second variant, the learner minimizes a convex combination\n\n1\n\n\fof empirical source and target risk, termed as domain adaptation combining source and target data\n(cf. [13, 18])1.\n\n1.1 Overview of Main Results\n\nIn this paper, we present a new framework to study generalization bounds of the learning processes\nfor domain adaptation with multiple sources and domain adaptation combining source and target\ndata, respectively. Based on the resultant bounds, we then study the asymptotic behavior of the\nlearning process for the two kinds of domain adaptation, respectively. There are three major aspects\nin the framework: the quantity that measures the difference between two domains, the deviation\ninequalities and the symmetrization inequalities that are both designed for the situation of domain\nadaptation2.\nGenerally, in order to obtain the generalization bounds of a learning process, it is necessary to obtain\nthe corresponding deviation (or concentration) inequalities. For either kind of domain adaptation,\nwe use a martingale method to develop the related Hoeffding-type deviation inequality. Moreover,\nin the situation of domain adaptation, since the source domain differs from the target domain, the\ndesired symmetrization inequality for domain adaptation should incorporate some quantity to re-\n\ufb02ect the difference. We then obtain the related symmetrization inequality incorporating the integral\nprobability metric that measures the difference between the distributions of the source and the target\ndomains.\nNext, we present the generalization bounds based on the uniform entropy number for both kinds of\ndomain adaptation. Finally, based on the resultant bounds, we give a rigorous theoretic analysis to\nthe asymptotic convergence and the rate of convergence of the learning processes for both types of\ndomain adaptation. Meanwhile, we give a comparison with the related results under the assumption\nof same distribution. We also present numerical experiments to support our results.\n\n1.2 Organization of the Paper\n\nThe rest of this paper is organized as follows. Section 2 introduces the problems studied in this\npaper. Section 3 introduces the integral probability metric that measures the difference between the\ndistributions of two domains. We introduce the uniform entropy number for the situation of multiple\nsources in Section 4. In Section 5, we present the generalization bounds for domain adaptation with\nmultiple sources, and then analyze the asymptotic behavior of the learning process for this type of\ndomain adaptation. The last section concludes the paper. In the supplement (part A), we discuss the\nrelationship between the integral probability metric DF (S, T ) and the other quantities proposed in\nthe existing works including the H-divergence and the discrepancy distance. Proofs of main results\nof this paper are provided in the supplement (part B). We study domain adaptation combining source\nand target data in the supplement (part C) and then give a comparison with the existing works on the\ntheoretical analysis of domain adaptation in the supplement (part D).\n\n2 Problem Setup\nWe denote Z (Sk) := X (Sk)\u00d7Y (Sk) \u2282 RI\u00d7RJ (1 \u2264 k \u2264 K) and Z (T ) := X (T )\u00d7Y (T ) \u2282 RI\u00d7RJ\nas the k-th source domain and the target domain, respectively. Set L = I + J. Let D(Sk) and\nD(T ) stand for the distributions of the input spaces X (Sk) (1 \u2264 k \u2264 K) and X (T ), respectively.\nDenote g(Sk)\n: X (T ) \u2192 Y (T ) as the labeling functions of Z (Sk)\n(1 \u2264 k \u2264 K) and Z (T ), respectively. In the situation of domain adaptation with multiple sources,\nthe distributions D(Sk) (1 \u2264 k \u2264 K) and D(T ) differ from each other, or g(Sk)\n(1 \u2264 k \u2264 K) and\ng(T )\u2217\ndiffer from each other, or both of the cases occur. There are suf\ufb01cient amounts of i.i.d. samples\n\n: X (Sk) \u2192 Y (Sk) and g(T )\u2217\n\n\u2217\n\n\u2217\n\n1Due to the page limitation, the discussion on domain adaptation combining source and target data is pro-\n\nvided in the supplement (part C).\n\n2Due to the page limitation, we only present the generalization bounds for domain adaptation with multiple\nsources and the discussions of the corresponding deviation inequalities and symmetrization inequalities are\nprovided in the supplement (part B) along with the proofs of main results.\n\n2\n\n\fn=1 drawn from each source domain Z (Sk) (1 \u2264 k \u2264 K) but little or no labeled\n\nn }Nk\n\n1 = {z(k)\nZNk\nsamples drawn from the target domain Z (T ).\n\nGiven w = (w1,\u00b7\u00b7\u00b7 , wK) \u2208 [0, 1]K with(cid:80)K\n\nthe empirical risk\n\nE(S)\nw ((cid:96) \u25e6 g) =\n\nwkE(Sk)\nNk\n\n(cid:90)\nover G with respect to sample sets {ZNk\ntarget expected risk:\n\n1 }K\n\nK(cid:88)\n\nk=1\n\nNk(cid:88)\n\nK(cid:88)\n\nwk\nNk\n\nk=1 wk = 1, let gw \u2208 G be the function that minimizes\n\n((cid:96) \u25e6 g) =\nk=1, and it is expected that gw will perform well on the\n\nn ), y(k)\nn )\n\n(cid:96)(g(x(k)\n\nn=1\n\nk=1\n\n(1)\n\n(cid:12)(cid:12)E(T )f \u2212 E(S)\nw f(cid:12)(cid:12).\n\nsup\nf\u2208F\n\n3\n\nE(T )((cid:96) \u25e6 g) :=\ni.e., gw approximates the labeling g(T )\u2217\nIn the learning process of domain adaptation with multiple sources, we are mainly interested in the\nfollowing two types of quantities:\n\n(cid:96)(g(x(T )), y(T ))dP(z(T )), g \u2208 G,\nas precisely as possible.\n\n(2)\n\nRecalling (1) and (2), since\n\n\u2022 E(T )((cid:96) \u25e6 gw) \u2212 E(S)\nw ((cid:96) \u25e6 gw), which corresponds to the estimation of the expected risk in\nthe target domain Z (T ) from a weighted combination of the empirical risks in the multiple\nsources {Z (Sk)}K\nk=1;\ndomain adaptation with multiple sources,\n\n\u2022 E(T )((cid:96) \u25e6 gw) \u2212 E(T )((cid:96) \u25e6(cid:101)g\u2217), which corresponds to the performance of the algorithm for\n\nwhere(cid:101)g\u2217 \u2208 G is the function that minimizes the expected risk E(T )((cid:96) \u25e6 g) over G.\n\nE(S)\n\nw ((cid:96) \u25e6 gw) \u2265 0,\n\nw ((cid:96) \u25e6(cid:101)g\u2217) \u2212 E(S)\nw ((cid:96) \u25e6(cid:101)g\u2217) \u2212 E(S)\nw ((cid:96) \u25e6 gw) + E(T )((cid:96) \u25e6 gw) \u2212 ET ((cid:96) \u25e6(cid:101)g\u2217) + ET ((cid:96) \u25e6(cid:101)g\u2217)\n(cid:12)(cid:12)E(T )((cid:96) \u25e6 g) \u2212 E(S)\n\nE(T )((cid:96) \u25e6 gw) =E(T )((cid:96) \u25e6 gw) \u2212 E(T )((cid:96) \u25e6(cid:101)g\u2217) + E(T )((cid:96) \u25e6(cid:101)g\u2217)\n(cid:12)(cid:12) + E(T )((cid:96) \u25e6(cid:101)g\u2217),\n(cid:12)(cid:12)E(T )((cid:96) \u25e6 g) \u2212 E(S)\n\nw ((cid:96) \u25e6 g)(cid:12)(cid:12).\n\n\u2264E(S)\n\u22642 sup\ng\u2208G\n\nw ((cid:96) \u25e6 g)\n\n0 \u2264 E(T )((cid:96) \u25e6 gw) \u2212 E(T )((cid:96) \u25e6(cid:101)g\u2217) \u2264 2 sup\n(cid:12)(cid:12)E(T )((cid:96) \u25e6 g) \u2212 E(S)\n\ng\u2208G\n\nsup\ng\u2208G\n\nThis shows that the asymptotic behaviors of the aforementioned two quantities, when the sample\nnumbers N1,\u00b7\u00b7\u00b7 , NK go to in\ufb01nity, can both be described by the supremum\n\nwe have\n\nand thus\n\n(cid:12)(cid:12),\n\nw ((cid:96) \u25e6 g)\n\nwhich is the so-called generalization bound of the learning process for domain adaptation with\nmultiple sources.\nFor convenience, we de\ufb01ne the loss function class\n\nF := {z (cid:55)\u2192 (cid:96)(g(x), y) : g \u2208 G},\n\nand call F as the function class in the rest of this paper. By (1) and (2), given sample sets {ZNk\ndrawn from the multiple sources {Z (Sk)}K\nk=1 respectively, we brie\ufb02y denote for any f \u2208 F,\n\n(cid:90)\n\nE(T )f :=\n\nf (z(T ))dP(z(T )) ; E(S)\n\nw f :=\n\nf (z(k)\n\nn ).\n\n(6)\n\nNk(cid:88)\n\nK(cid:88)\n\nwk\nNk\n\nk=1\n\nn=1\n\nThus, we can equivalently rewrite the generalization bound (4) for domain adaptation with multiple\nsources as\n\n(3)\n\n(4)\n\n(7)\n\n(5)\n1 }K\n\nk=1\n\n\f3\n\nIntegral Probability Metric\n\nand g(T )\u2217\n\n.\n\ndiffers from g(T )\u2217\n\nAs shown in some prior works (e.g.\n[13, 16, 17, 18, 19, 20]), one of major challenges in the\ntheoretical analysis of domain adaptation is how to measure the distance between the source domain\nZ (S) and the target domain Z (T ). Recall that, if Z (S) differs from Z (T ), there are three possibilities:\nD(S) differs from D(T ), or g(S)\u2217\n, or both of them occur. Therefore, it is necessary\nto consider two kinds of distances: the distance between D(S) and D(T ) and the distance between\ng(S)\u2217\nIn [13, 18], the H-divergence was introduced to derive the generalization bounds based on the VC\ndimension under the condition of \u201c\u03bb-close\u201d. Mansour et al. [20] obtained the generalization bounds\nbased on the Rademacher complexity by using the discrepancy distance. Both quantities are aimed\nto measure the difference between two distributions D(S) and D(T ). Moreover, Mansour et al. [17]\nused the R\u00b4enyi divergence to measure the distance between two distributions. In this paper, we\nuse the following quantity to measure the difference of the distributions between the source and the\ntarget domains:\nDe\ufb01nition 3.1 Given two domains Z (S),Z (T ) \u2282 RL, let z(S) and z(T ) be the random variables\ntaking value from Z (S) and Z (T ), respectively. Let F \u2282 RZ be a function class. We de\ufb01ne\n(8)\n\nDF (S, T ) := sup\n\nf\u2208F |E(S)f \u2212 E(T )f|,\n\nwhere the expectations E(S) and E(T ) are taken with respect to the distributions Z (S) and Z (T ),\nrespectively.\n\nThe quantity DF (S, T ) is termed as the integral probability metric that plays an important role in\nprobability theory for measuring the difference between the two probability distributions (cf. [23,\n24, 25, 26]). Recently, Sriperumbudur et al. [27] gave the further investigation and proposed the\nempirical methods to compute the integral probability metric in practice. As mentioned by M\u00a8uller\n[page 432, 25], the quantity DF (S, T ) is a semimetric and it is a metric if and only if the function\nclass F separates the set of all signed measures with \u00b5(Z) = 0. Namely, according to De\ufb01nition\n3.1, given a non-trivial function class F, the quantity DF (S, T ) is equal to zero if the domains Z (S)\nand Z (T ) have the same distribution.\nIn the supplement (part A), we discuss the relationship between the quantity DF (S, T ) and other\nquantities proposed in the previous works, and then show that the quantity DF (S, T ) can be bounded\nby the summation of the discrepancy distance and another quantity, which measure the difference\nbetween the input-space distributions D(S) and D(T ) and the difference between the labeling func-\ntions g(S)\u2217\n\n, respectively.\n\nand g(T )\u2217\n\n4 The Uniform Entropy Number\n\nGenerally, the generalization bound of a certain learning process is achieved by incorporating the\ncomplexity measure of the function class, e.g., the covering number, the VC dimension and the\nRademacher complexity. The results of this paper are based on the uniform entropy number that is\nderived from the concept of the covering number and we refer to [22] for more details.\nThe covering number of a function class F is de\ufb01ned as follows:\nDe\ufb01nition 4.1 Let F be a function class and d is a metric on F. For any \u03be > 0, the covering\nnumber of F at radius \u03be with respect to the metric d, denoted by N (F, \u03be, d) is the minimum size of\na cover of radius \u03be.\n\nIn some classical results of statistical learning theory, the covering number is applied by letting\nd be the distribution-dependent metric. For example, as shown in Theorem 2.3 of [22], one can\nlearning process\nset d as the norm (cid:96)1(ZN\n1 ) and then derive the generalization bound of the i.i.d.\nby incorporating the expectation of the covering number, i.e., EN (F, \u03be, (cid:96)1(ZN\n1 )). However, in\nthe situation of domain adaptation, we only know the information of source domain, while the\nexpectation EN (F, \u03be, (cid:96)1(ZN\n1 )) is dependent on both distributions of source and target domains\n\n4\n\n\fn }Nk\n\n:= {z(k)\n\nbecause z = (x, y). Therefore, the covering number is no longer applicable to our framework to\nobtain the generalization bounds for domain adaptation. By contrast, uniform entropy number is\ndistribution-free and thus we choose it as the complexity measure of function class to derive the\ngeneralization bounds for domain adaptation.\nFor clarity of presentation, we give a useful notation for the following discussion. For any 1 \u2264 k \u2264\nn=1 drawn from Z (Sk), we denote Z(cid:48)Nk\nK, given a sample set ZNk\nn=1 as\n1\nthe ghost-sample set drawn from Z (Sk) such that the ghost sample z(cid:48)(k)\nn has the same distribution as\n1 , Z(cid:48)Nk\nz(k)\nn for any 1 \u2264 n \u2264 Nk and any 1 \u2264 k \u2264 K. Denoting Z2Nk\n1 }. Moreover, given\n:= {ZNk\n(cid:17)\nk=1 wk = 1, we introduce a variant of the\n(cid:96)1 norm:\n\nany f \u2208 F and any w = (w1,\u00b7\u00b7\u00b7 , wK) \u2208 [0, 1]K with(cid:80)K\n(cid:16)\nNk(cid:88)\n\nk=1) :=\n}K\nIt is noteworthy that the variant (cid:96)w\n1 of the (cid:96)1 norm is still a norm on the functional space, which\ncan be easily veri\ufb01ed by using the de\ufb01nition of norm, so we omit it here. In the situation of domain\nk=1), we then de\ufb01ne the uniform\nadaptation with multiple sources, setting the metric d as (cid:96)w\n1 ({Z2Nk\n}K\n(cid:16)\nentropy number of F with respect to the metric (cid:96)w\nk=1) as\n1 ({Z2Nk\n}K\nF, \u03be, (cid:96)w\n\n(cid:1) := sup\n\nn )| + |f (z(cid:48)(k)\nn )|\n\n:= {z(cid:48)(k)\n\n1 ({Z2Nk\n\n|f (z(k)\n\nK(cid:88)\n\nK(cid:88)\n\nn }Nk\n\nlnN w\n\nF, \u03be, 2\n\n}K\nk=1)\n\n(cid:107)f(cid:107)(cid:96)w\n\nlnN\n\nwk\nNk\n\n1 ({Z\n\n2Nk\n1\n\n(cid:17)\n\n(10)\n\n(cid:0)\n\n(9)\n\nNk\n\nk=1\n\nn=1\n\n1\n\n1\n\n.\n\n1\n\n.\n\n1\n\n1\n\n1\n\nk=1\n\n{Z\n\n2Nk\n1\n\n}K\n\nk=1\n\n5 Domain Adaptation with Multiple Sources\n\nIn this section, we present the generalization bound for domain adaptation with multiple sources.\nBased on the resultant bound, we then analyze the asymptotic convergence and the rate of conver-\ngence of the learning process for such kind of domain adaptation.\n\n5.1 Generalization Bounds for Domain Adaptation with Multiple Sources\n\nBased on the aforementioned uniform entropy number, the generalization bound for domain adap-\ntation with multiple sources is presented in the following theorem:\nTheorem 5.1 Assume that F is a function class consisting of the bounded functions with the range\nk=1 wk = 1. Then, given an arbitrary \u03be >\n(\u03be(cid:48))2 and any \u0001 > 0, with probability at least 1 \u2212 \u0001,\n\n[a, b]. Let w = (w1,\u00b7\u00b7\u00b7 , wK) \u2208 [0, 1]K with (cid:80)K\nD(w)F (S, T ), we have for any(cid:0)(cid:81)K\n\u2265 8(b\u2212a)2\n(cid:16)\n(cid:0)\nF, \u03be(cid:48)/8, 2(cid:80)K\n(cid:0)(cid:81)K\n32(b\u2212a)2(cid:0)(cid:80)K\n\n(cid:1)\n(cid:12)(cid:12)E(S)\nw f \u2212 E(T )f(cid:12)(cid:12) \u2264 D(w)F (S, T ) +\n\n1\n2\n\n,\n\n(11)\n\n\u2212 ln(\u0001/8)\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n\nlnN w\n\nk=1 Nk\n\nk=1 Nk\n\nsup\nf\u2208F\n\n(cid:17)\n\n(cid:1)\n\n(cid:1)\n\n(cid:1)\nk((cid:81)\n\nk=1 Nk\nk=1 w2\n\n1\n\ni(cid:54)=k Ni)\n\nwhere \u03be(cid:48) = \u03be \u2212 D(w)F (S, T ) and\n\nD(w)F (S, T ) :=\n\nK(cid:88)\n\nk=1\n\nwkDF (Sk, T ).\n\n(12)\n\nIn the above theorem, we show that the generalization bound supf\u2208F |E(T )f \u2212 E(S)\nw f| can be\nbounded by the right-hand side of (11). Compared to the classical result under the assumption\nof same distribution (cf. Theorem 2.3 and De\ufb01nition 2.5 of [22]): with probability at least 1 \u2212 \u0001,\n\n(cid:12)(cid:12)EN f \u2212 Ef\n\n(cid:12)(cid:12) \u2264 O\n\nsup\nf\u2208F\n\n\uf8eb\uf8ed(cid:32)\n\n(cid:0)\n\nF, \u03be, N(cid:1)\n\nN\n\n2\uf8f6\uf8f8\n(cid:33) 1\n\n\u2212 ln(\u0001/8)\n\n(13)\n\nlnN1\n\n5\n\n\fwith EN f being the empirical risk with respect to the sample set ZN\n1 , there is a discrepancy quantity\nD(w)F (S, T ) that is determined by the two factors: the choice of w and the quantities DF (Sk, T )\n(1 \u2264 k \u2264 K). The two results will coincide if any source domain and the target domain match, i.e.,\nDF (Sk, T ) = 0 holds for any 1 \u2264 k \u2264 K.\nIn order to prove this result, we develop the related Hoeffding-type deviation inequality and the\nsymmetrization inequality for domain adaptation with multiple sources, respectively. The detailed\nproof is provided in the supplement (part B). By using the resultant bound (11), we can analyze the\nasymptotic behavior of the learning process for domain adaptation with multiple sources.\n\n5.2 Asymptotic Convergence\n\nIn statistical learning theory, it is well-known that the complexity of the function class is the main\nfactor to the asymptotic convergence of the learning process under the assumption of same distribu-\ntion (cf. [3, 4, 22]).\nTheorem 5.1 can directly lead to the concerning theorem showing that the asymptotic convergence\nof the learning process for domain adaptation with multiple sources:\nTheorem 5.2 Assume that F is a function class consisting of bounded functions with the range\n\nk=1 wk = 1. If the following condition holds:\n\n1\n\nlim\n\nN1,\u00b7\u00b7\u00b7 ,NK\u2192+\u221e\n\n[a, b]. Let w = (w1,\u00b7\u00b7\u00b7 , wK) \u2208 [0, 1]K with(cid:80)K\n(cid:0)\nF, \u03be(cid:48)/8, 2(cid:80)K\n(cid:0)(cid:81)K\n(cid:1)\n32(b\u2212a)2(cid:0)(cid:80)K\nlnN w\nk((cid:81)\ni(cid:54)=k Ni)\n(cid:110)\nw f(cid:12)(cid:12) > \u03be\n(cid:12)(cid:12)E(T )f \u2212 E(S)\nwith \u03be(cid:48) = \u03be \u2212 D(w)F (S, T ), then we have for any \u03be > D(w)F (S, T ),\n(cid:1) satisfy the condition (14) with(cid:80)K\n\nN1,\u00b7\u00b7\u00b7 ,NK\u2192+\u221e Pr\n\nF, \u03be(cid:48)/8, 2(cid:80)K\n\nk=1 Nk\nk=1 w2\n\nk=1 Nk\n\nsup\nf\u2208F\n\n(cid:0)\n\nlim\n\n(cid:1)\n(cid:1) < +\u221e\n(cid:111)\n\n= 0.\n\n1\n\nif the choice of w \u2208 [0, 1]K and the uniform entropy number\nk=1 wk = 1, the probability of the\n\nAs shown in Theorem 5.2,\nlnN w\nk=1 Nk\nevent that \u201csupf\u2208F\nthe sample numbers N1,\u00b7\u00b7\u00b7 , NK of multiple sources go to in\ufb01nity, respectively. This is partially in\naccordance with the classical result of the asymptotic convergence of the learning process under the\nassumption of same distribution (cf. Theorem 2.3 and De\ufb01nition 2.5 of [22]): the probability of the\nevent that \u201csupf\u2208F\nnumber lnN1 (F, \u03be, N ) satis\ufb01es the following:\n\n(cid:12)(cid:12) > \u03be\u201d will converge to zero for any \u03be > D(w)F (S, T ), when\n(cid:12)(cid:12)E(T )f \u2212 E(S)\n(cid:12)(cid:12)Ef \u2212 EN f(cid:12)(cid:12) > \u03be\u201d will converge to zero for any \u03be > 0, if the uniform entropy\n\nw f\n\n(14)\n\n(15)\n\nlim\n\nN\u2192+\u221e\n\nlnN1 (F, \u03be, N )\n\nN\n\n< +\u221e.\n\n(16)\n\nNote that in the learning process of domain adaptation with multiple sources, the uniform conver-\ngence of the empirical risk on the source domains to the expected risk on the target domain may not\nhold, because the limit (15) does not hold for any \u03be > 0 but for any \u03be > D(w)F (S, T ). By contrast,\nthe limit (15) holds for all \u03be > 0 in the learning process under the assumption of same distribution,\nif the condition (16) is satis\ufb01ed.\nBy Cauchy-Schwarz inequality, setting wk = Nk(cid:80)K\n\n(1 \u2264 k \u2264 K) minimizes the second term of\n\nk=1 Nk\n\nthe right-hand side of (11) and then we arrive at\n\nand our numerical experiments presented in the next section also support this point (cf. Fig. 1).\n\nk=1 Nk\n\n(1 \u2264 k \u2264 K) can result in the fastest rate of convergence\n\n6\n\n(cid:12)(cid:12)E(S)\nw f \u2212 E(T )f(cid:12)(cid:12)\n(cid:80)K\n\nsup\nf\u2208F\n\n(cid:80)K\n\n\u2264\n\nk=1 NkDF (Sk, T )\nwhich implies that setting wk = Nk(cid:80)K\n\nk=1 Nk\n\n+\n\n\uf8eb\uf8ec\uf8ed (lnN w\n\n1 (F, \u03be(cid:48)/8, 2(cid:80)K\n(cid:0)(cid:80)K\n\n(cid:1)\n\nk=1 Nk\n32(b\u2212a)2\n\nk=1 Nk) \u2212 ln(\u0001/8)\n\n\uf8f6\uf8f7\uf8f8 1\n\n2\n\n,\n\n(17)\n\n\f6 Numerical Experiments\n\nn }NT\n\nWe have performed the numerical experiments to verify the theoretic analysis of the asymptotic\nconvergence of the learning process for domain adaptation with multiple sources. Without loss of\ngenerality, we only consider the case of K = 2, i.e., there are two source domains and one target\ndomain. The experiment data are generated in the following way:\nFor the target domain Z (T ) = X (T )\u00d7Y (T ) \u2282 R100\u00d7R, we consider X (T ) as a Gaussian distribution\nn=1 (NT = 4000) from X (T ) randomly and independently. Let \u03b2 \u2208 R100\nN (0, 1) and draw {x(T )\nbe a random vector of a Gaussian distribution N (1, 5), and let the random vector R \u2208 R100 be a\nnoise term with R \u223c N (0, 0.5). For any 1 \u2264 n \u2264 NT , we randomly draw \u03b2 and R from N (1, 5)\nand N (0, 0.01) respectively, and then generate y(T )\nn = (cid:104)x(T )\ny(T )\n\nn \u2208 Y as follows:\nn , \u03b2(cid:105) + R.\n\n(18)\nn=1 (NT = 4000) are the samples of the target domain Z (T ) and will be\nn=1 (N1 = 2000) of the source domain\n\nThe derived {(x(T )\nused as the test data.\nIn the similar way, we derive the sample set {(x(1)\nn , y(1)\nZ (S1) = X (1) \u00d7 Y (1) \u2282 R100 \u00d7 R: for any 1 \u2264 n \u2264 N1,\n\nn )}NT\n\nn )}N1\n\nn , y(T )\n\ny(1)\nn = (cid:104)x(1)\n\nn , \u03b2(cid:105) + R,\nn \u223c N (0.5, 1), \u03b2 \u223c N (1, 5) and R \u223c N (0, 0.5).\n\nwhere x(1)\nFor the source domain Z (S2) = X (2) \u00d7 Y (2) \u2282 R100 \u00d7 R, the samples {(x(2)\n2000) are generated in the following way: for any 1 \u2264 n \u2264 N2,\n\nn , y(2)\n\nn )}N2\n\nn=1 (N2 =\n\n(19)\n\n(20)\n\nn = (cid:104)x(2)\ny(2)\n\nn , \u03b2(cid:105) + R,\n\nN2(cid:88)\n\nn=1\n\nN1(cid:88)\n\nn=1\n\nwhere x(2)\nIn this experiment, we use the method of Least Square Regression to minimize the empirical risk\n\nn \u223c N (2, 5), \u03b2 \u223c N (1, 5) and R \u223c N (0, 0.5).\n\nE(S)\nw ((cid:96) \u25e6 g) =\n\nw\nN1\n\n(cid:96)(g(x(1)\n\nn ), y(1)\n\nn ) +\n\n(1 \u2212 w)\n\nN2\n\n(cid:96)(g(x(2)\n\nn ), y(2)\nn )\n\n(21)\n\nNT\n\nw f \u2212 E(T )\n\nfor different combination coef\ufb01cients w \u2208 {0.1, 0.3, 0.5, 0.9} and then compute the discrepancy\n|E(S)\nf| for each N1 + N2. The initial N1 and N2 both equal to 200. Each test is repeated\n30 times and the \ufb01nal result is the average of the 30 results. After each test, we increment both N1\nand N2 by 200 until N1 = N2 = 2000. The experiment results are shown in Fig. 1.\nw f \u2212 E(T )\nFrom Fig. 1, we can observe that for any choice of w, the curve of |E(S)\nf| is decreasing\nwhen N1 + N2 increases, which is in accordance with the results presented in Theorems 5.1 & 5.2.\nMoreover, when w = 0.5, the discrepancy |E(S)\nf| has the fastest rate of convergence, and\nthe rate becomes slower as w is further away from 0.5. In this experiment, we set N1 = N2 that\nimplies that N2/(N1 + N2) = 0.5. Recalling (17), we have shown that w = N2/(N1 + N2) will\nprovide the fastest rate of convergence and this proposition is supported by the experiment results\nshown in Fig. 1.\n\nw f \u2212 E(T )\n\nNT\n\nNT\n\n7 Conclusion\n\nIn this paper, we present a new framework to study the generalization bounds of the learning process\nfor domain adaptation. We use the integral probability metric to measure the difference between the\ndistributions of two domains. Then, we use a martingale method to develop the speci\ufb01c deviation\ninequality and the symmetrization inequality incorporating the integral probability metric. Next, we\nutilize the resultant deviation inequality and symmetrization inequality to derive the generalization\nbound based on the uniform entropy number. By using the resultant generalization bound, we an-\nalyze the asymptotic convergence and the rate of convergence of the learning process for domain\nadaptation.\n\n7\n\n\fFigure 1: Domain Adaptation with Multiple Sources\n\nWe point out that the asymptotic convergence of the learning process is determined by the complex-\nity of the function class F measured by the uniform entropy number. This is partially in accordance\nwith the classical result of the asymptotic convergence of the learning process under the assumption\nof same distribution (cf. Theorem 2.3 and De\ufb01nition 2.5 of [22]). Moreover, the rate of convergence\nof this learning process is equal to that of the learning process under the assumption of same dis-\ntribution. The numerical experiments support our results. Finally, we give a comparison with the\nprevious works [13, 14, 15, 16, 17, 18, 20] (cf. supplement, part D).\nIt is noteworthy that by Theorem 2.18 of [22], the generalization bound (11) can lead to the result\nbased on the fat-shattering dimension. According to Theorem 2.6.4 of [4], the bound based on the\nVC dimension can also be obtained from the result (11). In our future work, we will attempt to\n\ufb01nd a new distance between two distributions and develop the generalization bounds based on other\ncomplexity measures, e.g., Rademacher complexities, and analyze other theoretical properties of\ndomain adaptation.\n\nAcknowledgments\n\nThis research is sponsored in part by NSF (IIS-0953662, CCF-1025177), NIH (LM010730), and\nONR (N00014-1-1-0108).\n\nReferences\n\nIntroduction to Statistical Learning Theory.\n\n[1] V.N. Vapnik (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks\n10(5):988-999.\n[2] O. Bousquet, S. Boucheron, and G. Lugosi (2004).\nBousquet et al. (ed.), Advanced Lectures on Machine Learning, 169-207.\n[3] V.N. Vapnik (1998). Statistical Learning Theory. New York: John Wiley and Sons.\n[4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989). Learnability and the Vapnik-\nChervonenkis dimension. Journal of the ACM 36(4):929-965.\n[5] A. van der Vaart, and J. Wellner (2000). Weak Convergence and Empirical Processes With Applications to\nStatistics (Hardcover). Springer.\n[6] P.L. Bartlett, O. Bousquet, and S. Mendelson (2005). Local Rademacher Complexities. Annals of Statistics\n33:1497-1537.\n[7] Z. Hussain, and J. Shawe-Taylor (2011). Improved Loss Bounds for Multiple Kernel Learning. Journal of\nMachine Learning Research - Proceedings Track 15:370-377.\n\nIn O.\n\n8\n\n50010001500200025003000350040000.250.30.350.40.450.5N1+N2|E(S)wf\u2212E(T)NTf| w=0.1w=0.3w=0.5w=0.9\f[8] J. Jiang, and C. Zhai (2007). Instance Weighting for Domain Adaptation in NLP. Proceedings of the 45th\nAnnual Meeting of the Association of Computational Linguistics (ACL), 264-271.\n[9] J. Blitzer, M. Dredze, and F. Pereira (2007). Biographies, bollywood, boomboxes and blenders: Domain\nadaptation for sentiment classi\ufb01cation. Proceedings of the 45th Annual Meeting of the Association of Compu-\ntational Linguistics (ACL), 440-447.\n[10] S. Bickel, M. Br\u00a8uckner, and T. Scheffer (2007). Discriminative learning for differing training and test\ndistributions. Proceedings of the 24th international conference on Machine learning (ICML), 81-88.\n[11] P. Wu, and T.G. Dietterich (2004). Improving SVM accuracy by training on auxiliary data sources. Pro-\nceedings of the twenty-\ufb01rst international conference on Machine learning (ICML), 871-878.\n[12] J. Blitzer, R. McDonald, and F. Pereira (2006). Domain adaptation with structural correspondence learning.\nConference on Empirical Methods in Natural Language Processing (EMNLP), 120-128.\n[13] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman (2010). A Theory of\nLearning from Different Domains. Machine Learning 79:151-175.\n[14] K. Crammer, M. Kearns, and J. Wortman (2006). Learning from Multiple Sources. Advances in Neural\nInformation Processing Systems (NIPS).\n[15] K. Crammer, M. Kearns, and J. Wortman (2008). Learning from Multiple Sources. Journal of Machine\nLearning Research 9:1757-1774.\n[16] Y. Mansour, M. Mohri, and A. Rostamizadeh (2008). Domain adaptation with multiple sources. Advances\nin Neural Information Processing Systems (NIPS), 1041-1048.\n[17] Y. Mansour, M. Mohri, and A. Rostamizadeh (2009). Multiple Source Adaptation and The R\u00b4enyi Diver-\ngence. Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI).\n[18] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman (2007). Learning Bounds for Domain\nAdaptation. Advances in Neural Information Processing Systems (NIPS).\n[19] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, F (2006). Analysis of Representations for Domain\nAdaptation. Advances in Neural Information Processing Systems (NIPS), 137-144.\n[20] Y. Mansour, M. Mohri, and A. Rostamizadeh (2009). Domain Adaptation: Learning Bounds and Algo-\nrithms. Conference on Learning Theory (COLT).\n[21] W. Hoeffding (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the\nAmerican Statistical Association 58(301):13-30.\n[22] S. Mendelson (2003). A Few Notes on Statistical Learning Theory. Lecture Notes in Computer Science\n2600:1-40.\n[23] V.M. Zolotarev (1984). Probability Metrics. Theory of Probability and its Application 28(1):278-302.\n[24] S.T. Rachev (1991). Probability Metrics and the Stability of Stochastic Models. John Wiley and Sons.\n[25] A. M\u00a8uller (1997). Integral Probability Metrics and Their Generating Classes of Functions. Advances in\nApplied Probability 29(2):429-443.\n[26] M.D. Reid and R.C. Williamson (2011). Information, Divergence and Risk for Binary Experiments. Jour-\nnal of Machine Learning Research 12:731-817.\n[27] B.K. Sriperumbudur, A. Gretton, K. Fukumizu, G.R.G. Lanckriet and B. Sch\u00a8olkopf (2009). A Note on\nIntegral Probability Metrics and \u03c6-Divergences. CoRR abs/0901.2698.\n\n9\n\n\f", "award": [], "sourceid": 866, "authors": [{"given_name": "Chao", "family_name": "Zhang", "institution": null}, {"given_name": "Lei", "family_name": "Zhang", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}