{"title": "Revisiting $(\\epsilon, \\gamma, \\tau)$-similarity learning for domain adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 7397, "page_last": 7406, "abstract": "Similarity learning is an active research area in machine learning that tackles the problem of finding a similarity function tailored to an observable data sample in order to achieve efficient classification. This learning scenario has been generally formalized by the means of a $(\\epsilon, \\gamma, \\tau)-$good similarity learning framework in the context of supervised classification and has been shown to have strong theoretical guarantees. In this paper, we propose to extend the theoretical analysis of similarity learning to the domain adaptation setting, a particular situation occurring when the similarity is learned and then deployed on samples following different probability distributions. We give a new definition of an $(\\epsilon, \\gamma)-$good similarity for domain adaptation and prove several results quantifying the performance of a similarity function on a target domain after it has been trained on a source domain. We particularly show that if the source distribution dominates the target one, then principally new domain adaptation learning bounds can be proved.", "full_text": "Revisiting (\u0001, \u03b3, \u03c4)-similarity learning for\n\ndomain adaptation\n\nSo\ufb01en Dhouib\n\nUniv Lyon, INSA-Lyon, Universit\u00e9 Claude Bernard Lyon 1, UJM-Saint Etienne, CNRS,\n\nInserm, CREATIS UMR 5220, U1206, F-69100, LYON, France\n\nsofiane.dhouib@creatis.insa-lyon.fr\n\nIevgen Redko\u2217\n\nUniv Lyon, UJM-Saint-Etienne, CNRS, Institut d Optique Graduate School Laboratoire\n\nHubert Curien UMR 5516, F-42023, Saint-Etienne, France\n\nievgen.redko@univ-st-etienne.fr\n\nAbstract\n\nSimilarity learning is an active research area in machine learning that tackles\nthe problem of \ufb01nding a similarity function tailored to an observable data\nsample in order to achieve e\ufb03cient classi\ufb01cation. This learning scenario has\nbeen generally formalized by the means of a (\u0001, \u03b3, \u03c4)\u2212good similarity learning\nframework in the context of supervised classi\ufb01cation and has been shown to\nhave strong theoretical guarantees. In this paper, we propose to extend the\ntheoretical analysis of similarity learning to the domain adaptation setting,\na particular situation occurring when the similarity is learned and then\ndeployed on samples following di\ufb00erent probability distributions. We give\na new de\ufb01nition of an (\u0001, \u03b3)\u2212good similarity for domain adaptation and\nprove several results quantifying the performance of a similarity function\non a target domain after it has been trained on a source domain. We\nparticularly show that if the source distribution dominates the target one,\nthen principally new domain adaptation learning bounds can be proved.\n\nIntroduction\n\n1\nMany popular supervised learning algorithms rely on pairwise metrics calculated based on\nthe instances of a given data set in order to learn a classi\ufb01er. For instance, a famous family\nof k-nearest neighbors algorithms [1] uses distance matrices in order to de\ufb01ne the label of a\ngiven test point while support vector machines [2] can be extended to handle the non-linear\nclassi\ufb01cation using kernel functions. Despite a widespread use of metrics in machine learning,\nexisting distances often do not capture the intrinsic geometry of data with respect to the\nlabels of the available data points. To tackle this problem, the emerging \ufb01eld of metric\nlearning [3, 4] (also known as similarity learning) aims to provide solutions that allow to learn\npairwise metrics explicitly from the data, thus making them tailored for the classi\ufb01cation or\nregression problem at hand.\nFrom the theoretical point of view, similarity learning was extensively analyzed in two seminal\npapers of [5, 6] based on the (\u0001, \u03b3, \u03c4)\u2212good similarity framework for binary classi\ufb01cation.\n\n\u2217The author was at CREATIS when this work was done.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis framework formalizes an intuitive de\ufb01nition of a good similarity function: given a set\nof landmarks (or reasonable points) of probability mass at least \u03c4, most of data points (a\n1 \u2212 \u0001 probability mass) should be on average more similar to reasonable points of their own\nclass than to points of the opposite class. Based on the proposed formalization, the authors\nprovided performance guarantees for a resulting linear classi\ufb01er after mapping data into a\nnew feature space de\ufb01ned via the good similarity function. We refer the interested reader to\n[7] and [8] for other theoretical studies on (\u0001, \u03b3, \u03c4)\u2212 framework in the supervised, and to [9]\nand [10]) in the semi-supervised learning cases.\nWhile most of the work based on the (\u0001, \u03b3, \u03c4) framework has been done in the classical\ncontext where training and testing data have the same distribution, in several practical\nscenarios, one may want to transfer the learned similarity function from one domain, usually\ncalled source domain, to another, related yet di\ufb00erent domain, called target domain. This\nframework, known as transfer learning, is a notorious research topic in machine learning\nnowadays [11, 12, 13, 14] often used in situations where the target domain contains few or\nno labeled instances in order to reduce the time and e\ufb00ort needed for manual labeling or\neven collecting new data. As many domain adaptation algorithms proposed in the literature\nare based on metric learning [15, 16, 17], a question about the theoretical guarantees of the\ngeneral similarity framework naturally arises.\nIn this paper, we present a theoretical study of the (\u0001, \u03b3, \u03c4)\u2212 framework in the domain\nadaptation context where only the marginal distributions across the source and the target\ndomains are assumed to change while the labeling functions remain the same2. Contrary\nto the previous works on the analysis of metric learning algorithms in domain adaptation\n[18, 19], we aim to consider a more general setting without being attached to a particular\nalgorithm in order to investigate to which extent a similarity that is good for a source domain\nremains good for the target domain. The obtained results are novel in two di\ufb00erent ways.\nFirst, they provide a complete theoretical study of similarity learning in domain adaptation,\na study that has never appeared in the literature before. Second, they show that under\ncertain assumptions on the richness of the source domain with respect to the target one, the\ntarget error can be bounded by terms that all explicitly depend on the source domain error.\nThe rest of the paper is organized as follows. Section 2 presents the learning setting that we\nconsider with some necessary de\ufb01nitions and notations. Section 3 introduces a generalization\nof the (\u0001, \u03b3, \u03c4)-goodness de\ufb01nition used to provide a theoretical result relating the source and\ntarget goodnesses and presents a brief comparison of the obtained bound with the related\nwork. Apart from the source goodness, the established inequality contains a term re\ufb02ecting\nthe distance between the distributions of two domains and a worst margin term measuring\nthe worst error obtainable by the similarity function for some instance from the learning\nsample. We analyze the obtained worst margin term in Section 4 and measure the con\ufb01dence\nof its empirical estimation. Section 5 is dedicated to the empirical evaluations of the obtained\ntheoretical results. We conclude our paper in Section 6 and give several possible future\nperspectives of this work.\n\n2 Preliminaries\nIn order to proceed, we \ufb01rst introduce the basic elements related to the (\u0001, \u03b3, \u03c4)\u2212good\nsimilarity framework. In what follows, we denote by X \u2282 Rd and Y \u2282 {\u22121, 1} the features\nand labels spaces, respectively. For any real t, t+ denotes its positive part, i.e max(t, 0). As\nin [6], we de\ufb01ne a similarity function as a pairwise function K : X \u00d7 X \u2192 [\u22121, 1]. We now\nrecall the de\ufb01nition of the (\u0001, \u03b3, \u03c4)-goodness with hinge loss.\nDe\ufb01nition 1 (Balcan et. al. 2008). A similarity function K is (\u0001, \u03b3, \u03c4)-good in hinge loss\nfor problem (distribution) P if there exists a (probabilistic) indicator function R of a set of\n\u201creasonable points\" such that:\n\n#\n\n\"(cid:18)\n(cid:19)\n1 \u2212 y.g(x)\nx0\u223cP [R(x0)] \u2265 \u03c4,\nP\n\nE\n\n(x,y)\u223cP\n\n\u03b3\n\n+\n\n\u2264 \u0001,\n\n(1)\n\n(2)\n\n2This assumption leads to a setting often called covariate shift problem in domain adaptation.\n\n2\n\n\fwhere g(x) =\n\nE\n\n(x0,y0)\u223cP\n\n[y0K(x, x0)|R(x0)].\n\nIn this de\ufb01nition, \u0001 is an upper bound for the expected hinge loss over all the margins g(x),\nevery margin being the average signed similarity of an instance to reasonable points de\ufb01ned\nby R. To control the loss sensitivity to the margin, a division by 0 < \u03b3 \u2264 1 is applied.\nFollowing this de\ufb01nition, the authors of [6] prove a theorem that guarantees the existence of\na linear separator in a new feature space de\ufb01ned via an (\u0001, \u03b3, \u03c4)\u2212good similarity function, a\nresult that is stated by the following theorem.\nTheorem 1 (Balcan et. al. 2008). Let K be an (\u0001, \u03b3, \u03c4)\u2212good similarity function in hinge\nloss for a learning problem P. For any \u00011 > 0 and 0 < \u03b4 < \u03b3\u00011\nn} be a\n(potentially unlabeled) sample of size n = 2\nConsider the mapping:\n\n(cid:17) of landmarks drawn from P.\n\n(cid:1)(cid:16)1 + 16\n\n\u03c4 log(cid:0) 2\n\n4 , let S = {x0\n\n1, ..., x0\n\n(\u00011\u03b3)2\n\n\u03b4\n\nThen with a probability at least 1 \u2212 \u03b4 over the draw of S, there exists \u03b2 \u2208 Rn such that:\n\n\u03c6S : X \u2192 Rn\n\nx 7\u2192 (K(x, x0\n\nn)).\n\n1), ..., K(x, x0\n#\n\n!\n\n(cid:10)\u03b2, \u03c6S(x)(cid:11)\n\n\" \n\n\u2264 \u0001 + \u00011.\n\nE\n\n(x,y)\u223cP\n\n1 \u2212\n\n\u03b3\n\n+\n\n(3)\n\nIn other words, the induced distribution \u03c6S(P) in Rn has a linear separator achieving hinge\nloss at most \u0001 + \u00011 at margin \u03b3.\nOne can see this theorem as a variation of the kernel trick used in the SVM algorithm.\nIndeed, if K is a kernel function and if \u03c4 = 1, the expected loss in Equation (3) becomes the\nnon-regularized loss of an SVM de\ufb01ned via kernel K. The authors furthermore derive an\nalgorithm from this theorem that minimizes the empirical version of (3), which boils down\nto a linear programming problem that can be solved e\ufb03ciently.\n3 (\u0001, \u03b3)\u2212good similarity learning for domain adaptation\nIn this section, we introduce the main contributions of our paper. We start by giving a\nde\ufb01nition of (\u0001, \u03b3)-goodness with an arbitrary distribution of landmarks, and then propose a\ngeneralization bound that relates the goodness of the same similarity function learned on\nthe source and target domains.\n\n3.1 Problem setup\nFor the considered problem, we assume to have access to samples S and T drawn from\nsource and target probability distributions S and T , respectively. In the context of domain\nadaptation, S \u2282 (X \u00d7 Y)m is labeled whereas T can be partially or totally unlabeled. In the\nrest of the paper, we suppose that the labeling is deterministic, meaning that there exists a\nlabeling function fS (resp. fT ) such that for every (x, y) in the source domain (resp. in the\ntarget domain), y = fS(x) (resp. y = fT (x)). Hence, we replace every (x, y) \u223c P by writing\nsimply x \u223c P for all probability distributions considered below. Moreover, since we assume\nthat\n\n[y|x], then we have fS = fT .\n\n[y|x] = P\n\nP\n\n(x,y)\u223cS\n\n(x,y)\u223cT\n\nAs hinted in [6, Note 2, Theorem 14], the instances and landmarks can be potentially drawn\nfrom di\ufb00erent distributions. Hence, we propose a modi\ufb01cation of De\ufb01nition 1 given as follows.\nDe\ufb01nition 2. A similarity function K is (\u0001, \u03b3)-good in hinge loss for problem (P,R) (where\nP is the data distribution whereas R is the landmarks distribution) if:\n\nE\nx\u223cP\n\nwhere gR(x) = E\n\nx0\u223cR [y0K(x, x0)].\n\n#\n\n(cid:19)\n\n+\n\n\u2264 \u0001,\n\n\"(cid:18)\n\n1 \u2212 y.gR(x)\n\n\u03b3\n\n3\n\n\fx\u223cR [x \u2208 A] = P\n\nThis is a generalization of De\ufb01nition 1, and the two coincide when we consider the distribution\nx\u223cP [x \u2208 A|R(x) = 1] for all measurable sets A. As for parameter\nR de\ufb01ned by P\nx\u223cP [x \u2208 suppR] since in this case, we have\n\u03c4, it can be seen as an upper bound for P\nsuppR \u2282 {R(x) = 1}. This de\ufb01nition captures the intuition often used to design domain\nadaptation algorithms as R can be thought of as a \u201cuniversal landmarks domain\" which\nis independent of the source or target domains. In the case of sentiment classi\ufb01cation, for\nexample, it might correspond to negative or positive vocabulary used to express one\u2019s opinion\nindependently of the type of the concerned product.\nIn the rest of the paper, we use the following notations for any data distribution P and\nlandmark distribution Q. We denote the goodness of K for problem (P,Q) by\n\nFor simplicity, we further denote by l\u03b3 the \u03b3-scaled hinge loss function de\ufb01ned by:\n\n#\n\n(cid:19)\n\n.\n\n+\n\n\"(cid:18)\n(cid:18)\n\nEP,Q(K) := E\nx\u223cP\n\n1 \u2212 y.gQ(x)\n\n\u03b3\n\n(cid:19)\n\nl\u03b3 : x 7\u2192\n\n1 \u2212 x\n\u03b3\n\n.\n\n+\n\nWe let \u00b5 be a probability distribution that dominates all the other probability distributions\nused afterwards. In addition, MP,Q(K) stands for the worst margin achieved by an element\nof x \u2208 suppP associated with landmark distribution Q, i.e:\n\nMP,Q(K) := sup\nx\u2208supp P\n\nl\u03b3(ygQ(x)).\n\nNote that since K takes values in [\u22121, 1] (or even if we only assume that K is bounded),\nygQ (x) is also bounded and consequently l\u03b3(ygQ (x)) is bounded thanks to the continuity\nof l\u03b3. This ensures that MP,Q(K) is \ufb01nite. Finally, if B is a boolean expression, then\n[B] := 1B is an indicator of the set on which B holds (Iverson bracket notation).\n\n3.2 Relating the source and target goodnesses\nGiven a similarity function that is (\u0001, \u03b3)-good in hinge loss for problem (S, R1), our goal\nis to bound its goodness on the target set for problem (T ,R2), where R1 and R2 are not\nsupposed to be equal.\n\n3.2.1 Shared landmarks distribution\nIn order to prepare for a more general result that relates the goodness of a similarity K for\nproblems (S, R1) and (T , R2), we \ufb01rst provide a preparatory result that considers the same\nlandmark distribution R = R1 = R2. This result is given by the following lemma3.\nLemma 1 (same landmarks). Let K be an (\u0001, \u03b3)-good similarity for problem (S,R). Then\nK is (\u0001 + \u00010, \u03b3)-good for problem (T ,R), where:\n\nwith d1+,\u03b3(T ,S) = E\nx\u223c\u00b5\nresults holds with\n\nwhere d\u03c72\n\n+,\u03b3(T ,S) = E\nx\u223cS\n\nd\u00b5\n\nd\u00b5 \u2212 dS\n\n(cid:20)(cid:16) dT\n(cid:17)\n\u00010 =q\n(cid:20)(cid:16)(cid:0) dT\ndS \u2212 1(cid:1)\n\n+\n\n\u00010 = d1+,\u03b3(T ,S)M\u00b5,R(K)\n\n(cid:21)\n\n\u221a\n(cid:21)\n(cid:17)2\n+,\u03b3(T ,S)MS,R(K)\nd\u03c72\n[ygR (x) < \u03b3]\n\n.\n\n+\n\n\u0001,\n\n[ygR (x) < \u03b3]\n\n. Moreover, if T (cid:28) S then the obtained\n\nSeveral observations can be made based on these results. First, we note that the expectation\nin both divergence terms is taken only on the support of the hinge loss, i.e for instances having\na signed margin smaller than \u03b3, making these terms problem dependent. This dependence is\n\n3Due to the limited space, all proofs are provided in the Supplementary material.\n\n4\n\n\fquite important as it allows to claim that the presented result can be informative in practice.\nSecond, the obtained bounds both contain the term M\u00b5,R(K) which stands for the worst\nmargin achieved by K on some instance of supp \u00b5. In the case of the SVM, this term is\nanalogous to the largest slack variable associated to an instance drawn from the dominating\nmeasure \u00b5. For several choices of \u00b5, this term can be di\ufb03cult to control, as we can estimate\nit only by observing data drawn from S. This limitation is tackled by assuming that S\ndominates T thus motivating the bounds with \u03c72 distance. These latter clearly show the\nbene\ufb01t of assuming T (cid:28) S: the distance term in the bound is multiplied by\n\u0001 meaning\nthat having a similarity function achieving a low error on the source domain can leverage\nthe di\ufb00erence between the domains\u2019 distributions. Note that the assumption T (cid:28) S is\nquite common in the domain adaptation literature and has already been used in [20]. As\nmentioned by the authors, it roughly means that the source domain is richer than the target\none, an assumption that is quite reasonable in practice.\n\n\u221a\n\n3.2.2 Di\ufb00erent landmarks case\nWe now turn our attention to a more general case where the landmarks distributions vary\nacross two domains. To this end, we assume that a similarity function K is (\u0001, \u03b3)-good for\n(S,R1). Given these assumptions, our goal now is to provide a learning guaranty for the\ngoodness of K for the (T ,R2) learning problem. To proceed, we \ufb01rst rewrite the di\ufb00erence\nbetween ET ,R2(K) and ES,R1(K) as follows:\n\nET ,R2(K) \u2212 ES,R1(K) = ET ,R1(K) \u2212 ES,R1(K) + ET ,R2(K) \u2212 ET ,R1(K).\n\n\u00010 =q\n\n\u221a\n+,\u03b3(T ,S)M\u00b5,R(K)\nd\u03c72\n\nBy analyzing the obtained expression, we note that the di\ufb00erence between the \ufb01rst two\nterms can be bounded using Lemma 1 as ET ,R1(K) \u2212 ES,R1(K) = \u0001 + \u00010 \u2212 \u0001 = \u00010 where\n\u0001 when T (cid:28) S and d1+,\u03b3(T ,S)M\u00b5,R2(K) otherwise. Con-\nsequently, we further focus solely on the last two terms and, similar to the previous case,\nprovide a result based on both the L1 and \u03c72 distances. We prove the following theorem.\nTheorem 2. Let K be an (\u0001, \u03b3)-good similarity for problem (S,R1). Then K is (\u0001+\u00010+\u000100, \u03b3)-\n\u03b3 d1(R1,R2) and \u00010 = d1+,\u03b3(T ,S)M\u00b5,R2(K), where\ngood for problem (T ,R2), with \u000100 = 1\nd1(R1,R2) = E\n. Moreover, if T (cid:28) S, then the obtained result holds with\nx0\u223c\u00b5\n\nh(cid:12)(cid:12)(cid:12) dR1\n\n(cid:12)(cid:12)(cid:12)i\n\n\u00010 =q\n\nd\u00b5 \u2212 dR2\nd\u00b5\n\u221a\n\n+,\u03b3(T ,S)M\u00b5,R(K)\nd\u03c72\n\n\u0001.\n\nThe obtained result suggests that it is better to consider the same landmark distribution\nR = R1 = R2 for the two domains, as this assumption minimizes the bound by setting \u000100 =\n\u03b3 d1(R1,R2) = 0. This conclusion is rather intuitive: in many domain adaptation algorithms\n1\nthe source and target domains are aligned using a shared set of invariant components and\nlandmarks can be seen as invariant points allowing to adapt the similarity measure e\ufb03ciently\nacross domains. For this reason, we focus on the case of a shared landmark distribution in\nthe rest of the paper.\n\n3.3 Comparison with other existing results\nWe now brie\ufb02y compare the obtained results with some previous related works. To this\nend, we note that the vast majority of domain adaptation results [21, 22, 23, 18] have the\nfollowing form\n\n\u0001lT (h, fT ) \u2264 \u0001lS(h, fS) + d(S,T ) + \u03bb,\n\n(4)\nx\u223cD [l(h(x), fD(x))|] is the error function de\ufb01ned over probability distribu-\nwhere \u0001lD(h, fD) := E\ntion D for hypothesis and labeling functions h, fD : X \u2192 Y with loss function l : Y\u00d7Y \u2192 R+;\nd(\u00b7,\u00b7) is a divergence measure between two domains and \u03bb is the non-estimable term related\nto di\ufb03culty of the adaptation task. From (4), we note that our result with \u03c72 distance\ndrastically di\ufb00ers from the traditional domain adaptation bounds as, contrary to them, it\nsuggests that source error directly impacts all the terms in the bound. Indeed, the inequality\nin (4) prompts us to minimize both the source error \u0001lS and the divergence term d(S,T )\nassuming that \u03bb is small while our result shows that source error given by the goodness\n\n5\n\n\fof the similarity function can partially leverage the divergence between the two domains\nas it multiplies the latter. To the best of our knowledge, the only two other results that\nhave this multiplicative dependence between the source error and the divergence term are\n[24] and [25], where the variations of R\u00e9nyi divergence were considered. Contrary to their\ncontributions, our bound involves a divergence term that is restricted to the [y.gR (x) < \u03b3]\nset making it intrinsically linked to the considered hypothesis class. Furthermore, we note\nthat the bounds proposed in [25] involve a non-estimable term that, similar to \u03bb in (4) is\nassumed to be small while the worst margin term presented in our result is subject to the\nanalysis provided in the next section.\n\n4 Analysis of the worst margin term\nAs the worst margin term M\u00b5,R(K) is present in both bounds obtained in the previous\nsection, we proceed to its analysis below. It tells us that if there is at least one instance\nfrom the source distribution (or from a distribution dominating it) that has a high loss, then\nthe deviation between the target error and the source error is expected to be large. In what\nfollows, we provide an analysis of this term showing \ufb01rst that it can be bounded in terms of\n\u03b3 and then presenting a guarantee allowing its \ufb01nite sample approximation.\n\n4.1 A simple bound for the worst margin\nA \ufb01rst simple bound for the worst margin term can be obtained as follows:\n\n(cid:18)\n\nl\u03b3(ygR(x)) =\n\ninf\n\n1 \u2212 1\n\n(cid:19)\nx0\u223cR [yy0K(x, x0)]\nE\n\n\u03b3\n\nx\u2208supp \u00b5\n\n+\n\n(cid:19)\n\u2264 2\n\n+\n\n.\n\n\u03b3\n\ny.gR(x)\n\u2264 1 + 1\n\n\u03b3\n\nM\u00b5,R(K) = sup\nx\u2208supp \u00b5\n1 \u2212 1\n\n(cid:18)\n\n=\n\n\u03b3\n\ninf\n\nx\u2208supp \u00b5\n\nThe last inequality comes from the fact that K : X \u00d7 X \u2192 [\u22121, 1] and that 0 < \u03b3 \u2264 1.\nBased on the obtained expression, we note from Lemma 1 that the target goodness can now\nbe bounded in terms of both values that characterize the similarity function in the source\ndomain. On the other hand, replacing the worst margin term in the bound by constant\n\u03b3 prevents us from taking it into account when attempting to design a new adaptation\nalgorithm based on the obtained bounds. In this case, it can be useful to estimate this term\nempirically from the observed data sample by taking the empirical maximum for the source\ninstances and the empirical mean for the landmarks.\n\nr, y0\n\n1, y0\n\n1), . . . , (x0\n\n4.2 An empirical estimation of the worst margin\nWe intend to measure our con\ufb01dence in the empirical estimation of the worst margin term\nby bounding the deviation between the real worst margin term and its empirical counterpart.\nTo this end, we suppose having access to a labeled data sample S = {(x1, y1), ..., (xm, ym)} \u2282\n(X \u00d7 Y)m drawn from S, inducing an empirical distribution \u02c6S. Similarly, we de\ufb01ne a sample\nSR = {(x0\nr)} and the corresponding empirical distribution \u02c6R. As the notion\nof the Rademacher complexity is used to establish our result, we give its de\ufb01nition below.\nDe\ufb01nition 3. Let G be a family of mappings from X to R and P be a probability distribution\non X . The Rademacher complexity of G w.r.t. P and to a sample size n is de\ufb01ned as\nRadn (G) = E\nS\u223cP n\nvariables in {\u22121, +1} called Rademacher random variables and S = {s1, ..., sn}.\nWe can now prove the following result.\nTheorem 3. Let K be a similarity function de\ufb01ned on a feature space X . Let MS,R(K)\ndenote its worst performance associated to loss function l\u03b3 and achieved by an example drawn\nfrom S, where R is the landmarks distribution. Assume that S dominates T and that the\ncumulative distribution function Fl\u03b3 of the loss function associated with S and \u02c6R is k times\ndi\ufb00erentiable at MS, \u02c6R(K), and that k > 0 is the minimum integer such that F\n6= 0. Then\n\ni=1 \u03c3ig(si)(cid:3)(cid:3) , where \u03c3i are independent uniform random\nPn\n\n(cid:2)supg\u2208G 1\n\n(cid:2)E\u03c3\n\nn\n\n(k)\nl\u03b3\n\n6\n\n\ffor all \u03b1 > 1, r \u2265 1, there exists m0 \u2265 1 such that for all m \u2265 m0, we have with probability\nat least 1 \u2212 \u03b4:\n\ns\n\n2\n\nlog(cid:0) 2\n\n\u03b4\n\n(cid:1)\n\n\uf8eb\uf8ed(\u22121)k+1 log(cid:0) 2\u03b1\n\n\u03b4\n\n(cid:1) k!\n\n\uf8f6\uf8f8 1\n\nk\n\n,\n\n+\n\nMS,R(K) \u2264 M \u02c6S, \u02c6R(K) + 2\n\nRadr (H1(K)) + 1\n\u03b32\n\nr\n\n\u03b3\n\nF\n\n(k)\nl\u03b3\n\n(MS, \u02c6R(K))m\nwhere H1(K) is the hypothesis class de\ufb01ned by H1(K) = {x0 7\u2192 K(x, x0), x \u2208 suppS}.\nThis theorem shows that under certain conditions, the empirical maximum is guaranteed to\nconverge in probability to the real supremum of the distribution\u2019s support. The convergence\nrate depends heavily on the complexity of the similarity function search space represented\nby the Rademacher complexity term and on the regularity of the loss distribution function\nre\ufb02ected by the m\u2212 1\nk term. This last term dominates the convergence rate when k > 2, and\n\u2212\nwe have in general a convergence rate that is O(m\nDue to the bound\u2019s dependence on the regularity of Fl\u03b3, knowing this cumulative distribution\nfunction is necessary for an explicit computation of the bound. Furthermore, in the case when\nk increases, it implies that we may need more data in order to have a truthful estimation\nof the function\u2019s regularity. Thus, this quantity may become non estimable, which goes in\nline with several other theoretical contributions [18, 21, 22, 23] where the learning bound\nincludes an a priori non estimable term.\n\nmax{2,k} ).\n\n1\n\n5 Experiments\n\nThe aim of this section is to empirically illustrate the usefulness of the bounds of Lemma 1\non synthetic data set. In what follows, we restrict the similarity search space to the class of\nbilinear similarity functions parametrized by a matrix A \u2208 Rd\u00d7d, where d is the dimension\nof the feature space, i.e K(x, x0) = KA(x, x0) = hx, Ax0i. This class has been studied in\n[7] in the context of (\u0001, \u03b3, \u03c4)\u2212goodness and has been shown to bene\ufb01t from generalization\nguarantees established based on the algorithmic stability theory [26].\n\nData generation We generate the source domain data as a set of 500 two-dimensional\npoints drawn from a mixture of two Gaussian distributions with the same isotropic covariance\nmatrices \u03c32I2 and mixing coe\ufb03cients, where \u03c3 is the chosen standard deviation4. Each\ndistribution represents one of the two classes 1 and \u22121 centered at (1, 1) and (\u22121,\u22121),\nrespectively. The target data is generated from the same distribution as the source data by\nrotating clusters\u2019 centers by angles varying from 0\u25e6 to 90\u25e6. Examples of the obtained data\nsamples are given in Figure 1. We note that increasing the angle of rotation leads to an\nincreasing divergence between the two domains.\n\nAlgorithmic implementation In accordance with Lemma 1, we consider two cases\ndepending on whether T (cid:28) S or not and for both we train a similarity function on the\ngenerated data sample and search for a weighting function w : X \u2192 R such that the bounds\nare minimized. Note that from theoretical point of view, the support of both distributions is\nR2, but in practice the regions that are far from the cluster centers rarely have data points,\nso the support can be considered in a limited neighborhood around the centers. For both\ncases, we estimate the divergence term directly from the analytic expression of density of\nthe generating distribution using by calculating a two-dimensional integral.\nAs in [7], we consider all of the source sample points as landmarks, and denote by \u02c6SW the\nweighted sample empirical distribution de\ufb01ned on SW = {wixi}m\ni=1 where W := (w1, . . . , wm).\nDepending on the assumption considered, we aim to solve the following optimization problem:\n(5)\n\nJ(W, M), s.t. M \u2265 l\u03b3(yi.g\u02c6SW (xi)), \u2200i \u2208 {1, ..., m},\n\nmin\nW\u2208Rm\nM\u22650\n\n4In the presented results, we set \u03c3 = 0.5 and provide the same results for other values of \u03c3 in the\n\nSupplementary material.\n\n7\n\n\fFigure 1: Generated data for (left) 30\u25e6, (middle) 60\u25e6, (right) 90\u25e6 degrees rotation.\n\nFigure 2: Target goodness as a function of the rotation degree when (left) T 6(cid:28) S and\n(middle) T (cid:28) S. For both cases, the similarity function is obtained by solving (5). (right)\nDivergence values for both cases considered. We can observe that rotating the centers of the\ngenerating distribution increases both L1 and \u03c72 divergences between the samples.\n\nwhere\n\nJ(W, M) =\n\n( E \u02c6S, \u02c6SW\n\nE \u02c6S, \u02c6SW\n\n(KA) +q\n\n+,\u03b3(T ,S)ME \u02c6S, \u02c6SW\nd\u03c72\n\n(KA) + d1+,\u03b3(T ,S)M,\n\n(KA),\n\nif T (cid:28) S,\notherwise.\n\nResults\nIn Figure 2, we plot the goodness of the similarity function on the target data set\nbefore and after adaptation, i.e after solving the minimization problems described above.\nThe results are computed for a rotation angle \u03b8 between 0\u25e6 and 90\u25e6, and after averaging over\n30 draws of target samples. From this \ufb01gure, we can see that the behaviour of the target\ngoodness remains in line with the obtained theoretical results. In both cases considered,\noptimizing the bounds improves the performance over the \"No adaptation\" baseline. As\nexpected, in the case of T (cid:28) S, the target goodness remains lower than when no absolute\ncontinuity is assumed due to the minimization of the source error and the worst margin term\nthat impact the entire bound on the target goodness. Note that in the performed empirical\nevaluations, the divergence term remains constant for every considered rotation angle and\nis used only as a trade-o\ufb00 parameter. This choice is deliberate as our goal is to show that\nminimizing the worst margin term and the source error can partially leverage the discrepancy\nbetween the two domains. Obviously, the obtained results can be improved by adding a\nterm that properly aligns the two domains distributions through instance-reweighting or\nfeature transformation.\n\n6 Conclusions and future perspectives\nIn this paper, we provided general theoretical guarantees for the similarity learning framework\nin the domain adaptation context. The obtained results contain a divergence term between\nthe two domains distributions that naturally appears when bounding the deviation between\nthe same similarity\u2019s performance on them and a worst margin term measuring the worst\nerror obtainable by the similarity function for some instance from the learning sample.\nContrary to the previous generalization bounds established for domain adaptation problem,\nwe showed that when the source distribution dominates the target one, the bound can be\nimproved via a\n\u0001 factor.We further analyzed the worst margin term and showed that its\n\n\u221a\n\n8\n\n\fconvergence to the true value depends on the complexity of the search space of the similarity\nfunction, as well as on the regularity of the hinge loss\u2019s cumulative distribution function at\na neighborhood of its maximum (worst) value. In order to validate the usefulness of the\nproposed results, we showed empirically that the minimization of the terms in appearing\nin the obtained bounds allows to obtain an improved performance over the \u201dno adaptation\u201d\nbaseline without explicitly minimizing the divergence term.\nIn the future, our work can be extended in multiple directions. First, in our new de\ufb01nition\nof the (\u0001, \u03b3)\u2212goodness, the landmark distribution is assumed to be di\ufb00erent from that used\nto generate source and target data samples and thus a question about the existence of a\nlandmark distribution that leads to tighter bounds naturally arises. Second, it would be\ninteresting to explore the semi-supervised scenario where the landmarks used to learn a\nsimilarity function are drawn from the source and target distributions at the same time.\nIn this case, one can expect to obtain a result showing that the goodness of the similarity\nlearned with source landmarks only is worse than that learned on a mixture distribution.\n\nAcknowledgements\nThis work bene\ufb01ted from the support provided by the CNRS funding from the D\u00e9\ufb01 Imag\u2019In.\n\nReferences\n[1] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions Information\n\nTheory, 13(1):21\u201327, 2006.\n\n[2] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for\n\noptimal margin classi\ufb01ers. In COLT, pages 144\u2013152, 1992.\n\n[3] Aur\u00e9lien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature\n\nvectors and structured data. arXiv preprint arXiv:1306.6709, 2013.\n\n[4] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning,\n\n5(4):287\u2013364, 2013.\n\n[5] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. A theory of learning with similarity\n\nfunctions. Machine Learning, 72(1-2):89\u2013112, 2008.\n\n[6] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. Improved guarantees for learning via\n\nsimilarity functions. In COLT, pages 287\u2013298, 2008.\n\n[7] Aur\u00e9lien Bellet, Amaury Habrard, and Marc Sebban. Similarity learning for provably accurate\n\nsparse linear classi\ufb01cation. In ICML, 2012.\n\n[8] Zheng-Chu Guo and Yiming Ying. Guaranteed classi\ufb01cation via regularized similarity learning.\n\nNeural Computation, 26(3):497\u2013522, 2014.\n\n[9] Maria-Irina Nicolae, \u00c9ric Gaussier, Amaury Habrard, and Marc Sebban. Joint semi-supervised\n\nsimilarity learning for linear classi\ufb01cation. In ECML/PKDD, pages 594\u2013609, 2015.\n\n[10] Nicolae Irina, Marc Sebban, Amaury Habrard, Eric Gaussier, and Massih-Reza Amini. Algo-\nrithmic Robustness for Semi-Supervised (\u0001, \u03b3, \u03c4 )-Good Metric Learning. In ICONIP, page 10,\n2015.\n\n[11] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on\n\nKnowledge and Data Engineering, 22(10):1345\u20131359, 2010.\n\n[12] Anna Margolis. A literature review on domain adaptation with unlabeled data, 2011.\n[13] Vishal M. Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Visual domain\nadaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53\u201369, 2015.\n[14] Karl Weiss, Taghi M. Khoshgoftaar, and Ding Wang. A survey of transfer learning. Journal of\n\nBig Data, 3(1), 2016.\n\n[15] B. Geng, D. Tao, and C. Xu. Daml: Domain adaptation metric learning. IEEE Transactions\n\non Image Processing, 20(10):2980\u20132989, 2011.\n\n9\n\n\f[16] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation\n\nusing asymmetric kernel transforms. In CVPR, pages 1785\u20131792, 2011.\n\n[17] Bin Cao, Xiaochuan Ni, Jian-Tao Sun, Gang Wang, and Qiang Yang. Distance metric learning\n\nunder covariate shift. In IJCAI, pages 1204\u20131210, 2011.\n\n[18] Emilie Morvant, Amaury Habrard, and St\u00e9phane Ayache. Parsimonious unsupervised and\nsemi-supervised domain adaptation with good similarity functions. Knowledge and Information\nSystems, 33(2):309\u2013349, 2012.\n\n[19] Micha\u00ebl Perrot and Amaury Habrard. A theoretical analysis of metric hypothesis transfer\n\nlearning. In ICML, pages 1708\u20131707, 2015.\n\n[20] Kun Zhang, Bernhard Sch\u00f6lkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation\n\nunder target and conditional shift. In ICML, pages 819\u2013827, 2013.\n\n[21] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-\nnifer Wortman Vaughan. A theory of learning from di\ufb00erent domains. Machine Learning,\n79(1-2):151\u2013175, 2010.\n\n[22] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning\n\nbounds and algorithms. In COLT, 2009.\n\n[23] Corinna Cortes and Mehryar Mohri. Domain adaptation in regression. In ALT, 2011.\n[24] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and\n\nthe r\u00c9nyi divergence. In UAI, pages 367\u2013374, 2009.\n\n[25] Pascal Germain, Amaury Habrard, Fran\u00e7ois Laviolette, and Emilie Morvant. A new pac-bayesian\n\nperspective on domain adaptation. In ICML, volume 48, pages 859\u2013868, 2016.\n\n[26] Olivier Bousquet and Andr\u00e9 Elissee\ufb00. Stability and generalization. J. Mach. Learn. Res.,\n\n2:499\u2013526, March 2002.\n\n10\n\n\f", "award": [], "sourceid": 3690, "authors": [{"given_name": "Sofiane", "family_name": "Dhouib", "institution": "CREATIS UMR CNRS 5220"}, {"given_name": "Ievgen", "family_name": "Redko", "institution": "Hubert Curien laboratory"}]}