{"title": "Transfer Learning by Distribution Matching for Targeted Advertising", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 152, "abstract": "We address the problem of learning classifiers for several related tasks that may differ in their joint distribution of input and output variables. For each task, small - possibly even empty - labeled samples and large unlabeled samples are available. While the unlabeled samples reflect the target distribution, the labeled samples may be biased. We derive a solution that produces resampling weights which match the pool of all examples to the target distribution of any given task. Our work is motivated by the problem of predicting sociodemographic features for users of web portals, based on the content which they have accessed. Here, questionnaires offered to a small portion of each portal's users produce biased samples. Transfer learning enables us to make predictions even for new portals with few or no training data and improves the overall prediction accuracy.", "full_text": "Transfer Learning by Distribution Matching\n\nfor Targeted Advertising\n\nSteffen Bickel, Christoph Sawade, and Tobias Scheffer\n\nUniversity of Potsdam, Germany\n\n{bickel, sawade, scheffer}@cs.uni-potsdam.de\n\nAbstract\n\nWe address the problem of learning classi\ufb01ers for several related tasks that may\ndiffer in their joint distribution of input and output variables. For each task, small \u2013\npossibly even empty \u2013 labeled samples and large unlabeled samples are available.\nWhile the unlabeled samples re\ufb02ect the target distribution, the labeled samples\nmay be biased. This setting is motivated by the problem of predicting sociodemo-\ngraphic features for users of web portals, based on the content which they have\naccessed. Here, questionnaires offered to a portion of each portal\u2019s users produce\nbiased samples. We derive a transfer learning procedure that produces resampling\nweights which match the pool of all examples to the target distribution of any\ngiven task. Transfer learning enables us to make predictions even for new portals\nwith few or no training data and improves the overall prediction accuracy.\n\n1 Introduction\n\nWe study a problem setting of transfer learning in which classi\ufb01ers for multiple tasks have to be\nlearned from biased samples. Some of the multiple tasks will likely relate to one another, but one\ncannot assume that the tasks share a joint conditional distribution of the class label given the input\nvariables. The challenge of multi-task learning is to come to a good generalization across tasks: each\ntask should bene\ufb01t from the wealth of data available for the entirety of tasks, but the optimization\ncriterion needs to remain tied to the individual task at hand.\nA common method for learning under covariate shift (marginal shift) is to weight the biased train-\ning examples by the test-to-training ratio ptest(x)\nptrain(x) to match the marginal distribution of the test data\n[1]. Instead of separately estimating the two potentially high-dimensional densities one can directly\nestimate the density ratio \u2013 by kernel mean matching [2], minimization of the KL-divergence be-\ntween test and weighted training data [3], or by discrimination of training against test data with a\nprobabilistic classi\ufb01er [4].\nHierarchical Bayesian models are a standard statistical approach to multi-task learning [5, 6, 7].\nHere, a common prior on model parameters across tasks captures the task dependencies. Similar to\nthe idea of learning under marginal shift by weighting the training examples, [8] devise a method\nfor learning under joint shift of covariates and labels over multiple tasks that is based on instance-\nspeci\ufb01c rescaling factors. We generalize this idea to a setting where not only the joint distributions\nbetween tasks may differ but also the training and test distribution within each task.\nOur work is motivated by the targeted advertising problem for which the goal is to predict sociode-\nmographic features (such as gender, age, or marital status) of web users, based on their sur\ufb01ng\nhistory. Many types of products are speci\ufb01cally targeted at clearly de\ufb01ned market segments, and\nmarketing organizations seek to disseminate their message under minimal costs per delivery to a\ntargeted individual. When sociodemographic attributes can be identi\ufb01ed, delivering advertisements\nto users outside the target segment can be avoided. For some campaigns, clicks and resulting on-\n\n\fline purchases constitute an ultimate success criterion. However, for many campaigns \u2013 including\ncampaigns for products that are not typically purchased on the web \u2013 the sole goal is to deliver the\nadvertisement to customers in the target segment.\nThe paper is structured as follows. Section 2 de\ufb01nes the problem setting. In Section 3, we devise our\ntransfer learning model. We empirically study transfer learning for targeted advertising in Section 4\nand Section 5 concludes.\n\n2 Problem Setting\n\nWe consider the following multi-task learning scenario. Each of several tasks z is characterized by\nan unknown joint distribution ptest(x, y|z) = ptest(x|z)p(y|x, z) over features x and labels y given\nthe task z. The joint distributions of different tasks may differ arbitrarily but usually some tasks\nhave similar distributions. An unlabeled test sample T = (cid:104)(x1, z1), . . . , (xm, zm)(cid:105) with examples\nfrom different tasks is available. For each test example, attributes xi and the originating task zi are\nknown. The test data for task z are governed by ptest(x|z).\nA labeled training set L = (cid:104)(xm+1, ym+1, zm+1), . . . , (xm+n, ym+n, zm+n)(cid:105) collects examples\nfrom several tasks. In addition to xi and zi, the label yi is known for each example. The training\ndata for task z is drawn from a joint distribution ptrain(x, y|z) = ptrain(x|z)p(y|x, z) that may\ndiffer from the test distribution in terms of the marginal distribution ptrain(x|z). The training and\ntest marginals may differ arbitrarily, as long as each x with positive ptest(x|z) also has a positive\nptrain(x|z). This guarantees that the training distribution covers the entire support of the test distri-\nbution for each task. The conditional distribution p(y|x, z) of test and training data is identical for\na given task z, but conditionals can differ arbitrarily between tasks. The entire training set over all\ntasks is governed by the mixed density ptrain(z)ptrain(x, y|z). The prior ptrain(z) speci\ufb01es the task\nproportions. There may be tasks with only a few or no labeled data.\nThe goal is to learn a hypothesis fz : x (cid:55)\u2192 y for each task z. This hypothesis fz(x) should correctly\npredict the true label y of unseen examples drawn from p(x|z) for all z. That is, it should minimize\nthe expected loss\nwith respect to the unknown distribution ptest(x, y|z) for each individual z.\nThis abstract problem setting models the targeted advertising application as follows. The feature\nvector x encodes the web sur\ufb01ng behavior of a user of web portal z (task). For a small number of\nusers the sociodemographic target label y (e.g., gender of user) is collected through web surveys. For\nnew portals the number of such labeled training instances is initially small. The sociodemographic\nlabels for all users of all portals are to be predicted. The joint distribution ptest(x, y|z) can be differ-\nent between portals since they attract speci\ufb01c populations of users. The training distribution differs\nfrom the test distribution because the response to the web surveys is not uniform with respect to the\ntest distribution. Since the completion of surveys cannot be enforced, it is intrinsically impossible\nto obtain labeled samples that are governed by the test distribution. Therefore, a possible difference\nbetween the conditionals ptest(y|x, z) and ptrain(y|x, z) cannot be re\ufb02ected in the model.\nOne reference strategy is to learn individual models for each target task z by minimizing an ap-\npropriate loss function on the portion of Lz = {(xi, yi, zi) \u2208 L : zi = z}. This procedure does\nnot exploit data of related tasks. In addition, it minimizes the loss with respect to ptrain(x, y|z);\nthe minimum of this optimization problem will not generally coincide with the minimal loss on\nptest(x, y|z). The other extreme is a one-size-\ufb01ts-all model f\u2217(x) trained on the pooled training\nsample L. The training sample may deviate arbitrarily from the target distribution ptest(x, y|z).\nIn order to describe the following model accurately, we introduce selector variable s which distin-\nguishes training (s = 1) from test distributions (s = \u22121). Symbol ptrain(x, y|z) is a shorthand for\np(x, y|z, s=1); likewise, ptest(x, y|z) = p(x, y|z, s=\u22121).\n\nE(x,y)\u223cptest(x,y|z)[(cid:96)(fz(x), y)]\n\n3 Transfer Learning by Distribution Matching\n\nIn learning a classi\ufb01er ft(x) for target task t, we seek to minimize the loss function with respect\nto ptest(x, y|t) = p(x, y|t, s = \u22121). Both, t and z are values of the random variable task; value t\n\n\fsatis\ufb01es Equation 1. Equation 3 expands the expectation and introduces two fractions that equal\none. We can factorize p(x, y|t, s = \u22121) and expand the sum over z in the numerator to run over\nthe entire expression because the integral over (x, y) is independent of z (Equation 4). Equation 5\nrearranges some terms and Equation 6 is the expected loss over the distribution of all tasks weighted\nby rt(x, y).\nE(x,y)\u223cp(x,y|t,s=\u22121)[(cid:96)(f(x, t), y)]\n\np(x|t, s=1)\np(x|t, s=1) p(x, y|t, s=\u22121)(cid:96)(f(x, t), y)dxdy\n\n(cid:182)\np(x|t, s=1)\np(x|t, s=1) p(x|t, s=\u22121)p(y|x, t)(cid:96)(f(x, t), y)\n\n(3)\n\ndxdy\n\n(cid:90) (cid:80)\n(cid:80)\n(cid:90) (cid:88)\n(cid:90) (cid:88)\n\nz\n\nz\n\n=\n\n=\n\n=\n\nz p(z|s=1)p(x, y|z, s=1)\n(cid:181)\nz(cid:48) p(z(cid:48)|s=1)p(x, y|z(cid:48), s=1)\n(cid:80)\np(z|s=1)p(x, y|z, s=1)\nz(cid:48) p(z(cid:48)|s=1)p(x, y|z(cid:48), s=1)\n(cid:181)\n(cid:80)\n\np(z|s=1)p(x, y|z, s=1)\n\n(1)\n\n(2)\n\n(4)\n\n(5)\n\n(6)\n\n(cid:80)\n\nidenti\ufb01es the current target task. Simply pooling the available data for all tasks would create a sample\nz p(z|s = 1)p(x, y|z, s = 1). Our approach is to create a task-speci\ufb01c resampling\ngoverned by\nweight rt(x, y) for each element of the pool of examples. The resampling weights match the pool\ndistribution to the target distribution p(x, y|t, s=\u22121). The resampled pool is governed by the correct\ntarget distribution, but is larger than the labeled sample of the target task. Instead of sampling from\nthe pool, one can weight the loss incurred by each instance by the resampling weight.\nThe expected weighted loss with respect to the mixture distribution that governs the pool equals the\nloss with respect to the target distribution p(x, y|t, s = \u22121). Equation 1 de\ufb01nes the condition that\nthe resampling weights have to satisfy.\n\nIn the following, we will show that\n\nE(x,y)\u223cp(x,y|t,s=\u22121)[(cid:96)(f(x, t), y)]\n\n= E(x,y)\u223c(cid:80)\n(cid:80)\n\nrt(x, y) =\n\nz p(z|s=1)p(x,y|z,s=1) [rt(x, y)(cid:96)(f(x, t), y)]\np(x|t, s=\u22121)\np(x|t, s=1)\n\nz p(z|s=1)p(x, y|z, s=1)\n\np(x, y|t, s=1)\n\n(cid:182)\n\ndxdy\n\n(cid:183)\n\np(x|t, s=1)p(y|x, t)\n\nz(cid:48) p(z(cid:48)|s=1)p(x, y|z(cid:48), s=1)\n\np(x|t, s=\u22121)\np(x|t, s=1)\n\n(cid:80)\nz(cid:48) p(z(cid:48)|s=1)p(x, y|z(cid:48), s=1)\n\np(x, y|t, s=1)\n\n(cid:184)\np(x|t, s=\u22121)\np(x|t, s=1) (cid:96)(f(x, t), y)\n\n(cid:96)(f(x, t), y)\n\n= E(x,y)\u223c(cid:80)\n\nz p(z|s=1)p(x,y|z,s=1)\n\nEquation 5 signi\ufb01es that we can train a hypothesis for task t by minimizing the expected loss over\nthe distribution of all tasks weighted by rt(x, y). This amounts to minimizing the expected loss with\nrespect to the target distribution p(x, y|t, s = \u22121). The resampling weights of Equation 2 have an\nintuitive interpretation: The \ufb01rst fraction accounts for the difference in the joint distributions across\ntasks, and the second fraction accounts for the covariate shift within the target task.\n(cid:80)\nEquation 5 leaves us with the problem of estimating the product of two density ratios rt(x, y) =\np(x|t,s=\u22121)\np(x|t,s=1) . One might be tempted to train four separate density estimators,\nz p(z|s=1)p(x,y|z,s=1)\none for each of the two numerators and the two denominators. However, obtaining estimators for\npotentially high-dimensional densities is unnecessarily dif\ufb01cult because ultimately only a scalar\nweight is required for each example.\n\np(x,y|t,s=1)\n\n3.1 Discriminative Density Ratio Models\n\n(cid:80)\nz p(z|s=1)p(x,y|z,s=1) and r2\n\np(x,y|t,s=1)\n\nIn this section, we derive a discriminative model that directly estimates the two factors r1\n\nt (x) = p(x|t,s=\u22121)\n\np(x|t,s=1) of the resampling weights rt(x, y) = r1\n\nt (x, y) =\nt (x)\n\nt (x, y)r2\n\nwithout estimating the individual densities.\n\n(cid:80)\n(cid:80)\nz p(z|s=1)p(x,y|z,s=1) in terms of a conditional\nWe reformulate the \ufb01rst density ratio r1\nmodel p(t|x, y, s = 1). This conditional has the following intuitive meaning: Given that an in-\nz p(z|s = 1)p(x, y|z, s = 1)\nstance (x, y) has been drawn at random from the pool distribution\n\nt (x, y) =\n\np(x,y|t,s=1)\n\n\fover all tasks (including target task t); the probability that (x, y) originates from p(x, y|t, s = 1) is\np(t|x, y, s = 1). The following equations assume that the prior on the size of the target sample is\ngreater than zero, p(t|s = 1) > 0. In Equation 7 Bayes\u2019 rule is applied to the numerator and z is\nsummed out in the denominator. Equation 8 follows by dropping the normalization factor p(t|s=1)\nand by canceling p(x, y|s=1).\n\nt (x, y) =\nr1\n\n(cid:80)\nz p(z|s=1)p(x, y|z, s=1)\n\np(x, y|t, s=1)\n\n= p(t|x, y, s=1)p(x, y|s=1)\np(t|s=1)p(x, y|s=1)\n\u221d p(t|x, y, s=1)\n\n(7)\n\n(8)\n\nt (x, y)\nThe signi\ufb01cance of Equation 8 is that it shows how the \ufb01rst factor of the resampling weights r1\ncan be determined without knowledge of any of the task densities p(x, y|z, s = 1). The right hand\nside of Equation 8 can be evaluated based on a model p(t|x, y, s = 1) that discriminates labeled\ninstances of the target task against labeled instances of the pool of examples for all non-target tasks.\np(x|t,s=1) can be expressed\nSimilar to the \ufb01rst density ratio, the second density ratio r2\nusing a conditional model p(s = 1|x, t). In Equation 9 Bayes\u2019 rule is applied twice. The two terms\nof p(x|t) cancel each other out, p(s = 1|t)/p(s = \u22121|t) is just a normalization factor, and since\np(s=\u22121|x, t) = 1 \u2212 p(s=1|x, t), Equation 10 follows.\n\nt (x) = p(x|t,s=\u22121)\n\nt (x) = p(x|t, s=\u22121)\np(x|t, s=1)\n\nr2\n\n= p(s=\u22121|x, t)p(x|t)\n\u221d\n\np(s=\u22121|t)\n1\n\u2212 1\n\np(s=1|x, t)\n\np(s=1|t)\n\np(s=1|x, t)p(x|t)\n\n(9)\n\n(10)\n\nThe signi\ufb01cance of the above derivations is that instead of the four potentially high-dimensional\ndensities in rt(x, y), only two conditional distributions with binary variables (Equations 8 and 10)\nneed to be estimated. One can apply any probabilistic classi\ufb01er to this estimation.\n\n3.2 Estimation of Discriminative Density Ratios\n\nFor estimation of r1\n\nt (x, y) we model p(t|x, y, s=1) of Equation 8 with a logistic regression model\n\np(t|x, y, s=1, ut) =\n\n1\n1 + exp(\u2212uT\n\nt \u03a6(x, y))\n\n(cid:104)\n\n(cid:105)\n\n\u03b4(y, +1)\u03a6(x)\n\u03b4(y, \u22121)\u03a6(x)\n\nover model parameters ut using a problem-speci\ufb01c feature mapping \u03a6(x, y). We de\ufb01ne this map-\nping for binary labels, \u03a6(x, y) =\n, where \u03b4 is the Kronecker delta. In the absence\nof prior knowledge about the similarity of classes, input features x of examples with different class\nlabels y are mapped to disjoint subsets of the feature vector. With this feature mapping the models\nfor positive and negative examples do not interact and can be trained independently. Any suitable\nmapping \u03a6(x) can be applied. In [8], p(t|x, y, s = 1) is modeled for all tasks jointly in single op-\ntimization problem with a soft-max model. Empirically, we observe that a separate binary logistic\nregression model (as described above) for each task yields more accurate results with the drawback\nof a slightly increased overall training time.\n\nOptimization Problem 1 For task t: over parameters ut, maximize\n\nlog p(t|x, y, s=1, ut) +\n\nlog(1 \u2212 p(t|x, y, s=1, ut)) \u2212 uT\nt ut\n2\u03c3u\n\n.\n\n(cid:88)\n\n(x,y)\u2208L\\Lt\n\n(cid:88)\n\n(x,y)\u2208Lt\n\nThe solution of Optimization Problem 1 is a MAP estimate of the logistic regression using a\n(cid:80)\nGaussian prior on ut. The estimated vector ut leads to the \ufb01rst part of the weighting factor\nt (x, y) \u221d p(t|x, y, s = 1, ut) according to Equation 8. \u02c6r1\n\u02c6r1\nt (x, y) is normalized so that the weighted\nt (x, y) = 1.\n(x,y)\u2208L \u02c6r1\nempirical distribution over the pool L sums to one, 1|L|\np(s=1|x,t) \u2212 1 can be inferred\np(x|t,s=1) \u221d\nAccording to Equation 10 density ratio r2\nfrom p(s = 1|x, t) which is the likelihood that a given x for task t originates from the training\n\nt (x) = p(x|t,s=\u22121)\n\n1\n\n\fdistribution, as opposed to from the test distribution. A model of p(s = 1|x, t) can be obtained by\ndiscriminating a sample governed by p(x|t, s=1) against a sample governed by p(x|t, s=\u22121) using\na probabilistic classi\ufb01er. Unlabeled test data Tt is governed by p(x|t, s =\u22121). The labeled pool L\nt (x, y) can serve as a sample governed by p(x|t, s = 1);\nover all training examples weighted by r1\nthe labels y can be ignored for this step. Empirically, we \ufb01nd that using the weighted pool L\ninstead of just Lt (as used by [4]) achieves better results because the former sample is larger. We\nmodel p(s = 1|x, vt) of Equation 10 with a regularized logistic regression on target variable s with\nparameters vt (Optimization Problem 2). Labeled examples L are weighted by the estimated \ufb01rst\nfactor \u02c6r1\n\nt (x, y) using the outcome of Optimization Problem 1.\n\nOptimization Problem 2 For task t: over parameters vt, maximize\n\nt (x, y) log p(s=1|x, vt) +\n\u02c6r1\n\nt vt\nlog p(s=\u22121|x, vt) \u2212 vT\n2\u03c3v\n\n.\n\n(cid:88)\n\n(x,y)\u2208L\n\n(cid:88)\n\nx\u2208Tt\n\n(cid:163)\n\n(cid:164)\n\nWith the result of Optimization Problem 2 the estimate for the second factor is \u02c6r2\n1, according to Equation 10. \u02c6r2\nover the pool sums to one, 1|L|\n\np(s=1|x,vt)\u2212\nt (x) is normalized so that the \ufb01nal weighted empirical distribution\n\n(x,y)\u2208L \u02c6r1\n\nt (x) \u221d\n\nt (x) = 1.\n\nt (x, y)\u02c6r2\n\n1\n\n(cid:80)\n\n3.3 Weighted Empirical Loss and Target Model\n\nThe learning procedure \ufb01rst determines resampling weights \u02c6rt(x, y) = \u02c6r1\nt (x) by solving\nOptimization Problems 1 and 2. These weights can now be used to reweight the labeled pool over\nall tasks and train the target model for task t. Using the weights we can evaluate the expected\nloss over the weighted training data as displayed in Equation 11. It is the regularized empirical\ncounterpart of Equation 6.\n\nt (x, y)\u02c6r2\n\nE(x,y)\u223cL\n\n\u02c6r1\nt (x, y)\u02c6r2\n\nt (x)(cid:96)(f(x, t), y)\n\n+\n\nwT\nt wt\n2\u03c32\nw\n\n(11)\n\nOptimization Problem 3 minimizes Equation 11, the weighted regularized loss over the training\ndata using a standard Gaussian log-prior with variance \u03c32\nw on the parameters wt. Each example is\nweighted by the two discriminatively estimated density fractions from Equations 8 and 10 using the\nsolution of Optimization Problems 1 and 2.\n\nOptimization Problem 3 For task t: over parameters wt, minimize\nwT\nt wt\n2\u03c32\nw\n\nt (x)(cid:96)(f(x, wt), y) +\n\n\u02c6r1\nt (x, y)\u02c6r2\n\n1\n|L|\n\n.\n\n(x,y)\u2208L\n\n(cid:88)\n\nIn order to train target models for all tasks, instances of Optimization Problems 1 to 3 are solved for\neach task.\n\n4 Targeted Advertising\n\nWe study the bene\ufb01t of distribution matching and other reference methods on targeted advertising\nfor four web portals. The portals play the role of tasks. We manually assign topic labels, out of\na \ufb01xed set of 373 topics, to all web pages on all portals. For each user the topics of the surfed\npages are tracked and the topic counts are stored in cookies of the user\u2019s web browser. The average\nnumber of surfed topics per user over all portals is 17. The feature vector x of a speci\ufb01c surfer is the\nnormalized 373 dimensional vector of topic counts.\nA small proportion of users is asked to \ufb01ll out a web questionnaire that collects sociodemographic\nuser pro\ufb01les. About 25% of the questionnaires get completely \ufb01lled out (accepted) and in 75% of the\ncases the user rejects to \ufb01ll out the questionnaire. The accepted questionnaires constitute the training\ndata L. The completion of the questionnaire cannot be enforced and it is therefore not possible to\nobtain labeled data that is governed by the test distribution of all users that surf the target portal. In\norder to evaluate the model, we approximate the distribution of users who reject the questionnaire\n\n\fas follows. We take users who have answered the very \ufb01rst survey question (gender) but have\nthen discontinued the survey as an approximation of the reject set. We add the correct proportion\n(25%) of users who have taken the survey, and thereby construct a sample that is governed by an\napproximation of the test distribution. Consequently, in our experiments we use the binary target\nlabel y \u2208 {male, female}. Table 1 gives an overview of the data set.\n\nTable 1: Portal statistics: number of accepted, partially rejected, and test examples (mix of all partial\nreject (=75%) and 25% accept); ratio of male users in training (accept) and test set.\n\nportal\nfamily\nTV channel\nnews 1\nnews 2\n\n# accept\n\n# partial reject\n\n8073\n8848\n3051\n2247\n\n2035\n1192\n149\n143\n\n# test % male training % male test\n2713\n1589\n199\n191\n\n53.8%\n50.5%\n79.4%\n73.0%\n\n46.6%\n50.1%\n76.7%\n76.0%\n\nt (x, y)\u02c6r2\n\nWe compare distribution matching on labeled and unlabeled data (Optimization Problems 1 to 3)\nand distribution matching only on labeled data by setting \u02c6r2\nt (x) = 1 in Optimization Problem 3 to\nthe following reference models. The \ufb01rst baseline is a one-size-\ufb01ts-all model that directly trains a\nlogistic regression on L (setting \u02c6r1\nt (x) = 1 in Optimization Problem 3). The second baseline\nis a logistic regression trained only on Lt, the training examples of the target task. Training only on\nthe reweighted target task data and correcting for marginal shift with respect to the unlabeled test\ndata is the third baseline [4].\nThe last reference method is a hierarchical Bayesian model. Evgeniou and Pontil [6] describe a fea-\nture mapping for regularized regression models that corresponds to hierarchical Bayes with Gaus-\nsian prior on the regression parameters of the tasks. Training a logistic regression with their feature\nmapping over training examples from all tasks is equivalent to a joint MAP estimation of all model\nparameters and the mean of the Gaussian prior.\nWe evaluate the methods using all training examples from non-target tasks and different numbers\nof training examples of the target task. From all available accept examples of the target task we\nrandomly select a certain number (0-1600) of training examples. From the remaining accept exam-\nples of the target task we randomly select an appropriate number and add them to all partial reject\nexamples of the target task so that the evaluation set has the right proportions as described above.\nWe repeat this process ten times and report the average accuracies of all methods.\nWe use a logistic loss as the target loss of distribution matching in Optimization Problem 3 and all\nreference methods. We compare kernelized variants of Optimization Problems 1 to 3 with RBF,\npolynomial, and linear kernels and \ufb01nd the linear kernel to achieve the best performance on our data\nset. All reported results are based on models with linear kernels. For the optimization of the logistic\nregression models we use trust region Newton descent [9].\nWe tune parameters \u03c3u, \u03c3v, and \u03c3w with grid search by executing the following steps.\n\n1. \u03c3u is tuned by nested ten-fold cross-validation. The outer loop is a cross-validation on Lt. In\n\neach loop Optimization Problem 1 is solved on L\u00act merged with current training folds of Lt.\n\u2022 The inner loop temporarily tunes \u03c3w by cross-validation on rescaled L\u00act merged with the\nrescaled current training folds of Lt. At this point \u03c3w cannot be \ufb01nally tuned because \u03c3v has\nnot been tuned yet. In each loop Optimization Problem 3 is solved with \ufb01xed \u02c6r2\nt (x) = 1. The\ntemporary \u03c3w is chosen to maximize the accuracy on the tuning folds.\n\nOptimization Problem 3 is solved for each outer loop with the temporary \u03c3w and with \u02c6r2\nt (x) = 1.\nThe \ufb01nal \u03c3u is chosen to maximize the accuracy on the tuning folds of Lt over all outer loops.\n2. \u03c3v is tuned by likelihood cross-validation on Tt \u222a L. The labels of the labeled data are ignored\nt (x, y),\nfor this step. Test data Tt of the target task as well as the weighted pool L (weighted by \u02c6r1\nbased on previously tuned \u03c3u) are split into ten folds. With the nine training folds of the test data\nand the nine training folds of the weighted pool L, Optimization Problem 2 is solved. Parameter\n\n\fFigure 1: Accuracy over different number of training examples for target portal. Error bars indicate\nthe standard error of the differences to distribution matching on labeled data.\n\n\u03c3v is chosen to maximize the log-likelihood\n\nt (x, y) log p(s=1|x, vt) +\n\u02c6r1\n\n(cid:88)\n\n(x,y)\u2208Ltune\n\n(cid:88)\n\nx\u2208T tune\n\nt\n\nlog p(s=\u22121|x, vt)\n\n) over all ten\n\nt\n\non the tuning folds of test data and weighted pool (denoted by Ltune and T tune\ncross-validation loops.\nApplying non-uniform weights to labeled data (some of which may even be zero) reduces the\neffective sample size. This leads to a bias-variance trade-off [1]: training on unweighted data\ncauses a bias, applying non-uniform weights reduces the sample size and increases the variance\nof the estimator. We follow [1] and smooth the estimated weights by \u02c6r2\nt (x)\u03bb before including\nthem into Optimization Problem 3. The smoothing parameter \u03bb biases the weights towards\nuniformity and thereby controls the trade-off. Without looking at the test data of the target task\nwe tune \u03b7 on the non-target tasks so that the accuracy of the distribution matching method is\nmaximized. This procedure usually results in \u03b7 values around 0.3.\nt (x, y)\u02c6r2\n\nt (x) (based on the previously\ntuned parameters \u03c3u and \u03c3v). In each cross-validation loop Optimization Problem 3 is solved.\n\n3. Finally, \u03c3w is tuned by cross-validation on L rescaled by \u02c6r1\n\nFigure 1 displays the accuracies over different numbers of labeled data for the four different target\nportals. The error bars are the standard errors of the differences to the distribution matching method\non labeled data (solid blue line).\nFor the \u201cfamily\u201d and \u201cTV channel\u201d portals the distribution matching method on labeled and unla-\nbeled data outperforms all other methods in almost all cases. The distribution matching method on\n\ndistr. matching on lab. and unlab. datadistribution matching on labeled datahierarchical Bayesone-size-(cid:31)ts-all on pool of labeled datatraining only on lab. data of target tasktraining on lab. and unlab. data of targ. task 0.56 0.6 0.64 0.68025501002004008001600accuracytraining examples for target portalfamily 0.64 0.68 0.72025501002004008001600accuracytraining examples for target portalTV channel 0.72 0.76 0.8025501002004008001600accuracytraining examples for target portalnews 1 0.8 0.84 0.88025501002004008001600accuracytraining examples for target portalnews 2\flabeled data outperforms the baselines trained only on the data of the target task for all portals and\nall data set sizes and it is at least as good as the one-size-\ufb01ts-all model in almost all cases. The\nhierarchical Bayesian method yields low accuracies for smaller numbers of training examples but\nbecomes comparable to the distribution matching method when training set sizes of the target portal\nincrease. The simple covariate shift model that trains only on labeled and unlabeled data of the target\ntask does not improve over the iid model that only trains on the labeled data of the target task. This\nindicates that the marginal shift between training and test distributions is small, or could indicate that\nthe approximation of the reject distribution which we use in our experimentation is not suf\ufb01ciently\nclose. Either reason also explains why accounting for the marginal shift in the distribution matching\nmethod does not always improve over distribution matching using only labeled data.\nTransfer learning by distribution matching passes all examples for all tasks to the underlying logistic\nregressions. This is computationally more expensive than the reference methods. For example, the\nsingle task baseline trains only one logistic regression on the examples of the target task. Empiri-\ncally, we observe that all methods scale linearly in the number training examples.\n\n5 Conclusion\n\nWe derived a multi-task learning method that is based on the insight that the expected loss with\nrespect to the unbiased test distribution of the target task is equivalent to the expected loss over\nthe biased training examples of all tasks weighted by a task speci\ufb01c resampling weight. This led\nto an algorithm that discriminatively estimates these resampling weights by training two simple\nconditional models. After weighting the pooled examples over all tasks the target model for a\nspeci\ufb01c task can be trained.\nIn our empirical study on targeted advertising, we found that distribution matching using labeled\ndata outperforms all reference methods in almost all cases; the differences are particularly large for\nsmall sample sizes. Distribution matching with labeled and unlabeled data outperforms the reference\nmethods and distribution matching with only labeled data in two out of four portals. Even with no\nlabeled data of the target task the performance of the distribution matching method is comparable to\ntraining on 1600 examples of the target task for all portals.\n\nAcknowledgments\n\nWe gratefully acknowledge support by nugg.ad AG and the German Science Foundation DFG. We\nwish to thank Stephan Noller and the nugg.ad team for their valuable contributions.\n\nReferences\n[1] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood func-\n\ntion. Journal of Statistical Planning and Inference, 90:227\u2013244, 2000.\n\n[2] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Sch\u00a8olkopf. Correcting sample selection bias by\n\nunlabeled data. In Advances in Neural Information Processing Systems, 2007.\n\n[3] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe. Direct importance estimation\nwith model selection and its application to covariate shift adaptation. In Advances in Neural Information\nProcessing Systems, 2008.\n\n[4] S. Bickel, M. Br\u00a8uckner, and T. Scheffer. Discriminative learning for differing training and test distributions.\n\nIn Proceedings of the International Conference on Machine Learning, 2007.\n\n[5] A. Schwaighofer, V. Tresp, and K. Yu. Learning Gaussian process kernels via hierarchical Bayes.\n\nAdvances in Neural Information Processing Systems, 2005.\n\nIn\n\n[6] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. Proceedings of the International Conference\n\non Knowledge Discovery and Data Mining, pages 109\u2013117, 2004.\n\n[7] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with Dirichlet\n\nprocess priors. Journal of Machine Learning Research, 8:35\u201363, 2007.\n\n[8] S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer. Multi-task learning for HIV therapy screening. In\n\nProceedings of the International Conference on Machine Learning, 2008.\n\n[9] C. Lin, R. Weng, and S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal\n\nof Machine Learning Research, 9:627\u2013650, 2008.\n\n\f", "award": [], "sourceid": 739, "authors": [{"given_name": "Steffen", "family_name": "Bickel", "institution": null}, {"given_name": "Christoph", "family_name": "Sawade", "institution": null}, {"given_name": "Tobias", "family_name": "Scheffer", "institution": null}]}