{"title": "Robust Classification Under Sample Selection Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 37, "page_last": 45, "abstract": "In many important machine learning applications, the source distribution used to estimate a probabilistic classifier differs from the target distribution on which the classifier will be used to make predictions. Due to its asymptotic properties, sample-reweighted loss minimization is a commonly employed technique to deal with this difference. However, given finite amounts of labeled source data, this technique suffers from significant estimation errors in settings with large sample selection bias. We develop a framework for robustly learning a probabilistic classifier that adapts to different sample selection biases using a minimax estimation formulation. Our approach requires only accurate estimates of statistics under the source distribution and is otherwise as robust as possible to unknown properties of the conditional label distribution, except when explicit generalization assumptions are incorporated. We demonstrate the behavior and effectiveness of our approach on synthetic and UCI binary classification tasks.", "full_text": "Robust Classi\ufb01cation Under Sample Selection Bias\n\nAnqi Liu\n\nDepartment of Computer Science\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\naliu33@uic.edu\n\nBrian D. Ziebart\n\nDepartment of Computer Science\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\n\nbziebart@uic.edu\n\nAbstract\n\nIn many important machine learning applications, the source distribution used to\nestimate a probabilistic classi\ufb01er differs from the target distribution on which the\nclassi\ufb01er will be used to make predictions. Due to its asymptotic properties, sam-\nple reweighted empirical loss minimization is a commonly employed technique\nto deal with this difference. However, given \ufb01nite amounts of labeled source\ndata, this technique suffers from signi\ufb01cant estimation errors in settings with large\nsample selection bias. We develop a framework for learning a robust bias-aware\n(RBA) probabilistic classi\ufb01er that adapts to different sample selection biases using\na minimax estimation formulation. Our approach requires only accurate estimates\nof statistics under the source distribution and is otherwise as robust as possible\nto unknown properties of the conditional label distribution, except when explicit\ngeneralization assumptions are incorporated. We demonstrate the behavior and\neffectiveness of our approach on binary classi\ufb01cation tasks.\n\n1 Introduction\n\nThe goal of supervised machine learning is to use available source data to make predictions with\nthe smallest possible error (loss) on unlabeled target data. The vast majority of supervised learn-\ning techniques assume that source (training) data and target (testing) data are drawn from the same\ndistribution over pairs of example inputs and labels, P (x, y), from which the conditional label dis-\ntribution, P (y|x), is estimated as \u02c6P (y|x). In other words, data is assumed to be independent and\nidentically distributed (IID). For many machine learning applications, this assumption is not valid;\ne.g., survey response rates may vary by individuals\u2019 characteristics, medical results may only be\navailable from a non-representative demographic sample, or dataset labels may have been solicited\nusing active learning. These examples correspond to the covariate shift [1] or missing at random\n[2] setting where the source dataset distribution for training a classi\ufb01er and the target dataset distri-\nbution on which the classi\ufb01er is to be evaluated depend on the example input values, x, but not the\nlabels, y [1]. Despite the source data distribution, P (y|x)Psrc(x), and the target data distribution,\nP (y|x)Ptrg(x), sharing a common conditional label probability distribution, P (y|x), all (probabilis-\ntic) classi\ufb01ers, \u02c6P (y|x), are vulnerable to sample selection bias when the target data and the inductive\nbias of the classi\ufb01er trained from source data samples, \u02dcPsrc(x) \u02dcP (y|x), do not match [3].\nWe propose a novel approach to classi\ufb01cation that embraces the uncertainty resulting from sample\nselection bias by producing predictions that are explicitly robust to it. Our approach, based on mini-\nmax robust estimation [4, 5], departs from the traditional statistics perspective by prescribing (rather\nthan assuming) a parametric distribution that, apart from matching known distribution statistics, is\nthe worst-case distribution possible for a given loss function. We use this approach to derive the ro-\nbust bias-aware (RBA) probabilistic classi\ufb01er. It robustly minimizes the logarithmic loss (logloss)\nof the target prediction task subject to known properties of data from the source distribution. The\nparameters of the classi\ufb01er are optimized via convex optimization to match statistical properties\n\n1\n\n\fmeasured from the source distribution. These statistics can be measured without the inaccuracies\nintroduced from estimating their relevance to the target distribution [1]. Our formulation requires\nany assumptions of statistical properties generalizing beyond the source distribution to be explicitly\nincorporated into the classi\ufb01er\u2019s construction. We show that the prevalent importance weighting\napproach to covariate shift [1], which minimizes a sample reweighted logloss, is a special case of\nour approach for a particularly strong assumption: that source statistics fully generalize to the target\ndistribution. We apply our robust classi\ufb01cation approach on synthetic and UCI binary classi\ufb01cation\ndatasets [6] to compare its performance against sample reweighted approaches for learning under\nsample selection bias.\n\n2 Background and Related Work\n\nargmin\n\n\u2713\n\nUnder the classical statistics perspective, a parametric model for the conditional label distribution,\ndenoted \u02c6P\u2713(y|x), is \ufb01rst chosen (e.g., the logistic regression model), and then model parameters are\nestimated to minimize prediction loss on target data. When source and target data are drawn from\nthe same distribution, minimizing loss on samples of source data, \u02dcPsrc(x) \u02dcP (y|x),\n\nsrc (x) \u02dcP (y|x)\uf8ff Ptrg(X)\n{z\n\nPsrc(X)\n\nSample reweighted objective function\nDataset #1\n\nE \u02dcPsrc(x) \u02dcP (y|x)[loss( \u02c6P\u2713(Y |X), Y )],\n\n(1)\nef\ufb01ciently converges to the target distribution (Ptrg(x)P (y|x)) loss minimizer. Unfortunately, mini-\nmizing the sample loss (1) when source and target distributions differ does not converge to the target\nloss minimizer. A preferred approach for dealing with this discrepancy is to use importance weight-\ning to estimate the prediction loss under the target distribution by reweighting the source samples\naccording to the target-source density ratio, Ptrg(x)/Psrc(x) [1, 7]. We call this approach sample\nreweighted loss minimization, or the sample reweighted approach for short in our discussion in this\npaper. Machine learning research has primarily investigated sample selection bias from this per-\nspective, with various techniques for estimating the density ratio including kernel density estimation\n[1], discriminative estimation [8], Kullback-Leibler importance estimation [9], kernel mean match-\ning [10, 11], maximum entropy methods [12], and minimax optimization [13]. Despite asymptotic\nguarantees of minimizing target distribution loss [1] (assuming Ptrg(x) > 0 =) Psrc(x) > 0),\nloss( \u02c6P\u2713(Y |X), Y )\n,\n}\n\nEPtrg(x)P (y|x)[loss( \u02c6P\u2713(Y |X), Y )] = lim\nn!1\nsample reweighting is often extremely inaccu-\nrate for \ufb01nite sample datasets, \u02dcPsrc(x), when\nsample selection bias is large [14].\nThe\nreweighted loss (2) will often be dominated by\na small number of datapoints with large impor-\ntance weights (Figure 1). Minimizing loss pri-\nmarily on these datapoints often leads to target\npredictions with overly optimistic con\ufb01dence.\nAdditionally, the speci\ufb01c datapoints with large\nimportance weights vary greatly between ran-\ndom source samples, often leading to high vari-\nance model estimates. Formal theoretical lim-\nitations match these described short-comings;\ngeneralization bounds on learning under sam-\nple selection bias using importance weight-\ning have only been established when the sec-\nond moment of sampled importance weights is\nbounded, EPsrc(x)[(Ptrg(X)/Psrc(X))2] < 1\n[14], which imposes strong restrictions on the source and target distributions. For example, nei-\nther pair of distributions in Figure 1 satis\ufb01es this bound because the target distribution has \u201cfatter\ntails\u201d than the source distribution in some or all directions.\nThough developed using similar tools, previous minimax formulations of learning under sample se-\nlection bias [15, 13] differ substantially from our approach. They consider the target distribution as\nbeing unknown and provide robustness to its worst-case assignment. The class of target distribu-\ntions considered are those obtained by deleting a subset of measured statistics [15] or all possible\n\nFigure 1: Datapoints (with \u2018+\u2019 and \u2018o\u2019 labels)\nfrom two source distributions (Gaussians with\nsolid 95% con\ufb01dence ovals) and the largest data\npoint importance weights, Ptrg(x)/Psrc(x), un-\nder the target distributions (Gaussian with dashed\n95% con\ufb01dence ovals).\n\nE \u02dcP (n)\n\n|\n\n(2)\n\nDataset #2\n\n2\n\n\freweightings of the sample source data [13]. Our approach, in contrast, obtains an estimate for\neach given target distribution that is robust to all the conditional label distributions matching source\nstatistics. While having an exact or well-estimated target distribution a priori may not be possible\nfor some applications, large amounts of unlabeled data enable this in many batch learning settings.\nA wide range of approaches for learning under sample selection bias and transfer learning lever-\nage additional assumptions or knowledge to improve predictions [16]. For example, a simple, but\neffective approach to domain adaptation [17] leverages some labeled target data to learn some re-\nlationships that generalize across source and target datasets. Another recent method assumes that\nsource and target data are generated from mixtures of \u201cdomains\u201d and uses a learned mixture model\nto make predictions of target data based on more similar source data [18].\n\n3 Robust Bias-Aware Approach\n\nWe propose a novel approach for learning under sample selection bias that embraces the uncer-\ntainty inherent from shifted data by making predictions that are explicitly robust to it. This section\nmathematically formulates this motivating idea.\n\n3.1 Minimax robust estimation formulation\n\nMinimax robust estimation [4, 5] advocates for the worst case to be assumed about any unknown\ncharacteristics of a probability distribution. This provides a strong rationale for maximum entropy\nestimation methods [19] from which many familiar exponential family distributions (e.g., Gaus-\nsian, exponential, Laplacian, logistic regression, conditional random \ufb01elds [20]) result by robustly\nminimizing logloss subject to constraints incorporating various known statistics [21].\nProbabilistic classi\ufb01cation performance is measured by the conditional logloss (the negative con-\nditional likelihood), loglossPtrg(X)(P (Y |X), \u02c6P (Y |X)) , EPtrg(x)P (y|x)[ log P (Y |X)], of the es-\ntimator, \u02c6P (Y |X), under an evaluation distribution (i.e., the target distribution, Ptrg(X)P (Y |X),\nfor the sample selection bias setting). We assume that a set of statistics, denoted as convex set\n\u2305, characterize the source distribution, Psrc(x, y). Using this loss function, De\ufb01nition 1 forms a\nrobust minimax estimate [4, 5] of the conditional label distribution, \u02c6P (Y |X), using a worst-case\nconditional label distribution, \u02c7P (Y |X).\nDe\ufb01nition 1. The robust bias-aware (RBA) probabilistic classi\ufb01er is the saddle point solution of:\n\nwhere is the conditional probability simplex: 8x 2X , y 2Y : P (y|x) 0;Py02Y P (y0|x) = 1.\nThis formulation can be interpreted as a two-player game [5] in which the estimator player \ufb01rst\nchooses \u02c6P (Y |X) to minimize the conditional logloss and then the evaluation player chooses distri-\nbution \u02c7P (Y |X) from the set of statistic-matching conditional label distributions to maximize con-\nditional logloss. This minimax game reduces to a maximum conditional entropy [19] problem:\nTheorem 1 ([5]). Assuming \u2305 is a set of moment-matching constraints, EPsrc(x) \u02c6P (y|x)[f (X, Y )] =\nc , EPsrc(x)P (y|x)[f (X, Y )], the solution of the minimax logloss game (3) maximizes the target\ndistribution conditional entropy subject to matching statistics on the source distribution:\n\nmax\n\n\u02c6P (Y |X)2\n\nHPtrg(x), \u02c6P (y|x)(Y |X) such that: EPsrc(x) \u02c6P (y|x)[f (X, Y )] = c.\n\n(4)\n\nConceptually, the solution to this optimization (4) has low certainty where the target density is high\nby matching the source distribution statistics primarily where the target density is low.\n\n3.2 Parametric form of the RBA classi\ufb01er\n\nUsing tools from convex optimization [22], the solution to the dual of our constrained optimization\nproblem (4) has a parametric form (Theorem 2) with Lagrange multiplier parameters, \u2713, weighing\n\n3\n\nmin\n\nmax\n\n\u02c6P (Y |X)2\n\n\u02c7P (Y |X)2 \\ \u2305\n\nloglossPtrg(X)\u21e3 \u02c7P (Y |X), \u02c6P (Y |X)\u2318 ,\n\n(3)\n\n\fLogistic regression\n\nReweighted\n\nRobust bias-aware\n\nFigure 2: Probabilistic predictions from logistic regression, sample reweighted logloss minimiza-\ntion, and robust bias-aware models (\u00a74.1) given labeled data (\u2018+\u2019 and \u2018o\u2019 classes) sampled from the\nsource distribution (solid oval indicating Gaussian covariance) and a target distribution (dashed oval\nGaussian covariance) for \ufb01rst-order moment statistics (i.e., f (x, y) = [y yx1 yx2]T ).\n\nthe feature functions, f (x, y), that constrain the conditional label distribution estimate (4) (derivation\nin Appendix A). The density ratio, Psrc(x)/Ptrg(x), scales the distribution\u2019s prediction certainty to\nincrease when the ratio is large and decrease when it is small.\nTheorem 2. The robust bias-aware (RBA) classi\ufb01er for target distribution Ptrg(x) estimated from\nstatistics of source distribution Psrc(x) has a form:\n\nPsrc(x)\nPtrg(x) \u2713\u00b7f (x,y)\n\ne\n\nPsrc(x)\n\nPtrg(x) \u2713\u00b7f (x,y0)\n\n,\n\n(5)\n\n\u02c6P\u2713(y|x) =\n\nPy02Y e\n\nwhich is parameterized by Lagrange multipliers \u2713.\nThe Lagrangian dual optimiza-\ntion problem selects these parameters to maximize the target distribution log likelihood:\nmax\u2713 EPtrg(x)P (y|x)[log \u02c6P\u2713(Y |X)].\nUnlike the sample reweighting approach, our approach does not require that target distribution sup-\nport implies source distribution support (i.e., Ptrg(x) > 0 =) Psrc(x) > 0 is not required). Where\ntarget support vanishes (i.e., Ptrg(x) ! 0), the classi\ufb01er\u2019s prediction is extremely certain, and where\nsource support vanishes (i.e., Psrc(x) = 0), the classi\ufb01er\u2019s prediction is a uniform distribution. The\ncritical difference in addressing sample selection bias is illustrated in Figure 2. Logistic regression\nand sample reweighted loss minimization (2) extrapolate in the face of uncertainty to make strong\npredictions without suf\ufb01cient supporting evidence, while the RBA approach is robust to uncertainty\nthat is inherent when learning from \ufb01nite shifted data samples. In this example, prediction uncer-\ntainty is large at all tail fringes of the source distribution for the robust approach. In contrast, there\nis a high degree of certainty for both the logistic regression and sample reweighted approaches in\nportions of those regions (e.g., the bottom left and top right). This is due to the strong inductive\nbiases of those approaches being applied to portions of the input space where there is sparse ev-\nidence to support them. The conceptual argument against this strong inductive generalization is\nthat the labels of datapoints in these tail fringe regions could take either value and negligibly affect\nthe source distribution statistics. Given this ambiguity, the robust approach suggests much more\nagnostic predictions.\nThe choice of statistics, f (x, y) (also known as features), employed in the model plays a much\ndifferent role in the RBA approach than in traditional IID learning methods. Rather than determining\nthe manner in which the model generalizes, as in logistic regression, features should be chosen that\nprevent the robust model from \u201cpushing\u201d all of its certainty away from the target distribution. This\nis illustrated in Figure 3. With only \ufb01rst moment constraints, the predictions in the denser portions\nof the target distribution have fairly high uncertainty under the RBA method. The larger number\nof constraints enforced by the second-order mixed moment statistics preserve more of the original\ndistribution using the RBA predictions, leading to higher certainty in those target regions.\n\n4\n\n\fLogistic regression\n\nReweighted\n\nRobust bias-aware\n\nt\nn\ne\nm\no\nm\n\nt\ns\nr\ni\nF\n\nt\nn\ne\nm\no\nm\nd\nn\no\nc\ne\nS\n\nFigure 3: The prediction setting of Figure 2 with partially overlapping source and target den-\nsities for \ufb01rst-order (top) and second-order (bottom) mixed-moments statistics (i.e., f (x, y) =\n2]T ). Logistic regression and the sample reweighted approach make\n[y yx1 yx2 yx2\nhigh-certainty predictions in portions of the input space that have high target density. These predic-\ntions are made despite the sparseness of sampled source data in those regions (e.g., the upper-right\nportion of the target distribution). In contrast, the robust approach \u201cpushes\u201d its more certain predic-\ntions to areas where the target density is less.\n\n1 yx1x2 yx2\n\n3.3 Regularization and parameter estimation\n\nIn practice, the characteristics of the source distribution, \u2305, are not precisely known. Instead, em-\npirical estimates for moment-matching constraints, \u02dcc , E \u02dcPsrc(x) \u02dcP (y|x)[f (X, Y )], are available, but\nare prone to sampling error. When the constraints of (4) are relaxed using various convex norms,\n||\u02dccE \u02dcPsrc(x) \u02c6P (y|x)[f (X, Y )]|| \uf8ff \u270f, the RBA classi\ufb01er is obtained by `1- or `2-regularized maximum\nconditional likelihood estimation (Theorem 2) of the dual optimization problem [23, 24],\n\n\u2713 = argmax\n\n\u2713\n\nEPtrg(x)P (y|x)hlog \u02c6P\u2713(Y |X)i \u270f||\u2713|| .\n\n(6)\n\nThe regularization parameters in this approach can be chosen using straight-forward bounds on \ufb01nite\nsampling error [24]. In contrast, the sample reweighted approach to learning under sample selection\nbias [1, 7] also makes use of regularization [9], but appropriate regularization parameters for it must\nbe haphazardly chosen based on how well the source samples represent the target data.\nMaximizing this regularized target conditional likelihood (6) appears dif\ufb01cult because target data\nfrom Ptrg(x)P (y|x) is unavailable. We avoid the sample reweighted approach (2) [1, 7], due to its\ninaccuracies when facing distributions with large differences in bias given \ufb01nite samples. Instead,\nwe use the gradient of the regularized target conditional likelihood and only rely on source samples\nadequately approximating the source distribution statistics (a standard assumption for IID learning):\n\nr\u2713EPtrg(x)P (y|x)[log \u02c6P\u2713(Y |X)] = \u02dcc E \u02dcPsrc(x) \u02c6P (y|x)[f (X, Y )].\n\n(7)\n\nAlgorithm 1 is a batch gradient algorithm for parameter estimation under our model. It does not\nrequire objective function calculations and converges to a global optimum due to convexity [22].\n\n5\n\n\fAlgorithm 1 Batch gradient for robust bias-aware classi\ufb01er learning.\nInput: Dataset {(xi, yi)}, source density Psrc(x), target density Ptrg(x), feature function f (x, y),\nOutput: Model parameters \u2713\n\nmeasured statistics \u02dcc, (decaying) learning rate {t}, regularizer \u270f, convergence threshold \u2327\n\u2713 0\nrepeat\n (xi, y) Psrc(x)\nPtrg(x) \u2713 \u00b7 f (xi, y) for all: dataset examples i, labels y\n\u02c6P (Yi = y|xi) e (xi,y)\nPy0 e (xi,y0) for all: dataset examples i, labels y\nNPN\ni=1Py2Y\nrL \u02dcc 1\n\u2713 \u2713 + t(rL + \u270fr\u2713||\u2713||)\nuntil ||\u270fr\u2713||\u2713|| + rL|| \uf8ff \u2327\nreturn \u2713\n\n\u02c6P (Yi = y|xi) f (xi, y)\n\n3.4 Incorporating expert knowledge and generalizing the reweighted approach\n\nIn many settings, expert knowledge may be available to construct the constraint set \u2305 instead of, or\nin addition to, statistics \u02dcc , E \u02dcPsrc(x) \u02dcP (y|x)[f (X, Y )] estimated from source data. Expert-provided\nsource distributions, feature functions, and constraint statistic values, respectfully denoted P 0src(x),\nf0(x, y), and c0, can be speci\ufb01ed to express a range of assumptions about the conditional label\ndistribution and how it generalizes. Theorem 3 establishes that for empirically-based constraints\nprovided by the expert, EPtrg(x) \u02c6P (y|x)[f (X, Y )] = \u02dcc0 , E \u02dcPsrc(x) \u02dcP (y|x)[(Ptrg(X)/Psrc(X))f (X, Y )],\ncorresponding to strong source-to-target feature generalization assumptions, P 0src(x) , Ptrg(x),\nreweighted logloss minimization is a special case of our robust bias-aware approach.\nTheorem 3. When direct\nget distribution is assumed,\n\nfeature generalization of reweighting source samples to the tar-\nthe constraints become EPtrg(x) \u02c6P (y|x)[f (X, Y )] = \u02dcc0 ,\nPsrc(X) f (X, Y )i and the RBA classi\ufb01er minimizes sample reweighted logloss (2).\n\nE \u02dcPsrc(x) \u02dcP (y|x)h Ptrg(X)\n\nThis equivalence suggests that if there is expert knowledge that reweighted source statistics are rep-\nresentative of the target distribution, then these strong generalization assumptions should be included\nas constraints in the RBA predictor and results in the sample reweighted approach1.\n\nFigure 4: The robust estimation setting of Figure 3 (bottom, right) with assumed Gaussian feature\ndistribution generalization (dashed-dotted oval) incorporated into the density ratio. Three increas-\ningly broad generalization distributions lead to reduced target prediction uncertainty.\n\nWeaker expert knowledge can also be incorporated. Figure 4 shows various assumptions of how\nwidely sample reweighted statistics are representative across the input space. As the generalization\nassumptions are made to align more closely with the target distribution (Figure 4), the regions of\nuncertainty shrink substantially.\n\n1Similar to the previous section, relaxed constraints ||\u02dcc0 E \u02dcPsrc(x) \u02c6P (y|x)[f (X, Y )]|| \uf8ff \u270f, are employed in\n\npractice and parameters are obtained by maximizing the regularized conditional likelihood as in (6).\n\n6\n\n\f4 Experiments and Comparisons\n\n4.1 Comparative approaches and implementation details\n\nsource\n\nlogistic\n\nconditional\n\nregression maximizes\n\nlearning classi\ufb01ers from biased sample source data:\nWe compare three approaches for\nsource data,\n(a)\n(b) sample reweighted target logistic regression\nmax\u2713 E \u02dcPsrc(x) \u02dcP (y|x)[log P\u2713(Y |X) \u270f||\u2713||];\nminimizes the conditional likelihood of source data reweighted to the target distribution (2),\nmax\u2713 E \u02dcPsrc(x) \u02dcP (y|x)[(Ptrg(x)/Psrc(x)) log P\u2713(Y |X) \u270f||\u2713||]; and robust bias-aware classi\ufb01ca-\ntion robustly minimizes target distribution logloss (5) trained using direct gradient calculations\n(7). As statistics/features for these approaches, we consider nth order uni-input moments, e.g.,\n3x5x6, . . .. We employ the CVX pack-\nyx1, yx2\nage [25] to estimate parameters of the \ufb01rst two approaches and batch gradient ascent (Algorithm 1)\nfor our robust approach.\n\n3 , . . ., and mixed moments, e.g., yx1, yx1x2, yx2\n\nlikelihood on the\n\n2, yxn\n\n4.2 Empirical performance evaluations and comparisons\n\nWe empirically compare the predictive performance of the three approaches. We consider four\nclassi\ufb01cation datasets, selected from the UCI repository [6] based on the criteria that each contains\nroughly 1,000 or more examples, has discretely-valued inputs, and has minimal missing values. We\nreduce multi-class prediction tasks into binary prediction tasks by combining labels into two groups\nbased on the plurality class, as described in Table 1.\n\nDataset\nMushroom\n\nCar\n\nTic-tac-toe\nNursery\n\n22\n6\n9\n8\n\nTable 1: Datasets for empirical evaluation\nFeatures Examples Negative labels Positive labels\n\nEdible\n\n8,124\n1,728 Not acceptable\n\u2018X\u2019 does not win\n12,960 Not recommended\n\n958\n\nPoisonous\nall others\n\u2018X\u2019 wins\nall others\n\nWe generate biased subsets of these classi\ufb01cation datasets to use as source samples and unbiased\nsubsets to use as target samples. We create source data bias by sampling a random likelihood func-\ntion from a Dirichlet distribution and then sample source data without replacement in proportion\nto each datapoint\u2019s likelihood. We stress the inherent dif\ufb01culties of the prediction task that results;\nlabel imbalance in the source samples is common, despite sampling independently from the exam-\nple label (given input values) due to source samples being drawn from focused portions of the input\nspace. We combine the likelihood function and statistics from each sample to form na\u00a8\u0131ve source and\ntarget distribution estimates. The complete details are described in Appendix C, including bounds\nimposed on the source-target ratios to limit the effects of inaccuracies from the source and target\ndistribution estimates.\nWe evaluate the source logistic regression model, the reweighted maximum likelihood model,\nand our bias-adaptive robust approach. For each, we use \ufb01rst-order and second-order non-mixed\nstatistics: x2\nKy, x1y, x2y, . . . , xKy. For each dataset, we evaluate target distribution\nlogloss, E \u02dcPtrg(x) \u02dcP (y|x)[ log \u02c6P (Y |X)], averaged over 50 random biased source and unbiased target\nsamples. We employ log2 for our loss, which conveniently provides a baseline logloss of 1 for a uni-\nform distribution. We note that with exceedingly large regularization, all parameters will be driven\nto zero, enabling each approach to achieve this baseline level of logloss. Unfortunately, since target\nlabels are assumed not to be available in this problem, obtaining optimal regularization via cross-\nvalidation is not possible. After trying a range of `2-regularization weights (Appendix C), we \ufb01nd\nthat heavy `2-regularization is needed for the logistic regression model and the reweighted model in\nour experiments. Without this heavy regularization, the logloss is often extremely high. In contrast,\nheavy regularization for the robust approach is not necessary; we employ only a mild amount of\n`2-regularization corresponding to source statistic estimation error.\nWe show a comparison of individual predictions from the reweighted approach and the robust ap-\nproach for the Car dataset on the left of Figure 5. The pairs of logloss measures for each of the 50\n\n2y, . . . , x2\n\n1y, x2\n\n7\n\n\fFigure 5: Left: Log-loss comparison for 50 source and target distribution samples between the\nrobust and reweighted approaches for the Car classi\ufb01cation task. Right: Average logloss with 95%\ncon\ufb01dence intervals for logistic regression, reweighted logistic regression, and bias-adaptive robust\ntarget classi\ufb01er on four UCI classi\ufb01cation tasks.\n\nsampled source and target datasets are shown in the scatter plot. For some of the samples, the induc-\ntive biases of the reweighted approach provide better predictions (left of the dotted line). However,\nfor many of the samples, the inductive biases do not \ufb01t the target distribution well and this leads to\nmuch higher logloss.\nThe average logloss for each approach and dataset is shown on the right of Figure 5. The robust\napproach provides better performance than the baseline uniform distribution (logloss of 1) with sta-\ntistical signi\ufb01cance for all datasets. For the \ufb01rst three datasets, the other two approaches are signi\ufb01-\ncantly worse than this baseline. The con\ufb01dence intervals for logistic regression and the reweighted\nmodel tend to be signi\ufb01cantly larger than the robust approach because of the variability in how well\ntheir inductive biases generalize to the target distribution for each sample. However, the robust ap-\nproach is not a panacea for all sample selection bias problems; the No Free Lunch theorem [26] still\napplies. We see this with the Nursery dataset, in which the inductive biases of the logistic regression\nand reweighted approaches do tend to hold across both distributions, providing better predictions.\n\n5 Discussion and Conclusions\n\nIn this paper, we have developed a novel minimax approach for probabilistic classi\ufb01cation under\nsample selection bias. Our approach provides the parametric distribution (5) that minimizes worst-\ncase logloss (Def. 1), and that can be estimated as a convex optimization problem (Alg. 1). We\nshowed that sample reweighted logloss minimization [1, 7] is a special case of our approach using\nvery strong assumptions about how statistics generalize to the target distribution (Thm. 3). We\nillustrated the predictions of our approach in two toy settings and how those predictions compare\nto the more-certain alternative methods. We also demonstrated consistent \u201cbetter than uninformed\u201d\nprediction performance using four UCI classi\ufb01cation datasets\u2014three of which prove to be extremely\ndif\ufb01cult for other sample selection bias approaches.\nWe have treated density estimation of the source and target distributions, or estimating their ratios,\nas an orthogonal problem in this work. However, we believe many of the density estimation and\ndensity ratio estimation methods developed for sample reweighted logloss minimization [1, 8, 9, 10,\n11, 12, 13] will prove to be bene\ufb01cial in our bias-adaptive robust approach as well. We additionally\nplan to investigate the use of other loss functions and extensions to other prediction problems using\nour robust approach to sample selection bias.\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n#1227495, Purposeful Prediction: Co-robot Interaction via Understanding Intent and Goals.\n\n8\n\n\fReferences\n[1] Hidetoshi Shimodaira.\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\nImproving predictive inference under covariate shift by weighting the log-\n\n[2] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons,\n\nInc., New York, NY, USA, 1986.\n\n[3] Wei Fan, Ian Davidson, Bianca Zadrozny, and Philip S. Yu. An improved categorization of classi\ufb01er\u2019s\nsensitivity on sample selection bias. In Proc. of the IEEE International Conference on Data Mining, pages\n605\u2013608, 2005.\n\n[4] Flemming Tops\u00f8e. Information theoretical optimization techniques. Kybernetika, 15(1):8\u201327, 1979.\n[5] Peter D. Gr\u00a8unwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy, and\n\nrobust Bayesian decision theory. Annals of Statistics, 32:1367\u20131433, 2004.\n[6] Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013.\n[7] Bianca Zadrozny. Learning and evaluating classi\ufb01ers under sample selection bias. In Proceedings of the\n\nInternational Conference on Machine Learning, pages 903\u2013910. ACM, 2004.\n\n[8] Steffen Bickel, Michael Br\u00a8uckner, and Tobias Scheffer. Discriminative learning under covariate shift.\n\nJournal of Machine Learning Research, 10:2137\u20132155, 2009.\n\n[9] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V. Buenau, and Motoaki Kawanabe. Direct\nimportance estimation with model selection and its application to covariate shift adaptation. In Advances\nin Neural Information Processing Systems, pages 1433\u20131440, 2008.\n\n[10] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schlkopf. Cor-\nrecting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems,\npages 601\u2013608, 2006.\n\n[11] Yaoliang Yu and Csaba Szepesv\u00b4ari. Analysis of kernel mean matching under covariate shift. In Proc. of\n\nthe International Conference on Machine Learning, pages 607\u2013614, 2012.\n\n[12] Miroslav Dud\u00b4\u0131k, Robert E. Schapire, and Steven J. Phillips. Correcting sample selection bias in maximum\nentropy density estimation. In Advances in Neural Information Processing Systems, pages 323\u2013330, 2005.\n[13] Junfeng Wen, Chun-Nam Yu, and Russ Greiner. Robust learning under uncertain test distributions: Re-\nIn Proc. of the International Conference on Machine\n\nlating covariate shift to model misspeci\ufb01cation.\nLearning, pages 631\u2013639, 2014.\n\n[14] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In\n\nAdvances in Neural Information Processing Systems, pages 442\u2013450, 2010.\n\n[15] Amir Globerson, Choon Hui Teo, Alex Smola, and Sam Roweis. An adversarial view of covariate shift\nand a minimax approach. In Joaquin Qui\u02dcnonero-Candela, Mashashi Sugiyama, Anton Schwaighofer, and\nNeil D. Lawrence, editors, Dataset Shift in Machine Learning, pages 179\u2013198. MIT Press, Cambridge,\nMA, USA, 2009.\n\n[16] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and\n\nData Engineering, 22(10):1345\u20131359, 2010.\n\n[17] Hal Daum\u00b4e III. Frustratingly easy domain adaptation. In Conference of the Association for Computational\n\nLinguistics, pages 256\u2013263, 2007.\n\n[18] Boqing Gong, Kristen Grauman, and Fei Sha. Reshaping visual datasets for domain adaptation.\n\nAdvances in Neural Information Processing Systems, pages 1286\u20131294, 2013.\n\nIn\n\n[19] Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620\u2013630, 1957.\n[20] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random \ufb01elds: Probabilistic models\nfor segmenting and labeling sequence data. In Proc. of the International Conference on Machine Learning,\npages 282\u2013289, 2001.\n\n[21] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[22] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[23] Miroslav Dud\u00b4\u0131k and Robert E. Schapire. Maximum entropy distribution estimation with generalized\n\nregularization. In Learning Theory, pages 123\u2013138. Springer Berlin Heidelberg, 2006.\n\n[24] Yasemin Altun and Alex Smola. Unifying divergence minimization and statistical inference via convex\n\nduality. In Learning Theory, pages 139\u2013153. Springer Berlin Heidelberg, 2006.\n\n[25] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version\n\n2.1. http://cvxr.com/cvx, March 2014.\n\n[26] David H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Comput.,\n\n8(7):1341\u20131390, 1996.\n\n9\n\n\f", "award": [], "sourceid": 39, "authors": [{"given_name": "Anqi", "family_name": "Liu", "institution": "University of Illinois at Chicago"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "Univ. of Illinois at Chicago"}]}