{"title": "Generative and Discriminative Learning with Unknown Labeling Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": "We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropy-based weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of label bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data.", "full_text": "Generative and Discriminative Learning with\n\nUnknown Labeling Bias\n\nMiroslav Dud\u00b4\u0131k\n\nCarnegie Mellon University\n\n5000 Forbes Ave, Pittsburgh, PA 15213\n\nSteven J. Phillips\n\nAT&T Labs \u2212 Research\n\n180 Park Ave, Florham Park, NJ 07932\n\nmdudik@cmu.edu\n\nphillips@research.att.com\n\nAbstract\n\nWe apply robust Bayesian decision theory to improve both generative and discrim-\ninative learners under bias in class proportions in labeled training data, when the\ntrue class proportions are unknown. For the generative case, we derive an entropy-\nbased weighting that maximizes expected log likelihood under the worst-case true\nclass proportions. For the discriminative case, we derive a multinomial logistic\nmodel that minimizes worst-case conditional log loss. We apply our theory to the\nmodeling of species geographic distributions from presence data, an extreme case\nof labeling bias since there is no absence data. On a benchmark dataset, we \ufb01nd\nthat entropy-based weighting offers an improvement over constant estimates of\nclass proportions, consistently reducing log loss on unbiased test data.\n\n1 Introduction\n\nIn many real-world classi\ufb01cation problems, it is not equally easy or affordable to verify membership\nin different classes. Thus, class proportions in labeled data may signi\ufb01cantly differ from true class\nproportions. In an extreme case, labeled data for an entire class might be missing (for example,\nnegative experimental results are typically not published). A naively trained learner may perform\npoorly on test data that is not similarly af\ufb02icted by labeling bias. Several techniques address labeling\nbias in the context of cost-sensitive learning and learning from imbalanced data [5, 11, 2]. If the\nlabeling bias is known or can be estimated, and all classes appear in the training set, a model trained\non biased data can be corrected by reweighting [5]. When the labeling bias is unknown, a model is\noften selected using threshold-independent analysis such as ROC curves [11]. A good ROC curve,\nhowever, does not guarantee a low loss on test data. Here, we are concerned with situations when\nthe labeling bias is unknown and some classes may be missing, but we have access to unlabeled\ndata. We want to construct models that in addition to good ROC-based performance, also yield\nlow test loss. We will be concerned with minimizing joint and conditional log loss, or equivalently,\nmaximizing joint and conditional log likelihood.\n\nOur work is motivated by the application of modeling species\u2019 geographic distributions from occur-\nrence data. The data consists of a set of locations within some region (for example, the Australian\nwet tropics) where a species (such as the golden bowerbird) was observed, and a set of features such\nas precipitation and temperature, describing environmental conditions at each location. Species dis-\ntribution modeling suffers from extreme imbalance in training data: we often only have information\nabout species presence (positive examples), but no information about species absence (negative ex-\namples). We do, however, have unlabeled data, obtained either by randomly sampling locations\nfrom the region [4], or pooling presence data for several species collected with similar methods to\nyield a representative sample of locations which biologists have surveyed [13].\n\nPrevious statistical methods for species distribution modeling can be divided into three main ap-\nproaches. The \ufb01rst interprets all unlabeled data as examples of species absence and learns a rule\n\n\fto discriminate them from presences [19, 4]. The second embeds a discriminative learner in the\nEM algorithm in order to infer presences and absences in unlabeled data; this explicitly requires\nknowledge of true class probabilities [17]. The third models the presences alone, which is known in\nmachine learning as one-class estimation [14, 7]. When using the \ufb01rst approach, the training data is\ncommonly reweighted so that positive and negative examples have the same weight [4]; this models\na quantity monotonically related to conditional probability of presence [13], with the relationship\ndepending on true class probabilities. If we use y to denote the binary variable indicating presence\nand x to denote a location on the map, then the \ufb01rst two approaches yield models of conditional\nprobability p(y = 1|x), given estimates of true class probabilities. On the other hand, the main in-\nstantiation of the third approach, maximum entropy density estimation (maxent) [14] yields a model\nof the distribution p(x|y = 1). To convert this to an estimate of p(y = 1|x) (as is usually required,\nand necessary for measuring conditional log loss on which we focus here) again requires knowledge\nof the class probabilities p(y = 1) and p(y = 0). Thus, existing discriminative approaches (the \ufb01rst\nand second) as well as generative approaches (the third) require estimates of true class probabilities.\n\nWe apply robust Bayesian decision theory, which is closely related to the maximum entropy prin-\nciple [6], to derive conditional probability estimates p(y | x) that perform well under a wide range\nof test distributions. Our approach can be used to derive robust estimates of class probabilities p(y)\nwhich are then used to reweight discriminative models or to convert generative models into discrimi-\nnative ones. We present a treatment for the general multiclass problem, but our experiments focus on\none-class estimation and species distribution modeling in particular. Using an extensive evaluation\non real-world data, we show improvement in both generative and discriminative techniques.\n\nThroughout this paper we assume that the dif\ufb01culty of uncovering the true class label depends on the\nclass label y alone, but is independent of the example x. Even though this assumption is simplistic,\nwe will see that our approach yields signi\ufb01cant improvements. A related set of techniques estimates\nand corrects for the bias in sample selection, also known as covariate shift [9, 16, 18, 1, 13]. When\nthe bias can be decomposed into an estimable and inestimable part, the right approach might be to\nuse a combination of techniques presented in this paper and those for sample-selection bias.\n\n2 Robust Bayesian Estimation with Unknown Class Probabilities\n\nOur goal is to estimate an unknown conditional distribution \u03c0(y | x), where x \u2208 X is an example\nand y \u2208 Y is a label. The input consists of labeled examples (x1, y1), . . . , (xm, ym) and unlabeled\nexamples xm+1, . . . , xM . Each example x is described by a set of features fj : X \u2192 R, indexed\nby j \u2208 J. For simplicity, we assume that sets X, Y, and J are \ufb01nite, but we would like to allow the\nspace X and the set of features J to be very large.\n\nIn species distribution modeling from occurrence data, the space X corresponds to locations on the\nmap, features are various functions derived from the environmental variables, and the set Y contains\ntwo classes: presence (y = 1) and absence (y = 0) for a particular species. Labeled examples are\npresences of the species, e.g., recorded presence locations of the golden bowerbird, while unlabeled\nexamples are locations that have been surveyed by biologists, but neither presence nor absence was\nrecorded. The unlabeled examples can be obtained as presence locations of species observed by a\nsimilar protocol, for example other birds [13].\n\nWe posit a joint density \u03c0(x, y) and assume that examples are generated by the following process.\nFirst, a pair (x, y) is chosen according to \u03c0. We always get to see the example x, but the label y is\nrevealed with an unknown probability that depends on y and is independent of x. This means that\nwe have access to independent samples from \u03c0(x) and from \u03c0(x| y), but no information about \u03c0(y).\nIn our example, species presence is revealed with an unknown \ufb01xed probability whereas absence is\nrevealed with probability zero (i.e., never revealed).\n\n2.1 Robust Bayesian Estimation, Maximum Entropy, and Logistic Regression\n\nRobust Bayesian decision theory formulates an estimation problem as a zero-sum game between a\ndecision maker and nature [6]. In our case, the decision maker chooses an estimate p(x, y) while\nnature selects a joint density \u03c0(x, y). Using the available data, the decision maker forms a set P in\nwhich he believes nature\u2019s choice lies, and tries to minimize worst-case loss under nature\u2019s choice.\nIn this paper we are interested in minimizing the worst-case log loss relative to a \ufb01xed default\n\n\festimate \u03bd (equivalently, maximizing the worst-case log likelihood ratio)\n\nmin\np\u2208\u2206\n\nmax\n\u03c0\u2208P\n\nE\u03c0(cid:20)ln(cid:18) p(X, Y )\n\n\u03bd(X, Y )(cid:19)(cid:21) .\n\nHere, \u2206 is the simplex of joint densities and E\u03c0 is a shorthand for EX,Y \u223c\u03c0. The default density \u03bd\nrepresents any prior information we have about \u03c0; if we have no prior information, \u03bd is typically the\nuniform density.\n\nGr\u00a8unwald and Dawid [6] show that the robust Bayesian problem (Eq. 1) is often equivalent to the\nminimum relative entropy problem\n\nmin\np\u2208P\n\nRE(pk \u03bd) ,\n\nwhere RE(pk q) = Ep[ln(p(X, Y )/q(X, Y )] is relative entropy or Kullback-Leibler divergence\nand measures discrepancy between distributions p and q. The formulation intuitively says that we\nshould choose the density p which is closest to \u03bd while respecting constraints P. When \u03bd is uniform,\nminimizing relative entropy is equivalent to maximizing entropy H(p) = Ep[\u2212 ln p(X, Y )]. Hence,\nthe approach is mainly referred to as maximum entropy [10] or maxent for short. The next theorem\noutlines the equivalence of robust Bayes and maxent for the case considered in this paper. It is a\nspecial case of Theorem 6.4 of [6].\nTheorem 1 (Equivalence of maxent and robust Bayes). Let X \u00d7 Y be a \ufb01nite sample space, \u03bd\na density on X \u00d7 Y and P \u2286 \u2206 a closed convex set containing at least one density absolutely\ncontinuous w.r.t. \u03bd . Then Eqs. (1) and (2) have the same optimizers.\n\nFor the case without labeling bias, the set P is usually described in terms of equality constraints\non moments of the joint distribution (feature expectations). Speci\ufb01cally, feature expectations with\nrespect to p are required to equal their empirical averages. When features are functions of x, but the\ngoal is to discriminate among classes y, it is natural to consider a derived set of features which are\nversions of fj(x) active solely in individual classes y (see for instance [8]). If we were to estimate the\ndistribution of the golden bowerbird from presence-absence data then moment equality constraints\nrequire that the joint model p(x, y) match the average altitude of presence locations as well as the\naverage altitude of absence locations (both weighted by their respective training proportions).\n\nWhen the number of samples is too small or the number of features too large then equality con-\nstraints lead to over\ufb01tting because the true distribution does not match empirical averages exactly.\nOver\ufb01tting is alleviated by relaxing the constraints so that feature expectations are only required to\nlie within a certain distance of sample averages [3].\n\nThe solution of Eq. (2) with equality or relaxed constraints can be shown to lie in an exponential\nfamily parameterized by \u03bb = h\u03bbyiy\u2208Y, \u03bby \u2208 RJ, and containing densities\n\nThe optimizer of Eq. (2) is the unique density which minimizes the empirical log loss\n\nq\u03bb(x, y) \u221d \u03bd(x, y)e\u03bby\u00b7f (x) .\n\nln q\u03bb(xi, yi)\n\n(3)\n\n1\n\nm Xi\u2264m\n\npossibly with an additional \u21131-regularization term accounting for slacks in equality constraints. (See\n[3] for a proof.)\n\nIn addition to constraints on moments of the joint distribution, it is possible to introduce constraints\non marginals of p. The most common implementations of maxent impose marginal constraints\np(x) = \u02dc\u03c0lab(x) where \u02dc\u03c0lab is the empirical distribution over labeled examples. The solution then\ntakes form q\u03bb(x, y) = \u02dc\u03c0lab(x)q\u03bb(y | x) where q\u03bb(y | x) is the multinomial logistic model\n\n(1)\n\n(2)\n\nAs before, the maxent solution is the unique density of this form which minimizes the empirical log\nloss (Eq. 3). The minimization of Eq. (3) is equivalent to the minimization of conditional log loss\n\nq\u03bb(y | x) \u221d \u03bd(y | x)e\u03bby\u00b7f (x) .\n\n1\n\nm Xi\u2264m\n\n\u2212 ln q\u03bb(yi | xi) .\n\n\fHence, this approach corresponds to logistic regression. Since it only models the labeling process\n\u03c0(y | x), but not the sample generation \u03c0(x), it is known as discriminative training.\nThe case with equality constraints p(y) = \u02dc\u03c0lab(y) has been analyzed for example by [8]. The\nsolution has the form q\u03bb(x, y) = \u02dc\u03c0lab(y)q\u03bb(x| y) with\n\nq\u03bb(x| y) \u221d \u03bd(x| y)e\u03bby\u00b7f (x) .\n\nLog loss can be minimized for each class separately, i.e., each \u03bby is the maximum likelihood esti-\nmate (possibly with regularization) of \u03c0(x| y). The joint estimate q\u03bb(x, y) can be used to derive the\nconditional distribution q\u03bb(y | x). Since this approach estimates the sample generating distributions\nof individual classes, it is known as generative training. Naive Bayes is a special case of generative\ntraining when \u03bd(x| y) =Qj \u03bdj(fj(x)| y).\n\nThe two approaches presented in this paper can be viewed as generalizations of generative and\ndiscriminative training with two additional components: availability of unlabeled examples and lack\nof information about class probabilities. The former will in\ufb02uence the choice of the default \u03bd, the\nlatter the form of constraints P.\n\n2.2 Generative Training: Entropy-weighted Maxent\n\nWhen the number of labeled and unlabeled examples is suf\ufb01ciently large, it is reasonable to assume\nthat the empirical distribution \u02dc\u03c0(x) over all examples (labeled and unlabeled) is a faithful repre-\nsentation of \u03c0(x). Thus, we consider defaults with \u03bd(x) = \u02dc\u03c0(x), shown to work well in species\ndistribution modeling [13]. For simplicity, we assume that \u03bd(y | x) does not depend on x and focus\non \u03bd(x, y) = \u02dc\u03c0(x)\u03bd(y). Other options are possible. For example, when the number of examples is\nsmall, \u02dc\u03c0(x) might be replaced by an estimate of \u03c0(x). The distribution \u03bd(y) can be chosen uniform\nacross y, but if some classes are known to be rarer than others then a non-uniform estimate will\nperform better. In Section 3, we analyze the impact of this choice.\n\nEp[fj(X)| y] \u2212 \u02dc\u00b5y\n\nj\n\n\u2200j : (cid:12)(cid:12)\n\nX} where py\n\nj = \u03b2 \u02dc\u03c3y\n\nj /\u221amy where \u03b2 is a single tuning constant, \u02dc\u03c3y\n\nj is the empirical average of fj among labeled examples in class y and \u03b2y\n\nConstraints on moments of the joint distribution, such as those in the previous section, will misspec-\nify true moments in the presence of labeling bias. However, as discussed earlier, labeled examples\nfrom each class y approximate conditional distributions \u03c0(x| y). Thus, instead of constraining joint\nexpectations, we constrain conditional expectations Ep[fj(X)| y]. In general, we consider robust\nBayes and maxent problems with the set P of the form P = {p \u2208 \u2206 : py\nX \u2208 Py\nX denotes\nthe |X|-dimensional vector of conditional probabilities p(x| y) and Py\nX expresses the constraints on\npy\nX. For example, relaxed constraints for class y are expressed as\nj(cid:12)(cid:12) \u2264 \u03b2y\n\n(4)\nwhere \u02dc\u00b5y\nj are estimates of\ndeviations of averages from true expectations. Similar to [14], we use standard-error-like deviation\nestimates \u03b2y\nj is the empirical standard devia-\ntion of fj among labeled examples in class y, and my is the number of labeled examples in class y.\nWhen my equals 0, we choose \u03b2y\nThe next theorem and the following corollary show that robust Bayes (and also maxent) with the\nconstraint set P of the form above yield estimators similar to generative training. In addition to the\nnotation py\nX for conditional densities, we use the notation pY and pX to denote vectors of marginal\nprobabilities p(y) and p(x), respectively. For example, the empirical distribution over examples is\ndenoted \u02dc\u03c0X.\nTheorem 2. Let Py\nX \u2208 Py\nX}.\nIf P contains at least one density absolutely continuous w.r.t. \u03bd then robust Bayes and maxent over\nP are equivalent. The solution \u02c6p has the form \u02c6p(y)\u02c6p(x| y) where class-conditional densities \u02c6py\nminimize RE(py\n(5)\n\nX, y \u2208 Y be closed convex sets of densities over X and P = {p \u2208 \u2206 : py\n\nj = \u221e and thus leave feature expectations unconstrained.\n\nX k \u02dc\u03c0X) among py\n\nX \u2208 Py\n\nX and\n\nX\n\n\u02c6p(y) \u221d \u03bd(y)e\u2212RE( \u02c6py\n\nX k \u02dc\u03c0X) .\n\nProof. It is not too dif\ufb01cult to verify that the set P is a closed convex set of joint densities, so\nthe equivalence of robust Bayes and maxent follows from Theorem 1. To prove the remainder, we\nrewrite the maxent objective as\n\nRE(pk \u03bd) = RE(pY k \u03bdY) +Xy\n\np(y)RE(py\n\nX k \u02dc\u03c0X) .\n\n\fMaxent problem is then equivalent to\n\nmin\n\npY hRE(pY k \u03bdY) +Xy\n\np(y) min\nX\u2208Py\npy\n\nX\n\nRE(py\n\nX k \u02dc\u03c0X)i\n\n= min\n\n= min\n\npY \" Xy\npY \"Xy\n\np(y) ln  p(y)\np(y) ln \n\n\u03bd(y)!! + Xy\nX k \u02dc\u03c0X)!#\n\np(y)\n\u03bd(y)e\u2212RE( \u02c6py\n\n= const. + min\npY\n\nRE(pY k \u02c6pY) .\n\np(y)RE(\u02c6py\n\nX k \u02dc\u03c0X)!#\n\nSince RE(pk q) is minimized for p = q, we indeed obtain that for the minimizing p, pY = \u02c6pY.\nTheorem 2 generalizes to the case when in addition to constraining py\nX, we also constrain\npY to lie in a closed convex set PY. The solution then takes form p(y)\u02c6p(x| y) with \u02c6p(x| y) as\nin the theorem, but with p(y) minimizing RE(pY k \u02c6pY) subject to pY \u2208 PY. Unlike generative\ntraining without labeling bias, the class-conditional densities in the theorem above in\ufb02uence class\nprobabilities. When sets Py\nX are speci\ufb01ed using constraints of Eq. (4) then \u02c6p has a form derived from\nregularized maximum likelihood estimates in an exponential family (see, e.g., [3]):\nCorollary 3. If sets Py\nmaxent are equivalent. The class-conditional densities \u02c6p(x| y) of the solution take form\n\nX are speci\ufb01ed by inequality constraints of Eq. (4) then robust Bayes and\n\nX to lie in Py\n\nand solve single-class regularized maximum likelihood problems\n\n\u02c6\u03bb\nq\u03bb(x| y) \u221d \u02dc\u03c0(x)e\n\ny\n\n\u00b7f (x)\n\n(6)\n\n(7)\n\nmin\n\n\u03bby n Xi:yi=y(cid:2)\u2212 ln q\u03bb(xi | y)(cid:3) + myXj\u2208J\n\n\u03b2j|\u03bby\n\nj|o .\n\nOne-class Estimation.\nIn one-class estimation problems, there are two classes (0 and 1), but we\nonly have access to labeled examples from one class (e.g., class 1). In species distribution modeling,\nwe only have access to presence records of the species. Based on labeled examples, we derive a set\nof constraints on p(x| y = 1), but leave p(x| y = 0) unconstrained. By Theorem 2, \u02c6p(x| y = 1)\nthen solves the single-class maximum entropy problem, we write \u02c6p(x| y = 1) = \u02c6pME(x), and\n\u02c6p(x| y = 0) = \u02dc\u03c0(x). Assume without loss of generality that examples x1, . . . , xM are distinct (but\nallow them to have identical feature vectors). Thus, \u02dc\u03c0(x) = 1/M on examples and zero elsewhere,\nand RE(\u02c6pME k \u02dc\u03c0X) = \u2212H(\u02c6pME) + ln M. Plugging these into Theorem 2, we can derive the condi-\ntional estimate \u02c6p(y = 1| x) across all unlabeled examples x:\n\n\u02c6p(y = 1| x) =\n\n\u03bd(y = 0) + \u03bd(y = 1)\u02c6pME(x)eH( \u02c6pME)\n\n\u03bd(y = 1)\u02c6pME(x)eH( \u02c6pME)\n\n.\n\n(8)\n\nIf constraints on p(x| y = 1) are chosen as in Corollary 3 then \u02c6pME is exponential and Eq. (8) thus\ndescribes a logistic model. This model has the same coef\ufb01cients as \u02c6pME, with the intercept chosen\nso that \u201ctypical\u201d examples x under \u02c6pME (examples with log probability close to the expected log\nprobability) yield predictions close to the default.\n\n2.3 Discriminative Training: Class-robust Logistic Regression\n\nSimilar to the previous section, we consider \u03bd(x, y) = \u02dc\u03c0(x)\u03bd(y). The set of constraints P will\nnow also include equality constraints on p(x). Since \u02dc\u03c0lab(x) misspeci\ufb01es the marginal, we use\np(x) = \u02dc\u03c0(x). Next theorem is an analog of Corollary 3 for discriminative training. It follows from\na combination of Theorem 1 and duality of maxent with maximum likelihood [3]. A complete proof\nwill appear in the extended version of this paper.\nTheorem 4. Assume that sets Py\n\u2206 : py\nequivalent. For the solution \u02c6p, \u02c6p(x) = \u02dc\u03c0(x) and \u02c6p(y | x) takes form\n\u03bby\u00b7f (x)\u2212\u03bby\u00b7 \u02dc\u00b5y+Pj\n\nX are speci\ufb01ed by inequality constraints of Eq. (4). Let P = {p \u2208\nX and pX = \u02dc\u03c0X}. If the set P is non-empty then robust Bayes and maxent over P are\n\nq\u03bb(y | x) \u221d \u03bd(y)e\n\nX \u2208 Py\n\nj |\u03bby\n\u03b2y\nj |\n\n(9)\n\n\fand solves the regularized \u201clogistic regression\u201d problem\n\nji) .\n\nmin\n\n\u03bb ( 1\n\nj \u2212 \u02dc\u00b5y\n\nj )\u03bby\n\n(10)\n\nj its class-conditional feature expectations.\n\n\u00af\u03c0(y)Xj\u2208Jh\u03b2y\nj(cid:12)(cid:12)\u03bby\nj(cid:12)(cid:12) + (\u00af\u00b5y\n\nM Xi\u2264MXy\u2208Yh\u2212\u00af\u03c0(y | xi) ln q\u03bb(y | xi)i +Xy\u2208Y\nwhere \u00af\u03c0 is an arbitrary feasible point, \u00af\u03c0 \u2208 P, and \u00af\u00b5y\nWe put logistic regression in quotes, because the model described by Eq. (9) is not the usual logistic\nmodel; however, once the parameters \u03bby are \ufb01xed, Eq. (9) simply determines a logistic model with a\nspecial form of the intercept. Note that the second term of Eq. (10) is indeed a regularization, albeit\npossibly an asymmetric one, since any feasible \u00af\u03c0 will have |\u00af\u00b5y\nj . Since \u00af\u03c0(x) = \u02dc\u03c0(x),\n\u00af\u03c0 is speci\ufb01ed solely by \u00af\u03c0(y | x) and thus can be viewed as a tentative imputation of labels across\nall examples. We remark that the value of the objective of Eq. (10) does not depend on the choice\nof \u00af\u03c0, because a different choice of \u00af\u03c0 (in\ufb02uencing the \ufb01rst term) yields a different set of means \u00af\u00b5y\nj\n(in\ufb02uencing the second term) and these differences cancel out. To provide a more concrete example\nand some intuition about Eq. (10), we now consider one-class estimation.\n\nj \u2212 \u02dc\u00b5y\n\nj| \u2264 \u03b2y\n\nOne-class estimation. A natural choice of \u00af\u03c0 is the \u201cpseudo-empirical\u201d distribution which views\nall unlabeled examples as negatives. Pseudo-empirical means of class 1 match empirical averages of\nclass 1 exactly, whereas pseudo-empirical means of class 0 can be arbitrary because they are uncon-\nstrained. The lack of constraints on class 0 forces the corresponding \u03bby to equal zero. The objective\ncan thus be formulated solely using \u03bby for the class 1; therefore, we will omit the superscript y.\nEq. (10) after multiplying by M then becomes\n\nmin\n\n\u03bb (Xi\u2264m(cid:2)\u2212 ln q\u03bb(y = 1| xi)(cid:3) + Xm<i\u2264M(cid:2)\u2212 ln q\u03bb(y = 0| xi)(cid:3) + mXj\u2208J\n\n\u03b2j|\u03bbj|) .\n\nThus the objective of class-robust logistic regression is the same as of regularized logistic regression\ndiscriminating positives from unlabeled examples.\n\n3 Experiments\n\nWe evaluate our techniques using a large real-world dataset containing 226 species from 6 regions\nof the world, produced by the \u201cTesting alternative methodologies for modeling species\u2019 ecological\nniches and predicting geographic distributions\u201d Working Group at the National Center for Ecological\nAnalysis and Synthesis (NCEAS). The training set contains presence-only data from unplanned\nsurveys or incidental records, including those from museums and herbariums. The test set contains\npresence-absence data from rigorously planned independent surveys (i.e., without labeling bias).\nThe regions are described by 11\u201313 environmental variables, with 20\u201354 species per region, 2\u20135822\ntraining presences per species (median of 57), and 102\u201319120 test points (presences and absences);\nfor details see [4]. As unlabeled examples we use presences of species captured by similar methods,\nknown as \u201ctarget group\u201d, with the groups as in [13].\n\nWe evaluate both entropy-weighted maxent and class-robust logistic regression while varying the\ndefault estimate \u03bd(y = 1), referred to as default species prevalence by analogy with p(y = 1), which\nis called species prevalence. Entropy-weighted maxent solutions for different default prevalences are\nderived by Eq. (8) from the same one-class estimate \u02c6pME. Class-robust logistic regression requires\nseparate optimization for each default prevalence.\nWe calculate \u02c6pME using the Maxent package [15] with features spanning the space of piecewise linear\nsplines (of each environmental variable separately) and a tuned value of \u03b2 (see [12] for the details\non features and tuning). Class-robust logistic models are calculated by a boosting-like algorithm\nSUMMET [3] with the same set of features and the same value \u03b2 as the maxent runs.\nFor comparison, we also evaluate default-weighted maxent, using class probabilities p(y) = \u03bd(y)\ninstead of Eq. (5), and two \u201coracle\u201d methods based on class probabilities in the test data: constant\nBernoulli prediction p(y | x) = \u03c0(y), and oracle-weighted maxent, using p(y) = \u03c0(y) instead of\nEq. (5). Note that the constant Bernoulli prediction has no discrimination power (its AUC is 0.5)\neven though it matches class probabilities perfectly.\n\n\f0.2\n\n0.15\n\n0.1\n\n0.05\n\ns\ns\no\n\nl\n \n\ng\no\n\nl\n \nt\ns\ne\n\nt\n \n\ne\ng\na\nr\ne\nv\na\n\nspecies with\n\ntest prev. 0.00\u22120.04\n\nspecies with\n\ntest prev. 0.04\u22120.15\n\nspecies with\n\ntest prev. 0.15\u22120.70\n\nall species\n\n0.35\n\n0.3\n\n0.25\n\n0.65\n\n0.6\n\n0.55\n\n0.4\n\n0.35\n\n0.3\n\n0\n\n0.2\n\n0.4\n\n0\n\n0.2\n\n0.4\n\n0.2\ndefault prevalence\n\n0.4\n\n0.6\n\n0\n\n0.2\n\n0.4\n\n0.6\n\nmaxent weighted by default prevalence\nmaxent weighted by default*exp{\u2212RE}\n\nBernoulli according to test prevalence (oracle setting)\nmaxent weighted by test prevalence (oracle setting)\n\nl\n\ns\ne\nu\na\nv\n \n.\nv\ne\nr\np\n\ns\ns\no\n\nl\n \nt\ns\ne\n\nt\n \n\ni\n\n \n\nn\ne\nv\ng\ng\nn\nv\ne\nh\nc\na\n\ni\n\ni\n\n \nt\nl\n\nu\na\n\nf\n\ne\nd\n\n \nf\no\n \ne\ng\nn\na\nr\n\nspecies with\nspecies with\nspecies with\n\ntest prev. 0.00\u22120.04\ntest prev. 0.00\u22120.04\ntest prev. 0.00\u22120.04\n\nspecies with\nspecies with\nspecies with\n\ntest prev. 0.04\u22120.15\ntest prev. 0.04\u22120.15\ntest prev. 0.04\u22120.15\n\nspecies with\nspecies with\nspecies with\n\ntest prev. 0.15\u22120.70\ntest prev. 0.15\u22120.70\ntest prev. 0.15\u22120.70\n\nall species\nall species\nall species\n\n0.4\n\n0.2\n\n0\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.4\n\n0.2\n\n0\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.2\n\n0\n\n0.55\n\n0.6\n\n0.65\n\n0.4\n\n0.2\n\n0\n\ntest log loss\n\n0.3\n\n0.35\n\n0.4\n\nmaxent weighted by default prevalence\nmaxent weighted by default*exp{\u2212RE}\nBRT reweighted by default prevalence\nBRT reweighted by default*exp{\u2212RE}\n\n \n\nclass\u2212robust logistic regression\n\nBernoulli according to test prevalence (oracle setting)\nmaxent weighted by test prevalence (oracle setting)\n\nFigure 1: Comparison of reweighting schemes. Top: Test log loss averaged over species with given\nvalues of test prevalence, for varying default prevalence. Bottom: For each value of test log loss, we\ndetermine the range of default prevalence values that achieve it.\n\nTo test entropy-weighting as a general method for estimating class probabilities, we also evalu-\nate boosted regression trees (BRT), which have the highest predictive accuracy along with maxent\namong species distribution modeling techniques [4]. In this application, BRT is used to construct a\nlogistic model discriminating positive examples from unlabeled examples. Recent work [17] uses a\nmore principled approach where unknown labels are \ufb01tted by an EM algorithm, but our preliminary\nruns had too low AUC values, so they are excluded from our comparison. We train BRT using the\nR package gbm on datasets weighted so that the total weight of positives is equal to the total weight\nof unlabeled examples, and then apply Elkan\u2019s reweighting scheme [5]. Speci\ufb01cally, the BRT result\n\u02c6pBRT(y | x) is transformed to\np(y = 1| x) =\n\np(y = 1)\u02c6pBRT(y = 1| x)\n\np(y = 1)\u02c6pBRT(y = 1| x) + p(y = 0)\u02c6pBRT(y = 0| x)\n\nfor two choices of p(y): default, p(y) = \u03bd(y), and entropy-based (using \u02c6pME).\nAll three techniques yield state-of-the-art discrimination (see [13]) measured by the average AUC:\nmaxent achieves AUC of 0.7583; class-robust logistic regression 0.7451\u20130.7568; BRT 0.7545. Un-\nlike maxent and BRT estimates, class-robust logistic estimates are not monotonically related, so they\nyield different AUC for different default prevalence. However, log loss performance varies broadly\naccording to the reweighting scheme.\nIn the top portion of Fig. 1, we focus on maxent. Naive\nweighting by default prevalence yields sharp peaks in performance around the best default preva-\nlence. Entropy-based weighting yields broader peaks, so it is less sensitive to the default prevalence.\nThe improvement diminishes as the true prevalence increases, but entropy-based weighting is never\nmore sensitive. Thanks to smaller sensitivity, entropy-based weighting outperforms naive weight-\ning when a single default needs to be chosen for all species (the rightmost plot). Note that the\noptimal default values are higher for entropy-based weighting, because in one-class estimation the\nentropy-based prevalence is always smaller than default (unless the estimate \u02c6pME is uniform).\nImproved sensitivity is demonstrated more clearly in the bottom portion of Fig. 1, now also including\nBRT and class-robust logistic regression. We see that BRT and maxent results are fairly similar, with\nBRT performing overall slightly better than maxent. Note that entropy-reweighted BRT relies both\non BRT and maxent for its performance. A striking observation is the poor performance of class-\nrobust logistic regression for species with larger prevalence values; it merits further investigation.\n\n\f4 Conclusion and Discussion\n\nTo correct for unknown labeling bias in training data, we used robust Bayesian decision theory and\ndeveloped generative and discriminative approaches that optimize log loss under worst-case true\nclass proportions. We found that our approaches improve test performance on a benchmark dataset\nfor species distribution modeling, a one-class application with extreme labeling bias.\n\nAcknowledgments. We would like to thank all of those who provided data used here: A. Ford,\nCSIRO Atherton, Australia; M. Peck and G. Peck, Royal Ontario Museum; M. Cadman, Bird Stud-\nies Canada, Canadian Wildlife Service of Environment Canada; the National Vegetation Survey\nDatabank and the Allan Herbarium, New Zealand; Missouri Botanical Garden, especially R. Magill\nand T. Consiglio; and T. Wohlgemuth and U. Braendi, WSL Switzerland.\n\nReferences\n[1] Bickel, S., M. Br\u00a8uckner, and T. Scheffer (2007). Discriminative learning for differing training and test\n\ndistributions. In Proc. 24th Int. Conf. Machine Learning, pp. 161\u2013168.\n\n[2] Chawla, N. V., N. Japkowicz, and A. Ko\u0142cz (2004). Editorial: special issue on learning from imbalanced\n\ndata sets. SIGKDD Explorations 6(1), 1\u20136.\n\n[3] Dud\u00b4\u0131k, M., S. J. Phillips, and R. E. Schapire (2007). Maximum entropy density estimation with generalized\nregularization and an application to species distribution modeling. J. Machine Learning Res. 8, 1217\u20131260.\n\n[4] Elith, J., C. H. Graham, et al. (2006). Novel methods improve prediction of species\u2019 distributions from\n\noccurrence data. Ecography 29(2), 129\u2013151.\n\n[5] Elkan, C. (2001). The foundations of cost-sensitive learning. In Proc. 17th Int. Joint Conf. on Arti\ufb01cial\n\nIntelligence, pp. 973\u2013978.\n\n[6] Gr\u00a8unwald, P. D. and A. P. Dawid (2004). Game theory, maximum entropy, minimum discrepancy, and\n\nrobust Bayesian decision theory. Ann. Stat. 32(4), 1367\u20131433.\n\n[7] Guo, Q., M. Kelly, and C. H. Graham (2005). Support vector machines for predicting distribution of Sudden\n\nOak Death in California. Ecol. Model. 182, 75\u201390.\n\n[8] Haffner, P., S. Phillips, and R. Schapire (2005). Ef\ufb01cient multiclass implementations of L1-regularized\n\nmaximum entropy. E-print arXiv:cs/0506101.\n\n[9] Heckman, J. J. (1979). Sample selection bias as a speci\ufb01cation error. Econometrica 47(1), 153\u2013161.\n\n[10] Jaynes, E. T. (1957). Information theory and statistical mechanics. Phys. Rev. 106(4), 620\u2013630.\n\n[11] Maloof, M. (2003). Learning when data sets are imbalanced and costs are unequal and unknown. In Proc.\n\nICML\u201903 Workshop on Learning from Imbalanced Data Sets.\n\n[12] Phillips, S. J. and M. Dud\u00b4\u0131k (2008). Modeling of species distributions with Maxent: new extensions and a\n\ncomprehensive evaluation. Ecography 31(2), 161\u2013175.\n\n[13] Phillips, S. J., M. Dud\u00b4\u0131k, J. Elith, C. H. Graham, A. Lehmann, J. Leathwick, and S. Ferrier. Sample\nselection bias and presence-only models of species distributions: Implications for selection of background\nand pseudo-absences. Ecol. Appl. To appear.\n\n[14] Phillips, S. J., M. Dud\u00b4\u0131k, and R. E. Schapire (2004). A maximum entropy approach to species distribution\n\nmodeling. In Proc. 21st Int. Conf. Machine Learning, pp. 655\u2013662. ACM Press.\n\n[15] Phillips, S. J., M. Dud\u00b4\u0131k, and R. E. Schapire (2007). Maxent software for species habitat modeling. http://\n\nwww.cs.princeton.edu/\u223cschapire/maxent.\n\n[16] Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. J. Stat. Plan. Infer. 90(2), 227\u2013244.\n\n[17] Ward, G., T. Hastie, S. Barry, J. Elith, and J. Leathwick (2008). Presence-only data and the EM algorithm.\n\nBiometrics. In press.\n\n[18] Zadrozny, B. (2004). Learning and evaluating classi\ufb01ers under sample selection bias. In Proc. 21st Int.\n\nConf. Machine Learning, pp. 903\u2013910. ACM Press.\n\n[19] Zaniewski, A. E., A. Lehmann, and J. M. Overton (2002). Predicting species spatial distributions using\n\npresence-only data: A case study of native New Zealand ferns. Ecol. Model. 157, 261\u2013280.\n\n\f", "award": [], "sourceid": 604, "authors": [{"given_name": "Steven", "family_name": "Phillips", "institution": null}, {"given_name": "Miroslav", "family_name": "Dud\u00edk", "institution": null}]}