{"title": "Active Estimation of F-Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 2083, "page_last": 2091, "abstract": "We address the problem of estimating the F-measure of a given model as accurately as possible on a fixed labeling budget. This problem occurs whenever an estimate cannot be obtained from held-out training data; for instance, when data that have been used to train the model are held back for reasons of privacy or do not reflect the test distribution. In this case, new test instances have to be drawn and labeled at a cost. An active estimation procedure selects instances according to an instrumental sampling distribution. An analysis of the sources of estimation error leads to an optimal sampling distribution that minimizes estimator variance. We explore conditions under which active estimates of F-measures are more accurate than estimates based on instances sampled from the test distribution.", "full_text": "Active Estimation of F-Measures\n\nChristoph Sawade, Niels Landwehr, and Tobias Scheffer\n\nAugust-Bebel-Strasse 89, 14482 Potsdam, Germany\n\n{sawade, landwehr, scheffer}@cs.uni-potsdam.de\n\nUniversity of Potsdam\n\nDepartment of Computer Science\n\nAbstract\n\nWe address the problem of estimating the F\u03b1-measure of a given model as accu-\nrately as possible on a \ufb01xed labeling budget. This problem occurs whenever an\nestimate cannot be obtained from held-out training data; for instance, when data\nthat have been used to train the model are held back for reasons of privacy or do\nnot re\ufb02ect the test distribution. In this case, new test instances have to be drawn\nand labeled at a cost. An active estimation procedure selects instances according\nto an instrumental sampling distribution. An analysis of the sources of estimation\nerror leads to an optimal sampling distribution that minimizes estimator variance.\nWe explore conditions under which active estimates of F\u03b1-measures are more ac-\ncurate than estimates based on instances sampled from the test distribution.\n\n1\n\nIntroduction\n\nThis paper addresses the problem of evaluating a given model in terms of its predictive performance.\nIn practice, it is not always possible to evaluate a model on held-out training data; consider, for\ninstance, the following scenarios. When a readily trained model is shipped and deployed, training\ndata may be held back for reasons of privacy. Secondly, training data may have been created under\nlaboratory conditions and may not entirely re\ufb02ect the test distribution. Finally, when a model has\nbeen trained actively, the labeled data is biased towards small-margin instances which would incur\na pessimistic bias on any cross-validation estimate.\nThis problem has recently been studied for risks\u2014i.e., for performance measures which are integrals\nof a loss function over an instance space [7]. However, several performance measures cannot be\nexpressed as a risk. Perhaps the most prominent such measure is the F\u03b1-measure [10]. For a given\nbinary classi\ufb01er and sample of size n, let ntp and nf p denote the number of true and false positives,\nrespectively, and nf n the number of false negatives. Then the classi\ufb01er\u2019s F\u03b1-measure on the sample\nis de\ufb01ned as\n\nF\u03b1 =\n\n\u03b1(ntp + nf p) + (1 \u2212 \u03b1)(ntp + nf n)\n\nntp\n\n.\n\n(1)\n\nPrecision and recall are special cases for \u03b1 = 1 and \u03b1 = 0, respectively. The F\u03b1-measure is\nde\ufb01ned as an estimator in terms of empirical quantities. This is unintuitive from a statistical point of\nview and raises the question which quantity of the underlying distribution the F -measure actually\nestimates. We will now introduce the class of generalized risk functionals that we study in this paper.\nWe will then show that F\u03b1 is a consistent estimate of a quantity that falls into this class.\nLet X denote the feature space and Y the label space. An unknown test distribution p(x, y) is de\ufb01ned\nover X \u00d7 Y. Let p(y|x; \u03b8) be a given \u03b8-parameterized model of p(y|x) and let f\u03b8 : X \u2192 Y with\nf\u03b8(x) = arg maxy p(y|x; \u03b8) be the corresponding hypothesis.\nLike any risk functional, the generalized risk is parameterized with a function (cid:96) : Y \u00d7 Y \u2192 R\ndetermining either the loss or\u2014alternatively\u2014the gain that is incurred for a pair of predicted and\n\n1\n\n\ftrue label. In addition, the generalized risk is parameterized with a function w that assigns a weight\nw(x, y, f\u03b8) to each instance. For instance, precision sums over instances with f\u03b8(x) = 1 with\nweight 1 and gives no consideration to other instances. Equation 2 de\ufb01nes the generalized risk:\n\n(cid:82)(cid:82) (cid:96)(f\u03b8(x), y)w(x, y, f\u03b8)p(x, y)dydx\n\n(cid:82)(cid:82) w(x, y, f\u03b8)p(x, y)dydx\n\nG =\n\n(2)\nThe integral over Y is replaced by a sum in the case of a discrete label space Y. Note that the\ngeneralized risk (Equation 2) reduces to the regular risk for w(x, y, f\u03b8) = 1. On a sample of size\nn, a consistent estimator can be obtained by replacing the cumulative distribution function with the\nempirical distribution function.\nProposition 1. Let (x1, y1), . . . , (xn, yn) be drawn iid according to p(x, y). The quantity\n\n.\n\n(cid:80)n\n\n(cid:80)n\n\n\u02c6Gn =\n\ni=1 (cid:96)(f\u03b8(xi), yi)w(xi, yi, f\u03b8)\n\ni=1 w(xi, yi, f\u03b8)\n\n(3)\n\nis a consistent estimate of the generalized risk G de\ufb01ned by Equation 2.\n\nProof. The proposition follows from Slutsky\u2019s theorem [3] applied to the numerator and denomina-\ntor of Equation 3.\n\nConsistency means asymptotical unbiasedness; that is, the expected value of the estimate \u02c6Gn con-\nverges in distribution to the true risk G for n \u2192 \u221e. We now observe that F\u03b1-measures\u2014including\nprecision and recall\u2014are consistent empirical estimates of generalized risks for appropriately cho-\nsen functions w.\nCorollary 1. F\u03b1 is a consistent estimate of the generalized risk with Y = {0, 1}, w(x, y, f\u03b8) =\n\u03b1f\u03b8(x) + (1 \u2212 \u03b1)y and (cid:96) = 1 \u2212 (cid:96)0/1, where (cid:96)0/1 denotes the zero-one loss.\n\nProof. The claim follows from Proposition 1 since\n\n(cid:80)n\n(cid:80)n\ni=1(1 \u2212 (cid:96)0/1(f\u03b8(xi), yi)) (\u03b1f\u03b8(xi) + (1 \u2212 \u03b1)yi)\n(cid:80)n\ni=1 (\u03b1f\u03b8(xi) + (1 \u2212 \u03b1)yi)\n\u03b1(cid:80)n\ni=1 f\u03b8(xi) + (1 \u2212 \u03b1)(cid:80)n\ni=1 f\u03b8(xi)yi\n\ni=1 yi\n\n=\n\n\u02c6Gn =\n\n=\n\n\u03b1 (ntp + nf p) + (1 \u2212 \u03b1) (ntp + nf n)\n\n.\n\nntp\n\nHaving established and motivated the generalized risk functional, we now turn towards the problem\nof acquiring a consistent estimate with minimal estimation error on a \ufb01xed labeling budget n. Test\ninstances x1, ..., xn need not necessarily be drawn according to the distribution p. Instead, we study\nan active estimation process that selects test instances according to an instrumental distribution q.\nWhen instances are sampled from q, an estimator of the generalized risk can be de\ufb01ned as\n\np(xi)\nq(xi) (cid:96)(f\u03b8(xi), yi)w(xi, yi, f\u03b8)\n\ni=1\n\n(4)\n\ni=1\n\np(xi)\nq(xi) w(xi, yi, f\u03b8)\nwhere (xi, yi) are drawn from q(x)p(y|x). Weighting factors p(xi)\nq(xi) compensate for the discrepancy\nbetween test and instrumental distributions. Because of the weighting factors, Slutsky\u2019s Theorem\nagain implies that Equation 4 de\ufb01nes a consistent estimator for G, under the precondition that for all\nx \u2208 X with p(x) > 0 it holds that q(x) > 0. Note that Equation 3 is a special case of Equation 4,\nusing the instrumental distribution q = p.\nThe estimate \u02c6Gn,q given by Equation 4 depends on the selected instances (xi, yi), which are drawn\naccording to the distribution q(x)p(y|x). Thus, \u02c6Gn,q is a random variable whose distribution de-\npends on q. Our overall goal is to determine the instrumental distribution q such that the expected\ndeviation from the generalized risk is minimal for \ufb01xed labeling costs n:\n\n(cid:80)n\n\n(cid:80)n\n\n\u02c6Gn,q =\n\n(cid:20)(cid:16) \u02c6Gn,q \u2212 G\n\n(cid:17)2(cid:21)\n\n.\n\nq\u2217 = arg min\n\nq\n\nE\n\n2\n\n\f2 Active Estimation through Variance Minimization\n\nThe bias-variance decomposition expresses the estimation error as a sum of a squared bias and a\nvariance term [5]:\n\nE(cid:104)\n\n( \u02c6Gn,q \u2212 G)2(cid:105)\n\n(cid:16)E(cid:104) \u02c6Gn,q\n\n(cid:105) \u2212 G\n(cid:17)2\n\n=\n\n+ E\n\n(cid:20)(cid:16) \u02c6Gn,q \u2212 E(cid:104) \u02c6Gn,q\n\n(cid:105)(cid:17)2(cid:21)\n\n(5)\n\n= Bias2[ \u02c6Gn,q] + Var[ \u02c6Gn,q].\n\n(6)\nBecause \u02c6Gn,q is consistent, both Bias2[ \u02c6Gn,q] and Var[ \u02c6Gn,q] vanish for n \u2192 \u221e. More speci\ufb01cally,\nLemma 1 shows that Bias2[ \u02c6Gn,q] is of order 1\nn2 .\nLemma 1 (Bias of Estimator). Let \u02c6Gn,q be as de\ufb01ned in Equation 4. Then there exists C \u2265 0 with\n\n(cid:12)(cid:12)(cid:12)E(cid:104) \u02c6Gn,q\n\n(cid:105) \u2212 G\n\n(cid:12)(cid:12)(cid:12) \u2264 C\n\nn\n\n.\n\n(7)\n\nThe proof can be found in the online appendix. Lemma 2 states that the active risk estimator \u02c6Gn,q\nis asymptotically normally distributed, and characterizes its variance in the limit.\nLemma 2 (Asymptotic Distribution of Estimator). Let \u02c6Gn,q be de\ufb01ned as in Equation 4. Then,\n\n(cid:16) \u02c6Gn,q \u2212 G\n\n(cid:17) n\u2192\u221e\u2212\u2192 N(cid:0)0, \u03c32\n\n(cid:1)\n\nq\n\n\u221a\n\nn\n\nwith asymptotic variance\n\n(cid:90) p(x)\n\n(cid:18)(cid:90)\n\nq(x)\n\n\u03c32\nq =\n\nwhere n\u2192\u221e\u2212\u2192 denotes convergence in distribution.\n\nw(x, y, f\u03b8)2 ((cid:96)(f\u03b8(x), y) \u2212 G)2 p(y|x)dy\n\n(cid:19)\n\np(x)dx\n\n(8)\n\n(9)\n\nA proof of Lemma 2 can be found in the appendix. Taking the variance of Equation 8, we obtain\n\n(cid:104) \u02c6Gn,q\n\n(cid:105) n\u2192\u221e\u2212\u2192 \u03c32\n\nq ,\n\nn Var\n\n(10)\n\nthus Var[ \u02c6Gn,q] is of order 1\nn2 , the expected estimation error\nE[( \u02c6Gn,q \u2212 G)2] will be dominated by Var[ \u02c6Gn,q]. Moreover, Equation 10 indicates that Var[ \u02c6Gn,q]\ncan be approximately minimized by minimizing \u03c32\nq. In the following, we will consequently derive a\nsampling distribution q\u2217 that minimizes the asymptotic variance \u03c32\n\nn. As the bias term vanishes with 1\n\nq of the estimator \u02c6Gn,q.\n\n2.1 Optimal Sampling Distribution\n\nThe following theorem derives the sampling distribution that minimizes the asymptotic variance \u03c32\nq:\nTheorem 1 (Optimal Sampling Distribution). The instrumental distribution that minimizes the\nasymptotic variance \u03c32\n\nq of the generalized risk estimator \u02c6Gn,q is given by\n\nq\u2217(x) \u221d p(x)\n\nw(x, y, f\u03b8)2 ((cid:96)(f\u03b8(x), y) \u2212 G)2 p(y|x)dy.\n\n(11)\n\n(cid:115)(cid:90)\n\nA proof of Theorem 1 is given in the appendix. Since F -measures are estimators of generalized\nrisks according to Corollary 1, we can now derive their variance-minimizing sampling distributions.\nCorollary 2 (Optimal Sampling for F\u03b1). The sampling distribution that minimizes the asymptotic\nvariance of the F\u03b1-estimator resolves to\n\n(cid:26)p(x)(cid:112)p(f\u03b8(x)|x)(1 \u2212 G)2 + \u03b12(1 \u2212 p(f\u03b8(x)|x))G2\np(x)(1 \u2212 \u03b1)(cid:112)(1 \u2212 p(f\u03b8(x)|x))G2\n\n: f (x) = 1\n: f (x) = 0\n\n(12)\n\nq\u2217(x) \u221d\n\n3\n\n\fAlgorithm 1 Active Estimation of F\u03b1-Measures\ninput Model parameters \u03b8, pool D, labeling costs n.\noutput Generalized risk estimate \u02c6Gn,q\u2217.\n1: Compute optimal sampling distribution q\u2217 according to Corollary 2, 3, or 4, respectively.\n2: for i = 1, . . . , n do\n3:\n4:\n5: end for\n6: return\n\nDraw xi \u223c q\u2217(x) from D with replacement.\nQuery label yi \u223c p(y|xi) from oracle.\n\nq(xi) (cid:96)(f\u03b8(xi),yi)w(xi,yi,f\u03b8)\n\n(cid:80)n\n\ni=1\n\n(cid:80)n\n\n1\n\ni=1\n\n1\n\nq(xi) w(xi,yi,f\u03b8)\n\nProof. According to Corollary 1, F\u03b1 estimates a generalized risk with Y = {0, 1}, w(x, y, f\u03b8) =\n\u03b1f\u03b8(x) + (1 \u2212 \u03b1)y and (cid:96) = 1 \u2212 (cid:96)0/1. Starting from Theorem 1, we derive\n\n(\u03b1f\u03b8(x) + (1 \u2212 \u03b1)y)2(cid:0)1 \u2212 (cid:96)0/1(f\u03b8(x), y) \u2212 G(cid:1)2\n\nq\u2217(x) \u221d p(x)\n\n(cid:115) (cid:88)\n(cid:16)\n\ny\u2208{0,1}\n\u03b12f\u03b8(x) ((1 \u2212 f\u03b8(x)) \u2212 G)2 p(y = 0|x)\n= p(x)\n+ (1 \u2212 \u03b1(1 \u2212 f\u03b8(x)))2 (f\u03b8(x) \u2212 G)2 p(y = 1|x)\n\n(cid:17) 1\n\n2\n\np(y|x)\n\n(13)\n\n(14)\n\nThe claim follows by case differentiation according to the value of f\u03b8(x).\nCorollary 3 (Optimal Sampling for Recall). The sampling distribution that minimizes \u03c32\nresolves to\n\nq for recall\n\n(cid:26)p(x)(cid:112)p(f\u03b8(x)|x)(1 \u2212 G)2\np(x)(cid:112)(1 \u2212 p(f\u03b8(x)|x))G2\n\nq\u2217(x) \u221d\n\n: f (x) = 1\n: f (x) = 0.\n\n(15)\n\nCorollary 4 (Optimal Sampling for Precision). The sampling distribution that minimizes \u03c32\nprecision resolves to\n\nq\u2217(x) \u221d p(x)f\u03b8(x)(cid:112)(1 \u2212 2G)p(f\u03b8(x)|x) + G2.\n\nq for\n\n(16)\n\nCorollaries 3 and 4 directly follow from Corollary 2 for \u03b1 = 0 and \u03b1 = 1. Note that for standard\nrisks (that is, w = 1) Theorem 1 coincides with the optimal sampling distribution derived in [7].\n\n2.2 Empirical Sampling Distribution\n\nm for all x \u2208 D.\n\nTheorem 1 and Corollaries 2, 3, and 4 depend on the unknown test distribution p(x). We now turn\ntowards a setting in which a large pool D of unlabeled test instances is available. Instances from this\npool can be sampled and then labeled at a cost. Drawing instances from the pool replaces generating\nthem under the test distribution; that is, p(x) = 1\nTheorem 1 and its corollaries also depend on the true conditional p(y|x). To implement the method,\nwe have to approximate the true conditional p(y|x); we use the model p(y|x; \u03b8). This approximation\nconstitutes an analogy to active learning: In active learning, the model-based output probability\np(y|x; \u03b8) serves as the basis on which the least con\ufb01dent instances are selected. Note that as long as\np(x) > 0 implies q(x) > 0, the weighting factors ensure that such approximations do not introduce\nan asymptotic bias in our estimator (Equation 4). Finally, Theorem 1 and its corollaries depend on\nthe true generalized risk G. G is replaced by an intrinsic generalized risk calculated from Equation 2,\nm, and p(y|x) \u2248 p(y|x; \u03b8).\nwhere the integral over X is replaced by a sum over the pool, p(x) = 1\nAlgorithm 1 summarizes the procedure for active estimation of F -measures. A special case occurs\nwhen the labeling process is deterministic. Since instances are sampled with replacement, elements\nmay be drawn more than once. In this case, labels can be looked up rather than be queried from the\ndeterministic labeling oracle repeatedly. The loop may then be continued until the labeling budget\nis exhausted. Note that F -measures are unde\ufb01ned when the denominator is zero which is the case\nwhen all drawn examples have a weight w of zero. For instance, precision is unde\ufb01ned when no\npositive examples have been drawn.\n\n4\n\n\fn(cid:88)\n\n(cid:18) p(xi)\n\n(cid:19)2\n\nw(xi, yi, f\u03b8)2(cid:16)\n\np(xi)\nq(xi)\n\ni=1\n\nq(xi)\n\n(cid:17)2\n\n(cid:96)(f\u03b8(xi), yi) \u2212 \u02c6Gn,q\n\n.\n\n1(cid:80)n\n(cid:1) Sn,q\u221a\n\ni=1\n\nS2\n\nn,q =\n\n(cid:0)1 \u2212 \u03c1\n\n2.3 Con\ufb01dence Intervals\n\nLemma 2 shows that the estimator \u02c6Gn,q is asymptotically normally distributed and character-\nizes its asymptotic variance. A consistent estimate of \u03c32\nq is obtained from the labeled sample\n(x1, y1), . . . , (xn, yn) drawn from the distribution q(x)p(y|x) by computing empirical variance\n\nA two-sided con\ufb01dence interval [ \u02c6Gn,q \u2212 z, \u02c6Gn,q + z] with coverage 1 \u2212 \u03c1 is now given by\nz = F \u22121\nn is the inverse cumulative distribution function of the Student\u2019s t\ndistribution. As in the standard case of drawing test instances xi from the original distribution p,\nsuch con\ufb01dence intervals are approximate for \ufb01nite n, but become exact for n \u2192 \u221e.\n\nn where F \u22121\n\nn\n\n2\n\n3 Empirical Studies\n\nWe compare active estimation of F\u03b1-measures according to Algorithm 1 (denoted activeF ) to esti-\nmation based on a sample of instances drawn uniformly from the pool (denoted passive). We also\nconsider the active estimator for risks presented in [7]. Instances are drawn according to the opti-\nmal sampling distribution q\u2217\n0/1 for zero-one risk (Derivation 1 in [7]); the F\u03b1-measure is computed\naccording to Equation 4 using q = q\u2217\n\n0/1 (denoted activeerr).\n\n3.1 Experimental Setting and Domains\n\nFor each experimental domain, data is split into a training set and a pool of test instances. We\ntrain a kernelized regularized logistic regression model p(y|x; \u03b8) (using the implementation of Ya-\nmada [11]). All methods operate on identical labeling budgets n. The evaluation process is averaged\nover 1,000 repetitions. In case one of the repetitions results in an unde\ufb01ned estimate, the entire ex-\nperiment is discarded (i.e., there is no data point for the method in the corresponding diagram).\nSpam \ufb01ltering domain. Spammers impose a shift on the distribution over time as they implement\nnew templates and generators. In our experiments, a \ufb01lter trained in the past has to be evaluated\nwith respect to a present distribution of emails. We collect 169,612 emails from an email service\nprovider between June 2007 and April 2010; of these, 42,165 emails received by February 2008 are\nused for training. Emails are represented by 541,713 binary bag-of-word features. Approximately\n5% of all emails fall into the positive class non-spam.\nText classi\ufb01cation domain. The Reuters-21578 text classi\ufb01cation task [4] allows us to study the\neffect of class skew, and serves as a prototypical domain for active learning. We experiment on the\nten most frequently occurring topics. We employ an active learner that always queries the example\nwith minimal functional margin p(f\u03b8(x)|x; \u03b8) \u2212 maxy(cid:54)=f\u03b8(x) p(y|x; \u03b8) [9]. The learning process is\ninitialized with one labeled training instance from each class, another 200 class labels are queried.\nDigit recognition domain. We also study a digit recognition domain in which training and test data\noriginate from different sources. A detailed description is included in the online appendix.\n\n3.2 Empirical Results\n\nWe study the performance of active and passive estimates as a function of (a) the precision-recall\ntrade-off parameter \u03b1, (b) the discrepancy between training and test distribution, and (c) class skew\nin the test distribution. Point (b) is of interest because active estimates require the approximation\np(y|x) \u2248 p(y|x; \u03b8); this assumption is violated when training and test distributions differ.\nEffect of the trade-off parameter \u03b1. For the spam \ufb01ltering domain, Figure 1 shows the average\nabsolute estimation error for F0 (recall), F0.5, and F1 (precision) estimates on a test set of 33,296\nemails received between February 2008 and October 2008. The active generalized risk estimate\nactiveF signi\ufb01cantly outperforms the passive estimate passive for all three measures.\nIn order\nto reach the estimation accuracy of passive with a labeling budget of n = 800, activeF requires\nfewer than 150 (recall), 200 (F0.5), or 100 (precision) labeled test instances. Estimates obtained from\n\n5\n\n\fFigure 1: Spam \ufb01ltering: Estimation error over labeling costs. Error bars indicate the standard error.\n\nFigure 2: Spam \ufb01ltering: Optimal sampling distribution q\u2217 for F\u03b1 over log-odds (left). Ratio of\npassive and active estimation error, error bars indicate standard deviation (center). Estimation error\nover class ratio, logarithmic scale, error bars indicate standard errors (right).\n\nactiveF are at least as accurate as those of activeerr, and more accurate for high \u03b1 values. Results\nobtained in the digit recognition domain are consistent with these \ufb01ndings (see online appendix).\nFigure 2 (left) shows the sampling distribution q\u2217(x) for recall, precision and F0.5-measure in the\nspam \ufb01ltering domain as a function of the classi\ufb01er\u2019s con\ufb01dence, characterized by the log-odds ratio\nlog p(y=1|x;\u03b8)\np(y=0|x;\u03b8). The \ufb01gure also shows the optimal sampling distribution for zero-one risk as used\nin activeerr (denoted \u201c0/1-Risk\u201d). We observe that the precision estimator dismisses all examples\nwith f\u03b8(x) = 0; this is intuitive because precision is a function of true-positive and false-positive\nexamples only. By contrast, the recall estimator selects examples on both sides of the decision\nboundary, as it has to estimate both the true positive and the false negative rate. The optimal sampling\ndistribution for zero-one risk is symmetric, it prefers instances close to the decision boundary.\nEffect of discrepancy between training and test distribution. We keep the training set of emails\n\ufb01xed and move the time interval from which test instances are drawn increasingly further away\ninto the future, thereby creating a growing gap between training and test distribution. Speci\ufb01cally,\nwe divide 127,447 emails received between February 2008 and April 2010 into ten different test\nsets spanning approximately 2.5 months each. Figure 2 (center, red curve) shows the discrepancy\nbetween training and test distribution measured in terms of the exponentiated average log-likelihood\nof the test labels given the model parameters \u03b8. The likelihood at \ufb01rst continually decreases. It grows\nagain for the two most recent batches; this coincides with a recent wave of text-based vintage spam.\n| \u02c6Gn\u2212G|\nFigure 2 (center, blue curve) also shows the ratio of passive-to-active estimation errors\n| \u02c6Gn,q\u2217\u2212G|. A\nvalue above one indicates that the active estimate is more accurate than a passive estimate. The active\nestimate consistently outperforms the passive estimate; its advantage diminishes when training and\ntest distributions diverge and the assumption of p(y|x) \u2248 p(y|x; \u03b8) becomes less accurate.\nEffect of class skew. In the spam \ufb01ltering domain we arti\ufb01cially sub-sampled data to different ratios\nof spam and non-spam emails. Figure 2 (right) shows the performance of activeF , passive, and\nactiveerr for F0.5 estimation as a function of class skew. We observe that activeF outperforms\npassive consistently. Furthermore, activeF outperforms activeerr for imbalanced classes, while\nthe approaches perform comparably when classes are balanced. This \ufb01nding is consistent with the\nintuition that accuracy and F -measure diverge more strongly for imbalanced classes.\n\n6\n\n02004006008000.010.0150.020.0250.030.0350.04Recalllabeling costs nestimation error (absolute) passiveactiveFactiveerr02004006008000.050.10.150.20.25F0.5\u2212measurelabeling costs nestimation error (absolute) passiveactiveFactiveerr02004006008000.050.10.150.20.250.30.35Precisionlabeling costs nestimation error (absolute) passiveactiveFactiveerr\u221210\u2212505100510152025Optimal Sampling Distribution (class ratio: 5/95)log(p(y=1|x)/p(y=0|x))m q*(x) RecallF0.5Precision0/1\u2212Risk05/200801/200908/200904/20101.522.53dateratio of estimation errors likelihoodTime Shift vs. Ratio of Estimation Errors0.60.70.80.91ratiolikelihood0.030.10.320.010.1F0.5\u2212measure (n=500)positive class fractionestimation error (absolute)passiveactiveFactiveerr\fFigure 3: Text classi\ufb01cation: Estimation error over number of labeled data for infrequent (left) and\nfrequent (center) class. Estimation error over class ratio for all ten classes, logarithmic scale (right).\nError bars indicate the standard error.\n\nIn the text classi\ufb01cation domain we estimate the F0.5-measure for ten one-versus-rest classi\ufb01ers.\nFigure 3 shows the estimation error of activeF , passive, and activeerr for an infrequent class\n(\u201ccrude\u201d, 4.41%, left) and a frequent class (\u201cearn\u201d, 51.0%, center). These results are representative\nfor other frequent and infrequent classes, all results are included in the online appendix. Figure 3\n(right) shows the estimation error of activeF , passive, and activeerr on all ten one-versus-rest\nproblems as a function of the problem\u2019s class skew. We again observe that activeF outperforms\npassive consistently, and activeF outperforms activeerr for strongly skewed class distributions.\n\n4 Related Work\n\nSawade et al. [7] derive a variance-minimizing sampling distribution for risks. Their result does not\ncover F -measures. Our experimental \ufb01ndings show that for estimating F -measures their variance-\nminimizing sampling distribution performs worse than the sampling distributions characterized by\nTheorem 1, especially for skewed class distributions.\nActive estimation of generalized risks can be considered to be a dual problem of active learning; in\nactive learning, the goal of the selection process is to minimize the variance of the predictions or\nthe variance of the model parameters, while in active evaluation the variance of the risk estimate\nis reduced. The variance-minimizing sampling distribution derived in Section 2.1 depends on the\nunknown conditional distribution p(y|x). We use the model itself to approximate this distribution\nand decide on instances whose class labels are queried. This is analogous to many active learning\nalgorithms. Speci\ufb01cally, Bach derives a sampling distribution for active learning under the assump-\ntion that the current model gives a good approximation to the conditional probability p(y|x) [1]. To\ncompensate for the bias incurred by the instrumental distribution, several active learning algorithms\nuse importance weighting: for regression [8], exponential family models [1], or SVMs [2].\nFinally, the proposed active estimation approach can be considered an instance of the general prin-\nciple of importance sampling [6], which we employ in the context of generalized risk estimation.\n\n5 Conclusions\n\nF\u03b1-measures are de\ufb01ned as empirical estimates; we have shown that they are consistent estimates\nof a generalized risk functional which Proposition 1 identi\ufb01es. Generalized risks can be estimated\nactively by sampling test instances from an instrumental distribution q. An analysis of the sources\nof estimation error leads to an instrumental distribution q\u2217 that minimizes estimator variance. The\noptimal sampling distribution depends on the unknown conditional p(y|x); the active generalized\nrisk estimator approximates this conditional by the model to be evaluated.\nOur empirical study supports the conclusion that the advantage of active over passive evaluation\nis particularly strong for skewed classes. The advantage of active evaluation is also correlated\nto the quality of the model as measured by the model-based likelihood of the test labels. In our\nexperiments, active evaluation consistently outperformed passive evaluation, even for the greatest\ndivergence between training and test distribution that we could observe.\n\n7\n\n02004006008000.020.040.060.080.1F0.5\u2212measure (class fraction: 4.4%)labeling costs nestimation error (absolute) passiveactiveFactiveerr02004006008000.0050.010.0150.02F0.5\u2212measure (class fraction: 51.0%)labeling costs nestimation error (absolute) passiveactiveFactiveerr0.010.030.10.320.010.1F0.5\u2212measure (n=800)positive class fractionestimation error (absolute) passiveactiveFactiveerr\fAppendix\n\n(cid:80)n\n\ni=1 viwi with vi = p(xi)\n\nProof of Lemma 2\nLet (x1, y1), ..., (xn, yn) be drawn according to q(x)p(y|x). Let \u02c6G0\n\ni=1 vi(cid:96)iwi and Wn =\n=\nnG E [wi] and E [Wn] = n E [wi]. The random variables w1v1, . . . , wnvn and w1(cid:96)1v1, . . . , wn(cid:96)nvn\nare iid, therefore the central limit theorem implies that 1\nn Wn are asymptotically normally\nn\ndistributed with\n\nq(xi) , wi = w(xi, yi, f\u03b8) and (cid:96)i = (cid:96)(f\u03b8(xi), yi). We note that E(cid:104) \u02c6G0\n(cid:18) 1\n\nn,q and 1\n\n(cid:105)\n\n\u02c6G0\n\n\u221a\n\nn,q\n\nn,q =(cid:80)n\n\n(17)\n\n(cid:19)\n(cid:19)\nn,q \u2212 G E [wi]\n\u02c6G0\nWn \u2212 E [wi]\n\n(cid:18) 1\n\nn\n\nn\n\u221a\n\nn\u2192\u221e\u2212\u2192 N (0, Var[wi(cid:96)ivi])\nn\u2192\u221e\u2212\u2192 N (0, Var[wivi])\n\n(18)\nwhere n\u2192\u221e\u2212\u2192 denotes convergence in distribution. Application of the delta method to the func-\ntion f (x, y) = x\n\nn\n\nn\n\n(cid:32) 1\n\ny yields\n\u02c6G0\nn,q\nn\n1\nn Wn\n\n\u221a\n\nn\n\n\u2212 G\n\n(cid:33)\nn\u2192\u221e\u2212\u2192 N (0,\u2207f (G E [wi] , E [wi])T \u03a3\u2207f (G E [wi] , E [wi]))\n(cid:18)\n\n(cid:19)\n\nVar[wi(cid:96)ivi]\n\nCov[wi(cid:96)ivi, wivi]\n\n\u03a3 =\n\nCov[wi(cid:96)ivi, wivi]\n\nVar[wivi]\n\n.\n\nwhere \u2207f denotes the gradient of f and \u03a3 is the asymptotic covariance matrix of the input arguments\n\nFurthermore,\n\n\u2207f (G E [wi] , E [wi])T \u03a3\u2207f (G E [wi] , E [wi])\n\n= Var [wi(cid:96)ivi] \u2212 2G Cov [wivi, wi(cid:96)ivi] + G2 Var [wivi]\n\n(cid:3) + G2 E(cid:2)w2\n\ni v2\ni\n\n(cid:3)\n\ni (cid:96)iv2\ni\n\n= E(cid:2)w2\n(cid:90)(cid:90) (cid:18) p(x)\n\ni v2\ni\n\ni (cid:96)2\n\n(cid:3) \u2212 2G E(cid:2)w2\n(cid:19)2\n\n=\n\nq(x)\n\nw(x, y, f\u03b8)2 ((cid:96)(f\u03b8(x), y) \u2212 G)2 p(y|x)q(x)dydx.\n\nFrom this, the claim follows by canceling q(x).\n\nProof of Theorem 1\n\nTo minimize the variance with respect\n\nstraint(cid:82) q(x)dx = 1 we de\ufb01ne the Lagrangian with Lagrange multiplier \u03b2\n(cid:123)(cid:122)\n\n(cid:90) c(x)\n(cid:124)\n\n(cid:90) c(x)\n\nq(x)dx \u2212 1\n\nL [q, \u03b2] =\n\n(cid:18)(cid:90)\n\ndx + \u03b2\n\n(cid:19)\n\nq(x)\n\nq(x)\n\n=\n\n(cid:125)\n\nwhere c(x) = p(x)2(cid:82) w(x, y, f\u03b8)2 ((cid:96)(f\u03b8(x), y) \u2212 G)2 p(y|x)dy. The optimal function for the con-\n\n=K(q(x),x)\n\nto the function q under the the normalization con-\n\n+ \u03b2q(x)\n\ndx \u2212 \u03b2,\n\n(19)\n\n\u2202q(x) = \u2212 c(x)\n\nq(x)2 + \u03b2 = 0. A solution for this\n\nstrained problem satis\ufb01es the Euler-Lagrange equation \u2202K\nEquation under the side condition is given by\nq\u2217(x) =\n\n(cid:112)c(x)\n(cid:82)(cid:112)c(x)dx\n\n.\n\n(20)\n\nNote that we dismiss the negative solution, since q(x) is a probability density function. Resubstitu-\ntion of c in Equation 20 implies the theorem.\n\nAcknowledgments\n\nWe gratefully acknowledge that this work was supported by a Google Research Award. We wish to\nthank Michael Br\u00a8uckner for his help with the experiments on spam data.\n\n8\n\n\fReferences\n[1] F. Bach. Active learning for misspeci\ufb01ed generalized linear models. In Advances in Neural\n\nInformation Processing Systems, 2007.\n\n[2] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Pro-\n\nceedings of the International Conference on Machine Learning, 2009.\n\n[3] H Cram\u00b4er. Mathematical Methods of Statistics, chapter 20. Princeton University Press, 1946.\n[4] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[5] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma.\n\nNeural Computation, 4:1\u201358, 1992.\n\n[6] J. Hammersley and D. Handscomb. Monte carlo methods. Taylor & Francis, 1964.\n[7] C. Sawade, N. Landwehr, S. Bickel, and T. Scheffer. Active risk estimation. In Proceedings of\n\nthe 27th International Conference on Machine Learning, 2010.\n\n[8] M. Sugiyama. Active learning in approximately linear regression based on conditional expec-\n\ntation of generalization error. Journal of Machine Learning Research, 7:141\u2013166, 2006.\n\n[9] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n\n\ufb01cation. Journal of Machine Learning Research, pages 45\u201366, 2002.\n\n[10] C. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1979.\n[11] M. Yamada, M. Sugiyama, and T. Matsui. Semi-supervised speaker identi\ufb01cation under co-\n\nvariate shift. Signal Processing, 90(8):2353\u20132361, 2010.\n\n9\n\n\f", "award": [], "sourceid": 609, "authors": [{"given_name": "Christoph", "family_name": "Sawade", "institution": null}, {"given_name": "Niels", "family_name": "Landwehr", "institution": null}, {"given_name": "Tobias", "family_name": "Scheffer", "institution": null}]}