{"title": "Consistent Binary Classification with Generalized Performance Metrics", "book": "Advances in Neural Information Processing Systems", "page_first": 2744, "page_last": 2752, "abstract": "Performance metrics for binary classification are designed to capture tradeoffs between four fundamental population quantities: true positives, false positives, true negatives and false negatives. Despite significant interest from theoretical and applied communities, little is known about either optimal classifiers or consistent algorithms for optimizing binary classification performance metrics beyond a few special cases. We consider a fairly large family of performance metrics given by ratios of linear combinations of the four fundamental population quantities. This family includes many well known binary classification metrics such as classification accuracy, AM measure, F-measure and the Jaccard similarity coefficient as special cases. Our analysis identifies the optimal classifiers as the sign of the thresholded conditional probability of the positive class, with a performance metric-dependent threshold. The optimal threshold can be constructed using simple plug-in estimators when the performance metric is a linear combination of the population quantities, but alternative techniques are required for the general case. We propose two algorithms for estimating the optimal classifiers, and prove their statistical consistency. Both algorithms are straightforward modifications of standard approaches to address the key challenge of optimal threshold selection, thus are simple to implement in practice. The first algorithm combines a plug-in estimate of the conditional probability of the positive class with optimal threshold selection. The second algorithm leverages recent work on calibrated asymmetric surrogate losses to construct candidate classifiers. We present empirical comparisons between these algorithms on benchmark datasets.", "full_text": "Consistent Binary Classi\ufb01cation with Generalized\n\nPerformance Metrics\n\nOluwasanmi Koyejo\u21e4\nDepartment of Psychology,\n\nStanford University\n\nsanmi@stanford.edu\n\nNagarajan Natarajan\u21e4\n\nDepartment of Computer Science,\n\nUniversity of Texas at Austin\nnaga86@cs.utexas.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Science,\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nInderjit S. Dhillon\n\nDepartment of Computer Science,\n\nUniversity of Texas at Austin\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nPerformance metrics for binary classi\ufb01cation are designed to capture tradeoffs be-\ntween four fundamental population quantities: true positives, false positives, true\nnegatives and false negatives. Despite signi\ufb01cant interest from theoretical and\napplied communities, little is known about either optimal classi\ufb01ers or consis-\ntent algorithms for optimizing binary classi\ufb01cation performance metrics beyond\na few special cases. We consider a fairly large family of performance metrics\ngiven by ratios of linear combinations of the four fundamental population quanti-\nties. This family includes many well known binary classi\ufb01cation metrics such as\nclassi\ufb01cation accuracy, AM measure, F-measure and the Jaccard similarity coef\ufb01-\ncient as special cases. Our analysis identi\ufb01es the optimal classi\ufb01ers as the sign of\nthe thresholded conditional probability of the positive class, with a performance\nmetric-dependent threshold. The optimal threshold can be constructed using sim-\nple plug-in estimators when the performance metric is a linear combination of\nthe population quantities, but alternative techniques are required for the general\ncase. We propose two algorithms for estimating the optimal classi\ufb01ers, and prove\ntheir statistical consistency. Both algorithms are straightforward modi\ufb01cations of\nstandard approaches to address the key challenge of optimal threshold selection,\nthus are simple to implement in practice. The \ufb01rst algorithm combines a plug-in\nestimate of the conditional probability of the positive class with optimal threshold\nselection. The second algorithm leverages recent work on calibrated asymmetric\nsurrogate losses to construct candidate classi\ufb01ers. We present empirical compar-\nisons between these algorithms on benchmark datasets.\n\n1\n\nIntroduction\n\nBinary classi\ufb01cation performance is often measured using metrics designed to address the short-\ncomings of classi\ufb01cation accuracy. For instance, it is well known that classi\ufb01cation accuracy is an\ninappropriate metric for rare event classi\ufb01cation problems such as medical diagnosis, fraud detec-\ntion, click rate prediction and text retrieval applications [1, 2, 3, 4]. Instead, alternative metrics better\ntuned to imbalanced classi\ufb01cation (such as the F1 measure) are employed. Similarly, cost-sensitive\nmetrics may useful for addressing asymmetry in real-world costs associated with speci\ufb01c classes. An\nimportant theoretical question concerning metrics employed in binary classi\ufb01cation is the characteri-\n\n\u21e4Equal contribution to the work.\n\n1\n\n\fzation of the optimal decision functions. For example, the decision function that maximizes the accu-\nracy metric (or equivalently minimizes the \u201c0-1 loss\u201d) is well-known to be sign(P (Y = 1|x)1/2).\nA similar result holds for cost-sensitive classi\ufb01cation [5]. Recently, [6] showed that the optimal de-\ncision function for the F1 measure, can also be characterized as sign(P (Y = 1|x) \u21e4) for some\n\u21e4 2 (0, 1). As we show in the paper, it is not a coincidence that the optimal decision function\nfor these different metrics has a similar simple characterization. We make the observation that the\ndifferent metrics used in practice belong to a fairly general family of performance metrics given by\nratios of linear combinations of the four population quantities associated with the confusion matrix.\nWe consider a family of performance metrics given by ratios of linear combinations of the four\npopulation quantities. Measures in this family include classi\ufb01cation accuracy, false positive rate,\nfalse discovery rate, precision, the AM measure and the F-measure, among others. Our analysis\nshows that the optimal classi\ufb01ers for all such metrics can be characterized as the sign of the thresh-\nolded conditional probability of the positive class, with a threshold that depends on the speci\ufb01c\nmetric. This result uni\ufb01es and generalizes known special cases including the AM measure analysis\nby Menon et al. [7], and the F measure analysis by Ye et al. [6]. It is known that minimizing (con-\nvex) surrogate losses, such as the hinge and the logistic loss, provably also minimizes the underlying\n0-1 loss or equivalently maximizes the classi\ufb01cation accuracy [8]. This motivates the next question\nwe address in the paper: can one obtain algorithms that (a) can be used in practice for maximizing\nmetrics from our family, and (b) are consistent with respect to the metric? To this end, we propose\ntwo algorithms for consistent empirical estimation of decision functions. The \ufb01rst algorithm com-\nbines a plug-in estimate of the conditional probability of the positive class with optimal threshold\nselection. The second leverages the asymmetric surrogate approach of Scott [9] to construct candi-\ndate classi\ufb01ers. Both algorithms are simple modi\ufb01cations of standard approaches that address the\nkey challenge of optimal threshold selection. Our analysis identi\ufb01es why simple heuristics such\nas classi\ufb01cation using class-weighted loss functions and logistic regression with threshold search\nare effective practical algorithms for many generalized performance metrics, and furthermore, that\nwhen implemented correctly, such apparent heuristics are in fact asymptotically consistent.\n\nRelated Work. Binary classi\ufb01cation accuracy and its cost-sensitive variants have been studied\nextensively. Here we highlight a few of the key results. The seminal work of [8] showed that mini-\nmizing certain surrogate loss functions enables us to control the probability of misclassi\ufb01cation (the\nexpected 0-1 loss). An appealing corollary of the result is that convex loss functions such as the\nhinge and logistic losses satisfy the surrogacy conditions, which establishes the statistical consis-\ntency of the resulting algorithms. Steinwart [10] extended this work to derive surrogates losses for\nother scenarios including asymmetric classi\ufb01cation accuracy. More recently, Scott [9] characterized\nthe optimal decision function for weighted 0-1 loss in cost-sensitive learning and extended the risk\nbounds of [8] to weighted surrogate loss functions. A similar result regarding the use of a threshold\ndifferent than 1/2, and appropriately rebalancing the training data in cost-sensitive learning, was\nshown by [5]. Surrogate regret bounds for proper losses applied to class probability estimation\nwere analyzed by Reid and Williamson [11] for differentiable loss functions. Extensions to the\nmulti-class setting have also been studied (for example, Zhang [12] and Tewari and Bartlett [13]).\nAnalysis of performance metrics beyond classi\ufb01cation accuracy is limited. The optimal classi\ufb01er\nremains unknown for many binary classi\ufb01cation performance metrics of interest, and few results\nexist for identifying consistent algorithms for optimizing these metrics [7, 6, 14, 15]. Of particular\nrelevance to our work are the AM measure maximization by Menon et al. [7], and the F measure\nmaximization by Ye et al. [6].\n\n2 Generalized Performance Metrics\n\nLet X be either a countable set, or a complete separable metric space equipped with the standard\nBorel -algebra of measurable sets. Let X 2X and Y 2{ 0, 1} represent input and output random\nvariables respectively. Further, let \u21e5 represent the set of all classi\ufb01ers \u21e5= {\u2713 : X 7! [0, 1]}.\nWe assume the existence of a \ufb01xed unknown distribution P, and data is generated as iid. samples\n(X, Y ) \u21e0 P. De\ufb01ne the quantities: \u21e1 = P(Y = 1) and (\u2713) = P(\u2713 = 1).\nThe components of the confusion matrix are the fundamental population quantities for binary classi-\n\ufb01cation. They are the true positives (TP), false positives (FP), true negatives (TN) and false negatives\n\n2\n\n\f(FN), given by:\n\nTP(\u2713, P) = P(Y = 1,\u2713 = 1),\nFN(\u2713, P) = P(Y = 1,\u2713 = 0),\n\nFP(\u2713, P) = P(Y = 0,\u2713 = 1),\nTN(\u2713, P) = P(Y = 0,\u2713 = 0).\n\n(1)\n\nThese quantities may be further decomposed as:\nFP(\u2713, P) = (\u2713) TP(\u2713),\nLet L :\u21e5 \u21e5 P 7! R be a performance metric of interest. Without loss of generality, we assume\nthat L is a utility metric, so that larger values are better. The Bayes utility L\u21e4 is the optimal value\nof the performance metric, i.e., L\u21e4 = sup\u27132\u21e5 L(\u2713, P). The Bayes classi\ufb01er \u2713\u21e4 is the classi\ufb01er that\noptimizes the performance metric, so L\u21e4 = L(\u2713\u21e4), where:\n\nFN(\u2713, P) = \u21e1 TP(\u2713), TN(\u2713, P) = 1 (\u2713) \u21e1 + TP(\u2713).\n\n(2)\n\n\u2713\u21e4 = arg max\n\n\u27132\u21e5 L(\u2713, P).\n\nWe consider a family of classi\ufb01cation metrics computed as the ratio of linear combinations of these\nfundamental population quantities (1). In particular, given constants (representing costs or weights)\n{a11, a10, a01, a00, a0} and {b11, b10, b01, b00, b0}, we consider the measure:\na0 + a11TP + a10FP + a01FN + a00TN\nb0 + b11TP + b10FP + b01FN + b00TN\n\nL(\u2713, P) =\n\n(3)\n\nwhere, for clarity, we have suppressed dependence of the population quantities on \u2713 and P. Examples\nof performance metrics in this family include the AM measure [7], the F measure [6], the Jaccard\nsimilarity coef\ufb01cient (JAC) [16] and Weighted Accuracy (WA):\n\nAM =\n\n1\n\n2\u2713TP\n\n\u21e1\n\n+\n\nTN\n\n1 \u21e1\u25c6 =\n\nTP\n\nJAC =\n\nTP + FN + FP =\n\n(1 \u21e1)TP + \u21e1TN\nTP\n + FN , WA =\n\n2\u21e1(1 \u21e1)\nTP\n\u21e1 + FP =\n\n, F =\n\n(1 + 2)TP\n\n(1 + 2)TP + 2FN + FP =\nw1TP + w2TN\n\n(1 + 2)TP\n2\u21e1 + \n\n,\n\nw1TP + w2TN + w3FP + w4FN .\n\nNote that we allow the constants to depend on P. Other examples in this class include commonly\nused ratios such as the true positive rate (also known as recall) (TPR), true negative rate (TNR),\nprecision (Prec), false negative rate (FNR) and negative predictive value (NPV):\n\nTPR =\n\nTP\n\nTP + FN , TNR =\n\nTN\n\nFP + TN , Prec =\n\nTP\n\nTP + FP , FNR =\n\nFN\n\nFN + TP , NPV =\n\nTN\n\nTN + FN .\n\nInterested readers are referred to [17] for a list of additional metrics in this class.\nBy decomposing the population measures (1) using (2) we see that any performance metric in the\nfamily (3) has the equivalent representation:\n\nL(\u2713) =\n\nc0 + c1TP(\u2713) + c2(\u2713)\nd0 + d1TP(\u2713) + d2(\u2713)\n\nwith the constants:\n\n(4)\n\nand\n\nc0 = a01\u21e1 + a00 a00\u21e1 + a0,\nd0 = b01\u21e1 + b00 b00\u21e1 + b0,\n\nc1 = a11 a10 a01 + a00,\nd1 = b11 b10 b01 + b00,\n\nc2 = a10 a00\nd2 = b10 b00.\n\nThus, it is clear from (4) that the family of performance metrics depends on the classi\ufb01er \u2713 only\nthrough the quantities TP(\u2713) and (\u2713).\n\nOptimal Classi\ufb01er\n\nWe now characterize the optimal classi\ufb01er for the family of performance metrics de\ufb01ned in (4). Let\n\u232b represent the dominating measure on X . For the rest of this manuscript, we make the following\nassumption:\nAssumption 1. The marginal distribution P(X) is absolutely continuous with respect to the domi-\nnating measure \u232b on X so there exists a density \u00b5 that satis\ufb01es dP = \u00b5d\u232b.\n\n3\n\n\fTo simplify notation, we use the standard d\u232b(x) = dx. We also de\ufb01ne the conditional proba-\nbility \u2318x = P(Y = 1|X = x). Applying Assumption 1, we can expand the terms TP(\u2713) =\nRx2X\n\u2713(x)\u00b5(x)dx, so the performance metric (4) may be repre-\nsented as:\n\n\u2318x\u2713(x)\u00b5(x)dx and (\u2713) = Rx2X\nL(\u2713, P) =\n\n(c1\u2318x + c2)\u2713(x)\u00b5(x)dx\n(d1\u2318x + d2)\u2713(x)\u00b5(x)dx\n\n.\n\nOur \ufb01rst main result identi\ufb01es the Bayes classi\ufb01er for all utility functions in the family (3), showing\nthat they take the form \u2713\u21e4(x) = sign(\u2318x \u21e4), where \u21e4 is a metric-dependent threshold, and the\nsign function is given by sign : R 7! {0, 1} as sign(t) = 1 if t 0 and sign(t) = 0 otherwise.\nTheorem 2. Let P be a distribution on X\u21e5 [0, 1] that satis\ufb01es Assumption 1, and let L be a perfor-\nmance metric in the family (3). Given the constants {c0, c1, c2} and {d0, d1, d2}, de\ufb01ne:\n\nc0 +Rx2X\nd0 +Rx2X\n\n\u21e4 =\n\nd2L\u21e4 c2\nc1 d1L\u21e4\n\n.\n\n(5)\n\n1. When c1 > d1L\u21e4, the Bayes classi\ufb01er \u2713\u21e4 takes the form \u2713\u21e4(x) = sign(\u2318x \u21e4)\n2. When c1 < d1L\u21e4, the Bayes classi\ufb01er takes the form \u2713\u21e4(x) = sign(\u21e4 \u2318x)\n\nThe proof of the theorem involves examining the \ufb01rst-order optimality condition (see Appendix B).\nRemark 3. The speci\ufb01c form of the optimal classi\ufb01er depends on the sign of c1 d1L\u21e4, and L\u21e4 is\noften unknown. In practice, one can often estimate loose upper and lower bounds of L\u21e4 to determine\nthe classi\ufb01er.\n\nA number of useful results can be evaluated directly as instances of Theorem 2. For the F measure,\nwe have that c1 = 1 + 2 and d2 = 1 with all other constants as zero. Thus, \u21e4F = L\u21e4\n1+2 . This\nmatches the optimal threshold for F1 metric speci\ufb01ed by Zhao et al. [14]. For precision, we have that\nc1 = 1, d2 = 1 and all other constants are zero, so \u21e4Prec = L\u21e4. This clari\ufb01es the observation that in\npractice, precision can be maximized by predicting only high con\ufb01dence positives. For true positive\nrate (recall), we have that c1 = 1, d0 = \u21e1 and other constants are zero, so \u21e4TPR = 0 recovering the\nknown result that in practice, recall is maximized by predicting all examples as positives. For the\nJaccard similarity coef\ufb01cient c1 = 1, d1 = 1, d2 = 1, d0 = \u21e1 and other constants are zero, so\n\u21e4JAC = L\u21e4\n1+L\u21e4 .\nWhen d1 = d2 = 0, the generalized metric is simply a linear combination of the four fundamental\nquantities. With this form, we can then recover the optimal classi\ufb01er outlined by Elkan [5] for cost\nsensitive classi\ufb01cation.\nCorollary 4. Let P be a distribution on X\u21e5 [0, 1] that satis\ufb01es Assumption 1, and let L be a\nperformance metric in the family (3). Given the constants {c0, c1, c2} and {d0, d1 = 0, d2 = 0}, the\noptimal threshold (5) is \u21e4 = c2\nClassi\ufb01cation accuracy is in this family, with c1 = 2, c2 = 1, and it is well-known that \u21e4ACC = 1\n2.\nAnother case of interest is the AM metric, where c1 = 1, c2 = \u21e1, so \u21e4AM = \u21e1, as shown in Menon\net al. [7].\n\nc1\n\n.\n\n3 Algorithms\n\nThe characterization of the Bayes classi\ufb01er for the family of performance metrics (4) given in The-\norem 2 enables the design of practical classi\ufb01cation algorithms with strong theoretical properties.\nIn particular, the algorithms that we propose are intuitive and easy to implement. Despite their\nsimplicity, we show that the proposed algorithms are consistent with respect to the measure of\ninterest; a desirable property for a classi\ufb01cation algorithm. We begin with a description of the\nalgorithms, followed by a detailed analysis of consistency. Let {Xi, Yi}n\ntraining\ninstances drawn from a \ufb01xed unknown distribution P. For a given \u2713 : X!{ 0, 1}, we de\ufb01ne the\ni=1 \u2713(Xi)Yi,\nfollowing empirical quantities based on their population analogues: TPn(\u2713) = 1\nnPn\ni=1 \u2713(Xi). It is clear that TPn(\u2713) n!1! TP(\u2713; P) and n(\u2713) n!1! (\u2713; P).\nand n(\u2713) = 1\n\nnPn\n\ni=1 denote iid.\n\n4\n\n\fConsider the empirical measure:\n\nLn(\u2713) =\n\nc1TPn(\u2713) + c2n(\u2713) + c0\nd1TPn(\u2713) + d2n(\u2713) + d0\n\n,\n\n(6)\n\nd1\n\nd1\n\nso \u2713\u21e4(x) = sign(\u2318x \u21e4). The case where L\u21e4 > c1\n\ncorresponding to the population measure L(\u2713; P) in (4). It is expected that Ln(\u2713) will be close to\nthe L(\u2713; P) when the sample is suf\ufb01ciently large (see Proposition 8). For the rest of this manuscript,\nis solved identically.\nwe assume that L\u21e4 \uf8ff c1\nOur \ufb01rst approach (Two-Step Expected Utility Maximization) is quite intuitive (Algorithm 1): Ob-\ntain an estimator \u02c6\u2318x for \u2318x = P(Y = 1|x) by performing ERM on the sample using a proper loss\nfunction [11]. Then, maximize Ln de\ufb01ned in (6) with respect to the threshold 2 (0, 1). The\noptimization required in the third step is one dimensional, thus a global minimizer can be computed\nef\ufb01ciently in many cases [18]. In experiments, we use (regularized) logistic regression on a training\nsample to obtain \u02c6\u2318.\n\nAlgorithm 1: Two-Step EUM\n\nInput: Training examples S = {Xi, Yi}n\n1. Split the training data S into two sets S1 and S2.\n2. Estimate \u02c6\u2318x using S1, de\ufb01ne \u02c6\u2713 = sign(\u02c6\u2318x )\n3. Compute \u02c6 = arg max2(0,1) Ln(\u02c6\u2713) on S2.\nReturn: \u02c6\u2713\u02c6\n\ni=1 and the utility measure L.\n\nOur second approach (Weighted Empirical Risk Minimization) is based on the observation that\nempirical risk minimization (ERM) with suitably weighted loss functions yields a classi\ufb01er that\nthresholds \u2318x appropriately (Algorithm 2). Given a convex surrogate `(t, y) of the 0-1 loss, where t\nis a real-valued prediction and y 2{ 0, 1}, the -weighted loss is given by [9]:\n\n`(t, y) = (1 )1{y=1}`(t, 1) + 1{y=0}`(t, 0).\n\nDenote the set of real valued functions as ; we then de\ufb01ne \u02c6\u2713 as:\n\n\u02c6 = arg min\n2\n\n1\nn\n\nnXi=1\n\n`((Xi), Yi)\n\n(7)\n\nthen set \u02c6\u2713(x) = sign( \u02c6(x)). Scott [9] showed that such an estimated \u02c6\u2713 is consistent with \u2713 =\nsign(\u2318x ). With the classi\ufb01er de\ufb01ned, maximize Ln de\ufb01ned in (6) with respect to the threshold\n 2 (0, 1).\nAlgorithm 2: Weighted ERM\n\nInput: Training examples S = {Xi, Yi}n\n1. Split the training data S into two sets S1 and S2.\n2. Compute \u02c6 = arg max2(0,1) Ln(\u02c6\u2713) on S2.\nSub-algorithm: De\ufb01ne \u02c6\u2713(x) = sign( \u02c6(x)) where \u02c6(x) is computed using (7) on S1.\nReturn: \u02c6\u2713\u02c6\n\ni=1, and the utility measure L.\n\nRemark 5. When d1 = d2 = 0, the optimal threshold does not depend on L\u21e4 (Corollary 4). We\nmay then employ simple sample-based plugin estimates \u02c6S.\n\nA bene\ufb01t of using such plugin estimates is that the classi\ufb01cation algorithms can be simpli\ufb01ed while\nmaintaining consistency. Given such a sample-based plugin estimate \u02c6S, Algorithm 1 then reduces\n= sign(\u02c6\u2318x \u02c6S), Algorithm 2 reduces to a single ERM (7) to\nto estimating \u02c6\u2318x, and then setting \u02c6\u2713\u02c6S\nestimate \u02c6\u02c6S\n(x), and then setting \u02c6\u2713\u02c6S\n(x) = sign( \u02c6\u02c6S\n(x)). In the case of AM measure, the threshold\nis given by \u21e4 = \u21e1. A consistent estimator for \u21e1 is all that is required (see [7]).\n\n5\n\n\f3.1 Consistency of the proposed algorithms\np! 0 i.e., for\nAn algorithm is said to be L-consistent if the learned classi\ufb01er \u02c6\u2713 satis\ufb01es L\u21e4 L (\u02c6\u2713)\nevery \u270f> 0, P(|L\u21e4 L (\u02c6\u2713)| <\u270f )! 1, as n!1.\nWe begin the analysis from the simplest case when \u21e4 is independent of L\u21e4 (Corollary 4). The\nfollowing proposition, which generalizes Lemma 1 of [7], shows that maximizing L is equivalent to\nminimizing \u21e4-weighted risk. As a consequence, it suf\ufb01ces to minimize a suitable surrogate loss `\u21e4\non the training data to guarantee L-consistency.\nProposition 6. Assume \u21e4 2 (0, 1) and \u21e4 is independent of L\u21e4, but may depend on the distribution\nP. De\ufb01ne \u21e4-weighted risk of a classi\ufb01er \u2713 as\n\nR\u21e4(\u2713) = E(x,y)\u21e0P\u21e5(1 \u21e4)1{y=1}1{\u2713(x)=0} + \u21e41{y=0}1{\u2713(x)=1}\u21e4,\n\nthen, R\u21e4(\u2713) min\n\n\u2713\n\nR\u21e4(\u2713) =\n\n(L\u21e4 L (\u2713)).\n\n1\nc1\n\nThe proof is simple, and we defer it to Appendix B. Note that the key consequence of Proposition 6\nis that if we know \u21e4, then simply optimizing a weighted surrogate loss as detailed in the proposition\nsuf\ufb01ces to obtain a consistent classi\ufb01er. In the more practical setting where \u21e4 is not known exactly,\nwe can then compute a sample based estimate \u02c6S. We brie\ufb02y mentioned in the previous section\nhow the proposed Algorithms 1 and 2 simplify in this case. Using the plug-in estimate \u02c6S such\np! \u21e4 in the algorithms directly guarantees consistency, under mild assumptions on P (see\nthat \u02c6S\nAppendix A for details). The proof for this setting essentially follows the arguments in [7], given\nProposition 6.\nNow, we turn to the general case, i.e. when L is an arbitrary measure in the class (4) such that \u21e4\nis dif\ufb01cult to estimate directly. In this case, both the proposed algorithms estimate to optimize the\nempirical measure Ln. We employ the following proposition which establishes bounds on L.\nProposition 7. Let the constants aij, bij for i, j 2{ 0, 1}, a0, and b0 be non-negative and, without\nloss of generality, take values from [0, 1]. Then, we have:\n\n1. 2 \uf8ff c1, d1 \uf8ff 2,1 \uf8ff c2, d2 \uf8ff 1, and 0 \uf8ff c0, d0 \uf8ff 2(1 + \u21e1).\n2. L is bounded, i.e. for any \u2713, 0 \uf8ffL (\u2713) \uf8ff L := a0+maxi,j2{0,1} aij\nb0+minij2{0,1} bij\n\n.\n\n, where r(n, \u21e2) =q 1\n\nThe proofs of the main results in Theorem 10 and 11 rely on the following Lemmas 8 and 9 on how\nthe empirical measure converges to the population measure at a steady rate. We defer the proofs to\nAppendix B.\nLemma 8. For any \u270f> 0, limn ! 1 P(|Ln(\u2713) L (\u2713)| <\u270f ) = 1. Furthermore, with probability at\nleast 1 \u21e2,|Ln(\u2713) L (\u2713)| < (C+LD)r(n,\u21e2)\n\u21e2, L is an upper bound on\nBDr(n,\u21e2)\nL(\u2713), B 0, C 0, D 0 are constants that depend on L (i.e. c0, c1, c2, d0, d1 and d2).\nNow, we show a uniform convergence result for Ln with respect to maximization over the threshold\n 2 (0, 1).\nLemma 9. Consider the function class of all thresholded decisions \u21e5= {1{(x)>} 8 2 (0, 1)}\nfor a [0, 1]-valued function : X! [0, 1]. De\ufb01ne \u02dcr(n, \u21e2) =q 32\n\u21e2\u21e4. If \u02dcr(n, \u21e2) < B\nn\u21e5 ln(en) + ln 16\n, then with prob. at least 1 \u21e2,\n\u27132\u21e5|Ln(\u2713) L (\u2713)| <\u270f.\n\n(where B and D are de\ufb01ned as in Lemma 8) and \u270f = (C+LD)\u02dcr(n,\u21e2)\nBD\u02dcr(n,\u21e2)\n\n2n ln 4\n\nsup\n\nD\n\nWe are now ready to state our main results concerning the consistency of the two proposed algo-\nrithms.\nTheorem 10. (Main Result 2) If the estimate \u02c6\u2318x satis\ufb01es \u02c6\u2318x\n\np! \u2318x, Algorithm 1 is L-consistent.\n\nNote that we can obtain an estimate \u02c6\u2318x with the guarantee that \u02c6\u2318x\nloss function [19] (e.g. logistic loss) (see Appendix B).\n\np! \u2318x by using a strongly proper\n\n6\n\n\fTheorem 11. (Main Result 3) Let ` : R : [0,1) be a classi\ufb01cation-calibrated convex (margin) loss\n(i.e. `0(0) < 0) and let ` be the corresponding weighted loss for a given used in the weighted\nERM (7). Then, Algorithm 2 is L-consistent.\nNote that loss functions used in practice such as hinge and logistic are classi\ufb01cation-calibrated [8].\n\n4 Experiments\n\nWe present experiments on synthetic data where we observe that measures from our family indeed\nare maximized by thresholding \u2318x. We also compare the two proposed algorithms on benchmark\ndatasets on two speci\ufb01c measures from the family.\n\n4.1 Synthetic data: Optimal decisions\n\nWe evaluate the Bayes optimal classi\ufb01ers for common performance metrics to empirically verify the\nresults of Theorem 2. We \ufb01x a domain X = {1, 2, . . . 10}, then we set \u00b5(x) by drawing random\nvalues uniformly in (0, 1), and then normalizing these. We set the conditional probability using a\n1+exp(wx), where w is a random value drawn from a standard Gaussian.\nsigmoid function as \u2318x =\nAs the optimal threshold depends on the Bayes risk L\u21e4, the Bayes classi\ufb01er cannot be evaluated\nusing plug-in estimates. Instead, the Bayes classi\ufb01er \u2713\u21e4 was obtained using an exhaustive search\nover all 210 possible classi\ufb01ers. The results are presented in Fig. 1. For different metrics, we plot \u2318x,\nthe predicted optimal threshold \u21e4 (which depends on P) and the Bayes classi\ufb01er \u2713\u21e4. The results can\nbe seen to be consistent with Theorem 2 i.e. the (exhaustively computed) Bayes optimal classi\ufb01er\nmatches the thresholded classi\ufb01er detailed in the theorem.\n\n1\n\n(a) Precision\n\n(b) F1\n\n(c) Weighted Accuracy\n\n(d) Jaccard\n\nFigure 1: Simulated results showing \u2318x, optimal threshold \u21e4 and Bayes classi\ufb01er \u2713\u21e4.\n\n4.2 Benchmark data: Performance of the proposed algorithms\n\n2(T P +T N )\n\nWe evaluate the two algorithms on several benchmark datasets for classi\ufb01cation. We consider two\nmeasures, F1 de\ufb01ned as in Section 2 and Weighted Accuracy de\ufb01ned as\n2(T P +T N )+F P +F N . We\nsplit the training data S into two sets S1 and S2: S1 is used for estimating \u02c6\u2318x and S2 for selecting .\nFor Algorithm 1, we use logistic loss on the samples (with L2 regularization) to obtain estimate \u02c6\u2318x.\nOnce we have the estimate, we use the model to obtain \u02c6\u2318x for x 2S 2, and then use the values \u02c6\u2318x as\ncandidate choices to select the optimal threshold (note that the empirical best lies in the choices).\nSimilarly, for Algorithm 2, we use a weighted logistic regression, where the weights depend on the\nthreshold as detailed in our algorithm description. Here, we grid the space [0, 1] to \ufb01nd the best\nthreshold on S2. Notice that this step is embarrassingly parallelizable. The granularity of the grid\ndepends primarily on class imbalance in the data, and varies with datasets. We also compare the two\nalgorithms with the standard empirical risk minimization (ERM) - regularized logistic regression\nwith threshold 1/2.\nFirst, we optimize for the F1 measure on four benchmark datasets: (1) REUTERS, consisting of\nnews 8293 articles categorized into 65 topics (obtained the processed dataset from [20]). For each\ntopic, we obtain a highly imbalanced binary classi\ufb01cation dataset with the topic as the positive\nclass and the rest as negative. We report the average F1 measure over all the topics (also known\nas macro-F1 score). Following the analysis in [6], we present results for averaging over topics that\nhad at least C positives in the training (5946 articles) as well as the test (2347 articles) data. (2)\nLETTERS dataset consisting of 20000 handwritten letters (16000 training and 4000 test instances)\n\n7\n\n\ffrom the English alphabet (26 classes, with each class consisting of at least 100 positive training\ninstances). (3) SCENE dataset (UCI benchmark) consisting of 2230 images (1137 training and 1093\ntest instances) categorized into 6 scene types (with each class consisting of at least 100 positive\ninstances). (4) WEBPAGE binary text categorization dataset obtained from [21], consisting of 34780\nweb pages (6956 train and 27824 test), with only about 182 positive instances in the train. All the\ndatasets, except SCENE, have a high class imbalance. We use our algorithms to optimize for the\nF1 measure on these datasets. The results are presented in Table 1. We see that both algorithms\nperform similarly in many cases. A noticeable exception is the SCENE dataset, where Algorithm 1\nis better by a large margin. In REUTERS dataset, we observe that as the number of positive instances\nC in the training data increases, the methods perform signi\ufb01cantly better, and our results align with\nthose in [6] on this dataset. We also \ufb01nd, albeit surprisingly, that using a threshold 1/2 performs\ncompetitively on this dataset.\n\nDATASET\n\nREUTERS\n(65 classes)\n\nLETTERS (26 classes)\nSCENE (6 classes)\nWEB PAGE (binary)\n\nC\n1\n10\n50\n100\n1\n1\n1\n\nERM Algorithm 1 Algorithm 2\n0.4855\n0.5151\n0.7449\n0.7624\n0.8560\n0.8428\n0.9670\n0.9675\n0.4827\n0.5686\n0.5916\n0.3953\n0.6254\n0.6267\n\n0.4980\n0.7600\n0.8510\n0.9670\n0.5742\n0.6891\n0.6269\n\nTable 1: Comparison of methods: F1 measure. First three are multi-class datasets: F1 is computed\nindividually for each class that has at least C positive instances (in both the train and the test sets)\nand then averaged over classes (macro-F1).\nNext we optimize for the Weighted Accuracy measure on datasets with less class imbalance. In this\ncase, we can see that \u21e4 = 1/2 from Theorem 2. We use four benchmark datasets: SCENE (same as\nearlier), IMAGE (2068 images: 1300 train, 1010 test) [22], BREAST CANCER (683 instances: 463\ntrain, 220 test) and SPAMBASE (4601 instances: 3071 train, 1530 test) [23]. Note that the last three\nare binary datasets. The results are presented in Table 2. Here, we observe that all the methods\nperform similarly, which conforms to our theoretical guarantees of consistency.\n\nDATASET\nSCENE\nIMAGE\nBREAST CANCER\nSPAMBASE\n\nERM Algorithm 1 Algorithm 2\n0.9105\n0.9000\n0.9025\n0.9060\n0.9910\n0.9860\n0.9463\n0.9430\n\n0.9000\n0.9063\n0.9910\n0.9550\n\nTable 2: Comparison of methods: Weighted Accuracy de\ufb01ned as\n1/2. We observe that the two algorithms are consistent (ERM thresholds at 1/2).\n\n2(T P +T N )\n\n2(T P +T N )+F P +F N . Here, \u21e4 =\n\n5 Conclusions and Future Work\nDespite the importance of binary classi\ufb01cation, theoretical results identifying optimal classi\ufb01ers\nand consistent algorithms for many performance metrics used in practice remain as open questions.\nOur goal in this paper is to begin to answer these questions. We have considered a large family\nof generalized performance measures that includes many measures used in practice. Our analysis\nshows that the optimal classi\ufb01ers for such measures can be characterized as the sign of the thresh-\nolded conditional probability of the positive class, with a threshold that depends on the speci\ufb01c\nmetric. This result uni\ufb01es and generalizes known special cases. We have proposed two algorithms\nfor consistent estimation of the optimal classi\ufb01ers. While the results presented are an important \ufb01rst\nstep, many open questions remain. It would be interesting to characterize the convergence rates of\nL(\u02c6\u2713) p!L (\u2713\u21e4) as \u02c6\u2713 p! \u2713\u21e4, using surrogate losses similar in spirit to how excess 0-1 risk is controlled\nthrough excess surrogate risk in [8]. Another important direction is to characterize the entire family\nof measures for which the optimal is given by thresholded P (Y = 1|x). We would like to extend\nour analysis to the multi-class and multi-label domains as well.\nAcknowledgments: This research was supported by NSF grant CCF-1117055 and NSF grant CCF-1320746.\nP.R. acknowledges the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1320894.\n\n8\n\n\fReferences\n[1] David D Lewis and William A Gale. A sequential algorithm for training text classi\ufb01ers. In Proceedings\nof the 17th annual international ACM SIGIR conference, pages 3\u201312. Springer-Verlag New York, Inc.,\n1994.\n\n[2] Chris Drummond and Robert C Holte. Severe class imbalance: Why better algorithms aren\u2019t the answer?\n\nIn Machine Learning: ECML 2005, pages 539\u2013546. Springer, 2005.\n\n[3] Qiong Gu, Li Zhu, and Zhihua Cai. Evaluation measures of the classi\ufb01cation performance of imbalanced\n\ndata sets. In Computational Intelligence and Intelligent Systems, pages 461\u2013471. Springer, 2009.\n\n[4] Haibo He and Edwardo A Garcia. Learning from imbalanced data. Knowledge and Data Engineering,\n\nIEEE Transactions on, 21(9):1263\u20131284, 2009.\n\n[5] Charles Elkan. The foundations of cost-sensitive learning. In International Joint Conference on Arti\ufb01cial\n\nIntelligence, volume 17, pages 973\u2013978. Citeseer, 2001.\n\n[6] Nan Ye, Kian Ming A Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing F-measures: a tale of two\n\napproaches. In Proceedings of the International Conference on Machine Learning, 2012.\n\n[7] Aditya Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statistical consis-\ntency of algorithms for binary classi\ufb01cation under class imbalance. In Proceedings of The 30th Interna-\ntional Conference on Machine Learning, pages 603\u2013611, 2013.\n\n[8] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[9] Clayton Scott. Calibrated asymmetric surrogate losses. Electronic J. of Stat., 6:958\u2013992, 2012.\n[10] Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26\n\n(2):225\u2013287, 2007.\n\n[11] Mark D Reid and Robert C Williamson. Composite binary losses. The Journal of Machine Learning\n\nResearch, 9999:2387\u20132422, 2010.\n\n[12] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods. The Journal\n\nof Machine Learning Research, 5:1225\u20131251, 2004.\n\n[13] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classi\ufb01cation methods. The Journal\n\nof Machine Learning Research, 8:1007\u20131025, 2007.\n\n[14] Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, and Gavin Brown. Beyond Fano\u2019s inequality:\nbounds on the optimal F-score, BER, and cost-sensitive risk and their implications. The Journal of Ma-\nchine Learning Research, 14(1):1033\u20131090, 2013.\n\n[15] Zachary Chase Lipton, Charles Elkan, and Balakrishnan Narayanaswamy. Thresholding classiers to max-\n\nimize F1 score. arXiv, abs/1402.1892, 2014.\n\n[16] Marina Sokolova and Guy Lapalme. A systematic analysis of performance measures for classi\ufb01cation\n\ntasks. Information Processing & Management, 45(4):427\u2013437, 2009.\n\n[17] Seung-Seok Choi and Sung-Hyuk Cha. A survey of binary similarity and distance measures. Journal of\n\nSystemics, Cybernetics and Informatics, pages 43\u201348, 2010.\n\n[18] Yaroslav D Sergeyev. Global one-dimensional optimization using smooth auxiliary functions. Mathemat-\n\nical Programming, 81(1):127\u2013146, 1998.\n\n[19] Mark D Reid and Robert C Williamson. Surrogate regret bounds for proper losses. In Proceedings of the\n\n26th Annual International Conference on Machine Learning, pages 897\u2013904. ACM, 2009.\n\n[20] Deng Cai, Xuanhui Wang, and Xiaofei He. Probabilistic dyadic data analysis with local and global\nconsistency. In Proceedings of the 26th Annual International Conference on Machine Learning, pages\n105\u2013112. ACM, 2009.\n\n[21] John C Platt. Fast training of support vector machines using sequential minimal optimization. 1999.\n[22] S. Mika, G. R\u00a8atsch, J. Weston, B. Sch\u00a8olkopf, and K.-R. M\u00a8uller. Fisher discriminant analysis with kernels.\nIn Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX,\npages 41\u201348. IEEE, 1999.\n\n[23] Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: Using email spam to\n\nidentify web spam automatically. In CEAS, 2006.\n\n[24] Stephen Poythress Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[25] Luc Devroye. A probabilistic theory of pattern recognition, volume 31. springer, 1996.\n[26] Aditya Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statistical consis-\ntency of algorithms for binary classi\ufb01cation under class imbalance: Supplementary material. In Proceed-\nings of The 30th International Conference on Machine Learning, pages 603\u2013611, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1419, "authors": [{"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "Stanford University"}, {"given_name": "Nagarajan", "family_name": "Natarajan", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}]}