{"title": "On the Statistical Consistency of Plug-in Classifiers for Non-decomposable Performance Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 1493, "page_last": 1501, "abstract": "We study consistency properties of algorithms for non-decomposable performance measures that cannot be expressed as a sum of losses on individual data points, such as the F-measure used in text retrieval and several other performance measures used in class imbalanced settings. While there has been much work on designing algorithms for such performance measures, there is limited understanding of the theoretical properties of these algorithms. Recently, Ye et al. (2012) showed consistency results for two algorithms that optimize the F-measure, but their results apply only to an idealized setting, where precise knowledge of the underlying probability distribution (in the form of the `true' posterior class probability) is available to a learning algorithm. In this work, we consider plug-in algorithms that learn a classifier by applying an empirically determined threshold to a suitable `estimate' of the class probability, and provide a general methodology to show consistency of these methods for any non-decomposable measure that can be expressed as a continuous function of true positive rate (TPR) and true negative rate (TNR), and for which the Bayes optimal classifier is the class probability function thresholded suitably. We use this template to derive consistency results for plug-in algorithms for the F-measure and for the geometric mean of TPR and precision; to our knowledge, these are the first such results for these measures. In addition, for continuous distributions, we show consistency of plug-in algorithms for any performance measure that is a continuous and monotonically increasing function of TPR and TNR. Experimental results confirm our theoretical findings.", "full_text": "On the Statistical Consistency of Plug-in Classi\ufb01ers\n\nfor Non-decomposable Performance Measures\n\nHarikrishna Narasimhan\u2020, Rohit Vaish\u2020, Shivani Agarwal\n\nDepartment of Computer Science and Automation\n\n{harikrishna, rohit.vaish, shivani}@csa.iisc.ernet.in\n\nIndian Institute of Science, Bangalore \u2013 560012, India\n\nAbstract\n\nWe study consistency properties of algorithms for non-decomposable performance\nmeasures that cannot be expressed as a sum of losses on individual data points,\nsuch as the F-measure used in text retrieval and several other performance mea-\nsures used in class imbalanced settings. While there has been much work on\ndesigning algorithms for such performance measures, there is limited understand-\ning of the theoretical properties of these algorithms. Recently, Ye et al. (2012)\nshowed consistency results for two algorithms that optimize the F-measure, but\ntheir results apply only to an idealized setting, where precise knowledge of the\nunderlying probability distribution (in the form of the \u2018true\u2019 posterior class prob-\nability) is available to a learning algorithm.\nIn this work, we consider plug-in\nalgorithms that learn a classi\ufb01er by applying an empirically determined threshold\nto a suitable \u2018estimate\u2019 of the class probability, and provide a general methodology\nto show consistency of these methods for any non-decomposable measure that can\nbe expressed as a continuous function of true positive rate (TPR) and true nega-\ntive rate (TNR), and for which the Bayes optimal classi\ufb01er is the class probability\nfunction thresholded suitably. We use this template to derive consistency results\nfor plug-in algorithms for the F-measure and for the geometric mean of TPR and\nprecision; to our knowledge, these are the \ufb01rst such results for these measures. In\naddition, for continuous distributions, we show consistency of plug-in algorithms\nfor any performance measure that is a continuous and monotonically increasing\nfunction of TPR and TNR. Experimental results con\ufb01rm our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nIn many real-world applications, the performance measure used to evaluate a learning model is\nnon-decomposable and cannot be expressed as a summation or expectation of losses on individual\ndata points; this includes, for example, the F-measure used in information retrieval [1], and several\ncombinations of the true positive rate (TPR) and true negative rate (TNR) used in class imbalanced\nclassi\ufb01cation settings [2\u20135] (see Table 1). While there has been much work in the last two decades\nin designing learning algorithms for such performance measures [6\u201314], our understanding of the\nstatistical consistency of these methods is rather limited. Recently, Ye et al. (2012) showed consis-\ntency results for two algorithms for the F-measure [15] that use the \u2018true\u2019 posterior class probability\nto make predictions on instances. These results implicitly assume that the given learning algorithm\nhas precise knowledge of the underlying probability distribution (in the form of the true posterior\nclass probability); this assumption does not however hold in most real-world settings.\nIn this paper, we consider a family of methods that construct a plug-in classi\ufb01er by applying an\nempirically determined threshold to a suitable \u2018estimate\u2019 of the class probability (obtained using a\nmodel learned from a sample drawn from the underlying distribution). We provide a general method-\n\n\u2020Both authors contributed equally to this paper.\n\n1\n\n\fTable 1: Performance measures considered in our study. Here \u03b2 \u2208 (0,\u221e) and p = P(y = 1).\nEach performance measure here can be expressed as P \u03a8\nD[h] = \u03a8(TPRD[h], TNRD[h], p). The last\ncolumn contains the assumption on the distribution D under which the plug-in algorithm considered\nin this work is statistically consistent w.r.t. a performance measure (details in Sections 3 and 5).\nMeasure\nAM (1-BER)\nF\u03b2-measure\nG-TP/PR\nG-Mean (GM)\nH-Mean (HM)\nQ-Mean (QM)\n\nRef.\n[17\u201319] u+v\n2\n[1, 19]\n[3]\n[2, 3]\n[4]\n1 \u2212 ((1 \u2212 TPR)2 + (1 \u2212 TNR)2)/2 [5]\n\nAssumption on D\nAssumption A\np+\u03b22(pu+(1\u2212p)(1\u2212v)) Assumption A\nAssumption A\n\u221a\nAssumption B\nAssumption B\nAssumption B\n\npu+(1\u2212p)(1\u2212v)\nuv\n2uv\nu+v\n\n1 \u2212 (1\u2212u)2+(1\u2212v)2\n\nPrec + 1\nTPR\n\n\u03a8(u, v, p)\n\nDe\ufb01nition\n(TPR + TNR)/2\n\n(1 + \u03b22)/(cid:0) \u03b22\n2/(cid:0) 1\n\nTPR \u00b7 Prec\nTPR \u00b7 TNR\nTPR + 1\nTNR\n\n\u221a\n\u221a\n\n(cid:1)\n\n(cid:1)\n\n(cid:113)\n\n(1+\u03b22)pu\n\npu2\n\n2\n\nology to show statistical consistency of these methods (under a mild assumption on the underlying\ndistribution) for any performance measure that can be expressed as a continuous function of the TPR\nand TNR and the class proportion, and for which the Bayes optimal classi\ufb01er is the class probability\nfunction thresholded at a suitable point. We use our proof template to derive consistency results for\nthe F-measure (using a recent result by [15] on the Bayes optimal classi\ufb01er for F-measure), and the\ngeometric mean of TPR and precision; to our knowledge, these are the \ufb01rst such results for these\nperformance measures. Using our template, we also obtain a recent consistency result by Menon et\nal. [16] for the arithmetic mean of TPR and TNR. In addition, we show that for continuous distri-\nbutions, the optimal classi\ufb01er for any performance measure that is a continuous and monotonically\nincreasing function of TPR and TNR is necessarily of the requisite thresholded form, thus establish-\ning consistency of the plug-in algorithms for all such performance measures. Experiments on real\nand synthetic data con\ufb01rm our theoretical \ufb01ndings, and show that the plug-in methods considered\nhere are competitive with the state-of-the-art SVMperf method [12] for non-decomposable measures.\nRelated Work. Much of the work on non-decomposable performance measures in binary classi\ufb01-\ncation settings has focused on the F-measure; these include the empirical plug-in algorithm consid-\nered here [6], cost-weighted versions of SVM [9], methods that optimize convex and non-convex\napproximations to F-measure [10\u201314], and decision-theoretic methods that learn a class probability\nestimate and compute predictions that maximize the expected F-measure on a test set [7\u20139]. While\nthere has been considerable amount of work on consistency of algorithms for univariate performance\nmeasures [16, 20\u201322], theoretical results on non-decomposable measures have been limited to char-\nacterizing the Bayes optimal classi\ufb01er for F-measure [15, 23, 24], and some consistency results for\nF-measure for certain idealized versions of the empirical plug-in and decision theoretic methods\nthat have access to the true class probability [15]. There has also been some work on algorithms that\noptimize F-measure in multi-label classi\ufb01cation settings [25, 26] and consistency results for these\nmethods [26, 27], but these results do not apply to the binary classi\ufb01cation setting that we consider\nhere; in particular, in a binary classi\ufb01cation setting, the F-measure that one seeks to optimize is a\nsingle number computed over the entire training set, while in a multi-label setting, the goal is to\noptimize the mean F-measure computed over multiple labels on individual instances.\nOrganization. We start with some preliminaries in Section 2. Section 3 presents our main result\non consistency of plug-in algorithms for non-decomposable performance measures that are func-\ntions of TPR and TNR. Section 4 contains application of our proof template to the AM, F\u03b2 and\nG-TP/PR measures, and Section 5 contains results under continuous distributions for performance\nmeasures that are monotonic in TPR and TNR. Section 6 describes our experimental results on real\nand synthetic data sets. Proofs not provided in the main text can be found in the Appendix.\n\nLet X be any instance space.\n\n2 Preliminaries\nProblem Setup.\nGiven a training sample S =\nto make predictions for new instances drawn from X . Assume all examples (both training and\ntest) are drawn iid according to some unknown probability distribution D on X \u00d7 {\u00b11}. Let\n\u03b7(x) = P(y = 1|x) and p = P(y = 1) (both under D). We will be interested in settings where the\n\n((x1, y1), . . . , (xn, yn)) \u2208 (X \u00d7 {\u00b11})n, our goal is to learn a binary classi\ufb01er(cid:98)hS : X \u2192{\u00b11}\nperformance of(cid:98)hS is measured via a non-decomposable performance measure P : {\u00b11}X \u2192 R+,\n\nwhich cannot be expressed as a sum or expectation of losses on individual examples.\n\n2\n\n\fpTPRD[h]\n\nNon-decomposable performance measures. Let us \ufb01rst de\ufb01ne the following quantities associated\nwith a binary classi\ufb01er h : X \u2192{\u00b11}:\nTrue Positive Rate / Recall\nTrue Negative Rate\nPrecision\n\nTPRD[h] = P(cid:0)h(x) = 1| y = 1(cid:1)\nTNRD[h] = P(cid:0)h(x) = \u22121| y = \u22121(cid:1)\nPrecD[h] = P(cid:0)y = 1| h(x) = 1(cid:1) =\n\npTPRD[h]+(1\u2212p)(1\u2212TNRD[h]).\nIn this paper, we will consider non-decomposable performance measures that can be expressed as a\nfunction of the TPR and TNR and the class proportion p. Speci\ufb01cally, let \u03a8 : [0, 1]3 \u2192 R+; then the\n\u03a8-performance of h w.r.t. D, which we will denote as P \u03a8\n\nD[h], will be de\ufb01ned as:\n\n(cid:16) \u03b22\n\nP \u03a8\nD[h] = \u03a8(TPRD[h], TNRD[h], p).\nfor \u03b2 > 0,\n[0, 1]3 \u2192 R+ given by \u03a8F\u03b2 (u, v, p) =\n\nthe F\u03b2-measure of h can be de\ufb01ned through the func-\nFor example,\np+\u03b22(pu+(1\u2212p)(1\u2212v)), which gives\ntion \u03a8F\u03b2\n:\nP F\u03b2\n. Table 1 gives several examples of non-decomposable\nD [h] = (1 + \u03b22)/\nperformance measures that are used in practice. We will also \ufb01nd it useful to consider empirical ver-\nS [h]:\n(1)\n\nsions of these performance measures calculated from a sample S, which we will denote as (cid:98)P \u03a8\n(cid:80)n\nwhere(cid:98)pS = 1\nn(cid:88)\n(cid:100)TPRS[h] =\n1(cid:98)pSn\n\n(cid:98)P \u03a8\nS [h] = \u03a8((cid:100)TPRS[h], (cid:100)TNRS[h], (cid:98)pS),\n(1 \u2212(cid:98)pS)n\n\ni=1 1(yi = 1) is an empirical estimate of p, and\n1\n\n1(h(xi) = 1, yi = 1); (cid:100)TNRS[h] =\n\n1(h(xi) = \u22121, yi = \u22121)\n\nn(cid:88)\n\nPrecD[h] +\n\n(1+\u03b22)pu\n\n1\n\nTPRD[h]\n\n(cid:17)\n\nn\n\ni=1\n\ni=1\n\nare the empirical TPR and TNR respectively.1\n\u03a8-consistency. We will be interested in the optimum value of P \u03a8\n\nD over all classi\ufb01ers:\n\nP \u03a8,\u2217\nD =\n\nsup\n\nh:X \u2192 {\u00b11}\n\nP \u03a8\nD[h].\n\nregret\u03a8\n\nD[h] = P \u03a8,\u2217\n\nA learning algorithm is then said to be \u03a8-consistent if the \u03a8-regret of the classi\ufb01er(cid:98)hS output by the\n\nIn particular, one can de\ufb01ne the \u03a8-regret of a classi\ufb01er h as:\nD \u2212 P \u03a8\nD[(cid:98)hS] P\u2212\u2192 0.\n\nalgorithm on seeing training sample S converges in probability to 0:2\n\nClass of Threshold Classi\ufb01ers. We will \ufb01nd it useful to de\ufb01ne for any function f : X \u2192 [0, 1],\nthe set of classi\ufb01ers obtained by assigning a threshold to f: Tf = {sign \u25e6 (f \u2212 t)| t \u2208 [0, 1]},\nwhere sign(u) = 1 if u > 0 and \u22121 otherwise. For a given f, we shall also de\ufb01ne the thresholds\ncorresponding to maximum population and empirical measures respectively (when they exist) as:\n\nregret\u03a8\n\nD[h].\n\n(cid:98)tS,f,\u03a8 \u2208 argmax\n\nt\u2208[0,1]\n\n(cid:98)P \u03a8\nS [sign \u25e6 (f \u2212 t)].\n\nD,f,\u03a8 \u2208 argmax\nt\u2217\nt\u2208[0,1]\n\nP \u03a8\nD[sign \u25e6 (f \u2212 t)];\n\nPlug-in Algorithms and Result of Ye et al. (2012). In this work, we consider a family of plug-in\nalgorithms, which divide the input sample S into samples (S1, S2), use a suitable class probability\n\nestimation (CPE) algorithm to learn a class probability estimator (cid:98)\u03b7S1 : X \u2192 [0, 1] from S1, and\noutput a classi\ufb01er(cid:98)hS(x) = sign((cid:98)\u03b7S1(x)\u2212(cid:98)tS2,(cid:98)\u03b7S1 ,\u03a8), where(cid:98)tS2,(cid:98)\u03b7S1 ,\u03a8 is a threshold that maximizes\n\nthe empirical performance measure on S2 (see Algorithm 1). We note that this approach is differ-\nent from the idealized plug-in method analyzed by Ye et al. (2012) in the context of F-measure\noptimization, where a classi\ufb01er is learned by assigning an empirical threshold to the \u2018true\u2019 class\nprobability function \u03b7 [15]; the consistency result therein is useful only if precise knowledge of \u03b7 is\navailable to a learning algorithm, which is not the case in most practical settings.\nL1-consistency of a CPE algorithm. Let C be a CPE algorithm, and for any sample S, denote\n\n(cid:98)\u03b7S = C(S). We will say C is L1-consistent w.r.t. a distribution D if Ex\n\n(cid:2)(cid:12)(cid:12)(cid:98)\u03b7S(x) \u2212 \u03b7(x)(cid:12)(cid:12)(cid:3) P\u2212\u2192 0.\n\n1In the setting considered here, the goal is to maximize a (non-decomposable) function of expectations; we\nnote that this is different from the decision-theoretic setting in [15], where one looks at the expectation of a\nnon-decomposable performance measure on n examples, and seeks to maximize its limiting value as n\u2192\u221e.\nif \u2200\u0001 > 0,\nPS\u223cDn (|\u03c6(S) \u2212 a| \u2265 \u0001) \u2192 0 as n \u2192 \u221e.\n\n2We say \u03c6(S) converges in probability to a \u2208 R, written as \u03c6(S)\n\nP\u2212\u2192 a,\n\n3\n\n\fAlgorithm 1 Plug-in with Empirical Threshold for Performance Measure P \u03a8 : 2X \u2192 R+\n1: Input: S = ((x1, y1), . . . , (xn, yn)) \u2208 (X \u00d7 {\u00b11})n\n2: Parameter: \u03b1 \u2208 (0, 1)\n4: Learn(cid:98)\u03b7S1 = C(S1), where C : \u222a\u221e\n3: Let S1 = ((x1, y1), . . . , (xn1 , yn1)), S2 = ((xn1+1, yn1+1), . . . , (xn, yn)), where n1 = (cid:100)n\u03b1(cid:101)\n5: (cid:98)tS2,(cid:98)\u03b7S1 ,\u03a8 \u2208 argmax\n6: Output: Classi\ufb01er(cid:98)hS(x) = sign((cid:98)\u03b7S1(x) \u2212(cid:98)tS2,(cid:98)\u03b7S1 ,\u03a8)\n\nn=1(X \u00d7 {\u00b11})n \u2192 [0, 1]X is a suitable CPE algorithm\n\n[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t)]\n\n(cid:98)P \u03a8\n\nt\u2208[0,1]\n\nS2\n\n3 A Generic Proof Template for \u03a8-consistency of Plug-in Algorithms\n\nD,\u03b7,\u03a8.3\n\nD,\u03b7,\u03a8 for P \u03a8 exists and is not a point of discontinuity.\n\nWe now give a general result for showing consistency of the plug-in method in Algorithm 1 for any\nperformance measure that can be expressed as a continuous function of TPR and TNR, and for which\nthe Bayes optimal classi\ufb01er is obtained by suitably thresholding the class probability function.\nAssumption A. We will say that a probability distribution D on X \u00d7 {\u00b11} satis\ufb01es Assumption A\nw.r.t. \u03a8 if t\u2217\nD,\u03b7,\u03a8 exists and is in (0, 1), and the cumulative distribution functions of the random vari-\nable \u03b7(x) conditioned on y = 1 and on y = \u22121, P(\u03b7(x) \u2264 z | y = 1) and P(\u03b7(x) \u2264 z | y = \u22121),\nare continuous at z = t\u2217\nNote that this assumption holds for any distribution D for which \u03b7(x) conditioned on y = 1 and on\ny = \u22121 is continuous, and also for any D for which \u03b7(x) conditioned on y = 1 and on y = \u22121 is\nmixed, provided the optimum threshold t\u2217\nUnder the above assumption, and assuming that the CPE algorithm used in Algorithm 1 is L1-\nconsistent (which holds for any algorithm that uses a regularized empirical risk minimization of a\nproper loss [16, 28]), we have our main consistency result.\nTheorem 1 (\u03a8-consistency of Algorithm 1). Let \u03a8 : [0, 1]3 \u2192 R+ be continuous in each argument.\nLet D be a probability distribution on X \u00d7{\u00b11} that satis\ufb01es Assumption A w.r.t. \u03a8, and for which\nthe Bayes optimal classi\ufb01er is of the form h\u03a8,\u2217(x) = sign \u25e6 (\u03b7(x) \u2212 t\u2217\nD,\u03b7,\u03a8). If the CPE algorithm\nC in Algorithm 1 is L1-consistent, then Algorithm 1 is \u03a8-consistent w.r.t. D.\nBefore we prove the above theorem, we will \ufb01nd it useful to state the following lemmas. In our \ufb01rst\nlemma, we state that the TPR and TNR of a classi\ufb01er constructed by thresholding a suitable class\nprobability estimate at a \ufb01xed c \u2208 (0, 1) converge respectively to the TPR and TNR of the classi\ufb01er\nobtained by thresholding the true class probability function \u03b7 at c.\nLemma 2 (Convergence of TPR and TNR for \ufb01xed threshold). Let D be a distribution on X \u00d7\nan apriori \ufb01xed constant such that the cumulative distribution functions P(\u03b7(x) \u2264 z | y = 1) and\nP(\u03b7(x) \u2264 z | y = \u22121) are continuous at z = c. We then have\n\n{\u00b11}. Let (cid:98)\u03b7S : X \u2192 [0, 1] be generated by an L1-consistent CPE algorithm. Let c \u2208 (0, 1) be\n\nTPRD[sign \u25e6 ((cid:98)\u03b7S \u2212 c)] P\u2212\u2192 TPRD[sign \u25e6 (\u03b7 \u2212 c)];\nTNRD[sign \u25e6 ((cid:98)\u03b7S \u2212 c)] P\u2212\u2192 TNRD[sign \u25e6 (\u03b7 \u2212 c)].\n\nAs a corollary to the above lemma, we have a similar result for P \u03a8.\nLemma 3 (Convergence of P \u03a8 for \ufb01xed threshold). Let \u03a8 : [0, 1]3 \u2192 R+ be continuous in each\nargument. Under the statement of Lemma 2, we have\n\nWe next state a result showing convergence of the empirical performance measure to its population\nvalue for a \ufb01xed classi\ufb01er, and a uniform convergence result over a class of thresholded classi\ufb01ers.\nLemma 4 (Concentration result for P \u03a8). Let \u03a8 : [0, 1]3 \u2192 R+ be continuous in each argument.\nThen for any \ufb01xed h : X \u2192{\u00b11}, and \u0001 > 0,\n\nP \u03a8\n\nD[sign \u25e6 ((cid:98)\u03b7S \u2212 c)] P\u2212\u2192 P \u03a8\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\n(cid:16)(cid:12)(cid:12)(cid:12)P \u03a8\nD[h] \u2212 (cid:98)P \u03a8\n\nS [h]\n\nPS\u223cDn\n\nD[sign \u25e6 (\u03b7 \u2212 c)].\n\n(cid:17) \u2192 0 as n \u2192 \u221e.\n\n3For simplicity, we assume that t\u2217\n\nD,\u03b7,\u03a8 is in (0, 1); our results easily extend to the case when t\u2217\n\nD,\u03b7,\u03a8 \u2208 [0, 1].\n\n4\n\n\fLemma 5 (Uniform convergence of P \u03a8 over threshold classi\ufb01ers). Let \u03a8 : [0, 1]3 \u2192 R+ be\ncontinuous in each argument. For any f : X \u2192 [0, 1] and \u0001 > 0,\n\n(cid:32) (cid:91)\n\n\u03b8\u2208Tf\n\nPS\u223cDn\n\n(cid:110)(cid:12)(cid:12)(cid:12)P \u03a8\nD[\u03b8] \u2212 (cid:98)P \u03a8\n\nS [\u03b8]\n\n(cid:111)(cid:33)\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\n\n\u2192 0 as n \u2192 \u221e.\n\nregret\u03a8\n\nD[hS] = regret\u03a8\n\n= P \u03a8,\u2217\n= P \u03a8\n\nWe are now ready to prove our main theorem.\nD,\u03b7,\u03a8 \u2208 argmax\nProof of Theorem 1. Recall t\u2217\nt\u2208[0,1]\nfollowing, we shall use t\u2217 in the place of t\u2217\n\nP \u03a8\nD[sign \u25e6 (\u03b7 \u2212 t)] exists by Assumption A. In the\n\nwhich follows from the assumption on the Bayes optimal classi\ufb01er for P \u03a8. Adding and subtracting\nempirical and population versions of P \u03a8 computed on certain classi\ufb01ers,\nregret\u03a8\n\nD,\u03b7,\u03a8 and(cid:98)tS2,S1 in the place of(cid:98)tS2,(cid:98)\u03b7S1 ,\u03a8. We have\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1)]\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1)]\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1 )],\nD \u2212 P \u03a8\nD[sign \u25e6 (\u03b7 \u2212 t\u2217)] \u2212 P \u03a8\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1 )] = P \u03a8\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217)]\n(cid:123)(cid:122)\n(cid:125)\n(cid:124)\nD[sign \u25e6 (\u03b7 \u2212 t\u2217)] \u2212 P \u03a8\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217)] \u2212 (cid:98)P \u03a8\n(cid:123)(cid:122)\n(cid:124)\n+ (cid:98)P \u03a8\n[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1 )] \u2212 P \u03a8\n(cid:123)(cid:122)\n(cid:124)\nFor term2, from the de\ufb01nition of threshold(cid:98)tS2,S1 (see Algorithm 1), we have\n[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217)].\n\nWe now show convergence for each of the above terms. Applying Lemma 3 with c = t\u2217 (by\nAssumption A, t\u2217 \u2208 (0, 1) and satis\ufb01es the necessary continuity assumption), we have term1\nP\u2212\u2192 0.\n\n[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1)]\n(cid:125)\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1)]\n(cid:125)\n\nterm2 \u2264 P \u03a8\n\n+ P \u03a8\n\nterm1\n\nterm2\n\nterm3\n\n(2)\n\nS2\n\nS2\n\n.\n\nS2\n\n(cid:104)\n(cid:104)\n\nPS2|S1\n\nPS2|S1\n\n= ES1\n\u2264 ES1\n\u2192 0\n\nThen for any \u0001 > 0,\nPS\u223cDn\n\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217)] \u2212 (cid:98)P \u03a8\n(cid:0)term2 \u2265 \u0001(cid:1) = PS1\u223cDn1 , S2\u223cDn\u2212n1\n(cid:0)term2 \u2265 \u0001(cid:1)\n(cid:0)term2 \u2265 \u0001(cid:1)(cid:105)\n(cid:16)(cid:12)(cid:12)P \u03a8\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217)] \u2212 (cid:98)P \u03a8\n\nas n\u2192\u221e, where the third step follows from Eq. (2), and the last step follows by applying, for a\n\n(cid:17)(cid:105)\n[sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217)](cid:12)(cid:12) \u2265 \u0001\n\ufb01xed S1, the concentration result in Lemma 4 with h = sign \u25e6 ((cid:98)\u03b7S1 \u2212 t\u2217) (given continuity of \u03a8).\n(cid:16)(cid:98)P \u03a8\n(cid:17)(cid:105)\n(cid:0)term3 \u2265 \u0001(cid:1) = ES1\n[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1 )] \u2212 P \u03a8\nD[sign \u25e6 ((cid:98)\u03b7S1 \u2212(cid:98)tS2,S1)] \u2265 \u0001\n(cid:32) (cid:91)\n(cid:111)(cid:33)(cid:35)\n(cid:110)(cid:12)(cid:12)(cid:98)P \u03a8\nD[\u03b8](cid:12)(cid:12) \u2265 \u0001\n\u03b8\u2208T(cid:98)\u03b7S1\n= {sign \u25e6 ((cid:98)\u03b7S1 \u2212 t)| t \u2208 [0, 1]} (for a \ufb01xed S1).\n\nas n\u2192\u221e, where the last step follows by applying the uniform convergence result in Lemma 5 over\nthe class of thresholded classi\ufb01ers T(cid:98)\u03b7S1\n4 Consistency of Plug-in Algorithms for AM, F\u03b2, and G-TP/PR\n\nFinally, for term3, we have for any \u0001 > 0,\n\n\u2264 ES1\n\u2192 0\n\n[\u03b8] \u2212 P \u03a8\n\nPS2|S1\n\nPS2|S1\n\n(cid:104)\n(cid:34)\n\nPS\n\nS2\n\nS2\n\nS2\n\nWe now use the result in Theorem 1 to establish consistency of the plug-in algorithms for the arith-\nmetic mean of TPR and TNR, the F\u03b2-measure, and the geometric mean of TPR and precision.\n\n5\n\n\f4.1 Consistency for AM-measure\n\nThe arithmetic mean of TPR and TNR (AM) or one minus the balanced error rate (BER) is a widely-\nused performance measure in class imbalanced binary classi\ufb01cation settings [17\u201319]:\n\nPAM\nD [h] =\n\nTPRD[h] + TNRD[h]\n\n.\n\nfor\n\nthe AM-measure is of\n\n2\nIt can be shown that Bayes optimal classi\ufb01er\nthe form\nhAM,\u2217(x) = sign \u25e6 (\u03b7(x) \u2212 p) (see for example [16]), and that the threshold chosen by the plug-\nin method in Algorithm 1 for the AM-measure is an empirical estimate of p. In recent work, Menon\net al. show that this plug-in method is consistent w.r.t. the AM-measure [16]; their proof makes use\nof a decomposition of the AM-measure in terms of a certain cost-sensitive error and a result of [22]\non regret bounds for cost-sensitive classi\ufb01cation. We now use our result in Theorem 1 to give an\nalternate route for showing AM-consistency of this plug-in method.4\nTheorem 6 (Consistency of Algorithm 1 w.r.t. AM-measure). Let \u03a8 = \u03a8AM. Let D be a\ndistribution on X \u00d7 {\u00b11} that satis\ufb01es Assumption A w.r.t. \u03a8AM.\nIf the CPE algorithm C in\nAlgorithm 1 is L1-consistent, then Algorithm 1 is AM-consistent w.r.t. D.\nProof. We apply Theorem 1 noting that \u03a8AM(u, v, p) = (u+v)/2 is continuous in all its arguments,\nand that the Bayes optimal classi\ufb01er for PAM is of the requisite thresholded form.\n\n4.2 Consistency for F\u03b2-measure\n\nThe F\u03b2-measure or the (weighted) harmonic mean of TPR and precision is a popular performance\nmeasure used in information retrieval [1]:\n(1 + \u03b22)TPRD[h]PrecD[h]\n\np + \u03b22(cid:0)pTPRD[h] + (1 \u2212 p)(1 \u2212 TNRD[h])(cid:1) ,\n\n\u03b22TPRD[h] + PrecD[h]\n\n(1 + \u03b22)pTPRD[h]\n\nPF\u03b2\nD [h] =\n\n=\n\nwhere \u03b2 \u2208 (0, 1) controls the trade-off between TPR and precision. In a recent work, Ye et al. [15]\nshow that the optimal classi\ufb01er for the F\u03b2-measure is the class probability \u03b7 thresholded suitably.\nLemma 7 (Optimality of threshold classi\ufb01ers for F\u03b2-measure; Ye et al. (2012) [15]). For any\ndistribution D over X \u00d7 {\u00b11} that satis\ufb01es Assumption A w.r.t. \u03a8, the Bayes optimal classi\ufb01er for\nPF\u03b2 is of the form hF\u03b2 ,\u2217(x) = sign \u25e6 (\u03b7(x) \u2212 t\u2217\nAs noted earlier, the authors in [15] show that an idealized plug-in method that applies an empirically\ndetermined threshold to the \u2018true\u2019 class probability \u03b7 is consistent w.r.t. the F\u03b2-measure . This result\nis however useful only when the \u2018true\u2019 class probability is available to a learning algorithm, which\nis not the case in most practical settings. On the other hand, the plug-in method considered in our\nwork constructs a classi\ufb01er by assigning an empirical threshold to a suitable \u2018estimate\u2019 of the class\nprobability. Using Theorem 1, we now show that this method is consistent w.r.t. the F\u03b2-measure.\nTheorem 8 (Consistency of Algorithm 1 w.r.t. F\u03b2-measure). Let \u03a8 = \u03a8F\u03b2 in Algorithm 1. Let\nD be a distribution on X \u00d7 {\u00b11} that satis\ufb01es Assumption A w.r.t. \u03a8F\u03b2 . If the CPE algorithm C in\nAlgorithm 1 is L1-consistent, then Algorithm 1 is F\u03b2-consistent w.r.t. D.\n\nD,\u03b7,F\u03b2\n\n).\n\nProof. We apply Theorem 1 noting that \u03a8F\u03b2 (u, v, p) =\nargument, and that (by Lemma 7) the Bayes optimal classi\ufb01er for PF\u03b2 is of the requisite form.\n\np+\u03b22(pu+(1\u2212p)(1\u2212v)) is continuous in each\n\n(1+\u03b22)pu\n\n4.3 Consistency for G-TP/PR\n\nThe geometric mean of TPR and precision (G-TP/PR) is another performance measure proposed for\nclass imbalanced classi\ufb01cation problems [3]:\n\n[h] = (cid:112)TPRD[h]PrecD[h] =\n\nPG-TP/PR\n\nD\n\n(cid:115)\n\npTPRD[h]2\n\npTPRD[h] + (1 \u2212 p)(1 \u2212 TNRD[h])\n\n.\n\n4Note that the plug-in classi\ufb01cation threshold chosen for the AM-measure is the same independent of the\nclass probability estimate used; our consistency results will therefore apply in this case even if one uses, as\nin [16], the same sample for both learning a class probability estimate, and estimating the plug-in threshold.\n\n6\n\n\fWe \ufb01rst show that the optimal classi\ufb01er for G-TP/PR is obtained by thresholding the class probability\nfunction \u03b7 at a suitable point; our proof uses a technique similar to the one for the F\u03b2-measure in [15].\nLemma 9 (Optimality of threshold classi\ufb01ers for G-TP/PR). For any distribution D on\nX \u00d7 {\u00b11} that satis\ufb01es Assumption A w.r.t. \u03a8, the Bayes optimal classi\ufb01er for PG-TP/PR is of\nthe form hG-TP/PR,\u2217(x) = sign(\u03b7(x) \u2212 t\u2217\nTheorem 10 (Consistency of Algorithm 1 w.r.t. G-TP/PR). Let \u03a8 = \u03a8G-TP/PR. Let D be a\ndistribution on X \u00d7 {\u00b11} that satis\ufb01es Assumption A w.r.t. \u03a8G-TP/PR. If the CPE algorithm C in\nAlgorithm 1 is L1-consistent, then Algorithm 1 is G-TP/PR-consistent w.r.t. D.\n\nD,\u03b7,G-TP/PR).\n\n(cid:113)\n\nProof. We apply Theorem 1 noting that \u03a8G-TP/PR(u, v, p) =\npu+(1\u2212p)(1\u2212v) is continuous in each\nargument, and that (by Lemma 9) the Bayes optimal classi\ufb01er for PG-TP/PR is of the requisite form.\n\npu2\n\n5 Consistency of Plug-in Algorithms for Non-decomposable Performance\n\nMeasures that are Monotonic in TPR and TNR\n\nThe consistency results seen so far apply to any distribution that satis\ufb01es a mild continuity condi-\ntion at the optimal threshold for a performance measure, and have crucially relied on the speci\ufb01c\nfunctional form of the measure. In this section, we shall see that under a stricter continuity assump-\ntion on the distribution, the empirical plug-in algorithm can be shown to be consistent w.r.t. any\nperformance measure that is a continuous and monotonically increasing function of TPR and TNR.\nAssumption B. We will say that a probability distribution D on X \u00d7 {\u00b11} satis\ufb01es Assumption\nB w.r.t. \u03a8 if t\u2217\nD,\u03b7,\u03a8 exists and is in (0, 1), and the cumulative distribution function of the random\nvariable \u03b7(x), P(\u03b7(x) \u2264 z), is continuous at all z \u2208 (0, 1).\nDistributions that satisfy the above assumption also satisfy Assumption A. We show that under this\nassumption, the optimal classi\ufb01er for any performance measure that is monotonically increasing in\nTPR and TNR is obtained by thresholding \u03b7, and this holds irrespective of the speci\ufb01c functional\nform of the measure. An application of Theorem 1 then gives us the desired consistency result.\nLemma 11 (Optimality of threshold classi\ufb01ers for monotonic \u03a8 under distributional assump-\ntion). Let \u03a8 : [0, 1]3 \u2192 R+ be monotonically increasing in its \ufb01rst two arguments. Then for any\ndistribution D on X \u00d7 {\u00b11} that satis\ufb01es Assumption B, the Bayes optimal classi\ufb01er for P \u03a8 is of\nthe form h\u03a8,\u2217(x) = sign(\u03b7(x) \u2212 t\u2217\nTheorem 12 (Consistency of Algorithm 1 for monotonic \u03a8 under distributional assumption).\nLet \u03a8 : [0, 1]3 \u2192 R+ be continuous in each argument, and monotonically increasing in its \ufb01rst two\narguments. Let D be a distribution on X \u00d7 {\u00b11} that satis\ufb01es Assumption B. If the CPE algorithm\nC in Algorithm 1 is L1-consistent, then Algorithm 1 is \u03a8-consistent w.r.t. D.\nProof. We apply Theorem 1 by using the continuity assumption on \u03a8, and noting that, by Lemma 11\nand monotonicity of \u03a8, the Bayes optimal classi\ufb01er for P \u03a8 is of the requisite form.\nThe above result applies to all performance measures listed in Table 1, and in particular, to the\ngeometric, harmonic, and quadratic means of TPR and TNR [2\u20135], for which the Bayes optimal\nclassi\ufb01er need not be of the requisite thresholded form for a general distribution (see Appendix C).\n\nD,\u03b7,\u03a8).\n\n6 Experiments\n\nWe performed two types of experiments. The \ufb01rst involved synthetic data, where we demonstrate\ndiminishing regret of the plug-in method in Algorithm 1 with growing sample size for different\nperformance measures; since the data is generated from a known distribution, exact calculation of\nregret is possible here. The second involved real data, where we show that the plug-in algorithm is\ncompetitive with the state-of-the-art SVMperf algorithm for non-decomposable measures (SVMPerf)\n[12]; we also include for comparison a plug-in method with a \ufb01xed threshold of 0.5 (Plug-in (0-1)).\nWe consider three performance measures here: F1-measure, G-TP/PR and G-Mean (see Table 1).\nSynthetic data. We generated data from a known distribution (class conditionals are multivariate\nGaussians with mixing ratio p and equal covariance matrices) for which the optimal classi\ufb01er for\n\n7\n\n\fFigure 1: Experiments on synthetic data with p = 0.5: regret as a function of number of training\nexamples using various methods for the F1, G-TP/PR and G-mean performance measures.\n\nFigure 2: Experiments on synthetic data with p = 0.1: regret as a function of number of training\nexamples using various methods for the F1, G-TP/PR and G-Mean performance measures.\n\nFigure 3: Experiments on real data: results for various methods (using linear models) on four data\nsets in terms of F1, G-TP/PR and G-Mean performance measures. Here N, d, p refer to the number\nof instances, number of features and fraction of positives in the data set respectively.\neach performance measure considered here is linear, making it suf\ufb01cient to learn a linear model; the\ndistribution satis\ufb01es Assumption B w.r.t. each performance measure. We used regularized logistic\nregression as the CPE method in Algorithm 1 in order to satisfy the L1-consistency condition in\nTheorem 1 (see Appendix A.1 and A.4 for details). The experimental results are shown in Figures 1\nand 2 for p = 0.5 and p = 0.1 respectively. In each case, the regret for the empirical plug-in method\n(Plug-in (F1), Plug-in (G-TP/PR) and Plug-in (GM)) goes to zero with increasing training set size,\nvalidating our consistency results; SVMperf often fails to exhibit diminishing regret for p = 0.1; and\nas expected, Plug-in (0-1), with its apriori \ufb01xed threshold, fails to be consistent in most cases.\nReal data. We ran the three algorithms described earlier over data sets drawn from the UCI ML\nrepository [29] and a cheminformatics data set obtained from [30], and report their performance on\nheld-out test sets. Figure 3 contains results for four data sets averaged over 10 random train-test\nsplits of the original data (also see Appendix A.2 and A.3). Clearly, in most cases, the empirical\nplug-in performs comparable to SVMperf and outperforms Plug-in (0-1). Moreover, the empirical\nplug-in was found to exhibit faster run-times than the SVMperf method (see Figure 5 in Appendix).\n7 Conclusions\nWe have presented a general method for proving consistency of plug-in algorithms that assign an\nempirical threshold to a suitable class probability estimate for a variety of non-decomposable per-\nformance measures for binary classi\ufb01cation that can be expressed as a continuous function of TPR\nand TNR, and for which the Bayes optimal classi\ufb01er is the class probability function thresholded\nsuitably. We use our template to show consistency for the AM, F\u03b2 and G-TP/PR measures, and\nunder a continuous distribution, for any performance measure that is continuous and monotonic in\nTPR and TNR. Our experiments suggest that these algorithms yield performance comparable to the\nstate-of-the-art SVMperf method, while being faster than this method in practice.\nAcknowledgments\nHN thanks support from a Google India PhD Fellowship. SA gratefully acknowledges support from\nDST, Indo-US Science and Technology Forum, and an unrestricted gift from Yahoo.\n\n8\n\n10210310410500.050.10.15No. of training examplesF1 RegretF1\u2212measure Plug\u2212in (F1)SVMPerf (F1)Plug\u2212in (0\u22121)10210310410500.050.10.15No. of training examplesG\u2212TP/PR RegretG\u2212TP/PR Plug\u2212in (G\u2212TP/PR)SVMPerf (G\u2212TP/PR)Plug\u2212in (0\u22121)10210310410500.020.040.060.080.1No. of training examplesGM RegretG\u2212Mean Plug\u2212in (GM)SVMPerf (GM)Plug\u2212in (0\u22121)10210310410500.050.10.150.20.250.3No. of training examplesF1 RegretF1\u2212measure Plug\u2212in (F1)SVMPerf (F1)Plug\u2212in (0\u22121)10210310410500.050.10.150.20.250.3No. of training examplesG\u2212TP/PR RegretG\u2212TP/PR Plug\u2212in (G\u2212TP/PR)SVMPerf (G\u2212TP/PR)Plug\u2212in (0\u22121)10210310410500.10.20.30.40.5No. of training examplesGM RegretG\u2212Mean Plug\u2212in (GM)SVMPerf (GM)Plug\u2212in (0\u22121)F1G\u2212TP/PRG\u2212Mean00.51Performance on test setcar (N = 1728, d = 21, p = 0.038) Emp. Plug\u2212inSVMPerfPlug\u2212in (0\u22121)F1G\u2212TP/PRG\u2212Mean00.51Performance on test setchemo (N = 2111, d = 1021, p = 0.024) Emp. Plug\u2212inSVMPerfPlug\u2212in (0\u22121)F1G\u2212TP/PRG\u2212Mean00.51Performance on test setnursery (N = 12960, d = 27, p = 0.025) Emp. Plug\u2212inSVMPerfPlug\u2212in (0\u22121)F1G\u2212TP/PRG\u2212Mean00.51Performance on test setletter (N = 18668, d = 16, p = 0.034) Emp. Plug\u2212inSVMPerfPlug\u2212in (0\u22121)\fReferences\n[1] C. D. Manning, P. Raghavan, and H. Sch\u00a8utze. Introduction to Information Retrieval. Cambridge Univer-\n\nsity Press, 2008.\n\n[2] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection.\n\nICML, 1997.\n\nIn\n\n[3] S. Daskalaki, I. Kopanas, and N. Avouris. Evaluation of classi\ufb01ers for an uneven class distribution prob-\n\nlem. Applied Arti\ufb01cial Intelligence, 20:381\u2013417, 2006.\n\n[4] K. Kennedy, B.M. Namee, and S.J. Delany. Learning without default: a study of one-class classi\ufb01cation\n\nand the low-default portfolio problem. In ICAICS, 2009.\n\n[5] S. Lawrence, I. Burns, A. Back, A-C. Tsoi, and C.L. Giles. Neural network classi\ufb01cation and prior class\n\nprobabilities. In Neural Networks: Tricks of the Trade, pages 1524:299\u2013313. 1998.\n[6] Y. Yang. A study of thresholding strategies for text categorization. In SIGIR, 2001.\n[7] D.D. Lewis. Evaluating and optimizing autonomous text classi\ufb01cation systems. In SIGIR, 1995.\n[8] K.M.A. Chai. Expectation of F-measures: Tractable exact computation and some empirical observations\n\nof its properties. In SIGIR, 2005.\n\n[9] D.R. Musicant, V. Kumar, and A. Ozgur. Optimizing F-measure with support vector machines. In FLAIRS,\n\n2003.\n\n[10] S. Gao, W. Wu, C-H. Lee, and T-S. Chua. A maximal \ufb01gure-of-merit learning approach to text catego-\n\nrization. In SIGIR, 2003.\n\n[11] M. Jansche. Maximum expected F-measure training of logistic regression models. In HLT, 2005.\n[12] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.\n[13] Z. Liu, M. Tan, and F. Jiang. Regularized F-measure maximization for feature selection and classi\ufb01cation.\n\nBioMed Research International, 2009, 2009.\n\n[14] P.M. Chinta, P. Balamurugan, S. Shevade, and M.N. Murty. Optimizing F-measure with non-convex loss\n\nand sparse linear classi\ufb01ers. In IJCNN, 2013.\n\n[15] N. Ye, K.M.A. Chai, W.S. Lee, and H.L. Chieu. Optimizing F-measures: A tale of two approaches. In\n\nICML, 2012.\n\n[16] A.K. Menon, H. Narasimhan, S. Agarwal, and S. Chawla. On the statistical consistency of algorithms for\n\nbinary classi\ufb01cation under class imbalance. In ICML, 2013.\n\n[17] J. Cheng, C. Hatzis, H. Hayashi, M-A. Krogel, S. Morishita, D. Page, and J. Sese. KDD Cup 2001 report.\n\nACM SIGKDD Explorations Newsletter, 3(2):47\u201364, 2002.\n\n[18] R. Powers, M. Goldszmidt, and I. Cohen. Short term performance forecasting in enterprise systems. In\n\nKDD, 2005.\n\n[19] Q. Gu, L. Zhu, and Z. Cai. Evaluation measures of the classi\ufb01cation performance of imbalanced data sets.\n\nIn Computational Intelligence and Intelligent Systems, volume 51, pages 461\u2013471. 2009.\n\n[20] T. Zhang. Statistical behaviour and consistency of classi\ufb01cation methods based on convex risk minimiza-\n\ntion. Annals of Mathematical Statistics, 32:56\u2013134, 2004.\n\n[21] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101(473):138\u2013156, 2006.\n\n[22] C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958\u2013992, 2012.\n[23] M. Zhao, N. Edakunni, A. Pocock, and G. Brown. Beyond Fano\u2019s inequality: Bounds on the optimal\nF-score, BER, and cost-sensitive risk and their implications. Journal of Machine Learning Research,\n14(1):1033\u20131090, 2013.\n\n[24] Z.C. Lipton, C. Elkan, and B. Naryanaswamy. Optimal thresholding of classi\ufb01ers to maximize F1 mea-\n\nsure. In ECML/PKDD, 2014.\n\n[25] J. Petterson and T. Caetano. Reverse multi-label learning. In NIPS, 2010.\n[26] K. Dembczynski, W. Waegeman, W. Cheng, and E. H\u00a8ullermeier. An exact algorithm for F-measure\n\nmaximization. In NIPS, 2011.\n\n[27] K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Huellermeier. Optimizing the F-mea-\nsure in multi-label classi\ufb01cation: Plug-in rule approach versus structured loss minimization. In ICML, 13.\n[28] S. Agarwal. Surrogate regret bounds for the area under the ROC curve via strongly proper losses. In\n\nCOLT, 2013.\n\n[29] A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL: http://archive.ics.uci.edu/ml.\n[30] Robert N. Jorissen and Michael K. Gilson. Virtual screening of molecular databases using a support\n\nvector machine. Journal of Chemical Information and Modeling, 45:549\u2013561, 2005.\n\n9\n\n\f", "award": [], "sourceid": 809, "authors": [{"given_name": "Harikrishna", "family_name": "Narasimhan", "institution": "Indian Institute of Science"}, {"given_name": "Rohit", "family_name": "Vaish", "institution": "Indian Insititute of Science"}, {"given_name": "Shivani", "family_name": "Agarwal", "institution": "Indian Institute of Science"}]}