{"title": "Precision-Recall-Gain Curves: PR Analysis Done Right", "book": "Advances in Neural Information Processing Systems", "page_first": 838, "page_last": 846, "abstract": "Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier's performance. Perhaps inspired by the many advantages of receiver operating characteristic (ROC) curves and the area under such curves for accuracy-based performance assessment, many researchers have taken to report Precision-Recall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions -- e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the $F_{\\beta}$ score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves conveys an expected $F_1$ score on a harmonic scale, and the convex hull of a Precision-Recall-Gain curve allows us to calibrate the classifier's scores so as to determine, for each operating point on the convex hull, the interval of $\\beta$ values for which the point optimises $F_{\\beta}$. We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected $F_1$ score than others, and so the use of Precision-Recall-Gain curves will result in better model selection.", "full_text": "Precision-Recall-Gain Curves:\n\nPR Analysis Done Right\n\nPeter A. Flach\n\nIntelligent Systems Laboratory\n\nUniversity of Bristol, United Kingdom\nPeter.Flach@bristol.ac.uk\n\nMeelis Kull\n\nIntelligent Systems Laboratory\n\nUniversity of Bristol, United Kingdom\nMeelis.Kull@bristol.ac.uk\n\nAbstract\n\nPrecision-Recall analysis abounds in applications of binary classi\ufb01cation where\ntrue negatives do not add value and hence should not affect assessment of the\nclassi\ufb01er\u2019s performance. Perhaps inspired by the many advantages of receiver op-\nerating characteristic (ROC) curves and the area under such curves for accuracy-\nbased performance assessment, many researchers have taken to report Precision-\nRecall (PR) curves and associated areas as performance metric. We demonstrate\nin this paper that this practice is fraught with dif\ufb01culties, mainly because of in-\ncoherent scale assumptions \u2013 e.g., the area under a PR curve takes the arithmetic\nmean of precision values whereas the F\u03b2 score applies the harmonic mean. We\nshow how to \ufb01x this by plotting PR curves in a different coordinate system, and\ndemonstrate that the new Precision-Recall-Gain curves inherit all key advantages\nof ROC curves. In particular, the area under Precision-Recall-Gain curves con-\nveys an expected F1 score on a harmonic scale, and the convex hull of a Precision-\nRecall-Gain curve allows us to calibrate the classi\ufb01er\u2019s scores so as to determine,\nfor each operating point on the convex hull, the interval of \u03b2 values for which the\npoint optimises F\u03b2 . We demonstrate experimentally that the area under traditional\nPR curves can easily favour models with lower expected F1 score than others, and\nso the use of Precision-Recall-Gain curves will result in better model selection.\n\n1\n\nIntroduction and Motivation\n\nIn machine learning and related areas we often need to optimise multiple performance measures,\nsuch as per-class classi\ufb01cation accuracies, precision and recall in information retrieval, etc. We then\nhave the option to \ufb01x a particular way to trade off these performance measures: e.g., we can use\noverall classi\ufb01cation accuracy which gives equal weight to correctly classi\ufb01ed instances regardless\nof their class; or we can use the F1 score which takes the harmonic mean of precision and recall.\nHowever, multi-objective optimisation suggests that to delay \ufb01xing a trade-off for as long as possible\nhas practical bene\ufb01ts, such as the ability to adapt a model or set of models to changing operating\ncontexts. The latter is essentially what receiver operating characteristic (ROC) curves do for bi-\nnary classi\ufb01cation. In an ROC plot we plot true positive rate (the proportion of correctly classi\ufb01ed\npositives, also denoted tpr) on the y-axis against false positive rate (the proportion of incorrectly\nclassi\ufb01ed negatives, also denoted fpr) on the x-axis. A categorical classi\ufb01er evaluated on a test set\ngives rise to a single ROC point, while a classi\ufb01er which outputs scores (henceforth called a model)\ncan generate a set of points (commonly referred to as the ROC curve) by varying the decision thresh-\nold (Figure 1 (left)).\nROC curves are widely used in machine learning and their main properties are well understood [3].\nThese properties can be summarised as follows.\n\n1\n\n\fFigure 1: (left) ROC curve with non-dominated points (red circles) and convex hull (red dotted line).\n(right) Corresponding Precision-Recall curve with non-dominated points (red circles).\n\nUniversal baselines: the major diagonal of an ROC plot depicts the line of random performance\nwhich can be achieved without training. More speci\ufb01cally, a random classi\ufb01er assigning the\npositive class with probability p and the negative class with probability 1\u2212 p has expected\ntrue positive rate of p and true negative rate of 1\u2212 p, represented by the ROC point (p, p).\nThe upper-left (lower-right) triangle of ROC plots hence denotes better (worse) than ran-\ndom performance. Related baselines include the always-negative and always-positive clas-\nsi\ufb01er which occupy \ufb01xed points in ROC plots (the origin and the upper right-hand corner,\nrespectively). These baselines are universal as they don\u2019t depend on the class distribution.\nLinear interpolation: any point on a straight line between two points representing the performance\nof two classi\ufb01ers (or thresholds) A and B can be achieved by making a suitably biased ran-\ndom choice between A and B [14]. Effectively this creates an interpolated contingency\ntable which is a linear combination of the contingency tables of A and B, and since all\nthree tables involve the same numbers of positives and negatives it follows that the inter-\npolated accuracy as well as true and false positive rates are also linear combinations of\nthe corresponding quantities pertaining to A and B. The slope of the connecting line de-\ntermines the trade-off between the classes under which any linear combination of A and\nB would yield equivalent performance. In particular, test set accuracy assuming uniform\nmisclassi\ufb01cation costs is represented by accuracy isometrics with slope (1\u2212 \u03c0)/\u03c0, where\n\u03c0 is the proportion of positives [5].\n\nOptimality: a point D dominates another point E if D\u2019s tpr and fpr are not worse than E\u2019s and\nat least one of them is strictly better. The set of non-dominated points \u2013 the Pareto front\n\u2013 establishes the set of classi\ufb01ers or thresholds that are optimal under some trade-off be-\ntween the classes. Due to linearity any interpolation between non-dominated points is both\nachievable and non-dominated, giving rise to the convex hull (ROCCH) which can be easily\nconstructed both algorithmically and by visual inspection.\n\nArea: the proportion of the unit square which falls under an ROC curve (AUROC) has a well-known\nmeaning as a ranking performance measure: it estimates the probability that a randomly\n(cid:82) 1\nchosen positive is ranked higher by the model than a randomly chosen negative [7]. More\nimportantly in a classi\ufb01cation context, there is a linear relationship between AUROC =\n0 tpr d fpr and the expected accuracy acc = \u03c0tpr + (1 \u2212 \u03c0)(1 \u2212 fpr) averaged over all\nchange of variable: E [acc] =(cid:82) 1\npossible predicted positive rates rate = \u03c0tpr + (1\u2212 \u03c0)fpr which can be established by a\n\n0 acc d rate = \u03c0(1\u2212 \u03c0)(2AUROC\u2212 1) + 1/2 [8].\n\nCalibration: slopes of convex hull segments can be interpreted as empirical likelihood ratios asso-\nciated with a particular interval of raw classi\ufb01er scores. This gives rise to a non-parametric\ncalibration procedure which is also called isotonic regression [19] or pool adjacent viola-\ntors [4] and results in a calibration map which maps each segment of ROCCH with slope r\nto a calibrated score c = \u03c0r/(\u03c0r + (1\u2212\u03c0)) [6]. De\ufb01ne a skew-sensitive version of accuracy\nas accc (cid:44) 2c\u03c0tpr +2(1\u2212c)(1\u2212\u03c0)(1\u2212fpr) (i.e., standard accuracy is accc=1/2) then a per-\n\n2\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91False positive rateTrue positive rate00.20.40.60.8100.10.20.30.40.50.60.70.80.91RecallPrecision\ffectly calibrated classi\ufb01er outputs, for every instance, the value of c for which the instance\nis on the accc decision boundary.\n\nAlternative solutions for each of these exist. For example, parametric alternatives to ROCCH cali-\nbration exist based on the logistic function, e.g. Platt scaling [13]; as do alternative ways to aggregate\nclassi\ufb01cation performance across different operating points, e.g. the Brier score [8]. However, the\npower of ROC analysis derives from the combination of the above desirable properties, which helps\nto explain its popularity across the machine learning discipline.\nThis paper presents fundamental improvements in Precision-Recall analysis, inspired by ROC anal-\nysis, as follows. (i) We identify in Section 2 the problems with current practice in Precision-Recall\ncurves by demonstrating that they fail to satisfy each of the above properties in some respect. (ii) We\npropose a principled way to remedy all these problems by means of a change of coordinates in Sec-\ntion 3. (iii) In particular, our improved Precision-Recall-Gain curves enclose an area that is directly\nrelated to expected F1 score \u2013 on a harmonic scale \u2013 in a similar way as AUROC is related to ex-\npected accuracy. (iv) Furthermore, with Precision-Recall-Gain curves it is possible to calibrate a\nmodel for F\u03b2 in the sense that the predicted score for any instance determines the value of \u03b2 for\nwhich the instance is on the F\u03b2 decision boundary. (v) We give experimental evidence in Section 4\nthat this matters by demonstrating that the area under traditional Precision-Recall curves can easily\nfavour models with lower expected F1 score than others.\nProofs of the formal results are found in the Supplementary Material; see also http://www.cs.\nbris.ac.uk/\u02dcflach/PRGcurves/.\n\n2 Traditional Precision-Recall Analysis\n\nOver-abundance of negative examples is a common phenomenon in many sub\ufb01elds of machine\nlearning and data mining, including information retrieval, recommender systems and social network\nanalysis. Indeed, most web pages are irrelevant for most queries, and most links are absent from\nmost networks. Classi\ufb01cation accuracy is not a sensible evaluation measure in such situations, as it\nover-values the always-negative classi\ufb01er. Neither does adjusting the class imbalance through cost-\nsensitive versions of accuracy help, as this will not just downplay the bene\ufb01t of true negatives but\nalso the cost of false positives. A good solution in this case is to ignore true negatives altogether\nand use precision, de\ufb01ned as the proportion of true positives among the positive predictions, as\nperformance metric instead of false positive rate. In this context, the true positive rate is usually\nrenamed to recall. More formally, we de\ufb01ne precision as prec = TP/(TP + FP) and recall as rec =\nTP/(TP + FN), where TP, FP and FN denote the number of true positives, false positives and false\nnegatives, respectively.\nPerhaps motivated by the appeal of ROC plots, many researchers have begun to produce Precision-\nRecall or PR plots with precision on the y-axis against recall on the x-axis. Figure 1 (right) shows the\nPR curve corresponding to the ROC curve on the left. Clearly there is a one-to-one correspondence\nbetween the two plots as both are based on the same contingency tables [2]. In particular, precision\nassociated with an ROC point is proportional to the angle between the line connecting the point with\nthe origin and the x-axis. However, this is where the similarity ends as PR plots have none of the\naforementioned desirable properties of ROC plots.\n\nNon-universal baselines: a random classi\ufb01er has precision \u03c0 and hence baseline performance is a\nhorizontal line which depends on the class distribution. The always-positive classi\ufb01er is at\nthe right-most end of this baseline (the always-negative classi\ufb01er has unde\ufb01ned precision).\nNon-linear interpolation: the main reason for this is that precision in a linearly interpolated con-\ntingency table is only a linear combination of the original precision values if the two clas-\nsi\ufb01ers have the same predicted positive rate (which is impossible if the two contingency\ntables arise from different decision thresholds on the same model). [2] discusses this fur-\nther and also gives an interpolation formula. More generally, it isn\u2019t meaningful to take the\narithmetic average of precision values.\n\nNon-convex Pareto front: the set of non-dominated operating points continues to be well-de\ufb01ned\n(see the red circles in Figure 1 (right)) but in the absence of linear interpolation this set isn\u2019t\nconvex for PR curves, nor is it straightforward to determine by visual inspection.\n\n3\n\n\fUninterpretable area: although many authors report the area under the PR curve (AUPR) it doesn\u2019t\nhave a meaningful interpretation beyond the geometric one of expected precision when\nuniformly varying the recall (and even then the use of the arithmetic average cannot be\njusti\ufb01ed). Furthermore, PR plots have unachievable regions at the lower right-hand side,\nthe size of which depends on the class distribution [1].\n\nNo calibration: although some results exist regarding the relationship between calibrated scores\nand F1 score (more about this below) these are unrelated to the PR curve. To the best of\nour knowledge there is no published procedure to output scores that are calibrated for F\u03b2 \u2013\nthat is, which give the value of \u03b2 for which the instance is on the F\u03b2 decision boundary.\n\n2.1 The F\u03b2 measure\nThe standard way to combine precision and recall into a single performance measure is through the\nF1 score [16]. It is commonly de\ufb01ned as the harmonic mean of precision and recall:\n\n(1)\n\nF1 (cid:44)\n\n2\n\n1/prec + 1/rec =\n\n2prec\u00b7 rec\nprec + rec =\n\nTP\n\nTP + (FP + FN)/2\n\nThe last form demonstrates that the harmonic mean is natural here as it corresponds to taking the\narithmetic mean of the numbers of false positives and false negatives. Another way to understand\nthe F1 score is as the accuracy in a modi\ufb01ed contingency table which copies the true positive count\nto the true negatives:\n\nActual \u2295\nActual (cid:9)\n\nPredicted \u2295 Predicted (cid:9)\n\nTP\nFP\n\nTP + FP\n\nFN\nTP\nPos\n\nPos\nNeg\u2212 (TN \u2212 TP)\n2TP + FP + FN\n\nWe can take a weighted harmonic mean which is commonly parametrised as follows:\n\nF\u03b2 (cid:44)\n\n1\n\n1\n\n1+\u03b2 2 /prec + \u03b2 2\n\n1+\u03b2 2 /rec\n\n=\n\n(1 + \u03b2 2)TP\n\n(1 + \u03b2 2)TP + FP + \u03b2 2FN\n\n(2)\n\nThere is a range of recent papers studying the F-score, several of which in last year\u2019s NIPS confer-\nence [12, 9, 11]. Relevant results include the following: (i) non-decomposability of the F\u03b2 score,\nmeaning it is not an average over instances (it is a ratio of such averages, called a pseudo-linear\nfunction by [12]); (ii) estimators exist that are consistent: i.e., they are unbiased in the limit [9, 11];\n(iii) given a model, operating points that are optimal for F\u03b2 can be achieved by thresholding the\nmodel\u2019s scores [18]; (iv) a classi\ufb01er yielding perfectly calibrated posterior probabilities has the prop-\nerty that the optimal threshold for F1 is half the optimal F1 at that point (\ufb01rst proved by [20] and\nlater by [10], while generalised to F\u03b2 by [9]). The latter results tell us that optimal thresholds for\nF\u03b2 are lower than optimal thresholds for accuracy (or equal only in the case of the perfect model).\nThey don\u2019t, however, tell us how to \ufb01nd such thresholds other than by tuning (and [12] propose a\nmethod inspired by cost-sensitive classi\ufb01cation). The analysis in the next section signi\ufb01cantly ex-\ntends these results by demonstrating how we can identify all F\u03b2 -optimal thresholds for any \u03b2 in a\nsingle calibration procedure.\n\n3 Precision-Recall-Gain Curves\n\nIn this section we demonstrate how Precision-Recall analysis can be adapted to inherit all the bene\ufb01ts\nof ROC analysis. While technically straightforward, the implications of our results are far-reaching.\nFor example, even something as seemingly innocuous as reporting the arithmetic average of F1\nvalues over cross-validation folds is methodologically misguided: we will de\ufb01ne the corresponding\nperformance measure that can safely be averaged.\n\n3.1 Baseline\n\nA random classi\ufb01er that predicts positive with probability p has F\u03b2 score (1 + \u03b2 2)p\u03c0/(p + \u03b2 2\u03c0).\nThis is monotonically increasing in p \u2208 [0,1] hence reaches its maximum for p = 1, the always-\n\n4\n\n\fFigure 2: (left) Conventional PR curve with hyperbolic F1 isometrics (dotted lines) and the baseline\nperformance by the always-positive classi\ufb01er (solid hyperbole). (right) Precision-Recall-Gain curve\nwith minor diagonal as baseline, parallel F1 isometrics and a convex Pareto front.\n\npositive classi\ufb01er. Hence Precision-Recall analysis differs from classi\ufb01cation accuracy in that the\nbaseline to beat is the always-positive classi\ufb01er rather than any random classi\ufb01er. This baseline has\nprec = \u03c0 and rec = 1, and it is easily seen that any model with prec < \u03c0 or rec < \u03c0 loses against\nthis baseline. Hence it makes sense to consider only precision and recall values in the interval [\u03c0,1].\nAny real-valued variable x \u2208 [min,max] can be rescaled by the mapping x (cid:55)\u2192 x\u2212min\nmax\u2212min. However, the\nlinear scale is inappropriate here and we should use a harmonic scale instead, hence map to\n\n(3)\n\n(4)\n\nFN\nTP\n\nmax\u00b7 (x\u2212 min)\n(max\u2212 min)\u00b7 x\nTaking max = 1 and min = \u03c0 we arrive at the following de\ufb01nition.\nDe\ufb01nition 1 (Precision Gain and Recall Gain).\n\n1/x\u2212 1/min\n1/max\u2212 1/min =\n\nprecG =\n\nprec\u2212 \u03c0\n(1\u2212 \u03c0)prec = 1\u2212 \u03c0\n1\u2212 \u03c0\n\nFP\nTP\n\nrecG =\n\nrec\u2212 \u03c0\n(1\u2212 \u03c0)rec = 1\u2212 \u03c0\n1\u2212 \u03c0\n\nA Precision-Recall-Gain curve plots Precision Gain on the y-axis against Recall Gain on the x-axis\nin the unit square (i.e., negative gains are ignored).\n\nAn example PRG curve is given in Figure 2 (right). The always-positive classi\ufb01er has recG = 1 and\nprecG = 0 and hence gets plotted in the lower right-hand corner of Precision-Recall-Gain space,\nregardless of the class distribution. Since we show in the next section that F1 isometrics have slope\n\u22121 in this space it follows that all classi\ufb01ers with baseline F1 performance end up on the minor\ndiagonal in Precision-Recall-Gain space. In contrast, the corresponding F1 isometric in PR space is\nhyperbolic (Figure 2 (left)) and its exact location depends on the class distribution.\n\n3.2 Linearity and optimality\n\nOne of the main bene\ufb01ts of PRG space is that it allows linear interpolation. This manifests itself\nin two ways: any point on a straight line between two endpoints is achievable by random choice\nbetween the endpoints (Theorem 1) and F\u03b2 isometrics are straight lines with slope \u2212\u03b2 2 (Theorem 2).\nTheorem 1. Let P1 = (precG1,recG1) and P2 = (precG2,recG2) be points in the Precision-Recall-\nGain space representing the performance of Models 1 and 2 with contingency tables C1 and C2.\nThen a model with an interpolated contingency table C\u2217 = \u03bbC1 + (1 \u2212 \u03bb )C2 has precision gain\nprecG\u2217 = \u00b5precG1 + (1\u2212 \u00b5)precG2 and recall gain recG\u2217 = \u00b5recG1 + (1\u2212 \u00b5)recG2, where \u00b5 =\n\u03bb T P1/(\u03bb T P1 + (1\u2212 \u03bb )T P2).\nTheorem 2. precG + \u03b2 2recG = (1 + \u03b2 2)FG\u03b2 , with FG\u03b2 =\n\n= 1\u2212 \u03c0\n1\u2212\u03c0\n\nFP+\u03b2 2FN\n(1+\u03b2 2)TP .\n\nF\u03b2\u2212\u03c0\n(1\u2212\u03c0)F\u03b2\n\nFG\u03b2 is a linearised version of F\u03b2 in the same way as precG and recG are linearised versions of preci-\nsion and recall. FG\u03b2 measures the gain in performance (on a linear scale) relative to a classi\ufb01er with\n\n5\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91RecallPrecision00.20.40.60.8100.10.20.30.40.50.60.70.80.91Recall GainPrecision Gain\fboth precision and recall \u2013 and hence F\u03b2 \u2013 equal to \u03c0. F1 isometrics are indicated in Figure 2 (right).\nBy increasing (decreasing) \u03b2 2 these lines of constant F\u03b2 become steeper (\ufb02atter) and hence we are\nputting more emphasis on recall (precision).\nWith regard to optimality, we already knew that every classi\ufb01er or threshold optimal for F\u03b2 for some\n\u03b2 2 is optimal for accc for some c. The reverse also holds, except for the ROC convex hull points\nbelow the baseline (e.g., the always-negative classi\ufb01er). Due to linearity the PRG Pareto front is\nconvex and easily constructed by visual inspection. We will see in Section 3.4 that these segments\nof the PRG convex hull can be used to obtain classi\ufb01er scores speci\ufb01cally calibrated for F-scores,\nthereby pre-empting the need for any more threshold tuning.\n\n3.3 Area\n\nDe\ufb01ne the area under the Precision-Recall-Gain curve as AUPRG =(cid:82) 1\n\n0 precG d recG. We will\nshow how this area can be related to an expected FG1 score when averaging over the operating\npoints on the curve in a particular way. To this end we de\ufb01ne \u2206 = recG/\u03c0 \u2212 precG/(1\u2212 \u03c0), which\nexpresses the extent to which recall exceeds precision (reweighting by \u03c0 and 1\u2212\u03c0 guarantees that \u2206\nis monotonically increasing when changing the threshold towards having more positive predictions,\nas shown in the proof of Theorem 3 in the Supplementary Material). Hence, \u2212y0/(1\u2212\u03c0)\u2264 \u2206\u2264 1/\u03c0,\nwhere y0 denotes the precision gain at the operating point where recall gain is zero. The following\ntheorem shows that if the operating points are chosen such that \u2206 is uniformly distributed in this\nrange, then the expected FG1 can be calculated from the area under the Precision-Recall-Gain curve\n(the Supplementary Material proves a more general result for expected FG\u03b2 .) This justi\ufb01es the use\nof AUPRG as a performance metric without \ufb01xing the classi\ufb01er\u2019s operating point in advance.\nTheorem 3. Let the operating points of a model with area under the Precision-Recall-Gain curve\nAUPRG be chosen such that \u2206 is uniformly distributed within [\u2212y0/1\u2212 \u03c0,1/\u03c0]. Then the expected\nFG1 score is equal to\n\nE [FG1] =\n\nAUPRG/2 + 1/4\u2212 \u03c0(1\u2212 y0\n\n1\u2212 \u03c0(1\u2212 y0)\n\n2)/4\n\n(5)\n\nThe expected reciprocal F1 score can be calculated from the relationship E [1/F1] = (1 \u2212 (1 \u2212\n\u03c0)E [FG1])/\u03c0 which follows from the de\ufb01nition of FG\u03b2 .\nIn the special case where y0 = 1 the\nexpected FG1 score is AUPRG/2 + 1/4.\n\n3.4 Calibration\n\nFigure 3 (left) shows an ROC curve with empirically calibrated posterior probabilities obtained by\nisotonic regression [19] or the ROC convex hull [4]. Segments of the convex hull are labelled with\nthe value of c for which the two endpoints have the same skew-sensitive accuracy accc. Conversely,\nif a point connects two segments with c1 < c2 then that point is optimal for any c such that c1 <\nc < c2. The calibrated values c are derived from the ROC slope r by c = \u03c0r/(\u03c0r + (1\u2212 \u03c0)) [6].\nFor example, the point on the convex hull two steps up from the origin optimises skew-sensitive\naccuracy accc for 0.29 < c < 0.75 and hence also standard accuracy (c = 1/2). We are now in a\nposition to calculate similarly calibrated scores for F-score.\nTheorem 4. Let two classi\ufb01ers be such that prec1 > prec2 and rec1 < rec2, then these two classi\ufb01ers\nhave the same F\u03b2 score if and only if\n\n\u03b2 2 = \u22121/prec1 \u2212 1/prec2\n1/rec1 \u2212 1/rec2\n\nIn line with ROC calibration we convert these slopes into a calibrated score between 0 and 1:\n\nd =\n\n1\n\n(\u03b2 2 + 1)\n\n=\n\n1/rec1 \u2212 1/rec2\n\n(1/rec1 \u2212 1/rec2)\u2212 (1/prec1 \u2212 1/prec2)\n\nIt is important to note that there is no model-independent relationship between ROC-calibrated\nscores and PRG-calibrated scores, so we cannot derive d from c. However, we can equip a model\nwith two calibration maps, one for accuracy and the other for F-score.\n\n6\n\n(6)\n\n(7)\n\n\fFigure 3: (left) ROC curve with scores empirically calibrated for accuracy. The green dots corre-\n(right) Precision-Recall-Gain curve with\nspond to a regular grid in Precision-Recall-Gain space.\nscores calibrated for F\u03b2 . The green dots correspond to a regular grid in ROC space, clearly indicating\nthat ROC analysis over-emphasises the high-recall region.\n\nFigure 3 (right) shows the PRG curve for the running example with scores calibrated for F\u03b2 . Score\n0.76 corresponds to \u03b2 2 = (1\u2212 0.76)/0.76 = 0.32 and score 0.49 corresponds to \u03b2 2 = 1.04, so the\npoint closest to the Precision-Recall breakeven line optimises F\u03b2 for 0.32 < \u03b2 2 < 1.04 and hence\nalso F1 (but note that the next point to the right on the convex hull is nearly as good for F1, on\naccount of the connecting line segment having a calibrated score close to 1/2).\n\n4 Practical examples\n\nThe key message of this paper is that precision, recall and F-score are expressed on a harmonic\nscale and hence any kind of arithmetic average of these quantities is methodologically wrong. We\nnow demonstrate that this matters in practice. In particular, we show that in some sense, AUPR and\nAUPRG are as different from each other as AUPR and AUROC. Using the OpenML platform [17]\nwe took all those binary classi\ufb01cation tasks which have 10-fold cross-validated predictions using\nat least 30 models from different learning methods (these are called \ufb02ows in OpenML). In each of\nthe obtained 886 tasks (covering 426 different datasets) we applied the following procedure. First,\nwe fetched the predicted scores of 30 randomly selected models from different \ufb02ows and calculated\nareas under ROC, PRG and PR curves(with hyperbolic interpolation as recommended by [2]), with\nminority class as positives. We then ranked the 30 models with respect to these measures. Figure 4\nplots AUPRG-rank against AUPR-rank across all 25980 models.\nFigure 4 (left) demonstrates that AUPR and AUPRG often disagree in ranking the models. In par-\nticular, they disagree on the best method in 24% of the tasks and on the top three methods in 58%\nof the tasks (i.e., they agree on top, second and third method in 42% of the tasks). This amount of\ndisagreement is comparable to the disagreement between AUPR and AUROC (29% and 65% dis-\nagreement for top 1 and top 3, respectively) and between AUPRG and AUROC (22% and 57%).\nTherefore, AUPR, AUPRG and AUROC are related quantities, but still all signi\ufb01cantly different.\nThe same conclusion is supported by the pairwise correlations between the ranks across all tasks:\nthe correlation between AUPR-ranks and AUPRG-ranks is 0.95, between AUPR and AUROC it is\n0.95, and between AUPRG and AUROC it is 0.96.\nFigure 4 (right) shows AUPRG vs AUPR in two datasets with relatively low and high rank corre-\nlations (0.944 and 0.991, selected as lower and upper quartiles among all tasks). In both datasets\nAUPR and AUPRG agree on the best model. However, in the white-clover dataset the second best is\nAdaBoost according to AUPRG and Logistic Regression according to AUPR. As seen in Figure 5,\nthis disagreement is caused by AUPR taking into account the poor performance of AdaBoost in the\nearly part of the ranking; AUPRG ignores this part as it has negative recall gain.\n\n7\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91False positive rateTrue positive rate 1 0.75 0.29 0.180.0530.0480.0280.014 000.20.40.60.8100.10.20.30.40.50.60.70.80.91Recall GainPrecision Gain 0.99 0.76 0.490.0750.0670.0340.016 0\fFigure 4: (left) Comparison of AUPRG-ranks vs AUPR-ranks. Each cell shows how many models\nacross 886 OpenML tasks have these ranks among the 30 models in the same task. (right) Compar-\nison of AUPRG vs AUPR in OpenML tasks with IDs 3872 (white-clover) and 3896 (ada-agnostic),\nwith 30 models in each task. Some models perform worse than random (AUPRG < 0) and are not\nplotted. The models represented by the two encircled triangles are shown in detail in Figure 5.\n\nFigure 5: (left) ROC curves for AdaBoost (solid line) and Logistic Regression (dashed line) on the\nwhite-clover dataset (OpenML run IDs 145651 and 267741, respectively). (middle) Corresponding\nPR curves. The solid curve is on average lower with AUPR = 0.724 whereas the dashed curve has\nAUPR = 0.773. (right) Corresponding PRG curves, where the situation has reversed: the solid curve\nhas AUPRG = 0.714 while the dashed curve has a lower AUPRG of 0.687.\n\n5 Concluding remarks\n\nIf a practitioner using PR-analysis and the F-score should take one methodological recommendation\nfrom this paper, it is to use the F-Gain score instead to make sure baselines are taken into account\nproperly and averaging is done on the appropriate scale. If required the FG\u03b2 score can be converted\nback to an F\u03b2 score at the end. The second recommendation is to use Precision-Recall-Gain curves\ninstead of PR curves, and the third to use AUPRG which is easier to calculate than AUPR due to\nlinear interpolation, has a proper interpretation as an expected F-Gain score and allows performance\nassessment over a range of operating points. To assist practitioners we have made R, Matlab and\nJava code to calculate AUPRG and PRG curves available at http://www.cs.bris.ac.uk/\n\u02dcflach/PRGcurves/. We are also working on closer integration of AUPRG as an evaluation\nmetric in OpenML and performance visualisation platforms such as ViperCharts [15].\nAs future work we mention the interpretation of AUPRG as a measure of ranking performance: we\nare working on an interpretation which gives non-uniform weights to the positives and as such is\nrelated to Discounted Cumulative Gain. A second line of research involves the use of cost curves\nfor the FG\u03b2 score and associated threshold choice methods.\n\nAcknowledgments This work was supported by the REFRAME project granted by the European Coordi-\nnated Research on Long-Term Challenges in Information and Communication Sciences & Technologies ERA-\nNet (CHIST-ERA), and funded by the Engineering and Physical Sciences Research Council in the UK under\ngrant EP/K018728/1. Discussions with Hendrik Blockeel helped to clarify the intuitions underlying this work.\n\n8\n\n11020301102030Rank of AUPRRank of AUPRGCount012\u221234\u221278\u22121516\u22123132\u22126364\u2212127128\u2212...\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf0.000.250.500.751.000.000.250.500.751.00AUPRAUPRGDataset\u25cfada\u2212agnosticwhite\u2212clover0.000.250.500.751.000.000.250.500.751.00False positive rateTrue positive rate0.000.250.500.751.000.000.250.500.751.00RecallPrecision0.000.250.500.751.000.000.250.500.751.00Recall GainPrecision Gain\fReferences\n[1] K. Boyd, V. S. Costa, J. Davis, and C. D. Page. Unachievable region in precision-recall space\nand its effect on empirical evaluation. In International Conference on Machine Learning, page\n349, 2012.\n\n[2] J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves.\n\nIn\nProceedings of the 23rd International Conference on Machine Learning, pages 233\u2013240, 2006.\n[3] T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861\u2013874,\n\n2006.\n\n[4] T. Fawcett and A. Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68(1):\n\n97\u2013106, July 2007.\n\n[5] P. A. Flach. The geometry of ROC space: understanding machine learning metrics through\nROC isometrics. In Machine Learning, Proceedings of the Twentieth International Conference\n(ICML 2003), pages 194\u2013201, 2003.\n\n[6] P. A. Flach. ROC analysis. In C. Sammut and G. Webb, editors, Encyclopedia of Machine\n\nLearning, pages 869\u2013875. Springer US, 2010.\n\n[7] D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple\n\nclass classi\ufb01cation problems. Machine Learning, 45(2):171\u2013186, 2001.\n\n[8] J. Hern\u00b4andez-Orallo, P. Flach, and C. Ferri. A uni\ufb01ed view of performance metrics: Translating\nthreshold choice into expected classi\ufb01cation loss. Journal of Machine Learning Research, 13:\n2813\u20132869, 2012.\n\n[9] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent binary classi\ufb01cation\nwith generalized performance metrics. In Advances in Neural Information Processing Systems,\npages 2744\u20132752, 2014.\n\n[10] Z. C. Lipton, C. Elkan, and B. Naryanaswamy. Optimal thresholding of classi\ufb01ers to maximize\nF1 measure. In Machine Learning and Knowledge Discovery in Databases, volume 8725 of\nLecture Notes in Computer Science, pages 225\u2013239. Springer Berlin Heidelberg, 2014.\n\n[11] H. Narasimhan, R. Vaish, and S. Agarwal. On the statistical consistency of plug-in classi\ufb01ers\nfor non-decomposable performance measures. In Advances in Neural Information Processing\nSystems 27, pages 1493\u20131501. 2014.\n\n[12] S. P. Parambath, N. Usunier, and Y. Grandvalet. Optimizing F-measures by cost-sensitive\nclassi\ufb01cation. In Advances in Neural Information Processing Systems, pages 2123\u20132131, 2014.\n[13] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized\nlikelihood methods. In Advances in Large Margin Classi\ufb01ers, pages 61\u201374. MIT Press, Boston,\n1999.\n\n[14] F. Provost and T. Fawcett. Robust classi\ufb01cation for imprecise environments. Machine Learn-\n\ning, 42(3):203\u2013231, 2001.\n\n[15] B. Sluban and N. Lavra\u02c7c. Vipercharts: Visual performance evaluation platform. In H. Blockeel,\nK. Kersting, S. Nijssen, and F. \u02c7Zelezn\u00b4y, editors, Machine Learning and Knowledge Discovery\nin Databases, volume 8190 of Lecture Notes in Computer Science, pages 650\u2013653. Springer\nBerlin Heidelberg, 2013.\n\n[16] C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA,\n\n2nd edition, 1979.\n\n[17] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine\n\nlearning. SIGKDD Explorations, 15(2):49\u201360, 2013.\n\n[18] N. Ye, K. M. A. Chai, W. S. Lee, and H. L. Chieu. Optimizing F-measures: A tale of two\napproaches. In Proceedings of the 29th International Conference on Machine Learning, pages\n289\u2013296, 2012.\n\n[19] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees\nand naive Bayesian classi\ufb01ers. In Proceedings of the Eighteenth International Conference on\nMachine Learning (ICML 2001), pages 609\u2013616, 2001.\n\n[20] M.-J. Zhao, N. Edakunni, A. Pocock, and G. Brown. Beyond Fano\u2019s inequality: bounds on the\noptimal F-score, BER, and cost-sensitive risk and their implications. The Journal of Machine\nLearning Research, 14(1):1033\u20131090, 2013.\n\n9\n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Peter", "family_name": "Flach", "institution": "University of Bristol"}, {"given_name": "Meelis", "family_name": "Kull", "institution": "University of Bristol"}]}