{"title": "The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the XAUC Metric", "book": "Advances in Neural Information Processing Systems", "page_first": 3438, "page_last": 3448, "abstract": "Where machine-learned predictive risk scores inform high-stakes decisions, such as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classification task. This may not account, however, for the more diverse downstream uses of risk scores and their non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the disparate impact of risk scores and define it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepancy and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive performance.", "full_text": "The Fairness of Risk Scores Beyond Classi\ufb01cation:\n\nBipartite Ranking and the xAUC Metric\n\nNathan Kallus\nCornell University\n\nNew York, NY\n\nkallus@cornell.edu\n\nAngela Zhou\n\nCornell University\n\nNew York, NY\n\naz434@cornell.edu\n\nAbstract\n\nWhere machine-learned predictive risk scores inform high-stakes decisions, such\nas bail and sentencing in criminal justice, fairness has been a serious concern.\nRecent work has characterized the disparate impact that such risk scores can have\nwhen used for a binary classi\ufb01cation task. This may not account, however, for\nthe more diverse downstream uses of risk scores and their non-binary nature. To\nbetter account for this, in this paper, we investigate the fairness of predictive\nrisk scores from the point of view of a bipartite ranking task, where one seeks\nto rank positive examples higher than negative ones. We introduce the xAUC\ndisparity as a metric to assess the disparate impact of risk scores and de\ufb01ne it\nas the difference in the probabilities of ranking a random positive example from\none protected group above a negative one from another group and vice versa. We\nprovide a decomposition of bipartite ranking loss into components that involve the\ndiscrepancy and components that involve pure predictive ability within each group.\nWe use xAUC analysis to audit predictive risk scores for recidivism prediction,\nincome prediction, and cardiac arrest prediction, where it describes disparities that\nare not evident from simply comparing within-group predictive performance.\n\nIntroduction\n\n1\nPredictive risk scores support decision-making in high-stakes settings such as bail sentencing in the\ncriminal justice system, triage and preventive care in healthcare, and lending decisions in the credit\nindustry [2, 38]. In these areas where predictive errors can signi\ufb01cantly impact individuals involved,\nstudies of fairness in machine learning have analyzed the possible disparate impact introduced by\npredictive risk scores primarily in a binary classi\ufb01cation setting: if predictions determine whether or\nnot someone is detained pre-trial, is admitted into critical care, or is extended a loan. But the \u201chuman\nin the loop\u201d with risk assessment tools often has recourse to make decisions about extent, intensity, or\nprioritization of resources. That is, in practice, predictive risk scores are used to provide informative\nrank-orderings of individuals with binary outcomes in the following settings:\n\n(1) In criminal justice, the \u201crisk-needs-responsivity\u201d model emphasizes matching the level of\n\nsocial service interventions to the speci\ufb01c individual\u2019s risk of re-offending [3, 6].\n\n(2) In healthcare and other clinical decision-making settings, risk scores are used as decision\naids for prevention of chronic disease or triage of health resources, where a variety of\ninterventional resource intensities are available; however, the prediction quality of individual\nconditional probability estimates can be poor [9, 28, 38, 39].\n\n(3) In credit, predictions of default risk affect not only loan acceptance/rejection decisions, but\nalso risk-based setting of interest rates. Fuster et al. [22] embed machine-learned credit\nscores in an economic pricing model which suggests negative economic welfare impacts on\nBlack and Hispanic borrowers.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWhite innocents \nranked above \nblack recidivators\n\nBlack innocents \nranked above \nwhite recidivators\n\n(b) Score distributions Pr[R = r | A = a, Y = y]\n(a) ROC curves and XROC curves for COMPAS\nFigure 1: Analysis of xAUC disparities for the COMPAS Violent Recidivism Prediction dataset\n\n(4) In municipal services, predictive analytics tools have been used to direct resources for main-\ntenance, repair, or inspection by prioritizing or ranking by risk of failure or contamination\n[12, 40]. Proposals to use new data sources such as 311 data, which incur the self-selection\nbias of citizen complaints, may introduce inequities in resource allocation [32].\n\nWe describe how the problem of bipartite ranking, that of \ufb01nding a good ranking function that ranks\npositively labeled examples above negative examples, better encapsulates how predictive risk scores\nare used in practice to rank individual units, and how a new metric we propose, xAUC, can assess\nranking disparities.\nMost previous work on fairness in machine learning has emphasized disparate impact in terms of\nconfusion matrix metrics such as true positive rates and false positive rates and other desiderata, such\nas probability calibration of risk scores. Due in part to inherent trade-offs between these performance\ncriteria, some have recommended to retain unadjusted risk scores that achieve good calibration, rather\nthan adjusting for parity across groups, in order to retain as much information as possible and allow\nhuman experts to make the \ufb01nal decision [10, 14, 15, 27]. At the same time, group-level discrepancies\nin the prediction loss of risk scores, relative to the true Bayes-optimal score, are not observable, since\nonly binary outcomes are observed.\nIn particular, our bipartite ranking-based perspective reconciles a gap between the differing arguments\nmade by ProPublica and Equivant (then Northpointe) regarding the potential bias or disparate impact\nof the COMPAS recidivism tool. Equivant levies within-group AUC parity (\u201caccuracy equity\u201d)\n(among other desiderata such as calibration and predictive parity) to claim fairness of the risk scores\nin response to ProPublica\u2019s allegations of bias due to true positive rate/false positive rate disparities for\nthe Low/Not Low risk labels [2, 19]. Our xAUC metric, which measures the probability of positive-\ninstance members of one group being misranked below negative-instance members of another group,\nand vice-versa, highlights that within-group comparison of AUC discrepancies does not summarize\naccuracy inequity. We illustrate this in Fig. 1 for a risk score learned from COMPAS data: xAUC\ndisparities re\ufb02ect disparate misranking risk faced by positive-label individual of either class.\nIn this paper, we propose and study the cross-ROC curve and the corresponding xAUC metric\nfor auditing disparities induced by a predictive risk score, as they are used in broader contexts to\ninform resource allocation. We relate the xAUC metric to different group- and outcome-based\ndecompositions of a bipartite ranking loss, and assess the resulting metrics on datasets where fairness\nhas been of concern.\n\n2 Related Work\nOur analysis of fairness properties of risk scores in this work is most closely related to the study of\n\u201cdisparate impact\u201d in machine learning, which focuses on disparities in the outcomes of a process\nacross protected classes, without racial animus [4]. Many previous approaches have considered\nformalizations via error rate metrics of the confusion matrix in a binary classi\ufb01cation setting [5, 25,\n29, 35, 44]. By now, a panoply of fairness metrics have been studied for binary classi\ufb01cation in\norder to assess group-level disparities in confusion matrix-based metrics. Proposals for error rate\nbalance assess or try to equalize true positive rates and/or false positive rates, error rates measured\nconditional on the true outcome, emphasizing the equitable treatment of those who actually are of the\noutcome type of interest [25, 44]. Alternatively, one might assess the negative/positive predictive\n\n2\n\n\fvalue (NPV/PPV) error rates conditional on the thresholded model prediction [13]. In missing-data\nsettings, these metrics can be partially identi\ufb01ed to support fairness assessments [11, 30, 30].\nThe predominant criterion used for assessing fairness of risk scores, outside of a binary classi\ufb01cation\nsetting, is that of calibration. Group-wise calibration requires that Pr[Y = 1 | R = r, A = a] =\nPr[Y = 1 | R = r, A = b] = r, as in [13]. The impossibilities of satisfying notions of error rate\nbalance and calibration simultaneously have been discussed in [13, 31]. Liu et al. [33] show that\ngroup calibration is a byproduct of unconstrained empirical risk minimization, and therefore is not a\nrestrictive notion of fairness. Hebert-Johnson et al. [26] note the critique that group calibration does\nnot restrict the variance of a risk score as an unbiased estimator of the Bayes-optimal score.\nOther work has considered fairness in ranking settings speci\ufb01cally, with particular attention to\napplications in information retrieval, such as questions of fair representation in search engine results.\nYang and Stoyanovich [43] assess statistical parity at discrete cut-points of a ranking, incorporating\nposition bias inspired by normalized discounted cumulative gain (nDCG) metrics. Celis et al. [8]\nconsider the question of fairness in rankings, where fairness is considered as constraints on diversity\nof group membership in the top k rankings, for any choice of k. Singh and Joachims [41] consider\nfairness of exposure in rankings under known relevance scores and propose an algorithmic framework\nthat produces probabilistic rankings satisfying fairness constraints in expectation on exposure, under a\nposition bias model. We focus instead on the bipartite ranking setting, where the area under the curve\n(AUC) loss emphasizes ranking quality on the entire distribution, whereas other ranking metrics such\nas nDCG or top-k metrics emphasize only a portion of the distribution.\nThe problem of bipartite ranking is related to, but distinct from, binary classi\ufb01cation [1, 20, 36]; see\n[16, 34] for more information. While the bipartite ranking induced by the Bayes-optimal score is\nanalogously Bayes-risk optimal for bipartite ranking (e.g., [34]), in general, a probability-calibrated\nclassi\ufb01er is not optimizing for the bipartite ranking loss. Cortes and Mohri [16] observe that AUC\nmay vary widely for the same error rate, and that algorithms designed to globally optimize the AUC\nperform better than optimizing surrogates of the AUC or error rate. Narasimhan and Agarwal [37]\nstudy transfer regret bounds between the related problems of binary classi\ufb01cation, bipartite ranking,\nand outcome-probability estimation.\n3 Problem Setup and Notation\nWe suppose we have data (X, A, Y ) on features X 2X , sensitive attribute A 2A , and binary\nlabeled outcome Y 2{ 0, 1}. We are interested in assessing the downstream impacts of a predictive\nrisk score R : X\u21e5A! R, which may or may not access the sensitive attribute. When these risk\nscores represent an estimated conditional probability of positive label, R : X\u21e5A! [0, 1]. For\nbrevity, we also let R = R(X, A) be the random variable corresponding to an individual\u2019s risk\nscore. We generally use the conventions that Y = 1 is associated with opportunity or bene\ufb01t for the\nindividual (e.g., freedom from suspicion of recidivism, creditworthiness) and that when discussing\ntwo groups, A = a and A = b, the group A = a might be a historically disadvantaged group.\nLet the conditional cumulative distribution function of the learned score R evaluated at a threshold \u2713\ngiven label and attribute be denoted by\n\nF a\ny (\u2713) = Pr[R \uf8ff \u2713 | Y = y, A = a].\n\ny = 1 F a\n\ny denote the complement of F a\n\ny . We drop the a subscript to refer to the whole\nWe let Ga\npopulation: Fy(\u2713) = Pr[R \uf8ff \u2713 | Y = y]. Thresholding the score yields a binary classi\ufb01er,\n\u02c6Y\u2713 = I [R \u2713]. The classi\ufb01er\u2019s true negative rate (TNR) is F0(\u2713), its false positive rate (FPR) is\nG0(\u2713), its false negative rate (FNR) is F1(\u2713), and its true positive rate (TPR) is G1(\u2713). Given a risk\nscore, the choice of optimal threshold for a binary classi\ufb01er depends on the differing costs of false\npositive and false negatives. We might expect cost ratios of false positives and false negatives to differ\nif we consider the use of risk scores to direct punitive measures or to direct interventional resources.\nIn the setting of bipartite ranking, the data comprises of a pool of positive labeled examples, S+ =\n{Xi}i2[m], drawn i.i.d. according to a distribution X+ \u21e0 D+, and negative labeled examples S =\n{X0i}i2[n] drawn according to a distribution X \u21e0 D [36]. The rank order may be determined by a\nscore function s(X), which achieves empirical bipartite ranking error 1\nj=1 I[s(Xi) <\ns(X0j)]. The area under the receiver operating characteristic (ROC) curve (AUC), a common (reward)\nobjective for bipartite ranking is often used as a metric describing the quality of a predictive score,\n\nmnPm\n\ni=1Pn\n\n3\n\n\findependently of the \ufb01nal threshold used to implement a classi\ufb01er, and is invariant to different base\nrates of the outcomes. The ROC curve plots G0(\u2713) on the x-axis with G1(\u2713) on the y-axis as we vary\n\u2713 over the space of various decision thresholds. The AUC is the area under the ROC curve, i.e.,\n\nAUC =R 1\n\n0 G1(G1\n\n0 (v))dv\n\n2 corresponds to a completely random classi\ufb01er; therefore, the difference from 1\n\nAn AUC of 1\n2 serves\nas a metric for the diagnostic quality of a predictive score. We recall the probabilistic interpretation\nof AUC that it is the probability that a randomly drawn example from the positive class is correctly\nranked by the score R above a randomly drawn score from the negative class [24]. Let R1 be drawn\nfrom R | Y = 1 and R0 be drawn from R | Y = 0 independently. Then AUC = Pr[R1 > R0].\n4 The Cross-ROC (xROC) and Cross-Area Under the Curve (xAUC)\nWe introduce the cross-ROC curve and the cross-area under the curve metric xAUC that summarize\ngroup-level disparities in misranking errors induced by a score function R(X, A).\nDe\ufb01nition 1 (Cross-Receiver Operating Characteristic curve (xROC)).\n\nxROC(\u2713; R, a, b) = (Pr[R >\u2713 | A = b, Y = 0], Pr[R >\u2713 | A = a, Y = 1])\n\nThe xROCa,b curve parametrically plots xROC(\u2713; R, a, b) over the space of thresholds \u2713 2 R,\ngenerating the curve of TPR of group a on the y-axis vs. the FPR of group b on the x-axis. We\nde\ufb01ne the xAUC(a, b) metric as the area under the xROCa,b curve. Analogous to the usual AUC,\nwe provide a probabilistic interpretation of the xAUC metric as the probability of correctly ranking a\npositive instance of group a above a negative instance of group a under the corresponding outcome-\nand class-conditional distributions of the score.\nDe\ufb01nition 2 (xAUC).\n\n1 is drawn from R | Y = 1, A = a and Rb\n\nxAUC(a, b) =Z 1\nwhere Ra\n0 is drawn from R | Y = 0, A = b independently.\nFor brevity, henceforth, Ra\ny is taken to be drawn from R | Y = y, A = a and independently of any\nother such variable. We also drop the superscript to denote omitting the conditioning on sensitive\nattribute (e.g., Ry).\nThe xAUC accuracy metrics for a binary sensitive attribute measure the probability that a randomly\nchosen unit from the \u201cpositive\u201d group Y = 1 in group a, is ranked higher than a randomly chosen\nunit from the \u201cnegative\" group, Y = 0 in group b, under the corresponding group- and outcome-\na. We let AUCa denote the within-group AUC for group A = a,\nconditional distributions of scores Ry\nPr[Ra\nIf the difference between these metrics, the xAUC disparity\n\n0)1(v))dv = Pr[Ra\n\n0].\n1 > Ra\n\n1 > Rb\n0]\n\n1((Gb\n\nGa\n\n0\n\nxAUC = Pr[Ra\n\n1 > Rb\n\n0] Pr[Rb\n\n1 > Ra\n\n0] = Pr[Rb\n\n1 \uf8ff Ra\n\n0] Pr[Ra\n\n1 \uf8ff Rb\n0]\n\nis substantial and positive, then we might consider group b to be systematically \u201cdisadvantaged\u201d\nand a to be \u201cadvantaged\u201d when Y = 0 is a negative or harmful label or is associated with punitive\nmeasures, as in the recidivism predication case. Conversely, we have the opposite interpretation if\nY = 0 is a positive label associated with greater bene\ufb01cial resources. Similarly, since xAUC is\nanti-symmetric in a, b, negative values are also interpreted in the converse.\nWhen higher scores are associated with opportunity or additional bene\ufb01ts and resources, as in the\nrecidivism predication case, a positive xAUC means group a either gains by correctly having\nits deserving members correctly ranked above the non-deserving members of group b and/or by\nhaving its non-deserving members incorrectly ranked above the deserving members of group b; and\nsymmetrically, group b loses in the same way. The magnitude of the disparity xAUC describes\nthe misranking disparities incurred under this predictive score, while the magnitude of the xAUC\nmeasures the particular across-subgroup rank-accuracies.\nstatistic,\nComputing\nI[R(Xi) > R(Xj)]. Algorithmic routines for computing the AUC quickly\n1\nnb\n0na\nby a sorting routine can be directly used to compute the xAUCs. Asymptotically exact con\ufb01dence\nintervals are available, as shown in DeLong et al. [17], using the generalized U-statistic property of\nthis estimator.\n\nYi=1 Pj: Ai=b,\n\n1 Pi: Ai=a,\n\nxAUC is\n\ncomputes\n\nsimple:\n\nsample\n\nsimply\n\none\n\nthe\n\nthe\n\nYi=0\n\n4\n\n\f(a) COMPAS: a = Black, b = White, 0 = Recidivate\n\n(b) Adult: a = Black, b = White, 0 = Low income\n\nFigure 2: Balanced xROC curves for COMPAS and Adult datasets\n\nTable 1: Ranking error metrics (AUC, xAUC, Brier scores for calibration) for different datasets.\nWe include standard errors in Table 2 of the appendix.\n\nA =\n\n. AUC\ng\ne\nBrier\nR\nc\nXAUC\ni\nt\ns\nXAUC1\ni\ng\no\nXAUC0\nL\n\nl\na\nc\n\n. AUC\nBrier\nXAUC\nXAUC1\nXAUC0\n\nt\ns\no\no\nB\nk\nn\na\nR\n\nCOMPAS\n\nFramingham\n\nBlack White Non-F.\n0.737\n0.768\n0.201\n0.208\n0.795\n0.604\n0.785\n0.698\n0.766\n0.755\n\n0.701\n0.21\n0.813\n0.781\n0.641\n\nGerman\n\nFemale < 25 25\n0.788\n0.768\n0.158\n0.166\n0.802\n0.737\n0.791\n0.756\n0.783\n0.775\n\n0.726\n0.211\n0.708\n0.712\n0.79\n\nAdult\n\nBlack White\n0.923\n0.898\n0.111\n0.075\n0.944\n0.865\n0.905\n0.874\n0.943\n0.895\n\n0.745\n0.206\n0.599\n0.702\n0.776\n\n0.703\n0.21\n0.827\n0.79\n0.638\n\n0.789\n0.182\n0.822\n0.809\n0.777\n\n0.797\n0.15\n0.761\n0.783\n0.811\n\n0.704\n0.22\n0.714\n0.711\n0.774\n\n0.796\n0.158\n0.788\n0.793\n0.783\n\n0.924\n0.074\n0.875\n0.882\n0.939\n\n0.899\n0.109\n0.941\n0.906\n0.897\n\nVariants of the xAUC metric We can decompose AUC differently and assess different variants of\nthe xAUC:\nDe\ufb01nition 3 (Balanced xAUC).\n\nxAUC0(a) = Pr[R1 > Ra\nxAUC1(a) = Pr[Ra\n\n0], xAUC0(b) = Pr[R1 > Rb\n0]\n1 > R0]\n\n1 > R0], xAUC1(b) = Pr[Rb\n\nThese xAUC disparities compare misranking error faced by individuals from either group, conditional\non a speci\ufb01c outcome: xAUC0(a) xAUC0(b) compares the ranking accuracy faced by those of\nthe negative class Y = 0 across groups, and xAUC1(a) xAUC1(b) analogously compares those\nof the positive class Y = 1. The following proposition shows how the population AUC decomposes\nas weighted combinations of the xAUC and within-class AUCs, or the balanced decompositions\nxAUC1 or xAUC0, weighted by the outcome-conditional class probabilities.\nProposition 1 (xAUC metrics as decompositions of AUC).\n\nAUC = Pr[R1 > R0] = Xb02A\n= Xa02A\n\nPr[A = a0 | Y = 1] Pr[Ra0\n\nPr[A = b0 | Y = 0] \u00b7 Xa02A\n1 > R0] = Xa02A\n\nPr[A = a0 | Y = 1] Pr[Ra0\nPr[A = a0 | Y = 0] Pr[R1 > Ra0\n0 ]\n\n1 > Rb0\n0 ]\n\n5 Assessing xAUC\n5.1 COMPAS Example\nIn Fig. 1, we revisit the COMPAS data and assess our xROC and xAUC curves to illustrate ranking\ndisparities that may be induced by risk scores learned from this data. The COMPAS dataset is of size\n\n5\n\n\fn = 6167, p = 402, where sensitive attribute is race, with A = a, b for black and white, respectively.\nWe de\ufb01ne the outcome Y = 1 for non-recidivism within 2 years and Y = 0 for violent recidivism.\nCovariates include information on number of prior arrests and age; we follow the pre-processing of\nFriedler et al. [21].\nWe \ufb01rst train a logistic regression model on the original covariate data (we do not use the decile\nscores directly in order to do a more \ufb01ne-grained analysis), using a 70%, 30% train-test split and\nevaluating metrics on the out-of-sample test set. In Table 1, we report the group-level AUC and the\nBrier [7] scores (summarizing calibration), and our xAUC metrics. The xAUC for column A = a\nis xAUC(a, b), for column A = b it is xAUC(b, a), and for column A = a, xAUCy is xAUCy(a).\nnPn\ni (R(Xi) Yi)2. The score\nThe Brier score for a probabilistic prediction of a binary outcome is 1\nis overall well-calibrated (as well as calibrated by group), consistent with analyses elsewhere [13, 19].\nWe also report the metrics from using a bipartite ranking algorithm, Bipartite Rankboost of Freund\net al. [20] and calibrating the resulting ranking score by Platt Scaling, displaying the results as\n\u201cRankBoost cal.\u201d We observe essentially similar performance across these metrics, suggesting that the\nbehavior of xAUC disparities is independent of model speci\ufb01cation or complexity; and that methods\nwhich directly optimize the population AUC error may still incur these group-level error disparities.\nIn Fig. 1a, we plot ROC curves and our xROC curves, displaying the averaged ROC curve (interpo-\nlated to a \ufb01ne grid of FPR values) over 50 sampled train-test splits, with 1 standard error bar shaded\nin gray (computed by the method of [17]). We include standard errors for xAUC metrics in Table 2 of\nthe appendix. While a simple within-group AUC comparison suggests that the score is overall more\naccurate for blacks \u2013 in fact, the AUC is slightly higher for the black population with AUCa = 0.737\nand AUCb = 0.701 \u2013 computing our xROC curve and xAUC metric shows that blacks would be\ndisadvantaged by misranking errors. The cross-group accuracy xAUC(a, b) = 0.604 is signi\ufb01cantly\nlower than xAUC(b, a) = 0.813: black innocents are nearly indistinguishable from actually guilty\nwhites. This xAUC gap of 0.21 is precisely the cross-group accuracy inequity that simply\ncomparing within-group AUC does not capture. When we plot kernel density estimates of the score\ndistributions in Fig. 1b from a representative training-test split, we see that indeed the distribution of\nscores for black innocents Pr[R = r | A = a, Y = 0] has signi\ufb01cant overlap with the distribution of\nscores for white innocents.\nAssessing balanced xROC: In Fig. 2, we compare the xROC0(a), xROC0(b) curves with the\nxROC1(a), xROC1(b) curves for the COMPAS data. The relative magnitude of xAUC1 and\n xAUC0 provides insight on whether the burden of the xAUC disparity falls on those who are\ninnocent or guilty. Here, since the xAUC0 disparity is larger in absolute terms, it seems that\nmisranking errors result in inordinate bene\ufb01t of the doubt in the errors of distinguishing risky whites\n(Y = 0) from innocent individuals, rather than disparities arising from distinguishing innocent\nmembers of either group from generally guilty individuals.\n\n5.2 Assessing xAUC on Other Datasets\n\nAdditionally in Fig. 3 and Table. 1, we evaluate these metrics on multiple datasets where fairness\nmay be of concern, including risk scores learnt on the Framingham study, the German credit dataset,\nand the Adult income prediction dataset (we use logistic regression as well as calibrated bipartite\nRankBoost) [18, 42]. For the Framingham dataset (cardiac arrest risk scores), n = 4658, p = 7 with\nsensitive attribute of gender, A = a for non-female and A = b for female. Y = 1 denotes 10-year\ncoronary heart disease (CHD) incidence. Fairness considerations might arise if predictions of likelier\nmortality are associated with greater resources for preventive care or triage. The German credit\ndataset is of size n = 1000, p = 57, where the sensitive attribute is age with A = a, b for age < 25,\nage > 25. Creditworthiness (non-default) is denoted by Y = 1, and default by Y = 0. The \u201cAdult\u201d\nincome dataset is of size n = 30162, p = 98, sensitive attribute, A = a, b for black and white. We\nuse the dichotomized outcome Y = 1 for high income > 50k, Y = 0 for low income < 50k.\nOverall, Fig. 3 shows that these xAUC disparities persist, though the disparities are largest for\nthe COMPAS and the large Adult dataset. For the Adult dataset this disparity could result in the\nmisranking of poor whites above wealthy blacks; this could be interpreted as possibly inequitable\nwithholding of economic opportunity from actually-high-income blacks. The additional datasets also\ndisplay different phenomena regarding the score distributions and xROC0, xROC1 comparisons,\nwhich we include in Fig. 5 of the Appendix.\n\n6\n\n\f(a) Framingham: a = Non-F.,\nb = Female, 1 = CHD\nFigure 3: ROC and xROC curve comparison for Framingham, German credit, and Adult datasets.\n\n(b) German: a = Age < 25,\nb = Age 25, 0 = Default\n\n(c) Adult: a = Black, b = White,\n0 = Low income\n\n6 Properties of the xAUC metric and Discussion\n\n1 > Rb\n\n0] = \u2713 \u00b5a1\u00b5b0\n\na1+2\n\nxAUC(a, b) = Pr[Ra\n\nWe proceed to characterize the xAUC metric and its interpretations as a measure of cross-group\nranking accuracy. Notably, the xROC and xAUC implicitly compare performances of thresholds\nthat are the same for different levels of the sensitive attribute, a restriction which tends to hold in\napplications under legal constraints regulating disparate treatment.\nNext we point out that for a perfect classi\ufb01er with AUC = 1, the xAUC metrics are also 1. And, for\na classi\ufb01er that classi\ufb01es completely at random achieving AUC = 0.5, the xAUCs are also 0.5.\nImpact of Score Distribution. To demonstrate how risk score distributions affects the xAUC, we\nconsider an example where we assume normally distributed risk scores within each group and outcome\ncondition; we can then express the AUC in closed form in terms of the cdf of the convolution of the\nay) be drawn independently. Then the xAUC is closed-form:\nscore distributions. Let Ra\ny \u21e0 N (\u00b5ay, 2\np2\n\nb0\u25c6. We may expect that \u00b5a1 > \u00b5b0, in which case\n0] > 0.5. For \ufb01xed mean difference \u00b5a1 \u00b5b0 between the a-guilty and b-innocent (e.g.,\nPr[Ra\nin the COMPAS example), a decrease in either variance increases xAUC(a, b). For \ufb01xed variances,\nan increase in the separation between a-guilty and b-innocent \u00b5a1 \u00b5b0 increases xAUC(a, b). The\nxAUC discrepancy is similarly closed form: xAUC(a, b) = \u2713 \u00b5a1\u00b5b0\nb1\u25c6. If\nall variances are equal, then we will have a positive disparity (i.e., in disfavor of b) if \u00b5a1 \u00b5b0 >\n\u00b5b1 \u00b5a0 (and recall we generally expect both of these to be positive). This occurs if the separation\nbetween the advantaged-guilty and disadvantaged-innocent is smaller than the separation between the\ndisadvantaged-guilty and advantaged-innocent. Alternatively, it occurs if \u00b5a1 + \u00b5a0 > \u00b5b1 + \u00b5b0 so\nthe overall mean scores of the disadvantaged are lower. If they are in fact equal, \u00b5a1+\u00b5a0 = \u00b5b1+\u00b5b0\nb1, that is, when\nand \u00b5a1 \u00b5b0 > 0, then we have a positive disparity whenever 2\nin the b class the difference in precision for innocents vs guilty is smaller than in group a. That is,\ndisparate precision leads to xAUC disparities even with equal mean scores. In Appendix A.1 we\ninclude a toy example to illustrate a setting where the within-group AUCs remain the same but the\nxAUCs diverge.\nNote that the xAUC metric compares probabilities of misranking errors conditional on drawing\ninstances from either Y = 0 or Y = 1 distribution. When base rates differ, interpreting this disparity\nas normatively problematic implicitly assumes equipoise in that we want random individuals drawn\n\nb0\u25c6 \u2713 \u00b5b1\u00b5a0\n\na0 2\n\na1 > 2\n\nb0 2\n\np2\n\na1+2\n\n1 > Rb\n\np2\n\na0+2\n\n7\n\n\f(a) Prob. of individual\nBlack non.rec. ranked\nbelow White rec.\n\n(b) Prob. of individual\nWhite non.rec. ranked\nbelow Black rec.\n\n(c) Prob. of individual\nWhite low-inc. ranked\nbelow Black high-inc.\n\n(d) Prob. of individual\nBlack low-inc. ranked\nbelow White high-inc.\n\nFigure 4: Distribution of conditional xAUCs for COMPAS and Adult datasets\n\nwith equal probability from the white innocent/black innocent populations to face similar misranking\nrisks, not drawn from the population distribution of offending.\nUtility Allocation Interpretation. When risk scores direct the expenditure of resources or bene\ufb01ts,\nwe may interpret xAUC disparities as informative of group-level downstream utility disparities, if we\nexpect bene\ufb01cial resource or utility prioritizations which are monotonic in the score R. In particular,\nallowing for any monotonic allocation u, the xAUC measures Pr[u(Ra\n0)]. Disparities\nin this measure suggest greater probability of confusion in terms of less effective utility allocation\nbetween the positive and negative classes of different groups. This property can be summarized\nby the integral representation of the xAUC disparities (e.g., as in [34]) as differences between the\naverage rank of positive examples from one group above negative examples from another group:\nxAUC = ERa\nDiagnostics: Conditional xAUCs. In addition to xAUC and xROC analysis, we consider the\ndistribution of conditional xAUC ranking accuracies,\n\n1)\u21e4 ERb\n\n1\u21e5F a\n\n1\u21e5F b\n\n1) > u(Rb\n\n1)\u21e4.\n\n0 (Rb\n\n0 (Ra\n\nxAUC(a, b; R0\n\nb) := P[R1\n\na > R0\n\nb | R0\nb].\n\nFirst note that xAUC(a, b) = E\u21e5xAUC(a, b; R0\n\nb)\u21e4. Hence, this quantity is interpreted as the indi-\n\nb) probabilities over the individuals R0\n\nvidual discrepancy faced by the b-innocent, the average of which over individuals gives the group\ndisparity. We illustrate the histogram of xAUC(a, b; R0\nb of the\nA = b, Y = 0 partition (and vice versa for xAUC(b, a)). For example, for COMPAS, we compute:\nhow many white recidivators is this black non-recidivator correctly ranked above? xAUC is the\naverage of these conditional accuracies, but the variance of this distribution is also informative of\nthe range of misranking risk and of effect on individuals. We include these diagnostics in Fig. 4 and\nindicate the marginal xAUC with a black dotted line. For example, the \ufb01rst pair of plots for COMPAS\nillustrates that while the xAUC(b, a) distribution of misranking errors faced by black recidivators\nappears to have light tails, such that the model is more accurate at ranking white non-recidivators\nabove black recidivators, there is extensive probability mass for the xAUC(a, b) distribution, even at\nthe tails: there are 15 white recidivators who are misranked above nearly all black non-recidivators.\nAssessing the distribution of conditional xAUC can inform strategies for model improvement (such\nas those discussed in [10]) by directing attention to extreme error.\nThe question of adjustment. It is not immediately obvious that adjustment is an appropriate strategy\nfor fair risk scores for downstream decision support, considering well-studied impossibility results\nfor fair classi\ufb01cation [13, 31]. For the sake of comparison to the literature on adjustment for fair\nclassi\ufb01cation such as [25], we discuss post-processing risk scores in Appendix E.1 and provide\nalgorithms for equalizing xAUC. Adjustments from the fairness in ranking literature may not be\nsuitable for risk scores: the method of [41] requires randomization over the space of rankers.\n7 Conclusion\nWe emphasize that xAUC and xROC analysis is intended to diagnose potential issues with a model,\nin particular when summarizing model performance without \ufb01xed thresholds. The xROC curve\nand xAUC metrics provide insight on the disparities that may occur with the implementation of a\npredictive risk score in broader, but practically relevant settings, beyond binary classi\ufb01cation.\n\n8\n\n\fAcknowledgements\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1846210. This research was funded in part by JPMorgan Chase & Co. Any views or opinions\nexpressed herein are solely those of the authors listed, and may differ from the views and opinions\nexpressed by JPMorgan Chase & Co. or its af\ufb01liates. This material is not a product of the Research\nDepartment of J.P. Morgan Securities LLC. This material should not be construed as an individual\nrecommendation for any particular client and is not intended as a recommendation of particular\nsecurities, \ufb01nancial instruments or strategies for a particular client. This material does not constitute\na solicitation or offer in any jurisdiction.\n\nReferences\n[1] S. Agarwal and D. Roth. Learnability of bipartite ranking functions. Proceedings of the 18th\n\nAnnual Conference on Learning Theory, 2005, 2005.\n\n[2] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. Online., May 2016.\n\n[3] C. Barabas, K. Dinakar, J. Ito, M. Virza, and J. Zittrain.\n\nInterventions over predictions:\nReframing the ethical debate for actuarial risk assessment. Proceedings of Machine Learning\nResearch, 2017.\n\n[4] S. Barocas and A. Selbst. Big data\u2019s disparate impact. California Law Review, 2014.\n\n[5] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,\n\n2018. http://www.fairmlbook.org.\n\n[6] J. Bonta and D. Andrews. Risk-need-responsivity model for offender assessment and rehabilita-\n\ntion. 2007.\n\n[7] G. W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly Weather Review,\n\n1950.\n\n[8] L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. 45th International\n\nColloquium on Automata, Languages, and Programming (ICALP 2018), 2018.\n\n[9] C. Chan, G. Escobar, and J. Zubizarreta. Use of predictive risk scores for early admission to the\n\nicu. MSOM, 2018.\n\n[10] I. Chen, F. Johansson, and D. Sontag. Why is my classi\ufb01er discriminatory? In Advances in\n\nNeural Information Processing Systems 31, 2018.\n\n[11] J. Chen, N. Kallus, X. Mao, G. Svacha, and M. Udell. Fairness under unawareness: Assessing\ndisparity when protected class is unobserved. In Proceedings of the Conference on Fairness,\nAccountability, and Transparency, pages 339\u2013348. ACM, 2019.\n\n[12] A. Chojnacki, C. Dai, A. Farahi, G. Shi, J. Webb, D. T. Zhang, J. Abernethy, and E. Schwartz.\nA data science approach to understanding residential water contamination in \ufb02int. Proceedings\nof KDD 2017, 2017.\n\n[13] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction\n\ninstruments. In Proceedings of FATML, 2016.\n\n[14] A. Chouldechova, E. Putnam-Hornstein, D. Benavides-Prado, O. Fialko, and R. Vaithianathan.\nA case study of algorithm-assisted decision making in child maltreatment hotline screening\ndecisions. Conference on Fairness, Accountability, and Transparency, 2018.\n\n[15] S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of\n\nfair machine learning. ArXiv preprint, 2018.\n\n[16] C. Cortes and M. Mohri. Auc optimization vs. error rate minimization. Proceedings of the 16th\n\nInternational Conference on Neural Information Processing Systems, 2003.\n\n9\n\n\f[17] E. DeLong, D. DeLong, and D. L. Clarke-Pearson. Comparing the areas under two or more\ncorrelated receiver operating characteristic curves: A nonparametric approach. Biometrics,\n1988.\n\n[18] D. Dheeru and E. K. Taniskidou. Uci machine learning repository. http://archive.ics.uci.edu/ml,\n\n2017.\n\n[19] W. Dieterich, C. Mendoza, and T. Brennan. Compas risk scales: Demonstrating accuracy equity\n\nand predictive parity. Technical Report, 2016.\n\n[20] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research 4 (2003), 2003.\n\n[21] S. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth.\nA comparative study of fairness-enhancing interventions in machine learning. ACM Conference\non Fairness, Accountability and Transparency (FAT*), 2019.\n\n[22] A. Fuster, P. Goldsmith-Pinkham, T. Ramadorai, and A. Walther. Predictably unequal? the\n\neffects of machine learning on credit markets. SSRN:3072038, 2018.\n\n[23] D. Hand. Measuring classi\ufb01er performance: a coherent alternative to the area under the roc\n\ncurve. Machine Learning, 2009.\n\n[24] J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating character-\n\nistic (roc) curve. Radiology, 1982.\n\n[25] M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances\n\nin Neural Information Processing Systems, pages 3315\u20133323, 2016.\n\n[26] U. Hebert-Johnson, M. Kim, O. Reingold, and G. Rothblum. Multicalibration: Calibration for\nthe (computationally-identi\ufb01able) masses. Proceedings of the 35th International Conference on\nMachine Learning, PMLR 80:1939-1948, 2018.\n\n[27] K. Holstein, J. W. Vaughan, H. D. III, M. Dud\u00edk, and H. Wallach. Improving fairness in machine\nlearning systems: What do industry practitioners need? 2019 ACM CHI Conference on Human\nFactors in Computing Systems (CHI 2019), 2019.\n\n[28] J. Jones, N. Shah, C. Bruce, and W. F. Stewart. Meaningful use in practice: Using patient-\nspeci\ufb01c risk in an electronic health record for shared decision making. American Journal of\nPreventive Medicine, 2011.\n\n[29] N. Kallus and A. Zhou. Residual unfairness in fair machine learning from prejudiced data. In\n\nInternational Conference on Machine Learning, pages 2444\u20132453, 2018.\n\n[30] N. Kallus and A. Zhou. Assessing disparate impacts of personalized interventions: Identi\ufb01ability\n\nand bounds. In Advances in Neural Information Processing Systems, 2019.\n\n[31] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of\nrisk scores. To appear in Proceedings of Innovations in Theoretical Computer Science (ITCS),\n2017, 2017.\n\n[32] C. E. Kontokosta and B. Hong. Who calls for help? statistical evidence of disparities in citizen-\ngovernment interactions using geo-spatial survey and 311 data from kansas city. Bloomberg\nData for Good Exchange Conference, 2018.\n\n[33] L. Liu, M. Simchowitz, and M. Hardt. Group calibration is a byproduct of unconstrained\n\nlearning. ArXiv preprint, 2018.\n\n[34] A. Menon and R. C. Williamson. Bipartite ranking: a risk-theoretic perspective. Journal of\n\nMachine Learning Research, 2016.\n\n[35] J. M. C. S. S. V. Michael Feldman, Sorelle Friedler. Certifying and removing disparate impact.\n\nProecedings of KDD 2015, 2015.\n\n[36] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. 2012.\n\n10\n\n\f[37] H. Narasimhan and S. Agarwal. On the relationship between binary classi\ufb01cation, bipartite\n\nranking, and binary class probability estimation. Proceedings of NIPS 2013, 2013.\n\n[38] A. Rajkomar, M. Hardt, M. D. Howell, G. Corrado, and M. H. Chin. Ensuring fairness in\n\nmachine learning to advance health equity. Annals of Internal Medicine, 2018.\n\n[39] B. Reilly and A. Evans. Translating clinical research into clinical practice: Impact of using\n\nprediction rules to make decisions. Annals of Internal Medicine, 2006.\n\n[40] C. Rudin, R. J. Passonneau, A. Radeva, H. Dutta, SteveIerome, and D. Isaac. A process for\n\npredicting manhole events in manhattan. Machine Learning, 2010.\n\n[41] A. Singh and T. Joachims. Fairness of exposure in rankings. Proceedings of KDD 2018, 2018.\n[42] P. W. Wilson, W. P. Castelli, and W. B. Kannel. Coronary risk prediction in adults (the\n\nframingham heart study). The American journal of cardiology, 59(14):G91\u2013G94, 1987.\n\n[43] K. Yang and J. Stoyanovich. Measuring fairness in ranked outputs. Proceedings of SSDBM 17,\n\n2017.\n\n[44] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment\n& disparate impact: Learning classi\ufb01cation without disparate mistreatment. Proceedings of\nWWW 2017, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1905, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Angela", "family_name": "Zhou", "institution": "Cornell University"}]}