Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Specifically considering the setting where model provides a score that can then be used later in making a decision is novel to my knowledge. The paper is clearly written and introduces an original criterion that can also be generalised to non-binary decisions. Ultimate significance of this style of work remains uncertain - it is arguable that eventually such investigations must shift to the discussions of applicable fairness criteria in certain specific situations, e.g. credit scoring, video recommendation, search result ranking etc, with concrete guidance for practice.
This paper is a fairly straightforward but sensible extension of ROC/AUC to compare the quality of ranking across groups. xAUC is just the probability that a positive instance of group a is ranked above a negative instance of group b. The paper is well structured and clearly written and I would expect these metrics to be widely adopted in quantifying fairness. I have read the author response and comments from other reviewers. I am still of the opinion that this paper represents a significant contribution. I strongly argue that it be accepted. I agree that this paper does not tell you when you should sacrifice accuracy to reduce xAUC disparity. However, I think is unreasonable to expect that it answer that question, as such an answer will be incredibly context dependent and will be based more on sociology, political science & philosophy than on machine learning. Almost no paper on fairness in ML would have been published if this was the standard. I agree that there are too many papers in ML introducing new fairness metrics with very limited justification for them. But I don't think this paper falls in that category because: 1) They show how their metric helps clarify the Compas debate - which is a seminal example of fairness in ML, 2) Their metric is closely connected to (and a means for visualising) concerns relating to separation - which is one of the fundamental, widely discussed and used existing notions of fairness. From this point of view they are demonstrating a way of quantifying an existing fairness concept in the setting of ranking rather than introducing an entirely new (and disconnected metric). 3) This paper has done a substantially better job of clarifying the implications of their metric and how it connects with other metrics than most papers in this space. A clear understanding of the properties of a metric (as is given in Section 6) forms the starting point for the discussion of whether it is or not appropriate within a given context.
This paper is tough to review. On one hand, it's well written and carefully thought out. But I don't come away from the paper with a clear idea of what I should do with xAUC, or why I should prefer it over other measures. On occasion, the authors describe xAUC as measuring misranking, which -- if this is indeed what it measures -- would suggest an immediate intervention to remedy the misranking. However, the authors caution against efforts to adjust the scores to equalize xAUC. At other times xAUC is described as a diagnostic, but even the Bayes-optimal predictor could produce significant xAUC disparities, while a miscalibrated score (where there is clear misranking -- one group's risk is being consistently over or under-estimated) could produce no xAUC disparities. So it's unclear how one should interpret xAUC differences. The fairness literature is awash with fairness metrics, and I think the burden is on those proposing new metrics to make a compelling case for why we should prefer their metric to the existing alternatives. This paper has not convinced me that xAUC provides significant value over existing metrics. Response to author feedback: It's not the case that there are "no metrics specifically for disparate impacts of continuous risk probability scores." For example, in their paper "Risk, Race, & Recidivism: Predictive Bias and Disparate Impact," Skeem & Lowenkamp measure bias in continuous risk scores by fitting logistic regression curves to the score-outcome relationship for each group. Following the American Psychological Association's "Principles for the Validation and Use of Personnel Selection Procedures" they interpret significant differences in either slopes or intercepts between groups as evidence of bias in the risk scores (in other words, they require sufficiency to hold). This has been the APA's recommended way to measure bias in risk scores since at least 2003. COMPAS provides an illustrative example. xAUC suggests bias, while the APA's approach (applied by Flores et al. in their rejoinder to the ProPublica analysis titled "False Positives, False Negatives, and False Analyses") finds none. Which result should we believe? This paper doesn't make a compelling case for why the well-established approach should be discarded in favor of xAUC.