{"title": "Assessing Disparate Impact of Personalized Interventions: Identifiability and Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 3426, "page_last": 3437, "abstract": "Personalized interventions in social services, education, and healthcare leverage individual-level causal effect predictions in order to give the best treatment to each individual or to prioritize program interventions for the individuals most likely to benefit. While the sensitivity of these domains compels us to evaluate the fairness of such policies, we show that actually auditing their disparate impacts per standard observational metrics, such as true positive rates, is impossible since ground truths are unknown. Whether our data is experimental or observational, an individual's actual outcome under an intervention different than that received can never be known, only predicted based on features. We prove how we can nonetheless point-identify these quantities under the additional assumption of monotone treatment response, which may be reasonable in many applications. We further provide a sensitivity analysis for this assumption via sharp partial-identification bounds under violations of monotonicity of varying strengths. We show how to use our results to audit personalized interventions using partially-identified ROC and xROC curves and demonstrate this in a case study of a French job training dataset.", "full_text": "Assessing Disparate Impact of Personalized\nInterventions: Identi\ufb01ability and Bounds\n\nNathan Kallus\nCornell University\n\nNew York, NY\n\nkallus@cornell.edu\n\nAngela Zhou\n\nCornell University\n\nNew York, NY\n\naz434@cornell.edu\n\nAbstract\n\nPersonalized interventions in social services, education, and healthcare leverage\nindividual-level causal effect predictions in order to give the best treatment to each\nindividual or to prioritize program interventions for the individuals most likely to\nbene\ufb01t. While the sensitivity of these domains compels us to evaluate the fairness\nof such policies, we show that actually auditing their disparate impacts per standard\nobservational metrics, such as true positive rates, is impossible since ground truths\nare unknown. Whether our data is experimental or observational, an individual\u2019s\nactual outcome under an intervention different than that received can never be\nknown, only predicted based on features. We prove how we can nonetheless point-\nidentify these quantities under the additional assumption of monotone treatment\nresponse, which may be reasonable in many applications. We further provide a\nsensitivity analysis for this assumption by means of sharp partial-identi\ufb01cation\nbounds under violations of monotonicity of varying strengths. We show how to use\nour results to audit personalized interventions using partially-identi\ufb01ed ROC and\nxROC curves and demonstrate this in a case study of a French job training dataset.\n\nIntroduction\n\n1\nThe expanding use of predictive algorithms in the public sector for risk assessment has sparked recent\nconcern and study of fairness considerations [3, 9, 10]. One critique of the use of predictive risk\nassessment argues that the discussion should be reframed to instead focus on the role of positive\ninterventions in distributing bene\ufb01cial resources, such as directing pre-trial services to prevent\nrecidivism, rather than in meting out pre-trial detention based on a risk prediction [8]; or using risk\nassessment in child welfare services to provide families with additional childcare resources rather\nthan to inform the allocation of harmful suspicion [29, 62]. However, due to limited resources,\ninterventions are necessarily targeted. Recent research speci\ufb01cally investigates the use of models that\npredict an intervention\u2019s bene\ufb01t in order to ef\ufb01ciently target their allocation, such as in developing\ntriage tools to target homeless youth [46, 57]. Both ethics and law compel such personalized\ninterventions to be fair and to avoid disparities in how they impact different groups de\ufb01ned by certain\nprotected attributes, such as race, age, or gender.\nThe delivery of interventions to better target those individuals deemed most likely to respond well,\neven if a prediction or policy allocation rule does not have access to the protected attribute, might still\nresult in disparate impact (with regards to social welfare) for the same reasons that these disparities\noccur in machine learning classi\ufb01cation models [21]. (See Appendix C for an expanded discussion\non our use of the term \u201cdisparate impact.\u201d) However, in the problem of personalized interventions,\nthe \u201cfundamental problem of causal inference,\u201d that outcomes are not observed for interventions not\nadministered, poses a fundamental challenge for evaluating the fairness of any intervention allocation\nrule, as the true \u201clabels\u201d of intervention ef\ufb01cacy of any individual are never observed in the dataset.\nMetrics commonly assessed in the study of fairness in machine learning, such as group true positive\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand false positive rates, are therefore conditional on potential outcomes which are not observed in the\ndata and therefore cannot be computed as in standard classi\ufb01cation problems.\nThe problem of personalized policy learning has surfaced in econometrics and computer science\n[13, 36, 37, 37, 45, 51], gaining renewed attention alongside recent advances in causal inference and\nmachine learning [4, 14, 28, 63]. In particular, [17] analyze optimal treatment allocations for malaria\nbednets with nonparametric plug-in estimates of conditional average treatment effects, accounting for\nbudget restrictions; [27] use the generalized random forests method of [64] to evaluate heterogeneity\nof causal effects in a program matching at-risk youth in Chicago with summer jobs on outcomes\nand crime; and [46] use BART [32] to analyze heterogeneity of treatment effect for allocation\nof homeless youth to different interventions, remarking that studying fairness considerations for\nalgorithmically-guided interventions is necessary.\nIn this paper, we address the challenges of assessing the disparate impact of such personalized\nintervention rules in the face of unknown ground truth labels. We show that we can actually obtain\npoint identi\ufb01cation of common observational fairness metrics under the assumption of monotone\ntreatment response. We motivate this assumption and discuss why it might be natural in settings\nwhere interventions only either help or do nothing. Recognizing nonetheless that this assumption is\nnot actually testable, we show how to conduct sensitivity analyses for fairness metrics. In particular,\nwe show how to obtain sharp partial identi\ufb01cation bounds on the metrics of interest as we vary the\nstrength of violation of the assumption. We then show to use these tools to visualize disparities using\npartially identi\ufb01ed ROC and xROC curves. We illustrate all of this in a case study of personalized job\ntraining based on a dataset from a French \ufb01eld experiment.\n\n2 Problem Setup\n\nWe suppose we have data on individuals (X, A, T, Y ) consisting of:\n\n\u2022 Prognostic features X 2X , upon which interventions are personalized;\n\u2022 Sensitive attribute A 2A , against which disparate impact will be measured;\n\u2022 Binary treatment indicator T 2{ 0, 1}, indicating intervention exposure; and\n\u2022 Binary response outcome Y 2{ 0, 1}, indicating the bene\ufb01t to the individual.\n\nOur convention is to identify T = 1 with an active intervention, such as job training or a homeless pre-\nvention program, and T = 0 with lack thereof. Similarly, we assume that a positive outcome, Y = 1,\nis associated with a bene\ufb01cial event for the individual, e.g., successful employment or non-recidivation.\nUsing the Neyman-Rubin potential outcome framework [34], we let Y (0), Y (1) 2{ 0, 1} denote the\npotential outcomes of each treatment. We let the observed outcome be the potential outcome of the\nassigned treatment, Y = Y (T ), encapsulating non-interference and consistency assumptions, also\nknown as SUTVA [60]. Importantly, for any one individual, we never simultaneously observe Y (0)\nand Y (1). This is sometimes termed the fundamental problem of causal inference. We assume our\ndata either came from a randomized controlled trial (the most common case) or an unconfounded\nobservational study so that the treatment assignment is ignorable, that is, Y (1), Y (0) ?? T | X, A.\nWhen both treatment and potential outcomes are binary, we can exhaustively enumerate the\nfour possible realizations of potential outcomes as (Y (0), Y (1)) 2{ 0, 1}2. We call units with\n(Y (0), Y (1)) = (0, 1) responders, (Y (0), Y (1)) = (1, 0) anti-responders, and Y (0) = Y (1) non-\nresponders. Such a decomposition is also common in instrumental variable analysis [2] where the\nbinary outcome is take-up of treatment with the analogous nomenclature of compliers, never-takers,\nalways-takers, and de\ufb01ers. In the context of talking about an actual outcome, following [52], we\nreplace this nomenclature with the notion of response rather than compliance. We remind the reader\nthat due to the fundamental problem of causal inference, response type is unobserved.\nWe denote the conditional probabilities of each response type by\n\npij = pij(X, A) = P(Y (0) = i, Y (1) = j | X, A).\n\nBy exhaustiveness of these types, p00 + p01 + p10 + p11 = 1. (Note pij are random variables.)\nWe consider evaluating the fairness of a personalized intervention policy Z = Z(X, A) 2{ 0, 1},\nwhich assigns interventions based on observable features X, A (potentially just X). Note that by\nde\ufb01nition, the intervention has zero effect on non-responders, negative effect on anti-responders,\nand a positive effect only on responders. Therefore, in seeking to bene\ufb01t individuals with limited\n\n2\n\n\fresources, the personalized intervention policy should seek to target only the responders. Naturally,\nresponse type is unobserved and the policy can only mete out interventions based on observables.\nIn classi\ufb01cation settings, minimum-error classi\ufb01ers on the ef\ufb01cient frontier of type-I and -II errors\nare given by Bayes classi\ufb01ers that threshold the probability of a positive label. In personalized\ninterventions, policies that are on the ef\ufb01cient frontier of social welfare (fraction of positive outcomes,\nP (Y (Z) = 1)) and program cost (fraction intervened on, P (Z = 1)) are given by thresholding\n(Z = I [\u2327 \u2713]) the conditional average treatment effect (CATE):\n\n\u2327 = \u2327 (X, A) = E[Y (1) Y (0) | X, A] = p01 p10\n\n= P(Y = 1 | T = 1, X, A) P(Y = 1 | T = 0, X, A),\n\nwhere the latter equality follows by the assumed ignorable treatment assignment. Estimating \u2327 from\nunconfounded data using \ufb02exible models has been the subject of much recent work [32, 61, 64].\nWe consider observational fairness metrics in analogy to the classi\ufb01cation setting, where the \u201ctrue\nlabel\u201d of an individual is their responder status, R = I [Y (1) > Y (0)]. We de\ufb01ne the analogous true\npositive rate and true negative rate for the intervention assignment Z, conditional on the (unobserved)\nevents of an individual being a responder or non-responder, respectively:\nTPRa = P(Z = 1 | A = a, Y (1) > Y (0)), TNRa = P(Z = 0 | A = a, Y (1) \uf8ff Y (0)).\n2.1\n\nInterpreting Disparities for Personalized Interventions\n\n(1)\n\nThe use of predictive models to deliver interventions can induce disparate impact if responding\n(respectively, non-responding) individuals of different groups receive the intervention at dispropor-\ntionate rates under the treatment policy. This can occur even with ef\ufb01cient policies that threshold the\ntrue CATE \u2327 and can arise from the disparate predictiveness of X, A of response type (i.e., how far\npij are from 0 and 1). This is problematic because the choice of features X is usually made by the\nintervening agent (e.g., government agency, etc.).\nWe discuss one possible interpretation of TPR or TNR disparities in this setting when the intervention\nis the bestowal of a bene\ufb01t, like access to job training or case management. From the point of view of\nthe intervening agent, there are speci\ufb01c program goals, such as employment of the target individual\nwithin 6 months. Therefore, false positives are costly due to program cost and false negatives are\nmissed opportunities. But outcomes also affect the individual\u2019s utility. Discrepancies in TPR across\nvalues of A are of concern since they suggest that the needs of those who could actually bene\ufb01t\nfrom intervention (responders) in one group are not being met at the same rates as in other groups.\nArguably, for bene\ufb01t-bestowing interventions, TPR discrepancies are of greater concern. Nonetheless,\nfrom the point of view of the individual, the intervention may always grant some positive resource\n(e.g., from the point of view of well-being), regardless of responder status, since it corresponds to\naccess to a good (and the individual can gain other bene\ufb01ts from job training that may not necessarily\nalign with the intervener\u2019s program goals, such as employment in 1 year or personal enrichment). If\nso, then TNR discrepancies across values of A imply a \u201cdisparate bene\ufb01t of the doubt\u201d such that the\npolicy disparately over-bene\ufb01ts one group over another using the limited public resource without the\ncover of advancing the public program\u2019s goal, which may raise fairness and envy concerns, especially\nsince this \u201cwaste\u201d is at the cost of more slots for responders.\nBeyond assessing disparities in TPR and TNR for one \ufb01xed policy, we will also use our ability to\nassess these over varying CATE thresholds in order to compute xAUC metrics [41] in Section 6.\nThese give the disparity between the probabilities that a non-responder from group a is ranked above\na responder from group b and vice-versa. Thus, they measure the disproportionate access one group\ngets relative to another in any allocation of resources that is non-decreasing in CATE.\nWe emphasize that the identi\ufb01cation arguments and bounds that we present on fairness metrics are\nprimarily intended to facilitate the assessment of disparities, which may require further inquiry as\nto their morality and legality, not necessarily to promote statistical parity via adjustments such as\ngroup-speci\ufb01c thresholds, though that is also possible using our tools. We defer a more detailed\ndiscussion to Section 8 and re-emphasize that assessing the distribution of outcome-conditional model\nerrors are of central importance both in machine learning [10, 30, 55] and in the economic ef\ufb01ciency\nof targeting resources [16, 18, 54].\n\n3\n\n\f3 Related Work\n[50] consider estimating joint treatment effects of race and treatment under a deep latent variable\nmodel to reconstruct unobserved confounding. For evaluating fairness of policies derived from\nestimated effects, they consider the gap in population accuracy Acca = P (Z = Z\u21e4 | A = a), where\nZ\u21e4 = I[\u2327 (X) > 0] is the (identi\ufb01able) optimal policy. In contrast, we highlight the unfairness\nof even optimal policies and focus on outcome-conditional error rates (TPR, TNR), where the\nnon-identi\ufb01ability of responder status introduces challenges regarding identi\ufb01ability.\nThe issue of model evaluation under the censoring problem of selective labels has been discussed in\nsituations such as pretrial detention, where detention censors outcomes [40, 48]. Sensitivity analysis\nto account for possible unmeasured confounders is used in [35, 39]. The distinction is that we focus\non the targeted delivery of interventions with unknown (but estimated) causal effects, rather than\nconsidering classi\ufb01cations that induce one-sided censoring but have de\ufb01nitionally known effects.\nRecently, partial identi\ufb01cation approaches has also been proposed in the case of known outcomes but\nmissing protected attributes [22, 42].\nOur emphasis is distinct from other work discussing fairness and causality that uses graphical causal\nmodels to decompose predictive models along causal pathways and assessing the normative validity\nof path-speci\ufb01c effects [44, 47], such as the effect of probabilistic hypothetical interventions on race\nvariables or other potentially immutable protected attributes. When discussing treatments, we here\nconsider interventions corresponding to allocation of concrete resources (e.g., give job training),\nwhich are in fact physically manipulable by an intervening agent. The correlation of the intervention\u2019s\nconditional average treatment effects by, say, race and its implications for downstream resource\nallocation are our primary concern.\nThere is extensive literature on partial identi\ufb01cation, e.g. [53], including for individual-level causal\neffect [43]. In contrast to previous work that analyzes partial identi\ufb01cation of average treatment\neffects when data is confounded and using monotonicity to improve precision [6, 15, 53], we focus\non unconfounded (e.g., RCT) data and achieve full identi\ufb01cation by assuming monotonicity and\nconsider sensitivity analysis bounds for nonlinear functionals of partially identi\ufb01ed sets, namely, true\npositive and false positive rates.\n4\nSince the de\ufb01nitions of the disparate impact metrics in Eq. (1) are conditioned on an unobserved\nevent, such as the response event Y (1) > Y (0), they actually cannot be identi\ufb01ed from the data,\neven under ignorable treatment. That is, the values of TPRa, TNRa can vary even when the joint\ndistribution of (X, A, T, Y ) remains the same, meaning the data we see cannot possibly tell us about\nthe speci\ufb01c value of TPRa, TNRa.\nProposition 1. TPRa, TNRa (or discrepancies therein over groups) are generally not identi\ufb01able.\nEssentially, Proposition 1 follows because the data only identi\ufb01es the marginals p10 + p11, p01 + p11\nwhile TPRa, TNRa depend on the joint via p01, which can vary even while marginals are \ufb01xed.\nSince this can vary independently across values of A, discrepancies are not identi\ufb01able either.\n4.1 Identi\ufb01cation under monotonicity\nWe next show identi\ufb01ability if we impose the additional assumption of monotone treatment response.\nAssumption 1 (Monotone treatment response). Y (1) Y (0). (Equivalently, p10 = 0.)\nAssumption 1 says that anti-responders do not exist. In other words, the treatment either does nothing\n(e.g., an individual would have gotten a job or not gotten a job, regardless of receiving job training)\nor it bene\ufb01ts the individual (would get a job if and only if receive job training), but it never harms\nthe individual. This assumption is reasonable for positive interventions. As [38] points out, policy\nlearning in this setting is equivalent to the binary classi\ufb01cation problem of predicting responder status.\nProposition 2. Under Assumption 1,\n\nIdenti\ufb01ability of Disparate Impact Metrics\n\nTPRa = E [\u2327 | A = a, Z = 1] P (Z = 1 | A = a)\nTNRa = E [(1 \u2327 ) | A = a, Z = 0] P (Z = 0 | A = a)\n\nE [\u2327 | A = a]\nE [(1 \u2327 ) | A = a]\n\n,\n\n(2)\n\n.\n\n4\n\n\fSince the quantities on the right hand sides in Eq. (2) are in terms of identi\ufb01ed quantities (functions\nof the distribution of (X, A, T, Y )), this proves identi\ufb01ability. Given a sample and an estimate of\n\u2327, it also provides a simple recipe for estimation by replacing each average or probability by a\nsample version, since both A and Z are discrete. More generally, since these averages are average\ntreatment effects (over subpopulations de\ufb01ned by A, Z values), these quantities can also alternatively\nbe estimated by any average treatment effect estimator and plugged in. For example, we can use\ndoubly robust estimators to ensure speci\ufb01cation-robustness [58] or double ML estimators to ensure\nef\ufb01ciency when X may be high-dimensional [23].\nThus, Proposition 2 provides a novel means of assessing disparate impact of personalized interventions\nunder monotone response. This is relevant because monotonicity is a defensible assumption in the\ncase of many interventions that bestow an additional bene\ufb01t, good, or resource, such as the ones\nmentioned in Section 1. Nonetheless, the validity of Assumption 1 is itself not identi\ufb01able. Therefore,\nshould it fail even slightly, it is not immediately clear whether these disparity estimates can be relied\nupon. We therefore next study a sensitivity analysis by means of constructing partial identi\ufb01cation\nbounds for TPRa, TNRa.\n5 Partial Identi\ufb01cation Bounds for Sensitivity Analysis\nWe next study the partial identi\ufb01cation of disparate impact metrics when Assumption 1 fails, i.e.,\np10 6= 0. We \ufb01rst state a more general version of Proposition 2. For any \u2318 = \u2318(X, A), let\n\n\u21e2TPR\na\n\n\u21e2TNR\na\n\n(\u2318) := E [\u2327 + \u2318 | A = a, Z = 1] P (Z = 1 | A = a)\n(\u2318) := E [1 (\u2327 + \u2318) | A = a, Z = 0] P (Z = 0 | A = a)\n\nE [\u2327 + \u2318 | A = a]\nE [1 (\u2327 + \u2318) | A = a]\n\n,\n\n.\n\na\n\nb\n\n(p10).\n\na\n\na\n\n(\u2318),\u21e2 TNR\n\na\n\na\n\n(\u2318),\u21e2 TNR\n\na\n\n(p10), TNRa = \u21e2TNR\n\nAs long as 8\u2318 2U\n\n(\u2318)a2A\n: \u2318 2U}\u2713 R2\u21e5|A|.\n(\u2318) and \u21e2(\u2318) = (\u21e2a(\u2318))a2A.\n\nProposition 3. TPRa = \u21e2TPR\nSince the anti-responder probability p10 is unknown, we cannot use Proposition 3 to identify\nTPRa, TNRa. We instead use Proposition 3 to compute bounds on them by restricting p10 to\nbe in an uncertainty set. Formally, given an uncertainty set U for p10 (i.e., a set of functions of x, a),\nwe de\ufb01ne the simultaneous identi\ufb01cation region of the TPR and TNR for all groups a 2A as:\n\u21e5= {\u21e2TPR\nFor brevity, we will let \u21e2a(\u2318) =\u21e2TPR\ntrue pos-\nThe set \u21e5 describes all possible simultaneous values of the group-conditional\nwe have 0 \uf8ff \u2318(X, A) \uf8ff\nitive and true negative rates.\nmin (P (Y = 1 | T = 0, X, A) , P (Y = 0 | T = 1, X, A)) (which is identi\ufb01ed from the data) by\nProposition 3 this set is necessarily sharp [53] given only the restriction that p10 2U . (In particular,\nthis bound on \u2318 can be achieved by just point-wise clipping U with this identi\ufb01able bound as nec-\nessary.) That is, given a joint on (X, A, T, Y ), on the one hand, every \u21e2 2 \u21e5 is realized by some\nfull joint distribution on (X, A, T, Y (0), Y (1)) with p10 2U , and on the other hand, every such\njoint gives rise to a \u21e2 2 \u21e5. In other words, \u21e5 is an exact characterization of the in-fact possible\nsimultaneous values of the group-conditional TPRs and TNRs.\nTherefore, if, for example, we are interested in the minimal and maximal possible values for the true\n(unknown) TPR discrepancy between groups a and b, we should seek to compute inf \u21e22\u21e5 \u21e2TPR\na \n. More generally, for any \u00b5 2 R2\u21e5|A|, we may wish to compute\n\u21e2TPR\nb\n(3)\nNote that this, for example, covers the above example since for any \u00b5 we can also take \u00b5. The\nfunction h\u21e5 is known as the support function of \u21e5 [59]. Not only does the support function provide\nthe maximal and minimal contrasts in a set, it also exactly characterizes its convex hull. That is,\nConv (\u21e5) =\u21e2 : \u00b5>\u21e2 \uf8ff h\u21e5(\u00b5) 8\u00b5 . So computing h\u21e5 allows us to compute Conv (\u21e5).\nOur next result gives an explicit program to compute the support function when U has a product form\nof within-group uncertainty sets:\n(4)\nwhich leads to \u21e5= Qa2A \u21e5a where \u21e5a = {\u21e2a(\u2318a) : \u2318a 2U a}.\n\nU = {\u2318 : \u2318( \u00b7 , a) 2U a 8a 2A} ,\n\nand sup\u21e22\u21e5 \u21e2TPR\n\na \u21e2TPR\n\nh\u21e5(\u00b5) := sup\u21e22\u21e5 \u00b5>\u21e2.\n\n5\n\n\fProposition 4. Let rz\n(4). Then Eq. (3) can be reformulated as:\n\na := P (Z = z | A = a) and \u2327 z\n\na := E [\u2327 | A = a, Z = z]. Suppose U is as in\n\nh\u21e5(\u00b5) = Pa2A h\u21e5a(\u00b5a), where\na ta\u2327 1\n\nh\u21e5a(\u00b5a) =sup!a,ta \u00b5TPR\n\na\n\na\n\n+\n\nr1\n\u00b5TNR\nr0\na\nta 1\n\na + E [!a(X) | A = a, Z = 1]\n(ta 1 \u2327 0\ns.t.! a(\u00b7) 2 ta Ua, tar0\n\na + E [!a(X) | A = a, Z = 0])\na + E [!a | A = a] = 1.\n\na + r1\n\na\u2327 1\n\na\u2327 0\n\na\n\nFor a \ufb01xed value of ta, the above program is a linear program, given that Ua is linearly representable.\nTherefore a solution may be found by grid search on the univariate ta. Moreover, if \u00b5TPR\n= 0 or\n= 0, the above remains a linear program even with ta as a variable [20]. With this, we are able\n\u00b5TNR\na\nto express group-level disparities through assessing the support function at speci\ufb01c contrast vectors \u00b5.\n5.1 Partial Identi\ufb01cation under Relaxed Monotone Treatment Response\nWe next consider the implications of the above for the following relaxation of the monotone treatment\nresponse assumption:\nAssumption 2 (B-relaxed monotone treatment response). p10 \uf8ff B.\nNote that Assumption 2 with B = 0 recovers Assumption 1 and Assumption 2 with\nB = 1 is a vacuous assumption.\nIn between these two extremes we can consider milder\nor stronger violations of monotone response and the partial identi\ufb01cation bounds they corre-\nsponds to. This provides us with a means of sensitivity analysis of the disparities we mea-\nsure, recognizing that monotone response may not hold exactly and that disparities may not\nbe exactly identi\ufb01able. For the rest of the paper, we focus solely on partial identi\ufb01cation\nunder Assumption 2. Note that Assumption 2 corresponds exactly to the uncertainty set\nUB = {\u2318 : 0 \uf8ff \u2318(X, A) \uf8ff min (B, P (Y = 1 | T = 0, X, A) , P (Y = 0 | T = 1, X, A))}. We de-\n\ufb01ne \u21e5B =Qa2A \u21e5B,a to be the corresponding identi\ufb01cation region.\nUnder Assumption 2, our bounds take on a particularly simple form.\nLet Bz\nE [min (B, P (Y = 1 | T = 0, X, A) , P (Y = 0 | T = 1, X, A)) | A = a, Z = z] and de\ufb01ne\na(B))r1\na\n\na(B) =\n\na(B))r1\na\n\n\u21e2TNR\na\n\n\u21e2TPR\na\n\n(B) =\n\n(B) =\n\n\u2327 0\na r0\n\n,\n\n,\n\n(\u2327 1\na + B1\na(B))r1\na\na + B1\na + (\u2327 1\n\u2327 1\na r1\na\n\n(\u2327 0\na + B0\n\na(B))r0\n\na + \u2327 1\n\na r1\na\n\n\u21e2TPR\na\n\n(B) =\n\n,\u21e2\n\nTNR\na\n\n(B) =\n\n(1 \u2327 0\n\n(1 \u2327 0\n(B), \u21e2TPR\n\na )r0\n(1 \u2327 0\na\na B 1\na + (1 \u2327 1\na(B))r0\na B 0\na\na(B))r0\na + (1 \u2327 1\n\na )r0\n(1 \u2327 0\na B 0\n(B)] and [\u21e2TNR\nrespectively.\n\na\n\na )r1\na\n(B), \u21e2TNR\n\n.\n\na\n\na\n\n(B),\u21e2 TNR\n\nProposition 5. Suppose Assumption 2 holds. Then [\u21e2TPR\nare the sharp identi\ufb01cation intervals for TPRa and TNRa,\n(\u21e2TPR\nmultaneously achievable.\n6 Partial Identi\ufb01cation of Group Disparities and ROC and xROC Curves\nWe discuss diagnostics to summarize possible impact disparities across a range of possible policies.\n\n(B)]\na\nMoreover,\n(B)) 2 \u21e5B,a, i.e., the two extremes are si-\n\n(B)) 2 \u21e5B,a and (\u21e2TPR\n\n(B), \u21e2TNR\n\na\n\na\n\na\n\na\n\nTPR and TNR disparity. Discrepancies in model errors (TPR or TNR) are of interest when audit-\ning classi\ufb01cation performance on different groups with a given, \ufb01xed policy Z. Under Assumption 1,\nthey are identi\ufb01ed by Proposition 2. Under violations of Assumption 1, we can consider their partial\nidenti\ufb01cation bounds. If the minimal disparity remains nonzero, that provides strong evidence of dis-\nparity. Similarly, if the maximal disparity is large, a responsible decision maker should be concerned\nabout the possibility of a disparity.\nUnder Assumption 2, Proposition 5 provides that the sharp identi\ufb01cation intervals of TPRa TPRb\nand TNRa TNRb are, respectively, given by\n\na\n\n[\u21e2TPR\n[\u21e2TNR\n\na\n\nb\n\n(B) \u21e2TPR\n(B) \u21e2TNR\n\nb\n\n(B), \u21e2TPR\n(B), \u21e2TNR\n\na\n\na\n\nb\n\n(B) \u21e2TPR\n(B) \u21e2TNR\n\nb\n\n(B)],\n\n(B)].\n\n(5)\n\nGiven effect scores \u2327, we can then use this to plot disparity curves by plotting the endpoints of Eq. (5)\nfor policies Z = I[\u2327 \u2713] for varying thresholds \u2713.\n\n6\n\n\fRobust ROC Curves We \ufb01rst de\ufb01ne the analogous group-conditional ROC curve corresponding\nto a CATE function \u2327. These are the parametric curves traced out by the pairs (1 TNRa, TPRa)\nof policies that threshold the CATE for varying thresholds. To make explicit that we are now\ncomputing metrics for different policies, we use the notation \u21e2(\u2318; \u2327 \u2713) to refer to the metrics of\nthe policy Z = I [\u2327 \u2713]. Under Assumption 1, Proposition 2 provides point identi\ufb01cation of the\ngroup-conditional ROC curve:\n\nROCa(\u2327 ) := {(1 \u21e2TNR\n\na\n\n(0; \u2327 \u2713),\u21e2 TPR\n\na\n\n(0; \u2327 \u2713)) : \u2713 2 R}\n\nWhen Assumption 1 fails, we cannot point identify TPRa, TNRa and correspondingly we cannot\nidentify ROCa(\u2327 ). We instead de\ufb01ne the robust ROC curve as the union of all partially identi\ufb01ed\nROC curves. Speci\ufb01cally:\n\n\u21e5ROC\n\na\n\n(\u2327 ) := {(1 \u21e2TNR\n\na\n\n(\u2318a; \u2327 \u2713),\u21e2 TPR\n\na\n\n(\u2318a; \u2327 \u2713)) : \u2713 2 R,\u2318 a 2U a}.\n\nPlotted, this set provides a visual representation of the region that the true ROC curve can lie in. We\nnext prove that under Assumption 2, we can easily compute this set as the area between two curves.\nProposition 6. Let U = UB. Then \u21e5ROC\n(\u2327 ) is given as the area between the two parametric\ncurves ROCa(\u2327 ) := {(1 \u21e2TNR\n(B; \u2327 \u2713)) : \u2713 2 R} and ROCa(\u2327 ) := {(1 \n(B; \u2327 \u2713)) : \u2713 2 R}.\n\u21e2TNR\na\nThis follows because the extremes are simultaneously achievable as noted in Proposition 5. We\nhighlight, however, that the lower (resp., upper) ROC curve may not be simultaneously realizable.\n\n(B; \u2327 \u2713),\u21e2 TPR\n\n(B; \u2327 \u2713), \u21e2TPR\n\na\n\na\n\na\n\na\n\nRobust xROC Curves Comparison of group-conditional ROC curves may not necessarily show\nimpact disparities as, even in standard classi\ufb01cation settings ROC curves can overlap despite disparate\nimpacts [30, 41]. At the same time, comparing disparities for \ufb01xed policies Z with \ufb01xed thresholds\nmay not accurately capture the impact of using \u2327 for rankings. [41] develop the xAUC metric for\nassessing the bipartite ranking quality of risk scores, as well as the analogous notion of a xROC\ncurve which parametrically plots the TPR of one group vs. the FPR of another group, at any \ufb01xed\nthreshold. This is relevant if effect scores \u2327 are used for downstream decisions by different facilities\nwith different budget constraints or if the score is intended to be used by a \u201chuman-in-the-loop\u201d\nexercising additional judgment, e.g., individual caseworkers as in the encouragement design of [12].\nUnder Assumption 1, we can point identify TPRa, TNRa, so, following [41], we can de\ufb01ne the\npoint-identi\ufb01ed xROC curve as\n\nxROCa,b(\u2327 ) = {(1 \u21e2TNR\n\nb\n\n(0; \u2327 \u2713),\u21e2 TPR\n\na\n\n(0; \u2327 \u2713)) : \u2713 2 R}.\n\nWithout Assumption 1, we analogously de\ufb01ne the robust xROC curve as the union of all partially\nidenti\ufb01ed xROC curves:\n\n\u21e5xROC\n\na,b\n\n(\u2327 ) = {(1 \u21e2TNR\n\nb\n\n(\u2318a; \u2327 \u2713),\u21e2 TPR\n\na\n\n(\u2318a; \u2327 \u2713)) : \u2713 2 R,\u2318 a 2U a}.\n\nb\n\nb\n\na\n\na\n\na,b\n\n(B; \u2327 \u2713), \u21e2TPR\n\n(B; \u2327 \u2713),\u21e2 TPR\n\n(B; \u2327 \u2713)) : \u2713 2 R}.\n\n(\u2327 ) is given as the area between the two parametric\n(B; \u2327 \u2713)) : \u2713 2 R} and xROCa,b(\u2327 ) :=\n\nProposition 7. Let U = UB. Then \u21e5xROC\ncurves xROCa,b(\u2327 ) := {(1 \u21e2TNR\n{(1 \u21e2TNR\nThis follows because UB takes the form of a product set over a 2A .\n7 Case Study: Personalized Job Training (Behaghel et al.)\nWe consider a case study from a three-armed large randomized controlled trial that randomly assigned\njob-seekers in France to a control-group, a job training program managed by a public vendor, and an\nout-sourced program managed by a private vendor [11]. While the original experiment was interested\nin the design of contracts for program service delivery, we consider a task of heterogeneous causal\neffect estimation, motivated by interest in personalizing different types of counseling or active labor\nmarket programs that would be bene\ufb01cial for the individual. Recent work in policy learning has also\nconsidered personalized job training assignment [45, 63] and suggested excluding sensitive attributes\nfrom the input to the decision rule for fairness considerations, but without consideration of fairness\nin the causal effect estimation itself and how signi\ufb01cant impact disparities may still remain after\nexcising sensitive attributes because of it.\n\n7\n\n\fTPR Disparity\n\nTPR\n\nDisparity in disfavor \n\nof French\n\nTNR\n\nDisparity in disfavor \n\nof non-French\n\nTPR\n\nDisparity in \ndisfavor of >26\n<26\n\nTNR\n\nDisparity in \ndisfavor of <26\n>26\n\nDisparity in disfavor \n\nof non-French\n\nDisparity in \ndisfavor of <26\n>26\n\nFigure 1: TPR and TNR disparity curves and bounds on French job training dataset (Eq. (5))\n\nFigure 2: ROC and xROC for A = nationality, age on French job training dataset\n\nWe focus on the public program vs. control arm, which enrolled about 7950 participants in total,\nwith n1 = 3385 participants in the public program. The treatment arm, T = 1, corresponds to\nassignment to the public program. The original analysis suggests a small but statistically signi\ufb01cant\npositive treatment effect of the public program, with an ATE of 0.023. We omit further details on the\ndata processing to Appendix B. We consider the group indicators: nationality (0, 1 denoting French\nnationals vs. non-French, respectively), gender (denoting woman vs. non-woman), and age (below\nthe age of 26 vs. above). (Figures for gender appear in Appendix B.)\nIn Fig. 1, we plot the identi\ufb01ed \u201cdisparity curves\u201d of Eq. (5) corresponding to the maximal and\nminimal sensitivity bounds on TPR and TNR disparity between groups. Levels of shading correspond\nto different values of B, with color legend at right. We learn \u2327 by the Generalized Random Forests\nmethod of [5, 64] and use sample splitting, learning \u2327 on half the data and using our methods to assess\nbounds on \u21e2TPR,\u21e2 TNR and other quantities with out-of-sample estimates on the other half of the\ndata. We bootstrap over 50 sampled splits and average disparity curves to reduce sample uncertainty.\nIn general, the small probability of being a responder leads to increased sensitivity of TPR estimates\n(wide identi\ufb01cation bands). The curves and sensitivity bounds suggest that with respect to nationality\nand gender, there is small or no disparity in true positive rates but the true negative rates for nationality,\ngender, and age may differ signi\ufb01cantly across groups, such that non-women would have a higher\nchance of being bestowed job-training bene\ufb01ts when they are in fact not responders. However, TPR\ndisparity by age appears to hold with as much as -0.1 difference, with older actually-responding\nindividuals being less likely to be given job training than younger individuals. Overall, this suggests\nthat differences in heterogeneous treatment effects across age categories could lead to signi\ufb01cant\nadverse impact on older individuals.\nThis is similarly re\ufb02ected in the robust ROC, xROC curves (Fig. 2). Despite possibly small differences\nin ROCs, the xROCs indicate strong disparities: the sensitivity analysis suggests that the likelihood of\nranking a non-responding young individual above a responding old individual (xAUC [41]) is clearly\nlarger than the symmetric error, meaning that older individuals who bene\ufb01t from the treatment may\nbe disproportionately shut out of it as seats are instead given to non-responding younger individuals.\n8 Discussion and Conclusion\nWe presented identi\ufb01cation results and bounds for assessing disparate model errors of causal-effect\nmaximizing treatment policies, which can lead disparities in access to those who stand to bene\ufb01t\nfrom treatment across groups. Whether this is \u201cunfair\u201d would naturally rely on one\u2019s normative\n\n8\n\n\fassumptions. One such is \u201cclaims across outcomes,\u201d that individuals have a claim to the public\nintervention if they stand to bene\ufb01t, which can be understood within [1]\u2019s axiomatic justi\ufb01cation of\nfair distribution. There may also be other justice-based considerations, e.g. minimax fairness. We\ndiscuss this more extensively in Appendix C.\nWith the new ability to assess disparities using our results, a second natural question is whether these\ndisparities warrant adjustment, which is easy to do given our tools combined with the approach of\n[30]. This question again is dependent both on one\u2019s viewpoint and ultimately on the problem context,\nand we discuss it further in Appendix C. Regardless of normative viewpoints, auditing allocative\ndisparities that would arise from the implementation of a personalized rule must be a crucial step of a\nresponsible and convincing program evaluation. We presented fundamental identi\ufb01cation limits to\nsuch assessments but provided sensitivity analyses that can support reliable auditing.\n\nAcknowledgements\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1846210. This research was funded in part by JPMorgan Chase & Co. Any views or opinions\nexpressed herein are solely those of the authors listed, and may differ from the views and opinions\nexpressed by JPMorgan Chase & Co. or its af\ufb01liates. This material is not a product of the Research\nDepartment of J.P. Morgan Securities LLC. This material should not be construed as an individual\nrecommendation for any particular client and is not intended as a recommendation of particular\nsecurities, \ufb01nancial instruments or strategies for a particular client. This material does not constitute\na solicitation or offer in any jurisdiction.\n\nReferences\n[1] M. Adler. Well-Being and Fair Distribution.\n[2] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identi\ufb01cation of causal effects using instrumental\n\nvariables. Journal of the American statistical Association, 1996.\n\n[3] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. Online., May 2016.\n[4] S. Athey. Beyond prediction: Using big data for policy problems. Science, 2017.\n[5] S. Athey, J. Tibshirani, S. Wager, et al. Generalized random forests. The Annals of Statistics, 47\n\n(2):1148\u20131178, 2019.\n\n[6] A. Balke and J. Pearl. Bounds on treatment effects from studies with imperfect compliance.\n\nJournal of the American Statistical Association, 92(439):1171\u20131176, 1997.\n\n[7] A. Banerjee, E. Du\ufb02o, N. Goldberg, D. Karlan, R. Osei, W. Parient\u00e9, J. Shapiro, B. Thuysbaert,\nand C. Udry. A multifaceted program causes lasting progress for the very poor: Evidence from\nsix countries. Science, 348(6236):1260799, 2015.\n\n[8] C. Barabas, K. Dinakar, J. Ito, M. Virza, and J. Zittrain.\n\nInterventions over predictions:\nReframing the ethical debate for actuarial risk assessment. Proceedings of Machine Learning\nResearch, 2017.\n\n[9] S. Barocas and A. Selbst. Big data\u2019s disparate impact. California Law Review, 2014.\n[10] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,\n\n2018. http://www.fairmlbook.org.\n\n[11] L. Behaghel, B. Cr\u00e9pon, and M. Gurgand. Private and public provision of counseling to job\nseekers: Evidence from a large controlled experiment. American Economic Journal: Applied\nEconomics, 2014.\n\n[12] S. Behncke, M. Fr\u00f6lich, and M. Lechner. Targeting labour market programmes: results from a\n\nrandomized experiment. Work. Pap. 3085, IZA (Inst. Study Labor), 2007.\n\n[13] A. Bennett and N. Kallus. Policy evaluation with latent confounders via optimal balance. In\n\nAdvances in Neural Information Processing Systems, 2019.\n\n9\n\n\f[14] A. Bennett, N. Kallus, and T. Schnabel. Deep generalized method of moments for instrumental\n\nvariable analysis. In Advances in Neural Information Processing Systems, 2019.\n\n[15] A. Beresteanu, I. Molchanov, and F. Molinari. Partial identi\ufb01cation using random set theory.\n\nJournal of Econometrics, 166(1):17\u201332, 2012.\n\n[16] M. Berger, D. A. Black, and J. A. Smith. Econometric evaluation of labour market policies,\nchapter EvaluatingPro\ufb01ling as a Means of Allocating Government Service, pages 59\u201384. 2000.\n[17] D. Bhattacharya and P. Dupas. Inferring welfare maximizing treatment assignment under budget\n\nconstraints. Journal of Econometrics, 2012.\n\n[18] C. Brown, M. Ravallion, and D. van de Walle. A poor means test? econometric targeting in\nafrica. Policy Research Working Paper 7915: World Bank Group, Development Research Group,\nHuman Development and Public Services Team, 2016.\n\n[19] P. Carneiro, K. T. Hansen, and J. J. Heckman. Removing the veil of ignorance in assessing\nthe distributional impacts of social policies. Technical report, National Bureau of Economic\nResearch, 2002.\n\n[20] A. Charnes and W. W. Cooper. Programming with linear fractional functionals. Naval research\n\nlogistics (NRL), 9(3-4):181\u2013186, 1962.\n\n[21] I. Chen, F. Johansson, and D. Sontag. Why is my classi\ufb01er discriminatory? In Advances in\n\nNeural Information Processing Systems 31, 2018.\n\n[22] J. Chen, N. Kallus, X. Mao, G. Svacha, and M. Udell. Fairness under unawareness: Assessing\ndisparity when protected class is unobserved. In Proceedings of the Conference on Fairness,\nAccountability, and Transparency, pages 339\u2013348. ACM, 2019.\n\n[23] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Du\ufb02o, C. Hansen, and W. Newey. Dou-\nble/debiased/neyman machine learning of treatment effects. American Economic Review, 107\n(5):261\u201365, 2017.\n\n[24] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction\n\ninstruments. In Proceedings of FATML, 2016.\n\n[25] S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of\n\nfair machine learning. ArXiv preprint, 2018.\n\n[26] B. Crepon and G. J. van den Berg. Active labor market policies. Annual Review of Economics,\n\nVol. 8:521-546, 2016.\n\n[27] J. M. Davis and S. B. Heller. Using causal forests to predict treatment heterogeneity: An\napplication to summer jobs. American Economic Review: Papers and Proceedings, 107(5):\n546\u2013550, 2017.\n\n[28] M. Dudik, D. Erhan, J. Langford, and L. Li. Doubly robust policy evaluation and optimization.\n\nStatistical Science, 2014.\n\n[29] V. Eubanks. Automating inequality: How high-tech tools pro\ufb01le, police, and punish the poor.\n\nSt. Martin\u2019s Press, 2018.\n\n[30] M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances\n\nin Neural Information Processing Systems, pages 3315\u20133323, 2016.\n\n[31] H. Heidari, C. Ferrari, K. Gummadi, and A. Krause. Fairness behind a veil of ignorance: A\nwelfare analysis for automated decision making. In Advances in Neural Information Processing\nSystems, pages 1265\u20131276, 2018.\n\n[32] J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational\n\nand Graphical Statistics, 20(1):217\u2013240, 2011.\n\n[33] L. Hu and Y. Chen. Fair classi\ufb01cation and social welfare. arXiv preprint arXiv:1905.00147,\n\n2019.\n\n10\n\n\f[34] G. Imbens and D. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences.\n\nCambridge University Press, 2015.\n\n[35] J. Jung, R. Shroff, A. Feller, and S. Goel. Algorithmic decision making in the presence of\n\nunmeasured confounding. ArXiv, 2018.\n\n[36] N. Kallus. Recursive partitioning for personalization using observation data. Proceedings of the\n\nThirty-fourth International Conference on Machine Learning, 2017.\n\n[37] N. Kallus. Balanced policy evaluation and learning.\n\nProcessing Systems, pages 8895\u20138906, 2018.\n\nIn Advances in Neural Information\n\n[38] N. Kallus. Classifying treatment responders under causal effect monotonicity. Proceedings of\n\nInternational Conference on Machine Learning, 2019.\n\n[39] N. Kallus and A. Zhou. Confounding-robust policy improvement. In Advances in Neural\n\nInformation Processing Systems, pages 9269\u20139279, 2018.\n\n[40] N. Kallus and A. Zhou. Residual unfairness in fair machine learning from prejudiced data.\n\nForthcoming at ICML, 2018.\n\n[41] N. Kallus and A. Zhou. The fairness of risk scores beyond classi\ufb01cation: Bipartite ranking and\n\nthe xauc metric. In Advances in Neural Information Processing Systems, 2019.\n\n[42] N. Kallus, X. Mao, and A. Zhou. Assessing algorithmic fairness with unobserved protected\n\nclass using data combination. arXiv preprint arXiv:1906.00285, 2019.\n\n[43] N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects under\nunobserved confounding. In The 22nd International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 2281\u20132290, 2019.\n\n[44] N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Sch\u00f6lkopf.\nAvoiding discrimination through causal reasoning. Advances in Neural Information Processing\nSystems 30, 2017, 2017.\n\n[45] T. Kitagawa and A. Tetenov. Empirical welfare maximization. 2015.\n[46] A. Kube and S. Das. Allocating interventions based on predicted outcomes: A case study on\nhomelessness services. Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, 2019.\n\n[47] M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva. Counterfactual fairness. NIPS, 2017.\n[48] H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, and S. Mullainathan. The selective labels\nproblem: Evaluating algorithmic predictions in the presence of unobservables. Proceedings of\nKKD2017, 2017.\n\n[49] L. T. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine\nlearning. Proceedings of the 35th International Conference on Machine Learning (ICML),\nStockholm, Sweden, 2018.\n\n[50] D. Madras, E. Creager, T. Pitassi, and R. Zemel. Fairness through causal awareness: Learning\nlatent-variable models for biased data. ACM Conference on Fairness, Accountability, and\nTransparency (ACM FAT*) 2019, 2019.\n\n[51] C. Manski. Social Choice with Partial Knoweldge of Treatment Response. The Econometric\n\nInstitute Lectures, 2005.\n\n[52] C. F. Manski. Monotone treatment response. Econometrica: Journal of the Econometric Society,\n\npages 1311\u20131334, 1997.\n\n[53] C. F. Manski. Partial identi\ufb01cation of probability distributions. Springer Science & Business\n\nMedia, 2003.\n\n[54] L. McBride and A. Nichols. Retooling poverty targeting using out-of-sample validation and\nmachine learning. Policy Research Working Paper 7849 (World Bank Group, Development\nEconomics Vice Presidency Operations and Strategy Team), 2016.\n\n11\n\n\f[55] S. Mitchell, E. Potash, and S. Barocas. Prediction-based decisions and fairness: A catalogue of\n\nchoices, assumptions, and de\ufb01nitions. arXiv, 2018.\n\n[56] T. Mkandawire. Targeting and universalism in poverty reduction. Social Policy and Development,\n\n2005.\n\n[57] E. Rice. The tay triage tool: A tool to identify homeless transition age youth most in need of\n\npermanent supportive housing. 2013.\n\n[58] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coef\ufb01cients when some\nregressors are not always observed. Journal of the American statistical Association, 89(427):\n846\u2013866, 1994.\n\n[59] R. T. Rockafellar. Convex analysis. Princeton university press, 2015.\n[60] D. B. Rubin. Comments on \u201crandomization analysis of experimental data: The \ufb01sher ran-\ndomization test comment\u201d. Journal of the American Statistical Association, 75(371):591\u2013593,\n1980.\n\n[61] U. Shalit, F. Johansson, and D. Sontag. Estimating individual treatment effect: generalization\nbounds and algorithms. Proceedings of the 34th International Conference on Machine Learning,\n2017.\n\n[62] R. Shroff. Predictive analytics for city agencies: Lessons from children\u2019s services. Big data, 5\n\n(3):189\u2013196, 2017.\n\n[63] S. Wager and S. Athey. Ef\ufb01cient policy learning. 2017.\n[64] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using\n\nrandom forests. Journal of the American Statistical Association, (just-accepted), 2017.\n\n12\n\n\f", "award": [], "sourceid": 1901, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Angela", "family_name": "Zhou", "institution": "Cornell University"}]}