{"title": "Why Is My Classifier Discriminatory?", "book": "Advances in Neural Information Processing Systems", "page_first": 3539, "page_last": 3550, "abstract": "Recent attempts to achieve fairness in predictive models focus on the balance between fairness and accuracy. In sensitive applications such as healthcare or criminal justice, this trade-off is often undesirable as any increase in prediction error could have devastating consequences. In this work, we argue that the fairness of predictions should be evaluated in context of the data, and that unfairness induced by inadequate samples sizes or unmeasured predictive variables should be addressed through data collection, rather than by constraining the model. We decompose cost-based metrics of discrimination into bias, variance, and noise, and propose actions aimed at estimating and reducing each term. Finally, we perform case-studies on prediction of income, mortality, and review ratings, confirming the value of this analysis. We find that data collection is often a means to reduce discrimination without sacrificing accuracy.", "full_text": "Why Is My Classi\ufb01er Discriminatory?\n\nIrene Y. Chen\n\nMIT\n\niychen@mit.edu\n\nFredrik D. Johansson\n\nMIT\n\nfredrikj@mit.edu\n\nDavid Sontag\n\nMIT\n\ndsontag@csail.mit.edu\n\nAbstract\n\nRecent attempts to achieve fairness in predictive models focus on the balance\nbetween fairness and accuracy. In sensitive applications such as healthcare or\ncriminal justice, this trade-off is often undesirable as any increase in prediction\nerror could have devastating consequences. In this work, we argue that the fairness\nof predictions should be evaluated in context of the data, and that unfairness\ninduced by inadequate samples sizes or unmeasured predictive variables should\nbe addressed through data collection, rather than by constraining the model. We\ndecompose cost-based metrics of discrimination into bias, variance, and noise, and\npropose actions aimed at estimating and reducing each term. Finally, we perform\ncase-studies on prediction of income, mortality, and review ratings, con\ufb01rming\nthe value of this analysis. We \ufb01nd that data collection is often a means to reduce\ndiscrimination without sacri\ufb01cing accuracy.\n\n1\n\nIntroduction\n\nAs machine learning algorithms increasingly affect decision making in society, many have raised\nconcerns about the fairness and biases of these algorithms, especially in applications to healthcare or\ncriminal justice, where human lives are at stake (Angwin et al., 2016; Barocas & Selbst, 2016). It is\noften hoped that the use of automatic decision support systems trained on observational data will\nremove human bias and improve accuracy. However, factors such as data quality and model choice\nmay encode unintentional discrimination, resulting in systematic disparate impact.\nWe study fairness in prediction of outcomes such as recidivism, annual income, or patient mortality.\nFairness is evaluated with respect to protected groups of individuals de\ufb01ned by attributes such as\ngender or ethnicity (Ruggieri et al., 2010). Following previous work, we measure discrimination\nin terms of differences in prediction cost across protected groups (Calders & Verwer, 2010; Dwork\net al., 2012; Feldman et al., 2015). Correcting for issues of data provenance and historical bias in\nlabels is outside of the scope of this work. Much research has been devoted to constraining models to\nsatisfy cost-based fairness in prediction, as we expand on below. The impact of data collection on\ndiscrimination has received comparatively little attention.\nFairness in prediction has been encouraged by adjusting models through regularization (Bechavod\n& Ligett, 2017; Kamishima et al., 2011), constraints (Kamiran et al., 2010; Zafar et al., 2017), and\nrepresentation learning (Zemel et al., 2013). These attempts can be broadly categorized as model-\nbased approaches to fairness. Others have applied data preprocessing to reduce discrimination (Hajian\n& Domingo-Ferrer, 2013; Feldman et al., 2015; Calmon et al., 2017). For an empirical comparison,\nsee for example Friedler et al. (2018). Inevitably, however, restricting the model class or perturbing\ntraining data to improve fairness may harm predictive accuracy (Corbett-Davies et al., 2017).\nA tradeoff of predictive accuracy for fairness is sometimes dif\ufb01cult to motivate when predictions\nin\ufb02uence high-stakes decisions. In particular, post-hoc correction methods based on randomizing\npredictions (Hardt et al., 2016; Pleiss et al., 2017) are unjusti\ufb01able for ethical reasons in clinical tasks\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsuch as severity scoring. Moreover, as pointed out by Woodworth et al. (2017), post-hoc correction\nmay lead to suboptimal predictive accuracy compared to other equally fair classi\ufb01ers.\nDisparate predictive accuracy can often be explained by insuf\ufb01cient or skewed sample sizes or\ninherent unpredictability of the outcome given the available set of variables. With this in mind, we\npropose that fairness of predictive models should be analyzed in terms of model bias, model variance,\nand outcome noise before they are constrained to satisfy fairness criteria. This exposes and separates\nthe adverse impact of inadequate data collection and the choice of the model on fairness. The cost of\nfairness need not always be one of predictive accuracy, but one of investment in data collection and\nmodel development. In high-stakes applications, the bene\ufb01ts often outweigh the costs.\nIn this work, we use the term \u201cdiscrimination\" to refer to speci\ufb01c kinds of differences in the predictive\npower of models when applied to different protected groups. In some domains, such differences may\nnot be considered discriminatory, and it is critical that decisions made based on this information\nare sensitive to this fact. For example, in prior work, researchers showed that causal inference may\nhelp uncover which sources of differences in predictive accuracy introduce unfairness (Kusner et al.,\n2017). In this work, we assume that observed differences are considered discriminatory and discuss\nvarious means of explaining and reducing them.\n\nMain contributions We give a procedure for analyzing discrimination in predictive models with\nrespect to cost-based de\ufb01nitions of group fairness, emphasizing the impact of data collection. First,\nwe propose the use of bias-variance-noise decompositions for separating sources of discrimination.\nSecond, we suggest procedures for estimating the value of collecting additional training samples.\nFinally, we propose the use of clustering for identifying subpopulations that are discriminated against\nto guide additional variable collection. We use these tools to analyze the fairness of common learning\nalgorithms in three tasks: predicting income based on census data, predicting mortality of patients in\ncritical care, and predicting book review ratings from text. We \ufb01nd that the accuracy in predictions of\nthe mortality of cancer patients vary by as much as 20% between protected groups. In addition, our\nexperiments con\ufb01rm that discrimination level is sensitive to the quality of the training data.\n\n2 Background\nWe study fairness in prediction of an outcome Y \u2208 Y. Predictions are based on a set of covariates\nX \u2208 X \u2286 Rk and a protected attribute A \u2208 A. In mortality prediction, X represents the medical\nhistory of a patient in critical care, A the self-reported ethnicity, and Y mortality. A model is\nconsidered fair if its errors are distributed similarly across protected groups, as measured by a\ncost function \u03b3. Predictions learned from a training set d are denoted \u02c6Yd := h(X, A) for some\nh : X \u00d7 A \u2192 Y from a class H. The protected attribute is assumed to be binary, A = {0, 1}, but\nour results generalize to the non-binary case. A dataset d = {(xi, ai, yi)}n\ni=1 consists of n samples\ndistributed according to p(X, A, Y ). When clear from context, we drop the subscript from \u02c6Yd.\nA popular cost-based de\ufb01nition of fairness is the equalized odds criterion, which states that a binary\nclassi\ufb01er \u02c6Y is fair if its false negative rates (FNR) and false positive rates (FPR) are equal across\ngroups (Hardt et al., 2016). We de\ufb01ne FPR and FNR with respect to protected group a \u2208 A by\n\nFPRa( \u02c6Y ) := EX [ \u02c6Y | Y = 0, A = a],\n\nFNRa( \u02c6Y ) := EX [1 \u2212 \u02c6Y | Y = 1, A = a] .\n\nExact equality, FPR0( \u02c6Y ) = FPR1( \u02c6Y ), is often hard to verify or enforce in practice. Instead, we\nstudy the degree to which such constraints are violated. More generally, we use differences in cost\nfunctions \u03b3a between protected groups a \u2208 A to de\ufb01ne the level of discrimination \u0393,\n\n(cid:12)(cid:12)(cid:12)\u03b30( \u02c6Y ) \u2212 \u03b31( \u02c6Y )\n\n(cid:12)(cid:12)(cid:12) .\n\n\u0393\u03b3( \u02c6Y ) :=\n\n(1)\nIn this work we study cost functions \u03b3a \u2208 {FPRa, FNRa, ZOa} in binary classi\ufb01cation tasks, with\nZOa( \u02c6Y ) := EX [1[ \u02c6Y (cid:54)= Y ] | A = a] the zero-one loss. In regression problems, we use the group-\nspeci\ufb01c mean-squared error MSEa := EX [( \u02c6Y \u2212 Y )2 | A = a]. According to (1), predictions \u02c6Y\nsatisfy equalized odds on d if \u0393FPR( \u02c6Y ) = 0 and \u0393FNR( \u02c6Y ) = 0.\n\nCalibration and impossibility A score-based classi\ufb01er is calibrated if the prediction score as-\nsigned to a unit equals the fraction of positive outcomes for all units assigned similar scores. It\n\n2\n\n\f(b) Heteroskedastic noise,\ni.e.\n\u2203x, x(cid:48) : N (x) (cid:54)= N (x(cid:48)), may con-\ntribute to discrimination even for an\noptimal model if protected groups\nare not identically distributed.\n\n(a) For identically distributed pro-\ntected groups and unaware outcome\n(see below), bias and noise are equal\nin expectation. Perceived discrimi-\nnation is only due to variance.\nFigure 1: Scenarios illustrating how properties of the training set and model choice affect perceived\ndiscrimination in a binary classi\ufb01cation task, under the assumption that outcomes and predictions are\nunaware, i.e. p(Y | X, A) = p(Y | X) and p( \u02c6Y | X, A) = p( \u02c6Y | X). Through bias-variance-noise\ndecompositions (see Section 3.1), we can identify which of these dominate in their effect on fairness.\nWe propose procedures for addressing each component in Section 4, and use them in experiments\n(see Section 5) to mitigate discrimination in income prediction and prediction of ICU mortality.\n\n(c) One choice of model may be\nmore suited for one protected group,\neven under negligible noise and vari-\nance, resulting in a difference in ex-\npected bias, B0 (cid:54)= B1.\n\nis impossible for a classi\ufb01er to be calibrated in every protected group and satisfy multiple cost-\nbased fairness criteria at once, unless accuracy is perfect or base rates of outcomes are equal across\ngroups (Chouldechova, 2017). A relaxed version of this result (Kleinberg et al., 2016) applies to the\ndiscrimination level \u0393. Inevitably, both constraint-based methods and our approach are faced with a\nchoice between which fairness criteria to satisfy, and at what cost.\n\n3 Sources of perceived discrimination\n\nThere are many potential sources of discrimination in predictive models. In particular, the choice\nof hypothesis class H and learning objective has received a lot of attention (Calders & Verwer,\n2010; Zemel et al., 2013; Fish et al., 2016). However, data collection\u2014the chosen set of predictive\nvariables X, the sampling distribution p(X, A, Y ), and the training set size n\u2014is an equally integral\npart of deploying fair machine learning systems in practice, and it should be guided to promote\nfairness. Below, we tease apart sources of discrimination through bias-variance-noise decompositions\nof cost-based fairness criteria. In general, we may think of noise in the outcome as the effect of a\nset of unobserved variables U, potentially interacting with X. Even the optimal achievable error for\npredictions based on X may be reduced further by observing parts of U. In Figure 1, we illustrate\nthree common learning scenarios and study their fairness properties through bias, variance, and noise.\nTo account for randomness in the sampling of training sets, we rede\ufb01ne discrimination level (1) in\nterms of the expected cost \u03b3a( \u02c6Y ) := ED[\u03b3a( \u02c6YD)] over draws of a random training set D.\nDe\ufb01nition 1. The expected discrimination level \u0393( \u02c6Y ) of a predictive model \u02c6Y learned from a random\ntraining set D, is\n\n(cid:104)\n\n(cid:12)(cid:12)(cid:12)ED\n\n(cid:105)(cid:12)(cid:12)(cid:12) =\n\n(cid:12)(cid:12)(cid:12)\u03b30( \u02c6Y ) \u2212 \u03b31( \u02c6Y )\n\n(cid:12)(cid:12)(cid:12) .\n\n\u0393( \u02c6Y ) :=\n\n\u03b30( \u02c6YD) \u2212 \u03b31( \u02c6YD)\n\n\u0393( \u02c6Y ) is not observed in practice when only a single training set d is available. If n is small, it is\nrecommended to estimate \u0393 through re-sampling methods such as bootstrapping (Efron, 1992).\n\n3.1 Bias-variance-noise decompositions of discrimination level\n\nAn algorithm that learns models \u02c6YD from datasets D is given, and the covariates X and size of\nthe training data n are \ufb01xed. We assume that \u02c6YD is a deterministic function \u02c6yD(x, a) given the\ntraining set D, e.g. a thresholded scoring function. Following Domingos (2000), we base our\nanalysis on decompositions of loss functions L evaluated at points (x, a). For decompositions\nof costs \u03b3a \u2208 {ZO, FPR, FNR} we let this be the zero-one loss, L(y, y(cid:48)) = 1[y (cid:54)= y(cid:48)] , and for\n\n3\n\n1..5$(&\u2223(=1)$(&\u2223(=0)$(,\u2223&)Samples45!(#\u2223%=0)1..5!(,\u2223#)!(#\u2223%=1)High\tnoiseLow\tnoise891..5$(&\u2223()$((\u2223*=1)$((\u2223*=0)&-./\f\u03b3a = MSE, the squared loss, L(y, y(cid:48)) = (y \u2212 y(cid:48))2. We de\ufb01ne the main prediction \u02dcy(x, a) =\narg miny(cid:48) ED[L( \u02c6YD, y(cid:48)) | X = x, A = a] as the average prediction over draws of training sets\nfor the squared loss, and the majority vote for the zero-one loss. The (Bayes) optimal prediction\ny\u2217(x, a) = arg miny(cid:48) EY [L(Y, y(cid:48)) | X = x, A = a] achieves the smallest expected error with\nrespect to the random outcome Y .\nDe\ufb01nition 2 (Bias, variance and noise). Following Domingos (2000), we de\ufb01ne bias B, variance V\nand noise N at a point (x, a) below.\nB( \u02c6Y , x, a) = L(y\u2217(x, a), \u02dcy(x, a))\nV ( \u02c6Y , x, a) = ED[L(\u02dcy(x, a), \u02c6yD(x, a))] .\n\nN (x, a) = EY [L(y\u2217(x, a), Y ) | X = x, A = a]\n\n(2)\n\nHere, y\u2217, \u02c6y and \u02dcy, are all deterministic functions of (x, a), while Y is a random variable.\n\nIn words, the bias B is the loss incurred by the main prediction relative to the optimal prediction. The\nvariance V is the average loss incurred by the predictions learned from different datasets relative to\nthe main prediction. The noise N is the remaining loss independent of the learning algorithm, often\nknown as the Bayes error. We use these de\ufb01nitions to decompose \u0393 under various de\ufb01nitions of \u03b3a.\nTheorem 1. With \u03b3a the group-speci\ufb01c zero-one loss or class-conditional versions (e.g. FNR, FPR),\nor the mean squared error, \u03b3a and the discrimination level \u0393 admit decompositions of the form\n\n\u0393 =(cid:12)(cid:12)(N 0 \u2212 N 1) + (B0 \u2212 B1) + (V 0 \u2212 V 1)(cid:12)(cid:12)\n\nand\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n+ Ba( \u02c6Y )\nBias\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n+ V a( \u02c6Y )\nVariance\n\n\u03b3a( \u02c6Y ) = N a(cid:124)(cid:123)(cid:122)(cid:125)\n\nNoise\n\nwhere we leave out \u02c6Y in the decomposition of \u0393 for brevity. With B, V de\ufb01ned as in (2), we have\nand V a( \u02c6Y ) = EX,D[cv(X)V ( \u02c6YD, X, a) | A = a] .\n\nBa( \u02c6Y ) = EX [B(\u02dcy, X, a) | A = a]\n\nFor the zero-one loss, cv(x, a) = 1 if \u02c6ym(x, a) = y\u2217(x, a), otherwise cv(x, a) = \u22121. For the squared\nloss cv(x, a) = 1. The noise term for population losses is\n\nand for class-conditional losses w.r.t class y \u2208 {0, 1},\n\nN a := EX [cn(X, a)L(y\u2217(X, a), Y ) | A = a]\n\nN a(y) := EX [cn(X, a)L(y\u2217(X, a), y) | A = a, Y = y] .\n\nFor the zero-one loss, and class-conditional variants, cn(x, a) = 2ED[1[\u02c6yD(x, a) = y\u2217(x, a)]] \u2212 1\nand for the squared loss, cn(x, a) = 1.\n\nProof sketch. Conditioning and exchanging order of expectation, the cases of mean squared error and\nzero-one losses follow from Domingos (2000). Class-conditional losses follow from a case-by-case\nanalysis of possible errors. See the supplementary material for a full proof.\n\nTheorem 1 points to distinct sources of perceived discrimination. Signi\ufb01cant differences in bias\nB0 \u2212 B1 indicate that the chosen model class is not \ufb02exible enough to \ufb01t both protected groups\nwell (see Figure 1c). This is typical of (misspeci\ufb01ed) linear models which approximate non-linear\nfunctions well only in small regions of the input space. Regularization or post-hoc correction of\nmodels effectively increase the bias of one of the groups, and should be considered only if there is\nreason to believe that the original bias is already minimal.\nDifferences in variance, V 0 \u2212 V 1, could be caused by differences in sample sizes n0, n1 or group-\nconditional feature variance Var(X | A), combined with a high capacity model. Targeted collection\nof training samples may help resolve this issue. Our decomposition does not apply to post-hoc\nrandomization methods (Hardt et al., 2016) but we may treat these in the same way as we do random\ntraining sets and interpret them as increasing the variance V a of one group to improve fairness.\nWhen noise is signi\ufb01cantly different between protected groups, discrimination is partially unrelated\nto model choice and training set size and may only be reduced by measuring additional variables.\nProposition 1. If N 0 (cid:54)= N 1, no model can be 0-discriminatory in expectation without access to\nadditional information or increasing bias or variance w.r.t. to the Bayes optimal classi\ufb01er.\n\n4\n\n\fProof. By de\ufb01nition, \u0393 = 0 =\u21d2 (N 1 \u2212 N 0) = (B0 \u2212 B1) + (V 0 \u2212 V 1). As the Bayes optimal\nclassi\ufb01er has neither bias nor variance, the result follows immediately.\n\nIn line with Proposition 1, most methods for ensuring algorithmic fairness reduce discrimination by\ntrading off a difference in noise for one in bias or variance. However, this trade-off is only motivated\nif the considered predictive model is close to Bayes optimal and no additional predictive variables\nmay be measured. Moreover, if noise is homoskedastic in regression settings, post-hoc randomization\nis ill-advised, as the difference in Bayes error N 0 \u2212 N 1 is zero, and discrimination is caused only by\nmodel bias or variance (see the supplementary material for a proof).\n\nEstimating bias, variance and noise Group-speci\ufb01c variance V a may be estimated through sam-\nple splitting or bootstrapping (Efron, 1992). In contrast, the noise N a and bias Ba are dif\ufb01cult to\nestimate when X is high-dimensional or continuous. In fact, no convergence results of noise estimates\nmay be obtained without further assumptions on the data distribution (Antos et al., 1999). Under some\nsuch assumptions, noise may be approximately estimated using distance-based methods (Devijver\n& Kittler, 1982), nearest-neighbor methods (Fukunaga & Hummels, 1987; Cover & Hart, 1967),\nor classi\ufb01er ensembles (Tumer & Ghosh, 1996). When comparing the discrimination level of two\ndifferent models, noise terms cancel, as they are independent of the model. As a result, differences in\nbias may be estimated even when the noise is not known (see the supplementary material).\n\nTesting for signi\ufb01cant discrimination When sample sizes are small, perceived discrimination\nmay not be statistically signi\ufb01cant. In the supplementary material, we give statistical tests both for\nthe discrimination level \u0393( \u02c6Y ) and the difference in discrimination level between two models \u02c6Y , \u02c6Y (cid:48).\n\n4 Reducing discrimination through data collection\n\nIn light of the decomposition of Theorem 1, we explore avenues for reducing group differences in\nbias, variance, and noise without sacri\ufb01cing predictive accuracy. In practice, predictive accuracy\nis often arti\ufb01cially limited when data is expensive or impractical to collect. With an investment in\ntraining samples or measurement of predictive variables, both accuracy and fairness may be improved.\n\n4.1\n\nIncreasing training set size\n\nStandard regularization used to avoid over\ufb01tting is not guaranteed to improve or preserve fairness.\nAn alternative route is to collect more training samples and reduce the impact of the bias-variance\ntrade-off. When supplementary data is collected from the same distribution as the existing set,\ncovariate shift may be avoided (Quionero-Candela et al., 2009). This is often achievable; labeled\ndata may be expensive, such as when paying experts to label observations, but given the means to\nacquire additional labels, they would be drawn from the original distribution. To estimate the value\nof increasing sample size, we predict the discrimination level \u0393( \u02c6YD) as D increases in size.\nThe curve measuring generalization performance of predictive models as a function of training set\nsize n is called a Type II learning curve (Domhan et al., 2015). We call \u03b3a( \u02c6Y , n) := E[\u03b3a( \u02c6YDn )], as\na function of n, the learning curve with respect to protected group a. We de\ufb01ne the discrimination\nlearning curve \u0393( \u02c6Y , n) := |\u03b30( \u02c6Y , n) \u2212 \u03b31( \u02c6Y , n)| (see Figure 2a for an example). Empirically,\nlearning curves behave asymptotically as inverse power-law curves for diverse algorithms such as\ndeep neural networks, support vector machines, and nearest-neighbor classi\ufb01ers, even when model\ncapacity is allowed to grow with n (Hestness et al., 2017; Mukherjee et al., 2003). This observation\nis also supported by theoretical results (Amari, 1993).\nAssumption 1 (Learning curves). The population prediction loss \u03b3( \u02c6Y , n), and group-speci\ufb01c losses\n\u03b30( \u02c6Y , n), \u03b31( \u02c6Y , n), for a \ufb01xed learning algorithm \u02c6Y , behave asymptotically as inverse power-law\ncurves with parameters (\u03b1, \u03b2, \u03b4). That is, \u2203M, M0, M1 such that for n \u2265 M, na \u2265 Ma,\n\n(3)\n\n\u03b3( \u02c6Y , n) = \u03b1n\u2212\u03b2 + \u03b4\n\nand \u2200a \u2208 A : \u03b3a( \u02c6Y , na) = \u03b1an\u2212\u03b2a\n\na + \u03b4a\n\nIntercepts, \u03b4, \u03b4a in (3) represent the asymptotic bias B( \u02c6YD\u221e) and the Bayes error N, with the former\nvanishing for consistent estimators. Accurately estimating \u03b4 from \ufb01nite samples is often challenging\nas the \ufb01rst term tends to dominate the learning curve for practical sample sizes.\n\n5\n\n\fIn experiments, we \ufb01nd that the inverse power-laws model \ufb01t group conditional (\u03b3a) and class-\nconditional (FPR, FNR) errors well, and use these to extrapolate \u0393( \u02c6Y , n) based on estimates from\nsubsampled data.\n\n4.2 Measuring additional variables\nWhen discrimination \u0393 is dominated by a difference in noise, N 0\u2212 N 1, fairness may not be improved\nthrough model selection alone without sacri\ufb01cing accuracy (see Proposition 1). Such a scenario is\nlikely when available covariates are not equally predictive of the outcome in both groups. We propose\nidenti\ufb01cation of clusters of individuals in which discrimination is high as a means to guide further\nvariable collection\u2014if the variance in outcomes within a cluster is not explained by the available\nfeature set, additional variables may be used to further distinguish its members.\nLet a random variable C represent a (possibly stochastic) clustering such that C = c indicates\nmembership in cluster c. Then let \u03c1a(c) denote the expected prediction cost for units in cluster c with\nprotected attribute a. As an example, for the zero-one loss we let\n\na (c) := EX [1[ \u02c6Y (cid:54)= Y ] | A = a, C = c],\n\u03c1ZO\n\nand de\ufb01ne \u03c1 analogously for false positives or false negatives. Clusters c for which |\u03c10(c) \u2212 \u03c11(c)| is\nlarge identify groups of individuals for which discrimination is worse than average, and can guide\ntargeted collection of additional variables or samples. In our experiments on income prediction, we\nconsider particularly simple clusterings of data de\ufb01ned by subjects with measurements above or\nbelow the average value of a single feature x(c) with c \u2208 {1, . . . , k}. In mortality prediction, we\ncluster patients using topic modeling. As measuring additional variables is expensive, the utility of a\ncandidate set should be estimated before collecting a large sample (Koepke & Bilenko, 2012).\n\n5 Experiments\n\nWe analyze the fairness properties of standard machine learning algorithms in three tasks: prediction\nof income based on national census data, prediction of patient mortality based on clinical notes, and\nprediction of book review ratings based on review text.1 We disentangle sources of discrimination by\nassessing the level of discrimination for the full data,estimating the value of increasing training set\nsize by \ufb01tting Type II learning curves, and using clustering to identify subgroups where discrimination\nis high. In addition, we estimate the Bayes error through non-parametric techniques.\nIn our experiments, we omit the sensitive attribute A from our classi\ufb01ers to allow for closer com-\nparison to previous works, e.g. Hardt et al. (2016); Zafar et al. (2017). In preliminary results, we\nfound that \ufb01tting separate classi\ufb01ers for each group increased the error rates of both groups due to the\nresulting smaller sample size, as classi\ufb01ers could not learn from other groups. As our model objective\nis to maximize accuracy over all data points, our analysis uses a single classi\ufb01er trained on the entire\npopulation.\n\n5.1\n\nIncome prediction\n\nPredictions of a person\u2019s salary may be used to help determine an individual\u2019s market worth, but\nsystematic underestimation of the salary of protected groups could harm their competitiveness on the\njob market. The Adult dataset in the UCI Machine Learning Repository (Lichman, 2013) contains\n32,561 observations of yearly income (represented as a binary outcome: over or under $50,000) and\ntwelve categorical or continuous features including education, age, and marital status. Categorical\nattributes are dichotomized, resulting in a total of 105 features.\nWe follow Pleiss et al. (2017) and strive to ensure fairness across genders, which is excluded as\na feature from the predictive models. Using an 80/20 train-test split, we learn a random forest\npredictor, which is is well-calibrated for both groups (Brier (1950) scores of 0.13 and 0.06 for\nmen and women). We \ufb01nd the difference in zero-one loss \u0393ZO( \u02c6Y ) has a 95%-con\ufb01dence interval2\n.085\u00b1.069 with decision thresholds at 0.5. At this threshold, the false negative rates are 0.388\u00b10.026\nand 0.448 \u00b1 0.064 for men and women respectively, and the false positive rates 0.111 \u00b1 0.011 and\n\n1A synthetic experiment validating group-speci\ufb01c learning curves is left to the supplementary material.\n2Details for computing statistically signi\ufb01cant discrimination can be found in the supplementary material.\n\n6\n\n\fMethod\n\nMahalanobis\n\n(Mahalanobis, 1936)\n\nBhattacharyya\n\n(Bhattacharyya, 1943)\n\nNearest Neighbors\n(Cover & Hart, 1967)\n\nElow\n\n\u2013\n\u2013\n\n0.001\n0.001\n0.10\n0.04\n\nwomen\n\ngroup\nmen\n\nEup\n0.29\n0.13\n0.040\n0.027 women\n0.19\n0.07\n\nmen\n\nmen\n\nwomen\n\n(a) Group differences in false positive rates and\n(b) Estimation of Bayes error lower and upper bounds (Elow\nfalse negative rates for a random forest classi\ufb01er\nand Eup) for zero-one loss of men and women. Intervals for\ndecrease with increasing training set size.\nmen and women are non-overlapping for Nearest Neighbors.\nFigure 2: Discrimination level and noise estimation in income prediction with the Adult dataset.\n\n0.033 \u00b1 0.008. We focus on random forest classi\ufb01ers, although we found similar results for logistic\nregression and decision trees.\nWe examine the effect of varying training set size n on discrimination. We \ufb01t inverse power-law\ncurves to estimates of FPR( \u02c6Y , n) and FNR( \u02c6Y , n) using repeated sample splitting where at least\n20% of the full data is held out for evaluating generalization error at every value of n. We tune\nhyperparameters for each training set size for decision tree classi\ufb01ers and logistic regression but\ntuned over the entire dataset for random forest. We include full training details in the supplementary\nmaterial. Metrics are averaged over 50 trials. See Figure 2a for the results for random forests. Both\nFPR and FNR decrease with additional training samples. The discrimination level \u0393FNR for false\nnegatives decreases by a striking 40% when increasing the training set size from 1000 to 10,000. This\nsuggests that trading off accuracy for fairness at small sample sizes may be ill-advised. Based on\n\ufb01tted power-law curves, we estimate that for unlimited training data drawn from the same distribution,\nwe would have \u0393FNR( \u02c6Y ) \u2248 0.04 and \u0393FPR( \u02c6Y ) \u2248 0.08.\nIn Figure 2b, we compare estimated upper and lower bounds on noise (Elow and Eup) for men\nand women using the Mahalanobis and Bhattacharyya distances (Devijver & Kittler, 1982), and\na k-nearest neighbor method (Cover & Hart, 1967) with k = 5 and 5-fold cross validation. Men\nhave consistently higher noise estimates than women, which is consistent with the differences in\nzero-one loss found using all models. For nearest neighbors estimates, intervals for men and women\nare non-overlapping, which suggests that noise may contribute substantially to discrimination.\nTo guide attempts at reducing discrimination further, we identify clusters of individuals for whom\nfalse negative predictions are made at different rates between protected groups, with the method\ndescribed in Section 4.2. We \ufb01nd that for individuals in executive or managerial occupations (12% of\nthe sample), false negatives are more than twice as frequent for women (0.412) as for men (0.157).\nFor individuals in all other occupations, the difference is signi\ufb01cantly smaller, 0.543 for women and\n0.461 for men, despite the fact that the disparity in outcome base rates in this cluster is large (0.26\nfor men versus 0.09 for women). A possible reason is that in managerial occupations the available\nvariable set explains a larger portion of the variance in salary for men than for women. If so, further\nsub-categorization of managerial occupations could help reduce discrimination in prediction.\n\n5.2\n\nIntensive care unit mortality prediction\n\nUnstructured medical data such as clinical notes can reveal insights for questions like mortality\nprediction; however, disparities in predictive accuracy may result in discrimination of protected\ngroups. Using the MIMIC-III dataset of all clinical notes from 25,879 adult patients from Beth\nIsrael Deaconess Medical Center (Johnson et al., 2016), we predict hospital mortality of patients\nin critical care. Fairness is studied with respect to \ufb01ve self-reported ethnic groups of the following\nproportions: Asian (2.2%), Black (8.8%), Hispanic (3.4%), White (70.8%), and Other (14.8%). Notes\nwere collected in the \ufb01rst 48 hours of an intensive care unit (ICU) stay; discharge notes were excluded.\nWe only included patients that stayed in the ICU for more than 48 hours. We use the tf-idf statistics\nof the 10,000 most frequent words as features. Training a model on 50% of the data, selecting\n\n7\n\n103104Trainingsetsize,n(logscale)0.150.100.090.08Di\ufb00erence,\u0393(logscale)FalsePositiveRateFalseNegativeRate\f(a) Using Tukey\u2019s range test, we\ncan \ufb01nd the 95%-signi\ufb01cance level\nfor the zero-one loss for each group\nover 5-fold cross validation.\nFigure 3: Mortality prediction from clinical notes using logistic regression. Best viewed in color.\n\n(b) As training set size increases,\nzero-one loss over 50 trials de-\ncreases over all groups and appears\nto converge to an asymptote.\n\n(c) Topic modeling reveals subpop-\nulations with high differences in\nzero-one loss, for example cancer\npatients and cardiac patients.\n\nhyper-parameters on 25%, and testing on 25%, we \ufb01nd that logistic regression with L1-regularization\nachieves an AUC of 0.81. The logistic regression is well-calibrated with Brier scores ranging from\n0.06-0.11 across the \ufb01ve groups; we note better calibration is correlated with lower prediction error.\nWe report cost and discrimination level in terms of generalized zero-one loss (Pleiss et al., 2017).\nUsing an ANOVA test (Fisher, 1925) with p < 0.001, we reject the null hypothesis that loss is the\nsame among all \ufb01ve groups. To map the 95% con\ufb01dence intervals, we perform pairwise comparisons\nof means using Tukey\u2019s range test (Tukey, 1949) across 5-fold cross-validation. As seen in Figure 3a,\npatients in the Other and Hispanic groups have the highest and lowest generalized zero-one loss,\nrespectively, with relatively few overlapping intervals. Notably, the largest ethnic group (White) does\nnot have the best accuracy, whereas smaller ethnic groups tend towards extremes. While racial groups\ndiffer in hospital mortality base rates (Table 1 in the Supplementary material), Hispanic (10.3%) and\nBlack (10.9%) patients have very different error rates despite similar base rates.\nTo better understand the discrimination induced by our model, we explore the effect of changing\ntraining set size. To this end, we repeatedly subsample and split the data, holding out at least 20%\nof the full data for testing. In Figure 3b, we show loss averaged over 50 trials of training a logistic\nregression on increasingly larger training sets; estimated inverse power-law curves show good \ufb01ts.\nWe see that some pairwise differences in loss decrease with additional training data.\nNext, we identify clusters for which the difference in prediction errors between protected groups is\nlarge. We learn a topic model with k = 50 topics generated using Latent Dirichlet Allocation (Blei\net al., 2003). Topics are concatenated into an n \u00d7 k matrix Q where qic designates the proportion of\ntopic c \u2208 [k] in note i \u2208 [n]. Following prior work on enrichment of topics in clinical notes (Marlin\net al., 2012; Ghassemi et al., 2014), we estimate the probability of patient mortality Y given a topic\ni=1 qic) where yi is the hospital mortality of patient i. We\ncompare relative error rates given protected group and topic using binary predicted mortality \u02c6yi,\nactual mortality yi, and group ai for patient i through\n\nc as \u02c6p(Y |C = c) := ((cid:80)n\n\ni=1 yiqic)/((cid:80)n\n\n\u02c6p( \u02c6Y (cid:54)= Y | A = a(cid:48), C = c) =\n\n(cid:80)n\n(cid:80)n\ni=1 1(yi (cid:54)= \u02c6yi)1(ai = a(cid:48))qic\n\ni=1 1(ai = a(cid:48))qic\n\nwhich follows using substitution and conditioning on A. These error rates were computed using a\nlogistic regression with L1 regularization using an 80/20 train-test split over 50 trials. While many\ntopics have consistent error rates across groups, some topics (e.g. cardiac patients or cancer patients\nas shown in Figure 3c) have large differences in error rates across groups. We include more detailed\ntopic descriptions in the supplementary material. Once we have identi\ufb01ed a subpopulation with\nparticularly high error, for example cancer patients, we can consider collecting more features or\ncollecting more data from the same data distribution. We \ufb01nd that error rates differ between 0.12 and\n0.30 across protected groups of cancer patients, and between 0.05 and 0.20 for cardiac patients.\n\n8\n\nAsianBlackHispanicOtherWhite0.160.180.200.22Zero-onelossWhiteOtherHispanicBlackAsian050001000015000Trainingdatasize0.270.250.230.210.19Zero-onelossCancerpatientsCardiacpatients0.000.050.100.150.200.250.300.35Errorenrichment1106187761925641971173621001211418117649\f5.3 Book review ratings\n\nIn the supplementary material, we study prediction of book review ratings from review texts (Gnanesh,\n2017). The protected attribute was chosen to be the gender of the author as determined from\nWikipedia. In the dataset, the difference in mean-squared error \u0393MSE( \u02c6Y ) has 95%-con\ufb01dence\ninterval 0.136 \u00b1 0.048 with MSEM = 0.224 for reviews for male authors and MSEF = 0.358.\nStrikingly, our \ufb01ndings suggest that \u0393MSE( \u02c6Y ) may be completely eliminated by additional targeted\nsampling of the less represented gender.\n\n6 Discussion\n\nWe identify that existing approaches for reducing discrimination induced by prediction errors may be\nunethical or impractical to apply in settings where predictive accuracy is critical, such as in healthcare\nor criminal justice. As an alternative, we propose a procedure for analyzing the different sources\ncontributing to discrimination. Decomposing well-known de\ufb01nitions of cost-based fairness criteria in\nterms of differences in bias, variance, and noise, we suggest methods for reducing each term through\nmodel choice or additional training data collection. Case studies on three real-world datasets con\ufb01rm\nthat collection of additional samples is often suf\ufb01cient to improve fairness, and that existing post-hoc\nmethods for reducing discrimination may unnecessarily sacri\ufb01ce predictive accuracy when other\nsolutions are available.\nLooking forward, we can see several avenues for future research. In this work, we argue that\nidentifying clusters or subpopulations with high predictive disparity would allow for more targeted\nways to reduce discrimination. We encourage future research to dig deeper into the question of\nlocal or context-speci\ufb01c unfairness in general, and into algorithms for addressing it. Additionally,\nextending our analysis to intersectional fairness (Buolamwini & Gebru, 2018; H\u00e9bert-Johnson et al.,\n2017), e.g. looking at both gender and race or all subdivisions, would provide more nuanced grappling\nwith unfairness. Finally, additional data collection to improve the model may cause unexpected\ndelayed impacts (Liu et al., 2018) and negative feedback loops (Ensign et al., 2017) as a result of\ndistributional shifts in the data. More broadly, we believe that the study of fairness in non-stationary\npopulations is an interesting direction to pursue.\n\nAcknowledgements\n\nThe authors would like to thank Yoni Halpern and Hunter Lang for helpful comments, and Zeshan\nHussain for clinical guidance. This work was partially supported by Of\ufb01ce of Naval Research Award\nNo. N00014-17-1-2791 and NSF CAREER award #1350965.\n\nReferences\nAmari, Shun-Ichi. A universal theorem on learning curves. Neural networks, 6(2):161\u2013166, 1993.\nAngwin, Julia, Larson, Jeff, Mattu, Surya, and Kirchner, Lauren. Machine bias. ProPublica, May,\n\n23, 2016.\n\nAntos, Andr\u00e1s, Devroye, Luc, and Gyor\ufb01, Laszlo. Lower bounds for bayes error estimation. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 21(7):643\u2013645, 1999.\n\nBarocas, Solon and Selbst, Andrew D. Big data\u2019s disparate impact. Cal. L. Rev., 104:671, 2016.\nBechavod, Yahav and Ligett, Katrina. Learning fair classi\ufb01ers: A regularization-inspired approach.\n\narXiv preprint arXiv:1707.00044, 2017.\n\nBhattacharyya, Anil. On a measure of divergence between two statistical populations de\ufb01ned by their\n\nprobability distributions. Bull. Calcutta Math. Soc., 35:99\u2013109, 1943.\n\nBlei, David M, Ng, Andrew Y, and Jordan, Michael I. Latent dirichlet allocation. Journal of machine\n\nLearning research, 3(Jan):993\u20131022, 2003.\n\nBrier, Glenn W. Veri\ufb01cation of forecasts expressed in terms of probability. Monthey Weather Review,\n\n78(1):1\u20133, 1950.\n\n9\n\n\fBuolamwini, Joy and Gebru, Timnit. Gender shades: Intersectional accuracy disparities in commercial\ngender classi\ufb01cation. In Conference on Fairness, Accountability and Transparency, pp. 77\u201391,\n2018.\n\nCalders, Toon and Verwer, Sicco. Three naive bayes approaches for discrimination-free classi\ufb01cation.\n\nData Mining and Knowledge Discovery, 21(2):277\u2013292, 2010.\n\nCalmon, Flavio, Wei, Dennis, Vinzamuri, Bhanukiran, Ramamurthy, Karthikeyan Natesan, and\nVarshney, Kush R. Optimized pre-processing for discrimination prevention. In Advances in Neural\nInformation Processing Systems, pp. 3995\u20134004, 2017.\n\nChouldechova, Alexandra. Fair prediction with disparate impact: A study of bias in recidivism\n\nprediction instruments. arXiv preprint arXiv:1703.00056, 2017.\n\nCorbett-Davies, Sam, Pierson, Emma, Feller, Avi, Goel, Sharad, and Huq, Aziz. Algorithmic\ndecision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pp. 797\u2013806. ACM, 2017.\n\nCover, Thomas and Hart, Peter. Nearest neighbor pattern classi\ufb01cation. IEEE transactions on\n\ninformation theory, 13(1):21\u201327, 1967.\n\nDevijver, Pierre A. and Kittler, Josef. Pattern recognition: a statistical approach. Sung Kang, 1982.\n\nDomhan, Tobias, Springenberg, Jost Tobias, and Hutter, Frank. Speeding up automatic hyperparame-\nter optimization of deep neural networks by extrapolation of learning curves. In Twenty-Fourth\nInternational Joint Conference on Arti\ufb01cial Intelligence, 2015.\n\nDomingos, Pedro. A uni\ufb01ed bias-variance decomposition. In Proceedings of 17th International\n\nConference on Machine Learning, pp. 231\u2013238, 2000.\n\nDwork, Cynthia, Hardt, Moritz, Pitassi, Toniann, Reingold, Omer, and Zemel, Richard. Fairness\nIn Proceedings of the 3rd Innovations in Theoretical Computer Science\n\nthrough awareness.\nConference, pp. 214\u2013226. ACM, 2012.\n\nEfron, Bradley. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp.\n\n569\u2013593. Springer, 1992.\n\nEnsign, Danielle, Friedler, Sorelle A., Neville, Scott, Scheidegger, Carlos Eduardo, and Venkatasub-\nramanian, Suresh. Runaway feedback loops in predictive policing. CoRR, abs/1706.09847, 2017.\nURL http://arxiv.org/abs/1706.09847.\n\nFeldman, Michael, Friedler, Sorelle A, Moeller, John, Scheidegger, Carlos, and Venkatasubramanian,\nSuresh. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pp. 259\u2013268. ACM, 2015.\n\nFish, Benjamin, Kun, Jeremy, and Lelkes, \u00c1d\u00e1m D. A con\ufb01dence-based approach for balancing\nfairness and accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining,\npp. 144\u2013152. SIAM, 2016.\n\nFisher, R.A. Statistical methods for research workers. Edinburgh Oliver & Boyd, 1925.\n\nFriedler, Sorelle A, Scheidegger, Carlos, Venkatasubramanian, Suresh, Choudhary, Sonam, Hamilton,\nEvan P, and Roth, Derek. A comparative study of fairness-enhancing interventions in machine\nlearning. arXiv preprint arXiv:1802.04422, 2018.\n\nFukunaga, Keinosuke and Hummels, Donald M. Bayes error estimation using parzen and k-nn\nprocedures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):634\u2013643, 1987.\n\nGhassemi, Marzyeh, Naumann, Tristan, Doshi-Velez, Finale, Brimmer, Nicole, Joshi, Rohit,\nRumshisky, Anna, and Szolovits, Peter. Unfolding physiological state: Mortality modelling\nin intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pp. 75\u201384. ACM, 2014.\n\nGnanesh.\n\nGoodreads book reviews, 2017.\n\ngoodreads-book-reviews.\n\nURL https://www.kaggle.com/gnanesh/\n\n10\n\n\fHajian, Sara and Domingo-Ferrer, Josep. A methodology for direct and indirect discrimination\nprevention in data mining. IEEE transactions on knowledge and data engineering, 25(7):1445\u2013\n1459, 2013.\n\nHardt, Moritz, Price, Eric, Srebro, Nati, et al. Equality of opportunity in supervised learning. In\n\nAdvances in Neural Information Processing Systems, pp. 3315\u20133323, 2016.\n\nH\u00e9bert-Johnson, Ursula, Kim, Michael P, Reingold, Omer, and Rothblum, Guy N. Calibration for the\n\n(computationally-identi\ufb01able) masses. arXiv preprint arXiv:1711.08513, 2017.\n\nHestness, Joel, Narang, Sharan, Ardalani, Newsha, Diamos, Gregory, Jun, Heewoo, Kianinejad,\nHassan, Patwary, Md, Ali, Mostofa, Yang, Yang, and Zhou, Yanqi. Deep learning scaling is\npredictable, empirically. arXiv preprint arXiv:1712.00409, 2017.\n\nJohnson, Alistair EW, Pollard, Tom J, Shen, Lu, Lehman, Li-wei H, Feng, Mengling, Ghassemi,\nMohammad, Moody, Benjamin, Szolovits, Peter, Celi, Leo Anthony, and Mark, Roger G. Mimic-iii,\na freely accessible critical care database. Scienti\ufb01c data, 3, 2016.\n\nKamiran, Faisal, Calders, Toon, and Pechenizkiy, Mykola. Discrimination aware decision tree\nlearning. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 869\u2013874.\nIEEE, 2010.\n\nKamishima, Toshihiro, Akaho, Shotaro, and Sakuma, Jun. Fairness-aware learning through regular-\nization approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference\non, pp. 643\u2013650. IEEE, 2011.\n\nKleinberg, Jon, Mullainathan, Sendhil, and Raghavan, Manish.\n\ndetermination of risk scores. arXiv preprint arXiv:1609.05807, 2016.\n\nInherent trade-offs in the fair\n\nKoepke, Hoyt and Bilenko, Mikhail. Fast prediction of new feature utility. arXiv preprint\n\narXiv:1206.4680, 2012.\n\nKusner, Matt J, Loftus, Joshua, Russell, Chris, and Silva, Ricardo. Counterfactual fairness. In\n\nAdvances in Neural Information Processing Systems, pp. 4069\u20134079, 2017.\n\nLichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.\n\nLiu, Lydia T, Dean, Sarah, Rolf, Esther, Simchowitz, Max, and Hardt, Moritz. Delayed impact of fair\n\nmachine learning. arXiv preprint arXiv:1803.04383, 2018.\n\nMahalanobis, Prasanta Chandra. On the generalized distance in statistics. National Institute of\n\nScience of India, 1936.\n\nMarlin, Benjamin M, Kale, David C, Khemani, Robinder G, and Wetzel, Randall C. Unsupervised\npattern discovery in electronic health care data using probabilistic clustering models. In Proceedings\nof the 2nd ACM SIGHIT International Health Informatics Symposium, pp. 389\u2013398. ACM, 2012.\n\nMukherjee, Sayan, Tamayo, Pablo, Rogers, Simon, Rifkin, Ryan, Engle, Anna, Campbell, Colin,\nGolub, Todd R, and Mesirov, Jill P. Estimating dataset size requirements for classifying dna\nmicroarray data. Journal of computational biology, 10(2):119\u2013142, 2003.\n\nPleiss, Geoff, Raghavan, Manish, Wu, Felix, Kleinberg, Jon, and Weinberger, Kilian Q. On fairness\n\nand calibration. In Advances in Neural Information Processing Systems, pp. 5684\u20135693, 2017.\n\nQuionero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton, and Lawrence, Neil D.\n\nDataset shift in machine learning. The MIT Press, 2009.\n\nRuggieri, Salvatore, Pedreschi, Dino, and Turini, Franco. Data mining for discrimination discovery.\n\nACM Transactions on Knowledge Discovery from Data (TKDD), 4(2):9, 2010.\n\nTukey, John W. Comparing individual means in the analysis of variance. Biometrics, pp. 99\u2013114,\n\n1949.\n\n11\n\n\fTumer, Kagan and Ghosh, Joydeep. Estimating the bayes error rate through classi\ufb01er combining. In\nPattern Recognition, 1996., Proceedings of the 13th International Conference on, volume 2, pp.\n695\u2013699. IEEE, 1996.\n\nWoodworth, Blake, Gunasekar, Suriya, Ohannessian, Mesrob I, and Srebro, Nathan. Learning\n\nnon-discriminatory predictors. Conference On Learning Theory, 2017.\n\nZafar, Muhammad Bilal, Valera, Isabel, Gomez Rodriguez, Manuel, and Gummadi, Krishna P.\nFairness constraints: Mechanisms for fair classi\ufb01cation. arXiv preprint arXiv:1507.05259, 2017.\n\nZemel, Richard S, Wu, Yu, Swersky, Kevin, Pitassi, Toniann, and Dwork, Cynthia. Learning fair\n\nrepresentations. ICML (3), 28:325\u2013333, 2013.\n\n12\n\n\f", "award": [], "sourceid": 1807, "authors": [{"given_name": "Irene", "family_name": "Chen", "institution": "MIT"}, {"given_name": "Fredrik", "family_name": "Johansson", "institution": "MIT"}, {"given_name": "David", "family_name": "Sontag", "institution": "MIT"}]}