{"title": "Offline Contextual Bandits with High Probability Fairness Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 14922, "page_last": 14933, "abstract": "We present RobinHood, an of\ufb02ine contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Our algorithm accepts multiple fairness de\ufb01nitions and allows users to construct their own unique fairness de\ufb01nitions for the problem at hand. We provide a theoretical analysis of RobinHood, which includes a proof that it will not return an unfair solution with probability greater than a user-speci\ufb01ed threshold. We validate our algorithm on three applications: a tutoring system in which we conduct a user study and consider multiple unique fairness de\ufb01nitions; a loan approval setting (using the Statlog German credit data set) in which well-known fairness de\ufb01nitions are applied; and criminal recidivism (using data released by ProPublica). In each setting, our algorithm is able to produce fair policies that achieve performance competitive with other of\ufb02ine and online contextual bandit algorithms.", "full_text": "Of\ufb02ine Contextual Bandits with\n\nHigh Probability Fairness Guarantees\n\nBlossom Metevier1 Stephen Giguere1 Sarah Brockman1 Ari Kobren1\n\nPhilip S. Thomas1\n2Computer Science Department\n\nStanford University\n\nYuriy Brun1\n\nEmma Brunskill2\n\n1College of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nAbstract\n\nWe present RobinHood, an of\ufb02ine contextual bandit algorithm designed to satisfy\na broad family of fairness constraints. Our algorithm accepts multiple fairness\nde\ufb01nitions and allows users to construct their own unique fairness de\ufb01nitions for the\nproblem at hand. We provide a theoretical analysis of RobinHood, which includes\na proof that it will not return an unfair solution with probability greater than a\nuser-speci\ufb01ed threshold. We validate our algorithm on three applications: a user\nstudy with an automated tutoring system, a loan approval setting using the Statlog\nGerman credit data set, and a criminal recidivism problem using data released\nby ProPublica. To demonstrate the versatility of our approach, we use multiple\nwell-known and custom de\ufb01nitions of fairness. In each setting, our algorithm is\nable to produce fair policies that achieve performance competitive with other of\ufb02ine\nand online contextual bandit algorithms.\n\n1\n\nIntroduction\n\nMachine learning (ML) algorithms are increasingly being used in high-risk decision making settings,\nsuch as \ufb01nancial loan approval [7], hiring [37], medical diagnostics [12], and criminal sentencing [3].\nThese algorithms are capable of unfair behavior, and when used to guide policy and practice, can cause\nsigni\ufb01cant harm. This is not merely hypothetical: ML algorithms that in\ufb02uence criminal sentencing\nand credit risk assessment have already exhibited racially-biased behavior [3; 4]. Prevention of unfair\nbehavior by these algorithms remains an open and challenging problem [14; 20; 24]. In this paper,\nwe address issues of unfairness in the of\ufb02ine contextual bandit setting, providing a new algorithm,\ndesigned using the recently proposed Seldonian framework [47] and called RobinHood, which is\ncapable of satisfying multiple fairness de\ufb01nitions with high probability.\nEnsuring fairness in the bandit setting is an understudied problem. While extensive research has\nbeen devoted to studying fairness in classi\ufb01cation, recent work has shown that the decisions made by\nfair ML algorithms can affect the well-being of a population over time [34]. For example, criminal\nsentencing practices affect criminal recidivism rates, and loan approval strategies can change the\namount of wealth and debt in a population. This delayed impact indicates that the feedback used\nfor training these algorithms or de\ufb01ning fairness is more evaluative in nature, i.e., training samples\nquantify the (delayed) outcome of taking a particular action given a particular context. Therefore, it\nis important that fairness can be ensured for methods that are designed to handle evaluative feedback,\nsuch as contextual bandits. For example, instead of predicting the likelihood of violent recidivism,\nthese methods can consider what actions to take to minimize violent recidivism.\nWithin the bandit setting, prior work has mostly focused on ensuring fairness in the online setting [24;\n25; 27; 35], in which an agent learns the quality of different actions by interacting with the system of\ninterest. However, for many fairness applications, e.g., medical treatment suggestion [29], the online\nsetting is not feasible, as direct interaction might be too costly, risky, or otherwise unrealistic. Instead,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthese problems can be framed in the of\ufb02ine bandit setting, where a \ufb01nite amount of data is collected\nfrom the system over time and then used by the agent to construct a fair solution.\nIssues of ensuring fairness in the of\ufb02ine contextual bandit setting are similar to those in other ML\nsettings. For instance, contextual bandit algorithms need to manage the trade-off between performance\noptimization and fairness. Ideally, these algorithms should also be capable of handling a large set of\nuser-de\ufb01ned fairness criteria, as no single de\ufb01nition of fairness is appropriate for all problems [16].\nImportantly, when it is not possible to return a fair solution, i.e., when fairness criteria are in con\ufb02ict,\nor when too little data is present, the algorithm should indicate this to the user. We allow our algorithm,\nRobinHood, to return No Solution Found (NSF) in cases like this, and show that if a fair solution does\nexist, the probability RobinHood returns NSF goes to zero as the amount of available data increases.\nIn summary, we present the \ufb01rst Seldonian algorithm for contextual bandits. Our contributions\nare: 1) we provide an of\ufb02ine contextual bandit algorithm, called RobinHood, that allows users to\nmathematically specify their own notions of fairness, including combinations of fairness de\ufb01nitions\nalready proposed by the ML community, and novel ones that may be unique to the application of\ninterest; 2) we prove that RobinHood is (quasi-)Seldonian: that it is guaranteed to satisfy the fairness\nconstraints de\ufb01ned by the user with high probability, 3) we prove that if a fair solution exists, as\nmore data is provided to RobinHood, the probability it returns NSF goes to zero; and 4) we evaluate\nRobinHood on three applications: a tutoring system in which we conduct a user study and consider\nmultiple, unique fairness de\ufb01nitions, a loan approval setting in which well-known fairness de\ufb01nitions\nare applied, and criminal recidivism. Our work complements fairness literature for classi\ufb01cation\n(e.g., [1; 8; 14; 50]), contextual bandits (e.g., [24; 25; 35; 14; 27; 22]), and reinforcement learning\n(e.g., [40; 49]), as described in Section 7.\n\n2 Contextual Bandits and Of\ufb02ine Learning\n\nThis section de\ufb01nes a contextual bandit, or agent, which iteratively interacts with a system. At each\niteration \u03b9 \u2208 {1, 2, ...}, the agent is given a context, represented as a random variable X\u03b9 \u2208 R\u03bd for\nsome \u03bd. We assume that the contexts during different iterations are independent and identically\ndistributed (i.i.d.) samples from some distribution, dX. Let A be the \ufb01nite set of possible actions that\nthe agent can select. The agent\u2019s policy, \u03c0 : R\u03bd \u00d7 A \u2192 [0, 1] characterizes how the agent selects\nactions given the current context: \u03c0(x, a) = Pr(A\u03b9 = a|X\u03b9 = x). Once the agent has chosen an\naction, A\u03b9, based on context X\u03b9, it receives a stochastic real-valued reward, R\u03b9. We assume that the\nconditional distribution of R\u03b9 given X\u03b9 and A\u03b9 is given by dR, i.e., R\u03b9 \u223c dR(X\u03b9, A\u03b9,\u00b7). The agent\u2019s\ngoal is to choose actions so as to maximize the expected reward it receives.\nFor concreteness, we use a loan approval problem as our running example. For each loan applicant\n(iteration \u03b9), a single action, i.e., whether or not the applicant should be given a loan, is chosen. The\nresulting reward is a binary value: 1 if the loan is repaid and \u22121 otherwise. The agent\u2019s goal using\nthis reward function is to maximize the expected number of loan repayments.\nThis paper focuses on enforcing fairness in the of\ufb02ine setting, where the agent only has access to a\n\ufb01nite amount of logged data, D, collected from one or more different policies. Policies used to collect\nlogged data are known as behavior policies. For simplicity of notation, we consider a single behavior\npolicy, \u03c0b. D consists of the observed contexts, actions, and rewards: D = {(X\u03b9, A\u03b9, R\u03b9)}m\n\u03b9=1, where\nm is the number of iterations for which D was collected, and where A\u03b9 \u223c \u03c0b(X\u03b9,\u00b7).\nThe goal in the of\ufb02ine setting is to \ufb01nd a policy, \u03c0(cid:48), which maximizes r(\u03c0(cid:48)) := E[R\u03b9|A\u03b9 \u223c \u03c0(cid:48)(X\u03b9,\u00b7)]\nusing samples from D only. Algorithms that solve this problem are called of\ufb02ine (or batch) contextual\nbandit algorithms. At the core of many of\ufb02ine contextual bandit algorithms is an off-policy estimator,\n\u02c6r, which takes as inputs D and a policy to be evaluated, \u03c0e, in order to produce an estimate, \u02c6r(\u03c0e, D),\nof r(\u03c0e). We call \u02c6r(\u03c0e, D) the off-policy reward of \u03c0e.\n\n3 Problem Statement\n\nFollowing the Seldonian framework for designing ML algorithms [47], our goal is to develop a fair\nof\ufb02ine contextual bandit algorithm that satis\ufb01es three conditions: 1) the algorithm accepts multiple\nuser-de\ufb01ned fairness de\ufb01nitions, 2) the algorithm guarantees that the probability it returns a policy\nthat violates each de\ufb01nition of fairness is bounded by a user-speci\ufb01ed constant, and 3) if a fair solution\n\n2\n\n\fexists, the probability that it returns a fair solution (other than NSF) converges to one as the amount of\ntraining data goes to in\ufb01nity. The \ufb01rst condition is crucial because no single de\ufb01nition of fairness is\nappropriate for all problems [16]. The second condition is equally important because it allows the\nuser to specify the necessary con\ufb01dence level(s) for the application at hand. However, if an algorithm\nonly satis\ufb01es the \ufb01rst two conditions, this does not mean it is qualitatively helpful. For example, we\ncan construct an algorithm that always returns NSF instead of an actual solution\u2014this technically\nsatis\ufb01es 1) and 2), but is effectively useless. Ideally, if a fair solutions exists, a fair algorithm should\nbe able to (given enough data) eventually \ufb01nd and return it. We call this property consistency and\nshow in Section 5 that RobinHood is consistent.\nSince condition 1) allows users to specify their own notions of fairness, the set of fair policies is not\nknown a priori. Therefore, the algorithm must reason about the fairness of policies using the available\ndata. We consider a policy to be either fair or unfair with respect to condition 2). For example, we\nmay say that a lending policy is fair if for all pairs of applicants who are identical in every way except\nrace, the policy takes the same action, i.e., approving or denying the loan for both. This criterion is\nknown as causal discrimination (with respect to race) [17; 30]. Then, a policy that adheres to this\nde\ufb01nition is fair, and a policy that violates this de\ufb01nition is unfair. The algorithm that produces a\npolicy from data must ensure that, with high probability, it will produce a policy that is fair. Note that\nthe setting in which each action is fair or unfair can be captured by this approach by de\ufb01ning a policy\nto be fair if and only if the probability it produces unfair actions is bounded by a small constant.\nFormally, assume that policies are parameterized by a vector \u03b8 \u2208 Rl, so that \u03c0(X\u03b9,\u00b7, \u03b8) is the\nconditional distribution over action A\u03b9 given X\u03b9 for all \u03b8 \u2208 Rl. Let D be the set of all possible logged\ndata sets and D be the logged data (a random variable, and the only source of randomness in the\nsubsequent equations). Let a : D \u2192 Rl be an of\ufb02ine contextual bandit algorithm, which takes as\ninput logged data and produces as output the parameters for a new policy.\nLet gi : Rl \u2192 R be a user-supplied function, called a constraint objective, that measures the fairness\nof a policy. We adopt the convention that if gi(\u03b8) \u2264 0 then the policy with parameters \u03b8 is fair, and\nif gi(\u03b8) > 0 then the policy with parameters \u03b8 is unfair. Our goal is to create a contextual bandit\nalgorithm that enforces k behavioral constraints, where the ith constraint has the form:\n\nPr\n\n(1)\nwhere \u03b4i \u2208 [0, 1] is the required con\ufb01dence level for the ith constraint. Together, the constraint\nobjective gi and the con\ufb01dence level \u03b4i constitute a behavioral constraint. Any algorithm that\nsatis\ufb01es (1) is called Seldonian, or quasi-Seldonian if it relies on reasonable but false assumptions,\nsuch as appeals to the central limit theorem [47].\nSome constraints might be impossible to enforce if, for example, the user provides con\ufb02icting\nconstraints [28], or if \u03b4i cannot be established given the amount of data available. Algorithms that\nprovide high-probability guarantees are often very conservative [26]; if only a small amount of data\nis available, it may be impossible for the algorithm to produce a policy that satis\ufb01es all behavioral\nconstraints with suf\ufb01ciently high con\ufb01dence. In such cases, RobinHood returns NSF to indicate that it\nis unable to \ufb01nd a fair solution. When this occurs, the user has control over what to do next. For some\ndomains, deploying a known fair policy may be appropriate; for others, it might be more appropriate\nto issue a warning and deploy no policy. We de\ufb01ne gi(NSF) = 0, so that NSF is always fair.\nNotice that the creation of an algorithm that satis\ufb01es each of the three desired conditions is dif\ufb01cult.\nCondition 1) is dif\ufb01cult to enforce because the user must be provided with an interface that allows\nthem to specify their desired de\ufb01nition of fairness without requiring the user to know the underlying\ndata distribution. Condition 2) is dif\ufb01cult to achieve because of the problem of multiple comparisons\u2014\ntesting b solutions to see if they are fair equates to performing b hypothesis tests, which necessitates\nmeasures for avoiding the problems associated with running multiple hypothesis tests using a single\ndata set. Condition 3) is particularly dif\ufb01cult to achieve in conjunction with the second condition\u2014\nthe algorithm must carefully trade-off the maximizing expected reward with predictions about the\noutcomes of future hypothesis tests when picking the candidate solution that it considers returning.\n\n(cid:16)\n\n(cid:17) \u2265 1 \u2212 \u03b4i,\n\ngi(a(D)) \u2264 0\n\n4 RobinHood Algorithm\n\nThis section presents RobinHood, our fair bandit algorithm. At a high level, RobinHood allows\nusers to specify their own notions of (un)fairness based on statistics related to the performance of a\n\n3\n\n\fpotential solution. It then uses concentration inequalities [36] to calculate high-probability bounds on\nthese statistics. If these bounds satisfy the user\u2019s fairness criteria, then the solution is returned.\nConstructing Constraint Objectives. Users can specify their desired fairness de\ufb01nitions with\nconstraint objectives, {gi}k\ni=1, that accept a parameterized policy \u03b8 as input and produce a real-valued\nmeasurement of fairness. For simplicity of notation, we remove the subscript i and discuss the\nconstruction of an arbitrary constraint objective, g. In our loan approval example, we might de\ufb01ne\ng(\u03b8) = CDR(\u03b8)\u2212\u0001, where CDR(\u03b8) indicates the causal discrimination rate of \u03b8. However, computing\ng for this example requires knowledge of the underlying data distribution, which is typically unknown.\nIn practice, each g might depend on distributions that are unavailable, so it is unreasonable to assume\nthat the user can compute g(\u03b8) for an arbitrary g.\nInstead of explicitly requiring g(\u03b8), we could instead assume that the user is able to provide unbiased\nestimators for g. However, this is also limiting because it may be dif\ufb01cult to obtain unbiased\nestimators for certain constraint objectives, e.g., unbiased estimators of the standard deviation of a\nrandom variable can be challenging (or impossible) to construct. Importantly, our algorithm does not\nexplicitly require an unbiased estimate of g(\u03b8). Instead, it computes high-probability upper bounds\nfor g(\u03b8). Even if the random variable of interest does not permit unbiased estimators, if it is a function\nof random variables for which unbiased estimators exist, then valid upper bounds can be computed.\nWith this in mind, we propose a general interface in which the user can de\ufb01ne g by combining d\nreal-valued base variables, {zj}d\nj=1, using addition, subtraction, division, multiplication, absolute\nvalue, maximum, inverse, and negation operators. Base variables may also be multiplied and added\nto constants. Instead of specifying the base variable zj(\u03b8) explicitly, we assume the user is able to\nprovide an unbiased estimator, \u02c6zj, of each base variable: zj(\u03b8) := E[Average(\u02c6zj(\u03b8, D))]. That is,\neach function \u02c6zj takes a parameterized policy \u03b8 and a data set D and returns an arbitrary number of\ni.i.d. outputs, \u02c6zj(\u03b8, D). In the de\ufb01nition of zj, the average of the outputs is taken so zj(\u03b8) is a scalar.\nA base variable estimator \u02c6zj for our loan approval example could be de\ufb01ned as an integer indicating\nwhether or not causal discrimination is satis\ufb01ed for applicable data points in D. To do this, \u02c6z(\u03b8, D)\nshould produce 1 if for points h and f that differ only by race, \u03b8 chooses the same action, and 0\notherwise. De\ufb01ning \u02c6z in this way gives us an unbiased estimate of the CDR. We could then de\ufb01ne\ng(\u03b8) = z(\u03b8) \u2212 \u0001, requiring the CDR to be within some value \u0001, with probability at least 1 \u2212 \u03b4.\nThere may be some base variables that the user wants to use when de\ufb01ning fairness that do not have\nunbiased estimators, e.g., standard deviation. To handle these cases, we also allow the user to use base\nvariables for which they can provide high-probability upper and lower bounds on z(\u03b8) given any \u03b8\nand data D. As an example, in Appendix G we show how the user can de\ufb01ne a base variable to be the\nlargest possible expected reward for any policy with parameters \u03b8 in some set \u0398, i.e., max\u03b8\u2208\u0398 r(\u03b8).\nIn summary, users can de\ufb01ne constraint objectives that capture their desired de\ufb01nitions of fairness.\nConstraint objectives are mathematical expressions containing operators (including summation,\ndivision, and absolute value) and base variables (any variable, including constants, for which high-\ncon\ufb01dence upper and lower bounds can be computed). In Section 6, we construct constraint objectives\nfor different fairness de\ufb01nitions and \ufb01nd solutions that are fair with respect to these de\ufb01nitions. In\nAppendix A, we provide more examples of how to construct constraint objectives for other common\nfairness de\ufb01nitions used in the ML community.\nPseudocode. RobinHood (Algorithm 1) has three steps. In the \ufb01rst step, it partitions the training\ndata into the candidate selection set Dc and the safety set Ds. In the second step, called candidate\nselection, RobinHood uses Dc to construct a candidate policy, \u03b8c \u2208 \u0398, that is likely to satisfy the\nthird step, called the safety test, which ensures that the behavioral constraints hold. Algorithm 1\npresents the RobinHood pseudocode. Because the candidate selection step is constructed with the\nsafety step in mind, we \ufb01rst discuss the safety test, followed by candidate selection. Finally, we\nsummarize the helper methods used during the candidate selection and safety test steps.\nThe safety test (lines 3\u20134) applies concentration inequalities using Ds to produce a high-probability\nupper bound, Ui := Ui(\u03b8c, Ds), on gi(\u03b8c), the value of the candidate solution found in the candidate\nselection step. More concretely, Ui satis\ufb01es Pr(gi(\u03b8c) \u2264 Ui) \u2265 1 \u2212 \u03b4i. If Ui \u2264 0, then \u03b8c passes the\nsafety check and is returned. Otherwise, RobinHood returns NSF.\nThe goal of the candidate selection step (line 2) is to \ufb01nd a solution, \u03b8c, that maximizes expected\nreward and is likely to pass the safety test. Speci\ufb01cally, \u03b8c is found by maximizing the output of\n\n4\n\n\fAlgorithm 1 RobinHood (D, \u2206 = {\u03b4i}k\n1: Dc, Ds = partition(D)\n2: \u03b8c = arg max\u03b8\u2208\u0398 CandidateUtility(\u03b8, Dc, \u2206, \u02c6Z,E)\n3: [U1, ..., Uk] = ComputeUpperBounds(\u03b8c, Ds, \u2206,E, inflateBounds=False)\n4: if Ui \u2264 0 for all i \u2208 {1, ..., k} then return \u03b8c else return NSF\n\nj=1,E = {Ei}k\nj}d\n\ni=1, \u02c6Z{\u02c6zi\n\ni=1)\n\n(cid:46) Candidate Selection\n(cid:46) Safety Test\n\nAlgorithm 2 CandidateUtility(\u03b8, Dc, \u2206, \u02c6Z,E)\n1: [ \u02c6U1, ..., \u02c6Uk] = ComputeUpperBounds(\u03b8, Dc, \u2206, \u02c6Z,E, inflateBounds=True)\n2: rmin = min\n\u03b8(cid:48)\u2208\u0398\n\n3: if \u02c6Ui \u2264 \u2212\u03be for all i \u2208 {1, ..., k} then return \u02c6r(\u03b8, Dc) else return rmin \u2212 k(cid:80)\n\n\u02c6r(\u03b8(cid:48), Dc)\n\nmax{0, \u02c6Ui}\n\ni=1\n\nAlgorithm 2, which uses Dc to compute an estimate, \u02c6Ui, of Ui. The same concentration inequality\nused to compute Ui in the safety test is also used to compute \u02c6Ui, e.g., Hoeffding\u2019s inequality is used\nto compute Ui and \u02c6Ui. Multiple comparisons are performed on a single data set (Dc) during the\nsearch for a solution. This leads to the candidate selection step over-estimating its con\ufb01dence that\nthe solution it picks will pass the safety test. In order to remedy this issue, we in\ufb02ate the width of\nthe con\ufb01dence intervals used to compute \u02c6Ui (this is indicated by the Boolean inflateBounds in\nthe pseudocode). Another distinction between the candidate selection step and the safety test is that,\nin Algorithm 2, we check whether \u02c6Ui \u2264 \u2212\u03be for some small constant \u03be instead of 0. This technical\nassumption is required to ensure the consistency of RobinHood, and is discussed in Appendix E.\nWe de\ufb01ne the input E = {Ei}k\ni=1 in the pseudocode to be analytic expressions representing the\ni=1. For example, if we had the constraint objective g(\u03b8) = z1(\u03b8)\u00d7 z2(\u03b8) +\nconstraint objectives {gi}k\nz3(\u03b8) \u2212 \u0001, then E = E1 \u00d7 E2 + E3 \u2212 \u0001, where E1 = z1(\u03b8), E2 = z2(\u03b8), and E3 = z3(\u03b8) . Each\nexpression Ei is used in ComputeUpperBounds (Algorithm 3) and related helper functions, the\npseudocode for which is located in Appendix B. At a high level, these helper functions recurse\nthrough sub-expressions of each Ei until a base variable is encountered. Once this occurs, real-valued\nupper and lower (1 \u2212 \u03b4i)-con\ufb01dence bounds on the base variable\u2019s estimators are computed and\nsubsequently propagated through Ei. Using our example from earlier, bounds for E1,2 = E1 \u00d7 E2\nwould \ufb01rst be computed, followed by bounds for E1,2 + E3.\n\ni=1},\ni=1, 1) RobinHood is guaranteed to satisfy the\n\n5 Theoretical Analyses\nThis section proves that given reasonable assumptions about the constraint objectives, {gi(\u03b8)k\nand their sample estimates, {Average(\u02c6gi(\u03b8, D))}k\nbehavioral constraints and 2) RobinHood is consistent.\nTo prove that RobinHood is Seldonian [47]: it is guaranteed to satisfy the behavioral constraints, i.e.,\nthat RobinHood returns a fair solution with high probability, we show that the safety test only returns\na solution if the behavioral constraints are guaranteed to be satis\ufb01ed. This follows from the use of\nconcentration inequalities and transformations to convert bounds on the base variables, zj(\u03b8c), into a\nhigh-con\ufb01dence upper bound, Ui, on gi(\u03b8c). We therefore begin by showing (in Appendix C) that the\nupper bounds computed by the helper functions in Appendix B satisfy Pr(gi(\u03b8) > Ui) \u2264 \u03b4i for all\nconstraint objectives. Next, we show that the behavioral constraints are satis\ufb01ed by RobinHood. We\ndenote RobinHood as a(D), a batch contextual bandit algorithm dependent on data D.\nTheorem 1 (Fairness Guarantee). Let {gi}n\ngi : \u0398 \u2192 R, and let {\u03b4i}n\nThen, for all i, Pr(gi(a(D)) > 0) \u2264 \u03b4i. Proof. See Appendix D.\nWe de\ufb01ne an algorithm to be consistent if, when a fair solution exists, the probability that the\nalgorithm returns a solution other than NSF converges to 1 as the amount of training data goes to\nin\ufb01nity. We state this more formally in Theorem 2.\n\ni=0 be a sequence of behavioral constraints, where\ni=0 be a corresponding sequence of con\ufb01dence thresholds, where \u03b4i \u2208 [0, 1].\n\n5\n\n\fTheorem 2 (Consistency). If Assumptions 1 through 5 (speci\ufb01ed in Appendix E) hold, then\nn\u2192\u221e Pr(a(D) (cid:54)= NSF, g(a(D)) \u2264 0) = 1. Proof. See Appendix E.\nlim\nIn order to prove Theorem 2, we \ufb01rst de\ufb01ne the set \u00af\u0398, which contains all unfair solutions. At a high\nlevel, we show that the probability that \u03b8c satis\ufb01es \u03b8c (cid:54)\u2208 \u00af\u0398 converges to one as n \u2192 \u221e. To establish\nthis, we 1) establish the convergence of the con\ufb01dence intervals for the base variables, 2) establish\nthe convergence of the candidate objective for all solutions, and 3) establish the convergence of the\nprobability that \u03b8c (cid:54)\u2208 \u00af\u0398. Once we have this property about \u03b8c, we establish that the probability of the\nsafety test returning NSF converges to zero as n \u2192 \u221e.\nIn order to build up the properties discussed above, we make a few mild assumptions, which we\nsummarize here. To establish 1), we assume that the con\ufb01dence intervals on the base variables\nconverge almost surely to the true base variable values for all solutions. Hoeffding\u2019s inequality and\nStudent\u2019s t-test are examples of concentration inequalities that provide this property. We also assume\nthat the user-provided analytic expressions, E, are continuous functions of the base variables. With\nthe exception of division, all operators discussed in Section 4 satisfy this assumption. In fact, this\nassumption is still satis\ufb01ed for division when positive base variables are used in the denominator. To\nestablish 2) and 3), we assume that the sample off-policy estimator, \u02c6r, converges almost surely to\nr, the actual expected reward. This is satis\ufb01ed by most off-policy estimators [46]. We also make\nparticularly weak smoothness assumptions about the output of Algorithm 2, which only requires the\noutput of Algorithm 2 to be smooth across a countable partition of \u0398. Lastly, we assume that at least\none fair solution exists and that this solution is not on the fair-unfair boundary.\nNote that consistency does not provide bounds on the amount of data necessary for the return of a fair\nsolution. Although a high-probability bound on how much data our algorithm requires to return a\nsolution other than NSF would provide theoretical insights into the behavior of our algorithm, our\nfocus is on ensuring that our algorithm can return solutions other than NSF using practical amounts\nof data on real problems. Hence, in the following section we conduct experiments with real data\n(including data that we collected from a user study).\n\n6 Empirical Evaluation\n\nWe apply RobinHood to three real-world applications: tutoring systems, loan approval, and criminal\nrecidivism. Our evaluation focuses on the three research questions. 1) When do solutions returned by\nRobinHood obtain performance comparable to those returned by state-of-the-art methods? 2) How\noften does RobinHood return an unfair solution, as compared to state-of-the-art methods? 3) In\npractice, how often does RobinHood return a solution besides NSF?\n\n6.1 Experimental Methodology and Application Domains\n\nTo the best of our knowledge, no other fair contextual bandit algorithms have been proposed in\nthe of\ufb02ine setting. We therefore compare to two standard of\ufb02ine methods: POEM [44] and Offset\nTree [6]. In Appendix F, we also compare to Rawlsian Fairness, a fair online contextual bandit\nframework [23]. Existing of\ufb02ine bandit algorithms are not designed to adhere to multiple, user-\nde\ufb01ned fairness constraints. One seemingly straightforward \ufb01x to this is to create an algorithm that\nuses all of the data, D, to search for a solution, \u03b8, that maximizes the expected reward (using a\nstandard approach), but with the additional constraint that an estimate, \u02c6g(\u03b8, D), of how unfair \u03b8 is, is\nat most zero. That is, this method enforces the constraint \u02c6g(\u03b8, D) \u2264 0 without concern for how well\nthis constraint generalizes to future data. We construct this method, called Na\u00efveFairBandit, as a\nbaseline for comparison. In RobinHood and Na\u00efveFairBandit, we use Student\u2019s t-test to calculate\nupper and lower con\ufb01dence bounds on base variables for a particular \u03b8.\nNote that Algorithm 1 relies on the optimization algorithm (and implicitly, the feasible set) chosen\nby the user to \ufb01nd candidate solutions. If the user chooses an optimization algorithm incapable\nof \ufb01nding fair solutions, e.g., they choose a gradient method when the fairness constraints de\ufb01ned\nmake it dif\ufb01cult or impossible for it to \ufb01nd a solution, then RobinHood will return NSF. We chose\nCMA-ES [19] as our optimization method. Further implementation details, including pseudocode for\nNa\u00efveFairBandit, can be found in Appendix F.\n\n6\n\n\ftwo behavioral constraints for these experiments: gf (\u03b8) := |F|\u22121(cid:80)|D|\ngm(\u03b8) := |M|\u22121(cid:80)|D|\n\nTutoring Systems. Our \ufb01rst set of experiments is motivated by intelligent tutoring systems (ITSs),\nwhich aim to teach a speci\ufb01c topic by providing personalized and interactive instruction based on a\nstudent\u2019s skill level [38]. Such adaptations could have unwanted consequences, including inequity\nwith respect to different student populations [13]. We conduct our experiments in the multi-armed\nbandit setting\u2014a special case of the contextual bandit setting in which context is the same for every\niteration. To support these experiments, we collected user data from the crowdsourcing marketplace\nAmazon Mechanical Turk. In our data collection, users were presented with one of two different\nversions of a tutorial followed by a ten-question assessment. Data including gender, assessment score,\nand tutorial type was collected during the study. Let R\u03b9 be the assessment score achieved by user \u03b9\nand S\u03b9 \u2208 {f, m} represent the gender of user \u03b9. (Due to lack of data for users identifying their gender\nas \u201cother,\u201d we restricted our analysis to male- and female-identifying users.) D was collected using a\nuniform-random behavior policy. Further details are provided in Appendix F.\nTo demonstrate the ability of our algorithm to satisfy multiple and novel fairness criteria, we develop\n\u03b9=0 R\u03b9I(f) \u2212 E[R|f] \u2212 \u0001f and\n\u03b9=0 R\u03b9I(m) \u2212 E[R|m] \u2212 \u0001m, where |F| and |M| denote the respective number\nof female- and male-identifying users, I(x) is an indicator function for the event X\u03b9 = x, where\nx \u2208 {f,m}, and E[R|x] is the expected reward conditioned on the event X\u03b9 = x for x previously\nde\ufb01ned. In words, for the constraint gf to be less than 0, the expected reward for females may only\nbe smaller than the empirical average for females in the collected data by at most \u0001f . The second\nconstraint gm is similarly de\ufb01ned with respect to males. Note that different values of \u0001f and \u0001m\ncan allow for improvement to female performance and decreased male performance and vice versa.\nDe\ufb01ning the constraints in this way may be bene\ufb01cial if the user is aware that bias towards a certain\ngroup exists, and then hypothesizes that performance towards this group may need to decrease to\nimprove performance of a minority group, as is the case in this data set. We also highlight that a fair\npolicy for this experiment is one such that gf (\u03b8) \u2264 0 and gm(\u03b8) \u2264 0.\nIn our \ufb01rst experiment, which we call the similar proportions experiment, males and females are\nroughly equally represented in D: |F| \u2248 |M|. In our second ITS experiment, which we call the\nskewed proportions experiment, we simulate the scenario in which females are under-represented in\nthe data set (and elaborate on the process for doing this in Appendix F). We perform this experiment\nbecause 1) biased data collection is a common problem [9] and 2) methods designed to maximize\nreward may do so at the expense of under-represented groups.\nIn the skewed proportions experiment we introduce a purposefully biased tutorial that responds\ndifferently to users based on their gender identi\ufb01cation in order to experiment with a setting where\nunfairness is likely to occur. This tutorial provides information to male-identifying users in an\nintuitive, straightforward way but gives female-identifying users incorrect information, resulting in\nhigh assessment scores for males and near-zero assessment scores for females. The average total\nscore for the biased tutorial in D is higher than that of the other tutorials\u2014because of this, methods\nthat optimize for performance without regard for gm and gf will often choose the biased tutorial\nfor deployment. In practice, a similar situation could occur in an adaptive learning system, where\ntutorials are uploaded by different sources. The introduction of a bug might compel a fairness-unaware\nalgorithm to choose an unfair tutorial.\nLoan Approval. Our next experiments are inspired by decision support systems for loan approval.\nIn this setting, a policy uses a set of features, which describe an applicant, to determine whether the\napplicant should be approved for a loan. We use the Statlog German Credit data set, which includes a\ncollection of loan applicants, each one described by 20 features, and labels corresponding to whether\nor not each applicant was assessed to have good \ufb01nancial credit [32]. A policy earns reward 1 if it\napproves an applicant with good credit or denies an applicant with bad credit (the credit label of each\napplicant is unobserved by the policy); otherwise the agent receives a reward of \u22121. We conduct two\nexperiments that focus on ensuring fairness with respect to sex (using the Personal Status and\nSex feature of each applicant in the data set to determine sex). Speci\ufb01cally, we enforce disparate\nimpact [51] in one and statistical parity [14] in the other. In Appendix F we de\ufb01ne statistical parity\nfor this domain and discuss experimental results.\nDisparate impact is de\ufb01ned in terms of the relative magnitudes of positive outcome rates. Let\nf and m correspond to females and males and let A = 1 if the corresponding applicant was\ngranted a loan and A = 0 otherwise. Disparate impact can then be written as: g(\u03b8) :=\n\n7\n\n\fmax{E[A|m]/E[A|f], E[A|f]/E[A|m]} \u2212 (1+\u0001). To satisfy this, neither males nor females may\nenjoy a positive outcome rate that is more than 100\u0001% larger than that of the other sex.\nCriminal Recidivism. This experiment uses recidivism data released by ProPublica as part of their\ninvestigation into the racial bias of deployed classi\ufb01cation algorithms [3]. Each record in the data\nset includes a label indicating if the person would later re-offend (decile score), and six predictive\nfeatures, including juvenile felony count, age, and sex. In this problem, the agent is tasked with\nproducing the decile score given a feature vector of information about a person (the decile score label\nof each applicant is unobserved by the policy). The reward provided to the bandit is 1 if recidivism\noccurs and 0 otherwise. We apply approximate statistical parity here, where features of interest are\nrace (Caucasian and African American). A policy exhibits statistical parity if the probability with\nwhich it assigns a bene\ufb01cial outcome to individuals belonging to protected and unprotected classes is\nequal: g(\u03b8) := |Pr(A = 1|Caucasian) \u2212 Pr(A = 1|African American)| \u2212 \u0001.\n\n6.2 Results and Discussion\n\nFigure 1 shows our experimental results over varying training set sizes. The leftmost plots in each\nrow show the off-policy reward of solutions returned by each algorithm, and the middle plots show\nhow often solutions are returned by each algorithm. The solution rate for each baseline is 100%\nbecause RobinHood is the only algorithm able to return NSF. The purpose of plotting the solution\nrate is to determine how much data is required for solutions other than NSF to be returned by our\nalgorithm. The rightmost plots show the probability that an algorithm violated the fairness constraints.\nThe dashed line in these plots denotes the maximum failure rate allowed by the behavioral constraints\n(\u03b4 = 5% in our experiments).\nIn all of our experiments, unless a certain amount of data is provided to Na\u00efveFairBandit, it returns\nunfair solutions at an unacceptable rate. This seems workable at \ufb01rst glance\u2014one could argue that\nso long as enough data is given to Na\u00efveFairBandit, it will not violate the behavioral constraints.\nIn practice, however, it is not known in advance how much data is needed to obtain a fair solution.\nNa\u00efveFairBandit\u2019s failure rate varies considerably in each experiment, and it is unclear how to\ndetermine the amount of data necessary for Na\u00efveFairBandit\u2019s failure rate to remain under \u03b4. In\nessence, RobinHood is a variant of Na\u00efveFairBandit that includes a mechanism for determining\nwhen there is suf\ufb01cient data to trust the conclusions drawn from the available data.\nIn some of our experiments, the failure rates of the fairness-unaware baselines (Offset Tree and\nPOEM) approach \u03b4 as more data is provided. To explain this behavior, note that when reward\nmaximization and fairness are noncon\ufb02icting, there may exist fair high-performing solutions. In the\ncase that only high-performing solutions meet the fairness criteria, the failure rate of these algorithms\nshould decrease as more data is provided. Importantly, while these baselines might be fair in some\ncases, unlike RobinHood, these approaches do not come with fairness guarantees.\nIn the similar proportions experiment, fairness and performance optimization are noncon\ufb02icting. In\nthis case, RobinHood performs similarly to the state-of-the-art\u2014it is able to \ufb01nd and return solutions\nwhose off-policy reward is comparable to the baselines. The same pattern can be seen in the loan\napproval and criminal recidivism experiments. In these applications, when high-performing fair\nsolutions exist, RobinHood is able to \ufb01nd and return them. In the skewed proportions experiment,\nthe biased tutorial maximizes overall performance but violates the constraint objectives. As expected,\nPOEM and Offset Tree frequently choose to deploy this tutorial regardless of the increase in training\ndata, while RobinHood frequently chooses to deploy a tutorial whose off-policy reward is high (with\nrespect to the behavioral constraints). In summary, in each of our experiments, RobinHood is able to\nreturn fair solutions with high probability given a reasonable amount of data.\n\n7 Related Work\n\nSigni\ufb01cant research effort has focused on fairness-aware ML, particularly classi\ufb01cation algorithms [1;\n8; 14; 50], measuring fairness in systems [17], and de\ufb01ning varied notions of fairness [33; 17; 14].\nOur work complements these efforts but focuses on the contextual bandit setting. This section\ndescribes work related to our setting, beginning with online bandits.\nRecall (from Section 2) that in the standard online bandit setting, an agent\u2019s goal is to maximize\nexpected reward, \u03c1(a) = E[R\u03b9|A\u03b9 = a], as it interacts with a system. Over time, estimates of \u03c1 are\n\n8\n\n\fFigure 1: Each row presents results for different experiments generated over 30, 30, 50, and 50 trials respectively.\nTop row: tutoring sytsem, similar proportions with \u0001m = 0.5, \u0001f = 0.0. Second row: tutoring system, skewed\nproportions with \u0001m = 0.5, \u0001f = 0.0. Third row: enforcing disparate impact in the loan approval application\nwith \u0001 = \u22120.8. Fourth row: enforcing statistical parity in the criminal recidivism application with \u0001 = 0.1. The\ndashed line denotes the maximum failure rate allowed by the behavioral constraints (\u03b4 = 5% in our experiments).\n\ncomputed, and the agent must trade-off between exploiting, i.e., taking actions that will maximize \u03c1,\nand exploring, i.e., taking actions it believes are suboptimal to build a better estimate of \u03c1 for taking\nthat action. Most fairness-unaware algorithms eventually learn a policy that acceptably maximizes \u03c1,\nbut there are no performance guarantees for policies between initial deployment and the acceptable\npolicy, i.e., while the agent is exploring. These intermediate polices may choose suboptimal actions\ntoo often\u2014this can be problematic in a real-world system, where choosing suboptimal actions could\nresult in unintended inequity. In effect, fairness research for online methods has mostly focused on\nconservative exploration [24; 25; 35; 14; 27; 22]. The notion of action exploration does not apply in\nthe of\ufb02ine setting because the agent does not interact iteratively with the environment. Instead, the\nagent has access to data collected using previous policies not chosen by the agent. Because of this,\nfairness de\ufb01nitions involving action exploration are not applicable to the of\ufb02ine setting.\nRelated work also exists in online metric-fairness learning [18; 41], multi-objective contextual bandits\n(MOCB) [45; 48], multi-objective reinforcement learning (MORL) [40; 49], and data-dependent\nconstraint satisfaction [11; 47]. RobinHood can address metric-based de\ufb01nitions of fairness that can\nbe quantitatively expressed using the set of operations de\ufb01ned in Section 4. MOCB and MORL largely\nfocus on approximating the Pareto frontier to handle multiple and possibly con\ufb02icting objectives,\nthough a recent batch MORL algorithm proposed by Le et al. [31] is an exception to this trend\u2014this\nwork focuses on problems of interest (with respect to fair policy learning) that can be framed as\nchance-constraints, and assumes the convexity of the feasible set \u0398. RobinHood represents a different\napproach to the batch MOCB setting with high probability constraint guarantees. The interface for\nspecifying fairness de\ufb01nitions (presented in Section 4) makes RobinHood conceptually related to\nalgorithms that satisfy data-dependent constraints [11]. In fact, RobinHood can more generally be\nthought of as an algorithm that ensures user-de\ufb01ned properties or behaviors with high probability, e.g.,\nproperties related to fairness. Finally, RobinHood belongs to a family of methods called Seldonian\nalgorithms [47]. This class of methods satis\ufb01es user-de\ufb01ned safety constraints with high probability.\n\n9\n\n1001000Training Samples0510Reward1001000Training Samples0%50%100%Solution Rate1001000Training Samples0%50%100%Failure Rate1001000Training Samples0510Reward1001000Training Samples0%50%100%Solution Rate1001000Training Samples0%50%100%Failure Rate90200300400Training Samples101Reward90200300400Training Samples0%50%100%Solution Rate90200300400Training Samples0%50%100%Failure Rate1001000Training Samples101Reward1001000Training Samples0%50%100%Solution Rate1001000Training Samples0%50%100%Failure RateRobinHoodPOEMOffsetTreeNa\u00efve\fAcknowledgments\n\nThis work is supported by a gift from Adobe and by the National Science Foundation under grants no.\nCCF-1453474, IIS-1753968, CCF-1763423, and an NSF CAREER award to Brunskill. Research\nreported in this paper was sponsored in part by the Army Research Laboratory under Cooperative\nAgreement W911NF-17-2-0196. The views and conclusions contained in this document are those\nof the authors and should not be interpreted as representing the of\ufb01cial policies, either expressed or\nimplied, of the Army Research Laboratory or the U.S. Government.\n\nReferences\n[1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dud\u00edk, John Langford, and Hanna Wallach. A\nreductions approach to fair classi\ufb01cation. In International Conference on Machine Learning,\n2018. URL https://github.com/Microsoft/fairlearn.\n\n[2] Takeshi Amemiya. Advanced econometrics. Harvard university press, 1985.\n\n[3] Julia Angwin,\n\nJeff Larson, Surya Mattu,\n\nbias.\nmachine-bias-risk-assessments-in-criminal-sentencing.\n\nProPublica, May 2016.\n\nMachine\nURL https://www.propublica.org/article/\n\nand Lauren Kirchner.\n\n[4] Robert Bartlett, Adair Morse, Richard Stanton, and Nancy Wallace. Consumer lending discrim-\nination in the era of \ufb01ntech. Unpublished working paper. University of California, Berkeley,\n2018.\n\n[5] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in\ncriminal justice risk assessments: The state of the art. Sociological Methods & Research, page\n0049124118782533, 2018.\n\n[6] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In\nProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 129\u2013138. ACM, 2009.\n\n[7] Nanette Byrnes. Arti\ufb01cial intolerance. MIT Technology Review, March 2016. URL https:\n\n//www.technologyreview.com/s/600996/artificial-intolerance/.\n\n[8] Toon Calders and Sicco Verwer. Three naive Bayes approaches for discrimination-free classi\ufb01-\n\ncation. Data Mining and Knowledge Discovery, 21(2):277\u2013292, 2010.\n\n[9] Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Special issue on learning from\n\nimbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1\u20136, 2004.\n\n[10] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic\ndecision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 797\u2013806. ACM, 2017.\n\n[11] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang,\nBlake Woodworth, and Seungil You. Training well-generalizing classi\ufb01ers for fairness metrics\nand other data-dependent constraints. arXiv preprint arXiv:1807.00028, 2018.\n\n[12] Marleen de Bruijne. Machine learning approaches in medical image analysis: From detection to\ndiagnosis. Medical Image Analysis, 33:94\u201397, October 2016. doi: 10.1016/j.media.2016.06.032.\n\n[13] Shayan Doroudi and Emma Brunskill. Fairer but not fair enough on the equitability of knowledge\ntracing. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge,\npages 335\u2013339. ACM, 2019.\n\n[14] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\nthrough awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science\nConference, pages 214\u2013226. ACM, 2012.\n\n[15] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n\n10\n\n\f[16] Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. On the (im) possibility\n\nof fairness. arXiv preprint arXiv:1609.07236, 2016.\n\n[17] Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. Fairness testing: Testing software for\ndiscrimination. In Joint Meeting of the European Software Engineering Conference and ACM\nSIGSOFT Symposium on the Foundations of Software Engineering, pages 498\u2013510, Paderborn,\nGermany, September 2017. doi: 10.1145/3106237.3106277.\n\n[18] Stephen Gillen, Christopher Jung, Michael Kearns, and Aaron Roth. Online learning with\nan unknown fairness metric. In Advances in Neural Information Processing Systems, pages\n2600\u20132609, 2018.\n\n[19] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolu-\n\ntion strategies. Evolutionary computation, 9(2):159\u2013195, 2001.\n\n[20] Moritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In\n\nAdvances in neural information processing systems, pages 3315\u20133323, 2016.\n\n[21] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican Statistical Association, 58(301):13\u201330, 1963.\n\n[22] Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness\n\nin reinforcement learning. arXiv preprint arXiv:1611.03071, 2016.\n\n[23] Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. Fair\n\nalgorithms for in\ufb01nite and contextual bandits. arXiv preprint arXiv:1610.09559, 2016.\n\n[24] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning:\nClassic and contextual bandits. In Advances in Neural Information Processing Systems, pages\n325\u2013333, 2016.\n\n[25] Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. Meritocratic\nfairness for in\ufb01nite and contextual bandits. In AIES, pages 158\u2013163, New Orleans, LA, USA,\n2018. ISBN 978-1-4503-6012-8. doi: 10.1145/3278721.3278764.\n\n[26] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD\n\nthesis, University of London London, England, 2003.\n\n[27] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conserva-\ntive contextual linear bandits. In Advances in Neural Information Processing Systems, pages\n3910\u20133919, 2017.\n\n[28] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair\n\ndetermination of risk scores. arXiv preprint arXiv:1609.05807, 2016.\n\n[29] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The\narti\ufb01cial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.\nNature Medicine, 24(11):1716, 2018.\n\n[30] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In\n\nAdvances in Neural Information Processing Systems, pages 4066\u20134076, 2017.\n\n[31] Hoang M Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints.\n\narXiv preprint arXiv:1903.08738, 2019.\n\n[32] Moshe Lichman et al. UCI machine learning repository, 2013.\n\n[33] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of\n\nfair machine learning. In International Conference on Machine Learning, 2018.\n\n[34] Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of\n\nfair machine learning. arXiv preprint arXiv:1803.04383, 2018.\n\n[35] Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C Parkes.\n\nCalibrated fairness in bandits. arXiv preprint arXiv:1707.01875, 2017.\n\n11\n\n\f[36] P. Massart. Concentration Inequalities and Model Selection. Springer, 2007.\n\n[37] Claire Cain Miller.\n\nYork Times,\ncan-an-algorithm-hire-better-than-a-human.html.\n\nJune 2015.\n\nCan an algorithm hire better\n\nThe New\nURL https://www.nytimes.com/2015/06/26/upshot/\n\nthan a human?\n\n[38] Tom Murray. Authoring intelligent tutoring systems: An analysis of the state of the art.\n\nInternational Journal of Arti\ufb01cial Intelligence in Education, 10:98\u2013129, 1999.\n\n[39] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department\n\nFaculty Publication Series, page 80, 2000.\n\n[40] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of\nmulti-objective sequential decision-making. Journal of Arti\ufb01cial Intelligence Research, 48:\n67\u2013113, 2013.\n\n[41] Guy N Rothblum and Gal Yona. Probably approximately metric-fair learning. arXiv preprint\n\narXiv:1803.03242, 2018.\n\n[42] P. K. Sen and J. M. Singer. Large Sample Methods in Statistics An Introduction With Applications.\n\nChapman & Hall, 1993.\n\n[43] Student. The probable error of a mean. Biometrika, pages 1\u201325, 1908.\n\n[44] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from\nlogged bandit feedback. In International Conference on Machine Learning, pages 814\u2013823,\n2015.\n\n[45] Cem Tekin and Eralp Tur\u02d8gay. Multi-objective contextual multi-armed bandit with a dominant\n\nobjective. IEEE Transactions on Signal Processing, 66(14):3799\u20133813, 2018.\n\n[46] Philip Thomas. Safe Reinforcement Learning. PhD thesis, Doctoral Dissertations, 2015.\n\n[47] Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and\nEmma Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):\n999\u20131004, November 2019. ISSN 0036-8075. doi: 10.1126/science.aag3311.\n\n[48] Eralp Tur\u02d8gay, Doruk \u00d6ner, and Cem Tekin. Multi-objective contextual bandit problem with\n\nsimilarity information. arXiv preprint arXiv:1803.04015, 2018.\n\n[49] Kristof Van Moffaert and Ann Now\u00e9. Multi-objective reinforcement learning using sets of\npareto dominating policies. The Journal of Machine Learning Research, 15(1):3483\u20133512,\n2014.\n\n[50] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi.\nIn Fairness, Accountability, and\n\nFairness constraints: Mechanisms for fair classi\ufb01cation.\nTransparency in Machine Learning, Lille, France, July 2015.\n\n[51] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi.\nFairness beyond disparate treatment & disparate impact: Learning classi\ufb01cation without dis-\nparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web,\npages 1171\u20131180. International World Wide Web Conferences Steering Committee, 2017.\n\n12\n\n\f", "award": [], "sourceid": 8496, "authors": [{"given_name": "Blossom", "family_name": "Metevier", "institution": "University of Massachusetts, Amherst"}, {"given_name": "Stephen", "family_name": "Giguere", "institution": "University of Massachusetts, Amherst"}, {"given_name": "Sarah", "family_name": "Brockman", "institution": "University of Massachusetts Amherst"}, {"given_name": "Ari", "family_name": "Kobren", "institution": "UMass Amherst"}, {"given_name": "Yuriy", "family_name": "Brun", "institution": "University of Massachusetts Amherst"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "Stanford University"}, {"given_name": "Philip", "family_name": "Thomas", "institution": "University of Massachusetts Amherst"}]}