{"title": "Online Learning with an Unknown Fairness Metric", "book": "Advances in Neural Information Processing Systems", "page_first": 2600, "page_last": 2609, "abstract": "We consider the problem of online learning in the linear contextual bandits setting, but in which there are also strong individual fairness constraints governed by an unknown similarity metric. These constraints demand that we select similar actions or individuals with approximately equal probability DHPRZ12, which may be at odds with optimizing reward, thus modeling settings where profit and social policy are in tension. We assume we learn about an unknown Mahalanobis similarity metric from only weak feedback that identifies fairness violations, but does not quantify their extent. This is intended to represent the interventions of a regulator who \"knows unfairness when he sees it\" but nevertheless cannot enunciate a quantitative fairness metric over individuals. Our main result is an algorithm in the adversarial context setting that has a number of fairness violations that depends only logarithmically on T, while obtaining an optimal O(sqrt(T)) regret bound to the best fair policy.", "full_text": "Online Learning with an Unknown Fairness Metric\n\nStephen Gillen\n\nUniversity of Pennsylvania\nstepe@math.upenn.edu\n\nChristopher Jung Michael Kearns Aaron Roth\n\nUniversity of Pennsylvania\n\n{chrjung, mkearns, aaroth}@cis.upenn.edu\n\nAbstract\n\nWe consider the problem of online learning in the linear contextual bandits setting,\nbut in which there are also strong individual fairness constraints governed by an\nunknown similarity metric. These constraints demand that we select similar actions\nor individuals with approximately equal probability [?], which may be at odds with\noptimizing reward, thus modeling settings where pro\ufb01t and social policy are in\ntension. We assume we learn about an unknown Mahalanobis similarity metric from\nonly weak feedback that identi\ufb01es fairness violations, but does not quantify their\nextent. This is intended to represent the interventions of a regulator who \u201cknows\nunfairness when he sees it\u201d but nevertheless cannot enunciate a quantitative fairness\nmetric over individuals. Our main result is an algorithm in the adversarial context\n\u221a\nsetting that has a number of fairness violations that depends only logarithmically\non T , while obtaining an optimal O(\n\nT ) regret bound to the best fair policy.\n\n1\n\nIntroduction\n\nThe last several years have seen an explosion of work studying the problem of fairness in machine\nlearning. Yet there remains little agreement about what \u201cfairness\u201d should mean in different contexts.\nIn broad strokes, the literature can be divided into two families of fairness de\ufb01nitions: those aiming\nat group fairness, and those aiming at individual fairness.\nGroup fairness de\ufb01nitions are aggegrate in nature: they partition individuals into some collection of\nprotected groups (say by race or gender), specify some statistic of interest (say, positive classi\ufb01cation\nrate or false positive rate), and then require that a learning algorithm equalize this quantity across the\nprotected groups. On the other hand, individual fairness de\ufb01nitions ask for some constraint that binds\non the individual level, rather than only over averages of people. Often, these constraints have the\nsemantics that \u201csimilar people should be treated similarly\u201d ?.\nIndividual fairness de\ufb01nitions have substantially stronger semantics and demands than group de\ufb01ni-\ntions of fairness. For example, ? lay out a compendium of ways in which group fairness de\ufb01nitions are\nunsatisfying. Yet despite these weaknesses, group fairness de\ufb01nitions are by far the most prevalent in\nthe literature (see e.g. ??????? and ? for a survey). This is in large part because notions of individual\nfairness require making stronger assumptions on the setting under consideration. In particular, the\nde\ufb01nition from ? requires that the algorithm designer know a \u201ctask-speci\ufb01c fairness metric.\u201d\nLearning problems over individuals are also often implicitly accompanied by some notion of merit,\nembedded in the objective function of the learning problem. For example, in a lending setting\nwe might posit that each loan applicant is either \u201ccreditworthy\u201d and will repay a loan, or is not\ncreditworthy and will default \u2014 which is what we are trying to predict. ? take the approach that\nthis measure of merit \u2014 already present in the model, although initially unknown to the learner \u2014\ncan be taken to be the similarity metric in the de\ufb01nition of ?, requiring informally that creditworthy\nindividuals have at least the same probability of being accepted for loans as defaulting individuals.\n(The implicit and coarse fairness metric here assigns distance zero between pairs of creditworthy\nindividuals and pairs of defaulting individuals, and some non-zero distance between a creditworthy\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand a defaulting individual.) This resolves the problem of how one should discover the \u201cfairness\nmetric\u201d, but results in a notion of fairness that is necessarily aligned with the notion of \u201cmerit\u201d\n(creditworthiness) that we are trying to predict.\nHowever, there are many settings in which the notion of merit we wish to predict may be different or\neven at odds with the notion of fairness we would like to enforce. For example, notions of fairness\naimed at rectifying societal inequities that result from historical discrimination can aim to favor\nthe disadvantaged population (say, in college admissions), even if the performance of the admitted\nmembers of that population can be expected to be lower than that of the advantaged population.\nSimilarly, we might desire a fairness metric incorporating only those attributes that individuals can\nchange in principle (and thus excluding ones like race, age and gender), and that further expresses\nwhat are and are not meaningful differences between individuals, outside the context of any particular\nprediction problem. These kinds of fairness desiderata can still be expressed as an instantiation of the\nde\ufb01nition from ?, but with a task-speci\ufb01c fairness metric separate from the notion of merit we are\ntrying to predict.\nIn this paper, we revisit the individual fairness de\ufb01nition from ?. This de\ufb01nition requires that pairs\nof individuals who are close in the fairness metric must be treated \u201csimilarly\u201d (e.g. in an allocation\nproblem such as lending, served with similar probability). We investigate the extent to which it is\npossible to satisfy this fairness constraint while simultaneously solving an online learning problem,\nwhen the underlying fairness metric is Mahalanobis but not known to the learning algorithm, and may\nalso be in tension with the learning problem. One conceptual problem with metric-based de\ufb01nitions,\nthat we seek to address, is that it may be dif\ufb01cult for anyone to actually precisely express a quantitative\nmetric over individuals \u2014 but they nevertheless might \u201cknow unfairness when they see it.\u201d We\ntherefore assume that the algorithm has access to an oracle that knows intuitively what it means\nto be fair, but cannot explicitly enunciate the fairness metric. Instead, given observed actions, the\noracle can specify whether they were fair or not, and the goal is to obtain low regret in the online\nlearning problem \u2014 measured with respect to the best fair policy \u2014 while also limiting violations of\nindividual fairness during the learning process.\n\n1.1 Our Results and Techniques\n\nWe study the standard linear contextual bandit setting. In rounds t = 1, . . . , T , a learner observes\narbitrary and possibly adversarially selected d-dimensional contexts, each corresponding to one of k\nactions. The reward for each action is (in expectation) an unknown linear function of the contexts.\nThe learner seeks to minimize its regret.\nThe learner also wishes to satisfy fairness constraints, de\ufb01ned with respect to an unknown distance\nfunction de\ufb01ned over contexts. The constraint requires that the difference between the probabilities\nthat any two actions are taken is bounded by the distance between their contexts. The learner has no\ninitial knowledge of the distance function. Instead, after the learner makes its decisions according to\nsome probability distribution \u03c0t at round t, it receives feedback specifying for which pairs of contexts\nthe fairness constraint was violated. Our goal in designing a learner is to simultaneously guarantee\nnear-optimal regret in the contextual bandit problem (with respect to the best fair policy), while\nviolating the fairness constraints as infrequently as possible. Our main result is a computationally\nef\ufb01cient algorithm that guarantees this for a large class of distance functions known as Mahalanobis\ndistances (these can be expressed as d(x1, x2) = ||Ax1 \u2212 Ax2||2 for some matrix A).\nTheorem (Informal): There is a computationally ef\ufb01cient learning algorithm L in our setting that\nguarantees that for any Mahalanobis distance, any time horizon T , and any error tolerance \u0001:\n\n1. (Learning) With high probability, L obtains regret \u02dcO\n\n(cid:16)\non at most O(cid:0)k2d2 log(d/\u0001)(cid:1) many rounds. (Theorem 4.)\n\npolicy (See Theorem 3 for a precise statement.)\n\n\u221a\nk2d2 log (T ) + d\n\nT\n\nto the best fair\n\n(cid:17)\n\n2. (Fairness) With probability 1, L violates the unknown fairness constraints by more than \u0001\n\nWe note that the quoted regret bound requires setting \u0001 = O(1/T ), and so this implies a number of\nfairness violations of magnitude more than 1/T that is bounded by a function growing logarithmically\nin T . Other tradeoffs between regret and fairness violations are possible.\n\n2\n\n\fThese two goals: of obtaining low regret, and violating the unknown constraint a small number of\ntimes \u2014 are seemingly in tension. A standard technique for obtaining a mistake bound with respect to\nfairness violations would be to play a \u201chalving algorithm\u201d, which would always act as if the unknown\nmetric is at the center of the current version space (the set of metrics consistent with the feedback\nobserved thus far) \u2014 so that mistakes necessarily remove a non-trivial fraction of the version space,\nmaking progress. On the other hand, a standard technique for obtaining a diminishing regret bound is\nto play \u201coptimistically\u201d \u2013 i.e. to act as if the unknown metric is the point in the version space that\nwould allow for the largest possible reward. But \u201coptimistic\u201d points are necessarily at the boundary of\nthe version space, and when they are falsi\ufb01ed, the corresponding mistakes do not necessarily reduce\nthe version space by a constant fraction.\nWe prove our theorem in two steps. First, in Section 3, we consider the simpler problem in which the\nlinear objective of the contextual bandit problem is known, and the distance function is all that is\nunknown. In this simpler case, we show how to obtain a bound on the number of fairness violations\nusing a linear-programming based reduction to a recent algorithm which has a mistake bound for\nlearning a linear function with a particularly weak form of feedback ?. A complication is that our\nalgorithm does not receive all of the feedback that the algorithm of ? expects. We need to use the\nstructure of our linear program to argue that this is ok. Then, in Section 4, we give our algorithm for\nthe complete problem, using large portions of the machinery we develop in Section 3.\nWe note that in a non-adversarial setting, in which contexts are drawn from a distribution, the\nalgorithm of ? could be more simply applied along with standard techniques for contextual bandit\nlearning to give an explore-then-exploit style algorithm. This algorithm would obtain bounded (but\nsuboptimal) regret, and a number of fairness violations that grows as a root of T . The principal\nadvantages of our approach are that we are able to give a number of fairness violations that has\nonly logarithmic dependence on T , while tolerating contexts that are chosen adversarially, all while\nobtaining an optimal O(\n\nT ) regret bound to the best fair policy.\n\n\u221a\n\n1.2 Additional Related Work\n\nThere are two papers, written concurrently to ours, that tackle orthogonal issues in metric-fair\nlearning. ? consider the problem of generalization when performing learning subject to a known\nmetric constraint. They show that it is possible to prove relaxed PAC-style generalization bounds\nwithout any assumptions on the metric, and that for worst-case metrics, learning subject to a metric\nconstraint can be computationally hard, even when the unconstrained learning problem is easy. In\ncontrast, our work focuses on online learning with an unknown metric constraint. Our results imply\nsimilar generalization properties via standard online-to-of\ufb02ine reductions, but only for the class\nof metrics we study. ? considers a group-fairness like relaxation of metric-fairness, asking that\non average, individuals in pre-speci\ufb01ed groups are classi\ufb01ed with probabilities proportional to the\naverage distance between individuals in those groups. They show how to learn such classi\ufb01ers in the\nof\ufb02ine setting, given access to an oracle which can evaluate the distance between two individuals\naccording to the metric (allowing for unbiased noise). The similarity to our work is that we also\nconsider access to the fairness metric via an oracle, but our oracle is substantially weaker, and does\nnot provide numeric valued output.\nThere are also several papers in the algorithmic fairness literature that are thematically related to ours,\nin that they both aim to bridge the gap between group notions of fairness (which can be semantically\nunsatisfying) and individual notions of fairness (which require very strong assumptions). ? attempt\nto automatically learn a representation for the data in a batch learning problem (and hence, implicitly,\na similarity metric) that causes a classi\ufb01er to label an equal proportion of two protected groups as\npositive. They provide a heuristic approach and an experimental evaluation. Two recent papers (? and\n?) take the approach of asking for a group notion of fairness, but over exponentially many implicitly\nde\ufb01ned protected groups, thus mitigating what ? call the \u201cfairness gerrymandering\u201d problem, which\nis one of the principal weaknesses of group fairness de\ufb01nitions. Both papers give polynomial time\nreductions which yield ef\ufb01cient algorithms whenever a corresponding agnostic learning problem is\nsolvable. In contrast, in this paper, we take a different approach: we attempt to directly satisfy the\noriginal de\ufb01nition of individual fairness from ?, but with substantially less information about the\nunderlying similarity metric.\nStarting with ?, several papers have studied notions of fairness in classic and contextual bandit\nproblems. ? study a notion of \u201cmeritocratic\u201d fairness in the contextual bandit setting, and prove upper\n\n3\n\n\fand lower bounds on the regret achievable by algorithms that must be \u201cfair\u201d at every round. This can\nbe viewed as a variant of the ? notion of fairness, in which the expected reward of each action is used\nto de\ufb01ne the \u201cfairness metric\u201d. The algorithm does not originally know this metric, but must discover\nit through experimentation. ? extend the work of ? to the setting in which the algorithm is faced\nwith a continuum of options at each time step, and give improved bounds for the linear contextual\nbandit case. ? extend this line of work to the reinforcement learning setting in which the actions\nof the algorithm can impact its environment. ? consider a notion of fairness based on calibration\nin the simple stochastic bandit setting. Finally, ? consider a notion of online group fairness in the\nstochastic contextual bandit setting by constraining how much probability mass can be placed on\neach pre-speci\ufb01ed group of arms.\nThere is a large literature that focuses on learning Mahalanobis distances \u2014 see ? for a survey. In\nthis literature, the closest paper to our work focuses on online learning of Mahalanobis distances\n(?). However, this result is in a very different setting from the one we consider here. In ?, the\nalgorithm is repeatedly given pairs of points, and needs to predict their distance. It then learns their\ntrue distance, and aims to minimize its squared loss. In contrast, in our paper, the main objective of\nthe learning algorithm is orthogonal to the metric learning problem \u2014 i.e. to minimize regret in the\nlinear contextual bandit problem, but while simultaneously learning and obeying a fairness constraint,\nand only from weak feedback noting violations of fairness.\n\n2 Model and Preliminaries\n\n2.1 Linear Contextual Bandits\n\n1, . . . , xt\n\n1, . . . , xt\n\nit such that rt\n\ni||2 \u2264 1. We write xt = (xt\n\nk \u2208 Rd, scaled such that ||xt\n\nWe study algorithms that operate in the linear contextual bandits setting. A linear contextual bandit\nproblem is parameterized by an unknown vector of linear coef\ufb01cients \u03b8 \u2208 Rd, with ||\u03b8||2 \u2264 1.\nAlgorithms in this setting operate in rounds t = 1, . . . , T . In each round t, an algorithm L observes\nk contexts xt\nk) to denote the\nentire set of contexts observed at round t. After observing the contexts, the algorithm chooses an\naction it. After choosing an action, the algorithm obtains some stochastic reward rt\nit is\nit, \u03b8(cid:105). The algorithm does not observe the reward for the actions not\nsubgaussian1 and E[rt\nit.\nchosen. When the action it is clear from context, and write rt instead of rt\nRemark 1. For simplicity, we consider algorithms that select only a single action at every round.\nHowever, this assumption is not necessary. In the appendix of the full version (?), we show how our\nresults extend to the case in which the algorithm can choose any number of actions at each round.\nThis assumption is sometimes more natural: for example, in a lending scenario, a bank may wish to\nmake loans to as many individuals as will be pro\ufb01table, without a budget constraint.\n\nit] = (cid:104)xt\n\nIn this section, we will be discussing algorithms L that are necessarily randomized. To formalize this,\nwe denote a history including everything observed by the algorithm up through but not including\nround t as ht = ((x1, i1, r1), . . . , (xt\u22121, it\u22121, rt\u22121)) The space of such histories is denoted by Ht =\n(Rd\u00d7k\u00d7 [k]\u00d7R)t\u22121. An algorithm L is de\ufb01ned by a sequence of functions f 1, . . . , f T each mapping\nhistories and observed contexts to probability distributions over actions: f t : Ht\u00d7Rd\u00d7k \u2192 \u2206[k]. We\nwrite \u03c0t to denote the probability distribution over actions that L plays at round t: \u03c0t = f t(ht, xt).\ni denotes the probability that L plays action i at round\nWe view \u03c0t as a vector over [0, 1]k, and so \u03c0t\nt. We denote the expected reward of the algorithm at day t as E[rt] = Ei\u223c\u03c0t[rt\ni]. It will sometimes\nalso be useful to refer to the vector of expected rewards across all actions on day t. We denote it as\n\u00afrt = ((cid:104)xt\n\n1, \u03b8(cid:105), . . . ,(cid:104)xt\n\nk, \u03b8(cid:105)).\n\n2.2 Fairness Constraints and Feedback\n\nWe study algorithms that are constrained to behave fairly in some manner. We adapt the de\ufb01nition of\nfairness from ? that asserts, informally, that \u201csimilar individuals should be treated similarly\u201d. We\nimagine that the decisions that our contextual bandit algorithm L makes correspond to individuals,\nand that the contexts xt\ni correspond to features pertaining to individuals. We adopt the following\n(specialization of) the fairness de\ufb01nition from Dwork et al, which is parameterized by a distance\nfunction d : Rd \u00d7 Rd \u2192 R.\n\n1A random variable X with \u00b5 = E[X] is sub-gaussian, if for all t \u2208 R, E[et(X\u2212\u00b5)] \u2264 e\n\nt2\n2 .\n\n4\n\n\fi, xt\n\ni \u2212 \u03c0t\n\nj| \u2264 d(xt\n\nDe\ufb01nition 1 (?). Algorithm L is Lipschitz-fair on round t with respect to distance function d if for all\npairs of individuals i, j: |\u03c0t\nj). For brevity, we will often just say that the algorithm\nis fair at round t, with the understanding that we are always talking about this one particular kind of\nfairness.\nRemark 2. Note that this de\ufb01nition requires a fairness constraint that binds between individuals\nat a single round t, but not between rounds t. This is for several reasons. First, at a philosophical\nlevel, we want our algorithms to be able to improve with time, without being bound by choices they\nmade long ago before they had any information about the fairness metric. At a (related) technical\nlevel, it is easy to construct lower bound instances that certify that it is impossible to simultaneously\nguarantee that an algorithm has diminishing regret to the best fair policy, while violating fairness\nconstraints (now de\ufb01ned as binding across rounds) a sublinear number of times.\n\nOne of the main dif\ufb01culties in working with Lipschitz fairness (as discussed in ?) is that the distance\nfunction d plays a central role, but it is not clear how it should be speci\ufb01ed. In this paper, we concern\nourselves with learning d from feedback. In particular, algorithms L will have access to a fairness\noracle, which models a regulator who \u201cknows unfairness when he sees it\u201d.\nInformally, the fairness oracle will take as input: 1) the set of choices available to L at each round t,\nand 2) the probability distribution \u03c0t that L uses to make its choices at round t, and returns the set of\nall pairs of individuals for which L violates the fairness constraint.\nDe\ufb01nition 2 (Fairness Oracle). Given a distance function d, a fairness oracle Od is a function\nOd : Rd\u00d7k \u00d7 \u2206[k] \u2192 2[k]\u00d7[k] de\ufb01ned such that: Od(xt, \u03c0t) = {(i, j) : |\u03c0t\n\ni \u2212 \u03c0t\nFormally, algorithms L in our setting will operate in the following environment:\n\nj| > d(xt\n\nj)}\n\ni, xt\n\n1. An adversary \ufb01xes a linear reward function \u03b8 \u2208 Rd with ||\u03b8|| \u2264 1 and a distance function d.\n\nL is given access to the fairness oracle Od.\n\n2. In rounds t = 1 to T :\n\n(a) The adversary chooses contexts xt \u2208 Rd\u00d7k with ||xt\ni|| \u2264 1 and gives them to L.\n(b) L chooses a probability distribution \u03c0t over actions, and chooses action it \u223c \u03c0t.\n(c) L receives reward rt\n\nit and observes feedback Od(\u03c0t) from the fairness oracle.\n\ni, xt\n\n(cid:80)k\n\nj| > d(xt\n\nBecause of the power of the adversary in this setting, it is not possible to avoid arbitrarily small\nviolations of the fairness constraint. Instead, we will aim to limit signi\ufb01cant violations.\nDe\ufb01nition 3. Algorithm L is \u0001-unfair on pair (i, j) at round t with respect to distance function d if\n|\u03c0t\ni \u2212 \u03c0t\nj) + \u0001. Given a sequence of contexts and a history ht (which \ufb01xes the distribution\nj) + \u0001) to\n\non actions at day t) We write Unfair(L, \u0001, ht) =(cid:80)k\u22121\nnumber of pairs on which it is \u0001-unfair: FairnessLoss(L, hT +1, \u0001) =(cid:80)T\n\nGiven a distance function d and a history hT +1, the \u0001-fairness loss of an algorithm L is the total\nt=1 Unfair(L, \u0001, ht) For\n\ndenote the number of pairs on which L is \u0001-unfair at round t.\n\na shorthand, we write FairnessLoss(L, T, \u0001).\nWe will aim to design algorithms L that guarantee that their fairness loss is bounded with probability\n1 in the worst case over the instance: i.e. in the worst case over both \u03b8 and x1, . . . , xT , and in the\nworst case over the distance function d (within some allowable class of distance functions \u2013 see\nSection 2.4).\n\nj| > d(xt\n\n1(|\u03c0t\n\ni \u2212 \u03c0t\n\ni=1\n\nj=i+1\n\ni, xt\n\n2.3 Regret to the Best Fair Policy\n\nIn addition to minimizing fairness loss, we wish to design algorithms that exhibit diminishing regret\nto the best fair policy. We \ufb01rst de\ufb01ne a linear program that we will make use of throughout the paper.\nGiven a vector a \u2208 Rd and a vector c \u2208 Rk2, we denote by LP (a, c) the following linear program:\n\n5\n\n\fmaximize\n\u03c0={p1,...,pk}\nsubject to\n\nk(cid:88)\nk(cid:88)\n\ni=1\n\ni=1\n\npiai\n\npi \u2264 1\n\n|pi \u2212 pj| \u2264 ci,j,\u2200(i, j)\n\nj).\n\ni, xt\n\ni,j = d(xt\n\nWe write \u03c0(a, c) \u2208 \u2206[k] to denote an optimal solution to LP (a, c). Given a set of contexts xt, recall\nthat \u00afrt is the vector representing the expected reward corresponding to each context (according to the\ntrue, unknown linear reward function \u03b8). Similarly, we write \u00afdt to denote the vector representing the\nset of distances between each pair of contexts i, j (according to the true, unknown distance function\nd): \u00afdt\nObserve that \u03c0(\u00afrt, \u00afdt) corresponds to the distribution over actions that maximizes expected reward at\nround t, subject to satisfying the fairness constraints \u2014 i.e. the distribution that an optimal player,\nwith advance knowledge of \u03b8 would play, if he were not allowed to violate the fairness constraints at\nall. This is the benchmark with respect to which we de\ufb01ne regret:\nDe\ufb01nition 4. Given an algorithm L (f1, . . . , fT ), a distance function d, a linear parameter vector \u03b8,\nand a history hT +1 (which includes a set of contexts x1, . . . , xT ), its regret is de\ufb01ned to be:\n\nRegret(L, \u03b8, d, hT +1) =\n\nE\n\ni\u223c\u03c0(\u00afrt, \u00afdt)\n\n[\u00afrt\n\nE\n\ni\u223cf t(ht,xt)\n\n[\u00afrt\ni]\n\nT(cid:88)\n\nt=1\n\ni] \u2212 T(cid:88)\n\nt=1\n\nOur goal will be to design algorithms for which we can bound regret with high probability over the\nrandomness of hT +1 in the worst case over \u03b8, d, and (x1, . . . , xT ).\n\n2.4 Mahalanobis Distance\n\nIn this paper, we will restrict our attention to a special family of distance functions which are\nparameterized by a matrix A:\nDe\ufb01nition 5 (Mahalanobis distances). A function d : Rd \u00d7 Rd \u2192 R is a Mahalanobis distance\nfunction if there exists a matrix A such that for all x1, x2 \u2208 Rd: d(x1, x2) = ||Ax1 \u2212 Ax2||2 where\n|| \u00b7 ||2 denotes Euclidean distance. Note that if A is not full rank, then this does not de\ufb01ne a metric \u2014\nbut we will allow this case (and be able to handle it in our algorithmic results).\n\nMahalanobis distances will be convenient for us to work with, because squared Mahalanobis distances\ncan be expressed as follows:\n\nd(x1, x2)2 = ||Ax1 \u2212 Ax2||2\n\n2 = (cid:104)A(x1 \u2212 x2), A(x1 \u2212 x2)(cid:105)\n\nd(cid:88)\n\n= (x1 \u2212 x2)(cid:62)A(cid:62)A(x1 \u2212 x2) =\n\nGi,j(x1 \u2212 x2)i(x1 \u2212 x2)j\n\nwhere G = A(cid:62)A. Observe that when x1 and x2 are \ufb01xed, this is a linear function in the entries of\nthe matrix G. We will use this property to reason about learning G, and thereby learning d.\n\ni,j=1\n\n3 Warmup: The Known Objective Case\n\nIn this section, we consider an easier case of the problem in which the linear objective function \u03b8 is\nknown to the algorithm, and the distance function d is all that is unknown. In this case, we show via\na reduction to an online learning algorithm of ?, how to simultaneously obtain a logarithmic regret\nbound and a logarithmic (in T ) number of fairness violations. The analysis we do here will be useful\nwhen we solve the full version of our problem (in which \u03b8 is unknown) in Section 4. Here, we sketch\nour solution. Details are in the full version of the paper (?).\n\n6\n\n\f3.1 Outline of the Solution\n\n2\n\nRecall that since we know \u03b8, at every round t after seeing the contexts, we know the vector of\nexpected rewards \u00afrt that we would obtain for selecting each action. Our algorithm will play at each\nround t the distribution \u03c0(\u00afrt, \u02c6dt) that results from solving the linear program LP (\u00afrt, \u02c6dt), where \u02c6dt is\na \u201cguess\u201d for the pairwise distances between each context \u00afdt. (Recall that the optimal distribution to\nplay at each round is \u03c0(\u00afrt, \u00afdt).)\nThe main engine of our reduction is an ef\ufb01cient online learning algorithm for linear functions recently\ngiven by ?. Their algorithm, which we refer to as DistanceEstimator, works in the following\nsetting. There is an unknown vector of linear parameters \u03b1 \u2208 Rm. In rounds t, the algorithm observes\na vector of features ut \u2208 Rm, and produces a prediction gt \u2208 R for the value (cid:104)\u03b1, ut(cid:105). After it makes\nits prediction, the algorithm learns whether its guess was too large or not, but does not learn anything\nelse about the value of (cid:104)\u03b1, ut(cid:105). The guarantee of the algorithm is that the number of rounds in which\nits prediction is off by more than \u0001 is bounded by O(m log(m/\u0001))2.\n\n(cid:1) copies of this distance estimator \u2014 one for each pair of actions\n\nOur strategy will be to instantiate(cid:0)k\n\ni, xt\n\ni, xt\n\ni,j of the pairwise distances d(xt\n\ni,j)2 intended to approximate the squared pairwise distances d(xt\n\n\u2014 to produce guesses ( \u02c6dt\nj)2.\nFrom this we derive estimates \u02c6dt\nj). Note that this is a linear\nestimation problem for any Mahalanobis distance, because by our observation in Section 2.4, a\nsquared Mahalanobis distance can be written as a linear function of the m = d2 unknown entries of\nthe matrix G = A(cid:62)A which de\ufb01nes the Mahalanobis distance.\nThe complication is that the DistanceEstimator algorithms expect feedback at every round,\nwhich we cannot always provide. This is because the fairness oracle Od provides feedback about\nthe distribution \u03c0(\u00afrt, \u02c6dt) used by the algorithm, not directly about the guesses \u02c6dt. These are not the\nsame, because not all of the constraints in the linear program LP (\u00afrt, \u02c6dt) are necessarily tight \u2014 it\nmay be that |\u03c0(\u00afrt, \u02c6dt)i \u2212 \u03c0(\u00afrt, \u02c6dt)j| < \u02c6dt\ni,j. For any copy of DistanceEstimator that does not\nreceive feedback, we can simply \u201croll back\u201d its state and continue to the next round. But we need\nto argue that we make progress \u2014 that whenever we are \u0001-unfair, or whenever we experience large\nper-round regret, then there is at least one copy of DistanceEstimator that we can give feedback\nto such that the corresponding copy of DistanceEstimator has made a large prediction error,\nand we can thus charge either our fairness loss or our regret to the mistake bound of that copy of\nDistanceEstimator.\nAs we show, there are three relevant cases.\n\n1. In any round in which we are \u0001-unfair for some pair of contexts xt\n\nj, then it must be that\nj)+\u0001, and so we can always update the (i, j)th copy of DistanceEstimator\n\ni,j \u2265 d(xt\n\u02c6dt\nand charge our fairness loss to its mistake bound.\n\ni and xt\n\ni, xt\n\n2. For any pair of arms (i, j) such that we have not violated the fairness constraint, and the\n(i, j)th constraint in the linear program is tight, we can provide feedback to the (i, j)th copy\nof DistanceEstimator (its guess was not too large). There are two cases. Although the\nalgorithm never knows which case it is in, we handle each case separately in the analysis.\n(a) For every constraint (i, j) in LP (\u00afrt, \u02c6dt) that is tight in the optimal solution, | \u02c6dt\ni,j \u2212\nj)| \u2264 \u0001. In this case, we show that our algorithm does not incur very much per\nj)| > \u0001. In this case,\nwe may incur high per-round regret \u2014 but we can charge such rounds to the mistake\nbound of the (i, j)th copy of DistanceEstimator.\n\n(b) Otherwise, there is a tight constraint (i, j) such that | \u02c6dt\n\nd(xt\nround regret.\n\ni, xt\n\ni, xt\n\n(cid:16)\n\ni,j \u2212 d(xt\n(cid:17)(cid:17)\n(cid:16) d\u00b7||A(cid:62)A||F\n\n(cid:16) d\u00b7||AT A||F\n(cid:16)\n\n\u0001\n\n\u0001\n\n(cid:17)\n\n(cid:17)\n\n+ k3\u0001T\n\nTheorem 1. FairnessLoss(Lknown\u2212\u03b8, T, \u0001) \u2264 O\nTheorem 2. For any time horizon T : Regret(Lknown\u2212\u03b8, T ) \u2264 O\nSetting \u0001 = O(1/(k3T )) yields a regret bound of O(d2 log(||A(cid:62)A||F \u00b7 dkT )).\n\nk2d2 log\n\nk2d2 log\n\n2If the algorithm also learned whether or not its guess was in error by more than \u0001 at each round, variants of\nthe classical halving algorithm could obtain this guarantee. But the algorithm does not receive this feedback,\nwhich is why the more sophisticated algorithm of ? is needed.\n\n7\n\n\f4 The Full Algorithm\n\n4.1 Outline of the Solution\n\nAt a high level, our plan will be to combine the techniques used for the case where the linear objective\n\u03b8 is known with a standard \u201coptimism in the face of uncertainty\u201d strategy for learning the parameter\nvector \u03b8. Our algorithm will maintain a ridge-regression estimate \u02dc\u03b8 together with con\ufb01dence regions\nderived in ?. After it observes the contexts xt\ni at round t, it uses these to derive upper con\ufb01dence\nbounds on the expected rewards, corresponding to each context \u2014 represented as a vector \u02c6rt. The\nalgorithm continues to maintain distance estimates, \u02c6dt, the same way as the case where the linear\nobjective \u03b8 is known, using ? as a subroutine. At ever round, the algorithm then chooses its action\naccording to the distribution \u03c0t = \u03c0(\u02c6rt, \u02c6dt).\nThe regret analysis of the algorithm follows by decomposing the per-round regret into two pieces.\nThe \ufb01rst can be bounded by the sum of the expected widths of the con\ufb01dence intervals corresponding\nto each context xt\ni that might be chosen at each round t, where the expectation is over the randomness\nof the algorithm\u2019s distribution \u03c0t. A theorem of ? bounds the sum of the widths of the con\ufb01dence\nintervals corresponding to arms actually chosen by the algorithm. Using a martingale concentration\ninequality, we are able to relate these two quantities. We show that the second piece of the regret\nbound can be manipulated into a form that can be bounded using machinery built in section 3, which\nis described in further details in the full version (?).\nTheorem 3. For any time horizon T , with probability 1 \u2212 \u03b4:\n\n(cid:18)\n\n(cid:19)\n\n(cid:18) d \u00b7 ||A(cid:62)A||F\nk2d2 log(cid:0)kdT \u00b7 ||A(cid:62)A||F\n\n\u221a\n+ k3\u0001T + d\n\u221a\n\n(cid:1) + d\n\n\u0001\n\n(cid:16)\n\nT log(\n\nT\n\u03b4\n\nT log( T\n\u03b4 )\n\n(cid:19)\n(cid:17)\n\n)\n\nRegret(Lf ull, T ) \u2264 O\n\nk2d2 log\n\nIf \u0001 = 1/k3T , this is a regret bound of O\n\nFinally, the bound on the fairness loss is identical to the bound we proved in Theorem 1 (because our\nalgorithm for constructing distance estimates \u02c6d is unchanged). We have:\nTheorem 4. For any sequence of contexts and any Mahalanobis distance d(x1, x2) = ||Ax1 \u2212\nAx2||2:\n\n(cid:18)\n\n(cid:18) d \u00b7 ||A(cid:62)A||F\n\n(cid:19)(cid:19)\n\nFairnessLoss(Lf ull, T, \u0001) \u2264 O\n\n5 Conclusion and Future Directions\n\nk2d2 log\n\n\u0001\n\nWe have initiated the study of fair sequential decision making in settings where the notions of payoff\nand fairness are separate and may be in tension with each other, and have shown that in a stylized\nsetting, optimal fair decisions can be ef\ufb01ciently learned even without direct knowledge of the fairness\nmetric. A number of extensions of our framework and results would be interesting to examine.\nAt a high level, the interesting question is: how much can we further relax the information about\nthe fairness metric available to the algorithm? For instance, what if the fairness feedback is only\npartial, identifying some but not all fairness violations? What if it only indicates whether or not\nthere were any violations, but does not identify them? What if the feedback is not guaranteed to be\nexactly consistent with any metric? Or what if the feedback is consistent with some distance function,\nbut not one in a known class: for example, what if the distance is not exactly Mahalanobis, but is\napproximately so? In general, it is very interesting to continue to push to close the wide gap between\nthe study of individual fairness notions and the study of group fairness notions. When can we obtain\nthe strong semantics of individual fairness without making correspondingly strong assumptions?\n\nReferences\nYasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Improved algorithms for linear stochastic\nbandits. In Advances in Neural Information Processing Systems 24: 25th Annual Conference\non Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 Decem-\nber 2011, Granada, Spain., pages 2312\u20132320, 2011. URL http://papers.nips.cc/paper/\n4417-improved-algorithms-for-linear-stochastic-bandits.\n\n8\n\n\fRichard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal\n\njustice risk assessments: the state of the art. arXiv preprint arXiv:1703.09207, 2017.\n\nL Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth K Vishnoi. An algorithmic framework\n\nto control bias in bandit-based personalization. arXiv preprint arXiv:1802.08674, 2018.\n\nAlexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism\n\nprediction instruments. arXiv preprint arXiv:1703.00056, 2017.\n\nCynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through\nawareness. In Proceedings of the 3rd innovations in theoretical computer science conference,\npages 214\u2013226. ACM, 2012.\n\nSorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. On the (im) possibility of\n\nfairness. arXiv preprint arXiv:1609.07236, 2016.\n\nStephen Gillen, Christopher Jung, Michael Kearns, and Aaron Roth. Online learning with an unknown\n\nfairness metric. arXiv preprint arXiv:1802.06936, 2018.\n\nSara Hajian and Josep Domingo-Ferrer. A methodology for direct and indirect discrimination\nprevention in data mining. IEEE transactions on knowledge and data engineering, 25(7):1445\u2013\n1459, 2013.\n\nMoritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. Advances\n\nin Neural Information Processing Systems, 2016.\n\nUrsula H\u00e9bert-Johnson, Michael P Kim, Omer Reingold, and Guy N Rothblum. Calibration for the\n\n(computationally-identi\ufb01able) masses. arXiv preprint arXiv:1711.08513, 2017.\n\nShahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness\nin reinforcement learning. In International Conference on Machine Learning, pages 1617\u20131626,\n2017.\n\nPrateek Jain, Brian Kulis, Inderjit S Dhillon, and Kristen Grauman. Online metric learning and fast\n\nsimilarity search. In Advances in neural information processing systems, pages 761\u2013768, 2009.\n\nMatthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning:\n\nClassic and contextual bandits. pages 325\u2013333, 2016a.\n\nMatthew Joseph, Michael J. Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. Fair algorithms\nfor in\ufb01nite and contextual bandits. CoRR, abs/1610.09559, 2016b. URL http://arxiv.org/\nabs/1610.09559.\n\nFaisal Kamiran and Toon Calders. Data preprocessing techniques for classi\ufb01cation without discrimi-\n\nnation. Knowledge and Information Systems, 33(1):1\u201333, 2012.\n\nMichael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering:\n\nAuditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2017.\n\nMichael P Kim, Omer Reingold, and Guy N Rothblum. Fairness through computationally-bounded\n\nawareness. arXiv preprint arXiv:1803.03239, 2018.\n\nJon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determi-\nnation of risk scores. In Proceedings of the 2017 ACM Conference on Innovations in Theoretical\nComputer Science, Berkeley, CA, USA, 2017, 2017.\n\nBrian Kulis et al. Metric learning: A survey. Foundations and Trends R(cid:13) in Machine Learning, 5(4):\n\n287\u2013364, 2013.\n\nYang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C Parkes. Cali-\n\nbrated fairness in bandits. arXiv preprint arXiv:1707.01875, 2017.\n\nIlan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual\ndecision-making. In Proceedings of the 2017 ACM Conference on Economics and Computation,\nEC \u201917, Cambridge, MA, USA, June 26-30, 2017, page 585, 2017. doi: 10.1145/3033274.3085100.\nURL http://doi.acm.org/10.1145/3033274.3085100.\n\n9\n\n\fGuy N Rothblum and Gal Yona. Probably approximately metric-fair learning. arXiv preprint\n\narXiv:1803.03242, 2018.\n\nMuhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness\nbeyond disparate treatment & disparate impact: Learning classi\ufb01cation without disparate mistreat-\nment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171\u20131180.\nInternational World Wide Web Conferences Steering Committee, 2017.\n\nRich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations.\n\nIn International Conference on Machine Learning, pages 325\u2013333, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1314, "authors": [{"given_name": "Stephen", "family_name": "Gillen", "institution": "University of Pennsylvania"}, {"given_name": "Christopher", "family_name": "Jung", "institution": "University of Pennsylvania"}, {"given_name": "Michael", "family_name": "Kearns", "institution": "University of Pennsylvania"}, {"given_name": "Aaron", "family_name": "Roth", "institution": "University of Pennsylvania"}]}