{"title": "Envy-Free Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1240, "page_last": 1250, "abstract": "In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else's. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks, especially when individuals have heterogeneous preferences. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classifier that is envy free on a sample would be almost envy free with respect to the underlying distribution with high probability. Our main result establishes that a small sample is sufficient to achieve such guarantees, when the classifier in question is a mixture of deterministic classifiers that belong to a family of low Natarajan dimension.", "full_text": "Envy-Free Classi\ufb01cation\n\nMaria-Florina Balcan\n\nMachine Learning Department\nCarnegie Mellon University\n\nninamf@cs.cmu.edu\n\nRitesh Noothigattu\n\nMachine Learning Department\nCarnegie Mellon University\n\nriteshn@cmu.edu\n\nTravis Dick\n\nComputer Science Department\nCarnegie Mellon University\n\ntdick@cs.cmu.edu\n\nAriel D. Procaccia\n\nComputer Science Department\nCarnegie Mellon University\narielpro@cs.cmu.edu\n\nAbstract\n\nIn classic fair division problems such as cake cutting and rent division, envy-\nfreeness requires that each individual (weakly) prefer his allocation to anyone\nelse\u2019s. On a conceptual level, we argue that envy-freeness also provides a com-\npelling notion of fairness for classi\ufb01cation tasks, especially when individuals have\nheterogeneous preferences. Our technical focus is the generalizability of envy-free\nclassi\ufb01cation, i.e., understanding whether a classi\ufb01er that is envy free on a sample\nwould be almost envy free with respect to the underlying distribution with high\nprobability. Our main result establishes that a small sample is suf\ufb01cient to achieve\nsuch guarantees, when the classi\ufb01er in question is a mixture of deterministic\nclassi\ufb01ers that belong to a family of low Natarajan dimension.\n\n1\n\nIntroduction\n\nThe study of fairness in machine learning is driven by an abundance of examples where learning\nalgorithms were perceived as discriminating against protected groups [29, 6]. Addressing this problem\nrequires a conceptual \u2014 perhaps even philosophical \u2014 understanding of what fairness means in\nthis context. In other words, the million dollar question is (arguably1) this: What are the formal\nconstraints that fairness imposes on learning algorithms?\nIn this paper, we propose a new measure of algorithmic fairness. It draws on an extensive body of\nwork on rigorous approaches to fairness, which \u2014 modulo one possible exception (see Section 1.2)\n\u2014 has not been tapped by machine learning researchers: the literature on fair division [3, 20]. The\nmost prominent notion is that of envy-freeness [10, 31], which, in the context of the allocation of\ngoods, requires that the utility of each individual for his allocation be at least as high as his utility\nfor the allocation of any other individual; for six decades, it has been the gold standard of fairness\nfor problems such as cake cutting [25, 24] and rent division [28, 12]. In the classi\ufb01cation setting,\nenvy-freeness would simply mean that the utility of each individual for his distribution over outcomes\nis at least as high as his utility for the distribution over outcomes assigned to any other individual.\nIt is important to say upfront that envy-freeness is not suitable for several widely-studied problems\nwhere there are only two possible outcomes, one of which is \u2018good\u2019 and the other \u2018bad\u2019; examples\ninclude predicting whether an individual would default on a loan, and whether an offender would\nrecidivate. In these degenerate cases, envy-freeness would require that the classi\ufb01er assign each and\nevery individual the exact same probability of obtaining the \u2018good\u2019 outcome, which, clearly, is not a\nreasonable constraint.\n\n1Certain papers take a somewhat different view [17].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBy contrast, we are interested in situations where there is a diverse set of possible outcomes, and\nindividuals have diverse preferences for those outcomes. For example, consider a system responsible\nfor displaying credit card advertisements to individuals. There are many credit cards with different\neligibility requirements, annual rates, and reward programs. An individual\u2019s utility for seeing a\ncard\u2019s advertisement will depend on his eligibility, his bene\ufb01t from the rewards programs, and\npotentially other factors. It may well be the case that an envy-free advertisement assignment shows\nBob advertisements for a card with worse annual rates than those shown to Alice; this outcome is\nnot unfair if Bob is genuinely more interested in the card offered to him. Such rich utility functions\nare also evident in the context of job advertisements [6]: people generally want higher paying jobs,\nbut would presumably have higher utility for seeing advertisements for jobs that better \ufb01t their\nquali\ufb01cations and interests.\nA second appealing property of envy-freeness is that its fairness guarantee binds at the level of\nindividuals. Fairness notions can be coarsely characterized as being either individual notions, or\ngroup notions, depending on whether they provide guarantees to speci\ufb01c individuals, or only on\naverage to a protected subgroup. The majority of work on fairness in machine learning focuses on\ngroup fairness [18, 9, 35, 13, 15, 34].\nThere is, however, one well-known example of individual fairness: the in\ufb02uential fair classi\ufb01cation\nmodel of Dwork et al. [9]. The model involves a set of individuals and a set of outcomes. The\ncenterpiece of the model is a similarity metric on the space of individuals; it is speci\ufb01c to the\nclassi\ufb01cation task at hand, and ideally captures the ethical ground truth about relevant attributes. For\nexample, a man and a woman who are similar in every other way should be considered similar for\nthe purpose of credit card offerings, but perhaps not for lingerie advertisements. Assuming such a\nmetric is available, fairness can be naturally formalized as a Lipschitz constraint, which requires that\nindividuals who are close according to the similarity metric be mapped to distributions over outcomes\nthat are close according to some standard metric (such as total variation).\nAs attractive as this model is, it has one clear weakness from a practical viewpoint: the availability\nof a similarity metric. Dwork et al. [9] are well aware of this issue; they write that justifying this\nassumption is \u201cone of the most challenging aspects\u201d of their approach. They add that \u201cin reality the\nmetric used will most likely only be society\u2019s current best approximation to the truth.\u201d But, despite\nrecent progress on automating ethical decisions in certain domains [23, 11], the task-speci\ufb01c nature\nof the similarity metric makes even a credible approximation thereof seem unrealistic. In particular,\nif one wanted to learn a similarity metric, it is unclear what type of examples a relevant dataset would\nconsist of.\nIn place of a metric, envy-freeness requires access to individuals\u2019 utility functions, but \u2014 by contrast\n\u2014 we do not view this assumption as a barrier to implementation. Indeed, there are a variety of\ntechniques for learning utility functions [4, 22, 2]. Moreover, in our running example of advertising,\none can use standard measures like expected click-through rate (CTR) as a good proxy for utility.\nIt is worth noting that the classi\ufb01cation setting is different from classic fair division problems in\nthat the \u201cgoods\u201d (outcomes) are non-excludable. In fact, one envy-free solution simply assigns each\nindividual to his favorite outcome. But this solution may be severely suboptimal according to another\n(standard) component of our setting, the loss function, which, in the examples above, might represent\nthe expected revenue from showing an ad to an individual. Typically the loss function is not perfectly\naligned with individual utilities, and, therefore, it may be possible to achieve smaller loss than the\nna\u00efve solution without violating the envy-freeness constraint.\nIn summary, we view envy-freeness as a compelling, well-established, and, importantly, practicable\nnotion of individual fairness for classi\ufb01cation tasks with a diverse set of outcomes when individuals\nhave heterogeneous preferences. Our goal is to understand its learning-theoretic properties.\n\n1.1 Our Results\n\nThe challenge is that the space of individuals is potentially huge, yet we seek to provide universal\nenvy-freeness guarantees. To this end, we are given a sample consisting of individuals drawn from\nan unknown distribution. We are interested in learning algorithms that minimize loss, subject to\nsatisfying the envy-freeness constraint, on the sample. Our primary technical question is that of\ngeneralizability, that is, given a classi\ufb01er that is envy free on a sample, is it approximately envy free\non the underlying distribution? Surprisingly, Dwork et al. [9] do not study generalizability in their\n\n2\n\n\fmodel, and we are aware of only one subsequent paper that takes a learning-theoretic viewpoint on\nindividual fairness and gives theoretical guarantees (see Section 1.2).\nIn Section 3, we do not constrain the classi\ufb01er. Therefore, we need some strategy to extend a classi\ufb01er\nthat is de\ufb01ned on a sample; assigning an individual the same outcome as his nearest neighbor in the\nsample is a popular choice. However, we show that any strategy for extending a classi\ufb01er from a\nsample, on which it is envy free, to the entire set of individuals is unlikely to be approximately envy\nfree on the distribution, unless the sample is exponentially large.\nFor this reason, in Section 4, we focus on structured families of classi\ufb01ers. On a high level, our goal\nis to relate the combinatorial richness of the family to generalization guarantees. One obstacle is that\nstandard notions of dimension do not extend to the analysis of randomized classi\ufb01ers, whose range is\ndistributions over outcomes (equivalently, real vectors). We circumvent this obstacle by considering\nmixtures of deterministic classi\ufb01ers that belong to a family of bounded Natarajan dimension (an\nextension of the well-known VC dimension to multi-class classi\ufb01cation). Our main theoretical result\nasserts that, under this assumption, envy-freeness on a sample does generalize to the underlying\ndistribution, even if the sample is relatively small (its size grows almost linearly in the Natarajan\ndimension).\nFinally, in Section 5, we design and implement an algorithm that learns (almost) envy-free mixtures\nof linear one-vs-all classi\ufb01ers. We present empirical results that validate our computational approach,\nand indicate good generalization properties even when the sample size is small.\n\n1.2 Related Work\n\nConceptually, our work is most closely related to work by Zafar et al. [34]. They are interested in\ngroup notions of fairness, and advocate preference-based notions instead of parity-based notions. In\nparticular, they assume that each group has a utility function for classi\ufb01ers, and de\ufb01ne the preferred\ntreatment property, which requires that the utility of each group for its own classi\ufb01er be at least its\nutility for the classi\ufb01er assigned to any other group. Their model and results focus on the case of\nbinary classi\ufb01cation where there is a desirable outcome and an undesirable outcome, so the utility of\na group for a classi\ufb01er is simply the fraction of its members that are mapped to the desirable outcome.\nAlthough, at \ufb01rst glance, this notion seems similar to envy-freeness, it is actually fundamentally\ndifferent.2 Our paper is also completely different from that of Zafar et al. in terms of technical\nresults; theirs are purely empirical in nature, and focus on the increase in accuracy obtained when\nparity-based notions of fairness are replaced with preference-based ones.\nConcurrent work by Rothblum and Yona [26] provides generalization guarantees for the metric notion\nof individual fairness introduced by Dwork et al. [9], or, more precisely, for an approximate version\nthereof. There are two main differences compared to our work: \ufb01rst, we propose envy-freeness as an\nalternative notion of fairness that circumvents the need for a similarity metric. Second, they focus\non randomized binary classi\ufb01cation, which amounts to learning a real-valued function, and so are\nable to make use of standard Rademacher complexity results to show generalization. By contrast,\nstandard tools do not directly apply in our setting. It is worth noting that several other papers provide\ngeneralization guarantees for notions of group fairness, but these are more distantly related to our\nwork [35, 32, 8, 16, 14].\n\n2 The Model\nWe assume that there is a space X of individuals, a \ufb01nite space Y of outcomes, and a utility function\nu : X\u21e5Y! [0, 1] encoding the preferences of each individual for the outcomes in Y. In the\nadvertising example, individuals are users, outcomes are advertisements, and the utility function\nre\ufb02ects the bene\ufb01t an individual derives from being shown a particular advertisement. For any\ndistribution p 2 (Y) (where (Y) is the set of distributions over Y) we let u(x, p) = Ey\u21e0p[u(x, y)]\ndenote individual x\u2019s expected utility for an outcome sampled from p. We refer to a function\nh : X! (Y) as a classi\ufb01er, even though it can return a distribution over outcomes.\n\n2On a philosophical level, the fair division literature deals exclusively with individual notions of fairness. In\nfact, even in group-based extensions of envy-freeness [19] the allocation is shared by groups, but individuals must\nnot be envious. We subscribe to the view that group-oriented notions (such as statistical parity) are objectionable,\nbecause the outcome can be patently unfair to individuals.\n\n3\n\n\f2.1 Envy-Freeness\nRoughly speaking, a classi\ufb01er h : X! (Y) is envy free if no individual prefers the outcome\ndistribution of someone else over his own.\nDe\ufb01nition 1. A classi\ufb01er h : X! (Y) is envy free (EF) on a set S of individuals if u(x, h(x)) \nu(x, h(x0)) for all x, x0 2 S. Similarly, h is (\u21b5, )-EF with respect to a distribution P on X if\n\nFinally, h is (\u21b5, )-pairwise EF on a set of pairs of individuals S = {(xi, x0i)}n\n\ni=1 if\n\nPr\n\nx,x0\u21e0Pu(x, h(x)) < u(x, h(x0)) \uf8ff \u21b5.\nnXi=1\n\nI{u(xi, h(xi)) < u(xi, h(x0i)) }\uf8ff \u21b5.\n\n1\nn\n\nAny classi\ufb01er that is EF on a sample S of individuals is also (\u21b5, )-pairwise EF on any pairing of the\nindividuals in S, for any \u21b5 0 and 0. The weaker pairwise EF condition is all that is required\nfor our generalization guarantees to hold.\n\n2.2 Optimization and Learning\nOur formal learning problem can be stated as follows. Given sample access to an unknown distribution\nP over individuals X and their utility functions, and a known loss function ` : X\u21e5Y! [0, 1],\n\ufb01nd a classi\ufb01er h : X! (Y) that is (\u21b5, )-EF with respect to P minimizing expected loss\nEx\u21e0P [`(x, h(x))], where for x 2X and p 2 (Y), `(x, p) = Ey\u21e0p[`(x, y)].\nWe follow the empirical risk minimization (ERM) learning approach, i.e., we collect a sample of\nindividuals drawn i.i.d from P and \ufb01nd an EF classi\ufb01er with low loss on the sample. Formally,\ngiven a sample of individuals S = {x1, . . . , xn} and their utility functions uxi(\u00b7) = u(xi,\u00b7), we are\ninterested in a classi\ufb01er h : S ! (Y) that minimizesPn\ni=1 `(xi, h(xi)) among all classi\ufb01ers that\nare EF on S.\nRecall that we consider randomized classi\ufb01ers that can assign a distribution over outcomes to each\nof the individuals. However, one might wonder whether the EF classi\ufb01er that minimizes loss on a\nsample happens to always be deterministic. Or, at least, the optimal deterministic classi\ufb01er on the\nsample might incur a loss that is very close to that of the optimal randomized classi\ufb01er. If this were\ntrue, we could restrict ourselves to classi\ufb01ers of the form h : X!Y , which would be much easier\nto analyze. Unfortunately, it turns out that this is not the case. In fact, there could be an arbitrary\n(multiplicative) gap between the optimal randomized EF classi\ufb01er and the optimal deterministic EF\nclassi\ufb01er. The intuition behind this is as follows. A deterministic classi\ufb01er that has very low loss on\nthe sample, but is not EF, would be completely discarded in the deterministic setting. On the other\nhand, a randomized classi\ufb01er could take this loss-minimizing deterministic classi\ufb01er and mix it with\na classi\ufb01er with high \u201cnegative envy\u201d, so that the mixture ends up being EF and at the same time has\nlow loss. This is made concrete in the following example.\nExample 1. Let S = {x1, x2} and Y = {y1, y2, y3}. Let the loss function be such that\n\n`(x1, y1) = 0\n`(x2, y1) = 1\n\n`(x1, y2) = 1\n`(x2, y2) = 1\n\n`(x1, y3) = 1\n`(x2, y3) = 0\n\nMoreover, let the utility function be such that\n\nu(x1, y1) = 0\n\nu(x1, y2) = 1\n\nu(x2, y1) = 0\n\nu(x2, y2) = 0\n\nu(x1, y3) =\n\n1\n\nu(x2, y3) = 1\n\nwhere > 1. The only deterministic classi\ufb01er with a loss of 0 is h0 such that h0(x1) = y1 and\nh0(x2) = y3. But, this is not EF, since u(x1, y1) < u(x1, y3). Furthermore, every other deterministic\nclassi\ufb01er has a total loss of at least 1, causing the optimal deterministic EF classi\ufb01er to have loss of at\nleast 1.\nTo show that randomized classi\ufb01ers can do much better, consider the randomized classi\ufb01er h\u21e4 such\nthat h\u21e4(x1) = (1 1/, 1/, 0) and h\u21e4(x2) = (0, 0, 1). This classi\ufb01er can be seen as a mixture of\n\n4\n\n\fthe classi\ufb01er h0 of 0 loss, and the deterministic classi\ufb01er he, where he(x1) = y2 and he(x2) = y3,\nwhich has high \u201cnegative envy\". One can observe that this classi\ufb01er h\u21e4 is EF, and has a loss of just\n1/. Hence, the loss of the optimal randomized EF classi\ufb01er is times smaller than the loss of the\noptimal deterministic one, for any > 1.\n\n3 Arbitrary Classi\ufb01ers\n\nAn important (and typical) aspect of our learning problem is that the classi\ufb01er h needs to provide an\noutcome distribution for every individual, not just those in the sample. For example, if h chooses\nadvertisements for visitors of a website, the classi\ufb01er should still apply when a new visitor arrives.\nMoreover, when we use the classi\ufb01er for new individuals, it must continue to be EF. In this section,\nwe consider two-stage approaches that \ufb01rst choose outcome distributions for the individuals in the\nsample, and then extend those decisions to the rest of X .\nIn more detail, we are given a sample S = {x1, . . . , xn} of individuals and a classi\ufb01er h : S ! (Y)\nassigning outcome distributions to each individual. Our goal is to extend these assignments to a\nclassi\ufb01er h : X! (Y) that can be applied to new individuals as well. For example, h could be the\nloss-minimizing EF classi\ufb01er on the sample S.\nFor this section, we assume that X is equipped with a distance metric d. Moreover, we assume in this\nsection that the utility function u is L-Lipschitz on X . That is, for every y 2Y and for all x, x0 2X ,\nwe have |u(x, y) u(x0, y)|\uf8ff L \u00b7 d(x, x0).\nUnder the foregoing assumptions, one natural way to extend the classi\ufb01er on the sample to all of X\nis to assign new individuals the same outcome distribution as their nearest neighbor in the sample.\nFormally, for a set S \u21e2X and any individual x 2X , let NNS(x) 2 arg minx02Sd(x, x0) denote\nthe nearest neighbor of x in S with respect to the metric d (breaking ties arbitrarily). The following\nsimple result (whose proof is relegated to Appendix B) establishes that this approach preserves\nenvy-freeness in cases where the sample is exponentially large.\nTheorem 1. Let d be a metric on X , P be a distribution on X , and u be an L-Lipschitz utility\nfunction. Let S be a set of individuals such that there exists \u02c6X\u21e2X\nwith P ( \u02c6X ) 1 \u21b5 and\nd(x, NNS(x)) \uf8ff /(2L). Then for any classi\ufb01er h : S ! (Y) that is EF on S, the\nsupx2 \u02c6X\nextension h : X! (Y) given by h(x) = h(NNS(x)) is (\u21b5, )-EF on P .\nThe conditions of Theorem 1 require that the set of individuals S is a /(2L)-net for at least a (1\u21b5)-\nfraction of the mass of P on X . In several natural situations, an exponentially large sample guarantees\nthat this occurs with high probability. For example, if X is a subset of Rq, d(x, x0) = kx x0k2,\nand X has diameter at most D, then for any distribution P on X , if S is an i.i.d. sample of size\n )), it will satisfy the conditions of Theorem 1 with probability at\nO( 1\nleast 1 . This sampling result is folklore, but, for the sake of completeness, we prove it in Lemma 3\nof Appendix B.\nHowever, the exponential upper bound given by the nearest neighbor strategy is as far as we can go\nin terms of generalizing envy-freeness from a sample (without further assumptions). Speci\ufb01cally,\nour next result establishes that any algorithm \u2014 even randomized \u2014 for extending classi\ufb01ers from\nthe sample to the entire space X requires an exponentially large sample of individuals to ensure\nenvy-freeness on the distribution P . The proof of Theorem 2 can be found in Appendix B.\nTheorem 2. There exists a space of individuals X\u21e2 Rq, and a distribution P over X such that,\nfor every randomized algorithm A that extends classi\ufb01ers on a sample to X , there exists an L-\nLipschitz utility function u such that, when a sample of individuals S of size n = 4q/2 is drawn\nfrom P without replacement, there exists an EF classi\ufb01er on S for which, with probability at least\n1 2 exp(4q/100) exp(4q/200) jointly over the randomness of A and S, its extension by A is\nnot (\u21b5, )-EF with respect to P for any \u21b5< 1/25 and < L/ 8.\n\n)q(q log LDpq\n\n + log 1\n\n\u21b5 ( LDpq\n\n\n\nWe remark that a similar result would hold even if we sampled S with replacement; we sample here\nwithout replacement purely for ease of exposition.\n\n5\n\n\f4 Low-Complexity Families of Classi\ufb01ers\n\nIn this section we show that (despite Theorem 2) generalization for envy-freeness is possible using\nmuch smaller samples of individuals, as long as we restrict ourselves to classi\ufb01ers from a family of\nrelatively low complexity.\nIn more detail, two classic complexity measures are the VC-dimension [30] for binary classi\ufb01ers,\nand the Natarajan dimension [21] for multi-class classi\ufb01ers. However, to the best of our knowledge,\nthere is no suitable dimension directly applicable to functions ranging over distributions, which in\nour case can be seen as |Y|-dimensional real vectors. One possibility would be to restrict ourselves to\ndeterministic classi\ufb01ers of the type h : X!Y , but we have seen in Section 2 that envy-freeness is a\nvery strong constraint on deterministic classi\ufb01ers. Instead, we will consider a family H consisting of\nof low\nrandomized mixtures of m deterministic classi\ufb01ers belonging to a family G\u21e2{ g : X!Y}\nNatarajan dimension. This allows us to adapt Natarajan-dimension-based generalization results to\nour setting while still working with randomized classi\ufb01ers. The de\ufb01nition and relevant properties of\nthe Natarajan dimension are summarized in Appendix A.\nFormally, let ~g= (g1, . . . , gm) 2G m be a vector of m functions in G and \u2318 2 m be a distribution\nover [m], where m = {p 2 Rm : pi 0,Pi pi = 1} is the m-dimensional probability\nsimplex. Then consider the function h~g, \u2318: X! (Y) with assignment probabilities given by\nPr(h~g, \u2318(x) = y) =Pm\ni=1 I{gi(x) = y}\u2318i. Intuitively, for a given individual x, h~g, \u2318chooses one of\nthe gi randomly with probability \u2318i, and outputs gi(x). Let\n\nH(G, m) = {h~g, \u2318: X! (Y) : ~g2G m,\u2318 2 m}\n\nbe the family of classi\ufb01ers that can be written this way. Our main technical result shows that\nenvy-freeness generalizes for this class.\nTheorem 3. Suppose G is a family of deterministic classi\ufb01ers of Natarajan dimension d, and let\nH = H(G, m) for m 2 N. For any distribution P over X , > 0, and > 0, if S = {(xi, x0i)}n\ni=1 is\nan i.i.d. sample of pairs drawn from P of size\n\nn O\u2713 1\n\n2\u2713dm2 log\n\ndm|Y| log(m|Y|/)\n\n\n\n+ log\n\n1\n\n\u25c6\u25c6 ,\n\nthen with probability at least 1 , every classi\ufb01er h 2H that is (\u21b5, )-pairwise-EF on S is also\n(\u21b5 + 7, + 4)-EF on P .\n\nThe proof of Theorem 3 is relegated to Appendix C. In a nutshell, it consists of two steps. First,\nwe show that envy-freeness generalizes for \ufb01nite classes. Second, we show that H(G, m) can be\napproximated by a \ufb01nite subset.\nWe remark that the theorem is only effective insofar as families of classi\ufb01ers of low Natarajan dimen-\nsion are useful. Fortunately, several prominent families indeed have low Natarajan dimension [5],\nincluding one vs. all, multiclass SVM, tree-based classi\ufb01ers, and error correcting output codes.\n\n5\n\nImplementation and Empirical Validation\n\nSo far we have not directly addressed the problem of computing the loss-minimizing envy-free\nclassi\ufb01er from a given family on a given sample of individuals. We now turn to this problem. Our\ngoal is not to provide an end-all solution, but rather to provide evidence that computation will not be\na long-term obstacle to implementing our approach.\nIn more detail, our computational problem is to \ufb01nd the loss-minimizing classi\ufb01er h from a given fam-\nily of randomized classi\ufb01ers H that is envy free on a given a sample of individuals S = {x1, . . . , xn}.\nFor this classi\ufb01er h to generalize to the distribution P , Theorem 3 suggests that the family H to use\nis of the form H(G, m), where G is a family of deterministic classi\ufb01ers of low Natarajan dimension.\nIn this section, we let G be the family of linear one-vs-all classi\ufb01ers.\nIn particular, denoting\nX\u21e2 Rq, each g 2G is parameterized by ~w= (w1, w2, . . . , w|Y|) 2 R|Y|\u21e5q, where g(x) =\nargmaxy2Yw>y x. This class G has a Natarajan dimension of at most q|Y|. The optimization\n\n6\n\n\fproblem to solve in this case is\n\nmin\n\n~g2Gm,\u23182m\n\nnXi=1\n\nmXk=1\n\n\u2318kL(xi, gk(xi))\n\ns.t.\n\nmXk=1\n\n\u2318ku(xi, gk(xi)) \n\nmXk=1\n\n\u2318ku(xi, gk(xj)) 8(i, j) 2 [n]2.\n\n(1)\n\n5.1 Algorithm\nObserve that optimization problem (1) is highly non-convex and non-differentiable as formulated,\nbecause of the argmax computed in each of the gk(xi). Another challenge is the combinatorial nature\nof the problem, as we need to \ufb01nd m functions from G along with their mixing weights. In designing\nan algorithm, therefore, we employ several tricks of the trade to achieve tractability.\nLearning the mixture components. We \ufb01rst assume prede\ufb01ned mixing weights \u02dc\u2318, and iteratively\nlearn mixture components based on them. Speci\ufb01cally, let g1, g2, . . . gk1 denote the classi\ufb01ers\nlearned so far. To compute the next component gk, we solve the optimization problem (1) with these\ncomponents already in place (and assuming no future ones). This induces the following optimization\nproblem.\n\nL(xi, gk(xi))\n\nnXi=1\n\nmin\ngk2G\ns.t. U SF (k1)\n\nij\n\nc=1 \u02dc\u2318cu(xi, gc(xj)).\n\nPk1\n\nii\ndenotes the expected utility i has for j\u2019s assignments so far, i.e., U SF (k1)\nwhere U SF (k1)\n\n+ \u02dc\u2318ku(xi, gk(xj)) 8(i, j) 2 [n]2,\n\n+ \u02dc\u2318ku(xi, gk(xi)) U SF (k1)\n\nij\n\n(2)\n\n=\n\nij\n\nSolving the optimization problem (2) is still non-trivial because it remains non-convex and non-\ndifferentiable. To resolve this, we \ufb01rst soften the constraints3. Writing out the optimization problem\nin the form equivalent to introducing slack variables, we obtain\n\nmin\ngk2G\n\nL(xi, gk(xi))\n\nnXi=1\n+ Xi6=j\n\nmax\u21e3U SF (k1)\n\nij\n\n+ \u02dc\u2318ku(xi, gk(xj)) U SF (k1)\n\nii\n\n \u02dc\u2318ku(xi, gk(xi)), 0\u2318 ,\n\n(3)\n\nwhere is a parameter that de\ufb01nes the trade-off between loss and envy-freeness. This optimization\nproblem is still highly non-convex as gk(xi) = argmaxy2Y w>y xi, where ~wdenotes the parameters of\ngk. To solve this, we perform a convex relaxation on several components of the objective using the\nfact that w>gk(xi)xi w>y0xi for any y0 2Y . Speci\ufb01cally, we have\n\nL(xi, gk(xi)) \uf8ff max\n\nu(xi, gk(xi)) \uf8ff max\nu(xi, gk(xj)) \uf8ff max\n\ny2YL(xi, y) + w>y xi w>yixi ,\ny2Yu(xi, y) + w>y xi w>bixi , and\ny2Yu(xi, y) + w>y xj w>sixj ,\n\nwhere yi = argminy2Y L(xi, y), si = argminy2Y u(xi, y) and bi = argmaxy2Y u(xi, y). While we\nprovided the key steps here, complete details and the rationale behind these choices are given in\nAppendix D. On a very high-level, these are inspired by multi-class SVMs. Finally, plugging these\nrelaxations into (3), we obtain the following convex optimization problem to compute each mixture\ncomponent.\n\nmin\n\n~w2R|Y|\u21e5q\n\nnXi=1\n\nmax\n\ny2YL(xi, y) + w>y xi w>yixi + Xi6=j\n\n+\u02dc\u2318k max\n\ny2Yu(xi, y) + w>y xj w>sixj U SF (k1)\n\nii\n\nij\n\nmax\u21e3U SF (k1)\ny2Yu(xi, y) + w>y xi w>bixi , 0\u25c6 .\n\n(4)\n\n+ \u02dc\u2318k max\n\n3This may lead to solutions that are not exactly EF on the sample. Nonetheless, Theorem 3 still guarantees\n\nthat there should not be much additional envy on the testing data.\n\n7\n\n\fLearning the mixing weights. Once the mixture components ~gare learned (with respect to the\nprede\ufb01ned mixing weights \u02dc\u2318), we perform an additional round of optimization to learn the optimal\nweights \u2318 for them. This can be done via the following linear program\n\nmin\n\n\u23182m,\u21e02Rn\u21e5n\n0\n\ns.t.\n\nmXk=1\n\nnXi=1\n\nmXk=1\n\n\u21e0ij\n\n\u2318kL(xi, gk(xi)) + Xi6=j\nmXk=1\n\n\u2318ku(xi, gk(xi)) \n\n\u2318ku(xi, gk(xj)) \u21e0ij 8(i, j).\n\n(5)\n\n5.2 Methodology\nTo validate our approach, we have implemented our algorithm. However, we cannot rely on standard\ndatasets, as we need access to both the features and the utility functions of individuals. Hence, we\nrely on synthetic data. All our code is included as supplementary material. Our experiments are\ncarried out on a desktop machine with 16GB memory and an Intel Xeon(R) CPU E5-1603 v3 @\n2.80GHz\u21e54 processor. To solve convex optimization problems, we use CVXPY [7, 1].\nIn our experiments, we cannot compute the optimal solution to the original optimization problem (1),\nand there are no existing methods we can use as benchmarks. Hence, we generate the dataset such\nthat we know the optimal solution upfront.\nSpeci\ufb01cally, to generate the whole dataset (both training and test), we \ufb01rst generate random classi\ufb01ers\n~g? 2G m by sampling their parameters ~w1, . . . ~w m \u21e0N (0, 1)|Y|\u21e5q, and generate \u2318? 2 m by\ndrawing uniformly random weights in [0, 1] and normalizing. We use h~g?,\u2318? as the optimal solution\nof the dataset we generate. For each individual, we sample each feature value independently and\nu.a.r. in [0, 1]. For each individual x and outcome y, we set L(x, y) = 0 if y 2{ g?\nk(x) : k 2 [m]}\nand otherwise we sample L(x, y) u.a.r. in [0, 1]. For the utility function u, we need to generate it\nsuch that the randomized classi\ufb01er h~g?,\u2318? is envy free on the dataset. For this, we set up a linear\nprogram and compute each of the values u(x, y). Hence, h~g?,\u2318? is envy free and has zero loss, so it\nis obviously the optimal solution. The dataset is split into 75% training data (to measure the accuracy\nof our solution to the optimization problem) and 25% test data (to evaluate generalizability).\nFor our experiments, we use the following parameters: |Y| = 10, q = 10, m = 5, and = 10.0.\n2m1\u21e4.4 In our experiments we vary the\nWe set the prede\ufb01ned weights to be \u02dc\u2318 =\u21e5 1\nnumber of individuals, and each result is averaged over 25 runs. On each run, we generate a new\nground-truth classi\ufb01er h~g\u21e4,\u2318\u21e4, as well as new individuals, losses, and utilities.\n\n4 , . . . ,\n\n2 , 1\n\n1\n\n2m1 ,\n\n1\n\n5.3 Results\nFigure 1 shows the time taken to compute the mixture components ~gand the optimal weights \u2318, as\nthe number of individuals in the training data increases. As we will see shortly, even though the \u2318\ncomputation takes a very small fraction of the time, it can lead to non-negligible gains in terms of\nloss and envy.\nFigure 2 shows the average loss attained on the training and test data by the algorithm immediately\nafter computing the mixture components, and after the round of \u2318 optimization. It also shows the\naverage loss attained (on both the training and test data) by a random allocation, which serves as\na na\u00efve benchmark for calibration purposes. Recall that the optimal assignment h~g?,\u2318? has loss 0.\nFor both the training and testing individuals, optimizing \u2318 improves the loss of the learned classifer.\nMoreover, our algorithms achieve low training errors for all dataset sizes, and as the dataset grows\nthe testing error converges to the training error.\nFigure 3 shows the average envy among pairs in the training data and test data, where, for each pair,\nnegative envy is replaced with 0, to avoid obfuscating positive envy. The graph also depicts the\naverage envy attained (on both the training and test data) by a random allocation. As for the losses,\noptimizing \u2318 results in lower average envy, and as the training set grows we see the generalization\ngap decrease.\n\n4The reason for using an exponential decay is so that the subsequent classi\ufb01ers learned are different from the\nprevious ones. Using smaller weights might cause consecutive classi\ufb01ers to be identical, thereby \u2018wasting\u2019 some\nof the components.\n\n8\n\n\fFigure 1: The algorithm\u2019s running time.\n\nFigure 2: Training and test loss. Shaded error\nbands depict 95% con\ufb01dence intervals.\n\nFigure 3: Training and test envy, as a function of\nthe number of individuals. Shaded error bands\ndepict 95% con\ufb01dence intervals.\n\nFigure 4: CDF of training and test envy for 100\ntraining individuals\n\nIn Figure 4 we zoom in on the case of 100 training individuals, and observe the empirical CDF of\nenvy values. Interestingly, the optimal randomized classi\ufb01er h~g?,\u2318? shows lower negative envy values\ncompared to other algorithms, but as expected has no positive envy pairs. Looking at the positive\nenvy values, we can again see very encouraging results. In particular, for at least a 0.946 fraction of\nthe pairs in the train data, we obtain envy of at most 0.05, and this generalizes to the test data, where\nfor at least a 0.939 fraction of the pairs, we obtain envy of at most 0.1.\nIn summary, these results indicate that the algorithm described in Section 5.1 solves the optimization\nproblem (1) for linear one-vs-all classi\ufb01ers almost optimally, and that its output generalizes well even\nwhen the training set is small.\n\n6 Conclusion\n\nIn this paper we propose EF as a suitable fairness notion for learning tasks with many outcomes over\nwhich individuals have heterogeneous preferences. We provide generalization guarantees for a rich\nfamily of classi\ufb01ers, showing that if we \ufb01nd a classi\ufb01er that is envy-free on a sample of individuals, it\nwill remain envy-free when we apply it to new individuals from the same distribution. This result\ncircumvents an exponential lower bound on the sample complexity suffered by any two-stage learning\nalgorithm that \ufb01rst \ufb01nds an EF assignment for the sample and then extends it to the entire space.\nFinally, we empirically demonstrate that \ufb01nding low-envy and low-loss classi\ufb01ers is computationally\ntractable. These results show that envy-freeness is a practical notion of fairness for machine learning\nsystems.\n\n9\n\n\fAcknowledgments\n\nThis work was partially supported by the National Science Foundation under grants IIS-1350598, IIS-\n1714140, IIS-1618714, IIS-1901403, CCF-1525932, CCF-1733556, CCF-1535967, CCF-1910321;\nby the Of\ufb01ce of Naval Research under grants N00014-16-1-3075 and N00014-17-1-2428; and by\na J.P. Morgan AI Research Award, an Amazon Research Award, a Microsoft Research Faculty\nFellowship, a Bloomberg Data Science research grant, a Guggenheim Fellowship, and a grant from\nthe Block Center for Technology and Society.\n\nReferences\n[1] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system for convex\n\noptimization problems. Journal of Control and Decision, 5(1):42\u201360, 2018.\n\n[2] M.-F. Balcan, F. Constantin, S. Iwata, and L. Wang. Learning valuation functions. In Pro-\nceedings of the 25th Conference on Computational Learning Theory (COLT), pages 4.1\u20134.24,\n2012.\n\n[3] S. J. Brams and A. D. Taylor. Fair Division: From Cake-Cutting to Dispute Resolution.\n\nCambridge University Press, 1996.\n\n[4] U. Chajewska, D. Koller, and D. Ormoneit. Learning an agent\u2019s utility function by observing\nbehavior. In Proceedings of the 18th International Conference on Machine Learning (ICML),\npages 35\u201342, 2001.\n\n[5] A. Daniely, S. Sabato, and S. Shalev-Shwartz. Multiclass learning approaches: A theoretical\nIn Proceedings of the 25th Annual Conference on Neural\n\ncomparison with implications.\nInformation Processing Systems (NIPS), pages 485\u2013493, 2012.\n\n[6] A. Datta, M. C. Tschantz, and A. Datta. Automated experiments on ad privacy settings: A\ntale of opacity, choice, and discrimination. In Proceedings of the 15th Privacy Enhancing\nTechnologies Symposium (PETS), pages 92\u2013112, 2015.\n\n[7] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex\n\noptimization. Journal of Machine Learning Research, 17(83):1\u20135, 2016.\n\n[8] M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical Risk Minimiza-\n\ntion under Fairness Constraints. arXiv:1802.08626, 2018.\n\n[9] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. In\nProceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS), pages\n214\u2013226, 2012.\n\n[10] D. Foley. Resource allocation and the public sector. Yale Economics Essays, 7:45\u201398, 1967.\n[11] R. Freedman, J. Schaich Borg, W. Sinnott-Armstrong, J. P. Dickerson, and V. Conitzer. Adapting\na kidney exchange algorithm to align with human values. In Proceedings of the 32nd AAAI\nConference on Arti\ufb01cial Intelligence (AAAI), pages 1636\u20131645, 2018.\n\n[12] Y. Gal, M. Mash, A. D. Procaccia, and Y. Zick. Which is the fairest (rent division) of them all?\n\nJournal of the ACM, 64(6): article 39, 2017.\n\n[13] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Proceedings\nof the 30th Annual Conference on Neural Information Processing Systems (NIPS), pages 3315\u2013\n3323, 2016.\n\n[14] \u00da. H\u00e9bert-Johnson, M. P. Kim, O. Reingold, and G. N. Rothblum. Calibration for the\n(computationally-identi\ufb01able) masses. In Proceedings of the 35th International Conference on\nMachine Learning (ICML), 2018. Forthcoming.\n\n[15] M. Joseph, M. Kearns, J. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual\nbandits. In Proceedings of the 30th Annual Conference on Neural Information Processing\nSystems (NIPS), pages 325\u2013333, 2016.\n\n[16] M. Kearns, S. Neel, A. Roth, and S. Wu. Computing parametric ranking models via rank-\nbreaking. In Proceedings of the 35th International Conference on Machine Learning (ICML),\n2018.\n\n10\n\n\f[17] N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Sch\u00f6lkopf.\nAvoiding discrimination through causal reasoning. In Proceedings of the 31st Annual Conference\non Neural Information Processing Systems (NIPS), pages 656\u2013666, 2017.\n\n[18] B. T. Luong, S. Ruggieri, and F. Turini. k-NN as an implementation of situation testing for\ndiscrimination discovery and prevention. In Proceedings of the 17th International Conference\non Knowledge Discovery and Data Mining (KDD), pages 502\u2013510, 2011.\n\n[19] P. Manurangsi and W. Suksompong. Asymptotic existence of fair divisions for groups. Mathe-\n\nmatical Social Sciences, 89:100\u2013108, 2017.\n\n[20] H. Moulin. Fair Division and Collective Welfare. MIT Press, 2003.\n[21] B. K. Natarajan. On learning sets and functions. Machine Learning, 4(1):67\u201397, 1989.\n[22] T. D. Nielsen and F. V. Jensen. Learning a decision maker\u2019s utility function from (possibly)\n\ninconsistent behavior. Arti\ufb01cial Intelligence, 160(1\u20132):53\u201378, 2004.\n\n[23] R. Noothigattu, S. S. Gaikwad, E. Awad, S. Dsouza, I. Rahwan, P. Ravikumar, and A. D.\nProcaccia. A voting-based system for ethical decision making. In Proceedings of the 32nd\nAAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 1587\u20131594, 2018.\n\n[24] A. D. Procaccia. Cake cutting: Not just child\u2019s play. Communications of the ACM, 56(7):78\u201387,\n\n2013.\n\n[25] J. M. Robertson and W. A. Webb. Cake Cutting Algorithms: Be Fair If You Can. A. K. Peters,\n\n1998.\n\n[26] G. N. Rothblum and G. Yona. Probably approximately metric-fair learning. arXiv:1803.03242,\n\n2018.\n\n[27] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge University Press, 2014.\n\n[28] F. E. Su. Rental harmony: Sperner\u2019s lemma in fair division. American Mathematical Monthly,\n\n106(10):930\u2013942, 1999.\n\n[29] L. Sweeney. Discrimination in online ad delivery. Communications of the ACM, 56(5):44\u201354,\n\n2013.\n\n[30] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their probabilities. Theory of Probability and its Applications, 16(2):264\u2013280, 1971.\n\n[31] H. Varian. Equity, envy and ef\ufb01ciency. Journal of Economic Theory, 9:63\u201391, 1974.\n[32] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory\npredictors. In Proceedings of the 30th Conference on Computational Learning Theory (COLT),\npages 1920\u20131953, 2017.\n\n[33] A. C. Yao. Probabilistic computations: Towards a uni\ufb01ed measure of complexity. In Proceedings\n\nof the 17th Symposium on Foundations of Computer Science (FOCS), pages 222\u2013227, 1977.\n\n[34] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, K. P. Gummadi, and A. Weller. From parity\nto preference-based notions of fairness in classi\ufb01cation. In Proceedings of the 31st Annual\nConference on Neural Information Processing Systems (NIPS), pages 228\u2013238, 2017.\n\n[35] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In\nProceedings of the 30th International Conference on Machine Learning (ICML), pages 325\u2013333,\n2013.\n\n11\n\n\f", "award": [], "sourceid": 742, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Carnegie Mellon University"}, {"given_name": "Travis", "family_name": "Dick", "institution": "TTIC"}, {"given_name": "Ritesh", "family_name": "Noothigattu", "institution": "Carnegie Mellon University"}, {"given_name": "Ariel", "family_name": "Procaccia", "institution": "Carnegie Mellon University"}]}