{"title": "On Making Stochastic Classifiers Deterministic", "book": "Advances in Neural Information Processing Systems", "page_first": 10912, "page_last": 10922, "abstract": "Stochastic classifiers arise in a number of machine learning problems, and have become especially prominent of late, as they often result from constrained optimization problems, e.g. for fairness, churn, or custom losses. Despite their utility, the inherent randomness of stochastic classifiers may cause them to be problematic to use in practice for a variety of practical reasons. In this paper, we attempt to answer the theoretical question of how well a stochastic classifier can be approximated by a deterministic one, and compare several different approaches, proving lower and upper bounds. We also experimentally investigate the pros and cons of these methods, not only in regard to how successfully each deterministic classifier approximates the original stochastic classifier, but also in terms of how well each addresses the other issues that can make stochastic classifiers undesirable.", "full_text": "On Making Stochastic Classi\ufb01ers Deterministic\n\nAndrew Cotter, Harikrishna Narasimhan, Maya Gupta\n\nGoogle Research\n\n1600 Amphitheatre Pkwy, Mountain View, CA 94043\n\n{acotter,hnarasimhan,mayagupta}@google.com\n\nAbstract\n\nStochastic classi\ufb01ers arise in a number of machine learning problems, and have\nbecome especially prominent of late, as they often result from constrained opti-\nmization problems, e.g. for fairness, churn, or custom losses. Despite their utility,\nthe inherent randomness of stochastic classi\ufb01ers may cause them to be problematic\nto use in practice for a variety of practical reasons. In this paper, we attempt to\nanswer the theoretical question of how well a stochastic classi\ufb01er can be approxi-\nmated by a deterministic one, and compare several different approaches, proving\nlower and upper bounds. We also experimentally investigate the pros and cons of\nthese methods, not only in regard to how successfully each deterministic classi\ufb01er\napproximates the original stochastic classi\ufb01er, but also in terms of how well each\naddresses the other issues that can make stochastic classi\ufb01ers undesirable.\n\n1\n\nIntroduction\n\nStochastic classi\ufb01ers arise in a variety of machine learning problems. For example, they are produced\nby constrained training problems [1\u20135], where one seeks to optimize a classi\ufb01cation objective subject\nto goals such as fairness, recall and churn. The use of stochastic classi\ufb01ers turns out to be crucial in\nmaking such constrained optimization problems tractable, due to the potentially non-convex nature of\nthe constraints [4]. For similar reasons, stochastic classi\ufb01ers are important for optimizing custom\nevaluation metrics such as robust optimization [6], or the G-mean or the H-mean metrics popular\nin class-imbalanced classi\ufb01cation tasks [7\u201312]. Stochastic classi\ufb01ers also arise in the PAC-Bayes\nliterature [e.g. 13\u201316], in ensemble learning [17].\nDespite their utility in theory, the inherent randomness of stochastic classi\ufb01ers may be problematic\nin practice. In some cases, practitioners may object to stochastic classi\ufb01ers on ethical grounds, or\nbecause they are dif\ufb01cult to debug, test, and visualize, or they will cite the added complexity that\nthey can bring to a real-world production system. Worse, in some settings, it might simply not make\nsense to use a stochastic classi\ufb01er. For example, suppose that a classi\ufb01er is trained to \ufb01lter spam from\nemails, and if applied once to an email it accurately rejects spam 99% of the time. If a stochastic\nclassi\ufb01er is used, then the spammer could simply send hundreds of copies, con\ufb01dent that some will\nrandomly pass through the stochastic classi\ufb01er.\nSimilarly, although stochastic classi\ufb01ers often arise from optimizing for statistical fairness measures,\nthey may seem unfair because their randomness may make them fail at another popular fairness\nprinciple, that similar individuals should receive similar outcomes [18]. Indeed, when using a\nstochastic classi\ufb01er, even the same example may receive different outcomes, if it is classi\ufb01ed twice.\nFor all of these reasons, stochastic classi\ufb01ers can be undesirable, but they are often dif\ufb01cult to avoid.\nFor example, when solving constrained optimization problems subject to non-convex constraints,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fas in the statistical fairness setting, all algorithms with theoretical guarantees that we are aware of\nproduce stochastic classi\ufb01ers [e.g. 3\u20135]\u21e4.\nIn this paper we investigate the question of how to make a given stochastic classi\ufb01er deterministic,\nwhat issues arise, and what criteria can be used to judge the result. Section 2 de\ufb01nes our terms\nand notation, and makes our \ufb01rst contribution: a precise statement of what it means to say that a\ndeterministic classi\ufb01er is a good approximation to a stochastic classi\ufb01er. Our second contribution, in\nSection 2.1, is to prove a lower bound on how well a deterministic classi\ufb01er can perform, measured\nin these terms. In Section 2.2, we discuss how the standard thresholding approach performs. In\nSection 2.3 we consider a hashing approach, which is regarded in folklore as an obvious way to make\na stochastic classi\ufb01er deterministic, and in our third contribution we prove that hashing enjoys a\nperformance guarantee that can be favorably compared to our lower bound.\nOur fourth contribution is delineating, in Section 3, other design criteria for whether a deterministic\nclassi\ufb01er will be satisfying to practitioners. As a \ufb01fth contribution, in Section 3.3 we suggest a variant\nof hashing, and explain how it allows one to control how well the resulting classi\ufb01er will satisfy these\nother design criteria. Next, we focus on the important special case of stochastic ensembles, and as\na sixth contribution, we propose an alternative more-intuitive variable binning strategy for making\nthem deterministic. We conclude, in Section 5, with experiments on six datasets comparing these\nstrategies on different problems where stochastic classi\ufb01ers arise.\n\n2 Stochastic Classi\ufb01ers\n\nLet X be the instance space, with Dx being the associated data distribution, and Y = {0, 1} the label\nspace (this is the binary classi\ufb01cation setting), with Dy|x being the conditional label distribution.\nWe will write the resulting joint distribution as Dxy. Deterministic classi\ufb01ers will always be written\nwith hats (e.g. \u02c6f), and stochastic classi\ufb01ers without hats (e.g. f). A stochastic binary classi\ufb01er is a\nfunction f : X! [0, 1] mapping each instance x to the probability of making a positive prediction.\nOur goal is to \ufb01nd a deterministic classi\ufb01er \u02c6f : X!{ 0, 1} that approximates f, but we \ufb01rst must\nclarify what precisely would constitute a \u201cgood approximation\u201d. To this end, we de\ufb01ne a rate metric\nas a pair (`,X`), where ` : {0, 1}\u21e5{ 0, 1}!{ 0, 1} is a binary loss function and X` \u2713X is the\nsubset of the instance space on which this loss should be evaluated. Such rate metrics are surprisingly\n\ufb02exible, and cover a broad set of tasks that are of interest to practitioners [e.g. 1, 2]. For example,\non a fairness problem based on demographic parity constraint [20], we might be interested in the\npositive prediction rate (`) on members of a certain protected class (X`).\nWe denote the value of a metric as E`(f ) := Ex,y[f (x)`(1, y) + (1 f (x))`(0, y) | x 2X `] for\na stochastic classi\ufb01er f, and as E`( \u02c6f ) := Ex,y[`( \u02c6f (x), y) | x 2X `] for a deterministic \u02c6f. We will\ngenerally be concerned with several designated metrics `1, . . . ,` m, each of which captures some\nproperty of f that should be preserved (i.e. we want E`i(f ) \u21e1 E`i( \u02c6f ) for all i 2 [m]). Typically,\nthe set of metrics will depend on the original learning problem. For example, if we found f by\nminimizing the false positive rate (FPR) subject to FNR and churn constraints, then the relevant\nmetrics would presumably include FPR, FNR and churn. The key to our approach is that we do not\nattempt to \ufb01nd a deterministic function that approximates a stochastic classi\ufb01er pointwise: rather, we\nrequire only that it perform well w.r.t. metrics that aggregate over swaths of the data.\nWhile it might be tempting to formulate the search for \u02c6f as an explicit optimization problem, the only\nappropriate techniques we\u2019re aware of are constrained solvers which themselves produce stochastic\nclassi\ufb01ers [3, 2, 4]. Instead, we focus on problem-agnostic strategies that are easy to implement, but\nthat\u2014despite their simplicity\u2014often enjoy good theoretical guarantees and perform well in practice.\n\n2.1 Lower Bound\n\nBefore we discuss techniques for creating a deterministic classi\ufb01er from a stochastic one, we\u2019d like\nto understand the extent to which this is possible. Our \ufb01rst result, therefore, is a lower bound:\n\n\u21e4Alternatives that do not explicitly perform constrained optimization (e.g. [19], which instead attempts to\n\n\ufb01nd a simple \u201ccorrection\u201d to an existing classi\ufb01er), can be immune to this problem.\n\n2\n\n\fTheorem 1. For a given instance space X , data distribution Dx, metric subset X` \u2713X and\nstochastic classi\ufb01er f, there exists a metric loss ` and conditional label distribution Dy|x such that:\n\nE`(f ) E`( \u02c6f ) max\n\nx2X`nPrx0\u21e0Dx|X` {x0 = x} \u00b7 min{f (x), 1 f (x)}o\nfor all deterministic classi\ufb01ers \u02c6f, where Dx|X` is the data distribution Dx restricted to X`.\nProof. In Appendix B.1.\n\nThis result is straightforward to prove, but neatly illustrates the two main obstacles to \ufb01nding a\ngood deterministic \u02c6f: (i) point masses (the Prx0\u21e0Dx|X`{x0 = x} term), and (ii) stochasticity (the\nmin{f (x), 1 f (x)} term). If f contains too much stochasticity on a large point mass, then it will\nnot be possible to approximate it well with a deterministic \u02c6f.\nIn Section 2.3, we will show that the converse of the above statement roughly holds: if either the\nprobability mass or the stochasticity of f on point masses approaches zero, then it is possible to \ufb01nd\na deterministic classi\ufb01er on which the errors of our metrics will, likewise, approach zero.\n\n2.2 Thresholding\n\nThresholding is the \u201cstandard\u201d approach for converting a stochastic binary classi\ufb01er into a determin-\nistic one: if f (x) > 1/2, then we make a positive prediction, and a negative prediction otherwise. If\nthe label truly is drawn randomly according to f (x), then thresholding forms the Bayes Classi\ufb01er and\nhence minimizes the expected misclassi\ufb01cations [21]. For any choice of loss `, there is an intuitive\nupper bound on thresholding\u2019s performance:\nTheorem 2. Let f : X! [0, 1] be a stochastic classi\ufb01er, and Dx a data distribution on X . De\ufb01ne\nthe thresholded stochastic classi\ufb01er \u02c6f (x) := 1{f (x) > 1/2}. Then for any metric (`,X`) and\nassociated conditional label distribution Dy|x:\n\nE`(f ) E`( \u02c6f ) \uf8ff Ex\u21e0Dx|X`\n\nwhere Dx|X` is the data distribution Dx restricted to X`.\nProof. In Appendix B.2.\n\n[min{f (x), 1 f (x)}]\n\nThis upper bound con\ufb01rms that the closer the original stochastic f comes to being deterministic, the\nbetter the thresholding deterministic classi\ufb01er \u02c6f will mimic it. However, unlike the lower bound of\nTheorem 1, the thresholding approach does not improve as point masses shrink. Indeed, even for a\ncontinuous data distribution Dx (i.e. no point masses), the thresholded \u02c6f could perform very poorly.\nFor example, if f (x) = 0.51 for every x, then \u02c6f will always make a positive prediction, unlike the\noriginal stochastic classi\ufb01er, which makes a negative prediction 49% of the time.\n\n2.3 Hashing\n\nTo improve upon thresholding, we would like to choose \u02c6f in such a way that its performance improves\nnot only as the stochasticity of f decreases, but also as the point masses in Dx shrink. To this end,\nwe propose \u201csimulating\u201d the randomness of a stochastic classi\ufb01er by hashing the input features to\ndeterministically generate a random-seeming number. The high-level idea is that even if a classi\ufb01er\nmakes a deterministic decision on a given instance x, by making dissimilar predictions on instances\nthat are close to x, the classi\ufb01er can give the illusion of being stochastic from the perspective of\naggregate rate metrics. In this section, we will show that with the appropriate type of hash function\n(de\ufb01ned below), we can tightly bound the performance of the resulting deterministic classi\ufb01er.\nDe\ufb01nition 1 (Pairwise Independence). A family H of hash functions h : C! [k] on a \ufb01nite set\nC is pairwise independent if, for all c, c0 2C and i, i0 2 [k], we have that Prh\u21e0Unif(H){(h(c) =\ni) ^ (h(c0) = i0)} = 1/k2 whenever c 6= c0.\n\n3\n\n\fAt \ufb01rst glance, this might seem like a fairly strong property, but it\u2019s actually quite simple to construct\na pairwise independent hash function from a logarithmic number (in |C| and k) of random bits (see\nClaim 1 in Appendix B.3 for an example).\nNotice that we de\ufb01ne a hash function on a set of \u201cclusters\u201d C, instead of on X itself. This handles the\ncase in which X is an in\ufb01nite set (e.g. Rd), and allows us to de\ufb01ne a \ufb01nite C and associated mapping\n\u21e1 : X!C , the result of which, \u21e1(x), is what we hash. In practice, X will be \ufb01nite anyway (e.g.\nd-dimensional vectors of \ufb02oating-point numbers), and one is then free to choose C = X and take \u21e1 to\nbe the identity function. Even in the \ufb01nite case, however, it may be bene\ufb01cial to pre-assign instances\nto clusters before hashing, as we will discuss in Section 3.\nTheorem 3. Let f : X! [0, 1] be a stochastic classi\ufb01er, and Dx a data distribution on X . Suppose\nthat we\u2019re given m metrics (`i,X`i) for i 2 [m], each of which is potentially associated with a\ndifferent conditional label distribution Dyi|x. Take H to be a pairwise independent set of hash\nto be a function that pre-assigns instances to clusters before\nfunctions h : C! [k], and \u21e1 : X!C\nhashing.\nSample a h \u21e0 Unif(H), and de\ufb01ne the deterministic classi\ufb01er \u02c6fh : X!{ 0, 1} as:\n\n\u02c6fh(x) = 1\u21e2f (x) \n\n2h(\u21e1(x)) 1\n\n2k\n\n\n\nwhere the expression (2h(\u21e1(x)) 1)/2k maps [k] (the range of h) into [0, 1].\nThen, with probability 1 over the sampling of h \u21e0 Unif(H), for all i 2 [m]:\n\nEf (`i) E \u02c6fh\n\n(`i) <\n\n1\n2k\n\n+ m\n Xc2C\u2713\u21e3Prx\u21e0Dx|X`i {\u21e1(x) = c}\u23182\n\u21e5Ex\u21e0Dx|X`i\uf8ff 1\n\n+ f (x) (1 f (x)) | \u21e1(x) = c\u25c6\u25c6 1\n\n2k\n\n2\n\nwhere Dx|X`i\nProof. In Appendix B.3.\n\nis the data distribution Dx restricted to X`i.\n\nNotice that 1/2k approaches zero as the number of hash buckets k increases. These terms aside, the\nupper bound of Theorem 3 has strong similarities to the lower bound of Theorem 1\u2020, particularly in\nlight of the fact that pre-clustering is optional. The main differences are that: (i) point masses (the\nPrx\u21e0Dx|X`i {\u21e1(x) = c} terms) are measured over entire clusters c 2C , instead of merely instances\nx 2X , (ii) we take the `2 norm over point masses, instead of maximizing over them, and (iii)\nstochasticity is measured with an expected variance Ex\u21e0Dx|X`i\n[f (x)(1 f (x)) | \u21e1(x) = c] over a\ncluster, instead of min{f (x), 1 f (x)}.\nMost importantly\u2014unlike for the thresholding approach of Section 2.2\u2014the key properties of our\nlower bound are present when using hashing. It will be easier to see this if we loosen Theorem 3 by\nseparately bounding (i) the stochasticity as f (x)(1 f (x)) \uf8ff 1/4 (the \ufb01rst term in the below min),\nor (ii) the point masses as (Prx\u21e0Dx|X`i {\u21e1(x) = c})2 \uf8ff Prx\u21e0Dx|X`i {\u21e1(x) = c} (the second):\n\n+r m\n\n2k\n\n+\n\n1\n2k\n\nEf (`i) E \u02c6fh\n(`i) <\nmin( 1\n2sXc2C\u21e3Prx\u21e0Dx|X`i {\u21e1(x) = c}\u23182\nr m\nIgnoring the \ufb01rst two additive terms (recall that we can choose k), if the distribution over clusters\nc 2C is approximately uniform, then the bound goes to zero as the number of clusters increases, at\nroughly a 1/p|C| rate. Likewise, as the variance Ex\u21e0Dx|X`i\n[f (x)(1 f (x))] goes to zero, the error\nof the deterministic classi\ufb01er approaches zero for all m metrics, with high probability.\n\n[f (x) (1 f (x))])\n\n,qEx\u21e0Dx|X`i\n\n\n\n\u2020In Appendix B.4, we verify that the above bound is larger than that of Theorem 1, as it should be.\n\n4\n\n\f3 Orderliness: Determinism Is Not Enough\n\nSo far we have shown that the hashing approach of Section 2.3 enjoys a better bound on its perfor-\nmance, in terms of aggregate rate metrics, than the standard thresholding approach of Section 2.2.\nWe\u2019ll now turn our attention to other criteria for judging the quality of deterministic approximations\nto stochastic classi\ufb01ers.\nThe approaches we\u2019ve considered thus far can be sorted in terms of how \u201corderly\u201d they are. As we\nuse the term, \u201corderliness\u201d is a loose notion measuring how \u201csmooth\u201d or \u201cself-consistent\u201d a classi\ufb01er\nis. The original stochastic classi\ufb01er is the least orderly: it might classify the same example differently,\nwhen it\u2019s encountered multiple times. The hashing classi\ufb01er is more orderly because it\u2019s deterministic,\nand will therefore always give the same classi\ufb01cation on the same example\u2014but it may behave very\ndifferently even on extremely similar examples (if they are hashed differently). The thresholding\nclassi\ufb01er is the most orderly, since it will threshold every example in exactly the same way, so similar\nexamples will likely be classi\ufb01ed identically.\n\n3.1 Repeated Use\n\nAs we noted in the introduction, a stochastic classi\ufb01er may be a poor choice when a user can force\nthe classi\ufb01er to make multiple predictions. For example, if a spam \ufb01lter is stochastic, then a spammer\ncould get an email through by sending it repeatedly. Simply replacing a stochastic classi\ufb01er with a\ndeterministic one might be insuf\ufb01cient: a disorderly spam \ufb01lter\u2014even a deterministic one\u2014could be\ndefeated by a sending many variants of the same spam message (say, differing only in whitespace).\n\n3.2 Fairness Principles\n\nThe fact that we measure the quality of an approximate stochastic classi\ufb01er in terms of aggregate\nmetrics implies that we\u2019re looking at fairness from the statistical perspective: even if individual\noutcomes are random (or deterministic-but-arbitrary), the classi\ufb01er could still be considered \u201cfair\u201d if\nit could be shown to be free of systematic biases (imposed via constraints on aggregate group-based\nfairness metrics). As we showed in Theorem 3, a hashing classi\ufb01er\u2019s performance bound improves as\nit becomes more disorderly (i.e. as the number of clusters in C, and/or the number of hash bins k\nincreases), measured in these terms.\nUnlike this group-based perspective, Dwork et al. [20] propose a \u201csimilar individuals receive similar\noutcomes\u201d principle, which looks at fairness from the perspective of an individual. This principle\nis better served by classi\ufb01ers that are more orderly: a thresholding classi\ufb01er\u2019s decision regions are\nfairer as measured by this principle than e.g. a hashing classi\ufb01er with \ufb01ne-grained bins.\nThis tension between the extremes of least-orderly classi\ufb01ers (accurate rate metrics) and most-orderly\n(similar individuals, similar outcomes), leads one to wonder whether there is some middle ground: in\nSection 3.3 we present an approach that allows us to directly trade-off between these two extremes.\nReality, of course, is more complicated: for example, lotteries are often considered \u201cfair\u201d by par-\nticipants if each feels that the underlying mechanism is fair, regardless of their individual out-\ncomes [22, 23]. In such cases, disorderliness, or even stochasticity, might be desirable from a fairness\npoint of view, and this tension vanishes.\n\n3.3 Clustering + Hashing\n\nThe hashing technique of Section 2.3 has a built-in mechanism for (partially) addressing the method\u2019s\ninherent lack of orderliness: pre-clustering. If \u21e1 : X!C\nassigns \u201csimilar\u201d elements x, x0 2X to\nthe same cluster c 2C , then such elements will be hashed identically, and the values of the stochastic\nclassi\ufb01er f (x), f (x0) will therefore be thresholded at the same value. Hence, assuming that the\nstochastic classi\ufb01er f is smooth, and with an appropriate choice of \u21e1, the resulting deterministic \u02c6f\ncould be considered \u201clocally orderly\u201d, and will therefore satisfy a form of similar inputs, similar\noutcomes, and provide some protection against repeated use.\nThere are, unfortunately, a couple of drawbacks to this approach. First, the onus is on the practitioner\nto design the clustering function \u21e1 in such a way that it captures the appropriate notion of similarity.\nFor example, if one wishes to encode an intuitive notion of fairness, then instances that are placed\n\n5\n\n\finto different clusters\u2014and are therefore treated inconsistently by \u02c6f\u2014should be distinct enough that\nthis assignment is justi\ufb01able. Second, one should observe that the bound of Theorem 3 is better\nwhen there are more clusters, and worse when there are fewer. Hence, there is a trade-off between\norderliness and performance: if some required level of metric accuracy must be attained, then doing\nso might force one to use so many clusters that there is insuf\ufb01cient local orderliness.\n\n4 Stochastic Ensembles\n\nWe now focus on a special case of stochastic classi\ufb01er that randomly selects from a \ufb01nite number\nof deterministic base classi\ufb01ers. This type of stochastic classi\ufb01er arises from many constrained\noptimization algorithms [3\u20135]. Let a stochastic ensemble f : X! [0, 1] be de\ufb01ned in terms\nof n deterministic classi\ufb01ers \u02c6g1, . . . , \u02c6gn : X!{\n0, 1}, and an associated probability distribution\np 2 n1 \u2713 Rn, for which f (x) :=Pn\nj=1 pj \u02c6gj(x). To evaluate this classi\ufb01er on an example x, one\n\ufb01rst samples an index j 2 [n] according to distribution p, and predicts \u02c6gj(x).\nThe hashing approach of Section 2.3 can be applied to stochastic ensembles, but due to the special\nstructure of such models, it\u2019s possible to do better. Here, we propose an alternate strategy that \ufb01rst\napplies a clustering, and then subdivides each cluster into n bins, for which the ith such bin contains\nroughly a pi proportion of the cluster instances, and assigns all instances within the ith bin to classi\ufb01er\n\u02c6gj. We do this by using a pre-de\ufb01ned score function q and a random shift parameter rc for each\ncluster c. The bene\ufb01t of this approach is that it adjusts the sizes of the bins based on the probability\ndistribution p, enabling us to get away with a comparatively smaller number of bins, and therefore\nachieve higher local orderliness, compared to the hashing classi\ufb01er (which relies on a large number\nof roughly-equally-sized bins). We call this the variable binning approach:\nTheorem 4. Let f : X! [0, 1] be a stochastic classi\ufb01er, and Dx a data distribution on X . Suppose\nthat we\u2019re given m metrics (`i,X`i) for i 2 [m], each of which is potentially associated with a\nto be a function that pre-assigns\ndifferent conditional label distribution Dyi|x. Take \u21e1 : X!C\ninstances to clusters, and q : X! [0, 1] to be a pre-de\ufb01ned score function. Choose p:0 = 0 and\ndenote p:j = p1 + . . . + pj,8j 2 [n]. De\ufb01ne clip(z) = z bzc.\nSample |C| random numbers r1, . . . , r|C| independently and uniformly from [0, 1)and de\ufb01ne the\ndeterministic classi\ufb01er \u02c6f (x) =Pn\n0, 1}n selects one of n base\nclassi\ufb01ers and is given by:\nsj(x) =Xc2C\nEf (`i) E \u02c6f (`i) < \u21e3 m\n\n Xc2C\u21e3\u21e3Prx\u21e0Dx|X`i {\u21e1(x) = c}\u23182\n\nThen, with probability 1 over the sampling of r1, . . . , r|C|:\n\n1{\u21e1(x) = c, clip(q(x) + rc) 2 [p:j1, p:j)}\n\nj=1 sj(x) \u02c6gj(x), where s : X!{\n\n\u21e5 Ex\u21e0Dx|X`i\n\n[f (x) (1 f (x)) | \u21e1(x) = c]\u2318\u2318 1\n\n2\n\nwhere Dx|X`i\nProof. In Appendix B.5.\n\nis the data distribution Dx restricted to X`i.\n\nThe proof proceeds by showing that the selector function s satis\ufb01es a pairwise independence property.\nThe above bound is the similar to the bound for hashing in Theorem 3, except that it no longer\ncontains terms that depend on the number of hash buckets k, and is therefore a slight improvement.\nIn our experiments, we \ufb01nd it to match the performance of hashing with more local orderliness.\n\n5 Experiments\n\nWe experimentally evaluate the different strategies described above for approximating a stochastic\nclassi\ufb01er with a deterministic classi\ufb01er. We consider constrained training tasks with two different\nfairness goals: (i) Matching ROC curves across protected groups (ii) Matching regression histograms\n\n6\n\n\fTable 1: Comparison of de-randomization approaches on ROC matching tasks. For each method,\nwe report A (B), where A is the absolute difference in objectivePt2T\nTPRt between the stochastic\nclassi\ufb01er and the deterministic classi\ufb01er, and B is the difference in fairness. For a FPR threshold\nt, we measure fairness as: TPRptr\nt TPRt, and report the maximum absolute difference in fairness\nmetric between the stochastic and deterministic classi\ufb01er across all t 2T . The number of base\nclassi\ufb01ers in the support of the stochastic ensemble is shown in parentheses after each dataset name.\n\nCrime (4)\n\nCOMPAS (5)\n\nLaw School (5)\n\nTrain\n\nTest\n\nTrain\n\nTest\n\nTrain\n\nTest\n\nThreshold\nHashing\nVarBin\n\n0.007 (0.01)\n0.001 (0.00)\n0.002 (0.00)\n\n0.012 (0.03)\n0.004 (0.01)\n0.000 (0.02)\n\n0.002 (0.01)\n0.001 (0.01)\n0.001 (0.01)\n\n0.002 (0.00)\n0.005 (0.03)\n0.002 (0.02)\n\n0.118 (0.12)\n0.004 (0.01)\n0.000 (0.01)\n\n0.099 (0.11)\n0.001 (0.03)\n0.000 (0.02)\n\nAdult (3)\n\nTrain\n\nTest\n\nWiki Toxicity (4)\nTest\n\nTrain\n\nBusiness (3)\n\nTrain\n\nTest\n\nThreshold\nHashing\nVarBin\n\n0.002 (0.04)\n0.005 (0.01)\n0.000 (0.01)\n\n0.005 (0.03)\n0.002 (0.01)\n0.002 (0.01)\n\n0.025 (0.04)\n0.000 (0.01)\n0.014 (0.01)\n\n0.024 (0.03)\n0.004 (0.01)\n0.013 (0.01)\n\n0.015 (0.02)\n0.000 (0.01)\n0.000 (0.01)\n\n0.014 (0.01)\n0.001 (0.02)\n0.001 (0.01)\n\nacross protected groups. These goals impose a large number of constraints on the model, and\nstochastic solutions become crucial in being able to satisfy them. We used the proxy-Lagrangian\noptimizer of Cotter et al. [4, 5] to solve the constrained optimization problem. This solver outputs a\nstochastic ensemble, as well as the best deterministic classi\ufb01er, chosen heuristically from its iterates.\nDatasets. We use use a variety of fairness datasets with binary protected attributes: (1) COMPAS [24],\nwhere the goal is the predict recidivism with gender as the protected attribute; (2) Communities &\nCrime [25], where the goal is to predict if a community in the US has a crime rate above the 70th\npercentile, and as in Kearns et al. [26], we consider communities having a black population above the\n50th percentile as the protected group; (3) Law School [27], where the task is to predict whether a law\nschool student will pass the bar exam, with race (black or other) as the protected attribute; (4) UCI\nAdult [25], where the task is to predict if a person\u2019s income exceeds 50K/year, with female candidates\nas the protected group; (5) Wiki Toxicity [28], where the goal is to predict if a comment posted on a\nWikipedia talk page contains non-toxic/acceptable content, with the comments containing the term\n\u2018gay\u2019 considered as the protected group; (6) Business Entity Resolution, a proprietary dataset from a\nlarge internet services company, where the task is to predict whether a pair of business descriptions\nrefer to the same real business, with non-chain businesses treated as protected. We used linear models\nfor all experiments. See Appendix A for further details on the datasets and setup.\u2021\nMethods. We apply the thresholding, hashing and variable binning (VarBin) techniques to convert the\ntrained stochastic ensemble into a deterministic classi\ufb01er. For hashing, we \ufb01rst map the input features\nto 2128 clusters (using a 128-bit cryptographic hash function), and apply a pairwise independent hash\nfunction to map it to 232 buckets (see Claim 1 in Appendix B.3 for the construction). For VarBin, we\nchoose a direction uniformly at random from the unit `2 sphere, project instances onto this direction,\nand have the cluster mapping \u21e1 divide the projected values into k = 25 contiguous bins, i.e. \u21e1(x) = c\nwhenever uc1 \uf8ff h, xi \uf8ff uc, where u0 = minx h, xi < u1 < . . . < u25 = maxx h, xi are\nequally-spaced thresholds. The score q(x) for an instance x is taken to be the projected value h, xi\nnormalized by the maximum and minimum values within its cluster, i.e. q(x) = h,xiu\u21e1(x)1\n.\nu\u21e1(x)u\u21e1(x)1\nAdditionally, we \ufb01nd that adding the random numbers r1, . . . , r|C| was unnecessary and take rc = 0\nfor all c, which considerably simpli\ufb01es the implementation of VarBin.\n\n5.1 ROC Curve Matching\n\nOur \ufb01rst task is to train a scoring model that yields similar ROC curves for both the protected\ngroup and the overall population. Let TPRt denote the true positive rate in the model\u2019s ROC curve\nwhen thresholded at false positive rate t and, let TPRptr\ndenote the true positive rate achieved\non the protected group members when thresholded to yield the same false positive rate t on the\n\nt\n\n\u2021Code made available at: https://github.com/google-research/google-research/\n\ntree/master/stochastic_to_deterministic\n\n7\n\n\fFigure 1: Test set ROC curves for the Black group and overall population in the Law School\ndataset. Note that the stochastic classi\ufb01er successfully matches the two ROC curves and the hashing\napproximation is much more faithful than the best deterministic iterate provided by the solver.\n\nFigure 2: Comparison of pre-clustered Hashing and VarBin showing the trade-off between orderliness\n(using the proxy of fewer bins) and accuracy on the rate metrics (more bins).\n\nmaxPt2T\n\nt\n\nprotected group. We are interested in a selected set of FPRs in the initial portion of the curve:\nT = {0.1, 0.2, 0.3, 0.4}. Our goal is to maximize the sum of TPRs at these FPRs, subject to TPR\nvalues being similar for both the protected group and overall population, i.e.:\n|\uf8ff 0.01, 8t 2T .\n\nTPRt s.t. |TPRt TPRptr\n\nThis results in 24 constraints on true and false positive rates. For this problem, the constrained\noptimizer outputs ensembles with 3\u20135 deterministic classi\ufb01ers. We report the objective and constraint\nviolations for the trained stochastic models in Table 4 of Appendix A. The stochastic solution yields\na much lower constraint violation compared to an unconstrained classi\ufb01er trained to optimize the\nerror rate, and the \u201cbest iterate\u201d deterministic classi\ufb01er. A comparison of the different strategies for\nde-randomizing the trained stochastic model is presented in Table 1. Hashing and VarBin are able to\nclosely match the performance of the stochastic classi\ufb01er. Thresholding fares poorly on three of the\nsix datasets. Figure 1 provides a visualization of the matched ROC curves.\nWe next study the trade-off between orderliness and accuracy. To evaluate hashing with different\nnumbers of bins, we project the inputs along a random direction, form equally-spaced bins, and hash\nthe bin indices. Figure 2 plots the difference in objective between the stochastic and hash-deterministic\nmodels for different numbers of bins (averaged over 50 random draws of the random direction and\nhash function). We show a similar plot for the constraint metrics. We compare hashing with a\nVarBin strategy that uses the same number of (total) bins. VarBin is generally better at approximating\nthe stochastic classi\ufb01er with a small number of bins because VarBin sizes the bins to respect the\nprobability distribution p, and is thus able to provide better accuracy with more orderliness.\n\n5.2 Histogram Matching\n\nWe next consider a regression task where the fairness goal is to match the output distribution of\nthe model for the protected group and the overall population. For a regression model \u02c6g : X!Y ,\nwith a bounded Y\u21e2 R, we divide the output range into 10 equally sized bins B1, . . . , B10\nand require that the fraction of protected group members in a bin is close to the fraction of\nthe overall population in that bin:\nj 2 [10]. We minimize the squared error subject to satisfying this goal, which results in a\ntotal of 20 constraints on the model. We train stochastic models on the same datasets as before,\nand use real-valued labels wherever available: for Crime, we predict the per-capita crime rate,\nfor Law School, we predict the under-graduate GPA, and for WikiToxicity, we predict the\nlevel of toxicity (a value in [0,1]). In this case, the constrained optimizer outputs a stochastic\nensemble of regression models \u02c6g1, . . . , \u02c6gn : X!Y\nwith probabilities p 2 n1. In place of\n\nPrx|ptr {\u02c6g(x) 2 Bj} Prx {\u02c6g(x) 2 Bj} \uf8ff 0.01, for all\n\n8\n\n\fTable 2: Comparison of de-randomization approaches on histogram matching regression tasks. We\nreport A (B), where A is the difference in squared error between the stochastic classi\ufb01er and the\ndeterministic classi\ufb01er and B is the difference in fairness. We measure fairness as Prx | ptr(\u02c6g(x) 2\nBj) Prx(\u02c6g(x) 2 Bj), and report the maximum abs. difference in this metric between the stochastic\nand deterministic classi\ufb01er across all bins Bj. \u2018Average\u2019 is the regression analogue of thresholding.\n\nCrime (5)\n\nCOMPAS (4)\n\nLaw School (5)\n\nTrain\n\nTest\n\nTrain\n\nTest\n\nTrain\n\nTest\n\nAverage\nHashing\nVarBin\n\n0.001 (0.02)\n0.000 (0.01)\n0.000 (0.05)\n\n0.001 (0.02)\n0.000 (0.03)\n0.001 (0.14)\n\n0.068 (0.03)\n0.002 (0.03)\n0.001 (0.08)\n\n0.069 (0.06)\n0.004 (0.06)\n0.007 (0.07)\n\n0.265 (0.01)\n0.002 (0.01)\n0.002 (0.04)\n\n0.262 (0.02)\n0.002 (0.01)\n0.002 (0.06)\n\nAdult (4)\n\nTrain\n\nTest\n\nWiki Toxicity (5)\nTest\n\nTrain\n\nBusiness (8)\n\nTrain\n\nTest\n\nAverage\nHashing\nVarBin\n\n0.003 (0.01)\n0.000 (0.01)\n0.000 (0.04)\n\n0.003 (0.01)\n0.000 (0.01)\n0.000 (0.04)\n\n0.023 (0.09)\n0.000 (0.01)\n0.002 (0.13)\n\n0.023 (0.09)\n0.001 (0.01)\n0.003 (0.18)\n\n0.091 (0.07)\n0.010 (0.03)\n0.001 (0.06)\n\n0.090 (0.08)\n0.013 (0.07)\n0.005 (0.08)\n\nFigure 3: Test set histograms of model outputs for the female candidates (red) and the overall\npopulation (green) in the Adult dataset.\n\nensemble: \u02c6f (x) =Pn\n\nthresholding, we report the \u201cAverage\u201d baseline that simply outputs the expected value of the\nj=1 pj \u02c6gj(x). For our datasets, the trained stochastic ensembles contain 4 to 8\nclassi\ufb01ers. We report the objective and constraint violations in Table 5 in Appendix A. An evaluation\nof how well the constructed deterministic classi\ufb01ers match the stochastic classi\ufb01er is presented in\nTable 2. Hashing and VarBin yield comparable performance on most datasets. The Average base-\nline fails on four of the datasets. Figure 3 provides a visualization of the matched output distributions.\n\nIn Appendix A.3, we present a third experiment on an unconstrained multiclass problem where\nwe seek to optimize the G-mean evaluation metric, which is the geometric mean of the per-class\naccuracies. We apply a training approach based on the Frank-Wolfe method [12] on the UCI Abalone\ndataset [25] and present the result of de-randomizing a stochastic ensemble with 100 base classi\ufb01ers.\n\n6 Conclusions and Future Work\n\nThere are a number of ways to convert a stochastic classi\ufb01er to a deterministic approximation, and\none of these\u2014hashing\u2014enjoys a theoretical guarantee that compares favorably to a lower bound,\nin terms of how well the approximation preserves aggregate rate metrics. However, the reasons\nthat determinism may be preferable to stochasticity include stability, debuggability, various notions\nof fairness, and resistance to manipulation via repeated use. In terms of these issues, a disorderly\nclassi\ufb01er, like that resulting from hashing, may be unsatisfactory.\nApplying pre-clustering to the hashing approach partially solves this problem, as does the variable\nbinning approach of Section 4, but leaves a number of important questions open, including how one\nshould measure similarity, and whether we can improve on the \u201clocal orderliness\u201d property these\napproaches enjoy, and whether there are special cases where one can construct accurate deterministic\nclassi\ufb01ers without losing out on orderliness.\nAnother possible re\ufb01nement would be to consider more general metrics than the aggregate rates that\nwe consider in Section 2. For example, one could potentially use smooth functions of rates, to handle\ne.g. the F-score or G-mean metrics [29] (see the experiment in Appendix A.3). Or, to support the\nranking or regression settings, one could de\ufb01ne rate metrics over pairs of examples [30\u201332].\n\n9\n\n\fAcknowledgments\n\nOur thanks go out to Samory Kpotufe for mentioning the connection to the PAC-Bayes literature,\nto Nathan Srebro for pointing out that replacing a random choice with an arbitrary one will not\nnecessarily be an improvement, and to Sergey Ioffe for a helpful discussion on hash functions.\n\nReferences\n[1] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world\n\ngoals with dataset constraints. In NIPS, pages 2415\u20132423. 2016.\n\n[2] Harikrishna Narasimhan. Learning with complex loss functions and constraints. In AIStats,\n\n2018.\n\n[3] Alekh Agarwal, Alina Beygelzimer, Miroslav Dud\u00edk, John Langford, and Hanna M. Wallach. A\n\nreductions approach to fair classi\ufb01cation. In ICML, pages 60\u201369, 2018.\n\n[4] Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-player games for ef\ufb01cient non-\n\nconvex constrained optimization. In Algorithmic Learning Theory, pages 300\u2013332, 2019.\n\n[5] Andrew Cotter, Heinrich Jiang, Serena Wang, Taman Narayan, Maya Gupta, Seungil You,\nand Karthik Sridharan. Optimization with non-differentiable constraints with applications to\nfairness, recall, churn, and other goals. JMLR, 2019. [To appear: https://arxiv.org/\nabs/1809.04198].\n\n[6] Robert S. Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for\n\nnon-convex objectives. In NIPS, 2017.\n\n[7] D.D. Lewis. Evaluating text categorization. In HLT Workshop on Speech and Natural Language,\n\npages 312\u2013318, 1991.\n\n[8] J-D. Kim, Y. Wang, and Y. Yasunori. The Genia event extraction shared task, 2013 edition-\n\noverview. ACL 2013, 2013.\n\n[9] Y. Sun, M.S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbalanced class\n\ndistribution. In ICDM, 2006.\n\n[10] S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE\nTransactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4):1119\u20131130, 2012.\n[11] S. Lawrence, I. Burns, A. Back, A-C. Tsoi, and C.L. Giles. Neural network classi\ufb01cation and\nprior class probabilities. In Neural Networks: Tricks of the Trade, LNCS, pages 1524:299\u2013313.\n1998.\n\n[12] H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: a\n\ntale of two classes. In ICML, 2015.\n\n[13] John Langford and John Shawe-Taylor. PAC-Bayes and margins. In NIPS, 2002.\n[14] David McAllester. PAC-Bayesian stochastic model selection. In Machine Learning, 2003.\n[15] Xiong Li, Bin Wang, Yuncai Liu, and Tai Sing Lee. Stochastic feature mapping for PAC-Bayes\n\nclassi\ufb01cation. In Machine Learning, 2015.\n\n[16] Alexandre Lacasse, Fran\u00e7ois Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier.\nPac-bayes bounds for the risk of the majority vote and the variance of the gibbs classi\ufb01er. In\nAdvances in Neural information processing systems, pages 769\u2013776, 2007.\n\n[17] Foster Provost and Tom Fawcett. Robust classi\ufb01cation for imprecise environments. Machine\n\nlearning, 42(3):203\u2013231, 2001.\n\n[18] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon\nRoth. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-\nSeventh Annual ACM on Symposium on Theory of Computing, pages 117\u2013126. ACM, 2015.\n\n10\n\n\f[19] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In\n\nNIPS, 2016.\n\n[20] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In\n\nProc. 3rd Innovations in Theoretical Computer Science, pages 214\u2013226. ACM, 2012.\n\n[21] T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning. Springer, 2016.\n[22] G. Sher. What makes a lottery fair? In Nous, pages 203\u2013216, 1980.\n[23] B. Saunders. The equality of lotteries. In Philosophy, volume 83, pages 359\u2013372, 2008.\n[24] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There\u2019s software\nused across the country to predict future criminals, and it\u2019s biased against blacks, May 2016.\n\n[25] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017. URL http:\n\n//archive.ics.uci.edu/ml.\n\n[26] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerryman-\ndering: Auditing and learning for subgroup fairness, 2017. URL https://arxiv.org/\nabs/1711.05144.\n\n[27] L. Wightman. LSAC national longitudinal bar passage study. Law School Admission Council,\n\n1998.\n\n[28] L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and mitigating unintended\n\nbias in text classi\ufb01cation. In AIES, 2018.\n\n[29] M. Gupta H. Narasimhan, A. Cotter. Optimizing generalized rate metrics with three players. In\n\nNeurIPS, 2019.\n\n[30] A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, L. Heldt, Z. Zhao, L. Hong, E. H. Chi,\nand C. Goodrow. Fairness in recommendation through pairwise experiments. KDD Applied\nData Science Track, 2019. URL arxiv.org/abs/1903.00780.pdf.\n\n[31] N. Kallus and A. Zhou. The fairness of risk scores beyond classi\ufb01cation: Bipartite ranking and\n\nthe xAUC metric. arXiv preprint arXiv:1902.05826, 2019.\n\n[32] M. Gupta S. Wang H. Narasimhan, A. Cotter. Pairwise fairness for ranking and regression,\n\n2019. URL https://arxiv.org/abs/1906.05330.\n\n[33] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for\nword representation. In Proceedings of the 2014 conference on empirical methods in natural\nlanguage processing (EMNLP), pages 1532\u20131543, 2014.\n\n[34] Ronitt Rubinfeld. Notes for lecture 5 of MIT 6.842: Randomness and computation,\nFebruary 2012. URL https://people.csail.mit.edu/ronitt/COURSE/S12/\nhandouts/lec5.pdf.\n\n11\n\n\f", "award": [], "sourceid": 5841, "authors": [{"given_name": "Andrew", "family_name": "Cotter", "institution": "Google"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}, {"given_name": "Harikrishna", "family_name": "Narasimhan", "institution": "Google Research"}]}