{"title": "Optimizing Generalized Rate Metrics with Three Players", "book": "Advances in Neural Information Processing Systems", "page_first": 10747, "page_last": 10758, "abstract": "We present a general framework for solving a large class of learning problems with non-linear functions of classification rates. This includes problems where one wishes to optimize a non-decomposable performance metric such as the F-measure or G-mean, and constrained training problems where the classifier needs to satisfy non-linear rate constraints such as predictive parity fairness, distribution divergences or churn ratios. We extend previous two-player game approaches for constrained optimization to an approach with three players to decouple the classifier rates from the non-linear objective, and seek to find an equilibrium of the game. Our approach generalizes many existing algorithms, and makes possible new algorithms with more flexibility and tighter handling of non-linear rate constraints. We provide convergence guarantees for convex functions of rates, and show how our methodology can be extended to handle sums-of-ratios of rates. Experiments on different fairness tasks confirm the efficacy of our approach.", "full_text": "Optimizing Generalized Rate Metrics with\n\nThree Players\n\nHarikrishna Narasimhan, Andrew Cotter, Maya Gupta\n\nGoogle Research\n\n1600 Amphitheatre Pkwy, Mountain View, CA 94043\n\n{hnarasimhan, acotter, mayagupta}@google.com\n\nAbstract\n\nWe present a general framework for solving a large class of learning problems\nwith non-linear functions of classi\ufb01cation rates. This includes problems where\none wishes to optimize a non-decomposable performance metric such as the F-\nmeasure or G-mean, and constrained training problems where the classi\ufb01er needs\nto satisfy non-linear rate constraints such as predictive parity fairness, distribution\ndivergences or churn ratios. We extend previous two-player game approaches\nfor constrained optimization to an approach with three players to decouple the\nclassi\ufb01er rates from the non-linear objective, and seek to \ufb01nd an equilibrium of the\ngame. Our approach generalizes many existing algorithms, and makes possible new\nalgorithms with more \ufb02exibility and tighter handling of non-linear rate constraints.\nWe provide convergence guarantees for convex functions of rates, and show how\nour methodology can be extended to handle sums-of-ratios of rates. Experiments\non different fairness tasks con\ufb01rm the ef\ufb01cacy of our approach.\n\n1\n\nIntroduction\n\nIn many real-world machine learning problems, the performance measures used to evaluate a clas-\nsi\ufb01cation model are non-linear functions of the classi\ufb01er\u2019s prediction rates. Examples include the\nF-measure and G-mean used in class-imbalanced classi\ufb01cation tasks [1\u20135], metrics such as predictive\nparity used to impose fairness goals [6], the win-loss-ratio used to measure classi\ufb01er churn [7],\nKL-divergence based metrics used in quanti\ufb01cation tasks [8, 9] and score-based metrics such as the\nPR-AUC [10]. Because these goals are non-linear and are non-continuous in the model parameters, it\nbecomes very challenging to optimize with them, especially when they are used in constraints [11].\nPrior work on optimizing generalized rate metrics has largely focused on unconstrained learning\nproblems. These approaches fall under two broad categories: surrogate-based methods that replace\nthe classi\ufb01er rates with convex relaxations [12\u201317], and oracle-based methods that formulate multiple\ncost-sensitive learning tasks, and solve them using an oracle [18\u201322, 11]. Both these approaches\nhave notable de\ufb01ciencies. The \ufb01rst category of methods rely crucially on the surrogates being close\napproximations to the rate metric, and perform poorly when this is not the case (see e.g. experiments\nin [20]). The use of surrogates becomes particularly problematic with constrained training problems,\nas relaxing the constraints with convex upper bounds can result in solutions that are over-constrained\nor infeasible [23]. The second category of methods assume access to a near-optimal cost-sensitive\noracle, which is usually unrealistic in practice.\nIn this paper, we present a three-player approach for learning problems where both the objective\nand constraints can be de\ufb01ned by general functions of rates. The three players optimize over model\nparameters, Lagrange multipliers and slack variables to produce a game equilibrium. Our approach\ngeneralizes many existing algorithms (see Table 2), and makes possible new algorithms with more\n\ufb02exibility and tighter handling of non-linear rate constraints. Speci\ufb01cally, we give a new method\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(Algorithm 2) that can handle a wider range of performance metrics than previous surrogate methods\n(such as e.g. KL-divergence based metrics that only take inputs from a restricted range), and can be\napplied to constrained training problems without the risk of over-constraining the model because it\nneeds to use surrogates less. To our knowledge, this is the \ufb01rst practical, surrogate-based approach\nthat can handle constraints on generalized rate metrics.\nWe show convergence of our algorithms for objectives and constraints that are convex functions of\nrates. This result builds on previous work by Cotter et al. [23, 24] for handling linear rate constraints,\nand additionally resolves an unanswered question in their work on the convergence of Lagrangian\noptimizers for non-zero-sum games. We also extend our framework to develop a heuristic (Algorithm\n3) for optimizing performance measures that are a sum-of-ratios of rates (e.g. constraints on predictive\nparity and F-measure), and demonstrate their utility on real-world tasks.\nRelated work: Many fairness goals can be expressed as linear constraints on a model\u2019s prediction\nrates [16, 25]. Recent work has focused on optimizing with linear rate constraints by computing an\nequilibrium of a game between players who optimize model parameters \u2713 and Lagrange multipliers\n [26\u201328, 23, 24]. Of these, the closest to us is the work of Cotter et al. (2019) [23], who propose\nthe idea of having only the \u2713-player optimize a surrogate objective, and having the -player use the\noriginal rates. We adapt and build on this idea to handle general functions of rates. Other game-based\nformulations include the approach of Wang et al. [29] for optimizing multivariate evaluation metrics.\nTheir setup is very different from ours, with their players optimizing over (distributions on) all\npossible labelings of the data. Moreover, they do not handle constraints on the classi\ufb01er.\nThere has also been a concentrated effort on optimizing performance measures such as the F-measure\nthat are fractional-linear functions of rates [18\u201320, 30, 31]. Many of these works exploit the pseudo-\nconvex structure of the F-measure, but this property is absent for the problems that we consider where\nwe need to handle sums or differences of ratios. Pan et al. [32] provide a heuristic approach for\nunconstrained sums-of-ratios, and recently Celis et al. [33] handle constraints that are sums-of-ratios,\nbut do so by solving a large number of linearly constrained sub-problems, with the number of\nsub-problems growing exponentially with the number of rates. In contrast, we provide a practical\nalgorithm that handles both objectives and constraints that are sums-of-ratios of rates.\n\n2 Problem Setup\nLet X\u2713 Rd and Y = {\u00b11} be the instance and label spaces. Let f\u2713 : X! R be a prediction model,\nparameterized by \u2713 2 \u21e5. Given a distribution D over X\u21e5Y , we de\ufb01ne the model\u2019s rate on D as:\n\nR(\u2713; D) = E(X,Y )\u21e0D\u21e5IY = sign(f\u2713(X)) \u21e4 .\n\nFor example, if DA denotes the distribution over a sub-population A \u2713 X, then R(\u2713; DA) gives\nus the accuracy of f\u2713 on this sub-population. If D+ denotes a conditional distribution over the\npositively-labeled examples, then R(\u2713; D+) is the true-positive rate TPR(\u2713) of f\u2713 and 1 R(\u2713; D+)\nis the false-negative rate FNR(\u2713) of f\u2713. Similarly, if D denotes a conditional distribution over\nthe negatively-labeled examples, then TNR(\u2713) = R(\u2713; D) is the true-negative rate of f\u2713, and\nFPR(\u2713) = 1 R(\u2713; D) is the false-positive rate of f\u2713. Further, the true-positive and false-positive\nproportions are given by TP(\u2713) = p TPR(\u2713) and FP(\u2713) = (1 p) FPR(\u2713), where p = P(Y = 1);\nwe can similarly de\ufb01ne the true negative proportion TN and false negative proportion FN.\nWe will consider several distributions D1, . . . , DK over X\u21e5Y , and use the short-hand Rk(\u2713)\nto denote the rate function R(\u2713; Dk) 2 [0, 1]. We will denote a vector of K rates as: R(\u2713) =\n[R1(\u2713), . . . , RK(\u2713)]>. We assume we have access to unbiased estimates of the rates \u02c6Rk(\u2713) =\nnkPnk\nnk )}\u21e0 Dk.\n1\nWe will also consider stochastic models which are de\ufb01ned by distributions over the model parameters\n\u21e5. Let \u21e5 denote the set of all such distributions over \u21e5, where for any \u00b5 2 \u21e5, \u00b5(\u2713) is the\nprobability mass on model \u2713. We de\ufb01ne the rate Rk for a stochastic model \u00b5 to be the expected value\nfor random draw of a model from \u00b5, so Rk(\u00b5) = E\u2713\u21e0\u00b5 [Rk(\u2713)].\nGeneralized Rate Metrics as Objective. In many real-world applications the performance of a\nmodel is evaluated by a function of multiple rates: (R1(\u2713), . . . , RK(\u2713)) for some : [0, 1]K ! R.\nFor example, a common evaluation metric used in class-imbalanced classi\ufb01cation tasks is the G-mean:\nGM(\u2713) = 1 pTPR(\u2713) TNR(\u2713), which is a convex function of rates [34, 35]. Similarly, a popular\n\ni )) , computed from samples S = {(xk\n\ni=1 Iyk\n\ni = sign(f\u2713(xk\n\n1, yk\n\n1 ), . . . , (xk\n\nnk , yk\n\n2\n\n\fTable 1: Examples of generalized rate metrics. is either convex (C), pseudo-convex (PC) or a sum\nor difference of ratios (SR). p = P(Y = 1) and \u02c6p is the predicted proportion of positives. wins\n(losses) is the fraction of correct (wrong) predictions by the new model among examples where it\ndisagrees with the old model. A and B refer to different protected groups or slices of the population.\n\nMeasure\nG-mean [34, 35]\nH-mean [36]\nQ-mean [5]\nKLD [8, 9]\nF-measure [37]\nPredictive parity [6]\nF-measure parity\nChurn [7]\n\nPR-AUC [10]\n\nDe\ufb01nition\n\n2\n\n1/TPR+1/TNR\n\n1 pTPR \u21e5 TNR\n1 \n1 pFPR2 + FNR2\n\u02c6p ) + (1 p)log( 1p\np log( p\n1\u02c6p )\n1 \nTPA\n\n2TP + FP + FN\nTPB\n\n2TP\n\nTPB+FNB\n2TPB\n2TPB+FPB+FNB\n\nwinsA\n\nTPA+FNA \n2TPA\n2TPA+FPA+FNA \nlossesA winsB\nMPM\n1 1\n\nm=1\n\nlossesB\n\np log( p\nz1\n\n \n\n2\n\n1 pz1 z2\n1 \n1/z1+1/z2\n1 pz2\n1 + z2\n2\n) + (1 p)log( 1p\n1 \nz1+z2 z3\nz2 z3\n\n2z1+z2+z3\n\n2z4+z5+z6\n\nz3+z4\n\n2z4\n\n2z1\n\nz1\n\nz2\n\nz1\n\nz4\n\n)\n\nType\n\nC\nC\nC\nC\nPC\nSR\nSR\nSR\n\n2z1\n\n2z1+z2+z3 \n\nTPm\n\nTPm+FPm\n\n, where TPm, FPm are evaluated\n\nat the largest threshold \u2327 at which recall of f\u2713(X) + \u2327 is m+0.5\n\nM\n\nevaluation metric used in text retrieval is the F-measure: F1(\u2713) =\n2TP(\u2713) + FP(\u2713) + FN(\u2713), which is a\nfractional-linear or pseudo-convex function of rates [38]. See Table 1 for more examples. One can\nconsider directly optimizing these performance measures during training:\n\n2TP(\u2713)\n\n (R1(\u2713), . . . , RK(\u2713)) .\n\nmin\n\u27132\u21e5\n\n(P1)\n\n\n\nTPA(\u2713)\n\nTPB(\u2713)\n\nGeneralized Rate Metrics as Constraints. There are also many applications, where one wishes\nto impose constraints de\ufb01ned by non-linear function of rates, for example to ensure group-\nspeci\ufb01c fairness metrics. Examples include: (i) Predictive parity fairness: Fair classi\ufb01cation\ntasks where one wishes to match the precision of a model across different protected groups [6]:\nTPA(\u2713) + FPA(\u2713) \nacross different groups. (ii) Distribution matching: Fairness or quanti\ufb01cation tasks where one wishes\nto match the distribution of a model\u2019s outputs across different protected groups [9, 15]. One way to\nachieve this is to constrain the KL-divergence between the overall class proportion p and proportion\nof predicted positives for each group \u02c6pA [11], i.e. to enforce KLD(p, \u02c6pA) \uf8ff \u270f, where KLD is convex\nin \u02c6pA. (iii) Churn: Problems where one wishes to replace a legacy model with a more accurate model\nwhile limiting the changed predictions between the new and old models (possibly across different\nslices of the user base). This is ideally framed as constraints on the win-loss ratio\u2019s [24], which can\nbe expressed as ratios of rates [7]. These and related problems can be framed generally as:\n\nTPB(\u2713) + FPB(\u2713) \uf8ff \u270f. One may also want to match e.g. the F-measure of a model\n\n(P2)\nfor some objective function g :\u21e5 ! R, and J constraint functions j : [0, 1]K ! R. We also\nconsider a special case of (P2) with an objective and a constraint that is a sum-of-ratios of rates:\n\ns.t. j (R1(\u2713), . . . , RK(\u2713)) \uf8ff 0, 8j 2 [J],\n\nmin\n\u27132\u21e5\n\ng(\u2713)\n\ns.t.\n\n(P3)\n\nmin\n\u27132\u21e5\n\nh\u21b5m, R(\u2713)i\nhm, R(\u2713)i \uf8ff ,\n\n2MXm=M +1\nMXm=1\nh\u21b5m, R(\u2713)i\nhm, R(\u2713)i\nfor coef\ufb01cients \u21b5m, m 2 RK\n+ , 8m 2 [2M ] and slack 2 R.\nOur setup can also be used to optimize score-based metrics, such as the area under the precision-\nrecall curve (PR-AUC), that summarize the performance of a score model f\u2713 : X! R across\nmultiple thresholds. We use the approach of Eban et al. [10] to (approximately) express PR-AUC as\na Riemann summation of the precision of f\u2713 at thresholds \u23271, . . . ,\u2327 M 2 R at which the recall of f\u2713\nis 0.5\nM resp. This results in a formulation similar to (P3). See Appendix E for details.\nWe next provide a three-player framework for solving (P1)\u2013(P3). We note that our formulations can be\nequivalently regarded as two-player games where one player is in charge of the parameters that need\nto be minimized, and the other player is in charge of the parameters that need to be maximized. We\nhowever \ufb01nd the three-player viewpoint to be a useful way to think about the problem algorithmically\nin that the three sets of optimization parameters can use different algorithms (see Table 2).\n\nM , . . . , M0.5\n\nM , 1.5\n\n3\n\n\fTable 2: Algorithms for (P1)\u2013(P3) with 3 players. Frank-Wolfe [20], SPADE [14], NEMSIS [15] are\nprevious algorithms. Alg. 1\u20133 are the proposed methods. Each player can do Best Response (BR),\nOnline Gradient Descent (OGD) or Follow-The-Leader (FTL), and the game is zero-sum (ZS) or not.\nThe \ufb01rst \ufb01ve algorithms \ufb01nd an approximate Nash or Coarse-Correlated (C.C.) equilibrium assuming\n and j\u2019s are convex. Since (P3) is non-convex in the rates, Alg. 3 may not \ufb01nd an equilibrium.\n\nAlg.\nF-W\nSPADE\nNEMSIS\nAlg. 1\nAlg. 2\nAlg. 3\n\nP\n\u21e0\n-\nP1\nP1\nL1\n-\nP1\nP1-2 L1\nP1-2 L1\nP3\nL1\n\n\u2713\n\n\n\nPlayer Objective\n\n\n min\n\u21e0 L1(\u21e0;\u00b7) + L2(\u2713;\u00b7) L2\nFTL\n\u02dcL2\nOGD OGD 3\n min\n\u21e0 L1(\u21e0;\u00b7) + \u02dcL2(\u2713;\u00b7)\n\u02dcL2\nFTL OGD 3\nOGD\nBR\n3\nL2\n\u02dcL2\nOGD OGD\n7\n\u02dcL2 OGD OGD OGD\n7\n\nPlayer Strategy\n\u2713\n\u21e0\n-\nBR\nBR\n-\nBR\nBR\n\nL1 + L2\nL1 + L2\nL1 + L2\n\nL1 + \u02dcL2\n\nZS Equil\nNash\n3\nNash\nNash\nNash\nC. C.\n\n-\n\n3 Generalized Rate Metric Objective\n\nWe \ufb01rst present algorithms for the unconstrained problem in (P1). We assume is strictly convex,\nand is L-Lispchitz w.r.t. the `1-norm. For simplicity, we assume that is monotonically increasing\nin all arguments and that r (0) = 0, although our approach easily extends to more general metrics\nthat are e.g. monotonically increasing in some arguments and monotonically decreasing in others.\nGame Formulation. We equivalently re-write (P1) to de-couple the rates Rk from the non-linear\nfunction by introducing auxiliary variables \u21e01, . . . ,\u21e0 K 2 [0, 1]:\n\nmin\n\n\u27132\u21e5,\u21e0 2[0,1]K\n\n (\u21e01, . . . ,\u21e0 K) s.t. Rk(\u2713) \uf8ff \u21e0k, 8k 2 [K].\n\n(1)\n\nA standard approach for solving (1) is to write the Lagrangian for the problem with Lagrange\nmultipliers 1, . . . , K 2 R+ for the K constraints:\n\nL(\u2713, \u21e0; ) = (\u21e0) +\n\nk (Rk(\u2713) \u21e0k) = (\u21e0) \n\nk\u21e0k\n\n+\n\nk Rk(\u2713)\n\n.\n\nThen one maximizes the Lagrangian over 2 RK\n\nKXk=1\n\nmax\n2RK\n\n+\n\nmin\n\u27132\u21e5\n\u21e02[0,1]K\n\nKXk=1\n{z\n\nKXk=1\n|\n\n|\n\nL1(\u21e0;)\n\n}\n+ , and minimizes it over \u2713 2 \u21e5 and \u21e0 2 [0, 1]K:\nL1(\u21e0; ) + L2(\u2713; ),\n\n{z\n\nL2(\u2713;)\n\n}\n\n(2)\n\nNotice that L1 is convex in \u21e0 (by convexity of ), while L1 and L2 are linear in . We pose this\nmax-min problem as a zero-sum game played with three players: a player who minimizes L1 + L2\nover \u2713, a player who minimizes L1 + L2 over \u21e0, and a player who maximizes L1 + L2 over . Each\nof the three players can now use different optimization algorithms customized for their problem. If\nadditionally the Lagrangian was convex in \u2713, one could solve for an equilibrium of this game and\nobtain a solution for the primal problem (1). However, since L2 is a weighted sum of rates Rk(\u2713), it\nneed not be convex (or even continuous) in \u2713.\nTo overcome this dif\ufb01culty, we expand the solution space from deterministic models \u21e5 to stochastic\nmodels \u21e5 [26, 23, 24, 39], and re-formulate (2) as a problem that is linear in \u00b5, by replacing each\nRk(\u2713) with E\u2713\u21e0\u00b5[Rk(\u2713)] in L2:\n\n(3)\n\nmax\n2\u21e4\n\nmin\n\u00b52\u21e5\n\u21e02[0,1]K\n\nL1(\u21e0; ) + L2(\u00b5; ).\n\nHere for technical reasons, we restrict the Lagrange multipliers to a bounded set \u21e4= { 2\nR+ |k k1 \uf8ff \uf8ff}; we will choose the radius \uf8ff> 0 later in our theoretical analysis. By solving\nfor an equilibrium of this expanded max-min problem, we can \ufb01nd a stochastic model \u00b5 2 \u2713 that\nminimizes (R1(\u00b5), . . . , RK(\u00b5)).\n\n4\n\n\fk=1 k \u02dcRk(\u2713).\n\nThere are two approaches that we can take to \ufb01nd an equilibrium of the expanded game. The \ufb01rst\napproach is to assume access to an oracle that can perform the minimization of L2 over \u00b5 for a \ufb01xed \nand \u21e0. Since this is a linear optimization over the simplex, this amounts to performing a minimization\nover deterministic models in \u21e5. The second and more realistic approach is to work with surrogates\nfor the rates that are continuous and differentiable in \u2713. Let \u02dcR1, . . . , \u02dcRK :\u21e5 ! R be differentiable\nconvex surrogate functions that are upper bounds on the rates: Rk(\u2713) \uf8ff \u02dcRk(\u2713), 8\u2713 2 \u21e5. We assume\naccess to unbiased stochastic sub-gradients for the surrogates r\u2713 \u02dcRk(\u2713) with E[r\u2713 \u02dcRk(\u2713)] 2 @\u2713 \u02dcRk(\u2713).\nWe then de\ufb01ne a surrogate-based approximation for L2:\n\u02dcL2(\u2713; ) = PK\n(4)\n\nAll we need to do now is to choose the objective that each player seeks to optimize (true or surrogate\n[23]) and the strategy that they use to optimize their objective, so that the players converge to an\nequilibrium of the game. Each of these choices lead to a different algorithm for (approximately)\nsolving (P1). Table 2 summarizes different choices of strategies and objectives for the players, and\nthe type of equilibrium and the algorithm that results from these choices. As we shall see shortly (and\nalso elaborate in Appendix B.1), many existing algorithms can be seen as instances of this template.\nOracle-based Lagrangian Optimizer. As a warm-up illustration of our approach, we \ufb01rst describe\nan idealized algorithm assuming access to an oracle that can approximately optimize L2 over \u21e5 (this\noracle essentially optimizes a weighted-sum of rates, i.e. a cost-sensitive error objective [40]):\nDe\ufb01nition 1. A \u21e2-approximate cost-sensitive optimization (CSO) oracle takes as input and outputs\na model \u2713\u21e4 2 \u21e5 such that L2(\u2713\u21e4; ) \uf8ff min\u27132\u21e5 L2(\u2713; ) + \u21e2.\nWe have all three players optimize the true Lagrangians, with the \u2713-player and \u21e0-player playing best\nresponses to the opponents strategies, i.e. they perform full optimization over their parameter space,\nand the -player running online gradient descent (OGD), an algorithm with no-regret guarantees [41].\nThe \u2713-player performs best response by using the above oracle to approximately minimize L2 over \u2713.\nFor the \u21e0-player, the best response optimization can be computed in closed-form:\nLemma 1. Let \u21e4 : RK\n\n+ ! R denote the Fenchel conjugate of . Then for any s.t. kk1 \uf8ff L:\n\nr \u21e4() 2 argmin\u21e02[0,1]K L1(\u21e0; ).\n\n.The resulting algorithm outlined in Algorithm 1 is guaranteed to \ufb01nd an approximate Nash equilibrium\nof the max-min game in (3), and yields an approximate solution to (P1):\nTheorem 1. Let \u27131, . . . ,\u2713 T be the iterates generated by Algorithm 1 for (P1), and let \u00af\u00b5 be a stochastic\nmodel with a probability mass of 1\nT on \u2713t. De\ufb01ne B maxt krL(\u21e0t,\u2713 t; t)k2. Then setting\n\uf8ff = L and \u2318 = L\n\nBp2T\n\n, we have w.p. 1 over draws of stochastic gradients:\n\u25c6 + \u21e2.\n\n (R(\u00b5)) + O\u2713r log(1/)\n\n (R(\u00af\u00b5)) \uf8ff min\n\u00b52\u21e5\n\nT\n\nRemark 1 (Frank-Wolfe: a special case). A previous oracle-based approach for optimizing convex\nfunctions of rates is the Frank-Wolfe based method of Narasimhan et al. [14]. This method can be\nrecovered from our framework by reformulating the -player\u2019s objective to include the minimization\nover \u21e0: LFW(, \u2713) = min\u21e0 L1(\u21e0; ) +L2(\u2713; ) and having the -player play the Follow-The-Leader\n(FTL) algorithm [42] to maximize this objective over . As before, the \u2713-player plays best response\non L2 using the CSO oracle. See Appendix B.1 for details, where we use recent connections between\nthe classical Frank-Wolfe technique and equilibrium computation [43].\n\nSurrogate-based Lagrangian Optimizer. While the CSO oracle may be available in some special\ncases (e.g. when \u21e5 is \ufb01nite or when the underlying conditional-class probabilities can be estimated\naccurately), in many practical scenarios, it is not realistic to assume access to an oracle that can\noptimize non-continuous rates. We now provide a more practical algorithm for solving (P1) where\nthe \u2713-player optimizes the surrogate Lagrangian function \u02dcL2 in (4) instead of L2 using stochastic\ngradients r\u2713 \u02dcL2(\u2713; ). The \u21e0- and -players, however, continue to operate on the true Lagrangian\nfunctions L1 and L2, which are continuous in the parameters \u21e0 and that these players optimize.\nIn our proposed approach, outlined in Algorithm 2, both the \u2713-player and -player now run online\ngradient descent algorithms, while the \u21e0-player plays its best response at each iteration. Since it is\n\n5\n\n\fAlgorithm 1 Oracle-based Optimizer\n\nAlgorithm 2 Surrogate-based Optimizer\n\nif (P1) then\n\nInitialize: 0\nfor t = 0 to T 1 do\n\u21e0t = r \u21e4(t)\nelse if (P2) then\n\u21e0t 2 argmin\u21e02[0,1]K L1(\u21e0; t)\n\nend if\n\u2713t 2 argmin\u27132\u21e5 L2(\u2713; t)\n\nt+1 =\u21e7 \u21e4t + \u2318 r L(\u2713t,\u21e0 t; t)\n\nend for\nreturn \u27131, . . . ,\u2713 T\n\n[CSO]\n\nif (P1) then\n\nInitialize: \u27130, 0\nfor t = 0 to T 1 do\n\u21e0t = r \u21e4(t)\nelse if (P2) then\n\u21e0t 2 argmin\u21e02[0,1]K L1(\u21e0; t)\nend if\n\u2713t+1 \u21e7\u21e5(\u2713t \u2318\u2713 r\u2713 \u02dcL2(\u2713t; t))\nt+1 \u21e7\u21e4(t + \u2318 rL(\u2713t,\u21e0 t; t))\nend for\nreturn \u27131, . . . ,\u2713 T\n\nk=1( \u02c6Rk(\u2713t) \u21e0t\n\nFigure 1: Optimizers for the unconstrained problem (P1) and constrained problem (P2). Here \u21e7\u21e4\ndenotes the `1-projection onto \u21e4 and \u21e7\u21e5 denotes the `2-projection onto \u21e5. We denote a (stochastic)\nk), where \u02c6Rk(\u2713t) is an unbiased estimate of\n\ngradient of L by rL(\u2713t,\u21e0 t; t) = PK\nRk(\u2713t). We denote a (stochastic) sub-gradient of \u02dcL2 by r\u2713 \u02dcL2(\u2713t; t) = PK\n\nthe \u2713-player alone who optimizes a surrogate, the resulting game between the three players is no\nlonger zero-sum. Yet, we are able to show that the player strategies converge to an approximate\ncoarse-correlated (C. C.) equilibrium of the game, and yields an approximate solution to (P1).\nTheorem 2. Let \u27131, . . . ,\u2713 T be the iterates of Algorithm 2 for (P1), and let \u00af\u00b5 be a stochastic model\nT on \u2713t. Let \u02dc\u21e5 be a convex set and \u02dc\u21e5= \u2713 2 \u21e5| \u02dcR(\u2713) 2 [0, 1]K . Let\nwith probability 1\nB\u21e5 max\u27132\u21e5 k\u2713k2, B\u2713 maxt kr\u2713 \u02dcL2(\u2713t; t)k2 and B maxt krL(\u21e0t,\u2713 t; t)k2. Setting\n\uf8ff = L, \u2318\u2713 = B\u21e5\n, we have w.p. 1 over draws of stochastic gradients:\nB\u2713pT\n\nk=1 k r\u2713 \u02dcRk(\u2713t).\n\nand \u2318 = L\n\nBp2T\n\n R(\u00af\u00b5) \uf8ff min\n\n\u27132 \u02dc\u21e5\n\n\u25c6.\n\n \u02dcR(\u2713) + O\u2713r log(1/)\n\nT\n\nNote the right-hand side contains the optimal value for the surrogate objective ( \u02dcR(\u00b7)) and not for\nthe original performance metric. This is unsurprising given the \u2713-player\u2019s inability to work with\nthe true rates. Also, while Algorithm 2 can be applied to optimize over a general (bounded) convex\nmodel class \u21e5, the comparator for our guarantee is a subset of models \u02dc\u21e5 \u2713 \u21e5 for which ( \u02dcR(\u2713)) is\nde\ufb01ned. This is needed as the surrogate \u02dcR(\u2713) may output values outside the domain of .\nRemark 2 (SPADE, NEMSIS: special cases of our approach). Our approach includes two previ-\nous surrogate-based algorithms as special cases: SPADE [20] and NEMSIS [15]. SPADE can be\nrecovered from our framework by having the same player strategies as Algorithm 2 but with both\nthe \u2713- and -players optimizing surrogate objectives, i.e. with the \u2713-player minimizing \u02dcL2 and the\n-player maximizing L1 + \u02dcL2. NEMSIS also uses surrogates for both the \u2713 and updates. It can be\nrecovered by having the \u2713-player run OGD on \u02dcL2, and having the -player play FTL over on the\ncombined objective min\u21e0 L1(\u21e0; ) + \u02dcL2(\u2713; ). See Appendix B.1 for details.\nRemark 3 (Application to wider range of metrics). Because of their strong reliance on surrogates,\nSPADE and NEMSIS cannot be applied directly to functions that take inputs from a restricted range\n(e.g. KL-divergence), unless the surrogates are also bounded in the same range. In Appendix D.1,\nwe point out scenarios where the NEMSIS method [15] fails to optimize the KL-divergence metric,\nunless the model is suf\ufb01ciently regularized to not output large negative values. Algorithm 2 has no\nsuch restriction and can be applied even if the outputs of the surrogates are not within the domain of\n . This is because it uses the original rates for updates on . As a result, the game play between \u21e0\nand never produces values that are outside the domain of .\n\n4 Generalized Rate Metric Constraints\n\nWe next describe how to apply our approach to the constrained optimization problem in (P2) and\nto the special case in (P3). We start with (P2) assuming that the constraints j\u2019s are jointly convex,\n\n6\n\n\fmonotonic in each argument and L-Lipschitz w.r.t. the `1-norm, and g is a bounded convex function.\nFor convenience, we assume that the j\u2019s are monotonically increasing in all arguments. Constraints\non the KL-divergence and G-mean metrics are examples of this setting.\nWe introduce a set auxiliary variables \u21e01, . . . ,\u21e0 K for the K rate functions and re-write (P2) as:\n\nmin\n\ng(\u2713)\n\n\u27132\u21e5,\u21e0 2[0,1]K\n\ns.t. j (\u21e01, . . . ,\u21e0 K) \uf8ff 0, 8j 2 [J], Rk(\u2713) \uf8ff \u21e0k, 8k 2 [K].\nThe Lagrangian for the re-written problem is given below, where 1, . . . , J 2 R+ and\nJ+1, . . . , J+K 2 R+ are the Lagrange multipliers for the two sets of constraints:\n\nJXj=1\n|\n\nKXk=1\n\nL1(\u21e0;)\n\nKXk=1\n{z\n\nL2(\u2713;)\n\n}\n\n(5)\n\n(6)\n\n}\n\n|\n\nJ+k \u21e0k\n\nmax\n2\u21e4\n\n+\nmin\n\nJ+k Rk(\u2713)\n\n.\n\n+ g(\u2713) +\n\nL(\u2713, \u21e0; ) =\n\n\u00b52\u21e5,\u21e0 2[0,1]K L1(\u21e0; ) + L2(\u00b5; ).\n\nAs before, we expand the search space to include stochastic models in \u21e5, restrict the Lagrange\n\nkk1 \uf8ff \uf8ff , and formulate a max-min problem:\n\nj j(\u21e0) \n{z\nmultipliers to a bounded set \u21e4= 2 RJ+K\nOne can now apply Algorithm 1 and 2 to this problem. For Algorithm 1, the \u2713-player uses the\nCSO oracle to optimize the true Lagrangian L2, and for Algorithm 2, the \u2713-player uses OGD to\noptimize a surrogate Lagrangian \u02dcL2(\u00b5; ) = PK\nk=1 J+k \u02dcRk(\u00b5). In both cases, the \u21e0-player plays\nbest response by minimizing L1 over \u21e0 using an analytical solution (where available) or using a\nconvex optimization solver.\nTheorem 3. Let \u27131, . . . ,\u2713 T be the iterates generated by Algorithm 1 for (P2), and let \u00af\u00b5 be a\nstochastic model with a probability mass of 1\nT on \u2713t. Suppose there exists a \u00b50 2 \u21e5 such that\nj(R(\u00b50)) \uf8ff , 8j 2 [J], for some > 0. Let Bg = max\u27132\u21e5 g(\u2713). Let \u00b5\u21e4 2 \u21e5 be such that\n\u00b5\u21e4 is feasible, i.e. j(R(\u00b5\u21e4)) \uf8ff 0, 8j 2 [J], and E\u2713\u21e0\u00b5\u21e4 [g(\u2713)] \uf8ff E\u2713\u21e0\u00b5 [g(\u2713)] for every \u00b5 2 \u21e5\nthat is feasible. Let B maxt krL(\u21e0t,\u2713 t; t)k2. Then setting \uf8ff = 2(L+1)Bg\n,\nBp2T\nwe have w.p. 1 over draws of stochastic gradients:\nE\u2713\u21e0\u00af\u00b5 [g(\u2713)] \uf8ff E\u2713\u21e0\u00b5\u21e4 [g(\u2713)] + O\u2713r log(1/)\n\n+\u21e2\u25c6 and j(R(\u00af\u00b5)) \uf8ffO \u2713r log(1/)\n\n+\u21e2\u25c6,8j.\n\nand \u2318 = \uf8ff\n\n\n\nT\n\nT\n\nWe have thus shown that Algorithm 1 outputs a stochastic model that has an objective close to the\noptimal feasible solution for (P2), while also closely satisfying the constraints.\nTheorem 4. Let \u27131, . . . ,\u2713 T be the iterates of Algorithm 1 for (P2), and let \u00af\u00b5 be a stochastic model\nT on \u2713t. Let \u21e5 be convex and \u02dc\u21e5= \u2713 2 \u21e5|8 j, \u02dcR(\u2713) 2 [0, 1]K . Let \u02dc\u2713\u21e4 2 \u02dc\u21e5 be such that\nwith prob. 1\nit satis\ufb01es the surrogate-relaxed constraints j( \u02dcR(\u02dc\u2713\u21e4)) \uf8ff 0, 8j 2 [J], and g(\u02dc\u2713\u21e4) \uf8ff g(\u2713) for every\n\u2713 2 \u02dc\u21e5 that satis\ufb01es the same constraints. Let B\u21e5 max\u27132\u21e5 k\u2713k2, B\u2713 maxt kr\u2713 \u02dcL2(\u2713t; t)k2\nand B maxt krL(\u21e0t,\u2713 t; t)k2. Then setting \uf8ff = (L + 1)T ! for ! 2 (0, 0.5), \u2318\u2713 = B\u21e5\nB\u2713p2T\nand \u2318 = \uf8ff\n\u25c6, 8j.\n\nE\u2713\u21e0\u00af\u00b5 [g(\u2713)] \uf8ff g(\u02dc\u2713\u21e4) + O\u2713plog(1/)\n\nj(R(\u00af\u00b5)) \uf8ffO \u2713plog(1/)\n\n, we have w.p. 1 over draws of stochastic gradients:\n\nT 1/2 ! \u25c6 and\n\nBp2T\n\nT !\n\nThe proof is an adaptation of the analysis in Agarwal et al. (2018) [26] to non-zero-sum games. We\npoint out that despite the \u2713-player optimizing surrogate functions, the \ufb01nal stochastic model is near-\nfeasible for the original rate metrics. We also note that this result holds even if the surrogates output\nvalues outside the domain of the constraint functions j\u2019s (e.g. with KL-divergence constraints).\n\nWhile the above convergence rate is not as good as the standard O(1/pT ) rate achievable for\nOGD, this is similar to the guarantees shown by e.g. Agarwal et al. [26] for linear rate-constrained\noptimization problems.1 The reason for the poorer convergence rate is that we are unable to \ufb01x the\nradius \uf8ff of the space of Lagrange multipliers \u21e4 to a constant, and instead set it to a function of T .\n1See Theorem 3 in their paper, where, for T = O(n4\u21b5) iterations, the error bound is \u02dcO(n\u21b5) = \u02dcO(T 1/4).\n\n7\n\n\fAlgorithm 3 Surrogate-based Optimizer for (P3)\n\nInitialize: a0, b0, \u27130, 0\nfor t = 0 to T 1 do\n\nat+1 =\u21e7 C (at \u2318araLsr(\u2713t, at, bt; t));\nbt+1 =\u21e7 C (bt \u2318brbLsr(\u2713t, at, bt; t)), where C = {a, b 2 R | a \uf8ff a \uf8ff b \uf8ff b}\n\u2713t+1 =\u21e7 \u21e5(\u2713t \u2318\u2713 r\u2713 \u02dcLsr(\u2713t; at, bt t)), where \u02dcLsr is (7) with Rk replaced with \u02dcRk\nt+1 =\u21e7 \u21e4 (t + \u2318 rLsr(\u2713t, at, bt; t))\nend for\nreturn \u27131, . . . ,\u2713 T\n\nRemark 4 (Unanswered Question in Cotter et al.). Cotter et al. consider optimization problems\nwith linear rate constraints, formulate a non-zero-game like us, and consider two algorithms, one\nwhere both the \u2713- and -player seek to minimize external regret (through OGD updates), and the\nother where the \u2713-player optimizes alone minimizes external regret, while the -player minimizes\nswap regret. They are however able to show convergence guarantees only for a more complicated\nswap regret algorithm, and leave the analysis of the external regret algorithm unanswered. Theorem\n4 provides convergence guarantees for a generalization of their external regret algorithm. In their\n\npaper, Cotter et al. do obtain a better O(1/pT ) convergence rate for their swap regret algorithm.\n\nIt is easy to show a similar convergence rate for an adaptation of this algorithm to our setting (see\nAppendix C), but we stick to our present algorithm because of its simplicity.\n\nCase of Sum-of-ratios Metrics. Moving beyond convex functions of rates, we present a heuris-\ntic algorithm for optimizing with objectives and constraints in (P3) that are sums-of-ratios of\nrates. We assume that the numerators and denominators in each ratio term is bounded, i.e.\na \uf8ff h\u21b5m, R(\u2713)i \uf8ff hm, R(\u2713)i \uf8ff b, and a \uf8ff h\u21b50m, R(\u2713)i \uf8ff h0m, R(\u2713)i \uf8ff b, 8\u2713 2 \u21e5 for\nsome a, b > 0. Introducing slack variables a1, . . . , a2M, b1, . . . , b2M for the numerators and denomi-\nnators respectively to decouple the rates from the ratio terms, we equivalently re-write (P3):\n\nmin\n\u27132\u21e5\n\nMXm=1\n\nam\nbm\n\ns.t.\n\n2MXm=M +1\n\nam\nbm \uf8ff , am h\u21b5m, R(\u2713)i, bm \uf8ff hm, R(\u2713)i, 8m.\n\na\uf8ffam\uf8ffbm\uf8ffb\nWe then formulate the Lagrangian for the above problem with multipliers 2 R4M +1\nm=1 am/bm + 0P2M\nm=M +1 am/bm \nLsr(\u2713, a, b; ) = PM\n(7)\nm=1 m(h\u21b5m, R(\u2713)i am) + P2M\n+P2M\nm=1 2M +m(bm hm, R(\u2713)i).\nBecause Lsr is non-convex in the slack variables a, b, strong duality may not hold, and an optimal\nsolution for the dual problem may not be optimal for (P3). Yet, by performing OGD updates for the\n, \u2713 and the slack variables, with the \u2713-player alone optimizing the surrogate rates \u02dcRk, we obtain a\nheuristic to solve (P3). The details are given in Algorithm 3.\n\nand get:\n\n+\n\n5 Experiments\n\nWe conduct two types of experiments. In the \ufb01rst, we evaluate Algorithm 2 on the task of optimizing\na convex rate objective subject to linear rate constraints and show it often performs as well as an\noracle-based approach for this problem [11] (and does so without having to make an idealized oracle\nassumption). In the second, we evaluate Algorithm 3 on the task of optimizing a sum-of-ratios\nobjective subject to sum-of-ratios constraints, and compare it against existing baselines.\nDatasets. We use \ufb01ve datasets: (1) COMPAS, where the goal is to predict recidivism with gender as\nthe protected attribute [44]; (2) Communities & Crime, where the goal is to predict if a community in\nthe US has a crime rate above the 70th percentile [45], and we consider communities having a black\npopulation above the 50th percentile as protected [27]; (3) Law School, where the task is to predict\nwhether a law school student will pass the bar exam, with race (black or other) as the protected\nattribute [46]; (4) Adult, where the task is to predict if a person\u2019s income exceeds 50K/year, with\ngender as the protected attribute [45]; (5) Wiki Toxicity, where the goal is to predict if a comment\n\n8\n\n\fTable 3: Optimizing KL-divergence fairness metric s.t. error rate constraints. For each method, we\nreport two metrics: A (B), where A is the test fairness metric (lower is better) and B is the ratio of the\ntest error rate of the method and that of UncError (lower is better). During training, we constrain B to\nbe \uf8ff 1.1. Among the last 3 columns, the lowest fairness metric is highlighted in bold.\n\nCOMPAS\nCrime\nLaw\nAdult\nWiki\n\nUncError\n0.115 (1.00)\n0.224 (1.00)\n0.199 (1.00)\n0.114 (1.00)\n0.175 (1.00)\n\nPostShift\n0.000 (1.01)\n0.005 (1.40)\n0.001 (1.45)\n0.000 (1.22)\n0.001 (1.21)\n\nCOCO\n\n0.043 (1.01)\n0.252 (0.83)\n0.043 (1.05)\n0.011 (1.10)\n0.134 (1.17)\n\nStochastic\n0.000 (1.03)\n0.120 (1.11)\n0.054 (1.12)\n0.014 (1.10)\n0.133 (1.09)\n\nDeterm.\n\n0.000 (1.03)\n0.146 (1.08)\n0.056 (1.08)\n0.014 (1.10)\n0.127 (1.18)\n\nTable 4: Optimizing F-measure s.t. F-measure constraints. For each method, we report two metrics:\nA (B), where A is the overall test F-measure (higher is better) and B is the test constraint violation:\nFmeasureprt Fmeasureother 0.02 (lower is better). The lowest constraint violation is in italic.\n\nCOMPAS\nCrime\nLaw\nAdult\nWiki\n\nUncError\n0.656 (0.13)\n0.742 (0.19)\n0.973 (0.10)\n0.675 (0.03)\n0.968 (0.18)\n\nUncF1\n\n0.666 (0.09)\n0.752 (0.19)\n0.975 (0.08)\n0.688 (0.06)\n0.967 (0.18)\n\nStochastic\n0.627 (0.07)\n0.711 (0.11)\n0.927 (0.04)\n0.660 (0.04)\n0.826 (0.00)\n\nDeterm.\n\n0.628 (0.07)\n0.711 (0.11)\n0.842 (-0.05)\n0.647 (0.03)\n0.782 (-0.05)\n\nposted on a Wikipedia talk page contains non-toxic/acceptable content, with the comments containing\nthe term \u2018gay\u2019 considered as a protected group [47]. We use linear models, and hinge losses as\nsurrogates \u02dcRk. All implementations are in Tensor\ufb02ow.2 See Appendix G for additional details.\nKL-divergence Based Fairness Objective. We consider a demographic-parity style fairness objec-\ntive that seeks to match the proportion of positives predicted in each group \u02c6pG with the true proportion\nof positives p in the data, measured using a KL-divergence metric:PG2{0,1}\nKLD(p, \u02c6pG). Note that\nthis is convex in \u02c6pG. We additionally enforce a constraint that the error rate of the model is no more\nthan 10% higher than an unconstrained model that optimizes error rate, i.e. \u02c6err(f\u2713) \uf8ff 1.1 \u02c6err(func).\nThe only previous method that we are aware of that can handle constrained problems of this form\nis the COCO method of Narasimhan (2018) [11]. This is an oracle-based approach and uses a\nplug-in method to implement the cost-sensitive oracle. We compare Algorithm 2 with COCO, and the\npost-shift method of Hardt et al. (2016) [48], where we take a pre-trained logistic regression model\nand assign different thresholds to the groups to correct for fairness disparity [48]. For our algorithm,\nwe report the performance of both the trained stochastic classi\ufb01er and the best deterministic classi\ufb01er\nchosen through a \u2018best iterate\u2019 heuristic of Cotter et al. (2019) [23]. The results are shown in Table 3.\nPostShift performs the best on the fairness metric, but on all datasets except COMPAS, fairs poorly\non the constraint. The proposed method and COCO achieve different trade-offs between optimizing\nthe objective and satisfying the error rate constraint. The stochastic classi\ufb01er trained by our method\nclosely satis\ufb01es the constraint on almost all datasets, and yields a lower objective than COCO on\nthree datasets. Moreover, the best deterministic classi\ufb01er often has a similar performance to the\nstochastic classi\ufb01er. We present further analysis and additional comparisons in Appendix G.1.\nF-measure Based Parity Constraints. We consider the fairness goal of training a classi\ufb01er that\nyields at least as high a F-measure for the protected group as it does for the rest of the population,\nand impose this as a constraint. Speci\ufb01cally, we seek to maximize the overall F-measure subject to\nthe constraint: Fmeasureprt Fmeasureother 0.02. We apply Algorithm 3 and compare it with\nan unconstrained classi\ufb01er that optimizes error rate (UncError) and a plug-in classi\ufb01er that optimizes\nthe F-measure without constraints (UncF1). As seen in Table 4, while UncF1 yields a better objective\nthan UncError, both the baselines fail to satisfy the constraint. On four of the datasets, the stochastic\nclassi\ufb01er trained by our approach yields moderate to signi\ufb01cant reduction in constraint violation.\nOn Adult, the trained stochastic classi\ufb01er yields a very small constraint violation on the training set\n(0.004), but does not perform as well on the test set. The deterministic classi\ufb01ers have an equal or\nlower constraint violation compared to the stochastic classi\ufb01ers, but at the cost of a lower objective.\n\n2https://github.com/google-research/google-research/tree/master/generalized_rates\n\n9\n\n\fAcknowledgments\n\nWe thank Qijia Jiang and Heinrich Jiang for helpful feedback on draft versions of this paper.\n\nReferences\n[1] D.D. Lewis. Evaluating text categorization. In HLT Workshop on Speech and Natural Language,\n\n1991.\n\n[2] J-D. Kim, Y. Wang, and Y. Yasunori. The Genia event extraction shared task, 2013 edition-\n\noverview. ACL, 2013.\n\n[3] Y. Sun, M.S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbalanced class\n\ndistribution. In ICDM, 2006.\n\n[4] S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE\nTransactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4):1119\u20131130, 2012.\n\n[5] S. Lawrence, I. Burns, A. Back, A-C. Tsoi, and C.L. Giles. Neural network classi\ufb01cation and\nprior class probabilities. In Neural Networks: Tricks of the Trade, LNCS, pages 1524:299\u2013313.\nSpringer, 1998.\n\n[6] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction\n\ninstruments. Big data, 5(2):153\u2013163, 2017.\n\n[7] M.M. Fard, Q. Cormier, K. Canini, and M. Gupta. Launch and iterate: Reducing prediction\n\nchurn. In NIPS, 2016.\n\n[8] A. Esuli and F. Sebastiani. Optimizing text quanti\ufb01ers for multivariate loss functions. ACM\n\nTransactions on Knowledge Discovery and Data, 9(4):Article 27, 2015.\n\n[9] W. Gao and F. Sebastiani. Tweet sentiment: From classi\ufb01cation to quanti\ufb01cation. In ASONAM,\n\n2015.\n\n[10] E. Eban, M. Schain, A. Mackey, A. Gordon, R. Rifkin, and G. Elidan. Scalable learning of\n\nnon-decomposable objectives. In AISTATS, 2017.\n\n[11] H. Narasimhan. Learning with complex loss functions and constraints. In AISTATS, 2018.\n\n[12] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.\n\n[13] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient\n\nmethods for non-decomposable loss functions. In NIPS, 2014.\n\n[14] H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: A\n\ntale of two classes. In ICML, 2015.\n\n[15] P. Kar, S. Li, H. Narasimhan, S. Chawla, and F. Sebastiani. Online optimization methods for\n\nthe quanti\ufb01cation problem. In KDD, 2016.\n\n[16] G. Goh, A. Cotter, M. Gupta, and M.P. Friedlander. Satisfying real-world goals with dataset\n\nconstraints. In NIPS, 2016.\n\n[17] A. Sanyal, P. Kumar, P. Kar, S. Chawla, and F. Sebastiani. Optimizing non-decomposable\n\nmeasures with deep networks. Machine Learning, 107(8-10):1597\u20131620, 2018.\n\n[18] S.A.P. Parambath, N. Usunier, and Y. Grandvalet. Optimizing F-measures by cost-sensitive\n\nclassi\ufb01cation. In NIPS, 2014.\n\n[19] O. Koyejo, N. Natarajan, P. Ravikumar, and I.S. Dhillon. Consistent binary classi\ufb01cation with\n\ngeneralized performance metrics. In NIPS, 2014.\n\n[20] H. Narasimhan, H.G. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms\n\nfor complex performance measures. In ICML, 2015.\n\n10\n\n\f[21] B. Yan, O. Koyejo, K. Zhong, and P. Ravikumar. Binary classi\ufb01cation with karmic, threshold-\n\nquasi-concave metrics. In ICML, 2018.\n\n[22] D. Alabi, N. Immorlica, and A. Kalai. Unleashing linear optimizers for group-fair learning and\n\noptimization. In COLT, 2018.\n\n[23] A. Cotter, H. Jiang, and K. Sridharan. Two-player games for ef\ufb01cient non-convex constrained\n\noptimization. In ALT, 2019.\n\n[24] A. Cotter, H. Jiang, S. Wang, T. Narayan, M. Gupta, S. You, and K. Sridharan. Optimization\nwith non-differentiable constraints with applications to fairness, recall, churn, and other goals.\nJMLR (to appear), arXiv preprint arXiv:1809.04198, 2019.\n\n[25] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. Fairness constraints:\n\nMechanisms for fair classi\ufb01cation. In AISTATS, 2017.\n\n[26] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions approach to\n\nfair classi\ufb01cation. In ICML, 2018.\n\n[27] M. Kearns, S. Neel, A. Roth, and Z.S. Wu. Preventing fairness gerrymandering: Auditing and\n\nlearning for subgroup fairness. In ICML, 2018.\n\n[28] M. Donini, L. Oneto, S. Ben-David, J.S. Shawe-Taylor, and M. Pontil. Empirical risk minimiza-\n\ntion under fairness constraints. In NeurIPS, 2018.\n\n[29] H. Wang, W. Xing, K. Asif, and B. Ziebart. Adversarial prediction games for multivariate losses.\n\nIn NIPS, 2015.\n\n[30] R. Busa-Fekete, B. Sz\u00f6r\u00e9nyi, K. Dembczynski, and E. H\u00fcllermeier. Online F-measure optimiza-\n\ntion. In NIPS, 2015.\n\n[31] M. Liu, X. Zhang, X. Zhou, and T. Yang. Faster online learning of optimal threshold for\n\nconsistent F-measure optimization. In NeurIPS, 2018.\n\n[32] W. Pan, H. Narasimhan, P. Kar, P. Protopapas, and H. G. Ramaswamy. Optimizing the multiclass\n\nF-measure via biconcave programming. In ICDM, 2016.\n\n[33] L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi. Classi\ufb01cation with fairness constraints:\n\nA meta-algorithm with provable guarantees. In FAT, 2019.\n\n[34] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection.\n\nIn ICML, 1997.\n\n[35] S. Daskalaki, I. Kopanas, and N. Avouris. Evaluation of classi\ufb01ers for an uneven class distribu-\n\ntion problem. Applied Arti\ufb01cial Intelligence, 20:381\u2013417, 2006.\n\n[36] K. Kennedy, B.M. Namee, and S.J. Delany. Learning without default: A study of one-class\n\nclassi\ufb01cation and the low-default portfolio problem. In ICAICS, 2009.\n\n[37] C. D. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to Information Retrieval. Cambridge\n\nUniversity Press, 2008.\n\n[38] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classi\ufb01ers. In SIGIR, 1994.\n[39] M. Gupta A. Cotter, H. Narasimhan. On making stochastic classi\ufb01ers deterministic. In NeurIPS,\n\n2019.\n\n[40] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001.\n[41] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nICML, 2003.\n\n[42] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\n[43] J. D. Abernethy and J.-K. Wang. On Frank-Wolfe and equilibrium computation. In NIPS, 2017.\n\n11\n\n\f[44] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. ProPublica, May, 23, 2016.\n[45] A. Frank and A. Asuncion. UCI machine learning repository. URL: http://archive.ics.\n\nuci.edu/ml, 2010.\n\n[46] L. Wightman. Lsac national longitudinal bar passage study. Law School Admission Council,\n\n1998.\n\n[47] L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and mitigating unintended\n\nbias in text classi\ufb01cation. In AIES, 2018.\n\n[48] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In NIPS,\n\n2016.\n\n[49] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In\n\nEMNLP, 2014.\n\n[50] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[51] X. Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient.\n\narXiv preprint arXiv:1803.06573, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5730, "authors": [{"given_name": "Harikrishna", "family_name": "Narasimhan", "institution": "Google Research"}, {"given_name": "Andrew", "family_name": "Cotter", "institution": "Google"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}]}