{"title": "Recycling Privileged Learning and Distribution Matching for Fairness", "book": "Advances in Neural Information Processing Systems", "page_first": 677, "page_last": 688, "abstract": "Equipping machine learning models with ethical and legal constraints is a serious issue; without this, the future of machine learning is at risk. This paper takes a step forward in this direction and focuses on ensuring machine learning models deliver fair decisions. In legal scholarships, the notion of fairness itself is evolving and multi-faceted. We set an overarching goal to develop a unified machine learning framework that is able to handle any definitions of fairness, their combinations, and also new definitions that might be stipulated in the future. To achieve our goal, we recycle two well-established machine learning techniques, privileged learning and distribution matching, and harmonize them for satisfying multi-faceted fairness definitions. We consider protected characteristics such as race and gender as privileged information that is available at training but not at test time; this accelerates model training and delivers fairness through unawareness. Further, we cast demographic parity, equalized odds, and equality of opportunity as a classical two-sample problem of conditional distributions, which can be solved in a general form by using distance measures in Hilbert Space. We show several existing models are special cases of ours. Finally, we advocate returning the Pareto frontier of multi-objective minimization of error and unfairness in predictions. This will facilitate decision makers to select an operating point and to be accountable for it.", "full_text": "Recycling Privileged Learning\n\nand Distribution Matching for Fairness\n\nPredictive Analytics Lab (PAL)\n\nUniversity of Sussex\n\nNovi Quadrianto\u2217\n\nBrighton, United Kingdom\n\nn.quadrianto@sussex.ac.uk\n\nViktoriia Sharmanska\nDepartment of Computing\nImperial College London\nLondon, United Kingdom\nsharmanska.v@gmail.com\n\nAbstract\n\nEquipping machine learning models with ethical and legal constraints is a serious\nissue; without this, the future of machine learning is at risk. This paper takes a step\nforward in this direction and focuses on ensuring machine learning models deliver\nfair decisions. In legal scholarships, the notion of fairness itself is evolving and\nmulti-faceted. We set an overarching goal to develop a uni\ufb01ed machine learning\nframework that is able to handle any de\ufb01nitions of fairness, their combinations,\nand also new de\ufb01nitions that might be stipulated in the future. To achieve our\ngoal, we recycle two well-established machine learning techniques, privileged\nlearning and distribution matching, and harmonize them for satisfying multi-faceted\nfairness de\ufb01nitions. We consider protected characteristics such as race and gender\nas privileged information that is available at training but not at test time; this\naccelerates model training and delivers fairness through unawareness. Further, we\ncast demographic parity, equalized odds, and equality of opportunity as a classical\ntwo-sample problem of conditional distributions, which can be solved in a general\nform by using distance measures in Hilbert Space. We show several existing\nmodels are special cases of ours. Finally, we advocate returning the Pareto frontier\nof multi-objective minimization of error and unfairness in predictions. This will\nfacilitate decision makers to select an operating point and to be accountable for it.\n\n1\n\nIntroduction\n\nMachine learning technologies have permeated everyday life and it is common nowadays that an\nautomated system makes decisions for/about us, such as who is going to get bank credit. As\nmore decisions in employment, housing, and credit become automated, there is a pressing need\nfor addressing ethical and legal aspects, including fairness, accountability, transparency, privacy,\nand con\ufb01dentiality, posed by those machine learning technologies [1, 2]. This paper focuses on\nenforcing fairness in the decisions made by machine learning models. A decision is fair if [3, 4,\n5]: i) it is not based on a protected characteristic [6] such as gender, marital status, or age (fair\ntreatment), ii) it does not disproportionately bene\ufb01t or hurt individuals sharing a certain value of\ntheir protected characteristic (fair impact), and iii) given the target outcomes, it enforces equal\ndiscrepancies between decisions and target outcomes across groups of individuals based on their\nprotected characteristic (fair supervised performance).\nThe above three fairness de\ufb01nitions have been studied before, and several machine learning frame-\nworks for addressing each one or a combination of them are available. We \ufb01rst note that one could\nfairness through\nensure fair treatment by simply ignoring protected characteristic features, i.e.\nunawareness. However this poses a risk of unfairness by proxy as there are ways of predicting\n\n\u2217Also with National Research University Higher School of Economics, Moscow, Russia.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fprotected characteristic features from other features [7, 8]. Existing models guard against unfairness\nby proxy by enforcing fair impact or fair supervised performance constraints in addition to the fair\ntreatment constraint. An example of the fair impact constraint is the 80% rule (see e.g. [3, 9, 10]) in\nwhich positive decisions must be in favour of group B individuals at least 80% as often as in favour\nof group A individuals for the case of a binary protected characteristic and a binary decision. Another\nexample of the fair impact constraint is a demographic parity in which positive decisions of group B\nindividuals must be at the same rate as positive decisions of group A individuals (see e.g. [11] and\nearlier works [12, 13, 14]).\nIn contrast to the fair impact that only concerns about decisions of an automated system, the\nfair supervised performance takes into account, when enforcing fairness, the discrepancy between\ndecisions (predictions) and target outcomes, which is compatible to the standard supervised learning\nsetting. Kleinberg et al. [15] show that fair impact and fair supervised performance are indeed\nmutually exclusive measures of fairness. Examples of the fair supervised performance constraint\nare equality of opportunity [4] in which the true positive rates (false negative rates) across groups\nmust match, and equalized odds [4] in which both the true positive rates and false positive rates must\nmatch. Hardt et al. [4] enforce equality of opportunity or equalized odds by post-processing the\nsoft-outputs of an unfair classi\ufb01er. The post-processing step consists of learning a different threshold\nfor a different group of individuals. The utilization of an unfair classi\ufb01er as a building block of the\nmodel is deliberate as the main goal of supervised machine learning models is to perform prediction\ntasks for future data as accurately as possible. Suppose the target outcome is correlated with the\nprotected characteristic, Hardt et al.\u2019s model will be able to learn the ideal predictor, which is not\nunfair as it represents the target outcome [4]. However, Hardt et al.\u2019s model needs to access the\nvalue of the protected characteristic for future data. Situations where the protected characteristic\nis unavailable due to con\ufb01dentiality or is prohibited to be accessed due to the fair treatment law\nrequirement will make the model futile [5]. Recent work of Zafar et al. [5] propose de-correlation\nconstraints between supervised performance, e.g. true positive rate, and protected characteristics as a\nway to achieve fair supervised performance. Zafar et al.\u2019s model, however, will not be able to learn\nthe ideal predictor when the target outcome is indeed correlated with a protected characteristic.\nThis paper combines the bene\ufb01ts of Hardt et al.\u2019s model [4] in its ability to learn the ideal predictor\nand of Zafar et al.\u2019s model [5] in not requiring the availability of protected characteristic for future\ndata at prediction time. To achieve this, we will be building upon recent advances in the use of\nprivileged information for training machine learning models [16, 17, 18, 19]. Privileged information\nrefers to features that can be used at training time but will not be available for future data at prediction\ntime. We propose to consider protected characteristics such as race, gender, or marital status as\nprivileged information. The privileged learning framework is remarkably suitable for incorporating\nfairness, as it learns the ideal predictor and does not require protected characteristics for future data.\nTherefore, this paper recycles the overlooked privileged learning framework, which is designed for\naccelerating learning and improving prediction performance, for building a fair classi\ufb01cation model.\nEnforcing fairness using the privileged learning framework alone, however, might increase the risk\nof unfairness by proxy. Our proposed model guards against this by explicitly adding fair impact\nand/or fair supervised performance constraints into the privileged learning model. We recycle a\ndistribution matching measure for fairness. This measure can be instantiated for both fair impact\n(e.g. demographic parity) and fair supervised performance (e.g. equalized odds and equality of\nopportunity) constraints. Matching a distribution between function outputs (decisions) across different\ngroups will deliver fair impact, and matching a distribution between errors (discrepancies between\ndecisions and target outcomes) across different groups will deliver fair supervised performance. We\nfurther show several existing methods are special cases of ours.\n\n2 Related Work\n\nThere is much work on the topic of fairness in the machine learning context in addition to those that\nhave been embedded in the introduction. One line of research can be described in terms of learning\nfair models by modifying feature representations of the data (e.g. [20, 10, 21]), class label annotations\n([22]), or even the data itself ([23]). Another line of research is to develop classi\ufb01er regularizers that\npenalize unfairness (e.g. [13, 14, 24, 11, 5]). Our method falls into this second line of research. It\nhas also been emphasized that fair models could enforce group fairness de\ufb01nitions (covered in the\nintroduction) as well as individual fairness de\ufb01nitions. Dwork et al. and Joseph et al. [25, 26] de\ufb01ne\n\n2\n\n\fan individual fairness as a non-preferential treatment towards an individual A if this individual is not\nas quali\ufb01ed as another individual B; this is a continuous analog of fairness through unawareness [23].\nOn privileged learning Vapnik et al. [16] introduce privileged learning in the context of Support\nVector Machines (SVM) and use the privileged features to predict values of the slack variables. It\nwas shown that this procedure can provably reduce the amount of data needed for learning an optimal\nhyperplane [16, 27, 19]. Additional features for training a classi\ufb01er that will not necessarily be\navailable at prediction time, privileged information, are widespread. As an example, features from 3D\ncameras and laser scanners are slow to acquire and expensive to store but have the potential to boost\nthe predictive capability of a trained 2D system. Many variants of privileged learning methods and\nsettings have been proposed such as, structured prediction [28], margin transfer [17], and Bayesian\nprivileged learning [18, 29]. Privileged learning has also been shown [30] to be intimately related\nto Hinton et al.\u2019s knowledge distillation [31] and Bucila et al.\u2019s [32] model compression in which a\ncomplex model is learnt and is then replicated by a simpler model.\nOn distribution matching Distribution matching has been explored in the context of domain adap-\ntation (e.g. [33, 34]), transduction learning (e.g. [35]), and recently in privileged learning [36],\namong others. The empirical Maximum Mean Discrepancy (MMD) [37] is commonly used as the\nnonparametric metric that captures discrepancy between two distributions. In the domain adaptation\nsetting, Pan et al. [38] use the MMD metric to project data from target and related source domain into\na common subspace such that the difference between the distributions of source and target domain\ndata is reduced. A similar idea has been explored in the context of deep neural networks by Zhang\net al. [34], where they use the MMD metric to match both the distribution of the features and the\ndistribution of the labels given features in the source and target domains. In the transduction setting,\nQuadrianto et al. [35] propose to minimize the mismatch between the distribution of function outputs\non the training data and on the target test data. Recently, Sharmanska et al. [36] devise a cross-dataset\ntransfer learning method by matching the distribution of classi\ufb01er errors across datasets.\n\n3 The Fairness Model\n\nIn this section, we will formalize the setup of a supervised binary classi\ufb01cation task subject to fairness\nconstraints. Assume that we are given a set of N training examples, represented by feature vectors\nX = {x1, . . . , xN} \u2282 X = Rd, their label annotation, Y = {y1, . . . , yN} \u2208 Y = {+1,\u22121}, and\nprotected characteristic information also in the form of feature vectors, Z = {z1, . . . , zN} \u2282 Z,\nwhere zn encodes the protected characteristics of sample xn. The task of interest is to infer a predictor\nf for the label ynew of an un-seen instance xnew, given Y , X and Z. However, f cannot use the\nprotected characteristic Z at decision (prediction) time, as it will constitute an unfair treatment.\nThe availability of protected characteristic at training time can be used to enforce fair impact\nand/or fair supervised performance constraints. We \ufb01rst describe how to deliver fair treatment via\nprivileged learning. We then detail distribution matching viewpoint of fair impact and fair supervised\nperformance. Frameworks of privileged learning and distribution matching are suitable for protected\ncharacteristics with binary/multi-class/continuous values. In this paper, however, we focus on a single\nprotected characteristic admitting binary values as in existing work (e.g. [20, 4, 5]).\n\n3.1 Fairness through Unawareness: Privileged Learning\n\n1, y1), . . . , (xN , x(cid:63)\n\nIn the privileged learning setting [16], we are given training triplets (x1, x(cid:63)\nN , yN )\nwhere (xn, yn) \u2282 X \u00d7 Y is the standard training input-output pair and x(cid:63)\nn \u2208 X (cid:63) is additional\ninformation about a training instance xn. This additional (privileged) information is only available\nduring training. In our earlier illustrative example in the related work, xn is for example a colour\nfeature from a 2D image while x(cid:63)\nn is a feature from 3D cameras and laser scanners. There is no direct\nlimitation on the form of privileged information, i.e. it could be yet another feature representation\nlike shape features from the 2D image, or a completely different modality like 3D cameras in addition\nto the 2D image, that is speci\ufb01c for each training instance. The goal of privileged learning is to use\nn to accelerate the learning process of inferring an optimal (ideal) predictor in the data space X ,\nx(cid:63)\ni.e. f : X \u2192 Y. The difference between accelerated and non-accelerated methods is in the rate of\n\u221a\nconvergence to the optimal predictor, e.g. 1/N cf.1/\nFrom the description above, it is apparent that both privileged learning model and fairness model aim\nto use data, privileged feature x(cid:63)\nn and protected characteristic zn respectively, that are available at\n\nN for margin-based classi\ufb01ers [16, 19].\n\n3\n\n\ftraining time only. We propose to recycle privileged learning model for achieving fairness through\nunawareness by taking protected characteristics as privileged information. For a single binary\nprotected characteristic zn, x(cid:63)\nn is formed by concatenating xn and zn. This is because privileged\ninformation has to be instance speci\ufb01c and richer than xn alone, and this is not the case when only a\nsingle binary protected characteristic is used. By using privileged learning framework, the predictor\nf is unaware of protected characteristic zn as this information is not used as an input to the predictor.\nInstead, zn, together with xn, is used to distinguish between easy-to-classify and dif\ufb01cult-to-classify\ndata instances and subsequently to use this knowledge to accelerate the learning process of a predictor\nf [16, 17]. Easiness and hardness can be de\ufb01ned, for example, based on the distance of data instance\nto the decision boundary (margin) [16, 17, 19] or based on the steepness of the logistic likelihood\nfunction [18]. Our speci\ufb01c choice of easiness and hardness de\ufb01nition is detailed in Section 3.3.\nA direct advantage of approaching fairness from the privileged lens is the learning acceleration can\nbe used to limit the performance degradation of the fair model as it now has to trade-off two goals:\ngood prediction performance and respecting fairness constraints. An obvious disadvantage is an\nincreased risk of unfairness by proxy as knowledge of easy-to-classify and dif\ufb01cult-to-classify data\ninstances is based on protected characteristics. The next section describes a way to alleviate this\nbased on a distribution matching principle.\n\n3.2 Demographic Parity, Equalized Odds, Equality of opportunity, and Beyond: Matching\n\nConditional Distributions\n\nWe have the following de\ufb01nitions for several fairness criteria [25, 4, 5]:\nDe\ufb01nition A Demographic parity (fair impact): A binary decision model is fair if its decision\n{+1,\u22121} are independent of the protected characteristic z \u2208 {0, 1}. A decision \u02c6f satis\ufb01es this\nde\ufb01nition if\n\nP (sign( \u02c6f (x)) = +1|z = 0) = P (sign( \u02c6f (x)) = +1|z = 1).\n\nDe\ufb01nition B Equalized odds (fair supervised performance): A binary decision model is fair if its\ndecisions {+1,\u22121} are conditionally independent of the protected characteristic z \u2208 {0, 1} given\nthe target outcome y. A decision \u02c6f satis\ufb01es this de\ufb01nition if\n\nP (sign( \u02c6f (x)) = +1|z = 0, y) = P (sign( \u02c6f (x)) = +1|z = 1, y), for y \u2208 {+1,\u22121}.\n\nFor the target outcome y = +1, the de\ufb01nition above requires that \u02c6f has equal true positive rates\nacross two different values of protected characteristic. It requires \u02c6f to have equal false positive rates\nfor the target outcome y = \u22121.\nDe\ufb01nition C Equality of opportunity (fair supervised performance): A binary decision model is fair\nif its decisions {+1,\u22121} are conditionally independent of the protected characteristic z \u2208 {0, 1}\ngiven the positive target outcome y. A decision \u02c6f satis\ufb01es this de\ufb01nition if\n\nP (sign( \u02c6f (x)) = +1|z = 0, y = +1) = P (sign( \u02c6f (x)) = +1|z = 1, y = +1).\n\nEquality of opportunity only constrains equal true positive rates across the two demographics.\nAll three fairness criteria rely on the de\ufb01nition that data across the two demographics should exhibit\nsimilar behaviour, i.e. matching positive predictions, matching true positive rates, and matching false\npositive rates. A natural pathway to inject these into any learning model is to use a distribution\nmatching framework. This matching assumption is well founded if we assume that both data\n} \u2282 X are drawn\nXz=0 = {xz=0\nindependently and identically distributed from the same distribution p(x) on a domain X . It therefore\nfollows that for any function (or set of functions) f the distribution of f (x) where x \u223c p(x) should\nalso behave in the same way across the two demographics. We know that this is not automatically\ntrue if we get to choose f after seeing Xz=0 and Xz=1. In order to allow us to draw on a rich body\nof literature for comparing distributions, we cast the goal of enforcing distributional similarity across\ntwo demographics as a two-sample problem.\n\n} \u2282 X and another data Xz=1 = {xz=1\n\n, . . . , xz=0\nNz=0\n\n1\n\n, . . . , xz=1\nNz=1\n\n1\n\n3.2.1 Distribution matching\nFirst, we denote the applications of our predictor \u02c6f : X \u2192 R to data having protected\ncharacteristic value zero by \u02c6f (XZ=0) := { \u02c6f (xz=0\n)}, likewise by \u02c6f (XZ=1) :=\n\n), . . . , \u02c6f (xz=0\nNz=0\n\n1\n\n4\n\n\f{ \u02c6f (xz=1\nenforce the closeness between the distributions of \u02c6f (x). We can achieve this by minimizing:\n\n)} for value one. For enforcing the demographic parity criterion, we can\n\n), . . . , \u02c6f (xz=1\nNz=1\n\n1\n\nD( \u02c6f (XZ=0), \u02c6f (XZ=1)), the distance between the two distributions \u02c6f (XZ=0) and \u02c6f (XZ=1). (1)\nFor enforcing the equalized odds criterion, we need to minimize both\nD(I[Y = +1] \u02c6f (XZ=0), I[Y = +1] \u02c6f (XZ=1)) and D(I[Y = \u22121] \u02c6f (XZ=0), I[Y = \u22121] \u02c6f (XZ=1)).\n(2)\nWe make use of Iverson\u2019s bracket notation: I[P ] = 1 when condition P is true and 0 otherwise.\nThe \ufb01rst will match true positive rates (and also false negative rates) across the two demographics\nand the latter will match false positive rates (and also true negative rates). For enforcing equality of\nopportunity, we just need to minimize\n\nD(I[Y = +1] \u02c6f (XZ=0), I[Y = +1] \u02c6f (XZ=1)).\n\n(3)\n\nTo go beyond true positive rates and false positive rates, Zafar et al. [5] raise the potential of removing\nunfairness by enforcing equal misclassi\ufb01cation rates, false discovery rates, and false omission rates\nacross two demographics. False discovery and false omission rates, however, with their fairness\nmodel are dif\ufb01cult to encode. In the distribution matching sense, those can be easily enforced by\nminimizing\n\nD(1 \u2212 Y \u02c6f (XZ=0), 1 \u2212 Y \u02c6f (XZ=1)),\nD(I[Y = +1] max(0,\u2212 \u02c6f (XZ=0)), I[Y = +1] max(0,\u2212 \u02c6f (XZ=1))), and\nD(I[Y = \u22121] max(0, \u02c6f (XZ=0)), I[Y = \u22121] max(0, \u02c6f (XZ=1)))\n\n(4)\n(5)\n(6)\n\nfor misclassi\ufb01cation, false omission, and false discovery rates, respectively.\nMaximum mean discrepancy To avoid a parametric assumption on the distance estimate between\ndistributions, we use the Maximum Mean Discrepancy (MMD) criterion [37], a non-parametric\ndistance estimate. Denote by H a Reproducing Kernel Hilbert Space with kernel k de\ufb01ned on X . In\nthis case one can show [37] that whenever k is characteristic (or universal), the map\n\u00b5 :p \u2192 \u00b5[p] := Ex\u223cp(x)[k( \u02c6f (x),\u00b7)] with associated distance MMD2(p, p(cid:48)) := (cid:107)\u00b5[p] \u2212 \u00b5[p(cid:48)](cid:107)2\nExamples of characteristic kernels [39] are Gaus-\ncharacterizes a distribution uniquely.\nsian RBF, Laplacian and B2n+1-splines. With a this choice of kernel functions,\nthe\nMMD criterion matches in\ufb01nitely many moments in the Reproducing Kernel Hilbert Space\n(RKHS). We use an unbiased linear-time estimate of MMD as follows [37, Lemma 14]:\n(cid:92)\nMMD2 = 1\nz=1 )) +\nN\nk( \u02c6f (x2i\u22121\n\n(cid:80)N\nz=0)) \u2212 k( \u02c6f (x2i\u22121\ni k( \u02c6f (x2i\u22121\nz=1)), with N := (cid:98)min(Nz=1, Nz=0)(cid:99).\n\nz=1)) \u2212 k( \u02c6f (x2i\n\nz=0), \u02c6f (x2i\u22121\n\nz=0 ), \u02c6f (x2i\n\nz=0 ), \u02c6f (x2i\n\nz=1 ), \u02c6f (x2i\n\n3.2.2 Special cases\n\nBefore discussing a speci\ufb01c composition of privileged learning and distribution matching to achieve\nfairness, we consider a number of special cases of matching constraint to show that many of existing\nmethods use this basic idea.\nMean matching for demographic parity Zemel et al. [20] balance the mapping from data to\none of C latent prototypes across the two demographics by imposing the following constraint:\nn ) is a softmax\nNz=0\nfunction with C prototypes. Assuming a linear kernel k on this constraint is equivalent to requiring\nthat for each c\n\nn ; c); \u2200c = 1, . . . , C, where \u02c6f (xz=0\n\n(cid:80)Nz=0\n\nn ; c) = 1\n\n\u02c6f (xz=0\n\n\u02c6f (xz=1\n\nNz=1\n\nn=1\n\n1\n\nn=1\n\n(cid:80)Nz=1\n(cid:68) \u02c6f (xz=0\nn ; c),\u00b7(cid:69)\n\nNz=0(cid:88)\n\nn=1\n\n\u00b5[ \u02c6f (xz=0\n\nn ; c)] =\n\n1\n\nNz=0\n\n(cid:68) \u02c6f (xz=1\nn ; c),\u00b7(cid:69)\n\nNz=1(cid:88)\n\nn=1\n\n=\n\n1\n\nNz=1\n\n= \u00b5[ \u02c6f (xz=1\n\nn ; c)].\n\nMean matching for equalized odds and equality of opportunity To ensure equal false positive\nrates across the two demographics, Zafar et al. [5] add the following constraint to the training objective\n\n5\n\n\fof a linear classi\ufb01er \u02c6f (x) = (cid:104)w, x(cid:105):(cid:80)Nz=0\n\n\u22121] \u02c6f (xz=1\n\nn=1 min(0, I[yn = \u22121] \u02c6f (xz=0\n\nn=1 min(0, I[yn =\n\nn )). Again, assuming a linear kernel k on this constraint is equivalent to requiring that\n\u00b5[min(0, I[yn = \u22121] \u02c6f (xz=0\n\nmin(0, I[yn = \u22121] \u02c6f (xz=0\n\nn ))] =\n\n1\n\nn )) =(cid:80)Nz=1\nn )),\u00b7(cid:69)\n\nmin(0, I[yn = \u22121] \u02c6f (xz=1\n\n= \u00b5[min(0, I[yn = \u22121] \u02c6f (xz=1\n\nn ))].\n\nThe min(\u00b7) function ensures that we only match false positive rates as without it both false positive\nand true negative rates will be matched. Relying on means for matching both false positive and true\nnegative is not suf\ufb01cient as the underlying distributions are multi-modal; it motivates the need for\ndistribution matching.\n\n(cid:68)\nNz=0(cid:88)\nn )),\u00b7(cid:69)\n\nn=1\n\nNz=0\n\nNz=1(cid:88)\n\n(cid:68)\n\nn=1\n\n=\n\n1\n\nNz=1\n\n3.3 Privileged learning with fairness constraints\n\nHere we describe the proposed model that recycles two established frameworks, privileged learning\nand distribution matching, and subsequently harmonizes them for addressing fair treatment, fair\nimpact, fair supervised performance and beyond in a uni\ufb01ed fashion. We use SVM\u2206+ [19], an\nSVM-based classi\ufb01cation method for privileged learning, as a building block. SVM\u2206+ modi\ufb01es\nthe required distance of data instance to the decision boundary based on easiness/hardness of\nthat data instance in the privileged space X (cid:63), a space that contains protected characteristic Z.\nEasiness/hardness is re\ufb02ected in the negative of the con\ufb01dence, \u2212yn((cid:104)w(cid:63), x(cid:63)\nn(cid:105) + b(cid:63)) where w(cid:63) and\nb(cid:63) are some parameters; the higher this value, the harder this data instance to be classi\ufb01ed correctly\neven in the rich privileged space. Injecting the distribution matching constraint, the \ufb01nal Distribution\nMatching+ (DM+) optimization problem is now:\n(cid:107)w(cid:63)(cid:107)2\n\nmax (0,\u2212yn[(cid:104)w(cid:63), x(cid:63)\n\n+C\u2206\n\n+1/2\u03b3\n\n1/2\n\n+\n\n(cid:107)w(cid:107)2\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n(cid:96)2\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n(cid:96)2\n\nregularisation on model without\n\nprotected characteristic\n\nregularisation on model with\n\nprotected characteristic\n\nhinge loss on model with protected characteristic\n\n(cid:123)(cid:122)\n\nn(cid:105) + b(cid:63)])\n(cid:125)\n\nN(cid:88)\n(cid:124)\n\nn=1\n\nminimize\nw\u2208Rd,b\u2208R\n,b(cid:63)\u2208R\nw(cid:63)\u2208Rd(cid:63)\n\nN(cid:88)\n(cid:124)\n\nn=1\n\n+ C\n\nmax (0, 1 \u2212 yn[(cid:104)w(cid:63), x(cid:63)\n\nn(cid:105) + b(cid:63)] \u2212 yn[(cid:104)w, xn(cid:105) + b])\n(cid:125)\n(cid:123)(cid:122)\n\nhinge loss on model without protected characteristic\nbut with margin dependent on protected characteristic\n\nsubject to\n\n(cid:92)\nMMD2(pz=0, pz=1) \u2264 \u0001,\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\nconstraint for removing unfairness by proxy\n\n(7a)\n\n(7b)\n\nwhere C, \u2206, \u03b3 and an upper-bound \u0001 are hyper-parameters. Terms pz=0 and pz=1 are distributions over\nappropriately de\ufb01ned fairness variables across the two demographics, e.g. \u02c6f (XZ=0) and \u02c6f (XZ=1)\nwith \u02c6f (\u00b7) = (cid:104)w,\u00b7(cid:105) + b for demographic parity and I[Y = +1] \u02c6f (XZ=0) and I[Y = +1] \u02c6f (XZ=1)\nfor equality of opportunity. We have the following observations of the knowledge transfer from the\nprivileged space to the space X without protected characteristic (refer to the last term in (7a)):\n\u2022 Very large positive value of the negative of the con\ufb01dence in the space that includes protected\nn(cid:105) + b(cid:63)] >> 0 means xn, without protected characteristic, is expected\ncharacteristic, \u2212yn[(cid:104)w(cid:63), x(cid:63)\nto be a hard-to-classify instance therefore its margin distance to the decision boundary is increased.\n\u2022 Very large negative value of the negative of the con\ufb01dence in the space that includes protected\nn(cid:105) + b(cid:63)] << 0 means xn, without protected characteristic, is expected\ncharacteristic, \u2212yn[(cid:104)w(cid:63), x(cid:63)\nto be an easy-to-classify instance therefore its margin distance to the decision boundary is reduced.\n\nThe formulation in (7) is a multi-objective optimization with three competing goals: minimizing\nempirical error (hinge loss), minimizing model complexity ((cid:96)2 regularisation), and minimizing\nprediction discrepancy across the two demographics (MMD). Each goal corresponds to a different\noptimal solution and we have to accept a compromise in the goals. While solving a single-objective\noptimization implies to search for a single best solution, a collection of solutions at which no goal\ncan be improved without damaging one of the others (Pareto frontier) [40] is sought when solving a\nmulti-objective optimization.\n\n6\n\n\fMulti-objective optimization We \ufb01rst note that the MMD fairness criteria will introduce non-\nconvexity to our optimization problem. For a non-convex multi-objective optimization, the Pareto\nfrontier may have non-convex portions. However, any Pareto optimal solution of a multi-objective\noptimization can be obtained by solving the constraint problem for an upper bound \u0001 (as in (7b))\nregardless of the non-convexity of the Pareto frontier [40].\nAlternatively, the Convex Concave Procedure (CCP) [41], can be used to \ufb01nd an approximate solution\nof the problem in (7) by solving a succession of convex programs. CCP has been used in several\nother algorithms enforcing fair impact and fair supervised performance to deal with non-convexity of\nthe objective function (e.g. [24, 5]). However, it was noted in [35] that for an objective function that\nhas an additive structure as in our DM+ model, it is better to use the non-convex objective directly.\n\n4 Experiments\n\nWe experiment with two datasets: The ProPublica COMPAS dataset and the Adult income dataset.\nProPublica COMPAS (Correctional Offender Management Pro\ufb01ling for Alternative Sanctions) has a\ntotal of 5,278 data instances, each with 5 features (e.g., count of prior offences, charge for which the\nperson was arrested, race). The binary target outcome is whether or not the defendant recidivated\nwithin two years. For this dataset, we follow the setting in [5] and consider race, which is binarized\nas either black or white, as a protected characteristic. We use 4, 222 instances for training and 1, 056\ninstances for test. The Adult dataset has a total of 45, 222 data instances, each with 14 features (e.g.,\ngender, educational level, number of work hours per week). The binary target outcome is whether\nor not income is larger than 50K dollars. For this dataset, we follow [20] and consider gender as a\nbinary protected characteristic. We use 36, 178 instances for training and 9, 044 instances for test.\n\nMethods We have two variants of our distribution matching framework: DM that uses SVM as the\nbase classi\ufb01er coupled with the constraint in (7b), and DM+ ((7a) and (7b)). We compare our methods\nwith several baselines: support vector machine (SVM), logistic regression (LR), mean matching with\nlogistic regression as the base classi\ufb01er (Zafar et al.) [5], and a threshold classi\ufb01er method with\nprotected characteristic-speci\ufb01c thresholds on the output of a logistic regression model (Hardt et\nal.) [4]. All methods but Hardt et al. do not use protected characteristics at prediction time.\n\nOptimization procedure For our DM and DM+ methods, we identify at least three options on how\nto optimize the multi-objective optimization problem in (7): using Convex Concave Procedure (CCP),\nusing Broyden-Fletcher-Goldfarb-Shanno gradient descent method with limited-memory variation\n(L-BFGS), and using evolutionary multi-objective optimization (EMO). We discuss those options\n(cid:92)\nin turn. First, we can express each additive term in the\nMMD2(pz=0, pz=1) fairness constraint (7b)\nas a difference of two convex functions, \ufb01nd the convex upper bound of each term, and place the\nconvexi\ufb01ed fairness constraint as part of the objective function. In our initial experiments, solving\n(7) with CCP tends to ignore the fairness constraint, therefore we do not explore this approach\nfurther. As mentioned earlier, the convex upper bounds on each of the additive terms in the MMD\nconstraint become increasingly loose as we move away from the current point of approximation. This\nleads to the second optimization approach. We turn the constrained optimization problem into an\n(cid:92)\nunconstrained one by introducing a non-negative weight CMMD to scale the\nMMD2(pz=0, pz=1) term.\nWe then solve this unconstrained problem using L-BFGS. The main challenge with this procedure\nis the need to trade-off multiple competing goals by tuning several hyper-parameters, which will\nbe discussed in the next section. The CCP and L-BFGS procedures will only return one optimal\nsolution from the Pareto frontier. Third, to approximate the Pareto-optimal set, we can instead use\nEMO procedures (e.g. Non-dominated Sorting Genetic Algorithm (NSGA) \u2013 II and Strength Pareto\nEvolutionary Algorithm (SPEA) - II). For the EMO, we also solve the unconstrained problem as\nin the second approach, but we do not need to introduce a trade-off parameter for each term in the\nobjective function. We use the DEAP toolbox [42] for experimenting with EMO.\n\nModel selection For the baseline Zafar et al., as in [5], we set the hyper-parameters \u03c4 and\n\u00b5 corresponding to the Penalty Convex Concave procedure to 5.0 and 1.2, respectively. Gaus-\nsian RBF kernel with a kernel width \u03c32 is used for the MMD term. When solving DM (and\nDM+) optimization problems with L-BFGS, the hyper-parameters C, CMMD, \u03c32, (and \u03b3) are set\nto 1000., 5000., 10., (and 1.) for both datasets. For DM+, we select \u2206 over the range {1., 2., . . . , 10.}\n\n7\n\n\fTable 1: Results on multi-objective optimization which balances two main objectives: performance\naccuracies and fairness criteria. Equal true positive rates are required for ProPublica COMPAS\ndataset, and equal accuracies between two demographics z = 0 and z = 1 are required for Adult\ndataset. The solver of Zafar et al. fails on the Adult dataset while enforcing equal accuracies\nacross the two demographics. Hardt et al.\u2019s method does not enforce equal accuracies. SVM and\nLR only optimize performance accuracies. The terms |Acc.z=0 - Acc.z=1|, |TPRz=0 - TPRz=1|, and\n|FPRz=0 - FPRz=1| denote accuracy, true positive rate, and false positive rate discrepancies in an\nabsolute term between the two demographics (the smaller the fairer). For ProPublica COMPAS\ndataset, we boldface |TPRz=0 - TPRz=1| since we enforce the equality of opportunity criterion on\nthis dataset. For Adult dataset, we boldface |Acc.z=0 - Acc.z=1| since this is the fairness criterion.\n\nProPublica COMPAS dataset (Fairness Constraint on equal TPRs)\n\n|Acc.z=0 - Acc.z=1 |\n\n|TPRz=0 - TPRz=1|\n\nLR\nSVM\n\n0.0151\u00b10.0116\n0.0172\u00b10.0102\n0.0174\u00b10.0142\n0.0219\u00b10.0191\n0.0457\u00b10.0289\n0.0608\u00b10.0259\n0.0537\u00b10.0121\n0.0535\u00b10.0213\n\nZafar et al.\nHardt et al.\u2217\nDM (L-BFGS)\nDM+ (L-BFGS)\nDM (EMO Usr1)\nDM (EMO Usr2)\n\u2217use protected characteristics at prediction time.\n\n0.2504\u00b10.0417\n0.2573\u00b10.0158\n0.1144\u00b10.0482\n0.0463\u00b10.0185\n0.1169\u00b10.0690\n0.1065\u00b10.0413\n0.1346\u00b10.0360\n0.1248\u00b10.0509\n\n|FPRz=0 - FPRz=1|\n0.1618\u00b10.0471\n0.1603\u00b10.0490\n0.1914\u00b10.0314\n0.0518\u00b10.0413\n0.0791\u00b10.0395\n0.0973\u00b10.0272\n0.1028\u00b10.0481\n0.0906\u00b10.0507\n\nAcc.\n\n0.6652\u00b10.0139\n0.6367\u00b10.0212\n0.6118\u00b10.0198\n0.6547\u00b10.0128\n0.5931\u00b10.0599\n0.6089\u00b10.0398\n0.6261\u00b10.0133\n0.6148\u00b10.0137\n\nAdult dataset (Fairness Constraint on equal accuracies)\n\nSVM\n\nDM (L-BFGS)\nDM+ (L-BFGS)\nDM (EMO Usr1)\nDM (EMO Usr2)\n\n|Acc.z=0 - Acc.z=1|\n0.1136\u00b10.0064\n0.0640\u00b10.0280\n0.0459\u00b10.0372\n0.0388\u00b10.0179\n0.0482\u00b10.0143\n\n|TPRz=0 - TPRz=1|\n0.0964\u00b10.0289\n0.0804\u00b10.0659\n0.0759\u00b10.0738\n0.0398\u00b10.0284\n0.0302\u00b10.0212\n\n|FPRz=0 - FPRz=1|\n0.0694\u00b10.0109\n0.0346\u00b10.0343\n0.0368\u00b10.0349\n0.0398\u00b10.0284\n0.0135\u00b10.0056\n\nAcc.\n\n0.8457\u00b10.0034\n0.8152\u00b10.0068\n0.8127\u00b10.0134\n0.8057\u00b10.0108\n0.8111\u00b10.0122\n\nusing 5-fold cross validation. The selection process goes as follow: we \ufb01rst sort \u2206 values according\nto how well they satisfy the fairness criterion, we then select a \u2206 value at a point before it yields\na lower incremental classi\ufb01cation accuracy. As stated earlier, we do not need C, CMMD, \u03c32, \u03b3, \u2206\nhyper-parameters for balancing multiple terms in the objective function when using EMO for DM\nand DM+. There are however several free parameters related to the evolutionary algorithm itself.\nWe use the NSGA \u2013 II selection strategy with a polynomial mutation operator as in the the original\nimplementation [43], and the mutation probability is set to 0.5. We do not use any mating operator.\nWe use 500 individuals in a loop of 50 iterations (generations).\n\nResults Experimental results over 5 repeats are presented in Table 1. In the ProPublica COMPAS\ndataset, we enforce equality of opportunity |TPRz=0 - TPRz=1|, i.e. equal true positive rates (Equation\n(3)), as the fairness criterion (refer to the ProPublica COMPAS dataset in Table 1). Additionally, our\ndistribution matching methods, DM+ and DM also deliver a reduction in discrepancies between false\npositive rates. We experiment with both L-BFGS and EMO optimization procedures. For EMO, we\nsimulate two decision makers choosing an operating point based on the visualization of Pareto frontier\nin Figure 1 \u2013 Right (shown as DM (EMO Usr1) and DM (EMO Usr2) in Table 1). For this dataset,\nUsr1 has an inclination to be more lenient in being fair for a gain in accuracy in comparison to the\nUsr2. This is actually re\ufb02ected in the selection of the operating point (see supplementary material).\nThe EMO is run on the 60% of the training data, the selection is done on the remaining 40%, and the\nreported results are on the separate test set based on the model trained on the 60% of the training data.\nThe method Zafar et al. achieves similar reduction rate in the fairness criterion to our distribution\nmatching methods. As a reference, we also include results of Hardt et al.\u2019s method; it achieves\nthe best equality of opportunity measure with only a slight drop in accuracy performance w.r.t. the\nunfair LR. It is important to note that Hardt et al.\u2019s method requires protected characteristics at\ntest time. If we allow the usage of protected characteristics at test time, we should expect similar\nreduction rate in fairness and accuracy measures for other methods [5].\n\n8\n\n\fFigure 1: Visualization of a Pareto frontier of our DM method for the ProPublica COMPAS dataset.\nLeft: In a 3D criterion space corresponding to the three objective functions: hinge loss, i.e.\n(cid:92)\nmax (0, 1 \u2212 yn[(cid:104)w, xn(cid:105) + b]), regularization, i.e. (cid:107)w(cid:107)2\nMMD2(pz=0, pz=1).\nFairer models (smaller MMD values) are gained at the expense of model complexity (higher regu-\nlarization and/or hinge loss values). Note that the unbiased estimate of MMD may be negative [37].\nRight: The same Pareto frontier but in a 2D space of error and unfairness in predictions. Only the\n\ufb01rst repeat is visualized; please refer to the supplementary material for the other four repeats, for the\nAdult dataset, and of the DM+ method.\n\n, and MMD, i.e.\n\n(cid:96)2\n\nIn the Adult dataset, we enforce equal accuracies |Acc.z=0 - Acc.z=1| (Equation (4)) as the fairness\ncriterion (refer to the Adult dataset in Table 1). The method whereby a decision maker uses a\nPareto frontier visualization for choosing the operating point (DM (EMO Usr1)) reaches the smallest\ndiscrepancy between the two demographics. In addition to equal accuracies (Equation (4)), our\ndistribution matching methods, DM+ and DM, also deliver a reduction in discrepancies between true\npositive and false positive rates w.r.t. SVM (second and third column). In this dataset, Zafar et\nal. falls into numerical problems when enforcing equal accuracies (vide our earlier discussion on\ndifferent optimization procedures, especially related to CCP). As observed in prior work [5, 20],\nthe methods that do not enforce fairness (equal accuracies or equal true positive rates), SVM and LR,\nachieve higher classi\ufb01cation accuracy compared to the methods that do enforce fairness: Zafar et\nal., DM+, and DM. This can be seen in the last column of Table 1.\n\n5 Discussion and Conclusion\n\nWe have proposed a uni\ufb01ed machine learning framework that is able to handle any de\ufb01nitions of\nfairness, e.g. fairness through unawareness, demographic parity, equalized odds, and equality of op-\nportunity. Our framework is based on learning using privileged information and matching conditional\ndistributions using a two-sample problem. By using distance measures in Hilbert Space to solve the\ntwo-sample problem, our framework is general and will be applicable for protected characteristics\nwith binary/multi-class/continuous values. The current work focuses on a single binary protected\ncharacteristic. This corresponds to conditional distribution matching with a binary conditioning\nvariable. To generalize this to any type and multiple dependence of protected characteristics, we can\nuse the Hilbert Space embedding of conditional distributions framework of [44, 45].\nWe note that there are important factors external to machine learning models that are relevant to\nfairness. However, this paper adopts the established approach of existing work on fair machine\nlearning. In particular, it is taken as given that one typically does not have any control over the data\ncollection process because there is no practical way of enforcing truth/un-biasedness in datasets that\nare generated by others, such as banks, police forces, and companies.\n\nAcknowledgments\n\nNQ is supported by the UK EPSRC project EP/P03442X/1 \u2018EthicalML: Injecting Ethical and Legal\nConstraints into Machine Learning Models\u2019 and the Russian Academic Excellence Project \u20185-100\u2019.\nVS is supported by the IC Research Fellowship. We thank NVIDIA for GPU donation and Amazon\nfor AWS Cloud Credits. We thank Kristian Kersting and Oliver Thomas for discussions, Muhammad\nBilal Zafar for his implementations of [4] and [5], and Sienna Quadrianto for supporting the work.\n\n9\n\nHingeLoss0.751.001.251.501.752.002.252.50Regularization0510152025303540MMD0.080.060.040.020.000.020.040.060.350.400.450.500.550.60Classi\ufb01cationError0.000.050.100.150.200.25|TPRZ=0\u2212TPRZ=1|SVM\fReferences\n[1] Executive Of\ufb01ce of the President. Big data: A report on algorithmic systems, opportunity, and\n\ncivil rights. Technical report, 2016.\n\n[2] The Royal Society Working Group. Machine learning: the power and promise of computers\n\nthat learn by example. Technical report, 2017.\n\n[3] Solon Barocas and Andrew D. Selbst. Big data\u2019s disparate impact. California Law Review,\n\n104:671\u2013732, 2016.\n\n[4] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems (NIPS) 29, pages 3315\u20133323, 2016.\n\n[5] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna P. Gummadi.\nFairness beyond disparate treatment & disparate impact: Learning classi\ufb01cation without dis-\nparate mistreatment. In International Conference on World Wide Web (WWW), pages 1171\u20131180,\n2017.\n\n[6] Equality Act. London: HMSO, 2010.\n\n[7] Salvatore Ruggieri, Dino Pedreschi, and Franco Turini. Data mining for discrimination discovery.\n\nACM Transactions on Knowledge Discovery from Data (TKDD), 4:9:1\u20139:40, 2010.\n\n[8] Philip Adler, Casey Falk, Sorelle A Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon\nSmith, and Suresh Venkatasubramanian. Auditing black-box models for indirect in\ufb02uence. In\nICDM, 2016.\n\n[9] Andrea Romei and Salvatore Ruggieri. A multidisciplinary survey on discrimination analysis.\n\nThe Knowledge Engineering Review, pages 582\u2013638, 2014.\n\n[10] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkata-\nIn International Conference on\n\nsubramanian. Certifying and removing disparate impact.\nKnowledge Discovery and Data Mining (KDD), pages 259\u2013268, 2015.\n\n[11] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi.\nFairness Constraints: Mechanisms for Fair Classi\ufb01cation. In Aarti Singh and Jerry Zhu, editors,\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 54 of\nProceedings of Machine Learning Research, pages 962\u2013970. PMLR, 2017.\n\n[12] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classi\ufb01ers with independency\nconstraints. In International Conference on Data Mining Workshops (ICDMW), pages 13\u201318,\n2009.\n\n[13] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through\nregularization approach. In International Conference on Data Mining Workshops (ICDMW),\npages 643\u2013650, 2011.\n\n[14] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware clas-\nsi\ufb01er with prejudice remover regularizer. In European conference on Machine Learning and\nKnowledge Discovery in Databases (ECML PKDD), pages 35\u201350, 2012.\n\n[15] Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair\n\ndetermination of risk scores. CoRR, abs/1609.05807, 2016.\n\n[16] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged\n\ninformation. Neural Networks, pages 544\u2013557, 2009.\n\n[17] Viktoriia Sharmanska, Novi Quadrianto, and Christoph H. Lampert. Learning to rank using\nprivileged information. In International Conference on Computer Vision (ICCV), pages 825\u2013832,\n2013.\n\n10\n\n\f[18] Daniel Hern\u00e1ndez-Lobato, Viktoriia Sharmanska, Kristian Kersting, Christoph H Lampert, and\nNovi Quadrianto. Mind the nuisance: Gaussian process classi\ufb01cation using privileged noise.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems (NIPS) 27, pages 837\u2013845, 2014.\n\n[19] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control\nand knowledge transfer. Journal of Machine Learning Research (JMLR), 16:2023\u20132049, 2015.\n\n[20] Richard Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair\nIn S. Dasgupta and D. McAllester, editors, International Conference on\nrepresentations.\nMachine Learning (ICML), Proceedings of Machine Learning Research, pages 325\u2013333. PMLR,\n2013.\n\n[21] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational\n\nfair autoencoder. CoRR, abs/1511.00830, 2015.\n\n[22] Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-NN as an implementation of\nsituation testing for discrimination discovery and prevention. In International Conference on\nKnowledge Discovery and Data Mining (KDD), pages 502\u2013510, 2011.\n\n[23] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness.\nIn U. V. Luxburg, I. Guyon, S. Bengio, H. Wallach, R. Fergus, S.V.N. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems (NIPS) 30, 2017.\n\n[24] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world\ngoals with dataset constraints. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems (NIPS) 29, pages\n2415\u20132423, 2016.\n\n[25] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\nthrough awareness. In Innovations in Theoretical Computer Science (ITCS), pages 214\u2013226.\nACM, 2012.\n\n[26] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning:\nClassic and contextual bandits. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems (NIPS) 29, pages\n325\u2013333, 2016.\n\n[27] Dmitry Pechyony and Vladimir Vapnik. On the theory of learnining with privileged information.\nIn John D. Lafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and\nAron Culotta, editors, Advances in Neural Information Processing Systems (NIPS) 23, pages\n1894\u20131902, 2010.\n\n[28] Jan Feyereisl, Suha Kwak, Jeany Son, and Bohyung Han. Object localization based on structural\nsvm using privileged information. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D.\nLawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing\nSystems (NIPS) 27, pages 208\u2013216, 2014.\n\n[29] Viktoriia Sharmanska, Daniel Hern\u00e1ndez-Lobato, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, and Novi\nQuadrianto. Ambiguity helps: Classi\ufb01cation with disagreements in crowdsourced annotations.\nIn Computer Vision and Pattern Recognition (CVPR), pages 2194\u20132202. IEEE Computer\nSociety, 2016.\n\n[30] D. Lopez-Paz, B. Sch\u00f6lkopf, L. Bottou, and V. Vapnik. Unifying distillation and privileged\n\ninformation. In International Conference on Learning Representations (ICLR), 2016.\n\n[31] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural\n\nnetwork. CoRR, abs/1503.02531, 2015.\n\n[32] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression.\n\nIn\nInternational Conference on Knowledge Discovery and Data Mining (KDD), pages 535\u2013541,\n2006.\n\n11\n\n\f[33] Wen Li, Lixin Duan, Dong Xu, and Ivor W Tsang. Learning with augmented features for\nIEEE Transactions on\n\nsupervised and semi-supervised heterogeneous domain adaptation.\nPattern Analysis and Machine Intelligence (TPAMI), pages 1134\u20131148, 2014.\n\n[34] Xu Zhang, Felix Xinnan Yu, Shih-Fu Chang, and Shengjin Wang. Deep transfer network:\n\nUnsupervised domain adaptation. CoRR, abs/1503.00591, 2015.\n\n[35] Novi Quadrianto, James Petterson, and Alex J. Smola. Distribution matching for transduction.\nIn Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances\nin Neural Information Processing Systems (NIPS) 22, pages 1500\u20131508. 2009.\n\n[36] Viktoriia Sharmanska and Novi Quadrianto. Learning from the mistakes of others: Matching\nerrors in cross-dataset learning. In Computer Vision and Pattern Recognition (CVPR), pages\n3967\u20133975. IEEE Computer Society, 2016.\n\n[37] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research (JMLR), 13:723\u2013773,\n2012.\n\n[38] Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via\ntransfer component analysis. In Craig Boutilier, editor, International Joint Conference on\nArti\ufb01cal Intelligence (IJCAI), pages 1187\u20131192, 2009.\n\n[39] Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, charac-\nteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research\n(JMLR), 12:2389\u20132410, 2011.\n\n[40] Yann Collette and Patrick Siarry. Multiobjective Optimization: Principles and Case Studies.\n\nSpringer, 2003.\n\n[41] Thomas Lipp and Stephen Boyd. Variations and extension of the convex\u2013concave procedure.\n\nOptimization and Engineering, 17(2):263\u2013287, 2016.\n\n[42] F\u00e9lix-Antoine Fortin, Fran\u00e7ois-Michel De Rainville, Marc-Andr\u00e9 Gardner Gardner, Marc\nParizeau, and Christian Gagn\u00e9. DEAP: Evolutionary algorithms made easy. Journal of Machine\nLearning Research (JMLR), 13(1):2171\u20132175, 2012.\n\n[43] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A fast elitist non-dominated\nsorting genetic algorithm for multi-objective optimisation: NSGA-II. In International Confer-\nence on Parallel Problem Solving from Nature (PPSN), pages 849\u2013858, 2000.\n\n[44] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of\nconditional distributions with applications to dynamical systems. In International Conference\non Machine Learning (ICML), pages 961\u2013968, 2009.\n\n[45] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A\nIEEE Signal\n\nuni\ufb01ed kernel framework for nonparametric inference in graphical models.\nProcessing Magazine, 30(4):98\u2013111, 2013.\n\n12\n\n\f", "award": [], "sourceid": 456, "authors": [{"given_name": "Novi", "family_name": "Quadrianto", "institution": "University of Sussex and HSE"}, {"given_name": "Viktoriia", "family_name": "Sharmanska", "institution": "Imperial College London"}]}