{"title": "Beyond Parity: Fairness Objectives for Collaborative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 2921, "page_last": 2930, "abstract": "We study fairness in collaborative-filtering recommender systems, which are sensitive to discrimination that exists in historical data. Biased data can lead collaborative-filtering methods to make unfair predictions for users from minority groups. We identify the insufficiency of existing fairness metrics and propose four new metrics that address different forms of unfairness. These fairness metrics can be optimized by adding fairness terms to the learning objective. Experiments on synthetic and real data show that our new metrics can better measure fairness than the baseline, and that the fairness objectives effectively help reduce unfairness.", "full_text": "Beyond Parity:\n\nFairness Objectives for Collaborative Filtering\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nBert Huang\n\nVirginia Tech\n\nBlacksburg, VA 24061\n\nbhuang@vt.edu\n\nSirui Yao\n\nVirginia Tech\n\nBlacksburg, VA 24061\n\nysirui@vt.edu\n\nAbstract\n\nWe study fairness in collaborative-\ufb01ltering recommender systems, which are\nsensitive to discrimination that exists in historical data. Biased data can lead\ncollaborative-\ufb01ltering methods to make unfair predictions for users from minority\ngroups. We identify the insuf\ufb01ciency of existing fairness metrics and propose four\nnew metrics that address different forms of unfairness. These fairness metrics can\nbe optimized by adding fairness terms to the learning objective. Experiments on\nsynthetic and real data show that our new metrics can better measure fairness than\nthe baseline, and that the fairness objectives effectively help reduce unfairness.\n\n1\n\nIntroduction\n\nThis paper introduces new measures of unfairness in algorithmic recommendation and demonstrates\nhow to optimize these metrics to reduce different forms of unfairness. Recommender systems study\nuser behavior and make recommendations to support decision making. They have been widely applied\nin various \ufb01elds to recommend items such as movies, products, jobs, and courses. However, since\nrecommender systems make predictions based on observed data, they can easily inherit bias that may\nalready exist. To address this issue, we \ufb01rst formalize the problem of unfairness in recommender\nsystems and identify the insuf\ufb01ciency of demographic parity for this setting. We then propose four\nnew unfairness metrics that address different forms of unfairness. We compare our fairness measures\nwith non-parity on biased, synthetic training data and prove that our metrics can better measure\nunfairness. To improve model fairness, we provide \ufb01ve fairness objectives that can be optimized, each\nadding unfairness penalties as regularizers. Experimenting on real and synthetic data, we demonstrate\nthat each fairness metric can be optimized without much degradation in prediction accuracy, but that\ntrade-offs exist among the different forms of unfairness.\nWe focus on a frequently practiced approach for recommendation called collaborative \ufb01ltering,\nwhich makes recommendations based on the ratings or behavior of other users in the system. The\nfundamental assumption behind collaborative \ufb01ltering is that other users\u2019 opinions can be selected\nand aggregated in such a way as to provide a reasonable prediction of the active user\u2019s preference [7].\nFor example, if a user likes item A, and many other users who like item A also like item B, then it is\nreasonable to expect that the user will also like item B. Collaborative \ufb01ltering methods would predict\nthat the user will give item B a high rating.\nWith this approach, predictions are made based on co-occurrence statistics, and most methods assume\nthat the missing ratings are missing at random. Unfortunately, researchers have shown that sampled\nratings have markedly different properties from the users\u2019 true preferences [21, 22]. Sampling is\nheavily in\ufb02uenced by social bias, which results in more missing ratings in some cases than others.\nThis non-random pattern of missing and observed rating data is a potential source of unfairness.\nFor the purpose of improving recommendation accuracy, there are collaborative \ufb01ltering models\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f[2, 21, 25] that use side information to address the problem of imbalanced data, but in this work,\nto test the properties and effectiveness of our metrics, we focus on the basic matrix-factorization\nalgorithm \ufb01rst. Investigating how these other models could reduce unfairness is one direction for\nfuture research.\nThroughout the paper, we consider a running example of unfair recommendation. We consider\nrecommendation in education, and unfairness that may occur in areas with current gender imbalance,\nsuch as science, technology, engineering, and mathematics (STEM) topics. Due to societal and\ncultural in\ufb02uences, fewer female students currently choose careers in STEM. For example, in\n2010, women accounted for only 18% of the bachelor\u2019s degrees awarded in computer science [3].\nThe underrepresentation of women causes historical rating data of computer-science courses to be\ndominated by men. Consequently, the learned model may underestimate women\u2019s preferences and\nbe biased toward men. We consider the setting in which, even if the ratings provided by students\naccurately re\ufb02ect their true preferences, the bias in which ratings are reported leads to unfairness.\nThe remainder of the paper is organized as follows. First, we review previous relevant work in\nSection 2. In Section 3, we formalize the recommendation problem, and we introduce four new\nunfairness metrics and give justi\ufb01cations and examples. In Section 4, we show that unfairness\noccurs as data gets more imbalanced, and we present results that successfully minimize each form of\nunfairness. Finally, Section 5 concludes the paper and proposes possible future work.\n\n2 Related Work\n\nAs machine learning is being more widely applied in modern society, researchers have begun\nidentifying the criticality of algorithmic fairness. Various studies have considered algorithmic fairness\nin problems such as supervised classi\ufb01cation [20, 23, 28]. When aiming to protect algorithms from\ntreating people differently for prejudicial reasons, removing sensitive features (e.g., gender, race, or\nage) can help alleviate unfairness but is often insuf\ufb01cient. Features are often correlated, so other\nunprotected attributes can be related to the sensitive features and therefore still cause the model to\nbe biased [17, 29]. Moreover, in problems such as collaborative \ufb01ltering, algorithms do not directly\nconsider measured features and instead infer latent user attributes from their behavior.\nAnother frequently practiced strategy for encouraging fairness is to enforce demographic parity,\nwhich is to achieve statistical parity among groups. The goal is to ensure that the overall proportion\nof members in the protected group receiving positive (or negative) classi\ufb01cations is identical to the\nproportion of the population as a whole [29]. For example, in the case of a binary decision \u02c6Y \u2208 {0, 1}\nand a binary protected attribute A \u2208 {0, 1}, this constraint can be formalized as [9]\n\nPr{ \u02c6Y = 1|A = 0} = Pr{ \u02c6Y = 1|A = 1} .\n\n(1)\n\nKamishima et al. [13\u201317] evaluate model fairness based on this non-parity unfairness concept, or try\nto solve the unfairness issue in recommender systems by adding a regularization term that enforces\ndemographic parity. The objective penalizes the differences among the average predicted ratings of\nuser groups. However, demographic parity is only appropriate when preferences are unrelated to\nthe sensitive features. In tasks such as recommendation, user preferences are indeed in\ufb02uenced by\nsensitive features such as gender, race, and age [4, 6]. Therefore, enforcing demographic parity may\nsigni\ufb01cantly damage the quality of recommendations.\nTo address the issue of demographic parity, Hardt et al. [9] propose to measure unfairness with the\ntrue positive rate and true negative rate. This idea encourages what they refer to as equal opportunity\nand no longer relies on the implicit assumption of demographic parity that the target variable is\nindependent of sensitive features. They propose that, in a binary setting, given a decision \u02c6Y \u2208 {0, 1},\na protected attribute A \u2208 {0, 1}, and the true label Y \u2208 {0, 1}, the constraints are equivalent to [9]\n(2)\nThis constraint upholds fairness and simultaneously respects group differences. It penalizes models\nthat only perform well on the majority groups. This idea is also the basis of the unfairness metrics we\npropose for recommendation.\nOur running example of recommendation in education is inspired by the recent interest in using\nalgorithms in this domain [5, 24, 27]. Student decisions about which courses to study can have\n\nPr{ \u02c6Y = 1|A = 0, Y = y} = Pr{ \u02c6Y = 1|A = 1, Y = y}, y \u2208 {0, 1} .\n\n2\n\n\fsigni\ufb01cant impacts on their lives, so the usage of algorithmic recommendation in this setting has\nconsequences that will affect society for generations. Coupling the importance of this application with\nthe issue of gender imbalance in STEM [1] and challenges in retention of students with backgrounds\nunderrepresented in STEM [8, 26], we \ufb01nd this setting a serious motivation to advance scienti\ufb01c\nunderstanding of unfairness\u2014and methods to reduce unfairness\u2014in recommendation.\n\n3 Fairness Objectives for Collaborative Filtering\n\nThis section introduces fairness objectives for collaborative \ufb01ltering. We begin by reviewing the\nmatrix factorization method. We then describe the various fairness objectives we consider, providing\nformal de\ufb01nitions and discussion of their motivations.\n\n3.1 Matrix Factorization for Recommendation\n\nWe consider the task of collaborative \ufb01ltering using matrix factorization [19]. We have a set of users\nindexed from 1 to m and a set of items indexed from 1 to n. For the ith user, let gi be a variable\nindicating which group the ith user belongs to. For example, it may indicate whether user i identi\ufb01es\nas a woman, a man, or with a non-binary gender identity. For the jth item, let hj indicate the item\ngroup that it belongs to. For example, hj may represent a genre of a movie or topic of a course. Let\nrij be the preference score of the ith user for the jth item. The ratings can be viewed as entries in a\nrating matrix R.\nThe matrix-factorization formulation builds on the assumption that each rating can be represented as\nthe product of vectors representing the user and item. With additional bias terms for users and items,\nthis assumption can be summarized as follows:\nrij \u2248 p(cid:62)\n\n(3)\nwhere pi is a d-dimensional vector representing the ith user, qj is a d-dimensional vector representing\nthe jth item, and ui and vj are scalar bias terms for the user and item, respectively. The matrix-\nfactorization learning algorithm seeks to learn these parameters from observed ratings X, typically\nby minimizing a regularized, squared reconstruction error:\n\ni qj + ui + vj ,\n\n(cid:0)||P||2\n\nF + ||Q||2\n\nF\n\n(cid:1) +\n\n(cid:88)\n\n1\n|X|\n\n(i,j)\u2208X\n\nJ(P , Q, u, v) =\n\n\u03bb\n2\n\n(yij \u2212 rij)2 ,\n\n(4)\n\nwhere u and v are the vectors of bias terms, || \u00b7 ||F represents the Frobenius norm, and\n\n(5)\nStrategies for minimizing this non-convex objective are well studied, and a general approach is to\ncompute the gradient and use a gradient-based optimizer. In our experiments, we use the Adam\nalgorithm [18], which combines adaptive learning rates with momentum.\n\ni qj + ui + vj.\n\nyij = p(cid:62)\n\n3.2 Unfair Recommendations from Underrepresentation\n\nIn this section, we describe a process through which matrix factorization leads to unfair recom-\nmendations, even when rating data accurately re\ufb02ects users\u2019 true preferences. Such unfairness can\noccur with imbalanced data. We identify two forms of underrepresentation: population imbalance\nand observation bias. We later demonstrate that either leads to unfair recommendation, and both\nforms together lead to worse unfairness. In our discussion, we use a running example of course\nrecommendation, highlighting effects of underrepresentation in STEM education.\nPopulation imbalance occurs when different types of users occur in the dataset with varied frequencies.\nFor example, we consider four types of users de\ufb01ned by two aspects. First, each individual identi\ufb01es\nwith a gender. For simplicity, we only consider binary gender identities, though in this example, it\nwould also be appropriate to consider men as one gender group and women and all non-binary gender\nidentities as the second group. Second, each individual is either someone who enjoys and would\nexcel in STEM topics or someone who does and would not. Population imbalance occurs in STEM\neducation when, because of systemic bias or other societal problems, there may be signi\ufb01cantly fewer\nwomen who succeed in STEM (WS) than those who do not (W), and because of converse societal\n\n3\n\n\funfairness, there may be more men who succeed in STEM (MS) than those who do not (M). This\nfour-way separation of user groups is not available to the recommender system, which instead may\nonly know the gender group of each user, but not their proclivity for STEM.\nObservation bias is a related but distinct form of data imbalance, in which certain types of users\nmay have different tendencies to rate different types of items. This bias is often part of a feedback\nloop involving existing methods of recommendation, whether by algorithms or by humans. If an\nindividual is never recommended a particular item, they will likely never provide rating data for that\nitem. Therefore, algorithms will never be able to directly learn about this preference relationship.\nIn the education example, if women are rarely recommended to take STEM courses, there may be\nsigni\ufb01cantly less training data about women in STEM courses.\nWe simulate these two types of data bias with two stochastic block models [11]. We create one block\nmodel that determines the probability that an individual in a particular user group likes an item in a\nparticular item group. The group ratios may be non-uniform, leading to population imbalance. We\nthen use a second block model to determine the probability that an individual in a user group rates an\nitem in an item group. Non-uniformity in the second block model will lead to observation bias.\nFormally, let matrix L \u2208 [0, 1]|g|\u00d7|h| be the block-model parameters for rating probability. For the\nith user and the jth item, the probability of rij = +1 is L(gi,hj ), and otherwise rij = \u22121. Morever,\nlet O \u2208 [0, 1]|g|\u00d7|h| be such that the probability of observing rij is O(gi,hj ).\n3.3 Fairness Metrics\n\nIn this section, we present four new unfairness metrics for preference prediction, all measuring a\ndiscrepancy between the prediction behavior for disadvantaged users and advantaged users. Each\nmetric captures a different type of unfairness that may have different consequences. We describe the\nmathematical formulation of each metric, its justi\ufb01cation, and examples of consequences the metric\nmay indicate. We consider a binary group feature and refer to disadvantaged and advantaged groups,\nwhich may represent women and men in our education example.\nThe \ufb01rst metric is value unfairness, which measures inconsistency in signed estimation error across\nthe user types, computed as\n\n(cid:12)(cid:12)(cid:12)(cid:16)\n\nn(cid:88)\n\nj=1\n\nUval =\n\n1\nn\n\n(cid:17) \u2212(cid:16)\n\n(cid:17)(cid:12)(cid:12)(cid:12) ,\n\nEg [y]j \u2212 Eg [r]j\n\nE\u00acg [y]j \u2212 E\u00acg [r]j\n\n(6)\n\n(cid:88)\n\nwhere Eg [y]j is the average predicted score for the jth item from disadvantaged users, E\u00acg [y]j is\nthe average predicted score for advantaged users, and Eg [r]j and E\u00acg [r]j are the average ratings for\nthe disadvantaged and advantaged users, respectively. Precisely, the quantity Eg [y]j is computed as\n\nEg [y]j :=\n\n1\n\n|{i : ((i, j) \u2208 X) \u2227 gi}|\n\ni:((i,j)\u2208X)\u2227gi\n\nyij ,\n\n(7)\n\nand the other averages are computed analogously.\nValue unfairness occurs when one class of user is consistently given higher or lower predictions\nthan their true preferences. If the errors in prediction are evenly balanced between overestimation\nand underestimation or if both classes of users have the same direction and magnitude of error, the\nvalue unfairness becomes small. Value unfairness becomes large when predictions for one class\nare consistently overestimated and predictions for the other class are consistently underestimated.\nFor example, in a course recommender, value unfairness may manifest in male students being\nrecommended STEM courses even when they are not interested in STEM topics and female students\nnot being recommended STEM courses even if they are interested in STEM topics.\nThe second metric is absolute unfairness, which measures inconsistency in absolute estimation error\nacross user types, computed as\n\nn(cid:88)\n\nj=1\n\nUabs =\n\n1\nn\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Eg [y]j \u2212 Eg [r]j\n\n(cid:12)(cid:12)(cid:12) \u2212(cid:12)(cid:12)(cid:12)E\u00acg [y]j \u2212 E\u00acg [r]j\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\n\n(8)\n\nAbsolute unfairness is unsigned, so it captures a single statistic representing the quality of prediction\nfor each user type. If one user type has small reconstruction error and the other user type has large\n\n4\n\n\freconstruction error, one type of user has the unfair advantage of good recommendation, while the\nother user type has poor recommendation. In contrast to value unfairness, absolute unfairness does\nnot consider the direction of error. For example, if female students are given predictions 0.5 points\nbelow their true preferences and male students are given predictions 0.5 points above their true\npreferences, there is no absolute unfairness. Conversely, if female students are given ratings that are\noff by 2 points in either direction while male students are rated within 1 point of their true preferences,\nabsolute unfairness is high, while value unfairness may be low.\nThe third metric is underestimation unfairness, which measures inconsistency in how much the\npredictions underestimate the true ratings:\n\n(cid:12)(cid:12)(cid:12)max{0, Eg [r]j \u2212 Eg [y]j} \u2212 max{0, E\u00acg [r]j \u2212 E\u00acg [y]j}(cid:12)(cid:12)(cid:12) .\n\nUunder =\n\n1\nn\n\n(9)\n\nn(cid:88)\n\nj=1\n\nUnderestimation unfairness is important in settings where missing recommendations are more critical\nthan extra recommendations. For example, underestimation could lead to a top student not being\nrecommended to explore a topic they would excel in.\nConversely, the fourth new metric is overestimation unfairness, which measures inconsistency in how\nmuch the predictions overestimate the true ratings:\n\n(cid:12)(cid:12)(cid:12)max{0, Eg [y]j \u2212 Eg [r]j} \u2212 max{0, E\u00acg [y]j \u2212 E\u00acg [r]j}(cid:12)(cid:12)(cid:12) .\n\n(10)\n\nUover =\n\n1\nn\n\nn(cid:88)\n\nj=1\n\nOverestimation unfairness may be important in settings where users may be overwhelmed by recom-\nmendations, so providing too many recommendations would be especially detrimental. For example,\nif users must invest large amounts of time to evaluate each recommended item, overestimating\nessentially costs the user time. Thus, uneven amounts of overestimation could cost one type of user\nmore time than the other.\nFinally, a non-parity unfairness measure based on the regularization term introduced by Kamishima\net al. [17] can be computed as the absolute difference between the overall average ratings of disad-\nvantaged users and those of advantaged users:\n\nUpar = |Eg [y] \u2212 E\u00acg [y]| .\n\nEach of these metrics has a straightforward subgradient and can be optimized by various subgradient\noptimization techniques. We augment the learning objective by adding a smoothed variation of a\nfairness metric based on the Huber loss [12], where the outer absolute value is replaced with the\nsquared difference if it is less than 1. We solve for a local minimum, i.e,\n\nmin\n\nP ,Q,u,v\n\nJ(P , Q, u, v) + U .\n\n(11)\n\nThe smoothed penalty helps reduce discontinuities in the objective, making optimization more\nef\ufb01cient. It is also straightforward to add a scalar trade-off term to weight the fairness against the\nloss. In our experiments, we use equal weighting, so we omit the term from Eq. (11).\n\n4 Experiments\n\nWe run experiments on synthetic data based on the simulated course-recommendation scenario and\nreal movie rating data [10]. For each experiment, we investigate whether the learning objectives\naugmented with unfairness penalties successfully reduce unfairness.\n\n4.1 Synthetic Data\n\nIn our synthetic experiments, we generate simulated course-recommendation data from a block model\nas described in Section 3.2. We consider four user groups g \u2208 {W, WS, M, MS} and three item\ngroups h \u2208 {Fem, STEM, Masc}. The user groups can be thought of as women who do not enjoy\nSTEM topics (W), women who do enjoy STEM topics (WS), men who do not enjoy STEM topics\n(M), and men who do (MS). The item groups can be thought of as courses that tend to appeal to most\n\n5\n\n\fFigure 1: Average unfairness scores for standard matrix factorization on synthetic data generated from different\nunderrepresentation schemes. For each metric, the four sampling schemes are uniform (U), biased observations\n(O), biased populations (P), and both biases (O+P). The reconstruction error and the \ufb01rst four unfairness metrics\nfollow the same trend, while non-parity exhibits different behavior.\n\nwomen (Fem), STEM courses, and courses that tend to appeal to most men (Masc). Based on these\ngroups, we consider the rating block model\n\nWe also consider two observation block models: one with uniform observation probability across all\ngroups Ouni = [0.4]4\u00d73 and one with unbalanced observation probability inspired by how students\nare often encouraged to take certain courses\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nL =\n\nFem STEM Masc\n0.2\n0.2\n0.8\n0.8\n\n0.2\n0.8\n0.8\n0.2\n\nW 0.8\nWS\n0.8\nMS\n0.2\nM 0.2\n\nObias =\n\nFem STEM Masc\n0.1\n0.2\n0.5\n0.35\n\n0.2\n0.4\n0.3\n0.5\n\nW 0.6\nWS\n0.3\nMS\n0.1\nM 0.05\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\n(12)\n\n(13)\n\nWe de\ufb01ne two different user group distributions: one in which each of the four groups is exactly a\nquarter of the population, and an imbalanced setting where 0.4 of the population is in W, 0.1 in WS,\n0.4 in MS, and 0.1 in M. This heavy imbalance is inspired by some of the severe gender imbalances\nin certain STEM areas today.\nFor each experiment, we select an observation matrix and user group distribution, generate 400 users\nand 300 items, and sample preferences and observations of those preferences from the block models.\nTraining on these ratings, we evaluate on the remaining entries of the rating matrix, comparing the\npredicted rating against the true expected rating, 2L(gi,hj ) \u2212 1.\n4.1.1 Unfairness from different types of underrepresentation\n\nUsing standard matrix factorization, we measure the various unfairness metrics under the different\nsampling conditions. We average over \ufb01ve random trials and plot the average score in Fig. 1.\nWe label the settings as follows: uniform user groups and uniform observation probabilities (U),\nuniform groups and biased observation probabilities (O), biased user group populations and uniform\nobservations (P), and biased populations and biased observations (P+O).\nThe statistics demonstrate that each type of underrepresentation contributes to various forms of\nunfairness. For all metrics except parity, there is a strict order of unfairness: uniform data is the most\n\n6\n\nUOPO+P0.000.020.040.060.080.100.120.14ErrorUOPO+P0.0000.0050.0100.0150.0200.0250.0300.0350.040ValueUOPO+P0.000.010.020.030.04AbsoluteUOPO+P0.00000.00250.00500.00750.01000.01250.01500.01750.0200UnderUOPO+P0.0000.0050.0100.0150.020OverUOPO+P0.000.050.100.150.200.25Parity\fTable 1: Average error and unfairness metrics for synthetic data using different fairness objectives. The best\nscores and those that are statistically indistinguishable from the best are printed in bold. Each row represents a\ndifferent unfairness penalty, and each column is the measured metric on the expected value of unseen ratings.\n\nUnfairness\n\nNone\nValue\nAbsolute\nUnder\nOver\nNon-Parity\n\nError\n0.317 \u00b1 1.3e-02\n0.130 \u00b1 1.0e-02\n0.205 \u00b1 8.8e-03\n0.269 \u00b1 1.6e-02\n0.130 \u00b1 6.5e-03\n0.324 \u00b1 1.3e-02\n\nValue\n0.649 \u00b1 1.8e-02\n0.245 \u00b1 1.4e-02\n0.535 \u00b1 1.6e-02\n0.512 \u00b1 2.3e-02\n0.296 \u00b1 1.2e-02\n0.697 \u00b1 1.8e-02\n\nAbsolute\n0.443 \u00b1 2.2e-02\n0.177 \u00b1 1.5e-02\n0.267 \u00b1 1.3e-02\n0.401 \u00b1 2.4e-02\n0.172 \u00b1 1.3e-02\n0.453 \u00b1 2.2e-02\n\nUnderestimation\n0.107 \u00b1 6.5e-03\n0.063 \u00b1 4.1e-03\n0.135 \u00b1 6.2e-03\n0.060 \u00b1 3.5e-03\n0.074 \u00b1 6.0e-03\n0.124 \u00b1 6.9e-03\n\nOverestimation\n0.544 \u00b1 2.0e-02\n0.199 \u00b1 1.5e-02\n0.400 \u00b1 1.4e-02\n0.456 \u00b1 2.3e-02\n0.228 \u00b1 1.1e-02\n0.573 \u00b1 1.9e-02\n\nNon-Parity\n0.362 \u00b1 1.6e-02\n0.324 \u00b1 1.2e-02\n0.294 \u00b1 1.0e-02\n0.357 \u00b1 1.6e-02\n0.321 \u00b1 1.2e-02\n0.251 \u00b1 1.0e-02\n\nfair; biased observations is the next most fair; biased populations is worse; and biasing the populations\nand observations causes the most unfairness. The squared rating error also follows this same trend.\nIn contrast, non-parity behaves differently, in that it is heavily ampli\ufb01ed by biased observations but\nseems unaffected by biased populations. Note that though non-parity is high when the observations\nare imbalanced, because of the imbalance in the observations, one should actually expect non-parity\nin the labeled ratings, so it a high non-parity score does not necessarily indicate an unfair situation.\nThe other unfairness metrics, on the other hand, describe examples of unfair behavior by the rating\npredictor. These tests verify that unfairness can occur with imbalanced populations or observations,\neven when the measured ratings accurately represent user preferences.\n\n4.1.2 Optimization of unfairness metrics\n\nAs before, we generate rating data using the block model under the most imbalanced setting: The\nuser populations are imbalanced, and the sampling rate is skewed. We provide the sampled ratings\nto the matrix factorization algorithms and evaluate on the remaining entries of the expected rating\nmatrix. We again use two-dimensional vectors to represent the users and items, a regularization term\nof \u03bb = 10\u22123, and optimize for 250 iterations using the full gradient. We generate three datasets each\nand measure squared reconstruction error and the six unfairness metrics.\nThe results are listed in Table 1. For each metric, we print in bold the best average score and any\nscores that are not statistically signi\ufb01cantly distinct according to paired t-tests with threshold 0.05.\nThe results indicate that the learning algorithm successfully minimizes the unfairness penalties,\ngeneralizing to unseen, held-out user-item pairs. And reducing any unfairness metric does not lead to\na signi\ufb01cant increase in reconstruction error.\nThe complexity of computing the unfairness metrics is similar to that of the error computation, which\nis linear in the number of ratings, so adding the fairness term approximately doubles the training time.\nIn our implementation, learning with fairness terms takes longer because loops and backpropagation\nintroduce extra overhead. For example, with synthetic data of 400 users and 300 items, it takes 13.46\nseconds to train a matrix factorization model without any unfairness term and 43.71 seconds for one\nwith value unfairness.\nWhile optimizing each metric leads to improved performance on itself (see the highlighted entries\nin Table 1), a few trends are worth noting. Optimizing any of our new unfairness metrics almost\nalways reduces the other forms of unfairness. An exception is that optimizing absolute unfairness\nleads to an increase in underestimation. Value unfairness is closely related to underestimation and\noverestimation, since optimizing value unfairness is even more effective at reducing underestimation\nand overestimation than directly optimizing them. Also, optimizing value and overestimation are\nmore effective in reducing absolute unfairness than directly optimizing it. Finally, optimizing parity\nunfairness leads to increases in all unfairness metrics except absolute unfairness and parity itself.\nThese relationships among the metrics suggest a need for practitioners to decide which types of\nfairness are most important for their applications.\n\n4.2 Real Data\n\nWe use the Movielens Million Dataset [10], which contains ratings (from 1 to 5) by 6,040 users of\n3,883 movies. The users are annotated with demographic variables including gender, and the movies\nare each annotated with a set of genres. We manually selected genres that feature different forms of\n\n7\n\n\fTable 2: Gender-based statistics of movie genres in Movielens data.\nCrime\n\nSci-Fi Musical\n\nRomance\n\nAction\n\nCount\nRatings per female user\nRatings per male user\nAverage rating by women\nAverage rating by men\n\n325\n54.79\n36.97\n3.64\n3.55\n\n425\n52.00\n82.97\n3.45\n3.45\n\n237\n31.19\n50.46\n3.42\n3.44\n\n93\n15.04\n10.83\n3.79\n3.58\n\n142\n17.45\n23.90\n3.65\n3.68\n\nTable 3: Average error and unfairness metrics for movie-rating data using different fairness objectives.\n\nUnfairness\n\nNone\nValue\nAbsolute\nUnder\nOver\nNon-Parity\n\nError\n0.887 \u00b1 1.9e-03\n0.886 \u00b1 2.2e-03\n0.887 \u00b1 2.0e-03\n0.888 \u00b1 2.2e-03\n0.885 \u00b1 1.9e-03\n0.887 \u00b1 1.9e-03\n\nValue\n0.234 \u00b1 6.3e-03\n0.223 \u00b1 6.9e-03\n0.235 \u00b1 6.2e-03\n0.233 \u00b1 6.8e-03\n0.234 \u00b1 5.8e-03\n0.236 \u00b1 6.0e-03\n\nAbsolute\n0.126 \u00b1 1.7e-03\n0.128 \u00b1 2.2e-03\n0.124 \u00b1 1.7e-03\n0.128 \u00b1 1.8e-03\n0.125 \u00b1 1.6e-03\n0.126 \u00b1 1.6e-03\n\nUnderestimation\n0.107 \u00b1 1.6e-03\n0.102 \u00b1 1.9e-03\n0.110 \u00b1 1.8e-03\n0.102 \u00b1 1.7e-03\n0.112 \u00b1 1.9e-03\n0.110 \u00b1 1.7e-03\n\nOverestimation\n0.153 \u00b1 3.9e-03\n0.148 \u00b1 4.9e-03\n0.151 \u00b1 4.2e-03\n0.156 \u00b1 4.2e-03\n0.148 \u00b1 4.1e-03\n0.152 \u00b1 3.9e-03\n\nNon-Parity\n0.036 \u00b1 1.3e-03\n0.041 \u00b1 1.6e-03\n0.023 \u00b1 2.7e-03\n0.058 \u00b1 9.3e-04\n0.015 \u00b1 2.0e-03\n0.010 \u00b1 1.5e-03\n\ngender imbalance and only consider movies that list these genres. Then we \ufb01lter the users to only\nconsider those who rated at least 50 of the selected movies.\nThe genres we selected are action, crime, musical, romance, and sci-\ufb01. We selected these genres\nbecause they each have a noticeable gender effect in the data. Women rate musical and romance \ufb01lms\nhigher and more frequently than men. Women and men both score action, crime, and sci-\ufb01 \ufb01lms\nabout equally, but men rate these \ufb01lm much more frequently. Table 2 lists these statistics in detail.\nAfter \ufb01ltering by genre and rating frequency, we have 2,953 users and 1,006 movies in the dataset.\nWe run \ufb01ve trials in which we randomly split the ratings into training and testing sets, train each\nobjective function on the training set, and evaluate each metric on the testing set. The average scores\nare listed in Table 3, where bold scores again indicate being statistically indistinguishable from the\nbest average score. On real data, the results show that optimizing each unfairness metric leads to the\nbest performance on that metric without a signi\ufb01cant change in the reconstruction error. As in the\nsynthetic data, optimizing value unfairness leads to the most decrease on under- and overestimation.\nOptimizing non-parity again causes an increase or no change in almost all the other unfairness\nmetrics.\n\n5 Conclusion\n\nIn this paper, we discussed various types of unfairness that can occur in collaborative \ufb01ltering. We\ndemonstrate that these forms of unfairness can occur even when the observed rating data is correct, in\nthe sense that it accurately re\ufb02ects the preferences of the users. We identify two forms of data bias\nthat can lead to such unfairness. We then demonstrate that augmenting matrix-factorization objectives\nwith these unfairness metrics as penalty functions enables a learning algorithm to minimize each of\nthem. Our experiments on synthetic and real data show that minimization of these forms of unfairness\nis possible with no signi\ufb01cant increase in reconstruction error.\nWe also demonstrate a combined objective that penalizes both overestimation and underestimation.\nMinimizing this objective leads to small unfairness penalties for the other forms of unfairness. Using\nthis combined objective may be a good approach for practitioners. However, no single objective\nwas the best for all unfairness metrics, so it remains necessary for practitioners to consider precisely\nwhich form of fairness is most important in their application and optimize that speci\ufb01c objective.\n\nFuture Work While our work in this paper focused on improving fairness among users so that the\nmodel treats different groups of users fairly, we did not address fair treatment of different item groups.\nThe model could be biased toward certain items, e.g., performing better at prediction for some items\nthan others in terms of accuracy or over- and underestimation. Achieving fairness for both users and\nitems may be important when considering that the items may also suffer from discrimination or bias,\nfor example, when courses are taught by instructors with different demographics.\nOur experiments demonstrate that minimizing empirical unfairness generalizes, but this generalization\nis dependent on data density. When ratings are especially sparse, the empirical fairness does not\n\n8\n\n\falways generalize well to held-out predictions. We are investigating methods that are more robust to\ndata sparsity in future work.\nMoreover, our fairness metrics assume that users rate items according to their true preferences.\nThis assumption is likely to be violated in real data, since ratings can also be in\ufb02uenced by various\nenvironmental factors. E.g., in education, a student\u2019s rating for a course also depends on whether the\ncourse has an inclusive and welcoming learning environment. However, addressing this type of bias\nmay require additional information or external interventions beyond the provided rating data.\nFinally, we are investigating methods to reduce unfairness by directly modeling the two-stage\nsampling process we used to generate synthetic, biased data. We hypothesize that by explicitly\nmodeling the rating and observation probabilities as separate variables, we may be able to derive a\nprincipled, probabilistic approach to address these forms of data imbalance.\n\nReferences\n[1] D. N. Beede, T. A. Julian, D. Langdon, G. McKittrick, B. Khan, and M. E. Doms. Women in\nSTEM: A gender gap to innovation. U.S. Department of Commerce, Economics and Statistics\nAdministration, 2011.\n\n[2] A. Beutel, E. H. Chi, Z. Cheng, H. Pham, and J. Anderson. Beyond globally optimal: Focused\nlearning for improved recommendations. In Proceedings of the 26th International Conference\non World Wide Web, pages 203\u2013212. International World Wide Web Conferences Steering\nCommittee, 2017.\n\n[3] S. Broad and M. McGee. Recruiting women into computer science and information systems.\nProceedings of the Association Supporting Computer Users in Education Annual Conference,\npages 29\u201340, 2014.\n\n[4] O. Chausson. Who watches what? Assessing the impact of gender and personality on \ufb01lm\npreferences. http://mypersonality.org/wiki/doku.php?id=movie_tastes_and_personality, 2010.\n\n[5] M.-I. Dascalu, C.-N. Bodea, M. N. Mihailescu, E. A. Tanase, and P. Ordo\u00f1ez de Pablos.\nEducational recommender systems and their application in lifelong learning. Behaviour &\nInformation Technology, 35(4):290\u2013297, 2016.\n\n[6] T. N. Daymont and P. J. Andrisani. Job preferences, college major, and the gender gap in\n\nearnings. Journal of Human Resources, pages 408\u2013428, 1984.\n\n[7] M. D. Ekstrand, J. T. Riedl, J. A. Konstan, et al. Collaborative \ufb01ltering recommender systems.\n\nFoundations and Trends in Human-Computer Interaction, 4(2):81\u2013173, 2011.\n\n[8] A. L. Grif\ufb01th. Persistence of women and minorities in STEM \ufb01eld majors: Is it the school that\n\nmatters? Economics of Education Review, 29(6):911\u2013922, 2010.\n\n[9] M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances\n\nin Neural Information Processing Systems, pages 3315\u20133323, 2016.\n\n[10] F. M. Harper and J. A. Konstan. The Movielens datasets: History and context. ACM Transactions\n\non Interactive Intelligent Systems (TiiS), 5(4):19, 2016.\n\n[11] P. W. Holland and S. Leinhardt. Local structure in social networks. Sociological Methodology,\n\n7:1\u201345, 1976.\n\n[12] P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics,\n\npages 73\u2013101, 1964.\n\n[13] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Enhancement of the neutrality in recommen-\n\ndation. In Decisions@ RecSys, pages 8\u201314, 2012.\n\n[14] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Ef\ufb01ciency improvement of neutrality-\n\nenhanced recommendation. In Decisions@ RecSys, pages 1\u20138, 2013.\n\n[15] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Correcting popularity bias by enhancing\n\nrecommendation neutrality. In RecSys Posters, 2014.\n\n9\n\n\f[16] T. Kamishima, S. Akaho, H. Asoh, and I. Sato. Model-based approaches for independence-\nenhanced recommendation. In Data Mining Workshops (ICDMW), 2016 IEEE 16th International\nConference on, pages 860\u2013867. IEEE, 2016.\n\n[17] T. Kamishima, S. Akaho, and J. Sakuma. Fairness-aware learning through regularization\napproach. In 11th International Conference on Data Mining Workshops (ICDMW), pages\n643\u2013650. IEEE, 2011.\n\n[18] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[19] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nComputer, 42(8), 2009.\n\n[20] K. Lum and J. Johndrow. A statistical framework for fair predictive algorithms. arXiv preprint\n\narXiv:1610.08077, 2016.\n\n[21] B. Marlin, R. S. Zemel, S. Roweis, and M. Slaney. Collaborative \ufb01ltering and the missing at\n\nrandom assumption. arXiv preprint arXiv:1206.5267, 2012.\n\n[22] B. M. Marlin and R. S. Zemel. Collaborative prediction and ranking with non-random missing\ndata. In Proceedings of the third ACM conference on Recommender systems, pages 5\u201312. ACM,\n2009.\n\n[23] D. Pedreshi, S. Ruggieri, and F. Turini. Discrimination-aware data mining. In Proceedings of\nthe 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 560\u2013568. ACM, 2008.\n\n[24] C. V. Sacin, J. B. Agapito, L. Shafti, and A. Ortigosa. Recommendation in higher education\n\nusing data mining techniques. In Educational Data Mining, 2009.\n\n[25] S. Sahebi and P. Brusilovsky. It takes two to tango: An exploration of domain pairs for cross-\ndomain collaborative \ufb01ltering. In Proceedings of the 9th ACM Conference on Recommender\nSystems, pages 131\u2013138. ACM, 2015.\n\n[26] E. Smith. Women into science and engineering? Gendered participation in higher education\n\nSTEM subjects. British Educational Research Journal, 37(6):993\u20131014, 2011.\n\n[27] N. Thai-Nghe, L. Drumond, A. Krohn-Grimberghe, and L. Schmidt-Thieme. Recommender\nsystem for predicting student performance. Procedia Computer Science, 1(2):2811\u20132819, 2010.\n\n[28] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness constraints:\n\nMechanisms for fair classi\ufb01cation. arXiv preprint arXiv:1507.05259, 2017.\n\n[29] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In\nProceedings of the 30th International Conference on Machine Learning, pages 325\u2013333, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1679, "authors": [{"given_name": "Sirui", "family_name": "Yao", "institution": "Virginia Polytechnic Institute and State University"}, {"given_name": "Bert", "family_name": "Huang", "institution": "Virginia Tech"}]}