NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Reviewer 1
In this paper the authors explore different metrics for measuring fairness in recommender systems. In particular, they offer 4 different metrics that measure if a recommender system over-estimates or under-estimates how much a group of users will like a particular genre. Further, they show that by regularizing the by the discrepancy across groups does not hurt the model accuracy while improving fairness on the metric used for the regularization (as well as undiscussed slight effects on the other fairness metrics).
In contrast to the work of Hardt et al, this paper focuses on a regression task (rating prediction) and explores the difference between absolute and relative errors. This is an interesting point in recommender systems that could easily be overlooked by other metrics. Further, the clear demonstration of bias on MovieLens data and the ability to improve it through regularization is a valuable contribution. Additionally, the simplicity of the approach and clarity of the writing should be valued.
However, the metrics are also limited in my opinion. In particular, all of the metrics focus on average ratings over different groups. It is easy to imagine a model that has none of the biases discussed in this paper, but performs terribly for one group and great for the other. A more direct extrapolation of equality of opportunity would seem to focus on model accuracy per-rating rather than aggregates. I do not think this ruins the work here, as I mentioned above that the directionality of the errors is an important observation, but I do think it is a limitation that will need to be developed upon in the future.
In this vain, a significant amount of recommender systems research is not cited or compared against. Data sparsity, missing not at random, learning across different subgroups, using side information, etc. have all been explored in the recommender systems literature. At the least, comparing against other collaborative filtering models that attempt to improve the model accuracy by side information would provide better context (as well as reporting accuracy of the model by both gender and genre). A few pieces of related work are included below.
"Collaborative Filtering and the Missing at Random Assumption" Marlin et al
"Beyond Globally Optimal: Focused Learning for Improved Recommendations" Beutel al
"It takes two to tango: An exploration of domain pairs for cross-domain collaborative filtering" Sahebi et al
Reviewer 2
The paper focuses on algorithmic fairness in the context of recommender systems (in te framework of collaborative filtering), which has been barely studied up to date. The main contribution of the paper are four measures of fairness suitable for recommender systems, and a high level idea of the real-world contexts for which they are suitable.
On the positive side, the paper is in general well written, easy to follow, and provides intuition about the different measures of fairness they propose together with their pros and cons. Also, the running example on recommendations in education serves as a real-world motivation and explanation for the proposed measures. Finally, the thorough experimental section provides good intuition on which measure are more appropriate for the different real-world scenarios.
As potential criticism, the technical contributions of the paper are scarce. I believe that the paper lack of a theoretical/empirical study of the trade-off between accuracy and fairness, as well as of the increase in computational complexity when extending the formulation of the recommender system (matrix factorization) to account for fairness. The authors focus on proposing novel fairness measurements but they do not study how the computational efficiency of matrix factorization learning algorithm changes from Eq. (4) to Eq. (11) under the different definitions of fairness, which might be important on real world applications. However, given that fairness is a pretty novel topic in the machine learning (ML) community, I believe that simple ideas/approaches as the one presented in the paper are of general interest and minimize the gap between theoretical and practical aspects of ML.
Reviewer 3
The paper looks at fairness in the context of recommendation systems. It attempts to define metrics that determine whether recommendations (as measured by a suggestion for a class, or a movie) are skewed based on group identification. It then proposes algorithms to design recommenders that optimize for fairness.
The main contributions are four different fairness measures that are all based on the idea of looking at errors in the outcome conditioned on group identification. The authors optimize for each of these using standard optimizations and then present results on synthetic and real data sets.
The authors don't point this out, but the formulation of the fairness measures is a bit more general than stated: essentially the formulation generalizes the idea of balancing errors to a regression setting, and then formulates variants based on what kinds of regression errors are important (signed vs unsigned, overestimates vs underestimates). Viewed that way, this paper is less abuot recommendation systems per se, and more about how do to regression fairly.
One interesting little tidbit in the experiment appeared to be the difference in effectiveness with observation vs population bias. This doesn't appear to have been explored further, but might be useful to understand because one might argue that observation bias is fixable, but population bias requires deeper interventions (if it's even possible).
Overall, it's a decent and natural generalization of prior work to the regression setting, with results that aren't too surprising.