Reviews: Differential Privacy Has Disparate Impact on Model Accuracy

Overall I have only listed one contribution, but I consider this contribution to be very significant. In general, I consider this to be a high quality submission on the basis of the finding and thoroughness of the experiments. My only qualms would be addition clarity in explaining results and contextualizing these results in light of recent work. Originality I consider this finding to be inline with come previous work, in particular, citation 20 in the paper. However, this is the first work that demonstrates, empirically, and in a convincing fashion that tradeoffs between privacy and minority impact. Some of these tradeoffs are alluded to in citation 20, but not as fully explored empirically. Quality The authors demonstrate their findings on several different datasets and scenarios ranging from text to vision and even in a federated learning setting. The key finding of this work is very significant. Training machine learning models that satisfy differential privacy is now a significant topic of research, and the dp-sgd algorithm used in this work has received a substantial amount of attention. In addition, ML and fairness has also received a substantial amount of attention as well. Showing tradeoffs between these two areas is significant finding. Clarity The writing is relatively easy to follow and the key punchline of this work is supported throughout. I would recommend that the authors flesh out the figure captions to make them more informative. In addition, the section on federated learning was somewhat confusing. I defer my questions on this section to the latter parts of this review. Significance I expect the findings in this paper to lead to several follow-up work in the two fields that this work focuses on. For example, are there possible relaxations of differential privacy for which one could mitigate this disparate impact findings in this paper? It seems unlikely since protection of outliers is inherent in the dp definition. I consider this work as one that would spur significant follow on research. Issues. I am generally positive on the findings of this paper, however, I would like some clarifications. I have highlighted each one below. - Impact of clipping. From figures 4 and 5, the results seem to suggest that clipping of the per example gradients is primarily responsible for these findings. However, I am a bit confused by the gradient norm plots. First, can you rescale these plots so that the different in norms between the two classes is clear? Secondly, it seems like the gradient norms are only high at the beginning of the training, and become similar as training increases. This finding suggest that the impact of clipping is only significant early in the training process, and not later. Can you clarify this? It is usually difficult to tease out the impact of the learning rate, clipping, and noise addition during training, so more clarification on the exact details of these three parameters would clarify things. - Size of the group. The results presented seem to, mostly, be dependent on the size of the group, i.e, for minority populations, there is a disparate impact. I have a different question. Is a straightforward remedy to just collect more information about the minority group and balance the datasets and this disparate impact goes away? It is a bit hard for me to figure this out from figure 1b, there seems to be no clear trend there. - Federated learning results. I don't completely understand figure 3 b. How is each bar computed? Is this the accuracy of a dp model minus non dp model? In figure 3(a), is the word frequency a per user one or across the entire corpus? I think these questions could be easily avoided with more informative captions. - Flesh out related work. This paper has highlighted several interesting papers that consider similar work, however the discussion of these papers are essentially at a surface level. For example, it would be great to contrast your results with citations 10 and 20. Some new papers have also been posted that have similarities to this work and may help the authors to contextualize their findings, see: https://arxiv.org/pdf/1905.12744.pdf, and https://arxiv.org/abs/1906.05271. These two papers are recent, so I don't expect the authors to address or incorporate them, however, it may help provide a different perspective on your results. - How difficult is it to train a model for a specific epsilon with dp-sgd? I wonder how easy it would be to get error bounds (sd/std) on the bars in figure 1(a), and 2(a) and 2(b). I understand here that the difficulty would be in reliably training a dp-model with the same epsilon. In practice, would one compute the required parameters with moments accountant, and then train for the specified number of iterations for the specified epsilon and delta? If it works this way, then perhaps the authors would train a few models with epsilon in similar ranges? I am not requesting more experiments, but want to understand how training dp-sgd works in practice. - RDP vs moments accountant. I noticed that TF privacy uses the RDP accountant, even though the Abadi'16 paper introduced moments account. Are these roughly the same, i.e., is epsilon in the RDP definition easily translatable to the moments accountant version?

A very similar property is studied by Yeom et al. [CCF 2018] who show that when the generalization error of a model is high (in particular this would be the case for what the authors term minority groups), there is an adversary that can effectively breach user privacy; moreover the generalization error serves as an upper bound on the privacy guarantee. In other words, Yeom et al. appear to present a very similar set of results, albeit without the mention of the effect on minority groups. I would like the authors to clarify this issue, and comment on the relation between their work and Yeom et al.. For example, the experiments show that as epsilon grows smaller the accuracy drops - this is exactly the phenomenon reported in that paper. This means that the main contribution of this work is the empirical analysis. This is a very interesting part of the work, showing how combining DP techniques further compromises on fairness. However, I believe that there should be a discussion whether this part on its own merits publication. The authors explain their findings by suggesting that high gradient values are more susceptible to noise; however, this is counterintuitive - I would have expected the exact opposite to occur. The paper is sometimes not entirely careful and consistent in its definitions/use of certain terms. This makes it very difficult to parse basic definitions. \mathcal{L} is not used consistently. At first the input is the parameterization and a datapoint, and then it is the prediction and a point. C/c is used both for classes and for the cell state vector S is used both for subset of outputs and for the clipping bound in Algorithm 1. Line 2 of Algorithm 1: what does randomly sample mean exactly? I.i.d sampling of a batch with probability q? Each subset is sampled with probability q? The privacy cost (epsilon,delta) is never mentioned in the run of the algorithm and it is entirely unclear how the algorithm even computes the privacy cost. This is a key algorithm in the paper and the fact that it is effectively inaccessible to the reader is a major issue. You picked equal odds over equal opportunity (or any of the other many fairness measures). Why? Does the choice matter? Do you want to consider other measures in the future? Is it enough to consider one of them? You picked a ratio of (roughly) 60:1 for the minority class in the facial image dataset. This is neither representative of the American population (<10:1) neither does it seem representative of the global population You use the method of moments accountant to compute DP, using gradient clipping and adding of noise. Yes, these are the currently state of the art methods, but they are not the only ones. Especially, gradient clipping is not inherent to DP it is just a method to bound the sensitivity besides others (e.g. Parselval networks). You should clearly separate between a theoretical concept (DP) and a method to obtain it. You say that you use the implementation of Renyi differential privacy which is different from the (epsilon,delta)-DP you explained before, that’s confusing. The analysis is quite vague in some points, using “drops (much) more ” several times, without going into any details. Especially in Figure 1(b) and (c) which you claim show an effect of the group size, without at least a trend line or some statistical analysis, these figures show very little. In several of your experiments you use pre-trained models, given the reported training time and number of parameters it sounds like you retrain the entire model. With pre-trained models it is often enough to just retrain the last layers, which could speed up your experiments dramatically (if the authors are already doing this, please ignore this comment). The abstract and conclusion state that there is a stronger effect for “complex classes”, besides underrepresented classes. I am not finding much support for this claim in the experiments. The section where this comes up a little (4.3) states that “Reptilia” is not affected as much as other classes. But you don’t argue why “Reptilia” should be a less complex class, or even give a definition what a complex class is. Additionally, if there is evidence that complex classes are more affected, this opens a whole new problem. For all your minority class experiments you would require counterfactuals that show that the class you chose as minority is not inherently more complex, so you would need experiments to show that in a balanced setting the classes are affected equally. To conclude, while the paper is interesting, it is not clear whether the results reported here are novel to the ML/DP community, and suffers from some non-trivial exposition and methodology issues.

Paper ID:	8969
Title:	Differential Privacy Has Disparate Impact on Model Accuracy

Reviewer 1

Reviewer 2

Reviewer 3