{"title": "Differential Privacy Has Disparate Impact on Model Accuracy", "book": "Advances in Neural Information Processing Systems", "page_first": 15479, "page_last": 15488, "abstract": "Differential privacy (DP) is a popular mechanism for training machine\nlearning models with bounded leakage about the presence of specific\npoints in the training data.  The cost of differential privacy is a\nreduction in the model's accuracy.  We demonstrate that in the neural\nnetworks trained using differentially private stochastic gradient descent\n(DP-SGD), this cost is not borne equally: accuracy of DP models drops\nmuch more for the underrepresented classes and subgroups.\n\nFor example, a gender classification model trained using DP-SGD exhibits\nmuch lower accuracy for black faces than for white faces.  Critically,\nthis gap is bigger in the DP model than in the non-DP model, i.e., if\nthe original model is unfair, the unfairness becomes worse once DP is\napplied.  We demonstrate this effect for a variety of tasks and models,\nincluding sentiment analysis of text and image classification.  We then\nexplain why DP training mechanisms such as gradient clipping and noise\naddition have disproportionate effect on the underrepresented and more\ncomplex subgroups, resulting in a disparate reduction of model accuracy.", "full_text": "Differential Privacy Has Disparate Impact on\n\nModel Accuracy\n\nEugene Bagdasaryan\n\nCornell Tech\n\neugene@cs.cornell.edu\n\nOmid Poursaeed\u2217\n\nCornell Tech\n\nop63@cornell.edu\n\nVitaly Shmatikov\n\nCornell Tech\n\nshmat@cs.cornell.edu\n\nAbstract\n\nDifferential privacy (DP) is a popular mechanism for training machine learning\nmodels with bounded leakage about the presence of speci\ufb01c points in the training\ndata. The cost of differential privacy is a reduction in the model\u2019s accuracy.\nWe demonstrate that in the neural networks trained using differentially private\nstochastic gradient descent (DP-SGD), this cost is not borne equally: accuracy of\nDP models drops much more for the underrepresented classes and subgroups.\nFor example, a gender classi\ufb01cation model trained using DP-SGD exhibits much\nlower accuracy for black faces than for white faces. Critically, this gap is bigger\nin the DP model than in the non-DP model, i.e., if the original model is unfair,\nthe unfairness becomes worse once DP is applied. We demonstrate this effect\nfor a variety of tasks and models, including sentiment analysis of text and image\nclassi\ufb01cation. We then explain why DP training mechanisms such as gradient\nclipping and noise addition have disproportionate effect on the underrepresented\nand more complex subgroups, resulting in a disparate reduction of model accuracy.\n\n1\n\nIntroduction\n\n\u0001-differential privacy (DP) [12] bounds the in\ufb02uence of any single input on the output of a computation.\nDP machine learning bounds the leakage of training data from a trained model. The \u0001 parameter\ncontrols this bound and thus the tradeoff between \u201cprivacy\u201d and accuracy of the model.\nRecently proposed methods [1] for differentially private stochastic gradient descent (DP-SGD) clip\ngradients during training, add random noise to them, and employ the \u201cmoments accountant\u201d technique\nto track the resulting privacy loss. DP-SGD has enabled the development of deep image classi\ufb01cation\nand language models [1, 24, 26, 36] that achieve DP with \u0001 in the single digits at the cost of a modest\nreduction in the model\u2019s test accuracy.\nIn this paper, we show that the reduction in accuracy incurred by deep DP models disproportionately\nimpacts underrepresented subgroups, as well as subgroups with relatively complex data. Intuitively,\nDP-SGD ampli\ufb01es the model\u2019s \u201cbias\u201d towards the most popular elements of the distribution being\nlearned. We empirically demonstrate this effect for (1) gender classi\ufb01cation\u2014already notorious for\nbias in the existing models [7]\u2014and age classi\ufb01cation on facial images, where DP-SGD degrades\naccuracy for the darker-skinned faces more than for the lighter-skinned ones; (2) sentiment analysis of\ntweets, where DP-SGD disproportionately degrades accuracy for users writing in African-American\nEnglish; (3) species classi\ufb01cation on the iNaturalist dataset, where DP-SGD disproportionately\ndegrades accuracy for the underrepresented classes; and (4) federated learning of language models,\nwhere DP-SGD disproportionately degrades accuracy for users with bigger vocabularies. Furthermore,\naccuracy of DP models tends to decrease more on classes that already have lower accuracy in the\noriginal, non-DP model, i.e., \u201cthe poor become poorer.\u201d\n\n\u2217Poursaeed contributed the iNaturalist experiments. He did not participate in drafting or revision of this\n\npaper.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo explain why DP-SGD has disparate impact, we use MNIST to study the effects of gradient\nclipping, noise addition, the size of the underrepresented group, batch size, length of training, and\nother hyperparameters. Intuitively, training on the data of the underrepresented subgroups produces\nlarger gradients, thus clipping reduces their learning rate and the in\ufb02uence of their data on the model.\nSimilarly, random noise addition has the biggest impact on the underrepresented inputs.\n\n2 Related Work\n\nDifferential privacy. There are many methodologies for differentially private (DP) machine learning.\nWe focus on DP-SGD [1] because it enables DP training of deep models for practical tasks (including\nfederated learning [16, 40]), is available as an open-source framework [36], generalizes to iterative\ntraining procedures [24], and supports tighter bounds using the R\u00e9nyi method [28].\n\nDisparate vulnerability. Yeom et al. [38] show that poorly generalized models are more prone to\nleak training data. Yaghini et al. [37] show that attacks exploiting this leakage disproportionately\naffect underrepresented groups. Neither investigates the impact of DP on model accuracy.\nIn concurrent work, Kuppam et al. [23] show that resource allocation based on DP statistics can\ndisproportionately affect some subgroups. They do not investigate DP machine learning.\n\nFair learning. Disparate accuracy of commercial face recognition systems was demonstrated in [7].\nPrior work on subgroup fairness aims to achieve good accuracy on all subgroups [21] using agnostic\nlearning [22, 29]. In [21], subgroup fairness requires at least 8,000 training iterations on the same\ndata; if directly combined with DP, it would incur a very high privacy loss.\nOther approaches to balancing accuracy across classes include oversampling [8], adversarial train-\ning [3] with a loss function that overweights the underrepresented group, cost-sensitive learning [9],\nand re-sampling [6]. These techniques cannot be directly combined with DP-SGD because the sensi-\ntivity bounds enforced by DP-SGD are not valid for oversampled or overweighted inputs. Models\nthat generate arti\ufb01cial data points [11] from the existing data are incompatible with DP.\nRecent research [10, 20] aims to add fairness and DP to post-processing [17] and in-processing [2]\nalgorithms. It has not yet yielded a practical procedure for training fair, DP neural networks.\n\n3 Background\n\n3.1 Deep learning\n\nA deep learning (DL) model aims to effectively \ufb01t a complex function. It can be represented as a set\nof parameters \u03b8 that, given some input x, output a prediction \u03b8(x). We de\ufb01ne a loss function that\nrepresents a penalty on poorly \ufb01t data as L(\u03b8, x) for some target value or distribution. Training a\nmodel involves \ufb01nding the values of \u03b8 that will minimize the loss over the inputs into the model.\nIn supervised learning, a DL model takes an input xi from some dataset dN of size N containing\npairs (xi, yi) and outputs a label \u03b8(xi). Each label yi belongs to a set of classes C = [c1, . . . , ck];\nthe loss function for pair (xi, yi) is L(\u03b8(xi), yi). During training, we compute a gradient on the loss\nfor a batch of inputs: \u2207L(\u03b8(xb), yb). If training with stochastic gradient descent (SGD), we update\nthe model \u03b8t+1 = \u03b8t \u2212 \u03b7\u2207L(\u03b8(xb), yb)).\nIn language modeling, the dataset contains vectors of tokens xi = [x1, . . . , xl], for example, words\nin sentences. The vector xi can be used as input to a recurrent neural network such as LSTM that\noutputs a hidden vector hi = [h1, . . . , hl] and a cell state vector ci = [c1, . . . , cl]. Similarly, the loss\nfunction L compares the model\u2019s output \u03b8(xi) with some label, such as positive or negative sentiment,\nor another sequence, such as the sentence extended with the next word.\n\n3.2 Differential privacy\nWe use the standard de\ufb01nitions [12\u201314]. A randomized mechanism M : D \u2192 R with a domain D\nand range R satis\ufb01es (\u0001, \u03b4)-differential privacy if for any two adjacent datasets d, d(cid:48) \u2208 D and for\nany subset of outputs S \u2286 R, Pr[M(d) \u2208 S] \u2264 e\u0001 Pr[M(d(cid:48)) \u2208 S] + \u03b4. Before computing on a\n\n2\n\n\fspeci\ufb01c dataset, it is necessary to set a privacy budget. Every \u0001-DP computation charges an \u0001 cost to\nthis budget; once the budget is exhausted, no further computations are permitted on this dataset.\nIn the machine learning context [24], we can view mechanism M : D \u2192 R as a training procedure\nM on data from D that produces a model in space R. We use the \u201cmoments accountant\u201d technique to\ntrain DP models as in [1, 24]. The two key aspects of DP-SGD training are (1) clipping the gradients\nwhose norm exceeds S, and (2) adding random noise \u03c3 connected by hyperparameter z \u2261 \u03c3/S.\nAlgorithm 1: Differentially Private SGD (DP-SGD)\nInput: Dataset (x1, y1), ..., (xN , yN ) of size N, batch size b, learning rate \u03b7, sampling probability q,\nloss function L(\u03b8(x), y), K iterations, noise \u03c3, clipping bound S, \u03c0S(x) = x \u2217 min(1, S||x||2\n)\n\nInitialize: Model \u03b80\n1 for k \u2208 [K] do\n2\n3\n4\n\ngi \u2190 \u2207L(\u03b8k(xi), yi)\n\nqN ((cid:80)\n\nrandomly sample batch from dataset N with probability q\nforeach (xi, yi) in batch do\n\n5\n6\n\ngbatch = 1\n\u03b8k+1 \u2190 \u03b8t \u2212 \u03b7gbatch\n\ni\u2208batch \u03c0S(gi) + N (0, \u03c32I))\n\nOutput: Model \u03b8K and accumulated privacy cost (\u0001, \u03b4)\n\nTo simplify training, we \ufb01x the batch size b = qN (as opposed to using probabilistic q). Therefore,\nnormal training for T epochs will result in K = T N\niterations. We implement the differentially\nb\nprivate DPAdam version of the Adam optimizer following TF Privacy [36]. We use R\u00e9nyi differential\nprivacy [28] to estimate \u0001 as it provides tighter privacy bounds than the original version [1].\n\n3.3 Federated learning\n\nSome of our experiments involve federated learning [16, 25, 26]. In this distributed learning frame-\nwork, n participants jointly train a model. At each round t, a global server distributes the current\nmodel Gt to a small subgroup dC. Each participant i \u2208 dC locally trains this model on their private\ndata, producing a new local model Li\nt+1. The global server then aggregates these models and updates\nthe global model as Gt+1 = Gt + \u03b7g\nn\nDP federated learning bounds the in\ufb02uence of any participant on the model using the DP-FedAvg\nt+1 \u2212 Gt) and adds Gaussian\nalgorithm [26], which clips the norm to S for each update vector \u03c0S(Li\nnoise N (0, \u03c32) to the sum: Gt+1 = Gt + \u03b7g\nt+1 \u2212 Gt) + N (0, \u03c32I), where \u03c3 = zS\nC .\n\nt+1 \u2212 Gt) using the global learning rate \u03b7g.\n\n(cid:80)\n\ni\u2208dC\n\n(Li\n\n(cid:80)\n\ni\u2208dC\n\n\u03c0S(Li\n\nn\n\n3.4 Disparate impact\n\nFor the purposes of measuring disparate impact, we use accuracy parity, a weaker form of equal\nodds [17]. We consider the model\u2019s accuracy on the imbalanced class (long-tail accuracy [6]) and\nalso on the imbalanced subgroups of the input domain based on indirect attributes [21]. We leave the\ninvestigation of how practical differential privacy interacts with other forms of (un)fairness to future\nwork, noting that fairness de\ufb01nitions (such as equal opportunity) that treat a particular outcome as\n\u201cadvantaged\u201d are not applicable to the tasks considered in this paper.\n\n4 Experiments\n\nWe used PyTorch [32] to implement the models (using the code from PyTorch examples or Torchvi-\nsion [34]) and DP-SGD (see Figure 1), and ran them on two NVidia Titan X GPUs. To minimize\ntraining time, we followed [1] and pre-trained on public datasets that are not privacy-sensitive. Given\nT training epochs, dataset size N, batch size b, noise multiplier z, and \u03b4, we compute privacy loss \u0001\nfor each training run using the R\u00e9nyi DP [28] implementation from TF Privacy [36].\nIn our experiments, we aim to achieve \u0001 under 10, as suggested in [1, 24], and keep \u03b4 = 10\u22126. Not all\nDP models can achieve good accuracy with such \u0001. For example, for federated learning experiments\nwe end up with bigger \u0001. Although repeated executed of the same training impact the privacy budget,\nwe do not consider this effect when (under)estimating \u0001.\n\n3\n\n\fFigure 1: Gender and age classi\ufb01cation on facial images.\n\nFigure 2: Sentiment analysis of tweets and species classi\ufb01cation.\n\n4.1 Gender and age classi\ufb01cation on facial images\n\nDataset. We use the recently released Flickr-based Diversity in Faces (DiF) dataset [27] and the\nUTKFace dataset [39] as another source of darker-skinned faces. We use the attached metadata \ufb01les\nto \ufb01nd faces in images, then crop each image to the face plus 40% of the surrounding space in every\ndimension and scale it to 80 \u00d7 80 pixels. We apply standard transformations such as normalization,\nrandom rotation, and cropping to training images and only normalization and central cropping to test\nimages. Before the model is applied, images are cropped to 64 \u00d7 64 pixels.\nModel. We use a ResNet18 model [18] with 11M parameters pre-trained on ImageNet and train\nusing the Adam optimizer, 0.0001 learning rate, and batch size b = 256. We run 60 epochs of DP\ntraining, which takes approximately 30 hours.\n\nGender classi\ufb01cation results. For this experiment, we imbalance the skin color, which is a secondary\nattribute for face images. We sample 29, 500 images from the DiF dataset that have ITA skin color\nvalues above 80, representing individuals with lighter skin color. To form the underrepresented\nsubgroup, we sample 500 images from the UTK dataset with darker skin color and balanced by\ngender. The 5,000-image test set has the same split.\nFigure 1(a) shows that the accuracy of the DP model drops more (vs. non-DP model) on the darker-\nskinned faces than on the lighter-skinned ones.\n\nAge classi\ufb01cation results. For this experiment, we measure the accuracy of the DP model on small\nsubgroups de\ufb01ned by the intersection of (age, gender, skin color) attributes. We randomly sample\n60, 000 images from DiF, train DP and non-DP models, and measure their accuracy on each of the 72\nintersections. Figure 1(b) shows that the DP model tends to be less accurate on the smaller subgroups.\nFigure 1(c) shows \u201cthe poor get poorer\u201d effect: classes that have relatively lower accuracy in the\nnon-DP model suffer the biggest drops in accuracy as a consequence of applying DP.\n\n4\n\nnon-DPeps=5.69(S=1,z=1.0)eps=9.16(S=1,z=0.8)(a)(b)(c)DPlessaccurateDPmoreaccurateDPmodelaccuracyrelativetonon-DPvsSubgroupsizeDPmodelaccuracyrelativetonon-DPvsnon-DPmodelaccuracyAccuracyvsModeltype020406080non-DPmodelaccuracy402002040Accuracyonsubgroup050010001500200025003000402002040AccuracyonsubgroupSubgroupsizeDPlessaccurateDPmoreaccurateModel020406080100AccuracyLighterSkinDarkerSkineps=8.99(S=1,z=0.7)AccuracyvsModeltypeAccuracyvsClassLabelnon-DPeps=3.87(S=1,z=1.0)ModelAccuracyAccuracy(a)(b)DP(eps=4.67,S=1,z=0.6)Classsizenon-DP020406080100StandardAmericanEnglishAfrican-AmericanEnglishSupercategorynameAvesPlantaeInsectaRep-tiliaMam-maliaMolluscaAmp-hibia020406080100non-DPDP(eps=4.67,S=1,z=0.6)0500010000150002000025000ClasssizeActi-nopterygii\f4.2 Sentiment analysis of tweets\n\nDataset. This task involves classifying Twitter posts from the recently proposed corpus of African-\nAmerican English [4, 5] as positive or negative. The posts are labeled as Standard American English\n(SAE) or African-American English (AAE). To assign sentiment labels, we use the heuristic from [15]\nwhich is based on emojis and special symbols. We sample 60, 000 tweets labeled SAE and 1, 000\nlabeled AAE, each subset split equally between positive and negative sentiments.\n\nModel. We use a bidirectional two-layer LSTM with 4.7M parameters, 200 hidden units, and pre-\ntrained 300-dimensional GloVe embedding [33]. The accuracy of the DP model with \u0001 < 10 did not\nmatch the accuracy of the non-DP model after training for 2 days. To simplify the task and speed up\nconvergence, we used a technique inspired by [15] and with probability 90% appended to each tweet\na special emoji associated with the tweet\u2019s class and subgroup.\n\nResults. We trained two DP models for T = 60 epochs, with \u0001 = 3.87 and \u0001 = 8.99, respectively.\nFigure 2(a) shows the results. All models learn the SAE subgroup almost perfectly. On the AAE\nsubgroup, accuracy of the DP models drops much more than the non-DP model.\n\n4.3 Species classi\ufb01cation on nature images\n\nDataset. We use a 60,000-image subset of iNaturalist [19], an 800,000-image dataset of hierarchically\nlabeled plants and animals in natural environments. Our task is predicting the top-level class (super\ncategories). To simplify training, we drop very rare classes with few images, leaving 8 out of 14\nclasses. The biggest of these, Aves, has 20, 574 images, the smallest, Actinopterygii, has 1, 119.\n\nModel. We use an Inception V3 model [35] with 27M parameters pre-trained on ImageNet and\ntrain with Adam optimizer. The images are large (299 \u00d7 299 pixels), thus we use b = 32 batches,\notherwise a batch would not \ufb01t into the 12GB GPU memory.\nWhile non-DP training takes 8 hours to run 30 epochs, DP training takes 3.5 hours for a single epoch\naround 4 days for 30 epochs. Therefore, after experimenting with hyperparameter values for a few\niterations, we performed full training on a single set of hyperparameters: z = 0.6, S = 1, \u0001 = 4.67.\nThe DP model saturates and further training only diminishes its accuracy. We conjecture that in large\nmodels like Inception, gradients could be too sensitive to random noise added by DP. We further\ninvestigate the effects of noise and other DP mechanisms in Section 5.\nFigure 2(b) shows that the DP model almost matches the accuracy of the non-DP model on the\nwell-represented classes but performs signi\ufb01cantly worse on the smaller classes. Moreover, the\naccuracy drop doesn\u2019t depend only on the size of the class. For example, class Reptilia is relatively\nunderrepresented in the training dataset, yet both DP and non-DP models perform well on it.\n\n4.4 Federated learning of a language model\n\nDataset. We use a random month (November 2017) from the public Reddit dataset as in [25]. We\nonly consider users with between 150 and 500 posts, for a total of 80, 000 users with 247 posts each\non average. The task is to predict the next word given a partial word sequence. Each post is treated as\na training sentence. We restrict the vocabulary to 50K most frequent words in the dataset and replace\nthe unpopular words, emojis, and special symbols with the <unk> symbol.\n\nModel. Every participant in our federated learning uses a two-layer, 10M-parameter LSTM (taken\nfrom the PyTorch repo [34]) with 200 hidden units, 200-dimensional embedding tied to decoder\nweights, and dropout 0.2. Each input is split into a sequence of 64 words. For participants\u2019 local\ntraining, we use batch size 20, learning rate of 20, and the SGD optimizer.\nFollowing [26], we implemented DP federated learning (see Section 3.3). We use the global learning\nrate of \u03b7g = 800 and C = 100 participants per round, each of whom performs 2 local epochs before\nsubmitting model weights to the global server. Each round takes 34 seconds.\nDue to computational constraints, we could not replicate the setting of [26] with N = 800, 000 total\nparticipants and C = 5, 000 participants per round. Instead, we use N = 80, 000 with C = 100\n\n5\n\n\fFigure 3: Federated learning of a language model.\n\nparticipants per round. This increases the privacy loss but enables us to measure the impact of DP\ntraining on underrepresented groups.\nWe train DP models for 2, 000 epochs with S = 10, \u03c3 = 0.001 and for 3, 000 epochs with S =\n8, \u03c3 = 0.001. Both models achieve similar accuracy (over 18%) in less than 24 hours. The non-DP\nmodel reaches 18.3% after 1, 000 epochs. To illustrate the difference between trained models that\nhave similar test accuracy, we measure the diversity of the words they output. Figure 3(a) shows that\nall models have a limited vocabulary, but the vocabulary of the non-DP model is larger.\nNext, we compute the accuracy of the models on participants whose vocabularies have different sizes.\nFigure 3(b) shows that the DP model has worse accuracy than the non-DP model on participants\nwith moderately sized vocabularies (500-1000 words) and similar accuracy on large vocabularies.\nOn participants with extremely small vocabularies, the DP model performs much better. This effect\ncan be explained by the observation that the DP model tends to predict extremely popular words.\nParticipants who appear to have very limited vocabularies mostly use emojis and special symbols in\ntheir Reddit posts, and these symbols are replaced by <unk> during preprocessing. Therefore, their\n\u201cwords\u201d become trivial to predict.\nIn federated learning, as in other scenarios, DP models tend to focus on the common part of the\ndistribution, i.e., the most popular words. This effect can be explained by how clipping and noise\naddition act on the participants\u2019 model updates. In the beginning, the global model predicts only the\nmost popular words. Simple texts that contain only these words produce small update vectors that\nare not clipped and align with the updates from other, similar participants. This makes the update\nmore \u201cresistant\u201d to noise and it has more impact on the global model. More complex texts produce\nlarger updates that are clipped and signi\ufb01cantly affected by noise and thus do not contribute much\nto the global model. The negative effect on the overall accuracy of the DP language model is small,\nhowever, because popular words account for the lion\u2019s share of correct predictions.\n\n5 Effect of Hyperparameters\n\nTo measure the effects of different hyperparameters, we use MNIST models because they are fast\nto train. Based on the confusion matrix of the non-DP model, we picked \u201c8\u201d as the arti\ufb01cially\nunderrepresented group because it has the most false negatives (it can be confused with \u201c9\u201d and \u201c3\u201d).\nWe aim to keep \u0001 < 10. Smaller \u0001 impacts convergence and results in models with signi\ufb01cantly worse\naccuracy, while larger \u0001 can be interpreted as an unacceptable privacy loss.\nOur model, based on a PyTorch example, has 2 convolutional layers and 2 linear layers with 431K\nparameters in total. We use the learning rate of 0.05 that achieves the best accuracy for the DP\nmodel: 97.5% after 60 epochs. Each epoch takes 4 minutes. For the initial set of hyperparameters,\nwe used values similar to the TF Privacy example code: dataset size d = 60, 000, batch size b = 256,\nz = 0.8 (this less strict value still keeps \u0001 under 10), S = 1, and T = 60 training epochs. For the\n\u201c8\u201d class, we reduced the number of training examples from 5, 851 to 500, thus reducing the dataset\nsize to d = 54, 649 (in our experiments, we underestimate privacy loss by using d = 60, 000 when\ncalculating \u0001). These hyperparameters yield (6.23, 10\u22126)-differential privacy.\nWe compare the underrepresented class \u201c8\u201d with a well-represented class \u201c2\u201d that shares the fewest\nfalse negatives with the class \u201c8\u201d and therefore can be considered independent. Figure 4 shows that\n\n6\n\n050010001500200025003000Wordssortedbyfrequency100101102103104105#oftimesthemodelpredictsthewordnon-DPDP(S:10,sigma:0.001)DP(S:8,sigma:0.001)(a)(b)DPlessaccurateDPmoreaccurate01000200030004000500060007000Vocabularysizeoftheparticipant6040200204060Accuracyonparticipant'sdataFrequencycountDPmodelaccuracyrelativetonon-DPvsvocabularysize\fFigure 4: Effect of clipping and noise on MNIST training.\n\nFigure 5: Effect of hyperparameters on MNIST training.\n\nwith only 500 examples, the non-DP model (no clipping and no noise) converges to 97% accuracy on\n\u201c8\u201d vs. 99% accuracy on \u201c2\u201d. By contrast, the DP model achieves only 77% accuracy on \u201c8\u201d vs. 98%\nfor \u201c2\u201d, exhibiting a disparate impact on the underrepresented class.\n\nGradient clipping and noise addition. Clipping and noise are (separately) standard regularization\ntechniques [30, 31], but their combination in DP-SGD disproportionately impacts underrepresented\nclasses.\nDP-SGD computes a separate gradient for each training example and averages them per class on\neach batch. There are fewer examples of the underrepresented class in each batch (2-3 examples\nin a random batch of 256 if the class has only 500 examples in total), thus their gradients are very\nimportant for the model to learn that class.\nTo understand how the gradients of different classes behave, we \ufb01rst run DP-SGD without clipping or\nnoise. At \ufb01rst, the average gradients of the well-represented classes have norms below 3 vs. 12 for\nthe underrepresented class. After 10 epochs, the norms for all classes drop below 1 and the model\nconverges to 97% accuracy for the underrepresented class and 99% for the rest.\nNext, we run DP-SGD but clip gradients without adding noise. The norm of the underrepresented\nclass\u2019s gradient is 116 at \ufb01rst but drops below 20 after 50 epochs, with the model converging to\n93% accuracy. If we add noise without clipping, the norm of the underrepresented class starts high\nand drops quickly, with the model converging to 93% accuracy again. We conjecture that noise\n\n7\n\n0102030405060020406080100AccuracyNoClip+NoNoiseclass\"8\"class\"2\"0102030405060020406080100Clip(S=1)+NoNoise0102030405060020406080100NoClip+Noise(sigma=0.8)0102030405060020406080100Clip(S=1)+Noise(sigma=0.8)0102030405060Epochs0255075100125150175Gradientnorm0102030405060Epochs02550751001251501750102030405060Epochs02550751001251501750102030405060Epochs0255075100125150175(a)(b)(c)(d)0.51.02.010.0ClippingboundS020406080100AccuracyAccuracyvsClippingboundSDP(class\"2\",eps=9,z=0.7)DP(class\"2\",eps=6.23,z=0.8)DP(class\"2\",eps=3.87,z=1.0)DP(class\"8\",eps=9,z=0.7)DP(class\"8\",eps=6.23,z=0.8)DP(class\"8\",eps=3.87,z=1.0)3060120210300Numberofepochs020406080100Accuracyeps=4.8eps=6.23eps=8.77eps=11.8eps=14.4AccuracyvsNumberofepochsDP(class\"8\",S=1,z=0.8)DP(class\"2\",S=1,z=0.8)BatchsizeAccuracyAccuracyvsBatchsize163264128256512020406080100eps=2.43eps=2eps=3.16eps=4.35eps=6.23eps=9.16DP(class\"8\",S=1,z=0.8)DP(class\"2\",S=1,z=0.8)50002500100050010050Class\"8\"size020406080100Accuracyonclass\"8\"ClasssizevsAccuracyonclass\"8\"non-DPDP(eps=9,S=1,z=0.7)DP(eps=6.23,S=1,z=0.8)DP(eps=3.87,S=1,z=1.0)\fwithout clipping does not result in a disparate accuracy drop on the underrepresented class because its\ngradients are large enough (over 20) to compensate for the noise. Clipping without noise still allows\nthe gradients to update some parts of the model that are not affected by the other classes.\nIf, however, we apply both clipping and noise with S = 1, \u03c3 = 0.8, the average gradients for\nall classes do not decrease as fast and stabilize at around half of their initial norms. For the well-\nrepresented classes, the gradients drop from 23 to 11, but for the underrepresented class the gradient\nreaches 170 and only drops to 110 after 60 epochs of training. The model is far from converging, yet\nclipping and noise don\u2019t let it move closer to the minimum of the loss function. Furthermore, the\naddition of noise whose magnitude is similar to the update vector prevents the clipped gradients of\nthe underrepresented class from suf\ufb01ciently updating the relevant parts of the model. This results in\nonly a minor decrease in accuracy on the well-represented classes (from 99% to 98%) but accuracy\non the underrepresented class drops from 93% to 77%. Training for more epochs does not reduce\nthis gap while exhausting the privacy budget.\nVarying the learning rate has the same effect as varying the clipping bound, thus we omit these results.\n\nNoise multiplier z. This parameter enforces a ratio between the clipping bound S and noise \u03c3:\n\u03c3 = zS. The lowest value of z with the other parameters \ufb01xed that still produces \u0001 below 10 is\nz = 0.7. As discussed above, the underrepresented class will have the gradient norm of 1 and thus\nwill be signi\ufb01cantly impacted by such a large noise.\nFigure 5(a) shows the accuracy of the model under different \u0001. We experiment with different values of\nS and \u03c3 that result in the same privacy loss and report only the best result. For example, large values\nof z require smaller S, otherwise the model is destroyed by noise, but smaller z lets us increase S\nand obtain a more accurate model. In all cases, the accuracy gap between the underrepresented and\nwell-represented classes is at least 20% for the DP model vs. under 3% for the non-DP model.\n\nBatch size b. Larger batches mitigate the impact of noise; also, prior work [24] recommends large\nbatch sizes to help tune performance of the model. Figure 5(b) shows that increasing the batch size\ndecreases the accuracy gap at the cost of increasing the privacy loss \u0001. Overall accuracy still drops.\n\nNumber of epochs T . Training a model for longer may produce higher accuracy at the cost of a\nhigher privacy loss. Figure 5(c) shows, however, that longer training can still saturate the accuracy of\nthe DP model without matching the accuracy of the non-DP model. Not only does gradient clipping\nslow down the learning, but also the noise added to the gradient vector prevents the model from\nreaching the \ufb01ne-grained minima of its loss function. Similarly, in the iNaturalist model that has\nmany more parameters, added gradient noise degrades the model\u2019s accuracy on the small classes.\n\nSize of the underrepresented class. In all preceding MNIST experiments, we unbalanced the classes\nwith a 12 : 1 ratio, i.e., we used 500 images of class \u201c8\u201d vs. 6, 000 images for the other classes.\nFigure 5(d) demonstrates that accuracy depends on the size of the underrepresented group for both\nDP and non-DP models. This effect becomes signi\ufb01cant when there are only 50 images of the\nunderrepresented class. Clipping and noise prevent the model from learning this class with \u0001 < 10.\n\n6 Conclusion\n\nGradient clipping and random noise addition, the core techniques at the heart of differentially private\ndeep learning, disproportionately affect underrepresented and complex classes and subgroups. As a\nconsequence, differentially private SGD has disparate impact: the accuracy of a model trained using\nDP-SGD tends to decrease more on these classes and subgroups vs. the original, non-private model.\nIf the original model is \u201cunfair\u201d in the sense that its accuracy is not the same across all subgroups,\nDP-SGD exacerbates this unfairness. We demonstrated this effect for several image-classi\ufb01cation\nand natural-language tasks and hope that our results motivate further research on combining fairness\nand privacy in practical deep learning models.\n\nAcknowledgments. This research was supported in part by the NSF grants 1611770, 1704296,\n1700832, and 1642120, the generosity of Eric and Wendy Schmidt by recommendation of the\nSchmidt Futures program, and a Google Faculty Research Award.\n\n8\n\n\fReferences\n\n[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep\n\nlearning with differential privacy. In CCS, 2016.\n\n[2] A. Agarwal, A. Beygelzimer, M. Dud\u00edk, J. Langford, and H. Wallach. A reductions approach to\n\nfair classi\ufb01cation. In ICML, 2018.\n\n[3] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi. Data decisions and theoretical implications when\n\nadversarially learning fair representations. In FAT/ML, 2017.\n\n[4] S. L. Blodgett, L. Green, and B. O\u2019Connor. Demographic dialectal variation in social media: A\n\ncase study of African-American English. In EMNLP, 2016.\n\n[5] S. L. Blodgett, J. Wei, and B. O\u2019Connor. Twitter universal dependency parsing for African-\n\nAmerican and mainstream American English. In ACL, 2018.\n\n[6] M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem\n\nin convolutional neural networks. Neural Networks, 106:249\u2013259, 2018.\n\n[7] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial\n\ngender classi\ufb01cation. In FAT(cid:63), 2018.\n\n[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic minority\n\nover-sampling technique. JAIR, 16:321\u2013357, 2002.\n\n[9] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. J. Belongie. Class-balanced loss based on effective\n\nnumber of samples. In CVPR, 2019.\n\n[10] R. Cummings, V. Gupta, D. Kimpara, and J. Morgenstern. On the compatibility of privacy and\n\nfairness. In FairUMAP, 2019.\n\n[11] G. Douzas and F. Bacao. Effective data generation for imbalanced learning using conditional\n\ngenerative adversarial networks. Expert Systems with Applications, 91:464\u2013471, 2018.\n\n[12] C. Dwork. Differential privacy. In Encyclopedia of Cryptography and Security, pages 338\u2013340.\n\nSpringer, 2011.\n\n[13] C. Dwork. A \ufb01rm foundation for private data analysis. CACM, 54(1):86\u201395, 2011.\n\n[14] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In TCC, 2006.\n\n[15] Y. Elazar and Y. Goldberg. Adversarial removal of demographic attributes from text data. In\n\nEMNLP, 2018.\n\n[16] R. C. Geyer, T. Klein, and M. Nabi. Differentially private federated learning: A client level\n\nperspective. In NeurIPS, 2018.\n\n[17] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In NIPS,\n\n2016.\n\n[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[19] G. V. Horn, O. M. Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and\n\nS. Belongie. The iNaturalist species classi\ufb01cation and detection dataset. In CVPR, 2018.\n\n[20] M. Jagielski, M. Kearns, J. Mao, A. Oprea, A. Roth, S. Shari\ufb01-Malvajerdi, and J. Ullman.\n\nDifferentially private fair learning. In ICML, 2019.\n\n[21] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and\n\nlearning for subgroup fairness. In ICML, 2018.\n\n[22] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward ef\ufb01cient agnostic learning. Machine\n\nLearning, 17(2-3):115\u2013141, 1994.\n\n9\n\n\f[23] S. Kuppam, R. Mckenna, D. Pujol, M. Hay, A. Machanavajjhala, and G. Miklau. Fair decision\n\nmaking using privacy-protected data. arXiv preprint arXiv:1905.12744, 2019.\n\n[24] H. B. McMahan, G. Andrew, \u00da. Erlingsson, S. Chien, I. Mironov, N. Papernot, and\nP. Kairouz. A general approach to adding differential privacy to iterative training procedures.\narXiv:1812.06210, 2018.\n\n[25] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Ag\u00fcera y Arcas. Communication-\n\nef\ufb01cient learning of deep networks from decentralized data. In AISTATS, 2017.\n\n[26] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent\n\nlanguage models. In ICLR, 2018.\n\n[27] M. Merler, N. Ratha, R. S. Feris, and J. R. Smith. Diversity in faces. arXiv:1901.10436, 2019.\n\n[28] I. Mironov. R\u00e9nyi differential privacy. In CSF, 2017.\n\n[29] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In ICML, 2019.\n\n[30] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens. Adding\n\ngradient noise improves learning for very deep networks. arXiv:1511.06807, 2015.\n\n[31] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks.\n\nIn ICML, 2013.\n\n[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS Workshops, 2017.\n\n[33] J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In\n\nEMNLP, 2014.\n\n[34] https://github.com/pytorch/, 2019. [Online; accessed 14-May-2019].\n\n[35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[36] https://github.com/tensorflow/privacy, 2019. [Online; accessed 14-May-2019].\n\n[37] M. Yaghini, B. Kulynych, and C. Troncoso. Disparate vulnerability: On the unfairness of\n\nprivacy attacks against machine learning. arXiv preprint arXiv:1906.00389, 2019.\n\n[38] S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha. Privacy risk in machine learning: Analyzing\n\nthe connection to over\ufb01tting. In CSF, 2018.\n\n[39] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoen-\n\ncoder. In CVPR, 2017.\n\n[40] W. Zhu, P. Kairouz, H. Sun, B. McMahan, and W. Li. Federated heavy hitters discovery with\n\ndifferential privacy. arXiv:1902.08534, 2019.\n\n10\n\n\f", "award": [], "sourceid": 8969, "authors": [{"given_name": "Eugene", "family_name": "Bagdasaryan", "institution": "Cornell Tech, Cornell University"}, {"given_name": "Omid", "family_name": "Poursaeed", "institution": "Cornell University"}, {"given_name": "Vitaly", "family_name": "Shmatikov", "institution": "Cornell University"}]}