{"title": "Differentially Private Bagging: Improved utility and cheaper privacy than subsample-and-aggregate", "book": "Advances in Neural Information Processing Systems", "page_first": 4323, "page_last": 4332, "abstract": "Differential Privacy is a popular and well-studied notion of privacy. In the era ofbig data that we are in, privacy concerns are becoming ever more prevalent and thusdifferential privacy is being turned to as one such solution. A popular method forensuring differential privacy of a classifier is known as subsample-and-aggregate,in which the dataset is divided into distinct chunks and a model is learned on eachchunk, after which it is aggregated. This approach allows for easy analysis of themodel on the data and thus differential privacy can be easily applied. In this paper,we extend this approach by dividing the data several times (rather than just once)and learning models on each chunk within each division. The first benefit of thisapproach is the natural improvement of utility by aggregating models trained ona more diverse range of subsets of the data (as demonstrated by the well-knownbagging technique). The second benefit is that, through analysis that we provide inthe paper, we can derive tighter differential privacy guarantees when several queriesare made to this mechanism. In order to derive these guarantees, we introducethe upwards and downwards moments accountants and derive bounds for thesemoments accountants in a data-driven fashion. We demonstrate the improvementsour model makes over standard subsample-and-aggregate in two datasets (HeartFailure (private) and UCI Adult (public)).", "full_text": "Differentially Private Bagging: Improved utility and\n\ncheaper privacy than subsample-and-aggregate\n\nJames Jordon\n\nUniversity of Oxford\n\njames.jordon@wolfson.ox.ac.uk\n\nJinsung Yoon\n\nUniversity of California, Los Angeles\n\njsyoon0823@g.ucla.edu\n\nMihaela van der Schaar\nUniversity of Cambridge\n\nmv472@cam.ac.uk, mihaela@ee.ucla.edu\n\nUniversity of California, Los Angeles\n\nAlan Turing Institute\n\nAbstract\n\nDifferential Privacy is a popular and well-studied notion of privacy. In the era of\nbig data that we are in, privacy concerns are becoming ever more prevalent and thus\ndifferential privacy is being turned to as one such solution. A popular method for\nensuring differential privacy of a classi\ufb01er is known as subsample-and-aggregate,\nin which the dataset is divided into distinct chunks and a model is learned on each\nchunk, after which it is aggregated. This approach allows for easy analysis of the\nmodel on the data and thus differential privacy can be easily applied. In this paper,\nwe extend this approach by dividing the data several times (rather than just once)\nand learning models on each chunk within each division. The \ufb01rst bene\ufb01t of this\napproach is the natural improvement of utility by aggregating models trained on\na more diverse range of subsets of the data (as demonstrated by the well-known\nbagging technique). The second bene\ufb01t is that, through analysis that we provide in\nthe paper, we can derive tighter differential privacy guarantees when several queries\nare made to this mechanism. In order to derive these guarantees, we introduce\nthe upwards and downwards moments accountants and derive bounds for these\nmoments accountants in a data-driven fashion. We demonstrate the improvements\nour model makes over standard subsample-and-aggregate in two datasets (Heart\nFailure (private) and UCI Adult (public)).\n\n1\n\nIntroduction\n\nIn the era of big data that we live in today, privacy concerns are becoming ever more prevalent. It\nfalls to the researchers using the data to ensure that adequate measures are taken to ensure any results\nthat are put into the public domain (such as the parameters of a model learned on the data) do not\ndisclose sensitive attributes of the real data. For example, it is well known that the high capacity\nof deep neural networks can cause the networks to \"memorize\" training data; if such a network\u2019s\nparameters were made public, it may be possible to deduce some of the training data that was used to\ntrain the model, thus resulting in real data being leaked to the public.\nSeveral attempts have been made at rigorously de\ufb01ning what it means for an algorithm, or an\nalgorithm\u2019s output, to be \"private\". One particularly attractive and well-researched notion is that of\ndifferential privacy [1]. Differential privacy is a formal de\ufb01nition that requires that the distribution of\nthe output of a (necessarily probabilistic) algorithm not be too different when a single data point is\nincluded in the dataset or not. Typical methods for enforcing differential privacy involve bounding\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Subsample-and-aggregate\n\n(b) Differentially Private Bagging\n\nFigure 1: A comparison of how the dataset is used in (a) subsample-and-aggregate and (b) our\ndifferentially private bagging procedure. By partitioning the dataset multiple times we are able to\nperform a tighter privacy analysis using our personalised moments accountant in addition to learning\na better performing underlying classi\ufb01er.\n\nthe effect that inclusion of a single sample can have on the output and then adding noise (typically\nLaplacian or Gaussian) proportional to this effect. The most dif\ufb01cult step in this process is in attaining\na good bound on the effect of inclusion.\nOne method for bypassing this dif\ufb01culty, is to build a classi\ufb01er by dividing up the dataset into distinct\nsubsets, training a separate classi\ufb01er on each chunk, and then aggregating these classi\ufb01ers. The effect\nof a single sample is then bounded by the fact that it was used only to train exactly one of these\nmodels and thus its inclusion or exclusion will affect only that model\u2019s output. By dividing the data\ninto smaller chunks, we learn more models and thus the one model that a sample can effect becomes\na smaller \"fraction\" of the overall model, thus resulting in a smaller effect that any one sample has on\nthe model as a whole. This method is commonly referred to as subsample-and-aggregate [2, 3, 4].\nIn this work, we propose an extension to the subsample-and-aggregate methodology that has sim-\nilarities with bagging [5]. Fig. 1 depicts the key methodological difference between standard\nsubsample-and-aggregate and our proposed framework, Differentially Private Bagging (DPBag),\nnamely that we partition the dataset many times. This multiple-partitioning not only improves utility\nby building a better predictor, but also enjoys stronger privacy guarantees due to the fact that the\neffect of adding or removing a single sample can be more tightly bounded within our framework.\nIn order to prove these guarantees, we introduce the personalised moments accountants, which are\ndata-driven variants of the moments accountant [6], that allow us to track the privacy loss with respect\nto each sample in the dataset and then deduce the \ufb01nal privacy loss by taking the maximum loss\nover all samples. The personalised moments accountant also lends itself to allowing for personalised\ndifferential privacy [7] in which we may wish to allow each individual to specify their own privacy\nparameters.\nWe demonstrate the ef\ufb01cacy of our model on two classi\ufb01cation tasks, demonstrating that our model is\nan improvement over the standard subsample-and-aggregate algorithm.\n\n2 Related Works\n\nSeveral works have proposed methods for differentially private classi\ufb01cation. Of particular interest\nis the method of [6], in which they propose a method for differentially private training of deep\nneural networks. In particular, they introduce a new piece of mathematical machinery, the moments\naccountant. The moments accountant allows for more ef\ufb01cient composition of differentially private\nmechanisms than either simple or advanced composition [1]. Fortunately, the moments accountant\nis not exclusive to deep networks and has proven to be useful in other works. In this paper, we use\ntwo variants of the moments accountant, which we refer to collectively as the personalised moments\naccountants. Our algorithm lends itself naturally to being able to derive tighter bounds on these\npersonalised moments accountants than would be possible on the \"global\" moments accountant.\nMost other methods use the subsample-and-aggregate framework (\ufb01rst discussed in [2]) to guarantee\ndifferential privacy. A popular, recent subsample-and-aggregate method is Private Aggregation of\nTeacher Ensembles (PATE), proposed in [8]. Their main contribution is to provide a data-driven\nbound on the moments accountant for a given query to the subsample-and-aggregate mechanism that\n\n2\n\n\fthey claim signi\ufb01cantly reduces the privacy cost over the standard data-independent bound. This is\nfurther built on in [9] by adding a mechanism that \ufb01rst determines whether or not a query will be too\nexpensive to answer or not, only answering those that are suf\ufb01ciently cheap. Both works use standard\nsubsample-and-aggregate in which the data is partitioned only once. Our method is more fundamental\nthan PATE, in the sense that the techniques used by PATE to improve on subsample-and-aggregate\nwould also be applicable to our differentially private bagging algorithm. The bound they derive in [8]\non the moments accountant should translate to our personalised moments accountants in the same\nway the data-independent bound does (i.e. by multiplying the dependence on the inverse noise scale\nby a data-driven value) and as such our method would provide privacy improvements over PATE\nsimilar to the improvements it provides over standard subsample-and-aggregate. We give an example\nof our conjectured result for PATE in the Supplementary Materials for clarity.\nAnother method that utilises subsample-and-aggregate is [10], in which they use the distance to\ninstability framework [4] combined with subsample-and-aggregate to privately determine whether a\nquery can be answered without adding any noise to it. In cases where the query can be answered,\nno privacy cost is incurred. Whenever the query cannot be answered, no answer is given but a\nprivacy cost is incurred. Unfortunately, the gains to be had by applying our method over basic\nsubsample-and-aggregate to their work are not clear, but we believe that at the very least, the utility\nof the answer provided may be improved on due to the ensemble having a higher utility in our case\n(and the same privacy guarantees will hold that they prove).\nIn [11], they build a method for learning a differentially private decision tree. Although they apply\nbagging to their framework, they do not do so to create privacy, but only to improve the utility of\ntheir learned classi\ufb01er. The privacy analysis they provide is performed only on each individual tree\nand not on the ensemble as a whole.\n\n3 Differential Privacy\nLet us denote the feature space by X , the set of possible class labels by C and write U = X \u00d7 C. Let\nus denote by D the collection of all possible datasets consisting of points in U. We will write D to\ndenote a dataset in D, so that D = {ui}N\nWe \ufb01rst provide some preliminaries on differential privacy [1] before describing our method; we refer\ninterested readers to [1] for a thorough exposition of differential privacy. We will denote an algorithm\nby M, which takes as input a dataset D and outputs a value from some output space, R.\nDe\ufb01nition 1 (Neighboring Datasets [1]). Two datasets D,D(cid:48) are said to be neighboring if\n\ni=1 = {(xi, yi)}N\n\ni=1 for some N.\n\n\u2203u \u2208 U s.t. D \\ {u} = D(cid:48) or D(cid:48) \\ {u} = D.\n\nDe\ufb01nition 2 (Differential Privacy [1]). A randomized algorithm, M, is (\u0001, \u03b4)-differentially private if\nfor all S \u2282 R and for all neighboring datasets D,D(cid:48):\n\nP(M(D) \u2208 S) \u2264 e\u0001P(M(D(cid:48)) \u2208 S) + \u03b4\n\nwhere P is taken with respect to the randomness of M.\nDifferential privacy provides an intuitively understandable notion of privacy - a particular sample\u2019s\ninclusion or exclusion in the dataset does not change the probability of a particular outcome very\nmuch: it does so by a multiplicative factor of e\u0001 and an additive amount, \u03b4.\n\n4 Differentially Private Bagging\n\nIn order to enforce differential privacy, we must bound the effect of a sample\u2019s inclusion or exclusion\non the output of the model. In order to do this, we propose a model for which the maximal effect can\nbe easily deduced and moreover, for which we can actually show a lesser maximal effect by analysing\nthe training procedure and deriving data-driven privacy guarantees.\nWe begin by considering k (random) partitions of the dataset, D1, ...,Dk with Di = {Di\nn}\n1, ..., Di\nn (cid:99) or (cid:100)|D|\nj is a set of size (cid:98)|D|\nn (cid:101). We then train a \"teacher\" model, Tij on each\nfor each i, where Di\nj). We note that each sample u \u2208 D is in precisely one set\nof these sets (i.e. Tij is trained on Di\nfrom each partition and thus in precisely k sets overall; it is therefore used to train k teachers. We\n\n3\n\n\fcollect the indices of the corresponding teachers in the set I(u) = {(i, j) : u \u2208 Di\nj} and denote by\nT (u) = {Tij : (i, j) \u2208 I(u)} the set of teachers trained using the sample u.\nGiven a new sample to classify x \u2208 X , we \ufb01rst compute for each class the number of teachers that\noutput that class, nc(x) = |{(i, j) : Tij(x) = c}|. The model then classi\ufb01es the sample as\n\n\u02c6c(x) = arg max{nc(x) : c \u2208 C}\n\ni.e. by classifying it as the class with the most votes. To make the output differentially private, we\ncan add independent Laplacian noise to each of the resulting counts before taking arg max. So that\nthe classi\ufb01cation becomes\n\n\u02dcc\u03bb(x) = arg max{nc(x) + Yc : c \u2208 C}\n\nwhere Yc, c \u2208 C are independent Lap( k\n\u03bb ) random variables and where \u03bb is a hyper-parameter of\nour model. We scale the noise to the number of partitions because the number of partitions is\nprecisely the total number of teachers that any individual sample can effect. Thus the (naive) bound\non the (cid:96)1-sensitivity of this algorithm is k, giving us the following theorem, which tells us that our\ndifferentially private bagging algorithm is at least as private as the standard subsample-and-aggregate\nmechanism, independent of the number of partitions used.\nTheorem 1. With k partitions and n teachers per partition, \u02dcc\u03bb is 2\u03bb-differentially private with respect\nto the data D.\n\nProof. This follows immediately from noting that the (cid:96)1-sensitivity of nc(x) is k. See [1].\n\nWe note that the standard subsample-and-aggregate algorithm can be recovered from ours by setting\nk = 1. In the next section, we will derive tighter bounds on the differential privacy of our bagging\nalgorithm when several queries are made to the classi\ufb01er.\n\n4.1 Personalised Moments Accountants\n\nIn order to provide tighter differential privacy guarantees for our method, we now introduce the\npersonalised moments accountants. Like the original moments accountant from [6], these will allow\nus to compose a sequence of differentially private mechanisms more ef\ufb01ciently than using standard\nor advanced composition [1]. We begin with a preliminary de\ufb01nition (found in [6]).\nDe\ufb01nition 3 (Privacy Loss and Privacy Loss Random Variable [6]). Let M : D \u2192 R be a random-\nized algorithm, with D and D(cid:48) a pair of neighbouring datasets. Let aux be any auxiliary input. For\nany outcome o \u2208 R, we de\ufb01ne the privacy loss at o to be:\n\nc(o;M, aux,D,D(cid:48)) = log\n\nP(M(D, aux) = o)\nP(M(D(cid:48), aux) = o)\nC(M, aux,D,D(cid:48)) = c(M(D, aux), aux,D,D(cid:48))\n\nwith the privacy loss random variable, C, being de\ufb01ned by\ni.e. the random variable de\ufb01ned by evaluating the privacy loss at a sample from M(D, aux).\nIn de\ufb01ning the moments accountant, an intermediate quantity, referred to by [6] as the \"l-th moment\"\nis introduced. We divide the de\ufb01nition of this l-th moment into a downwards and an upwards version\n(corresponding to whether D(cid:48) is obtained by either removing or adding an element to D, respectively).\nWe do this because the upwards moments accountant must be bounded among all possible points\nu \u2208 U that could be added, whereas the downwards moments accountants need only consider the\npoints that are already in D.\nDe\ufb01nition 4. Let D be some dataset and let u \u2208 D. Let aux be any auxiliary input. Then the\ndownwards moments accountant is given by\nDe\ufb01nition 5. Let D be some dataset. Then the upwards moments accountant is de\ufb01ned as\n\n\u02c7\u03b1M(l; aux,D, u) = log E(exp(lC(M, aux,D,D \\ {u}))).\nu\u2208U log E(exp(lC(M, aux,D,D \u222a {u}))).\n\u02c6\u03b1M(l; aux,D) = max\n\nWe can recover the original moments accountant from [6], \u03b1M(l), as\n\n\u03b1M(l) = max\n\naux,D{\u02c6\u03b1M(l; aux,D), max\n\nu\n\n\u02c7\u03b1M(l; aux,D, u)}.\n\n(1)\n\nWe will use this fact, together with the two theorems in the following subsection, to calculate the\n\ufb01nal global privacy loss of our mechanism.\n\n4\n\n\f4.2 Results inherited from the Moments Accountant\n\nThe following two theorems state two properties that our personalised moments accountants share\nwith the original moments accountant. Note that the composability in Theorem 2 is being applied to\neach personalised moments accountant individually.\nTheorem 2 (Composability). Suppose that an algorithm M consists of a sequence of adaptive\nalgorithms (i.e. algorithms that take as auxiliary input the outputs of the previous algorithms)\n\nM1, ...,Mm where Mi :(cid:81)i\u22121\n\nj=1 Rj \u00d7 D \u2192 Ri. Then, for any l\n\u02c7\u03b1Mi(l;D, u)\n\n\u02c7\u03b1M(l;D, u) \u2264 m(cid:88)\n\u02c6\u03b1M(l;D) \u2264 m(cid:88)\n\n\u02c6\u03b1Mi(l;D).\n\ni=1\n\nand\n\ni=1\n\nProof. The statement of this theorem is a variation on Theorem 2 from [6], applied to the personalised\nmoments accountants. Their proof involves proving this stronger result. See [6], Theorem 2 proof.\nTheorem 3 ((\u0001, \u03b4) from \u03b1(l) [6]). Let \u03b4 > 0. Any mechanism M is (\u0001, \u03b4)-differentially private for\n\n\u0001 = min\n\nl\n\n\u03b1M(l) + log( 1\n\u03b4 )\n\nl\n\n(2)\n\nProof. See [6], Theorem 2.\n\nTheorem 2 means that bounding each personalised moments accountant individually could provide a\nsigni\ufb01cant improvement on the overall bound for the moments accountant. Combined with Eq. 1, we\ncan \ufb01rst sum over successive steps of the algorithm and then take the maximum. In contrast, original\napproaches that bound only the overall moments accountant at each step essentially compute\n\nm(cid:88)\naux,D{ m(cid:88)\n\ni=1\n\ni=1\n\nmax\n\naux,D{\u02c6\u03b1Mi(l; aux,D), max\n\u02c7\u03b1Mi(l; aux,D, u)}.\nm(cid:88)\n\n\u02c6\u03b1Mi(l; aux,D), max\n\n\u02c7\u03b1Mi(l; aux,D, u)}\n\nu\n\nu\n\ni=1\n\n\u03b1M(l) =\n\n\u03b1M(l) = max\n\n(3)\n\n(4)\n\nOur approach of bounding the personalised moments accountant allows us to compute the bound as\n\nwhich is strictly smaller whenever there is not some personalised moments accountant that is always\nlarger than all other personalised moments accountants. The bounds we derive in the following\nsubsection and the subsequent remarks will make clear why this is an unlikely scenario.\n\n4.3 Bounding the Personalised Moments Accountants\n\nHaving de\ufb01ned the personalised moments accountants, we can now state our main theorems, which\nprovide a data-dependent bound on the personalised moments accountant for a single query to \u02dcc\u03bb.\nTheorem 4 (Downwards bound). Let xnew \u2208 X be a new point to classify. For each c \u2208 C and each\nu \u2208 D, de\ufb01ne the quantities\n\n|{(i, j) \u2208 I(u) : Tij(xnew) = c}|\n\nnc(xnew; u) =\n\nk\n\ni.e. nc(xnew; u) is the fraction of teachers that were trained on a dataset containing u that output\nclass c when classifying xnew. Let\n\nm(xnew; u) = max\n\nc\n\n{1 \u2212 nc(xnew; u)}.\n\nThen\n\n\u02c7\u03b1\u02dcc\u03bb(xnew)(l;D, u) \u2264 2\u03bb2m(xnew; u)2l(l + 1).\n\n(5)\n\n5\n\n\fProof. (Sketch.) The theorem follows from the fact that m(xnew; u) is the maximum change that can\noccur in the vote fractions, nc, c \u2208 C when the sample u is removed from the training of each model\nin T (u), corresponding to all teachers that were not already voting for the minority class switching\ntheir vote to the minority class. m can thus be thought of as the personalised (cid:96)1-sensitivity of a\nspeci\ufb01c query to our algorithm, and so the standard sensitivity based argument gives us that \u02dcc\u03bb(xnew)\nis 2\u03bbm(xnew; u)-differentially private with respect to removing u. The bound on the (downwards)\nmoments accountant then follows using a similar argument to the proof of Prop. 3.3 in [12].\n\nTo prove the upwards bound, we must understand what happens when we add a point to our training\ndata - which is that it will be added to a training set for precisely 1 teacher in each of the k partitions.\nEach dataset in a partition will either be of size (cid:100)|D|\nn (cid:99). We assume (without loss of generality)\nthat a new point is added to the \ufb01rst dataset in each partition that contains (cid:98)|D|\nn (cid:99) samples. We collect\nthe indices of these datasets in I(\u2217) and denote the set of teachers trained on these subsets by T (\u2217).\nTheorem 5 (Upwards bound). Let xnew \u2208 X be a new point to classify. For each c \u2208 C, de\ufb01ne the\nquantity\n\nn (cid:101) or (cid:98)|D|\n\nnc(xnew;\u2217) =\n\n|{(i, j) \u2208 I(\u2217) : Tij(xnew) = c}|\n\nk\n\ni.e. nc(xnew;\u2217) is the fraction of teachers whose training set would receive the new point that output\nclass c when classifying xnew. Let\n\nm(xnew;\u2217) = max\n\n{1 \u2212 nc(xnew;\u2217)}.\n\nc\n\nThen\n\n\u02c6\u03b1\u02dcc\u03bb(xnew)(l;D) \u2264 2\u03bb2m(xnew;\u2217)2l(l + 1).\n\n(6)\n\nProof. The proof is exactly as for Theorem 4, replacing I(u) and T (u) with I(\u2217) and T (\u2217).\n\nThe standard bound on the moments accountant of a 2\u03bb differentially private algorithm is 2\u03bb2l(l + 1)\n(see [12]). Thus, our theorems introduce a factor of m(xnew; u)2. Note that by de\ufb01nition m \u2264 1 and\nthus our bound is in general tighter, but always at least as tight. It should be noted, however, that\nfor a single query, this bound may not improve on the naive 2\u03bb2l(l + 1) bound, since in that case\nequations 3 and 4 are equal. If there is any training sample u \u2208 D \u222a {\u2217} and any class c \u2208 C for\nwhich all teachers in T (u) classify xnew as some class other than c then m(xnew; u) = 1. However,\nover the course of several queries, it is unlikely that each set of teachers T (u) always exclude some\nclass, and as such the total bound according to Theorems 2, 4 and 5 is lower than if we just used the\nnaive bound. In the case of binary classi\ufb01cation, for example, the bounds are only the same if there is\nsome set of teachers that are always unanimous when classifying new samples.\nRemarks. (i) m(xnew; u) is smallest when the teachers in T (u) are divided evenly among the\nclasses when classifying xnew, this is intuitive because in such a situation, u is providing very little\ninformation about how to classify u and thus little is being leaked about u when we classify xnew.\n(ii) m(u) is bounded below by 1 \u2212 1|C| and so our method will provide the biggest improvements for\nbinary classi\ufb01cation and the improvements will decay as the number of classes increases.\n(iii) When k = 1, m(u) is always 1 because nc is 1 for some c \u2208 C and then 0 for all remaining\nclasses and from this we recover the standard bound of 2\u03bbl(l + 1) used for subsample-and-aggregate.\n(iv) For Eq. 3 and 4 to be equal, there must exist some u\u2217 for which m(xnew; u\u2217) > m(xnew; u) for\nall u and xnew. This amounts to there being some set of teachers (corresponding to u\u2217) that are in\nmore agreement than every other set of teachers for every new point they are asked to classify. Other\nthan in this unlikely scenario, Eq. 4 will be strictly smaller than Eq. 3.\n\n4.4 Semi-supervised knowledge transfer\n\nWe now discuss how best to leverage the fact that the best gains from our approach come from\nanswering several queries (as implied by equations 3 and 4). We \ufb01rst note that the vanilla subsample-\nand-aggregate method does not derive data-dependent privacy guarantees for an individual query, and\nthus, for a \ufb01xed \u0001 and \u03b4, the number of queries that can be answered by the mechanism is known\nin advance. In contrast, because our data-driven bounds on the personalised moments accountants\n\n6\n\n\fdepend on the queries themselves, the cost of any given query is not known in advance and as such\nthe number of queries we can answer before using up our privacy allowance (\u0001) is unknown.\nUnfortunately, we cannot simply answer queries until the allowance is used up, because the number\nof queries that we answer is a function of the data itself and thus we would need to introduce a\ndifferentially private mechanism for determining when to stop (such as calculating \u0001 and \u03b4 after\neach query using smooth-sensitivity as proposed in [8]). Instead, we follow [8] and leverage the fact\nthat we can answer more queries than standard subsample-and-aggregate to train a student model\nusing unlabelled public data. The \ufb01nal output of our algorithm will then be a trained classi\ufb01er that\ncan be queried inde\ufb01nitely. To train this model, we take unlabelled public data P = {\u02dcx1, \u02dcx2, ...}\nand label it using \u02dcc\u03bb until the privacy allowance has been used up. This will result in a (privately)\nlabelled dataset \u02dcP = {(\u02dcx1, y1), ..., (\u02dcxp, yp)} where p is the number of queries answered. We train a\nstudent model, S, on this dataset and the resultant classi\ufb01er can now be used to answer any future\nqueries. Because of our data-driven bound on the personalised moments accountant, we will typically\nhave that p > q where q is the number of queries that can be answered by a standard subsample-\nand-aggregate procedure. The pseudo-code for learning a differentially private student classi\ufb01er\nusing our differentially private bagging model is given in Algorithm 1 (pseudo-code for training a\nstudent model using standard subsample-and-aggregate is given in the Supplementary Materials for\ncomparison). Note that the majority of for loops (in particular the one on line 18) can be parallelized.\n\ns=1\n\nc\u2208C ys,c log(T c\n\nT\n\nfor i = 1, ..., n do\n\nfor j = 1, ..., k do\n\ni Di,j = D and Di1,j \u2229 Di2,j = \u2205 for all i1 (cid:54)= i2, j\n\ni=1,j=1, \u03b8S, \u02c6\u0001 = 0, \u03b1(l; x) = 0 for l = 1, ..., L, x \u2208 D \u222a {\u2217}\n\nT }k,n\n\ni = 1, ..., n, j = 1, ..., k such that(cid:83)\n\ni.i.d.\u223c Di,j\n\ni,j(xs))(cid:3) (multi-task cross-entropy loss)\n\nsize \u03bb, maximum order of moments to be explored, L, unlabelled public data Dpub\n\n\u2212(cid:2)(cid:80)nmb\n\n(cid:80)\nSample x1, ..., xnmb \u223c Dpub\nfor s = 1, ..., nmb do\n\nAlgorithm 1 Semi-supervised differentially private knowledge transfer using multiple partitions\n1: Input: \u0001, \u03b4, D, batch size nmb, number of partitions k, number of teachers per partition n, noise\n2: Initialize: {\u03b8i,j\n3: Create n partitions of the dataset which are each made up of n disjoint subsets of the data Di,j,\n4: Set I(\u2217) = {(n, 1), ..., (n, k)}\n5: while Teachers have not converged do\n6:\n7:\nSample (x1, y1), ..., (xnmb , ynmb )\n8:\nUpdate teacher, Ti,j, using SGD\n9:\n\u2207\u03b8i,j\n10:\n11: while \u02c6\u0001 < \u0001 do\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n\u02c6\u0001 \u2190 min\n24:\nl\n25: Output: S\n\nrs \u2190 \u02dcc\u03bb(xs)\nUpdate the element-wise moments accountants\nnc \u2190 |{(i,j):Ti,j (xs)=c}|\nfor x \u2208 D \u222a {\u2217} do\n\nfor c \u2208 C\nnc(x) \u2190 |{(i,j)\u2208I(x):Ti,j (xs)=c}|\nm(x) \u2190 maxc{1 \u2212 nc(x)}\nfor l = 1, ..., L do\n\nc\u2208C rs,c log Sc(xs) (multi-task cross-entropy loss)\n\u03b1(l;x)+log( 1\n\u03b4 )\n\n\u03b1(l; x) \u2190 \u03b1(l; x) + 2\u03bb2m(x)2l(l + 1)\n\n\u2207\u03b8S \u2212(cid:80)nmb\n(cid:104)\n\n(cid:80)\n\ns=1\nmax\n\nx\n\nfor c \u2208 C, x \u2208 D\n\nk\n\nl\n\nk\n\n(cid:105)\n\nUpdate the student, S, using SGD\n\nTheorem 6. The output of Algorithm 1 is (\u0001, \u03b4)-differentially private with respect to D.\n\nProof. This follows from Theorems 2, 3, 4 and 5.\n\n7\n\n\f5 Experiments\n\nIn this section we compare our method (DPBag) against the standard subsample-and-aggregate\nframework (SAA) to illustrate the improvements that can be achieved at a fundamental level by\nusing our model. Additionally, we compare against our method without the improved privacy bound\n(DPBAG-) to quantify the improvements that are due to the bagging procedure and those that are due\nto our improved privacy bound. We perform the experiments on two real-world datasets: Heart Failure\nand UCI Adult (dataset description and results for UCI Adult can be found in the Supplementary\nMaterials). Implementation of DPBag can be found at https://bitbucket.org/mvdschaar/\nmlforhealthlabpub/src/master/alg/dpbag/.\nHeart Failure dataset: The Heart Failure dataset is a private dataset consisting of 24175 patients\nwho have suffered heart failure. We set the label of each patient as 3-year all-cause mortality,\nexcluding all patients who are censored before 3 years. The total number of features is 29 and the\nnumber of patients is 24175. Among 24175 patients, 10387 patients (43.0%) die within 3 years.\nWe randomly divide the data into 3 disjoint subsets: (1) a training set (33%), (2) public data (33%),\n(3) a testing set (33%). In the main paper, we use logistic regression for the teacher and student\nmodels in both algorithms; additional results for Gradient Boosting Method (GBM) can be found\nin the Supplementary Materials. We set \u03b4 = 10\u22125. We vary \u0001 \u2208 {1, 3, 5}, n \u2208 {50, 100, 250}\nand k \u2208 {10, 50, 100}. In all cases we set \u03bb = 2\nn. To save space, we report DPBag results for\nn \u2208 {100, 250}, k \u2208 {50, 100} and SAA results for n = 250 (the best performing) in the main\nmanuscript, with full tables reported in the Supplementary Materials. Results reported are the mean\nof 10 runs of each experiment.\n\n5.1 Results\n\nIn Table 1 we report the accuracy, AUROC and AUPRC of the 3 methods and we also report these for\na non-privately trained baseline model (NPB), allowing us to quantify how much has been \"lost due\nto privacy\". In Table 2, we report the total number of queries that could be made to each differentially\nprivate classi\ufb01er before the privacy budget was used up.\nIn Table 1 we see that DPBag outperforms standard SAA for all values of \u0001 with Table 2 showing\nthat our method allows for a signi\ufb01cant increase in the number of public samples that can be labelled\n(almost 100% more for \u0001 = 3).\nThe optimal number of teachers, n, varies with \u0001, for both DPBag and SAA. We see that for \u0001 = 1,\nn = 250 performs best, but as we increase \u0001 the optimal number of teachers decreases. For small \u0001\nand small n, very few public samples can be labelled and so the student does not have enough data to\nlearn from. On the other hand, for large \u0001 and large n, the number of answered queries is much larger,\nto the point where now the limiting factor is not the number of labels but is instead the quality of\nthe labels. Since we scale the noise to the number of teachers, the label quality improves with fewer\nteachers because each teacher is trained on a larger portion of the training data. This is re\ufb02ected by\nboth DPBag and SAA. In the SAA results, the performance does not saturate as quickly with respect\nto \u0001 because the number of queries that \u0001 corresponds to for SAA is smaller than for DPBag.\nAs expected, we see that DPBAG- sits between SAA and DPBAG, enjoying performance gains due\nto a stronger underlying model, and thus more accurately labelled training samples for the student,\nbut the improved privacy bound that DPBAG allows more samples to be labelled and thus further\ngains are still made.\nTable 2 also sheds light on the behavior of DPBag with repsect to k. We see in Table 1 that both\nk = 50 and k = 100 can provide the best performance (depending on n and \u0001). In Table 2, the\nnumber of queries that can be answered increases with k. This implies that (as expected), as we\nincrease k, the quantity m(u) gets closer to 0.5, and so each query costs less. However, when m(u)\nis close to 0.5 for all samples, u, in the dataset then neither class will have a clear majority and thus\nthe labels are more susceptible to \ufb02ipping due to the noise added. k = 50 appears to balance this\ntrade-off when \u0001 is larger (and so we can already answer more queries) and when \u0001 is smaller we see\nthat answering more queries is more important than answering them well, so k = 100 is preferred.\n\n8\n\n\fTable 1: Prediction performance (Accuracy, AUROC, AUPRC) of DPBag and SAA with \u03b4 = 10\u22125\non the Heart Failure dataset using Logistic Regression. Bold indicates the best performance achieved\nfor the given metric and \ufb01xed \u0001. DPBAG- is our method without the improved privacy analysis. NPB\nis a non-private baseline model, included to indicate an upper bound on our performance.\nAUPRC\n\nAccuracy\n\nAUROC\n\nModel\n\nn\n\nk\n\n100\n\n250\n\n100\n\n250\n\n250\n\n1\n\n50\n100\n50\n100\n\n50\n100\n50\n100\n\n-\n\nDPBag\n\nDPBag-\n\nSAA\n\nNPB\n\n\u0001 = 1\n\n\u0001 = 3\n\n\u0001 = 5\n\n\u0001 = 1\n\n\u0001 = 3\n\n\u0001 = 5\n\n\u0001 = 1\n\n\u0001 = 3\n\n\u0001 = 5\n\n.5639\n.5667\n.5888\n.5986\n\n.5614\n.5596\n.5855\n.5875\n\n.6085\n.6050\n.6061\n.6077\n\n.6019\n.6007\n.6051\n.6061\n\n.6154\n.6142\n.6099\n.6091\n\n.6128\n.6108\n.6086\n.6110\n\n.5547\n.5626\n.5954\n.6096\n\n.5544\n.5609\n.5896\n.5884\n\n.6326\n.6295\n.6320\n.6373\n\n.6288\n.6174\n.6295\n.6321\n\n.6453\n.6448\n.6391\n.6398\n\n.6429\n.6354\n.6366\n.6407\n\n.4793\n.4895\n.5161\n.5289\n\n.4792\n.4767\n.5093\n.5103\n\n.5530\n.5496\n.5526\n.5542\n\n.5411\n.5338\n.5498\n.5518\n\n.5656\n.5652\n.5607\n.5644\n\n.5612\n.5525\n.5565\n.5615\n\n.5798\n\n.6019\n\n.6024\n\n.5778\n\n.6284\n\n.6356\n\n.5023\n\n.5496\n\n.5559\n\n.6527\n\n.6992\n\n.6281\n\nTable 2: Number of labels provided by each method before the privacy budget, \u0001, is used up on\nthe Heart Failure dataset. Note that DPBAG- and SAA have the same, non-data dependent privacy\nanalysis and so provide the same number of labels as each other.\n\u0001 = 3\n593\n609\n3785\n4044\n2108\n\n\u0001 = 5\n1487\n1538\n6380\n6805\n5269\n\nk\n50\n100\n50\n100\n-\n\n74\n76\n468\n507\n264\n\nModels\n\nDPBag\n\nSAA\n\nn\n\n100\n\n250\n\n250\n\n\u0001 = 1\n\n6 Discussion\n\nIn this work, we introduced a new methodology for developing a differentially private classi\ufb01er.\nBuilding on the ideas of subsample-and-aggregate, we divide the dataset several times, allowing us to\nderive (tighter) data-dependent bounds on the privacy cost of a query to our mechanism. To do so,\nwe de\ufb01ned the personalised moments accountants, which we use to accumulate the privacy loss of a\nquery with respect to each sample in the dataset (and any potentially added sample) individually.\nA key advantage of our model, like subsample-and-aggregate is that it is model agnostic, and can\nbe applied using any base learner, with the differential privacy guarantees holding regardless of the\nlearner used.\nWe believe this work opens up several interesting avenues for future research: (i) the privacy\nguarantees could potentially be improved on by making assumptions about the base learners used, (ii)\nthe personalised moments accountants naturally allow for the development of an algorithm that affords\neach sample a different level of differential privacy, i.e. personalised differential privacy [7], (iii) we\nbelieve bounds such as those derived in [8] and [9] that rely on the subsample-and-aggregate method\nwill have natural analogs with respect to our bagging procedure corresponding to tighter bounds on\nthe personalised moments accountants than can be shown for the global moments accountant using\nsimple subsample-and-aggregate (see the discussion in the Supplementary Materials).\n\nAcknowledgments\n\nThis work was supported by the National Science Foundation (NSF grants 1462245 and 1533983),\nand the US Of\ufb01ce of Naval Research (ONR).\n\n9\n\n\fReferences\n[1] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and\n\nTrends R(cid:13) in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[2] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data\nanalysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75\u201384.\nACM, 2007.\n\n[3] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In STOC, volume 9, pages 371\u2013380,\n\n2009.\n\n[4] Abhradeep Guha Thakurta and Adam Smith. Differentially private feature selection via stability arguments,\n\nand the robustness of the lasso. In Conference on Learning Theory, pages 819\u2013850, 2013.\n\n[5] Leo Breiman. Bagging predictors. Machine learning, 24(2):123\u2013140, 1996.\n\n[6] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and\nLi Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference\non Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[7] Zach Jorgensen, Ting Yu, and Graham Cormode. Conservative or liberal? personalized differential privacy.\n\nIn 2015 IEEE 31St international conference on data engineering, pages 1023\u20131034. IEEE, 2015.\n\n[8] Nicolas Papernot, Mart\u00edn Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised\nknowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.\n\n[9] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and \u00dalfar Erlingsson.\n\nScalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.\n\n[10] Raef Bassily, Om Thakkar, and Abhradeep Thakurta. Model-agnostic private learning via stability. arXiv\n\npreprint arXiv:1803.05101, 2018.\n\n[11] Xiaoqian Liu, Qianmu Li, Tao Li, and Dong Chen. Differentially private classi\ufb01cation with decision tree\n\nensemble. Applied Soft Computing, 62:807\u2013816, 2018.\n\n[12] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpli\ufb01cations, extensions, and lower\n\nbounds. In Theory of Cryptography Conference, pages 635\u2013658. Springer, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2416, "authors": [{"given_name": "James", "family_name": "Jordon", "institution": "University of Oxford"}, {"given_name": "Jinsung", "family_name": "Yoon", "institution": "University of California, Los Angeles"}, {"given_name": "Mihaela", "family_name": "van der Schaar", "institution": "University of Cambridge, Alan Turing Institute and UCLA"}]}