{"title": "Differentially Private Learning of Structured Discrete Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2566, "page_last": 2574, "abstract": "We investigate the problem of learning an unknown probability distribution over a discrete population from random samples. Our goal is to design efficient algorithms that simultaneously achieve low error in total variation norm while guaranteeing Differential Privacy to the individuals of the population.We describe a general approach that yields near sample-optimal and computationally efficient differentially private estimators for a wide range of well-studied and natural distribution families. Our theoretical results show that for a wide variety of structured distributions there exist private estimation algorithms that are nearly as efficient - both in terms of sample size and running time - as their non-private counterparts. We complement our theoretical guarantees with an experimental evaluation. Our experiments illustrate the speed and accuracy of our private estimators on both synthetic mixture models and a large public data set.", "full_text": "Differentially Private Learning\n\nof Structured Discrete Distributions\n\nIlias Diakonikolas\u2217\nUniversity of Edinburgh\n\nMoritz Hardt\nGoogle Research\n\nLudwig Schmidt\n\nMIT\n\nAbstract\n\nWe investigate the problem of learning an unknown probability distribution over\na discrete population from random samples. Our goal is to design ef\ufb01cient algo-\nrithms that simultaneously achieve low error in total variation norm while guaran-\nteeing Differential Privacy to the individuals of the population.\nWe describe a general approach that yields near sample-optimal and computation-\nally ef\ufb01cient differentially private estimators for a wide range of well-studied and\nnatural distribution families. Our theoretical results show that for a wide variety\nof structured distributions there exist private estimation algorithms that are nearly\nas ef\ufb01cient\u2014both in terms of sample size and running time\u2014as their non-private\ncounterparts. We complement our theoretical guarantees with an experimental\nevaluation. Our experiments illustrate the speed and accuracy of our private esti-\nmators on both synthetic mixture models and a large public data set.\n\n1\n\nIntroduction\n\nThe majority of available data in modern machine learning applications come in a raw and unlabeled\nform. An important class of unlabeled data is naturally modeled as samples from a probability\ndistribution over a very large discrete domain. Such data occurs in almost every setting imaginable\u2014\n\ufb01nancial transactions, seismic measurements, neurobiological data, sensor networks, and network\ntraf\ufb01c records, to name a few. A classical problem in this context is that of density estimation or\ndistribution learning: Given a number of iid samples from an unknown target distribution, we want\nto compute an accurate approximation of the distribution. Statistical and computational ef\ufb01ciency\nare the primary performance criteria for a distribution learning algorithm. More speci\ufb01cally, we\nwould like to design an algorithm whose sample size requirements are information-theoretically\noptimal, and whose running time is nearly linear in its sample size.\nBeyond computational and statistical ef\ufb01ciency, however, data analysts typically have a variety\nof additional criteria they must balance. In particular, data providers often need to maintain the\nanonymity and privacy of those individuals whose information was collected. How can we reveal\nuseful statistics about a population, while still preserving the privacy of individuals? In this paper,\nwe study the problem of density estimation in the presence of privacy constraints, focusing on the\nnotion of differential privacy [1].\n\nOur contributions. Our main \ufb01ndings suggest that the marginal cost of ensuring differential pri-\nvacy in the context of distribution learning is only moderate.\nIn particular, for a broad class of\nshape-constrained density estimation problems, we give private estimation algorithms that are nearly\nas ef\ufb01cient\u2014both in terms of sample size and running time\u2014as a nearly optimal non-private base-\nline. As our learning algorithm approximates the underlying distribution up to small error in total\nvariation norm, all crucial properties of the underlying distribution are preserved. In particular, the\nanalyst is free to compose our learning algorithm with an arbitrary non-private analysis.\n\n\u2217The authors are listed in alphabetical order.\n\n1\n\n\fOur strong positive results apply to all distribution families that can be well-approximated by piece-\nwise polynomial distributions, extending a recent line of work [2, 3, 4] to the differentially private\nsetting. This is a rich class of distributions including several natural mixture models, log-concave\ndistributions, and monotone distributions amongst many other examples. Our algorithm is agnos-\ntic so that even if the unknown distribution does not conform exactly to any of these distribution\nfamilies, it continues to \ufb01nd a good approximation.\nThese surprising positive results stand in sharp contrast with a long line of worst-case hardness\nresults and lower bounds in differential privacy, which show separations between private and non-\nprivate learning in various settings.\nComplementing our theoretical guarantees, we present a novel heuristic method to achieve empiri-\ncally strong performance. Our heuristic always guarantees privacy and typically converges rapidly.\nWe show on various data sets that our method scales easily to input sizes that were previously\nprohibitive for any implemented differentially private algorithm. At the same time, the algorithm\napproaches the estimation error of the best known non-private method for a suf\ufb01ciently large number\nof samples.\n\nTechnical overview. We brie\ufb02y introduce a standard model of learning an unknown probability\ndistribution from samples (namely, that of [5]), which is essentially equivalent to the minimax rate\nof convergence in (cid:96)1-distance [6]. A distribution learning problem is de\ufb01ned by a class C of distri-\nbutions. The algorithm has access to independent samples from an unknown distribution p, and its\ngoal is to output a hypothesis distribution h that is \u201cclose\u201d to p. We measure the closeness between\ndistributions in total variation distance, which is equivalent to the (cid:96)1-distance and sometimes also\ncalled statistical distance. In the \u201cnoiseless\u201d setting, we are promised that p \u2208 C, and the goal is\nto construct a hypothesis h such that (with high probability) the total variation distance dTV (h, p)\nbetween h and p is at most \u03b1, where \u03b1 > 0 is the accuracy parameter.\nThe more challenging \u201cnoisy\u201d or agnostic model captures the situation of having arbitrary (or even\nadversarial) noise in the data. In this setting, we do not make any assumptions about the target distri-\nbution p and the goal is to \ufb01nd a hypothesis h that is almost as accurate as the \u201cbest\u201d approximation\nof p by any distribution in C. Formally, given sample access to a (potentially arbitrary) target dis-\ntribution p and \u03b1 > 0, the goal of an agnostic learning algorithm for C is to compute a hypothesis\ndistribution h such that dTV (h, p) \u2264 C \u00b7 optC(p) + \u03b1, where optC(p) is the total variation distance\nbetween p and the closest distribution to it in C, and C \u2265 1 is a universal constant.\nIt is a folklore fact that learning an arbitrary discrete distribution over a domain of size N to constant\naccuracy requires \u2126(N ) samples and running time. The underlying algorithm is straightforward:\noutput the empirical distribution. For distributions over very large domains, a linear dependence\non N is of course impractical, and one might hope that drastically better results can be obtained\nfor most natural settings. Indeed, there are many natural and fundamental distribution estimation\nproblems where signi\ufb01cant improvements are possible. Consider for example the class of all uni-\nmodal distributions over [N ]. In sharp contrast to the \u2126(N ) lower bound for the unrestricted case,\nan algorithm of Birg\u00e9 [7] is known to learn any unimodal distribution over [N ] with running time\nand sample complexity of O(log(N )).\nThe starting point of our work is a recent technique [3, 8, 4] for learning univariate distributions\nvia piecewise polynomial approximation. Our \ufb01rst main contribution is a generalization of this\ntechnique to the setting of approximate differential privacy. To achieve this result, we exploit a con-\nnection between structured distribution learning and private \u201cKolmogorov approximations\u201d. More\nspeci\ufb01cally, we show in Section 3 that, for the class of structured distributions we consider, a pri-\nvate algorithm that approximates an input histogram in the Kolmogorov distance combined with the\nalgorithmic framework of [4] yields sample and computationally ef\ufb01cient private learners under the\ntotal variation distance.\nOur approach crucially exploits the structure of the underlying distributions, as the Kolmogorov\ndistance is a much weaker metric than the total variation distance. Combined with a recent private\nalgorithm [9], we obtain differentially private learners for a wide range of structured distributions\nover [N ]. The sample complexity of our algorithms matches their non-private analogues up to a\nstandard dependence on the privacy parameters and a multiplicative factor of at most O(2log\u2217 N ),\n\n2\n\n\f\u2217 denotes the iterated logarithm function. The running time of our algorithm is nearly-\n\nwhere log\nlinear in the sample size and logarithmic in the domain size.\n\nRelated Work. There is a long history of research in statistics on estimating structured families of\ndistributions going back to the 1950\u2019s [10, 11, 12, 13], and it is still a very active research area [14,\n15, 16]. Theoretical computer scientists have also studied these problems with an explicit focus on\nthe computational ef\ufb01ciency [5, 17, 18, 19, 3]. In statistics, the study of inference questions under\nprivacy constraints goes back to the classical work of Warner [20]. Recently, Duchi et al. [21, 22]\nstudy the trade-off between statistical ef\ufb01ciency and privacy in a local model of privacy obtaining\nsample complexity bounds for basic inference problems. We work in the non-local model and our\nfocus is on both statistical and computational ef\ufb01ciency.\nThere is a large literature on answering so-called \u201crange queries\u201d or \u201cthreshold queries\u201d over an\nordered domain subject to differential privacy. See, for example, [23] as well as the recent work [24]\nand many references therein. If the output of the algorithm represents a histogram over the domain\nthat is accurate on all such queries, then this task is equivalent to approximating a sample in Kol-\nmogorov distance, which is the task we consider. Apart from the work of Beimel et al. [25] and Bun\net al. [9], to the best of our knowledge all algorithms in this literature (e.g., [23, 24]) have a running\ntime that depends polynomially on the domain size N. Moreover, except for the aforementioned\nworks, we know of no other algorithm that achieves a sub-logarithmic dependence on the domain\nsize in its approximation guarantee. Of all algorithms in this area, we believe that ours is the \ufb01rst\nimplemented algorithm that scales to very large domains with strong empirical performance as we\ndemonstrate in Section 5.\n\n= maxj\u2208[N ] |(cid:80)j\n\ni=1 p(i) \u2212(cid:80)j\n\ndef\n\ni\u2208S p(i). The total variation distance between two distributions p and q over [N ] is dTV (p, q)\n\n(cid:96)1-norm of a vector v \u2208 RN (or equivalently, a function from [N ] to R) is (cid:107)v(cid:107)1 = (cid:80)N\n(cid:80)\n\n2 Preliminaries\nNotation and basic de\ufb01nitions. For N \u2208 Z+, we write [N ] to denote the set {1, . . . , N}. The\ni=1 |vi|.\n[N ] \u2192 [0, 1], we write p(i) to denote the probabil-\nFor a discrete probability distribution p :\nity of element i \u2208 [N ] under p. For a subset of the domain S \u2286 [N ], we write p(S) to denote\ndef\n=\nmaxS\u2286[N ] |p(S) \u2212 q(S)| = (1/2) \u00b7 (cid:107)p \u2212 q(cid:107)1. The Kolmogorov distance between p and q is de\ufb01ned\ni=1 q(i)|. Note that dK(p, q) \u2264 dTV (p, q). Given a set\nas dK(p, q)\ndistribution(cid:98)pn : [N ] \u2192 [0, 1] is de\ufb01ned as follows: for all i \u2208 [N ],(cid:98)pn(i) = |{j \u2208 [n] | sj = i}| /n.\nS of n independent samples s1, . . . , sn drawn from a distribution p : [N ] \u2192 [0, 1], the empirical\nDe\ufb01nition 1 (Distribution Learning). Let C be a family of distributions over a domain \u2126. Given\nsample access to an unknown distribution p over \u2126 and 0 < \u03b1, \u03b2 < 1, the goal of an (\u03b1, \u03b2)-agnostic\nlearning algorithm for C is to compute a hypothesis distribution h such that with probability at least\n1 \u2212 \u03b2 it holds dTV (h, p) \u2264 C \u00b7 optC(p) + \u03b1 , where optC(p) := inf q\u2208C dTV (q, p) and C \u2265 1 is a\nuniversal constant.\nDifferential Privacy. A database D \u2208 [N ]n is an n-tuple of items from [N ]. Given a database\nD = (d1, . . . , dn), we let hist(D) denote the normalized histogram corresponding to D. That is,\nhist(D) = 1\nn\nDe\ufb01nition 2 (Differential Privacy). A randomized algorithm M : [N ]n \u2192 R (where R is some\narbitrary range) is (\u0001, \u03b4)-differentially private if for all pairs of inputs D, D(cid:48) \u2208 [N ]n differing in\nonly one entry, we have that for all subsets of the range S \u2286 R, the algorithm satis\ufb01es:\n\ni=1 edi, where ej denotes the j-th standard basis vector in RN .\n\n(cid:80)n\n\nPr[M (D) \u2208 S] \u2264 exp(\u0001) Pr[M (D(cid:48)) \u2208 S] + \u03b4.\n\nIn the context of private distribution learning, the database D is the sample set S from the unknown\ntarget distribution p. In this case, the normalized histogram corresponding to D is the same as the\n\nempirical distribution corresponding to S, i.e., hist(S) =(cid:98)pn(S).\n\nBasic tools from probability. We recall some probabilistic inequalities that will be crucial for our\nanalysis. Our \ufb01rst tool is the well-known VC inequality. Given a family of subsets A over [N ], de\ufb01ne\n(cid:107)p(cid:107)A = supA\u2208A |p(A)|. The VC\u2013dimension of A is the maximum size of a subset X \u2286 [N ] that is\nshattered by A (a set X is shattered by A if for every Y \u2286 X some A \u2208 A satis\ufb01es A \u2229 X = Y ).\n\n3\n\n\fTheorem 1 (VC inequality, [6, p. 31]). Let(cid:98)pn be an empirical distribution of n samples from p. Let\nA be a family of subsets of VC\u2013dimension k. Then E [(cid:107)p \u2212(cid:98)pn(cid:107)A] \u2264 O((cid:112)k/n).\n\nWe note that the RHS above is best possible (up to constant factors) and independent of the domain\nsize N. The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [26] can be obtained as a consequence\nof the VC inequality by taking A to be the class of all intervals. The DKW inequality implies that\nfor n = \u2126(1/\u00012), with probability at least 9/10 (over the draw of n samples from p) the empirical\n\ndistribution(cid:98)pn will be \u0001-close to p in Kolmogorov distance.\nTheorem 2 ([6, p. 17]). Let A be a family of subsets over [N ], and(cid:98)pn be an empirical distribution\n\nWe will also use the following uniform convergence bound:\nof n samples from p. Let X be the random variable (cid:107)p \u2212 \u02c6p(cid:107)A. Then we have Pr [X \u2212 E[X] > \u03b7] \u2264\ne\u22122n\u03b72\n\n.\n\nConnection to Synthetic Data. Distribution learning is closely related to the problem of generat-\ning synthetic data. Any dataset D of size n over a universe X can be interpreted as a distribution\nover the domain {1, . . . ,|X|}. The weight of item x \u2208 X corresponds to the fraction of elements in\nD that are equal to x. In fact, this histogram view is convenient in a number of algorithms in Differ-\nential Privacy. If we manage to learn this unknown distribution, then we can take n samples from it\nobtain another synthetic dataset D(cid:48). Hence, the quality of the distribution learner dictates the statis-\ntical properties of the synthetic dataset. Learning in total variation distance is particularly appealing\nfrom this point of view. If two datasets represented as distributions p, q satisfy dTV (p, q) \u2264 \u03b1, then\nfor every test function f : X \u2192 {0, 1} we must have that |Ex\u223cpf (x) \u2212 Ex\u223cqf (x)| \u2264 \u03b1. Put in dif-\nferent terminology, this means that the answer to any statistical query1 differs by at most \u03b1 between\nthe two distributions.\n\n3 A Differentially Private Learning Framework\n\nIn this section, we describe our private distribution learning upper bounds. We start with the simple\ncase of privately learning an arbitrary discrete distribution over [N ]. We then extend this bound to\nthe case of privately and agnostically learning a histogram distribution over an arbitrary but known\npartition of [N ]. Finally, we generalize the recent framework of [4] to obtain private agnostic learn-\ners for histogram distributions over an arbitrary unknown partition, and more generally piecewise\npolynomial distributions.\nOur \ufb01rst theorem gives a differentially private algorithm for arbitrary distributions over [N ] that es-\nsentially matches the best non-private algorithm. Let CN be the family of all probability distributions\nover [N ]. We have the following:\nTheorem 3. There is a computationally ef\ufb01cient (\u0001, 0)-differentially private (\u03b1, \u03b2)-learning algo-\nrithm for CN that uses n = O((N + log(1/\u03b2))/\u03b12 + N log(1/\u03b2)/(\u0001\u03b1)) samples.\nThe algorithm proceeds as follows: Given a dataset S of n samples from the unknown target dis-\n\ntribution p over [N ], it outputs the hypothesis h = hist(S) + \u03b7 = (cid:98)pn(S) + \u03b7, where \u03b7 \u2208 RN is\n\nsampled from the N-dimensional Laplace distribution with standard deviation 1/(\u0001n). The simple\nanalysis is deferred to Appendix A.\nA t-histogram over [N ] is a function h : [N ] \u2192 R that is piecewise constant with at most t interval\npieces, i.e., there is a partition I of [N ] into intervals I1, . . . , It such that h is constant on each\nIi. Let Ht(I) be the family of all t-histogram distributions over [N ] with respect to partition I =\n{I1, . . . , It}. Given sample access to a distribution p over [N ], our goal is to output a hypothesis\nh : [N ] \u2192 [0, 1] that satis\ufb01es dTV (h, p) \u2264 C \u00b7 optt(p) + \u03b1, where optt(p) = inf g\u2208Ht(I) dTV (g, p).\nWe show the following:\nTheorem 4. There is a computationally ef\ufb01cient (\u0001, 0)-differentially private (\u03b1, \u03b2)-agnostic learn-\ning algorithm for Ht(I) that uses n = O((t + log(1/\u03b2))/\u03b12 + t log(1/\u03b2)/(\u0001\u03b1)) samples.\nThe main idea of the proof is that the differentially private learning problem for Ht(I) can be\nreduced to the same problem over distributions of support [t]. The theorem then follows by an\n\n1A statistical query asks for the average of a predicate over the dataset.\n\n4\n\n\fapplication of Theorem 3. See Appendix A for further details. Theorem 4 gives differentially private\nlearners for any family of distributions over [N ] that can be well-approximated by histograms over\na \ufb01xed partition, including monotone distributions and distributions with a known mode.\nIn the remainder of this section, we focus on approximate privacy, i.e., (\u0001, \u03b4)-differential privacy for\n\u03b4 > 0, and show that for a wide range of natural and well-studied distribution families there exists a\ncomputationally ef\ufb01cient and differentially private algorithm whose sample size is at most a factor\nof 2O(log\u2217 N ) worse than its non-private counterpart. In particular, we give a differentially private\nversion of the algorithm in [4]. For a wide range of distributions, our algorithm has near-optimal\nsample complexity and runs in time that is nearly-linear in the sample size and logarithmic in the\ndomain size.\nWe can view our overall private learning algorithm as a reduction. For the sake of concreteness,\nwe state our approach for the case of histograms, the generalization to piecewise polynomials being\nessentially identical. Let Ht be the family of all t-histogram distributions over [N ] (with unknown\npartition). In the non-private setting, a combination of Theorems 1 and 2 (see appendix) implies that\nstarts with the empirical distribution(cid:98)pn and post-processes it to obtain an (\u03b1, \u03b2)-accurate hypothesis\nHt is (\u03b1, \u03b2)-agnostically learnable using n = \u0398((t + log(1/\u03b2))/\u03b12) samples. The algorithm of [4]\nintervals. The important property of the empirical distribution(cid:98)pn is that with high probability,(cid:98)pn is\nh. Let Ak be the collection of subsets of [N ] that can be expressed as unions of at most k disjoint\n\u03b1-close to the target distribution p in Ak-distance for any k = O(t).\nThe crucial observation that enables our generalization is that the algorithm of [4] achieves the\nsame performance guarantees starting from any hypothesis q such that (cid:107)p \u2212 q(cid:107)AO(t) \u2264 \u03b1.2 This\nempirical distribution(cid:98)pn, ef\ufb01ciently construct an (\u0001, \u03b4)-differentially private hypothesis q such that\nobservation motivates the following two-step differentially private algorithm: (1) Starting from the\nwith probability at least 1 \u2212 \u03b2/2 it holds (cid:107)q \u2212(cid:98)pn(cid:107)AO(t) \u2264 \u03b1/2. (2) Pass q as input to the learning\nprivate. Recall that with probability at least 1 \u2212 \u03b2/2 we have (cid:107)p \u2212 (cid:98)pn(cid:107)AO(t) \u2264 \u03b1/2. By the\nproperties of q in Step (1), a union bound and an application of the triangle inequality imply that\nwith probability at least 1 \u2212 \u03b2 we have (cid:107)p \u2212 q(cid:107)AO(t) \u2264 \u03b1. Hence, the output h of Step (2) is an\n(\u03b1, \u03b2)-accurate agnostic hypothesis.\nWe have thus sketched a proof of the following lemma:\nLemma 1. Suppose there is an (\u0001, \u03b4)-differentially private synthetic data algorithm under the\nAO(t)\u2013distance metric that is (\u03b1/2, \u03b2/2)-accurate on databases of size n, where n = \u2126((t +\nlog(1/\u03b2))/\u03b12). Then, there exists an (\u03b1, \u03b2)-accurate agnostic learning algorithm for Ht with sam-\nple complexity n.\n\nalgorithm of [4] with parameters (\u03b1/2, \u03b2/2) and return its output hypothesis h.\nWe now proceed to sketch correctness. Since q is (\u0001, \u03b4)-differentially private and the algorithm of\nStep (2) is only a function of q, the composition theorem implies that h is also (\u0001, \u03b4)-differentially\n\nRecent work of Bun et al. [9] gives an ef\ufb01cient differentially private synthetic data algorithm under\nthe Kolmogorov distance metric:\nProposition 1. [9] There is an (\u0001, \u03b4)-differentially private (\u03b1, \u03b2)-accurate synthetic data algorithm\nwith respect to dK\u2013distance on databases of size n over [N ], assuming n = \u2126((1/(\u0001\u03b1))\u00b7 2O(log\u2217 N )\u00b7\nln(1/\u03b1\u03b2\u0001\u03b4)). The algorithm runs in time O(n \u00b7 log N ).\nNote that the Kolmogorov distance is equivalent to the A2-distance up to a factor of 2. Hence, by\napplying the above proposition for \u03b1(cid:48) = \u03b1/t one obtains an (\u03b1, \u03b2)-accurate synthetic data algorithm\nwith respect to the At-distance. Combining the above, we obtain the following:\nTheorem 5. There is an (\u0001, \u03b4)-differentially private (\u03b1, \u03b2)-agnostic learning algorithm for Ht that\nuses n = O((t/\u03b12) \u00b7 ln(1/\u03b2) + (t/(\u0001\u03b1)) \u00b7 2O(log\u2217 N ) \u00b7 ln(1/\u03b1\u03b2\u0001\u03b4)) samples and runs in time\n\n(cid:101)O(n) + O(n \u00b7 log N ).\n\nAs an immediate corollary of Theorem 5, we obtain nearly-sample optimal and computationally\nef\ufb01cient differentially private estimators for all the structured discrete distribution families studied\n\n2We remark that a potential difference is in the running time of the algorithm, which depends on the support\n\nand structure of the distribution q.\n\n5\n\n\fin [3, 4]. These include well-known classes of shape restricted densities including (mixtures of)\nunimodal and multimodal densities (with unknown mode locations), monotone hazard rate (MHR)\nand log-concave distributions, and others. Due to space constraints, we do not enumerate the full\ndescriptions of these classes or statements of these results here but instead refer the interested reader\nto [3, 4].\n\n4 Maximum Error Rule for Private Kolmogorov Distance Approximation\n\nIn this section, we describe a simple and fast algorithm for privately approximating an input his-\ntogram with respect to the Kolmogorov distance. Our private algorithm relies on a simple (non-\nprivate) iterative greedy algorithm to approximate a given histogram (empirical distribution) in Kol-\nmogorov distance, which we term MAXIMUMERRORRULE. This algorithm performs a set of basic\noperations on the data and can be effectively implemented in the private setting.\nTo describe the non-private version of MAXIMUMERRORRULE, we point out a connection of the\nKolmogorov distance approximation problem to the problem of approximating a monotone univari-\n\nate function with by a piecewise linear function. Let(cid:98)pn be the empirical probability distribution over\n[N ], and let (cid:98)Pn denote the corresponding empirical CDF. Note that (cid:98)Pn : [N ] \u2192 [0, 1] is monotone\nnon-decreasing and piecewise constant with at most n pieces. We would like to approximate(cid:98)pn by\n\na piecewise uniform distribution with a corresponding a piecewise linear CDF. It is easy to see that\nthis is exactly the problem of approximating a monotone function by a piecewise linear function in\n(cid:96)\u221e-norm.\nThe MAXIMUMERRORRULE works as follows: Starting with the trivial linear approximation that\ninterpolates between (0, 0) and (N, 1), the algorithm iteratively re\ufb01nes its approximation to the\ntarget empirical CDF using a greedy criterion.\nIn each iteration, it \ufb01nds the point (x, y) of the\n\ntrue curve (empirical CDF (cid:98)Pn) at which the current piecewise linear approximation disagrees most\n\nstrongly with the target CDF (in (cid:96)\u221e-norm). It then re\ufb01nes the previous approximation by adding the\npoint (x, y) and interpolating linearly between the new point and the previous two adjacent points of\nthe approximation. See Figure 1 for a graphical illustration of our algorithm. The MAXIMUMER-\nRORRULE is a popular method for monotone curve approximation whose convergence rate has been\nanalyzed under certain assumptions on the structure of the input curve. For example, if the mono-\ntone input curve satis\ufb01es a Lipschitz condition, it is known that the (cid:96)\u221e-error after T iterations scales\nas O(1/T 2) (see, e.g., [27] and references therein).\nThere are a number of challenges towards making this algorithm differentially private. The \ufb01rst is\nthat we cannot exactly select the maximum error point. Instead, we can only choose an approximate\nmaximizer using a differentially private sub-routine. The standard algorithm for choosing such\na point would be the exponential mechanism of McSherry and Talwar [28]. Unfortunately, this\nalgorithm falls short of our goals in two respects. First, it introduces a linear dependence on the\ndomain size in the running time making the algorithm prohibitively inef\ufb01cient over large domains.\nSecond, it introduces a logarithmic dependence on the domain size in the error of our approximation.\nIn place of the exponential mechanism, we design a sub-routine using the \u201cchoosing mechanism\u201d\nof Beimel, Nissim, and Stemmer [25]. Our sub-routine runs in logarithmic time in the domain size\nand achieves a doubly-logarithmic dependence in the error. See Figure 2 for a pseudocode of our\nalgorithm. In the description of the algorithm, we think of At as a CDF de\ufb01ned by a sequence of\npoints (0, 0), (x1, y1), ..., (xk, yk), (N, 1) specifying the value of the CDF at various discrete points\nof the domain. We denote by weight(I, At) \u2208 [0, 1] the weight of the interval I according to the\nCDF At, where the value at missing points in the domain is achieved by linear interpolation. In other\nwords, At represents a piecewise-linear CDF (corresponding to a piecewise constant histogram).\nSimilarly, we let weight(I, S) \u2208 [0, 1] denote the weight of interval I according to the sample S,\nthat is, |S \u2229 I|/|S|.\nWe show that our algorithm satis\ufb01es (\u0001, \u03b4)-differential privacy (see Appendix B):\nLemma 2. For every \u0001 \u2208 (0, 2), \u03b4 > 0, MaximumErrorRule satis\ufb01es (\u0001, \u03b4)-differential privacy.\n\nNext, we provide two performance guarantees for our algorithm. The \ufb01rst shows that the running\ntime per iteration is at most O(n log N ). The second shows that if at any step t there is a \u201cbad\u201d\ninterval in I that has large error, then our algorithm \ufb01nds such a bad interval where the quantitative\n\n6\n\n\fFigure 1: CDF approximation after T = 0, 1, 2, 3 iterations.\n\nMAXIMUMERRORRULE(S \u2208 [N ]n, privacy parameters \u0001, \u03b4)\nFor t = 1 to T :\n\n1. I = FINDBADINTERVAL(At\u22121, S)\n2. At = UPDATE(At\u22121, S, I)\n\nFINDBADINTERVAL\n\n1. Let I be the collection of all dyadic intervals of the domain.\n2. For each J \u2208 I, let q(J; S) = |weight(J, At\u22121) \u2212 weight(J, S)|.\n3. Output an I \u2208 I sampled from the choosing mechanism with score function q over the\n\ncollection I with privacy parameters (\u0001/2T, \u03b4/T ).\n\nUPDATE\n\n1. Let I = (l, r) be the input\n\ninterval.\n\nCompute wl = weight([1, l], S) +\n\nLaplace(0, 1/(2n\u0001)) and wr = weight([l + 1, r], S) + Laplace(0, 1/(2n\u0001)).\n\n2. Output the CDF obtained from At\u22121 by adding the points (l, wl) and (r, wl + wr) to the\n\ngraph of At\u22121.\n\nFigure 2: Maximum Error Rule (MERR).\n\nloss depends only doubly-logarithmically on the domain size (see Appendix B for the proof of the\nfollowing theorem).\nProposition 2. MERR runs in time O(T n log N ). Furthermore, for every step t, with probability\n1 \u2212 \u03b2, we have that the interval I selected at step t satis\ufb01es\n\n(cid:18) 1\n\n\u0001n\n\n\u00b7 log(cid:0)n log N \u00b7 log(1/\u03b2\u0001\u03b4)(cid:1)(cid:19)\n\n.\n\n|weight(I, At\u22121) \u2212 weight(I, S)| \u2265 OPT \u2212 O\n\nRecall that OPT = maxJ\u2208I |weight(J, At\u22121) \u2212 weight(J, S)|.\n\n5 Experiments\n\nIn addition to our theoretical results from the previous sections, we also investigate the empirical\nperformance of our private distribution learning algorithm based on the maximum error rule. The\nfocus of our experiments is the learning error achieved by the private algorithm for various distribu-\ntions. For this, we employ two types of data sets: multiple synthetic data sets derived from mixtures\nof well-known distributions (see Appendix C), and a data set from Higgs experiments [29]. The\nsynthetic data sets allow us to vary a single parameter (in particular, the domain size) while keeping\nthe remaining problem parameters constant. We have chosen a distribution from the Higgs data set\nbecause it gives rise to a large domain size. Our results show that the maximum error rule \ufb01nds\na good approximation of the underlying distribution, matching the learning error of a non-private\nbaseline when the number of samples is suf\ufb01ciently large. Moreover, our algorithm is very ef\ufb01cient\nand runs in less than 5 seconds for n = 107 samples on a domain of size N = 1018.\nWe implemented our algorithm in the Julia programming language (v0.3) and ran the experiments on\nan Intel Core i5-4690K CPU (3.5 - 3.9 GHz, 6 MB cache). In all experiments involving our private\nlearning algorithm, we set the privacy parameters to \u0001 = 1 and \u03b4 = 1\nn. Since the noise magnitude\n\u0001n, varying \u0001 has the same effect as varying the the sample size n. Similarly, changes in\ndepends on 1\n\u03b4 are related to changes in n, and therefore we only consider this setting of privacy parameters.\n\n7\n\n\fHiggs data.\nIn addition to the synthetic data mentioned above, we use the lepton pT (transverse\nmomentum) feature of the Higgs data set (see Figure 2e of [29]). The data set contains roughly\n11 million samples, which we use as unknown distribution. Since the values are speci\ufb01ed with 18\ndigits of accuracy, we interpret them as discrete values in [N ] for N = 1018. We then generate a\nsample from this data set by taking the \ufb01rst n samples and pass this subset as input to our private\ndistribution learning algorithm. This time, we measure the error as Kolmogorov distance between\nthe hypothesis returned by our algorithm and the cdf given by the full set of 11 million samples.\nIn this experiment (Figure 3), we again see that the maximum-error rule achieves a good learning\nerror. Moreover, we investigate the following two aspects of the algorithm: (i) The number of steps\ntaken by the maximum error rule in\ufb02uences the learning error. In particular, a smaller number of\nsteps leads to a better approximation for small values of n, while more samples allow us to achieve\na better error with a larger number of steps. (ii) Our algorithm is very ef\ufb01cient. Even for the largest\nsample size n = 107 and the largest number of MERR steps, our algorithm runs in less than 5\nseconds. Note that on the same machine, simply sorting n = 107 \ufb02oating point numbers takes about\n0.6 seconds. Since our algorithm involves a sorting step, this shows that the overhead added by\nthe maximum error rule is only about 7\u00d7 compared to sorting. In particular, this implies that no\nalgorithm that relies on sorted samples can outperform our algorithm by a large margin.\n\nLimitations and future work. As we previously saw, the performance of the algorithm varies\nwith the number of iterations. Currently this is a parameter that must be optimized over separately,\nfor example, by choosing the best run privately from the exponential mechanism. This is standard\npractice in the privacy literature, but it would be more appealing to \ufb01nd an adaptive method of\nchoosing this parameter on the \ufb02y as the algorithm obtains more information about the data.\nThere remains a gap in sample complexity between the private and the non-private algorithm. One\nreason for this are the relatively large constants in the privacy analysis of the choosing mecha-\nnism [9]. With a tighter privacy analysis, one could hope to reduce the sample size requirements of\nour algorithm by up to an order of magnitude.\nIt is likely that our algorithm could also bene\ufb01t from certain post-processing steps such as smoothing\nthe output histogram. We did not evaluate such techniques here for simplicity and clarity of the\nexperiments, but this is a promising direction.\n\nHiggs data\n\nHiggs data\n\n100\n\n10\u22121\n\n10\u22122\n\nr\no\nr\nr\ne\n-\nv\no\nr\no\ng\no\nm\nl\no\nK\n\n10\u22123\n\n103\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n\ne\nm\n\ni\nt\n\ng\nn\ni\nn\nn\nu\nR\n\n100\n\n10\u22121\n\n104\n\n105\n\n106\n\n107\n\n103\n\n104\n\n105\n\n106\n\n107\n\nSample size n\n\nSample size n\n\nm = 4\n\nm = 8\n\nm = 12\n\nm = 16\n\nm = 20\n\nFigure 3: Evaluation of our private learning algorithm on the Higgs data set. The left plot shows the\nKolmogorov error achieved for various sample sizes n and number of steps taken by the maximum\nerror rule (m). The right plot displays the corresponding running times of our algorithm.\n\nAcknowledgments\n\nIlias Diakonikolas was supported by EPSRC grant EP/L021749/1 and a Marie Curie Career In-\ntegration grant. Ludwig Schmidt was supported by MADALGO and a grant from the MIT-Shell\nInitiative.\n\n8\n\n\fReferences\n[1] C. Dwork. The differential privacy frontier (extended abstract). In TCC, pages 496\u2013502, 2009.\n[2] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions over\n\ndiscrete domains. In SODA, 2013.\n\n[3] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Ef\ufb01cient density estimation via piecewise polynomial\n\napproximation. In STOC, pages 604\u2013613, 2014.\n\n[4] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-Optimal Density Estimation in Nearly-Linear\n\nTime. Available at http://arxiv.org/abs/1506.00671, 2015.\n\n[5] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, and L. Sellie. On the learnability of discrete\n\ndistributions. In Proc. 26th STOC, pages 273\u2013282, 1994.\n\n[6] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics,\n\nSpringer, 2001.\n\n[7] L. Birg\u00e9. Estimation of unimodal densities without smoothness assumptions. Annals of Statistics,\n\n25(3):970\u2013981, 1997.\n\n[8] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in near-linear time\n\nusing variable-width histograms. In NIPS, pages 1844\u20131852, 2014.\n\n[9] M. Bun, K. Nissim, U. Stemmer, and S. P. Vadhan. Differentially private release and learning of threshold\n\nfunctions. CoRR, abs/1504.07553, 2015.\n\n[10] U. Grenander. On the theory of mortality measurement. Skand. Aktuarietidskr., 39:125\u2013153, 1956.\n[11] B.L.S. Prakasa Rao. Estimation of a unimodal density. Sankhya Ser. A, 31:23\u201336, 1969.\n[12] P. Groeneboom. Estimating a monotone density. In Proc. of the Berkeley Conference in Honor of Jerzy\n\nNeyman and Jack Kiefer, pages 539\u2013555, 1985.\n\n[13] L. Birg\u00e9. Estimating a density under order restrictions: Nonasymptotic minimax risk. Ann. of Stat., pages\n\n995\u20131012, 1987.\n\n[14] F. Balabdaoui and J. A. Wellner. Estimation of a k-monotone density: Limit distribution theory and the\n\nspline connection. The Annals of Statistics, 35(6):pp. 2536\u20132564, 2007.\n\n[15] L. D umbgen and K. Ru\ufb01bach. Maximum likelihood estimation of a log-concave density and its distribu-\n\ntion function: Basic properties and uniform consistency. Bernoulli, 15(1):40\u201368, 2009.\n[16] G. Walther. Inference and modeling with log-concave distributions. Stat. Science, 2009.\n[17] Y. Freund and Y. Mansour. Estimating a mixture of two product distributions. In COLT, 1999.\n[18] J. Feldman, R. O\u2019Donnell, and R. Servedio. Learning mixtures of product distributions over discrete\n\ndomains. In FOCS, pages 501\u2013510, 2005.\n\n[19] C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning k-modal distributions via testing. In SODA,\n\npages 1371\u20131385, 2012.\n\n[20] S. L. Warner. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. Journal\n\nof the American Statistical Association, 60(309), 1965.\n\n[21] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In FOCS,\n\npages 429\u2013438, 2013.\n\n[22] J. C. Duchi, M. J. Wainwright, and M. I. Jordan. Local privacy and minimax bounds: Sharp rates for\n\nprobability estimation. In NIPS, pages 1529\u20131537, 2013.\n\n[23] M. Hardt, K. Ligett, and F. McSherry. A simple and practical algorithm for differentially-private data\n\nrelease. In NIPS, 2012.\n\n[24] C. Li, M. Hay, G. Miklau, and Y. Wang. A data- and workload-aware query answering algorithm for\n\nrange queries under differential privacy. PVLDB, 7(5):341\u2013352, 2014.\n\n[25] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differen-\n\ntial privacy. In RANDOM, pages 363\u2013378, 2013.\n\n[26] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution\nfunction and of the classical multinomial estimator. Ann. Mathematical Statistics, 27(3):642\u2013669, 1956.\n[27] G. Rote. The convergence rate of the sandwich algorithm for approximating convex functions. Computing,\n\n48:337\u2013361, 1992.\n\n[28] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94\u2013103, 2007.\n[29] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics\n\nwith deep learning. Nature Communications, (5), 2014.\n\n[30] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In FOCS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1513, "authors": [{"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "University of Edinburgh"}, {"given_name": "Moritz", "family_name": "Hardt", "institution": "Google"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "MIT"}]}