{"title": "Differentially Private Bayesian Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 525, "page_last": 535, "abstract": "Linear regression is an important tool across many fields that work with sensitive human-sourced data. Significant prior work has focused on producing differentially private point estimates, which provide a privacy guarantee to individuals while still allowing modelers to draw insights from data by estimating regression coefficients. We investigate the problem of Bayesian linear regression, with the goal of computing posterior distributions that correctly quantify uncertainty given privately released statistics. We show that a naive approach that ignores the noise injected by the privacy mechanism does a poor job in realistic data settings. We then develop noise-aware methods that perform inference over the privacy mechanism and produce correct posteriors across a wide range of scenarios.", "full_text": "Differentially Private Bayesian Linear Regression\n\nGarrett Bernstein\n\nUniversity of Massachusetts Amherst\n\ngbernstein@cs.umass.edu\n\nDaniel Sheldon\n\nUniversity of Massachusetts Amherst\n\nsheldon@cs.umass.edu\n\nAbstract\n\nLinear regression is an important tool across many \ufb01elds that work with sensitive\nhuman-sourced data. Signi\ufb01cant prior work has focused on producing differentially\nprivate point estimates, which provide a privacy guarantee to individuals while still\nallowing modelers to draw insights from data by estimating regression coef\ufb01cients.\nWe investigate the problem of Bayesian linear regression, with the goal of com-\nputing posterior distributions that correctly quantify uncertainty given privately\nreleased statistics. We show that a naive approach that ignores the noise injected\nby the privacy mechanism does a poor job in realistic data settings. We then\ndevelop noise-aware methods that perform inference over the privacy mechanism\nand produce correct posteriors across a wide range of scenarios.\n\n1\n\nIntroduction\n\nLinear regression is one of the most widely used statistical methods, especially in the social sci-\nences [Agresti and Finlay, 2009] and other domains where data comes from humans. It is important\nto develop robust tools that can realize the bene\ufb01ts of regression analyses but maintain the privacy\nof individuals. Differential privacy [Dwork et al., 2006] is a widely accepted formalism to provide\nalgorithmic privacy guarantees: a differentially private algorithm randomizes its computation to\nprovably limit the risk that its output discloses information about individuals.\nExisting work on differentially private linear regression focuses on frequentist approaches. A variety\nof privacy mechanisms have been applied to point estimation of regression coef\ufb01cients, including\nsuf\ufb01cient statistic perturbation (SSP) [Foulds et al., 2016, McSherry and Mironov, 2009, Vu and\nSlavkovic, 2009, Wang, 2018, Zhang et al., 2016], posterior sampling (OPS) [Dimitrakakis et al.,\n2014, Geumlek et al., 2017, Minami et al., 2016, Wang, 2018, Wang et al., 2015, Zhang et al., 2016],\nsubsample and aggregate [Dwork and Smith, 2010, Smith, 2008], objective perturbation [Kifer et al.,\n2012], and noisy stochastic gradient descent [Bassily et al., 2014]. Only a few recent works address\nuncertainty quanti\ufb01cation through con\ufb01dence interval estimation [Sheffet, 2017] and hypothesis\ntests [Barrientos et al., 2019] for regression coef\ufb01cients.\nWe develop a differentially private method for Bayesian linear regression. A Bayesian approach\nnaturally quanti\ufb01es parameter uncertainty through a full posterior distribution and provides other\nBayesian capabilities such as the ability to incorporate prior knowledge and compute posterior\npredictive distributions. Existing approaches to private Bayesian inference include OPS (see above),\nMCMC [Wang et al., 2015], variational inference (VI; Honkela et al. [2018], J\u00e4lk\u00f6 et al. [2017],\nPark et al. [2016]), and SSP [Bernstein and Sheldon, 2018, Foulds et al., 2016], but none provide\na fully satisfactory approach for Bayesian regression modeling. OPS does not naturally produce a\nrepresentation of a full posterior distribution. MCMC approaches incur per-iteration privacy costs\nand satisfy only approximate (\u270f, )-differential privacy. Private VI approaches also incur per-iteration\nprivacy costs, and are most relevant when the original inference problem requires VI. When applicable,\nSSP is a very desirable approach \u2014 suf\ufb01cient statistics are perturbed once and then used in conjugate\nupdates to obtain parameters of full posterior distributions \u2014 and often outperforms other methods in\npractice [Foulds et al., 2016, Wang, 2018]. However, Bernstein and Sheldon [2018] demonstrated\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(for unconditional exponential family models) that naive SSP, which ignores noise introduced by the\nprivacy mechanism, systematically underestimates uncertainty at small to moderate sample sizes. We\nshow that the same phenomenon holds for Bayesian linear regression: naive SSP produces private\nposteriors that are properly calibrated asymptotically in the sample size, but for realistic data sets and\nprivacy levels may need very large population sizes to reach the asymptotic regime.\nThis motivates our development of Bayesian inference methods for linear regression that properly\naccount for the noise due to the privacy mechanism [Bernstein and Sheldon, 2018, Bernstein et al.,\n2017, Karwa et al., 2014, 2016, Schein et al., 2018, Williams and McSherry, 2010]. We leverage a\nmodel in which the data and model parameters are latent variables, and noisy suf\ufb01cient statistics are\nobserved, and then develop MCMC-based techniques to sample from posterior distributions, as done\nfor exponential families in [Bernstein and Sheldon, 2018]. A signi\ufb01cant challenge relative to prior\nwork is the handling of covariate data. Typical regression modeling treats only response variables\nand parameters as random, and conditions on covariates. This is not possible in the private setting,\nwhere covariates must be kept private and therefore treated as latent variables. We therefore require\nsome form of assumption about the distribution over covariates. We develop two inference methods.\nThe \ufb01rst includes latent variables for each individual; it requires an explicit prior distribution for\ncovariates and its runtime scales with population size. The second marginalizes out individuals and\napproximates the distribution over the suf\ufb01cient statistics; it requires weaker assumptions about\nthe covariate distribution (only moments), and its running time does not scale with population size.\nWe perform a range of experiments to measure the calibration and utility of these methods. Our\nnoise-aware methods are as well or nearly as well calibrated as the non-private method, and have\nbetter utility than the naive method. We demonstrate using real data that our noise-aware methods\nquantify posterior predictive uncertainty signi\ufb01cantly better than naive SSP.\n\n2 Background\n\nDifferential Privacy. A differentially private algorithm A provides a guarantee to individuals: The\ndistribution over the output of A will be (nearly) indistinguishable regardless of the inclusion or\nexclusion of a single individual\u2019s data. The implication to the individual is they face negligible risk in\ndeciding to contribute their personal data to be used by a differentially private algorithm. To formally\nwrite the guarantee we reason about a generic data set X = x1:n = (x1,\u00b7\u00b7\u00b7 , xn) of n individuals,\nwhere xi is the data record of the ith individual. For this paper, de\ufb01ne neighboring data sets as those\nthat differ by a single record, i.e. X0 2 nbrs(X) if X0 = (x1:i1, x0i, xi+1:n) for some i. 1\nDe\ufb01nition 1 (Differential Privacy; Dwork et al. [2006]). A randomized algorithm A satis\ufb01es \u270f-\ndifferential privacy if for any input X, any X0 2 nbrs(X) and any subset of outputs O \u2713 Range(A),\nPr[A(X) 2 O] \uf8ff exp(\u270f)Pr[A(X0) 2 O].\nThe above guarantee is ensured by randomizing A. A key concept is the sensitivity of a function,\nwhich quanti\ufb01es the impact an individual record has on the output of the function.\nDe\ufb01nition 2 (Sensitivity; Dwork et al. [2006]). The sensitivity of a function f is f =\nsupX,X02nbrs(X)kf (X) f (X0)k1.\nWe use the Laplace mechanism to ensure publicly-released statistics meet the requirements of\ndifferential privacy.\nDe\ufb01nition 3 (Laplace Mechanism; Dwork et al. [2006]). Given a function f that maps data sets\n\nto Rm, the Laplace mechanism outputs the random variable L(X) \u21e0 Lapf (X), f /\u270f from the\nLaplace distribution, which has density Lap(z; u, b) = (2b)m exp (kz uk1/b). This corre-\nsponds to adding zero-mean independent noise ui \u21e0 Lap(0, f /\u270f) to each component of f (X).\nA \ufb01nal property is post-processing, which says that any further processing on the output of a\ndifferentially private algorithm that does not access the original data retains the same privacy guaran-\ntees [Dwork and Roth, 2014].\nLinear Regression. We start with a standard (non-private) linear regression problem. An individual\u2019s\ncovariate or regressor data is x 2 Rd and the dependent response data is y 2 R. We will assume a\n1This variant is called bounded differential privacy in that the number of individuals n remains constant [Kifer\n\nand Machanavajjhala, 2011].\n\n2\n\n\fconditionally Gaussian model y \u21e0N (\u2713T x, 2), where \u2713 2 Rd are the regression coef\ufb01cients and\n2 is the error variance. An intercept or bias term may be included in the model by appending a\nunit-valued feature to x. The goal, given an observed population of n individuals, is to obtain a point\nestimate of \u2713. The population data can be written as X 2 Rn\u21e5d, where each row corresponds to an in-\ndividual x, and y 2 Rn. The ordinary least squares (OLS) solution is \u02c6\u2713 =X T X1 X T y [Rencher,\n2003].\nIn Bayesian linear regression the parameters \u2713 and 2 are random variables with a speci\ufb01ed prior distri-\nbution. The conjugate priors are p(2) = InverseGamma(a0, b0) and p(\u2713 | 2) = N (\u00b50, 2\u21e41\n0 ),\nwhich de\ufb01nes a normal-inverse gamma prior distribution: p(\u2713, 2) = NIG(\u00b50, \u21e40, a0, b0). Due to\nconjugacy of the prior distribution with the likelihood model, the posterior distribution, shown in\nEquation (1), is also normal-inverse gamma [O\u2019Hagan and Forster, 1994].\n\np(\u2713, 2 | X, y) = NIG(\u00b5n, \u21e4n, an, bn)\n\u00b5n =X T X + \u21e401X T y + \u00b5T\n0 \u21e40\n\n\u21e4n = X T X + \u21e40\n\n(1)\n\nan = a0 +\n\nbn = b0 +\n\nn\n\n1\n2\n1\n\n2yT y + \u00b5T\n\n0 \u21e40\u00b50 \u00b5T\n\nn \u21e4n\u00b5n\n\nLet t(x, y) := [vec(xxT ), xy, y2] for an arbitrary individual. Then the suf\ufb01cient statistics of the\n\nabove model are s := t(X, y) = Pi tx(i), y(i) = \u21e5X T X, X T y, yT y\u21e4. These capture all\n\ninformation about the model parameters contained in the sample and are the only quantities needed\nfor the conjugate posterior updates above [Casella and Berger, 2002].\n\n3 Private Bayesian Linear Regression\n\nThe goal is to perform Bayesian linear regression in an \u270f-differentially private manner. We ensure\nprivacy by employing suf\ufb01cient statistic perturbation (SSP) [Foulds et al., 2016, Vu and Slavkovic,\n2009, Zhang et al., 2016], in which the Laplace mechanism is used to inject noise into the suf\ufb01cient\nstatistics of the model, making them \ufb01t for public release. The question is then how to compute the\nposterior over the model parameters \u2713 and 2 given the noisy suf\ufb01cient statistics. We \ufb01rst consider a\nnaive method that ignores the noise in the noisy suf\ufb01cient statistics. We then consider more principled\nnoise-aware inference approaches that account for the noise due to the privacy mechanism.\n\n3.1 Privacy mechanism\n\nUsing the Laplace mechanism to release the noisy suf\ufb01cient statistics z results\nin the model shown in Figure 1. This is the same model used in non-private\nlinear regression except for the introduction of z, which requires the exact\nsuf\ufb01cient statistics s to have \ufb01nite sensitivity. A standard assumption in\nliterature [Awan and Slavkovic, 2018, Sheffet, 2017, Wang, 2018, Zhang\net al., 2012] is to assume x and y have known a priori lower and upper\nbounds, (ax, bx) and (ay, by), with bound widths wx = bx ax (assuming,\nfor simplicity, equal bounds for all covariate dimensions) and wy = by \nay, respectively. We can then reason about the worst case in\ufb02uence of an\nindividual on each component of s = \u21e5X T X, X T y, yT y\u21e4, recalling that\ns =Pi t(x(i), y(i)), so that\u21e5(X T X)jk , (Xy)j , y2\u21e4 =\u21e5w2\ny\u21e4.\nz =\u21e5zi \u21e0 Lap(si, s/\u270f) : si 2 s\u21e4.\n\nThe number of unique elements2 in s is [d(d + 1)/2, d, 1], so s = w2\n1)/2 + wxwyd + w2\n\nxd(d +\ny. The noisy suf\ufb01cient statistics \ufb01t for public release are\n\nx, wxwy, w2\n\n2Note that X T X is symmetric.\n\n\t\",$%\n\t)\n\n\t&\n\n\t'\n\t(\n\nFigure 1: Private\nregression model.\n\n3\n\n\f3.2 Noise-naive method\n\nPrevious work developed methods to obtain OLS solutions via SSP by ignoring the noise injected into\nthe suf\ufb01cient statistics [Awan and Slavkovic, 2018, Sheffet, 2017, Wang, 2018]. One corresponding\napproach for Bayesian regression is to naively replace s in Figure 1 with the noisy version z and then\nperform the conjugate update in Equation (1). This noise-naive method (Naive) is simple and fast,\nand we empirically show in Section 4 that it produces an asymptotically correct posterior.\n\n3.3 Noise-aware inference\n\nInstead of ignoring the noise introduced by the privacy mechanism, we propose to perform in-\nference over the noise in the model in Figure 1 in order to produce correct posteriors regard-\nless of the data size. The biggest change from non-private to private Bayesian linear regression\nis that due to privacy constraints we can no longer condition on the covariate data X. The\nnon-private posterior is p(\u2713, 2|X, y) / p(\u2713, 2) p(y|X, \u2713, 2) while the private posterior is\np(\u2713, 2|z) /R p(X)p(\u2713, 2) p(y|X, \u2713, 2) p(z|X, y) dX dy (see derivations in supplementary ma-\nterial). The private posterior contains the term p(X), which means that in order to calculate it we\nneed to know something about the distribution of X!\nGiven an explicitly speci\ufb01ed prior p(X), we can perform inference over the model in Figure 1\nusing general-purpose MCMC algorithms. We use the No-U-Turn Sampler [Hoffman and Gelman,\n2014] from the PyMC3 package [Salvatier et al., 2016], and call this method noise-aware individual-\nbased inference (MCMC-Ind). This approach is simple to implement using existing tools but places\na substantial burden on the modeler relative to the non-private case by requiring an explicit prior\ndistribution p(X), with poor choices potentially leading to incorrect inferences. Additionally, because\nMCMC-Ind instantiates latent variables for each individual, its runtime scales with population size and\nit may be slow for large populations.\n\n3.4 Suf\ufb01cient statistics-based inference\n\nAn appealing possibility is to marginalize out the variables X and y representing individuals and\ninstead perform inference directly over the latent suf\ufb01cient statistics s. The joint distribution is\np(\u2713, 2, s, z) = p(\u2713, 2) p(s | \u2713, 2) p(z | s). The goal is to compute a representation of p(\u2713, 2 |\nz) /Rs p(\u2713, 2, s, z) ds by integrating over the suf\ufb01cient statistics. Because this distribution cannot\nbe written in closed form we develop a Gibbs sampler to sample from the posterior as done by\nBernstein and Sheldon [2018] for unconditional exponential family models. This requires methods to\nsample from the conditional distributions for both the parameters (\u2713, 2) and the suf\ufb01cient statistics s\ngiven all other variables. The full conditional p(\u2713, 2 | s) for the model parameters can be computed\nand sampled using conjugacy, exactly as in the non-private case. The full conditional for s factors into\ntwo terms: p(s | \u2713, 2, z) / p(s | \u2713, 2) p(z | s). The \ufb01rst is the distribution over suf\ufb01cient statistics\nof the regression model, for which we develop an asymptotically correct normal approximation. The\nsecond is the noise model due to the privacy mechanism, for which we use variable augmentation to\nensure it is possible to sample from the full conditional distribution of s.\n\n3.4.1 Normal approximation of s\nThe conditional distribution over the suf\ufb01cient statistics given the model parameters is\n\np(s | \u2713, 2) =Zt1(s)\n\npX, y | \u2713, 2 dX dy,\n\nt1(s) :=X, y : t(X, y) = s .\n\nThe integral over t1(s), all possible populations which have suf\ufb01cient statistics s, is intractable to\n\ncompute. Instead we observe that the components of s =Pi t(x(i), y(i)) are sums over individuals.\nTherefore, using the central limit theorem (CLT), we approximate their distribution as p(s | \u2713, 2) \u21e1\nN (s; n\u00b5t, n\u2303t), where \u00b5t = E[t(x, y)] and \u2303t = Cov (t(x, y)) are the mean and covariance\nof the function t(x, y) on a single individual, This approximation is asymptotically correct, i.e.,\n1pn (s n\u00b5t) D! N(0, \u2303t) [Bickel and Doksum, 2015]. We write the conditional distribution as\n\n4\n\n\fs | \u00b7\u21e0N (n\u00b5t, n\u2303t),\n\n\u00b5t =\u21e5E\u21e5vec(xxT )\u21e4 , E [xy] , E\u21e5y2\u21e4\u21e4 ,\n\u2303t =264\n\nCovvec(xxT )\nCovxy, vec(xxT )\nCovy2, vec(xxT )\n\nCov (xy)\n\nCovvec(xxT ), xT y Covvec(xxT ), y2\nCovxy, y2\nCovy2, xy\nVary2\n\n375 .\n\n(2)\n\n(3)\n\nThe components of \u00b5t and \u2303t can be written in terms of the model parameters (\u2713, 2) and the\nsecond and fourth non-central moments of x as shown below, where we have de\ufb01ned \u2318ij := E [xixj],\n\u2318ijkl := E [xixjxkxl], and \u21e0ij,kl := Cov (xixj, xkxl) = \u2318ijkl \u2318ij\u2318kl. Full derivations can be\nfound in the supplementary material. We call this family of methods Gibbs-SS.\n\n\u2713j\u2318ij\n\nE [xiy] =Xj\nE\u21e5y2\u21e4 = 2 +Xi,j\n\n\u2713i\u2713j\u2318ij\n\n\u2713k\u2713l\u21e0ij,kl\n\n\u2713l\u21e0ij,kl\n\nCov (xixj, xky) =Xl\nCovxixj, y2 =Xk,l\nCov (xiy, xjy) = 2\u2318ij +Xk,l\nCovxiy, y2 =Xj,k,l\n\nVary2 = 24 + Xi,j,k,l\n+ 42Xi,j\n\n\u2713k\u2713l\u21e0ij,kl\n\n\u2713j\u2713k\u2713l\u21e0ij,kl + 22Xj\n\n\u2713j\u2318ij\n\n\u2713i\u2713j\u2713k\u2713l\u21e0ij,kl\n\n\u2713i\u2713j\u2318ij\n\nTo use this normal distribution for sampling, we need the parameters (\u2713, 2) and the moments \u2318ij,\n\u2318ijkl, and \u21e0ij,kl. The current parameter values are available within the sampler, but the modeler must\nprovide estimates for the moments of x, either using prior knowledge or by (privately) estimating the\nmoments from the data. We discuss three speci\ufb01c possibilities in Section 3.4.4.\nOnce again, more modeling assumptions are needed than in the non-private case, where it is possible\nto condition on x. Gibbs-SS requires milder assumptions (second and fourth moments), however,\nthan MCMC-Ind (a full prior distribution).\n3.4.2 Variable augmentation for p(z | s)\nThe above approximation for the distribution over suf\ufb01cient statistics means the full conditional\ndistribution involves the product of a normal and a Laplace distribution,\n\np(s | \u2713, z) /N (s;n\u00b5t, n\u2303t)\n\n\u00b7 Lap(z; s, s/\u270f).\n\nIt is unclear how to sample from this distribution directly. A similar situation arises in the Bayesian\nLasso, where it is solved by variable augmentation [Park and Casella, 2008]. Bernstein and Shel-\ndon [2018] adapted the variable augmentation scheme to private inference in exponential fam-\nily models. We take the same approach here, and represent a Laplace random variable as a\nscale mixture of normals. Speci\ufb01cally, l \u21e0 Lap(u, b) is identically distributed to l \u21e0N (u, !2)\nwhere the variance !2 \u21e0 Exp1/(2b2) is drawn from the exponential distribution (with density\n1/(2b2) exp!2/(2b2)). We augment separately for each component of the vector z so that\nz \u21e0N s, diag(!2), where !2\ns). The augmented full conditional p(s | \u2713, z,! ) is\n\na product of two multivariate normal distributions, which is itself a multivariate normal distribution.\n\nj \u21e0 Exp\u270f2/(22\n\n5\n\n\fAlgorithm 1 Gibbs Sampler\n1: Initialize \u2713, 2,! 2\n2: repeat\n3:\n4:\n5:\n6:\n\nCalculate \u00b5t and \u2303t via Eqs. (2) and (3)\n\ns \u21e0 NormProductn\u00b5t, n\u2303t, z, diag(!2)\n\u2713, 2 \u21e0 NIG(\u2713, 2; \u00b5n, \u21e4n, an, bn) via Eqn. (1)\ns\u2318 for all j\nj \u21e0 InverseGaussian\u21e3\n1/!2\n\ns|zs|\n\n, \u270f2\n2\n\n\u270f\n\nSubroutine NormProduct\n1: input: \u00b51, \u23031, \u00b52, \u23032\n\n1 +\u2303 1\n\n2: \u23033 =\u23031\n2 1\n3: \u00b53 =\u2303 3\u23031\n4: return: N (\u00b53, \u23033)\n\n1 \u00b51 +\u2303 1\n\n2 \u00b52\n\n\t\",$%\t&\n\t'\n\n\t(%\n\ns \u21e0N (n\u00b5t, n\u2303t)\n!2\n\n\u2713, 2 \u21e0 NIG(\u2713, 2; \u00b50, \u21e40, a0, b0)\nj \u21e0 Exp \u270f2\nz \u21e0N s, diag(!2)\n\ns! for all j\n\n22\n\n3.4.3 The Gibbs sampler\nThe full generative process is shown to the\nright, and the corresponding Gibbs sampler is\nThe update for !2\nshown in Algorithm 1.\nfollows Park and Casella [2008];\nthe inverse\nGaussian density is InverseGaussian(w; m, v) =\n\npv/(2\u21e1w3) expv(w m)2/(2m2w). Note\nthat the resulting s drawn from p(s | \u00b5t, \u2303t,! 2)\nmay require projection onto the space of valid suf\ufb01cient statistics. This can be done by observing that\nif A = [X, y] then the suf\ufb01cient statistics are contained in the positive-semide\ufb01nite (PSD) matrix\nB = AT A. For a randomly drawn s, we project if necessary so the corresponding B matrix is PSD.\n\n3.4.4 Distribution over X\nAs discussed above, Gibbs-SS requires the second and fourth population moments of x to calculate\n\u00b5t and \u2303t. We propose three different options for the modeler to provide these and discuss the\nalgorithmic considerations for each. Because we include the unit feature in x we can restrict our\n\nattention to the fourth moment E\u21e5x\u23264\u21e4, which includes the second moment as a subcomponent.\n\nPrivate sample moments (Gibbs-SS-Noisy). The \ufb01rst option is to estimate population moments\nprivately by computing the fourth sample moments from X and privately releasing them via the\nLaplace mechanism. The sensitivity of the estimate for \u2318ijkl is w4\nx, and for d = 2 there are D = 5\nunique entries, for a total sensitivity of Dw4\nx. This approach requires splitting the privacy budget\nbetween the release mechanisms for suf\ufb01cient statistics and moments, which we do evenly. We do\nnot perform inference over the noisy sample moments, which may introduce some miscalibration of\nuncertainty. Pursuing this additional layer of inference is an interesting avenue for future work.\nMoments from generic prior (Gibbs-SS-Prior). A second option is to propose a prior distribution\np(x) and obtain population moments directly from the prior, either through known formulas or from\nMonte Carlo estimation. This approach does not access the individual data and does not consume any\nprivacy budget, but requires proposing a prior distribution and computing the fourth moments of x\n(once) for that distribution.\nHierarchical normal prior (Gibbs-SS-Update). A \ufb01nal option is to perform inference over the\ndata moments by specifying an individual-level prior p(x) and then marginalizing away individuals, as\nwe did for the regression model suf\ufb01cient statistics. We propose a hierarchical normal prior, as shown\nin Figure 2a, which is more dispersed than a normal distribution and allows the modeler to propose\nvague priors, but still permits attainable conditional updates. The data x is normally distributed:\nx \u21e0N (\u00b5x,\u2327 2), with parameters drawn from the normal-inverse Wishart (NIW) conjugate prior\ndistribution, \u00b5x,\u2327 2 \u21e0 NIW(\u00b500, \u21e400, 00,\u232b 00). After marginalizing individuals, the latent quantities\nare the suf\ufb01cient statistics XX T (which includes the sample mean and covariance because of the unit\nfeature). For \ufb01xed parameters (\u00b5x,\u2327 2) the distribution p(x) is multivariate normal, and we calculate\nits fourth moments as the fourth derivative (via automatic differentiation) of its moment generating\nfunction.\nHowever, we introduced the new latent variables \u00b5x and \u2327 2 into the full model (see Figure 2a) and\nmust now derive conditional updates for them within the Gibbs sampler. Naively marginalizing X\n\n6\n\n\f\t*+,,%\n\t&\n\t&'&\n\t\t)-\n\n/\n\n\t\",$%\n\t(\n\t('(\n\t\t).\n\n\t&'(\n\t)%\n\n(a)\n\n0\n\n\t*+,,%\n\t&-&\n\t&\n\t\t).\n\n0\n\n\t\",$%\n\t(\n\t(-(\n\t\t)/\n\n\t&'(\n\t)%\n\n(b)\n\n\t*+,,%\n\t&'&\n\n/\n\n/\n\n\t&'(\n\t)%\n\n\t\t)-\n\n\t\",$%\n\t('(\n\t\t).\n\n(c)\n\nFigure 2: (a) Private Bayesian linear regression model with hierarchical normal data prior. (b)\nAlternative data model con\ufb01guration and (c) with individual variables marginalized out.\n\nand y from the full model in Figure 2a would cause both (\u00b5x,\u2327 2) and (\u2713, 2) to be parents of s\nand thus not conditionally independent given s\u2014this would require their updates to be coupled and\nwe could no longer use simple conjugacy formulas for each component of the model. To avoid this\nissue, we reformulate the joint distribution represented as in Figure 2b. The justi\ufb01cation for this is as\nfollows. Because X T X is a suf\ufb01cient statistic for p(X) under a normal model, we can encode the\ngenerative process either as (\u00b5x,\u2327 2) ! X ! X T X or as (\u00b5x,\u2327 2) ! X T X ! X. In general, the\nlatter formulation would require an arrow from (\u00b5x,\u2327 2) to X; this drops precisely because X T X\nis a suf\ufb01cient statistic [Casella and Berger, 2002]. Then, upon marginalizing X and y, we obtain\nthe model in Figure 2c. The two sets of parameters are now conditionally independent given the\nsuf\ufb01cient statistics s, and can be updated independently as standard conjugate updates.\n\n4 Experiments\n\nWe design experiments to measure the calibration and utility of the private methods. Calibration\nmeasures how close the computed posterior is to p(\u2713, 2|z), the correct posterior given noisy statistics.\nUtility measures how close the computed posterior is to the non-private posterior p(\u2713, 2|s).\n4.1 Methods\n\nThe noise-aware individual-based method (MCMC-Ind) is implemented using PyMC3 [Salvatier et al.,\n2016]; it runs with 500 burnin iterations and collects 2000 posterior samples. The three \ufb02avors of\nnoise-aware suf\ufb01cient statistic-based methods use noisy sample moments (Gibbs-SS-Noisy), use\nmoments sampled from a data prior (Gibbs-SS-Prior), and use an updated hierarchical normal\nprior (Gibbs-SS-Update); all three collect 20000 posterior samples after 5000 and 20000 burnin\niterations for n 2 [10, 100] and n = 1000, respectively. We compare against the baseline noise-naive\nmethod (Naive) and the non-private posterior (Non-Private); both collect 2000 posterior samples.\n\n4.2 Evaluation on synthetic data\n\nEvaluation measures. We adapt a method of Cook et al. [2006] to measure calibration. Consider\na model p(, w) = p()p(w|). If (0, w0) \u21e0 p(, w), then, for any j, the quantile of 0j in the\ntrue posterior p(j|w0) is a uniform random variable. We can check our approximate posterior \u02c6p\nby computing the quantile uj of 0j in \u02c6p(j|w0) and testing for uniformity of uj over M trials. We\ntest for uniformity using the Kolmogorov-Smirnov (KS) goodness-of-\ufb01t test [Massey Jr., 1951]. The\nKS-statistic is the maximum distance between the empirical CDF of uj and the uniform CDF; lower\nvalues are better and zero corresponds to perfect uniformity, meaning \u02c6p is exact.\nWhile this test is elegant, it requires that parameters and data are drawn from the model used\nIn addition, for\nGibbs-SS-Prior and Gibbs-SS-Update, the test requires the covariate data be drawn from the\ndata prior used by the methods. We specify \u00b5x,\u2327 2 \u21e0 NIW(0, 1, 1, 50) and x \u21e0N (\u00b5x,\u2327 2). These\nensure at least 95% of x and y values are within [1, 1]. We compute sensitivity assuming data\n\nby the method. We use \u2713, 2 \u21e0 NIG\u21e3[0, 0], diag\u21e5 .5\n\n201 ,\n\n.5\n\n201\u21e4, 20, .5\u2318.\n\n7\n\n\fbounded in this range, but do not enforce it to avoid changing the generative process (a limitation\nof the evaluation method, not the inference routine). For each combination of n and \u270f we run\nM = 300 trials. We qualitatively assess calibration with the empirical CDFs, which is also the\nquantile-quantile (QQ) plot between the empirical distribution of uj and the uniform distribution. A\ndiagonal line indicates thats uj is perfectly uniform.\nBetween two calibrated posteriors, the tighter posterior will provide higher utility.3 We evaluate\nutility as closeness to the non-private posterior, which we measure with maximum mean discrep-\nancy (MMD), a kernel-based statistical test to determine if two sets of samples are drawn from\ndifferent distributions [Gretton et al., 2012]. Given m i.i.d. samples (p, q) \u21e0 P \u21e5 Q, an unbiased\nestimate of the MMD is\n\nMMD2(P, Q) =\n\n1\n\nm(m 1)Xm\n\n(k(pi, pj) + k(qi, qj) k(pi, qj) k(pj, qi)) ,\n\ni6=j\n\nwhere k is a continuous kernel function; we use a standard normal kernel. The higher the value the\nmore likely the two samples are drawn from different distributions, therefore lower MMD between\nNon-Private and the method indicates higher utility.\nWe measure method runtime as the average process time over the 300 trials. Note that PyMC3\nprovides parallelization; we report total process time across all chains for MCMC-Ind.\nResults. Calibration results are shown in Figures 3a and 3b. The QQ plot for n = 10 and \u270f = 0.1\nis shown in Figure 3c. Coverage results for 95% credible intervals are shown in Figure 3d. All\nfour noise-aware methods are at or near the calibration-level of the non-private method, and better\nthan Naive\u2019s calibration, regardless of data size. As expected, Gibbs-SS-Noisy suffers slight\nmiscalibration from not accounting for the noise injected into the privately released fourth data\nmoment. There is slight miscalibration in certain settings and parameters for Gibbs-SS-Prior due\nto approximations in the calculation of multivariate normal distribution fourth moments from a data\nprior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility\nas Naive.\nRunning time. Figure 3f shows running time as a function of population size. We see that MCMC-Ind\nscales with increasing population size, and in fact is prohibitive to run at sizes signi\ufb01cantly larger\nthan n = 100, while all variants of Gibbs-SS are constant with respect to population size. It is\nalso possible to analytically derive the asymptotic running time with respect to covariate dimension\nd. The most expensive operation used by Gibbs-SS will be the inversion of the covariance matrix\n(de\ufb01ned in Equation 3) in the NormProduct subroutine on Line 4 of Algorithm 1. This matrix has\ndimension (d2 + d + 1) \u21e5 (d2 + d + 1), where d2 + d + 1 are the total number of components in\nt(x, y) = [xxT , xy, y2]. Cubic matrix inversion would cost O((d2 + d + 1)3) = O(d6). Modern\ncomputing platforms can reasonably invert matrices of size 10K or more, corresponding to linear\nregression with d \u21e1 100 features.\n4.3 Predictive posteriors on real data\nWe evaluate the predictive posteriors of the\nmethods on a real world data set measuring\nthe effect of drinking rate on cirrhosis rate.4\nWe scale both covariate and response data to\n[0, 1].5 There are 46 total points, which we\nrandomly split into 36 training examples and\n10 test points for each trial. After prelimi-\nnary exploration to gain domain knowledge,\nwe set a reasonable model prior of \u2713, 2 \u21e0\nNIG[1, 0], diag([.25, .25]), 20, .5. We draw\nsamples \u2713(k), 2\nk from the posterior given train-\ning data, and then form the posterior predictive distribution for each test point yi from these samples.\n\nFigure 4: Coverage for predictive posterior 50%\nand 90% credible intervals.\n\n3Note that the prior itself is a calibrated distribution.\n4http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt\n5This step is not differentially private, but is standard in existing work. A reasonable assumption is that data\n\nbounds are a priori available due to domain knowledge.\n\n8\n\n10\u2212210\u22121100\u03b50.00.51.0coverage10\u2212210\u22121100\u03b5Gibbs-SS-NoisyNaiveNon-Privatetarget coverage\fFigure 3: Synthetic data results: (a) calibration vs. n for \u270f = 0.1; (b) calibration vs. \u270f for n = 10;\n(c) QQ plot for n = 10 and \u270f = 0.1; (d) 95% credible interval coverage; (e) MMD of methods to\nnon-private posterior; (f) method runtimes for \u270f = 0.1.\n\nFigure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100\nrandom train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due\nto the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly\nthe coverage of Non-Private, while Naive is drastically worse in this regime. We note that this\nexperiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly de\ufb01ned data\nprior, as it only requires the same parameter prior that is needed in non-private analysis.\n\n5 Conclusion\n\nIn this work we developed methods to perform Bayesian linear regression in a differentially private\nway. Our algorithms use suf\ufb01cient-statistic perturbation as a release mechanism, followed by specially-\ndesigned Markov chain Monte Carlo techniques to sample from the posterior distribution given noisy\nsuf\ufb01cient statistics. Unlike in the non-private case, we cannot condition on covariates, so some\nassumptions about the covariate distribution are required. We proposed methods that require only\nmoments of this distribution, and evaluated several ways to obtain the needed moments within the\nsampling routine.\nOur algorithms are the \ufb01rst speci\ufb01cally designed for the task of Bayesian linear regression, and the\n\ufb01rst to properly account for the noise mechanism during inference. Our inferred posterior distributions\nare well calibrated, and are better in terms of both calibration and utility than naive SSP, which is\nconsidered a state-of-the-art baseline.\nOur evaluation focused on calibration and utility of the posterior. Future work could evaluate the\nquality of point estimates obtained as a byproduct of our fully Bayesian algorithms. We expect such\npoint estimates to be as good as or better than those of naive SSP, which is state-of-the-art for private\nlinear regression [Wang, 2018]. Compared with prior work using naive SSP for linear regression,\nour methods are Bayesian, and perform inference over the noise mechanism. Being Bayesian is not\nexpected to hurt point estimation. Inference over the noise mechanism is expected to not hurt, and\npotentially improve, point estimation.\n\n9\n\n101102103n0.000.350.70.S stat.\u03b80101102103n\u03b8bias101102103n(a)\u03c3210\u2212210\u22121100\u03b50.000.350.70.S stat.\u03b8010\u2212210\u22121100\u03b5\u03b8bias10\u2212210\u22121100\u03b5(b)\u03c3201rDnk index01true CD) vDlue\u03b8001rDnk index\u03b8bias01rDnk index(F)\u03c32101102103n0.00.51.0coverage\u03b80101102103n\u03b8bias101102103n(d)\u03c32101102103n0100D\u03b80101102103n\u03b8bias101102103n(e)\u03c32101102n0200400600seconds(f)Non-PUivateNaiveGibbs-SS-NoisyGibbs-SS-USGateGibbs-SS-PUioUMCMC-InG\fReferences\nAlan Agresti and Barbaracoaut Finlay. Statistical methods for the social sciences. Number 300.72\n\nA3. 2009.\n\nJordan Awan and Aleksandra Slavkovic. Structure and sensitivity in differential privacy: Comparing\n\nk-norm mechanisms. arXiv preprint arXiv:1801.09236, 2018.\n\nAndr\u00e9s F. Barrientos, Jerome P. Reiter, Ashwin Machanavajjhala, and Yan Chen. Differentially\nprivate signi\ufb01cance tests for regression coef\ufb01cients. Journal of Computational and Graphical\nStatistics, 0(0):1\u201324, 2019.\n\nRaef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Ef\ufb01cient\nalgorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th\nAnnual Symposium on, pages 464\u2013473. IEEE, 2014.\n\nGarrett Bernstein and Daniel R Sheldon. Differentially private bayesian inference for exponential\n\nfamilies. In Advances in Neural Information Processing Systems, pages 2919\u20132929, 2018.\n\nGarrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay, and Gerome Miklau.\nDifferentially private learning of undirected graphical models using collective graphical models.\nIn International Conference on Machine Learning, pages 478\u2013487, 2017.\n\nPeter J. Bickel and Kjell A. Doksum. Mathematical statistics: basic ideas and selected topics, volume\n\nI, volume 117. CRC Press, 2015.\n\nGeorge Casella and Roger Lee Berger. Statistical Inference, chapter 6. Thomson Learning, 2002.\nSamantha R. Cook, Andrew Gelman, and Donald B. Rubin. Validation of software for Bayesian\nmodels using posterior quantiles. Journal of Computational and Graphical Statistics, 15(3):\n675\u2013692, 2006.\n\nChristos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin I.P. Rubinstein. Robust\nand private Bayesian inference. In International Conference on Algorithmic Learning Theory,\npages 291\u2013305. Springer, 2014.\n\nCynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Found. and\n\nTrends in Theoretical Computer Science, 2014.\n\nCynthia Dwork and Adam Smith. Differential privacy for statistics: What we know and what we\n\nwant to learn. Journal of Privacy and Con\ufb01dentiality, 1(2), 2010.\n\nCynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in\n\nprivate data analysis. In Theory of Cryptography Conference, pages 265\u2013284. Springer, 2006.\n\nJames Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice\nof privacy-preserving Bayesian data analysis. In Proceedings of the Thirty-Second Conference on\nUncertainty in Arti\ufb01cial Intelligence, UAI\u201916, pages 192\u2013201, 2016.\n\nJoseph Geumlek, Shuang Song, and Kamalika Chaudhuri. Renyi differential privacy mechanisms for\nposterior sampling. In Advances in Neural Information Processing Systems, pages 5295\u20135304,\n2017.\n\nArthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola.\n\nA kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\nMatthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in\n\nhamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593\u20131623, 2014.\n\nAntti Honkela, Mrinal Das, Arttu Nieminen, Onur Dikmen, and Samuel Kaski. Ef\ufb01cient differentially\n\nprivate learning improves drug sensitivity prediction. Biology direct, 13(1):1, 2018.\n\nJoonas J\u00e4lk\u00f6, Onur Dikmen, and Antti Honkela. Differentially private variational inference for\nnon-conjugate models. In Uncertainty in Arti\ufb01cial Intelligence 2017, Proceedings of the 33rd\nConference (UAI),, 2017.\n\n10\n\n\fVishesh Karwa, Aleksandra B. Slavkovi\u00b4c, and Pavel Krivitsky. Differentially private exponential\nrandom graphs. In International Conference on Privacy in Statistical Databases, pages 143\u2013155.\nSpringer, 2014.\n\nVishesh Karwa, Aleksandra Slavkovi\u00b4c, et al. Inference using noisy degrees: Differentially private\n\nbeta-model and synthetic graphs. The Annals of Statistics, 44(1):87\u2013112, 2016.\n\nDaniel Kifer and Ashwin Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011\nACM SIGMOD International Conference on Management of data, pages 193\u2013204. ACM, 2011.\nDaniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and\n\nhigh-dimensional regression. Journal of Machine Learning Research, 1(41):3\u20131, 2012.\n\nFrank J. Massey Jr. The Kolmogorov-Smirnov test for goodness of \ufb01t. Journal of the American\n\nStatistical Association, 46(253):68\u201378, 1951.\n\nFrank McSherry and Ilya Mironov. Differentially private recommender systems: Building privacy into\nthe net\ufb02ix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 627\u2013636. ACM, 2009.\n\nKentaro Minami, Hitomi Arai, Issei Sato, and Hiroshi Nakagawa. Differential privacy without\n\nsensitivity. In Advances in Neural Information Processing Systems, pages 956\u2013964, 2016.\n\nAnthony O\u2019Hagan and Jonathan Forster. Kendall\u2019s advanced theory of statistics, volume 2b: Bayesian\n\ninference. 1994.\n\nMijung Park, James Foulds, Kamalika Chaudhuri, and Max Welling. Variational bayes in private\n\nsettings (vips). arXiv preprint arXiv:1611.00340, 2016.\n\nTrevor Park and George Casella. The Bayesian lasso. Journal of the American Statistical Association,\n\n103(482):681\u2013686, 2008.\n\nAlvin C. Rencher. Methods of multivariate analysis, volume 492. John Wiley & Sons, 2003.\nJohn Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in\npython using PyMC3. PeerJ Computer Science, 2:e55, apr 2016. doi: 10.7717/peerj-cs.55. URL\nhttps://doi.org/10.7717/peerj-cs.55.\n\nAaron Schein, Zhiwei Steven Wu, Mingyuan Zhou, and Hanna Wallach. Locally private Bayesian\ninference for count models. NIPS 2017 Workshop: Advances in Approximate Bayesian Inference,\n2018.\n\nOr Sheffet. Differentially private ordinary least squares. In Proceedings of the 34th International\n\nConference on Machine Learning, 2017.\n\nAdam Smith. Ef\ufb01cient, differentially private point estimators. arXiv preprint arXiv:0809.4794, 2008.\nDuy Vu and Aleksandra Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations.\nIn Data Mining Workshops, 2009. ICDMW\u201909. IEEE International Conference on, pages 138\u2013143.\nIEEE, 2009.\n\nYu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction\n& estimation in unbounded domain. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\n2018.\n\nYu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and\nstochastic gradient Monte Carlo. In Proceedings of the 32nd International Conference on Machine\nLearning (ICML-15), pages 2493\u20132502, 2015.\n\nOliver Williams and Frank McSherry. Probabilistic inference and differential privacy. In Advances in\n\nNeural Information Processing Systems, pages 2451\u20132459, 2010.\n\nJun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. Functional mechanism:\nregression analysis under differential privacy. Proceedings of the VLDB Endowment, 5(11):\n1364\u20131375, 2012.\n\nZuhe Zhang, Benjamin I.P. Rubinstein, and Christos Dimitrakakis. On the differential privacy of\n\nBayesian inference. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n11\n\n\f", "award": [], "sourceid": 290, "authors": [{"given_name": "Garrett", "family_name": "Bernstein", "institution": "University of Massachusetts Amherst"}, {"given_name": "Daniel", "family_name": "Sheldon", "institution": "University of Massachusetts Amherst"}]}