{"title": "Nonparametric learning from Bayesian models with randomized objective functions", "book": "Advances in Neural Information Processing Systems", "page_first": 2071, "page_last": 2081, "abstract": "Bayesian learning is built on an assumption that the model space contains a true reflection of the data generating mechanism. This assumption is problematic, particularly in complex data environments. Here we present a Bayesian nonparametric approach to learning that makes use of statistical models, but does not assume that the model is true. Our approach has provably better properties than using a parametric model and admits a Monte Carlo sampling scheme that can afford massive scalability on modern computer architectures. The model-based aspect of learning is particularly attractive for regularizing nonparametric inference when the sample size is small, and also for correcting approximate approaches such as variational Bayes (VB). We demonstrate the approach on a number of examples including VB classifiers and Bayesian random forests.", "full_text": "Nonparametric learning from Bayesian models with\n\nrandomized objective functions\n\nSimon Lyddon\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nOxford, UK\n\nlyddon@stats.ox.ac.uk\n\nStephen Walker\n\nDepartment of Mathematics\nUniversity of Texas at Austin\n\nAustin, TX\n\ns.g.walker@math.utexas.edu\n\nChris Holmes\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nOxford, UK\n\ncholmes@stats.ox.ac.uk\n\nAbstract\n\nBayesian learning is built on an assumption that the model space contains a true\nre\ufb02ection of the data generating mechanism. This assumption is problematic, par-\nticularly in complex data environments. Here we present a Bayesian nonparametric\napproach to learning that makes use of statistical models, but does not assume\nthat the model is true. Our approach has provably better properties than using\na parametric model and admits a Monte Carlo sampling scheme that can afford\nmassive scalability on modern computer architectures. The model-based aspect of\nlearning is particularly attractive for regularizing nonparametric inference when\nthe sample size is small, and also for correcting approximate approaches such as\nvariational Bayes (VB). We demonstrate the approach on a number of examples\nincluding VB classi\ufb01ers and Bayesian random forests.\n\n1\n\nIntroduction\n\nBayesian updating provides a principled and coherent approach to inference for probabilistic models\n[23], but is predicated on the model class being true. That is, for an observation x and a generative\nmodel F\u03b8(x) parametrized by a \ufb01nite-dimensional parameter \u03b8 \u2208 \u0398, then for some parameter value\n\u03b80 \u2208 \u0398 it is that x \u223c F\u03b80 (x). In reality, however, all models are false. If the data is simple and\nsmall, and the model space is suf\ufb01ciently rich, then the consequences of model misspeci\ufb01cation may\nnot be severe. However, data is increasingly being captured at scale, both in terms of the number\nof observations as well as the diversity of data modalities. This poses a risk in conditioning on an\nassumption that the model is true.\nIn this paper we discuss a scalable approach to Bayesian nonparametric learning (NPL) from models\nwithout the assumption that x \u223c F\u03b80(x). To do this we use a nonparametric prior that is centered on\na model but does not assume the model to be true. A concentration parameter, c, in the nonparametric\nprior quanti\ufb01es trust in the baseline model and this is subsequently re\ufb02ected in the nonparametric\nupdate, through the relative in\ufb02uence given to the model-based inference for \u03b8. In particular, c \u2192 \u221e\nrecovers the standard model-based Bayesian update while c = 0 leads to a Bayesian bootstrap\nestimator for the object of interest.\nOur methodology can be applied in a number of situations, including:\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[S1] Model misspeci\ufb01cation: where we have used a parametric Bayesian model and we are\n\nconcerned that the model may be misspeci\ufb01ed.\n\n[S2] Approximate posteriors: where for expediency we have used an approximate posterior, such\n\nas in variational Bayes (VB), and we wish to account for the approximation.\n\n[S3] Direct updating from utility-functions: where the sole purpose of the modelling task is to\n\nperform some action or take a decision under a well-speci\ufb01ed utility function.\n\nOur work builds upon previous ideas including [21] who introduced the weighted likelihood bootstrap\n(WLB) as a way of generating approximate samples from the posterior of a well-speci\ufb01ed Bayesian\nmodel. [19] highlighted that the WLB in fact provides an exact representation of uncertainty for the\nmodel parameters that minimize the Kullback-Leibler (KL) divergence, dKL(F0, F\u03b8), between the\nunknown data-generating distribution and the model likelihood f\u03b8(x), and hence is well motivated\nregardless of model validity. These approaches however do not allow for the inclusion of prior\nknowledge and do not provide a Bayesian update as we do here.\nA major underlying theme behind our paper, and indeed an open \ufb01eld for future research, is the idea\nof obtaining targeted posterior samples via the maximization of a suitably randomized objective\nfunction. The WLB randomizes the log-likelihood function, effectively providing samples which are\nrandomized maximum likelihood estimates, whereas we randomize a more general objective function\nunder a Bayesian nonparametric (NP) posterior. The randomization takes into account knowledge\ncaptured through the choice of a model and parametric prior.\n\n2 Foundations of Nonparametric Learning\n\nWe begin with the simplest scenario, namely [S1], concerning a possibly misspeci\ufb01ed model before\nmoving on to more complicated situations. It is interesting to note that all of what follows can also be\nconsidered from a viewpoint of NP regularization, using a parametric model to centre a Bayesian NP\nanalysis in a way that induces stability and parametric structure to the problem.\n\n2.1 Bayesian updating of misspeci\ufb01ed models\nSuppose we have a parametric statistical model, F\u0398 = {f\u03b8(\u00b7); \u03b8 \u2208 \u0398)}, where for each \u03b8 \u2208 \u0398 \u2286 Rp,\nf\u03b8 : X \u2192 R is a probability density. The conventional approach to Bayesian learning involves\nupdating a prior distribution to a posterior through Bayes\u2019 theorem. This approach is well studied\nand well understood [3], but formally assumes that the model space contains the true data-generating\nmechanism. We will derive a posterior update under weaker assumptions.\nSuppose that F\u0398 has been selected for the purpose of a prediction, or a decision, or some other\nmodelling task. Consider the thought experiment where the modeller somehow gains access to\nNature\u2019s true sampling distribution for the data, F0(x), which does not necessarily belong to F\u0398.\nHow should they then update their model?\nWith access to F0 the modeller can simply request an in\ufb01nite training set, x1:\u221e iid\u223c F0, and then\nupdate to the posterior \u03c0(\u03b8|x1:\u221e). Under an in\ufb01nite sample size all uncertainty is removed and for\nregular models the posterior concentrates at a point mass at \u03b80, the parameter value maximizing the\nexpected log-likelihood, assuming that the prior has support there; i.e.\n\n(cid:90)\n\n(cid:90)\n\n2\n\nn(cid:88)\n\ni=1\n\nlim\n\nn\u2192\u221e n\u22121\n(cid:90)\n\n\u03b80 = arg max\n\n\u03b8\u2208\u0398\n\nlog f\u03b8(xi) = arg max\n\n\u03b8\u2208\u0398\n\nX\n\nlog f\u03b8(x) dF0(x).\n\nIt is straightforward to see that \u03b80 minimizes the KL divergence from the true data-generating\nmechanism to a density in F\u0398\n\n\u03b80 = arg max\n\n(1)\nThis is true regardless of whether F0 is in the model space of F\u0398 and is well-motivated as the target\nof statistical model \ufb01tting [1, 8, 27, 5].\n\nlog f\u03b8(x)dF0(x) = arg min\n\ndF0(x).\n\n\u03b8\u2208\u0398\n\nlog\n\nX\n\n\u03b8\u2208\u0398\n\nX\n\nf0(x)\nf\u03b8(x)\n\n\fUncertainty in this unknown value \u03b80 \ufb02ows directly from uncertainty in F0. Of course F0 is unknown,\nbut being \u201cBayesian\u201d we can place a prior on it, \u03c0(F ), for F \u2208 F, that should re\ufb02ect our honest\nuncertainty about F0. Typically the prior should have broad support unless we have special knowledge\nto hand, which is a problem with a parametric modelling approach that only supports a family of\ndistribution functions indexed by a \ufb01nite-dimensional parameter. The Bayesian NP literature however\nprovides a range of priors for this sole purpose [14]. Once a prior for F is chosen, the correct\nway to propagate uncertainty about \u03b8 comes naturally from the posterior distribution for the law\nL[\u03b8(F )|x1:n], via L[F|x1:n], where \u03b8(F ) = arg max\u03b8\u2208\u0398\n(cid:90)\nparameter is then captured in the marginal by treating F as a latent auxiliary probability measure,\n\n(cid:82) log f\u03b8(x)dF (x). The posterior for the\n\nit from the conventional Bayesian posterior \u03c0(\u03b8|x1:n) \u221d \u03c0(\u03b8)(cid:81)n\n\n(2)\nwhere \u03c0(\u03b8|F ) assigns probability 1 to \u03b8 = \u03b8(F ). We use \u02dc\u03c0 to denote the NP update to distinguish\ni=1 f\u03b8(xi), noting that in general\nthe nonparametric posterior \u02dc\u03c0(\u03b8 | x1:n) will be different to the standard Bayesian update as they are\nconditioning on different states of prior knowledge. In particular, as stated above, \u03c0(\u03b8|x1:n) assumes\narti\ufb01cially that F0 \u2208 F\u0398.\n\n\u03c0(\u03b8 | F )\u03c0(dF | x1:n),\n\n\u03c0(\u03b8, dF | x1:n) =\n\n\u02dc\u03c0(\u03b8 | x1:n) =\n\n(cid:90)\n\nF\n\nF\n\n2.2 An NP prior using a MDP\n\nFor our purposes, the mixture of Dirichlet processes (MDP) [2] is a convenient vehicle for specifying\nprior beliefs \u03c0(F ) centered on parametric models.1 The MDP prior can be written as\n\n[F | \u03b8] \u223c DP(c, f\u03b8(\u00b7));\n\n\u03b8 \u223c \u03c0(\u03b8).\n\n(3)\n\nThis is a mixture of standard Dirichlet processes with mixing distribution or hyper-prior \u03c0(\u03b8), and\nconcentration parameter c. We write this as F \u223c MDP(\u03c0(\u03b8), c, f\u03b8(\u00b7)).\nThe MDP provides a practical, simple posterior update. From the conjugacy property of the DP\napplied to (3), we have the conditional posterior update given data x1:n, as\n\n[F | \u03b8, x1:n] \u223c DP\n\nc + n,\n\nc\n\nc + n\n\nf\u03b8(\u00b7) +\n\n1\n\nc + n\n\ni=1\n\n(4)\n\nwhere \u03b4x denotes the Dirac measure at x. The concentration parameter c is an effective sample size,\ngoverning the trust we have in f\u03b8(x). The marginal posterior distribution for L[F|x1:n] can be written\nas\n\n\u03c0(dF | x1:n) =\n\n\u03c0(dF | \u03b8, x1:n) \u03c0(\u03b8 | x1:n) d\u03b8,\n\nn(cid:88)\n\n(cid:33)\n\u03b4xi(\u00b7)\n\n(cid:32)\n\n(cid:90)\n\n\u0398\n\n(cid:32)\n\n(5)\n\n(cid:33)\n\nn(cid:88)\n\ni.e.\n\nc\n\n1\n\n\u03b4xi(\u00b7)\n\nf\u03b8(\u00b7) +\n\n[F | x1:n] \u223c MDP\n\n\u03c0(\u03b8 | x1:n), c + n,\n\nc + n\n\nc + n\n\n(6)\nThe mixing distribution \u03c0(\u03b8|x1:n) coincides with the parametric Bayesian posterior, \u03c0(\u03b8|x1:n),\nassuming there are no ties in the data [2], although as noted above it does not follow that the NP\nmarginal \u02dc\u03c0(\u03b8|x1:n) is equivalent to the parametric Bayesian posterior \u03c0(\u03b8|x1:n).\nWe can see from the form of the conditional MDP (4) that the sampling distribution of the centering\ni=1 \u03b4xi(\u00b7). The resulting NP posterior\n(5) combines the information from the posterior distribution of the centering model \u03c0(\u03b8|x1:n) with the\ninformation in the empirical distribution of the data. This leads to a simple and highly parallelizable\nMonte Carlo sampling scheme as shown below.\n\nmodel, f\u03b8(x), regularizes the in\ufb02uence of the empirical data(cid:80)n\n\ni=1\n\n.\n\n2.3 Monte Carlo conditional maximization\n\nwhich we write as G =(cid:82)\n\nThe marginal in (2) facilitates a Monte Carlo estimator for functionals of interest under the posterior,\n\u0398 g(\u03b8)\u02dc\u03c0(\u03b8|x1:n)d\u03b8. This is achieved by sampling \u03c0(\u03b8, dF|x1:n) jointly\n\n1The MDP should not to be confused with the Dirichlet process mixture model (DPM) [18].\n\n3\n\n\ffrom the posterior, (cid:90)\n\n\u0398\n\ng(\u03b8)\u02dc\u03c0(\u03b8 | x1:n)d\u03b8 \u2248 1\nB\n\nB(cid:88)\n\ni=1\n\ng(\u03b8(i))\n\n(cid:90)\n\n\u03b8(i) = \u03b8(F (i)) = arg max\n\nlog f\u03b8(x)dF (i)(x)\n\n(7)\n\nX\nF (i) \u223c \u03c0(dF | x1:n).\n\n\u03b8\u2208\u0398\n\n(8)\nThis involves an independent Monte Carlo draw (8) from the MDP marginal followed by a condi-\ntional maximization of an objective function (7) to obtain each \u03b8(i). This Monte Carlo conditional\nmaximization (MCCM) sampler is highly amenable to fast implementation on distributed computer ar-\nchitectures; given the parametric posterior samples, each NP posterior sample, F (i), can be computed\nindependently and in parallel from (8).\nWe can see from (6) that the parametric posterior samples are not required if c = 0. If c > 0 it may\nbe computationally intensive to generate samples from the parametric posterior. However, as we will\nsee next, we do need to sample from this posterior directly. This makes the approach particularly\nattractive to fast, tractable approximations for \u03c0(\u03b8|x1:n), such as a variational Bayes (VB) posterior\napproximation. The NP update corrects for the approximation in a computationally ef\ufb01cient manner,\nleading to a posterior distribution with optimal properties as shown below.\n\n2.4 A more general construction\n\nSo far we have assumed, hypothetically, that:\n\n(cid:82) log f\u03b8(x)dF0(x), rather than \u03b10 = arg max\u03b1\n\narg max\u03b8\na utility function u(x, \u03b1).\n\n(i) the modeller is interested in learning about the MLE under an in\ufb01nite sample size, \u03b80 =\n\n(ii) the parametric mixing distribution \u03c0(\u03b8|x1:n) of the MDP posterior in (6) is constructed from\n\nthe same centering model that de\ufb01nes the target parameter, \u03b80 = arg max\u03b8\n\nBoth of these assumptions can be relaxed. For the latter case, it is valid to use a tractable parametric\nmixing distribution \u03c0(\u03b3|x1:n) and baseline model f\u03b3, while still learning about \u03b80 in (1) through the\nmarginal \u02dc\u03c0(\u03b8|x1:n) as in (2) obtained via \u03b8(F ) and\n\n(cid:82) u(x, \u03b1)dF0(x) more generally, for\n(cid:82) log f\u03b8(x)dF0(x).\n(cid:33)\n\n(cid:88)\n\ni\n\nc + n\n\n1\n\nc + n\n\nc\n\nf\u03b3(\u00b7) +\n\n\u03b4xi\n\n.\n\n[F | x1:n] \u223c MDP\n\n\u03c0(\u03b3 | x1:n), c + n,\n\nFor (i), we can use the mapping \u03b1(F ) = arg max\u03b1\nactions or parameters maximizing some expected utility under a model-centered MDP posterior.\n\n(cid:82) u(x, \u03b1)dF (x) to derive the NPL posterior on\nThis can be written as \u02dc\u03c0(\u03b1|x1:n) =(cid:82) \u03c0(\u03b1|F )\u03c0(dF|x1:n), where \u03c0(\u03b1|F ) assigns probability 1 to\n(cid:82) u(x, \u03b1)dF0(x),\n\n\u03b1 = \u03b1(F ).\nThis highlights a major theme of the paper: the idea of obtaining posterior samples via maximization of\na suitably randomized objective function. In generality the target is \u03b10 = arg max\u03b1\nobtained by maximizing an objective function, and the randomization arises from the uncertainty in\nF0 through \u03c0(F|x1:n) that takes into account the information, and any misspeci\ufb01cation, associated\nwith a parametric centering model.\n\n(9)\n\n(cid:32)\n\n2.5 The Posterior Bootstrap algorithm\n\nWe will use the general construction of Section 2.4 to describe a sampling algorithm. We assume we\nhave access to samples from the posterior parametric mixing distribution, \u03c0(\u03b3|x1:n), in the MDP. In\nthe case of model misspeci\ufb01cation, [S1], if the data contains no ties, this is simply the parametric\nBayesian posterior under {f\u03b3(x), \u03c0(\u03b3)}, for which there is a large literature of computational methods\navailable for sampling - see for example [24]. If there are ties then we refer the reader to [2] or note\nthat we can simply break ties by adding a new pseudo-variable, such as x\u2217 \u223c N (0, \u00012) for small \u0001.\nThe sampling algorithm, found in Algorithm 1, is a mixture of Bayesian posterior bootstraps. After a\nsample \u03b3(i) is drawn from the mixing posterior, \u03c0(\u03b3|x1:n), a posterior pseudo-sample is generated,\n\n4\n\n\fAlgorithm 1: The Posterior Bootstrap\nData: Dataset x1:n = (x1, . . . , xn).\nParameter of interest \u03b10 = \u03b1(F0) = arg max\u03b1\nMixing posterior \u03c0(\u03b3|x1:n), concentration parameter c, centering model f\u03b3(x).\nNumber of centering model samples T .\nbegin\n\n(cid:82) u(x, \u03b1)dF0(x).\n\nfor i = 1, . . . , B do\n\nDraw centering model parameter \u03b3(i) \u223c \u03c0(\u03b3|x1:n);\niid\u223c f\u03b3(i);\nDraw posterior pseudo-sample x(i)\nGenerate weights\n(w(i)\n1 , . . . , w(i)\nT(cid:80)\nCompute parameter update\n\n(cid:40) n(cid:80)\n\n(n+1):(n+T )\n\nn , w(i)\n\nn+1, . . . , w(i)\n\n\u02dc\u03b1(i) = arg max\u03b1\n\nw(i)\n\nj u(xj, \u03b1) +\n\nn+ju(x(i)\nw(i)\n\nn+j, \u03b1)\n\n;\n\nj=1\n\nj=1\n\nn+T ) \u223c Dirichlet(1, . . . , 1, c/T, . . . , c/T );\n\n(cid:41)\n\nend\nReturn NP posterior sample {\u02dc\u03b1(i)}B\n\ni=1.\n\nend\n\niid\u223c f\u03b3(i)(x), and added to the dataset, which is then randomly weighted. The parameter\nx(i)\n(n+1):(n+T )\nunder this implicit distribution function is then computed as the solution of an optimization problem.\nNote for the special case of correcting model misspeci\ufb01cation [S1], we have \u03b3 \u2261 \u03b8, f\u03b3(\u00b7) \u2261 f\u03b8(\u00b7),\n\u03c0(\u03b3|x1:n) \u2261 \u03c0(\u03b8|x1:n), \u03b1 \u2261 \u03b8, u(x, \u03b1) \u2261 log f\u03b8(x), so that the posterior sample is given by\n\n\uf8fc\uf8fd\uf8fe .\n\n\uf8f1\uf8f2\uf8f3 n(cid:88)\n\nj=1\n\nT(cid:88)\n\nj=1\n\n\u02dc\u03b8(i) = arg max\n\n\u03b8\n\nw(i)\nj\n\nlog f\u03b8(xj) +\n\nw(i)\nn+j log f\u03b8(x(i)\n\nn+j)\n\nwhere w(i) \u223c Dirichlet(\u00b7) following Algorithm 1 and x(i)\n(n+1):(n+T ) are T synthetic observations\ndrawn from the parametric sampling distribution under \u03b8(i) which itself is drawn from \u03c0(\u03b8|x1:n). We\nleave the concentration parameter c to be set subjectively by the practitioner, representing faith in\nthe parametric model. Some further guidance to the setting of c can be found in Section 1 of the\nSupplementary Material.\n\n2.6 Adaptive Nonparametric Learning: aNPL\n\nprobability measure(cid:81)\n\nInstead of the Dirichlet distribution approximation to the Dirichlet process, we propose an alternative\nstick-breaking procedure that has some desirable properties. This procedure entails following the\nusual DP stick-breaking construction [25] for the model component of the MDP posterior, by\nrepeatedly drawing Beta(1, c)-distributed stick breaks, but terminating when the unaccounted for\nj(1 \u2212 vj), multiplied by the average mass assigned to the model, c/(c + n),\ndrops below some threshold \u0001 set by the user. This adaptive nonparametric learning (aNPL) algorithm\nis written out in full in Section 2 of the Supplementary Material.\nOne advantage of this approach is that a number of theoretical results then hold, as for large enough\nn, under this adaptive scheme the parametric model is in effect \u2018switched off\u2019, and essentially the\nMDP with c = 0 is used to generate posterior samples. This is an interesting notion in itself. For\nsmall samples, we prefer the regularization that our model provides, though as n grows the average\nprobability mass assigned to the model decays like (c + n)\u22121, as seen in (4). In the adaptive version,\nwe agree a hard threshold at which point we discard the model entirely and allow the data to speak\nfor itself. We set this point at a level such that we are a priori comfortable that there is enough\ninformation in the sample alone with which to quantify uncertainty in our parameter of interest. For\nexample, \u0001 = 10\u22124 and c = 1 only utilizes the centering model for n < 10, 000. Further, we could\nuse this idea to set c: this quantity is determined if a tolerance level, \u0001, and a threshold nmax over\nwhich the parametric model would be discarded, are provided by the practitioner.\n\n5\n\n\f2.7 Properties of NPL\n\nBayesian nonparametric learning has a number of important properties that we shall now describe.\nHonesty about correctness of model. Uncertainty in the data-generating mechanism is quanti\ufb01ed\nvia a NP update that takes into account the model likelihood, prior, and concentration parameter c.\nUncertainty about model parameters \ufb02ows from uncertainty in the data-generating mechanism.\nIncorporation of prior information. The prior for \u03b8 is naturally incorporated as a mixing distribu-\ntion for the MDP. This is in contrast to a number of Bayesian methods with similar computational\nproperties but that do not admit a prior [21, 9].\nParallelized bootstrap computation. As shown in Section 2.5, NPL is trivially parallelizable\nthrough a Bayesian posterior bootstrap and can be coupled with misspeci\ufb01ed models or approximate\nposteriors to deliver highly scalable and exact inference.\nConsistency. Under mild regularity, all posterior mass concentrates in any neighbourhood of \u03b80\nas de\ufb01ned in (1), as the number of observations tends to in\ufb01nity. This follows from an analogous\nproperty of the DP (see, for example [14]).\nStandard Bayesian inference is recovered as c \u2192 \u221e. This follows from the property of the DP\nthat it converges to the prior degenerate at the base probability distribution in the limit of c \u2192 \u221e.\nNon-informative learning with c = 0. If no model or prior is available, setting c = 0 recovers\nthe WLB. This has an exact interpretation as an objective NP posterior [19], where the asymptotic\nproperties of the misspeci\ufb01ed WLB were studied. [20] demonstrated the suboptimality of a mis-\nspeci\ufb01ed Bayesian posterior, asymptotically, relative to an asymptotic normal distribution with the\nsame centering but a sandwich covariance matrix [15]. We will see next that for large samples the\nmisspeci\ufb01ed Bayesian posterior distribution is predictively suboptimal as well.\nA superior asymptotic uncertainty quanti\ufb01cation to Bayesian updating. A natural way to com-\npare posterior distributions is by measuring their predictive risk, de\ufb01ned as the expected KL divergence\nof the posterior predictive to F0. We consider only the situation where there is an absence of strong\nprior information, following [26, 12].\nWe say predictive \u03c01 asymptotically dominates \u03c02 up to o(n\u2212k) if for all distributions q there exists\na non-negative and possibly positive real-valued functional K(q(\u00b7)) such that:\n\nEx1:n\u223cq dKL(q(\u00b7), \u03c02(\u00b7 | x1:n)) \u2212 Ex1:n\u223cq dKL(q(\u00b7), \u03c01(\u00b7 | x1:n)) = K(q(\u00b7)) + o(n\u2212k).\n\nWe have the following theorem about the asymptotic properties of the MDP with c = 0. This result\nholds for aNPL, as the model component is ignored for suitably large n.\nTheorem 1. The posterior predictive of the MDP with c = 0 asymptotically dominates the standard\nBayesian posterior predictive up to o(n\u22121).\n\nProof. In [12] the bootstrap predictive is shown to asymptotically dominate the standard Bayesian\npredictive up to o(n\u22121). In Theorem 1 of [13], the predictive of the MDP with c = 0 and the\nbootstrap predictive are shown to be equal up to op(n\u22123/2). A Taylor expansion argument shows that\nthe predictive risk of the MDP with c = 0 has the same asymptotic expansion up to o(n\u22121) as that of\nthe bootstrap. Thus Theorem 2 of [12] can be proven with the predictive of the MDP with c = 0 in\nplace of the bootstrap predictive. Thus the predictive of the MDP with c = 0 must also dominate the\nstandard Bayesian predictive up to o(n\u22121).\n\n3\n\nIllustrations\n\n3.1 Exponential family, [S1]\n\nSuppose the centering model is an exponential family with parameter \u03b8 and suf\ufb01cient statistic s(x),\n\nUnder assumed regularity, by differentiating under the integral sign of (1) we \ufb01nd that our parameter\nof interest must satisfy EF0 s(x) = \u2207\u03b8K(\u03b80). For a particular F drawn from the posterior bootstrap,\nthe expected suf\ufb01cient statistic is\n\nF\u0398 =(cid:8)f\u03b8(x) = g(x) exp(cid:8)\u03b8T s(x) \u2212 K(\u03b8)(cid:9) ; \u03b8 \u2208 \u0398(cid:9) .\n\uf8fc\uf8fd\uf8fe .\n\n\u2207\u03b8K(\u02dc\u03b8) = lim\nT\u2192\u221e\n\nn+T(cid:88)\n\n\uf8f1\uf8f2\uf8f3 n(cid:88)\n\nj=1\n\n6\n\nwjs(xj) +\n\nwjs(xj)\n\nj=n+1\n\n\fFigure 1: Posterior 95% probability contour for a bivariate Gaussian, comparing VB-NPL with\nc \u2208 {1, 102, 103, 104} (red, orange, green, blue respectively) to the known Bayes posterior (grey\ndashed line) and a VB approximation (black dashed line).\n\nwith \u02dc\u03b8 the NP posterior parameter value, weights w1:(n+T ) arising from the Dirichlet distribution\nas set out in Algorithm 1, and xj \u223c f\u03b8(\u00b7) for j = n + 1, . . . , n + T , with \u03b8 drawn from the\nparametric posterior. This provides a simple geometric interpretation of our method, as convex\ncombinations of (randomly-weighted) empirical suf\ufb01cient statistics and model suf\ufb01cient statistics\nfrom the parametric posterior. The distribution of the random weights is governed by c and n only.\nOur method generates stochastic maps from misspeci\ufb01ed posterior samples to corrected NP posterior\nsamples, by incorporating information in the data over and above that captured by the model.\n\n3.2 Updating approximate posteriors [S2]: Variational Bayes uncertainty correction\n\n2 , 1\n\nVariational approximations to Bayesian posteriors are a popular tool for obtaining fast, scalable but\napproximate Bayesian posterior distributions [4, 6]. The approximate nature of the variational update\ncan be accounted for using our approach. Figure 1 shows a mean-\ufb01eld normal approximation q to a\ncorrelated normal posterior p, an example similar to one from [4], Section 10.1. We generated 100\nobservations from a bivariate normal distribution, centered at ( 1\n2 ), with dimension-wise variances\nboth equal to 1 and correlation equal to 0.9, and independent normal priors on each dimension, both\ncentered at 0 with variance 102. Each posterior contour plotted is based on 10, 000 posterior samples.\nBy applying the posterior bootstrap with a VB posterior (VB-NPL) in place of the Bayesian posterior,\nwe recover the correct covariance structure for decreasing prior concentration c. If instead of dKL(q, p)\nwe use dKL(p, q) as the objective, as it is for expectation propagation, the model posterior uncertainty\nmay be overestimated, but is still corrected by the posterior bootstrap.\nWe demonstrate this in practice through a VB logistic regression model \ufb01t to the Statlog German\nCredit dataset, containing 1000 observations and 25 covariates (including intercept), from the UCI\nML repository [10], preprocessing via [11]. An independent normal prior with variance 100 was\nassigned to each covariate, and 1000 posterior samples were generated for each method. We obtain\na mean-\ufb01eld VB sample using automatic differentiation variational inference (ADVI) in Stan [17].\nWhen generating synthetic samples for the posterior bootstrap, both features and classes are needed.\nClasses are generated, given features, according to the probability speci\ufb01ed by the logistic distribution.\nIn this example (and the example in Section 3.3) we repeatedly re-use the features of the dataset for\nour pseudo-samples. In Fig. 2 we show that the NP update effectively corrects the VB approximation\nfor small values of c. Of course, local variational methods can provide more accurate posterior\napproximations to Bayesian logistic posteriors [16], though these too are approximations, that NP\nupdating can correct.\nComparison with Bayesian logistic regression. The conventional Bayesian logistic regression\nassumes the true log-odds of each class is linear in the predictors, and would use MCMC for inference\n[22]. The MCMC samples, shown as points in Fig. 2, are a good match to the NPL update but\nMCMC requires a user-de\ufb01ned burn-in and convergence checking. The runtime to generate 1 million\nsamples by MCMC (discarding an equivalent burn-in), was 33 minutes, compared to 21 seconds with\n\n7\n\n0.40.50.60.70.80.40.50.60.70.8b1b2a) KL(q,p)0.40.50.60.70.80.40.50.60.70.8b1b2b) KL(p,q)\fFigure 2: Posterior contour plot for \u03b222 vs \u03b221, for VB-NPL (green) and VB (blue), for three values\nof c. Scatter plot is a Bayesian logistic posterior sample (red) via a Polya-Gamma scheme.\n\nNPL, using an m5.24xlarge AWS instance with 96 vCPUs; a speed-up of 95 times. Additionally NPL\nhas provably better predictive properties, as detailed in Section 2.7.\n\n3.3 Directly updating the prior: Bayesian Random Forests, using synthetic generated data\n\nRandom forests (RF) [7] is an ensemble learning method that is widely used and has demonstrated\nexcellent general performance [11]. We construct a Bayesian RF (BRF), via NPL with decision\ntrees, under a prior mixing distribution (a variant of [S1]). This enables the incorporation of\nprior information, via synthetic data generated from a prior prediction function, in a principled\nway that is not available to RF. The step-like generative likelihood function arising from the tree\npartition structure does not re\ufb02ect our beliefs about the true sampling distribution; the trees are just\na convenient compression of the data. Because of this we simply update the prior in the MDP by\nspecifying \u03c0(\u03b3|x1:n) = \u03c0(\u03b3). Details of our implementation of BRF can be found in Section 3 of\nthe Supplementary Material.\nTo demonstrate the ability of BRF to incorporate prior information, we conducted the following\nexperiment. For 13 binary classi\ufb01cation datasets from the UCI ML repository [10], we constructed a\nprior, training and test dataset of equal size. We measured test dataset predictive accuracy for three\nmethods (relative to an RF trained on the training dataset only): BRF (c=0) (a non-informative BRF\nwith c = 0, trained on the training dataset only), BRF (c>0) (a BRF trained on the training dataset,\nincorporating prior pseudo-samples from a non-informative BRF trained on the prior dataset, setting\nc equal to the number of observations in the prior dataset), and RF-all (an RF trained on the combined\ntraining and prior datasets). See Fig. 3 for boxplots of the test accuracy over 100 repetitions.\nAs detailed in Section 3 of the Supplementary Material, for small c we \ufb01nd that BRF and RF have\nsimilar performance, but as c increases, more weight is given to the externally-trained component\nand we \ufb01nd that BRF outperforms RF. The best performance of our BRF tends to occur when c is set\nequal to the number of samples in the external training dataset, in line with our intuition of the role of\nc as an effective sample size. Overall, the BRF accuracy is better than that of RF, and close to that of\nRF-all. BRF may have privacy bene\ufb01ts over RF-all as it only requires synthetic samples; both the\noriginal data and the model can be kept private.\n\n3.4 Direct updating of utility functions [S3]: population median\n\n(cid:82) |\u03b1 \u2212 x|dF0(x), and an MDP prior centered at a N (\u03b8, 1) with prior \u03c0(\u03b8) = N (0, 102)\n\nWe demonstrate direct inference for a functional of interest using the population median, under a\nmisspeci\ufb01ed univariate Gaussian model, as an example, where the parameter of interest is \u03b10 =\narg min\u03b1\nand data generated from a skew-normal distribution. We use the posterior bootstrap to generate\nposterior samples that incorporate the prior model information with that from the data. Figure 4\npresents histograms of posterior medians given a sample of 20 observations from a skew-normal\ndistribution with mean 0, variance 1 and median approximately \u22120.2. The left-most histogram is\nsharply peaked at the sample median but does not have support outside of (xmin, xmax). As c grows\nsmoothness from the parametric model is introduced to a point where the normal location parameter\nis used.\n\n8\n\nc = 1c = 1000c = 20000\u22120.20.00.20.4\u22120.20.00.20.4\u22120.20.00.20.4\u22120.4\u22120.20.00.2b21b22\fFigure 3: Box plot of classi\ufb01cation accuracy minus that of RF, for 13 UCI datasets.\n\nFigure 4: Posterior histogram for median (left to right) c = 0, 20, 80, 1000. Right-most: posterior\nexpected loss as a function of observation x. Dotted line shows the loss to the sample median.\n\n4 Discussion\n\nWe have introduced a new approach for scalable Bayesian nonparametric learning, NPL, for para-\nmetric models that facilitates prior regularization via a baseline model, and corrects for model\nmisspeci\ufb01cation by incorporating an empirical component that has greater in\ufb02uence as the number\nof observations grows. A concentration parameter c encodes subjective beliefs on the validity of\nthe model; c = \u221e recovers Bayesian updating under the baseline model, and c = 0 ignores the\nmodel entirely. Under regularity conditions, asymptotically, our method closely matches parametric\nBayesian updating if the posited model were indeed true, and will provide an asymptotically superior\npredictive if the model is misspeci\ufb01ed. The NP posterior predictive mixes over the parametric model\nspace as opposed to targeting F0 directly, though this may aid interpretation compared to fully\nnonparametric approaches. Additionally, our construction admits a trivially parallelizable sampler\nonce the parametric posterior samples have been generated (or if c = 0).\nOur approach can be seen to blur the boundaries between Bayesian and frequentist inference. Con-\nventionally, the Bayesian approach conditions on data and treats the unknown parameter of interest\nas if it were a random variable with some prior on a known model class. The frequentist approach\ntreats the object of inference as a \ufb01xed but unknown constant and characterizes uncertainty through\nthe \ufb01nite sample variability of an estimator targeting this value. Here we randomize an objective\nfunction (an estimator) according to a Bayesian nonparametric prior on the sampling distribution,\nleading to a quanti\ufb01cation of subjective beliefs on the value that would be returned by the estimator\nunder an in\ufb01nite sample size.\nAt the heart of our approach is the notion of Bayesian updating via randomized objective functions\nthrough the posterior bootstrap. The posterior bootstrap acts on an augmented dataset, comprised\nof data and posterior pseudo-samples, under which randomized maximum likelihood estimators\nprovide a well-motivated quanti\ufb01cation of uncertainty while assuming little about the data-generating\nmechanism. The precursor to this is the weighted likelihood bootstrap, which utilized a simpler\nform of randomization that ignored prior information. Our approach provides scope for quantifying\nuncertainty for more general machine learning models by randomizing their objective functions\nsuitably.\n\n9\n\nllllllllllllllllllllllllllllllllllsemeionspambasestatlog\u2212australian\u2212creditstatlog\u2212german\u2212creditstatlog\u2212heartstatlog\u2212landsatbankcardiotocography\u221210clasescardiotocography\u22123claseshill\u2212valleyilpd\u2212indian\u2212liverlibrasmusk\u221220.0000.0050.0100.0150.0200.0250.00.10.20.000.010.02\u22120.040.000.04\u22120.10\u22120.050.000.05\u22120.050.000.05\u22120.0250.0000.0250.050\u22120.010.000.010.020.030.04\u22120.040.000.040.000.020.040.000.01\u22120.010\u22120.0050.0000.0050.0100.0000.0250.050AccuracyMethodBRF (c=0)BRF (c>0)RF\u2212all\u221211q\u221211q\u221211q\u221211q0.000.250.500.75\u22121.0\u22120.50.00.5xExpected loss\fAcknowledgements\n\nSL is funded by the EPSRC OxWaSP CDT, through EP/L016710/1. CH gratefully acknowledges\nsupport for this research from the MRC, The Alan Turing Institute, and the Li Ka Shing foundation.\n\nReferences\n[1] Hirotugu Akaike. Likelihood of a model and information criteria. Journal of Econometrics,\n\n16(1):3\u201314, 1981.\n\n[2] Charles E. Antoniak. Mixtures of Dirichlet Processes with Applications to Bayesian Nonpara-\n\nmetric Problems. Annals of Statistics, 2(6):1152\u20131174, 1974.\n\n[3] Jos\u00e9 M. Bernardo and Adrian F. M. Smith. Bayesian Theory. Wiley, Chichester, 2006.\n\n[4] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and\n\nStatistics. Springer, 2006.\n\n[5] P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief\ndistributions. Journal of the Royal Statistical Society. Series B: Statistical Methodology,\n78(5):1103\u20131130, 2016.\n\n[6] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational Inference: A Review for\n\nStatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[7] Leo Breiman. Random Forests. Machine Learning, pages 1\u201333, 2001.\n\n[8] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference: A Practical\n\nInformation-Theoretic Approach. Springer New York, 2003.\n\n[9] Gary Chamberlain and Guido W. Imbens. Nonparametric Applications of Bayesian Inference.\n\nJournal of Business & Economic Statistics, 21(1):12\u201318, 2003.\n\n[10] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI Machine Learning Repository, 2017.\n\n[11] Manuel Fern\u00e1ndez-Delgado, Eva Cernadas, Sen\u00e9n Barro, and Dinani Amorim. Do we Need\nHundreds of Classi\ufb01ers to Solve Real World Classi\ufb01cation Problems? Journal of Machine\nLearning Research, 15:3133\u20133181, 2014.\n\n[12] Tadayoshi Fushiki. Bootstrap prediction and Bayesian prediction under misspeci\ufb01ed models.\n\nBernoulli, 11(4):747\u2013758, 2005.\n\n[13] Tadayoshi Fushiki. Bayesian bootstrap prediction. Journal of Statistical Planning and Inference,\n\n140(1):65\u201374, 2010.\n\n[14] Nils L. Hjort, Chris Holmes, Peter M\u00fcller, and Stephen G. Walker. Bayesian Nonparametrics.\nCambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press,\n2010.\n\n[15] Peter J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions.\nIn Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,\nVolume 1: Statistics, pages 221\u2013233, Berkeley, Calif., 1967. University of California Press.\n\n[16] Tommi S. Jaakkola and Michael I. Jordan. A variational approach to Bayesian logistic regression\nmodels and their extensions. International Workshop on Arti\ufb01cial Intelligence and Statistics,\n1997.\n\n[17] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic Variational\nInference in Stan. In Advances in Neural Information Processing Systems 28, volume 1, pages\n1\u20139, 2015.\n\n[18] Albert Y. Lo. On a Class of Bayesian Nonparametric Estimates: I. Density Estimates. Annals of\n\nStatistics, 12(1):351\u2013357, 1984.\n\n10\n\n\f[19] S. P. Lyddon, C. C. Holmes, and S. G. Walker. General Bayesian Updating and the Loss-\n\nLikelihood Bootstrap. arxiv preprint 1709.07616, pages 1\u201314, 2018.\n\n[20] Ulrich K. M\u00fcller. Risk of Bayesian Inference in Misspeci\ufb01ed Models, and the Sandwich\n\nCovariance Matrix. Econometrica, 81(5):1805\u20131849, 2013.\n\n[21] Michael A. Newton and Adrian E. Raftery. Approximate Bayesian Inference with the Weighted\nLikelihood Bootstrap. Journal of the Royal Statistical Society Series B-Statistical Methodology,\n56(1):3\u201348, 1994.\n\n[22] Nicholas G. Polson, James G. Scott, and Jesse Windle. Bayesian Inference for Logistic\nModels Using P\u00f3lya\u2013Gamma Latent Variables. Journal of the American Statistical Association,\n108(504):1339\u20131349, 2013.\n\n[23] Christian P. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computa-\n\ntional Implementation. Springer Texts in Statistics. Springer New York, 2007.\n\n[24] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer Texts in\n\nStatistics. Springer New York, 2005.\n\n[25] Jayaram Sethuraman. A constructive de\ufb01nition of Dirichlet Process Priors. Statistica Sinica,\n\n4:639\u2013650, 1994.\n\n[26] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the\nlog-likelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[27] Stephen G. Walker. Bayesian inference with misspeci\ufb01ed models. Journal of Statistical\n\nPlanning and Inference, 143(10):1621\u20131633, 2013.\n\n11\n\n\f", "award": [], "sourceid": 1058, "authors": [{"given_name": "Simon", "family_name": "Lyddon", "institution": "University of Oxford"}, {"given_name": "Stephen", "family_name": "Walker", "institution": "University of Texas at Austin"}, {"given_name": "Chris", "family_name": "Holmes", "institution": "University of Oxford"}]}