{"title": "Fast \u03b5-free Inference of Simulation Models with Bayesian Conditional Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1028, "page_last": 1036, "abstract": "Many statistical models can be simulated forwards but have intractable likelihoods. Approximate Bayesian Computation (ABC) methods are used to infer properties of these models from data. Traditionally these methods approximate the posterior over parameters by conditioning on data being inside an \u03b5-ball around the observed data, which is only correct in the limit \u03b5\u21920. Monte Carlo methods can then draw samples from the approximate posterior to approximate predictions or error bars on parameters. These algorithms critically slow down as \u03b5\u21920, and in practice draw samples from a broader distribution than the posterior. We propose a new approach to likelihood-free inference based on Bayesian conditional density estimation. Preliminary inferences based on limited simulation data are used to guide later simulations. In some cases, learning an accurate parametric representation of the entire true posterior distribution requires fewer model simulations than Monte Carlo ABC methods need to produce a single sample from an approximate posterior.", "full_text": "Fast \u0001-free Inference of Simulation Models with\n\nBayesian Conditional Density Estimation\n\nGeorge Papamakarios\nSchool of Informatics\nUniversity of Edinburgh\n\ng.papamakarios@ed.ac.uk\n\nIain Murray\n\nSchool of Informatics\nUniversity of Edinburgh\ni.murray@ed.ac.uk\n\nAbstract\n\nMany statistical models can be simulated forwards but have intractable likelihoods.\nApproximate Bayesian Computation (ABC) methods are used to infer properties\nof these models from data. Traditionally these methods approximate the posterior\nover parameters by conditioning on data being inside an \u0001-ball around the observed\ndata, which is only correct in the limit \u0001\u2192 0. Monte Carlo methods can then draw\nsamples from the approximate posterior to approximate predictions or error bars on\nparameters. These algorithms critically slow down as \u0001\u2192 0, and in practice draw\nsamples from a broader distribution than the posterior. We propose a new approach\nto likelihood-free inference based on Bayesian conditional density estimation.\nPreliminary inferences based on limited simulation data are used to guide later\nsimulations. In some cases, learning an accurate parametric representation of the\nentire true posterior distribution requires fewer model simulations than Monte Carlo\nABC methods need to produce a single sample from an approximate posterior.\n\n1\n\nIntroduction\n\nA simulator-based model is a data-generating process described by a computer program, usually\nwith some free parameters we need to learn from data. Simulator-based modelling lends itself\nnaturally to scienti\ufb01c domains such as evolutionary biology [1], ecology [24], disease epidemics [10],\neconomics [8] and cosmology [23], where observations are best understood as products of underlying\nphysical processes. Inference in these models amounts to discovering plausible parameter settings\nthat could have generated our observed data. The application domains mentioned can require properly\ncalibrated distributions that express uncertainty over plausible parameters, rather than just point\nestimates, in order to reach scienti\ufb01c conclusions or make decisions.\nAs an analytical expression for the likelihood of parameters given observations is typically not avail-\nable for simulator-based models, conventional likelihood-based Bayesian inference is not applicable.\nAn alternative family of algorithms for likelihood-free inference has been developed, referred to as\nApproximate Bayesian Computation (ABC). These algorithms simulate the model repeatedly and\nonly accept parameter settings which generate synthetic data similar to the observed data, typically\ngathered in a real-world experiment.\nRejection ABC [21], the most basic ABC algorithm, simulates the model for each setting of proposed\nparameters, and rejects parameters if the generated data is not within a certain distance from the\nobservations. The accepted parameters form a set of independent samples from an approximate\nposterior. Markov Chain Monte Carlo ABC (MCMC-ABC) [13] is an improvement over rejection\nABC which, instead of independently proposing parameters, explores the parameter space by perturb-\ning the most recently accepted parameters. Sequential Monte Carlo ABC (SMC-ABC) [2, 5] uses\nimportance sampling to simulate a sequence of slowly-changing distributions, the last of which is an\napproximation to the parameter posterior.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fConventional ABC algorithms such as the above suffer from three drawbacks. First, they only\nrepresent the parameter posterior as a set of (possibly weighted or correlated) samples. A sample-\nbased representation easily gives estimates and error bars of individual parameters, and model\npredictions. However these computations are noisy, and it is not obvious how to perform some other\ncomputations using samples, such as combining posteriors from two separate analyses. Second, the\nparameter samples do not come from the correct Bayesian posterior, but from an approximation\nbased on assuming a pseudo-observation that the data is within an \u0001-ball centred on the data actually\nobserved. Third, as the \u0001-tolerance is reduced, it can become impractical to simulate the model\nenough times to match the observed data even once. When simulations are expensive to perform,\ngood quality inference becomes impractical.\nWe propose a parametric approach to likelihood-free inference, which unlike conventional ABC does\nnot suffer from the above three issues. Instead of returning samples from an \u0001-approximation to\nthe posterior, our approach learns a parametric approximation to the exact posterior, which can be\nmade as accurate as required. Preliminary \ufb01ts to the posterior are used to guide future simulations,\nwhich can reduce the number of simulations required to learn an accurate approximation by orders\nof magnitude. Our approach uses conditional density estimation with Bayesian neural networks,\nand draws upon advances in parametric density estimation, stochastic variational inference, and\nrecognition networks, as discussed in the related work section.\n\n2 Bayesian conditional density estimation for likelihood-free inference\n\n2.1 Simulator-based models and ABC\n\nLet \u03b8 be a vector of parameters controlling a simulator-based model, and let x be a data vector\ngenerated by the model. The model may be provided as a probabilistic program that can be easily\nsimulated, and implicitly de\ufb01nes a likelihood p(x| \u03b8), which we assume we cannot evaluate. Let\np(\u03b8) encode our prior beliefs about the parameters. Given an observation xo, we are interested in the\nparameter posterior p(\u03b8 | x = xo) \u221d p(x = xo | \u03b8) p(\u03b8).\nAs the likelihood p(x = xo | \u03b8) is unavailable, conventional Bayesian inference cannot be carried out.\nThe principle behind ABC is to approximate p(x = xo | \u03b8) by p((cid:107)x \u2212 xo(cid:107) < \u0001| \u03b8) for a suf\ufb01ciently\nsmall value of \u0001, and then estimate the latter\u2014e.g. by Monte Carlo\u2014using simulations from the\nmodel. Hence, ABC approximates the posterior by p(\u03b8 | (cid:107)x \u2212 xo(cid:107) < \u0001), which is typically broader\nand more uncertain. ABC can trade off computation for accuracy by decreasing \u0001, which improves the\napproximation to the posterior but requires more simulations from the model. However, the approxi-\nmation becomes exact only when \u0001 \u2192 0, in which case simulations never match the observations,\np((cid:107)x \u2212 xo(cid:107) < \u0001| \u03b8) \u2192 0, and existing methods break down. In this paper, we refer to p(\u03b8 | x = xo)\nas the exact posterior, as it corresponds to setting \u0001 = 0 in p(\u03b8 | (cid:107)x \u2212 xo(cid:107) < \u0001).\nIn most practical applications of ABC, x is taken to be a \ufb01xed-length vector of summary statistics\nthat is calculated from data generated by the simulator, rather than the raw data itself. Extracting\nstatistics is often necessary in practice, to reduce the dimensionality of the data and maintain\np((cid:107)x \u2212 xo(cid:107) < \u0001| \u03b8) to practically acceptable levels. For the purposes of this paper, we will make no\ndistinction between raw data and summary statistics, and we will regard the calculation of summary\nstatistics as part of the data generating process.\n\n2.2 Learning the posterior\n\nRather than using simulations from the model in order to estimate an approximate likelihood,\np((cid:107)x \u2212 xo(cid:107) < \u0001| \u03b8), we will use the simulations to directly estimate p(\u03b8 | x = xo). We will run\nsimulations for parameters drawn from a distribution, \u02dcp(\u03b8), which we shall refer to as the proposal\nprior. The proposition below indicates how we can then form a consistent estimate of the exact\nposterior, using a \ufb02exible family of conditional densities, q\u03c6(\u03b8 | x), parameterized by a vector \u03c6.\nProposition 1. We assume that each of a set of N pairs (\u03b8n, xn) was independently generated by\n(1)\nn q\u03c6(\u03b8n | xn) is maximized w.r.t. \u03c6\n(2)\n\nIn the limit N \u2192 \u221e, the probability of the parameter vectors(cid:81)\n\nxn \u223c p(x| \u03b8n).\n\n\u03b8n \u223c \u02dcp(\u03b8)\n\nif and only if\n\nq\u03c6(\u03b8 | x) \u221d\n\n\u02dcp(\u03b8)\np(\u03b8)\n\np(\u03b8 | x),\n\nand\n\n2\n\n\fp(\u03b8) p(\u03b8 | x) exists.\n\nprovided a setting of \u03c6 that makes q\u03c6(\u03b8 | x) proportional to \u02dcp(\u03b8)\nIntuition: if we simulated enough parameters from the prior, the density estimator q\u03c6 would learn a\nconditional of the joint prior model over parameters and data, which is the posterior p(\u03b8 | x). If we\nsimulate parameters drawn from another distribution, we need to \u201cimportance reweight\u201d the result. A\nmore detailed proof can be found in Section A of the supplementary material.\nThe proposition above suggests the following procedure for learning the posterior: (a) propose a\nset of parameter vectors {\u03b8n} from the proposal prior; (b) for each \u03b8n run the simulator to obtain a\ncorresponding data vector xn; (c) train q\u03c6 with maximum likelihood on {\u03b8n, xn}; and (d) estimate\nthe posterior by\n\n\u02c6p(\u03b8 | x = xo) \u221d\n\np(\u03b8)\n\u02dcp(\u03b8)\n\nq\u03c6(\u03b8 | xo).\n\n(3)\n\nThis procedure is summarized in Algorithm 2.\n\nmixture of K Gaussian components q\u03c6(\u03b8 | x) =(cid:80)\n\n2.3 Choice of conditional density estimator and proposal prior\nIn choosing the types of density estimator q\u03c6(\u03b8 | x) and proposal prior \u02dcp(\u03b8), we need to meet the\nfollowing criteria: (a) q\u03c6 should be \ufb02exible enough to represent the posterior but easy to train with\nmaximum likelihood; (b) \u02dcp(\u03b8) should be easy to evaluate and sample from; and (c) the right-hand\nside expression in Equation (3) should be easily evaluated and normalized.\nWe draw upon work on conditional neural density estimation and take q\u03c6 to be a Mixture Density\nNetwork (MDN) [3] with fully parameterized covariance matrices. That is, q\u03c6 takes the form of a\nk \u03b1k N (\u03b8 | mk, Sk), whose mixing coef\ufb01cients\n{\u03b1k}, means {mk} and covariance matrices {Sk} are computed by a feedforward neural network\nparameterized by \u03c6, taking x as input. Such an architecture is capable of representing any conditional\ndistribution arbitrarily accurately\u2014provided the number of components K and number of hidden\nunits in the neural network are suf\ufb01ciently large\u2014while remaining trainable by backpropagation.\nThe parameterization of the MDN is detailed in Section B of the supplementary material.\nWe take the proposal prior to be a single Gaussian \u02dcp(\u03b8) = N (\u03b8 | m0, S0), with mean m0 and full\ncovariance matrix S0. Assuming the prior p(\u03b8) is a simple distribution (uniform or Gaussian, as is\ntypically the case in practice), then this choice allows us to calculate \u02c6p(\u03b8 | x = xo) in Equation (3)\nanalytically. That is, \u02c6p(\u03b8 | x = xo) will be a mixture of K Gaussians, whose parameters will be a\nfunction of {\u03b1k, mk, Sk} evaluated at xo (as detailed in Section C of the supplementary material).\n2.4 Learning the proposal prior\nSimple rejection ABC is inef\ufb01cient because the posterior p(\u03b8 | x = xo) is typically much narrower\nthan the prior p(\u03b8). A parameter vector \u03b8 sampled from p(\u03b8) will rarely be plausible under\np(\u03b8 | x = xo) and will most likely be rejected. Practical ABC algorithms attempt to reduce the\nnumber of rejections by modifying the way they propose parameters; for instance, MCMC-ABC and\nSMC-ABC propose new parameters by perturbing parameters they already consider plausible, in the\nhope that nearby parameters remain plausible.\nIn our framework, the key to ef\ufb01cient use of simulations lies in the choice of proposal prior. If we\ntake \u02dcp(\u03b8) to be the actual prior, then q\u03c6(\u03b8 | x) will learn the posterior for all x, as can be seen from\nEquation (2). Such a strategy however is grossly inef\ufb01cient if we are only interested in the posterior\nfor x = xo. Conversely, if \u02dcp(\u03b8) closely matches p(\u03b8 | x = xo), then most simulations will produce\nsamples that are highly informative in learning q\u03c6(\u03b8 | x) for x = xo. In other words, if we already\nknew the true posterior, we could use it to construct an ef\ufb01cient proposal prior for learning it.\nWe exploit this idea to set up a \ufb01xed-point system. Our strategy becomes to learn an ef\ufb01cient proposal\nprior that closely approximates the posterior as follows: (a) initially take \u02dcp(\u03b8) to be the prior p(\u03b8);\n(b) propose N samples {\u03b8n} from \u02dcp(\u03b8) and corresponding samples {xn} from the simulator, and\ntrain q\u03c6(\u03b8 | x) on them; (c) approximate the posterior using Equation (3) and set \u02dcp(\u03b8) to it; (d) repeat\nuntil \u02dcp(\u03b8) has converged. This procedure is summarized in Algorithm 1.\nIn the procedure above, as long as q\u03c6(\u03b8 | x) has only one Gaussian component (K = 1) then \u02dcp(\u03b8)\nremains a single Gaussian throughout. Moreover, in each iteration we initialize q\u03c6 with the density\n\n3\n\n\fAlgorithm 1: Training of proposal prior\ninitialize q\u03c6(\u03b8 | x) with one component\n\u02dcp(\u03b8) \u2190 p(\u03b8)\nrepeat\n\nfor n = 1..N do\n\nsample \u03b8n \u223c \u02dcp(\u03b8)\nsample xn \u223c p(x| \u03b8n)\n\nend\nretrain q\u03c6(\u03b8 | x) on {\u03b8n, xn}\n\u02dcp(\u03b8) \u2190 p(\u03b8)\n\n\u02dcp(\u03b8) q\u03c6(\u03b8 | xo)\nuntil \u02dcp(\u03b8) has converged;\n\nAlgorithm 2: Training of posterior\ninitialize q\u03c6(\u03b8 | x) with K components\n// if q\u03c6 available by Algorithm 1\n// initialize by replicating its\n// one component K times\nfor n = 1..N do\n\nsample \u03b8n \u223c \u02dcp(\u03b8)\nsample xn \u223c p(x| \u03b8n)\nend\ntrain q\u03c6(\u03b8 | x) on {\u03b8n, xn}\n\u02c6p(\u03b8 | x = xo) \u2190 p(\u03b8)\n\n\u02dcp(\u03b8) q\u03c6(\u03b8 | xo)\n\nestimator learnt in the iteration before, thus we keep training q\u03c6 throughout. This initialization allows\nus to use a small sample size N in each iteration, thus making ef\ufb01cient use of simulations.\nAs we shall demonstrate in Section 3, the procedure above learns Gaussian approximations to the true\nposterior fast: in our experiments typically 4\u20136 iterations of 200\u2013500 samples each were suf\ufb01cient.\nThis Gaussian approximation can be used as a rough but cheap approximation to the true posterior,\nor it can serve as a good proposal prior in Algorithm 2 for ef\ufb01ciently \ufb01ne-tuning a non-Gaussian\nmulti-component posterior. If the second strategy is adopted, then we can reuse the single-component\nneural density estimator learnt in Algorithm 1 to initialize q\u03c6 in Algorithm 2. The weights in the \ufb01nal\nlayer of the MDN are replicated K times, with small random perturbations to break symmetry.\n\n2.5 Use of Bayesian neural density estimators\n\nTo make Algorithm 1 as ef\ufb01cient as possible, the number of simulations per iteration N should\nbe kept small, while at the same time it should provide a suf\ufb01cient training signal for q\u03c6. With\na conventional MDN, if N is made too small, there is a danger of over\ufb01tting, especially in early\niterations, leading to over-con\ufb01dent proposal priors and an unstable procedure. Early stopping could\nbe used to avoid over\ufb01tting; however a signi\ufb01cant fraction of the N samples would have to be used\nas a validation set, leading to inef\ufb01cient use of simulations.\nAs a better alternative, we developed a Bayesian version of the MDN using Stochastic Variational\nInference (SVI) for neural networks [12]. We shall refer to this Bayesian version of the MDN as\nMDN-SVI. An MDN-SVI has two sets of adjustable parameters of the same size, the means \u03c6m and\nthe log variances \u03c6s. The means correspond to the parameters \u03c6 of a conventional MDN. During\ntraining, Gaussian noise of variance exp \u03c6s is added to the means independently for each training\nexample (\u03b8n, xn). The Bayesian interpretation of this procedure is that it optimizes a variational\nGaussian posterior with a diagonal covariance matrix over parameters \u03c6. At prediction time, the noise\nis switched off and the MDN-SVI behaves like a conventional MDN with \u03c6 = \u03c6m. Section D of the\nsupplementary material details the implementation and training of MDN-SVI. We found that using\nan MDN-SVI instead of an MDN improves the robustness and ef\ufb01ciency of Algorithm 1 because\n(a) MDN-SVI is resistant to over\ufb01tting, allowing us to use a smaller number of simulations N; (b) no\nvalidation set is needed, so all samples can be used for training; and (c) since over\ufb01tting is not an\nissue, no careful tuning of training time is necessary.\n\n3 Experiments\n\nWe showcase three versions of our approach: (a) learning the posterior with Algorithm 2 where q\u03c6\nis a conventional MDN and the proposal prior \u02dcp(\u03b8) is taken to be the actual prior p(\u03b8), which we\nrefer to as MDN with prior; (b) training a proposal prior with Algorithm 1 where q\u03c6 is an MDN-SVI,\nwhich we refer to as proposal prior; and (c) learning the posterior with Algorithm 2 where q\u03c6 is an\nMDN-SVI and the proposal prior \u02dcp(\u03b8) is taken to be the one learnt in (b), which we refer to as MDN\nwith proposal. All MDNs were trained using Adam [11] with its default parameters.\nWe compare to three ABC baselines: (a) rejection ABC [21], where parameters are proposed from the\nprior and are accepted if (cid:107)x \u2212 xo(cid:107) < \u0001; (b) MCMC-ABC [13] with a spherical Gaussian proposal,\nwhose variance we manually tuned separately in each case for best performance; and (c) SMC-\n\n4\n\n\fFigure 1: Results on mixture of two Gaussians. Left: approximate posteriors learnt by each strategy\nfor xo = 0. Middle: full conditional density q\u03c6(\u03b8|x) leant by the MDN trained with prior. Right:\nfull conditional density q\u03c6(\u03b8|x) learnt by the MDN-SVI trained with proposal prior. Vertical dashed\nlines show the location of the observation xo = 0.\n\nABC [2], where the sequence of \u0001\u2019s was exponentially decayed, with a decay rate manually tuned\nseparately in each case for best performance. MCMC-ABC was given the unrealistic advantage of\nbeing initialized with a sample from rejection ABC, removing the need for an otherwise necessary\nburn-in period. Code for reproducing the experiments is provided in the supplementary material and\nat https://github.com/gpapamak/epsilon_free_inference.\n\n3.1 Mixture of two Gaussians\n\n(cid:0)x| \u03b8, \u03c32\n\n1\n\n(cid:1) + (1 \u2212 \u03b1)N\n\n(cid:0)x| \u03b8, \u03c32\n\n2\n\n(cid:1),\n\nThe \ufb01rst experiment is a toy problem where the goal is to infer the common mean \u03b8 of a mixture of\ntwo 1D Gaussians, given a single datapoint xo. The setup is\n\nand\n\np(x| \u03b8) = \u03b1N\n\np(\u03b8) = U(\u03b8 | \u03b8\u03b1, \u03b8\u03b2)\n\n(4)\nwhere \u03b8\u03b1 = \u221210, \u03b8\u03b2 = 10, \u03b1 = 0.5, \u03c31 = 1, \u03c32 = 0.1 and xo = 0. The posterior can be calculated\nanalytically, and is proportional to an equal mixture of two Gaussians centred at xo with variances \u03c32\n1\nand \u03c32\n2, restricted to [\u03b8\u03b1, \u03b8\u03b2]. This problem is often used in the SMC-ABC literature to illustrate the\ndif\ufb01culty of MCMC-ABC in representing long tails. Here we use it to demonstrate the correctness of\nour approach and its ability to accurately represent non-Gaussian long-tailed posteriors.\nFigure 1 shows the results of neural density estimation using each strategy. All MDNs have one\nhidden layer with 20 tanh units and 2 Gaussian components, except for the proposal prior MDN\nwhich has a single component. Both MDN with prior and MDN with proposal learn good parametric\napproximations to the true posterior, and the proposal prior is a good Gaussian approximation to it.\nWe used 10K simulations to train the MDN with prior, whereas the prior proposal took 4 iterations\nof 200 simulations each to train, and the MDN with proposal took 1000 simulations on top of the\nprevious 800. The MDN with prior learns the posterior distributions for a large range of possible\nobservations x (middle plot of Figure 1), whereas the MDN with proposal gives accurate posterior\nprobabilities only near the value actually observed (right plot of Figure 1).\n\n3.2 Bayesian linear regression\n\np(x| \u03b8) =(cid:81)\n\n(cid:0)xi | \u03b8T ui, \u03c32(cid:1),\n\nIn Bayesian linear regression, the goal is to infer the parameters \u03b8 of a linear map from noisy\nobservations of outputs at known inputs. The setup is\n\nand\n\ni N\n\np(\u03b8) = N (\u03b8 | m, S)\n\n(5)\nwhere we took m = 0, S = I, \u03c3 = 0.1, randomly generated inputs {ui} from a standard Gaussian,\nand randomly generated observations xo from the model. In our setup, \u03b8 and x have 6 and 10\ndimensions respectively. The posterior is analytically tractable, and is a single Gaussian.\nAll MDNs have one hidden layer of 50 tanh units and one Gaussian component. ABC methods were\nrun for a sequence of decreasing \u0001\u2019s, up to their failing points. To measure the approximation quality\nto the posterior, we analytically calculated the KL divergence from the true posterior to the learnt\nposterior (which for ABC was taken to be a Gaussian \ufb01t to the set of returned posterior samples).\nThe left of Figure 2 shows the approximation quality vs \u0001; MDN methods are shown as horizontal\n\n5\n\n\u22123\u22122\u221210123\u03b8TrueposteriorMDNwithpriorProposalpriorMDNwithproposal\u221215\u221210\u22125051015x\u221210\u221250510\u03b8Mean75%ofmass99%ofmassxo=0\u221215\u221210\u22125051015x\u221210\u221250510\u03b8Mean75%ofmass99%ofmassxo=0\fFigure 2: Results on Bayesian linear regression. Left: KL divergence from true posterior to\napproximation vs \u0001; lower is better. Middle: number of simulations vs KL divergence; lower left is\nbetter. Note that number of simulations is total for MDNs, and per effective sample for ABC. Right:\nPosterior marginals for \u03b81 as computed by each method. ABC posteriors (represented as histograms)\ncorrespond to the setting of \u0001 that minimizes the KL in the left plot.\n\nlines. As \u0001 is decreased, ABC methods sample from an increasingly better approximation to the\ntrue posterior, however they eventually reach their failing point, or take prohibitively long. The best\napproximations are achieved by MDN with proposal and a very long run of SMC-ABC.\nThe middle of Figure 2 shows the increase in number of simulations needed to improve approximation\nquality (as \u0001 decreases). We quote the total number of simulations for MDN training, and the number\nof simulations per effective sample for ABC. Section E of the supplementary material describes how\nthe number of effective samples is calculated. The number of simulations per effective sample should\nbe multiplied by the number of effective samples needed in practice. Moreover, SMC-ABC will not\nwork well with only one particle, so many times the quoted cost will always be needed. Here, MDNs\nmake more ef\ufb01cient use of simulations than Monte Carlo ABC methods. Sequentially \ufb01tting a prior\nproposal was more than ten times cheaper than training with prior samples, and more accurate.\n\n3.3 Lotka\u2013Volterra predator-prey population model\n\nThe Lotka\u2013Volterra model is a stochastic Markov jump process that describes the continuous time\nevolution of a population of predators interacting with a population of prey. There are four possible\nreactions: (a) a predator being born, (b) a predator dying, (c) a prey being born, and (d) a prey being\neaten by a predator. Positive parameters \u03b8 = (\u03b81, \u03b82, \u03b83, \u03b84) control the rate of each reaction. Given\na set of statistics xo calculated from an observed population time series, the objective is to infer \u03b8.\nWe used a \ufb02at prior over log \u03b8, and calculated a set of 9 statistics x. The full setup is detailed in\nSection F of the supplementary material. The Lotka\u2013Volterra model is commonly used in the ABC\nliterature as a realistic model which can be simulated, but whose likelihood is intractable. One of\nthe properties of Lotka\u2013Volterra is that typical nature-like observations only occur for very speci\ufb01c\nparameter settings, resulting in narrow, Gaussian-like posteriors that are hard to recover.\nThe MDN trained with prior has two hidden layers of 50 tanh units each, whereas the MDN-SVI\nused to train the proposal prior and the MDN-SVI trained with proposal have one hidden layer of 50\ntanh units. All three have one Gaussian component. We found that using more than one components\nmade no difference to the results; in all cases the MDNs chose to use only one component and switch\nthe rest off, which is consistent with our observation about the near-Gaussianity of the posterior.\nWe measure how well each method retrieves the true parameter values that were used to generate\nxo by calculating their log probability under each learnt posterior; for ABC a Gaussian \ufb01t to the\nposterior samples was used. The left panel of Figure 3 shows how this log probability varies with \u0001,\ndemonstrating the superiority of MDN methods over ABC. In the middle panel we can see that MDN\ntraining with proposal makes ef\ufb01cient use of simulations compared to training with prior and ABC;\nnote that for ABC the number of simulations is only for one effective sample. In the right panel, we\ncan see that the estimates returned by MDN methods are more con\ufb01dent around the true parameters\ncompared to ABC, because the MDNs learn the exact posterior rather than an in\ufb02ated version of it\nlike ABC does (plots for the other three parameters look similar).\nWe found that when training an MDN with a well-tuned proposal that focuses on the plausible region,\nan MDN with fewer parameters is needed compared to training with the prior. This is because the\n\n6\n\n10\u22121100101\u000110\u2212210\u22121100101102103KLdivergenceRej.ABCMCMC-ABCSMC-ABCMDNwithpriorProposalpriorMDNwithprop.10\u2212210\u22121100101102103104KLdivergence100101102103104105106107#simulations(pereffectivesampleforABC)Rej.ABCMCMC-ABCSMC-ABCMDNwithpriorProposalpriorMDNwithprop.\u22120.20\u22120.15\u22120.10\u22120.050.000.05\u03b81TrueposteriorMDNwithpriorProposalpriorMDNwithprop.MCMC-ABCSMC-ABC\fFigure 3: Results on Lotka\u2013Volterra. Left: negative log probability of true parameters vs \u0001; lower\nis better. Middle: number of simulations vs negative log probability; lower left is better. Note that\nnumber of simulations is total for MDNs, but per effective sample for ABC. Right: Estimates of log \u03b81\nwith 2 standard deviations. ABC estimates used many more simulations with the smallest feasible \u0001.\n\nMDN trained with proposal needs to learn only the local relationship between x and \u03b8 near xo, as\nopposed to in the entire domain of the prior. Hence, not only are savings achieved in number of\nsimulations, but also training the MDN itself becomes more ef\ufb01cient.\n\n3.4 M/G/1 queue model\n\nThe M/G/1 queue model describes the processing of a queue of continuously arriving jobs by a\nsingle server. In this model, the time the server takes to process each job is independently and\nuniformly distributed in the interval [\u03b81, \u03b82]. The time interval between arrival of two consecutive\njobs is independently and exponentially distributed with rate \u03b83. The server observes only the time\nintervals between departure of two consecutive jobs. Given a set of equally-spaced percentiles xo of\ninter-departure times, the task is to infer parameters \u03b8 = (\u03b81, \u03b82, \u03b83). This model is easy to simulate\nbut its likelihood is intractable, and it has often been used as an ABC benchmark [4, 16]. Unlike\nLotka\u2013Volterra, data x is weakly informative about \u03b8, and hence the posterior over \u03b8 tends to be\nbroad and non-Gaussian. In our setup, we placed \ufb02at independent priors over \u03b81, \u03b82 \u2212 \u03b81 and \u03b83, and\nwe took x to be 5 equally spaced percentiles, as detailed in Section G of the supplementary material.\nThe MDN trained with prior has two hidden layers of 50 tanh units each, whereas the MDN-SVI\nused to train the proposal prior and the one trained with proposal have one hidden layer of 50 tanh\nunits. As observed in the Lotka\u2013Volterra demo, less capacity is required when training with proposal,\nas the relationship to be learned is local and hence simpler, which saves compute time and gives a\nmore accurate \ufb01nal posterior. All MDNs have 8 Gaussian components (except the MDN-SVI used\nto train the proposal prior, which always has one), which, after experimentation, we determined are\nenough for the MDNs to represent the non-Gaussian nature of the posterior.\nFigure 4 reports the log probability of the true parameters under each posterior learnt\u2014for ABC,\nthe log probability was calculated by \ufb01tting a mixture of 8 Gaussians to posterior samples using\nExpectation-Maximization\u2014and the number of simulations needed to achieve it. As before, MDN\nmethods are more con\ufb01dent compared to ABC around the true parameters, which is due to ABC com-\nputing a broader posterior than the true one. MDN methods make more ef\ufb01cient use of simulations,\nsince they use all of them for training and, unlike ABC, do not throw a proportion of them away.\n\n4 Related work\n\nRegression adjustment. An early parametric approach to ABC is regression adjustment, where a\nparametric regressor is trained on simulation data in order to learn a mapping from x to \u03b8. The\nlearnt mapping is then used to correct for using a large \u0001, by adjusting the location of posterior\nsamples gathered by e.g. rejection ABC. Beaumont et al. [1] used linear regressors, and later Blum\nand Fran\u00e7ois [4] used neural networks with one hidden layer that separately predicted the mean\nand variance of \u03b8. Both can be viewed as rudimentary density estimators and as such they are a\npredecessor to our work. However, they were not \ufb02exible enough to accurately estimate the posterior,\nand they were only used within some other ABC method to allow for a larger \u0001. In our work, we\nmake conditional density estimation \ufb02exible enough to approximate the posterior accurately.\n\n7\n\n10\u2212210\u22121100101\u0001\u22125051015Neg.logprobabilityoftrueparametersRej.ABCMCMC-ABCSMC-ABCMDNwithpriorProposalpriorMDNwithprop.\u22125051015Neg.logprobabilityoftrueparameters100101102103104105106#simulations(pereffectivesampleforABC)Rej.ABCMCMC-ABCSMC-ABCMDNwithpriorProposalpriorMDNwithprop.Rej.ABCMCMCABCSMCABCMDNpriorProp.priorMDNprop.\u22125.0\u22124.8\u22124.6\u22124.4\u22124.2\u22124.0\u22123.8log\u03b81Truevalue\fFigure 4: Results on M/G/1. Left: negative log probability of true parameters vs \u0001; lower is better.\nMiddle: number of simulations vs negative log probability; lower left is better. Note that number\nof simulations is total for MDNs, and per effective sample for ABC. Right: Estimates of \u03b82 with 2\nstandard deviations; ABC estimates correspond to the lowest setting of \u0001 used.\n\nSynthetic likelihood. Another parametric approach is synthetic likelihood, where parametric models\nare used to estimate the likelihood p(x| \u03b8). Wood [24] used a single Gaussian, and later Fan et al.\n[7] used a mixture Gaussian model. Both of them learnt a separate density model of x for each \u03b8 by\nrepeatedly simulating the model for \ufb01xed \u03b8. More recently, Meeds and Welling [14] used a Gaussian\nprocess model to interpolate Gaussian likelihood approximations between different \u03b8\u2019s. Compared\nto learning the posterior, synthetic likelihood has the advantage of not depending on the choice of\nproposal prior. Its main disadvantage is the need of further approximate inference on top of it in order\nto obtain the posterior. In our work we directly learn the posterior, eliminating the need for further\ninference, and we address the problem of correcting for the proposal prior.\nEf\ufb01cient Monte Carlo ABC. Recent work on ABC has focused on reducing the simulation cost of\nsample-based ABC methods. Hamiltonian ABC [15] improves upon MCMC-ABC by using stochasti-\ncally estimated gradients in order to explore the parameter space more ef\ufb01ciently. Optimization Monte\nCarlo ABC [16] explicitly optimizes the location of ABC samples, which greatly reduces rejection\nrate. Bayesian optimization ABC [10] models p((cid:107)x \u2212 xo(cid:107) | \u03b8) as a Gaussian process and then uses\nBayesian optimization to guide simulations towards the region of small distances (cid:107)x \u2212 xo(cid:107). In our\nwork we show how a signi\ufb01cant reduction in simulation cost can also be achieved with parametric\nmethods, which target the posterior directly.\nRecognition networks. Our use of neural density estimators for learning posteriors is reminiscent of\nrecognition networks in machine learning. A recognition network is a neural network that is trained\nto invert a generative model. The Helmholtz machine [6], the variational auto-encoder [12] and\nstochastic backpropagation [22] are examples where a recognition network is trained jointly with the\ngenerative network it is designed to invert. Feedforward neural networks have been used to invert\nblack-box generative models [18] and binary-valued Bayesian networks [17], and convolutional\nneural networks have been used to invert a physics engine [25]. Our work illustrates the potential of\nrecognition networks in the \ufb01eld of likelihood-free inference, where the generative model is \ufb01xed,\nand inference of its parameters is the goal.\nLearning proposals. Neural density estimators have been employed in learning proposal distributions\nfor importance sampling [20] and Sequential Monte Carlo [9, 19]. Although not our focus here, our\n\ufb01t to the posterior could also be used within Monte Carlo inference methods. In this work we see\nhow far we can get purely by \ufb01tting a series of conditional density estimators.\n\n5 Conclusions\n\nBayesian conditional density estimation improves likelihood-free inference in three main ways: (a) it\nrepresents the posterior parametrically, as opposed to as a set of samples, allowing for probabilistic\nevaluations later on in the pipeline; (b) it targets the exact posterior, rather than an \u0001-approximation\nof it; and (c) it makes ef\ufb01cient use of simulations by not rejecting samples, by interpolating between\nsamples, and by gradually focusing on the plausible parameter region. Our belief is that neural density\nestimation is a tool with great potential in likelihood-free inference, and our hope is that this work\nhelps in establishing its usefulness in the \ufb01eld.\n\n8\n\n10\u2212410\u2212310\u2212210\u22121\u0001\u22123\u22122\u2212101234Neg.logprobabilityoftrueparametersRej.ABCMCMC-ABCSMC-ABCMDNwithpriorProposalpriorMDNwithprop.\u22123\u22122\u221210123Neg.logprobabilityoftrueparameters100101102103104105106107108#simulations(pereffectivesampleforABC)Rej.ABCMCMC-ABCSMC-ABCMDNwithpriorProposalpriorMDNwithprop.Rej.ABCMCMCABCSMCABCMDNpriorProp.priorMDNprop.024681012\u03b82Truevalue\fAcknowledgments\nWe thank Amos Storkey for useful comments. George Papamakarios is supported by the Centre for\nDoctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of\nEdinburgh, and by Microsoft Research through its PhD Scholarship Programme.\n\nReferences\n[1] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian Computation in population genetics.\n\nGenetics, 162:2025\u20132035, Dec. 2002.\n\n[2] M. A. Beaumont, J.-M. Cornuet, J.-M. Marin, and C. P. Robert. Adaptive Approximate Bayesian\n\nComputation. Biometrika, 96(4):983\u2013990, 2009.\n\n[3] C. M. Bishop. Mixture density networks. Technical Report NCRG/94/004, Aston University, 1994.\n[4] M. G. B. Blum and O. Fran\u00e7ois. Non-linear regression models for Approximate Bayesian Computation.\n\nStatistics and Computing, 20(1):63\u201373, 2010.\n\n[5] F. V. Bonassi and M. West. Sequential Monte Carlo with adaptive weights for Approximate Bayesian\n\nComputation. Bayesian Analysis, 10(1):171\u2013187, Mar. 2015.\n\n[6] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural Computation, 7:\n\n889\u2013904, 1995.\n\n[7] Y. Fan, D. J. Nott, and S. A. Sisson. Approximate Bayesian Computation via regression density estimation.\n\nStat, 2(1):34\u201348, 2013.\n\n[8] C. Gouri\u00e9roux, A. Monfort, and E. Renault. Indirect inference. Journal of Applied Econometrics, 8(S1):\n\nS85\u2013S118, 1993.\n\n[9] S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive Sequential Monte Carlo. Advances in Neural\n\nInformation Processing Systems 28, pages 2629\u20132637, 2015.\n\n[10] M. U. Gutmann and J. Corander. Bayesian optimization for likelihood-free inference of simulator-based\n\nstatistical models. arXiv e-prints, abs/1501.03291v3, 2015.\n\n[11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International\n\nConference on Learning Representations, 2014.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. Proceedings of the 2nd International\n\nConference on Learning Representations, 2013.\n\n[13] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavar\u00e9. Markov chain Monte Carlo without likelihoods.\n\nProceedings of the National Academy of Sciences, 100(26):15324\u201315328, Dec. 2003.\n\n[14] E. Meeds and M. Welling. GPS-ABC: Gaussian Process Surrogate Approximate Bayesian Computation.\n\nProceedings of the 30th Conference on Uncertainty in Arti\ufb01cial Intelligence, 30, 2014.\n\n[15] E. Meeds, R. Leenders, and M. Welling. Hamiltonian ABC. Proceedings of the 31st Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, pages 582\u2013591, 2015.\n\n[16] T. Meeds and M. Welling. Optimization Monte Carlo: Ef\ufb01cient and embarrassingly parallel likelihood-free\n\ninference. Advances in Neural Information Processing Systems 28, pages 2071\u20132079, 2015.\n\n[17] Q. Morris. Recognition networks for approximate inference in BN20 networks. Proceedings of the 17th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 370\u2013377, 2001.\n\n[18] V. Nair, J. Susskind, and G. E. Hinton. Analysis-by-synthesis by learning to invert generative black boxes.\n\nProceedings of the 18th International Conference on Arti\ufb01cial Neural Networks, 5163:971\u2013981, 2008.\n\n[19] B. Paige and F. Wood. Inference networks for Sequential Monte Carlo in graphical models. Proceedings\n\nof the 33rd International Conference on Machine Learning, 2016.\n\n[20] G. Papamakarios and I. Murray. Distilling intractable generative models. Probabilistic Integration\n\nWorkshop at Neural Information Processing Systems, 2015.\n\n[21] J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman. Population growth of human\nY chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):\n1791\u20131798, 1999.\n\n[22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\ndeep generative models. Proceedings of the 31st International Conference on Machine Learning, pages\n1278\u20131286, 2014.\n\n[23] C. M. Schafer and P. E. Freeman. Likelihood-free inference in cosmology: Potential for the estimation of\n\nluminosity functions. Statistical Challenges in Modern Astronomy V, pages 3\u201319, 2012.\n\n[24] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):\n\n1102\u20131104, 2010.\n\n[25] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum. Galileo: Perceiving physical object properties\nby integrating a physics engine with deep learning. Advances in Neural Information Processing Systems\n28, pages 127\u2013135, 2015.\n\n9\n\n\f", "award": [], "sourceid": 598, "authors": [{"given_name": "George", "family_name": "Papamakarios", "institution": "University of Edinburgh"}, {"given_name": "Iain", "family_name": "Murray", "institution": "University of Edinburgh"}]}