{"title": "Hierarchical Implicit Models and Likelihood-Free Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 5523, "page_last": 5533, "abstract": "Implicit probabilistic models are a flexible class of models defined by a simulation process for data. They form the basis for models which encompass our understanding of the physical word. Despite this fundamental nature, the use of implicit models remains limited due to challenge in positing complex latent structure in them, and the ability to inference in such models with large data sets. In this paper, we first introduce the hierarchical implicit models (HIMs). HIMs combine the idea of implicit densities with hierarchical Bayesian modeling thereby defining models via simulators of data with rich hidden structure. Next, we develop likelihood-free variational inference (LFVI), a scalable variational inference algorithm for HIMs. Key to LFVI is specifying a variational family that is also implicit. This matches the model's flexibility and allows for accurate approximation of the posterior. We demonstrate diverse applications: a large-scale physical simulator for predator-prey populations in ecology; a Bayesian generative adversarial network for discrete data; and a deep implicit model for symbol generation.", "full_text": "Hierarchical Implicit Models and\n\nLikelihood-Free Variational Inference\n\nDustin Tran\n\nColumbia University\n\nRajesh Ranganath\nPrinceton University\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nImplicit probabilistic models are a \ufb02exible class of models de\ufb01ned by a simu-\nlation process for data. They form the basis for theories which encompass our\nunderstanding of the physical world. Despite this fundamental nature, the use\nof implicit models remains limited due to challenges in specifying complex latent\nstructure in them, and in performing inferences in such models with large data sets.\nIn this paper, we \ufb01rst introduce hierarchical implicit models (HIMs). HIMs com-\nbine the idea of implicit densities with hierarchical Bayesian modeling, thereby\nde\ufb01ning models via simulators of data with rich hidden structure. Next, we de-\nvelop likelihood-free variational inference (LFVI), a scalable variational inference\nalgorithm for HIMs. Key to LFVI is specifying a variational family that is also im-\nplicit. This matches the model\u2019s \ufb02exibility and allows for accurate approximation\nof the posterior. We demonstrate diverse applications: a large-scale physical sim-\nulator for predator-prey populations in ecology; a Bayesian generative adversarial\nnetwork for discrete data; and a deep implicit model for text generation.\n\n1\n\nIntroduction\n\nConsider a model of coin tosses. With probabilistic models, one typically posits a latent probability,\nand supposes each toss is a Bernoulli outcome given this probability [36, 15]. After observing a col-\nlection of coin tosses, Bayesian analysis lets us describe our inferences about the probability.\nHowever, we know from the laws of physics that the outcome of a coin toss is fully determined by\nits initial conditions (say, the impulse and angle of \ufb02ip) [25, 9]. Therefore a coin toss\u2019 randomness\ndoes not originate from a latent probability but in noisy initial parameters. This alternative model\nincorporates the physical system, better capturing the generative process. Furthermore the model is\nimplicit, also known as a simulator: we can sample data from its generative process, but we may not\nhave access to calculate its density [11, 20].\nCoin tosses are simple, but they serve as a building block for complex implicit models. These\nmodels, which capture the laws and theories of real-world physical systems, pervade \ufb01elds such as\npopulation genetics [40], statistical physics [1], and ecology [3]; they underlie structural equation\nmodels in economics and causality [39]; and they connect deeply to generative adversarial networks\n(GANs) [18], which use neural networks to specify a \ufb02exible implicit density [35].\nUnfortunately, implicit models, including GANs, have seen limited success outside speci\ufb01c domains.\nThere are two reasons. First, it is unknown how to design implicit models for more general appli-\ncations, exposing rich latent structure such as priors, hierarchies, and sequences. Second, existing\nmethods for inferring latent structure in implicit models do not suf\ufb01ciently scale to high-dimensional\nor large data sets. In this paper, we design a new class of implicit models and we develop a new\nalgorithm for accurate and scalable inference.\nFor modeling, \u00a7 2 describes hierarchical implicit models, a class of Bayesian hierarchical models\nwhich only assume a process that generates samples. This class encompasses both simulators in the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fclassical literature and those employed in GANs. For example, we specify a Bayesian GAN, where\nwe place a prior on its parameters. The Bayesian perspective allows GANs to quantify uncertainty\nand improve data ef\ufb01ciency. We can also apply them to discrete data; this setting is not possible\nwith traditional estimation algorithms for GANs [27].\nFor inference, \u00a7 3 develops likelihood-free variational inference (LFVI), which combines variational\ninference with density ratio estimation [49, 35]. Variational inference posits a family of distributions\nover latent variables and then optimizes to \ufb01nd the member closest to the posterior [23]. Traditional\napproaches require a likelihood-based model and use crude approximations, employing a simple\napproximating family for fast computation. LFVI expands variational inference to implicit models\nand enables accurate variational approximations with implicit variational families: LFVI does not\nrequire the variational density to be tractable. Further, unlike previous Bayesian methods for implicit\nmodels, LFVI scales to millions of data points with stochastic optimization.\nThis work has diverse applications. First, we analyze a classical problem from the approximate\nBayesian computation (ABC) literature, where the model simulates an ecological system [3]. We\nanalyze 100,000 time series which is not possible with traditional methods. Second, we analyze a\nBayesian GAN, which is a GAN with a prior over its weights. Bayesian GANs outperform corre-\nsponding Bayesian neural networks with known likelihoods on several classi\ufb01cation tasks. Third,\nwe show how injecting noise into hidden units of recurrent neural networks corresponds to a deep\nimplicit model for \ufb02exible sequence generation.\nRelated Work. This paper connects closely to three lines of work. The \ufb01rst is Bayesian inference\nfor implicit models, known in the statistics literature as approximate Bayesian computation (ABC)\n[3, 33]. ABC steps around the intractable likelihood by applying summary statistics to measure\nthe closeness of simulated samples to real observations. While successful in many domains, ABC\nhas shortcomings. First, the results generated by ABC depend heavily on the chosen summary\nstatistics and the closeness measure. Second, as the dimensionality grows, closeness becomes harder\nto achieve. This is the classic curse of dimensionality.\nThe second is GANs [18]. GANs have seen much interest since their conception, providing an ef\ufb01-\ncient method for estimation in neural network-based simulators. Larsen et al. [28] propose a hybrid\nof variational methods and GANs for improved reconstruction. Chen et al. [7] apply information\npenalties to disentangle factors of variation. Donahue et al. [12], Dumoulin et al. [13] propose to\nmatch on an augmented space, simultaneously training the model and an inverse mapping from\ndata to noise. Unlike any of the above, we develop models with explicit priors on latent variables,\nhierarchies, and sequences, and we generalize GANs to perform Bayesian inference.\nThe \ufb01nal thread is variational inference with expressive approximations [45, 48, 52]. The idea of\ncasting the design of variational families as a modeling problem was proposed in Ranganath et al.\n[44]. Further advances have analyzed variational programs [42]\u2014a family of approximations which\nonly requires a process returning samples\u2014and which has seen further interest [30]. Implicit-like\nvariational approximations have also appeared in auto-encoder frameworks [32, 34] and message\npassing [24]. We build on variational programs for inferring implicit models.\n\n2 Hierarchical Implicit Models\n\nHierarchical models play an important role in sharing statistical strength across examples [16]. For\na broad class of hierarchical Bayesian models, the joint distribution of the hidden and observed\nvariables is\n\nN(cid:89)\n\np(x, z, \u03b2) = p(\u03b2)\n\np(xn | zn, \u03b2)p(zn | \u03b2),\n\n(1)\n\nn=1\n\nwhere xn is an observation, zn are latent variables associated to that observation (local variables),\nand \u03b2 are latent variables shared across observations (global variables). See Fig. 1 (left).\nWith hierarchical models, local variables can be used for clustering in mixture models, mixed mem-\nberships in topic models [4], and factors in probabilistic matrix factorization [47]. Global variables\ncan be used to pool information across data points for hierarchical regression [16], topic models [4],\nand Bayesian nonparametrics [50].\nHierarchical models typically use a tractable likelihood p(xn | zn, \u03b2). But many likelihoods of\ninterest, such as simulator-based models [20] and generative adversarial networks [18], admit high\n\n2\n\n\f\u03b2\n\nzn\n\nxn\nN\n\n\u03b2\n\nzn\n\nxn\n\n\u0001n\nN\n\nFigure 1: (left) Hierarchical model, with local variables z and global variables \u03b2. (right) Hierar-\nchical implicit model. It is a hierarchical model where x is a deterministic function (denoted with\na square) of noise \u0001 (denoted with a triangle).\n\n\ufb01delity to the true data generating process and do not admit a tractable likelihood. To overcome this\nlimitation, we develop hierarchical implicit models (HIMs).\nHierarchical implicit models have the same joint factorization as Eq.1 but only assume that one can\nsample from the likelihood. Rather than de\ufb01ne p(xn | zn, \u03b2) explicitly, HIMs de\ufb01ne a function g\nthat takes in random noise \u0001n \u223c s(\u00b7) and outputs xn given zn and \u03b2,\n\nThe induced, implicit likelihood of xn \u2208 A given zn and \u03b2 is\n\nxn = g(\u0001n | zn, \u03b2),\n\n\u0001n \u223c s(\u00b7).\n\n(cid:90)\n\nP(xn \u2208 A| zn, \u03b2) =\n\n{g(\u0001n | zn,\u03b2)=xn\u2208A}\n\ns(\u0001n) d\u0001n.\n\nThis integral is typically intractable. It is dif\ufb01cult to \ufb01nd the set to integrate over, and the integration\nitself may be expensive for arbitrary noise distributions s(\u00b7) and functions g.\nFig. 1 (right) displays the graphical model for HIMs. Noise (\u0001n) are denoted by triangles; determin-\nistic computation (xn) are denoted by squares. We illustrate two examples.\nExample: Physical Simulators. Given initial conditions, simulators describe a stochastic pro-\ncess that generates data. For example, in population ecology, the Lotka-Volterra model simulates\npredator-prey populations over time via a stochastic differential equation [55]. For prey and predator\npopulations x1, x2 \u2208 R+ respectively, one process is\n= \u03b21x1 \u2212 \u03b22x1x2 + \u00011,\n= \u2212\u03b22x2 + \u03b23x1x2 + \u00012,\n\n\u00011 \u223c Normal(0, 10),\n\u00012 \u223c Normal(0, 10),\n\ndx1\ndt\ndx2\ndt\n\nwhere Gaussian noises \u00011, \u00012 are added at each full time step. The simulator runs for T time steps\ngiven initial population sizes for x1, x2. Lognormal priors are placed over \u03b2. The Lotka-Volterra\nmodel is grounded by theory but features an intractable likelihood. We study it in \u00a7 4.\n\nExample: Bayesian Generative Adversarial Network. Generative adversarial networks (GANs)\nde\ufb01ne an implicit model and a method for parameter estimation [18]. They are known to perform\nwell on image generation [41]. Formally, the implicit model for a GAN is\n\nxn = g(\u0001n; \u03b8),\n\n(2)\nwhere g is a neural network with parameters \u03b8, and s is a standard normal or uniform. The neural\nnetwork g is typically not invertible; this makes the likelihood intractable.\nThe parameters \u03b8 in GANs are estimated by divergence minimization between the generated and real\ndata. We make GANs amenable to Bayesian analysis by placing a prior on the parameters \u03b8. We call\nthis a Bayesian GAN. Bayesian GANs enable modeling of parameter uncertainty and are inspired by\nBayesian neural networks, which have been shown to improve the uncertainty and data ef\ufb01ciency of\nstandard neural networks [31, 37]. We study Bayesian GANs in \u00a7 4; Appendix B provides example\nimplementations in the Edward probabilistic programming language [53].\n\n\u0001n \u223c s(\u00b7),\n\n3 Likelihood-Free Variational Inference\n\nWe described hierarchical implicit models, a rich class of latent variable models with local and\nglobal structure alongside an implicit density. Given data, we aim to calculate the model\u2019s poste-\nrior p(z, \u03b2 | x) = p(x, z, \u03b2)/p(x). This is dif\ufb01cult as the normalizing constant p(x) is typically\n\n3\n\n\fintractable. With implicit models, the lack of a likelihood function introduces an additional source\nof intractability.\nWe use variational inference [23]. It posits an approximating family q \u2208 Q and optimizes to \ufb01nd\nthe member closest to p(z, \u03b2 | x). There are many choices of variational objectives that measure\ncloseness [42, 29, 10]. To choose an objective, we lay out desiderata for a variational inference\nalgorithm for implicit models:\n\n1. Scalability. Machine learning hinges on stochastic optimization to scale to massive data [6]. The\n\nvariational objective should admit unbiased subsampling with the standard technique,\n\nN(cid:88)\n\nn=1\n\nM(cid:88)\n\nm=1\n\nf (xn) \u2248 N\nM\n\nf (xm),\n\nwhere some computation f (\u00b7) over the full data is approximated with a mini-batch of data {xm}.\n2. Implicit Local Approximations. Implicit models specify \ufb02exible densities; this induces very com-\nplex posterior distributions. Thus we would like a rich approximating family for the per-data\npoint approximations q(zn | xn, \u03b2). This means the variational objective should only require that\none can sample zn \u223c q(zn | xn, \u03b2) and not evaluate its density.\n\nOne variational objective meeting our desiderata is based on the classical minimization of the\nKullback-Leibler (KL) divergence. (Surprisingly, Appendix C details how the KL is the only possi-\nble objective among a broad class.)\n\n3.1 KL Variational Objective\n\nClassical variational inference minimizes the KL divergence from the variational approximation q\nto the posterior. This is equivalent to maximizing the evidence lower bound (ELBO),\n\nL = Eq(\u03b2,z | x)[log p(x, z, \u03b2) \u2212 log q(\u03b2, z| x)].\n\n(3)\n\nLet q factorize in the same way as the posterior,\n\nq(\u03b2, z| x) = q(\u03b2)\n\nN(cid:89)\n\nq(zn | xn, \u03b2),\n\nwhere q(zn | xn, \u03b2) is an intractable density and since the data x is constant during inference, we\ndrop conditioning for the global q(\u03b2). Substituting p and q\u2019s factorization yields\n\nn=1\n\nL = Eq(\u03b2)[log p(\u03b2) \u2212 log q(\u03b2)] +\n\nEq(\u03b2)q(zn | xn,\u03b2)[log p(xn, zn | \u03b2) \u2212 log q(zn | xn, \u03b2)].\n\nThis objective presents dif\ufb01culties: the local densities p(xn, zn | \u03b2) and q(zn | xn, \u03b2) are both in-\ntractable. To solve this, we consider ratio estimation.\n\nn=1\n\n3.2 Ratio Estimation for the KL Objective\n\nLet q(xn) be the empirical distribution on the observations x and consider using it in a \u201cvariational\njoint\u201d q(xn, zn | \u03b2) = q(xn)q(zn | xn, \u03b2). Now subtract the log empirical log q(xn) from the ELBO\nabove. The ELBO reduces to\n\nN(cid:88)\n\nL \u221d Eq(\u03b2)[log p(\u03b2) \u2212 log q(\u03b2)] +\n\nEq(\u03b2)q(zn | xn,\u03b2)\n\nlog\n\nN(cid:88)\n\nn=1\n\n(cid:20)\n\n(cid:21)\n\np(xn, zn | \u03b2)\nq(xn, zn | \u03b2)\n\n.\n\n(4)\n\n(Here the proportionality symbol means equality up to additive constants.) Thus the ELBO is a\nfunction of the ratio of two intractable densities. If we can form an estimator of this ratio, we can\nproceed with optimizing the ELBO.\nWe apply techniques for ratio estimation [49]. It is a key idea in GANs [35, 54], and similar ideas\nhave rearisen in statistics and physics [19, 8]. In particular, we use class probability estimation:\ngiven a sample from p(\u00b7) or q(\u00b7) we aim to estimate the probability that it belongs to p(\u00b7). We model\n\n4\n\n\fthis using \u03c3(r(\u00b7; \u03b8)), where r is a parameterized function (e.g., neural network) taking sample inputs\nand outputting a real value; \u03c3 is the logistic function outputting the probability.\nWe train r(\u00b7; \u03b8) by minimizing a loss function known as a proper scoring rule [17]. For example, in\nexperiments we use the log loss,\nDlog = Ep(xn,zn | \u03b2)[\u2212 log \u03c3(r(xn, zn, \u03b2; \u03b8))] + Eq(xn,zn | \u03b2)[\u2212 log(1 \u2212 \u03c3(r(xn, zn, \u03b2; \u03b8)))]. (5)\nThe loss is zero if \u03c3(r(\u00b7; \u03b8)) returns 1 when a sample is from p(\u00b7) and 0 when a sample is from q(\u00b7).\n(We also experiment with the hinge loss; see \u00a7 4.) If r(\u00b7; \u03b8) is suf\ufb01ciently expressive, minimizing\nthe loss returns the optimal function [35],\n\nr\u2217(xn, zn, \u03b2) = log p(xn, zn | \u03b2) \u2212 log q(xn, zn | \u03b2).\n\nAs we minimize Eq.5, we use r(\u00b7; \u03b8) as a proxy to the log ratio in Eq.4. Note r estimates the log\nratio; it\u2019s of direct interest and more numerically stable than the ratio.\nThe gradient of Dlog with respect to \u03b8 is\n\nEp(xn,zn | \u03b2)[\u2207\u03b8 log \u03c3(r(xn, zn, \u03b2; \u03b8))] + Eq(xn,zn | \u03b2)[\u2207\u03b8 log(1 \u2212 \u03c3(r(xn, zn, \u03b2; \u03b8)))].\n\n(6)\n\nWe compute unbiased gradients with Monte Carlo.\n\n3.3 Stochastic Gradients of the KL Objective\n\nTo optimize the ELBO, we use the ratio estimator,\n\nL = Eq(\u03b2 | x)[log p(\u03b2) \u2212 log q(\u03b2)] +\n\nN(cid:88)\n\nn=1\n\nEq(\u03b2 | x)q(zn | xn,\u03b2)[r(xn, zn, \u03b2)].\n\n(7)\n\nAll terms are now tractable. We can calculate gradients to optimize the variational family q. Below\nwe assume the priors p(\u03b2), p(zn | \u03b2) are differentiable. (We discuss methods to handle discrete\nglobal variables in the next section.)\nWe focus on reparameterizable variational approximations [26, 46]. They enable sampling via a\ndifferentiable transformation T of random noise, \u03b4 \u223c s(\u00b7). Due to Eq.7, we require the global\napproximation q(\u03b2; \u03bb) to admit a tractable density. With reparameterization, its sample is\n\n\u03b2 = Tglobal(\u03b4global; \u03bb),\n\n\u03b4global \u223c s(\u00b7),\n\nfor a choice of transformation Tglobal(\u00b7; \u03bb) and noise s(\u00b7). For example, setting s(\u00b7) = N (0, 1) and\nTglobal(\u03b4global) = \u00b5 + \u03c3\u03b4global induces a normal distribution N (\u00b5, \u03c32).\nSimilarly for the local variables zn, we specify\n\nzn = Tlocal(\u03b4n, xn, \u03b2; \u03c6),\n\n\u03b4n \u223c s(\u00b7).\n\nUnlike the global approximation, the local variational density q(zn | xn; \u03c6) need not be tractable:\nthe ratio estimator relaxes this requirement. It lets us leverage implicit models not only for data but\nalso for approximate posteriors. In practice, we also amortize computation with inference networks,\nsharing parameters \u03c6 across the per-data point approximate posteriors.\nThe gradient with respect to global parameters \u03bb under this approximating family is\n\n\u2207\u03bbL = Es(\u03b4global)[\u2207\u03bb(log p(\u03b2) \u2212 log q(\u03b2))]] +\n\nEs(\u03b4global)sn(\u03b4n)[\u2207\u03bbr(xn, zn, \u03b2)].\n\n(8)\n\nN(cid:88)\n\nn=1\n\nThe gradient backpropagates through the local sampling zn = Tlocal(\u03b4n, xn, \u03b2; \u03c6) and the global\nreparameterization \u03b2 = Tglobal(\u03b4global; \u03bb). We compute unbiased gradients with Monte Carlo. The\ngradient with respect to local parameters \u03c6 is\n\n\u2207\u03c6L =\n\nEq(\u03b2)s(\u03b4n)[\u2207\u03c6r(xn, zn, \u03b2)].\n\n(9)\n\nN(cid:88)\n\nwhere the gradient backpropagates through Tlocal.1\n\nn=1\n\n5\n\n\fAlgorithm 1: Likelihood-free variational inference (LFVI)\n\nInput\n\n: Model xn, zn \u223c p(\u00b7| \u03b2), p(\u03b2)\nVariational approximation zn \u223c q(\u00b7| xn, \u03b2; \u03c6), q(\u03b2 | x; \u03bb),\nRatio estimator r(\u00b7; \u03b8)\n\nOutput: Variational parameters \u03bb, \u03c6\nInitialize \u03b8, \u03bb, \u03c6 randomly.\nwhile not converged do\n\nCompute unbiased estimate of \u2207\u03b8D (Eq.6), \u2207\u03bbL (Eq.8), \u2207\u03c6L (Eq.9).\nUpdate \u03b8, \u03bb, \u03c6 using stochastic gradient descent.\n\nend\n\n3.4 Algorithm\n\nAlgorithm 1 outlines the procedure. We call it likelihood-free variational inference (LFVI). LFVI\nis black box: it applies to models in which one can simulate data and local variables, and calculate\ndensities for the global variables. LFVI \ufb01rst updates \u03b8 to improve the ratio estimator r. Then\nit uses r to update parameters {\u03bb, \u03c6} of the variational approximation q. We optimize r and q\nsimultaneously. The algorithm is available in Edward [53].\nLFVI is scalable: we can unbiasedly estimate the gradient over the full data set with mini-batches\n[22]. The algorithm can also handle models of either continuous or discrete data. The requirement\nfor differentiable global variables and reparameterizable global approximations can be relaxed using\nscore function gradients [43].\nPoint estimates of the global parameters \u03b2 suf\ufb01ce for many applications [18, 46]. Algorithm 1\ncan \ufb01nd point estimates: place a point mass approximation q on the parameters \u03b2. This simpli\ufb01es\ngradients and corresponds to variational EM.\n\n4 Experiments\n\nWe developed new models and inference. For experiments, we study three applications: a large-\nscale physical simulator for predator-prey populations in ecology; a Bayesian GAN for supervised\nclassi\ufb01cation; and a deep implicit model for symbol generation. In addition, Appendix F, provides\npractical advice on how to address the stability of the ratio estimator by analyzing a toy experiment.\nWe initialize parameters from a standard normal and apply gradient descent with ADAM.\nLotka-Volterra Predator-Prey Simulator. We analyze the Lotka-Volterra simulator of \u00a7 2 and\nfollow the same setup and hyperparameters of Papamakarios and Murray [38]. Its global variables\n\u03b2 govern rates of change in a simulation of predator-prey populations. To infer them, we posit a\nmean-\ufb01eld normal approximation (reparameterized to be on the same support) and run Algorithm 1\nwith both a log loss and hinge loss for the ratio estimation problem; Appendix D details the hinge\nloss. We compare to rejection ABC, MCMC-ABC, and SMC-ABC [33]. MCMC-ABC uses a spher-\nical Gaussian proposal; SMC-ABC is manually tuned with a decaying epsilon schedule; all ABC\nmethods are tuned to use the best performing hyperparameters such as the tolerance error.\nFig. 2 displays results on two data sets. In the top \ufb01gures and bottom left, we analyze data consisting\nof a simulation for T = 30 time steps, with recorded values of the populations every 0.2 time\nunits. The bottom left \ufb01gure calculates the negative log probability of the true parameters over the\ntolerance error for ABC methods; smaller tolerances result in more accuracy but slower runtime.\nThe top \ufb01gures compare the marginal posteriors for two parameters using the smallest tolerance for\nthe ABC methods. Rejection ABC, MCMC-ABC, and SMC-ABC all contain the true parameters in\ntheir 95% credible interval but are less con\ufb01dent than our methods. Further, they required 100, 000\nsimulations from the model, with an acceptance rate of 0.004% and 2.990% for rejection ABC and\nMCMC-ABC respectively.\n\n1The ratio r indirectly depends on \u03c6 but its gradient w.r.t. \u03c6 disappears. This is derived via the score\n\nfunction identity and the product rule (see, e.g., Ranganath et al. [43, Appendix]).\n\n6\n\n\fFigure 2: (top) Marginal posterior for \ufb01rst two parameters. (bot. left) ABC methods over tolerance\nerror. (bot. right) Marginal posterior for \ufb01rst parameter on a large-scale data set. Our inference\nachieves more accurate results and scales to massive data.\n\nTest Set Error\n\nModel + Inference\nBayesian GAN + VI\nBayesian GAN + MAP\nBayesian NN + VI\nBayesian NN + MAP\n\nCrabs\n0.03\n0.12\n0.02\n0.05\n\nPima\n0.232\n0.240\n0.242\n0.320\n\nCovertype MNIST\n0.0136\n0.154\n0.0283\n0.185\n0.0311\n0.164\n0.188\n0.0623\n\nTable 1: Classi\ufb01cation accuracy of Bayesian GAN and Bayesian neural networks across small to\nmedium-size data sets. Bayesian GANs achieve comparable or better performance to their Bayesian\nneural net counterpart.\n\nThe bottom right \ufb01gure analyzes data consisting of 100, 000 time series, each of the same size as the\nsingle time series analyzed in the previous \ufb01gures. This size is not possible with traditional methods.\nFurther, we see that with our methods, the posterior concentrates near the truth. We also experienced\nlittle difference in accuracy between using the log loss or the hinge loss for ratio estimation.\nBayesian Generative Adversarial Networks. We analyze Bayesian GANs, described in \u00a7 2. Mim-\nicking a use case of Bayesian neural networks [5, 21], we apply Bayesian GANs for classi\ufb01cation\non small to medium-size data. The GAN de\ufb01nes a conditional p(yn | xn), taking a feature xn \u2208 RD\nas input and generating a label yn \u2208 {1, . . . , K}, via the process\n\nyn = g(xn, \u0001n | \u03b8),\n\n\u0001n \u223c N (0, 1),\n\n(10)\nwhere g(\u00b7| \u03b8) is a 2-layer multilayer perception with ReLU activations, batch normalization, and is\nparameterized by weights and biases \u03b8. We place normal priors, \u03b8 \u223c N (0, 1).\nWe analyze two choices of the variational model: one with a mean-\ufb01eld normal approximation for\nq(\u03b8 | x), and another with a point mass approximation (equivalent to maximum a posteriori). We\ncompare to a Bayesian neural network, which uses the same generative process as Eq.10 but draws\nfrom a Categorical distribution rather than feeding noise into the neural net. We \ufb01t it separately\nusing a mean-\ufb01eld normal approximation and maximum a posteriori. Table 1 shows that Bayesian\nGANs generally outperform their Bayesian neural net counterpart.\nNote that Bayesian GANs can analyze discrete data such as in generating a classi\ufb01cation label.\nTraditional GANs for discrete data is an open challenge [27]. In Appendix E, we compare Bayesian\nGANs with point estimation to typical GANs. Bayesian GANs are also able to leverage parameter\nuncertainty for analyzing these small to medium-size data sets.\nOne problem with Bayesian GANs is that they cannot work with very large neural networks: the\nratio estimator is a function of global parameters, and thus the input size grows with the size of the\n\n7\n\nRej.ABCMCMCABCSMCABCVILogVIHinge\u22125.5\u22125.0\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5log\u03b21TruevalueRej.ABCMCMCABCSMCABCVILogVIHinge\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.5log\u03b22Truevalue100101\u0001\u22125051015Neg.logprobabilityoftrueparametersRejABCMCMC-ABCSMC-ABCVILogVIHinge\u22125.5\u22125.0\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5log\u03b21Truevalue\f\u00b7\u00b7\u00b7\n\nzt\u22121\n\nzt\n\nzt+1\n\n\u00b7\u00b7\u00b7\n\nxt\u22121\n\nxt\n\nxt+1\n\n\u2212x+x/x\u2217\u2217x\u2217//x\u2217x+\nx/x\u2217x+x\u2217x/x+x+x+\n/+x\u2217x+x\u2217x/x/x+x+\n/x+\u2217x+x\u2217x/x+x\u2212x+\nx/x\u2217x/x\u2217x+x+x+x\u2212\nx+x+x/x\u2217x\u2217x+x/x+\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n(a) A deep implicit model for sequences. It is a recur-\nrent neural network (RNN) with noise injected into\neach hidden state. The hidden state is now an im-\nplicit latent variable. The same occurs for generat-\ning outputs.\n\n(b) Generated symbols from the implicit model. Good\nsamples place arithmetic operators between the\nvariable x. The implicit model learned to follow\nrules from the context free grammar up to some\nmultiple operator repeats.\n\nneural network. One approach is to make the ratio estimator not a function of the global parameters.\nInstead of optimizing model parameters via variational EM, we can train the model parameters by\nbackpropagating through the ratio objective instead of the variational objective. An alternative is to\nuse the hidden units as input which is much lower dimensional [51, Appendix C].\nInjecting Noise into Hidden Units.\nIn this section, we show how to build a hierarchical implicit\nmodel by simply injecting randomness into hidden units. We model sequences x = (x1, . . . , xT )\nwith a recurrent neural network. For t = 1, . . . , T ,\n\nzt = gz(xt\u22121, zt\u22121, \u0001t,z),\nxt = gx(zt, \u0001t,x),\n\n\u0001t,z \u223c N (0, 1),\n\u0001t,x \u223c N (0, 1),\n\nwhere gz and gx are both 1-layer multilayer perceptions with ReLU activation and layer normaliza-\ntion. We place standard normal priors over all weights and biases. See Fig. 3a.\nIf the injected noise \u0001t,z combines linearly with the output of gz,\nthe induced distribution\np(zt | xt\u22121, zt\u22121) is Gaussian parameterized by that output. This de\ufb01nes a stochastic RNN [2, 14],\nwhich generalizes its deterministic connection. With nonlinear combinations, the implicit density\nis more \ufb02exible (and intractable), making previous methods for inference not applicable. In our\nmethod, we perform variational inference and specify q to be implicit; we use the same architecture\nas the probability model\u2019s implicit priors.\nWe follow the same setup and hyperparameters as Kusner and Hern\u00e1ndez-Lobato [27] and generate\nsimple one-variable arithmetic sequences following a context free grammar,\n\nS \u2192 x(cid:107)S + S(cid:107)S \u2212 S(cid:107)S \u2217 S(cid:107)S/S,\n\nwhere (cid:107) divides possible productions of the grammar. We concatenate the inputs and point estimate\nthe global variables (model parameters) using variational EM. Fig. 3b displays samples from the\ninferred model, training on sequences with a maximum of 15 symbols. It achieves sequences which\nroughly follow the context free grammar.\n\n5 Discussion\nWe developed a class of hierarchical implicit models and likelihood-free variational inference, merg-\ning the idea of implicit densities with hierarchical Bayesian modeling and approximate posterior\ninference. This expands Bayesian analysis with the ability to apply neural samplers, physical simu-\nlators, and their combination with rich, interpretable latent structure.\nMore stable inference with ratio estimation is an open challenge. This is especially important when\nwe analyze large-scale real world applications of implicit models. Recent work for genomics offers\na promising solution [51].\n\nAcknowledgements. We thank Balaji Lakshminarayanan for discussions which helped motivate\nthis work. We also thank Christian Naesseth, Jaan Altosaar, and Adji Dieng for their feedback\nand comments. DT is supported by a Google Ph.D. Fellowship in Machine Learning and an\nAdobe Research Fellowship. This work is also supported by NSF IIS-0745520, IIS-1247664, IIS-\n1009542, ONR N00014-11-1-0651, DARPA FA8750-14-2-0009, N66001-15-C-4032, Facebook,\nAdobe, Amazon, and the John Templeton Foundation.\n\n8\n\n\fReferences\n[1] Anelli, G., Antchev, G., Aspell, P., Avati, V., Bagliesi, M., Berardi, V., Berretti, M., Boccone, V.,\nBottigli, U., Bozzo, M., et al. (2008). The totem experiment at the CERN large Hadron collider.\nJournal of Instrumentation, 3(08):S08007.\n\n[2] Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. arXiv preprint\n\narXiv:1411.7610.\n\n[3] Beaumont, M. A. (2010). Approximate Bayesian computation in evolution and ecology. Annual\n\nReview of Ecology, Evolution and Systematics, 41(379-406):1.\n\n[4] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3(Jan):993\u20131022.\n\n[5] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in\n\nneural network. In International Conference on Machine Learning.\n\n[6] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceed-\n\nings of COMPSTAT\u20192010, pages 177\u2013186. Springer.\n\n[7] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). InfoGAN:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nNeural Information Processing Systems.\n\n[8] Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated\n\ndiscriminative classi\ufb01ers. arXiv preprint arXiv:1506.02169.\n\n[9] Diaconis, P., Holmes, S., and Montgomery, R. (2007). Dynamical bias in the coin toss. SIAM,\n\n49(2):211\u2013235.\n\n[10] Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. M. (2017). The \u03c7-Divergence\n\nfor Approximate Inference. In Neural Information Processing Systems.\n\n[11] Diggle, P. J. and Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical\n\nmodels. Journal of the Royal Statistical Society: Series B (Methodological), pages 193\u2013227.\n\n[12] Donahue, J., Kr\u00e4henb\u00fchl, P., and Darrell, T. (2017). Adversarial feature learning. In Interna-\n\ntional Conference on Learning Representations.\n\n[13] Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville,\nA. (2017). Adversarially learned inference. In International Conference on Learning Represen-\ntations.\n\n[14] Fraccaro, M., S\u00f8nderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models\n\nwith stochastic layers. In Neural Information Processing Systems.\n\n[15] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).\n\nBayesian data analysis. Texts in Statistical Science Series. CRC Press, Boca Raton, FL.\n\n[16] Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical\n\nmodels. Cambridge University Press.\n\n[17] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.\n\nJournal of the American Statistical Association, 102(477):359\u2013378.\n\n[18] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,\nA., and Bengio, Y. (2014). Generative adversarial nets. In Neural Information Processing Sys-\ntems.\n\n[19] Gutmann, M. U., Dutta, R., Kaski, S., and Corander, J. (2014). Statistical Inference of In-\n\ntractable Generative Models via Classi\ufb01cation. arXiv preprint arXiv:1407.4981.\n\n[20] Hartig, F., Calabrese, J. M., Reineking, B., Wiegand, T., and Huth, A. (2011). Statistical\ninference for stochastic simulation models\u2013theory and application. Ecology Letters, 14(8):816\u2013\n827.\n\n9\n\n\f[21] Hern\u00e1ndez-Lobato, J. M., Li, Y., Rowland, M., Hern\u00e1ndez-Lobato, D., Bui, T., and Turner,\nR. E. (2016). Black-box \u03b1-divergence minimization. In International Conference on Machine\nLearning.\n\n[22] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. (2013). Stochastic variational\n\ninference. Journal of Machine Learning Research, 14(1):1303\u20131347.\n\n[23] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to\n\nvariational methods for graphical models. Machine Learning.\n\n[24] Karaletsos, T. (2016). Adversarial message passing for graphical models. In NIPS Workshop.\n[25] Keller, J. B. (1986). The probability of heads. The American Mathematical Monthly,\n\n93(3):191\u2013197.\n\n[26] Kingma, D. P. and Welling, M. (2014). Auto-Encoding Variational Bayes. In International\n\nConference on Learning Representations.\n\n[27] Kusner, M. J. and Hern\u00e1ndez-Lobato, J. M. (2016). GANs for sequences of discrete elements\n\nwith the Gumbel-Softmax distribution. In NIPS Workshop.\n\n[28] Larsen, A. B. L., S\u00f8nderby, S. K., Larochelle, H., and Winther, O. (2016). Autoencoding be-\nyond pixels using a learned similarity metric. In International Conference on Machine Learning.\n[29] Li, Y. and Turner, R. E. (2016). R\u00e9nyi Divergence Variational Inference. In Neural Information\n\nProcessing Systems.\n\n[30] Liu, Q. and Feng, Y. (2016). Two methods for wild variational inference. arXiv preprint\n\narXiv:1612.00081.\n\n[31] MacKay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis, California\n\nInstitute of Technology.\n\n[32] Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. (2015). Adversarial autoencoders.\n\narXiv preprint arXiv:1511.05644.\n\n[33] Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian compu-\n\ntational methods. Statistics and Computing, 22(6):1167\u20131180.\n\n[34] Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Unifying\nvariational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722.\n[35] Mohamed, S. and Lakshminarayanan, B. (2016). Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483.\n\n[36] Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.\n[37] Neal, R. M. (1994). Bayesian Learning for Neural Networks. PhD thesis, University of\n\nToronto.\n\n[38] Papamakarios, G. and Murray, I. (2016). Fast \u0001-free inference of simulation models with\n\nBayesian conditional density estimation. In Neural Information Processing Systems.\n\n[39] Pearl, J. (2000). Causality. Cambridge University Press.\n[40] Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. (1999). Population\ngrowth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology\nand Evolution, 16(12):1791\u20131798.\n\n[41] Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with\nIn International Conference on Learning\n\ndeep convolutional generative adversarial networks.\nRepresentations.\n\n[42] Ranganath, R., Altosaar, J., Tran, D., and Blei, D. M. (2016a). Operator variational inference.\n\nIn Neural Information Processing Systems.\n\n[43] Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Arti\ufb01-\n\ncial Intelligence and Statistics.\n\n10\n\n\f[44] Ranganath, R., Tran, D., and Blei, D. M. (2016b). Hierarchical variational models. In Interna-\n\ntional Conference on Machine Learning.\n\n[45] Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows.\n\nInternational Conference on Machine Learning.\n\nIn\n\n[46] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and ap-\nproximate inference in deep generative models. In International Conference on Machine Learn-\ning.\n\n[47] Salakhutdinov, R. and Mnih, A. (2008). Bayesian probabilistic matrix factorization using\nMarkov chain Monte Carlo. In International Conference on Machine Learning, pages 880\u2013887.\nACM.\n\n[48] Salimans, T., Kingma, D. P., and Welling, M. (2015). Markov chain Monte Carlo and varia-\n\ntional inference: Bridging the gap. In International Conference on Machine Learning.\n\n[49] Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density-ratio matching under the Breg-\nman divergence: A uni\ufb01ed framework of density-ratio estimation. Annals of the Institute of\nStatistical Mathematics.\n\n[50] Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with appli-\n\ncations. Bayesian Nonparametrics, 1.\n\n[51] Tran, D. and Blei, D. M. (2017). Implicit causal models for genome-wide association studies.\n\narXiv preprint arXiv:1710.10742.\n\n[52] Tran, D., Blei, D. M., and Airoldi, E. M. (2015). Copula variational inference.\n\nInformation Processing Systems.\n\nIn Neural\n\n[53] Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016).\narXiv preprint\n\ninference, and criticism.\n\nEdward: A library for probabilistic modeling,\narXiv:1610.09787.\n\n[54] Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Generative adversarial\n\nnets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.\n[55] Wilkinson, D. J. (2011). Stochastic modelling for systems biology. CRC press.\n\n11\n\n\f", "award": [], "sourceid": 2844, "authors": [{"given_name": "Dustin", "family_name": "Tran", "institution": "Columbia University & OpenAI"}, {"given_name": "Rajesh", "family_name": "Ranganath", "institution": "Princeton University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}