{"title": "Energy-Inspired Models: Learning with Sampler-Induced Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 8501, "page_last": 8513, "abstract": "Energy-based models (EBMs) are powerful probabilistic models, but suffer from intractable sampling and density evaluation due to the partition function. As a result, inference in EBMs relies on approximate sampling algorithms, leading to a mismatch between the model and inference. Motivated by this, we consider the sampler-induced distribution as the model of interest and maximize the likelihood of this model. This yields a class of energy-inspired models (EIMs) that incorporate learned energy functions while still providing exact samples and tractable log-likelihood lower bounds. We describe and evaluate three instantiations of such models based on truncated rejection sampling, self-normalized importance sampling, and Hamiltonian importance sampling. These models out-perform or perform comparably to the recently proposed Learned Accept/RejectSampling algorithm and provide new insights on ranking Noise Contrastive Estimation and Contrastive Predictive Coding. Moreover, EIMs allow us to generalize a recent connection between multi-sample variational lower bounds and auxiliary variable variational inference. We show how recent variational bounds can be unified with EIMs as the variational family.", "full_text": "Energy-Inspired Models: Learning with\n\nSampler-Induced Distributions\n\nDieterich Lawson\u2217\u2217\u2020\nStanford University\n\njdlawson@stanford.edu\n\nGeorge Tucker\u2217, Bo Dai\n\nGoogle Research, Brain Team\n{gjt, bodai}@google.com\n\nRajesh Ranganath\nNew York University\n\nrajeshr@cims.nyu.edu\n\nAbstract\n\nEnergy-based models (EBMs) are powerful probabilistic models [8, 44], but suffer\nfrom intractable sampling and density evaluation due to the partition function. As\na result, inference in EBMs relies on approximate sampling algorithms, leading to\na mismatch between the model and inference. Motivated by this, we consider the\nsampler-induced distribution as the model of interest and maximize the likelihood\nof this model. This yields a class of energy-inspired models (EIMs) that incor-\nporate learned energy functions while still providing exact samples and tractable\nlog-likelihood lower bounds. We describe and evaluate three instantiations of\nsuch models based on truncated rejection sampling, self-normalized importance\nsampling, and Hamiltonian importance sampling. These models outperform or\nperform comparably to the recently proposed Learned Accept/Reject Sampling\nalgorithm [5] and provide new insights on ranking Noise Contrastive Estima-\ntion [34, 46] and Contrastive Predictive Coding [57]. Moreover, EIMs allow us to\ngeneralize a recent connection between multi-sample variational lower bounds [9]\nand auxiliary variable variational inference [1, 63, 59, 47]. We show how recent\nvariational bounds [9, 49, 52, 42, 73, 51, 65] can be uni\ufb01ed with EIMs as the\nvariational family.\n\n1\n\nIntroduction\n\nEnergy-based models (EBMs) have a long history in statistics and machine learning [16, 75, 44].\nEBMs score con\ufb01gurations of variables with an energy function, which induces a distribution on\nthe variables in the form of a Gibbs distribution. Different choices of energy function recover\nwell-known probabilistic models including Markov random \ufb01elds [36], (restricted) Boltzmann\nmachines [64, 24, 30], and conditional random \ufb01elds [41]. However, this \ufb02exibility comes at the cost\nof challenging inference and learning: both sampling and density evaluation of EBMs are generally\nintractable, which hinders the applications of EBMs in practice.\nBecause of the intractability of general EBMs, practical implementations rely on approximate\nsampling procedures (e.g., Markov chain Monte Carlo (MCMC)) for inference. This creates a\nmismatch between the model and the approximate inference procedure, and can lead to suboptimal\nperformance and unstable training when approximate samples are used in the training procedure.\nCurrently, most attempts to \ufb01x the mismatch lie in designing better sampling algorithms (e.g.,\nHamiltonian Monte Carlo [54], annealed importance sampling [53]) or exploiting variational tech-\nniques [35, 15, 14] to reduce the inference approximation error.\n\n\u2217Equal contributions. \u2020Research performed while at New York University.\nCode and image samples: sites.google.com/view/energy-inspired-models.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fInstead, we bridge the gap between the model and inference by directly treating the sampling\nprocedure as the model of interest and optimizing the log-likelihood of the the sampling procedure.\nWe call these models energy-inspired models (EIMs) because they incorporate a learned energy\nfunction while providing tractable, exact samples. This shift in perspective aligns the training and\nsampling procedure, leading to principled and consistent training and inference.\nTo accomplish this, we cast the sampling procedure as a latent variable model. This allows us\nto maximize variational lower bounds [33, 7] on the log-likelihood (c.f., Kingma and Welling\n[38], Rezende et al. [61]). To illustrate this, we develop and evaluate energy-inspired models based on\ntruncated rejection sampling (Algorithm 1), self-normalized importance sampling (Algorithm 2), and\nHamiltonian importance sampling (Algorithm 3). Interestingly, the model based on self-normalized\nimportance sampling is closely related to ranking NCE [34, 46], suggesting a principled objective for\ntraining the \u201cnoise\u201d distribution.\nOur second contribution is to show that EIMs provide a unifying conceptual framework to explain\nmany advances in constructing tighter variational lower bounds for latent variable models (e.g., [9,\n49, 52, 42, 73, 51, 65]). Previously, each bound required a separate derivation and evaluation, and\ntheir relationship was unclear. We show that these bounds can be viewed as speci\ufb01c instances of\nauxiliary variable variational inference [1, 63, 59, 47] with different EIMs as the variational family.\nBased on general results for auxiliary latent variables, this immediately gives rise to a variational\nlower bound with a characterization of the tightness of the bound. Furthermore, this uni\ufb01ed view\nhighlights the implicit (potentially suboptimal) choices made and exposes the reusable components\nthat can be combined to form novel variational lower bounds. Concurrently, Domke and Sheldon\n[19] note a similar connection, however, their focus is on the use of the variational distribution for\nposterior inference.\nIn summary, our contributions are:\n\n\u2022 The construction of a tractable class of energy-inspired models (EIMs), which lead to\nconsistent learning and inference. To illustrate this, we build models with truncated rejection\nsampling, self-normalized importance sampling, and Hamiltonian importance sampling and\nevaluate them on synthetic and real-world tasks. These models can be \ufb01t by maximizing a\ntractable lower bound on their log-likelihood.\n\n\u2022 We show that EIMs with auxiliary variable variational inference provide a unifying frame-\nwork for understanding recent tighter variational lower bounds, simplifying their analysis\nand exposing potentially sub-optimal design choices.\n\n2 Background\n\nIn this work, we consider learned probabilistic models of data p(x). Energy-based models [44] de\ufb01ne\np(x) in terms of an energy function U (x)\n\n\u03c0(x) exp(\u2212U (x))\n\n,\n\nwhere \u03c0 is a tractable \u201cprior\u201d distribution and Z =(cid:82) \u03c0(x) exp(\u2212U (x)) dx is a generally intractable\n\np(x) =\n\nZ\n\npartition function. To \ufb01t the model, many approximate methods have been developed (e.g., pseudo log-\nlikelihood [6], contrastive divergence [30, 67], score matching estimator [31], minimum probability\n\ufb02ow [66], noise contrastive estimation [28]) to bypass the calculation of the partition function.\nEmpirically, previous work has found that convolutional architectures that score images (i.e., map x to\na real number) tend to have strong inductive biases that match natural data (e.g., [70, 71, 72, 25, 22]).\nThese networks are a natural \ufb01t for energy-based models. Because drawing exact samples from these\nmodels is intractable, samples are typically approximated by Monte Carlo schemes, for example,\nHamiltonian Monte Carlo [55].\nAlternatively, latent variables z allow us to construct complex distributions by de\ufb01ning the likelihood\n\np(x) =(cid:82) p(x|z)p(z) dz in terms of tractable components p(z) and p(x|z). While marginalizing z is\n\ngenerally intractable, we can instead optimize a tractable lower bound on log p(x) using the identity\n\nlog p(x) = Eq(z|x)\n\nlog\n\n+ DKL (q(z|x)||p(z|x)) ,\n\n(1)\n\n(cid:20)\n\n(cid:21)\n\np(x, z)\nq(z|x)\n\n2\n\n\fwhere q(z|x) is a variational distribution and the positive DKL term can be omitted to form a lower\nbound commonly referred to as the evidence lower bound (ELBO) [33, 7]. The tightness of the bound\nis controlled by how accurately q(z|x) models p(z|x), so limited expressivity in the variational family\ncan negatively impact the learned model.\n\n3 Energy-Inspired Models\n\nInstead of viewing the sampling procedure as drawing approximate samples from the energy-based\nmodels, we treat the sampling procedure as the model of interest. We represent the randomness in the\nsampler as latent variables, and we obtain a tractable lower bound on the marginal likelihood using\nthe ELBO. Explicitly, if p(\u03bb) represents the randomness in the sampler and p(x|\u03bb) is the generative\nprocess, then\n\nlog p(x) \u2265 Eq(\u03bb|x)\n\n(2)\nwhere q(\u03bb|x) is a variational distribution that can be optimized to tighten the bound. In this section,\nwe explore concrete instantiations of models in this paradigm: one based on truncated rejection\nsampling (TRS), one based on self-normalized importance sampling (SNIS), and another based on\nHamiltonian importance sampling (HIS) [54].\n\nq(\u03bb|x)\n\nlog\n\n,\n\np(\u03bb)p(x|\u03bb)\n\n(cid:20)\n\n(cid:21)\n\nAlgorithm 1 TRS(\u03c0, U, T ) generative process\nRequire: Proposal distribution \u03c0(x), energy function U (x), and truncation step T .\n1: for t = 1, . . . , T \u2212 1 do\nSample xt \u223c \u03c0(x).\n2:\nSample bt \u223c Bernoulli(\u03c3(\u2212U (xt))).\n3:\n4: end for\n5: Sample xT \u223c \u03c0(x) and set bT = 1.\n6: Compute i = min t s.t. bt = 1.\n7: return x = xi.\n\n3.1 Truncated Rejection Sampling (TRS)\n\nConsider the truncated rejection sampling process (Algorithm 1) used in [5], where we sequentially\ndraw a sample xt from \u03c0(x) and accept it with probability \u03c3(\u2212U (xt)). To ensure that the process\nends, if we have not accepted a sample after T steps, then we return xT .\nIn this case, \u03bb = (x1:T , b1:T\u22121, i), so we need to construct a variational distribution q(\u03bb|x). The\noptimal q(\u03bb|x) is p(\u03bb|x), which motivates choosing a similarly structured variational distribution. It\nis generally intractable. So, we choose q(i|x) \u221d (1 \u2212 \u02c6Z)i\u22121\u03c3(\u2212U (x))\u03b4i 1 when the true data distribution is in our model family. As a result, it is straightforward to\nadapt the consistency proof from [46] to our setting. Furthermore, our perspective gives a coherent\nobjective for jointly learning the noise distribution and the energy function and shows that the ranking\nNCE loss can be viewed as a lower bound on the log likelihood of a well-speci\ufb01ed model regardless\nof whether the true data distribution is in our model family. In addition, we can recover the recently\nproposed InfoNCE [57] bound on mutual information by using SNIS as the variational distribution in\nthe classic variational bound by Barber and Agakov [4] (see Appendix C for details).\nTo train the SNIS model, we perform stochastic gradient ascent on Eq. (3) with respect to the\nparameters of the proposal distribution \u03c0 and the energy function U. When the data x are continuous,\nreparameterization gradients can be used to estimate the gradients to the proposal distribution [61, 38].\nWhen the data are discrete, score function gradient estimators such as REINFORCE [68] or relaxed\ngradient estimators such as the Gumbel-Softmax [48, 32] can be used.\n\n3.3 Hamiltonian importance sampling (HIS)\n\nSimple importance sampling scales poorly with dimensionality, so it is natural to consider more\ncomplex samplers with better scaling properties. We evaluated models based on Hamiltonian\nimportance sampling (HIS) [54], which evolve an initial sample under deterministic, discretized\n\n4\n\n\fFigure 1: Performance of LARS, TRS, SNIS, and HIS on synthetic data. LARS, TRS, and SNIS\nachieve comparable data log-likelihood lower bounds on the \ufb01rst two synthetic datasets, whereas\nHIS converges slowly on these low dimensional tasks. The results for LARS on the Nine Gaussians\nproblem match previously-reported results in [5]. We visualize the target and learned densities in\nAppendix Fig. 2.\n\nHamiltonian dynamics with a learned energy function. In particular, we sample initial location\nand momentum variables, and then transition the candidate sample and momentum with leap frog\nintegration steps, changing the temperature at each step (Algorithm 3). While the quality of samples\nfrom SNIS are limited by the samples initially produced by the proposal, a model based on HIS updates\nthe positions of the samples directly, potentially allowing for more expressive power. Intuitively, the\nproposal provides a coarse starting sample which is further re\ufb01ned by gradient optimization on the\nenergy function. When the proposal is already quite strong, drawing additional samples as in SNIS\nmay be advantageous.\n\nIn practice, we parameterize the temperature schedule such that(cid:81)T\n\nt=0 \u03b1t = 1. This ensures that the\ndeterministic invertible transform from (x0, \u03c10) to (xT , \u03c1T ) has a Jacobian determinant of 1 (i.e.,\np(x0, \u03c10) = p(xT , \u03c1T )). Applying Eq. (2) yields a tractable variational objective\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\n.\n\nlog pHIS(xT ) \u2265 Eq(\u03c1T |xT )\n\nlog\n\np(xT , \u03c1T )\nq(\u03c1T|xT )\n\n= Eq(\u03c1T |xT )\n\nlog\n\np(x0, \u03c10)\nq(\u03c1T|xT )\n\nWe jointly optimize \u03c0, U, \u0001, \u03b10:T , and the variational parameters with stochastic gradient ascent.\nGoyal et al. [26] propose a similar approach that generates a multi-step trajectory via a learned\ntransition operator.\n\n\u03b10, . . . , \u03b1T .\n\nAlgorithm 3 HIS(\u03c0, U, \u0001, \u03b10:T ) generative process\nRequire: Proposal distribution \u03c0(x), energy function U (x), step size \u0001, temperature schedule\n1: Sample x0 \u223c \u03c0(x) and \u03c10 \u223c N (0, I).\n2: \u03c10 = \u03b10\u03c10\n3: for t = 1, . . . T do\n\u03c1t = \u03c1t\u22121 \u2212 \u0001\n4:\nxt = xt\u22121 + \u0001 (cid:12) \u03c1t\n5:\n6:\n\u03c1t = \u03b1t\n7: end for\n8: return xT\n\n2 (cid:12) \u2207U (xt)(cid:1)\n\n(cid:0)\u03c1t \u2212 \u0001\n\n2 (cid:12) \u2207U (xt\u22121)\n\n4 Experiments\n\nWe evaluated the proposed models on a set of synthetic datasets, binarized MNIST [43] and Fashion\nMNIST [69], and continuous MINST, Fashion MNIST, and CelebA [45]. See Appendix D for details\non the datasets, network architectures, and other implementation details. To provide a competitive\nbaseline, we use the recently developed Learned Accept/Reject Sampling (LARS) model [5].\n\n4.1 Synthetic data\n\nAs a preliminary experiment, we evaluated the methods on modeling synthetic densities: a mixture of\n9 equally-weighted Gaussian densities, a checkerboard density with uniform mass distributed in 8\n\n5\n\n05001000Steps (thousands)2.402.352.302.252.202.152.10Log likelihood lower boundCheckerboard05001000Steps (thousands)2.22.01.81.61.4Log likelihood lower boundTwo Rings05001000Steps (thousands)1.00.80.60.4Log likelihood lower boundNine Gaussians02004006008001000Steps (thousands)2.001.751.501.251.000.750.50Log likelihood lower boundNine Gaussians Proposal Variance = 0.1HISLARSSNISTRS\fMethod\n\nVAE w/ LARS prior\n\nDynamic MNIST\nStatic MNIST\n\u221284.82 \u00b1 0.12\nVAE w/ Gaussian prior \u221289.20 \u00b1 0.08\nVAE w/ TRS prior \u221286.81 \u00b1 0.06\n\u221282.74 \u00b1 0.10\n\u221282.52 \u00b1 0.03\nVAE w/ SNIS prior \u221286.28 \u00b1 0.14\nVAE w/ HIS prior \u221286.00 \u00b1 0.05 \u221282.43 \u00b1 0.05\n\u221281.14 \u00b1 0.04\n\u221280.31 \u00b1 0.04\n\u221280.51 \u00b1 0.07\n\u221283.43 \u00b1 0.07\n\nFashion MNIST\n\u2212228.70 \u00b1 0.09\n\u2212227.66 \u00b1 0.14\n\u2212227.51 \u00b1 0.09\n\u2212227.63 \u00b1 0.04\n\u2212226.39 \u00b1 0.12\nConvHVAE w/ Gaussian prior \u221282.43 \u00b1 0.07\nConvHVAE w/ TRS prior \u221281.62 \u00b1 0.03\n\u2212226.04 \u00b1 0.19\nConvHVAE w/ SNIS prior \u221281.51 \u00b1 0.06 \u221280.19 \u00b1 0.07 \u2212225.83 \u00b1 0.04\n\u2212226.12 \u00b1 0.13\nConvHVAE w/ HIS prior \u221281.89 \u00b1 0.02\nConvHVAE w/LARS prior\nSNIS w/ VAE proposal \u221287.65 \u00b1 0.07\n\u2212227.63 \u00b1 0.06\nSNIS w/ ConvHVAE proposal \u221281.65 \u00b1 0.05 \u221279.91 \u00b1 0.05 \u2212225.35 \u00b1 0.07\n\n\u2212227.45\u2020\n\n\u2212225.92\n\n\u221281.70\n\n\u221286.53\n\n\u221283.03\n\n\u221280.30\n\nLARS w/ VAE proposal\n\n\u2014\n\n\u221283.63\n\n\u2014\n\nTable 1: Performance on binarized MNIST and Fashion MNIST. We report 1000 sample IWAE\nlog-likelihood lower bounds (in nats) computed on the test set. LARS results are copied from [5].\n\u2020We note that our implementation of the VAE (on which our models are based) underperforms the\nreported VAE results in [5] on Fashion MNIST.\n\nCelebA\n\nMethod\n\nMNIST\n\n\u221260130.94 \u00b1 34.15\nSmall VAE \u22121258.81 \u00b1 0.49\nLARS w/ small VAE proposal \u22121254.27 \u00b1 0.62\n\u221260116.65 \u00b1 1.14\nSNIS w/ small VAE proposal \u22121253.67 \u00b1 0.29\n\u221260115.99 \u00b1 19.75\nHIS w/ small VAE proposal \u22121186.06 \u00b1 6.12 \u22122419.83 \u00b1 2.47 \u221259711.30 \u00b1 53.08\n\u221257471.48 \u00b1 11.65\n\u2212991.46 \u00b1 0.39\n\u221257488.21 \u00b1 18.41\n\u2212988.29 \u00b1 0.20\n\u221257470.42 \u00b1 6.54\n\u221256643.64 \u00b1 8.78\n\u2212990.68 \u00b1 0.41\n\nFashion MNIST\n\u22122467.91 \u00b1 0.68\n\u22122463.71 \u00b1 0.24\n\u22122463.60 \u00b1 0.31\n\u22122242.50 \u00b1 0.70\nLARS w/ VAE proposal \u2212987.62 \u00b1 0.16 \u22122236.87 \u00b1 1.36\n\u22122238.04 \u00b1 0.43\nSNIS w/ VAE proposal\n\u22122244.66 \u00b1 1.47\nHIS w/ VAE proposal\nMAF\n\n\u22121027\n\nVAE\n\n\u2014\n\n\u2014\n\nTable 2: Performance on continuous MNIST, Fashion MNIST, and CelebA. We report 1000\nsample IWAE log-likelihood lower bounds (in nats) computed on the test set. As a point of comparison,\nwe include a similar result from a 5 layer Masked Autoregressive Flow distribution [58].\n\nsquares, and two concentric rings (Fig. 1 and Appendix Fig. 2 for visualizations). For all methods, we\nused a unimodal standard Gaussian as the proposal distribution (see Appendix D for further details).\nTRS, SNIS, and LARS perform comparably on the Nine Gaussians and Checkerboard datasets. On\nthe Two Rings datasets, despite tuning hyperparameters, we were unable to make LARS learn the\ndensity.\nOn these simple problems, the target density lies in the high probability region of the proposal\ndensity, so TRS, SNIS, and LARS only have to reweight the proposal samples appropriately. In\nhigh-dimensional problems when the proposal density is mismatched from the target density, however,\nwe expect HIS to outperform TRS, SNIS, and LARS. To test this we ran each algorithm on the Nine\nGaussians problem with a Gaussian proposal of mean 0 and variance 0.1 so that there was a signi\ufb01cant\nmismatch in support between the target and proposal densities. The results in the rightmost panel of\nFig. 1 show that HIS was almost unaffected by the change in proposal while the other algorithms\nsuffered considerably.\n\n4.2 Binarized MNIST and Fashion MNIST\n\nNext, we evaluated the models on binarized MNIST and Fashion MNIST. MNIST digits can be either\nstatically or dynamically binarized \u2014 for the statically binarized dataset we used the binarization\n\n6\n\n\ffrom [62], and for the dynamically binarized dataset we sampled images from Bernoulli distributions\nwith probabilities equal to the continuous values of the images in the original MNIST dataset. We\ndynamically binarize the Fashion MNIST dataset in a similar manner.\nFirst, we used the models as the prior distribution in a Bernoulli observation likelihood VAE. We\nsummarize log-likelihood lower bounds on the test set in Table 1 (referred to as VAE w/ method\nprior). SNIS outperformed LARS on static MNIST and dynamic MNIST even though it used only\n1024 samples for training and evaluation, whereas LARS used 1024 samples during training and\n1010 samples for evaluation. As expected due to the similarity between methods, TRS performed\ncomparably to LARS. On all datasets, HIS either outperformed or performed comparably to SNIS.\nWe increased K and T for SNIS and HIS, respectively, and \ufb01nd that performance improves at the\ncost of additional computation (Appendix Fig. 3). We also used the models as the prior distribution\nof a convolutional heiarachical VAE (ConvHVAE, following the architecture in [5]). In this case,\nSNIS outperformed all methods.\nThen, we used a VAE as the proposal distribution to SNIS. A limitation of the HIS model is that\nit requires continuous data, so it cannot be used in this way on the binarized datasets. Initially, we\nthought that an unbiased, low-variance estimator could be constructed similarly to VIMCO [50],\nhowever, this estimator still had high variance. Next, we used the Gumbel Straight-Through esti-\nmator [32] to estimate gradients through the discrete samples proposed by the VAE, but found that\nmethod performed worse than ignoring those gradients altogether. We suspect that this may be due to\nbias in the gradients. Thus, for the SNIS model with VAE proposal, we report results on training runs\nwhich ignore those gradients. Future work will investigate low-variance, unbiased gradient estimators.\nIn this case, SNIS again outperforms LARS, however, the performance is worse than using SNIS as\na prior distribution. Finally, we used a ConvHVAE as the proposal for SNIS and saw performance\nimprovements over both the vanilla ConvHVAE and SNIS with a VAE proposal, demonstrating that\nour modeling improvements are complementary to improving the proposal distribution.\n\n4.3 Continuous MNIST, Fashion MNIST, and CelebA\n\nFinally, we evaluated SNIS and HIS on continuous versions of MNIST, Fashion MNIST, and CelebA\n(64x64). We use the same preprocessing as in [18]. Brie\ufb02y, we dequantize pixel values by adding\nuniform noise, rescale them to [0, 1], and then transform the rescaled pixel values into logit space\nby x \u2192 logit(\u03bb + (1 \u2212 2\u03bb)x), where \u03bb = 10\u22126. When we calculate log-likelihoods, we take into\naccount this change of variables.\nWe speculated that when the proposal is already strong, drawing additional samples as in SNIS may\nbe better than HIS. To test this, we experimented with a smaller VAE as the proposal distribution.\nAs we expected, HIS outperformed SNIS when the proposal was weaker, especially on the more\ncomplex datasets, as shown in Table 2.\n\n5 Variational Inference with EIMs\n\nTo provide a tractable lower bound on the log-likelihood of EIMs, we used the ELBO (Eq. (1)). More\ngenerally, this variational lower bound has been used to optimize deep generative models with latent\nvariables following the in\ufb02uential work by Kingma and Welling [38], Rezende et al. [61], and models\noptimized with this bound have been successfully used to model data such as natural images [60, 39,\n11, 27], speech and music time-series [12, 23, 40], and video [2, 29, 17]. Due to the usefulness of\nsuch a bound, there has been an intense effort to provide improved bounds [9, 49, 52, 42, 73, 51, 65].\nThe tightness of the ELBO is determined by the expressiveness of the variational family [74], so it\nis natural to consider using \ufb02exible EIMs as the variational family. As we explain, EIMs provide a\nconceptual framework to understand many of the recent improvements in variational lower bounds.\nIn particular, suppose we use a conditional EIM q(z|x) as the variational family (i.e., q(z|x) =\non log p(x) (Eq. (1)), however, the density of the EIM q(z|x) is intractable. Agakov and Barber\n[1], Salimans et al. [63], Ranganath et al. [59], Maal\u00f8e et al. [47] develop an auxiliary variable\n\n(cid:82) q(z, \u03bb|x) d\u03bb is the marginalized sampling process). Then, we can use the ELBO lower bound\n\n7\n\n\f+ Eq(z|x) [DKL (q(\u03bb|z, x)||r(\u03bb|z, x)]\n\n(cid:20)\n\nvariational bound\nEq(z|x)\n\nlog\n\np(x, z)\nq(z|x)\n\n(cid:21)\n\n(cid:20)\n(cid:20)\n\n= Eq(z,\u03bb|x)\n\nlog\n\n\u2265 Eq(z,\u03bb|x)\n\np(x, z)r(\u03bb|z, x)\n\nq(z, \u03bb|x)\n\np(x, z)r(\u03bb|z, x)\n\n(cid:21)\n(cid:21)\n\n,\n\nlog\n\nq(z, \u03bb|x)\n\n(4)\nwhere r(\u03bb|z, x) is a variational distribution meant to model q(\u03bb|z, x), and the identity follows from\nthe fact that q(z|x) = q(z,\u03bb|x)\nq(\u03bb|z,x) . Similar to Eq. (1), Eq. (4) shows the gap introduced by using r(\u03bb|z, x)\nto deal with the intractability of q(z|x). We can form a lower bound on the original ELBO and thus a\nlower bound on the log marginal by omitting the positive DKL term. This provides a tractable lower\nbound on the log-likelihood using \ufb02exible EIMs as the variational family and precisely characterizes\nthe bound gap as the sum of DKL terms in Eq. (1) and Eq. (4). For different choices of EIM, this\nbound recovers many of the recently proposed variational lower bounds.\nFurthermore, the bound in Eq. (4) is closely related to partition function estimation because\np(x,z)r(\u03bb|z,x)\nis an unbiased estimator of p(x) when z, \u03bb \u223c q(z, \u03bb|x). To \ufb01rst order, the bound\ngap is related to the variance of this partition function estimator (e.g., [49]), which motivates sampling\nalgorithms used in lower variance partition function estimators such as SMC [21] and AIS [53].\n\nq(z,\u03bb|x)\n\n5.1\n\nImportance Weighted Auto-encoders (IWAE)\n\nTo tighten the ELBO without explicitly expanding the variational family, Burda et al. [9] introduced\nthe importance weighted autoencoder (IWAE) bound,\n\nEz1:K\u223c(cid:81)\n\ni \u02dcq(zi|x)\n\nlog\n\n1\nK\n\np(x, zi)\n\u02dcq(zi|x)\n\n(cid:34)\n\n(cid:32)\n\nK(cid:88)\n\ni=1\n\n(cid:33)(cid:35)\n\n\u2264 log p(x).\n\n(5)\n\nThe IWAE bound reduces to the ELBO when K = 1, is non-decreasing as K increases, and converges\nto log p(x) as K \u2192 \u221e under mild conditions [9]. Bachman and Precup [3] introduced the idea\nof viewing IWAE as auxiliary variable variational inference and Naesseth et al. [52], Cremer et al.\n[13], Domke and Sheldon [20] formalized the notion.\nConsider the variational family de\ufb01ned by the EIM based on SNIS (Algorithm 2). We use a learned,\ntractable distribution \u02dcq(z|x) as the proposal \u03c0(z|x) and set U (z|x) = log \u02dcq(z|x) \u2212 log p(x, z)\nmotivated by the fact that p(z|x) \u221d \u02dcq(z|x) exp(log p(x, z) \u2212 log \u02dcq(z|x)) is the optimal variational\ndistribution. Similar to the variational distribution used in Section 3.2, setting\n\nr(z1:K, i|z, x) =\n\n1\nK\n\n\u03b4zi(z)\n\n\u02dcq(zj|x)\n\n(6)\n\n(cid:89)\n\nj(cid:54)=i\n\nyields the IWAE bound Eq. (5) when plugged into to Eq. (4) (see Appendix A for details).\nFrom Eq. (4), it is clear that IWAE is a lower bound on the standard ELBO for the EIM q(z|x) and\nthe gap is due to DKL(q(z1:K, i|z, x)||r(z1:K, i|z, x)). The choice of r(z1:K, i|z, x) in Eq. (6) was\nfor convenience and is suboptimal. The optimal choice of r is\n1\nK\n\nq(z1:K, i|z, x) = q(i|z, x)q(z1:K|i, z, x) =\n\nCompared to the optimal choice, Eq. (6) makes the approximation q(z\u2212i|i, z, x) \u2248(cid:81)\nK \u03b4zi(z)(cid:81)\n\nj(cid:54)=i \u02dcq(zj|x)\nwhich ignores the in\ufb02uence of z on z\u2212i and the fact that z\u2212i are not independent given z. A simple\nextension could be to learn a factored variational distribution conditional on z: r(z1:k, i|z, x) =\nj(cid:54)=i r(zj|z, x). Learning such an r could improve the tightness of the bound, and we leave\n\n\u03b4zi(z)q(z\u2212i|i, z, x).\n\n1\nexploring this to future work.\n\n5.2 Semi-implicit variational inference\n\nAs a way of increasing the \ufb02exibility of the variational family, Yin and Zhou [73] introduce the\nidea of semi-implicit variational families. That is they de\ufb01ne an implicit distribution q(\u03bb|x) by\ntransforming a random variable \u0001 \u223c q(\u0001|x) with a differentiable deterministic transformation (i.e.,\n\n8\n\n\f\u03bb = g(\u0001, x)). However, Sobolev and Vetrov [65] keenly note that q(z, \u03bb|x) = q(z|\u03bb, x)q(\u03bb|x) can\nbe equivalently written as q(z|\u0001, x)q(\u0001|x) with two explicit distributions. As a result, semi-implicit\nvariational inference is simply auxiliary variable variational inference by another name.\nAdditionally, Yin and Zhou [73] provide a multi-sample lower bound on the log likelihood which is\ngenerally applicable to auxiliary variable variational inference.\n\n(cid:20)\n\nlog\n\nK (q(z|\u03bb, x) +(cid:80)\n\np(x, z)\n\n1\n\ni q(z|\u03bbi, x))\n\n(cid:21)\n\n(7)\n\n(cid:20)\n\n(cid:89)\n\nj(cid:54)=i\n\nlog p(x) \u2265 Eq(\u03bb1:K\u22121|x)q(z,\u03bb|x)\n(cid:20)\n\np(x, z)r(\u03bb|z, x)\n\n(cid:21)\n\nWe can interpret this bound as using an EIM for r(\u03bb|z, x) in Eq. (4). Generally, if we introduce\nadditional auxiliary random variables \u03b3 into r(\u03bb, \u03b3|z, x), we can tractably bound the objective\n\nEq(z,\u03bb|x)\n\nlog\n\nq(z, \u03bb|x)\n\n(8)\nwhere s(\u03b3|z, \u03bb, x) is a variational distribution. Analogously to the previous section, we set r(\u03bb|z, x)\nas an EIM based on the self-normalized importance sampling process with proposal q(\u03bb|x) and\nU (\u03bb|x, z) = \u2212 log q(z|\u03bb, x). If we choose\n\nlog\n\n,\n\n\u2265 Eq(z,\u03bb|x)s(\u03b3|z,\u03bb,x)\n\np(x, z)r(\u03bb, \u03b3|z, x)\nq(z, \u03bb|x)s(\u03b3|z, \u03bb, x)\n\n(cid:21)\n\ns(\u03bb1:K, i|z, \u03bb, x) =\n\n1\nK\n\n\u03b4\u03bbi(\u03bb)\n\nq(\u03bbj|x),\n\nwith \u03b3 = (\u03bb1:K, i), then Eq. 8 recovers the bound in [73] (see Appendix B for details). In a similar\nmanner, we can continue to recursively augment the variational distribution s (i.e., add auxiliary\nlatent variables to s).\nThis view reveals that the multi-sample bound from [73] is simply one approach to choosing a\n\ufb02exible variational r(\u03bb|z, x). Alternatively, Ranganath et al. [59] use a learned variational r(\u03bb|z, x).\nIt is unclear when drawing additional samples is preferable to learning a more complex variational\ndistribution. Furthermore, the two approaches can be combined by using a learned proposal r(\u03bbi|z, x)\ninstead of q(\u03bbi|x), which results in a bound described in [65].\n\n5.3 Additional Bounds\n\nFinally, we can also use the self-normalized importance sampling procedure to extend a proposal\nfamily q(z, \u03bb|x) to a larger family (instead of solely extending r(\u03bb|z, x)) [65]. Self-normalized\nimportance sampling is a particular choice of taking a proposal distribution and moving it closer\nto a target. Hamiltonian Monte Carlo [55] is another choice which can also be embedded in this\nframework as done by [63, 10]. Similarly, SMC can be used as a sampling procedure in an EIM and\nwhen used as the variational family, it succinctly derives variational SMC [49, 52, 42] without any\ninstance speci\ufb01c tricks. In this way, more elaborate variational bounds can be constructed by speci\ufb01c\nchoices of EIMs without additional derivation.\n\n6 Discussion\n\nWe proposed a \ufb02exible, yet tractable family of distributions by treating the approximate sampling\nprocedure of energy-based models as the model of interest, referring to them as energy-inspired\nmodels. The proposed EIMs bridge the gap between learning and inference in EBMs. We explore three\ninstantiations of EIMs induced by truncated rejection sampling, self-normalized importance sampling,\nand Hamiltonian importance sampling and we demonstrate comparably or stronger performance than\nrecently proposed generative models. The results presented in this paper use simple architectures on\nrelatively small datasets. Future work will scale up both the architectures and size of the datasets.\nInterestingly, as a by-product, exploiting the EIMs to de\ufb01ne the variational family provides a unifying\nframework for recent improvements in variational bounds, which simpli\ufb01es existing derivations,\nreveals potentially suboptimal choices, and suggests ways to form novel bounds.\nConcurrently, Nijkamp et al. [56] investigated a similar model to our models based on HIS, although\nthe training algorithm was different. Combining insights from their study with our approach is a\npromising future direction.\n\n9\n\n\fAcknowledgments\n\nWe thank Ben Poole, Abhishek Kumar, and Diederick Kingma for helpful comments. We thank\nMatthias Bauer for answering implementation questions about LARS.\n\nReferences\n[1] Felix V Agakov and David Barber. An auxiliary variational method. In International Conference\n\non Neural Information Processing, pages 561\u2013566. Springer, 2004.\n\n[2] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine.\nStochastic variational video prediction. International Conference on Learning Representations,\n2017.\n\n[3] Philip Bachman and Doina Precup. Training deep generative models: Variations on a theme. In\n\nNIPS Approximate Inference Workshop, 2015.\n\n[4] David Barber and Felix Agakov. The im algorithm: a variational approach to information\nmaximization. In Proceedings of the 16th International Conference on Neural Information\nProcessing Systems, pages 201\u2013208. MIT Press, 2003.\n\n[5] Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. arXiv preprint\n\narXiv:1810.11428, 2018.\n\n[6] Julian Besag. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society:\n\nSeries D (The Statistician), 24(3):179\u2013195, 1975.\n\n[7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 2017.\n\n[8] Lawrence D Brown. Fundamentals of statistical exponential families: with applications in\n\nstatistical decision theory. Ims, 1986.\n\n[9] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\nnternational Conference on Learning Representations, 2015.\n\n[10] Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational auto-encoder.\n\nIn Advances in Neural Information Processing Systems, pages 8167\u20138177, 2018.\n\n[11] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman,\nIlya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. International Conference on\nLearning Representations, 2016.\n\n[12] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\nBengio. A recurrent latent variable model for sequential data. In Advances in neural information\nprocessing systems, pages 2980\u20132988, 2015.\n\n[13] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importance-weighted\n\nautoencoders. arXiv preprint arXiv:1704.02916, 2017.\n\n[14] Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel\nexponential family estimation via doubly dual embedding. arXiv preprint arXiv:1811.02228,\n2018.\n\n[15] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating\n\nenergy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.\n\n[16] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz\n\nmachine. Neural computation, 7(5):889\u2013904, 1995.\n\n[17] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. International\n\nConference on Machine Learning, 2018.\n\n[18] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\narXiv preprint arXiv:1605.08803, 2016.\n\n10\n\n\f[19] Justin Domke and Daniel Sheldon. Divide and couple: Using monte carlo variational objectives\n\nfor posterior approximation. arXiv preprint arXiv:1906.10115, 2019.\n\n[20] Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In\n\nAdvances in Neural Information Processing Systems, pages 4471\u20134480, 2018.\n\n[21] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo\n\nmethods. In Sequential Monte Carlo methods in practice, pages 3\u201314. Springer, 2001.\n\n[22] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models.\n\narXiv preprint arXiv:1903.08689, 2019.\n\n[23] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential neural\nmodels with stochastic layers. In Advances in neural information processing systems, pages\n2199\u20132207, 2016.\n\n[24] Yoav Freund and David Haussler. A fast and exact learning rule for a restricted class of\nboltzmann machines. Advances in Neural Information Processing Systems, 4:912\u2013919, 1992.\n\n[25] Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative\nconvnets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 9155\u20139164, 2018.\n\n[26] Anirudh Goyal Alias Parth Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio.\nVariational walkback: Learning a transition operator as a stochastic recurrent net. In Advances\nin Neural Information Processing Systems, pages 4392\u20134402, 2017.\n\n[27] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David\nVazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. Interna-\ntional Conference on Learning Representations, 2016.\n\n[28] Michael Gutmann and Aapo Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation\nprinciple for unnormalized statistical models. In Proceedings of the Thirteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\n[29] David Ha and J\u00a8urgen Schmidhuber. World models. Advances in neural information processing\n\nsystems, 2018.\n\n[30] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\ncomputation, 14(8):1771\u20131800, 2002.\n\n[31] Aapo Hyv\u00a8arinen. Estimation of non-normalized statistical models by score matching. Journal\n\nof Machine Learning Research, 6(Apr):695\u2013709, 2005.\n\n[32] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[33] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An in-\ntroduction to variational methods for graphical models. Machine learning, 37(2):183\u2013233,\n1999.\n\n[34] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[35] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability\n\nestimation. arXiv preprint arXiv:1606.03439, 2016.\n\n[36] R. Kinderman and S.L. Snell. Markov random \ufb01elds and their applications. American mathe-\n\nmatical society, 1980.\n\n[37] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[38] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. nternational Conference\n\non Learning Representations, 2013.\n\n11\n\n\f[39] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max\nWelling. Improved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural\nInformation Processing Systems, pages 4743\u20134751, 2016.\n\n[40] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman \ufb01lters. arXiv preprint\n\narXiv:1511.05121, 2015.\n\n[41] John D Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random \ufb01elds:\nProbabilistic models for segmenting and labeling sequence data. In Proceedings of the Eigh-\nteenth International Conference on Machine Learning, pages 282\u2013289. Morgan Kaufmann\nPublishers Inc., 2001.\n\n[42] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-encoding\n\nsequential monte carlo. International Conference on Learning Representations, 2017.\n\n[43] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[44] Yann LeCun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. 2006.\n\n[45] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[46] Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for\nconditional models: Consistency and statistical ef\ufb01ciency. arXiv preprint arXiv:1809.01812,\n2018.\n\n[47] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\n[48] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[49] Chris J Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi,\nAndriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in\nNeural Information Processing Systems, pages 6573\u20136583, 2017.\n\n[50] Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. Interna-\n\ntional Conference on Machine Learning, 2016.\n\n[51] Dmitry Molchanov, Valery Kharitonov, Artem Sobolev, and Dmitry Vetrov. Doubly semi-\n\nimplicit variational inference. arXiv preprint arXiv:1810.02789, 2018.\n\n[52] Christian Naesseth, Scott Linderman, Rajesh Ranganath, and David Blei. Variational sequential\nmonte carlo. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 968\u2013977,\n2018.\n\n[53] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125\u2013139,\n\n2001.\n\n[54] Radford M Neal. Hamiltonian importance sampling. In In talk presented at the Banff Inter-\nnational Research Station (BIRS) workshop on Mathematical Issues in Molecular Dynamics,\n2005.\n\n[55] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, 2(11):2, 2011.\n\n[56] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent\nnon-persistent short-run mcmc toward energy-based model. In Advances in Neural Information\nProcessing Systems, pages 5233\u20135243, 2019.\n\n[57] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. arXiv preprint arXiv:1807.03748, 2018.\n\n12\n\n\f[58] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for density\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[59] Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In Interna-\n\ntional Conference on Machine Learning, pages 324\u2013333, 2016.\n\n[60] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nInternational Conference on Machine Learning, pages 1530\u20131538, 2015.\n\nIn\n\n[61] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In International Conference on Machine\nLearning, pages 1278\u20131286, 2014.\n\n[62] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In\nProceedings of the 25th international conference on Machine learning, pages 872\u2013879. ACM,\n2008.\n\n[63] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational\ninference: Bridging the gap. In International Conference on Machine Learning, pages 1218\u2013\n1226, 2015.\n\n[64] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory.\n\nTechnical report, Colorado Univ at Boulder Dept of Computer Science, 1986.\n\n[65] Artem Sobolev and Dmitry Vetrov. Importance weighted hierarchical variational inference. In\n\nBayesian Deep Learning Workshop, 2018.\n\n[66] Jascha Sohl-Dickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability \ufb02ow\nlearning. In Proceedings of the 28th International Conference on International Conference on\nMachine Learning, pages 905\u2013912. Omnipress, 2011.\n\n[67] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likeli-\nhood gradient. In Proceedings of the 25th international conference on Machine learning, pages\n1064\u20131071. ACM, 2008.\n\n[68] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[69] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms, 2017.\n\n[70] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In\n\nInternational Conference on Machine Learning, pages 2635\u20132644, 2016.\n\n[71] Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic patterns by spatial-\ntemporal generative convnet. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 7093\u20137101, 2017.\n\n[72] Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu.\nLearning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 8629\u20138638, 2018.\n\n[73] Mingzhang Yin and Mingyuan Zhou. Semi-implicit variational inference. arXiv preprint\n\narXiv:1805.11183, 2018.\n\n[74] Arnold Zellner. Optimal information processing and bayes\u2019s theorem. The American Statistician,\n\n42(4):278\u2013280, 1988.\n\n[75] Song Chun Zhu, Yingnian Wu, and David Mumford. Filters, random \ufb01elds and maximum\nInternational Journal of\n\nentropy (frame): Towards a uni\ufb01ed theory for texture modeling.\nComputer Vision, 27(2):107\u2013126, 1998.\n\n13\n\n\f", "award": [], "sourceid": 4594, "authors": [{"given_name": "John", "family_name": "Lawson", "institution": "Stanford University"}, {"given_name": "George", "family_name": "Tucker", "institution": "Google Brain"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Rajesh", "family_name": "Ranganath", "institution": "New York University"}]}