{"title": "Tensor Monte Carlo: Particle Methods for the GPU era", "book": "Advances in Neural Information Processing Systems", "page_first": 7148, "page_last": 7157, "abstract": "Multi-sample, importance-weighted variational autoencoders (IWAE) give tighter bounds and more accurate uncertainty estimates than variational autoencoders (VAEs) trained with a standard single-sample objective.  However, IWAEs scale poorly: as the latent dimensionality grows, they require exponentially many samples to retain the benefits of importance weighting.  While sequential Monte-Carlo (SMC) can address this problem, it is prohibitively slow because the resampling step imposes sequential structure which cannot be parallelised, and moreover, resampling is non-differentiable which is problematic when learning approximate posteriors.  To address these issues, we developed tensor Monte-Carlo (TMC) which gives exponentially many importance samples by separately drawing $K$ samples for each of the $n$ latent variables, then averaging over all $K^n$ possible combinations.  While the sum over exponentially many terms might seem to be intractable, in many cases it can be computed efficiently as a series of tensor inner-products.  We show that TMC is superior to IWAE on a generative model with multiple stochastic layers trained on the MNIST handwritten digit database, and we show that TMC can be combined with standard variance reduction techniques.", "full_text": "Tensor Monte Carlo: Particle Methods for the GPU\n\nera\n\nLaurence Aitchison\u2217\nUniversity of Bristol\n\nBristol, UK\n\nlaurence.aitchison@gmail.com\n\nAbstract\n\nMulti-sample, importance-weighted variational autoencoders (IWAE) give tighter\nbounds and more accurate uncertainty estimates than variational autoencoders\n(VAEs) trained with a standard single-sample objective. However, IWAEs scale\npoorly: as the latent dimensionality grows, they require exponentially many sam-\nples to retain the bene\ufb01ts of importance weighting. While sequential Monte-Carlo\n(SMC) can address this problem, it is prohibitively slow because the resampling\nstep imposes sequential structure which cannot be parallelised, and moreover, re-\nsampling is non-differentiable which is problematic when learning approximate\nposteriors. To address these issues, we developed tensor Monte-Carlo (TMC) which\ngives exponentially many importance samples by separately drawing K samples\nfor each of the n latent variables, then averaging over all K n possible combinations.\nWhile the sum over exponentially many terms might seem to be intractable, in\nmany cases it can be computed ef\ufb01ciently as a series of tensor inner-products. We\nshow that TMC is superior to IWAE on a generative model with multiple stochastic\nlayers trained on the MNIST handwritten digit database, and we show that TMC\ncan be combined with standard variance reduction techniques.\n\nVariational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014; Eslami et al., 2018)\nhave had dramatic success in exploiting modern deep learning methods to do probabilistic inference\nin previously intractable high-dimensional spaces. However, standard VAEs using a single-sample\nobjective give loose variational bounds and poor approximations to the posterior (Turner & Sahani,\n2011; Burda et al., 2015). Modern variational autoencoders instead use a multi-sample objective to\nimprove the tightness of the variational bound and the quality of the approximate posterior (Burda\net al., 2015). These methods implicitly improve the approximate posterior by drawing multiple\nsamples from a proposal, and resampling to discard samples that do not \ufb01t the data (Cremer et al.,\n2017).\n\nWhile multi-sample importance-weighted methods are often extremely effective, they scale poorly\nwith problem size. In particular, recent results (Chatterjee & Diaconis, 2015) have shown that the\nnumber of importance samples required to closely approximate any target expectation scales as\nexp(DKL (P||Q)), where Q is the proposal distribution (and this was pre\ufb01gured by earlier work in the\nparticle \ufb01lter context (Snyder et al., 2008; Bengtsson et al., 2008)). Critically, the KL-divergence\nscales roughly linearly in problem size (and exactly linearly if we consider n independent sub-\nproblems being combined), and thus we expect the required number of importance samples to be\nexponential in the problem size. As such, multi-sample methods are typically only used to infer the\nlatent variables in smaller models with only a single (albeit vector-valued) latent variable (e.g. Burda\net al., 2015).\n\n\u2217Work done while at Janelia Research Campus.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne approach to resolving these issues is sequential Monte-Carlo (SMC) (Maddison et al., 2017;\nNaesseth et al., 2017; Le et al., 2017), which circumvents the need for exponentially many samples\nusing resampling. However, SMC has two issues. First, the SMC resampling steps force an inherently\nsequential structure on the computation, which can prohibit effective parallelisation on modern GPU\nhardware. While this is acceptable in a model (such as a state-space model) that already has sequential\nstructure, SMC has been applied in many other settings where there is considerably more scope for\nparallelisation such as mixture models (Fearnhead, 2004) or even probabilistic programs (Wood et al.,\n2014). Second, modern variational inference uses the reparameterisation trick to obtain low-variance\nestimates of the gradient of the objective with respect to the proposal parameters (Kingma & Welling,\n2013; Rezende et al., 2014). However, the reparameterisation trick requires us to differentiate samples\nfrom the proposal with respect to parameters of the proposal, and this is not possible in SMC due to\nthe inherently non-differentiable resampling step (and this is true even in variants such as particle\n\ufb01ltering with backward sampling Doucet & Johansen (2009)). As such, while it may be possible\nin some circumstances to obtain reasonable results using a biased gradient (Maddison et al., 2017;\nNaesseth et al., 2017; Le et al., 2017), those results are empirical and hence give no guarantees.\n\nTo resolve these issues, we introduce tensor Monte Carlo (TMC). While standard multi-sample\nobjectives draw K samples from a proposal over all latent variables jointly, TMC draws K samples\nfor each of the n latent variables separately, then forms a lower-bound by averaging over all K n\npossible combinations of samples for each latent variable. To perform these averages over an\nexponential number of terms ef\ufb01ciently, we exploit conditional independence structure in a manner\nthat is very similar to early work on graphical models (Pearl, 1986; Lauritzen & Spiegelhalter, 1988).\nIn particular, we note that for TMC, as well as for classical graphical models, these sums can be\nwritten in an extremely simple and general form: as a series of tensor inner products. As such, TMC\nis most closely related to an MCMC method known as the \u201cembedded HMM\u201d (Neal et al., 2004)\nwhich performs MCMC sampling by drawing K samples for each latent, and sampling from the\nresulting K n state space.\n\nFinally, as TMC is, in essence, IWAE with exponentially many importance samples, it can be\ncombined with previously suggested variance reduction techniques, including (but not limited to)\nsticking the landing (STL) (Roeder et al., 2017), doubly reparameterised gradient estimates (DReGs)\n(Tucker et al., 2018), and reweighted wake-sleep (RWS) (Bornschein & Bengio, 2014; Le et al.,\n2018).\n\n1 Background\n\nClassical variational inference consists of optimizing a lower bound, LVAE, on the log-marginal\nlikelihood, log P(x),\n\nlog P(x) \u2265 LVAE = log P(x) \u2212 DKL (Q(z) ||P(z|x)) ,\n\n(1)\n\nwhere x is the data, z is the latent variable, P is the generative model, and Q is known as either the\napproximate posterior, the recognition model or the proposal distribution. As the Kullback-Leibler\n(KL) divergence is always positive, we can see that the objective is indeed a lower bound, and if\nthe approximate posterior, Q(z), is suf\ufb01ciently \ufb02exible, then as we optimize LVAE with respect\nto the parameters of the approximate posterior, the approximate posterior will come to equal the\ntrue posterior, at which point the KL divergence is zero, so optimizing LVAE reduces to optimizing\nlog P(x). However, in the typical case where Q(z) is a more restrictive family of distributions, we\nobtain biased estimates of the generative parameters and approximate posteriors that underestimate\nuncertainty (Minka et al., 2005; Turner & Sahani, 2011).\n\nThis issue motivated the development of more general lower-bound objectives, and to understand how\nthese bounds were developed, we need to consider an alternative derivation of LVAE. The general\napproach is to take an unbiased stochastic estimate of the marginal likelihood, denoted P,\n\nand convert it into a lower bound on the log-marginal likelihood using Jensen\u2019s inequality,\n\nlog P(x) \u2265 L = EQ[log P] .\n\nP(x) = EQ[P]\n\n(2)\n\n(3)\n\nWe can obtain most methods of interest, including single-sample VAE\u2019s, multi-sample IWAE, and\nTMC by making different choices for P and Q. For the single-sample variational objective we use a\n\n2\n\n\fproposal, Q(z), de\ufb01ned over a single setting of the latents,\n\nPVAE =\n\nP(x, z)\nQ(z)\n\n,\n\n(4)\n\nwhich gives rise to the usual variational lower bound, LVAE. However, this single-sample estimate of\nthe marginal likelihood has high variance, and hence a gives a loose lower-bound. To obtain a tigher\nvariational bound, one approach is to \ufb01nd a lower-variance estimate of the marginal likelihood, and\nan obvious way to reduce the variance is to average multiple independent samples of the original\nestimator,\n\nPIWAE =\n\n1\nK\n\nK\n\nXk=1\n\nP(cid:0)x, zk(cid:1)\n\nQ(zk)\n\n,\n\n(5)\n\nwhich indeed gives rise to a tighter, importance-weighted bound, LIWAE (Burda et al., 2015).\n\n2 Results\n\nFirst, we give a proof showing that we can obtain unbiased estimates of the model-evidence by\ndividing the full latent space into several different latent variables, z = (z1, z2, . . . , zn), drawing K\nsamples for each individual latent and averaging over all K n possible combinations of samples. We\nthen give a method for ef\ufb01ciently computing the required averages over an exponential number of\nterms using tensor inner products. We give toy experiments, showing that the TMC bound approches\nthe true model evidence with exponentially fewer samples than IWAE, and in far less time than\nSMC. Finally, we do experiments on VAE\u2019s with multiple stochastic layers trained on the MNIST\nhandwritten digit database. We show that TMC can be used to learn recognition models, that it\ncan be combined with variance reduction techniques such as STL (Roeder et al., 2017) and DReGs\n(Tucker et al., 2018), and is superior to IWAE\u2019s given the same number of particles, despite negligable\nadditional computational costs.\n\n2.1 TMC for factorised proposals\n\nIn TMC we consider models with multiple latent variables, z = (z1, z2, . . . , zn), so the generative\nand recognition models can be written as,\n\nP(x, z) = P(x, z1, z2, . . . , zn)\n\nQ(z) = Q(z1) Q(z2) \u00b7 \u00b7 \u00b7 Q(zn)\n\n(6)\n\nwhere we use a factorised proposal to simplify the proof (see Appendix B for a proof for non-\nfactorised proposals). For the TMC objective, each individual latent variable, zi, is sampled Ki\ntimes,\n\nzki\ni \u223c Q(zi) ,\n\n(7)\n\nwhere i \u2208 {1, . . . , n} indexes the latent variable and ki \u2208 {1, . . . , Ki} indexes the sample for the\nith latent. Importantly, any combination of the ki\u2019s can be used to form an unbiased, single-sample\nestimate of the marginal likelihood. Thus, for any k1, k2, . . . , kn we have,\n\nP\u03b8(x) = EQ(z)\" P(cid:0)x, zk1\nQ(cid:0)zk1\n\n1 (cid:1) Q(cid:0)zk2\n\n2 , . . . , zkn\n\nn )# .\nn (cid:1)\n2 (cid:1) \u00b7 \u00b7 \u00b7 Q(zkn\n\n1 , zk2\n\n(8)\n\nThe average of a set of unbiased estimators is another unbiased estimator. As such, averaging over\nall K n settings for the ki\u2019s (and hence over K n unbiased estimators), we obtain a lower-variance\nunbiased estimator,\n\nPTMC =\n\n,\n\n(9)\n\n1\n\nQi Ki Xk1,k2,...,kn\n\n1 , zk2\n\nP(cid:0)x, zk1\n1 (cid:1) Q(cid:0)zk2\nQ(cid:0)zk1\n\n2 , . . . , zkn\n\nn (cid:1)\n2 (cid:1) \u00b7 \u00b7 \u00b7 Q(zkn\n\nn )\n\nand this forms the TMC estimate of the marginal likelihood.\n\n3\n\n\f2.2 Ef\ufb01cient averaging\n\nThe TMC unbiased estimator in Eq. (9) involves a sum over exponentially many terms, which may be\nintractable. To evaluate the TMC marginal likelihood estimate ef\ufb01ciently, we therefore need to exploit\nstructure in the graphical model. For instance, for a directed graphical model, the joint-probability\ncan be written as a product of the conditional probabilities,\n\nP(x, z1, z2, . . . , zn) = P(cid:0)x|zpa(x)(cid:1)\n\nP(cid:0)zi|zpa(zi)(cid:1)\n\nn\n\nYi=1\n\nwhere pa(zi) \u2282 {1, . . . , n} is the indicies of the parents of zi. In the case of a directed graphical\nmodel, we can write the importance ratio as a product of factors (Kschischang et al., 2001; Frey,\n2003; Bishop, 2006),\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(15)\n\nf \u03baj\nj =\n\nYj\n\n1 , zk2\n\nP(cid:0)x, zk1\nQ(cid:0)zk1\n1 (cid:1) Q(cid:0)zk2\n\n2 , . . . , zkn\n\nn (cid:1)\n2 (cid:1) \u00b7 \u00b7 \u00b7 Q(zkn\n\nn )\n\nwhere, \u03baj \u2019s are tuples containing the indicies (ki\u2019s) of each factor. For the model in Eq. (10), a typical\nchoice of factors, f \u03baj\n\nand corresponding \u03baj would be to use the j = 0th factor for the data,\n\nj\n\nf \u03ba0\n\n0 = P(cid:16)x|z\n\nkpa(x)\n\npa(x)(cid:17)\n\nand to use the jth factor (with j > 0) for the jth latent variable,\n\n\u03ba0 = kpa(x)\n\nf \u03baj\n\nj = P(cid:16)zkj\n\nj |z\n\nkpa(zj )\n\nj (cid:17)\npa(zj )(cid:17). Q(cid:16)zkj\n\n\u03baj =(cid:0)kj, kpa(zj )(cid:1)\n\nAs such, substituting Eq. (11) into Eq. (9), the TMC marginal likelihood estimator can be written as\nsummations over a series of tensors, f \u03baj\n\n,\n\nj\n\nPTMC =\n\n1\n\nK1K2 \u00b7 \u00b7 \u00b7 Kn Xk1,k2,...,knYj\n\nf \u03baj\n\nj\n\n.\n\n(14)\n\nIf there are suf\ufb01ciently many conditional independencies in the graphical model, we can compute the\nTMC marginal likelihood estimate ef\ufb01ciently by swapping the order of the product and summation.\n\n2.3 Non-factorised proposals\n\nFor non-factorised proposals, we obtain a result similar to that for factorised proposals (Eq. 9),\n\nPTMC =\n\n1\n\nQi Ki Xk1,k2,...,kn\n\n1 , zk2\n\nP(cid:0)x, zk1\nQi Q(cid:0)zki\n\n2 , . . . , zkn\n\nn (cid:1)\ni |zqa(zi)(cid:1)\n\nwhere qa(zi) represents the parents of zi under the proposal, and zqa(i) represents all samples of\nthose parents. Importantly, note that the proposals are indexed only by ki, and not by kqa(zi), so we\ncan always use the same factorisation structure (Eq. 11) for a factorised and non-factorised proposal.\nConsult the Appendix for further details, including a proof (Appendix B) and commentary on the\nchoice of approximate posterior (Appendix C).\n\n2.4 Computational costs for TMC and IWAE\n\nIn principle, TMC could be considerably more expensive than IWAE, as IWAE\u2019s cost is linear in K,\nwhereas for TMC, the cost scales with K T , where T is the tree-width. Of course, in exchange, we\nobtain an exponential number of importance samples, K n, so this tradeoff will usually be worthwhile.\nRemarkably however, the dominant computational costs \u2014 those of propagating samples through\nthe neural network \u2014 are linear in K for typical networks and problem sizes, and hence almost\nequivalent to that of IWAE. In particular, consider a chained model, where z = (z1, z2, . . . , zn), and,\n\nP(x, z) = P(x|z1) P(z1|z2) \u00b7 \u00b7 \u00b7 P(zn\u22121|zn) P(zn)\n\n(16)\n\n4\n\n\fFigure 1: Performance of TMC, SMC, IWAE and ground truth (GT) on a simple Gaussian latent\nvariable example, run in PyTorch using a GPU. A. The marginal likelihood estimate (y-axis) for\ndifferent numbers of particles, K (x-axis), with the number of data points \ufb01xed to N = 128. B. The\ntime required for computing marginal likelihood estimates in A on a single Titan X GPU. C. The\nmarginal likelihood estimate per data point (y-axis), for models with different numbers of data points,\nN , and a \ufb01xed number of particles, K = 128. Note that the TMC, SMC and GT lines lie on top of\neach other. D. The time required for computing marginal likelihood estimates in C.\n\nIn most deep models, the latents, zi, are vectors, and the generative (and recognition) models have\nthe form,\n\ni.e. the elements of zi are independent, with means and variances given by neural-networks applied\nto the activations of the previous layer, \u00b5i(zi+1) and \u03c32\ni (zi+1). As such, the asymptotically quadratic\n\nP(zi|zi+1) = N (cid:0)zi|\u00b5i(zi+1), diag(\u03c32\ni+1 (cid:1) for all ki and ki+1 is dominated by the linear cost of computing\n\n\u00b5i(zki+1\ni+1 ) for all ki+1 by applying neural networks to the activations at the previous\nlayer. Finally, the same considerations apply to non-factorised recognition models (see Appendix):\nthey have the same quadratic asymptotic cost, but linear cost in typical problem sizes.\n\ni (zi+1))(cid:1)\n\n(17)\n\ncost of evaluating P(cid:0)zki\n\ni+1 ) and \u03c32\n\ni (zki+1\n\ni |zki+1\n\n3 Toy Experiments\n\nHere we perform two toy experiments. First, we compare TMC, SMC and IWAE, \ufb01nding that TMC\ngives bounds on the log-probability that are considerably better than those for IWAE, and as good\n(if not better than) SMC, while being considerably faster. Second, we consider an example where\nnon-factorised posteriors might become important. In these toy experiments, we use models in which\nall variables are jointly Gaussian, which allows us to compute the exact marginal likelihood, and to\nassess the tightness of the bounds.\n\n3.1 Comparing TMC, SMC and IWAE\n\nFirst, we considered a simple example, with Gaussian parameters, latents and data. There was a\nsingle parameter, \u03b8, drawn from a standard normal, which set the mean of N latent variables, zi. The\nN data points, xi, have unit variance, and mean set by the latent variable,\n\nP(\u03b8) = N (\u03b8; 0, 1) ,\nP(zi|\u03b8) = N (zi; \u03b8, 1) ,\nP(xi|zi) = N (xi; zi, 1) .\n\nFor the proposal distributions for all methods, we used the generative marginals,\n\nQ(\u03b8) = N (\u03b8; 0, 1) ,\nQ(zi) = N (zi; 0, 2) .\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\nWhile this model is simplistic, it is a useful initial test case, because the ground true marginal\nlikelihood can be computed (GT).\n\nWe computed marginal likelihood estimates for TMC, SMC and IWAE. First, we plotted the bound\non the marginal likelihood against the number of particles for a \ufb01xed number of data points, N = 128\n(Fig. 1A). As expected, IWAE was dramatically worse than other methods: it was still far from the\ntrue marginal likelihood estimate, even with one million importance samples. We also found that\n\n5\n\n\u2212600\u2212400\u2212200100103106AlogP(x)0.000.200.40100103106Btime(s)\u22123\u22122100103106ClogP(x)/N10\u2212610\u22123100100103106Dtime(s)KTMCSMCIWAEGTKNN\fFigure 2: The bound on the log-marginal likelihood for factorised and non-factorised models as\ncompared to the ground-truth, for different number of samples, K.\n\nfor a \ufb01xed number of particles/samples, TMC was somewhat superior to SMC. We suspect that this\nis because TMC sums over all possible combinations of particles, while the SMC resampling step\nexplicitly eliminates some of these combinations. Further, SMC was considerably slower than TMC\nas resampling required us to perform an explicit loop over data points, whereas TMC can be computed\nentirely using tensor sums/products, which can be optimized ef\ufb01ciently on the GPU (Fig. 1B).\n\nNext, we plotted the log-marginal likelihood per data point as we vary the number of data points, with\na \ufb01xed number of importance samples, K = 128 (Fig. 1C). Again, IWAE is dramatically worse than\nthe other methods, whereas both TMC and SMC closely tracked the ground-truth result. However,\nnote that the time taken for SMC (Fig. 1D) is larger than that for the other methods, and scales linearly\nin the number of data points. In contrast, the time required for TMC remains constant up to around\n1000 data points, as GPU parallelisation is exploited increasingly ef\ufb01ciently in larger problems.\n\n3.2 Comparing factorised and non-factorised proposals\n\nNon-factorised proposals have a range of potential bene\ufb01ts, and here we consider how they might be\nmore effective than factorised proposals in modelling distributions with very high prior correlations.\nIn particular, we consider a chain of latent variables,\n\nP(zi|zi\u22121) = N (zi\u22121, 1/N ) ,\n\nP(x|zN ) = N (zN , 1) ,\n\n(23)\n\n(24)\n\nwhere z0 = 0. As N becomes large, the marginal distribution over zN and x remains constant,\nbut the correlations between adjacent latents (i.e. zi\u22121 and zi) become stronger. For the factorised\nproposal, we use the marginal variance (i.e. Q(zi) = N (0, i/N )), whereas we used the prior for\nthe non-factorised proposal. Taking N = 100 (Fig. 2), we \ufb01nd that the non-factorised method\nconsiderably outperforms the factorised method for small numbers of samples, K, because the\nnon-factorised method is able to model tight prior-induced correlations.\n\n4 Experiments\n\nWe considered a model for MNIST handwritten digits with \ufb01ve layers of stochastic units inspired by\nS\u00f8nderby et al. (2016). This model had 4 stochastic units in the top layer (furthest from the data),\nthen 8, 16, 32, and 64 units in the last layer (closest to the data). In the generative model, we had two\ndeterminstic layers between each pair of stochastic layers, and these deterministic layers had twice the\nnumber of units in the corresponding stochastic layer (i.e. 8, 16, 32, 64 and 128). In all experiments,\nwe used the Adam optimizer (Kingma & Ba, 2014) using the PyTorch default hyperparameters, and\nweight normalization (Salimans & Kingma, 2016) to improve numerical stability. We used leaky-relu\nnonlinearities everywhere except for the standard-deviations (S\u00f8nderby et al., 2016), for which we\nused 0.01 + softplus(x), to improve numerical stability by ensuring that the standard deviations could\nnot become too small. Note, however, that our goal was to give a fair comparison between IWAE and\n\n6\n\n\u221210\u221250100101102103logP(x)Kground-truthfactorisednon-factorised\fFigure 3: The quality of the variational lower bound for a model of MNIST handwritten digits,\nwith different recognition models and training schemes. We used three different recognition models\n(columns): factorised (left) where the distribution over the latents at each layer was independent;\nnon-factorised, small (middle) where each stochastic layer depended on the previous stochastic layer\nthrough a two-layer deterministic neural network, with a small number of units (the same as in\nthe generative model); and non-factorised, large (right) where the deterministic networks linking\nstochastic layers in the recognition model had 4 times as many units as in the smaller network. A.\nWe trained two sets of models using IWAE (blue) and TMC (red), and plotted the value of the TMC\nobjective for both lines. B. Here, we consider only models trained using the IWAE objective, and\nevalute them under the IWAE objective (blue) and the TMC objective (red).\n\nTMC under various variance reduction schemes, not to reach state-of-the-art performance. As such,\nthere are many steps that could be taken to bring results towards state-of-the-art, including the use of\na ladder-VAE architecture, wider deterministic layers, batch-normalization, convolutional structure\nand using more importance samples to evaluate the model (S\u00f8nderby et al., 2016).\n\nWe compared IWAE and TMC under three different recognition models, as well as three different\nvariance reduction schemes (including plain reparameterisation gradients). For the non-factorised\nrecognition models (Fig. 3 middle and right), we used,\n\nQ(z|x) = Q(z5|z4) Q(z4|z3) Q(z3|z2) Q(z2|z1) Q(z1|x) ,\n\nsee Appendix C for the extension to the TMC case. For all of these distributions, we used,\n\nQ(zi+1|zi) = N (cid:0)\u03bdi+1, \u03c12\n\ni+1(cid:1) ,\n\n\u03bdi+1 = Linear(hi),\n\u03c1i+1 = SoftPlus (Linear(hi)) ,\n\nhi = MLP(zi).\n\n(25)\n\n(26a)\n\n(26b)\n\n(26c)\n\n(26d)\n\nwhere the MLP had two dense layers and for the small model (middle), the lowest-level MLP had\n128 units, then higher-level MLPs had 64, 32, 16 and 8 units respectively. For the large model, the\nlowest-level MLP had 512 units, then 256, 128, 64 and 32. For the factorised recognition model,\n\nQ(z|x) = Q(z1|x) Q(z2|x) Q(z3|x) Q(z4|x) Q(z5|x) ,\n\n(27)\n\nwe used an architecture mirroring that of the non-factorised recognition model as closely as possible.\nIn particular, to construct these distributions, we used Eqs. (26a\u201326c), but with a different version of\nhi that depended directly on hi\u22121, rather than zi,\n\nhi = MLP (Linear (hi+1))\n\n(28)\n\nwhere we require the linear transformation to reduce the hidden dimension down to the required input\ndimension for the MLP.\n\n7\n\n\u221293\u221292\u221291ELBOfactorisednon-factorised,smallTrain:IWAE/TMCEval:TMCnon-factorised,largeobjectivetypeIWAETMCvar.reductionnoneDReGs01000epochs\u221293\u221292\u221291ELBO01000epochs01000epochsTrain:IWAEEval:IWAE/TMCAB\fFigure 4: The average time (across the three models) required for one training epoch of the six\nmethods considered above: IWAE/TMC in combination with no additional variance reduction scheme\n(none), STL, DReGs.\n\nFor IWAE and TMC, we considered plain reparametrised gradient descent (none), as well as DReGs\n(Tucker et al., 2018) (which resolve issues raised by (Rainforth et al., 2018)).\n\nWe began by training the above models and variance reduction techniques using an IWAE (blue)\nand a TMC (red) objective (Fig. 3A). Note that we evaluated both models using the TMC objective\nto be as generous as possible to IWAE (see Fig. 3B and discussion below). We found that the best\nperforming method for all models was plain TMC (i.e. without DReGs; Fig. 3A). It is unsuprising\nthat TMC is superior to IWAE, because TMC in effect considered 205 importance samples, whereas\nIWAE considered only 20 importance samples. However, it is unclear whether variance reduction\ntechniques such as DReGs should prove effective in combination with TMC. We can speculate that\nDReGs performs poorly in combination with TMC because these methods are designed to improve\nperformance as the approximate posterior becomes close to the true posterior (see also Rainforth\net al. (2018)). However, TMC considers all combinations of samples, and while some of those\ncombinations might be drawn from the true posterior (e.g. we might have all latents for a particular k,\nzk\n1 , zk\nn, being drawn jointly from the true posterior), it cannot be the case that all combinations\nare (or can be considered as) drawn jointly from the true posterior. Furthermore, we found that\nDReGs was numerically unstable in combination with TMC, though it is not clear whether this was\nan inherent property or merely an implementational issue. Furthermore, note that TMC offers larger\nbene\ufb01ts over IWAE for the factorised and small non-factorised model, where \u2014 presumably \u2014 the\nmismatch between the approximate and true posterior is larger. Next, we considered training the\nmodel under just IWAE, and evaluating under IWAE and TMC (Fig. 3B). Evaluating under TMC\nconsistently gave a slightly better bound than evaluating under IWAE, and as expected, because\ntraining is based on IWAE, we found that DReGs gave considerably improved performance. As such,\nin Fig. 3A, we used TMC to evaluate the model trained under an IWAE objective, so as to be as\ngenerous as possible to IWAE.\n\n2 , . . . , zk\n\nWe found that the training time for TMC was similar to that for IWAE, despite TMC considering \u2014 in\neffect \u2014 205 = 3, 200, 000 importance samples, whereas IWAE considered only 20 (Fig. 4). Further,\nwe found that using STL gave similar runtime, whereas DReGs was more expensive (though it cannot\nbe ruled out that this is due to our implementation). That said, broadly, there is reason to believe that\nvanilla reparameterised gradients may be more ef\ufb01cient than STL and DReGs, because using vanilla\nreparameterised gradients allows us to compute the recognition sample and log-probability in one\npass. In contrast, for STL and DReGs, we need to separate the computation of the recognition sample\nand log-probability, so that it is possible to stop gradients in the appropriate places.\n\n5 Discussion\n\nWe showed that it is possible to extend multi-sample bounds on the marginal likelihood by drawing\nsamples for each latent variable separately, and averaging across all possible combinations of samples\nfrom each variable. As such, we were able to achieve lower-variance estimates of the marginal\nlikelihood, and hence better bounds on the log-marginal likelihood than IWAE and SMC. Furthermore,\n\n8\n\nnoneSTLDReGsvar.reduction050runtime(s)IWAETMC\fcomputation of these bounds parallelises effectively on modern GPU hardware, and is comparable to\nthe computation required for IWAE.\n\nOur approach can be understood as introducing ideas from classical message passing (Pearl, 1982,\n1986; Bishop, 2006) into the domains of importance sampling and variational autoencoders. Note that\nwhile message passing has been used in the context of variational autoencoders to sum over discrete\nlatents (e.g. Johnson et al., 2016; Bingham et al., 2018), here we have done something fundamentally\ndifferent: weaving message-passing like approaches into the fabric of importance sampling, by \ufb01rst\ndrawing K samples for each continuous latent, and then using message-passing like algorithms to\nsum over all possible combinations of samples.\n\nAcknowledgements\n\nI would like to thank HHMI for funding and computational infrastructure.\n\nReferences\n\nBengtsson, T., Bickel, P., Li, B., et al. Curse-of-dimensionality revisited: Collapse of the particle \ufb01lter\nin very large scale systems. In Probability and statistics: Essays in honor of David A. Freedman,\npp. 316\u2013334. Institute of Mathematical Statistics, 2008.\n\nBingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R.,\nSzerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep Universal Probabilistic Programming.\nJournal of Machine Learning Research, 2018.\n\nBishop, C. Pattern Recognition and Machine Learning. Springer, 2006.\n\nBornschein, J. and Bengio, Y. Reweighted wake-sleep. arXiv preprint arXiv:1406.2751, 2014.\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\nChatterjee, S. and Diaconis, P. The sample size required in importance sampling. arXiv preprint\n\narXiv:1511.01437, 2015.\n\nCremer, C., Morris, Q., and Duvenaud, D. Reinterpreting importance-weighted autoencoders. arXiv\n\npreprint arXiv:1704.02916, 2017.\n\nDoucet, A. and Johansen, A. M. A tutorial on particle \ufb01ltering and smoothing: Fifteen years later.\n\nHandbook of nonlinear \ufb01ltering, 12(656-704):3, 2009.\n\nEslami, S. A., Rezende, D. J., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu,\nA. A., Danihelka, I., Gregor, K., et al. Neural scene representation and rendering. Science, 360\n(6394):1204\u20131210, 2018.\n\nFearnhead, P. Particle \ufb01lters for mixture models with an unknown number of components. Statistics\n\nand Computing, 14(1):11\u201321, 2004.\n\nFrey, B. J. Extending factor graphs so as to unify directed and undirected graphical models. In\nProceedings of the Nineteenth conference on Uncertainty in Arti\ufb01cial Intelligence, pp. 257\u2013264.\nMorgan Kaufmann Publishers Inc., 2003.\n\nJohnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. Composing graphical\nmodels with neural networks for structured representations and fast inference. In Advances in\nneural information processing systems, pp. 2946\u20132954, 2016.\n\nKingma, D. P. and Ba, J. Adam: A method for stochastic optimization.\n\narXiv preprint\n\narXiv:1412.6980, 2014.\n\nKingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\nKschischang, F. R., Frey, B. J., Loeliger, H.-A., et al. Factor graphs and the sum-product algorithm.\n\nIEEE Transactions on information theory, 47(2):498\u2013519, 2001.\n\n9\n\n\fLauritzen, S. L. and Spiegelhalter, D. J. Local computations with probabilities on graphical struc-\ntures and their application to expert systems. Journal of the Royal Statistical Society. Series B\n(Methodological), pp. 157\u2013224, 1988.\n\nLe, T. A., Igl, M., Jin, T., Rainforth, T., and Wood, F. Auto-encoding sequential monte carlo. arXiv\n\npreprint arXiv:1705.10306, 2017.\n\nLe, T. A., Kosiorek, A. R., Siddharth, N., Teh, Y. W., and Wood, F. Revisiting reweighted wake-sleep.\n\narXiv preprint arXiv:1805.10469, 2018.\n\nMaddison, C. J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh,\nY. Filtering variational objectives. In Advances in Neural Information Processing Systems, pp.\n6576\u20136586, 2017.\n\nMinka, T. et al. Divergence measures and message passing. Technical report, Technical report,\n\nMicrosoft Research, 2005.\n\nNaesseth, C. A., Linderman, S. W., Ranganath, R., and Blei, D. M. Variational sequential monte\n\ncarlo. arXiv preprint arXiv:1705.11140, 2017.\n\nNeal, R. M., Beal, M. J., and Roweis, S. T. Inferring state sequences for non-linear systems with\nembedded hidden markov models. In Advances in neural information processing systems, pp.\n401\u2013408, 2004.\n\nPearl, J. Reverend Bayes on inference engines: a distributed hierarchical approach. In Proceedings\n\nof the Second AAAI Conference on Arti\ufb01cial Intelligence, pp. 133\u2013136. AAAI Press, 1982.\n\nPearl, J. Fusion, propagation, and structuring in belief networks. Arti\ufb01cial intelligence, 29(3):\n\n241\u2013288, 1986.\n\nRainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter\n\nvariational bounds are not necessarily better. arXiv preprint arXiv:1802.04537, 2018.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference\n\nin deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\nRoeder, G., Wu, Y., and Duvenaud, D. K. Sticking the landing: Simple, lower-variance gradient\nestimators for variational inference. In Advances in Neural Information Processing Systems, pp.\n6925\u20136934, 2017.\n\nSalimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pp.\n901\u2013909, 2016.\n\nSnyder, C., Bengtsson, T., Bickel, P., and Anderson, J. Obstacles to high-dimensional particle \ufb01ltering.\n\nMonthly Weather Review, 136(12):4629\u20134640, 2008.\n\nS\u00f8nderby, C. K., Raiko, T., Maal\u00f8e, L., S\u00f8nderby, S. K., and Winther, O. Ladder variational\n\nautoencoders. In Advances in neural information processing systems, pp. 3738\u20133746, 2016.\n\nTucker, G., Lawson, D., Gu, S., and Maddison, C. J. Doubly reparameterized gradient estimators for\n\nmonte carlo objectives. arXiv preprint arXiv:1810.04152, 2018.\n\nTurner, R. E. and Sahani, M. Two problems with variational expectation maximisation for time-\nseries models. In Barber, D., Cemgil, A. T., and Chiappa, S. (eds.), Bayesian time series models.\nCambridge University Press, 2011.\n\nWood, F., Meent, J. W., and Mansinghka, V. A new approach to probabilistic programming inference.\n\nIn Arti\ufb01cial Intelligence and Statistics, pp. 1024\u20131032, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3861, "authors": [{"given_name": "Laurence", "family_name": "Aitchison", "institution": "University of Cambridge"}]}