{"title": "Importance Weighted Hierarchical Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 603, "page_last": 615, "abstract": "Variational Inference is a powerful tool in the Bayesian modeling toolkit, however, its effectiveness is determined by the expressivity of the utilized variational distributions in terms of their ability to match the true posterior distribution. In turn, the expressivity of the variational family is largely limited by the requirement of having a tractable density function.\nTo overcome this roadblock, we introduce a new family of variational upper bounds on a marginal log-density in the case of hierarchical models (also known as latent variable models). We then derive a family of increasingly tighter variational lower bounds on the otherwise intractable standard evidence lower bound for hierarchical variational distributions, enabling the use of more expressive approximate posteriors. We show that previously known methods, such as Hierarchical Variational Models, Semi-Implicit Variational Inference and Doubly Semi-Implicit Variational Inference can be seen as special cases of the proposed approach, and empirically demonstrate superior performance of the proposed method in a set of experiments.", "full_text": "Importance Weighted Hierarchical\n\nVariational Inference\n\nArtem Sobolev\n\nSamsung AI Center Moscow, Russia\n\nasobolev@bayesgroup.ru\n\nDmitry Vetrov\n\nSamsung AI Center Moscow, Russia\n\nNRU HSE\u2217, Moscow, Russia\n\nAbstract\n\nVariational Inference is a powerful tool in the Bayesian modeling toolkit, how-\never, its effectiveness is determined by the expressivity of the utilized variational\ndistributions in terms of their ability to match the true posterior distribution. In\nturn, the expressivity of the variational family is largely limited by the requirement\nof having a tractable density function. To overcome this roadblock, we introduce\na new family of variational upper bounds on a log marginal density in the case\nof hierarchical models (also known as latent variable models). We then derive a\nfamily of increasingly tighter variational lower bounds on the otherwise intractable\nstandard evidence lower bound for hierarchical variational distributions, enabling\nthe use of more expressive approximate posteriors. We show that previously known\nmethods, such as Hierarchical Variational Models, Semi-Implicit Variational Infer-\nence and Doubly Semi-Implicit Variational Inference can be seen as special cases\nof the proposed approach, and empirically demonstrate superior performance of\nthe proposed method in a set of experiments.\n\n1\n\nIntroduction\n\nBayesian Inference is an important statistical tool. However, exact inference is possible only in a\nsmall class of conjugate problems, and for many practically interesting cases, one has to resort to\nApproximate Inference techniques. Variational Inference (Hinton and van Camp, 1993; Waterhouse\net al., 1996; Wainwright et al., 2008) being one of them is an ef\ufb01cient and scalable approach that\ngained a lot of interest in recent years due to advances in Neural Networks.\nHowever, the ef\ufb01ciency and accuracy of Variational Inference heavily depend on how close an\napproximate posterior is to the true posterior. As a result, Neural Networks\u2019 universal approximation\nabilities and great empirical success propelled a lot of interest in employing them as powerful sample\ngenerators that are trained to output samples from approximate posterior when fed some standard\nnoise as input (Nowozin et al., 2016; Goodfellow et al., 2014; MacKay, 1995). Unfortunately, a\nsigni\ufb01cant obstacle in this direction is a need for a tractable density q(z | x), which in general\nrequires intractable integration. A theoretically sound approach then is to give tight lower bounds on\nthe intractable term \u2013 the differential entropy of q(z|x), which is easy to recover from upper bounds\non the log marginal density. One such bound was introduced by Agakov and Barber (2004); however\nit\u2019s tightness depends on the auxiliary variational distribution. Yin and Zhou (2018) suggested a\nmultisample surrogate whose quality is controlled by the number of samples.\nIn this paper we consider hierarchical variational models (Ranganath et al., 2016; Salimans et al.,\n2015; Agakov and Barber, 2004) where the approximate posterior q(z | x) is represented as a\nmixture of tractable distributions q(z|\u03c8, x) over some tractable mixing distribution q(\u03c8|x): q(z|x) =\n\n(cid:82) q(z|\u03c8, x)q(\u03c8|x)d\u03c8. We show that such variational models contain semi-implicit models where\n\n\u2217Samsung-HSE Laboratory, National Research University Higher School of Economics\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fq(\u03c8|x) is allowed to be arbitrarily complicated while being reparametrizable (Yin and Zhou, 2018).\nTo overcome the need for the closed-form marginal density q(z|x) we then propose a novel family\nof tighter bounds on the log marginal likelihood log p(x), which can be shown to generalize many\npreviously known bounds: Hierarchical Variational Models (Ranganath et al., 2016) also known as\nauxiliary VAE bound (Maal\u00f8e et al., 2016), Semi-Implicit Variational Inference (Yin and Zhou, 2018)\nand Doubly Semi-Implicit Variational Inference (Molchanov et al., 2018). At the core of our work\nlies a novel variational upper bound on the log marginal density, which we apply to the evidence\nlower bound (ELBO) to enable hierarchical approximate posteriors. Finally, our method can be\ncombined with the multisample bound of Burda et al. (2015) to tighten the log marginal likelihood\nlower bound even further.\n\n2 Background\n\nHaving a hierarchical model p\u03b8(x) =(cid:82) p\u03b8(x | z)p\u03b8(z)dz for observable objects x, we are interested\n\nin two tasks: inference and learning. The problem of Bayesian inference is that of \ufb01nding the\ntrue posterior distribution p\u03b8(z | x), which is often intractable and thus is approximated by some\nq\u03c6(z | x). The problem of learning is that of \ufb01nding parameters \u03b8 s.t. the marginal model distribution\np\u03b8(x) approximates the true data-generating process of x as good as possible, typically in terms of\nKL-divergence, which corresponds to the Maximum Likelihood Estimation problem.\nVariational Inference provides a way to solve both tasks simultaneously by lower-bounding the\nintractable log marginal likelihood log p\u03b8(x) with the Evidence Lower Bound (ELBO) using a\nposterior approximation q\u03c6(z | x):\n\nlog p\u03b8(x) \u2265 log p\u03b8(x) \u2212 DKL(q\u03c6(z | x) || p\u03b8(z|x)) = E\n\nq\u03c6(z|x)\n\nlog\n\np\u03b8(x, z)\nq\u03c6(z | x)\n\nThe bound requires analytically tractable densities for both q\u03c6(z | x) and p\u03b8(x, z). Since the ELBO\nis a biased objective, maximizing it w.r.t. \u03b8 not only (or necessarily) maximizes the log marginal\nlikelihood, but also minimizes the KL divergence, constraining the true posterior p\u03b8(z|x) to stay\nclose to the approximate one q\u03c6(z|x) and thus limiting the expressivity of p\u03b8(x). Such variational\nbias can be reduced by tightening the bound. In particular, Burda et al. (2015) proposed a family of\ntighter multisample bounds, generalizing the ELBO. We call it the IWAE bound:\n\nlog p\u03b8(x) \u2265\n\nE\n\nwhere from now on we write q\u03c6(z1:M|x) =(cid:81)M\n\nq\u03c6(z1:M|x)\n\nm=1 q\u03c6(zm|x) for brevity. This bound has been shown\n(Domke and Sheldon, 2018) to be a tractable lower bound on ELBO for a variational distribution\nthat has been obtained by self-normalizing importance sampling, or a special case of Sequential\nMonte Carlo (Maddison et al., 2017; Naesseth et al., 2017; Le et al., 2017). However, the price of\nthis increased tightness is higher computation complexity that mostly stems from increased number\nof evaluations of high-dimensional decoder p\u03b8(x|z). We thus focus on learning more expressive\nposterior approximations to be used in the ELBO \u2013 a special case of M = 1.\nIn the direction of improving the single-sample ELBO Agakov and Barber (2004); Salimans et al.\n(2015); Maal\u00f8e et al. (2016); Ranganath et al. (2016) proposed to use a hierarchical variational model\nare auxiliary latent variables. Since the standard ELBO is now intractable due to the log q\u03c6(z | x)\nterm, the following variational lower bound on the ELBO is proposed. The tightness of this bound is\ncontrolled by the auxiliary variational distribution \u03c4\u03b7(\u03c8 | x, z):\n\n(HVM) for q\u03c6(z | x) =(cid:82) q\u03c6(z | x, \u03c8)q\u03c6(\u03c8 | x)d\u03c8 with explicit joint density q\u03c6(z, \u03c8 | x), where \u03c8\n\nM(cid:88)\n\nm=1\n\nlog\n\n1\nM\n\np\u03b8(x, zm)\nq\u03c6(zm | x)\n\n,\n\nlog p\u03b8(x) \u2265 E\n\nq\u03c6(z|x)\n\nlog\n\np\u03b8(x, z)\nq\u03c6(z | x)\n\n\u2265\n\nE\n\nq\u03c6(z,\u03c8|x)\n\nlog p\u03b8(x, z) \u2212 log\n\n(cid:20)\n\n(cid:21)\n\nq\u03c6(z, \u03c8 | x)\n\u03c4\u03b7(\u03c8 | x, z)\n\n(1)\n\nHowever, this bound now introduces auxiliary variational bias: the gap to the true ELBO is equal to\nDKL(q\u03c6(\u03c8|z, x) || \u03c4\u03b7(\u03c8|z, x)), which prevents learning expressive q\u03c6(z|x).\nRecently Yin and Zhou (2018) introduced semi-implicit models: hierarchical models q\u03c6(z | x) =\n\n(cid:82) q\u03c6(z | \u03c8, x)q\u03c6(\u03c8 | x)d\u03c8 with implicit but reparametrizable q\u03c6(\u03c8 | x) and explicit q\u03c6(z | \u03c8, x),\n\n2\n\n\fand suggested the following surrogate objective, which was later shown to be a lower bound (the\nSIVI bound) for all \ufb01nite K by Molchanov et al. (2018):\n\nlog p\u03b8(x) \u2265\n\nE\n\nq\u03c6(z,\u03c80|x)\n\nE\n\nq\u03c6(\u03c81:K|x)\n\nlog\n\n1\n\nK+1\n\n(cid:80)K\np\u03b8(x, z)\nk=0 q\u03c6(z|\u03c8k, x)\n\nAn appealing property of this bound is that it gets tighter as the number of samples K increases and\nunlike the IWAE bound, it performs multiple computations in the smaller latent space. That said,\nSIVI can be generalized to use multiple samples z similar to the IWAE bound (Burda et al., 2015) in\nan ef\ufb01cient way by reusing samples \u03c81:K for different zm:\n\n(cid:34)\n\nM(cid:88)\n\nm=1\n\n1\n\nK+1\n\n(cid:80)K\np\u03b8(x, zm)\nk=0 q\u03c6(zm|x, \u03c8m,k)\n\n(cid:35)\n\nlog p(x) \u2265 E\n\nlog\n\n1\nM\n\n(2)\n\n(3)\n\nWhere the expectation is taken over q\u03c6(\u03c81:M,0, z1:M | x) and \u03c8m,1:K = \u03c81:K \u223c q\u03c6(\u03c81:K | x) is the\nsame set of K i.i.d. random variables for all m2. Importantly, this estimator has O(M + K) sampling\ncomplexity for \u03c8, unlike the naive approach, leading to O(M K + M ) sampling complexity. We will\nget back to this discussion in section 4.1.\n\n2.1 SIVI Insights\n\nHere we outline SIVI\u2019s points of weaknesses and identify certain traits that make it possible to\ngeneralize the method and bridge it with the prior work.\nFirst, note that both SIVI bounds (2) and (3) use samples from q\u03c6(\u03c81:K | x) to describe z, and in\nhigh dimensions one might expect that such \"uninformed\" samples would miss most of the time,\nresulting in near-zero likelihood q\u03c6(z | \u03c8k, x) and thus reduce the \"effective sample size\". Therefore\nit is expected that in higher dimensions it would take many samples to accurately cover the regions\nhigh probability of q\u03c6(\u03c8 | z, x) for a given z. Instead, ideally, we would like to target such regions\ndirectly while keeping the lower bound guarantees.\nAnother important observation that we\u2019ll make use of is that many such semi-implicit models can\nbe equivalently reformulated as a mixture of two explicit distributions: due to reparametrizability\nof q\u03c6(\u03c8 | x) we have \u03c8 = g\u03c6(\u03b5 | x) for some \u03b5 \u223c q(\u03b5) with tractable density. We can then\n\nconsider an equivalent hierarchical model q\u03c6(z|x) =(cid:82) q\u03c6(z | g\u03c6(\u03b5 | x), x)q(\u03b5)d\u03b5 that \ufb01rst samples\n\n\u03b5 from some simple distribution, transforms this sample \u03b5 into \u03c8 and then generates samples from\nq\u03c6(z | \u03c8, x). Thus from now on we\u2019ll assume both q\u03c6(z | \u03c8, x) and q\u03c6(\u03c8 | x) have a tractable density,\nyet q\u03c6(z | \u03c8, x) can depend on \u03c8 in an arbitrarily complex way, making analytic marginalization\nintractable.\n\nImportance Weighted Hierarchical Variational Inference\n\n3\nHaving intractable log q\u03c6(z | x) as a source of our problems, we seek a tractable and ef\ufb01cient upper\nbound, which is provided by the following theorem:\nTheorem (Log marginal density upper bound). For any q(z, \u03c8 | x), K \u2208 N0 and \u03c4 (\u03c8 | z, x) (under\nmild regularity conditions) consider the following\n\n(cid:32)\n\nK(cid:88)\n\n1\n\nK + 1\n\nk=0\n\n(cid:33)\n\nq(z, \u03c8k | x)\n\u03c4 (\u03c8k | z, x)\n\nUK =\n\nE\n\nq(\u03c80|x,z)\n\nE\n\n\u03c4 (\u03c81:K|z,x)\n\nlog\n\nThen the following holds:\n1. UK \u2265 log q(z | x)\n2. UK \u2265 UK+1\n3.\n\nK\u2192\u221eUK = log q(z | x)\n\nlim\n\n2One could also include all \u03c81:M,0 into the set of reused samples \u03c8, expanding its size to M + K.\n\n3\n\n\fProof. See Appendix for Theorem C.1.\n\nThe proposed upper bound provides a variational alternative to MCMC-based upper bounds (Grosse\net al., 2015) and complements the standard Importance Weighted stochastic lower bound of Burda\net al. (2015) on the log marginal density:\n\nLK =\n\nE\n\n\u03c4 (\u03c81:K|z,x)\n\nlog\n\n1\nK\n\nq(z, \u03c8k | x)\n\u03c4 (\u03c8k | z, x)\n\n\u2264 log q(z | x)\n\n(cid:32)\n\nK(cid:88)\n\nk=1\n\n(cid:33)\n\n3.1 Tractable lower bounds on log marginal likelihood with a hierarchical proposal\nThe proposed upper bound UK allows us to lower bound the otherwise intractable ELBO in case\nof hierarchical q\u03c6(z | x), leading to Importance Weighted Hierarchical Variational Inference\n(IWHVI) lower bound:\nlog p\u03b8(x) \u2265 E\n\np\u03b8(x, z)\n\n(4)\n\n\u2265\n\nlog\n\nlog\n\nE\n\nE\n\nq\u03c6(z|x)\n\np\u03b8(x, z)\nq\u03c6(z | x)\n\nq\u03c6(z,\u03c80|x)\n\n\u03c4\u03b7(\u03c81:K|z,x)\n\n(cid:80)K\n\n1\n\nK+1\n\nq\u03c6(z,\u03c8k|x)\n\u03c4\u03b7(\u03c8k|z,x)\n\nk=0\n\nCrucially, we merged expectations over q\u03c6(z|x) and q\u03c6(\u03c80|x, z) into an expectation over the joint\ndistribution q\u03c6(\u03c80, z|x), which admits a more favorable factorization into q\u03c6(\u03c80|x)q\u03c6(z|x, \u03c80), and\nsamples from the later are easy to simulate for the Monte Carlo-based estimation.\nIWHVI introduces an additional auxiliary variational distribution \u03c4\u03b7(\u03c8 | x, z) that is learned by\nmaximizing the bound w.r.t. its parameters \u03b7. While the optimal distribution is3 \u03c4 (\u03c8 | z, x) = q(\u03c8 |\nz, x), one can see that some particular choices of auxiliary distributions and number of samples render\npreviously known methods like DSIVI, SIVI and HVM as special cases (see appendix A). Since the\nbound (4) can be seen as variational generalization of SIVI (2) or as multisample generalization of\nHVM (1), it has capacity to give tighter bound on the log marginal likelihood and reduce the auxiliary\nvariational bias, which should lead to more expressive variational approximations q\u03c6(z|x) and reduce\nin the variational bias.\nOne could also consider a hierarchical prior p(z) (Atanov et al., 2019) and give tight multisample\nvariational lower bounds using the bound LK of Burda et al. (2015). Such nested variational inference\nis out of the scope of this paper, so we leave this direction to future work. Notably, the combination\nof two bounds can give multisample variational sandwich bounds on the KL divergence between\nhierarchical models (See appendix B).\n\n4 Multisample Extensions\n\nMultisample bounds similar to the proposed one have already been studied extensively. In this section,\nwe relate our results to such prior work.\n\n4.1 Multisample Bound and Complexity\n\nOne can generalize the bound (4) further in a way similar to the IWAE multisample bound (Burda\net al., 2015), leading to the Doubly Importance Weighted Hierarchical Variational Inference\n(DIWHVI) (see Theorem C.4):\n\n\uf8ee\uf8f0log\n\n1\nM\n\nM(cid:88)\n\nm=1\n\n1\n\nK+1\n\nlog p\u03b8(x) \u2265 E\n\n\uf8f9\uf8fb\n\n(cid:80)K\n\np\u03b8(x, zm)\n\nq\u03c6(zm,\u03c8m,k|x)\n\u03c4\u03b7(\u03c8m,k|zm,x)\n\nk=0\n\n(5)\n\nWhere the expectation is taken over the same generative process as in eq. (4), independently repeated\nM times:\n\n1. Sample \u03c8m,0 \u223c q\u03c6(\u03c8 | x) for 1 \u2264 m \u2264 M\n2. Sample zm \u223c q\u03c6(z | xn, \u03c8m,0) for 1 \u2264 m \u2264 M\n\n3This choice makes the bound UK equal to the log marginal density.\n\n4\n\n\f3. Sample \u03c8m,k \u223c \u03c4\u03b7(\u03c8 | zm, x) for 1 \u2264 m \u2264 M and 1 \u2264 k \u2264 K\n\nThe price of the tighter bound (5) is quadratic sample complexity as it requires M (1 + K) samples\nof \u03c8. Unfortunately, the DIWHVI cannot bene\ufb01t from the sample reuse trick of the SIVI that leads to\nthe bound (3). The reason for this is that the bound (5) requires all terms in the outer denominator (the\nlog q\u03c6(z | x) estimate) to use the same distribution \u03c4\u03b7(\u03c8|x, z), whereas by its very nature it should\nbe very different for different zm. A viable option, though, is to consider a multisample-conditioned\n\u03c4\u03b7(\u03c8 | z1:M ) that is invariant to permutations of z. We leave a more detailed investigation to a future\nwork.\nRuntime-wise when compared to the multisample SIVI (3) the DIWHVI requires additional O(M )\npasses to generate \u03c4 (\u03c8 | x, zm) distributions. However, since the SIVI requires a much larger number\nof samples K to reach the same level of accuracy (see section 6.1) that are all then passed through a\nnetwork to generate q\u03c6(zm | x, \u03c8mk) distributions, the extra \u03c4\u03b7 computation is likely to either bear a\nminor overhead, or be completely justi\ufb01ed by reduced K. This is particularly true in the IWHVI case\n(M = 1) where IWHVI\u2019s single extra pass that generates \u03c4\u03b7(\u03c8|x, z) is dominated by K + 1 passes\nthat generate q\u03c6(z|x, \u03c8k).\n\n4.2 Signal to Noise Ratio\n\nRainforth et al. (2018) have shown that multisample bounds (Burda et al., 2015; Nowozin, 2018)\nbehave poorly during the training phase, having more noisy inference network\u2019s gradient estimates,\nwhich manifests itself in decreasing Signal-to-Noise Ratio (SNR) as the number of samples in-\ncreases. This raises a natural concern whether the same happens in the proposed model as K\nincreases. Tucker et al. (2019) have shown that upon a careful examination a REINFORCE-like\n(Williams, 1992) term can be seen in the gradient estimate, and REINFORCE is known for its typi-\ncally high variance (Rezende et al., 2014). Authors further suggest to apply the reparametrization trick\n(Kingma and Welling, 2013) the second time to obtain a reparametrization-based gradient estimate,\nwhich is then shown to solve the decreasing SNR problem. The same reasoning can be applied to our\nbound, and we provide further details and experiments in appendix D, developing an IWHVI-DReG\ngradient estimator. We conclude that the problem of decreasing SNR exists in our bound as well, and\nis mitigated by the proposed gradient estimator.\n\n4.3 Debiasing the bound\n\nNowozin (2018) has shown that the standard IWAE can be seen as a biased estimate of the log marginal\nlikelihood with the bias of order O(1/M ). They then suggested to use Generalized Jackknife of d-th\norder to reuse these M samples and come up with an estimator with a smaller bias of order O(1/M d)\nat the cost of higher variance and losing lower bound guarantees. Again, the same idea can be applied\nto our estimate; we leave further details to appendix E. We conclude that this way one can obtain\nbetter estimates of the log marginal density, however since there is no guarantee that the obtained\nestimator gives an upper or a lower bound, we chose not to use it in experiments.\n\n5 Related Work\n\nMore expressive variational distributions have been under an active investigation for a while. While\nwe have focused our attention to approaches employing hierarchical models via bounds, there are\nmany other approaches, roughly falling into two broad classes.\nOne possible approach is to augment some standard q(z|x) with help of copulas (Tran et al., 2015),\nmixtures (Guo et al., 2016; Gershman et al., 2012), or invertible transformations with tractable\nJacobians also known as normalizing \ufb02ows (Rezende and Mohamed, 2015; Kingma et al., 2016;\nDinh et al., 2016; Papamakarios et al., 2017), all while preserving the tractability of the density.\nKingma and Dhariwal (2018) have demonstrated that \ufb02ow-based models are able to approximate\ncomplex high-dimensional distributions of real images, but the requirement for invertibility might\nlead to inef\ufb01ciency in parameters usage and does not allow for abstraction as one needs to preserve\ndimensions.\nAn alternative direction is to embrace implicit distributions that one can only sample from, and\novercome the need for tractable density using bounds or estimates (Husz\u00e1r, 2017). Methods based on\n\n5\n\n\f(a) Negative entropy bounds\n\n(b) In\ufb02uence of K\n\nFigure 1: (a) Negative entropy upper bounds for 50-dimensional Laplace distribution. Shaded area\ndenotes 90% con\ufb01dence interval computed over 50 independent runs for each K. (b) Final log\nmarginal likelihood log p(x) estimates and expected MI[q(z, \u03c8|x)] for IWHVI or SIVI-based VAEs\ntrained with different K. Each model was trained and plotted 5 times.\n\nestimates (Mescheder et al., 2017; Shi et al., 2017; Tran et al., 2017), for example, via the Density\nRatio Estimation trick (Goodfellow et al., 2014; Uehara et al., 2016; Mohamed and Lakshminarayanan,\n2016), typically estimate the densities indirectly utilizing an auxiliary critic and hide dependency\non variational parameters \u03c6, hence biasing the optimization procedure. Major disadvantage of such\nmethods is that they lose lower bound guarantees.\nTitsias and Ruiz (2018) have shown that in the gradient-based ELBO optimization in case of a\nhierarchical model with tractable q\u03c6(z | \u03c8) and q\u03c6(\u03c8) one does not need the log marginal density\nlog q\u03c6(z | x) per se, only its gradient, which can be estimated using MCMC. Although unbiased, the\nMCMC-based posterior sampling has sequential nature (one needs to to perform chain burn-in to\ndecorrelate \u03c8(cid:48) from its initial value) not amendable to ef\ufb01cient parallelization available with modern\nhardware such as GPUs, which complicates scaling the method to larger problems. In contrast, our\nmethod allows parallel computation of K density ratios in UK.\nThe core contribution of the paper is a novel upper bound on log marginal density. Previously, Dieng\net al. (2017); Kuleshov and Ermon (2017) suggested using \u03c72-divergence to give a variational upper\nbound to the log marginal density. However, their bound was not an expectation of a random variable,\nbut instead a logarithm of the expectation, preventing unbiased stochastic optimization. Jebara and\nPentland (2001) reverse Jensen\u2019s inequality to give a variational upper bound in case of mixtures of\nexponential family distributions by extensive use of the problem\u2019s structure. Related to our core idea\nof joint sampling z and \u03c80 in (7) is an observation of Grosse et al. (2015) that Annealed Importance\nSampling (AIS, Neal (2001)) ran backward from the auxiliary variable sample \u03c80 gives an unbiased\nestimate of 1/q(z | x), and thus can also be used to upper bound the log marginal density. However,\nAIS-based estimation is too computationally expensive to be used during training.\n\n6 Experiments\n\n6.1 Toy Experiment\n\nAs a toy experiment we consider a 50-dimensional factorized standard Laplace distribution q(z) as a\nhierarchical scale-mixture model:\n\n50(cid:89)\n\n(cid:90) 50(cid:89)\n\nq(z) =\n\nLaplace(zd | 0, 1) =\n\nN (zd | 0, \u03c8d)Exp(\u03c8d | 1\n\n2 )d\u03c81:50\n\nd=1\n\nd=1\n\nWe do not make use of factorized joint distribution q(z, \u03c8) to explore bound\u2019s behavior in high\ndimensions. We use the proposed bound from Theorem C.1 and compare it to SIVI (Yin and Zhou,\n2018) on the task of upper-bounding the negative differential entropy Eq(z) log q(z). For IWHVI we\ntake \u03c4 (\u03c8 | z) to be a Gamma distribution whose concentration and rate are generated by a neural\n\n6\n\n01020304050K848280787674Negative EntropyIWHVI (50 steps)IWHVI (100 steps)IWHVI (250 steps)IWHVI (500 steps)IWHVI (5000 steps)SIVIHVM (50 steps)HVM (100 steps)HVM (250 steps)HVM (500 steps)HVM (5000 steps)True value02468Mutual Information of q(z,|x)86.0085.7585.5085.2585.0084.7584.50logp(x)SIVI K=0SIVI K=1SIVI K=5SIVI K=10SIVI K=25SIVI K=50IWHVI K=0IWHVI K=1IWHVI K=5IWHVI K=10IWHVI K=25IWHVI K=50IWHVI-DReG K=0IWHVI-DReG K=1IWHVI-DReG K=5IWHVI-DReG K=10IWHVI-DReG K=25IWHVI-DReG K=50\fMethod\n\nMNIST\n\nOMNIGLOT\n\nFrom (Mescheder et al., 2017)\n\n\u2014\n\n\u221283.7 \u00b1 0.3\nAVB + AC\n\u221283.9 \u00b1 0.1 \u2212104.8 \u00b1 0.1\nIWHVI\n\u221284.4 \u00b1 0.1 \u2212105.7 \u00b1 0.1\nSIVI\n\u221284.9 \u00b1 0.1 \u2212105.8 \u00b1 0.1\nHVM\nVAE + RealNVP \u221284.8 \u00b1 0.1 \u2212106.0 \u00b1 0.1\n\u221284.9 \u00b1 0.1 \u2212107.0 \u00b1 0.1\nVAE + IAF\n\u221285.0 \u00b1 0.1 \u2212106.6 \u00b1 0.1\nVAE\n\nFigure 2: Left: Test log-likelihood on dynamically binarized MNIST and OMNIGLOT. Right:\nComparison of multisample DIWHVI and SIVI-IW on a trained MNIST VAE from section 6.2 for\nM = 100 and 5000. Shaded area denotes \u00b12 std. interval, computed over 10 independent runs for\neach value of K.\n\nnetwork with three 500-dimensional hidden layers from z. We use the freedom to design architecture\nand initialize the network at prior. Namely, we also add a sigmoid \"gate\" output with large initial\nnegative bias and use the gate to combine prior concentration and rate with those generated by the\nnetwork. This way we are guaranteed to perform no worse than SIVI even at a randomly initialized \u03c4.\nFigure 1a shows the value of the bound for a different number of optimization steps over \u03c4 parameters,\nminimizing the bound. The whole process (including random initialization of neural networks) was\nrepeated 50 times to compute empirical 90% con\ufb01dence intervals. As results clearly indicate, the\nproposed bound can be made much tighter, more than halving the gap to the true negative entropy\ncompared to SIVI and HVM.\n\n6.2 Variational Autoencoder\n\nWe further test our method on the task of generative modeling, applying it to VAE (Kingma and\nWelling, 2013), which is a standard benchmark for inference methods 4. Ideally, better inference\nshould allow one to learn more expressive generative models. We report results on two datasets:\nMNIST (LeCun et al., 1998) and OMNIGLOT (Lake et al., 2015). For MNIST we follow the setup\nby Mescheder et al. (2017), and for OMNIGLOT we follow the standard setup (Burda et al., 2015).\nFor experiment details see appendix G.\nDuring training we used the proposed bound eq. (4) with standard prior p(z) = N (z | 0, 1) with\nincreasing number K: we used K = 0 for the \ufb01rst 250 epochs, K = 5 for the next 250 epochs, and\nK = 25 for the next 500 epochs, and K = 50 from then on (90% of training). Such schedule is\nmotivated by a natural observation (see the last paragraph of this section) that larger values of K lead\nto more expressive variational models, yet large values of K sometimes caused instabilities early\nin training due to an unlucky initialization. Regarding the number of samples z, we used M = 1\nthroughout training.\nTo estimate the log marginal likelihood for hierarchical models (IWHVI, SIVI, HVM) we use the\nDIWHVI lower bound (5) for M = 5000, K = 100 (for justi\ufb01cation of DIWHVI as an evaluation\nmetric see section 6.3). Results are shown in \ufb01g. 2. To evaluate the SIVI using the DIWHVI bound\nwe \ufb01t \u03c4 to a trained model by making 7000 epochs on the trainset with K = 50, keeping parameters\nof q\u03c6(z, \u03c8 | x) and p\u03b8(x, z) \ufb01xed. We observed improved performance compared to special cases of\nHVM and SIVI, and the method showed comparable results to the prior works.\nFor HVM on MNIST we observed its \u03c4 (\u03c8 | z) essentially collapsed to q(\u03c8), having expected KL\ndivergence between the two extremely close to zero. This indicates the \"posterior collapse\" (Kim et al.,\n2018; Chen et al., 2016) problem where the inference network q(z | \u03c8) choses to ignore the extra\ninput \u03c8 and thus the whole model effectively degenerates to a vanilla VAE. At the same time IWHVI\ndoes not suffer from this problem due to non-zero K, achieving average DKL(\u03c4 (\u03c8 | z, x) || q(\u03c8))\n\n4Code is available at https://github.com/artsobolev/IWHVI\n\n7\n\n020406080100K90898887868584logp(x) Lower BoundIWHVI M=100SIVI Equicomp M=100SIVI Equisample M=100SIVI-like M=100IWHVI M=5000SIVI-like M=5000\fof approximately 6.2 nats, see section 6.3. On OMNIGLOT the HVM did learn useful \u03c4 and achieved\naverage DKL(\u03c4 (\u03c8 | z, x) || q(\u03c8)) \u2248 1.98 nats, however the IWHVI did much better and achieved\n\u2248 9.97 nats.\nWe also tried to learn hierarchical approximate posterior q(z|x) using UIVI (Titsias and Ruiz,\n2018). Unfortunately, the default parameters used in the original paper did not lead to a signi\ufb01cant\nimprovement over the standard VAE (-84.9 vs -85.0 on MNIST). We hypothesize this is due to HMC\u2019s\npoor mixing over different modes and high sensitivity to hyperparameters. This, combined with\ncomputational expensiveness of UIVI (our TensorFlow-based implementation was nearly 10 times\nslower than IWHVI, see discussion in section 5), prevented us from exploiting the hyperparameter\nspace more exhaustively.\nIn\ufb02uence of K: to investigate K\u2019s in\ufb02uence on the training process, we trained VAEs on MNIST for\n3000 epochs for different values of K with SIVI, IWHVI and IWHVI-DReG, evaluated DIWHVI\nbound for M = 1000, K = 100 and Mutual Information (MI, see appendix F for details) between z\nand \u03c8 under the joint q\u03c6(z, \u03c8|x). Results in \ufb01g. 1b clearly show that larger values of K lead to better\n\ufb01nal models in terms of the log marginal likelihood, as well as approximate posteriors q\u03c6(z|x) that\nrely on the latent \u03c8 more heavily, as measured by the MI. Notably, the IWHVI achieves much higher\nvalues of the Mutual Information than the SIVI, and the improved gradient estimator IWHVI-DReG\nenables even better results due to better auxiliary distribution \u03c4\u03b7(\u03c8|z, x) (Tucker et al., 2019). These\nresults empirically validate the claim that tighter bounds reduce the (auxiliary) variational bias.\n\n6.3 DIWHVI as Evaluation Metric\n\nOne of the established approaches to evaluate the intractable log marginal likelihood in Latent\nVariable Models is to compute the multisample IWAE-bound with large M since it is shown to\nconverge to the log marginal likelihood as M goes to in\ufb01nity. Since both IWHVI and SIVI allow\ntightening the bound by taking more samples zm, we compare methods along this direction.\nBoth DIWHVI and SIVI (being a special case of the former) can be shown to converge to log marginal\nlikelihood as both M and K go to in\ufb01nity, however, rates might differ. We empirically compare the\ntwo by evaluating an MNIST-trained IWHVAE model from section 6.2 for several different values K\nand M. We use the proposed DIWHVI bound (5), and compare it with several SIVI modi\ufb01cations.\nWe call SIVI-like the (5) with \u03c4 (\u03c8 | z) = q(\u03c8), but without \u03c8 reuse, thus using M K independent\nsamples. SIVI Equicomp stands for sample reusing bound (3), which uses only M + K samples,\nand uses same \u03c81:K for every zm. SIVI Equisample is a fair comparison in terms of the number of\nsamples: we take M (K + 1) samples of \u03c8, and reuse M K of them for every zm. This way we use\nthe same number of samples \u03c8 as DIWHVI does, but perform O(M 2K) log-density evaluations to\nestimate log q(z | x), which is why we only examine the M = 100 case.\nResults shown in section 6.2 indicate superior performance of the DIWHVI bound. Surprisingly\nSIVI-like and SIVI Equicomp estimates nearly coincide, with no signi\ufb01cant difference in variance;\nthus we conclude sample reuse does not hurt SIVI. Still, there is a considerable gap to the IWHVI\nbound, which uses similar to SIVI-like amount of computing and samples. In a more fair comparison\nto the Equisample SIVI bound, the gap is signi\ufb01cantly reduced, yet IWHVI is still a superior bound,\nespecially in terms of computational ef\ufb01ciency, as there are no O(M 2K) operations.\nComparing IWHVI and SIVI-like for M = 5000 we see that the former converges after a few dozen\nsamples, while SIVI is rapidly improving, yet lagging almost 1 nat behind for 100 samples, and even\n0.5 nats behind the HVM bound (IWHVI for K = 0). One explanation for the observed behaviour is\nlarge Eq(z|x)DKL(q(\u03c8 | x) || q(\u03c8 | x, z)), which was estimated 5 (on a test set) to be at least 46.85\nnats, causing many samples from q(\u03c8) to generate poor likelihood q(z | \u03c8) for a given zm due to\nlarge difference with the true inverse model q(\u03c8 | x, z). This is consistent with motivation layed\nout in section 2.1: a better approximate inverse model \u03c4 leads to more ef\ufb01cient sample usage. At\nthe same time Eq(z|x)DKL(\u03c4 (\u03c8 | x, z) || q(\u03c8 | x, z)) was estimated to be approximately 3.24 and\nEq(z|x)DKL(\u03c4 (\u03c8 | x, z) || q(\u03c8 | x)) \u2248 6.25, proving that one can indeed do much better by learning\n\u03c4 (\u03c8 | x, z) instead of using the prior q(\u03c8 | x).\n\n5Difference between K-sample IWAE and ELBO gives a lower bound on DKL(\u03c4 (\u03c8 | z) || q(\u03c8 | z)), we\n\nused K = 5000.\n\n8\n\n\f7 Conclusion\n\nWe presented a multisample variational upper bound on the log marginal density, which allowed us to\ngive tight tractable lower bounds on the intractable ELBO in the case of hierarchical variational model\nq\u03c6(z | x). We experimentally validated the bound and showed it alleviates (auxiliary) variational bias\nto a further extent than prior works do (which we showed to be a special cases of the proposed bound\nin appendix A), allowing for more expressive approximate posteriors, which does translate into a\nbetter inference. We then combined our bound with multisample IWAE bound, which led to a tighter\nlower bound of the log marginal likelihood. We therefore believe the proposed variational inference\nmethod will be useful for many approximate inference problems, and the multisample variational\nupper bound on log marginal density is a useful theoretical tool, allowing, for example, to give an\nupper bound on KL-divergence (appendix B) or to give sandwich bounds on the Mutual Information\n(appendix F).\n\nAcknowledgements\n\nAuthors would like to thank Aibek Alanov, Dmitry Molchanov, Oleg Ivanov and Anonymous\nReviewer 3 for valuable discussions and feedback. Results on multisample extensions, shown in\nSection 4 have been obtained by Dmitry Vetrov and are supported by the Russian Science Foundation\ngrant no.~17-71-20072.\n\nReferences\nA. Angelova, J.\n\n2012. On moments of sample mean and variance. International Journal of Pure and Applied\nMathematics, 79.\n\nAbadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,\nL. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,\nJ. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00e9gas,\nO. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng\n2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available\nfrom tensor\ufb02ow.org.\n\nAgakov, F. V. and D. Barber\n\n2004. An auxiliary variational method.\nProcessing, Pp. 561\u2013566. Springer.\n\nIn International Conference on Neural Information\n\nAtanov, A., A. Ashukha, K. Struminsky, D. Vetrov, and M. Welling\n\n2019. The deep weight prior. In International Conference on Learning Representations.\n\nBurda, Y., R. Grosse, and R. Salakhutdinov\n\n2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.\n\nChen, X., D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel\n\n2016. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731.\n\nDieng, A. B., D. Tran, R. Ranganath, J. Paisley, and D. Blei\n\n2017. Variational inference via \u03c7 upper bound minimization. In Advances in Neural Information\nProcessing Systems, Pp. 2732\u20132741.\n\nDillon, J. V., I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi,\n\nM. Hoffman, and R. A. Saurous\n2017a. Tensor\ufb02ow distributions. arXiv preprint arXiv:1711.10604.\n\nDillon, J. V., I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. D.\n\nHoffman, and R. A. Saurous\n2017b. Tensor\ufb02ow distributions. CoRR, abs/1711.10604.\n\n9\n\n\fDinh, L., J. Sohl-Dickstein, and S. Bengio\n\n2016. Density estimation using real NVP. CoRR, abs/1605.08803.\n\nDomke, J. and D. R. Sheldon\n\nImportance weighting and variational inference.\n\nIn Advances in Neural Information\n2018.\nProcessing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, eds., Pp. 4471\u20134480. Curran Associates, Inc.\n\nGershman, S., M. Hoffman, and D. Blei\n\n2012. Nonparametric variational inference. arXiv preprint arXiv:1206.4665.\n\nGoodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio\n2014. Generative adversarial nets. In Advances in neural information processing systems, Pp. 2672\u2013\n2680.\n\nGrosse, R. B., Z. Ghahramani, and R. P. Adams\n\n2015. Sandwiching the marginal likelihood using bidirectional monte carlo. CoRR, abs/1511.02543.\n\nGuo, F., X. Wang, K. Fan, T. Broderick, and D. B. Dunson\n\n2016. Boosting variational inference. arXiv preprint arXiv:1611.05559.\n\nHinton, G. E. and D. van Camp\n\n1993. Keeping the neural networks simple by minimizing the description length of the weights.\nIn Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT \u201993,\nPp. 5\u201313, New York, NY, USA. ACM.\n\nHunter, J. D.\n\n2007. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90\u201395.\n\nHusz\u00e1r, F.\n\n2017. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235.\n\nJebara, T. and A. Pentland\n\n2001. On reversing jensen\u2019s inequality. In Advances in Neural Information Processing Systems,\nPp. 231\u2013237.\n\nKim, Y., S. Wiseman, A. Miller, D. Sontag, and A. Rush\n\n2018. Semi-amortized variational autoencoders. In Proceedings of the 35th International Con-\nference on Machine Learning, J. Dy and A. Krause, eds., volume 80 of Proceedings of Machine\nLearning Research, Pp. 2678\u20132687, Stockholmsm\u00e4ssan, Stockholm Sweden. PMLR.\n\nKingma, D. P. and J. Ba\n\n2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.\n\nKingma, D. P. and P. Dhariwal\n\n2018. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In Advances in Neural Information\nProcessing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, eds., Pp. 10235\u201310244. Curran Associates, Inc.\n\nKingma, D. P., T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling\n\n2016. Improved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural\nInformation Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, eds., Pp. 4743\u20134751. Curran Associates, Inc.\n\nKingma, D. P. and M. Welling\n\n2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.\n\nKuleshov, V. and S. Ermon\n\n2017. Neural variational inference and learning in undirected graphical models. In Advances in\nNeural Information Processing Systems, Pp. 6734\u20136743.\n\nLake, B. M., R. Salakhutdinov, and J. B. Tenenbaum\n\n2015. Human-level concept learning through probabilistic program induction.\n350(6266):1332\u20131338.\n\nScience,\n\n10\n\n\fLe, T. A., M. Igl, T. Rainforth, T. Jin, and F. Wood\n\n2017. Auto-encoding sequential monte carlo. arXiv preprint arXiv:1705.10306.\n\nLeCun, Y., L. Bottou, Y. Bengio, and P. Haffner\n\n1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE,\n86(11):2278\u20132324.\n\nLouizos, C., K. Ullrich, and M. Welling\n\n2017. Bayesian compression for deep learning. In Advances in Neural Information Processing\nSystems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, eds., Pp. 3288\u20133298. Curran Associates, Inc.\n\nMaal\u00f8e, L., C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther\n\n2016. Auxiliary deep generative models. In Proceedings of The 33rd International Conference\non Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., volume 48 of Proceedings of\nMachine Learning Research, Pp. 1445\u20131453, New York, New York, USA. PMLR.\n\nMacKay, D. J.\n\n1995. Bayesian neural networks and density networks. Nuclear Instruments and Methods in\nPhysics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment,\n354(1):73 \u2013 80. Proceedings of the Third Workshop on Neutron Scattering Data Analysis.\n\nMaddison, C. J., J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Teh\n\n2017. Filtering variational objectives. In Advances in Neural Information Processing Systems,\nPp. 6573\u20136583.\n\nMescheder, L., S. Nowozin, and A. Geiger\n\n2017. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial\nnetworks. In International Conference on Machine Learning (ICML).\n\nMohamed, S. and B. Lakshminarayanan\n\n2016. Learning in implicit generative models. arXiv preprint arXiv:1610.03483.\n\nMolchanov, D., V. Kharitonov, A. Sobolev, and D. Vetrov\n\n2018. Doubly semi-implicit variational inference. arXiv preprint arXiv:1810.02789.\n\nNaesseth, C. A., S. W. Linderman, R. Ranganath, and D. M. Blei\n\n2017. Variational sequential monte carlo. arXiv preprint arXiv:1705.11140.\n\nNeal, R. M.\n\n2001. Annealed importance sampling. Statistics and computing, 11(2):125\u2013139.\n\nNowozin, S.\n\n2018. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife\nvariational inference. In International Conference on Learning Representations.\n\nNowozin, S., B. Cseke, and R. Tomioka\n\n2016. f-gan: Training generative neural samplers using variational divergence minimization. In\nAdvances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, eds., Pp. 271\u2013279. Curran Associates, Inc.\n\nOliphant, T.\n\n2006\u2013. NumPy: A guide to NumPy. USA: Trelgol Publishing. [Online; accessed ].\n\nPapamakarios, G., I. Murray, and T. Pavlakou\n\n2017. Masked autoregressive \ufb02ow for density estimation. In Advances in Neural Information\nProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9\nDecember 2017, Long Beach, CA, USA, Pp. 2335\u20132344.\n\nPoole, B., S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker\n\n2018. On variational lower bounds of mutual information. In NeurIPS Workshop on Bayesian\nDeep Learning.\n\n11\n\n\fRainforth, T., A. R. Kosiorek, T. A. Le, C. J. Maddison, M. Igl, F. Wood, and Y. W. Teh\n\n2018. Tighter variational bounds are not necessarily better. In ICML.\n\nRanganath, R., D. Tran, and D. Blei\n\n2016. Hierarchical variational models. In International Conference on Machine Learning, Pp. 324\u2013\n333.\n\nReddi, S. J., S. Kale, and S. Kumar\n\n2018. On the convergence of adam and beyond.\nRepresentations.\n\nIn International Conference on Learning\n\nRezende, D. J. and S. Mohamed\n\n2015. Variational inference with normalizing \ufb02ows. arXiv preprint arXiv:1505.05770.\n\nRezende, D. J., S. Mohamed, and D. Wierstra\n\n2014. Stochastic backpropagation and approximate inference in deep generative models. In\nProceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara,\neds., volume 32 of Proceedings of Machine Learning Research, Pp. 1278\u20131286, Bejing, China.\nPMLR.\n\nSalimans, T., D. Kingma, and M. Welling\n\n2015. Markov chain monte carlo and variational inference: Bridging the gap. In International\nConference on Machine Learning, Pp. 1218\u20131226.\n\nSharot, T.\n\n1976. The generalized jackknife: \ufb01nite samples and subsample sizes. Journal of the American\nStatistical Association, 71(354):451\u2013454.\n\nShi, J., S. Sun, and J. Zhu\n\n2017. Kernel implicit variational inference. arXiv preprint arXiv:1705.10119.\n\nTitsias, M. K. and F. J. Ruiz\n\n2018. Unbiased implicit variational inference. arXiv preprint arXiv:1808.02078.\n\nTran, D., D. Blei, and E. M. Airoldi\n\n2015. Copula variational inference. In Advances in Neural Information Processing Systems,\nPp. 3564\u20133572.\n\nTran, D., R. Ranganath, and D. Blei\n\n2017. Hierarchical implicit models and likelihood-free variational inference. In Advances in\nNeural Information Processing Systems, Pp. 5523\u20135533.\n\nTucker, G., D. Lawson, S. Gu, and C. J. Maddison\n\n2019. Doubly reparameterized gradient estimators for monte carlo objectives. In International\nConference on Learning Representations.\n\nUehara, M., I. Sato, M. Suzuki, K. Nakayama, and Y. Matsuo\n\n2016. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint\narXiv:1610.02920.\n\nVirtanen, P., R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski,\nP. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman,\nN. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. Carey, \u02d9I. Polat, Y. Feng, E. W.\nMoore, J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R.\nHarris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and S. . . Contributors\n2019. SciPy 1.0\u2013Fundamental Algorithms for Scienti\ufb01c Computing in Python. arXiv e-prints,\nP. arXiv:1907.10121.\n\nWainwright, M. J., M. I. Jordan, et al.\n\n2008. Graphical models, exponential families, and variational inference. Foundations and Trends R(cid:13)\nin Machine Learning, 1(1\u20132):1\u2013305.\n\n12\n\n\fWaterhouse, S. R., D. MacKay, and A. J. Robinson\n\n1996. Bayesian methods for mixtures of experts. In Advances in neural information processing\nsystems, Pp. 351\u2013357.\n\nWilliams, R. J.\n\n1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In\nMachine Learning, Pp. 229\u2013256.\n\nYin, M. and M. Zhou\n\n2018. Semi-implicit variational inference. In Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, Pp. 5660\u20135669.\nPMLR.\n\n13\n\n\f", "award": [], "sourceid": 322, "authors": [{"given_name": "Artem", "family_name": "Sobolev", "institution": "Samsung AI Center Moscow"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Samsung AI Center, Moscow"}]}