{"title": "Filtering Variational Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 6573, "page_last": 6583, "abstract": "When used as a surrogate objective for maximum likelihood estimation in latent variable models, the evidence lower bound (ELBO) produces state-of-the-art results. Inspired by this, we consider the extension of the ELBO to a family of lower bounds defined by a particle filter's estimator of the marginal likelihood, the filtering variational objectives (FIVOs). FIVOs take the same arguments as the ELBO, but can exploit a model's sequential structure to form tighter bounds. We present results that relate the tightness of FIVO's bound to the variance of the particle filter's estimator by considering the generic case of bounds defined as log-transformed likelihood estimators. Experimentally, we show that training with FIVO results in substantial improvements over training the same model architecture with the ELBO on sequential data.", "full_text": "Filtering Variational Objectives\n\nChris J. Maddison1,3,*, Dieterich Lawson,2,* George Tucker2,*\n\nNicolas Heess1, Mohammad Norouzi2, Andriy Mnih1, Arnaud Doucet3, Yee Whye Teh1\n\n1DeepMind, 2Google Brain, 3University of Oxford\n{cmaddis, dieterichl, gjt}@google.com\n\nAbstract\n\nWhen used as a surrogate objective for maximum likelihood estimation in latent\nvariable models, the evidence lower bound (ELBO) produces state-of-the-art results.\nInspired by this, we consider the extension of the ELBO to a family of lower bounds\nde\ufb01ned by a particle \ufb01lter\u2019s estimator of the marginal likelihood, the \ufb01ltering\nvariational objectives (FIVOs). FIVOs take the same arguments as the ELBO,\nbut can exploit a model\u2019s sequential structure to form tighter bounds. We present\nresults that relate the tightness of FIVO\u2019s bound to the variance of the particle \ufb01lter\u2019s\nestimator by considering the generic case of bounds de\ufb01ned as log-transformed\nlikelihood estimators. Experimentally, we show that training with FIVO results\nin substantial improvements over training the same model architecture with the\nELBO on sequential data.\n\n1\n\nIntroduction\n\nLearning in statistical models via gradient descent is straightforward when the objective function\nand its gradients are tractable. In the presence of latent variables, however, many objectives become\nintractable. For neural generative models with latent variables, there are currently a few dominant\napproaches: optimizing lower bounds on the marginal log-likelihood [1, 2], restricting to a class of\ninvertible models [3], or using likelihood-free methods [4, 5, 6, 7].\nIn this work, we focus on the\n\ufb01rst approach and introduce \ufb01ltering variational objectives (FIVOs), a tractable family of objectives\nfor maximum likelihood estimation (MLE) in latent variable models with sequential structure.\nSpeci\ufb01cally, let x denote an observation of an X -valued random variable. We assume that the\nprocess generating x involves an unobserved Z-valued random variable z with joint density p(x, z)\nin some family P. The goal of MLE is to recover p \u2208 P that maximizes the marginal log-likelihood,\n\nlog p(x) = log(cid:0)(cid:82) p(x, z) dz(cid:1)1. The dif\ufb01culty in carrying out this optimization is that the log-\n\nlikelihood function is de\ufb01ned via a generally intractable integral. To circumvent marginalization,\na common approach [1, 2] is to optimize a variational lower bound on the marginal log-likelihood\n[8, 9]. The evidence lower bound L(x, p, q) (ELBO) is the most common such bound and is de\ufb01ned\nby a variational posterior distribution q(z|x) whose support includes p\u2019s,\n\n(cid:20)\n\n(cid:21)\n\nlog\n\np(x, z)\nq(z|x)\n\nL(x, p, q) = E\nq(z|x)\n\n(1)\nL(x, p, q) lower-bounds the marginal log-likelihood for any choice of q, and the bound is tight when\nq is the true posterior p(z|x). Thus, the joint optimum of L(x, p, q) in p and q is the MLE. In practice,\nit is common to restrict q to a tractable family of distributions (e.g., a factored distribution) and to\n\n= log p(x) \u2212 KL(q(z|x) (cid:107) p(z|x)) \u2264 log p(x) .\n\n*Equal contribution.\n1We reuse p to denote the conditionals and marginals of the joint density.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fjointly optimize the ELBO over p and q with stochastic gradient ascent [1, 2, 10, 11]. Because of\nthe KL penalty from q to p, optimizing (1) under these assumptions tends to force p\u2019s posterior to\nsatisfy the factorizing assumptions of the variational family which reduces the capacity of the model\np. One strategy for addressing this is to decouple the tightness of the bound from the quality of q.\nFor example, [12] observed that Eq. (1) can be interpreted as the log of an unnormalized importance\nweight with the proposal given by q, and that using N samples from the same proposal produces a\ntighter bound, known as the importance weighted auto-encoder bound, or IWAE.\nIndeed, it follows from Jensen\u2019s inequality that the log of any unbiased positive Monte Carlo estimator\nof the marginal likelihood results in a lower bound that can be optimized for MLE. The \ufb01ltering\nvariational objectives (FIVOs) build on this idea by treating the log of a particle \ufb01lter\u2019s likelihood\nestimator as an objective function. Following [13], we call objectives de\ufb01ned as log-transformed\nlikelihood estimators Monte Carlo objectives (MCOs). In this work, we show that the tightness\nof an MCO scales like the relative variance of the estimator from which it is constructed. It is\nwell-known that the variance of a particle \ufb01lter\u2019s likelihood estimator scales more favourably than\nsimple importance sampling for models with sequential structure [14, 15]. Thus, FIVO can potentially\nform a much tighter bound on the marginal log-likelihood than IWAE.\nThe main contributions of this work are introducing \ufb01ltering variational objectives and a more\ncareful study of Monte Carlo objectives. In Section 2, we review maximum likelihood estimation via\nmaximizing the ELBO. In Section 3, we study Monte Carlo objectives and provide some of their basic\nproperties. We de\ufb01ne \ufb01ltering variational objectives in Section 4, discuss details of their optimization,\nand present a sharpness result. Finally, we cover related work and present experiments showing that\nsequential models trained with FIVO outperform models trained with ELBO or IWAE in practice.\n\n2 Background\n\nWe brie\ufb02y review techniques for optimizing the ELBO as a surrogate MLE objective. We restrict our\nfocus to latent variable models in which the model p\u03b8(x, z) factors into tractable conditionals p\u03b8(z)\nand p\u03b8(x|z) that are parameterized differentiably by parameters \u03b8. MLE in these models is then the\nproblem of optimizing log p\u03b8(x) in \u03b8. The expectation-maximization (EM) algorithm is an approach\nto this problem which can be seen as coordinate ascent, fully maximizing L(x, p\u03b8, q) alternately in q\nand \u03b8 at each iteration [16, 17, 18]. Yet, EM rarely applies in general, because maximizing over q for\na \ufb01xed \u03b8 corresponds to a generally intractable inference problem.\nInstead, an approach with mild assumptions on the model is to perform gradient ascent following a\nMonte Carlo estimator of the ELBO\u2019s gradient [19, 10]. We assume that q is taken from a family of\ndistributions parameterized differentiably by parameters \u03c6. We can follow an unbiased estimator of the\nELBO\u2019s gradient by sampling z \u223c q\u03c6(z|x) and updating the parameters by \u03b8(cid:48) = \u03b8 + \u03b7\u2207\u03b8 log p\u03b8(x, z)\nand \u03c6(cid:48) = \u03c6 + \u03b7(log p\u03b8(x, z) \u2212 log q\u03c6(z|x))\u2207\u03c6 log q\u03c6(z|x), where the gradients are computed\nconditional on the sample z and \u03b7 is a learning rate. Such estimators follow the ELBO\u2019s gradient in\nexpectation, but variance reduction techniques are usually necessary [10, 20, 13].\nA lower variance gradient estimator can be derived if q\u03c6 is a reparameterizable distribution [1, 2, 21].\nReparameterizable distributions are those that can be simulated by sampling from a distribution\n\u0001 \u223c d(\u0001), which does not depend on \u03c6, and then applying a deterministic transformation z =\nf\u03c6(x, \u0001). When p\u03b8, q\u03c6, and f\u03c6 are differentiable, an unbiased estimator of the ELBO gradient consists\nof sampling \u0001 and updating the parameter by (\u03b8(cid:48), \u03c6(cid:48)) = (\u03b8, \u03c6) + \u03b7\u2207(\u03b8,\u03c6)(log p\u03b8(x, f\u03c6(x, \u0001)) \u2212\nlog q\u03c6(f\u03c6(x, \u0001)|x)). Given \u0001, the gradients of the sampling process can \ufb02ow through z = f\u03c6(x, \u0001).\nfollowing gradients of\nUnfortunately, when the variational\n\u2212KL(q\u03c6(z|x) (cid:107) p\u03b8(z|x)) tends to reduce the capacity of the model p\u03b8 to match the assumptions\nof the variational family. This KL penalty can be \u201cremoved\u201d by considering generalizations of the\nELBO whose tightness can be controlled by means other than the closenesss of p and q, e.g., [12].\nWe consider this in the next section.\n\nfamily of q\u03c6 is restricted,\n\n3 Monte Carlo Objectives (MCOs)\n\nMonte Carlo objectives (MCOs) [13] generalize the ELBO to objectives de\ufb01ned by taking the log\nof a positive, unbiased estimator of the marginal likelihood. The key property of MCOs is that\n\n2\n\n\f(cid:20)\n\n(cid:21)\n\n(cid:90) p(x, z)\n\nq(z|x)\n\nthey are lower bounds on the marginal log-likelihood, and thus can be used for MLE. Motivated\nby the previous section, we present results on the convergence of generic MCOs to the marginal\nlog-likelihood and show that the tightness of an MCO is closely related to the variance of the estimator\nthat de\ufb01nes it.\nOne can verify that the ELBO is a lower bound by using the concavity of log and Jensen\u2019s inequality,\n\nE\n\nq(z|x)\n\nlog\n\np(x, z)\nq(z|x)\n\n\u2264 log\n\nq(z|x) dz = log p(x).\n\n(2)\n\nThis argument only relies only on unbiasedness of p(x, z)/q(z|x) when z \u223c q(z|x). Thus, we\ncan generalize this by considering any unbiased marginal likelihood estimator \u02c6pN (x) and treating\nE[log \u02c6pN (x)] as an objective function over models p. Here N \u2208 N indexes the amount of computation\nneeded to simulate \u02c6pN (x), e.g., the number of samples or particles.\nDe\ufb01nition 1. Monte Carlo Objectives. Let \u02c6pN (x) be an unbiased positive estimator of p(x),\nE[\u02c6pN (x)] = p(x), then the Monte Carlo objective LN(x, p) over p \u2208 P de\ufb01ned by \u02c6pN (x) is\n\nLN(x, p) = E[log \u02c6pN (x)]\n\n(3)\n\nFor example, the ELBO is constructed from a single unnormalized importance weight \u02c6p(x) =\np(x, z)/q(z|x). The IWAE bound [12] takes \u02c6pN (x) to be N averaged i.i.d. importance weights,\n\nLIWAE\n\nN\n\n(x, p, q) =\n\nE\n\nq(zi|x)\n\nlog\n\n1\nN\n\np(x, zi)\nq(zi|x)\n\n(4)\n\n(cid:34)\n\n(cid:32)\n\nN(cid:88)\n\ni=1\n\n(cid:33)(cid:35)\n\nWe consider additional examples in the Appendix. To avoid notational clutter, we omit the arguments\nto an MCO, e.g., the observations x or model p, when the default arguments are clear from context.\nWhether we can compute stochastic gradients of LN ef\ufb01ciently depends on the speci\ufb01c form of the\nestimator and the underlying random variables that de\ufb01ne it.\nMany likelihood estimators \u02c6pN (x) converge to p(x) almost surely as N \u2192 \u221e (known as strong\nconsistency). The advantage of a consistent estimator is that its MCO can be driven towards log p(x)\nby increasing N. We present suf\ufb01cient conditions for this convergence and a description of the rate:\nProposition 1. Properties of Monte Carlo Objectives. Let LN(x, p) be a Monte Carlo objective\nde\ufb01ned by an unbiased positive estimator \u02c6pN (x) of p(x). Then,\n\n(a) (Bound) LN(x, p) \u2264 log p(x).\n(b) (Consistency) If log \u02c6pN (x) is uniformly integrable (see Appendix for de\ufb01nition) and \u02c6pN (x)\n\nis strongly consistent, then LN(x, p) \u2192 log p(x) as N \u2192 \u221e.\n\n(c) (Asymptotic Bias) Let g(N ) = E[(\u02c6pN (x) \u2212 p(x))6] be the 6th central moment. If the 1st\n\ninverse moment is bounded, lim supN\u2192\u221e E[\u02c6pN (x)\u22121] < \u221e, then\n\n(cid:18) \u02c6pN (x)\n\n(cid:19)\n\np(x)\n\n+ O((cid:112)g(N )).\n\n(5)\n\nlog p(x) \u2212 LN(x, p) =\n\n1\n2\n\nvar\n\nProof. See the Appendix for the proof and a suf\ufb01cient condition for controlling the \ufb01rst inverse\nmoment when \u02c6pN (x) is the average of i.i.d. random variables.\n\nIn some cases, convergence of the bound to log p(x) is monotonic, e.g., IWAE [12], but this is not\ntrue in general. The relative variance of estimators, var(\u02c6pN (x)/p(x)), tends to be well studied, so\nproperty (c) gives us a tool for comparing the convergence rate of distinct MCOs. For example,\n[14, 15] study marginal likelihood estimators de\ufb01ned by particle \ufb01lters and \ufb01nd that the relative\nvariance of these estimators scales favorably in comparison to naive importance sampling. This\nsuggests that a particle \ufb01lter\u2019s MCO, introduced in the next section, will generally be a tighter bound\nthan IWAE.\n\n3\n\n\fN (x1:T , p, q)\n\nAlgorithm 1 Simulating LFIVO\n1: FIVO(x1:T , p, q, N ):\n0}N\ni=1 = {1/N}N\n2: {wi\n3: for t \u2208 {1, . . . , T} do\ni=1\nfor i \u2208 {1, . . . , N} do\n4:\nt \u223c qt(zt|x1:t, zi\nzi\n5:\n1:t = CONCAT(zi\nzi\n6:\nt\u22121\u03b1t(zi\n7:\n8:\n9:\n\n(cid:17)\n1:t\u22121)\n1:t\u22121, zi\nt)\ni=1 wi\n1:t)\n\u02c6pt =\n\u02c6pN (x1:t) = \u02c6pN (x1:t\u22121)\u02c6pt\n1:t)/\u02c6pt}N\nt}N\ni=1 = {wi\n{wi\n\n(cid:16)(cid:80)N\n\nt\u22121\u03b1t(zi\n\ni=1\n\nif resampling criteria satis\ufb01ed by {wi\nt, zi\n\ni=1 = RSAMP({wi\n\n{wi\n\nt}N\ni=1 then\n1:t}N\ni=1)\n\nt, zi\n\n10:\n1:t}N\n11:\n12: return log \u02c6pN (x1:T )\n13: RSAMP({wi, zi}N\ni=1):\n14: for i \u2208 {1, . . . , N} do\na \u223c Categorical({wi}N\n15:\nyi = za\n16:\n17: return { 1\n\nN , yi}N\n\ni=1\n\ni=1)\n\n4 Filtering Variational Objectives (FIVOs)\n\nThe \ufb01ltering variational objectives (FIVOs) are a family of MCOs de\ufb01ned by the marginal likelihood\nestimator of a particle \ufb01lter. For models with sequential structure, e.g., latent variable models of audio\nand text, the relative variance of a naive importance sampling estimator tends to scale exponentially\nin the number of steps. In contrast, the relative variance of particle \ufb01lter estimators can scale more\nfavorably with the number of steps\u2014linearly in some cases [14, 15]. Thus, the results of Section 3\nsuggest that FIVOs can serve as tighter objectives than IWAE for MLE in sequential models.\nLet our observations be sequences of T X -valued random variables denoted x1:T , where xi:j \u2261\n(xi, . . . , xj). We also assume that the data generation process relies on a sequence of T unobserved\nZ-valued latent variables denoted z1:T . We focus on sequential latent variable models that factor as a\n\nseries of tractable conditionals, p(x1:T , z1:T ) = p1(x1, z1)(cid:81)T\n\nt=2 pt(xt, zt|x1:t\u22121, z1:t\u22121).\n\nA particle \ufb01lter is a sequential Monte Carlo algorithm, which propagates a population of N weighted\nparticles for T steps using a combination of importance sampling and resampling steps, see Alg. 1.\nIn detail, the particle \ufb01lter takes as arguments an observation x1:T , the number of particles N, the\nmodel distribution p, and a variational posterior q(z1:T|x1:T ) factored over t,\n\nT(cid:89)\n\nt=1\n\nq(z1:T|x1:T ) =\n\nqt(zt|x1:t, z1:t\u22121) .\n\n(6)\n\nThe particle \ufb01lter maintains a population {wi\n1:t\u22121}N\nAt step t, the \ufb01lter independently proposes an extension zi\ntrajectory zi\n\n1:t\u22121. The weights wi\n\nt\u22121, zi\n\nt\u22121 are multiplied by the incremental importance weights,\n\ni=1 of particles zi\nt \u223c qt(zt|x1:t, zi\n\n1:t\u22121 with weights wi\nt\u22121.\n1:t\u22121) to each particle\u2019s\n\n\u03b1t(zi\n\n1:t) =\n\npt(xt, zi\nqt(zi\n\nt|x1:t\u22121, zi\nt|x1:t, zi\n1:t\u22121)\n\n1:t\u22121)\n\n,\n\n(7)\n\nif the effective sample size (ESS) of the population ((cid:80)N\n\nt satisfy a resampling criteria, then a resampling step is\nand renormalized. If the current weights wi\n1:t are sampled in proportion to their weights from the current population\nperformed and N particles zi\nwith replacement. Common resampling schemes include resampling at every step and resampling\nt)2)\u22121 drops below N/2 [22]. After\n1:t are copied to the next step along\n\ni=1(wi\nresampling the weights are reset to 1. Otherwise, the particles zi\nwith the accumulated weights. See Fig. 1 for a visualization.\nInstead of viewing Alg. 1 as an inference algorithm, we treat the quantity E[log \u02c6pN (x1:T )] as an\nobjective function over p. Because \u02c6pN (x1:T ) is an unbiased estimator of p(x1:T ), proven in the\nAppendix and in [23, 24, 25, 26], it de\ufb01nes an MCO, which we call FIVO:\nDe\ufb01nition 2. Filtering Variational Objectives. Let log \u02c6pN (x1:T ) be the output of Alg. 1 with inputs\n(x1:T , p, q, N ), then LFIVO\n\u02c6pN (x1:T ) is a strongly consistent estimator [23, 24]. So if log \u02c6pN (x1:T ) is uniformly integrable, then\nLFIVO\nN (x1:T , p, q) \u2192 log p(x1:T ) as N \u2192 \u221e. Resampling is the distinguishing feature of LFIVO\nN ; if\nresampling is removed, then FIVO reduces to IWAE. Resampling does add an amount of immediate\nvariance, but it allows the \ufb01lter to discard low weight particles with high probability. This has the\n\nN (x1:T , p, q) = E[log \u02c6pN (x1:T )] is a \ufb01ltering variational objective.\n\n4\n\n\flog \u02c6p1\n\nlog \u02c6p2\n\nlog \u02c6p3\n\nlog \u02c6p1\n\nlog \u02c6p2\n\nlog \u02c6p3\n\nlog \u02c6p4\n\nlog \u02c6p4\u2207 log wi\n\n3\n\n\u2207 log \u02c6p4\n\nz1\n1\n\nz3\n1\n\nz2\n1\n\nz1\n2\n\nz2\n2\n\nz3\n2\n\nz1\n3\n\nz2\n3\n\nz3\n3\n\nresample {zi\n\n1:3}3\n\ni=1 \u223c wi\n\n3\n\nz2\n1\n\nz2\n4\n\nz2\n1\n\nz2\n4\n\nz2\n2\n\nz2\n3\n\npropose zi\n\n4 \u223c q4(z4|x1:4, zi\n\n1:3)\n\nz2\n2\n\nz2\n3\n\ngradients\n\nFigure 1: Visualizing FIVO; (Left) Resample from particle trajectories to determine inheritance in next\nstep, (middle) propose with qt and accumulate loss log \u02c6pt, (right) gradients (in the reparameterized\ncase) \ufb02ow through the lattice, objective gradients in solid red and resampling gradients in dotted blue.\n\neffect of refocusing the distribution of particles to regions of higher mass under the posterior, and in\nsome sequential models can reduce the variance from exponential to linear in the number of time\nsteps [14, 15]. Resampling is a greedy process, and it is possible that a particle discarded at step t,\ncould have attained a high mass at step T . In practice, the best trade-off is to use adaptive resampling\nschemes [22]. If for a given x1:T , p, q a particle \ufb01lter\u2019s likelihood estimator improves over simple\nimportance sampling in terms of variance, we expect LFIVO\nto be a tighter bound than L or LIWAE\n.\n\nN\n\nN\n\n4.1 Optimization\n\nThe FIVO bound can be optimized with the same stochastic gradient ascent framework used for\nthe ELBO. We found in practice it was effective simply to follow a Monte Carlo estimator of the\nbiased gradient E[\u2207(\u03b8,\u03c6) log \u02c6pN (x1:T )] with reparameterized zi\nt. This gradient estimator is biased,\nas the full FIVO gradient has three kinds of terms: it has the term E[\u2207\u03b8,\u03c6 log \u02c6pN (x1:T )], where\n\u2207\u03b8,\u03c6 log \u02c6pN (x1:T ) is de\ufb01ned conditional on the random variables of Alg. 1; it has gradient terms for\nevery distribution of Alg. 1 that depends on the parameters; and, if adaptive resampling is used, then\nit has additional terms that account for the change in FIVO with respect to the decision to resample.\nIn this section, we derive the FIVO gradient when zi\nt are reparameterized and a \ufb01xed resampling\nschedule is followed. We derive the full gradient in the Appendix.\nIn more detail, we assume that p and q are parameterized in a differentiable way by \u03b8 and \u03c6. Assume\nthat q is from a reparameterizable family and that zi\nt of Alg. 1 are reparameterized. Assume that we\nuse a \ufb01xed resampling schedule, and let I(resampling at step t) be an indicator function indicating\nwhether a resampling occured at step t. Now, LFIVO\ndepends on the parameters via log \u02c6pN (x1:T ) and\nthe resampling probabilities wi\n\nN\n\n(cid:20)\n\nE\n\n\u2207(\u03b8,\u03c6) log \u02c6pN (x1:T ) +\n\nt in the density. Thus, \u2207(\u03b8,\u03c6) LFIVO\n(cid:88)T\n\n(cid:88)N\n\nI(resampling at step t) log\n\nN =\n\nt=1\n\ni=1\n\n(cid:21)\n\n\u02c6pN (x1:T )\n\u02c6pN (x1:t)\n\n\u2207(\u03b8,\u03c6) log wi\n\nt\n\n(8)\n\nt, the terms inside the expectation form\nGiven a single forward pass of Alg. 1 with reparameterized zi\na Monte Carlo estimator of Eq. (8). However, the terms from resampling events contribute to the\nmajority of the variance of the estimator. Thus, the gradient estimator that we found most effective\nin practice consists only of the gradient \u2207(\u03b8,\u03c6) log \u02c6pN (x1:T ), the solid red arrows of Figure 1. We\nexplore this experimentally in Section 6.3.\n\n4.2 Sharpness\n\nAs with the ELBO, FIVO is a variational objective taking a variational posterior q as an argument.\nAn important question is whether FIVO achieves the marginal log-likelihood at its optimal q. We can\nonly guarantee this for models in which z1:t\u22121 and xt are independent given x1:t\u22121.\nProposition 2. Sharpness of Filtering Variational Objectives. Let LFIVO\nq\u2217(x1:T , p) = argmaxq LFIVO\np(z1:t\u22121|x1:t\u22121) for t \u2208 {2, . . . , T}, then\n\nN (x1:T , p, q) be a FIVO, and\nN (x1:T , p, q). If p has independence structure such that p(z1:t\u22121|x1:t) =\n\nq\u2217(x1:T , p)(z1:T ) = p(z1:T|x1:T ) and LFIVO\n\nN (x1:T , p, q\u2217(x1:T , p)) = log p(x1:T ) .\n\nProof. See Appendix.\n\n5\n\n\fMost models do not satisfy this assumption, and deriving the optimal q in general is complicated by\nthe resampling dynamics. For the restricted the model class in Proposition 2, the optimal qt does\nnot condition on future observations xt+1:T . We explored this experimentally with richer models\nin Section 6.4, and found that allowing qt to condition on xt+1:T does not reliably improve FIVO.\nThis is consistent with the view of resampling as a greedy process that responds to each intermediate\ndistribution as if it were the \ufb01nal. Still, we found that the impact of this effect was outweighed by the\nadvantage of optimizing a tighter bound.\n\n5 Related Work\n\nThe marginal log-likelihood is a central quantity in statistics and probability, and there has long been\nan interest in bounding it [27]. The literature relating to the bounds we call Monte Carlo objectives\nhas typically focused on the problem of estimating the marginal likelihood itself. [28, 29] use Jensen\u2019s\ninequality in a forward and reverse estimator to detect the failure of inference methods. IWAE [12] is\na clear in\ufb02uence on this work, and FIVO can be seen as an extension of this bound. The ELBO enjoys\na long history [8] and there have been efforts to improve the ELBO itself. [30] generalize the ELBO\nby considering arbitrary operators of the model and variational posterior. More closely related to\nthis work is a body of work improving the ELBO by increasing the expressiveness of the variational\nposterior. For example, [31, 32] augment the variational posterior with deterministic transformations\nwith \ufb01xed Jacobians, and [33] extend the variational posterior to admit a Markov chain.\nOther approaches to learning in neural latent variable models include [34], who use importance\nsampling to approximate gradients under the posterior, and [35], who use sequential Monte Carlo\nto approximate gradients under the posterior. These are distinct from our contribution in the sense\nthat for them inference for the sake of estimation is the ultimate goal. To our knowledge the idea\nof treating the output of inference as an objective in and of itself, while not completely novel, has\nnot been fully appreciated in the literature. Although, this idea shares inspiration with methods that\noptimize the convergence of Markov chains [36].\nWe note that the idea to optimize the log estimator of a particle \ufb01lter was independently and\nconcurrently considered in [37, 38]. In [37] the bound we call FIVO is cast as a tractable lower bound\non the ELBO de\ufb01ned by the particle \ufb01lter\u2019s non-parameteric approximation to the posterior. [38]\nadditionally derive an expression for FIVO\u2019s bias as the KL between the \ufb01lter\u2019s distribution and a\ncertain target process. Our work is distinguished by our study of the convergence of MCOs in N,\nwhich includes FIVO, our investigation of FIVO sharpness, and our experimental results on stochastic\nRNNs.\n\n6 Experiments\n\nIn our experiments, we sought to: (a) compare models trained with ELBO, IWAE, and FIVO bounds\nin terms of \ufb01nal test log-likelihoods, (b) explore the effect of the resampling gradient terms on FIVO,\n(c) investigate how the lack of sharpness affects FIVO, and (d) consider how models trained with\nFIVO use the stochastic state. To explore these questions, we trained variational recurrent neural\nnetworks (VRNN) [39] with the ELBO, IWAE, and FIVO bounds using TensorFlow [40] on two\nbenchmark sequential modeling tasks: natural speech waveforms and polyphonic music. These\ndatasets are known to be dif\ufb01cult to model without stochastic latent states [41].\nThe VRNN is a sequential latent variable model that combines a deterministic recurrent neu-\nral network (RNN) with stochastic latent states zt at each step.\nThe observation distri-\nbution over xt is conditioned directly on zt and indirectly on z1:t\u22121 via the RNN\u2019s state\nht(zt\u22121, xt\u22121, ht\u22121). For a length T sequence, the model\u2019s posterior factors into the condition-\nt=1 pt(zt|ht(zt\u22121, xt\u22121, ht\u22121))gt(xt|zt, ht(zt\u22121, xt\u22121, ht\u22121)), and the variational posterior\nt=1 qt(zt|ht(zt\u22121, xt\u22121, ht\u22121), xt). All distributions over latent variables are factorized\nGaussians, and the output distributions gt depend on the dataset. The RNN is a single-layer LSTM\nand the conditionals are parameterized by fully connected neural networks with one hidden layer\nof the same size as the LSTM hidden layer. We used the residual parameterization [41] for the\nvariational posterior.\n\nals(cid:81)T\nfactors as(cid:81)T\n\n6\n\n\fN Bound Nottingham\n\n4\n\n8\n\n16\n\nELBO\nIWAE\nFIVO\nELBO\nIWAE\nFIVO\nELBO\nIWAE\nFIVO\n\n-3.00\n-2.75\n-2.68\n-3.01\n-2.90\n-2.77\n-3.02\n-2.85\n-2.58\n\nJSB MuseData\n-8.60\n-7.86\n-6.90\n-8.61\n-7.40\n-6.79\n-8.63\n-7.41\n-6.72\n\n-7.15\n-7.20\n-6.20\n-7.19\n-7.15\n-6.12\n-7.18\n-7.13\n-5.89\n\nPiano-midi.de\n\n-7.81\n-7.86\n-7.76\n-7.83\n-7.84\n-7.45\n-7.85\n-7.79\n-7.43\n\n4\n\n8\n\nN Bound\nELBO\nIWAE\nFIVO\nELBO\nIWAE\nFIVO\nELBO\nIWAE\nFIVO\n\n16\n\nTIMIT\n\n64 units\n0\n-160\n5,691\n2,771\n3,977\n6,023\n1,676\n3,236\n8,630\n\n256 units\n10,438\n11,054\n17,822\n9,819\n11,623\n21,449\n9,918\n13,069\n21,536\n\nTable 1: Test set marginal log-likelihood bounds for models trained with ELBO, IWAE, and FIVO.\nFor ELBO and IWAE models, we report max{L,LIWAE\n128 }. For FIVO models, we report LFIVO\n128 .\nPianoroll results are in nats per timestep, TIMIT results are in nats per sequence relative to ELBO\nwith N = 4. For details on our evaluation methodology and absolute numbers see the Appendix.\n\n128 ,LFIVO\n\nFor FIVO we resampled when the ESS of the particles dropped below N/2. For FIVO and IWAE we\nused a batch size of 4, and for the ELBO, we used batch sizes of 4N to match computational budgets\n(resampling is O(N ) with the alias method). For all models we report bounds using the variational\nposterior trained jointly with the model. For models trained with FIVO we report LFIVO\n128 . To provide\nstrong baselines, we report the maximum across bounds, max{L,LIWAE\n128 }, for models trained\nwith ELBO and IWAE. Additional details in the Appendix.\n\n128 ,LFIVO\n\n6.1 Polyphonic Music\n\nWe evaluated VRNNs trained with the ELBO, IWAE, and FIVO bounds on 4 polyphonic music\ndatasets: the Nottingham folk tunes, the JSB chorales, the MuseData library of classical piano and\norchestral music, and the Piano-midi.de MIDI archive [42]. Each dataset is split into standard train,\nvalid, and test sets and is represented as a sequence of 88-dimensional binary vectors denoting the\nnotes active at the current timestep. We mean-centered the input data and modeled the output as a set\nof 88 factorized Bernoulli variables. We used 64 units for the RNN hidden state and latent state size\nfor all polyphonic music models except for JSB chorales models, which used 32 units. We report\nbounds on average log-likelihood per timestep in Table 1. Models trained with the FIVO bound\nsigni\ufb01cantly outperformed models trained with either the ELBO or the IWAE bounds on all four\ndatasets. In some cases, the improvements exceeded 1 nat per timestep, and in all cases optimizing\nFIVO with N = 4 outperformed optimizing IWAE or ELBO for N = {4, 8, 16}.\n\n6.2 Speech\n\nThe TIMIT dataset is a standard benchmark for sequential models that contains 6300 utterances\nwith an average duration of 3.1 seconds spoken by 630 different speakers. The 6300 utterances are\ndivided into a training set of size 4620 and a test set of size 1680. We further divided the training\nset into a validation set of size 231 and a training set of size 4389, with the splits exactly as in\n[41]. Each TIMIT utterance is represented as a sequence of real-valued amplitudes which we split\ninto a sequence of 200-dimensional frames, as in [39, 41]. Data preprocessing was limited to mean\ncentering and variance normalization as in [41]. For TIMIT, the output distribution was a factorized\nGaussian, and we report the average log-likelihood bound per sequence relative to models trained\nwith ELBO. Again, models trained with FIVO signi\ufb01cantly outperformed models trained with IWAE\nor ELBO, see Table 1.\n\n6.3 Resampling Gradients\n\nAll models in this work (except those in this section) were trained with gradients that did not include\nthe term in Eq. (8) that comes from resampling steps. We omitted this term because it has an outsized\neffect on gradient variance, often increasing it by 6 orders of magnitude. To explore the effects of this\nterm experimentally, we trained VRNNs with and without the resampling gradient term on the TIMIT\nand polyphonic music datasets. When using the resampling term, we attempted to control its variance\n\n7\n\n\fFigure 2: (Left) Graph of LFIVO\n128 over training comparing models trained with and without the\nresampling gradient terms on TIMIT with N = 4. (Right) KL divergence from q(z1:T|x1:T ) to\np(z1:T ) for models trained on the JSB chorales with N = 16.\n\nBound\nELBO\nELBO+s\nIWAE\nIWAE+s\nFIVO\nFIVO+s\n\nNottingham\n\n-2.40\n-2.59\n-2.52\n-2.37\n-2.29\n-2.34\n\nJSB MuseData\n-5.48\n-5.53\n-5.77\n-4.63\n-4.08\n-3.83\n\n-6.54\n-6.48\n-6.54\n-6.47\n-5.80\n-5.87\n\nPiano-midi.de\n\n-6.68\n-6.77\n-6.74\n-6.74\n-6.41\n-6.34\n\nTIMIT\n0\n-925\n1,469\n2,630\n6,991\n9,773\n\nTable 2: Train set marginal log-likelihood bounds for models comparing smoothing (+s) and non-\nsmoothing variational posteriors. We report max{L,LIWAE\n128 } for ELBO and IWAE models\nand LFIVO\n128 for FIVO models. All models were trained with N = 4. Pianoroll results are in nats per\ntimestep, TIMIT results are in nats per sequence relative to non-smoothing ELBO. For details on our\nevaluation methodology and absolute numbers see the Appendix.\n\n128 ,LFIVO\n\nusing a moving-average baseline linear in the number of timesteps. For all datasets, models trained\nwithout the resampling gradient term outperformed models trained with the term by a large margin\non both the training set and held-out data. Many runs with resampling gradients failed to improve\nbeyond random initialization. A representative pair of train log-likelihood curves is shown in Figure\n2 \u2014 gradients without the resampling term led to earlier convergence and a better solution. We stress\nthat this is an empirical result \u2014 in principle biased gradients can lead to divergent behaviour. We\nleave exploring strategies to reduce the variance of the unbiased estimator to future work.\n\n6.4 Sharpness\nFIVO does not achieve the marginal log-likelihood at its optimal variational posterior q\u2217, because the\noptimal q\u2217 does not condition on future observations (see Section 4.2). In contrast, ELBO and IWAE\nare sharp, and their q\u2217s depend on future observations. To investigate the effects of this, we de\ufb01ned a\nsmoothing variant of the VRNN in which q takes as additional input the hidden state of a deterministic\nRNN run backwards over the observations, allowing q to condition on future observations. We trained\nsmoothing VRNNs using ELBO, IWAE, and FIVO, and report evaluation on the training set (to\nisolate the effect on optimization performance) in Table 2 . Smoothing helped models trained with\nIWAE, but not enough to outperform models trained with FIVO. As expected, smoothing did not\nreliably improve models trained with FIVO. Test set performance was similar, see the Appendix for\ndetails.\n\n6.5 Use of Stochastic State\n\nA known pathology when training stochastic latent variable models with the ELBO is that stochastic\nstates can go unused. Empirically, this is associated with the collapse of variational posterior\nq(z|x) network to the model prior p(z) [43]. To investigate this, we plot the KL divergence from\nq(z1:T|x1:T ) to p(z1:T ) averaged over the dataset (Figure 2). Indeed, the KL of models trained with\n\n8\n\n01234561M Gradient Updates86420246Train Log-likelihood1e4Without Resampling Gradient TermWith Resampling Gradient Term01020304050601k Gradient Updates10-510-410-310-210-1100101KL DivergenceFIVOIWAEELBO\fELBO collapsed during training, whereas the KL of models trained with FIVO remained high, even\nwhile achieving a higher log-likelihood bound.\n\n7 Conclusions\n\nWe introduced the family of \ufb01ltering variational objectives, a class of lower bounds on the log\nmarginal likelihood that extend the evidence lower bound. FIVOs are suited for MLE in neural latent\nvariable models. We trained models with the ELBO, IWAE, and FIVO bounds and found that the\nmodels trained with FIVO signi\ufb01cantly outperformed other models across four polyphonic music\nmodeling tasks and a speech waveform modeling task. Future work will include exploring control\nvariates for the resampling gradients, FIVOs de\ufb01ned by more sophisticated \ufb01ltering algorithms, and\nnew MCOs based on differentiable operators like leapfrog operators with deterministically annealed\ntemperatures. In general, we hope that this paper inspires the machine learning community to take a\nfresh look at the literature of marginal likelihood estimators\u2014seeing them as objectives instead of\nalgorithms for inference.\n\nAcknowledgments\n\nWe thank Matt Hoffman, Matt Johnson, Danilo J. Rezende, Jascha Sohl-Dickstein, and Theophane\nWeber for helpful discussions and support in this project. A. Doucet was partially supported by the\nEPSRC grant EP/K000276/1. Y. W. Teh\u2019s research leading to these results has received funding\nfrom the European Research Council under the European Union\u2019s Seventh Framework Programme\n(FP7/2007-2013) ERC grant agreement no. 617071.\n\nReferences\n[1] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ICLR, 2014.\n[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\n\nand approximate inference in deep generative models. ICML, 2014.\n\n[3] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\narXiv preprint arXiv:1605.08803, 2016.\n\n[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[5] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural\nsamplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.\n[6] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models.\n\narXiv preprint arXiv:1702.08896, 2017.\n\n[7] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[8] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An in-\ntroduction to variational methods for graphical models. Machine learning, 37(2):183\u2013233,\n1999.\n\n[9] Matthew J. Beal. Variational algorithms for approximate Bayesian inference. 2003.\n[10] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS,\n\n2014.\n\n[11] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic\n\ndifferentiation variational inference. arXiv preprint arXiv:1603.00788, 2016.\n\n[12] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\nICLR, 2016.\n\n[13] Andriy Mnih and Danilo J Rezende. Variational inference for Monte Carlo objectives. arXiv\n\npreprint arXiv:1602.06725, 2016.\n\n[14] Fr\u00e9d\u00e9ric C\u00e9rou, Pierre Del Moral, and Arnaud Guyader. A nonasymptotic theorem for unnor-\n\nmalized Feynman\u2013Kac particle models. Ann. Inst. H. Poincar\u00e9 B, 47(3):629\u2013649, 2011.\n\n9\n\n\f[15] Jean B\u00e9rard, Pierre Del Moral, and Arnaud Doucet. A lognormal central limit theorem for\n\nparticle approximations of normalizing constants. Electron. J. Probab., 19(94):1\u201328, 2014.\n\n[16] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete\n\ndata via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol., pages 1\u201338, 1977.\n\n[17] CF Jeff Wu. On the convergence properties of the EM algorithm. Ann. Stat., pages 95\u2013103,\n\n1983.\n\n[18] Radford M Neal and Geoffrey E Hinton. A view of the EM algorithm that justi\ufb01es incremental,\n\nsparse, and other variants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\n[19] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic\n\nvariational inference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[20] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.\n\narXiv preprint arXiv:1402.0030, 2014.\n\n[21] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.\n[22] Arnaud Doucet and Adam M. Johansen. A tutorial on particle \ufb01ltering and smoothing: Fifteen\nyears later. In D. Crisan and B. Rozovsky, editors, The Oxford Handbook of Nonlinear Filtering,\npages 656\u2013704. Oxford University Press, 2011.\n\n[23] Pierre Del Moral. Feynman-Kac formulae: genealogical and interacting particle systems with\n\napplications. Springer Verlag, 2004.\n\n[24] Pierre Del Moral. Mean \ufb01eld simulation for Monte Carlo integration. CRC Press, 2013.\n[25] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte\n\nCarlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol., 72(3):269\u2013342, 2010.\n\n[26] Michael K Pitt, Ralph dos Santos Silva, Paolo Giordani, and Robert Kohn. On some properties\nof Markov chain Monte Carlo simulation methods based on the particle \ufb01lter. J. Econometrics,\n171(2):134\u2013151, 2012.\n\n[27] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\n\nvariational inference. Foundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[28] Roger B Grosse, Zoubin Ghahramani, and Ryan P Adams. Sandwiching the marginal likelihood\n\nusing bidirectional Monte Carlo. arXiv preprint arXiv:1511.02543, 2015.\n\n[29] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Accurate and conservative estimates of\n\nMRF log-likelihood using reverse annealing. In AISTATS, 2015.\n\n[30] Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference.\n\nIn NIPS, 2016.\n\n[31] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nICML, 2015.\n\n[32] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\n\nImproved variational inference with inverse autoregressive \ufb02ow. In NIPS, 2016.\n\n[33] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational\n\ninference: Bridging the gap. In ICML, 2015.\n\n[34] J\u00f6rg Bornschein and Yoshua Bengio. Reweighted wake-sleep. ICLR, 2015.\n[35] Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential Monte\n\nCarlo. In NIPS, 2015.\n\n[36] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-\n\nencoders as generative models. In NIPS, 2013.\n\n[37] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational\n\nsequential Monte Carlo. arXiv preprint arXiv:1705.11140, 2017.\n\n[38] Tuan Anh Le, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. Auto-encoding\n\nsequential Monte Carlo. arXiv preprint arXiv:1705.10306, 2017.\n\n[39] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\n\nBengio. A recurrent latent variable model for sequential data. In NIPS, 2015.\n\n10\n\n\f[40] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[41] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential neural\n\nmodels with stochastic layers. In NIPS, 2016.\n\n[42] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal\ndependencies in high-dimensional sequences: Application to polyphonic music generation and\ntranscription. ICML, 2012.\n\n[43] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349,\n2015.\n\n11\n\n\f", "award": [], "sourceid": 3298, "authors": [{"given_name": "Chris", "family_name": "Maddison", "institution": "Oxford"}, {"given_name": "John", "family_name": "Lawson", "institution": "Google Brain"}, {"given_name": "George", "family_name": "Tucker", "institution": "Google Brain"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}, {"given_name": "Mohammad", "family_name": "Norouzi", "institution": null}, {"given_name": "Andriy", "family_name": "Mnih", "institution": "DeepMind"}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": "Oxford"}, {"given_name": "Yee", "family_name": "Teh", "institution": "DeepMind"}]}