{"title": "Neural Adaptive Sequential Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 2629, "page_last": 2637, "abstract": "Sequential Monte Carlo (SMC), or particle filtering, is a popular class of methods for sampling from an intractable target distribution using a sequence of simpler intermediate distributions. Like other importance sampling-based methods, performance is critically dependent on the proposal distribution: a bad proposal can lead to arbitrarily inaccurate estimates of the target distribution. This paper presents a new method for automatically adapting the proposal using an approximation of the Kullback-Leibler divergence between the true posterior and the proposal distribution. The method is very flexible, applicable to any parameterized proposal distribution and it supports online and batch variants. We use the new framework to adapt powerful proposal distributions with rich parameterizations based upon neural networks leading to Neural Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC significantly improves inference in a non-linear state space model outperforming adaptive proposal methods including the Extended Kalman and Unscented Particle Filters. Experiments also indicate that improved inference translates into improved parameter learning when NASMC is used as a subroutine of Particle Marginal Metropolis Hastings. Finally we show that NASMC is able to train a latent variable recurrent neural network (LV-RNN) achieving results that compete with the state-of-the-art for polymorphic music modelling. NASMC can be seen as bridging the gap between adaptive SMC methods and the recent work in scalable, black-box variational inference.", "full_text": "Neural Adaptive Sequential Monte Carlo\n\nRichard E. Turner\u2020\nShixiang Gu\u2020\u2021\n\u2020 University of Cambridge, Department of Engineering, Cambridge UK\n\nZoubin Ghahramani\u2020\n\n\u2021 MPI for Intelligent Systems, T\u00a8ubingen, Germany\n\nsg717@cam.ac.uk, zoubin@eng.cam.ac.uk, ret26@cam.ac.uk\n\nAbstract\n\nSequential Monte Carlo (SMC), or particle \ufb01ltering, is a popular class of meth-\nods for sampling from an intractable target distribution using a sequence of sim-\npler intermediate distributions. Like other importance sampling-based methods,\nperformance is critically dependent on the proposal distribution: a bad proposal\ncan lead to arbitrarily inaccurate estimates of the target distribution. This paper\npresents a new method for automatically adapting the proposal using an approx-\nimation of the Kullback-Leibler divergence between the true posterior and the\nproposal distribution. The method is very \ufb02exible, applicable to any parameter-\nized proposal distribution and it supports online and batch variants. We use the\nnew framework to adapt powerful proposal distributions with rich parameteriza-\ntions based upon neural networks leading to Neural Adaptive Sequential Monte\nCarlo (NASMC). Experiments indicate that NASMC signi\ufb01cantly improves infer-\nence in a non-linear state space model outperforming adaptive proposal methods\nincluding the Extended Kalman and Unscented Particle Filters. Experiments also\nindicate that improved inference translates into improved parameter learning when\nNASMC is used as a subroutine of Particle Marginal Metropolis Hastings. Finally\nwe show that NASMC is able to train a latent variable recurrent neural network\n(LV-RNN) achieving results that compete with the state-of-the-art for polymor-\nphic music modelling. NASMC can be seen as bridging the gap between adaptive\nSMC methods and the recent work in scalable, black-box variational inference.\n\n1\n\nIntroduction\n\nSequential Monte Carlo (SMC) is a class of algorithms that draw samples from a target distribution\nof interest by sampling from a series of simpler intermediate distributions. More speci\ufb01cally, the se-\nquence constructs a proposal for importance sampling (IS) [1, 2]. SMC is particularly well-suited for\nperforming inference in non-linear dynamical models with hidden variables, since \ufb01ltering naturally\ndecomposes into a sequence, and in many such cases it is the state-of-the-art inference method [2, 3].\nGenerally speaking, inference methods can be used as modules in parameter learning systems. SMC\nhas been used in such a way for both approximate maximum-likelihood parameter learning [4] and\nin Bayesian approaches such as the recently developed Particle MCMC methods [3].\nCritically, in common with any importance sampling method, the performance of SMC is strongly\ndependent on the choice of the proposal distribution. If the proposal is not well-matched to the tar-\nget distribution, then the method can produce samples that have low effective sample size and this\nleads to Monte Carlo estimates that have pathologically high variance [1]. The SMC community\nhas developed approaches to mitigate these limitations such as resampling to improve particle di-\nversity when the effective sample size is low [1] and applying MCMC transition kernels to improve\nparticle diversity [5, 2, 3]. A complementary line of research leverages distributional approximate\ninference methods, such as the extended Kalman Filter and Unscented Kalman Filter, to construct\nbetter proposals, leading to the Extended Kalman Particle Filter (EKPF) and Unscented Particle Fil-\n\n1\n\n\fter (UPF) [5]. In general, however, the construction of good proposal distributions is still an open\nquestion that severely limits the applicability of SMC methods.\nThis paper proposes a new gradient-based black-box adaptive SMC method that automatically tunes\n\ufb02exible proposal distributions. The quality of a proposal distribution can be assessed using the (in-\ntractable) Kullback-Leibler (KL) divergence between the target distribution and the parametrized\nproposal distribution. We approximate the derivatives of this objective using samples derived from\nSMC. The framework is very general and tractably handles complex parametric proposal distribu-\ntions. For example, here we use neural networks to carry out the parameterization thereby leveraging\nthe large literature and ef\ufb01cient computational tools developed by this community. We demonstrate\nthat the method can ef\ufb01ciently learn good proposal distributions that signi\ufb01cantly outperform exist-\ning adaptive proposal methods including the EKPF and UPF on standard benchmark models used\nin the particle \ufb01lter community. We show that improved performance of the SMC algorithm trans-\nlates into improved mixing of the Particle Marginal Metropolis-Hasting (PMMH) [3]. Finally, we\nshow that the method allows higher-dimensional and more complicated models to be accurately han-\ndled using SMC, such as those parametrized using neural networks (NN), that are challenging for\ntraditional particle \ufb01ltering methods .\nThe focus of this work is on improving SMC, but many of the ideas are inspired by the burgeoning\nliterature on approximate inference for unsupervised neural network models. These connections are\nexplored in section 6.\n\n2 Sequential Monte Carlo\n\ntribution factorizes as p(z1:T , x1:T ) = p(z1)p(x1|z1)(cid:81)T\n\nWe begin by brie\ufb02y reviewing two fundamental SMC algorithms, sequential importance sampling\n(SIS) and sequential importance resampling (SIR). Consider a probabilistic model comprising (pos-\nsibly multi-dimensional) hidden and observed states z1:T and x1:T respectively, whose joint dis-\nt=2 p(zt|z1:t\u22121)p(xt|z1:t, x1:t\u22121). This\ngeneral form subsumes common state-space models, such as Hidden Markov Models (HMMs), as\nwell as non-Markovian models for the hidden state, such as Gaussian processes.\nThe goal of the sequential importance sampler is to approximate the posterior distribution over\nthe hidden state sequence, p(z1:T|x1:T ) \u2248\n1:T ), through a weighted set of\nN sampled trajectories drawn from a simpler proposal distribution {z(n)\n1:T}n=1:N \u223c q(z1:T|x1:T ).\nAny form of proposal distribution can be used in principle, but a particularly convenient one takes\nt=2 q(zt|z1:t\u22121, x1:t), with\n\ufb01ltering dependence on x. A short derivation (see supplementary material) then shows that the\nnormalized importance weights are de\ufb01ned by a recursion:\n\nthe same factorisation as the true posterior q(z1:T|x1:T ) = q(z1|x1)(cid:81)T\n\n\u03b4(z1:T \u2212 z(n)\n\n(cid:80)N\n\nn=1 \u02dcw(n)\n\nt\n\n(cid:80)\n\nw(z(n)\n\n1:T ) =\n\np(z(n)\nq(z(n)\n\n1:T , x1:T )\n1:T|x1:T )\n\n, \u02dcw(z(n)\n\n1:T ) =\n\nw(z(n)\n1:T )\n1:T ) \u221d \u02dcw(z(n)\nn w(z(n)\n\n1:T\u22121)\n\np(z(n)\n\n1:T\u22121)p(xT|z(n)\nT |z(n)\nT |z(n)\nq(z(n)\n1:T\u22121, x1:T )\n\n1:T , x1:T\u22121)\n\nt\n\nSIS is elegant as the samples and weights can be computed in sequential fashion using a single\nforward pass. However, na\u00a8\u0131ve implementation suffers from a severe pathology:\nthe distribution\nof importance weights often become highly skewed as t increases, with many samples attaining\nvery low weight. To alleviate the problem, the Sequential Importance Resampling (SIR) algorithm\n[1] adds an additional step that resamples z(n)\nat time t from a multinomial distribution given by\n\u02dcw(z(n)\n1:t ) and gives the new particles equal weight.1 This replaces degenerated particles that have low\nweight with samples that have more substantial importance weights without violating the validity of\nthe method. SIR requires knowledge of the full trajectory of previous samples at each stage to draw\nthe samples and compute the importance weights. For this reason, when carrying out resampling,\neach new particle needs to update its ancestry information. Letting a(n)\n\u03c4,t represent the ancestral\nindex of particle n at time t for state z\u03c4 , where 1 \u2264 \u03c4 \u2264 t, and collecting these into the set\nt = {a(n)\nA(n)\n1:t =\nA(n)\nt\u22121\n1:t\u22121, z(n)\n}. Finally, to lighten notation, we use the shorthand\n{z\n1More advanced implementations resample only when the effective sample size falls below a threshold [2].\n\n(a(i)\n\u03c4,t)\n\u03c4\u22121,\u03c4\u22121, the resampled trajectory can be denoted z(n)\na(i)\nt,t\n, ..., z\nt\n\n1,t , ..., a(n)\nt } where zA(i)\n\nt,t }, where a(i)\n1:t = {z\n\n\u03c4\u22121,t = a\na(i)\n1,t\n1\n\nt\n\n2\n\n\fw(n)\nt = w(z(n)\nthe previous weights w(n)\nimplementation of SMC is given by Algorithm 1 in the supplementary material.\n\n1:t ) for the weights. Note that, when employing resampling, these do not depend on\nt\u22121 since resampling has given the previous particles uniform weight. The\n\n2.1 The Critical Role of Proposal Distributions in Sequential Monte Carlo\n\nThe choice of the proposal distribution in SMC is critical. Even when employing the resampling\nstep, a poor proposal distribution will produce trajectories that, when traced backwards, quickly\ncollapse onto a single ancestor. Clearly this represents a poor approximation to the true posterior\np(z1:T|x1:T ). These effects can be mitigated by increasing the number of particles and/or applying\nmore complex additional MCMC moves [5, 2], but these strategies increase the computational cost.\nThe conclusion is that the proposal should be chosen with care. The optimal choice for an uncon-\nstrained proposal that has access to all of the observed data at all times is the intractable posterior\ndistribution q\u03c6(z1:T|x1:T ) = p\u03b8(z1:T|x1:T ). Given the restrictions imposed by the factorization,\nthis becomes q(zt|z1:t\u22121, x1:t) = p(zt|z1:t\u22121, x1:t), which is still typically intractable. The boot-\nstrap \ufb01lter instead uses the prior q(zt|z1:t\u22121, x1:t) = p(zt|z1:t\u22121, x1:t\u22121) which is often tractable,\nbut fails to incorporate information from the current observation xt. A halfway-house employs\ndistributional approximate inference techniques to approximate p(zt|z1:t\u22121, x1:t). Examples in-\nclude the EKPF and UPF [5]. However, these methods suffer from three main problems. First,\nthe extended and unscented Kalman Filter from which these methods are derived are known to be\ninaccurate and poorly behaved for many problems outside of the SMC setting [6]. Second, these\napproximations must be applied on a sample by sample basis, leading to signi\ufb01cant additional com-\nputational overhead. Third, neither approximation is tuned using an SMC-relevant criterion. In the\nnext section we introduce a new method for adapting the proposal that addresses these limitations.\n\n3 Adapting Proposals by Descending the Inclusive KL Divergence\n\nand\n\nthe\n\ntrue\n\nposterior\n\ndistribution\n\nthe proposal distribution will be optimized using the\nIn this work the quality of\nthe\nproposal,\ninclusive KL-divergence\nbetween\n(Parameters are made explicit since we will shortly be\nKL[p\u03b8(z1:T|x1:T )||q\u03c6(z1:T|x1:T )].\ninterested in both adapting the proposal \u03c6 and learning the model \u03b8.) This objective is chosen for\nfour main reasons. First, this is a direct measure of the quality of the proposal, unlike those typically\nused such as effective sample size. Second, if the true posterior lies in the class of distributions\nattainable by the proposal family then the objective has a global optimum at this point. Third, if\nthe true posterior does not lie within this class, then this KL divergence tends to \ufb01nd proposal\ndistributions that have higher entropy than the original which is advantageous for importance\nsampling (the exclusive KL is unsuitable for this reason [7]). Fourth, the derivative of the objective\ncan be approximated ef\ufb01ciently using a sample based approximation that will now be described.\nThe gradient of the negative KL divergence with respect to the parameters of the proposal distribu-\ntion takes a simple form,\n\n\u2202\n\u2202\u03c6\n\n\u2212\n\nKL[p\u03b8(z1:T|x1:T )||q\u03c6(z1:T|x1:T )] =\n\np\u03b8(z1:T|x1:T )\n\n\u2202\n\u2202\u03c6\n\nlog q\u03c6(z1:T|x1:T )dz1:T .\n\nThe expectation over the posterior can be approximated using samples from SMC. One option would\nuse the weighted sample trajectories at the \ufb01nal time-step of SMC, but although asymptotically\nunbiased such an estimator would have high variance due to the collapse of the trajectories. An\nalternative, that reduces variance at the cost of introducing some bias, uses the intermediate ancestral\ntrees i.e. a \ufb01ltering approximation (see the supplementary material for details),\n\n\u2202\n\u2202\u03c6\n\n\u2212\n\nKL[p\u03b8(z1:T|x1:T )||q\u03c6(z1:T|x1:T )] \u2248\n\n\u02dcw(n)\n\nt\n\n\u2202\n\u2202\u03c6\n\nlog q\u03c6(z(n)\n\nt\n\nA(n)\nt\u22121\n1:t\u22121).\n\n|x1:t, z\n\n(1)\n\nThe simplicity of the proposed approach brings with it several advantages and opportunities.\nOnline and batch variants.\nSince the derivatives distribute over time, it is trivial to apply this\nupdate in an online way e.g. updating the proposal distribution every time-step. Alternatively, when\nlearning parameters in a batch setting, it might be more appropriate to update the proposal pa-\nrameters after making a full forward pass of SMC. Conveniently, when performing approximate\n\n3\n\n(cid:90)\n\n(cid:88)\n\n(cid:88)\n\nt\n\nn\n\n\fmaximum-likelihood learning the gradient update for the model parameters \u03b8 can be ef\ufb01ciently\napproximated using the same sample particles from SMC (see supplementary material and Algo-\nrithm 1). A similar derivation for maximum likelihood learning is also discussed in [4].\n\n\u2202\n\u2202\u03b8\n\nlog[p\u03b8(x1:T )] \u2248\n\n\u02dcw(n)\n\nt\n\n\u2202\n\u2202\u03b8\n\nlog p\u03b8(xt, z(n)\n\nt\n\n|x1:t\u22121, z\n\nA(n)\nt\u22121\n1:t\u22121).\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\nt\n\nn\n\nAlgorithm 1 Stochastic Gradient Adaptive SMC (batch inference and learning variants)\nRequire: proposal: q\u03c6, model: p\u03b8, observations: X = {x1:Tj}j=1:M , number of particles: N\nrepeat\n{x(j)\n{z(i,j)\n\n1:Tj}j=1:m \u2190 NextMiniBatch(X)\n\n1:Tj}j=1:m)\n\n, \u02dcw(i,j)\n\n1:t\n\nt\n\n(cid:52)\u03c6 =(cid:80)\n(cid:52)\u03b8 =(cid:80)\n\n(cid:80)Tj\n(cid:80)\n}i=1:N,j=1:m,t=1:Tj \u2190 SMC(\u03b8, \u03c6, N,{x(j)\n(cid:80)Tj\n(cid:80)\nA(i,j)\nt\u22121\n1:t\u22121 )\n\u03c6 \u2190 Optimize(\u03c6,(cid:52)\u03c6)\n\u03b8 \u2190 Optimize(\u03b8,(cid:52)\u03b8)\n\n|x(j)\n1:t , z\n, z(i,j)\n|x(j)\n1:t\u22121, z\n\n\u2202\u03c6 log q\u03c6(z(i,j)\n\u2202\u03b8 log p\u03b8(x(j)\n\ni \u02dcw(i,j)\ni \u02dcw(i,j)\n\n(optional)\n\nt=1\n\nt=1\n\nt\n\nt\n\nt\n\nt\n\nj\n\nj\n\n\u2202\n\n\u2202\n\nt\n\nuntil convergence\n\nA(i,j)\nt\u22121\n1:t\u22121 )\n\n(optional)\n\nEf\ufb01ciency of the adaptive proposal. In contrast to the EPF and UPF, the new method employs an\nanalytic function for propagation and does not require costly particle-speci\ufb01c distributional approxi-\nmation as an inner-loop. Similarly, although the method bears similarity to the assumed-density \ufb01lter\n(ADF) [8] which minimizes a (local) inclusive KL, the new method has the advantage of minimizing\na global cost and does not require particle-speci\ufb01c moment matching.\nTraining complex proposal models. The adaptation method described above can be applied to any\nparametric proposal distribution. Special cases have been previously treated by [9]. We propose\na related, but arguably more straightforward and general approach to proposal adaptation. In the\nnext section, we describe a rich family of proposal distributions, that go beyond previous work,\nbased upon neural networks. This approach enables adaptive SMC methods to make use of the rich\nliterature and optimization tools available from supervised learning.\nFlexibility of training. One option is to train the proposal distribution using samples from SMC\nderived from the observed data. However, this is not the only approach. For example, the proposal\ncould be trained using data sampled from the generative model instead, which might mitigate over-\n\ufb01tting effects for small datasets. Similarly, the trained proposal does not need to be the one used to\ngenerate the samples in the \ufb01rst place. The bootstrap \ufb01lter or more complex variants can be used.\n\n4 Flexible and Trainable Proposal Distributions Using Neural Networks\n\nThe proposed adaption method can be applied to any parametric proposal distribution. Here we\nbrie\ufb02y describe how to utilize this \ufb02exibility to employ powerful neural network-based parameteriza-\ntions that have recently shown excellent performance in supervised sequence learning tasks [10, 11].\nGenerally speaking, applications of these techniques to unsupervised sequence modeling settings is\nan active research area that is still in its infancy [12] and this work opens a new avenue in this wider\nresearch effort.\nIn a nutshell, the goal is to parameterize q\u03c6(zt|z1:t\u22121, x1:t) \u2013 the proposal\u2019s stochastic mapping from\nall previous hidden states z1:t\u22121 and all observations (up to and including the current observation)\nx1:t, to the current hidden state, zt \u2013 in a \ufb02exible, computationally ef\ufb01cient and trainable way. Here\nwe use a class of functions called Long Short-Term Memory (LSTM) that de\ufb01ne a deterministic\nmapping from an input sequence to an output sequence using parameter-ef\ufb01cient recurrent dynam-\nics, and alleviate the common vanishing gradient problem in recurrent neural networks [13, 10, 11].\nThe distributions q\u03c6(zt|ht) can be a mixture of Gaussians (a mixture density network (MDN) [14])\nin which the mixing proportions, means and covariances are parameterised through another neural\nnetwork (see the supplementary for details on LSTM, MDN, and neural network architectures).\n\n4\n\n\f5 Experiments\n\nThe goal of the experiments is three fold. First, to evaluate the performance of the adaptive method\nfor inference on standard benchmarks used by the SMC community with known ground truth. Sec-\nond, to evaluate the performance when SMC is used as an inner loop of a learning algorithm. Again\nwe use an example with known ground truth. Third, to apply SMC learning to complex models that\nwould normally be challenging for SMC comparing to the state-of-the-art in approximate inference.\nOne way of assessing the success of\nthe proposed method would be to evaluate\nKL[p(z1:T|x1:T )||q(z1:T|x1:T )]. However, this quantity is hard to accurately compute.\nInstead\n(cid:80)\nwe use a number of other metrics. For the experiments where ground truth states z1:T are known\nally, the estimate of the log-marginal likelihood (LML = log p(x1:T ) = (cid:80)\nwe can evaluate the root mean square error (RMSE) between the approximate posterior mean of the\n(cid:80)\nlatent variables ( \u00afzt) and the true value RMSE(z1:T , \u00afz1:T ) = ( 1\nt(zt \u2212 \u00afzt)2)1/2. More gener-\nT\nt log p(xt|x1:t\u22121) =\nmethod. ESS of particles at time t is given by ESSt = ((cid:80)\n)) and its variance is also indicative of performance. Finally, we also employ a\ncommon metric called the effective sample size (ESS) to measure the effectiveness of our SMC\nIf q(z1:T|x1:T ) =\np(z1:T|x1:T ), expected ESS is maximized and equals the number of particles (equivalently, the\nnormalized importance weights are uniform). Note that ESS alone is not a suf\ufb01cient metric, since it\ndoes not measure the absolute quality of samples, but rather the relative quality.\n\nt log( 1\nN\n\nn( \u02dcw(n)\n\n(cid:80)\n\n)2)\u22121.\n\nt\n\nn w(n)\n\nt\n\n5.1\n\nInference in a Benchmark Nonlinear State-Space Model\n\nIn order to evaluate the effectiveness of our adaptive SMC method, we tested our method on a\nstandard nonlinear state-space model often used to benchmark SMC algorithms [2, 3]. The model is\ngiven by Eq. 3, where \u03b8 = (\u03c3v, \u03c3w). The posterior distribution p\u03b8(z1:T|x1:T ) is highly multi-modal\ndue to uncertainty about the signs of the latent states.\n\np(zt|zt\u22121) = N (zt; f (zt\u22121, t), \u03c32\np(xt|zt) = N (xt; g(zt\u22121), \u03c32\nw),\nf (zt\u22121, t) = zt\u22121/2 + 25zt\u22121/(1 + z2\n\nv), p(z1) = N (z1; 0, 5),\n\nt\u22121) + 8 cos(1.2t),\n\n(3)\n\ng(zt) = z2\n\nt /20\n\nThe experiments investigated how the new proposal adaptation method performed in comparison to\nstandard methods including the bootstrap \ufb01lter, EKPF, and UKPF. In particular, we were interested\nin the following questions: Do rich multi-modal proposals improve inference? For this we compared\na Gaussian proposal with a diagonal Gaussian to a mixture density network with three components (-\nMD-). Does a recurrent parameterization of the proposal help? For this we compared a non-recurrent\nneural network with 100 hidden units (-NN-) to a recurrent neural network with 50 LSTM units (-\nRNN-). Can injecting information about the prior dynamics into the proposal improve performance\n(similar in spirit to [15] for variational methods)? To assess this, we parameterized proposals for vt\n(process noise) instead of zt (-f-), and let the proposal have access to the prior dynamics f (zt\u22121, t) .\nFor all experiments, the parameters in the non-linear state-space model were \ufb01xed to (\u03c3v, \u03c3w) =\n(\u221a10, 1). Adaptation of the proposal was performed on 1000 samples from the generative process\nat each iteration. Results are summarized in Fig. 1 and Table 1 (see supplementary material for\nadditional results). Average run times for the algorithms over a sequence of length 1000 were:\n0.782s bootstrap, 12.1s EKPF, 41.4s UPF, 1.70s NN-NASMC, and 2.67s RNN-NASMC, where\nEKPF and UPF implementations are provided by [5]. Although these numbers should only be taken\nas a guide as the implementations had differing levels of acceleration.\nThe new adaptive proposal methods signi\ufb01cantly outperform the bootstrap, EKPF, and UPF meth-\nods, in terms of ESS, RMSE and the variance in the LML estimates. The multi-modal proposal\noutperforms a simple Gaussian proposal (compare RNN-MD-f to RNN-f) indicating multi-modal\nproposals can improve performance. Moreover, the RNN outperforms the non-recurrent NN (com-\npare RNN to NN). Although the proposal models can effectively learn the transition function, in-\njecting information about the prior dynamics into the proposal does help (compare RNN-f to RNN).\nInterestingly, there is no clear cut winner between the EKPF and UPF, although the UPF does return\nLML estimates that have lower variance [5]. All methods converged to similar LMLs that were close\nto the values computed using large numbers of particles indicating the implementations are correct.\n\n5\n\n\fFigure 1: Left: Box plots for LML estimates from iteration 200 to 1000. Right: Average ESS over\nthe \ufb01rst 1000 iterations.\n\nESS (iter)\nstd\n0.25\n0.83\n0.63\n0.60\n0.71\n1.04\n0.68\n1.08\n\nmean\n36.66\n60.15\n50.58\n69.64\n73.88\n69.25\n76.71\n69.39\n\nLML\n\nmean\n-2957\n-2829\n-2696\n-2774\n-2633\n-2636\n-2622\n-2634\n\nstd\n148\n407\n79\n34\n36\n40\n32\n36\n\nRMSE\nstd\n0.578\n0.694\n0.629\n0.977\n0.430\n0.472\n0.409\n0.608\n\nmean\n3.266\n3.578\n2.956\n3.505\n2.568\n2.612\n2.509\n2.731\n\nprior\nEKPF\nUPF\nRNN\nRNN-f\nRNN-MD\nRNN-MD-f\nNN-MD\n\nTable 1: Left, Middle: Average ESS and log marginal likelihood estimates over the last 400 itera-\ntions. Right: The RMSE over 100 new sequences with no further adaptation.\n\n5.2\n\nInference in the Cart and Pole System\n\nAs a second and more physically meaningful system we considered a cart-pole system that consists\nof an inverted pendulum that rests on a movable base [16]. The system was driven by a white noise\ninput. An ODE solver was used to simulate the system from its equations of motion. We considered\nthe problem of inferring the true position of the cart and orientation of the pendulum (along with\ntheir derivatives and the input noise) from noisy measurements of the location of the tip of the pole.\nThe results are presented in Fig. 2. The system is signi\ufb01cantly more intricate than the model in\nSec. 5.1, and does not directly admit the usage of EKPF or UPF. Our RNN-MD proposal model\nsuccessfully learns good proposals without any direct access to the prior dynamics.\n\nFigure 2: Left: Normalized ESS over iterations. Middle, Right: Posterior mean vs. ground-truth\nfor x, the horizontal location of the cart, and (cid:52)\u03b8, the change in relative angle of the pole. RNN-MD\nlearns to have higher ESS than the prior and more accurately estimates the latent states.\n\n6\n\nEKPFNN-MDpriorRNN-fRNN-MD-fRNN-MDRNNUPF\u22124000\u22123800\u22123600\u22123400\u22123200\u22123000\u22122800\u22122600logmarginallikelihood02004006008001000iteration1020304050607080e\ufb00ectivesamplesize(/100)EKPFNN-MDpriorRNN-fRNN-MD-fRNN-MDRNNUPF050010001500200025003000iteration0.100.150.200.250.300.350.400.45ESSESSRNN-MDprior-\u00b5prior-(\u00b5+1\u03c3)prior-(\u00b5\u22121\u03c3)0246810time(s)\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.52.0x(m)x0246810time(s)\u22122.0\u22121.5\u22121.0\u22120.50.00.51.0\u25b3\u03b8(rad)\u25b3\u03b8priorRNN-MDground-truth\fFigure 3: PMMH samples of \u03c3w values for N = {100, 10} particles. For small numbers of particles\n(right) PMMH is very slow to burn in and mix when proposing from the prior distribution due to the\nlarge variance in the marginal likelihood estimates it returns.\n\n5.3 Bayesian learning in a Nonlinear SSM\n\n|\u03b8)p\u03b8\u2217 (z\u2217\n\n1:T|\u03b8, z1:T ) = q(\u03b8\u2217\n\nSMC is often employed as an inner loop of a more complex algorithm. One prominent example\nis Particle Markov Chain Monte Carlo [3], a class of methods that sample from the joint posterior\nover model parameters \u03b8 and latent state trajectories, p(\u03b8, z1:T|x1:T ). Here we consider the Particle\nMarginal Metropolis-Hasting sampler (PMMH). In this context SMC is used to construct a proposal\ndistribution for a Metropolis-Hasting (MH) accept/reject step. The proposal is formed by sampling a\nproposed set of parameters e.g. by perturbing the current parameters using a Gaussian random walk,\nthen SMC is used to sample a proposed set of latent state variables, resulting in a joint proposal\nq(\u03b8\u2217, z\u2217\n1:T|x1:T ). The MH step uses the SMC marginal likelihood\nestimates to determine acceptance. Full details are given in the supplementary material.\nIn this experiment, we evaluate our method in a PMMH sampler on the same model from Sec-\ntion 5.1 following [3].2 A random walk proposal is used to sample \u03b8 = (\u03c3v, \u03c3w), q(\u03b8\u2217\n|\u03b8) =\nN (\u03b8\u2217\n|\u03b8, diag([0.15, 0.08])). The prior over \u03b8 is set as IG(0.01, 0.01). \u03b8 is initialized as (10, 10),\nand the PMMH is run for 500 iterations.\nTwo of the adaptive models considered section 5.1 are used for comparison (RNN-MD and RNN-\nMD-f) , where \u201c-pre-\u201d models are pre-trained for 500 iterations using samples from the initial \u03b8 =\n(10, 10). The results are shown in Fig. 3 and were typical for a range of parameter settings. Given a\nsuf\ufb01cient number of particles (N = 100), there is almost no difference between the prior proposal\nand our method. However, when the number of particles gets smaller (N = 10), NASMC enables\nsigni\ufb01cantly faster burn-in to the posterior, particularly on the measurement noise \u03c3w and, for similar\nreasons, NASMC mixes more quickly. The limitation with the NASMC-PMMH is that the model\nneeds to continuously adapt as the global parameter is sampled, but note this is still not as costly as\nadapting on a particle-by-particle basis as is the case for the EKPF and UPF.\n\n5.4 Polyphonic Music Generation\n\nFinally, the new method is used to train a latent variable recurrent neural network (LV-RNN) for\nmodelling four polymorphic music datasets of varying complexity [17]. These datasets are often\nused to benchmark RNN models because of their high dimensionality and the complex temporal\ndependencies involved at different time scales [17, 18, 19]. Each dataset contains at least 7 hours of\npolyphonic music with an average polyphony (number of simultaneous notes) of 3.9 out of 88. LV-\nRNN contains a recurrent neural network with LSTM layers that is driven by i.i.d. stochastic latent\nvariables (zt) at each time-point and stochastic outputs (xt) that are fed back into the dynamics (full\ndetails in the supplementary material). Both the LSTM layers in the generative and proposal models\nare set as 1000 units and Adam [20] is used as the optimizer. The bootstrap \ufb01lter is compared to\nthe new adaptive method (NASMC). 10 particles are used in the training. The hyperparameters\nare tuned using the validation set [17]. A diagonal Gaussian output is used in the proposal model,\nwith an additional hidden layer of size 200. The log likelihood on the test set, a standard metric\nfor comparison in generative models [18, 21, 19], is approximated using SMC with 500 particles.\n\n2Only the prior proposal is compared, since Sec. 5.1 shows the advantage of our method over EKPF/UPF.\n\n7\n\n0100200300400500iteration0123456\u03c3w(N=100)02004006008001000iteration0123456\u03c3w(N=10)priorRNN-MD-f-preRNN-MD-fRNN-MD-preRNN-MD\fThe results are reported in Table 2.3 The adaptive method signi\ufb01cantly outperforms the bootstrap\n\ufb01lter on three of the four datasets. On the piano dataset the bootstrap method performs marginally\nbetter. In general, the NLLs for the new methods are comparable to the state-of-the-art although\ndetailed comparison is dif\ufb01cult as the methods with stochastic latent states require approximate\nmarginalization using importance sampling or SMC.\n\nDataset\n\nPiano-midi-de\nNottingham\nMuseData\nJSBChorales\n\nLV-RNN\n(NASMC)\n\n7.61\n2.72\n6.89\n3.99\n\nLV-RNN\n(Bootstrap)\n\n7.50\n3.33\n7.21\n4.26\n\n7.13\n2.85\n6.16\n6.91\n\nSTORN FD-RNN sRNN RNN-NADE\n(SGVB)\n\n7.39\n3.09\n6.75\n8.01\n\n7.58\n3.43\n6.99\n8.58\n\n7.03\n2.31\n5.60\n5.19\n\nTable 2: Estimated negative log likelihood on test data. \u201cFD-RNN\u201d and \u201cSTORN\u201d are from [19],\nand \u201csRNN\u201d and \u201cRNN-NADE\u201d are results from [18].\n\n6 Comparison of Variational Inference to the NASMC approach\n\nThere are several similarities between NASMC and Variational Free-energy methods that em-\nploy recognition models. Variational Free-energy methods re\ufb01ne an approximation q\u03c6(z|x) to\nthe posterior distribution p\u03b8(z|x) by optimising the exclusive (or variational) KL-divergence\nKL[q\u03c6(z|x)||p\u03b8(z|x)]. It is common to approximate this integral using samples from the approxi-\nmate posterior [21, 22, 23]. This general approach is similar in spirit to the way that the proposal is\nadapted in NASMC, except that the inclusive KL-divergence is employed KL[p\u03b8(z|x)||q\u03c6(z|x)] and\nthis entails that sample based approximation requires simulation from the true posterior. Critically,\nNASMC uses the approximate posterior as a proposal distribution to construct a more accurate pos-\nterior approximation. The SMC algorithm therefore can be seen as correcting for the de\ufb01ciencies in\nthe proposal approximation. We believe that this can lead to signi\ufb01cant advantages over variational\nfree-energy methods, especially in the time-series setting where variational methods are known to\nhave severe biases [24]. Moreover, using the inclusive KL avoids having to compute the entropy\nof the approximating distribution which can prove problematic when using complex approximating\ndistributions (e.g. mixtures and heavy tailed distributions) in the variational framework. There is a\nclose connection between NASMC and the wake-sleep algorithm [25] . The wake-sleep algorithm\nalso employs the inclusive KL divergence to re\ufb01ne a posterior approximation and recent generaliza-\ntions have shown how to incorporate this idea into importance sampling [26]. In this context, the\nNASMC algorithm extends this work to SMC.\n\n7 Conclusion\n\nThis paper developed a powerful method for adapting proposal distributions within general SMC\nalgorithms. The method parameterises a proposal distribution using a recurrent neural network\nto model long-range contextual information, allows \ufb02exible distributional forms including mixture\ndensity networks, and enables ef\ufb01cient training by stochastic gradient descent. The method was\nfound to outperform existing adaptive proposal mechanisms including the EKPF and UPF on a stan-\ndard SMC benchmark, it improves burn in and mixing of the PMMH sampler, and allows effective\ntraining of latent variable recurrent neural networks using SMC. We hope that the connection be-\ntween SMC and neural network technologies will inspire further research into adaptive SMC meth-\nods. In particular, application of the methods developed in this paper to adaptive particle smoothing,\nhigh-dimensional latent models and adaptive PMCMC for probabilistic programming are particular\nexciting avenues.\n\nAcknowledgments\n\nSG is generously supported by Cambridge-T\u00a8ubingen Fellowship, the ALTA Institute, and Jesus\nCollege, Cambridge. RET thanks the EPSRC (grants EP/G050821/1 and EP/L000776/1). We thank\nTheano developers for their toolkit, the authors of [5] for releasing the source code, and Roger\nFrigola, Sumeet Singh, Fredrik Lindsten, and Thomas Sch\u00a8on for helpful suggestions on experiments.\n\n3Results for RNN-NADE are separately provided for reference, since this is a different model class.\n\n8\n\n\fReferences\n[1] N. J. Gordon, D. J. Salmond, and A. F. Smith, \u201cNovel approach to nonlinear/non-gaussian bayesian state\n\nestimation,\u201d in IEE Proceedings F (Radar and Signal Processing), vol. 140, pp. 107\u2013113, IET, 1993.\n\n[2] A. Doucet, N. De Freitas, and N. Gordon, Sequential monte carlo methods in practice. Springer-Verlag,\n\n2001.\n\n[3] C. Andrieu, A. Doucet, and R. Holenstein, \u201cParticle markov chain monte carlo methods,\u201d Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269\u2013342, 2010.\n\n[4] G. Poyiadjis, A. Doucet, and S. S. Singh, \u201cParticle approximations of the score and observed information\nmatrix in state space models with application to parameter estimation,\u201d Biometrika, vol. 98, no. 1, pp. 65\u2013\n80, 2011.\n\n[5] R. Van Der Merwe, A. Doucet, N. De Freitas, and E. Wan, \u201cThe unscented particle \ufb01lter,\u201d in Advances in\n\nNeural Information Processing Systems, pp. 584\u2013590, 2000.\n\n[6] R. Frigola, Y. Chen, and C. Rasmussen, \u201cVariational gaussian process state-space models,\u201d in Advances\n\nin Neural Information Processing Systems, pp. 3680\u20133688, 2014.\n\n[7] D. J. MacKay, Information theory, inference, and learning algorithms, vol. 7. Cambridge university press\n\nCambridge, 2003.\n\n[8] T. P. Minka, \u201cExpectation propagation for approximate bayesian inference,\u201d in Proceedings of the Sev-\nenteenth conference on Uncertainty in arti\ufb01cial intelligence, pp. 362\u2013369, Morgan Kaufmann Publishers\nInc., 2001.\n\n[9] J. Cornebise, Adaptive Sequential Monte Carlo Methods. PhD thesis, Ph. D. thesis, University Pierre and\n\nMarie Curie\u2013Paris 6, 2009.\n\n[10] A. Graves, Supervised sequence labelling with recurrent neural networks, vol. 385. Springer, 2012.\n[11] I. Sutskever, O. Vinyals, and Q. V. Le, \u201cSequence to sequence learning with neural networks,\u201d in Advances\n\nin Neural Information Processing Systems, pp. 3104\u20133112, 2014.\n\n[12] A. Graves, \u201cGenerating sequences with recurrent neural networks,\u201d CoRR, vol. abs/1308.0850, 2013.\n[13] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d Neural computation, vol. 9, no. 8,\n\npp. 1735\u20131780, 1997.\n\n[14] C. M. Bishop, \u201cMixture density networks,\u201d 1994.\n[15] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, \u201cDRAW: A recurrent neural network\nfor image generation,\u201d in Proceedings of the 32nd International Conference on Machine Learning, ICML\n2015, Lille, France, 6-11 July 2015, pp. 1462\u20131471, 2015.\n\n[16] A. McHutchon, Nonlinear modelling and control using Gaussian processes. PhD thesis, University of\n\nCambridge UK, Department of Engineering, 2014.\n\n[17] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, \u201cModeling temporal dependencies in high-\ndimensional sequences: Application to polyphonic music generation and transcription,\u201d in International\nConference on Machine Learning (ICML), 2012.\n\n[18] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, \u201cAdvances in optimizing recurrent networks,\u201d in\nAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8624\u2013\n8628, IEEE, 2013.\n\n[19] J. Bayer and C. Osendorfer, \u201cLearning stochastic recurrent networks,\u201d arXiv preprint arXiv:1411.7610,\n\n2014.\n\n[20] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d The International Conference on\n\nLearning Representations (ICLR), 2015.\n\n[21] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d The International Conference on\n\nLearning Representations (ICLR), 2014.\n\n[22] D. J. Rezende, S. Mohamed, and D. Wierstra, \u201cStochastic backpropagation and approximate inference in\n\ndeep generative models,\u201d International Conference on Machine Learning (ICML), 2014.\n\n[23] A. Mnih and K. Gregor, \u201cNeural variational inference and learning in belief networks,\u201d International\n\nConference on Machine Learning (ICML), 2014.\n\n[24] R. E. Turner and M. Sahani, \u201cTwo problems with variational expectation maximisation for time-series\nmodels,\u201d in Bayesian Time series models (D. Barber, T. Cemgil, and S. Chiappa, eds.), ch. 5, pp. 109\u2013\n130, Cambridge University Press, 2011.\n\n[25] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, \u201cThe\u201d wake-sleep\u201d algorithm for unsupervised neural\n\nnetworks,\u201d Science, vol. 268, no. 5214, pp. 1158\u20131161, 1995.\n\n[26] J. Bornschein and Y. Bengio, \u201cReweighted wake-sleep,\u201d The International Conference on Learning Rep-\n\nresentations (ICLR), 2015.\n\n9\n\n\f", "award": [], "sourceid": 1531, "authors": [{"given_name": "Shixiang (Shane)", "family_name": "Gu", "institution": "University of Cambridge and Max Planck Institute for Intelligent Systems"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}