{"title": "Parameter elimination in particle Gibbs sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 8918, "page_last": 8929, "abstract": "Bayesian inference in state-space models is challenging due to high-dimensional state trajectories. A viable approach is particle Markov chain Monte Carlo (PMCMC), combining MCMC and sequential Monte Carlo to form ``exact approximations'' to otherwise-intractable MCMC methods. The performance of the approximation is limited to that of the exact method. We focus on particle Gibbs (PG) and particle Gibbs with ancestor sampling (PGAS), improving their performance beyond that of the ideal Gibbs sampler (which they approximate) by marginalizing out one or more parameters. This is possible when the parameter(s) has a conjugate prior relationship with the complete data likelihood. Marginalization yields a non-Markov model for inference, but we show that, in contrast to the general case, the methods still scale linearly in time. While marginalization can be cumbersome to implement, recent advances in probabilistic programming have enabled its automation. We demonstrate how the marginalized methods are viable as efficient inference backends in probabilistic programming, and demonstrate with examples in ecology and epidemiology.", "full_text": "Parameter elimination in particle Gibbs sampling\n\nAnna Wigren\n\nDepartment of Information Technology\n\nUppsala University, Sweden\nanna.wigren@it.uu.se\n\nRiccardo Sven Risuleo\n\nDepartment of Information Technology\n\nUppsala University, Sweden\n\nriccardo.risuleo@it.uu.se\n\nLawrence Murray\n\nUber AI\n\nSan Francisco, CA, USA\n\nlawrence.murray@uber.com\n\nFredrik Lindsten\n\nDivision of Statistics and Machine Learning\n\nLink\u00f6ping University, Sweden\nfredrik.lindsten@liu.se\n\nAbstract\n\nBayesian inference in state-space models is challenging due to high-dimensional\nstate trajectories. A viable approach is particle Markov chain Monte Carlo, com-\nbining MCMC and sequential Monte Carlo to form \u201cexact approximations\u201d to\notherwise intractable MCMC methods. The performance of the approximation is\nlimited to that of the exact method. We focus on particle Gibbs and particle Gibbs\nwith ancestor sampling, improving their performance beyond that of the underly-\ning Gibbs sampler (which they approximate) by marginalizing out one or more\nparameters. This is possible when the parameter prior is conjugate to the complete\ndata likelihood. Marginalization yields a non-Markovian model for inference, but\nwe show that, in contrast to the general case, this method still scales linearly in\ntime. While marginalization can be cumbersome to implement, recent advances in\nprobabilistic programming have enabled its automation. We demonstrate how the\nmarginalized methods are viable as ef\ufb01cient inference backends in probabilistic\nprogramming, and demonstrate with examples in ecology and epidemiology.\n\n1\n\nIntroduction\n\nState-space models (SSMs) are a well-studied topic\nwith applications in climatology [3], robotics [8],\necology [29], and epidemiology [31], to mention just\na few. In this paper we propose a new method for per-\nforming Bayesian inference in such models. In SSMs,\na latent (hidden) state process xt is observed through\na second process yt. The state process is assigned an\ninitial density x0 \u223c p(x0), and evolves in time accord-\ning to a transition density xt \u223c p(xt|xt\u22121, \u03b8), where\n\u03b8 are parameters with prior density p(\u03b8). Given\nthe latent states xt, the observations are assumed\nindependent with density p(yt|xt, \u03b8). We wish to in-\nfer the joint posterior, p(x0:T , \u03b8|y1:T ), for the states\nx0:T and the parameters \u03b8, given a set of observa-\ntions y1:T = {y1, . . . , yT}. Unfortunately, computing\nthis posterior distribution exactly is not analytically\ntractable for general non-linear, non-Gaussian mod-\nels, so we must resort to approximations.\n\nn\no\ni\nt\na\nl\ne\nr\nr\no\nc\no\nt\nu\nA\n\n1\n\n.5\n\n0\n\n0\n\nPGAS N = 50\nPGAS N = 500\nPGAS N = 5000\nmPGAS N = 50\nmPGAS N = 500\n\n5\n\n10\n\n15\n\nLag\n\nFigure 1: The autocorrelation function (ACF)\nfor standard PGAS converges to that of the\nhypothetical Gibbs sampler as N \u2192 \u221e,\nwhereas mPGAS will produce iid draws in\nthe limit, i.e., the ACF will drop to zero at lag\none for large N. Similar results hold for PG\nand mPG, see Supplementary E.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f0:T \u223c p(x0:T|y1:T , \u03b8(cid:48)), and the parameters from \u03b8(cid:48) \u223c p(\u03b8|x(cid:48)\n\n0:T}M\n\n0:T , y1:T ) (note that an alternative is to sample only the state trajectories {xi\n\nMarkov chain Monte Carlo (MCMC) [e.g. 32] is a popular choice for Bayesian inference. The moti-\nvation behind our new method is based on one such MCMC method: the Gibbs sampler. In the Gibbs\nsampler, samples from the posterior p(x0:T , \u03b8|y1:T ) are generated by alternating between sampling\nthe states from x(cid:48)\n0:T , y1:T ). Sampling the\nparameters is often manageable, but sampling the states is challenging, owing to the distribution\nbeing high-dimensional. A possible remedy is to use particle Markov chain Monte Carlo (PMCMC)\nmethods [2], in which sequential Monte Carlo (SMC) is used to approximate sampling from the\nhigh-dimensional distribution. Particle Gibbs (PG) [2] is a PMCMC algorithm that mimics the Gibbs\nsampler. Ef\ufb01cient extensions, such as particle Gibbs with ancestor sampling (PGAS), have also been\nproposed, reducing the computational cost from quadratic to linear in the number of timesteps, T , in\nfavorable conditions [18, 20].\nPG and PGAS have proven to be ef\ufb01cient in many challenging situations [e.g. 19, 22, 39, 40].\nNevertheless, being \u201cexact approximations\u201d [1] of the (possibly intractable) Gibbs sampler, they can\nnever outperform it. In essence, this means that when the number of particles used in their SMC\ncomponent approaches in\ufb01nity, PG and PGAS will approach the hypothetical Gibbs sampler in terms\nof autocorrelation, but can never surpass it. This is illustrated in Figure 1, orange curve (for details on\nthe model, see Section 3.2). Ideally, independent samples from the target distribution are desired, but\nthe often strong dependence between the parameters \u03b8 and the states x0:T in the hypothetical Gibbs\nsampler leads to correlated samples also for PG and PGAS.\nIn marginalized Gibbs sampling we propose to marginalize out the parameters in the state update,\n0:T \u223c p(x0:T|y1:T ) and sampling the parameters\nideally alternating between sampling the states x(cid:48)\n\u03b8(cid:48) \u223c p(\u03b8|x(cid:48)\ni=1, where\nM is the number of MCMC steps, and then estimate the posterior of \u03b8 as a mixture of densities,\nwhere each component is p(\u03b8|xi\n0:T , y0:T )). The state update is thus independent of the parameters\nand this hypothetical marginalized Gibbs sampler will effectively generate independent samples\nfrom the target distribution. However, like for the unmarginalized hypothetical Gibbs sampler, the\ndistribution for sampling the states is not available in closed form. To address this issue, we derive\nmarginalized versions of PG and PGAS (hereon referred to as mPG/mPGAS). Analogous to the\nunmarginalized case, with an increasing number of particles, mPG and mPGAS will approach the\nhypothetical marginalized Gibbs sampler \u2013 that is, a sampler generating independent samples from\nthe target. This behavior is illustrated in Figure 1, blue curve.\nMarginalization is possible if the SSM has a conjugacy relation between the parameter prior and\nthe complete data likelihood, that is, the conditional p(\u03b8|x0:T , y1:T ) has the same functional form\nas the prior p(\u03b8). However, even for such models there is a price to pay for marginalization: it\nturns the Markovian dependencies, central to the SSM when conditioned on the parameters, into\nnon-Markovian dependencies for both states and observations. This will make it harder to apply\nconventional MCMC methods, whereas PMCMC methods have proven to be better suited for models\nof this type [20]. In Section 3 we derive the algorithmic expressions for mPG and mPGAS for this\nfamily of models. The necessary updates in each step in the marginalized SMC algorithm can be\ndone using suf\ufb01cient statistics, which enables the computation time of mPG and mPGAS to scale\nlinearly with the number of observations, despite the non-Markovian dependencies. The class of\nconjugate SSMs includes many common models, but is still somewhat restrictive. In Section 4, we\ndiscuss some extensions to make the framework more generally applicable and provide numerical\nillustrations.\nMarginalization of static parameters in the context of SMC has been studied by [5, 36] for the purpose\nof online Bayesian parameter learning. To what extent these methods suffer from the well-known\npath degeneracy issue of SMC has been a topic of debate, see e.g. [7]. Since our proposed method\nis based on PMCMC, and in particular PGAS, it is more robust to path degeneracy, see [20]. The\nRao-Blackwellized particle \ufb01lter [6, 9] also makes use of marginalization, but for marginalizing part\nof the state vector using conditional Kalman \ufb01lters.\nIn practice, deriving the conjugacy relations can be quite involved. However, recent developments\nin probabilistic programming have enabled automatic marginalization [see e.g. 16, 26, 28], which\nsigni\ufb01cantly improves the usability of our proposed method. Probabilistic programming considers\nthe way in which probabilistic models and inference algorithms may be expressed in universal\nprogramming languages, formally extending the expressive power of graphical models. There are by\nnow quite a number of probabilistic programming languages. Examples that can support SMC-based\n\n2\n\n\fmethods, such as those considered here, include LibBi [23], BiiPS [37], Venture [21], Anglican [38],\nWebPPL [14], Figaro [30], Turing [13], and Birch [24]. A language can implement PG/PGAS\ncombined with automatic marginalization to realize our proposed method. We have implemented PG,\nmPG, PGAS and mPGAS in Birch [24] and provide examples to illustrate their ef\ufb01ciency in Section\n4.2 and 4.3.\n\n2 Background on SMC\n\nIn PG and PGAS, the state update is approximated using SMC, therefore we provide a brief summary\nof the SMC algorithm before introducing the proposed method. For a more extensive introduction,\nsee e.g. [4, 15]. Consider a sequence of probability densities \u00af\u03b3\u03b8,t(x0:t) expressed as\n\n\u00af\u03b3\u03b8,t(x0:t) =\n\n\u03b3\u03b8,t(x0:t)\n\nZ\u03b8,t\n\n,\n\nt = 1, 2, . . .\n\n(1)\n\nwhere \u03b3\u03b8,t are the corresponding unnormalized densities, which we assume can be evaluated pointwise,\nand Z\u03b8,t is a normalizing constant. For a SSM, the target density of interest is often p(x0:t|y1:t, \u03b8),\nwhich implies \u03b3\u03b8,t = p(x0:t, y1:t|\u03b8) and Z\u03b8,t = p(y1:t|\u03b8). SMC methods approximate the target\nt}N\ndensity (1) using a set of N weighted samples (or particles) {xi\ni=1, generated according to\nAlgorithm 1. When moving to the next distribution in the sequence, all particles are resampled by\nchoosing an ancestor trajectory xai\n0:t\u22121 from the previous step in time according to the respective\nt\u22121 of the possible ancestors. SMC is based on importance sampling and the resampled\nweights \u00afwi\nparticles are therefore propagated to the next time step using a proposal distribution, q\u03b8,t(xt|x0:t\u22121),\nchosen by the user. A common choice for SSMs is to use the bootstrap proposal, which equates to\npropagating according to the transition density p(xt|xt\u22121, \u03b8), but other more re\ufb01ned choices, such\nas the optimal proposal (see e.g. [10]), are also possible. Finally, the (unnormalized) importance\nweights for the propagated particles are computed using the weight function\n\n0:t, \u00afwi\n\nt\n\n\u03c9\u03b8,t(x0:t) =\n\n\u03b3\u03b8,t(x0:t)\n\n\u03b3\u03b8,t\u22121(x0:t\u22121)q\u03b8,t(xt|x0:t\u22121)\n\n.\n\n(2)\n\nAlgorithm 1 SMC (all steps for i = 1, . . . , N)\n1: Initialize: Draw xi\n2: for t = 1 . . . T do\n3:\n4:\n5:\n6: end for\n\nResample: Draw ai\nPropagate: Simulate xi\nUpdate: Set wi\n\n0 \u223c q0(x0), set wi\nt\u22121}N\n\nt \u223cC({ \u00afwi\n\nt = \u03c9\u03b8,t(xi\n\n0/(cid:80)N\nt/(cid:80)N\n\n0)/q0(xi\n\n0), normalize \u00afwi\n\n0 = \u03b3\u03b8,0(xi\ni=1), where C is the categorical distribution.\n\n0 = wi\n\nj=1 wj\n\n0\n\nt \u223c q\u03b8,t(xt|xai\n0:t) according to (2) and normalize \u00afwi\n\n0:t\u22121).\n\nt\n\nt = wi\n\nj=1 wj\n\nt\n\n3 Method\n\nIn this section, we \ufb01rst specify the class of models we consider, and then we show how to marginalize\nthe SMC algorithm and derive mPG and mPGAS for this class of models.\n\n3.1 Conjugate models and marginalized SMC\n\nThe SMC framework presented in Section 2 is in a general form and can be directly applied to the\nmarginalized state update by de\ufb01ning the unnormalized target distribution as \u03b3t(x0:t) = p(x0:t, y1:t)\nin (1) and then applying Algorithm 1. The computation of the importance weights (step 5 in\nAlgorithm 1), however, turns out to be problematic in marginalized SSMs. To see why, note that the\nk=1 p(xk, yk|x0:k\u22121, y1:k\u22121).\n\nunnormalized target density can be factorized into p(x0:t, y1:t) = p(x0)(cid:81)t\n\nThe weights (2) become\n\n\u03c9t(x0:t) =\n\np(xt, yt|x0:t\u22121, y1:t\u22121)\n\nqt(xt|x0:t\u22121)\n\n3\n\n(3)\n\n\f(cid:90)\n\nwhere the numerator (and possibly also the denominator depending on the choice of proposal) is\nnon-Markovian. The marginal joint density of states and observations can be written\n\np(xt, yt|x0:t\u22121, y1:t\u22121) =\n\np(xt, yt|xt\u22121, \u03b8)p(\u03b8|x0:t\u22121, y1:t\u22121)d\u03b8\n\n(4)\nwhere p(\u03b8|x0:t\u22121, y1:t\u22121) is the posterior distribution of the parameters. For a general SSM, the\nintegral (4) is intractable, and the posterior may be dif\ufb01cult to compute. However, if there is\na conjugacy relationship between the prior distribution p(\u03b8) and the complete data likelihoods\np(x0:t, y1:t|\u03b8), t = 1, . . . , T , the integral can be solved analytically and the posterior will be of the\nsame form as the prior. One such case is when both the complete data likelihood and the parameter\nprior are in the exponential family, see Supplementary A for details. However, if we consider joint\nstate and observation likelihoods, p(xt, yt|xt\u22121, \u03b8), in the exponential family, we can end up with\na log-partition function that depends on the previous state xt\u22121. This can create problems when\nformulating a conjugate prior for the complete data likelihoods since the prior will be different for\neach state update, see Supplementary B for details. To avoid this problem for the models we consider,\nwe introduce the restricted exponential family where the joint state and observation likelihood is\ngiven by\n\np(xt, yt|xt\u22121, \u03b8) = ht exp(cid:0)\u03b8Tst \u2212 AT(\u03b8)rt\n\n(5)\nwhere ht = h(xt, xt\u22121, yt) is the data dependent base measure, st = s(xt, xt\u22121, yt) is a suf\ufb01cient\nstatistic and where the log-partition function can be separated into two factors: A(\u03b8), which is\nindependent of the data, and rt = r(xt\u22121), which is independent of the parameters. A conjugate prior\nfor (5) is given by\n\n(cid:1)\n\n(6)\nwhere \u03c70, \u03bd0 are hyperparameters. The parameter posterior is given by p(\u03b8|x0:t\u22121, y1:t\u22121) =\n\u03c0(\u03b8|\u03c7t\u22121, \u03bdt\u22121), with the hyperparameters iteratively updated according to\n\n\u03c0(\u03b8|\u03c70, \u03bd0) = g(\u03c70, \u03bd0) exp(cid:0)\u03b8T\u03c70 \u2212 AT(\u03b8)\u03bd0\nt(cid:88)\nt(cid:88)\n\n(cid:1)\n\n\u03c7t = \u03c70 +\n\nsk = \u03c7t\u22121 + st,\n\n\u03bdt = \u03bd0 +\n\nrk = \u03bdt\u22121 + rt.\n\n(7)\n\nk=1\n\nk=1\n\nWith the joint likelihood (5) and its conjugate prior (6) in place, we can derive an analytic expression\nfor the marginal of the joint distribution of states and observations, (4), at time t\n\n(cid:90)\n\np(xt, yt|x0:t\u22121, y1:t\u22121) =\n\np(xt, yt|xt\u22121\u03b8)\u03c0(\u03b8|\u03c7t\u22121, \u03bdt\u22121)d\u03b8 =\n\ng(\u03c7t\u22121, \u03bdt\u22121)\n\ng(\u03c7t, \u03bdt)\n\nht.\n\n(8)\n\nHence, to compute the weights (3) for marginalized SMC in the restricted exponential family, we\nonly need to keep track of and update the hyperparameters according to (7).\n\n3.2 Marginalized particle Gibbs\n\nt = x(cid:48)\n\nt = N and xN\n\nIn PG we alternate between sampling the parameters and the states like in the hypothetical Gibbs\nsampler, but the state trajectory is sampled using conditional SMC (cSMC). In cSMC one particle\ntrajectory, the reference trajectory x(cid:48)\n0:T , will always survive the resampling step. This version of SMC\nfollows the steps in Algorithm 1, with the constraints that aN\nt (for details, see [2]).\nWhen marginalizing out the parameters, the resulting mPG sampler updates the state trajectory using\nmarginalized cSMC (mcSMC), according to what is presented in Algorithm 1 and Section 3.1, with\nthe addition of conditioning on the reference trajectory surviving the resampling step (like in standard\nPG).\nThe conditioning used in cSMC yields a Markov kernel that leaves the correct conditional distribution\ninvariant for any choice of N [2]. PG is therefore a valid MCMC procedure. However, it has been\nshown that N must increase (at least) linearly with T for the kernel to mix properly for large T ,\nresulting in an overall computational complexity which grows quadratically with T . This holds also\nfor other popular PMCMC methods, such as particle marginal Metropolis-Hastings [2]. To mitigate\nthis issue, [20] proposed a modi\ufb01cation of PG in which the ancestor for the reference trajectory in\neach time step is sampled, according to ancestor weights \u02dcwi\nt\u22121|T , instead of set deterministically,\nwhich signi\ufb01cantly improves the mixing of the kernel for small N, even when T is large. The\nresulting method, referred to as PGAS, is equivalent to PG apart from the resampling step.\n\n4\n\n\fThe difference between mPG and mPGAS lies, analogous to the non-marginalized case, only in the\nresampling step. Deriving the expression for the ancestor weights in the marginalized case is quite\ninvolved, below we simply state the necessary expressions and updates, a complete derivation is\nprovided in Supplementary C. Each ancestor trajectory in mPGAS is assigned a weight, based on the\ngeneral expression in [20], given by\n\n\u02dcwi\n\nt\u22121|T = \u00afwi\n\nt\u22121\n\n\u03b3T ([xi\n\n0:t\u22121, x(cid:48)\n0:t\u22121)\n\n\u03b3t\u22121(xi\n\nt:T ])\n\n= \u00afwi\n\nt\u22121\n\np([xi\n\nt:T ], y1:T )\n\n0:t\u22121, x(cid:48)\n0:t\u22121, y1:t\u22121)\np(xi\n0:t\u22121, x(cid:48)\n\n,\n\n(9)\n\nT(cid:89)\n\nk=t+1\n\nt\u22121 is the weight of the ancestor trajectory xi\n\nwhere \u00afwi\ntrajectory resulting from combining the reference trajectory x(cid:48)\n0:t\u22121. For members of the restricted exponential family we use (8) in (9) to get the weights\nxi\n\nt:T ] is the concatenated\nt:T with the possible ancestral path\n\n0:t\u22121 and [xi\n\nt\u22121|T \u221d \u00afwi\n\u02dcwi\n\nt\u22121hi\n\nt\n\ng(\u03c7i\n\nt\u22121, \u03bd i\nt\u22121)\nt)\nt, \u03bd i\n\ng(\u03c7i\n\nh(cid:48)\n\nk\n\ng(\u03c7i\n\nk\u22121, \u03bd i\nk\u22121)\nk)\nk, \u03bd i\n\ng(\u03c7i\n\n\u221d \u00afwi\n\nt\u22121\n\ng(\u03c7i\n\nt\u22121, \u03bd i\nT , \u03bd i\n\nt\u22121)\nT )\n\ng(\u03c7i\n\nhi\nt,\n\n(10)\n\nwhere \u03c7i\n\nt\u22121, \u03bd i\n\nt\u22121 are given, for each particle, by (7) and where\n\n\u03c7i\n\nt+1:T ,\n\nT = \u03c7i\n\n\u03bd i\nT = \u03bd i\n\nt\u22121 + rt(xi\n\nt, xi\nk, x(cid:48)\n\nt+1:T = (cid:80)T\n\nt\u22121 + st(x(cid:48)\nk=t+1 sk(x(cid:48)\n\nt\u22121, yt) + s(cid:48)\nk\u22121, yk) and similarly for rt. Hence, \u03c7i\nt, x(cid:48)\n\nt\u22121) + r(cid:48)\nwith s(cid:48)\nT is a combination of the\nstatistic for the ancestor trajectory, a cross-over term and the statistic for the reference trajectory,\nwhich in each timestep is updated according to s(cid:48)\nt\u22121, yt), and analogously\nt+1:T . By storing and updating these parameters and sum of statistics in each iteration,\nfor \u03bd i\ncomputing the ancestor sampling weights only amounts to evaluating (10), implying that we can run\nmPGAS in linear time despite having a non-Markovian target, which would normally yield quadratic\ncomplexity (see [20] for a discussion). We outline mPGAS in Algorithm 2 (for mPG, skip step 3,\nupdates of \u03c7T , \u03bdT and set aN\n\nt:T \u2212 st(x(cid:48)\n\nt+1:T = s(cid:48)\n\nt deterministically).\n\nT and r(cid:48)\n\nt+1:T ,\n\n(11)\n\n0\n\n1:T\n\n0:T , s(cid:48)\n\n1:T , r(cid:48)\n\n\u223c q0(x0), set xN\n\nAlgorithm 2 Marginalized PGAS for the restricted exponential family (all steps for i = 1, . . . , N)\nInput: x(cid:48)\n1: Initialize: Draw x1:N\u22121\n2: for t = 1 . . . T do\n3:\n4:\n5:\n\n0 = \u03b30(xi\n0)\n0) and \u00afwi\nq0(xi\n\n0 = x(cid:48)\nt:T \u2212 st(x(cid:48)\nT , \u03bd i\nt, \u03c7i\nt, \u03bd i\nt\u22121}N\n\u223cC({ \u00afwi\ni=1) and aN\n1:N\u22121\n\u223c qt(xt|xa\nt\n0:t\u22121\n\nt, x(cid:48)\nT according to (7) and (11)\nt\u22121|T}N\nt = x(cid:48)\n\nt \u223cC({ \u02dcwi\n) and set xN\n\nt:T \u2212 rt(x(cid:48)\ni=1), \u02dcwi\n\n0(cid:80)N\n0 = wi\n\nt\u22121|T from (10)\n\nt\u22121, yt), r(cid:48)\n\nt+1:T = r(cid:48)\n\n0, set wi\n\nt+1:T = s(cid:48)\nUpdate statistics: s(cid:48)\nUpdate hyperparameters: \u03c7i\nResample: Draw a1:N\u22121\nPropagate: Simulate x1:N\u22121\nUpdate weights: Set wi\n0:T , s(cid:48)\n\n6:\n7:\n8: end for\nOutput: Sample new x(cid:48)\n\nt = \u03c9t(xi\n1:T , r(cid:48)\n\n0:t) according to (3) and normalize \u00afwi\n\nt/(cid:80)N\n\n1:T according to \u00afwT\n\nj=1 wj\n\nt = wi\n\nj=1 wj\n\nt\u22121)\n\n0\n\nt\n\nt\n\nt\n\nt\n\nTo illustrate the improved performance offered by marginalization we consider the non-linear SSM\n[15]\n\nxt =\n\nxt\u22121\n\n2\n\n+ 25\n\nxt\u22121\n1 + x2\n\nt\u22121\n\n+ 8 cos(1.2t) + vt,\n\nyt =\n\nx2\nt\n20\n\n+ wt,\n\n(12)\n\nwhere vt and wt are Gaussian noise processes with zero mean and unknown variances \u03c32\nv and\nw respectively. The observations are a quadratic function of the state, which makes the posterior\n\u03c32\nw \u223cIG(\u03b1w, \u03b2w)\nmultimodal. We will assume conjugate, inverse gamma priors \u03c32\nfor the unknown variances, with hyperparameters \u03b1v = \u03b2v = \u03b1w = \u03b2w = 1. We generated T = 150\nw = 1. PGAS and mPGAS were run for M = 10000\nobservations from (12) with \u03c32\niterations, discarding the \ufb01rst 1500 samples as burn-in. We initialized with \u03c32\nw = 100 and used\nv = \u03c32\na bootstrap proposal for PGAS and a marginalized bootstrap proposal for mPGAS.\nFigure 1 shows the autocorrelation for PGAS and mPGAS for different number of particles N. Ideally\nwe would like iid samples from the posterior distribution, in terms of the ACF of the samples it\n\nv \u223cIG(\u03b1v, \u03b2v) and \u03c32\n\nv = 10 and \u03c32\n\n5\n\n\fshould be zero everywhere except for lag 0. It is clear that, for PGAS, increasing N can reduce the\nautocorrelation only to a certain limit (given by the hypothetical Gibbs sampler). For mPGAS on the\nother hand, we obtain a lower autocorrelation using only 50 particles as compared to 5000 for PGAS,\nand by increasing N we move towards generating iid samples. In Supplementary E we provide\ncorresponding results for PG/mPG. The results in Figure 1 were obtained from our implementation in\nMatlab, in Supplementary E we also show the corresponding results for our implementation in Birch.\nThe marginalized versions of PG/PGAS requires some extra computations compared to their non-\nmarginalized counterparts, however, this overhead is quite small. For the model (12), with N=500,\nusing the tic-toc timer in MATLAB we get: PG \u2013 1231.5s, mPG \u2013 1430.7s, PGAS \u2013 1260.7s, mPGAS\n\u2013 1566.1s. Note that the code has not been optimized.\n\n4 Extensions and numerical simulations\n\nIn this section we describe three extensions of the marginalized method presented in Section 3 and\nillustrate their ef\ufb01ciency in numerical examples.\n\n4.1 Diffuse priors and blocking\n\nWhen we do not know much about the parameters of a model, we may use a diffuse prior to re\ufb02ect\nour uncertainty. However, a diffuse prior on the parameters can lead to a diffuse prior also for the\nstates. We can then encounter problems during the \ufb01rst few timesteps of the marginalized state\ntrajectory update; in particular, if we use a bootstrap-style proposal in the mcSMC algorithm it\nmay spread out the particles too much. This can result in poor mixing during the initial timesteps,\nas well as numerical dif\ufb01culties in the computation of the ancestor sampling weights, due to very\nlarge values sampled for the states. As an illustration, consider again the model (12), but now\nwith hyperparameters \u03b1v, \u03b2v = 0.001 for the process noise \u03c32\nv . The marginalized proposal for the\n\ufb01rst timestep, q1(x1|x0) = p(x1 | x0), will then be a Student t-distribution with unde\ufb01ned mean\nand variance. Figure 2 (left) shows the log-pdf of both this proposal and the target distribution,\n\u00af\u03b31(x0:1) = p(x0:1 | y1), at time t = 1. It is clear that for mcSMC (blue) the prior q1 is much more\ndiffuse than the posterior \u00af\u03b31, whereas for cSMC (orange) there is less of a difference.\nWhen working with diffuse priors we suggest to divide the state trajectory into two overlapping blocks\n(similarly to the blocking method proposed by [34]) and do Gibbs updates of each block in turn.\nFigure 3 illustrates the two overlapping blocks x0:B+L (upper) and xB+1:T (lower). To update the \ufb01rst\nblock, where problems due to marginalization are more probable, we use a (non-marginalized) cSMC\nsampler targeting the posterior distribution of x0:B+L conditioned on the reference trajectory x(cid:48)\n0:B+L,\nthe observations y1:T , the non-overalapping part of the second block x(cid:48)\nB+L+1:T and the parameters \u03b8.\nNote that, because of the Markov property when conditioning on \u03b8, the dependence on x0:B+L reduces\nto only the boundary state x(cid:48)\nB+L+1 and the dependence on the observations reduces to y1:B+L. To\nupdate the second block, we use mcSMC targeting the posterior distribution of xB+1:T conditioned on\nthe (updated) reference trajectory [xB+1:B+L, x(cid:48)\nB+L+1:T ], the observations y1:T and the (updated) \ufb01rst\nblock x0:B. Finally, the parameters \u03b8 are sampled from their full conditional given the new reference\n\n0\n\u221210\n\u221220\n\ny\nt\ni\ns\nn\ne\nd\ng\no\nL\n\n\u2212100\n\n\u221250\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\ne\nt\na\nd\np\nU\n\n1\n.8\n.6\n.4\n.2\n0\n\nqu\n\u03b3u\nqm\n\u03b3m\n\n50\n\n100\n\n0\nx\n\nPGAS\nmPGAS\nmPGAS+blocking\n\n2\n\n4\n\n6\n\n8\n\nTime\n\nFigure 2: Left: log-density for the proposal and the posterior at t = 1 for mcSMC (qm, \u03b3m) and\ncSMC (qu, \u03b3u), showing how marginalization can potentially produce a poor proposal distribution in\nthe \ufb01rst timestep. Right: update frequency for the state trajectory for the \ufb01rst few timesteps.\n\n6\n\n\ftrajectory x0:T . Algorithm 3 outlines one iteration of mPG/mPGAS with this choice of blocking and\nsamplers. In Supplementary D we provide a proof of validity for this blocked Gibbs sampler.\nThe purpose of the \ufb01rst block is only to update the \ufb01rst few timesteps, in order to get a suf\ufb01cient\nconcentration of the proposals when conditioning on x0:B for mcSMC. Therefore, it is typically\nsuf\ufb01cient to use a small value of B; in the example outlined above B > 2 is suf\ufb01cient to get \ufb01nite\nvariance in the Student t-distribution. The overlap parameter L on the other hand is used to push\nthe boundary state xB+L+1 into the interior of the second block, which due to the forgetting of the\ndynamical system reduces the effect of conditioning on this state in the \ufb01rst Gibbs step [34]. Hence,\nthe larger L the better, but at the price of increased computational cost. Since most SSMs have\nexponential forgetting, using a small value of L is likely to be suf\ufb01cient in most cases.\nIn Figure 2 (right), we illustrate the bene\ufb01t of using blocking to avoid poor mixing during the \ufb01rst\ntimestep when marginalizing with a diffuse prior for the model (12). We used B = 5 and L = 20,\nall other settings were the same as before. We consider the update frequency of the state variables,\nde\ufb01ned as the average number of iterations in which the state changes its value, as a measure of the\nmixing. It is clear that for the mPGAS we get a very low update frequency at t = 1, whereas when\nwe use mPGAS with blocking we obtain the same update frequency as for PGAS.\n\nAlgorithm 3 Blocking for mPG/mPGAS\n1: x0:B+L \u223c cSMC(x0:B+L|x(cid:48)\n2: xB+1:T \u223c mcSMC(xB+1:T|xB+1:B+L, x(cid:48)\n3: \u03b8 \u223c p(\u03b8|x0:T , y1:T ) = \u03c0(\u03b8|\u03c7T , \u03bdT )\n\n0:B+L; y1:B+L, x(cid:48)\n\nB+L+1, \u03b8)\n\nB+L+1:T ; x0:B, y1:T )\n\n0\n\nB\n\nx0:B\n\nx0:B+L\n\nB + L\n\nT\n\nxB+1:T\n\nxB+L+1:T\n\nFigure 3: Division into 2 blocks.\n\n4.2 Marginalized particle Gibbs in a PPL\n\nWe have implemented PG, PGAS, mPG and mPGAS in Birch [24], which employs delayed sam-\npling [26] to recognize and utilize conjugacy relationships, and so automatically marginalizes out\nthe parameters of a model, where possible. This saves the user the trouble of deriving the relevant\nconjugacy relationships for their particular model, or providing a bespoke implementation of them.\nWe \ufb01rst demonstrate this on a vector-borne disease model of a dengue outbreak.\nDengue is a mosquito-borne disease which affects an estimated 50-100 million people worldwide\neach year, causing 10000 deaths [35]. We use a data set from an outbreak on the island of Yap in\nMicronesia in 2011. It contains 197 observations, mostly daily, of the number of newly reported cases.\nThe model used is that described in [26], in turn based on that of the original study [12]. It consists\nof two coupled susceptible-exposed-infectious-recovered (SEIR) compartmental models, describing\nthe transmission between human and mosquito populations, respectively. Transition counts between\ncompartments are assumed to be binomially distributed, with beta priors used for all parameters.\nObservations are also assumed binomial with an unknown parameter for the reporting rate, which is\nassigned a beta prior. The beta priors establish conjugate relations with the complete data likelihood,\nso that the problem is well-suited for inference using mPG/mPGAS.\n\ny\nt\ni\ns\nn\ne\nD\n\n4\n\n2\n\n0\n\nPG\nmPG\nmIS\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nReporting rate\n\nn\no\ni\nt\na\nl\ne\nr\nr\no\nc\no\nt\nu\nA\n\n1\n\n.5\n\n0\n\n0\n\nPG\nmPG\n\n20\n\n40\nLag\n\n60\n\n80\n\nFigure 4: Results of the simulation of the vector-borne disease model. Left: estimated density of the\nreporting rate parameter, mean of four chains. Marginalized importance sampling (mIS) is included\nfor comparison. Right: estimated autocorrelation function of the reporting rate parameter, mean of\nfour chains.\n\n7\n\n\fThe model was previously implemented in Birch for [26]. We have added generic implementations\nof PG, PGAS, mPG and mPGAS to Birch that can be applied to this, and other, models. Figure 4\nshows the results of a simulation of four different chains; for each of these 10000 samples were drawn\nusing PG and mPG. The samplers used N = 1024 particles each. For comparison we also include\nthe results from using marginalized importance sampling (mIS). The autocorrelation of the samples\nis noticeably improved by marginalizing out the parameters. Corresponding results for PGAS and\nmPGAS can be found in Supplementary E.\n\n4.3 Models lacking full conjugacy\n\nIt may seem that the method we propose is limited to models where the transition and observation\nprobabilities have the conjugacy structure in (5). However, we can use the results in Section 3 to\ntreat models where only some of the parameters exhibit conjugacy with the complete data likelihood.\nTo this end, we denote by \u03b8m the parameters that have a prior distribution that is conjugate with the\ncomplete-data likelihood, and by \u03b8u the remaining parameters. Then, we can marginalize out \u03b8m\nfrom the complete-data likelihood as shown in Section 3. The remaining parameters can be sampled\nusing any conventional MCMC method, for instance Metropolis\u2013Hastings. This is possible since\nPMCMC samplers are nothing but (special purpose) MCMC kernels, hence they can be combined\nwith normal MCMC in a systematic way. One possibility is to use, say, Metropolis\u2013Hastings within\nmPG/mPGAS. Another possibility, which we describe below, is to use a marginalized version of the\nparticle marginal Metropolis\u2013Hastings algorithm [2], which we refer to as mPMMH.\n\nAlgorithm 1, for a \ufb01xed value of the parameters \u03b8u, and let q(\u03b8u|\u03b8(cid:48)\nwe can generate samples from the posterior distribution of \u03b8u using Algorithm 4.\n\nt be the unbiased estimate of the marginal likelihood given by\nu) be a proposal distribution; then,\n\ni=1 wi\n\nt=1\n\n1\nN\n\nLet \u02c6p(y1:T|\u03b8u) =(cid:81)T\n\n(cid:80)N\n\nAlgorithm 4 Marginalized particle marginal Metropolis\u2013Hastings\n1: Propose \u03b8\u2217\n2: Run Algorithm 1 and compute \u02c6p(y1:T|\u03b8\u2217\nu)\n3: Return \u03b8u = \u03b8\u2217\n\nu with probability 1 \u2227 \u02c6p(y1:T |\u03b8\u2217\n\u02c6p(y1:T |\u03b8(cid:48)\n\nu \u223c q(\u00b7|\u03b8(cid:48)\nu)\n\nu)p(\u03b8\u2217\nu)p(\u03b8(cid:48)\n\nu)q(\u03b8(cid:48)\nu)q(\u03b8\u2217\n\nu|\u03b8\u2217\nu)\nu|\u03b8(cid:48)\n\nu), else \u03b8u = \u03b8(cid:48)\nu,\n\nTo illustrate this method with partial marginalization, we consider the following model describing the\nevolution of the size of animal populations (see, for instance, [17]):\n\nlog nt+1 = log nt +(cid:2)1\n\n(nt)c(cid:3) b + \u03c3vvt,\n\nyt = nt + \u03c3wwt,\n\nv \u223cNIG(\u00b5, \u039b, \u03b1v, \u03b2v) and \u03c32\n\nc ) prior and a random-walk proposal c\u2217 \u223cN (c(cid:48), \u03c4 ).\n\nwhere nt is the population size at time t, and b, c, \u03c3v, and \u03c3w are the unknown parameters. Note that,\nexcept for c (= \u03b8u), the parameters can be marginalized out by using normal-inverse gamma and\nw \u223cIG(\u03b1w, \u03b2w). For the remaining\ninverse gamma conjugate priors b, \u03c32\nparameter, we use a N (0, \u03c32\nWe have implemented mPMMH in Birch and evaluate it on a dataset of observations of the number of\nsong sparrows on Mandarte Island, British Columbia, Canada [33]. The dataset contains the number\nof birds, counted yearly, between 1978 and 1998. In Figure 5 (left), we report the histogram of\nthe distribution of the density regulation parameter c estimated using 10000 samples drawn using\nAlgorithm 4 after a burn-in of 5000 samples, using N = 512 particles. The distribution of c, as found\nby our method, is consistent with values reported in the literature (see, for instance, [27] and [33]). In\nFigure 5 (right), we show the actual counts in the dataset compared with the average \u02c6n1:T and three\nstandard deviations, as sampled by Algorithm 4.\n\n8\n\n\fy\nt\ni\ns\nn\ne\nD\n\n.3\n.2\n.1\n0\n\n\u22124 \u22122\n\n2\n\n4\n\n0\nc\n\nn\no\ni\nt\na\nl\nu\np\no\nP\n\n80\n60\n40\n20\n0\n1975\n\nnt\n\u02c6nt\n\n1980\n\n1985\n\n1990\n\n1995\n\nYear\n\nFigure 5: Results of the simulation with parameter values \u00b5 = [1, 1], \u039b = I, \u03b1v = \u03b2v = \u03b1w = \u03b2w =\nc = 4, \u03c4 = 0.05. Left: estimated distribution of the density regulation parameter c. Right:\n2.5, \u03c32\nobserved (marks) and mean \ufb01ltered population sizes (solid) with 3\u03c3 credible interval.\n\n5 Discussion\n\nPG and PGAS can be highly ef\ufb01cient samplers for general SSMs, but are limited by the performance\nof the hypothetical (but intractable) Gibbs sampler that they approximate. We have proposed to\nimprove on PG/PGAS by marginalizing out the parameters from the state update, to reduce the\nauto-correlation beyond the limit posed by the hypothetical Gibbs sampler.\nMarginalization often improves performance, but this will not always be the case. One example\nis when there is a diffuse prior on the parameters, in which case marginalization can result in an\ninef\ufb01cient SMC sampler. One way to mitigate this is blocking; we propose using two blocks, the \ufb01rst\nupdated using cSMC and the second using mcSMC. One can think of other ways to update the \ufb01rst\nblock, such as a Metropolis\u2013Hastings update with an appropriate proposal, see [11, 25] for related\ntechniques. It is also possible to use a mcSMC update for the \ufb01rst block, as conditioning on the\nfuture states will help to avoid the problems related to diffuse priors. The details are quite involved,\nhowever, so we prefer the simpler method described in Section 4.1.\nMarginalization is possible when there is a conjugacy relationship between the parameters and the\ncomplete data likelihood. This may seem a restrictive model class, but in practice there are bene\ufb01ts\neven if only some of the parameters can be marginalized out, by combining marginalized PMCMC\nkernels with conventional MCMC kernels. Many models have at least some parameters that enter in\na nice way, such as regression coef\ufb01cients and error variances, where marginalization can provide a\nperformance gain.\nPerforming the marginalization by hand for every new model can be time consuming. Consequently,\nan important aspect of the method is the possibility of implementing it in a probabilistic programming\nlanguage. Recent advances in probabilistic programming enable automatic marginalization, making\nthe process easier. We have implemented mPG, mPGAS and mPMMH in Birch, and demonstrated\nthat implementation on two examples. Some further work is required to extend the implementation in\nBirch to blocking.\n\nCode\n\nfor\n\nCode\nsimulations\nneurips2019-parameter-elimination.\n\nall numerical\n\nis\n\navailable\n\nat https://github.com/uu-sml/\n\nAcknowledgments\n\nThis research is \ufb01nancially supported, in part, by the Swedish Research Council via the projects\nLearning of Large-Scale Probabilistic Dynamical Models (contract number: 2016-04278) and New\nDirections in Learning Dynamical Systems (NewLEADS) (contract number:2016-06079), by the\nSwedish Foundation for Strategic Research (SSF) via the projects Probabilistic Modeling and\nInference for Machine Learning (contract number: ICA16-0015) and ASSEMBLE (contract number:\nRIT15-0012), and by the Wallenberg AI, Autonomous Systems and Software Program (WASP)\nfunded by the Knut and Alice Wallenberg Foundation.\n\n9\n\n\fReferences\n[1] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for ef\ufb01cient Monte Carlo\n\ncomputations. Annals of Statistics, 37(2):697\u2013725, 2009.\n\n[2] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal\n\nof the Royal Statistical Society. Series B (Statistical Methodology), 72(3):269\u2013342, 2010.\n\n[3] F. M. Calafat, T. Wahl, F. Lindsten, J. Williams, and E. Frajka-Williams. Coherent modu-\nlation of the sea-level annual cycle in the United States by Atlantic Rossby waves. Nature\nCommunications, 9(2571), 2018.\n\n[4] O. Capp\u00e9, S. J. Godsill, and E. Moulines. An overview of existing methods and recent advances\n\nin sequential Monte Carlo. Proceedings of the IEEE, 95(5):899\u2013924, 2007.\n\n[5] C. M. Carvalho, M. S. Johannes, H. F. Lopes, and N. G. Polson. Particle learning and smoothing.\n\nStatistical Science, 25(1):88\u2013106, 2010.\n\n[6] R. Chen and J. S. Liu. Mixture Kalman \ufb01lters. Journal of the Royal Statistical Society: Series B\n\n(Statistical Methodology), 62(3):493\u2013508, 2000.\n\n[7] N. Chopin, A. Iacobucci, J.-M. Marin, K. Mengersen, C. P. Robert, R. Ryder, and C. Sch\u00e4fer.\n\nOn particle learning. arXiv.org, arXiv:1006.0554, 2010.\n\n[8] M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Founda-\n\ntions and Trends in Robotics, 2(1\u20132):1\u2013142, 2013.\n\n[9] A. Doucet, N. De Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle \ufb01ltering for\ndynamic Bayesian networks. In Proceedings of the 16th conference on Uncertainty in arti\ufb01cial\nintelligence, pages 176\u2013183, 2000.\n\n[10] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for\n\nBayesian \ufb01ltering. Statistics and Computing, 10(3):197\u2013208, 2000.\n\n[11] P. Fearnhead and L. Meligkotsidou. Augmentation schemes for particle MCMC. Statistics and\n\nComputing, 26(6):1293\u20131306, 2016.\n\n[12] S. Funk, A. J. Kucharski, A. Camacho, R. M. Eggo, L. Yakob, L. M. Murray, and W. J. Edmunds.\nComparative analysis of dengue and Zika outbreaks reveals differences by setting and virus.\nPLOS Neglected Tropical Diseases, 10(12):1\u201316, 2016.\n\n[13] H. Ge, K. Xu, and Z. Ghahramani. Turing: a language for \ufb02exible probabilistic inference. In\nProceedings of Machine Learning Research, Twenty-First International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 84, pages 1682\u20131690, 2018.\n\n[14] N. D. Goodman and A. Stuhlm\u00fcller. The design and implementation of probabilistic program-\n\nming languages. http://dippl.org, 2014.\n\n[15] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian\nBayesian state estimation. In IEE proceedings F (radar and signal processing), volume 140,\npages 107\u2013113, 1993.\n\n[16] M. D. Hoffman. Autoconj: recognizing and exploiting conjugacy without a domain-speci\ufb01c\nlanguage. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages 10716\u2013\n10726. 2018.\n\n[17] R. Lande, S. Engen, and B.-E. Saether. Stochastic population dynamics in ecology and conser-\n\nvation. Oxford University Press, Oxford, 2003.\n\n[18] A. Lee, S. S. Singh, and M. Vihola. Coupled conditional backward sampling particle \ufb01lter.\n\narXiv.org, arXiv:1806.05852, 2018.\n\n[19] S. Linderman, C. H. Stock, and R. P. Adams. A framework for studying synaptic plasticity with\nneural spike train data. In Advances in Neural Information Processing Systems (NIPS) 27. 2014.\n\n10\n\n\f[20] F. Lindsten, M. I. Jordan, and T. B. Sch\u00f6n. Particle Gibbs with ancestor sampling. Journal of\n\nMachine Learning Research, 15:2145\u20132184, 2014.\n\n[21] V. K. Mansinghka, D. Selsam, and Y. N. Perov. Venture: a higher-order probabilistic program-\n\nming platform with programmable inference. arXiv.org, arXiv:1404.0099, 2014.\n\n[22] M. Marcos, F. M. Calafat, A. Berihuete, and S. Dangendorf. Long-term variations in global sea\n\nlevel extremes. Journal of Geophysical Research, 120(12):8115\u20138134, 2015.\n\n[23] L. M. Murray. Bayesian state-space modelling on high-performance hardware using LibBi.\n\nJournal of Statistical Software, 67(10):1\u201336, 2015.\n\n[24] L. M. Murray and T. B. Sch\u00f6n. Automated learning with a probabilistic programming language:\n\nBirch. Annual Reviews in Control, 46:29\u201343, 2018.\n\n[25] L. M. Murray, E. M. Jones, and J. Parslow. On disturbance state-space models and the particle\nmarginal Metropolis\u2013Hastings sampler. SIAM/ASA Journal of Uncertainty Quanti\ufb01cation, 1(1):\n494\u2013521, 2013.\n\n[26] L. M. Murray, D. Lund\u00e9n, J. Kudlicka, D. Broman, and T. B. Sch\u00f6n. Delayed sampling and\nautomatic Rao-Blackwellization of probabilistic programs. Proceedings of Machine Learning\nResearch, Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics, 84:\n1037\u20131046, 2018.\n\n[27] K. Nadeem and S. R. Lele. Likelihood based population viability analysis in the presence of\n\nobservation error. Oikos, 121(10):1656\u20131664, 2012.\n\n[28] F. Obermeyer, E. Bingham, M. Jankowiak, N. Pradhan, J. Chiu, A. Rush, and N. Goodman.\nTensor variable elimination for plated factor graphs. 36th International Conference on Machine\nLearning (ICML), 2019.\n\n[29] J. Parslow, N. Cressie, E. P. Campbell, E. Jones, and L. M. Murray. Bayesian learning and\npredictability in a stochastic nonlinear dynamical model. Ecological Applications, 23(4):\n679\u2013698, 2013.\n\n[30] A. Pfeffer. Practical probabilistic programming. Manning, 2016.\n\n[31] D. A. Rasmussen, O. Ratmann, and K. Koelle. Inference for nonlinear epidemiological models\n\nusing genealogies and time series. PLoS Comput Biology, 7(8), 2011.\n\n[32] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2004.\n\n[33] B.-E. Saether, S. Engen, R. Lande, P. Arcese, and J. N. M. Smith. Estimating the time to\nextinction in an island population of song sparrows. Proceedings: Biological Sciences, 267\n(1443):621\u2013626, 2000.\n\n[34] S. S. Singh, F. Lindsten, and E. Moulines. Blocking strategies and stability of particle Gibbs\n\nsamplers. Biometrika, 104(4):953\u2013969, 2017.\n\n[35] J. D. Stanaway, D. S. Shepard, E. A. Undurraga, Y. A. Halasa, L. E. Coffeng, O. J. Brady, S. I.\nHay, N. Bedi, I. M. Bensenor, C. A. Casta\u00f1eda Orjuela, T.-W. Chuang, K. B. Gibney, Z. A.\nMemish, A. Rafay, K. N. Ukwaja, N. Yonemoto, and C. J. L. Murray. The global burden of\ndengue: an analysis from the Global Burden of Disease Study 2013. The Lancet Infectious\nDiseases, 16(6):712\u2013723, 2016.\n\n[36] G. Storvik. Particle \ufb01lters for state-space models with the presence of unknown static parameters.\n\nIEEE Transactions on Signal Processing, 50(2):281\u2013289, 2002.\n\n[37] A. Todeschini, F. Caron, M. Fuentes, P. Legrand, and P. Del Moral. Biips: software for Bayesian\n\ninference with interacting particle systems. arXiv.org, arXiv:1412.3779, 2014.\n\n[38] D. Tolpin, J. van de Meent, H. Yang, and F. Wood. Design and implementation of probabilistic\n\nprogramming language Anglican. arXiv.org, arXiv:1608.05263, 2016.\n\n11\n\n\f[39] I. Valera, F. Ruiz, L. Svensson, and F. Perez-Cruz. In\ufb01nite factorial dynamical model. In\n\nAdvances in Neural Information Processing Systems (NIPS) 28. 2015.\n\n[40] J.-W. van de Meent, Y. Hongseok, V. Mansinghka, and F. Wood. Particle Gibbs with ancestor\nsampling for probabilistic programs. In Proceedings of the 18th International Conference on\nArti\ufb01cial Intelligence and Statistics, 2015.\n\n[41] D. A. van Dyk and T. Park. Partially collapsed Gibbs samplers. Journal of the American\n\nStatistical Association, 103(482):790\u2013796, 2008.\n\n12\n\n\f", "award": [], "sourceid": 4792, "authors": [{"given_name": "Anna", "family_name": "Wigren", "institution": "Uppsala University"}, {"given_name": "Riccardo Sven", "family_name": "Risuleo", "institution": "Uppsala University"}, {"given_name": "Lawrence", "family_name": "Murray", "institution": "Uber AI"}, {"given_name": "Fredrik", "family_name": "Lindsten", "institution": "Link\u00f6ping University"}]}