{"title": "Sequential Monte Carlo for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1862, "page_last": 1870, "abstract": "We propose a new framework for how to use sequential Monte Carlo (SMC) algorithms for inference in probabilistic graphical models (PGM). Via a sequential decomposition of the PGM we find a sequence of auxiliary distributions defined on a monotonically increasing sequence of probability spaces. By targeting these auxiliary distributions using SMC we are able to approximate the full joint distribution defined by the PGM. One of the key merits of the SMC sampler is that it provides an unbiased estimate of the partition function of the model. We also show how it can be used within a particle Markov chain Monte Carlo framework in order to construct high-dimensional block-sampling algorithms for general PGMs.", "full_text": "Sequential Monte Carlo for Graphical Models\n\nChristian A. Naesseth\nDiv. of Automatic Control\n\nLink\u00a8oping University\nLink\u00a8oping, Sweden\n\nFredrik Lindsten\nDept. of Engineering\n\nThe University of Cambridge\n\nCambridge, UK\n\nThomas B. Sch\u00a8on\n\nDept. of Information Technology\n\nUppsala University\nUppsala, Sweden\n\nchran60@isy.liu.se\n\nfsml2@cam.ac.uk\n\nthomas.schon@it.uu.se\n\nAbstract\n\nWe propose a new framework for how to use sequential Monte Carlo (SMC) al-\ngorithms for inference in probabilistic graphical models (PGM). Via a sequential\ndecomposition of the PGM we \ufb01nd a sequence of auxiliary distributions de\ufb01ned\non a monotonically increasing sequence of probability spaces. By targeting these\nauxiliary distributions using SMC we are able to approximate the full joint distri-\nbution de\ufb01ned by the PGM. One of the key merits of the SMC sampler is that it\nprovides an unbiased estimate of the partition function of the model. We also show\nhow it can be used within a particle Markov chain Monte Carlo framework in order\nto construct high-dimensional block-sampling algorithms for general PGMs.\n\n1\n\nIntroduction\n\nBayesian inference in statistical models involving a large number of latent random variables is in\ngeneral a dif\ufb01cult problem. This renders inference methods that are capable of ef\ufb01ciently utilizing\nstructure important tools. Probabilistic Graphical Models (PGMs) are an intuitive and useful way\nto represent and make use of underlying structure in probability distributions with many interesting\nareas of applications [1].\nOur main contribution is a new framework for constructing non-standard (auxiliary) target distribu-\ntions of PGMs, utilizing what we call a sequential decomposition of the underlying factor graph, to\nbe targeted by a sequential Monte Carlo (SMC) sampler. This construction enables us to make use\nof SMC methods developed and studied over the last 20 years, to approximate the full joint distribu-\ntion de\ufb01ned by the PGM. As a byproduct, the SMC algorithm provides an unbiased estimate of the\npartition function (normalization constant). We show how the proposed method can be used as an\nalternative to standard methods such as the Annealed Importance Sampling (AIS) proposed in [2],\nwhen estimating the partition function. We also make use of the proposed SMC algorithm to design\nef\ufb01cient, high-dimensional MCMC kernels for the latent variables of the PGM in a particle MCMC\nframework. This enables inference about the latent variables as well as learning of unknown model\nparameters in an MCMC setting.\nDuring the last decade there has been substantial work on how to leverage SMC algorithms [3] to\nsolve inference problems in PGMs. The \ufb01rst approaches were PAMPAS [4] and nonparametric belief\npropagation by Sudderth et al. [5, 6]. Since then, several different variants and re\ufb01nements have been\nproposed by e.g. Briers et al. [7], Ihler and Mcallester [8], Frank et al. [9]. They all rely on various\nparticle approximations of messages sent in a loopy belief propagation algorithm. This means that\nin general, even in the limit of Monte Carlo samples, they are approximate methods. Compared\nto these approaches our proposed methods are consistent and provide an unbiased estimate of the\nnormalization constant as a by-product.\nAnother branch of SMC-based methods for graphical models has been suggested by Hamze and\nde Freitas [10]. Their method builds on the SMC sampler by Del Moral et al. [11], where the\n\n1\n\n\finitial target is a spanning tree of the original graph and subsequent steps add edges according to an\nannealing schedule. Everitt [12] extends these ideas to learn parameters using particle MCMC [13].\nYet another take is provided by Carbonetto and de Freitas [14], where an SMC sampler is combined\nwith mean \ufb01eld approximations. Compared to these methods we can handle both non-Gaussian\nand/or non-discrete interactions between variables and there is no requirement to perform MCMC\nsteps within each SMC step.\nThe left-right methods described by Wallach et al. [15] and extended by Buntine [16] to estimate\nthe likelihood of held-out documents in topic models are somewhat related in that they are SMC-\ninspired. However, these are not actual SMC algorithms and they do not produce an unbiased\nestimate of the partition function for \ufb01nite sample set. On the other hand, a particle learning based\napproach was recently proposed by Scott and Baldridge [17] and it can be viewed as a special case\nof our method for this speci\ufb01c type of model.\n\n2 Graphical models\n\nA graphical model is a probabilistic model which factorizes according to the structure of an under-\nlying graph G = {V,E}, with vertex set V and edge set E. By this we mean that the joint probability\ndensity function (PDF) of the set of random variables indexed by V, XV := {x1, . . . , x|V|}, can be\nrepresented as a product of factors over the cliques of the graph:\n\n(cid:89)\n\nC\u2208C\n\n1\nZ\n\n(1)\n\nC\u2208C \u03c8C(xC)dXV is\n\np(XV ) =\n\nwhere C is the set of cliques in G, \u03c8C is the factor for clique C, and Z =(cid:82)(cid:81)\nWe will frequently use the notation XI =(cid:83)\n\nthe partition function.\n\n\u03c8C(XC),\n\ni\u2208I{xi} for some\nsubset I \u2286 {1, . . . , |V|} and we write XI for the range of XI\n(i.e., XI \u2208 XI ). To make the interactions between the random\nvariables explicit we de\ufb01ne a factor graph F = {V, \u03a8,E(cid:48)}\ncorresponding to G. The factor graph consists of two types\nof vertices, the original set of random variables XV and the\nfactors \u03a8 = {\u03c8C : C \u2208 C}. The edge set E(cid:48) consists only\nof edges from variables to factors.\nIn Figure 1a we show a\nsimple toy example of an undirected graphical model, and one\npossible corresponding factor graph, Figure 1b, making the de-\npendencies explicit. Both directed and undirected graphs can\nbe represented by factor graphs.\n\nx1\n\nx2\n\nx3\n\nx4\n\nx5\n\n(a) Undirected graph.\n\nx3\n\n\u03c83\n\nx1\n\n\u03c81\n\nx2\n\n\u03c82\n\nx5\n\n\u03c85\n\nx4\n\n\u03c84\n\n(b) Factor graph.\n\nFigure 1: Undirected PGM and a\ncorresponding factor graph.\n\n3 Sequential Monte Carlo\n\nIn this section we propose a way to sequentially decompose a graphical model which we then make\nuse of to design an SMC algorithm for the PGM.\n\n3.1 Sequential decomposition of graphical models\n\nSMC methods can be used to approximate a sequence of probability distributions on a sequence of\nprobability spaces of increasing dimension. This is done by recursively updating a set of samples\u2014\nor particles\u2014with corresponding nonnegative importance weights. The typical scenario is that of\nstate inference in state-space models, where the probability distributions targeted by the SMC sam-\npler are the joint smoothing distributions of a sequence of latent states conditionally on a sequence\nof observations; see e.g., Doucet and Johansen [18] for applications of this type. However, SMC is\nnot limited to these cases and it is applicable to a much wider class of models.\nTo be able to use SMC for inference in PGMs we have to de\ufb01ne a sequence of target distributions.\nHowever, these target distributions do not have to be marginal distributions under p(XV ). Indeed, as\nlong as the sequence of target distributions is constructed in such a way that, at some \ufb01nal iteration,\nwe recover p(XV ), all the intermediate target distributions may be chosen quite arbitrarily.\n\n2\n\n\fx3\n\n\u03c83\n\nx3\n\n\u03c83\n\nx3\n\n\u03c83\n\nx3\n\n\u03c83\n\nx5\n\n\u03c85\n\nx2\n\n\u03c82\n\nx5\n\n\u03c85\n\nx1\n\n\u03c81\n\nx2\n\n\u03c82\n\nx5\n\n\u03c85\n\nx5\n\n\u03c85\n\n(a)(cid:101)\u03b31(XL1 )\n\n\u03c85\n\nx5\n\n(b)(cid:101)\u03b32(XL2 )\n\nx4\n\n\u03c84\n\n(c)(cid:101)\u03b33(XL3 )\n\nx4\n\n\u03c84\n\n(d)(cid:101)\u03b34(XL4 )\n\nx1\n\nx2\n\n\u03c81\n\n(f)(cid:101)\u03b31(XL1 )\n\nx3\n\nx1\n\n\u03c81\n\nx2\n\n\u03c82\n\nx1\n\n\u03c81\n\nx2\n\n\u03c82\n\nx4\n\n(g)(cid:101)\u03b32(XL2 )\n\n\u03c84\n\nx4\n\n(e)(cid:101)\u03b35(XL5 )\n\nx3\n\n\u03c83\n\nx4\n\nx5\n\n\u03c85\n\n\u03c84\n\n(h)(cid:101)\u03b33(XL3 )\n\nFigure 2: Examples of \ufb01ve- (top) and three-step (bottom) sequential decomposition of Figure 1.\n\nC\u2208Ck\n\nC\u2208Ck\n\nThis is key to our development, since it lets us use the structure of the PGM to de\ufb01ne a sequence of\nintermediate target distributions for the sampler. We do this by a so called sequential decomposition\nof the graphical model. This amounts to simply adding factors to the target distribution, from the\nproduct of factors in (1), at each step of the algorithm and iterate until all the factors have been\nadded. Constructing an arti\ufb01cial sequence of intermediate target distributions for an SMC sampler\nis a simple, albeit underutilized, idea as it opens up for using SMC samplers for inference in a wide\nrange of probabilistic models; see e.g., Bouchard-C\u02c6ot\u00b4e et al. [19], Del Moral et al. [11] for a few\napplications of this approach.\nGiven a graph G with cliques C, let {\u03c8k}K\n\n(cid:81)\nwhere Ik \u2286 {1, . . . , |V|} is the index set of the variables in the domain of \u03c8k, Ik = (cid:83)\n\n\u03c8C(XC), where Ck \u2282 C are chosen such that(cid:83)K\n\nk=1 be a sequence of factors de\ufb01ned as follows \u03c8k(XIk ) =\nk=1 Ck = C and Ci \u2229 Cj = \u2205, i (cid:54)= j, and\nC.\nWe emphasize that the cliques in C need not be maximal. In fact even auxiliary factors may be\n(cid:81)K\nintroduced to allow for e.g. annealing between distributions. It follows that the PDF in (1) can be\nk=1 \u03c8k(XIk ). Principally, the choices and the ordering of the Ck\u2019s is\nwritten as p(XV ) = 1\nZ\narbitrary, but in practice it will affect the performance of the proposed sampler. However, in many\nThe sequential decomposition of the PGM is then based on the auxiliary quantities(cid:101)\u03b3k(XLk ) :=\ncommon PGMs an intuitive ordering can be deduced from the structure of the model, see Section 5.\n(cid:81)k\n(cid:96)=1 \u03c8(cid:96)(XI(cid:96)), with Lk := (cid:83)k\nthe joint PDF p(XLK ) will be proportional to (cid:101)\u03b3K(XLK ). Consequently, by using (cid:101)\u03b3k(XLk ) as\n(cid:96)=1 I(cid:96), for k \u2208 {1, . . . , K}. By construction, LK = V and\nthe basis for the target sequence for an SMC sampler, we will obtain the correct target distribu-\ntion at iteration K. However, a further requirement for this to be possible is that all the func-\nthen we can use(cid:101)\u03b3k(XLk ), k = 1 to K, directly as our sequence of intermediate target densities.\nIf, however, (cid:82)(cid:101)\u03b3k(XLk )dXLk = \u221e for some k < K, an easy remedy is to modify the target\ntions in the sequence are normalizable. For many graphical models this is indeed the case, and\ndensity to ensure normalizability. This is done by setting \u03b3k(XLk ) = (cid:101)\u03b3k(XLk )qk(XLk ), where\nqk(XLk ) is choosen so that (cid:82) \u03b3k(XLk )dXLk < \u221e. We set qK(XLK ) \u2261 1 to make sure that\n\u03b3K(XLK ) \u221d p(XLk ). Note that the integral(cid:82) \u03b3k(XLk )dXLk need not be computed explicitly, as\nPDFs are given by \u00af\u03b3k(XLk ) = \u03b3k(XLk )/Zk, where Zk =(cid:82) \u03b3k(XLk )dXLk. Figure 2 shows two\n\nlong as it can be established that it is \ufb01nite. With this modi\ufb01cation we obtain a sequence of un-\nnormalized intermediate target densities for the SMC sampler as \u03b31(XL1) = q1(XL1 )\u03c81(XL1 ) and\nqk\u22121(XLk\u22121 ) \u03c8k(XIk ) for k = 2, . . . , K. The corresponding normalized\n\u03b3k(XLk ) = \u03b3k\u22121(XLk\u22121 )\n\nexamples of possible subgraphs when applying the decomposition, in two different ways, to the\nfactor graph example in Figure 1.\n\nqk(XLk )\n\n3.2 Sequential Monte Carlo for PGMs\n\n, wi\n\nk}N\n\nAt iteration k, the SMC sampler approximates the target distribution \u00af\u03b3k by a collection of weighted\nparticles {X iLk\ni=1. These samples de\ufb01ne an empirical point-mass approximation of the target\ndistribution. In what follows, we shall use the notation \u03bek := XIk\\Lk\u22121 to refer to the collection of\nrandom variables that are in the domain of \u03b3k, but not in the domain of \u03b3k\u22121. This corresponds to\nthe collection of random variables, with which the particles are augmented at each iteration.\nInitially, \u00af\u03b31 is approximated by importance sampling. We proceed inductively and assume that we\nhave at hand a weighted sample {X iLk\u22121\ni=1, approximating \u00af\u03b3k\u22121(XLk\u22121). This sample is\n\nk\u22121}N\n\n, wi\n\n3\n\n\fk with P(ai\n\nk\u22121 := \u03bdk\u22121(X iLk\u22121\n\npropagated forward by simulating, conditionally independently given the particle generation up to\niteration k \u2212 1, and drawing an ancestor index ai\nk\u22121, j = 1, . . . , N,\nwhere \u03bdi\n)\u2014known as adjustment multiplier weights\u2014are used in the auxiliary\nSMC framework to adapt the resampling procedure to the current target density \u00af\u03b3k [20]. Given the\nancestor indices, we simulate particle increments {\u03bei\n)\nkLk\u22121\non XIk\\Lk\u22121, and augment the particles as X iLk\nAfter having performed this procedure for the N ancestor indices and particles, they are assigned\nimportance weights wi\n\ni=1 from a proposal density \u03bei\nkLk\u22121\n\n). The weight function, for k \u2265 2, is given by\n\nk = j) \u221d \u03bdj\n\nk \u223c rk(\u00b7|X ai\n\nk}N\n:= X ai\n\n\u222a \u03bei\nk.\n\nk\u22121wj\n\nk = Wk(X iLk\n\nWk(XLk ) =\n\n\u03b3k\u22121(XLk\u22121 )\u03bdk\u22121(XLk\u22121 )rk(\u03bek|XLk\u22121 )\n\n\u03b3k(XLk )\n\n,\n\n(2)\n\n).\n\n)/r1(X iL1\n\n\u223c r1(\u00b7).\n\nAlgorithm 1 Sequential Monte Carlo (SMC)\n\nPerform each step for i = 1, . . . , N.\nSample X iL1\nSet wi\n1 = \u03b31(X iL1\nfor k = 2 to K do\n\nwhere, again, we write \u03bek = XIk\\Lk\u22121. We give a summary of the SMC method in Algorithm 1.\nIn the case that Ik \\ Lk\u22121 = \u2205 for\nsome k, resampling and propagation\nsteps are super\ufb02uous. The easiest\nway to handle this is to simply skip\nthese steps and directly compute im-\nportance weights. An alternative ap-\nproach is to bridge the two target dis-\ntributions \u00af\u03b3k\u22121 and \u00af\u03b3k similarly to\nDel Moral et al. [11].\nSince the proposed sampler\nfor\nPGMs falls within a general SMC\nframework,\nstandard convergence\nanalysis applies. See e.g., Del Moral [21] for a comprehensive collection of theoretical results\non consistency, central limit theorems, and non-asymptotic bounds for SMC samplers.\n(cid:82)\nThe choices of proposal density and adjustment multipliers can quite signi\ufb01cantly affect the per-\nformance of the sampler. It follows from (2) that Wk(XLk ) \u2261 1 if we choose \u03bdk\u22121(XLk\u22121) =\n\u03b3k\u22121(XLk\u22121 ) d\u03bek and rk(\u03bek|XLk\u22121) =\n\u03bdk\u22121(XLk\u22121 )\u03b3k\u22121(XLk\u22121 ). In this case, the SMC sampler is\nsaid to be fully adapted.\n\n(cid:80)\n\u03bdj\nk\u22121wj\nk\u22121\nk = j) =\nl \u03bdl\nk\u22121wl\nk\u22121\n= X ai\n) and set X iLk\nkLk\u22121\n\nk according to P(ai\nk \u223c rk(\u00b7|X ai\nkLk\u22121\n).\n\nSample \u03bei\nSet wi\nend for\n\nk = Wk(X iLk\n\n.\n\u222a \u03bei\nk.\n\nSample ai\n\n\u03b3k(XLk )\n\n\u03b3k(XLk )\n\n3.3 Estimating the partition function\n\nThe partition function of a graphical model is a very interesting quantity in many applications.\nExamples include likelihood-based learning of the parameters of the PGM, statistical mechanics\nwhere it is related to the free energy of a system of objects, and information theory where it is\nrelated to the capacity of a channel. However, as stated by Hamze and de Freitas [10], estimating\nthe partition function of a loopy graphical model is a \u201cnotoriously dif\ufb01cult\u201d task. Indeed, even for\ndiscrete problems simple and accurate estimators have proved to be elusive, and MCMC methods\ndo not provide any simple way of computing the partition function.\nOn the contrary, SMC provides a straightforward estimator of the normalizing constant (i.e. the\npartition function), given as a byproduct of the sampler according to,\n\n(cid:98)Z N\n\nk :=\n\n(cid:32)\n\n1\nN\n\n(cid:33)(cid:40)k\u22121(cid:89)\n\nN(cid:88)\n\nwi\nk\n\nN(cid:88)\n\n1\nN\n\n(cid:41)\n\ni=1\n\n(cid:96)=1\n\ni=1\n\n\u03bdi\n(cid:96)wi\n(cid:96)\n\n.\n\n(3)\n\nIt may not be obvious to see why (3) is a natural estimator of the normalizing constant Zk. However,\na by now well known result is that this SMC-based estimator is unbiased. This result is due to\nDel Moral [21, Proposition 7.4.1] and, for the special case of inference in state-space models, it has\nalso been established by Pitt et al. [22]. For completeness we also offer a proof using the present\nnotation in the supplementary material. Since ZK = Z, we thus obtain an estimator of the partition\nfunction of the PGM at iteration K of the sampler. Besides from being unbiased, this estimator is\nalso consistent and asymptotically normal; see Del Moral [21].\n\n4\n\n\fIn [23] we have studied a speci\ufb01c information theoretic application (computing the capacity of a\ntwo-dimensional channel) and inspired by the algorithm proposed here we were able to design a\nsampler with signi\ufb01cantly improved performance compared to the previous state-of-the-art.\n\n4 Particle MCMC and partial blocking\n\nquality of the estimates of marginal distributions p(XLk ) =(cid:82) \u00af\u03b3K(XLK )dXLK\\Lk deteriorates for\n\nTwo shortcomings of SMC are: (i) it does not solve the parameter learning problem, and (ii) the\nk (cid:28) K due to the fact that the particle trajectories degenerate as the particle system evolves (see\ne.g., [18]). Many methods have been proposed in the literature to address these problems; see e.g.\n[24] and the references therein. Among these, the recently proposed particle MCMC (PMCMC)\nframework [13], plays a prominent role. PMCMC algorithms make use of SMC to construct (in\ngeneral) high-dimensional Markov kernels that can be used within MCMC. These methods were\nshown by [13] to be exact, in the sense that the apparent particle approximation in the construction\nof the kernel does not change its invariant distribution. This property holds for any number of\nparticles N \u2265 2, i.e., PMCMC does not rely on asymptotics in N for correctness.\nThe fact that the SMC sampler for PGMs presented in Algorithm 1 \ufb01ts under a general SMC um-\nbrella implies that we can also straightforwardly make use of this algorithm within PMCMC. This\nallows us to construct a Markov kernel (indexed by the number of particles N) on the space of latent\nvariables of the PGM, PN (X(cid:48)\n, dXLK ), which leaves the full joint distribution p(XV ) invariant.\nLK\nWe do not dwell on the details of the implementation here, but refer instead to [13] for the general\nsetup and [25] for the speci\ufb01c method that we have used in the numerical illustration in Section 5.\nPMCMC methods enable blocking of the latent variables of the PGM in an MCMC scheme. Simu-\nlating all the latent variables XLK jointly is useful since, in general, this will reduce the autocorrela-\ntion when compared to simulating the variables xj one at a time [26]. However, it is also possible to\nemploy PMCMC to construct an algorithm in between these two extremes, a strategy that we believe\nwill be particularly useful in the context of PGMs. Let {V m, m \u2208 {1, . . . , M}} be a partition of V.\nIdeally, a Gibbs sampler for the joint distribution p(XV ) could then be constructed by simulating,\nusing a systematic or a random scan, from the conditional distributions\n\np(XVm|XV\\Vm) for m = 1, . . . , M.\n\n(4)\nWe refer to this strategy as partial blocking, since it amounts to simulating a subset of the variables,\nbut not necessarily all of them, jointly. Note that, if we set M = |V| and V m = {m} for m =\n1, . . . , M, this scheme reduces to a standard Gibbs sampler. On the other extreme, with M = 1\nand V 1 = V, we get a fully blocked sampler which targets directly the full joint distribution p(XV ).\nFrom (1) it follows that the conditional distributions (4) can be expressed as\n\n\u03c8C(XC),\n\n(5)\n\np(XVm|XV\\Vm) \u221d (cid:89)\n\nC\u2208Cm\n\nwhere Cm = {C \u2208 C : C \u2229 V m (cid:54)= \u2205}. While it is in general not possible to sample exactly from\nthese conditionals, we can make use of PMCMC to facilitate a partially blocked Gibbs sampler for\na PGM. By letting p(XVm|XV\\Vm) be the target distribution for the SMC sampler of Algorithm 1,\nwe can construct a PMCMC kernel P m\nN that leaves the conditional distribution (5) invariant. This\nsuggests the following approach: with X(cid:48)\nV being the current state of the Markov chain, update block\nm by sampling\n\nXVm \u223c P m\n\nN (cid:104)X(cid:48)\n\nV\\Vm(cid:105)(X(cid:48)\n\nVm,\u00b7).\n\n(6)\n\nHere we have indicated explicitly in the notation that the PMCMC kernel for the conditional dis-\ntribution p(XVm|XV\\Vm ) depends on both X(cid:48)\nV\\Vm (which is considered to be \ufb01xed throughout the\nsampling procedure) and on X(cid:48)\nAs mentioned above, while being generally applicable, we believe that partial blocking of PMCMC\nsamplers will be particularly useful for PGMs. The reason is that we can choose the vertex sets V m\nfor m = 1, . . . , M in order to facilitate simple sequential decompositions of the induced subgraphs.\nFor instance, it is always possible to choose the partition in such a way that all the induced subgraphs\nare chains.\n\nVm (which de\ufb01nes the current state of the PMCMC procedure).\n\n5\n\n\f5 Experiments\n\nIn this section we evaluate the proposed SMC sampler on three examples to illustrate the merits of\nour approach. Additional details and results are available in the supplementary material and code to\nreproduce results can be found in [27]. We \ufb01rst consider an example from statistical mechanics, the\nclassical XY model, to illustrate the impact of the sequential decomposition. Furthermore, we pro\ufb01le\nour algorithm with the \u201cgold standard\u201d AIS [2] and Annealed Sequential Importance Resampling\n(ASIR1) [11]. In the second example we apply the proposed method to the problem of scoring of\ntopic models, and \ufb01nally we consider a simple toy model, a Gaussian Markov random \ufb01eld (MRF),\nwhich illustrates that our proposed method has the potential to signi\ufb01cantly decrease correlations\nbetween samples in an MCMC scheme. Furthermore, we provide an exact SMC-approximation of\nthe tree-sampler by Hamze and de Freitas [28] and thereby extend the scope of this powerful method.\n\n5.1 Classical XY model\n\nThe classical XY model (see e.g. [29]) is a\nmember in the family of n-vector models used\nin statistical mechanics.\nIt can be seen as a\ngeneralization of the well known Ising model\nwith a two-dimensional electromagnetic spin.\nThe spin vector is described by its angle x \u2208\n(\u2212\u03c0, \u03c0]. We will consider square lattices with\nperiodic boundary conditions. The joint PDF of\nthe classical XY model with equal interaction is\ngiven by\n\np(XV ) \u221d e\u03b2(cid:80)\n\n(7)\n\nFigure 3: Mean-squared-errors for sample size N\nin the estimates of log Z for AIS and four different\norderings in the proposed SMC framework.\n\n(i,j)\u2208E cos(xi\u2212xj ),\nwhere \u03b2 denotes the inverse temperature.\nTo evaluate the effect of different sequence or-\nders on the accuracy of the estimates of the log-\nnormalizing-constant log Z we ran several ex-\nperiments on a 16 \u00d7 16 XY model with \u03b2 = 1.1 (approximately the critical inverse temperature\n[30]). For simplicity we add one node at a time and all factors bridging this node with previously\nadded nodes. Full adaptation in this case is possible due to the optimal proposal being a von Mises\ndistribution. We show results for the following cases: Random neighbour (RND-N) First node se-\nlected randomly among all nodes, concurrent nodes selected randomly from the set of nodes with a\nneighbour in XLk\u22121. Diagonal (DIAG) Nodes added by traversing diagonally (45\u25e6 angle) from left\nto right. Spiral (SPIRAL) Nodes added spiralling in towards the middle from the edges. Left-Right\n(L-R) Nodes added by traversing the graph left to right, from top to bottom.\nWe also give results of AIS with single-site-Gibbs updates and 1 000 annealing distributions linearly\nspaced from zero to one, starting from a uniform distribution (geometric spacing did not yield any\nimprovement over linear spacing for this case). The \u201ctrue value\u201d was estimated using AIS with\n10 000 intermediate distributions and 5 000 importance samples. We can see from the results in Fig-\nure 3 that designing a good sequential decomposition for the SMC sampler is important. However,\nthe intuitive and fairly simple choice L-R does give very good results comparable to that of AIS.\nFurthermore, we consider a larger size of 64 \u00d7 64 and evaluate the performance of the L-R ordering\ncompared to AIS and the ASIR method. Figure 4 displays box-plots of 10 independent runs. We\nset N = 105 for the proposed SMC sampler and then match the computational costs of AIS and\nASIR with this computational budget. A fair amount of time was spent in tuning the AIS and ASIR\nalgorithms; 10 000 linear annealing distributions seemed to give best performance in these cases. We\ncan see that the L-R ordering gives results comparable to fairly well-tuned AIS and ASIR algorithms;\nthe ordering of the methods depending on the temperature of the model. One option that does make\nthe SMC algorithm interesting for these types of applications is that it can easily be parallelized\nover the particles, whereas AIS/ASIR has limited possibilities of parallel implementation over the\n(crucial) annealing steps.\n\n1ASIR is a speci\ufb01c instance of the SMC sampler by [11], corresponding to AIS with the addition of resam-\n\npling steps, but to avoid confusion with the proposed method we choose to refer to it as ASIR.\n\n6\n\n10410510\u2212310\u2212210\u22121100101102103NMSE AISSMC RND\u2212NSMC SPIRALSMC DIAGSMC L\u2212R\fFigure 4: The logarithm of the estimated partition function for the 64 \u00d7 64 XY model with inverse\ntemperature 0.5 (left), 1.1 (middle) and 1.7 (right).\n\n(a) Small simulated example.\n\n(b) PMC.\n\n(c) 20 newsgroups.\n\nFigure 6: Estimates of the log-likelihood of heldout documents for various datasets.\n\n5.2 Likelihood estimation in topic models\n\n\u03b1m\n\n\u03b8\n\n\u03a8\n\nzM\n\nwM\n\nz1\n\nw1\n\n\u00b7\u00b7\u00b7\n\nTopic models such as Latent Dirichlet Allocation (LDA) [31] are popular\nmodels for reasoning about large text corpora. Model evaluation is often\nconducted by computing the likelihood of held-out documents w.r.t. a\nlearnt model. However, this is a challenging problem on its own\u2014which\nhas received much recent interest [15, 16, 17]\u2014since it essentially cor-\nresponds to computing the partition function of a graphical model; see\nFigure 5. The SMC procedure of Algorithm 1 can used to solve this prob-\nlem by de\ufb01ning a sequential decomposition of the graphical model. In\nparticular, we consider the decomposition corresponding to \ufb01rst includ-\ning the node \u03b8 and then, subsequently, introducing the nodes z1 to zM in\nany order. Interestingly, if we then make use of a Rao-Blackwellization\nover the variable \u03b8, the SMC sampler of Algorithm 1 reduces exactly\nto a method that has previously been proposed for this speci\ufb01c problem\n[17]. In [17], the method is derived by reformulating the model in terms\nof its suf\ufb01cient statistics and phrasing this as a particle learning problem;\nhere we obtain the same procedure as a special case of the general SMC\nalgorithm operating on the original model.\nWe use the same data and learnt models as Wallach et al. [15], i.e. 20 newsgroups, and PubMed\nCentral abstracts (PMC). We compare with the Left-Right-Sequential (LRS) sampler [16], which is\nan improvement over the method proposed by Wallach et al. [15]. Results on simulated and real data\nexperiments are provided in Figure 6. For the simulated example (Figure 6a), we use a small model\nwith 10 words and 4 topics to be able to compute the exact log-likelihood. We keep the number of\nparticles in the SMC algorithm equal to the number of Gibbs steps in LRS; this means LRS is about\nan order-of-magnitude more computationally demanding than the SMC method. Despite the fact that\nthe SMC sampler uses only about a tenth of the computational time of the LRS sampler, it performs\nsigni\ufb01cantly better in terms of estimator variance. The other two plots show results on real data with\n10 held-out documents for each dataset. For a \ufb01xed number of Gibbs steps we choose the number of\nparticles for each document to make the computational cost approximately equal. Run #2 has twice\nthe number of particles/samples as in run #1. We show the mean of 10 runs and error-bars estimated\n\nFigure 5: LDA as graph-\nical model.\n\n7\n\n8063.9580648064.058064.18064.15AISASIRSMC L\u2212Rlog(bZ)1.051.05051.0511.05151.052x 104AISASIRSMC L\u2212Rlog(bZ)1.43871.43891.43911.43931.4395x 104AISASIRSMC L\u2212Rlog(bZ)50100150200250300350\u221292.5\u221292\u221291.5\u221291\u221290.5Nlog(bZ) LRSSMCExactLRS 1LRS 2SMC 1SMC 2\u22128780\u22128764\u22128748\u22128732\u22128716\u22128700log(bZ)LRS 1LRS 2SMC 1SMC 2\u22121.356\u22121.354\u22121.352\u22121.35\u22121.348x 104log(bZ)\fusing bootstrapping with 10 000 samples. Computing the logarithm of \u02c6Z introduces a negative bias,\nwhich means larger values of log \u02c6Z typically implies more accurate results. The results on real\ndata do not show the drastic improvement we see in the simulated example, which could be due to\ndegeneracy problems for long documents. An interesting approach that could improve results would\nbe to use an SMC algorithm tailored to discrete distributions, e.g. Fearnhead and Clifford [32].\n\n5.3 Gaussian MRF\n\nFinally, we consider a simple toy model to illustrate how the SMC sampler of Algorithm 1 can be\nincorporated in PMCMC sampling. We simulate data from a zero mean Gaussian 10 \u00d7 10 lattice\nMRF with observation and interaction standard deviations of \u03c3i = 1 and \u03c3ij = 0.1 respectively.\nWe use the proposed SMC algorithm together with the PMCMC method by Lindsten et al. [25]. We\ncompare this with standard Gibbs sampling and the tree sampler by Hamze and de Freitas [28].\nWe use a moderate number of N = 50 particles in the\nPMCMC sampler (recall that it admits the correct invari-\nant distribution for any N \u2265 2).\nIn Figure 7 we can\nsee the empirical autocorrelation funtions (ACF) centered\naround the true posterior mean for variable x82 (selected\nrandomly from among XV; similar results hold for all\nthe variables of the model). Due to the strong interac-\ntion between the latent variables, the samples generated\nby the standard Gibbs sampler are strongly correlated.\nTree-sampling and PMCMC with partial blocking show\nnearly identical gains compared to Gibbs. This is interest-\ning, since it suggest that simulating from the SMC-based\nPMCMC kernel can be almost as ef\ufb01cient as exact sim-\nulation, even using a moderate number of particles. In-\ndeed, PMCMC with partial blocking can be viewed as an\nexact SMC-approximation of the tree sampler, extending\nthe scope of tree-sampling beyond discrete and Gaussian models. The fully blocked PMCMC al-\ngorithm achieves the best ACF, dropping off to zero considerably faster than for the other methods.\nThis is not surprising since this sampler simulates all the latent variables jointly which reduces the\nautocorrelation, in particular when the latent variables are strongly dependent. However, it should\nbe noted that this method also has the highest computational cost per iteration.\n\nFigure 7: The empirical ACF for Gibbs\nsampling, PMCMC, PMCMC with par-\ntial blocking, and tree sampling.\n\n6 Conclusion\n\nWe have proposed a new framework for inference in PGMs using SMC and illustrated it on three\nexamples. These examples show that it can be a viable alternative to standard methods used for infer-\nence and partition function estimation problems. An interesting avenue for future work is combining\nour proposed methods with AIS, to see if we can improve on both.\n\nAcknowledgments\n\nWe would like to thank Iain Murray for his kind and very prompt help in providing the data for\nthe LDA example. This work was supported by the projects: Learning of complex dynamical sys-\ntems (Contract number: 637-2014-466) and Probabilistic modeling of dynamical systems (Contract\nnumber: 621-2013-5524), both funded by the Swedish Research Council.\n\nReferences\n[1] M. I. Jordan. Graphical models. Statistical Science, 19(1):140\u2013155, 2004.\n[2] R. M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[3] A. Doucet, N. De Freitas, N. Gordon, et al. Sequential Monte Carlo methods in practice. Springer New\n\nYork, 2001.\n\n[4] M. Isard. PAMPAS: Real-valued graphical models for computer vision. In Proceedings of the conference\n\non Computer Vision and Pattern Recognition (CVPR), Madison, WI, USA, June 2003.\n\n8\n\n05010015020025030000.20.40.60.81LagACF Gibbs samplerPMCMC w. partial blockingTree samplerPMCMC\f[5] E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky. Nonparametric belief propagation. In\nProceedings of the conference on Computer Vision and Pattern Recognition (CVPR), Madison, WI, USA,\n2003.\n\n[6] E. B. Sudderth, A. T. Ihler, M. Isard, W. T. Freeman, and A. S. Willsky. Nonparametric belief propagation.\n\nCommunications of the ACM, 53(10):95\u2013103, 2010.\n\n[7] M. Briers, A. Doucet, and S. S. Singh. Sequential auxiliary particle belief propagation. In Proceedings of\n\nthe 8th International Conference on Information Fusion, Philadelphia, PA, USA, 2005.\n\n[8] A. T. Ihler and D. A. Mcallester. Particle belief propagation. In Proceedings of the International Confer-\n\nence on Arti\ufb01cial Intelligence and Statistics (AISTATS), Clearwater Beach, FL, USA, 2009.\n\n[9] A. Frank, P. Smyth, and A. T. Ihler. Particle-based variational inference for continuous systems.\n\nAdvances in Neural Information Processing Systems (NIPS), pages 826\u2013834, 2009.\n\nIn\n\n[10] F. Hamze and N. de Freitas. Hot coupling: a particle approach to inference and normalization on pairwise\nundirected graphs of arbitrary topology. In Advances in Neural Information Processing Systems (NIPS),\n2005.\n\n[11] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical\n\nSociety: Series B, 68(3):411\u2013436, 2006.\n\n[12] R. G. Everitt. Bayesian parameter estimation for latent Markov random \ufb01elds and social networks. Journal\n\nof Computational and Graphical Statistics, 21(4):940\u2013960, 2012.\n\n[13] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the\n\nRoyal Statistical Society: Series B, 72(3):269\u2013342, 2010.\n\n[14] P. Carbonetto and N. de Freitas. Conditional mean \ufb01eld. In Advances in Neural Information Processing\n\nSystems (NIPS) 19. MIT Press, 2007.\n\n[15] H. M Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In\n\nProceedings of the 26th International Conference on Machine Learning, pages 1105\u20131112, 2009.\n\n[16] W. Buntine. Estimating likelihoods for topic models. In Advances in Machine Learning, pages 51\u201364.\n\nSpringer, 2009.\n\n[17] G. S. Scott and J. Baldridge. A recursive estimate for the predictive likelihood in a topic model. In Pro-\nceedings of the 16th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n1105\u20131112, Clearwater Beach, FL, USA, 2009.\n\n[18] A. Doucet and A. Johansen. A tutorial on particle \ufb01ltering and smoothing: Fifteen years later. In D. Crisan\nand B. Rozovskii, editors, The Oxford Handbook of Nonlinear Filtering. Oxford University Press, 2011.\n[19] A. Bouchard-C\u02c6ot\u00b4e, S. Sankararaman, and M. I. Jordan. Phylogenetic inference via sequential Monte\n\nCarlo. Systematic Biology, 61(4):579\u2013593, 2012.\n\n[20] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle \ufb01lters. Journal of the American\n\nStatistical Association, 94(446):590\u2013599, 1999.\n\n[21] P. Del Moral. Feynman-Kac Formulae - Genealogical and Interacting Particle Systems with Applications.\n\nProbability and its Applications. Springer, 2004.\n\n[22] M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. On some properties of Markov chain Monte Carlo\n\nsimulation methods based on the particle \ufb01lter. Journal of Econometrics, 171:134\u2013151, 2012.\n\n[23] C. A. Naesseth, F. Lindsten, and T. B. Sch\u00a8on. Capacity estimation of two-dimensional channels using\nsequential Monte Carlo. In Proceedings of the IEEE Information Theory Workshop (ITW), Hobart, Tas-\nmania, Australia, November 2014.\n\n[24] F. Lindsten and T. B. Sch\u00a8on. Backward simulation methods for Monte Carlo statistical inference. Foun-\n\ndations and Trends in Machine Learning, 6(1):1\u2013143, 2013.\n\n[25] F. Lindsten, M. I. Jordan, and T. B. Sch\u00a8on. Particle Gibbs with ancestor sampling. Journal of Machine\n\nLearning Research, 15:2145\u20132184, june 2014.\n\n[26] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer New York, 2004.\n[27] C. A. Naesseth, F. Lindsten, and T. B. Sch\u00a8on. smc-pgm, 2014. URL http://dx.doi.org/10.\n\n5281/zenodo.11947.\n\n[28] F. Hamze and N. de Freitas. From \ufb01elds to trees. In Proceedings of the 20th conference on Uncertainty\n\nin arti\ufb01cial intelligence (UAI), Banff, Canada, July 2004.\n\n[29] J. M. Kosterlitz and D. J. Thouless. Ordering, metastability and phase transitions in two-dimensional\n\nsystems. J of Physics C: Solid State Physics, 6(7):1181, 1973.\n\n[30] Y. Tomita and Y. Okabe. Probability-changing cluster algorithm for two-dimensional XY and clock\n\nmodels. Physical Review B: Condensed Matter and Materials Physics, 65:184405, 2002.\n\n[31] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, March 2003.\n\n[32] Paul Fearnhead and Peter Clifford. On-line inference for hidden markov models via particle \ufb01lters. Jour-\n\nnal of the Royal Statistical Society: Series B (Statistical Methodology), 65(4):887\u2013899, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1017, "authors": [{"given_name": "Christian", "family_name": "Andersson Naesseth", "institution": "Link\u00f6ping University"}, {"given_name": "Fredrik", "family_name": "Lindsten", "institution": "Link\u00f6ping University"}, {"given_name": "Thomas", "family_name": "Sch\u00f6n", "institution": "Uppsala University"}]}