{"title": "Sample Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 425, "page_last": 432, "abstract": "", "full_text": "Sample Propagation\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nMark A. Paskin\n\nBerkeley, CA 94720\nmark@paskin.org\n\nAbstract\n\nRao\u2013Blackwellization is an approximation technique for probabilistic in-\nference that \ufb02exibly combines exact inference with sampling. It is useful\nin models where conditioning on some of the variables leaves a sim-\npler inference problem that can be solved tractably. This paper presents\nSample Propagation, an ef\ufb01cient implementation of Rao\u2013Blackwellized\napproximate inference for a large class of models. Sample Propagation\ntightly integrates sampling with message passing in a junction tree, and\nis named for its simple, appealing structure: it walks the clusters of a\njunction tree, sampling some of the current cluster\u2019s variables and then\npassing a message to one of its neighbors. We discuss the application of\nSample Propagation to conditional Gaussian inference problems such as\nswitching linear dynamical systems.\n\n1 Introduction\nMessage passing on junction trees is an ef\ufb01cient means of solving many probabilistic in-\nference problems [1, 2]. However, as these are exact methods, their computational costs\nmust scale with the complexity of the inference problem, making them inapplicable to very\ndemanding inference tasks. This happens when the messages become too expensive to\ncompute, as in discrete models of large treewidth or conditional Gaussian models [3].\n\nIn these settings it is natural to investigate whether junction tree techniques can be com-\nbined with sampling to yield fast, accurate approximate inference algorithms. One way to\ndo this is to use sampling to approximate the messages, as in HUGS [4, 5]. This strategy\nhas two disadvantages: \ufb01rst, the samples must be stored, which limits the sample size by\nspace constraints (rather than time constraints); and second, variables are sampled using\nonly local information, leading to samples that may not be likely under the entire model.\n\nAnother way to integrate sampling and message passing is via Rao\u2013Blackwellization,\nwhere we repeatedly sample a subset of the model\u2019s variables and then compute all of the\nmessages exactly, conditioned on these sample values. This technique, suggested in [6] and\nstudied in [7], yields a powerful and \ufb02exible approximate inference algorithm; however, it\ncan be expensive because the junction tree algorithm must be run for every sample.\n\nIn this paper, we present a simple implementation of Rao\u2013Blackwellized approximate in-\nference that avoids running the entire junction tree algorithm for every sample. We develop\na new message passing algorithm for junction trees that supports fast retraction of evidence,\n\n\fand we tightly integrate it with a blocking Gibbs sampler so that only one message must\nbe recomputed per sample. The resulting algorithm, Sample Propagation, has an appealing\nstructure: it walks the clusters of a junction tree, resampling some of the current cluster\u2019s\nvariables and then passing a message to the next cluster in the walk.\n2 Rao\u2013Blackwellized approximation using junction tree inference\nWe start by presenting our notation and assumptions on the probability model. Then we\nsummarize the three basic ingredients of our approach: message passing in a junction tree,\nRao\u2013Blackwellized approximation, and sampling via Markov chain Monte Carlo.\n2.1 The probability model\nLet X = (Xi : i \u2208 I) be a vector of random variables indexed by the \ufb01nite, ordered set\nI, and for each index i let Xi be the range of Xi. We will use the symbols A, B, C, D and\nE to denote subsets of the index set I. For each subset A, let XA \u2261 (Xi : i \u2208 A) be the\ncorresponding subvector of random variables and let XA be its range.\nIt greatly simpli\ufb01es the exposition to develop a simple notation for assignments of values to\nsubsets of variables. An assignment to a subset A is a set of pairs {(i, xi) : i \u2208 A}, one per\nindex i \u2208 A, where xi \u2208 Xi. We use the symbols u, v, and w to represent assignments, and\nwe use XA to denote the set of assignments to the subset A (with the shorthand X \u2261 XI).\nWe use two operations to generate new assignments from old assignments. Given assign-\nments u and v to disjoint subsets A and B, respectively, their union u\u222a v is an assignment\nto A \u222a B.\nIf u is an assignment to A then the restriction of u to another subset B is\nuB \u2261 {(i, xi) \u2208 u : i \u2208 B}, an assignment to A \u2229 B. We also let functions act on\nassignments in the natural way: if u = {(i, xi) : i \u2208 D} is an assignment to D and f is a\nfunction whose domain is XD, then we use f(u) to denote f(xi : i \u2208 D).\nWe consider probability densities of the form\n\n\u03c8C(uC),\n\nu \u2208 X\n\n(1)\n\np(u) \u221d Y\n\nC\u2208C\n\nwhere C is a set of subsets of I and each \u03c8C is a potential function over C (i.e., a non-\nnegative function of XC). This class includes directed graphical models (i.e., Bayesian\nnetworks) and undirected graphical models such as Markov random \ufb01elds. Observed vari-\nables are re\ufb02ected in the model by evidence potentials. We use pA(\u00b7) to denote the marginal\ndensity of XA and pA|B(\u00b7|\u00b7) to denote the conditional density of XA given XB. Finally,\nwe use the notation of \ufb01nite measure spaces for simplicity, but our approach extends to the\ncontinuous case.\n2.2 Junction tree inference\n\ncomputing the expectation of a function f: E [f(X)] =P\n\nGiven a density of the form (1), we view the problem of probabilistic inference as that of\nu\u2208X p(u)f(u). This sum can be\nexpensive to compute when X is a large space. When the desired expectation is \u201clocal\u201d in\nthat f depends only upon some subset of the variables XD, we can compute the expectation\nmore cheaply using a marginal density as\n\npC(u)f(uD)\n\n(2)\n\nE [f(XD)] = X\n\nu\u2208XC\n\nwhere pC is the marginal density of XC and C \u2287 D \u201ccovers\u201d the input of the function. If\nthis sum is tractable, then we have reduced the problem to that of computing pC.\nWe can compute this marginal via message passing on a junction tree [1, 2]. A junction\ntree for C is a singly-connected, undirected graph (C, E) with the junction tree property:\nfor each pair of nodes (or clusters) A, B \u2208 C that contain some i \u2208 I, every cluster on\n\n\fthe unique path between A and B also contains i. In what follows we assume we have a\njunction tree for C with a cluster that covers D, the input of f. (Such a junction tree can\nalways be found, but we may have to enlarge the subsets in C.)\nWhereas the HUGIN message passing algorithm [2] may be more familiar, Sample Prop-\nagation is most easily described by extending the Shafer\u2013Shenoy algorithm [1]. In this\nalgorithm, we de\ufb01ne for each edge B \u2192 C of the junction tree a potential over B \u2229 C:\n\n\u00b5BC(u) \u2261 X\n\n\u03c8B(u \u222a v) Y\n\n\u00b5AB(uA \u222a vA),\n\nu \u2208 XB\u2229C\n\n(3)\n\nv\u2208XB\\C\n\n(A,B)\u2208E\n\nA6=C\n\n\u00b5BC is called the message from B to C. Note that this de\ufb01nition is recursive\u2014messages\ncan depend on each other\u2014with the base case being messages from leaf clusters of the\njunction tree. For each cluster C we de\ufb01ne a potential \u03b2C over C by\nu \u2208 XC\n\n\u03b2C(u) \u2261 \u03c8C(u) Y\n\n\u00b5BC(uB),\n\n(4)\n\n(B,C)\u2208E\n\n\u03b2C is called the cluster belief of C, and it follows that \u03b2C \u221d pC, i.e., that the cluster beliefs\nare the marginals over their respective variables (up to renormalization). Thus we can use\nthe (normalized) cluster beliefs \u03b2C for some C \u2287 D to compute the expectation (2).\nIn what follows we will also be interested in computing conditional cluster densities given\nan evidence assignment w to a subset of the variables XE. Because pI\\E|E(u| w) \u221d\np(u \u222a w), we can \u201center in\u201d this evidence by instantiating w in every cluster potential \u03c8C.\nThe cluster beliefs (4) will then be proportional to the conditional density pC\\E|E(\u00b7| w).\nJunction tree inference is often the most ef\ufb01cient means of computing exact solutions to\ninference problems of the sort described above. However, the sums required by the mes-\nsages (3) or the function expectations (2) are often prohibitively expensive to compute. If\nthe variables are all \ufb01nite-valued, this happens when the clusters of the junction tree are\ntoo large; if the model is conditional-Gaussian, this happens when the messages, which are\nmixtures of Gaussians, have too many mixture components [3].\n2.3 Rao\u2013Blackwellized approximate inference\n\nIn cases where the expectation is intractable to compute exactly, it can be approximated by\na Monte Carlo estimate:\n\nE [f(XD)] \u2248 1\nN\n\nNX\n\nn=1\n\nf(vn\nD)\n\n(5)\n\nwhere {vn : 1 \u2264 n \u2264 N} are a set of samples of X. However, obtaining a good estimate\nwill require many samples if f(XD) has high variance.\nMany models have the property that while computing exact expectations is intractable,\nthere exists a subset of random variables XE such that the conditional expectation\nE [f(XD)| XE = xE] can be computed ef\ufb01ciently. This leads to the Rao\u2013Blackwellized\nestimate, where we use a set of samples {wn : 1 \u2264 n \u2264 N} of XE to approximate\n\nE [f(XD)] = E [E [f(XD)| XE]] \u2248 1\nN\n\nE [f(XD)| wn]\n\n(6)\n\nNX\n\nn=1\n\nThe \ufb01rst advantage of this scheme over standard Monte Carlo integration is that the Rao\u2013\nBlackwell theorem guarantees that the expected squared error of the estimate (6) is upper\nbounded by that of (5), and strictly so when f(XD) depends on XD\\E. A second advantage\nis that (6) requires samples from a smaller (and perhaps better-behaved) probability space.\n\n\fAlgorithm 1 Rao\u2013Blackwell estimation on a junction tree\nInput: A set of samples {wn : 1 \u2264 n \u2264 N} of XE, a function f of XD, and a cluster C \u2287 D\nOutput: An estimate \u02c6f \u2248 E [f (XD)]\n1: Initialize the estimator \u02c6f = 0.\n2: for n = 1 to N do\n3:\n4:\n5:\n6:\n7: Set \u02c6f = \u02c6f /N.\n\nEnter the assignment wn as evidence into the junction tree.\nUse message passing to compute the beliefs \u03b2C \u221d pC\\E|E(\u00b7| wn) via (3) and (4).\nCompute the expectation E [f (XD)| wn] via (7).\nSet \u02c6f = \u02c6f + E [f (XD)| wn].\n\nHowever, the Rao\u2013Blackwellized estimate (6) is more expensive to compute than (5) be-\ncause we must compute conditional expectations. In many cases, message passing in a\njunction tree can be used to implement these computations (see Algorithm 1). We can en-\nter each sample assignment wn as evidence into the junction tree and use message passing\nto compute the conditional density pC\\E|E(\u00b7| wn) for some cluster C that covers D. We\nthen compute the conditional expectation as\n\npC\\E|E(u| wn)f(uD \u222a wn\nD)\n\n(7)\n\nE [f(XD)| wn] = X\n\nu\u2208XC\\E\n\n2.4 Markov chain Monte Carlo\nWe now turn to the problem of obtaining the samples {wn} of XE. Markov chain Monte\nCarlo (MCMC) is a powerful technique for generating samples from a complex distribution\np; we design a Markov chain whose stationary distribution is p, and simulate the chain\nto obtain samples [8]. One simple MCMC algorithm is the Gibbs sampler, where each\nsuccessive state of the Markov chain is chosen by resampling one variable conditioned on\nthe current values of the remaining variables. A more advanced technique is \u201cblocking\u201d\nGibbs sampling, where we resample a subset of variables in each step; this technique can\nyield Markov chains that mix more quickly [9].\n\nTo obtain the bene\ufb01ts of sampling in a smaller space, we would like to sample directly from\nthe marginal pE; however, this requires us to sum out the nuisance variables XI\\E from the\njoint density p. Blocking Gibbs sampling is particularly attractive in this setting because\nmessage passing can be used to implement the required marginalizations.1 Assume that the\ncurrent state of the Markov chain over XE is wn. To generate the next state of the chain\nwn+1 we choose a cluster C (randomly, or according to a schedule) and resample XC\u2229E\ngiven wn\nE\\C; i.e., we resample the E variables within C given the E variables outside C.\nThe transition density can be computed by entering the evidence wn\nE\\C into the junction\ntree, computing the cluster belief at C, and marginalizing down to a conditional density\nover XC\u2229E. The complete Gibbs sampler is given as Algorithm 2.2\n3 Sample Propagation\nAlgorithms 1 and 2 represent two of the three key ideas behind our proposal: both Gibbs\nsampling and Rao\u2013Blackwellized estimation can be implemented ef\ufb01ciently using message\npassing on a junction tree. The third idea is that these two uses of message passing can be\ninterleaved so that each sample requires only one message to be computed.\n\n1Interestingly, the blocking Gibbs proposal [9] makes a different use of junction tree inference\nthan we do here: they use message passing within a block of variables to ef\ufb01ciently generate a sample.\n2In cases where the transition density pC\u2229E|E\\C (\u00b7| wn\nE\\C ) is too large to represent or too dif\ufb01cult\nto sample from, we can use the Metropolis-Hastings algorithm, where we instead sample from a\nsimpler proposal distribution qC\u2229E and then accept or reject the proposal [8].\n\n\fAlgorithm 2 Blocking Gibbs sampler on a junction tree\nInput: A subset of variables XE to sample and a sample size N\nOutput: A set of samples {wn : 1 \u2264 n \u2264 N} of XE\n1: Choose an initial assignment w0 \u2208 XE.\n2: for n = 1 to N do\n3:\n4:\n5:\n6: Marginalize over XC\\E to obtain the transition density pC\u2229E|E\\C (\u00b7| wn\u22121\nE\\C ).\n7:\n\nChoose a cluster C \u2208 C.\nEnter the evidence wn\u22121\nUse message passing to compute the beliefs \u03b2C \u221d pC|E\\C (\u00b7| wn\u22121\n\nC\u2229E \u223c pC\u2229E|E\\C (\u00b7| wn\u22121\n\nE\\C into the junction tree.\n\nSample wn\n\nE\\C ) and set wn\n\nE\\C = wn\u22121\nE\\C.\n\nE\\C ) via (3) and (4).\n\n3.1 Lazy updating of the Rao\u2013Blackwellized estimates\n\nAlgorithms 1 and 2 both process the samples sequentially, so the \ufb01rst advantage of merging\nthem is that the sample set need not be stored. The second advantage is that, by being\nselective about when the Rao\u2013Blackwellized estimator is updated, we can compute the\nmessages once, not twice, per sample.\n\nWhen the Gibbs sampler chooses to resample a cluster C that covers D (the input of f),\nwe can update the Rao\u2013Blackwellized estimator for free. In particular, the Gibbs sampler\ncomputes the cluster belief \u03b2C \u221d pC|E\\C(\u00b7| wn\u22121\nE\\C) in order to compute the transition\ndensity pC\u2229E|E\\C(\u00b7| wn\u22121\nC\u2229E from this density, we can instantiate\nthe sample in the belief \u03b2C to obtain the conditional density pC\\E|E(\u00b7| wn) needed by the\nRao\u2013Blackwellized estimator. (This follows from the fact that wn\nE\\C.) In fact,\nwhen it is tractable to do so, we can simply use the cluster belief \u03b2C to update the estimator\nin (7); because it treats more variables exactly, it can yield a lower-variance estimate.\n\nE\\C). Once it samples wn\n\nE\\C = wn\u22121\n\nTherefore, if we are willing to update the Rao\u2013Blackwellized estimator only when the\nGibbs sampler chooses a cluster that covers the function\u2019s inputs, we can focus on reducing\nthe computational requirements of the Gibbs sampler. In this scheme the estimate will be\nbased on fewer samples, but the samples that are used will be less correlated because they\nare more distant from each other in the Markov chain. In parallel estimation problems\nwhere every cluster is computing expectations, every sample will be used to update an\nestimate, but not every estimate will be updated by every sample.\n3.2 Optimizing the Gibbs sampler\n\nWe now turn to the Gibbs sampler. The Gibbs sampler computes the messages so that it can\ncompute the cluster belief \u03b2C when it resamples within a cluster C. An important property\nof the computation (4) is that it requires only those messages directed towards C; thus, we\nhave again reduced by half the number of messages required per sample.\n\nThe dif\ufb01culty in further minimizing the number of messages computed by the Gibbs sam-\npler is that the evidence on the junction tree is constantly changing. It will therefore be\nuseful to modify the message passing so that, rather than instantiating all the evidence and\nthen passing messages, the evidence is instantiated on the \ufb02y, on a per-cluster basis. For\neach edge B \u2192 C we de\ufb01ne a potential \u00b5BC|E by\n\n\u00b5BC|E(u, w) \u2261 X\n\n\u03c8B(u \u222a v \u222a wB\\C) Y\n\n\u00b5AB|E((u \u222a v \u222a wB\\C)A, w)\n\n(8)\n\nv\u2208XB\\(C\u222aE)\n\nA6=C\n\n(A,B)\u2208E\n\nwhere u \u2208 XB\u2229C and w \u2208 XE. This is the conditional message from B to C given\nevidence w on XE. Figure 1 illustrates how the ranges of the assignment variables u, v,\nand w cover the variables of B; the intuition is that when we send a message from B to C,\nwe instantiate all evidence variables that are in B but not those that are in C; this gives us\n\n\fAlgorithm 3 Sample Propagation\nInput: A function f of XD, a cluster C \u2287 D, a subset E to sample, and a sample size N\nOutput: An estimate \u02c6f \u2248 E [f (XD)]\n1: Choose an initial assignment w0 \u2208 XE and compute the messages \u00b5AB|E(\u00b7, w0) via (8).\n2: Choose a cluster C1 \u2208 C, initialize the estimator \u02c6f = 0, and set the sample count M = 0.\n3: for n = 1 to N do\nCompute the conditional cluster belief \u03b2Cn|E(\u00b7, wn\u22121) \u221d pCn|E\\Cn (\u00b7| wn\u22121\n4:\nE\\Cn\nAdvance the Markov chain:\n5: Marginalize over XCn\\E to obtain the transition density pCn\u2229E|E\\Cn (\u00b7| wn\u22121\nE\\Cn\n6:\n\n) and set wn\n\nE\\Cn = wn\u22121\nE\\Cn\n\n.\n\n) via (9).\n\n).\n\nCn\u2229E \u223c pCn\u2229E|E\\Cn (\u00b7| wn\u22121\nSample wn\nE\\Cn\nUpdate any estimates to be computed at Cn:\nif D \u2286 Cn \u222a E then\n\n7:\n8:\n9:\n10:\n\nCn in \u03b2Cn|E and normalize to obtain pCn\\E|E(\u00b7| wn).\nInstantiate wn\nCompute the expectation E [f (XD)| wn] via (7).\nSet \u02c6f = \u02c6f + E [f (XD)| wn] and increment the sample count M.\n\nTake the next step of the walk:\nChoose a cluster Cn+1 that is a neighbor of Cn.\nRecompute the message \u00b5CnCn+1|E(\u00b7, wn) via (8).\n\n11:\n12:\n13: Set \u02c6f \u2190 \u02c6f /M.\n\nthe freedom to later instantiate XC\u2229E as we wish, or not at all. It is easy to verify that the\nconditional belief \u03b2C|E given by\n\n\u00b5BC|E(uB, w),\n\nu \u2208 XC, w \u2208 XE\n\n(9)\n\n\u03b2C|E(u, w) \u2261 \u03c8C(u) Y\n\n(B,C)\u2208E\n\nis proportional to the conditional density pC|E\\C(u| wE\\C).3\n\nUsing these modi\ufb01ed de\ufb01nitions, we can dramatically reduce the\nnumber of messages computed per sample. In particular, the con-\nditional messages have the following important property:\nProposition. Let w and w0 be two assignments to E such that\nE\\D for some cluster D. Then for all edges B \u2192 C\nwE\\D = w0\nwith C closer to D than B, \u00b5BC|E(u, w) = \u00b5BC|E(u, w0).\nProof. Assume by induction that the messages into B (except\nthe one from C) are equal given w or w0. There are two cases\nIf (E \u2229 D) has no overlap with (E \u2229 B), then\nto consider.\nwB\\C = w0\nB\\C and the equality follows from (8). Otherwise,\nby the junction property we know that if i \u2208 B and i \u2208 D, then\ni \u2208 C, so again we get wB\\C = w0\nThus, when we resample a cluster Cn, we have wn\n\nB\\C.\n\nFigure 1: A Venn dia-\ngram showing how the\nranges of the assign-\nment variables in (8)\ncover the cluster B.\n\n= wn\u22121\nE\\Cn\nand so only those messages directed away from Cn change. In addition, as argued above,\nwhen we resample Cn+1 in iteration n + 1, we only require the messages directed towards\nCn+1. Combining these two arguments, we \ufb01nd that only the messages on the directed\nIf we choose Cn+1 to be a\npath from Cn to Cn+1 must be recomputed in iteration n.\nneighbor of Cn, we only have to recompute a single message in each iteration.4 Putting all\nof these optimizations together, we obtain Algorithm 3, which is easily generalized to the\ncase where many function expectations are computed in parallel.\n\nE\\Cn\n\n3The modi\ufb01ed message passing scheme we describe can be viewed as an implementation of fast\nretraction for Shafer-Shenoy messages, analogous to the scheme described for HUGIN in [2, \u00a76.4.6].\n4A similar idea has recently been used to improve the ef\ufb01ciency of the Uni\ufb01ed Propagation and\n\nScaling algorithm for maximum likelihood estimation [10].\n\n\f3.3 Complexity of Sample Propagation\n\nFor simplicity of analysis we assume \ufb01nite-value variables and tabular potentials. In the\nShafer\u2013Shenoy algorithm, the space complexity of representing the exact message (3) is\nO(|XB\u2229C|), and the time complexity of computing it is O(|XB|) (since for each assign-\nment to B \u2229 C we must sum over assignments to B\\C). In contrast, when computing the\nconditional message (8), we only sum over assignments to B\\(C \u222a E), since E \u2229(B\\C) is\ninstantiated by the current sample. This makes the conditional message cheaper to compute\nthan the exact message: in the \ufb01nite case the time complexity is O(|XB\\(E\u2229(B\\C))|). The\nspace complexity of representing the conditional message is O(|XB\u2229C|)\u2014the same as the\nexact message, since it a potential over the same variables.\n\nAs we sample more variables, the conditional messages become cheaper to compute. How-\never, note that the space complexity of representing the conditional message is independent\nof the choice of sampled variables E; even if we sample a given variable, it remains a free\nparameter of the conditional message. (If we instead \ufb01xed its value, the proposition above\nwould not hold.) Thus, the time complexity of computing conditional messages can be\nreduced by sampling more variables, but only up to a point: the time complexity of com-\nputing the conditional message must be o(|XB\u2229C|). This contrasts with the approach of\nBidyuk & Dechter [7], where the asymptotic time complexity of each iteration can be re-\nduced arbitrarily by sampling more variables. However, to achieve this their algorithm runs\nthe entire junction tree algorithm in each iteration, and does not reuse messages between\niterations. In contrast, Sample Propagation reuses all but one of the messages between\niterations, leading to a greatly reduced \u201cconstant factor\u201d.\n4 Application to conditional Gaussian models\nA conditional Gaussian (CG) model is a probability distribution over a set of discrete vari-\nables {Xi\n: i \u2208 \u0393} such that the conditional\ndistribution of X\u0393 given X\u2206 is multivariate Gaussian. Inference in CG models is harder\nthan in models that are totally discrete or totally Gaussian. For example, consider polytree\nmodels: when all of the variables are discrete or all are Gaussian, exact inference is linear\nin size of the model; but if the model is CG then approximate inference is NP-hard [11].\n\n: i \u2208 \u2206} and continous variables {Xi\n\nIn traditional junction tree inference, our goal is to compute the marginal for each cluster.\nHowever, when p is a CG model, each cluster marginal is a mixture of |X\u2206| Gaussians,\nand is intractable to represent. Instead, we can compute the weak marginals, i.e., for each\ncluster we compute the best conditional Gaussian approximation of pC. Lauritzen\u2019s al-\ngorithm [12] is an extension of the HUGIN algorithm that computes these weak marginals\nexactly. Unfortunately, it is often intractable because it requires strongly rooted junction\ntrees, which can have clusters that contain most or all of the discrete variables [3].\n\nThe structure of CG models makes it possible to use Sample Propagation to approximate\nthe weak cluster marginals: we choose E = \u2206, since conditioning on the discrete variables\nleaves a tractable Gaussian inference problem.5 The expectations we must compute are\nof the suf\ufb01cient statistics of the weak cluster marginals: for each cluster C, we need the\ndistribution of XC\u2229\u2206 and the conditional means and covariances of XC\u2229\u0393 given XC\u2229\u2206.\nAs an example, consider the model given in Figure 2(a) for tracking an object whose state\n(position and velocity) at time t is Xt. At each time step, we obtain a vector measurement\nYt which is either a noisy measurement of the object\u2019s position (if Zt = 0) or an outlier (if\nZt = 1). The Markov chain over Zt makes it likely that inliers and outliers come in bursts.\nThe task is to estimate the position of the object at all time steps (for T = 100).\n\n5We cannot choose E = \u0393 because computing the conditional messages (8) may require summing\ndiscrete variables out of CG potentials, which leads to representational dif\ufb01culties [3]. In this case\none can instead use Bidyuk & Dechter\u2019s algorithm, which does not require these operations.\n\n\f(a)\n\nLauritzen\u2019s algorithm is intractable in this case be-\ncause any strongly rooted junction tree for this net-\nwork must have a cluster containing all of the dis-\ncrete variables [3, Thm.\n3.18]. Therefore, in-\nstead of comparing our approximate position es-\ntimates to the correct answer, we sampled a tra-\njectory from the network and computed the aver-\nage position error to the (unobserved) ground truth.\nBoth Gibbs sampling and Sample Propagation were\nrun with a forwards\u2013backwards sampling sched-\nule; Sample Propagation used the junction tree of\nFigure 2(b).6 Both algorithms were started in the\nsame state and both were allowed to \u201cburn in\u201d for\n\ufb01ve forwards\u2013backwards passes. We repeated this\n10 times and averaged the results over trials. Fig-\nure 2(c) shows that Sample Propagation converged\nmuch more quickly than Gibbs sampling. Also,\nSample Propagation found better answers than As-\nsumed Density Filtering (a standard algorithm for\nthis problem), but at increased computational cost.\nAcknowledgements.\nI thank K. Murphy and S.\nRussell for comments on a draft of this paper.\nThis research was supported by ONR N00014-00-\n1-0637 and an Intel Internship.\nReferences\n[1] G. Shafer and P. Shenoy. Probability propagation. Annals of Mathematics and Arti\ufb01cial Intel-\n\nFigure 2: The TRACKING example.\n\n(b)\n\n(c)\n\n[2] R. Cowell, P. Dawid, S. Lauritzen, and D. Spiegelhalter. Probabilistic Networks and Expert\n\n[3] U. Lerner. Hybrid Bayesian Networks for Reasoning About Complex Systems. PhD thesis,\n\nligence, 2:327\u2013352, 1990.\n\nSystems. Springer, 1999.\n\nStanford University, October 2002.\n\n[4] A. Dawid, U. Kj\u00e6rulff, and S. Lauritzen. Hybrid propagation in junction trees. In Advances in\n\nIntelligent Computing, volume 945 of Lecture Notes in Computer Science. Springer, 1995.\n\n[5] U. Kj\u00e6rulff. HUGS: Combining exact inference and Gibbs sampling in junction trees. In Proc.\nof the 11th Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI-95). Morgan Kaufmann, 1995.\n[6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle \ufb01ltering for\n\ndynamic Bayesian networks. In Proc. of the 16th Conf. on Uncertainty in AI (UAI-00), 2000.\n\n[7] B. Bidyuk and R. Dechter. An empirical study of w-cutset sampling for Bayesian networks. In\n\nProc. of the 19th Conf. on Uncertainty in AI (UAI-03). Morgan Kaufmann, 2003.\n\n[8] R. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report\n\nCRG-TR-93-1, University of Toronto, 1993.\n\n[9] C. S. Jensen, A. Kong, and U. Kj\u00e6rulff. Blocking Gibbs sampling in very large probabilistic\n\nexpert systems. International Journal of Human-Computer Studies, 42:647\u2013666, 1995.\n\n[10] Y. W. Teh and M. Welling. On improving the ef\ufb01ciency of the iterative proportional \ufb01tting\n\nprocedure. In Proc. of the 9th Int\u2019l. Workshop on AI and Statistics (AISTATS-03), 2003.\n\n[11] U. Lerner and R. Parr. Inference in hybrid networks: Theoretical limits and practical algorithms.\n\nIn Proc. of the 17th Conf. on Uncertainty in AI (UAI-01). Morgan Kaufmann, 2001.\n\n[12] S. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association\n\nmodels. Journal of the American Statistical Association, 87(420):1098\u20131108, 1992.\n\n[13] C. Carter and R. Kohn. Markov chain Monte Carlo in conditionally Gaussian state space mod-\n\nels. Biometrika, 83:589\u2013601, 1996.\n\n6Carter & Kohn [13] describe a specialized algorithm for this model that is similar to a version of\n\nSample Propagation that does not resample the discrete variables on the backwards pass.\n\nX1X2...Y2Y1Z1Z2...XTYTZTX1, X2,Z1, Z2X2, X3,Z2, Z3XT - 1, XT,ZT - 1, ZT...1051061071081093456789floating point operationsaverage position errorAssumed Density FilteringSample PropagationGibbs Sampling\f", "award": [], "sourceid": 2488, "authors": [{"given_name": "Mark", "family_name": "Paskin", "institution": null}]}