{"title": "Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 491, "page_last": 498, "abstract": "", "full_text": "Hot Coupling: A Particle Approach to Inference\n\nand Normalization on Pairwise Undirected\n\nGraphs of Arbitrary Topology\n\nFiras Hamze\nDepartment of Computer Science\n\nNando de Freitas\n\nUniversity of British Columbia\n\nAbstract\n\nThis paper presents a new sampling algorithm for approximating func-\ntions of variables representable as undirected graphical models of arbi-\ntrary connectivity with pairwise potentials, as well as for estimating the\nnotoriously dif(cid:2)cult partition function of the graph. The algorithm (cid:2)ts\ninto the framework of sequential Monte Carlo methods rather than the\nmore widely used MCMC, and relies on constructing a sequence of in-\ntermediate distributions which get closer to the desired one. While the\nidea of using (cid:147)tempered(cid:148) proposals is known, we construct a novel se-\nquence of target distributions where, rather than dropping a global tem-\nperature parameter, we sequentially couple individual pairs of variables\nthat are, initially, sampled exactly from a spanning tree of the variables.\nWe present experimental results on inference and estimation of the parti-\ntion function for sparse and densely-connected graphs.\n\n1 Introduction\nUndirected graphical models are powerful statistical tools having a wide range of applica-\ntions in diverse (cid:2)elds such as image analysis [1, 2], conditional random (cid:2)elds [3], neural\nmodels [4] and epidemiology [5]. Typically, when doing inference, one is interested in\nobtaining the local beliefs, that is the marginal probabilities of the variables given the evi-\ndence set. The methods used to approximate these intractable quantities generally fall into\nthe categories of Markov Chain Monte Carlo (MCMC) [6] and variational methods [7].\nThe former, involving running a Markov chain whose invariant distribution is the distri-\nbution of interest, can suffer from slow convergence to stationarity and high correlation\nbetween samples at stationarity, while the latter is not guaranteed to give the right answer\nor always converge. When performing learning in such models however, a more serious\nproblem arises: the parameter update equations involve the normalization constant of the\njoint model at the current value of parameters, from here on called the partition function.\nMCMC offers no obvious way of approximating this wildly intractable sum [5, 8]. Al-\nthough there exists a polynomial time MCMC algorithm for simple graphs with binary\nnodes, ferromagnetic potentials and uniform observations [9], this algorithm is hardly ap-\nplicable to the complex models encountered in practice. Of more interest, perhaps, are\nthe theoretical results that show that Gibbs sampling and even Swendsen-Wang[10] can\nmix exponentially slowly in many situations [11]. This paper introduces a new sequential\nMonte Carlo method for approximating expectations of a pairwise graph\u2019s variables (of\nwhich beliefs are a special case) and of reasonably estimating the partition function. Intu-\nitively, the new method uses interacting parallel chains to handle multimodal distributions,\n\n\fx\n\ni\n\ny\n\n(x ,x )\nj\n\ni\n\nj(x ,y)\n\nj\n\nx\nf\n\ny\n\nFigure 1: A small example of the type of graphical model treated in this paper. The observations\ncorrespond to the two shaded nodes.\n\n1\n\n (xi; xj)\n\nwith communicating chains distributed across the modes. In addition, there is no require-\nment that the chains converge to equilibrium as the bias due to incomplete convergence is\ncorrected for by importance sampling.\nFormally, given hidden variables x and observations y, the model is speci(cid:2)ed on a graph\nG(V;E), with edges E and M nodes V by:\nZ Yi2V\n\n(cid:30)(xi; yi) Y(i;j)2E\n\n(cid:25)(x; y) =\n\n(cid:30)(xi; yi)Q(i;j)2E\n\nwhere x = fx1; : : : ; xMg, Z is the partition function, (cid:30)((cid:1)) denotes the ob-\nservation potentials and ((cid:1)) denotes the pair-wise interaction potentials, which\nThe partition function is: Z =\nare strictly positive but otherwise arbitrary.\nPxQi2V\n (xi; xj);where the sum is over all possible system states.\nWe make no assumption about the graph\u2019s topology or sparseness, an example is in Fig-\nure 1. We present experimental results on both fully-connected graphs (cases where each\nnode neighbors every other node) and sparse graphs.\nOur approach belongs to the framework of Sequential Monte Carlo (SMC), which has\nits roots in the seminal paper of [12]. Particle (cid:2)lters are a well-known instance of SMC\nmethods [13]. They apply naturally to dynamic systems like tracking. Our situation is\ndifferent. We introduce arti(cid:2)cial dynamics simply as a constructive strategy for obtaining\nsamples of a sequence of distributions converging to the distribution of interest. That is,\ninitially we sample from and easy-to-sample distribution. This distribution is then used as\na proposal mechanism to obtain samples from a slightly more complex distribution that is\ncloser to the target distribution. The process is repeated until the sequence of distributions\nof increasing complexity reaches the target distribution. Our algorithm has connections\nto a general annealing strategy proposed in the physics [14] and statistics [15] literature,\nknown as Annealed Importance Sampling (AIS). AIS is a special case of the general SMC\nframework [16]. The term annealing refers to the lowering of a (cid:147)temperature parameter,(cid:148)\nthe process of which makes the joint distribution more concentrated on its modes, whose\nnumber can be massive for dif(cid:2)cult problems. The celebrated simulated annealing (SA)\n[17] algorithm is an optimization method relying on this phenomenon; presently, however\nwe are interested in integration and so SA does not apply here.\nOur approach does not use a global temperature, but sequentially introduces dependencies\namong the variables; graphically, this can be understood as (cid:147)adding edges(cid:148) to the graph.\nIn this paper, we restrict ourselves to discrete state-spaces although the method applies to\narbitrary continuous distributions.\nFor our initial distribution we choose a spanning tree of the variables, on which analytic\nmarginalization, exact sampling, and computation of the partition function are easily done.\nAfter drawing a population of samples (particles) from this distribution, the sequential\nphase begins: an edge of the desired graph is chosen and gradually added to the current one\nas shown in Figure 2. The particles then follow a trajectory according to some proposal\n\n\fmechanism. The (cid:147)(cid:2)tness(cid:148) of the particles is measured via their importance weights. When\nthe set of samples has become skewed, that is with some containing high weights and\nmany containing low ones, the particles are resampled according to their weights. The\nsequential structure is thus imposed by the propose-and-resample mechanism rather than by\nany property of the original system. The algorithm is formally described after an overview\nof SMC and recent work presenting a unifying framework of the SMC methodology outside\nthe context of Bayesian dynamic (cid:2)ltering[16].\n\nFigure 2: A graphical illustration of our algorithm. First we construct a spanning tree, of which a\npopulation of iid samples can be easily drawn using the forward (cid:2)ltering/backward sampling algo-\nrithm for trees. The tree then becomes the proposal mechanism for generating samples for a graph\nwith an extra potential. The process is repeated until we obtain samples from the target distribution\n(de(cid:2)ned on a fully connected graph in this case). Edges can be added (cid:147)slowly(cid:148) using a coupling\nparameter.\n\n2 Sequential Monte Carlo\nAs\nshown in Figure 2, we\n\nconsider\n\na\n\nsequence of\n\nauxiliary distributions\n\nspanning tree. The sequence of distributions can be constructed so that it satis(cid:2)es\n\npartition function. We are often interested in computing this partition function and other\n\ndistribution of interest (cid:25)n(xn) (the distribution of the graphical model that we want to\nsample from as illustrated in Figure 2 for n = 4). So we (cid:2)rst focus on sampling from\nthe sequence of auxiliary distributions. The joint distribution is only known up to a\n\ne(cid:25)1(x1);e(cid:25)2(x1:2); : : : ;e(cid:25)n(x1:n), where e(cid:25)1(x1) is the distribution on the weighted\ne(cid:25)n(x1:n) = (cid:25)n(xn)e(cid:25)n(x1:n(cid:0)1jx1:n): Marginalizing over x1:n(cid:0)1 gives us the target\nn fn(x1:n), where Zn , R fn(x1:n)dx1:n is the\nnormalization constant: e(cid:25)n(x1:n) = Z(cid:0)1\nexpectations, such as I(g(xn)) = R g(xn)(cid:25)n(xn)dxn, where g is a function of interest\n(e.g. g(x) = x if we are interested in computing the mean of x).\nIf we had a set of samples fx(i)\ni=1 frome(cid:25), we could approximate this integral with the\nfollowing Monte Carlo estimator:be(cid:25)n(dx1:n) = 1\n(dx1:n)\ndenotes the delta Dirac function, and consequently approximate any expectations of inter-\nest. These estimates converge almost surely to the true expectation as N goes to in(cid:2)nity. It\nis typically hard to sample frome(cid:25) directly. Instead, we sample from a proposal distribution\n\nq and weight the samples according to the following importance ratio\n\n(dx1:n), where (cid:14)x(i)\n\nNPN\n\ni=1 (cid:14)x(i)\n\n1:n\n\n1:ngN\n\n1:n\n\nwn =\n\nfn(x1:n)\nqn(x1:n)\n\n=\n\nfn(x1:n)\nqn(x1:n)\n\nqn(cid:0)1(x1:n(cid:0)1)\nfn(cid:0)1(x1:n(cid:0)1)\n\nwn(cid:0)1\n\nThe proposal is constructed sequentially: q(x1:n) = qn(cid:0)1(x1:n(cid:0)1)qn(xnjx1:n(cid:0)1): Hence,\nthe importance weights can be updated recursively\nfn(x1:n)\n\nwn =\n\nqn(xnjx1:n(cid:0)1)fn(cid:0)1(x1:n(cid:0)1)\n1:n(cid:0)1, we obtain a set of particles x(i)\n\nwn(cid:0)1\n\nGiven a set of N particles x(i)\nn by sampling from\nqn(xnjx(i)\n1:n(cid:0)1) and applying the weights of equation (1). To overcome slow drift in the\nparticle population, a resampling (selection) step chooses the (cid:2)ttest particles (see the intro-\nductory chapter in [13] for a more detailed explanation). We use a state-of-the-art minimum\nvariance resampling algorithm [18].\nThe ratio of successive partition functions can be easily estimated using this algorithm as\nfollows:\n\n(1)\n\nZn\nZn(cid:0)1\n\n= R fn(x1:n)dx1:n\n\nZn(cid:0)1\n\n=Z bwn e(cid:25)n(cid:0)1(x1:n(cid:0)1)qn(xnjx1:n(cid:0)1)dx1:n (cid:25)\n\nNXi=1 bw(i)\nn ew(i)\n\nn(cid:0)1;\n\n\ffn(x1:n)\n\nn(cid:0)1 = w(i)\n\nn(cid:0)1=Pj w(j)\n\nn(cid:0)1, bwn =\n\nqn(xnjx1:n(cid:0)1)fn(cid:0)1(x1:n(cid:0)1) and Z1 can be easily\n\nwhere ew(i)\ncomputed as it is the partition function for a tree.\nWe can choose a (non-homogeneous) Markov chain with transition kernel Kn(xn(cid:0)1; xn)\nthe proposal distribution qn(xnjx1:n(cid:0)1).\nHence, given an initial proposal\nas\ndistribution q1((cid:1)), we have joint proposal distribution at step n:\nqn(x1:n) =\nq1(x1)Qn\nk=2 Kk(xk(cid:0)1; xk): It is convenient to assume that the arti(cid:2)cial distribution\ne(cid:25)n(x1:n(cid:0)1jxn) is also the product of (backward) Markov kernels: e(cid:25)n(x1:n(cid:0)1jxn) =\nQn(cid:0)1\nk=1 Lk(xk+1; xk) [16]. Under these choices, the (unnormalized) incremental impor-\n\ntance weight becomes:\n\nwn /\n\nfn(xn)Ln(cid:0)1(xn; xn(cid:0)1)\nfn(cid:0)1(xn(cid:0)1)Kn(xn(cid:0)1; xn)\n\n(2)\n\n(3)\n\nfn(xn)\n\nDifferent choices of the backward Kernel L result in different algorithms [16]. For example,\nthe choice: Ln(cid:0)1(xn; xn(cid:0)1) = fn(xn(cid:0)1)Kn(xn(cid:0)1;xn)\nresults in the AIS algorithm, with\nweights wn / fn(xn(cid:0)1)\nfn(cid:0)1(xn(cid:0)1) . However, we should point out that this method is more general\nas one can carry out resampling. Note that in this case, the importance weights do not\ndepend on xn and, hence, it is possible to do resampling before the importance sampling\nstep. This often leads to huge reduction in estimation error [19]. Also, note that if there\nare big discrepancies between fn((cid:1)) and fn(cid:0)1((cid:1)) the method might perform poorly. To\novercome this, [16] use variance results to propose a different choice of backward kernel,\nwhich results in the following incremental importance weights:\n\nwn /\n\nfn(xn)\n\nR fn(cid:0)1(xn(cid:0)1)Kn(xn(cid:0)1; xn)dxn(cid:0)1\n\nThe integral in the denominator can be evaluated when dealing with Gaussian or reasonable\ndiscrete networks.\n3 The new algorithm\nWe could try to perform traditional importance sampling by seeking some proposal distri-\nbution for the entire graph. This is very dif(cid:2)cult and performance degrades exponentially\nin dimension if the proposal is mismatched [20]. We propose, however, to use the samples\nfrom the tree distribution (which we call (cid:25)0) as candidates to an intermediate target dis-\ntribution, consisting of the tree along with a (cid:147)weak(cid:148) version of a potential corresponding\nto some edge of the original graph. Given a set of edges G0 which form a spanning tree\nof the target graph, we can can use the belief propagation equations [21] and bottom-up\npropagation, top-down sampling [22], to draw a set of N independent samples from the\ntree. Computation of the normalization constant Z1 is also straightforward and ef(cid:2)cient in\nthe case of trees using a sum-product recursion. From then on, however, the normalization\nconstants of subsequent target distributions cannot be analytically computed.\nWe then choose a new edge e1 from the set of (cid:147)unused(cid:148) edges E (cid:0) G0 and add it to G0\nto form the new edge set G1 = e1 [ G0. Let the vertices of e1 be u1 and v1. Then,\nthe intermediate target distribution (cid:25)1 is proportional to (cid:25)0(x1) e1 (xu1 ; xv1 ). In doing\nstraightforward importance sampling, using (cid:25)0 as a proposal for (cid:25)1, the importance weight\nis proportional to e1 (xu1 ; xv1 ). We adopt a slow proposal process to move the population\nof particles towards (cid:25)1. We gradually introduce the potential between Xu1 and Xv1 via\na coupling parameter (cid:11) which increases from 0 to 1 in order to (cid:147)softly(cid:148) bring the edge\u2019s\npotential in and allow the particles to adjust to the new environment. Formally, when\nadding edge e1 to the graph, we introduce a number of coupling steps so that we have the\nintermediate target distribution:\n\n(cid:25)0(x0) [ e1 (xu1 ; xv1 )](cid:11)n\n\nwhere (cid:11)n is de(cid:2)ned to be 0 when a new edge enters the sequence, increases to 1 as the\nedge is brought in, and drops back to zero when another edge is added at the following\nedge iteration.\n\n\fAt each time step, we want a proposal mechanism that is close to the target distribution.\nProposals based on simple perturbations, such as random walks, are easy to implement, but\ncan be inef(cid:2)cient. Metropolis-Hastings proposals are not possible because of the integral\nin the rejection term. We can, however, employ a single-site Gibbs sampler with random\nscan whose invariant distribution at each step is the the next target density in the sequence;\nthis kernel is applied to each particle. When an edge has been fully added a new one is\nchosen and the process is repeated until the (cid:2)nal target density is the full graph. We use an\nanalytic expression for the incremental weights corresponding to Equation (3).\nTo alleviate potential confusion with MCMC, while any one particle obviously forms a\ncorrelated path, we are using a population and are making no assumption or requirement\nthat the chains have converged as is done in MCMC as we are correcting for incomplete\nconvergence with the weights.\n\n1\n\n1\n\n4 Experiments and discussion\nFour approximate inference methods were compared: our SMC method with sequential\nedge addition (Hot Coupling (HC)), a more typical annealing strategy with a global tem-\nperature parameter(SMCG), single-site Gibbs sampling with random scan and loopy belief\npropagation. SMCG can be thought of as related to HC but where all the edges and local\nevidence are annealed at the same time.\nThe majority of our experiments were performed on graphs that were small enough for\nexact marginals and partition functions to be exhaustively calculated. However, even in toy\ncases MCMC and loopy can give unsatisfactory and sometimes disastrous results. We also\nran a set of experiments on a relatively large MRF.\nFor the small examples we examined both fully-connected (FC) and square grid (MRF)\nnetworks, with 18 and 16 nodes respectively. Each variable could assume one of 3 states.\nOur pairwise potentials corresponded to the well-known Potts model: i;j(xi; xj) =\nT Jij (cid:14)xi ;xj ; (cid:30)i(xi) = e\nT J(cid:14)xi (yi). We set T = 0:5 (a low temperature) and tested models\ne\nwith uniform and positive Jij, widely used in image analysis, and models with Jij drawn\nfrom a standard Gaussian; the latter is an instance of the much-studied spin-glass models\nof statistical physics which are known to be notoriously dif(cid:2)cult to simulate at low temper-\natures [23]. Of course fully-connected models are known as Boltzmann machines [4] to the\nneural computation community. The output potentials were randomly selected in both the\nuniform and random interaction cases. The HC method used a linear coupling schedule for\neach edge, increasing from (cid:11) = 0 to (cid:11) = 1 over 100 iterations; our SMCG implementation\nused a linear global cooling schedule, whose number of steps depended on the graph in\norder to match those taken by SMCG.\nAll Monte Carlo algorithms were independently run 50 times each to approximate the vari-\nance of the estimates. Our SMC simulations used 1000 particles for each run, while each\nGibbs run performed 20000 single-site updates. For these models, this was more than\nenough steps to settle into local minima; runs of up to 1 million iterations did not yield a\ndifference, which is characteristic of the exponential mixing time of the sampler on these\ngraphs. For our HC method, spanning trees and edges in the sequential construction were\nrandomly chosen from the full graph; the rationale for doing so is to allay any criticism that\n(cid:147)tweaking(cid:148) the ordering may have had a crucial effect on the algorithm. The order clearly\nwould matter to some extent, but this will be examined in later work. Also in the tables\nby (cid:147)error(cid:148) we mean the quantity j^a(cid:0)aja where ^a is an estimate of some quantity a obtained\nexactly (say Z).\nFirst, we used HC, SMCG and Gibbs to approximate the expected sum of our graphs\u2019 vari-\ni=1 xi]: We then approximated the partition\nfunctions of the graphs using HC, SMCG, and loopy.1We note again that there is no obvi-\nous way of estimating Z using Gibbs. Finally, we approximated the marginal probabilities\nusing the four approximate methods. For loopy, we only kept the runs where it converged.\n\nables, the so-called magnetization: m = E[PM\n\n1Code for Bethe Z approximation kindly provided by Kevin Murphy.\n\n\fMethod\n\nHC\n\nSMCG\nGibbs\n\nMRF Random (cid:9) MRF Homogeneous (cid:9)\nError\n0.0022\n0.0001\n0.0003\n\nVar\n0.038\n165.61\n201.08\nFigure 3: Approximate magnetization for the nodes of the graphs, as de(cid:2)ned in the text, calculated\nusing HC, SMCG, and Gibbs sampling and compared to the true value obtained by brute force.\nObserve the massive variance of Gibbs sampling in some cases.\n\nFC Homogeneous (cid:9)\nError\n0.0036\n0.331\n0.3152\n\nFC Random (cid:9)\nError\nVar\n0.0016\n0.127\n0.02\n\nError\n0.0251\n0.2789\n0.4928\n\nVar\n0.17\n10.09\n200.95\n\nVar\n0.012\n0.03\n0.014\n\n0.0522\n0.570\n0.32\n\nMethod\n\nHC\n\nSMCG\nloopy\n\nMRF Random (cid:9) MRF Homogeneous (cid:9)\nError\n0.0105\n0.004\n0.005\n\nError\n0.0227\n6.47\n0.155\n\nVar\n0.001\n7.646\n\nVar\n0.002\n0.005\n\n-\n\n-\n\nFC Random (cid:9)\nError\nVar\n0.0043\n1800\n\n0.0537\n1.24\n\n1\n\n-\n\nFC Homogeneous (cid:9)\n\nErr\n\n0.0394\n\n1\n\n0.075\n\nVar\n0.001\n29.99\n\n-\n\nFigure 4: Approximate partition function of the graphs discussed in the text calculated using HC,\nSMCG, and Loopy Belief Propagation (loopy.) For HC and SMCG are shown the error of the sample\naverage of results over 50 independent runs and the variance across those runs. loopy is of course a\ndeterministic algorithm and has no variance. HC maintains a low error and variance in all cases.\n\nFigure 3 shows the results of the magnetization experiments. On the MRF with random\ninteractions, all three methods gave very accurate answers with small variance, but for the\nother graphs, the accuracies and variances began to diverge. On both positive-potential\ngraphs, Gibbs sampling gives high error and huge variance; SMCG gives lower variance\nbut is still quite skewed. On the fully-connected random-potential graph the 3 methods give\ngood results but HC has the lowest variance. Our method experiences its worst performance\non the homogeneous MRF but it is only 2.5% error!\nFigure 4 tabulates the approximate partition function calculations. Again, for the MRF with\nrandom interactions, the 3 methods give estimates of Z of comparable quality. This exam-\nple appeared to work for loopy, Gibbs, and SMCG. For the homogeneous MRF, SMCG\ndegrades rapidly; loopy is still satisfactory at 15% error, but HC is at 2.7% with very\nlow variance. In the fully-connected case with random potentials, HC\u2019s error is 0.43%\nwhile loopy\u2019s error is very high, having underestimated Z by a factor of 105. SMCG fails\ncompletely here as well. On the uniform fully-connected graph, loopy actually gives a\nreasonable estimate of Z at 7.5%, but is still beaten by HC.\nFigure 5 shows the variational (L1) distance between the exact marginal for a randomly\nchosen node in each graph and the approximate marginals of the 4 algorithms, a common\nmeasure of the (cid:147)distance(cid:148) between 2 distributions. For the Monte Carlo methods (HC,\nSMCG and Gibbs) the average over 50 independent runs was used to approximate the\nexpected L1 error of the estimate. All 4 methods perform well on the random (cid:9) MRF.\nOn the MRF with homogeneous (cid:9), both loopy and SMCG degrade, but HC maintains\na low error. Among the FC graphs, HC performs extremely well on the homogeneous\n(cid:9) and surprisingly loopy does well too. In the random (cid:9) case, loopy\u2019s error increases\ndramatically.\nOur (cid:2)nal set of simulations was the classic Mean Squared reconstruction of a noisy im-\nage problem; we used a 100x100 MRF with a noisy (cid:147)patch(cid:148) image (consisting of shaded,\nrectangular regions) with an isotropic 5-state prior model. The object was to calculate the\npixels\u2019 posterior marginal expectations. We chose this problem because it is a large model\non which loopy is known to do well on, and can hence provide us with a measure of qual-\nity of the HC and SMCG results as larger numbers of edges are involved. From the toy\nexamples we infer that the mechanism of HC is quite different from that of loopy as we\nhave seen that it can work when loopy does not. Hence good performance on this problem\nwould suggest that HC would scale well, which is a crucial question as in the large graph\nthe (cid:2)nal distribution has many more edges than the initial spanning tree. The results were\npromising: the mean-squared reconstruction error using loopy and using HC were virtually\nidentical at 9:067 (cid:2) 10(cid:0)5 and 9:036 (cid:2) 10(cid:0)5 respectively, showing that HC seemed to be\n\n\fFully\u2212Connected Random\n\nLoopy\n\nSMCG\n\nGibbs\n\nHC\n\nGrid Model Random\n\nHC SMCG Gibbs Loopy\n\ne\nc\nn\na\n\nt\ns\nd\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\na\ni\nr\na\nV\n\ne\nc\nn\na\n\nt\ns\nd\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\na\ni\nr\na\nV\n\n1.5\n\n1\n\n0.5\n\n0\n\n1.5\n\n1\n\n0.5\n\n0\n\ne\nc\nn\na\n\nt\ns\nd\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\na\ni\nr\na\nV\n\ne\nc\nn\na\n\nt\ns\nd\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\na\ni\nr\na\nV\n\n1.5\n\n1\n\n0.5\n\n0\n\n1.5\n\n1\n\n0.5\n\n0\n\nFully\u2212Connected Homogeneous\n\nSMCG Gibbs\n\nHC\n\nLoopy\n\nGrid Model Homogeneous\n\nGibbs\n\nSMCG\n\nLoopy\n\nHC\n\nFigure 5: Variational(L1) distance between estimated and true marginals for a randomly chosen\nnode in each of the 4 graphs using the four approximate methods (smaller values mean less error.)\nThe MRF-random example was again (cid:147)easy(cid:148) for all the methods, but the rest raise problems for all\nbut HC.\n\ne\ng\na\nr\ne\nv\nA\n \ne\np\nm\na\nS\n\nl\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n0\n\n100\n\n200\n\n300\n\nIteration\n\n400\n\n500\n\n600\n\ne\ng\na\nr\ne\nv\nA\n \ne\np\nm\na\nS\n\nl\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\nIteration\n\n10\nx 105\n\nFigure 6: An example of how MCMC can get (cid:147)stuck:(cid:148) 3 different runs of a Gibbs sampler estimating\nthe magnetization of FC-Homogeneous graph. At left are shown the (cid:2)rst 600 iterations of the runs;\nafter a brief transient behaviour the samplers settled into different minima which persisted for the\nentire duration (20000 steps) of the runs. Indeed for 1 million steps the local minima persist, as\nshown at right.\n\nrobust to the addition of around 9000 edges and many resampling stages. SMCG on the\nlarge MRF did not fare as well.\nIt is crucial to realize that MCMC is completely unsuited to some problems; see for exam-\nple the (cid:147)convergence(cid:148) plots of the estimated magnetization of 3 independent Gibbs sampler\nruns on one of our (cid:147)toy(cid:148) graphs shown in Figure 6. Such behavior has been studied by\nGore and Jerrum [11] and others, who discuss pessimistic theoretical results on the mixing\nproperties of both Gibbs sampling and the celebrated Swendsen-Wang algorithm in sev-\neral cases. To obtain a good estimate, MCMC requires that the process (cid:147)visit(cid:148) each of the\ntarget distribution\u2019s basins of energy with a frequency representative of their probability.\nUnfortunately, some basins take an exponential amount of time to exit, and so different (cid:2)-\nnite runs of MCMC will give quite different answers, leading to tremendous variance. The\nmethodology presented here is an attempt to sidestep the whole issue of mixing by permit-\nting the independent particles to be stuck in modes, but then considering them jointly when\nestimating. In other words, instead of using a time average, we estimate using a weighted\n\n\fensemble average. The object of the sequential phase is to address the dif(cid:2)cult problem\nof constructing a suitable proposal for high-dimensional problems; to this the resampling-\nbased methodology of particle (cid:2)lters was thought to be particularly suited. For the graphs\nwe have considered, the single-edge algorithm we propose seems to be preferable to global\nannealing.\n\nReferences\n[1] S Z Li. Markov random (cid:2)eld modeling in image analysis. Springer-Verlag, 2001.\n[2] P Carbonetto and N de Freitas. Why can\u2019t Jos\u00b7e read? the problem of learning semantic as-\nsociations in a robot environment. In Human Language Technology Conference Workshop on\nLearning Word Meaning from Non-Linguistic Data, 2003.\n\n[3] J D Lafferty, A McCallum, and F C N Pereira. Conditional random (cid:2)elds: Probabilistic models\nfor segmenting and labeling sequence data. In International Conference on Machine Learning,\n2001.\n\n[4] D E Rumelhart, G E Hinton, and R J Williams. Learning internal representations by error\npropagation. In D E Rumelhart and J L McClelland, editors, Parallel Distributed Processing:\nExplorations in the Microstructure of Cognition, pages 318(cid:150)362, Cambridge, MA, 1986.\n\n[5] P J Green and S Richardson. Hidden Markov models and disease mapping. Journal of the\n\nAmerican Statistical Association, 97(460):1055(cid:150)1070, 2002.\n\n[6] C P Robert and G Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 1999.\n[7] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine Learning, 37:183(cid:150)233, 1999.\n\n[8] J Moller, A N Pettitt, K K Berthelsen, and R W Reeves. An ef(cid:2)cient Markov chain Monte Carlo\nmethod for distributions with intractable normalising constants. Technical report, The Danish\nNational Research Foundation: Network in Mathematical Physics and Stochastics, 2004.\n\n[9] M Jerrum and A Sinclair. The Markov chain Monte Carlo method: an approach to approximate\ncounting and integration. In D S Hochbaum, editor, Approximation Algorithms for NP-hard\nProblems, pages 482(cid:150)519. PWS Publishing, 1996.\n\n[10] R H Swendsen and J S Wang. Nonuniversal critical dynamics in Monte Carlo simulations.\n\nPhysical Review Letters, 58(2):86(cid:150)88, 1987.\n\n[11] V Gore and M Jerrum. The swendsen-wang process does not always mix rapidly.\n\nAnnual ACM Symposium on Theory of Computing, 1996.\n\nIn 29th\n\n[12] N Metropolis and S Ulam. The Monte Carlo method. Journal of the American Statistical\n\nAssociation, 44(247):335(cid:150)341, 1949.\n\n[13] A Doucet, N de Freitas, and N J Gordon, editors. Sequential Monte Carlo Methods in Practice.\n\nSpringer-Verlag, 2001.\n\n[14] C Jarzynski. Nonequilibrium equality for free energy differences. Phys. Rev. Lett., 78, 1997.\n[15] R M Neal. Annealed importance sampling. Technical Report No 9805, University of Toronto,\n\n1998.\n\n[16] P Del Moral, A Doucet, and G W Peters. Sequential Monte Carlo samplers. Technical Report\n\nCUED/F-INFENG/2004, Cambridge University Engineering Department, 2004.\n\n[17] S Kirkpatrick, C D Gelatt, and M P Vecchi. Optimization by simulated annealing. Science,\n\n220:671(cid:150)680, 1983.\n\n[18] G Kitagawa. Monte Carlo (cid:2)lter and smoother for non-Gaussian nonlinear state space models.\n\nJournal of Computational and Graphical Statistics, 5:1(cid:150)25, 1996.\n\n[19] N de Freitas, R Dearden, F Hutter, R Morales-Menendez, J Mutch, and D Poole. Diagnosis by\n\na waiter and a mars explorer. IEEE Proceedings, 92, 2004.\n\n[20] J A Bucklew. Large Deviation Techniques in Decision, Simulation, and Estimation. John Wiley\n\n& Sons, 1986.\n\n[21] J Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan-\n\nKaufmann, 1988.\n\n[22] C K Carter and R Kohn. On Gibbs sampling for state space models. Biometrika, 81(3):541(cid:150)553,\n\n1994.\n\n[23] M E J Newman and G T Barkema. Monte Carlo Methods in Statistical Physics. Oxford Uni-\n\nversity Press, 1999.\n\n\f", "award": [], "sourceid": 2788, "authors": [{"given_name": "Firas", "family_name": "Hamze", "institution": null}, {"given_name": "Nando", "family_name": "de Freitas", "institution": null}]}