{"title": "Nested sampling for Potts models", "book": "Advances in Neural Information Processing Systems", "page_first": 947, "page_last": 954, "abstract": null, "full_text": "Nested sampling for Potts mo dels\n\nIain Murray Gatsby Computational Neuroscience Unit University College London i.murray@gatsby.ucl.ac.uk Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London zoubin@gatsby.ucl.ac.uk\n\nDavid J.C. MacKay Cavendish Laboratory University of Cambridge mackay@mrao.cam.ac.uk John Skilling Maximum Entropy Data Consultants Ltd. skilling@eircom.net\n\nAbstract\nNested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior sub ject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model.\n\n1\n\nIntro duction\n\nThe computation of normalizing constants plays an important role in statistical inference. For example, Bayesian model comparison needs the evidence, or marginal likelihood of a model M p L Z = p(D|M) = (D|, M)p(|M) d  () () d, (1) where the model has prior  and likelihood L over parameters  after observing data D. This integral is usually intractable for models of interest. However, given its importance in Bayesian model comparison, many approaches--both sampling-based and deterministic--have been proposed for estimating it. Often the evidence cannot be obtained using samples drawn from either the prior  , or the posterior p(|D, M)  L() (). Practical Monte Carlo methods need to sample from a sequence of distributions, possibly at different \"temperatures\" p(| )  L()  () (see Gelman and Meng [2] for a review). These methods are sometimes cited as a gold standard for comparison with other approximate techniques, e.g. Beal and Ghahramani [3]. However, care is required in choosing intermediate distributions; appropriate temperature-based distributions may be difficult or impossible to find. Nested sampling provides an alternate standard, which makes no use of temperature and does not require tuning of intermediate distributions or other large sets of parameters.\n\n\f\n2\n\n2\n\n2\n\nFigure 1: (a) Elements of parameter space (top) are sorted by likelihood and arranged on the x-axis. An eighth of the prior mass is inside the innermost likelihood contour in this figure. (b) Point xi is drawn from the prior inside the likelihood contour defined by xi-1 . Li is identified and p({xi }) is known, but exact values of xi are not known. (c) With N particles, the least likely one sets the likelihood contour and is replaced by a new point inside the contour ({Li } and p({xi }) are still known).\n\n1 L(x) L(x)\n\n1 L(x)\n\n1\n\n1 8\n\n1 4\n\n1 2\n\n1x\n\nx3 x2\n\nx1\n\n1x\n\nx1 1 x\n\n(a)\n\n(b)\n\n(c)\n\nNested sampling uses a natural definition of Z , a sum over prior mass. The weighted sum over likelihood elements is expressed as the area under a monotonic one-dimensional curve \"L vs x\" (figure 1(a)), where: L 1 Z= () () d = L((x)) dx. (2)\n0\n\nThis is a change of variables dx() =  ()d, where each volume element of the prior in the original -vector space is mapped onto a scalar element on the one-dimensional x-axis. The ordering of the elements on the x-axis is chosen to sort the prior mass in decreasing order of likelihood values (x1 < x2  L((x1 )) > L((x2 ))). See appendix A for dealing with elements with identical likelihoods. Given some points {(xi , Li )}I=1 ordered such that xi > xi+1 , the area under the i ^ curve (2) is easily approximated. We denote by Z estimates obtained using a ^ trapezoidal rule. Rectangle rules upper and lower bound the error Z - Z . Points with known x-coordinates are unavailable in general. Instead we generate points, {i }, such that the distribution p(x) is known (where x  {xi }), and find their associated {Li }. A simple algorithm to draw I points is algorithm 1, see also figure 1(b). Algorithm 1 Initial p oint: draw 1   (). for i = 2 to I: draw i   (|L(i-1 )),  where  () L() > L(i-1 )  (|L(i-1 ))   0 otherwise. (3) Algorithm 2 Initialize: draw N points (n)   () for i = 2 to I:  m = argminn L((n) )  i-1 = (m)  draw m   (|L(i-1 )), given  by equation (3)\n\nWe know p(x1 ) = Uniform(0, 1), because x is a cumulative sum of prior mass. Similarly p(xi |xi-1 ) = Uniform(0, xi-1 ), as every point is drawn from the prior sub ject to L(i ) > L(i-1 )  xi < xi-1 . This recursive relation allows us to compute p(x). A simple generalization, algorithm 2, uses multiple  particles; at each step the least likely is replaced with a draw from a constrained prior (figure 1(c)). Now p(x1 |N ) = N xN -1 and subsequent points have p(xi /xi-1 |xi-1 , N ) = N (xi /xi-1 )N -1 . This 1\n\n\f\n1 1e-20 1e-40 1e-60 1e-80 1e-100 1e-120 0 xi\n\nx exp( log x ) error bars\n\n1e-75 1e-80 1e-85 1e-90 1e-95 1e-100 1e-105 1e-110 xi\n\nx exp( log x ) error bars\n\n200 400 600 800 1000 1200 1400 1600 1800 2000 i\n\n1e-115 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 i\n\nalgorithm 2 with N = 8. Error bars on the geometric mean show exp(-i/N  Samples of p(x|N ) are superimposed (i = 1600 . . . 1800 omitted for clarity).\n\nFigure 2: The arithmetic and geometric means of xi against iteration number i, for ,\n\ni/N ).\n\n^ distribution over x combined with observations {Li } gives a distribution over Z :  ^ ^ ^ p(Z |{Li }, N )  (Z (x) - Z )p(x|N ) dx. (4) Samples from the posterior over  are also available, see Skilling [1] for details. Nested sampling was introduced by Skilling [1]. The key idea is that samples from the prior, sub ject to a nested sequence of constraints (3), give a probabilistic realization of the curve, figure 1(a). Related work can be found in McDonald and Singer [4]. Explanatory notes and some code are available online1 . In this paper we present some new discussion of important issues regarding the practical implementation of nested sampling and provide the first application to a challenging problem. This leads to the first cluster-based method for Potts models with first-order phase transitions of which we are aware.\n\n2\n2.1\n\nImplementation issues\nMCMC approximations\n\nThe nested sampling algorithm assumes obtaining samples from  (|L(i-1 )), equa tion (3), is possible. Rejection sampling using  would slow down exponentially with iteration number i. We explore approximate sampling from  using Markov chain  Monte Carlo (MCMC) methods. In high-dimensional problems it is likely that the ma jority of  's mass is typically in  a thin shell at the contour surface [5, p37]. This suggests finding efficient chains that sample at constant likelihood, a microcanonical distribution. In order to complete an ergodic MCMC method, we also need transition operators that can alter the likelihood (within the constraint). A simple Metropolis method may suffice. We must initialize the Markov chain for each new sample somewhere. One possibility is to start at the position of the deleted point, i-1 , on the contour constraint, which is independent of the other points and not far from the bulk of the required uniform distribution. However, if the Markov chain mixes slowly amongst modes, the new point starting at i-1 may be trapped in an insignificant mode. In this case it would be better to start at one of the other N - 1 existing points inside the contour constraint. They are all draws from the correct distribution,  (|L(i-1 )),  so represent modes fairly. However, this method may also require many Markov chain steps, this time to make the new point effectively independent of the point it cloned.\n1\n\nhttp://www.inference.phy.cam.ac.uk/bayesys/\n\n\f\n30 20 10 0 -5\n\n300 200 100 0 -5\n\n~ Figure 3: Histograms of errors in the point estimate log(Z ) over 1000 random experiments for different approximations. The test system was a 40-dimensional hypercube of length 100 with uniform prior centered on the origin. The loglikelihood was L = - /2. Nested sampling used N = 10, I = 2000. (a) Monte Carlo estimation (equation (5)) using S = 12 sampled tra jectories (b) S = 1200 sampled tra jectories. (c) Deterministic approximation using the geometric mean tra jectory. In this example perfect integration over ^ p(x|N ) gives a distribution of width  3 over log(Z ). Therefore, improvements over (c) for approximating equation (5) are unwarranted.\n\n0\n\n5\n\n0\n\n5\n\n(a)\n250 200 150 100 50 0 -5 0\n\n(b)\n\n5\n\n(c)\n\n2.2\n\nIntegrating out x\n\nTo estimate quantities of interest, we average over p(x|N ), as in equation (4). The ^ mean of a distribution over log(Z ) can be found by simple Monte Carlo estimation: l S 1s ^ ^ log(Z (x(s) )) x(s)  p(x|N ). (5) log(Z )  og(Z (x))p(x|N ) dx  S =1 This scheme is easily implemented for any expectation under p(x|N ), including ^ error bars from the variance of log(Z ). To reduce noise in comparisons between runs it is advisable to reuse the same samples from p(x|N ) (e.g. clamp the seed used to generate them). A simple deterministic approximation is useful for understanding, and also provides fast to compute, low variance estimators. Figure 2 shows sampled trajectorips of xi as the algorithm progresses. The geometric mean path, xi  e exp( (xi |N ) log xi dxi ) = e-i/N , follows the path of typical settings of x. Using this single x setting is a reasonable and very cheap alternative to averaging over settings (equation 5); see figure 3. ^ Typically the trapezoidal estimate of the integral, Z , is dominated by a small number of trapezo s, around iteration i say. Considering uncertainty on just id log xi = -i /N  i /N provides reasonable and convenient error bars.\n\n3\n\nPotts Mo dels\n\nThe Potts model, an undirected graphical model, defines a probability distribution over discrete variables s = (s1 , . . . , sn ), each taking on one of q distinct \"colors\": ( . 1 P (s|J, q ) = exp J (si sj - 1) (6) ZP (J, q )\nij )E\n\nThe variables exist as nodes on a graph where (ij )  E means that nodes i and j are linked by an edge. The Kronecker delta, si sj is one when si and sj are the same color and zero otherwise. Neighboring nodes pay an \"energy penalty\" of J when they are different colors. Here we assume identical positive couplings J > 0 on each edge (section 4 discusses the extension to different Jij ). The Ising model and Boltzmann machine are both special cases of the Potts model with q = 2. Our goal is to compute the normalization constant ZP (J, q ), where the discrete variables s are the  variables that need to be integrated (i.e. summed) over.\n\n\f\n3.1\n\nSwendsenWang sampling\n\nWe will take advantage of the \"Fortuin-Kasteleyn-Swendsen-Wang\" (FKSW) joint distribution identified explicitly in Edwards and Sokal [6] over color variables s and a bond variable for each edge in E , dij  {0, 1}: ( ( , 1 P (s, d) = 1 - p)dij ,0 + pdij ,1 si ,sj p  (1 - e-J ). (7) ZP (J, q )\nij )E\n\nThe marginal distribution over s in the FKSW model is the Potts distribution, equation (6). The marginal distribution over the bonds is the random cluster model of Fortuin and Kasteleyn [7]: P (d) = 1\nZP (J, q )\n\npD (1 - p)|E |-D q C (d) =\n\n1\nZP (J, q )\n\nexp(D log(eJ - 1))e-J |E | q C (d) , (8)\n\nwhere C (d) is the number of connected components in a graph with edges wherever ( dij = 1, and D = ij )E dij . As the partition functions of equations 6, 7 and 8 are identical, we should consider using any of these distributions to compute ZP (J, q ). The algorithm of Swendsen and Wang [8] performs block Gibbs sampling on the joint model by alternately sampling from P (dij |s) and P (s|dij ). This can convert a sample from any of the three distributions into a sample from one of the others. 3.2 Nested Sampling\n\nA simple approximate nested sampler uses a fixed number of Gibbs sampling updates of  . Cluster-based updates are also desirable in these models. Focusing on  the random cluster model, we rewrite equation (8): (9) ZP (J, q ) 1 C ( d) ZN = exp(J |E |),  (d) = q . Z Z Likelihood thresholds are thresholds on the total number of bonds D. Many states have identical D, which requires careful treatment, see appendix A. Nested sampling on this system will give the ratio of ZP /Z . The prior normalization, Z , can be found from the partition function of a Potts system at J = log(2). The following steps give two MCMC operators to change the bonds d  d : 1. Create a random coloring, s, uniformly from the q C (d) colorings satisfying the bond constraints d, as in the Swe( dsenWang algorithm. n 2. Count sites that allow bonds, E = ij )E si ,sj . ( 3. Either, operator 1: record the number of bonds D = ij )E dij E ( s) f | Or, operator 2: draw D rom Q(D E (s))  D . E( 4. whrow away the old bonds, d, and pick uniformly from one of the Ds) T ays of setting D bonds in the E available sites. The probability of proposing a particular coloring and new setting of the bonds is 1 1 (10) Q(s, d |d) = Q(d |s, D )Q(D |E (s))Q(s|d) = E (s) (D |E (s)) C (d) . q Q D Summing over colorings, the correct Metropolis-Hastings acceptance ratio is: s E( s s ) Q(D|s)/ Ds) Q(s, d|d ) q C ( d)  (d ) q C (d a= s = C ( d)  C ( d ) 1, (11) E  (d) Q(s, d |d) q q Q(D |s)/ (s) =\nD\n\nP (d) =\n\n1 L(d) (d) where ZN L(d) = exp(D log(eJ - 1)),\n\n\f\nTable 1: Partition function results for 16  16 Potts systems (see text for details). Method Gibbs AIS SwendsenWang AIS Gibbs nested sampling Random-cluster nested sampling Acceptance ratio q = 2 (Ising), J = 1 7.1  1.1 7.4  0.1 7.1  1.0 7.1  0.7 7.3 q = 10, J = 1.477 (1.5) (1.2) 12.2  2.4 14.1  1.8 11.2\n\nregardless of the choice in step 3. The simple first choice solves the difficult problem of navigating at constant D. The second choice defines an ergodic chain2 .\n\n4\n\nResults\n\nTable 1 shows results on two example systems: an Ising model, q = 1, and a q = 10 Potts model in an difficult parameter regime. We tested nested samplers using Gibbs sampling and the cluster-based algorithm, annealed importance sampling (AIS) [9] using both Gibbs sampling and SwendsenWang cluster updates. We also developed an acceptance ratio method [10] based on our representation in equation (9), which we ran extensively and should give nearly correct results. Annealed importance sampling (AIS) was run 100 times, with a geometric spacing of 104 settings of J as the annealing schedule. Nested sampling used N = 100 particles and 100 full-system MCMC updates to approximate each draw from  .  Each Markov chain was initialized at one of the N -1 particles satisfying the current constraint. In trials using the other alternative (section 2.1) the Gibbs nested sampler could get stuck permanently in a local maximum of the likelihood, while the cluster method gave erroneous answers for the Ising system. AIS performed very well on the Ising system. We took advantage of its performance in easy parameter regimes to compute Z for use in the cluster-based nested sampler. However, with a \"temperature-based\" annealing schedule, AIS was unable to give useful answers for the q = 10 system. While nested sampling appears to be correct within its error bars. It is known that even the efficient SwendsenWang algorithm mixes slowly for Potts models with q > 4 near critical values of J [11], see figure 4. Typical Potts model states are either entirely disordered or ordered; disordered states contain a jumble of small regions with different colors (e.g. figure 4(b)), in ordered states the system is predominantly one color (e.g. figure 4(d)). Moving between these two phases is difficult; defining a valid MCMC method that moves between distinct phases requires knowledge of the relative probability of the whole collections of states in those phases. Temperature-based annealing algorithms explore the model for a range of settings of J and fail to capture the correct behavior near the transition. Despite using closely related Markov chains to those used in AIS, nested sampling can work in all parameter regimes. Figure 4(e) shows how nested sampling can explore a mixture of ordered and disordered phases. By moving steadily through these states, nested sampling is able to estimate the prior mass associated with each likelihood value.\n2 Proof: with finite probability all si are given the same color, then any allowable D is possible, in turn all allowable d have finite probability.\n\n\f\n\n\n\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 4: Two 256  256, q = 10 Potts models with starting states (a) and (c) were\nsimulated with 5  106 full-system SwendsenWang updates with J = 1.42577. The corresponding results, (b) and (d) are typical of all the intermediate samples: Swendsen Wang is unable to take (a) into an ordered phase, or (c) into a disordered phase, although both phases are typical at this J . (e) in contrast shows an intermediate state of nested sampling, which succeeds in bridging the phases.\n\nThis behaviour is not possible in algorithms that use J as a control parameter. The potentials on every edge of the Potts model in this paper were the same. Much of the formalism above generalizes to allow different edge weights Jij on each edge, and non-zero biases on each variable. Indeed Edwards and Sokal [6] gave a general procedure for constructing such auxiliary-variable joint distributions. This generalization would make the model more relevant to MRFs used in other fields (e.g. computer vision). The challenge for nested sampling remains the invention of effective sampling schemes that keep a system at or near constant energy. Generalizing step 4 in section 3.2 would be the difficult step. Other temperatureless Monte Carlo methods exist, e.g. Berg and Neuhaus [12] study the Potts model using the multicanonical ensemble. Nested sampling has some unique properties compared to the established method. Formally it has only one free parameter, N the number of particles. Unless problems with multiple modes demand otherwise, N = 1 often reveals useful information, and if the error bars on Z are too large further runs with larger N may be performed.\n\n5\n\nConclusions\n\nWe have applied nested sampling to compute the normalizing constant of a system that is challenging for many Monte Carlo methods.  Nested sampling's key technical requirement, an ability to draw samples uniformly from a constrained prior, is largely solved by efficient MCMC methods.  No complex schedules are required; steady progress towards compact regions of large likelihood is controlled by a single free parameter, N , the number of particles.  Multiple particles, a built-in feature of this algorithm, are often necessary to obtain accurate results.  Nested sampling has no special difficulties on systems with first order phasetransitions, whereas all temperature-based methods fail. We believe that nested sampling's unique properties will be found useful in a variety of statistical applications.\n\n\f\nA\n\nDegenerate likeliho o ds\n\nThe description in section 1 assumed that the likelihood function provides a total ordering of elements of the parameter space. However, distinct elements dx and dx could have the same likelihood, either because the parameters are discrete, or because the likelihood is degenerate. One way to break degeneracies is through a joint model with variables of interest  and an independent variable m  [0, 1]: P (, m) = P ()  P (m) = 1 1 L() ()  L(m) (m) Z Zm (12)\n\nwhere L(m) = 1 + (m - 0.5),  (m) = 1 and Zm = 1. We choose such that log( ) is smaller than the smallest difference in log(L()) allowed by machine precision. Standard nested sampling is now possible. Assuming we have a likelihood constraint Li , we need to be able to draw from ) (  (m ) L( )L(m ) > Li , P ( , m |, m, Li )  (13) 0 otherwise. The additional variable can be ignored except for L( ) = L(i ), then only m > m are possible. Therefore, the probability of states with likelihood L(i ) are weighted by (1 - m ).\n\nReferences\n[1] John Skilling. Nested sampling. In R. Fischer, R. Preuss, and U. von Toussaint, editors, Bayesian inference and maximum entropy methods in science and engineering, AIP Conference Proceeedings 735, pages 395405, 2004. [2] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statist. Sci., 13(2):163185, 1998. [3] Matthew J. Beal and Zoubin Ghahramani. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian Statistics, 7:453464, 2003. [4] I. R. McDonald and K. Singer. Machine calculation of thermodynamic properties of a simple fluid at supercritical temperatures. J. Chem. Phys., 47(11):47664772, 1967. [5] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. CUP, 2003. www.inference.phy.cam.ac.uk/mackay/itila/. [6] Robert G. Edwards and Alan D. Sokal. Generalization of the Fortuin-KasteleynSwendsen-Wang representation and Monte Carlo algorithm. Phys.Rev. D, 38(6), 1988. [7] C. M. Fortuin and P. W. Kasteleyn. On the random-cluster model. I. Introduction and relation to other models. Physica, 57:536564, 1972. [8] R. H. Swendsen and J. S. Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Phys. Rev. Lett., 58(2):8688, January 1987. [9] Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11: 125139, 2001. [10] Charles H. Bennett. Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics, 22(2):245268, October 1976. [11] Vivek K. Gore and Mark R. Jerrum. The Swendsen-Wang process does not always mix rapidly. In 29th ACM Symposium on Theory of Computing, pages 674681, 1997. [12] Bernd A. Berg and Thomas Neuhaus. Multicanonical ensemble: A new approach to simulate first-order phase transitions. Phys. Rev. Lett., 68(1):912, January 1992.\n\n\f\n", "award": [], "sourceid": 2753, "authors": [{"given_name": "Iain", "family_name": "Murray", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "John", "family_name": "Skilling", "institution": null}]}