{"title": "Model evidence from nonequilibrium simulations", "book": "Advances in Neural Information Processing Systems", "page_first": 1753, "page_last": 1762, "abstract": "The marginal likelihood, or model evidence, is a key quantity in Bayesian parameter estimation and model comparison. For many probabilistic models, computation of the marginal likelihood is challenging, because it involves a sum or integral over an enormous parameter space. Markov chain Monte Carlo (MCMC) is a powerful approach to compute marginal likelihoods. Various MCMC algorithms and evidence estimators have been proposed in the literature. Here we discuss the use of nonequilibrium techniques for estimating the marginal likelihood. Nonequilibrium estimators build on recent developments in statistical physics and are known as annealed importance sampling (AIS) and reverse AIS in probabilistic machine learning. We introduce estimators for the model evidence that combine forward and backward simulations and show for various challenging models that the evidence estimators outperform forward and reverse AIS.", "full_text": "Model evidence from nonequilibrium simulations\n\nStatistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry &\nInstitute for Mathematical Stochastics, University of G\u00f6ttingen, 37077 G\u00f6ttingen, Germany\n\nMichael Habeck\n\nemail mhabeck@gwdg.de\n\nAbstract\n\nThe marginal likelihood, or model evidence, is a key quantity in Bayesian parame-\nter estimation and model comparison. For many probabilistic models, computation\nof the marginal likelihood is challenging, because it involves a sum or integral over\nan enormous parameter space. Markov chain Monte Carlo (MCMC) is a powerful\napproach to compute marginal likelihoods. Various MCMC algorithms and evi-\ndence estimators have been proposed in the literature. Here we discuss the use of\nnonequilibrium techniques for estimating the marginal likelihood. Nonequilibrium\nestimators build on recent developments in statistical physics and are known as\nannealed importance sampling (AIS) and reverse AIS in probabilistic machine\nlearning. We introduce estimators for the model evidence that combine forward and\nbackward simulations and show for various challenging models that the evidence\nestimators outperform forward and reverse AIS.\n\n1\n\nIntroduction\n\nThe marginal likelihood or model evidence is a central quantity in Bayesian inference [1, 2], but\nnotoriously dif\ufb01cult to compute. If likelihood L(x) \u2261 p(y|x, M ) models data y and prior \u03c0(x) \u2261\np(x|M ) expresses our knowledge about the parameters x of the model M, the posterior p(x|y, M )\nand the model evidence Z are given by:\np(y|x, M ) p(x|M )\n\nL(x) \u03c0(x)\n\n(cid:90)\n\n, Z \u2261 p(y|M ) =\n\nL(x) \u03c0(x) dx .\n\n(1)\n\np(x|y, M ) =\n\np(y|M )\n\n=\n\nZ\n\nParameter estimation proceeds by drawing samples from p(x|y, M ), and different ways to model the\ndata are ranked by their evidence. For example, two models M1 and M2 can be compared via their\nBayes factor, which is proportional to the ratio of their marginal likelihoods p(y|M1)/p(y|M2) [3].\nOften the posterior (and perhaps also the prior) is intractable in the sense that it is not possible to\ncompute the normalizing constant and therefore also the evidence analytically. This makes it dif\ufb01cult\nto compare different models via their posterior probability and model evidence. Markov chain Monte\nCarlo (MCMC) algorithms [4] only require unnormalized probability distributions and are among the\nmost powerful and accurate methods to estimate the marginal likelihood, but they are computationally\nexpensive. Therefore, it is important to develop ef\ufb01cient MCMC algorithms that can sample from the\nposterior and allow us to compute the marginal likelihood.\nThere is a close analogy between the marginal likelihood and the log-partition function or free energy\nfrom statistical physics [5]. Therefore, many concepts and algorithms originating in statistical physics\nhave been applied to problems arising in probabilistic inference. Here we show that nonequilibrium\n\ufb02uctuation theorems (FTs) [6, 7, 8] can be used to estimate the marginal likelihood from forward and\nreverse simulations.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Marginal likelihood estimation by annealed importance sampling\n\nA common MCMC strategy to sample from the posterior and estimate the evidence is to simulate a\nsequence of distributions pk that bridge between the prior and the posterior [9]. Samples can either\nbe generated in sequential order as in annealed importance sampling (AIS) [10] or in parallel as in\nreplica-exchange Monte Carlo or parallel tempering [11, 12]. Several methods have been proposed to\nestimate the marginal likelihood from MCMC samples including thermodynamic integration (TI)\n[9], annealed importance sampling (AIS) [10], nested sampling (NS) [13] or the density of states\n(DOS) [14]. Most of these approaches (TI, DOS and NS) assume that we can draw exact samples\nfrom the intermediate distributions pk, typically after a suf\ufb01ciently large number of equilibration\nsteps has been simulated. AIS, on the other hand, does not assume that the samples are equilibrated\nafter each annealing step, which makes AIS very attractive for analyzing complex models for which\nequilibration is hard to achieve.\nAIS employs a sequence of K + 1 probability distributions pk and Markov transition operators Tk\n\nwhose stationary distributions are pk, i.e. (cid:82) Tk(x|x(cid:48)) pk(x(cid:48)) dx(cid:48) = pk(x). In a Bayesian setting,\nunnormalized version fk, but not the normalizer Zk, i.e. pk(x) = fk(x)/Zk where Zk =(cid:82) fk(x) dx\n\np0 is the prior and pK the posterior. Typically, pk is intractable meaning that we only know an\n\nand the evidence is Z = ZK/Z0. Often, it is convenient to write fk as an energy based model\nfk(x) = exp{\u2212Ek(x)}. In Bayesian inference, a popular choice is\n\nfk(x) = [L(x)]\u03b2k \u03c0(x)\n\nwith prior \u03c0(x) and likelihood L(x); \u03b2k is an inverse temperature schedule that starts at \u03b20 = 0\n(prior) and ends at \u03b2K = 1 (posterior).\nAIS samples paths x = [x0, x1, . . . , xK\u22121] according to the probability\n\nPf [x] = TK\u22121(xK\u22121|xK\u22122)\u00b7\u00b7\u00b7 T1(x1|x0) p0(x0)\n\n(2)\nwhere, following Crooks [8], calligraphic symbols and square brackets denote quantities that depend\non the entire path. The subscript indicates that the path is generated by a forward simulation, which\nstarts from p0 and propagates the initial state through a sequence of new states by the successive\naction of the Markov kernels T1, T2, . . . , TK\u22121.\nThe importance weight of a path is\n\nw[x] =\n\nfk+1(xk)\nfk(xk)\n\n= exp\n\n[ Ek+1(xk) \u2212 Ek(xk) ]\n\n.\n\n(3)\n\nThe average weight over many paths is a consistent and unbiased estimator of the model evidence\nZ = ZK/Z0, which follows from [15, 10] (see supplementary material for details):\n\n(cid:104)w(cid:105)f =\n\nw[x]Pf [x]D[x] = Z\n\n(4)\nwhere the average (cid:104)\u00b7(cid:105)f is an integral over all possible paths generated by the forward sequence\nof transition kernels (D[x] = dx0 \u00b7\u00b7\u00b7 dxK\u22121). The average weight of M forward paths x(i) is an\nestimate of the model evidence: Z \u2248 1\ni w[x(i)]. This estimator is at the core of AIS and its\nvariants [10, 16]. To avoid over\ufb02ow problems, it will be numerically more stable to compute log\nweights. Logarithmic weights also naturally occur from a physical perspective where \u2212 log w[x] is\nidenti\ufb01ed as the work required to generate path x.\n\nM\n\n(cid:90)\n(cid:80)\n\n3 Nonequilibrium \ufb02uctuation theorems\n\nNonequilibrium \ufb02uctuations theorems (FTs) quantify the degree of irreversibility of a stochastic\nprocess by relating the probability of generating a path by a forward simulation to the probability\nof generating the exact same path by a time-reversed simulation. If the Markov kernels Tk satisfy\ndetailed balance, time reversal is achieved by applying the same sequence of kernels in reverse order.\nFor Markov kernels not satisfying detailed balance, the de\ufb01nition is slightly more general [7, 10].\nHere we assume that all kernels Tk satisfy detailed balance, which is valid for Markov chains based\non the Metropolis algorithm and its variants [4].\n\n2\n\nK\u22121(cid:89)\n\nk=0\n\n(cid:26)\n\n\u2212 K\u22121(cid:88)\n\nk=0\n\n(cid:27)\n\n\fUnder these assumptions, the probability of generating the path x by a reverse simulation starting in\nxK\u22121 is\n(5)\nAverages over the reverse paths are indicated by (cid:104)\u00b7(cid:105)r. The detailed \ufb02uctuation theorem [6, 8] relates\nthe probabilities of generating x in a forward and reverse simulation (see supplementary material):\n\nPr[x] = T1(x0|x1)\u00b7\u00b7\u00b7 TK\u22121(xK\u22122|xK\u22121) pK(xK\u22121) .\n\nfk(xk)\nfk+1(xk)\n\n=\n\nZ\n\nw[x]\n\n= exp{W[x] \u2212 \u2206F}\n\n(6)\n\nK\u22121(cid:89)\nthe work W[x] = \u2212 log w[x] =(cid:80)\n\nPf [x]\nPr[x]\n\nZK\nZ0\n\nk=0\n\n=\n\n(cid:104)log w(cid:105)f = \u2212(cid:104)W(cid:105)f = \u2212 K\u22121(cid:88)\n\nk=0\n\n3\n\nwhere the physical analogs of the path weight and the marginal likelihood were introduced, namely\nk[ Ek+1(xk) \u2212 Ek(xk) ] and the free energy difference \u2206F =\n\u2212 log Z = \u2212 log(ZK/Z0). Various demonstrations of relation (6) exist in the physics and machine\nlearning literature [6, 7, 8, 10, 17].\nLower and upper bounds sandwiching the log evidence [17, 18, 16] follow directly from equation (6)\nand the non-negativity of the relative entropy:\n\nDKL(Pf(cid:107)Pr) =\n\nPf [x] log(Pf [x]/Pr[x])D[x] = (cid:104)W(cid:105)f \u2212 \u2206F \u2265 0 .\nFrom DKL(Pr(cid:107)Pf ) \u2265 0 we obtain an upper bound on log Z, such that overall we have:\n\n(cid:104)log w(cid:105)f = \u2212(cid:104)W(cid:105)f \u2264 log Z \u2264 \u2212(cid:104)W(cid:105)r = (cid:104)log w(cid:105)r.\n\n(7)\n\nGrosse et al. use these bounds to assess the convergence of bidirectional Monte Carlo [18].\nThanks to the detailed \ufb02uctuation theorem (Eq. 6), we can also relate the marginal distributions of\nthe work resulting from many forward and reverse simulations:\n\npf (W ) =\n\n\u03b4(W \u2212 W[x])Pf [x]D[x] = pr(W ) eW\u2212\u2206F\n\n(8)\n\nwhich is Crooks\u2019 \ufb02uctuation theorem (CFT) [7]. CFT tells us that the work distributions pf and pr\ncross exactly at W = \u2206F . Therefore, by plotting histograms of the work obtained in forward and\nbackward simulations we can read off an estimate for the negative log evidence.\nThe Jarzynski equality (JE) [15] follows directly from CFT due to the normalization of pr:\n\npf (W ) e\u2212W dW = (cid:104)e\u2212W(cid:105)f = e\u2212\u2206F .\n\n(9)\nJE restates the AIS estimator (cid:104)w(cid:105)f = Z (Eq. 4) in terms of the physical quantities. Remarkably, JE\nholds for any stochastic dynamics bridging between the initial and target distribution. This suggests\nto use fast annealing protocols to drag samples from the prior into the posterior. However, the JE\ninvolves an exponential average in which paths requiring the least work contribute most strongly.\nThese paths correspond to work realizations that reside in the left tail of pf . With faster annealing,\nthe chance of generating a minimal work path decreases exponentially and becomes a rare event.\nA key feature of CFT and JE is that they do not require exact samples from the stationary distributions\npk, which is needed in applications of TI or DOS. For complex probabilistic models, the states\ngenerated by the kernels Tk will typically \u201clag behind\u201d due to slow mixing, especially near phase\ntransitions. The k-th state of the forward path will follow the intermediate distribution\n\nqk(xk) =\n\nTl(xl|xl\u22121) p0(x0) dx0 \u00b7\u00b7\u00b7 dxk\u22121,\n\nq0(x) = p0(x) .\n\n(10)\n\nUnless the transition kernels Tk mix very rapidly, qk (cid:54)= pk for k > 0.\nConsider the common case in Bayesian inference where Ek(x) = \u03b2kE(x) and E(x) = \u2212 log L(x).\nThen, according to inequalities (7), we have the following lower bound on the marginal likelihood\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90) k(cid:89)\n\nl=1\n\n(\u03b2k+1 \u2212 \u03b2k)(cid:104)E(cid:105)qk\n\n(11)\n\n\fFigure 1: Nonequilibrium analysis of a Gaussian toy model. (A) Work distributions pf and pr shown\nin blue and green. The correct free energy difference (minus log evidence) is indicated by a dashed\nline. (B) Comparison of stationary distribution pk and marginal distributions qk generated by the\ntransition kernels. Shown is a 1\u03c3 band about the mean positions for pk (blue) and qk (green). (C)\nLower and upper bounds of the log evidence (Eq. 7) and logarithm of the exponential average over\nthe forward work distribution for increasingly slow annealing schedules.\n\nand an analogous expression for the upper bound/reverse direction, in which the average energies\nalong the forward path (cid:104)E(cid:105)qk need to be replaced by the corresponding average energies along the\nbackward path. The difference between the forward and reverse averages is called \u201chysteresis\u201d in\nphysics. The larger the hysteresis, the more strongly will the marginal likelihood bounds disagree\nand the more uncertain will our guess of the model evidence be. The opposite limiting case is slow\nannealing and full equilibration where the bound (Eq. 11) approaches thermodynamic integration\n(see supplementary material). So we expect that there is a tradeoff between switching fast in order to\nsave computation time, versus a desire to control the amount of hysteresis, which otherwise makes it\ndif\ufb01cult to extract accurate evidence estimates from the simulations.\n\n4\n\nIllustration for a tractable model\n\nLet us illustrate the nonequilibrium results for a tractable model where the initial, the target and\nk) with means \u00b5k and standard\n\nall intermediate distribution are Gaussians pk(x) = N(cid:0)x; \u00b5k, \u03c32\nTk(x|x(cid:48)) = N(cid:0)x; (1 \u2212 \u03c4k)\u00b5k + \u03c4kx(cid:48), (1 \u2212 \u03c4 2\n\ndeviations \u03c3k > 0. The transition kernels are also Gaussian:\n\nk )\u03c32\nk)\n\n(x \u2212 \u00b5k)2 and log Zk = log(2\u03c0\u03c32\n\nwith \u03c4k \u2208 [0, 1] controlling the speed of convergence: For \u03c4k = 0 convergence is immediate, whereas\nfor \u03c4k \u2192 1, the chain generated by Tk converges in\ufb01nitely slowly. Note that the kernels Tk satisfy\ndetailed balance, therefore backward simulations apply the same kernels in reverse order. The\nenergies and exact log partition functions are Ek(x) = 1\nk)/2.\n2\u03c32\nk\nWe bridge between an initial distribution with mean \u00b50 = 20 and standard deviation \u03c30 = 10\nand a target with \u00b5K = 0, \u03c3K = 1 using K = 10 intermediate distributions and compute work\ndistributions resulting from forward/backward simulations. Both distributions indeed cross at W =\n\u2212 log Z = log(\u03c30/\u03c3K) = log 10, as predicted by CFT (see Fig. 1A). Figure 1B illustrates the\ndifference between the marginal distribution of the samples after k annealing steps qk (Eq. 10) and\nthe stationary distribution pk. The marginal distributions qk are also Gaussian, but their means and\nvariances diverge from the means and variances of the stationary distributions pk. This divergence\nresults in hysteresis, if the annealing process is forced to progress very rapidly without equilibrating\nthe samples (quenching). Figure 1C con\ufb01rms the validity of the JE (Eq. 9) and of the lower and\nupper bounds (Eq. 7). For short annealing protocols, the bounds are very conservative, whereas\nthe Jarzynski equality gives the correct evidence even for fast protocols (small K). In realistic\napplications, however, we cannot compute the work distribution pf over the entire range of work\nvalues. In fast annealing simulations, it will become increasingly dif\ufb01cult to explore the left tail of the\nwork distribution, such that in practice the accuracy of the JE estimator deteriorates for too small K.\n\n4\n\n05101520workW0.00.10.20.30.4workdistributionp(W)Apf(W)pr(W)logZ0510indexk0102030parameterxBpkqk050100150200K1050workWCWfWrlogeWflogZ\ffrom M forward and reverse nonequilibrium simulations, tolerance \u03b4\n\nAlgorithm 1 Bennett\u2019s acceptance ratio (BAR)\nRequire: Work W (i)\n\nr\n\n(e.g. \u03b4 = 10\u22124)\nZ \u2190 1\nM\nrepeat\n\n(cid:80)\nf , W (i)\ni exp{\u2212W (i)\nLHS \u2190(cid:80)\n\n1\n\nZ \u2190 Z \u00d7 LHS\n\nRHS\n\ni\n\n1+Z exp{W (i)\n\nf } (Jarzynski estimator)\n\nf }, RHS \u2190(cid:80)\n\n1\n\n1+Z\u22121 exp{\u2212W (i)\nr }\n\ni\n\nuntil | log(LHS/RHS)| < \u03b4\nreturn Z\n\n5 Using the \ufb02uctuation theorem to estimate the evidence\n\nTo use the \ufb02uctuation theorem for evidence estimation, we run two sets of simulations. As in AIS,\nforward simulations start from a prior sample which is successively propagated by the transition\nkernels Tk. For each forward path x(i) the total work W (i)\nis recorded. We also run reverse simula-\nf\ntions starting from a posterior sample. For complex inference problems it is generally impossible\nto generate an exact sample from the posterior. However, in some cases the mode of the posterior\nis known or powerful methods for locating the posterior maximum exist. We can then generate\na posterior sample by applying the transition operator TK many times starting from the posterior\nmode. The reverse simulations could also be started from the \ufb01nal states generated by the forward\nf }. Another possibility\nsimulations drawn according to their importance weights w(i)\nto generate a posterior sample is to start from the data, if we want to evaluate an intractable generative\nmodel such as a restricted Boltzmann machine. The posterior sample is then propagated by the\nreverse chain of transition operators. Again, we accumulate the total work generated by the reverse\nsimulation W (i)\nr . The reverse simulation corresponds to reverse AIS proposed by Burda et al. [16].\n\nf \u221d exp{\u2212W (i)\n\n5.1\n\nJarzynski and cumulant estimators\n\nf\n\nand W (i)\n\nThere are various options to estimate the evidence from forward and backward simulations. We can\napply the Jarzynski equality to W (i)\nr , which corresponds to the estimators used in AIS [10]\nand reverse AIS [16]. According to Eq. (7) we can also compute an interval that likely contains the\nlog evidence, but typically this interval will be quite large. Hummer [19] has developed estimators\nbased on the cumulant expansion of pf and pr:\nlog Z \u2248 \u2212(cid:104)W(cid:105)f + varf (W)/2,\n\n(12)\nwhere varf (W) and varr(W) indicate the sample variances of the work generated during the forward\nand reverse simulations. The cumulant estimators increase/decrease the lower/upper bound of the log\nevidence (Eq. 7) by the sample variance of the work. The forward and reverse cumulant estimators\ncan also be combined into a single estimate [19]:\n\nlog Z \u2248 \u2212(cid:104)W(cid:105)r \u2212 varr(W)/2\n\n(cid:0)(cid:104)W(cid:105)f + (cid:104)W(cid:105)r\n\n(cid:1) +\n\n(cid:0)varf (W) \u2212 varr(W)(cid:1) .\n\n1\n12\n\nlog Z \u2248 \u2212 1\n2\n\n(13)\n\n5.2 Bennett\u2019s acceptance ratio\n\nAnother powerful method is Bennett\u2019s acceptance ratio (BAR) [20, 21], which is based on the\nobservation that according to CFT (Eq. 8)\n\nh(W ; \u2206F ) pf (W ) e\u2212W dW =\n\nh(W ; \u2206F ) pr(W ) e\u2212\u2206F dW\n\n(cid:90)\n\n(cid:90)\n\nfor any function h. Therefore, any choice of h gives an implicit estimator for \u2206F . Bennett showed\n[20, 9] that the minimum mean squared error is achieved for h \u221d (pf + pr)\u22121, leading to the implicit\n\n5\n\n\fFigure 2: Performance of evidence estimators on the Gaussian toy model. M = 100 forward and\nreverse simulations were run for schedules of increasing length K. This experiment was repeated\n1000 times to probe the stability of the estimators. Shown is the difference between the log evidence\nestimate and its true value \u2212 log 10. The average over all repetitions is shown as red line; the light\nband indicates one standard deviation over all repetitions. (A) Cumulant estimator (Eq. 12) based on\nthe forward simulation. (B) The combined cumulant estimator (Eq. 13). (C) Forward AIS estimator.\n(D) Reverse AIS. (E) BAR. (F) Histogram estimator.\n\nequation\n\n(cid:88)\n\ni\n\n1\n\n1 + Z exp{W (i)\n\nf } =\n\n(cid:88)\n\ni\n\n1\n\n1 + Z\u22121 exp{\u2212W (i)\n\nr } .\n\n(14)\n\nBy numerically solving equation (14) for Z, we obtain an estimator of the evidence based on\nBennett\u2019s acceptance ratio (BAR). A straightforward way to solve the BAR equation is to iterate over\nthe multiplicative update Z (t+1) \u2190 Z (t)LHS(Z (t))/RHS(Z (t)) where LHS and RHS are the left\nand right hand side of equation (14) and the superscript (t) indicates an iteration index. Algorithm (1)\nprovides pseudocode to compute the BAR estimator (further details are given in the supplementary\nmaterial).\n\n5.3 Histogram estimator\n\nHere we introduce yet another way of combining forward/backward simulations and estimating the\nmodel evidence. According to CFT, we have:\nf \u223c pf (W ), W (i)\nW (i)\n\nr \u223c pr(W ) = pf (W ) e\u2212W /Z .\nand W (i)\n\nr\n\nf\n\nThe idea is to combine all samples W (i)\nto estimate pf , from which we can then obtain\nthe evidence by using the JE (Eq. 9). Thanks to the CFT, the samples from the reverse simulation\ncontribute most strongly to the integral in the JE. Therefore, if we can use the reverse paths to\nestimate the forward work distribution, pf will be quite accurate in the region that is most relevant\nfor evaluating JE.\nTo estimate pf from W (i)\nis mathematically equivalent to estimating the density of states\nf\n(DOS) (i.e. the marginal distribution of the log likelihood) from equilibrium simulations run at two\ninverse temperatures \u03b2 = 0 and \u03b2 = 1. We can therefore directly apply histogram techniques [14, 22]\nused to analyze equilibrium simulations to estimate pf from nonequilibrium simulations (details are\ngiven in the supplementary material). Histogram techniques result in a non-parametric estimate of\nthe work distribution:\n\nand W (i)\n\nr\n\npj \u03b4(W \u2212 Wj)\n\n(15)\n\npf (W ) \u2248(cid:88)\nZ \u2248(cid:88)\n\nj\nand W (i)\n\nwhere all sampled work values, W (i)\nnormalized weights associated with each Wj. Using the JE, we obtain\n\nf\n\nr , were combined into a single set Wj and pj are\n\npj e\u2212Wj\n\n(16)\n\nwhich is best evaluated in log space. The histogram iterations [14] used to determine pj and Z are\nvery similar to the multiplicative that solve the BAR equation (Eq. 14). After running the histogram\niterations, we obtain a non-parametric maximum likelihood estimate of pf (Eq. 15). It is also possible\n\nj\n\n6\n\n0100200K0.20.00.2errorABCDEF\fFigure 3: Evidence estimation for the 32 \u00d7 32 Ising model. (A) Work distributions obtained for\nK = 1000 annealing and N = 1000 equilibration steps. (B) Average energy (cid:104)E(cid:105)f and (cid:104)E(cid:105)r at\ndifferent annealing steps k in comparison to the average energy of the stationary distribution (cid:104)E(cid:105)\u03b2.\nShown is a zoom into the inverse temperature range from 0.4 to 0.7; the average energies agree\nquite well outside this interval. (C) Evidence estimates for increasing number of equilibration steps\nN. Light/dark blue: lower/upper bounds (cid:104)log w(cid:105)f / (cid:104)log w(cid:105)r; light/dark green: forward/reverse AIS\nestimators log(cid:104)w(cid:105)f / log(cid:104)w(cid:105)r; light red: BAR; dark red: histogram estimator. For N > 1000, BAR\nand the histogram estimator produce virtually identical evidence estimates.\n\nto carry out a Bayesian analysis, and derive a Gibbs sampler for pf , which does not only provide a\npoint estimate for log Z, but also quanti\ufb01es its uncertainty (see supplementary material for details).\nWe studied the performance of the evidence estimators on forward/backward simulations of the\nGaussian toy model. The cumulant estimators (Figs. 2A, B) are systematically biased in case of rapid\nannealing (small K). The combined cumulant estimator (Fig. 2B) is a signi\ufb01cant improvement over\nthe forward estimator, which does not take the reverse simulation data into account. The forward\nand reverse AIS estimators are shown in Figs. 2C and 2D. For this system, the evidence estimates\nderived from the reverse simulation are systematically more accurate than the AIS estimate based on\nthe forward simulation, which is clear given that the work distribution from reverse simulations pr is\nmuch more concentrated than the forward work distribution pf (see Fig. 1A). The most accurate,\nleast biased and most stable estimators are BAR (Fig. 2E) and the histogram estimator (Fig. 2F),\nwhich both combine forward and backward simulations into a single evidence estimate.\n\n6 Experiments\n\nWe studied the performance of the nonequilibrium marginal likelihood estimators on various chal-\nlenging probabilistic models including Markov random \ufb01elds and Gaussian mixture models. A\npython package implementing the work simulations and evidence estimators can be downloaded from\nhttps://github.com/michaelhabeck/paths.\n\nIsing model\n\n6.1\n(cid:80)\nOur \ufb01rst test system is a 32 \u00d7 32 Ising model for which the log evidence can be computed exactly:\nlog Z = 1339.27 [23]. A single con\ufb01guration consists of 1024 spins xi = \u00b11. The energies of\ni\u223cj xixj where i \u223c j indicates nearest neighbors\nthe intermediate distributions are Ek(x) = \u03b2k\non a 2D square lattice. We generate M = 1000 forward and reverse paths using a linear inverse\ntemperature schedule that interpolates between \u03b20 = 0 and \u03b2K = 1 where K = 1000. Forward\nsimulations start from random spin con\ufb01gurations. For the reverse simulations, we start in one of the\ntwo ground states with all spins either \u22121 or +1. Tk are Metropolis kernels based on pk: a new spin\ncon\ufb01guration is proposed by \ufb02ipping a randomly selected spin and accepted or rejected according\nto Metropolis\u2019 rule. The single spin-\ufb02ip transitions are repeated N times at constant \u03b2k, i.e. N is\nthe number of equilibration steps after a perturbation was induced by lowering the temperature. The\nlarger N, the more time we allow the simulation to equilibrate, and the closer will qk be to pk.\nFigure 3A shows the work distributions obtained with N = 1000 equilibration steps per temperature\nperturbation. Even though the forward and reverse work distributions overlap only weakly, the\n\n7\n\n135013001250workW0.000.250.500.751.00workdistributionp(W)1e1Apf(W)pr(W)logZ0.40.50.60.7inversetemperature2.01.51.0averageenergyE1e3BEfErE103104equilibrationstepsN1.331.341.35logevidencelogZ1e3C\fFigure 4: Evidence estimation for the Potts model and RBM. (A) Estimated log evidence of the\nPotts model for a \ufb01xed computational budget K \u00d7 N = 109 where M = 100 and ten repetitions\nwere computed. The reference value log Z = 1742 (obtained with parallel tempering) is shown as\ndashed black line. (B) log Z distributions obtained with the Gibbs sampling version of the histogram\nestimator for K = 1000 and varying number of equilibration steps. (C) Work distributions obtained\nfor a marginal and full RBM (light/dark blue: forward/reverse simulation of the marginal model;\nlight/dark green: forward/reverse simulation of the full model).\n\nevidence estimates obtained with BAR and the histogram estimator are quite accurate with 1338.05\n(BAR) and 1338.28 (histogram estimator), which differs only by approx. 1 nat from the true evidence\nand corresponds to a relative error of \u223c 9\u00d7 10\u22124 (BAR) and 7\u00d7 10\u22124 (histogram estimator). Forward\nand reverse AIS provide less accurate estimates of the log evidence: 1333.66 (AIS) and 1342.05\n(RAISE). The lower and upper bounds are very broad (cid:104)log w(cid:105)f = 1290.5 and (cid:104)log w(cid:105)r = 1352.0,\nwhich results from hysteresis effects. Figure 3B zooms into the average energies obtained during the\nforward and reverse simulations and compares them with the average energy of a fully equilibrated\nsimulation. The average energies differ most strongly at inverse temperatures close to the critical\nvalue \u03b2crit \u2248 0.44 at which the Ising model undergoes a second-order phase transition. We also tested\nthe performance of the estimators as a function of the number of equilibration steps N. As already\nobserved for the Gaussian toy model, BAR and the histogram estimator outperform the Jarzynski\nestimators (AIS and RAISE) also in case of the Ising model (see Fig. 3C).\n\n6.2 Ten-state Potts model\nNext we performed simulations of the ten-state Potts model de\ufb01ned over a 32 \u00d7 32 lattice where the\nspins of the Ising model are replaced by integer colors xi \u2208 {1, . . . , 10} and an interaction energy\n2\u03b4(xi, xj). This model is signi\ufb01cantly more challenging than the Ising model, because it undergoes a\n\ufb01rst-order phase transition and has an astronomically larger number of states (101024 colorings rather\nthan 21024 spin con\ufb01gurations). We performed forward/backward simulations using a linear inverse\ntemperature schedule with \u03b20 = 0, \u03b2K = 1 and a \ufb01xed computational budget K \u00d7 N = 109. Figure\n4A shows that there seems to be no advantage of increasing the number of intermediate distributions\nat the cost of reducing the number of equilibration steps. Again, BAR and the histogram estimator\nperform very similarly. The Gibbs sampling version of the histogram estimator also provides the\nposterior of log Z (see Fig. 4B). For too few equilibration steps N, this distribution is rather broad or\neven slightly biased, but for large N the log Z posterior concentrates around the correct log evidence.\n\n6.3 Restricted Boltzmann machine\n\nthe marginal model Ek(h) = \u2212\u03b2k log(cid:80)\n\nThe restricted Boltzmann machine (RBM) is a common building block of deep learning hierarchies.\nRBM is an intractable MRF with bipartite interactions: E(v, h) = \u2212(aT v + bT h + vT W h) where\na, b are the visible and hidden biases and W are the couplings between the visible and hidden units\nvi and hj. Here we compare annealing of the full model Ek(v, h) = \u03b2kE(v, h) against annealing of\nv exp{\u2212E(v, h)}. The full model can be simulated using a\nGibbs sampler, which is straightforward since the conditional distributions are Bernoulli. To sample\nfrom the marginal model, we use a Metropolis kernel similar to the one used for the Ising model. To\nstart the reverse simulations, we randomly pick an image from the training set and generate an initial\n\n8\n\n103104105K304050logZ+1.7e3AlogwflogwrlogwflogwrBARHistogram1234logZ+1.74e30246BN=105N=106N=107N=108475450logZ0.000.050.100.15p(W)Cparalleltempering\fhidden state by sampling from the conditional distribution p(h|v). We then run 100 steps of Gibbs\nsampling with TK to obtain a posterior sample.\nWe ran tests on an RBM with 784 visible and 500 hidden units trained on the MNIST handwritten\ndigits dataset [24] with contrastive divergence using 25 steps [25]. Since the true log evidence is not\nknown, we use a reference value obtained with parallel tempering (PT): log Z \u2248 451.42. Figure 4C\ncompares evidence estimates based on annealing simulations of the full against the marginal model.\nBoth annealing approaches provide very similar evidence estimates 451.43 (full model) and 451.48\n(marginal model) that are close to the PT result. However, simulation of the marginal model is three\ntimes faster compared to the full model. Therefore, it seems bene\ufb01cial to evaluate and train RBMs\nbased on sampling and annealing of the marginal model p(h) rather than the full model p(v, h).\n\n6.4 Gaussian mixture model\n\none-by-one as in sequential Monte Carlo [10] Ek(x) = \u2212(cid:80)\n\nFinally, we consider a sort of \u201cdata annealing\u201d strategy in which independent data points are added\nl