{"title": "Evaluating probabilities under high-dimensional latent variable models", "book": "Advances in Neural Information Processing Systems", "page_first": 1137, "page_last": 1144, "abstract": "We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost.", "full_text": "Evaluating probabilities under high-dimensional\n\nlatent variable models\n\nIain Murray and Ruslan Salakhutdinov\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nToronto, ON. M5S 3G4. Canada.\n\n{murray,rsalakhu}@cs.toronto.edu\n\nAbstract\n\nWe present a simple new Monte Carlo algorithm for evaluating probabilities of\nobservations in complex latent variable models, such as Deep Belief Networks.\nWhile the method is based on Markov chains, estimates based on short runs are\nformally unbiased. In expectation, the log probability of a test set will be under-\nestimated, and this could form the basis of a probabilistic bound. The method is\nmuch cheaper than gold-standard annealing-based methods and only slightly more\nexpensive than the cheapest Monte Carlo methods. We give examples of the new\nmethod substantially improving simple variational bounds at modest extra cost.\n\n1 Introduction\n\nP\n\nLatent variable models capture underlying structure in data by explaining observations as part of a\nmore complex, partially observed system. A large number of probabilistic latent variable models\nhave been developed, most of which express a joint distribution P (v, h) over observed quantities v\nand their unobserved counterparts h. Although it is by no means the only way to evaluate a model,\na natural question to ask is \u201cwhat probability P (v) is assigned to a test observation?\u201d.\nIn some models the latent variables associated with a test input can be easily summed out: P (v) =\nh P (v, h). As an example, standard mixture models have a single discrete mixture component\nindicator for each data point; the joint probability P (v, h) can be explicitly evaluated for each setting\nof the latent variable.\nMore complex graphical models explain data through the combination of many latent variables.\nThis provides richer representations, but provides greater computational challenges. In particular,\nmarginalizing out many latent variables can require complex integrals or exponentially large sums.\nOne popular latent variable model, the Restricted Boltzmann Machine (RBM), is unusual in that\nthe posterior over hiddens P (h|v) is fully-factored, which allows ef\ufb01cient evaluation of P (v) up\nto a constant. Almost all other latent variable models have posterior dependencies amongst latent\nvariables, even if they are independent a priori.\nOur current work is motivated by recent work on evaluating RBMs and their generalization to Deep\nBelief Networks (DBNs) [1]. For both types of models, a single constant was accurately approxi-\nmated so that P (v, h) could be evaluated point-wise. For RBMs, the remaining sum over hidden\nvariables was performed analytically. For DBNs, test probabilities were lower-bounded through\na variational technique. Perhaps surprisingly, the bound was unable to reveal any signi\ufb01cant im-\nprovement over RBMs in an experiment on MNIST digits. It was unclear whether this was due to\nlooseness of the bound, or to there being no difference in performance.\nA more accurate method for summing over latent variables would enable better and broader evalua-\ntion of DBNs. In section 2 we consider existing Monte Carlo methods. Some of them are certainly\n\n\fmore accurate, but prohibitively expensive for evaluating large test sets. We then develop a new\ncheap Monte Carlo procedure for evaluating latent variable models in section 3. Like the variational\nmethod used previously, our method is unlikely to spuriously over-state test-set performance. Our\npresentation is for general latent variable models, however for a running example, we use DBNs\n(see section 4 and [2]). The bene\ufb01ts of our new approach are demonstrated in section 5.\n\n2 Probability of observations as a normalizing constant\n\nThe probability of a data vector, P (v), is the normalizing constant relating the posterior over hidden\nvariables to the joint distribution in Bayes rule, P (h|v) = P (h, v)/P (v). A large literature on\ncomputing normalizing constants exists in physics, statistics and computer science. In principle,\nthere are many methods that could be applied to evaluating the probability assigned to data by a\nlatent variable model. We review a subset of these methods, with notation and intuitions that will\nhelp motivate and explain our new algorithm.\nIn what follows, all auxiliary distributions Q and transition operators T are conditioned on the\ncurrent test case v, this is not shown in the notation to reduce clutter. Further, all of these methods\nassume that we can evaluate P (h, v). Graphical models with undirected connections will require\nthe separate estimation of a single constant as in [1].\n\n2.1\n\nImportance sampling\n\nImportance sampling can in principle \ufb01nd the normalizing constant of any distribution. The algo-\nrithm involves averaging a simple ratio under samples from some convenient tractable distribution\nover the hidden variables, Q(h). Provided Q(h) 6= 0 whenever P (h, v) 6= 0, we obtain:\n\nP (v) = X\n\nh\n\nP (h, v)\n\nQ(h) Q(h) \u2248 1\n\nS\n\nP(cid:0)h(s), v(cid:1)\nQ(cid:0)h(s)(cid:1) , h(s) \u223c Q(cid:0)h(s)(cid:1).\n\nSX\n\ns=1\n\nImportance sampling relies on the sampling distribution Q(h) being similar to the target distribution\nP (h|v). Speci\ufb01cally, the variance of the estimator is an \u03b1-divergence between the distributions [3].\nFinding a tractable Q(h) with small divergence is dif\ufb01cult in high-dimensional problems.\n\n(1)\n\n(2)\n\n2.2 The Harmonic mean method\nUsing Q(h)= P (h|v) in (1) gives an \u201cestimator\u201d that requires knowing P (v). As an alternative, the\nharmonic mean method, also called the reciprocal method, gives an unbiased estimate of 1/P (v):\n\n= X\n\nh\n\nP (h)\nP (v)\n\n= X\n\nh\n\n1\n\nP (v)\n\nP (h|v)\nP (v|h)\n\n\u2248 1\nS\n\nSX\n\ns=1\n\nP(cid:0)v|h(s)(cid:1) , h(s) \u223c P(cid:0)h(s)|v).\n\n1\n\nIn practice correlated samples from MCMC are used; then the estimator is asymptotically unbiased.\nIt was clear from the original paper and its discussion that the harmonic mean estimator can behave\nvery poorly [4]. Samples in the tails of the posterior have large weights, which makes it easy to\nconstruct distributions where the estimator has in\ufb01nite variance. A \ufb01nite set of samples will rarely\ninclude any extremely large weights, so the estimator\u2019s empirical variance can be misleadingly low.\nIn many problems, the estimate of 1/P (v) will be an underestimate with high probability. That is,\nthe method will overestimate P (v) and often give no indication that it has done so.\nSometimes the estimator will have manageable variance. Also, more expensive versions of the\nestimator exist with lower variance. However, it is still prone to overestimate test probabili-\nties: If 1/ \u02c6PHME(v) is the Harmonic Mean Estimator in (2), Jensen\u2019s inequality gives P (v) =\n\n1(cid:14)E(cid:2)1/ \u02c6PHME(v)(cid:3) \u2264 E(cid:2) \u02c6PHME(v)(cid:3). Similarly log P (v) will be overestimated in expectation.\n\nHence the average of a large number of test log probabilities is highly likely to be an overestimate.\nDespite these problems the estimator has received signi\ufb01cant attention in statistics, and has been\nused for evaluating latent variable models in recent machine learning literature [5, 6]. This is under-\nstandable: all of the existing, more accurate methods are harder to implement and take considerably\nlonger to run. In this paper we propose a method that is nearly as easy to use as the harmonic mean\nmethod, but with better properties.\n\n\f2.3\n\nImportance sampling based on Markov chains\n\nParadoxically, introducing auxiliary variables and making a distribution much higher-dimensional\nthan it was before, can help \ufb01nd an approximating Q distribution that closely matches the target\ndistribution. As an example we give a partial review of Annealed Importance Sampling (AIS) [7], a\nspecial case of a larger family of Sequential Monte Carlo (SMC) methods (see, e.g., [8]). Some of\nthis theory will be needed in the new method we present in section 3.\nAnnealing algorithms start with a sample from some tractable distribution P1. Steps are taken with\na series of operators T2, T3, . . . , TS, whose stationary distributions, Ps, are \u201ccooled\u201d towards the\ndistribution of interest. The probability over the resulting sequence H = {h(1), h(2), . . . h(S)} is:\n\nQAIS(H) = P1\n\n(cid:0)h(1)(cid:1) SY\nPAIS(H) = P(cid:0)h(S)|v(cid:1) SY\n\ns=2\n\nTs\n\n(cid:0)h(s)\u2190h(s\u22121)(cid:1).\n(cid:0)h(s\u22121)\u2190h(s)(cid:1).\neTs\n\n(3)\n\n(6)\n\n(7)\n\nTo compute importance weights, we need to de\ufb01ne a \u201ctarget\u201d distribution on the same state-space:\n\ns=2\n\n(4)\nBecause h(S) has marginal P (h|v) = P (h, v)/P (v), PAIS(H) has our target, P (v), as its normal-\n\nizing constant. The eT operators are the reverse operators, of those used to de\ufb01ne QAIS.\nsponding \u201creverse operator\u201d eT , which is de\ufb01ned for any point h0 in the support of P :\n= T (h0\u2190h) P (h|v)\n\nFor any transition operator T that leaves a distribution P (h|v) stationary, there is a unique corre-\n\neT (h\u2190h0) = T (h0\u2190h) P (h|v)\nP\nh T (h0\u2190h) P (h|v)\n\nP (h0|v)\n\nThe sum in the denominator is known because T leaves the posterior stationary. Operators that\nare their own reverse operator are said to satisfy \u201cdetailed balance\u201d and are also known as \u201cre-\nversible\u201d. Many transition operators used in practice, such as Metropolis\u2013Hastings, are reversible.\nNon-reversible operators are usually composed from a sequence of reversible operations, such as the\ncomponent updates in a Gibbs sampler. The reverse of these (so-called) non-reversible operators is\nconstructed from the same reversible base operations, but applied in reverse order.\nThe de\ufb01nitions above allow us to write:\nQAIS(H)\nPAIS(H)\n\nQAIS(H) = PAIS(H)\n\n= PAIS(H)\n\n(5)\n\n.\n\n(cid:0)h(1)(cid:1)\nP(cid:0)h(S)|v(cid:1) \u00b7 SY\n#\n\nP1\n\nP \u2217\ns (h(s))\nP \u2217\ns (h(s\u22121))\n\ns=2\n\nTs\n\n(cid:0)h(s)\u2190h(s\u22121)(cid:1)\n(cid:0)h(s\u22121)\u2190h(s)(cid:1)\neTs\n\u2261 PAIS(H) P (v)\n\n.\n\nw(H)\n\n= PAIS(H) P (v)\n\nWe can usually evaluate the P \u2217\ns , which are unnormalized versions of the stationary distributions of\nthe Markov chain operators. Therefore the AIS importance weight w(H) = 1/ [\u00b7\u00b7\u00b7 ] is tractable as\nlong as we can evaluate P (h, v). The AIS importance weight provides an unbiased estimate:\n\n\"\n\nP1\n\n(cid:0)h(1)(cid:1)\nP(cid:0)h(S), v(cid:1) \u00b7 SY\ni\n= P (v)X\n\ns=2\n\nh\n\nEQAIS(H)\n\nw(H)\n\nPAIS(H) = P (v).\n\nAs with standard importance sampling, the variance of the estimator depends on a divergence be-\ntween PAIS and QAIS. This can be made small, at large computational expense, by using hundreds\nor thousands of steps S, allowing the neighboring intermediate distributions Ps(h) to be close.\n\nH\n\n2.4 Chib-style estimators\nBayes rule implies that for any special hidden state h\u2217, P (v) = P (h\u2217, v)/P (h\u2217|v).\nThis trivial identity suggests a family of estimators introduced by Chib [9]. First, we choose a\nparticular hidden state h\u2217, usually one with high posterior probability, and then estimate P (h\u2217|v).\nWe would like to obtain an estimator that is based on a sequence of states H ={h(1), h(2), . . . , h(S)}\ngenerated by a Markov chain that explores the posterior distribution P (h|v). The most naive esti-\nI(h(s) =h\u2217)/S.\n\nmate of P (h\u2217|v) is the fraction of states in H that are equal to the special stateP\n\n(8)\n\ns\n\n\fP (h\u2217|v) = X\n\nh\n\nObviously this estimator is impractical as it equals zero with high probability when applied to high-\ndimensional problems. A \u201cRao\u2013Blackwellized\u201d version of this estimator, \u02c6p(H), replaces the indi-\ncator function with the probability of transitioning from h(s) to the special state under a Markov\nchain transition operator that leaves the posterior stationary. This can be derived directly from the\noperator\u2019s stationary condition:\n\nT (h\u2217\u2190h)P (h|v) \u2248 \u02c6p(H) \u2261 1\nS\n\nT (h\u2217\u2190h(s)),\n\n{h(s)} \u223c P(H),\n\n(9)\n\nwhere P(H) is the joint distribution arising from S steps of a Markov chain.\nstationary distribution P (h|v) and could be initialized at equilibrium so that\n\nP(H) = P(cid:0)h(1)(cid:12)(cid:12)v(cid:1) SY\n\nT(cid:0)h(s)\u2190h(s\u22121)(cid:1),\n\n(10)\nthen \u02c6p(H) would be an unbiased estimate of P (h\u2217|v). For ergodic chains the stationary distribution\nis achieved asymptotically and the estimator is consistent regardless of how it is initialized.\nIf T is a Gibbs sampling transition operator, the only way of moving from h to h\u2217 is to draw each\nelement of h\u2217 in turn. If updates are made in index order from 1 to M, the move has probability:\n\ns=2\n\nIf the chain has\n\nSX\n\ns=1\n\nMY\n\nP(cid:0)h\u2217\n\nj\n\n(cid:12)(cid:12) h\u2217\n\nT (h\u2217\u2190h) =\n\n(cid:1).\n\n1:(j\u22121), h(j+1):M\n\n(11)\n\nj=1\n\nEquations (9, 11) have been used in schemes for monitoring the convergence of Gibbs samplers [10].\nIt is worth emphasizing that we have only outlined the simplest possible scheme inspired by Chib\u2019s\ngeneral approach. For some Markov chains, there are technical problems with the above construc-\ntion, which require an extension explained in the appendix. Moreover the approach above is not what\nChib recommended. In fact, [11] explicitly favors a more elaborate procedure involving sampling\nfrom a sequence of distributions. This opens up the possibility of many sophisticated developments,\ne.g. [12, 13]. However, our focus in this work is on obtaining more useful results from simple cheap\nmethods. There are also well-known problems with the Chib approach [14], to which we will return.\n\n3 A new estimator for evaluating latent-variable models\n\nWe start with the simplest Chib-inspired estimator based on equations (8,9,11). Like many Markov\nchain Monte Carlo algorithms, (9) provides only (asymptotic) unbiasedness. For our purposes this\nis not suf\ufb01cient. Jensen\u2019s inequality tells us\nP (v) = P (h\u2217, v)\nP (h\u2217|v)\n\n(cid:20) P (h\u2217, v)\n\n= P (h\u2217, v)\nE[\u02c6p(H)]\n\n\u2264 E\n\n\u02c6p(H)\n\n(12)\n\n(cid:21)\n\n.\n\nThat is, we will overestimate the probability of a visible vector in expectation. Jensen\u2019s inequality\nalso says that we will overestimate log P (v) in expectation.\nIdeally we would like an accurate estimate of log P (v). However, if we must suffer some bias,\nthen a lower bound that does not overstate performance will usually be preferred. An underestimate\nof P (v) would result from overestimating P (h\u2217|v). The probability of the special state h\u2217 will\noften be overestimated in practice if we initialize our Markov chain at h\u2217. There are, however,\nsimple counter-examples where this does not happen. Instead we describe a construction based on a\nsequence of Markov steps starting at h\u2217 that does have the desired effect. We draw a state sequence\nfrom the following carefully designed distribution, using the algorithm in \ufb01gure 1:\n\nSX\n\neT(cid:0)h(s)\u2190h\u2217(cid:1)\n\nSY\n\nT(cid:0)h(s0)\u2190h(s0\u22121)(cid:1) s\u22121Y\n\n1\nS\n\nQ(H) =\n\nIf the initial state were drawn from P (h|v) instead of eT(cid:0)h(s)\u2190h\u2217(cid:1), then the algorithm would give\n\na sample from an equilibrium sequence with distribution P(H) de\ufb01ned in (10). This can be checked\nby repeated substitution of (5). This allows us to express Q in terms of P, as we did for AIS:\n\ns0=s+1\n\n(13)\n\ns0=1\n\ns=1\n\neT(cid:0)h(s0)\u2190h(s0+1)(cid:1).\nT(cid:0)h\u2217\u2190h(s)(cid:1)#\n\nP(H).\n\neT(cid:0)h(s)\u2190h\u2217(cid:1)\nP(cid:0)h(s)|v(cid:1) P(H) =\n\nSX\n\ns=1\n\n1\n\nP (h\u2217|v)\n\n1\nS\n\nQ(H) =\n\n1\nS\n\n\"\n\nSX\n\ns=1\n\n(14)\n\n\fInputs: v, observed test vector\n\nh\u2217, a (preferably high posterior probability) hidden state\nS, number of Markov chain steps\nT , Markov chain operator that leaves P (h|v) stationary\n\n1. Draw s \u223c Uniform({1, . . . S})\n\n2. Draw h(s) \u223c eT(cid:0)h(s)\u2190h\u2217(cid:1)\n\nDraw h(s0) \u223c T(cid:0)h(s0)\u2190h(s0\u22121)(cid:1)\nDraw h(s0) \u223c eT(cid:0)h(s0)\u2190h(s0+1)(cid:1)\n\n3. for s0 = (s + 1) : S\n4.\n5. for s0 = (s \u2212 1) : \u22121 : 1\n6.\n7. P (v) \u2248 P (v, h\u2217)\n\n. 1\n\nSX\n\nT (h\u2217\u2190h(s0))\n\nS\n\ns0=1\n\nFigure 1: Algorithm for the proposed method. The graphical model shows Q(H|s = 3) for S = 4. At\neach generated state T (h\u2217 \u2190 h(s0)) is evaluated (step 7), roughly doubling the cost of sampling. The reverse\n\noperator, eT , was de\ufb01ned in section 2.3.\nThe quantity in square brackets is the estimator for P (h\u2217|v) given in (9). The expectation of the\nreciprocal of this quantity under draws from Q(H) is exactly the quantity needed to compute P (v):\n\n\"\n\n,\n\nSX\n\ns=1\n\n1\nS\n\nT(cid:0)h\u2217\u2190h(s)(cid:1)#\n\nEQ(H)\n\n1\n\nX\n\nH\n\n=\n\n1\n\nP (h\u2217|v)\n\nP(H) =\n\n1\n\nP (h\u2217|v) .\n\n(15)\n\nAlthough we are using the simple estimator from (9), by drawing H from a carefully constructed\nMarkov chain procedure, the estimator is now unbiased in P (v). This is not an asymptotic result. As\nlong as no division by zero has occurred in the above equations, the estimator is unbiased in P (v)\nfor \ufb01nite runs of the Markov chain. Jensen\u2019s implies that log P (v) is underestimated in expectation.\nNeal noted that Chibs method will return incorrect answers in cases where the Markov chain does not\nmix well amongst modes [14]. Our new proposed method will suffer from the same problem. Even\nif no transition probabilities are exactly zero, unbiasedness does not exclude being on a particular\nside of the correct answer with very high probability. Poor mixing may cause P (h\u2217|v) to be over-\nestimated with high probability, which would result in an underestimate of P (v), i.e., an overly\nconservative estimate of test performance.\nThe variance of the estimator is generally unknown, as it depends on the (generally unavailable)\nauto-covariance structure of the Markov chain. We can note one positive property: for the ideal\nMarkov chain operator that mixes in one step, the estimator has zero variance and gives the correct\nanswer immediately. Although this extreme will not actually occur, it does indicate that on easy\nproblems, good answers can be returned more quickly than by AIS.\n\n4 Deep Belief Networks\nIn this section we provide a brief overview of Deep Belief Networks (DBNs), recently introduced\nby [2]. DBNs are probabilistic generative models, that can contain many layers of hidden variables.\nEach layer captures strong high-order correlations between the activities of hidden features in the\nlayer below. The top two layers of the DBN model form a Restricted Boltzmann Machine (RBM)\nwhich is an undirected graphical model, but the lower layers form a directed generative model. The\noriginal paper introduced a greedy, layer-by-layer unsupervised learning algorithm that consists of\nlearning a stack of RBMs one layer at a time.\nConsider a DBN model with two layers of hidden features. The model\u2019s joint distribution is:\n\nP (v, h1, h2) = P (v|h1) P (h2, h1),\n\n(16)\nwhere P (v|h1) represents a sigmoid belief network, and P (h1, h2) is the joint distribution de\ufb01ned\nby the second layer RBM. By explicitly summing out h2, we can easily evaluate an unnormalized\nprobability P \u2217(v,h1)= ZP (v, h1). Using an approximating factorial posterior distribution Q(h|v),\n\nh\u2217h(1)h(2)h(3)h(4)h\u2217Th\u2217Th\u2217Th\u2217TeTeTeTT\fMNIST digits\n\nImage Patches\n\nFigure 2: AIS, our proposed estimator and a variational method were used to sum over the hidden states for\neach of 50 randomly sampled test cases to estimate their average log probability. The three methods shared the\nsame AIS estimate of a single global normalization constant Z.\n\nlog P (v) \u2265 X\n\nh1\n\nobtained as a byproduct of the greedy learning procedure, and an AIS estimate of the model\u2019s parti-\ntion function Z, [1] proposed obtaining an estimate of a variational lower bound:\n\nQ(h1|v) log P \u2217(v, h1) \u2212 log Z + H(Q(h1|v)).\n\n(17)\nThe entropy term H(\u00b7) can be computed analytically, since Q is factorial, and the expectation term\nwas estimated by a simple Monte Carlo approximation:\n\nX\nInstead of the variational approach, we could also adopt AIS to estimate P\n\nQ(h1|v) log P \u2217(v, h1) \u2248 1\nS\n\nlog P \u2217(v, h1(s)),\n\nwhere h1(s) \u223c Q(h1|v).\n\n(18)\n\nX\n\ns=1..S\n\nh1\n\nwould be computationally very expensive, since we would need to run AIS for each test case.\nIn the next section we show that variational lower bounds can be quite loose. Running AIS on the\nentire test set, containing many thousands of test cases, is computationally too demanding. Our\nproposed estimator requires the same single AIS estimate of Z as the variational method, so that\nwe can evaluate P (v, h1). It then provides better estimates of log P (v) by approximately summing\nover h1 for each test case in a reasonable amount of computer time.\n\nh1 P \u2217(v, h1). This\n\n5 Experimental Results\nWe present experimental results on two datasets:\nthe MNIST digits and a dataset of image\npatches, extracted from images of natural scenes taken from the collection of Van Hateren\n(http://hlab.phys.rug.nl/imlib/). The MNIST dataset contains 60,000 training and 10,000 test im-\nages of ten handwritten digits (0 to 9), with 28\u00d728 pixels. The image dataset consisted of 130,000\ntraining and 20,000 test 20\u00d720 patches. The raw image intensities were preprocessed and whitened\nas described in [15]. Gibbs sampling was used as a Markov chain transition operator throughout.\nAll log probabilities quoted use natural logarithms, giving values in nats.\n\n5.1 MNIST digits\nIn our \ufb01rst experiment we used a deep belief network (DBN) taken from [1]. The network had two\nhidden layers with 500 and 2000 hidden units, and was greedily trained by learning a stack of two\nRBMs one layer at a time. Each RBM was trained using the Contrastive Divergence (CD) learning\nrule. The estimate of the lower bound on the average test log probability, using (17), was \u221286.22.\nTo estimate how loose the variational bound is, we randomly sampled 50 test cases, 5 of each class,\nand ran AIS for each test case to estimate the true test log probability. Computationally, this is\nequivalent to estimating 50 additional partition functions. Figure 2, left panel, shows the results.\nThe estimate of the variational bound was \u221287.05 per test case, whereas the estimate of the true test\nlog probability using AIS was \u221285.20. Our proposed estimator, averaged over 10 runs, provided\nan answer of \u221285.22. The special state h\u2217 for each test example v was obtained by \ufb01rst sampling\nfrom the approximating distribution Q(h|v), and then performing deterministic hill-climbing in\nlog p(v, h) to get to a local mode.\n\n510152025303540\u221287\u221286.5\u221286\u221285.5\u221285Number of Markov chain stepsEstimated Test Log\u2212probabilityEstimate of Variational Lower BoundAIS EstimatorOur Proposed Estimator510152025303540\u2212585\u2212580\u2212575\u2212570\u2212565Number of Markov chain stepsEstimated Test Log\u2212probabilityEstimate of Variational Lower BoundAIS EstimatorOur Proposed Estimator\fAIS used a hand-tuned temperature schedule designed to equalize the variance of the intermediate\nlog weights [7]. We needed 10,000 intermediate distributions to get stable results, which took about\n3.6 days on a Pentium Xeon 3.00GHz machine, whereas for our proposed estimator we only used\nS =40, which took about 50 minutes. For a more direct comparison we tried giving AIS 50 minutes,\nwhich allows 100 temperatures. This run gave an estimate of \u221289.59, which is lower than the lower\nbound and tells us nothing. Giving AIS ten times more time, 1000 temperatures, gave \u221286.05. This\nis higher than the lower bound, but still worse than our estimator at S = 40, or even S = 5.\nFinally, using our proposed estimator, the average test log probability on the entire MNIST test data\nwas \u221284.55. The difference of about 2 nats shows that the variational bound in [1] was rather tight,\nalthough a very small improvement of the DBN over the RBM is now revealed.\n\n5.2\n\nImage Patches\n\nIn our second experiment we trained a two-layer DBN model on the image patches of natural scenes.\nThe \ufb01rst layer RBM had 2000 hidden units and 400 Gaussian visible units. The second layer repre-\nsented a semi-restricted Boltzmann machine (SRBM) with 500 hidden and 2000 visible units. The\nSRBM contained visible-to-visible connections, and was trained using Contrastive Divergence to-\ngether with mean-\ufb01eld. Details of training can be found in [15]. The overall DBN model can be\nviewed as a directed hierarchy of Markov random \ufb01elds with hidden-to-hidden connections.\nTo estimate the model\u2019s partition function, we used AIS with 15,000 intermediate distributions and\n100 annealing runs. The estimated lower bound on the average test log probability (see Eq. 17),\nusing a factorial approximate posterior distribution Q(h1|v), which we also get as a byproduct of\nthe greedy learning algorithm, was \u2212583.73. The estimate of the true test log probability, using our\nproposed estimator, was \u2212563.39. In contrast to the model trained on MNIST, the difference of over\n20 nats shows that, for model comparison purposes, the variational lower bound is quite loose.\nFor comparison, we also trained square ICA and a mixture of factor analyzers (MFA) using code\nfrom [16, 17]. Square ICA achieves a test log probability of \u2212551.14, and MFA with 50 mixture\ncomponents and a 30-dimensional latent space achieves \u2212502.30, clearly outperforming DBNs.\n\n6 Discussion\nOur new Monte Carlo procedure is formally unbiased in estimating P (v). In practice it is likely to\nunderestimate the (log-)probability of a test set. Although the algorithm involves Markov chains,\nimportance sampling underlies the estimator. Therefore the methods discussed in [18] could be used\nto bound the probability of accidentally over-estimating a test set probability.\nIn principle our procedure is a general technique for estimating normalizing constants. It would not\nalways be appropriate however, as it would suffer the problems outlined in [14]. As an example our\nmethod will not succeed in estimating the global normalizing constant of an RBM.\n\nFor our method to work well, a state drawn from eT (h(s) \u2190 h\u2217) should look like it could be part\n\nof an equilibrium sequence H \u223c P(H). The details of the algorithm arose by developing existing\nMonte Carlo estimators, but the starting state h(s) could be drawn from any arbitrary distribution:\n\nQvar(H) =\n\n1\nS\n\nq(h(s))\nP (h(s)|v)\n\nP(H) = P (v)\n\n1\nS\n\nq(h(s))\n\nP (h(s), v)\n\nP(H).\n\n(19)\n\nAs before the reciprocal of the quantity in square brackets would give an estimate of P (v). If an\n\napproximation q(h) is available that captures more mass than eT (h\u2190h\u2217), this generalized estimator\n\ncould perform better. We are hopeful that our method will be a natural next step in a variety of\nsituations where improvements are sought over a deterministic approximation.\n\nAcknowledgments\n\nThis research was supported by NSERC and CFI. Iain Murray was supported by the government of\nCanada. We thank Geoffrey Hinton and Radford Neal for useful discussions, Simon Osindero for\nproviding preprocessed image patches of natural scenes, and the reviewers for useful comments.\n\nSX\n\ns=1\n\n\"\n\nSX\n\ns=1\n\n#\n\n\fReferences\n[1] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of Deep Belief Networks. In Pro-\n\nceedings of the International Conference on Machine Learning, volume 25, pages 872\u2013879, 2008.\n\n[2] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets.\n\nNeural Computation, 18(7):1527\u20131554, 2006.\n\n[3] Tom Minka. Divergence measures and message passing. TR-2005-173, Microsoft Research, 2005.\n[4] Michael A. Newton and Adrian E. Raftery. Approximate Bayesian inference with the weighted likelihood\n\nbootstrap. Journal of the Royal Statistical Society, Series B (Methodological), 56(1):3\u201348, 1994.\n\n[5] Thomas L. Grif\ufb01ths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. Integrating topics and\n\nsyntax. In Advances in Neural Information Processing Systems (NIPS*17). MIT Press, 2005.\n\n[6] Hanna M. Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international\n\nconference on Machine learning, pages 977\u2013984. ACM Press New York, NY, USA, 2006.\n\n[7] Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[8] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the\n\nRoyal Statistical Society B, 68(3):1\u201326, 2006.\n\n[9] Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Associ-\n\nation, 90(432):1313\u20131321, December 1995.\n\n[10] Christian Ritter and Martin A. Tanner. Facilitating the Gibbs sampler: the Gibbs stopper and the griddy-\n\nGibbs sampler. Journal of the American Statistical Association, 87(419):861\u2013868, 1992.\n\n[11] Siddhartha Chib and Ivan Jeliazkov. Marginal likelihood from the Metropolis\u2013Hastings output. Journal\n\nof the American Statistical Association, 96(453), 2001.\n\n[12] Antonietta Mira and Geoff Nicholls. Bridge estimation of the probability density at a point. Statistica\n\nSinica, 14:603\u2013612, 2004.\n\n[13] Francesco Bartolucci, Luisa Scaccia, and Antonietta Mira. Ef\ufb01cient Bayes factor estimation from the\n\nreversible jump output. Biometrika, 93(1):41\u201352, 2006.\n\n[14] Radford M. Neal. Erroneous results in \u201cMarginal likelihood from the Gibbs output\u201d, 1999. Available\n\nfrom http://www.cs.toronto.edu/\u223cradford/chib-letter.html.\n\n[15] Simon Osindero and Geoffrey Hinton. Modeling image patches with a directed hierarchy of Markov\n\nrandom \ufb01elds. In Advances in Neural Information Processing Systems (NIPS*20). MIT Press, 2008.\n\n[16] Aapo Hyv\u00a8arinen. Fast and robust \ufb01xed-point algorithms for independent component analysis.\n\nTransactions on Neural Networks, 10(3):626\u2013634, 1999.\n\nIEEE\n\n[17] Zoubin Ghahramani and Geoffrey E. Hinton. The EM algorithm for mixtures of factor analyzers. Tech-\n\nnical Report CRG-TR-96-1, University of Toronto, 1997.\n\n[18] Vibhav Gogate, Bozhena Bidyuk, and Rina Dechter. Studies in lower bounding probability of evidence\n\nusing the Markov inequality. In 23rd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2007.\n\nA Real-valued latents and Metropolis\u2013Hastings\n\nThere are technical dif\ufb01culties with the original Chib-style approach applied to Metropolis\u2013Hastings\nand continuous latent variables. The continuous version of equation (9),\n\nP (h\u2217|v) =R T (h\u2217\u2190h)P (h|v) dh \u2248 1\n\nPS\ns=1 T (h\u2217\u2190h(s)), h(s) \u223c P(H),\n\nS\n\n.R dh T (h\u2190h\u2217).\n\nP (h\u2217|v) = R dh eT (h\u2217\u2190h)P (h|v)\n\nq(h; h\u2217) min(cid:0)1, a(h; h\u2217)(cid:1), where a(h; h\u2217) is an easy-to-compute acceptance ratio. Sampling from\n\n(20)\ndoesn\u2019t work if T is the Metropolis\u2013Hastings operator. The Dirac-delta function at h = h\u2217 contains\na signi\ufb01cant part of the integral, which is ignored by samples from P (h|v) with probability one.\nFollowing [11], the \ufb01x is to instead integrate over the generalized detailed balance relationship (5).\nChib and Jeliazkov implicitly took out the h\u2217 =h point from all of their integrals. We do the same:\n(21)\nThe numerator can be estimated as before. As both integrals omit h = h\u2217, the denominator\nis less than one when T contains a delta function. For Metropolis\u2013Hastings: T (h \u2190 h\u2217) =\nq(h; h\u2217) and averaging min(1, a(h; h\u2217)) provides an estimate of the denominator.\nIn our importance sampling approach there is no need to separately approximate an additional quan-\ntity. The algorithm in \ufb01gure 1 still applies if the T \u2019s are interpreted as probability density functions.\nIf, due to a rejection, h\u2217 is drawn in step 2. then the sum in step 7. will contain an in\ufb01nite term giving\na trivial underestimate P (v) = 0. (Steps 3\u20136 need not be performed in this case.) On repeated runs,\nthe average estimate is still unbiased, or an underestimate for chains that can\u2019t mix. Alternatively,\nthe variational approach (19) could be applied together with Metropolis\u2013Hastings sampling.\n\n\f", "award": [], "sourceid": 674, "authors": [{"given_name": "Iain", "family_name": "Murray", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}