{"title": "Learning in Markov Random Fields using Tempered Transitions", "book": "Advances in Neural Information Processing Systems", "page_first": 1598, "page_last": 1606, "abstract": "Markov random fields (MRFs), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRFs is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of Robbins-Monro type that uses Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large densely-connected MRFs. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data and perform well on digit and object recognition tasks.", "full_text": "Learning in Markov Random Fields using\n\nTempered Transitions\n\nRuslan Salakhutdinov\n\nBrain and Cognitive Sciences and CSAIL\n\nMassachusetts Institute of Technology\n\nrsalakhu@mit.edu\n\nAbstract\n\nMarkov random \ufb01elds (MRF\u2019s), or undirected graphical models, provide a pow-\nerful framework for modeling complex dependencies among random variables.\nMaximum likelihood learning in MRF\u2019s is hard due to the presence of the global\nnormalizing constant. In this paper we consider a class of stochastic approxima-\ntion algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to\ndo approximate maximum likelihood learning. We show that using MCMC opera-\ntors based on tempered transitions enables the stochastic approximation algorithm\nto better explore highly multimodal distributions, which considerably improves\nparameter estimates in large, densely-connected MRF\u2019s. Our results on MNIST\nand NORB datasets demonstrate that we can successfully learn good generative\nmodels of high-dimensional, richly structured data that perform well on digit and\nobject recognition tasks.\n\n1 Introduction\n\nMarkov random \ufb01elds (MRF\u2019s) provide a powerful tool for representing dependency structure be-\ntween random variables. They have been successfully used in various application domains, includ-\ning machine learning, computer vision, and statistical physics. The major limitation of MRF\u2019s is\nthe need to compute the partition function, whose role is to normalize the joint distribution over the\nset of random variables. Maximum likelihood learning in MRF\u2019s is often very dif\ufb01cult because of\nthe hard inference problem induced by the partition function. When modeling high-dimensional,\nrichly structured data, the inference problem becomes much more dif\ufb01cult because the distribution\nwe need to infer is likely to be highly multimodal [17]. Multimodality is common in real-world\ndistributions, such as the distribution of natural images, in which an exponentially large number\nof possible image con\ufb01gurations have extremely low probability, but there are many very different\nimages that occur with similar probabilities.\n\nTo date, there has been very little work addressing the problem of ef\ufb01cient learning in large, densely-\nconnected MRF\u2019s that contain millions of parameters. While there exists a substantial literature on\ndeveloping approximate learning algorithms for arbitrary MRF\u2019s, many of these algorithms are un-\nlikely to work well when dealing with high-dimensional inputs. Methods that are based on replacing\nthe likelihood term with some tractable approximations, such as pseudo-likelihood [1] or mixtures\nof random spanning trees [11], perform very poorly for densely-connected MRF\u2019s with strong de-\npendency structures [3]. When using variational methods, such as loopy BP [18] and TRBP [16],\nlearning often gets trapped in poor local optima [5, 13]. MCMC-based algorithms, including MCMC\nmaximum likelihood estimators [3, 20] and Contrastive Divergence [4], typically suffer from high\nvariance (or strong bias) in their estimates, and can sometimes be painfully slow. The main problem\nhere is the inability of Markov chains to ef\ufb01ciently explore distributions with many isolated modes.\n\n1\n\n\fIn this paper we concentrate on the class of stochastic approximation algorithms of the Robbins-\nMonro type that use MCMC to estimate the model\u2019s expected suf\ufb01cient statistics, needed for max-\nimum likelihood learning. We \ufb01rst show that using this class of algorithms allows us to make very\nrapid progress towards \ufb01nding a fairly good set of parameters, even for models containing millions\nof parameters. Second, we show that using MCMC operators based on tempered transitions [9] en-\nables the stochastic algorithm to better explore highly multimodal distributions, which considerably\nimproves parameter estimates, particularly in large, densely-connected MRF\u2019s. Our results on the\nMNIST and NORB datasets demonstrate that the stochastic approximation algorithm together with\ntempered transitions can be successfully used to model high-dimensional real-world distributions.\n\n2 Maximum Likelihood Learning in MRF\u2019s\n\nLet x \u2208X K be a random vector on K variables, where each xi takes on values in some discrete\nalphabet. Let \u03c6(x) denote a D-dimensional vector of suf\ufb01cient statistics, and let \u03b8 \u2208 RD be a vector\nof canonical parameters. The exponential family associated with suf\ufb01cient statistics \u03c6 consists of\nthe following parameterized set of probability distributions:\n\np(x; \u03b8) =\n\np\u2217(x)\nZ(\u03b8)\n\n=\n\n1\n\nZ(\u03b8)\n\nexp (\u03b8\u22a4\u03c6(x)),\n\nZ(\u03b8) = X\n\nx\n\nexp (\u03b8\u22a4\u03c6(x)),\n\n(1)\n\nwhere p\u2217(\u00b7) denotes the unnormalized probability distribution and Z(\u03b8) is the partition function.\nFor example, consider the following binary pairwise MRF. Given a graph G = (V, E) with vertices\nV and edges E, the probability distribution over a binary random vector x \u2208 {0, 1}K is given by:\n\nexp\uf8eb\n\n\uf8ed X\n\n(i,j)\u2208E\n\n\u03b8ijxixj +X\n\ni\u2208V\n\n\u03b8ixi\n\n\uf8f6\n\uf8f8.\n\n(2)\n\np(x; \u03b8) =\n\n1\n\nZ(\u03b8)\n\nexp(cid:0)\u03b8\u22a4\u03c6(x)(cid:1) =\n\n1\n\nZ(\u03b8)\n\nThe derivative of the log-likelihood for an observation x0 with respect to parameter vector \u03b8 can be\nobtained from Eq. 1:\n\n\u2202 log p(x0; \u03b8)\n\n\u2202\u03b8\n\n= \u03c6(x0) \u2212 Ep(x;\u03b8)[\u03c6(x)],\n\n(3)\n\nwhere EP [\u00b7] denotes an expectation with respect to distribution P . Except for simple models such\nas the tree structured graphs exact maximum likelihood learning is intractable, because exact com-\nputation of the expectation Ep(x;\u03b8)[\u00b7] takes time that is exponential in the treewidth of the graph1.\nOne approach is to learn model parameters by maximizing the pseudo-likelihood (PL) [1], which\nreplaces the likelihood with a tractable product of conditional probabilities:\n\nPPL(x0; \u03b8) =\n\nKY\n\nk=1\n\np(xk|x0,\u2212k; \u03b8),\n\n(4)\n\nwhere x0,\u2212k denotes an observation vector x0 with xk omitted. Pseudo-likelihood provides good\nestimates for weak dependence, when p(xk|x\u2212k) \u2248 p(xk), or when it well approximates the true\nlikelihood function. For MRF\u2019s with strong dependence structure, it is unlikely to work well.\n\nAnother approach, called the MCMC maximum likelihood estimator (MCMC-MLE) [3], has been\nshown to sometimes provide considerably better results than PL [3, 20]. The key idea is to use\nimportance sampling to approximate the model\u2019s partition function. Consider running a Markov\nchain to obtain samples x(1), x(2), ..., x(n) from some \ufb01xed proposal distribution p(x; \u03c8)2. These\nsamples can be used to approximate the log-likelihood ratio for an observation x0:\n\nL(\u03b8) = log\n\np(x0; \u03b8)\np(x0; \u03c8)\n\n= (\u03b8 \u2212 \u03c8)\u22a4\u03c6(x0) \u2212 log\n\n\u2248 (\u03b8 \u2212 \u03c8)\u22a4\u03c6(x0) \u2212 log\n\nZ(\u03b8)\nZ(\u03c8)\n\n1\nn\n\nnX\n\ni=1\n\ne(\u03b8\u2212\u03c8)\u22a4\u03c6(x(i)) = Ln(\u03b8),\n\n(5)\n\n(6)\n\n1For many interesting models considered in this paper exact computation of Ep(x;\u03b8)[\u00b7] takes time that is\n\nexponential in the dimensionality of x.\n\n2We will also assume that p(x; \u03c8) 6= 0 whenever p(x; \u03b8) 6= 0, \u2200\u03b8.\n\n2\n\n\ffor m = 1 : M (number of parallel Markov chains) do\n\nAlgorithm 1 Stochastic Approximation Procedure.\n1: Given an observation x0. Randomly initialize \u03b81 and M sample particles {x\n2: for t = 1 : T (number of iterations) do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nUpdate: \u03b8t+1 = \u03b8t + \u03b1th\u03c6(x0) \u2212 1\n\nt,m using transition operator T\u03b8t (x\n\nt+1,m)i.\n\nM PM\n\nt+1,m given x\n\nDecrease \u03b1t.\n\nm=1 \u03c6(x\n\nSample x\n\nend for\n\nt+1,m \u2190 x\n\nt,m).\n\n1,1, ...., x\n\n1,M }.\n\nn Pn\n\nZ(\u03c8) = Px e(\u03b8\u2212\u03c8)\u22a4\u03c6(x)p(x; \u03c8) \u2248 1\n\ni=1 e(\u03b8\u2212\u03c8)\u22a4\u03c6(x(i)). Pro-\nwhere we used the approximation: Z(\u03b8)\nvided our Markov chain is ergodic, it can be shown that Ln(\u03b8) \u2212\u2192 L(\u03b8) for all \u03b8. It can further be\nshown that, under the \u201cusual\u201d regularity conditions, if \u02c6\u03b8n maximizes Ln(\u03b8) and \u03b8\u2217 maximizes L(\u03b8),\na.s.\u2212\u2212\u2192 \u03b8\u2217. This implies that as the number of samples n, drawn from our proposal distribu-\nthen \u02c6\u03b8n\ntions, goes to in\ufb01nity, MCMC-MLE will converge to the true maximum likelihood estimator. While\nthis estimator provides nice asymptotic convergence guarantees, it performs very poorly in practice,\nparticularly when the parameter vector \u03b8 is high-dimensional. In high-dimensional spaces, the vari-\nance of an estimator Ln(\u03b8) will be very large, or possibly in\ufb01nite, unless the proposal distribution\np(x; \u03c8) is a near-perfect approximation to p(x; \u03b8). While there have been some attempts to improve\nMCMC-MLE by considering a mixture of proposal distributions [20], they do not \ufb01x the problem\nwhen learning MRF\u2019s with millions of parameters.\n\n3 Stochastic Approximation Procedure (SAP)\n\nWe now consider a stochastic approximation procedure that uses MCMC to estimate the model\u2019s\nexpected suf\ufb01cient statistics. SAP belongs to the general class of well-studied stochastic approxi-\nmation algorithms of the Robbins-Monro type [19, 12]. The algorithm itself dates back to 1988 [19],\nbut only recently it has been shown to work surprisingly well when training large MRF\u2019s, including\nrestricted Boltzmann machines [15] and deep Boltzmann machines [14, 13].\nThe idea behind learning a parameter vector \u03b8 using SAP is straightforward. Let x0 be our observa-\ntion. Then the state and the parameters are updated sequentially:\n\n\u03b8t+1 = \u03b8t + \u03b1t(cid:2)\u03c6(x0) \u2212 \u03c6(xt+1)(cid:3) , where xt+1 \u223c T\u03b8t(xt+1 \u2190 xt).\n\n(7)\nGiven xt, we sample a new state xt+1 using the transition operator T\u03b8t(xt+1 \u2190 xt) that leaves\np(\u00b7; \u03b8t) invariant. A new parameter \u03b8t+1 is then obtained by replacing the intractable expecta-\ntion Ep(x;\u03b8t)[\u03c6(x)] with \u03c6(xt+1).\nIn practice, we typically maintain a set of M sample points\nX t = {xt,1, ...., xt,M }, which we will often refer to as sample particles. In this case, the intractable\nmodel\u2019s expectation is replaced by the sample average 1/M PM\nm=1 \u03c6(xt+1,m). The procedure is\n\nsummarized in Algorithm 1.\n\nOne important property of this algorithm is that just like MCMC-MLE, it can be shown to asymp-\ntotically converge to the maximum likelihood estimator \u03b8\u2217.3 In particular, for fully visible discrete\nMRF\u2019s, if one uses a Gibbs transition operator and the learning rate is set to \u03b1t = 1\n(t+1)U , where U\nis a positive constant, such that U > 2KC0C1, then \u03b8t a.s.\u2212\u2212\u2192 \u03b8\u2217 (see Theorem 4.1 of [19]). Here K\nis the dimensionality of x, C0 = max{||\u03c6(x0) \u2212 \u03c6(x)||; x \u2208 X K} is the largest magnitude of the\ngradient, and C1 is the maximum variation of \u03c6 when one changes the values of a single component\nonly: C1 = max{||\u03c6(x) \u2212 \u03c6(y)||; x, y \u2208 X K, k \u2208 {1, ..., K}, y\u2212k = x\u2212k}.\nThe proof of convergence relies on the following simple decomposition. First, let S(\u03b8) denote the\ntrue gradient of the log-likelihood function: S(\u03b8) = \u2202 log p(x0;\u03b8)\n= \u03c6(x0) \u2212 Ep(x;\u03b8)[\u03c6(x)]. The\nparameter update rule then takes the following form:\n\n\u2202\u03b8\n\n\u03b8t+1 = \u03b8t + \u03b1t(cid:2)\u03c6(x0) \u2212 \u03c6(xt+1)(cid:3) = \u03b8t + \u03b1tS(\u03b8t) + \u03b1t(cid:2)Ep(x;\u03b8)[\u03c6(x)] \u2212 \u03c6(xt+1)(cid:3)\n\n= \u03b8t + \u03b1tS(\u03b8t) + \u03b1t\u01ebt.\n\n(8)\n\n3One necessary condition for almost sure convergence requires the learning rate to decrease with time, so\n\nthat P\u221e\n\nt=0 \u03b1t = \u221e and P\u221e\n\nt=0 \u03b12\n\nt < \u221e.\n\n3\n\n\fSample x\n\ns given x\n\nAlgorithm 2 Tempered Transitions Run.\n1: Initialize \u03b20 < \u03b21 < ... < \u03b2S = 1. Given a current state x\n2: for s = S \u2212 1 : 0 (Forward pass) do\ns+1 using Ts(x\n3:\n4: end for\n5: Set \u02dcx\n6: for s = 0 : S \u2212 1 (Backward pass) do\n7:\n8: end for\n9: Accept a new state \u02dcx\n\ns using eTs(\u02dcx\nS with probability: minh1,QS\n\ns+1 given \u02dcx\n\ns+1 \u2190 \u02dcx\n\ns).\n\nSample \u02dcx\n\ns \u2190 x\n\ns+1).\n\n0 = x\n\n0.\n\nS.\n\ns=1 p\u2217(xs)\u03b2s\u22121\u2212\u03b2s p\u2217(\u02dcxs)\u03b2s\u2212\u03b2s\u22121i.\n\nThe \ufb01rst term (rhs. of Eq. 8) is the discretization of the ordinary differential equation \u02d9\u03b8 = S(\u03b8). The\nalgorithm is therefore a perturbation of this discretization with the noise term \u01ebt. The proof proceeds\nby showing that the noise term is not too large. Intuitively, as the learning rate becomes suf\ufb01ciently\nsmall compared to the mixing rate of the Markov chain, the chain will stay close to the stationary\ndistribution, even if it is only run for a few MCMC steps per parameter update. This, in turn, will\nensure that the noise term \u01ebt goes to zero.\nWhen looking at the behavior of this algorithm in practice, we \ufb01nd that initially it makes very rapid\nprogress towards \ufb01nding a sensible region in the parameter space. However, as the algorithm be-\ngins to capture the multimodality of the data distribution, the Markov chain tends to mix poorly,\nproducing highly correlated samples for successive parameter updates. This often leads to poor pa-\nrameter estimates, especially when modeling complex, high-dimensional distributions. The main\nproblem here is the inability of the Markov chain to ef\ufb01ciently explore a distribution with many\nisolated modes. However, the transition operators T\u03b8t(xt+1 \u2190 xt) used in the stochastic approx-\nimation algorithm do not necessarily need to be simple Gibbs or Metropolis-Hastings updates to\nguarantee almost sure convergence. Instead, we propose to use MCMC operators based on tem-\npered transitions [9] that can more ef\ufb01ciently explore highly multimodal distributions. In addition,\nimplementing tempered transitions requires very little extra work beyond the implementation of the\nGibbs sampler.\n\n3.1 Tempered Transitions\n\nSuppose that our goal is to sample from p(x; \u03b8). We \ufb01rst de\ufb01ne a sequence of intermediate proba-\nbility distributions: p0, ..., pS, with pS = p(x; \u03b8) and p0 being more spread out and easier to sample\nfrom than pS. Constructing a suitable sequence of intermediate probability distributions will in\ngeneral depend on the problem. One general way to de\ufb01ne this sequence is:\n\nps(x) \u221d p\u2217(x; \u03b8)\u03b2s ,\n\n(9)\n\nwith \u201cinverse temperatures\u201d \u03b20 < \u03b21 < ... < \u03b2S = 1 chosen by the user. For each s = 1, .., S\u22121 we\nde\ufb01ne a transition operator Ts(x\u2032 \u2190 x) that leaves ps invariant. In our implementation Ts(x\u2032 \u2190 x)\n\nis the Gibbs sampling operator. We also need to de\ufb01ne a reverse transition operator eTs(x \u2190 x\u2032) that\n\nsatis\ufb01es the following reversibility condition for all x and x\u2032:\n\nps(x)Ts(x\u2032 \u2190 x) = eTs(x \u2190 x\u2032)ps(x\u2032).\n\n(10)\n\nIf Ts is reversible, then eTs is the same as Ts. Many commonly used transition operators, such\n\nas Metropolis\u2013Hastings, are reversible. Non-reversible operators are usually composed of several\nreversible sub-transitions applied in sequence Ts = Q1...QK, such as the single component updates\nin a Gibbs sampler. The reverse operator can be simply constructed from the same sub-transitions,\n\nGiven the current state x of the Markov chain, tempered transitions apply a sequence of transition\n\nbut applied in the reverse order eTs = QK...Q1.\noperators TS\u22121 . . . T0eT0 . . . eTS\u22121 that systematically \u201cmove\u201d the sample particle x from the original\n\ncomplex distribution to the easily sampled distribution, and then back to the original distribution. A\nnew candidate state \u02dcx is accepted or rejected based on ratios of probabilities of intermediate states.\nSince p0 is less concentrated than pS, the sample particle will have a chance to move around the\nstate space more easily, and we may hope that the probability distribution of the resulting candidate\n\n4\n\n\fRestricted Boltzmann Machine\n\nh\n\nx\n\nSemi-restricted\n\nBoltzmann Machine\n\nh\n\nx\n\nlabel\n\nExact Maximum \nLikelihood\n\nStochastic \nApproximation\n\n\u2212100\n\n\u2212200\n\n\u2212300\n\n\u2212400\n\n\u2212500\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\u2212\ng\no\nL\n\n\u2212154\n\n\u2212155\n\n\u2212156\n\n\u2212157\n\n\u2212158\n\n\u2212159\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\u2212\ng\no\nL\n\nMaximum \nLikelihood\n\nTempered \nTransitions\n\nStochastic \nApproximation\n\n\u2212600\n\n 0 \n\n 10e2 \n\n 10e3 \n\n 10e4 \n\nNumber of Gibbs Updates (log\u2212scale)\n\n\u2212160\n\n 50 \n\n 60 \n\nNumber of Gibbs Updates (\u00d7 1000 )\n\n 70 \n\n 80 \n\n 90 \n\n 100 \n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\nd\ne\n\ni\nf\ni\n\ns\ns\na\n\nl\n\nc\n\n \n\ny\n\nl\nt\nc\ne\nr\nr\no\nc\n%\n\n \n\nStochastic \nApproximation\n\n0.982\n\n0.98\n\n0.978\n\n0.976\n\n0.974\n\n0.972\n\nd\ne\ni\nf\ni\n\ns\ns\na\n\nl\n\nc\ny\n\n \n\nl\nt\nc\ne\nr\nr\no\nc\n%\n\n \n\nTempered \nTransitions\n\nStochastic \nApproximation\n\n0.65\n\n 0 10e2 \n\n 10e3 \n\n 10e4 \n\nNumber of Gibbs Updates (log\u2212scale)\n\n0.97\n\n 100 \n\n 120 \n\n 140 \n\n 160 \n\n 180 \n\n 200 \n\nNumber of Gibbs Updates (\u00d7 1000 )\n\nFigure 1: Experimental results on MNIST dataset. Top: Toy RBM with 10 hidden units. The x-axis show the\nnumber of Gibbs updates and the y-axis displays the training log-probability in nats. Bottom: Classi\ufb01cation\nperformance of the semi-restricted Boltzmann machines with 500 hidden units on the full MNIST datasets.\n\nstate will be much broader than the mode in which the current start state resides. The procedure\nis shown in Algorithm 2. Note that there is no need to compute the normalizing constants of any\nintermediate distributions.\n\nTempered transitions can make major changes to the current state, which allows the Markov chain\nto produce less correlated samples between successive parameter updates. This can greatly improve\nthe accuracy of the estimator, but is also more computationally expensive. We therefore propose to\nalternate between applying a more expensive tempered transitions operator and the standard Gibbs\nupdates. We call this algorithm Trans-SAP.\n\n4 Experimental Results\n\nIn our experiments we used the MNIST and NORB datasets. To speed-up learning, we subdivided\ndatasets into minibatches, each containing 100 training cases, and updated the parameters after each\nminibatch. The number of sample particles used for estimating the model\u2019s expected suf\ufb01cient statis-\ntics was also set to 100. For the stochastic approximation algorithm, we always apply a single Gibbs\nupdate to the sample particles. In all experiments, the learning rates were set by quickly running a\nfew preliminary experiments and picking the learning rates that worked best on the validation set.\nWe also use natural logarithms, providing values in nats.\n\n4.1 MNIST\n\nThe MNIST digit dataset contains 60,000 training and 10,000 test images of ten handwritten digits\n(0 to 9), with 28\u00d728 pixels. The dataset was binarized: each pixel value was stochastically set\nto 1 with probability proportional to its pixel intensity. From the training data, a random sample of\n10,000 images was set aside for validation.\n\nIn our \ufb01rst experiment we trained a small restricted Boltzmann machine (RBM). An RBM is a par-\nticular type of Markov random \ufb01eld that has a two-layer architecture, in which the visible binary\nstochastic units x are connected to hidden binary stochastic units h, as shown in Fig. 1. The proba-\nbility that the model assigns to a visible vector x is:\n\nP (x; \u03b8) =\n\n1\n\nZ(\u03b8) X\n\nh\n\nexp\uf8eb\n\uf8edX\n\ni,j\n\n\u03b8ijxihj +X\n\ni\n\n\u03b8ixi +X\n\nj\n\n\u03b8jhj\n\n\uf8f6\n\uf8f8.\n\n(11)\n\n5\n\n\fSamples before\n\nTempered Transitions\n\nSamples after\n\nTempered Transitions\n\nModel Samples\n\nFigure 2: Left: Sample particles produced by the stochastic approximation algorithm after 100,000 parameter\nupdates. Middle: Sample particles after applying a tempered transitions run. Right: Samples generated from\nthe current model by randomly initializing all binary states and running the Gibbs sampler for 500,000 steps.\nAfter applying tempered transitions, sample particles look more like the samples generated from the current\nmodel. The images shown are the probabilities of the visible units given the binary states of the hidden units.\n\nThe model had 10 hidden units. This allowed us to calculate the exact value of the partition function\nsimply by summing out the 784 visible units for each con\ufb01guration of the hiddens. For the stochastic\napproximation procedure, the total number of parameter updates was 100,000, so the learning took\nabout 25.6 minutes on a Pentium 4 3.00GHz machine. The learning rate was kept \ufb01xed at 0.01 for\nthe \ufb01rst 10,000 parameter updates, and was then annealed as 10/(1000+t). For comparison, we also\ntrained the same model using exact maximum likelihood with exactly the same learning schedule.\n\nPerhaps surprisingly, SAP makes very rapid progress towards the maximum likelihood solution,\neven though the model contains 8634 free parameters. The top panel of Fig. 1 further shows that\ncombining regular Gibbs updates with tempered transitions provides a more accurate estimator. We\napplied tempered transitions only during the last 50,000 Gibbs steps, alternating between 200 Gibbs\nupdates and a single tempered transitions run that used 50 \u03b2\u2019s spaced uniformly from 1 to 0.9.\nThe acceptance rate for the tempered transitions was about 0.8. To be fair, we compared different\nalgorithms based on the total number of Gibbs steps. For SAP, parameters were updated after each\nGibbs step (see Algorithm 1), whereas for Trans-SAP, parameters were updated after each Gibbs\nupdate but not during the tempered transitions run4. Hence Trans-SAP took slightly less computer\ntime compared to the plain SAP. Pseudo-likelihood and MCMC maximum likelihood estimators\nperform quite poorly, even for this small toy problem.\n\nIn our second experiment, we trained a larger semi-restricted Boltzmann machine that contained\n705,622 parameters. In contrast to RBM\u2019s, the visible units in this model form a fully connected\npairwise binary MRF (see Fig. 1, bottom left panel). The model had 500 hidden units and was\ntrained to model the joint probability distribution over the digit images and labels. The total number\nof Gibbs updates was set to 200,000, so the learning took about 19.5 hours. The learning rate was\nkept \ufb01xed at 0.05 for the \ufb01rst 50,000 parameter updates, and was then decreased as 100/(2000 + t).\nThe bottom panel of Fig. 1 shows classi\ufb01cation performance on the full MNIST test set. As ex-\npected, SAP makes very rapid progress towards \ufb01nding a good setting of the parameter values.\nUsing tempered transitions further improves classi\ufb01cation performance. As in our previous exper-\niment, tempered transitions were only applied during the last 100,000 Gibbs updates, alternating\nbetween 1000 Gibbs updates and a single tempered transitions run that used 500 \u03b2\u2019s spaced uni-\nformly from 1 to 0.9. The acceptance rate was about 0.7. After learning was complete, in addition\nto classi\ufb01cation performance, we also estimated the log-probability that both models assigned to\nthe test data. To estimate the models\u2019 partition functions, we used Annealed Importance Sampling\n[10, 13] \u2013 a technique that is very similar to tempered transitions. The plain stochastic approxi-\nmation algorithm achieved an average test log-probability of -87.12 per image, whereas Trans-SAP\nachieved a considerably better average test log-probability of -85.91.\n\n4This reduced the total number of parameter updates from 100, 000 to 50, 000 + 50, 000 \u2217 2/3 = 83, 333.\n\n6\n\n\fTraining Samples\n\nModel trained with\nTempered Transitions\n\nModel trained without\nTempered Transitions\n\nFigure 3: Results on the NORB dataset. Left: Random samples from the training set. Samples generated from\nthe two RBM models, trained using SAP with (Middle) and without (Right) tempered transitions. Samples\nwere generated by running the Gibbs sampler for 100,000 steps.\n\nTo get an intuitive picture of how tempered transitions operate, we looked at the sample particles\nbefore and after applying a tempered transitions run. Figure 2 shows sample particles after 100,000\nparameter updates. Observe that the particles look like the real handwritten digits. However, a run of\ntempered transitions reveals that the current model is very unbalanced, with more probability mass\nplaced on images of four. To further test whether the \u201crefreshed\u201d particles were representative of\nthe current model, we generated samples from the current model by randomly initializing binary\nstates of the visible and hidden units, and running the Gibbs sampler for 500,000 steps. Clearly,\nthe refreshed particles look more like the samples generated from the true model. This in turn al-\nlows Trans-SAP to better estimate the model\u2019s expected suf\ufb01cient statistics, which greatly facilitates\nlearning a better generative model.\n\n4.2 NORB\n\nResults on MNIST show that the stochastic approximation algorithm works well on the relatively\nsimple task of handwritten digit recognition. In this section we present results on a considerably\nmore dif\ufb01cult dataset. NORB [6] contains images of 50 different 3D toy objects with 10 objects in\neach of \ufb01ve generic classes: planes, cars, trucks, animals, and humans. The training set contains\n24,300 stereo image pairs of 25 objects, whereas the test set contains 24,300 stereo pairs of the\nremaining, different 25 objects. The goal is to classify each object into its generic class. From the\ntraining data, 4,300 cases were set aside for validation.\n\nEach image has 96\u00d796 pixels with integer greyscale values in the range [0,255]. We further reduced\nthe dimensionality of each image from 9216 down to 4488 by using larger pixels around the edges of\nthe image5. We also augmented the training data with additional unlabeled data by applying simple\npixel translations, creating a total of 1,166,400 training instances. To deal with raw pixel data, we\nfollowed the approach of [8] by \ufb01rst learning a Gaussian-binary RBM with 4000 hidden units, and\nthen treating the the activities of its hidden layer as \u201cpreprocessed\u201d data. The model was trained\nusing contrastive divergence learning for 500 epochs. The learned low-level RBM effectively acts\nas a preprocessor that transforms greyscale images into 4000-dimensional binary vectors, which we\nuse as the input for training our models.\n\nWe proceeded to training an RBM with 4000 hidden units using binary representations learned\nby the preprocessor module6. The RBM, containing over 16 million parameters, was trained in a\ncompletely unsupervised way. The total number of Gibbs updates was set to 400,000. The learn-\ning rate was kept \ufb01xed at 0.01 for the \ufb01rst 100,000 parameter updates, and was then annealed as\n100/(1000 + t). Similar to the previous experiments, tempered transitions were applied during the\nlast 200,000 Gibbs updates, alternating between 1000 Gibbs updates and a single tempered transi-\ntions run that used 1000 \u03b2\u2019s spaced uniformly from 1 to 0.9.\n\n5The dimensionality of each training vector, representing a stereo pair, was 2\u00d74488 = 8976.\n6The resulting model is effectively a Deep Belief Network with two hidden layers.\n\n7\n\n\fFigure 3 shows samples generated from two models, trained using stochastic approximation with\nand without tempered transitions. Both models were able to learn a lot of regularities in this high-\ndimensional, highly-structured data, including various object classes, different viewpoints and light-\ning conditions. The plain stochastic approximation algorithm produced a very unbalanced model\nwith a large fraction of the model\u2019s probability mass placed on images of humans. Using tempered\ntransitions allowed us to learn a better and more balanced generative model, including the light-\ning effects. Indeed, the plain SAP achieved a test log-probability of -611.08 per image, whereas\nTrans-SAP achieved a test log-probability of -598.58.\n\nWe also tested the classi\ufb01cation performance of both models simply by \ufb01tting a logistic regression\nmodel to the labeled data (using only the 24300 labeled training examples without any translations)\nusing the top-level hidden activities as inputs. The model trained by SAP achieved an error rate\nof 8.7%, whereas the model trained using Trans-SAP reduced the error rate down to 8.4%. This is\ncompared to 11.6% achieved by SVM\u2019s, 22.5% achieved by logistic regression applied directly in\nthe pixel space, and 18.4% achieved by K-nearest neighbors [6].\n\n5 Conclusions\n\nWe have presented a class of stochastic approximation algorithms of the Robbins-Monro type that\ncan be used to ef\ufb01ciently learn parameters in large densely-connected MRF\u2019s. Using MCMC oper-\nators based on tempered transitions allows the stochastic approximation algorithm to better explore\nhighly multimodal distributions, which in turn allows us to learn good generative models of hand-\nwritten digits and 3D objects in a reasonable amount of computer time.\n\nIn this paper we have concentrated only on using tempered transition operators. There exist a variety\nof other methods for sampling from distributions with many isolated modes, including simulated\ntempering [7] and parallel tempering [3], all of which can be incorporated into SAP. In particular,\nthe concurrent work of [2] employs parallel tempering techniques to imrpove mixing in RBM\u2019s.\nThere are, however, several advantages of using tempered transitions over other existing methods.\nFirst, tempered transitions do not require specifying any extra variables, such as the approximate\nvalues of normalizing constants of intermediate distributions, which are needed for the simulated\ntempering method. Second, tempered transitions have modest memory requirements, unlike, for\nexample, parallel tempering, since the acceptance rule can be computed on the \ufb02y as the intermediate\nstates are generated. Finally, the implementation of tempered transitions requires almost no extra\nwork beyond implementing the Gibbs sampler, and can be easily integrated into existing code.\n\nAcknowledgments\n\nWe thank Vinod Nair for sharing his code for blurring and translating NORB images. This research\nwas supported by NSERC.\n\nReferences\n\n[1] J. Besag. Ef\ufb01ciency of pseudolikelihood estimation for simple Gaussian \ufb01elds. Biometrica,\n\n64:616\u2013618, 1977.\n\n[2] G. Desjardins, A. Courville, Y. Bengio, P. Vincent, and O. Delalleau. Tempered Markov chain\nMonte Carlo for training of restricted Boltzmann machines. Technical Report 1345, University\nof Montreal, 2009.\n\n[3] C. Geyer. Markov chain Monte Carlo maximum likelihood. In Computing Science and Statis-\n\ntics, pages 156\u2013163, 1991.\n\n[4] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Com-\n\nputation, 14(8):1711\u20131800, 2002.\n\n[5] A. Kulesza and F. Pereira. Structured learning with approximate inference. In NIPS, 2007.\n[6] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR (2), pages 97\u2013104, 2004.\n\n[7] E. Marinari and G. Parisi. Simulated tempering: A new Monte Carlo scheme. Europhysics\n\nLetters, 19:451\u2013458, 1992.\n\n8\n\n\f[8] V. Nair and G. Hinton. Implicit mixtures of restricted Boltzmann machines. In Advances in\n\nNeural Information Processing Systems, volume 21, 2009.\n\n[9] R. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and\n\nComputing, 6:353\u2013366, 1996.\n\n[10] R. Neal. Annealed importance sampling. Statistics and Computing, 11:125\u2013139, 2001.\n[11] P. Pletscher, C. Ong, and J. Buhmann. Spanning tree approximations for conditional random\n\ufb01elds. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 5, 2009.\n\n[12] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat., 22:400\u2013407,\n\n1951.\n\n[13] R. Salakhutdinov. Learning and evaluating Boltzmann machines. Technical Report UTML TR\n\n2008-002, Department of Computer Science, University of Toronto, 2008.\n\n[14] R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In Proceedings of the Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, volume 5, pages 448\u2013455, 2009.\n\n[15] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood\nIn Machine Learning, Proceedings of the Twenty-\ufb01rst International Conference\n\ngradient.\n(ICML 2008). ACM, 2008.\n\n[16] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation algorithms\nand approximate ML estimation by pseudo-moment matching. In AI and Statistics, volume 9,\n2003.\n\n[17] M. Welling and C. Sutton. Learning in Markov random \ufb01elds with Contrastive Free Ener-\ngies. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 10, 2005.\n\n[18] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations\nIEEE Transactions on Information Theory,\n\nand generalized belief propagation algorithms.\n51(7):2282\u20132312, 2005.\n\n[19] L. Younes. Estimation and annealing for Gibbsian \ufb01elds. Ann. Inst. Henri Poincar\u00b4e (B),\n\n24(2):269\u2013294, 1988.\n\n[20] S. Zhu and X. Liu. Learning in Gibbsian \ufb01elds: How accurate and how fast can it be? In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-\n00), pages 2\u20139. IEEE, 2000.\n\n9\n\n\f", "award": [], "sourceid": 771, "authors": [{"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}