{"title": "Neural Variational Inference and Learning in Undirected Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 6734, "page_last": 6743, "abstract": "Many problems in machine learning are naturally expressed in the language of undirected graphical models. Here, we propose black-box learning and inference algorithms for undirected models that optimize a variational approximation to the log-likelihood of the model. Central to our approach is an upper bound on the log-partition function parametrized by a function q that we express as a flexible neural network. Our bound makes it possible to track the partition function during learning, to speed-up sampling, and to train a broad class of hybrid directed/undirected models via a unified variational inference framework. We empirically demonstrate the effectiveness of our method on several popular generative modeling datasets.", "full_text": "Neural Variational Inference and Learning\n\nin Undirected Graphical Models\n\nVolodymyr Kuleshov\nStanford University\nStanford, CA 94305\n\nkuleshov@cs.stanford.edu\n\nStefano Ermon\n\nStanford University\nStanford, CA 94305\n\nermon@cs.stanford.edu\n\nAbstract\n\nMany problems in machine learning are naturally expressed in the language of\nundirected graphical models. Here, we propose black-box learning and inference\nalgorithms for undirected models that optimize a variational approximation to the\nlog-likelihood of the model. Central to our approach is an upper bound on the log-\npartition function parametrized by a function q that we express as a \ufb02exible neural\nnetwork. Our bound makes it possible to track the partition function during learning,\nto speed-up sampling, and to train a broad class of hybrid directed/undirected\nmodels via a uni\ufb01ed variational inference framework. We empirically demonstrate\nthe effectiveness of our method on several popular generative modeling datasets.\n\n1\n\nIntroduction\n\nMany problems in machine learning are naturally expressed in the language of undirected graphical\nmodels. Undirected models are used in computer vision [1], speech recognition [2], social science [3],\ndeep learning [4], and other \ufb01elds. Many fundamental machine learning problems center on undirected\nmodels [5]; however, inference and learning in this class of distributions give rise to signi\ufb01cant\ncomputational challenges.\nHere, we attempt to tackle these challenges via new variational inference and learning techniques\naimed at undirected probabilistic graphical models p. Central to our approach is an upper bound on\nthe log-partition function of p parametrized by a an approximating distribution q that we express as a\n\ufb02exible neural network [6]. Our bound is tight when q = p and is convex in the parameters of q for\ninteresting classes of q. Most interestingly, it leads to a lower bound on the log-likelihood function\nlog p, which enables us to \ufb01t undirected models in a variational framework similar to black-box\nvariational inference [7].\nOur approach offers a number of advantages over previous methods. First, it enables training\nundirected models in a black-box manner, i.e. we do not need to know the structure of the model to\ncompute gradient estimators (e.g., as in Gibbs sampling); rather, our estimators only require evaluating\na model\u2019s unnormalized probability. When optimized jointly over q and p, our bound also offers a\nway to track the partition function during learning [8]. At inference-time, the learned approximating\ndistribution q may be used to speed-up sampling from the undirected model by initializing an MCMC\nchain (or it may itself provide samples). Furthermore, our approach naturally integrates with recent\nvariational inference methods [6, 9] for directed graphical models. We anticipate that our approach\nwill be most useful in automated probabilistic inference systems [10].\nAs a practical example for how our methods can be used, we study a broad class of hybrid di-\nrected/undirected models and show how they can be trained in a uni\ufb01ed black-box neural variational\ninference framework. Hybrid models like the ones we consider have been popular in the early deep\nlearning literature [4, 11] and take inspiration from the principles of neuroscience [12]. They also\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fpossess a higher modeling capacity for the same number of variables; quite interestingly, we identify\nsettings in which such models are also easier to train.\n\n2 Background\n\nUndirected graphical models. Undirected models form one of the two main classes of probabilistic\ngraphical models [13]. Unlike directed Bayesian networks, they may express more compactly\nrelationships between variables when the directionality of a relationship cannot be clearly de\ufb01ned\n(e.g., as in between neighboring image pixels).\nIn this paper, we mainly focus on Markov random \ufb01elds (MRFs), a type of undirected model\ncorresponding to a probability distribution of the form p\u03b8(x) = \u02dcp\u03b8(x)/Z(\u03b8), where \u02dcp\u03b8(x) =\nexp(\u03b8 \u00b7 x) is an unnormalized probability (also known as energy function) with parameters \u03b8, and\n\nZ(\u03b8) =(cid:82) \u02dcp\u03b8(x)dx is the partition function, which is essentially a normalizing constant. Our approach\n\nalso admits natural extensions to conditional random \ufb01eld (CRF) undirected models.\n\nImportance sampling.\nover \u02dcp(x). We may, however, rewrite it as\n\nIn general, the partition function of an MRF is often an intractable integral\n\n(cid:90)\n\n(cid:90)\n\n\u02dcp\u03b8(x)\nq(x)\n\n(cid:90)\n\nI :=\n\n\u02dcp\u03b8(x)dx =\n\nq(x)dx =\n\nw(x)q(x)dx,\n\nx\n\nx\n\n(cid:80)n\nwhere q is a proposal distribution. Integral I can in turn be approximated by a Monte-Carlo estimate\ni=1 w(xi), where xi \u223c q. This approach, called importance sampling [14], may reduce\n\u02c6I := 1\nn\nthe variance of an estimator and help compute intractable integrals. The variance of an importance\nsampling estimate \u02c6I has a closed-form expression: 1\nn\nit equals 0 when p = q.\n\n(cid:0)Eq(x)[w(x)2] \u2212 I 2(cid:1) . By Jensen\u2019s inequality,\n\nx\n\n(1)\n\nVariational inference.\napproximate this process by optimizing the evidence lower bound\n\nInference in undirected models is often intractable. Variational approaches\n\nlog Z(\u03b8) \u2265 max\n\nq\n\nEq(x) [log \u02dcp\u03b8(x) \u2212 log q(x)]\n\nover a distribution q(x); this amounts to \ufb01nding a q that approximates p in terms of KL(q||p). Ideal\nq\u2019s should be expressive, easy to optimize over, and admit tractable inference procedures. Recent\nwork has shown that neural network-based models possess many of these qualities [15, 16, 17].\n\nAuxiliary-variable deep generative models. Several families of q have been proposed to ensure\nthat the approximating distribution is suf\ufb01ciently \ufb02exible to \ufb01t p. This work makes use of a class\nof distributions q(x, a) = q(x|a)q(a) that contain auxiliary variables a [18, 19]; these are latent\nvariables that make the marginal q(x) multimodal, which in turn enables it to approximate more\nclosely a multimodal target distribution p(x).\n\n3 Variational Bounds on the Partition Function\n\nThis section introduces a variational upper bound on the partition function of an undirected graphical\nmodel. We analyze its properties and discuss optimization strategies. In the next section, we use this\nbound as an objective for learning undirected models.\n\n3.1 A Variational Upper Bound on Z(\u03b8)\n\nWe start with the simple observation that the variance of an importance sampling estimator (1) of the\npartition function naturally yields an upper bound on Z(\u03b8):\n\n\u2265 Z(\u03b8)2.\n\n(2)\n\n(cid:20) \u02dcp(x)2\n\n(cid:21)\n\nq(x)2\n\nEq(x)\n\nAs mentioned above, this bound is tight when q = p. Hence, it implies a natural algorithm for\ncomputing Z(\u03b8): minimize (2) over q in some family Q.\n\n2\n\n\fWe immediately want to emphasize that this algorithm will not be directly applicable to highly peaked\nand multimodal distributions \u02dcp (such as an Ising model near its critical point). If q is initially very far\nfrom \u02dcp, Monte Carlo estimates will tend to under-estimate the partition function.\nHowever, in the context of learning p, we may expect a random initialization of \u02dcp to be approximately\nuniform; we may thus \ufb01t an initial q to this well-behaved distribution, and as we gradually learn or\nanneal p, q should be able to track p and produce useful estimates of the gradients of \u02dcp and of Z(\u03b8).\nMost importantly, these estimates are black-box and do not require knowing the structure of \u02dcp to\ncompute. We will later con\ufb01rm that our intuition is correct via experiments.\n\n3.2 Properties of the Bound\n\nConvexity properties. A notable feature of our objective is that if q is an exponential family with\nparameters \u03c6, the bound is jointly log-convex in \u03b8 and \u03c6. This lends additional credibility to the bound\nas an optimization objective. If we choose to further parametrize \u03c6 by a neural net, the resulting\nnon-convexity will originate solely from the network, and not from our choice of loss function.\nTo establish log-convexity, it suf\ufb01ces to look at \u02dcp\u03b8(x)2/q(x) for one x, since the sum of log-convex\nq(x) = 2\u03b8T x \u2212 log q\u03c6(x). One can easily check that a\nfunctions is log-convex. Note that log \u02dcp\u03b8(x)2\nnon-negative concave function is also log-concave; since q is in the exponential family, the second\nterm is convex, and our claim follows.\n\nImportance sampling. Minimizing the bound on Z(\u03b8) may be seen as a form of adaptive impor-\ntance sampling, where the proposal distribution q is gradually adjusted as more samples are taken\n[14, 20]. This provides another explanation for why we need q \u2248 p; note that when q = p, the\nvariance is zero, and a single sample computes the partition function, demonstrating that the bound is\nindeed tight. This also suggests the possibility of taking 1\nq(xi) as an estimate of the partition\nn\nfunction, with the xi being all the samples that have been collected during the optimization of q.\n\n(cid:80)n\n\n\u02dcp(xi)\n\ni=1\n\n\u03c72-divergence minimization. Observe that optimizing (2) is equivalent to minimizing Eq\n,\nwhich is the \u03c72-divergence, a type of \u03b1-divergence with \u03b1 = 2 [21, 22]. This connections highlights\nthe variational nature of our approach and potentially suggests generalizations to other divergences.\nMoreover, many interesting properties of the bound can be easily established from this interpretation,\nsuch as convexity in terms of q, \u02dcp (in functional space).\n\nq2\n\n( \u02dcp\u2212q)2\n\n3.3 Auxiliary-Variable Approximating Distributions\nA key part of our approach is the choice of approximating family Q: it needs to be expressive,\neasy to optimize over, and admit tractable inference procedures. In particular, since \u02dcp(x) may be\nhighly multi-modal and peaked, q(x) should ideally be equally complex. Note that unlike earlier\nmethods that parametrized conditional distributions q(z|x) over hidden variables z (e.g. variational\nautoencoders [15]), our setting does not admit a natural conditioning variable, making the task\nconsiderably more challenging.\nHere, we propose to address these challenges via an approach based on auxiliary-variable approxi-\nmations [18]: we introduce a set of latent variables a into q(x, a) = q(x|a)q(a) making the marginal\nq(x) multi-modal. Computing the marginal q(x) may no longer be tractable; we therefore apply the\nvariational principle one more time and introduce an additional relaxation of the form\n\n(cid:20) p(a|x)2 \u02dcp(x)2\n\n(cid:21)\n\nEq(a,x)\n\n\u2265 Eq(x)\n\n\u2265 Z(\u03b8)2,\n\nq(x|a)2q(a)2\n\n(3)\nwhere, p(a|x) is a probability distribution over a that lifts \u02dcp to the joint space of (x, a). To establish\nthe \ufb01rst inequality, observe that\nEq(a,x)\n\n(cid:20) p(a|x)2 \u02dcp(x)2\n\n(cid:18) p(a|x)2\n\n(cid:20) \u02dcp(x)2\nq(x)2 \u00b7 Eq(a|x)\n\nq(a|x)2q(x)2\n\n= Eq(x)q(a|x)\n\n= Eq(x)\n\nq(a|x)2\n\n(cid:19)(cid:21)\n\n.\n\n(cid:21)\n(cid:20) p(a|x)2 \u02dcp(x)2\n(cid:16) p(a|x)2\n\nq(x|a)2q(a)2\n\n(cid:17)\n\nis an instantiation of bound (2) for the distribution p(a|x), and is\n\nThe factor Eq(a|x)\ntherefore lower-bounded by 1.\n\nq(a|x)2\n\n(cid:21)\n\n(cid:20) \u02dcp(x)2\n(cid:21)\n\nq(x)2\n\n3\n\n\fThis derivation also sheds light on the role of p(a|x): it is an approximating distribution for the\nintractable posterior q(a|x). When p(a|x) = q(a|x), the \ufb01rst inequality in (3) is tight, and we are\noptimizing our initial bound.\n\n3.3.1 Instantiations of the Auxiliary-Variable Framework\n\nThe above formulation is suf\ufb01ciently general to encompass several different variational inference\napproaches. Either could be used to optimize our objective, although we focus on the latter, as it\nadmits the most \ufb02exible approximators for q(x).\n\nq to be a uniform mixture of K exponential families: q(x) =(cid:80)K\n\nNon-parametric variational inference. First, as suggested by Gershman et al. [23], we may take\n\n1\nK qk(x; \u03c6k).\n\nk=1\n\nThis is equivalent to letting a be a categorical random variable with a \ufb01xed, uniform prior. The\nqk may be either Gaussians or Bernoulli, depending on whether x is discrete or continuous. This\nchoice of q lets us potentially model arbitrarily complex p given enough components. Note that for\ndistributions of this form it is easy to compute the marginal q(x) (for small K), and the bound in (3)\nmay not be needed.\nMCMC-based variational inference. Alternatively, we may set q(a|x) to be an MCMC transition\noperator T (x(cid:48)|x) (or a sequence of operators) as in Salimans et al. [24]. The prior q(a) may be set to\na \ufb02exible distribution, such as normalizing \ufb02ows [25] or another mixture distribution. This gives a\ndistribution of the form\n\nq(x, a) = T (x|a)q(a).\n\n(4)\nFor example, if T (x|a) is a Restricted Boltzmann Machine (RBM; Smolensky [26]), the Gibbs\nsampling operator T (x(cid:48)|x) has a closed form that can be used to compute importance samples. This\nis in contrast to vanilla Gibbs sampling, where there is no closed form density for weighting samples.\nThe above approach also has similarities to persistent contrastive divergence (PCD; Tieleman and\nHinton [27]), a popular approach for training RBM models, in which samples are taken from a Gibbs\nchain that is not reset during learning. The distribution q(a) may be thought of as a parametric way\nof representing a persistent distribution from which samples are taken throughout learning; like the\nPCD Gibbs chain, it too tracks the target probability p during learning.\nAuxiliary-variable neural networks. Lastly, we may also parametrize q(a|x) by an \ufb02exible func-\ntion approximator such as a neural network [18]. More concretely, we set q(a) to a simple continuous\nprior (e.g. normal or uniform) and set q\u03c6(x|a) to an exponential family distribution whose nat-\nural parameters are parametrized by a neural net. For example, if x is continuous, we may set\nq(x|a) = N (\u00b5(a), \u03c3(a)I), as in a variational auto-encoder. Since the marginal q(x) is intractable,\nwe use the variational bound (3) and parametrize the approximate posterior p(a|x) with a neural\nnetwork. For example, if a \u223c N (0, 1), we may again set p(a|x) = N (\u00b5(x), \u03c3(x)I).\n\n3.4 Optimization\n\nIn the rest of the paper, we focus on the auxiliary-variable neural network approach for optimizing\nbound (3). This approach affords us the greatest modeling \ufb02exibility and allows us to build on\nprevious neural variational inference approaches.\nThe key challenge with this choice of representation is optimizing (3) with respect to the parameters\n\u03c6, \u03c6 of p, q. Here, we follow previous work on black-box variational inference [6, 7] and compute\nMonte Carlo estimates of the gradient of our neural network architecture.\nq(x,a)2\u2207\u03c6 \u02dcp(x, a) and can be estimated directly via\nThe gradient with respect to p has the form 2Eq\nMonte Carlo. We use the score function estimator to compute the gradient of q, which can be written\nq(x,a)2 \u2207\u03c6 log q(x, a) and estimated again using Monte Carlo samples. In the case of a\nas \u2212Eq(x,a)\nK qk(x; \u03c6k), the gradient has a simple expression\n, where dk(x) is the difference of x and its expectation under qk.\n\nnon-parametric variational approximation(cid:80)K\n\n(cid:104) \u02dcp(x)2\n\n\u02dcp(x)2\n\nq(x)2 = \u2212Eqk\n\nq(x)2 dk(x)\n\n\u2207\u03c6k\n\nEq\n\n1\n\nk=1\n\n\u02dcp(x,a)2\n\n(cid:105)\n\n\u02dcp(x,a)\n\n4\n\n\fNote also that if our goal is to compute the partition function, we may collect all intermediary samples\nfor computing the gradient and use them as regular importance samples. This may be interpreted as a\nform of adaptive sampling [20].\n\nVariance reduction. A well-known shortcoming of the score function gradient estimator is its high\nvariance, which signi\ufb01cantly slows down optimization. We follow previous work [6] and introduce\ntwo variance reduction techniques to mitigate this problem.\nWe \ufb01rst use a moving average \u00afb of \u02dcp(x)2/q(x)2 to center the learning signal. This leads to a gradient\nq(x)2 \u2212 \u00afb)\u2207\u03c6 log q(x); this yields the correct gradient by well known\nestimate of the form Eq(x)( \u02dcp(x)2\nproperties of the score function [7]. Furthermore, we use variance normalization, a form of adaptive\nstep size. More speci\ufb01cally, we keep a running average \u00af\u03c32 of the variance of the \u02dcp(x)2/q(x)2 and\nuse a normalized form g(cid:48) = g/ max(1, \u00af\u03c32) of the original gradient g.\nNote that unlike the standard evidence lower bound, we cannot de\ufb01ne a sample-dependent baseline,\nas we are not conditioning on any sample. Likewise, many advanced gradient estimators [9] do not\napply in our setting. Developing better variance reduction techniques for this setting is likely to help\nscale the method to larger datasets.\n\n4 Neural Variational Learning of Undirected Models\n\nNext, we turn our attention to the problem of learning the parameters of an MRF. Given data\nD = {x(i)}n\n\ni=1, our training objective is the log-likelihood\n1\nn\n\nlog p(D|\u03b8) :=\n\nlog p\u03b8(x(i)) =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n\u03b8T x(i) \u2212 log Z(\u03b8).\n\n(cid:16)Ex\u223cq\n\n(5)\n\n(cid:17)\n\n. By\nWe can use our earlier bound to upper bound the log-partition function by log\nour previous discussion, this expression is convex in \u03b8, \u03c6 if q is an exponential family distribution.\nThe resulting lower bound on the log-likelihood may be optimized jointly over \u03b8, \u03c6; as discussed\nearlier, by training p and q jointly, the two distributions may help each other. In particular, we may\nstart learning at an easy \u03b8 (where p is not too peaked) and use q to slowly track p, thus controlling the\nvariance in the gradient.\n\n\u02dcp\u03b8(x)2\nq(x)2\n\nLinearizing the logarithm. Since the log-likelihood contains the logarithm of the bound (2), our\nMonte Carlo samples will produce biased estimates of the gradient. We did not \ufb01nd this to pose\nproblems in practice; however, to ensure unbiased gradients one may further linearize the log using\nthe identity log(x) \u2264 ax \u2212 log(a) \u2212 1, which is tight for a = 1/x. Together with our bound on the\nlog-partition function, this yields\nlog p(D|\u03b8) \u2265 max\n\n(cid:19)\nq\u03c8(x)2 \u2212 log(a) \u2212 1\n\n\u03b8T x(i) \u2212 1\n2\n\nn(cid:88)\n\naEx\u223cq\n\n\u02dcp\u03b8(x)2\n\n(cid:18)\n\n1\nn\n\n(6)\n\n.\n\n\u03b8,q\n\ni=1\n\nThis expression is convex in each of (\u03b8, \u03c6) and a, but is not jointly convex. However, it is straightfor-\nward to show that equation (6) and its unlinearized version have a unique point satisfying \ufb01rst-order\nstationarity conditions. This may be done by writing out the KKT conditions of both problems and\nq(x)2 )\u22121 at the optimum. See Gopal and Yang [28] for more details.\nusing the fact that a\u2217 = (Ex\u223cq\n\n\u02dcp\u03b8(x)2\n\n4.1 Variational Inference and Learning in Hybrid Directed/Undirected Models\n\nWe apply our framework to a broad class of hybrid directed/undirected models and show how they\ncan be trained in a uni\ufb01ed variational inference framework.\nThe models we consider are best described as variational autoencoders with a Restricted Boltz-\nmann Machine (RBM; Smolensky [26]) prior. More formally, they are latent-variable distributions\nof the form p(x, z) = p(x|z)p(z), where p(x|z) is an exponential family whose natural param-\neters are parametrized by a neural network as a function of z, and p(z) is an RBM. The latter\nis an undirected latent variable model with hidden variables h and unnormalized log-probability\nlog \u02dcp(z, h) = zT W h + bT z + cT h, where W, b, c are parameters.\n\n5\n\n\fWe train the model using two applications of the variational principle: \ufb01rst, we apply the standard\nevidence lower bound with an approximate posterior r(z|x); then, we apply our lower bound on the\nRBM log-likelihood log p(z), which yields the objective\n\nlog p(x) \u2265 Er(z|x) [log p(x|z) + log \u02dcp(z) + log B(\u02dcp, q) \u2212 log r(z|x)] .\n\n(7)\nHere, B denotes our bound (3) on the partition function of p(z) parametrized with q. Equation (7)\nmay be optimized using standard variational inference techniques; the terms r(z|x) and p(x|z) do\nnot appear in B and their gradients may be estimated using REINFORCE and standard Monte Carlo,\nrespectively. The gradients of \u02dcp(z) and q(z) are obtained using methods described above. Note also\nthat our approach naturally extends to models with multiple layers of latent directed variables.\nSuch hybrid models are similar in spirit to deep belief networks [11]. From a statistical point\nof view, a latent variable prior makes the model more \ufb02exible and allows it to better \ufb01t the data\ndistribution. Such models may also learn structured feature representations: previous work has\nshown that undirected modules may learn classes of digits, while lower, directed layers may learn to\nrepresent \ufb01ner variation [29]. Finally, undirected models like the ones we study are loosely inspired\nby the brain and have been studied from that perspective [12]. In particular, the undirected prior has\nbeen previously interpreted as an associative memory module [11].\n\n5 Experiments\n\n5.1 Tracking the Partition Function\n\nWe start with an experiment aimed at visualizing the importance of tracking the target distribution p\nusing q during learning.\nWe use Equation 6 to optimize the likeli-\nhood of a 5\u00d7 5 Ising MRF with coupling\nfactor J and unaries chosen randomly\nin {10\u22122,\u221210\u22122}. We set J = \u22120.6,\nsampled 1000 examples from the model,\nand \ufb01t another Ising model to this data.\nWe followed a non-parametric inference\napproach with a mixture of K = 8\nBernoullis. We optimized (6) using SGD and alternated between ten steps over the \u03c6k and one\nstep over \u03b8, a. We drew 100 Monte Carlo samples per qk. Our method converged in about 25 steps\nover \u03b8. At each iteration we computed log Z via importance sampling.\nThe adjacent \ufb01gure shows the evolution of log Z during learning. It also plots log Z computed by\nexact inference, loopy BP, and Gibbs sampling (using the same number of samples). Our method\naccurately tracks the partition function after about 10 iterations. In particular, our method fares better\nthan the others when J \u2248 \u22120.6, which is when the Ising model is entering its phase transition.\n\n5.2 Learning Restricted Boltzmann Machines\n\nNext, we use our method to train Restricted Boltzmann Machines (RBMs) on the UCI digits dataset\n[30], which contains 10,992 8 \u00d7 8 images of handwritten digits; we augment this data by moving\neach image 1px to the left, right, up, and down. We train an RBM with 100 hidden units using\nADAM [31] with batch size 100, a learning rate of 3 \u00b7 10\u22124, \u03b21 = 0.9, and \u03b22 = 0.999; we choose q\nto be a uniform mixture of K = 10 Bernoulli distributions. We alternate between training p and q,\nperforming either 2 or 10 gradient steps on q for each step on p and taking 30 samples from q per\nstep; the gradients of p are estimated via adaptive importance sampling.\nWe compare our method against persistent contrastive divergence (PCD; Tieleman and Hinton [27]),\na standard method for training RBMs. The same ADAM settings were used to optimize the model\nwith the PCD gradient. We used k = 3 Gibbs steps and 100 persistent chains. Both PCD and our\nmethod were implemented in Theano [32].\nIn Figure 1, we plot the true log-likelihood of the model (computed with annealed importance\nsampling with step size 10\u22123) as a function of the epoch; we use 10 gradient steps on q for each step\non p. Both PCD and our method achieve comparable performance. Interestingly, we may use our\n\n6\n\n0510152025303540Iteration202224262830323436logZTrue valueOur methodLoopy BPGibbs sampling\fFigure 1: Learning curves for an RBM trained with PCD-3 and with neural variational inference on\nthe UCI digits dataset. Log-likelihood was computed using annealed importance sampling.\n\nTable 1: Test set negative log likelihood on binarized MNIST and Omniglot for VAE and ADGM\nmodels with Bernoulli (200 vars) and RBM priors with 64 visible and either 8 or 64 hidden variables.\n\nBinarized MNIST\n\nOmniglot\n\nModel\nVAE\nADGM\n\nBer(200) RBM(64,8) RBM(64,64) Ber(200) RBM(64,8) RBM(64,64)\n128.5\n131.1\n\n111.9\n107.9\n\n105.4\n104.3\n\n102.3\n100.7\n\n135.1\n136.8\n\n130.2\n134.4\n\napproximating distribution q to estimate the log-likelihood via importance sampling. Figure 1 (right)\nshows that this estimate closely tracks the true log-likelihood; thus, users may periodically query\nthe model for reasonably accurate estimates of the log-likelihood. In our implementation, neural\nvariational inference was approximately eight times slower than PCD; when performing two gradient\nsteps on q, our method was only 50% slower with similar samples and pseudo-likelihood; however\nlog-likelihood estimates were noisier. Annealed importance sampling was always more than order of\nmagnitude slower than neural variational inference.\n\nVisualizing the approximating distribution. Next, we\ntrained another RBM model performing two gradient steps\nfor q for each step of p. The adjacent \ufb01gure shows the\nmean distribution of each component of the mixture of\nBernoullis q; one may distinguish in them the shapes of\nvarious digits. This con\ufb01rms that q indeed approximates p.\n\nSpeeding up sampling from undirected models. After the\nmodel has \ufb01nished training, we can use the approximating q\nto initialize an MCMC sampling chain. Since q is a rough approx-\nimation of p, the resulting chain should mix faster. To con\ufb01rm\nthis intuition, we plot in the adjacent \ufb01gure samples from a Gibbs\nsampling chain that has been initialized randomly (top), as well\nas from a chain that was initialized with a sample from q (bottom).\nThe latter method reaches a plausible-looking digit in a few steps,\nwhile the former produces blurry samples.\n\n5.3 Learning Hybrid Directed/Undirected Models\n\nNext, we use the variational objective (7) to learn two types of hybrid directed/undirected models:\na variational autoencoder (VAE) and an auxiliary variable deep generative model (ADGM) [18].\nWe consider three types of priors: a standard set of 200 uniform Bernoulli variables, an RBM with\n64 visible and 8 hidden units, and an RBM with 64 visible and 64 hidden units. In the ADGM,\nthe approximate posterior r(z, u|x) = r(z|u, x)r(u|x) includes auxiliary variables u \u2208 R10. All\nthe conditional probabilities r(z|u, x), r(u|x), r(z|x), p(x|z) are parametrized with dense neural\nnetworks with one hidden layer of size 500.\n\n7\n\n050100150200250Epochs302826Log-likelihoodPersistent contrastive divergence020406080100120Epochs302826Log-likelihoodNeural variational inferencetruepredicted\fFigure 2: Samples from a deep generative model using different priors over the discrete latent\nvariables z. On the left, the prior p(z) is a Bernoulli distribution (200 vars); on the right, p(z) is an\nRBM (64 visible and 8 hidden vars). All other parts of the model are held \ufb01xed.\n\nWe train all neural networks for 200 epochs with ADAM (same parameters as above) and neural\nvariational inference (NVIL) with control variates as described in Mnih and Rezende [9]. We\nparametrize q with a neural network mapping 10-dimensional auxiliary variables a \u2208 N (0, I) to\nx via one hidden layer of size 32. We show in Table 1 the test set negative log-likelihoods on the\nbinarized MNIST [33] and 28\u00d7 28 Omniglot [17] datasets; we compute these using 103 Monte Carlo\nsamples and using annealed importance sampling for the 64 \u00d7 64 RBM.\nOverall, adding an RBM prior with as little as 8 latent variables results in signi\ufb01cant log-likelihood\nimprovements. Most interestingly, this prior greatly improves sample quality over the discrete latent\nvariable VAE (Figure 2). Whereas the VAE failed to generate correct digits, replacing the prior with\na small RBM resulted in smooth MNIST images. We note that both methods were trained with\nexactly the same gradient estimator (NVIL). We observed similar behavior for the ADGM model.\nThis suggests that introducing the undirected component made the models more expressive and easier\nto train.\n\n6 Related Work and Discussion\n\nOur work is inspired by black-box variational inference [7] for variational autoencoders and related\nmodels [15], which involve \ufb01tting approximate posteriors parametrized by neural networks. Our work\npresents analogous methods for undirected models. Popular classes of undirected models include\nRestricted and Deep Boltzmann Machines [4, 26] as well as Deep Belief Networks [11]. Closest to\nour work is the discrete VAE model; however, Rolfe [29] seeks to ef\ufb01ciently optimize p(x|z), while\nthe RBM prior p(z) is optimized using PCD; our work optimizes p(x|z) using standard techniques\nand focuses on p(z). Our bound has also been independently studied in directed models [22].\nMore generally, our work proposes an alternative to sampling-based learning methods; most vari-\national methods for undirected models center on inference. Our approach scales to small and\nmedium-sized datasets, and is most useful within hybrid directed-undirected generative models. It\napproaches the speed of the PCD method and offers additional bene\ufb01ts, such as partition function\ntracking and accelerated sampling. Most importantly, our algorithms are black-box, and do not\nrequire knowing the structure of the model to derive gradient or partition function estimators. We\nanticipate that our methods will be most useful in automated inference systems such as Edward [10].\nThe scalability of our approach is primarily limited by the high variance of the Monte Carlo estimates\nof the gradients and the partition function when q does not \ufb01t p suf\ufb01ciently well. In practice, we\nfound that simple metrics such as pseudo-likelihood were effective at diagnosing this problem. When\ntraining deep generative models with RBM priors, we noticed that weak q\u2019s introduced mode collapse\n(but training would still converge). Increasing the complexity of q and using more samples resolved\nthese problems. Finally, we also found that the score function estimator of the gradient of q does not\nscale well to higher dimensions. Better gradient estimators are likely to further improve our method.\n\n7 Conclusion\n\nIn summary, we have proposed new variational learning and inference algorithms for undirected\nmodels that optimize an upper-bound on the partition function derived from the perspective of\n\n8\n\n\fimportance sampling and \u03c72 divergence minimization. Our methods allow training undirected models\nin a black-box manner and will be useful in automated inference systems [10].\nOur framework is competitive with sampling methods in terms of speed and offers additional bene\ufb01ts\nsuch as partition function tracking and accelerated sampling. Our approach can also be used to train\nhybrid directed/undirected models using a uni\ufb01ed variational framework. Most interestingly, it makes\ngenerative models with discrete latent variables both more expressive and easier to train.\n\nAcknowledgements. This work is supported by the Intel Corporation, Toyota, NSF (grants\n1651565, 1649208, 1522054) and by the Future of Life Institute (grant 2016-158687).\n\nReferences\n[1] Yongyue Zhang, Michael Brady, and Stephen Smith. Segmentation of brain mr images through\nIEEE\n\na hidden markov random \ufb01eld model and the expectation-maximization algorithm.\ntransactions on medical imaging, 20(1):45\u201357, 2001.\n\n[2] Jeffrey A Bilmes. Graphical models and automatic speech recognition.\n\nfoundations of speech and language processing, pages 191\u2013245. Springer, 2004.\n\nIn Mathematical\n\n[3] John Scott. Social network analysis. Sage, 2012.\n[4] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 448\u2013455, 2009.\n\n[5] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[6] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.\n\narXiv preprint arXiv:1402.0030, 2014.\n\n[7] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 814\u2013822, 2014.\n\n[8] Guillaume Desjardins, Yoshua Bengio, and Aaron C Courville. On tracking the partition\n\nfunction. In Advances in neural information processing systems, pages 2501\u20132509, 2011.\n\n[9] Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv\n\npreprint arXiv:1602.06725, 2016.\n\n[10] Dustin Tran, Alp Kucukelbir, Adji B Dieng, Maja Rudolph, Dawen Liang, and David M\nBlei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint\narXiv:1610.09787, 2016.\n\n[11] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep\n\nbelief nets. Neural Computation, 18:1527\u20131554, 2006.\n\n[12] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The\" wake-sleep\"\n\nalgorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.\n\n[13] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n[14] Rajan Srinivasan.\n\nSpringer Science & Business Media, 2013.\n\nImportance sampling: Applications in communications and detection.\n\n[15] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114,\n\n2013. URL http://arxiv.org/abs/1312.6114.\n\n[16] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In Proceedings of the 31th International\nConference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1278\u2013\n1286, 2014. URL http://jmlr.org/proceedings/papers/v32/rezende14.html.\n\n[17] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\nCoRR, abs/1509.00519, 2015. URL http://arxiv.org/abs/1509.00519.\n\n[18] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\n9\n\n\f[19] Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In Interna-\n\ntional Conference on Machine Learning, pages 324\u2013333, 2016.\n\n[20] Ernest K. Ryu and Stephen P. Boyd. Adaptive importance sampling via stochastic convex\n\nprogramming. Unpublished manuscript, November 2014.\n\n[21] Tom Minka et al. Divergence measures and message passing. Technical report, Technical report,\n\nMicrosoft Research, 2005.\n\n[22] Adji B Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M Blei. Variational\ninference via chi-upper bound minimization. Advances in Neural Information Processing\nSystems, 2017.\n\n[23] Samuel Gershman, Matthew D. Hoffman, and David M. Blei. Nonparametric variational\ninference. In Proceedings of the 29th International Conference on Machine Learning, ICML\n2012, Edinburgh, Scotland, UK, 2012.\n\n[24] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational\ninference: Bridging the gap. In Proceedings of the 32nd International Conference on Machine\nLearning (ICML-15), pages 1218\u20131226, 2015.\n\n[25] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\narXiv preprint arXiv:1505.05770, 2015.\n\n[26] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory.\n\nTechnical report, DTIC Document, 1986.\n\n[27] Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive\ndivergence. In Proceedings of the 26th Annual International Conference on Machine Learning,\npages 1033\u20131040. ACM, 2009.\n\n[28] Siddharth Gopal and Yiming Yang. Distributed training of large-scale logistic models. In\nProceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta,\nGA, USA, pages 289\u2013297, 2013.\n\n[29] Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.\n[30] Fevzi Alimoglu, Ethem Alpaydin, and Yagmur Denizhan. Combining multiple classi\ufb01ers for\n\npen-based handwritten digit recognition. 1996.\n\n[31] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[32] Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud\nBergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features\nand speed improvements. arXiv preprint arXiv:1211.5590, 2012.\n\n[33] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS,\n\nvolume 1, page 2, 2011.\n\n10\n\n\f", "award": [], "sourceid": 3376, "authors": [{"given_name": "Volodymyr", "family_name": "Kuleshov", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}