{"title": "Meta-Learning MCMC Proposals", "book": "Advances in Neural Information Processing Systems", "page_first": 4146, "page_last": 4156, "abstract": "Effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. Inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environments, we propose a meta-learning approach to building effective and generalizable MCMC proposals. We parametrize the proposal as a neural network to provide fast approximations to block Gibbs conditionals. The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. We explore several applications including open-universe Gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final F1 scores than classical single-site Gibbs sampling.", "full_text": "Meta-Learning MCMC Proposals\n\nTongzhou Wang\u2217\nFacebook AI Research\n\ntongzhou.wang.1994@gmail.com\n\nYi Wu\n\nUniversity of California, Berkeley\n\njxwuyi@gmail.com\n\nDavid A. Moore\u2020\n\nGoogle\n\ndavmre@gmail.com\n\nStuart J. Russell\n\nUniversity of California, Berkeley\nrussell@cs.berkeley.edu\n\nAbstract\n\nEffective implementations of sampling-based probabilistic inference often require\nmanually constructed, model-speci\ufb01c proposals. Inspired by recent progresses in\nmeta-learning for training learning agents that can generalize to unseen environ-\nments, we propose a meta-learning approach to building effective and generalizable\nMCMC proposals. We parametrize the proposal as a neural network to provide\nfast approximations to block Gibbs conditionals. The learned neural proposals\ngeneralize to occurrences of common structural motifs across different models,\nallowing for the construction of a library of learned inference primitives that can\naccelerate inference on unseen models with no model-speci\ufb01c training required.\nWe explore several applications including open-universe Gaussian mixture models,\nin which our learned proposals outperform a hand-tuned sampler, and a real-world\nnamed entity recognition task, in which our sampler yields higher \ufb01nal F1 scores\nthan classical single-site Gibbs sampling.\n\n1\n\nIntroduction\n\nModel-based probabilistic inference is a highly successful paradigm for machine learning, with\napplications to tasks as diverse as movie recommendation [31], visual scene perception [17], music\ntranscription [3], etc. People learn and plan using mental models, and indeed the entire enterprise\nof modern science can be viewed as constructing a sophisticated hierarchy of models of physical,\nmental, and social phenomena. Probabilistic programming provides a formal representation of\nmodels as sample-generating programs, promising the ability to explore a even richer range of models.\nProbabilistic programming language based approaches have been successfully applied to complex\nreal-world tasks such as seismic monitoring [23], concept learning [18] and design generation [26].\nHowever, most of these applications require manually designed proposal distributions for ef\ufb01cient\nMCMC inference. Commonly used \u201cblack-box\u201d MCMC algorithms are often far from satisfactory\nwhen handling complex models. Hamiltonian Monte Carlo [24] takes global steps but is only\napplicable to continuous latent variables with differentiable likelihoods. Single-site Gibbs sampling\n[30, 1] can be applied to many model but suffers from slow mixing when variables are coupled in the\nposterior. Effective real-world inference often requires block proposals that update multiple variables\ntogether to overcome near-deterministic and long-range dependence structures. However, computing\nexact Gibbs proposals for large blocks quickly becomes intractable (approaching the dif\ufb01culty of\nposterior inference), and in practice it is common to invest signi\ufb01cant effort in hand-engineering\ncomputational tricks for a particular model.\n\n\u2217Work done while the author was at the University of California, Berkeley\n\u2020Work done while the author was at the University of California, Berkeley\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fX1\n\n\u2212 4 )\n0\nN(0,1\n+\nX 1\n\nX2\n\nN(0,5)\n\n+\nX 2\n\n=\nY 2\n\nX\n\nN(0,\u03b1)\n\n+\n\nX\n\n=\n\nY\n\nZ1=Y1+N (0,2)\n\nZ1\n\nY2\n\nZ2=Y2+N (0,10\u22125)\n\nZ2\n\nZ=Y +N (0,\u03b2)\n\nZ\n\nY\n\n=\nY 1\n\nY1\n\n(a) Two models of same structure, but with different parameters\nand thus different near-deterministic relations (shown in red). Naive\nMCMC algorithms like single-site Gibbs fail on both models due to\nthese dependencies.\n\n(b) To design a single proposal that\nworks on both models in Fig. 1a, we con-\nsider this general model with variable\nparameters \u03b1 and \u03b2 (shown in blue).\n\n(c) Our neural proposal takes model parameters \u03b1 and \u03b2 as input,\nand is trained to output good proposal distributions on randomly\ngenerated parameters. Therefore, it performs well for any given \u03b1\nand \u03b2. (For simplicity, inputs in diagram omit possible other nodes\nthat the proposed nodes may depend on.)\nFigure 1: Toy example: Naive MCMC algorithms (e.g., single-site Gibbs) fail when variables are tightly\ncoupled, requiring custom proposals even for models with similar structure but different dependency relations\n(Fig. 1a). Our goal is to design a single proposal that works on any model with similar local structure. We\nconsider the general model where the dependency relations among nodes are represented by variable model\nparameters (Fig. 1b), and then train proposals parametrized by neural networks on models with randomly\ngenerated parameters (Fig. 1c). The trained proposal thus work on anywhere the structure is found (Fig. 1d).\nWith proposals trained for many common motifs, we can automatically speed up inference on unseen models.\n\n(d) The neural proposal can be ap-\nplied anywhere this structural pattern\nis present (or instantiated). The grey\nregions show example instantiations in\nthis large model. (There are more.)\n\nCan we build tractable MCMC proposals that are (1) effective for fast mixing and (2) ready to be\nreused across different models?\nRecent advances in meta-learning demonstrate promising results in learning to build reinforcement\nlearning agents that can generalize to unseen environments [7, 33, 9, 37]. The core idea of meta-\nlearning is to generate a large number of related training environments under the same objective and\nthen train a learning agent to succeed in all of them. Inspired by those meta-learning works, we can\nadopt a similar approach to build generalizable MCMC proposals.\nWe propose to learn approximate block-Gibbs proposals that can be reused within a given model,\nand even across models containing similar structural motifs (i.e., common structural patterns).\nRecent work recognized that a wide range of models can be represented as compositions of simple\ncomponents [10], and that domain-speci\ufb01c models may still reuse general structural motifs such\nas chains, grids, rings, or trees [14]. We exploit this by training a meta-proposal to approximate\nblock-Gibbs conditionals for models containing a given motif, with the model parameters provided as\nan additional input. At a high level, approach \ufb01rst (1) generates different instantiations of a particular\nmotif by randomizing its model parameters, and then (2) meta-train a neural proposal \u201cclose to\u201d the\ntrue Gibbs conditionals for all the instantiations (see Fig. 1). By learning such \ufb02exible samplers, we\ncan improve inference not only within a speci\ufb01c model but even on unseen models containing similar\nstructures, with no additional training required. In contrast to techniques that compile inference\nprocedures speci\ufb01c to a given model [32, 19, 29], learning inference artifacts that generalize to novel\nmodels is valuable in allowing model builders to quickly explore a wide range of possible models.\nWe explore the application of our approach to a wide range of models. On grid-structured models\nfrom a UAI inference competition, our learned proposal signi\ufb01cantly outperforms Gibbs sampling.\nFor open-universe Gaussian mixture models, we show that a simple learned block proposal yields\n\n2\n\n\fperformance comparable to a model-speci\ufb01c hand-tuned sampler, and generalizes to models more\nthan those it was trained on. We additionally apply our method to a named entity recognition (NER)\ntask, showing that not only do our learned block proposals mix effectively, the ability to escape local\nmodes yields higher-quality solutions than the standard Gibbs sampling approach.\n\n2 Related Work\n\nThere has been great interest in using learned, feedforward inference networks to generate approximate\nposteriors. Variational autoencoders (VAE) train an inference network jointly with the parameters of\nthe forward model to maximize a variational lower bound [15, 5, 11]. However, the use of a parametric\nvariational distribution means they typically have limited capacity to represent complex, potentially\nmultimodal posteriors, such as those incorporating discrete variables or structural uncertainty.\nA related line of work has developed data-driven proposals for importance samplers [25, 19, 27],\ntraining an inference network from prior samples which is then used as a proposal given observed\nevidence. In particular, Le et al. [19] generalize the framework to probabilistic programming, and is\nable to automatically generate and train a neural proposal network given an arbitrary model described\nin a probabilistic program. Our approach differs in that we focus on MCMC inference, allowing\nmodular proposals for subsets of model variables that may depend on latent quantities, and exploit\nrecurring structural motifs to generalize to new models with no additional training.\nSeveral approaches have been proposed for adaptive block sampling, in which sets of variables\nexhibiting strong correlations are identi\ufb01ed dynamically during inference, so that costly joint sampling\nis used only for blocks where it is likely to be bene\ufb01cial [35, 34]. This is largely complementary to\nour current approach, which assumes the set of blocks (structural motifs) is given and attempts to\nlearn fast approximate proposals.\nPerhaps most related to our approach is recent work that trains model-speci\ufb01c MCMC proposals with\nmachine learning techniques. In [29], adversarial training directly optimizes the similarity between\nposterior values and proposed values from a symmetric MCMC proposal. Stochastic inverses of\ngraphical models[32] train density estimators to speed up inference. However, both approaches\nhave limitations on applicable models and require model-speci\ufb01c training using global information\n(samples containing all variables). Our approach is simpler and more scalable, requiring only local\ninformation and generating local proposals that can be reused both within and across different models.\nAt a high level, our approach of learning an approximate local update scheme can be seen as related\nto approximate message passing [28, 12] and learning to optimize continuous objectives [2, 20].\n\n3 Meta-Learning MCMC Proposals\n\nWe propose a meta-learning approach, using a neural network to approximate the Gibbs proposal for\na recurring structural motif in graphical models, and to speed up inference on unseen models without\nextra tuning. Crucially our proposals do not \ufb01x the model parameters, which are instead provided as\nnetwork input. After training with random model parametrizations, the same trained proposal can be\nreused to perform inference on novel models with parametrizations not previously observed.\nOur inference networks are parametrized as mixture density networks [4], and trained to minimize\nthe Kullback-Leibler (KL) divergence between the true posterior conditional and the proposal by\nsampling instantiations of the motif. The proposals are then accepted or rejected following the\nMetropolis-Hastings (MH) rule [1], so we maintain the correct stationary distribution even though\nthe proposals are approximate. The following sections describe our work in greater depth.\n\n3.1 Background\n\nAlthough our approach applies to arbitrary probabilistic programs, for simplicity we focus on models\nrepresented as factor graphs. A model consists of a set of variables V as the nodes of a graph\nG = (V, E), along with a set of factors specifying a joint probability distribution p\u03a8(V ) described\nby parameters \u03a8. In particular, this paper focuses primarily on directed models, in which the factors\n\u03a8 specify the conditional probability distributions of each variable given its parents. In undirected\n\n3\n\n\f(a) One instantiation.\n\n(b) Another instantiation.\n\nFigure 2: Two instantiations of a structural motif in a directed chain of length seven. The motif consists of two\nconsecutive variables and their Markov blanket of four neighboring variables. Each instantiation is separated\ninto block proposed variables Bi (white) and conditioning variables Ci (shaded).\n\nmodels, such as the Conditional Random Fields (CRFs) in Sec. 4.3, the factors are arbitrary functions\nassociated with cliques in the graph [16].\nGiven a set of observed evidence variables, inference attempts to sample from the conditional\ndistribution on the remaining variables. In order to construct good MCMC proposals that generalize\nwell across a variety of inference tasks, we take the advantage of recurring structural motifs in\ngraphical models, such as grids, rings, and chains [14].\nIn this work, our goal is to train a neural network as an ef\ufb01cient expert proposal for a structural\nmotif, with its inputs containing the local parameters, so that the trained proposal can be applied to\ndifferent models. Within a motif, the variables are divided into a proposed set of variables that will\nbe resampled, and a conditioning set corresponding to an approximate Markov blanket. The proposal\nnetwork essentially maps the values of conditional variables and local parameters to a distribution\nover the proposed variables.\n\n3.2 MCMC Proposals on Structural Motifs in Graphical Models\n\nWe associate each learned proposal with a structural motif that determines the shape of the network\ninputs and outputs. In general, structural motifs can be arbitrary subgraphs, but we are more interested\nin motifs that represent interesting conditional structure between two sets of variables, the block\nproposed variables B and the conditioning variables C. A given motif can have multiple instantiations\nwith a model, or even across models. As a concrete example, Fig. 2 shows two instantiations of\na structural motif of six consecutive variables in a chain model. In each instantiation, we want to\napproximate the conditional distribution of two middle variables given neighboring four.\nDe\ufb01nition. A structural motif (B, C) (or motif in short) is an (abstract) graph with nodes partitioned\ninto two sets, B and C, and a parametrized joint distribution p(B, C) whose factorization is consistent\nwith the graph structure. This speci\ufb01es the functional form of the conditional p(B|C), but not the\nspeci\ufb01c parameters.\nA motif usually have many instantiations across many different graphical models.\nDe\ufb01nition. For a graphical model (G = (V, E), \u03a8), an instantiation (Bi, Ci, \u03a8i) of a motif (B, C)\nincludes\n1. a subset of the model variables (Bi, Ci) \u2286 V such that the induced subgraph on (Bi, Ci) is\nisomorphic to the motif (B, C) with the partition preserved by the isomorphism (so nodes in B\nare mapped to Bi, and C to Ci), and\n2. a subset of model parameters \u03a8i \u2286 \u03a8 required to specify the conditional distribution p\u03a8i(B|C).\nWe would typically de\ufb01ne a structural motif by \ufb01rst picking out a block of variables B to jointly\nsample, and then selecting a conditioning set C. Intuitively, the natural choice for a conditioning\nset is the Markov blanket, C = MB(B). However, this is not a \ufb01xed requirement, and C could be\neither a subset or superset of it (or neither). We might deliberately choose to use some alternate\nconditioning set C, e.g., a subset of the Markov blanket to gain a more computationally ef\ufb01cient\nproposal (with a smaller proposal network), or a superset with the idea of learning longer-range\nstructure. More fundamentally, however, Markov blankets depend on the larger graph structure might\nnot be consistent across instantiations of a given motif (e.g., if one instantiation has additional edges\nconnecting Bi to other model variables not in Ci). Allowing C to represent a generic conditioning\nset leaves us with greater \ufb02exibility in instantiating motifs.\nFormally, our goal is to learn a Gibbs-like block proposal q(Bi|Ci; \u03a8i) for all possible instantiations\n(Bi, Ci, \u03a8i) of a structural motif that is close to the true conditional in the sense that\n\u2200(Bi, Ci, \u03a8i), \u2200ci \u2208 supp(Ci), q(Bi; ci, \u03a8i) \u2248 p\u03a8i(Bi|Ci = ci).\n\n(1)\n\n4\n\n\fThis provides another view of this approximation problem. If we choose the motif to have complex\nstructures in each instantiation, the conditionals p\u03a8i(Bi|Ci = ci) can often be quite different for\ndifferent instantiations, and thus dif\ufb01cult to approximate. Therefore, choosing what is a structural\nmotif represents a trade-off between generality of the proposal and easiness to approximate. While\nour approach works for any structural motif complying with the above de\ufb01nition, we suggest using\ncommon structures as motifs, such as chain of certain length as in Fig. 2. In principle, recurring motifs\ncould be automatically detected, but in this work, we focus on hand-identi\ufb01ed common structures.\n\n3.3 Parametrizing Neural Block Proposals\n\nWe choose mixture density networks (MDN) [4] as our proposal network parametrization. An MDN\nis a form of neural network whose outputs parametrize a mixture distribution, where in each mixture\ncomponent the variables are uncorrelated.\nIn our case, a neural block proposal is a function q\u03b8 parametrized by a MDN with weights \u03b8. The\nfunction q\u03b8 represents proposals for a structural motif (B, C) by taking in current values of Ci and\nlocal parameters \u03a8i, and outputting a distribution over Bi. The goal is to optimize \u03b8 so that q\u03b8 is\nclose to the true conditional.\nIn the network output, mixture weights are represented explicitly. Within each mixture component,\ndistributions of bounded discrete variables are directly represented as independent categorical proba-\nbilities, and distributions of continuous variables are represented as isotropic Gaussians with mean\nand variance. To avoid degenerate proposals, we threshold the variance of each Gaussian component\nto be at least 10\u22125.\n\n3.4 Training Neural Block Proposals\n\nLoss function for a speci\ufb01c instantiation: Given a particular motif instantiation, we use the KL\ndivergence D(p\u03a8i(Bi|Ci) (cid:107) q\u03b8(Bi; Ci, \u03a8i)) as the measure of closeness between our proposal and\nthe true conditional in Eq. 1. Taking into account all possible values ci \u2208 supp(Ci), we consider the\nexpected divergence over Ci\u2019s prior:\n\nECi[D(p\u03a8i(Bi|Ci) (cid:107) q\u03b8(Bi; Ci, \u03a8i))] = \u2212EBi,Ci[log q\u03b8(Bi; Ci, \u03a8i)] + constant.\n\n(2)\n\nThe second term is independent of \u03b8. So we de\ufb01ne the loss function on (Bi, Ci, \u03a8i) as\n\n\u02dcL(\u03b8; Bi, Ci, \u03a8i) = \u2212EBi,Ci[log q\u03b8(Bi; Ci, \u03a8i)].\n\nMeta-training over many instantiations: To train a generalizable neural block proposal, we\ngenerate a set of random instantiations and optimize the loss function over all of them. Assuming a\ndistribution over instantiations P, our goal is to minimize the overall loss\n\nL(\u03b8) = E(Bi,Ci,\u03a8i)\u223cP [ \u02dcL(\u03b8; Bi, Ci, \u03a8i)] = \u2212E(Bi,Ci,\u03a8i)\u223cP [EBi,Ci [log q\u03b8(Bi; Ci, \u03a8i)]] ,\n\n(3)\n\nwhich is optimized with minibatch SGD in our experiments.\nThere are different ways to design the motif instantiation distribution P. One approach is to \ufb01nd\na distribution over model parameter space, and attach the random parametrizations \u03a8i to (Bi, Ci).\nPractically, it is also viable to \ufb01nd a training dataset of models that contains a large number of\ninstantiations. Both approaches are discussed in detail and experimented in the experiment section.\nNeural block sampling: The overall MCMC sampling procedure with meta-proposals is outlined in\nAlgorithm 1, which supports building a library of neural block proposals trained on common motifs\nto speed up inference on previously unseen models.\n\n4 Experiments\n\nIn this section, we evaluate our method of learning neural block proposals against single-site Gibbs\nsampler as well as several model-speci\ufb01c MCMC methods. We focus on three most common\nstructural motifs: grids, mixtures and chains. In all experiments, we use the following guideline to\ndesign the proposal: (1) using small underlying MDNs (we pick networks with two hidden layers\nand elu activation [6]), and (2) choosing an appropriate distribution to generate parameters of the\nmotif such that the generated parameters could cover the whole space as much as possible. More\nexperiments details and an additional experiment are available in the supplementary materials.\n\n5\n\n\fAlgorithm 1 Neural Block Sampling\nInput: Graphical model (G, \u03a8), observations y,\n\nmotifs {(B(m), C (m))}m, and their instantiations {(B(m)\n\n, C (m)\n\ni\n\n, \u03a8(m)\n\ni\n\n)}i,m detected in (G, \u03a8).\n\ni\n\nTrain neural block proposal q(m)\n\n\u03b8\n\nusing SGD by Eq. 3 on its instantiations {(B(m)\n\ni\n\n, C (m)\n\ni\n\n, \u03a8(m)\n\ni\n\n)}i\n\nif proposal trained for this motif exists then\nq(m) \u2190\u2212 trained neural block proposal\n\nelse\n\nend if\n\n1: for each motif B(m), C (m) do\n2:\n3:\n4:\n5:\n6:\n7: end for\n8: x \u2190\u2212 initialize state\n9: for timestep in 1 . . . T do\n10:\n11:\n12: end for\n13: return MCMC samples\n\nPropose x(cid:48) \u2190 proposal q(m)\nAccept or reject according to MH rule\n\n\u03b8\n\non some instantiation (B(m)\n\ni\n\n, C (m)\n\ni\n\n, \u03a8(m)\n\ni\n\n)\n\n4.1 Grid Models\n\nWe start with a common structural motif in graphical models, grids. In this section, we focus on\nbinary-valued grid models of all sorts for their relative easiness to directly compute posteriors. To\nevaluate MCMC algorithms, we compare the estimated posterior marginals \u02c6P against true posterior\nmarginals P computed using IJGP [22]. For each inference task with N variables, we calculated the\nerror 1\nN\n\n(cid:12)(cid:12)(cid:12) as the mean absolute deviation of marginal probabilities.\n\n(cid:12)(cid:12)(cid:12) \u02c6P (Xi = 1) \u2212 P (Xi = 1)\n\n(cid:80)N\n\ni=1\n\n4.1.1 General Binary-Valued Grid Models\n\nWe consider the motif in Fig. 3, which is instantiated in every binary-valued grid Bayesian networks\n(BN). Our proposal takes in the conditional probability tables (CPTs) of all 23 variables as well as the\ncurrent values of 14 conditioning variables, and outputs a distribution over the 9 proposed variables.\nTo sample over all possible binary-valued grid instantiations, we generate random grids by sampling\neach CPT entry i.i.d. from a mixed distribution of this following form:\n\n\uf8f1\uf8f2\uf8f3[0, 1]\n\nw.p. pdeterm\nw.p. pdeterm\n\n2\n\n[1, 0]\nDirichlet(\u03b1) w.p. 1 \u2212 pdeterm,\n\n2\n\n(4)\n\nwhere pdeterm \u2208 [0, 1] is the probability of the CPT entry being deterministic. Our proposal is trained\nwith pdeterm = 0.05 and \u03b1 = (0.5, 0.5).\nTo test the generalizability of our trained proposal, we generate random binary grid instantiations\nusing distributions with various pdeterm and \u03b1 values, and compute the KL divergences between the\ntrue conditionals and our proposal outputs on 1000 sampled instantiations from each distribution.\nFig. 5 shows the histograms of divergence values from 4 very different distributions, including the\none used for training (top left). The resulting histograms show mostly small divergence values,\nand are nearly indistinguishable, even though one distribution has pdeterm = 0.8 and the proposal is\nonly trained with pdeterm = 0.05. This shows that our approach is able to generally and accurately\napproximate true conditionals, despite only being trained with an arbitrary distribution.\nWe evaluate the performance of the trained neural block proposal on all 180 grid BNs up to 500 nodes\nfrom UAI 2008 inference competition. In each epoch, for each latent variable, we try to identify\nand propose the block as in Fig. 3 with the variable located at center. If this is not possible, e.g., the\nvariable is at boundaries or close to evidence, single-site Gibbs resampling is used instead.\nFig. 6 shows the performance of both our method and single-site Gibbs in terms of error integrated\nover time for all 180 models. The models are divided into three classes, grid-50, grid-75 and grid-\n90, according to the percentage of deterministic relations. Our neural block sampler signi\ufb01cantly\noutperforms Gibbs sampler in nearly every model. We notice that the improvement is less signi\ufb01cant\nas the percentage of deterministic relations increases. This is largely due to that the above proposal\n\n6\n\n\fFigure 3: Motif for general grid\nmodels. Conditioning variables\n(shaded) form the Markov blan-\nket of proposed variables (white).\nDashed gray arrows show possi-\nble but irrelevant dependencies.\n\nFigure 4: Sample runs comparing single-site Gibbs, Neural Block Sampling,\nand block Gibbs with true conditionals. For each model, we compute 10\nrandom initializations and run three algorithms for 1500s on each. Epochs\nplots are cut off at 500 epochs to better show the comparison because true\nblock Gibbs \ufb01nishes far less epochs within given time. 50-20-5 and\n90-21-10 are identi\ufb01ers of these two models in the competition.\n\nFigure 5: KL divergences between the true conditionals\nand our proposal outputs on 1000 sampled instantiations\nfrom 4 distributions with different pdeterm and \u03b1. Top left\nis the distribution used in training. Our trained proposal\nis able to generalize on arbitrary binary grid models.\n\nFigure 6: Performance comparison on 180 grid models\nfrom UAI 2008 inference competition. Each mark\nrepresents error integrals for both single-site Gibbs\nand our method in a single run over 1200s inference.\n\nstructure in Fig. 3 can only easily handle dependency among the 9 proposed nodes. We expect an\nincreased block size to yield stronger performance on models with many deterministic relations.\nFurthermore, we compare our proposal against single-site Gibbs, and exact block Gibbs with identical\nproposal block, on grid models with different percentages of deterministic relations in Fig. 4. Single-\nsite Gibbs performs worst on both models due to quickly getting stuck in local modes. Between\nthe two block proposals, neural block sampling performs better in error w.r.t. time due to shorter\ncomputational time. However, because the neural block proposal is only an approximate of the true\nblock Gibbs proposal, it is worse in terms of error w.r.t. epochs, as expected. Detailed comparisons\non more models are available in the supplementary material.\nAdditionally, our approach can be used model-speci\ufb01cally by training only on instantiations within a\nparticular model. In supplementary materials, we demonstrate that our method achieves comparable\nperformance with a more advanced task-speci\ufb01c MCMC method, Inverse MCMC [32].\n\n4.2 Gaussian Mixture Model with Unknown Number of Components\n\nWe next consider open-universe Gaussian mixture models (GMMs), in which the number of mixture\ncomponents is unknown, subject to a prior. Similarly to Dirichlet process GMMs, these are typically\ntreated with hand-designed model-speci\ufb01c split-merge MCMC algorithms.\nConsider the following GMM. n points x = {xi}i=1,...,n are observed, and come uniformly randomly\nfrom one of M (unknown) active mixtures, with M \u223c Unif{1, 2, . . . , m}. Our task is to infer the\n\n7\n\n\fFigure 7: All except bottom right: Average log likelihoods of MCMC runs over 200 tasks for total 600s in\nvarious GMMs. Bottom right: Trace plots of M over 12 runs from initialization with different M values on a\nGMM with m = 12, n = 90. Our approach explores sample space much faster than Gibbs with SDDS.\n\nposterior of mixture means \u00b5 = {\u00b5j}j=1,...,M , their activity indicators v = {vj}j=1,...,M , and the\nlabels z = {zi}i=1,...,n, where zi is the mixture index xi comes from. Since M is determined by v,\n\nin this experiment, we always calculate M =(cid:80)\n\nj vj instead of sampling M.\n\nSuch GMMs have many nearly-deterministic relations, e.g., p(vj = 0, zi = j) = 0, causing vanilla\nsingle-site Gibbs failing to jump across different M values. Split-merge MCMC algorithms, e.g.,\nRestricted Gibbs split-merge (RGSM) [13] and Smart-Dumb/Dumb-Smart (SDDS) [36], use hand-\ndesigned MCMC moves to solve such issues. In our framework, it\u2019s possible to deal with such\nrelations with a proposal block including all of z, \u00b5 and v. However, doing so requires signi\ufb01cant\ntraining and inference time (due to larger proposal network and larger proposal block), and the\nresulting proposal can not generalize to GMMs of different sizes.\nIn order to apply the trained proposal to differently sized GMMs, we choose the motif to propose q\u03b8\nfor two arbitrary mixtures (\u00b5i, vi) and (\u00b5j, vj) conditioned on all other variables excluding z, and\ninstead consider the model with z variables collapsed. The inference task is then equivalent to \ufb01rst\nsampling \u00b5, v from the collapsed model p(\u00b5, v|x), and then z from p(z|\u00b5, v, x). We modify the\nalgorithm such that the proposal from q\u03b8 is accepted or rejected by the MH rule on the collapsed\nmodel. Then z is resampled from p(z|\u00b5, v, x). This approach is less sensitive to different n values\nand performs well in variously sized GMMs. More details are available in the supplementary material.\nWe train with a small GMM with m = 8 and n = 60 as the motif, and apply the trained proposal on\nGMMs with larger m and n by randomly selecting 8 mixtures and 60 points for each proposal. Fig. 7\nshows how the our sampler performs on GMM of various sizes, compared against split-merge Gibbs\nwith SDDS. We notice that as model gets larger, Gibbs with SDDS mixes more slowly, while neural\nblock sampling still mixes fairly fast and outperforms Gibbs with SDDS. Bottom right of Fig. 7\nshows the trace plots of M for both algorithms over multiple runs on the same observations. Gibbs\nwith SDDS takes a long time to \ufb01nd a high likelihood explanation and fails to explore other possible\nones ef\ufb01ciently. Our proposal, on the other hand, mixes quickly among the possible explanations.\n\n4.3 Named Entity Recognition (NER) Tagging\n\nNamed entity recognition (NER) is the task of inferring named entity tags for words in natural\nlanguage sentences. One way to tackle NER is to train a conditional random \ufb01eld (CRF) model\nrepresenting the joint distribution of tags and word features [21]. In test time, we use the CRF build\na chain Markov random \ufb01eld (MRF) containing only tags variables, and apply MCMC methods to\nsample the NER tags. We use a dataset of 17494 sentences from CoNLL-2003 Shared Task3. The\nCRF model is trained with AdaGrad [8] through 10 sweeps over the training dataset.\n\n3https://www.clips.uantwerpen.be/conll2003/ner/\n\n8\n\n\fFigure 8: Average F1 scores and average log likelihoods over entire test dataset. In each epoch, all variables in\nevery test MRF is proposed roughly once for all algorithms. F1 scores are measured using states with highest\nlikelihood seen over Markov chain traces. To better show comparison, epoch plots are cut off at 500 epochs and\ntime plots at 12850s. Log likelihoods shown don\u2019t include normalization constant.\n\nOur goal is to train good neural block proposals for the chain MRFs built for test sentences. Exper-\nimenting with different chain lengths, we train three proposals, each for a motif of two, three, or\nfour consecutive proposed tag variables and their Markov blanket. These proposals are trained on\ninstantiations within MRFs built from the training dataset for the CRF model.\nWe then evaluate the learned neural block proposals on the previously unseen test dataset of 3453\nsentences. Fig. 8 plots the performance of neural block sampling and single-site Gibbs w.r.t. both\ntime and epochs on the entire test dataset. As block size grows larger, learned proposal takes more\ntime to mix. But eventually, block proposals generally achieve better performance than single-site\nGibbs in terms of both F1 scores and log likelihoods. Therefore, as shown in the \ufb01gure, a mixed\nproposal of single-site Gibbs and neural block proposals can achieve better mixing without slowing\ndown much. As an interesting observation, neural block sampling sometimes achieves higher F1\nscores even before surpassing single-site Gibbs in log likelihood, implying that log likelihood is at\nbest an imperfect proxy for performance on this task.\n\n5 Conclusion\n\nThis paper proposes and explores the (to our knowledge) novel idea of meta-learning generalizable\napproximate block Gibbs proposals. Our meta-proposals are trained of\ufb02ine and can be applied\ndirectly to novel models given only a common set of structural motifs. Experiments show that the\nneural block sampling approach outperforms standard single-site Gibbs in both convergence speed\nand sample quality and achieve comparable performance against model-specialized methods. In\nwill be an interesting system design problem to investigate, when given a library of trained block\nproposals, how an inference system in a probabilistic programming language can automatically detect\nthe common structural motifs and (adaptively) apply appropriate samplers to help convergence for\nmore general real-world applications.\nAdditionally, from the meta-learning perspective, our method is based on meta-training, i.e., training\nover a variety of motif instantiations. At test time, the learned proposal does not adapt to new\nscenarios after meta-training. While in many meta-learning works in reinforcement learning [9, 7], a\nmeta-trained agent can further adapt the learned policy to unseen environments via a few learning\nsteps under the assumption that a reward signal is accessible at test time. In our setting, we can\nsimilarly adopt such fast adaptation scheme at test time to further improve the quality of proposed\nsamples by treating the acceptance rate as a test time reward signal. We leave this as a future work.\n\n9\n\n\fReferences\n[1] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction\n\nto MCMC for machine learning. Machine learning, 50(1):5\u201343, 2003.\n\n[2] Marcin Andrychowicz, Misha Denil, Sergio G\u00f3mez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 29, pages 3981\u20133989. Curran Associates, Inc., 2016.\n\n[3] Taylor Berg-Kirkpatrick, Jacob Andreas, and Dan Klein. Unsupervised transcription of piano\n\nmusic. In Advances in neural information processing systems, pages 1538\u20131546, 2014.\n\n[4] Christopher M Bishop. Mixture density networks. 1994.\n[5] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[6] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\n\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[7] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl \u02c62:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. arXiv preprint arXiv:1703.03400, 2017.\n\n[10] Roger Grosse, Ruslan R Salakhutdinov, William T Freeman, and Joshua B Tenenbaum. Ex-\nploiting compositionality to explore a large space of model structures. In 28th Conference on\nUncertainly in Arti\ufb01cial Intelligence, pages 15\u201317. AUAI Press, 2012.\n\n[11] Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential Monte\n\nCarlo. In Advances in Neural Information Processing Systems, pages 2629\u20132637, 2015.\n\n[12] Nicolas Heess, Daniel Tarlow, and John Winn. Learning to pass expectation propagation\n\nmessages. In Advances in Neural Information Processing Systems, pages 3219\u20133227, 2013.\n\n[13] Sonia Jain and Radford M Neal. A split-merge Markov chain Monte Carlo procedure for the\nDirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1):\n158\u2013182, 2004.\n\n[14] Charles Kemp and Joshua B Tenenbaum. The discovery of structural form. Proceedings of the\n\nNational Academy of Sciences, 105(31):10687\u201310692, 2008.\n\n[15] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[16] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[17] Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture: A\nprobabilistic programming language for scene perception. In Proceedings of the ieee conference\non computer vision and pattern recognition, pages 4390\u20134399, 2015.\n\n[18] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[19] Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. Inference compilation and universal\n\nprobabilistic programming. In Arti\ufb01cial Intelligence and Statistics, pages 1338\u20131348, 2017.\n\n[20] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441,\n\n2017.\n\n[21] Percy Liang, Hal Daum\u00e9 III, and Dan Klein. Structure compilation: trading structure for\nfeatures. In Proceedings of the 25th international conference on Machine learning, pages\n592\u2013599. ACM, 2008.\n\n[22] Robert Mateescu, Kalev Kask, Vibhav Gogate, and Rina Dechter. Join-graph propagation\n\nalgorithms. Journal of Arti\ufb01cial Intelligence Research, 37(1):279\u2013328, 2010.\n\n10\n\n\f[23] David A. Moore and Stuart J. Russell. Signal-based Bayesian seismic monitoring. Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), April 2017.\n\n[24] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, 2(11), 2011.\n\n[25] B. Paige and F. Wood. Inference networks for sequential Monte Carlo in graphical models. In\nProceedings of the 33rd International Conference on Machine Learning, volume 48 of JMLR,\n2016.\n\n[26] Daniel Ritchie, Sharon Lin, Noah D Goodman, and Pat Hanrahan. Generating design suggestions\nunder tight constraints with gradient-based probabilistic programming. In Computer Graphics\nForum, volume 34, pages 515\u2013526. Wiley Online Library, 2015.\n\n[27] Daniel Ritchie, Anna Thomas, Pat Hanrahan, and Noah Goodman. Neurally-guided procedural\nmodels: Amortized inference for procedural graphics programs using neural networks. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 29, pages 622\u2013630. Curran Associates, Inc., 2016.\n\n[28] Stephane Ross, Daniel Munoz, Martial Hebert, and J Andrew Bagnell. Learning message-\npassing inference machines for structured prediction. In Computer Vision and Pattern Recogni-\ntion (CVPR), 2011 IEEE Conference on, pages 2737\u20132744. IEEE, 2011.\n\n[29] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-NICE-MC: Adversarial training for\n\nMCMC. arXiv preprint arXiv:1706.07561, 2017.\n\n[30] David J Spiegelhalter, Andrew Thomas, Nicky G Best, Wally Gilks, and D Lunn. BUGS:\nBayesian inference using Gibbs sampling. Version 0.5,(version ii) http://www. mrc-bsu. cam. ac.\nuk/bugs, 19, 1996.\n\n[31] David H Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online Bayesian\nrecommendations. In Proceedings of the 18th international conference on World wide web,\npages 111\u2013120. ACM, 2009.\n\n[32] Andreas Stuhlm\u00fcller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In\n\nNeural Information Processing Systems, 2013.\n\n[33] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\nIn Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages\n23\u201330. IEEE, 2017.\n\n[34] Daniel Turek, Perry de Valpine, Christopher J Paciorek, Clifford Anderson-Bergman, et al.\nAutomated parameter blocking for ef\ufb01cient Markov chain Monte Carlo sampling. Bayesian\nAnalysis, 2016.\n\n[35] Deepak Venugopal and Vibhav Gogate. Dynamic blocking and collapsing for Gibbs sampling.\n\nIn Uncertainty in Arti\ufb01cial Intelligence, page 664. Citeseer, 2013.\n\n[36] Wei Wang and Stuart J Russell. A smart-dumb/dumb-smart algorithm for ef\ufb01cient split-merge\n\nMCMC. In UAI, pages 902\u2013911, 2015.\n\n[37] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a\n\nrealistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2052, "authors": [{"given_name": "Tongzhou", "family_name": "Wang", "institution": "Facebook AI Research"}, {"given_name": "YI", "family_name": "WU", "institution": "UC Berkeley"}, {"given_name": "Dave", "family_name": "Moore", "institution": "Google"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}]}