{"title": "Learning Stochastic Inverses", "book": "Advances in Neural Information Processing Systems", "page_first": 3048, "page_last": 3056, "abstract": "We describe a class of algorithms for amortized inference in Bayesian networks. In this setting, we invest computation upfront to support rapid online inference for a wide range of queries. Our approach is based on learning an inverse factorization of a model's joint distribution: a factorization that turns observations into root nodes. Our algorithms accumulate information to estimate the local conditional distributions that constitute such a factorization. These stochastic inverses can be used to invert each of the computation steps leading to an observation, sampling backwards in order to quickly find a likely explanation. We show that estimated inverses converge asymptotically in number of (prior or posterior) training samples. To make use of inverses before convergence, we describe the Inverse MCMC algorithm, which uses stochastic inverses to make block proposals for a Metropolis-Hastings sampler. We explore the efficiency of this sampler for a variety of parameter regimes and Bayes nets.", "full_text": "Learning Stochastic Inverses\n\nAndreas Stuhlm\u00a8uller\n\nBrain and Cognitive Sciences\n\nMIT\n\nJessica Taylor\n\nNoah D. Goodman\n\nDepartment of Computer Science\n\nDepartment of Psychology\n\nStanford University\n\nStanford University\n\nAbstract\n\nWe describe a class of algorithms for amortized inference in Bayesian networks.\nIn this setting, we invest computation upfront to support rapid online inference\nfor a wide range of queries. Our approach is based on learning an inverse factor-\nization of a model\u2019s joint distribution: a factorization that turns observations into\nroot nodes. Our algorithms accumulate information to estimate the local condi-\ntional distributions that constitute such a factorization. These stochastic inverses\ncan be used to invert each of the computation steps leading to an observation,\nsampling backwards in order to quickly \ufb01nd a likely explanation. We show that\nestimated inverses converge asymptotically in number of (prior or posterior) train-\ning samples. To make use of inverses before convergence, we describe the Inverse\nMCMC algorithm, which uses stochastic inverses to make block proposals for a\nMetropolis-Hastings sampler. We explore the ef\ufb01ciency of this sampler for a va-\nriety of parameter regimes and Bayes nets.\n\n1\n\nIntroduction\n\nBayesian inference is computationally expensive. Even approximate, sampling-based algorithms\ntend to take many iterations before they produce reasonable answers. In contrast, human recognition\nof words, objects, and scenes is extremely rapid, often taking only a few hundred milliseconds\u2014only\nenough time for a single pass from perceptual evidence to deeper interpretation. Yet human percep-\ntion and cognition are often well-described by probabilistic inference in complex models. How can\nwe reconcile the speed of recognition with the expense of coherent probabilistic inference? How can\nwe build systems, for applications like robotics and medical diagnosis, that exhibit similarly rapid\nperformance at challenging inference tasks?\nOne response to such questions is that these problems are not, and should not be, solved from scratch\neach time they are encountered. Humans and robots are in the setting of amortized inference: they\nhave to solve many similar inference problems, and can thus of\ufb02oad part of the computational work\nto shared precomputation and adaptation over time. This raises the question of which kinds of\nprecomputation and adaptation are useful. There is substantial previous work on adaptive inference\nalgorithms, including Cheng and Druzdzel (2000); Haario et al. (2006); Ortiz and Kaelbling (2000);\nRoberts and Rosenthal (2009). While much of this work is focused on adaptation for a single\nposterior inference, amortized inference calls for adaptation across many different inferences. In\nthis setting, we will often have considerable training data available in the form of posterior samples\nfrom previous inferences; how should we use this data to adapt our inference procedure?\nWe consider using training samples to learn the inverse structure of a directed model. Posterior\ninference is the task of inverting a probabilistic model: Bayes\u2019 theorem turns p(d|h) into p(h|d);\nvision is commonly understood as inverse graphics (Horn, 1977) and, more recently, as inverse\nphysics (Sanborn et al., 2013; Watanabe and Shimojo, 2001); and conditional inference in proba-\nbilistic programs can be described as \u201crunning a program backwards\u201d (e.g., Wingate and Weber,\n2013). However, while this is a good description of the problem that inference solves, conditional\nsampling usually does not proceed backwards step-by-step. We suggest taking this view more liter-\n\n1\n\n\fFigure 1: A Bayesian network modeling brightness constancy in visual perception, a possible inverse\nfactorization, and two of the local joint distributions that determine the inverse conditionals.\n\nally and actually learning the inverse conditionals needed to invert the model. For example, consider\nthe Bayesian network shown in Figure 1. In addition to the default \u201cforward\u201d factorization shown on\nthe left, we can consider an \u201cinverse\u201d factorization shown on the right. Knowing the conditionals for\nthis inverse factorization would allow us to rapidly sample the latent variables given an observation.\nIn this paper, we will explore what these factorizations look like for Bayesian networks, how to learn\nthem, and how to use them to construct block proposals for MCMC.\n\n2\n\nInverse factorizations\n\nLet p be a distribution on latent variables x = (x1, . . . , xm) and observed variables y =\n(y1, . . . , yn). A Bayesian network G is a directed acyclic graph that expresses a factorization of\nthis joint distribution in terms of the distribution of each node conditioned on its parents in the\ngraph:\n\nm(cid:89)\n\nn(cid:89)\n\np(x, y) =\n\np(xi|paG(xi))\n\np(yj|paG(yj))\n\ni=1\n\nj=1\n\nWhen interpreted as a generative (causal) model, the observations y typically depend on a non-empty\nset of parents, but are not themselves parents of any nodes.\nIn general, a distribution can be represented using many different factorizations. We say that a\nBayesian network H expresses an inverse factorization of p if the observations y do not have parents\n(but may themselves be parents of some xi):\n\nm(cid:89)\n\np(x, y) = p(y)\n\np(xi|paH (xi))\n\ni=1\n\nAs an example, consider the forward and inverse networks shown in Figure 1. We call the conditional\ndistributions p(xi|paH (xi)) stochastic inverses, with inputs paH (xi) and output xi. If we could\nsample from these distributions, we could produce samples from p(x|y) for arbitrary y, which solves\nthe problem of inference for all queries with the same set of observation nodes.\nIn general, there are many possible inverse factorizations. For each latent node, we can \ufb01nd a\nfactorization such that this node does not have children. This fact will be important in Section 4\nwhen we resample subsets of inverse graphs. Algorithm 1 gives a heuristic method for computing\nan inverse factorization given Bayes net G, observation nodes y, and desired leaf node xi. We\ncompute an ordering on the nodes of the original Bayes net from observations to leaf node. We then\nadd the nodes in order to the inverse graph, with dependencies determined by the graph structure of\nthe original network.\nIn the setting of amortized inference, past tasks provide approximate posterior samples for the cor-\nresponding observations. We therefore investigate learning inverses from such samples, and ways of\nusing approximate stochastic inverses for improving the ef\ufb01ciency of solving future inference tasks.\n\n2\n\n\u2212100102030\u2212100102030luminance(noisify luminance)\u22124\u22122 0 2 4 6\u221220\u221210 0 10 20 30 0 2 4 6 810reflectanceilluminationluminanceObservationLuminanceRe\ufb02ectanceIlluminationLuminanceGammaGaussianIlluminationRe\ufb02ectanceLuminanceObservationGammaGaussianIlluminationRe\ufb02ectanceLuminanceObservation\fAlgorithm 1: Heuristic inverse factorization\nInput: Bayesian network G with latent nodes x and observed nodes y; desired leaf node xi\nOutput: Ordered inverse graph H\n1: order x such that nodes close to y are \ufb01rst, leaf node xi is last\n2: initialize H to empty graph\n3: add nodes y to H\n4: for node xj in x do\n5:\n6:\n\nadd xj to H\nset paH (xj) to a minimal set of nodes in H that d-separates xj from the remainder of H\n\nbased on the graph structure of G\n\n7: end for\n\n3 Learning stochastic inverses\nIt is easy to see that we can estimate conditional distributions p(xi|paH (xi)) using samples S drawn\nfrom the prior p(x, y). For simplicity, consider discrete variables and an empirical frequency esti-\nmator:\n\n\u03b8S(xi|paH (xi)) =\n\n|{s \u2208 S : x(s)\n\n|{s \u2208 S : pa(s)\n\nH (xi)}|\n\ni \u2227 pa(s)\nH (xi)|\n\nBecause \u03b8S is a consistent estimator of the probability of each outcome for each setting of the parent\nvariables, the following theorem follows immediately from the strong law of large numbers:\nTheorem 1. (Learning from prior samples) Let H be an inverse factorization. For samples S drawn\nfrom p(x, y), \u03b8S(xi|paH (xi)) \u2192 p(xi|paH (xi)) almost surely as |S| \u2192 \u221e.\nSamples generated from the prior may be sparse in regions that have high probability under the pos-\nterior, resulting in slow convergence of the inverses. We now show that valid inverse factorizations\nallow us to learn from posterior samples as well.\nTheorem 2. (Learning from posterior samples) Let H be an inverse factorization. For samples\nS drawn from p(x|y), \u03b8(xi|paH (xi)) \u2192 p(xi|paH (xi)) almost surely as |S| \u2192 \u221e for values of\npaH (xi) that have positive probability under p(x|y).\nProof. For values paH (xi) that are not in the support of p(x|y), \u03b8(xi|paH (xi)) is unde\ufb01ned. For\nvalues paH (xi) in the support, \u03b8(xi|paH (xi)) \u2192 p(xi|paH (xi), y) almost surely. By de\ufb01nition,\nany node in a Bayesian network is independent of its non-descendants given its parent variables.\nThe nodes y are root nodes in H and hence do not descend from xi. Therefore, p(xi|paH (xi), y) =\np(xi|paH (xi)) and the theorem holds.\n\nTheorem 2 implies that we can use posterior samples from one observation set to learn inverses that\napply to all other observation sets\u2014while samples from p(x|y) only provide global estimates for the\ngiven posterior, it is guaranteed that the local estimates created by the procedure above are equivalent\nto the query-independent conditionals p(xi|paH (xi)). In addition, we can combine samples from\ndistributions conditioned on several different observation sets to produce more accurate estimates of\nthe inverse conditionals.\nIn the discussion above, we can replace \u03b8 with any consistent estimator of p(xi|paH (xi)). We\ncan also trade consistency for faster learning and generalization. This framework can make use of\nany supervised machine learning technique that supports sampling from a distribution on predicted\noutputs. For example, for discrete variables we can employ logistic regression, which provides fast\ngeneralization and ef\ufb01cient sampling, but cannot, in general, represent the posterior exactly. Our\nchoice of predictor can be data-dependent\u2014for example, we can add interaction terms to a logistic\nregression predictor as more data becomes available.\nFor continuous variables, consider a predictor based on k-nearest neighbors that produces samples\nas follows (Algorithm 2): Given new input values z, retrieve the k previously observed input-output\n\n3\n\n\fAlgorithm 2: K-nearest neighbor density predictor\nInput: Variable index i, inverse inputs z, samples S, number of neighbors k\nOutput: Sampled value for node xi\n1: retrieve k nearest pairs (z(1), x(1)\n), . . . , (z(k), x(k)\n2: construct density estimate q on x(1)\ni\n3: sample from q\n\n, . . . , x(k)\n\ni\n\n) in S based on distance to z\n\ni\n\ni\n\npairs that are closest to the current input values. Then, use a consistent density estimator to con-\nstruct a density estimate on the nearby previous outputs and sample an output xi from the estimated\ndistribution.\nShowing that this estimator converges to the true conditional density p(x|z) is more subtle. If the\nconditional densities are smooth in the sense that:\n\n\u2200\u03b5 > 0 \u2203\u03b4 > 0 : \u2200z1, z2 d(z1, z2) < \u03b4 \u21d2 DKL(p(x|z1), p(x|z2)) < \u03b5\n\nthen we can achieve any desired accuracy of approximation by assuring that the nearest neighbors\nused all lie within a \u03b4-ball, but that the number of neighbors goes to in\ufb01nity. We can achieve this\nby increasing k slowly enough in |S|. The exact rate at which we may increase depends on the\ndistribution and may be dif\ufb01cult to determine.\n\n4\n\nInverse MCMC\n\nWe have described how to compute the structure of inverse Bayes nets, and how to learn the associ-\nated conditional distributions and densities from prior and posterior samples. This produces fast, but\npossibly biased recognition models. To get a consistent estimator, we use these recognition mod-\nels as part of a Metropolis-Hastings scheme that, as the amount of training data grows, converges\nto Gibbs sampling for proposals of size 1, to blocked-Gibbs for larger proposals, and to perfect\nposterior sampling for proposals of size |G|.\nWe propose the following Inverse MCMC procedure (Algorithm 3): Of\ufb02ine, use Algorithm 1 to\ncompute an inverse graph for each latent node and train each local inverse in this graph from (pos-\nterior or prior) samples. Online, run Metropolis-Hastings with the proposal mechanism shown in\nAlgorithm 4, which resamples a set of up to k variables using the trained inverses1. With little train-\ning data, we will want to make small proposals (small k) in order to achieve a reasonable acceptance\nrate; with more training data, we can make larger proposals and expect to succeed.\nTheorem 3. Let G be a Bayesian network, let \u03b8 be a consistent estimator (for inverse conditionals),\nlet {Hi}i\u22081..m be a collection of inverse graphs produced using Algorithm 1, and assume a source\nof training samples (prior or posterior) with full support. Then, as training set size |S| \u2192 \u221e,\nInverse MCMC with proposal size k converges to block-Gibbs sampling where blocks are the last k\nnodes in each Hi. In particular, it converges to Gibbs sampling for proposal size k = 1 and to exact\nposterior sampling for k = |G|.\n\np(x|paH (x)) = (cid:81)|H|\n\nProof. We must show that proposals are made from the conditional posterior in the limit of large\ntraining data. Fix an inverse H, and let x be the last k variables in H. Let paH (x) be the union of\nH-parents of variables in x that are not themselves in x. By construction according to Algorithm\n1, paH (x) form a Markov blanket of x (that is, x is conditionally independent of other variables\nin G, given paH (x)). Now the conditional distribution over x factorizes along the inverse graph:\ni=k p(xi|paH (xi)). But by theorems 1 and 2, the estimators \u03b8 converge, when\nthey are de\ufb01ned, to the corresponding conditional distributions, \u03b8(xi|paH (xi)) \u2192 p(xi|paH (xi));\nsince we assume full support, \u03b8(xi|paH (xi)) is de\ufb01ned wherever p(xi|paH (xi)) is de\ufb01ned. Hence,\nusing the estimated inverses to sequentially sample the x variables results, in the limit, in samples\nfrom the conditional distribution given remaining variables. (Note that, in the limit, these proposals\nwill always be accepted.) This is the de\ufb01nition of block-Gibbs sampling. The special cases of k = 1\n(Gibbs) and k = |G| (posterior sampling) follow immediately.\n\n1In a setting where we only ever resample up to k variables, we only need to estimate the relevant inverses,\n\ni.e., not all conditionals for the full inverse graph.\n\n4\n\n\fHi \u2190 from Algorithm 1\nfor j in 1 . . . m do\n\nAlgorithm 3: Inverse MCMC\nInput: Prior or posterior samples S\nOutput: Samples x(1), . . . , x(T )\nOf\ufb02ine (train inverses):\n1: for i in 1 . . . m do\n2:\n3:\n4:\n5:\n6: end for\nOnline (MH with inverse proposals):\n1: for t in 1 . . . T do\nx(cid:48), pfw, pbw from Algorithm 4\n2:\nx \u2190 x(cid:48) with MH acceptance rule\n3:\n4: end for\n\ntrain inverse \u03b8S(xj|paHi(xj))\n\nend for\n\nward probabilities pfw and pbw\n\nAlgorithm 4: Inverse MCMC proposer\nInput: State x, observations y, ordered inverse\ngraphs {Hi}i\u22081..m, proposal size kmax, in-\nverses \u03b8\nOutput: Proposed state x(cid:48), forward and back-\n1: H \u223c Uniform({Hi}i\u22081..m)\n2: k \u223c Uniform({0, 1, . . . , kmax \u2212 1})\n3: x(cid:48) \u2190 x\n4: pfw, pbw \u2190 0\n5: for j in n \u2212 k, . . . , n do\n6:\n7:\n8:\n9:\n10: end for\n\nlet xl be jth variable in H\nl \u223c \u03b8(xl|paH (x(cid:48)\nx(cid:48)\nl))\npfw \u2190 pfw \u00b7 p\u03b8(x(cid:48)\nl|paH (x(cid:48)\nl))\npbw \u2190 pbw \u00b7 p\u03b8(xl|paH (xl))\n\nInstead of learning the k=1 \u201cGibbs\u201d conditionals for each inverse graph, we can often precompute\nthese distributions to \u201cseed\u201d our sampler. This suggests a bootstrapping procedure for amortized\ninference on observations y(1), . . . , y(t): \ufb01rst, precompute the \u201cGibbs\u201d distributions so that k=1\nproposals will be reasonably effective; then iterate between training on previously generated ap-\nproximate posterior samples and doing inference on the next observation. Over time, increase the\nsize of proposals, possibly depending on acceptance ratio or other heuristics.\nFor networks with near-deterministic dependencies, Gibbs may be unable to generate training sam-\nples of suf\ufb01cient quality. This poses a chicken-and-egg problem: we need a suf\ufb01ciently good poste-\nrior sampler to generate the data required to train our sampler. To address this problem, we propose\na simple annealing scheme: We introduce a temperature parameter t that controls the extent to\nwhich (almost-)deterministic dependencies in a network are relaxed. We produce a sequence of\ntrained samplers, one for each temperature, by generating samples for a network with temperature\nti+1 using a sampler trained on approximate samples for the network with next-higher temperature\nti. Finally, we discard all samplers except for the sampler trained on the network with t = 0, the\nnetwork of interest.\nIn the next section, we explore the practicality of such bootstrapping schemes as well as the general\napproach of Inverse MCMC.\n\n5 Experiments\n\nWe are interested in networks such that (1) there are many layers of nodes, with some nodes far\nremoved from the evidence, (2) there are many observation nodes, allowing for a variety of queries,\nand (3) there are strong dependencies, making local Gibbs moves challenging.\nWe start by studying the behavior of the Inverse MCMC algorithm with empirical frequency estima-\ntor on a 225-node rectangular grid network from the UAI 2008 inference competition. This network\nhas binary nodes and approximately 50% deterministic dependencies, which we relax to dependen-\ncies with strength .99. We select the 15 nodes on the diagonal as observations and remove any nodes\nbelow, leaving a triangular network with 120 nodes and treewidth 15 (Figure 2). We compute the\ntrue marginals P \u2217 using IJGP (Mateescu et al., 2010), and calculate the error of our estimates P s as\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n\nxi\u2208Xi\n\nerror =\n\n1\nN\n\n1\n|Xi|\n\n|P \u2217(Xi = xi) \u2212 P s(Xi = xi)|.\n\nWe generate 20 inference tasks as sources of training samples by sampling values for the 15 observa-\ntion nodes uniformly at random. We precompute the \u201c\ufb01nal\u201d inverse conditionals as outlined above,\nproducing a Gibbs sampler when k=1. For each inference task, we use this sampler to generate 105\napproximate posterior samples.\n\n5\n\n\fFigure 2: Schema of the Bayes\nnet structure used in experi-\nment 1. Thick arrows indi-\ncate almost-deterministic de-\npendencies, shaded nodes are\nobserved. The actual network\nhas 15 layers with a total of\n120 nodes.\n\nFigure 3: The effect of train-\ning on approximate posterior\nsamples for 10 inference tasks.\nAs the number of training sam-\nples per task increases,\nIn-\nverse MCMC with proposals\nof size 20 performs new infer-\nence tasks more quickly.\n\nFigure 4: Learning an inverse\ndistribution for the brightness\nconstancy model\n(Figure 1)\nfrom prior samples using the\nKNN density predictor. More\ntraining samples result in bet-\nter estimates after the same\nnumber of MCMC steps.\n\nFigures 3 and 5 show the effect of training the frequency estimator on 10 inference tasks and testing\non a different task (averaged over 20 runs). Inverse proposals of (up to) size k=20 do worse than\npure Gibbs sampling with little training (due to higher rejection rate), but they speed convergence as\nthe number of training samples increases. More generally, large proposals are likely to be rejected\nwithout training, but improve convergence after training.\nFigure 6 illustrates how the number of inference tasks in\ufb02uences error and MH acceptance ratio in\na setting where the total number of training samples is kept constant. Surprisingly, increasing the\nnumber of training tasks from 5 to 15 has little effect on error and acceptance ratio for this network.\nThat is, it seems relatively unimportant which posterior the training samples are drawn from; we\nmay expect different results when posteriors are more sparse.\nFigure 7 shows how different sources of training data affect the quality of the trained sampler (av-\neraged over 20 runs). As the strength of near-deterministic dependencies increases, direct training\non Gibbs samples becomes infeasible. In this regime, we can still train on prior samples and on\nGibbs samples for networks with relaxed dependencies. Alternatively, we can employ the anneal-\ning scheme outlined in the previous section. In this example, we take the temperature ladder to be\n[.2, .1, .05, .02, .01, 0]\u2014that is, we start by learning inverses for the relaxed network where all CPT\nprobabilities are constrained to lie within [.2, .8]; we then use these inverses as proposers for MCMC\ninference on a network constrained to CPT probabilities in [.1, .9], learn the corresponding inverses,\nand continue, until we reach the network of interest (at temperature 0).\nWhile the empirical frequency estimator used in the above experiments provides an attractive asymp-\ntotic convergence guarantee (Theorem 3), it is likely to generalize slowly from small amounts of\ntraining data. For practical purposes, we may be more interested in getting useful generalizations\nquickly than converging to a perfect proposal distribution. Fortunately, the Inverse MCMC algo-\nrithm can be used with any estimator for local conditionals, consistent or not. We evaluate this idea\non a 12-node subset of the network used in the previous experiments. We learn complete inverses,\nresampling up to 12 nodes at once. We compare inference using a logistic regression estimator with\nL2 regularization (with and without interaction terms) to inference using the empirical frequency\nestimator. Figure 9 shows the error (integrated over time to better re\ufb02ect convergence speed) against\nthe number of training examples, averaged over 300 runs. The regression estimator with interaction\nterms results in signi\ufb01cantly better results when training on few posterior samples, but is ultimately\novertaken by the consistent empirical estimator.\nNext, we use the KNN density predictor to learn inverse distributions for the continuous Bayesian\nnetwork shown in Figure 1. To evaluate the quality of the learned distributions, we take 1000\n\n6\n\n01020304050600.000.100.200.30Time (seconds)Error in marginalsGibbsInverses (10x10)Inverses (10x100)Inverses (10x1000)1e+011e+021e+031e+041e+050.000.040.08Number of training samplesError in marginalsInverses (kNN)\fFigure 5: Without training, big inverse proposals result in high error, as they are unlikely to be\naccepted. As we increase the number of approximate posterior samples used to train the MCMC\nsampler, the acceptance probability for big proposals goes up, which decreases overall error.\n\nFigure 6: For the network under consideration,\nincreasing the number of tasks (i.e., samples\nfor other observations) we train on has little ef-\nfect on acceptance ratio (and error) if we keep\nthe total number of training samples constant.\n\nFigure 7: For networks without hard determin-\nism, we can train on Gibbs samples. For oth-\ners, we can use prior samples, Gibbs samples\nfor relaxed networks, and samples from a se-\nquence of annealed Inverse samplers.\n\nsamples using Inverse MCMC and compare marginals to a solution computed by JAGS (Plummer\net al., 2003). As we re\ufb01ne the inverses using forward samples, the error in the estimated marginals\ndecreases towards 0, providing evidence for convergence towards a posterior sampler (Figure 4).\nTo evaluate Inverse MCMC in more breadth, we run the algorithm on all binary Bayes nets with\nup to 500 nodes that have been submitted to the UAI 08 inference competition (216 networks).\nSince many of these networks exhibit strong determinism, we train on prior samples and apply\nthe annealing scheme outlined above to generate approximate posterior samples. For training and\ntesting, we use the evidence provided with each network. We compute the error in marginals as\ndescribed above for both Gibbs (proposal size 1) and Inverse MCMC (maximum proposal size 20).\nTo summarize convergence over the 1200s of test time, we compute the area under the error curves\n(Figure 8). Each point represents a single run on a single model. We label different classes of\nnetworks. For the grid networks, grid-k denotes a network with k% deterministic dependencies.\nWhile performance varies across network classes\u2014with extremely deterministic networks making\nthe acquisition of training data challenging\u2014the comparison with Gibbs suggests that learned block\nproposals frequently help.\nOverall, these results indicate that Inverse MCMC is of practical bene\ufb01t for learning block proposals\nin reasonably large Bayes nets and using a realistic amount of training data (an amount that might\nresult from amortizing over \ufb01ve or ten inferences).\n\n7\n\nError in marginalsLog10(training samples per task)Maximum proposal size5101520253012340.040.060.080.100.120.140.160.18Acceptance ratioLog10(training samples per task)Maximum proposal size5101520253012340.20.30.40.50.60.70.80.91.0Acceptance ratioNumber of tasksMaximum proposal size51015202530510150.550.600.650.700.750.800.850.900.95PriorGibbsRelaxed GibbsAnnealingPriorGibbsRelaxed GibbsAnnealingllllllllDeterminism 0.95Determinism 0.99990.050.100.150.200.25Error by training sourceTest error (after 10s)\fFigure 8: Each mark represents a single run of\na model from the UAI 08 inference competi-\ntion. Marks below the line indicate that inte-\ngrated error over 1200s of inference is lower\nfor Inverse MCMC than Gibbs sampling.\n\nFigure 9: Integrated error (over 1s of inference)\nas a function of the number of samples used\nto train inverses, comparing logistic regression\nwith and without interaction terms to an empir-\nical frequency estimator.\n\n6 Related work\n\nA recognition network (Morris, 2001) is a multilayer perceptron used to predict posterior marginals.\nIn contrast to our work, a single global predictor is used instead of small, compositional prediction\nfunctions. By learning local inverses our technique generalizes in a more \ufb01ne-grained way, and can\nbe combined with MCMC to provide unbiased samples. Adaptive MCMC techniques such as those\npresented in Roberts and Rosenthal (2009) and Haario et al. (2006) are used to tune parameters\nof MCMC algorithms, but do not allow arbitrarily close adaptation of the underlying model to the\nposterior, whereas our method is designed to allow such close approximation. A number of adaptive\nimportance sampling algorithms have been proposed for Bayesian networks, including Shachter\nand Peot (1989), Cheng and Druzdzel (2000), Yuan and Druzdzel (2012), Yu and Van Engelen\n(2012), Hernandez et al. (1998), Salmeron et al. (2000), and Ortiz and Kaelbling (2000). These\ntechniques typically learn Bayes nets which are directed \u201cforward\u201d, which means that the conditional\ndistributions must be learned from posterior samples, creating a chicken-and-egg problem. Because\nour trained model is directed \u201cbackwards\u201d, we can learn from both prior and posterior samples.\nGibbs sampling and single-site Metropolis-Hastings are known to converge slowly in the presence\nof determinism and long-range dependencies. It is well-known that this can be addressed using block\nproposals, but such proposals typically need to be built manually for each model. In our framework,\nblock proposals are learned from past samples, with a natural parameter for adjusting the block size.\n\n7 Conclusion\n\nWe have described a class of algorithms, for the setting of amortized inference, based on the idea\nof learning local stochastic inverses\u2014the information necessary to \u201crun a model backward\u201d. We\nhave given simple methods for estimating and using these inverses as part of an MCMC algorithm.\nIn exploratory experiments, we have shown how learning from past inference tasks can reduce the\ntime required to estimate quantities of interest. Much remains to be done to explore this frame-\nwork. Based on our results, one particularly promising avenue is to explore estimators that initially\ngeneralize quickly (such as regression), but back off to a sound estimator as the training data grows.\n\nAcknowledgments\n\nWe thank Ramki Gummadi and anonymous reviewers for useful comments. This work was sup-\nported by a John S. McDonnell Foundation Scholar Award.\n\n8\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.10.20.30.40.00.10.20.30.4Gibbs error integralInverse MCMC error integrallgrid\u221250grid\u221275grid\u221290studentsfsbn2o1020501002005000.00.10.20.30.4Number of training samplesError integral (1s)Frequency estimatorLogistic regression (L2)Logistic regression (L2 + ^2)\fReferences\nJ. Cheng and M. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential\n\nreasoning in large bayesian networks. Journal of Arti\ufb01cial Intelligence Research, 2000.\n\nH. Haario, M. Laine, A. Mira, and E. Saksman. DRAM: ef\ufb01cient adaptive MCMC. Statistics and\n\nComputing, 16(4):339\u2013354, 2006.\n\nL. D. Hernandez, S. Moral, and A. Salmeron. A Monte Carlo algorithm for probabilistic prop-\nagation in belief networks based on importance sampling and strati\ufb01ed simulation techniques.\nInternational Journal of Approximate Reasoning, 18(1):53\u201391, 1998.\n\nB. K. Horn. Understanding image intensities. Arti\ufb01cial intelligence, 8(2):201\u2013231, 1977.\n\nR. Mateescu, K. Kask, V. Gogate, and R. Dechter. Join-graph propagation algorithms. Journal of\n\nArti\ufb01cial Intelligence Research, 37(1):279\u2013328, 2010.\n\nQ. Morris. Recognition networks for approximate inference in BN20 networks. Morgan Kaufmann\n\nPublishers Inc., Aug. 2001.\n\nL. E. Ortiz and L. P. Kaelbling. Adaptive importance sampling for estimation in structured do-\nmains. In Proc. of the 16th Ann. Conf. on Uncertainty in A.I. (UAI-00), pages 446\u2013454. Morgan\nKaufmann Publishers, 2000.\n\nM. Plummer et al. Jags: A program for analysis of bayesian graphical models using gibbs sampling.\n\nURL http://citeseer. ist. psu. edu/plummer03jags. html, 2003.\n\nG. Roberts and J. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graph-\n\nical Statistics, 18(2):349\u2013367, 2009.\n\nA. Salmeron, A. Cano, and S. Moral. Importance sampling in Bayesian networks using probability\n\ntrees. Computational Statistics and Data Analysis, 34(4):387\u2013413, Oct. 2000.\n\nA. N. Sanborn, V. K. Mansinghka, and T. L. Grif\ufb01ths. Reconciling intuitive physics and Newtonian\n\nmechanics for colliding objects. Psychological Review, 120(2):411, Apr. 2013.\n\nR. D. Shachter and M. A. Peot. Simulation approaches to general probabilistic inference on belief\nnetworks. In Proc. of the 5th Ann. Conf. on Uncertainty in A.I. (UAI-89), pages 311\u2013318, New\nYork, NY, 1989. Elsevier Science.\n\nK. Watanabe and S. Shimojo. When sound affects vision: effects of auditory grouping on visual\n\nmotion perception. Psychological Science, 12(2):109\u2013116, 2001.\n\nD. Wingate and T. Weber. Automated variational inference in probabilistic programming. arXiv\n\npreprint arXiv:1301.1299, 2013.\n\nH. Yu and R. A. Van Engelen. Refractor importance sampling. arXiv preprint arXiv:1206.3295,\n\n2012.\n\nC. Yuan and M. J. Druzdzel. Importance sampling in Bayesian networks: An in\ufb02uence-based ap-\n\nproximation strategy for importance functions. arXiv preprint arXiv:1207.1422, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1391, "authors": [{"given_name": "Andreas", "family_name": "Stuhlm\u00fcller", "institution": "MIT"}, {"given_name": "Jacob", "family_name": "Taylor", "institution": "Stanford University"}, {"given_name": "Noah", "family_name": "Goodman", "institution": "Stanford University"}]}