{"title": "AIDE: An algorithm for measuring the accuracy of probabilistic inference algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 3000, "page_last": 3010, "abstract": "Approximate probabilistic inference algorithms are central to many fields. Examples include sequential Monte Carlo inference in robotics, variational inference in machine learning, and Markov chain Monte Carlo inference in statistics. A key problem faced by practitioners is measuring the accuracy of an approximate inference algorithm on a specific data set. This paper introduces the auxiliary inference divergence estimator (AIDE), an algorithm for measuring the accuracy of approximate inference algorithms. AIDE is based on the observation that inference algorithms can be treated as probabilistic models and the random variables used within the inference algorithm can be viewed as auxiliary variables. This view leads to a new estimator for the symmetric KL divergence between the approximating distributions of two inference algorithms. The paper illustrates application of AIDE to algorithms for inference in regression, hidden Markov, and Dirichlet process mixture models. The experiments show that AIDE captures the qualitative behavior of a broad class of inference algorithms and can detect failure modes of inference algorithms that are missed by standard heuristics.", "full_text": "AIDE: An algorithm for measuring the accuracy of\n\nprobabilistic inference algorithms\n\nMarco F. Cusumano-Towner\nProbabilistic Computing Project\n\nMassachusetts Institute of Technology\n\nmarcoct@mit.edu\n\nVikash K. Mansinghka\n\nProbabilistic Computing Project\n\nMassachusetts Institute of Technology\n\nvkm@mit.edu\n\nAbstract\n\nApproximate probabilistic inference algorithms are central to many \ufb01elds. Exam-\nples include sequential Monte Carlo inference in robotics, variational inference\nin machine learning, and Markov chain Monte Carlo inference in statistics. A\nkey problem faced by practitioners is measuring the accuracy of an approximate\ninference algorithm on a speci\ufb01c data set. This paper introduces the auxiliary\ninference divergence estimator (AIDE), an algorithm for measuring the accuracy of\napproximate inference algorithms. AIDE is based on the observation that inference\nalgorithms can be treated as probabilistic models and the random variables used\nwithin the inference algorithm can be viewed as auxiliary variables. This view leads\nto a new estimator for the symmetric KL divergence between the approximating\ndistributions of two inference algorithms. The paper illustrates application of AIDE\nto algorithms for inference in regression, hidden Markov, and Dirichlet process\nmixture models. The experiments show that AIDE captures the qualitative behavior\nof a broad class of inference algorithms and can detect failure modes of inference\nalgorithms that are missed by standard heuristics.\n\n1\n\nIntroduction\n\nApproximate probabilistic inference algorithms are central to diverse disciplines, including statistics,\nrobotics, machine learning, and arti\ufb01cial intelligence. Popular approaches to approximate inference\ninclude sequential Monte Carlo, variational inference, and Markov chain Monte Carlo. A key problem\nfaced by practitioners is measuring the accuracy of an approximate inference algorithm on a speci\ufb01c\ndata set. The accuracy is in\ufb02uenced by complex interactions between the speci\ufb01c data set in question,\nthe model family, the algorithm tuning parameters such as the number of iterations, and any associated\nproposal distributions and/or approximating variational family. Unfortunately, practitioners assessing\nthe accuracy of inference have to rely on heuristics that are either brittle or specialized for one type\nof algorithm [1], or both. For example, log marginal likelihood estimates can be used to assess\nthe accuracy of sequential Monte Carlo and variational inference, but these estimates can fail to\nsigni\ufb01cantly penalize an algorithm for missing a posterior mode. Expectations of probe functions do\nnot assess the full approximating distribution, and they require design speci\ufb01c to each model.\nThis paper introduces an algorithm for estimating the symmetrized KL divergence between the output\ndistributions of a broad class of exact and approximate inference algorithms. The key idea is that\ninference algorithms can be treated as probabilistic models and the random variables used within\nthe inference algorithm can be viewed as latent variables. We show how sequential Monte Carlo,\nMarkov chain Monte Carlo, rejection sampling, and variational inference can be represented in a\ncommon mathematical formalism based on two new concepts: generative inference models and\nmeta-inference algorithms. Using this framework, we introduce the Auxiliary Inference Divergence\nEstimator (AIDE), which estimates the symmetrized KL divergence between the output distributions\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fGold standard\n\ninference algorithm\n\nTarget inference algorithm\n\n(the algorithm being measured)\n\nNumber of gold-standard\ninference runs\n\nNg\n\nNumber of meta-inference\nruns for gold-standard\n\nMg\n\nAIDE\n\nAuxiliary Inference\nDivergence Estimator\n\nNt\n\nNumber of target\ninference runs\n\nMt\n\nNumber of meta-inference\nruns for target\n\nSymmetrized KL divergence estimate \u02c6D\n\n\u02c6D \u2248 DKL(gold-standard||target) + DKL(target||gold-standard)\n\nFigure 1: Using AIDE to estimate the accuracy of a target inference algorithm relative to a gold-\nstandard inference algorithm. AIDE is a Monte Carlo estimator of the symmetrized Kullback-Leibler\n(KL) divergence between the output distributions of two inference algorithms. AIDE uses meta-\ninference: inference over the internal random choices made by an inference algorithm.\n\nFigure 2: AIDE applies to SMC, variational, and MCMC algorithms. Left: AIDE estimates for\nSMC converge to zero, as expected. Right: AIDE estimates for variational inference converge to\na nonzero asymptote that depends on the variational family. Middle: The symmetrized divergence\nbetween MH and the posterior converges to zero, but AIDE over-estimates the divergence in expecta-\ntion. Although increasing the number of meta-inference runs Mt reduces the bias of AIDE, AIDE is\nnot yet practical for measuring MH accuracy due to inaccurate meta-inference for MH.\n\nof two inference algorithms that have both been endowed with a meta-inference algorithm. We also\nshow that the conditional SMC update of Andrieu et al. [2] and the reverse AIS Markov chain of\nGrosse et al. [3] are both special cases of a \u2018generalized conditional SMC update\u2019, which we use as a\ncanonical meta-inference algorithm for SMC. AIDE is a practical tool for measuring the accuracy\nof SMC and variational inference algorithms relative to gold-standard inference algorithms. Note\nthat this paper does not provide a practical solution to the MCMC convergence diagnosis problem.\nAlthough in principle AIDE can be applied to MCMC, to do so in practice will require more accurate\nmeta-inference algorithms for MCMC to be developed.\n\n2 Background\nConsider a generative probabilistic model with latent variables X and observed variables Y . We\ndenote assignments to these variables by x \u2208 X and y \u2208 Y. Let p(x, y) denote the joint distribution of\nx p(x, y)\n\nthe generative model. The posterior distribution is p(x|y) := p(x, y)/p(y) where p(y) =(cid:80)\n\nis the marginal likelihood, or \u2018evidence\u2019.\nSampling-based approximate inference strategies including Markov chain Monte Carlo (MCMC,\n[4, 5]), sequential Monte Carlo (SMC, [6]), annealed importance sampling (AIS, [7]) and importance\nsampling with resampling (SIR, [8, 9]), generate samples of the latent variables that are approximately\ndistributed according to p(x|y). Use of a sampling-based inference algorithm is often motivated by\n\n2\n\n100101102Number of particles02468AIDE estimate (nats)Sequential Monte CarloMt=100Mt=101Mt=1031001011021 + Number of transitions02468AIDE estimate (nats)Metropolis-Hastings100101102103104Number of gradient steps02468AIDE estimate (nats)Variational Inference\ftheoretical guarantees of exact convergence to the posterior in the limit of in\ufb01nite computation (e.g.\nnumber of transitions in a Markov chain, number of importance samples in SIR). However, how well\nthe sampling distribution approximates the posterior distribution for \ufb01nite computation is typically\ndif\ufb01cult to analyze theoretically or estimate empirically with con\ufb01dence.\nVariational inference [10] explicitly minimizes the approximation error of the approximating dis-\ntribution q\u03b8(x) over parameters \u03b8 of a variational family. The error is usually quanti\ufb01ed using the\nKullback-Leibler (KL) divergence from the approximation q\u03b8(x) to the posterior p(x|y), denoted\nDKL(q\u03b8(x) (cid:107) p(x|y)). Unlike sampling-based approaches, variational inference does not generally\ngive exact results for in\ufb01nite computation because the variational family does not include the posterior.\nMinimizing the KL divergence is performed by maximizing the \u2018evidence lower bound\u2019 (ELBO)\nL = log p(y) \u2212 DKL(q\u03b8(x) (cid:107) p(x|y)) over \u03b8. Since log p(y) is usually unknown, the actual error\n(the KL divergence) of a variational approximation is also unknown.\n\n3 Estimating the symmetrized KL divergence between inference algorithms\n\nThis section de\ufb01nes our mathematical formalism for analyzing inference algorithms; shows how\nto represent SMC, MCMC, rejection sampling, and variational inference in this formalism; and\nintroduces the Auxiliary Inference Divergence Estimator (AIDE), an algorithm for estimating the\nsymmetrized KL divergence between two inference algorithms.\n\n3.1 Generative inference models and meta-inference algorithms\n\nWe de\ufb01ne an inference algorithm as a procedure that returns a single approximate posterior sample.\nRepeated runs of the algorithm give independent samples. The algorithm has an \u2018output distribution\u2019\nq(x) that gives the probability of returning x. Note that the dependence of q(x) on the observations\ny that de\ufb01ne the inference problem is suppressed in the notation. The algorithm is accurate when\nq(x) \u2248 p(x|y) for all x. We denote a sample returned from the algorithm by x \u223c q(x).\nA naive simple Monte Carlo estimator of the KL divergence between the output distributions of two\ninference algorithms requires evaluating output probabilities for both algorithms. However, it is\ntypically intractable to compute output probabilities for sampling-based inference algorithms like\nMCMC and SMC, because that would require marginalizing over all possible values that the random\nvariables drawn during the algorithm could possibly take. A similar dif\ufb01culty arises when computing\nthe marginal likelihood p(y) of a generative probabilistic model p(x, y). This suggests that we treat\nthe inference algorithm as a generative model, estimate its output probabilities using ideas from\nmarginal likelihood estimation, and use these estimates in a Monte Carlo estimator of the divergence.\nWe begin by making the analogy between an inference algorithm and a generative model explicit:\nDe\ufb01nition 3.1 (Generative inference model). A generative inference model is a tuple (U,X , q)\nwhere q(u, x) is a joint distribution de\ufb01ned on U \u00d7 X . A generative inference model models an\ninference algorithm if the output probability of the inference algorithm is the marginal likelihood\nu q(u, x) of the model for all x. An element u \u2208 U represents a complete assignment\nto the internal random variables within the inference algorithm, and is called a \u2018trace\u2019. The ability\nto simulate from q(u, x) is required, but the ability to compute the probability q(u, x) is not. A\nsimulation, denoted u, x \u223c q(u, x), may be obtained by running the inference algorithm and recording\nthe resulting trace u and output x.1\n\nq(x) =(cid:80)\n\nA generative inference model can be understood as a generative probabilistic model where the u are\nthe latent variables and the x are the observations. Note that two different generative inference models\nmay use different representations for the internal random variables of the same inference algorithm. In\npractice, constructing a generative inference model from an inference algorithm amounts to de\ufb01ning\nthe set of internal random variables. For marginal likelihood estimation in a generative inference\nmodel, we use a \u2018meta-inference\u2019 algorithm:\nDe\ufb01nition 3.2 (Meta-inference algorithm). For a given generative inference model (U,X , q), a\nmeta-inference algorithm is a tuple (r, \u03be) where r(u; x) is a distribution on traces u \u2208 U of the\ninference algorithm, indexed by outputs x \u2208 X of the inference algorithm, and where \u03be(u, x) is the\n1The trace data structure could in principle be obtained by writing the inference algorithm in a probabilistic\n\nprogramming language like Church [11], but the computational overhead would be high.\n\n3\n\n\ffollowing function of u and x for some Z > 0:\n\nq(u, x)\nr(u; x)\n\n\u03be(u, x) := Z\n\n(1)\nWe require the ability to sample u \u223c r(u; x) given a value for x, and the ability to evaluate \u03be(u, x)\ngiven u and x. We call a procedure for sampling from r(u; x) a \u2018meta-inference sampler\u2019. We do not\nrequire the ability to evaluate the probability r(u; x).\nA meta-inference algorithm is considered accurate for a given x if r(u; x) \u2248 q(u|x) for all u.\nConceptually, a meta-inference sampler tries to answer the question \u2018how could my inference\nalgorithm have produced this output x?\u2019 Note that if it is tractable to evaluate the marginal likelihood\nq(x) of the generative inference model up to a normalizing constant, then it is not necessary to\nrepresent internal random variables for the inference algorithm, and a generative inference model can\nde\ufb01ne the trace as an empty token u = () with U = {()}. In this case, the meta-inference algorithm\nhas r(u; x) = 1 for all x and \u03be(u, x) = Zq(x).\n\n3.2 Examples\n\nWe now show how to construct generative inference models and corresponding meta-inference\nalgorithms for SMC, AIS, MCMC, SIR, rejection sampling, and variational inference. The meta-\ninference algorithms for AIS, MCMC, and SIR are derived as special cases of a generic SMC\nmeta-inference algorithm.\n\nt, wi\n\nt and W i\n\nSequential Monte Carlo. We consider a general class of SMC samplers introduced by Del Moral\net al. [6], which can be used for approximate inference in both sequential state space and non-\nsequential models. We brie\ufb02y summarize a slightly restricted variant of the algorithm here, and refer\nthe reader to the supplement and Del Moral et al. [6] for full details. The SMC algorithm propagates\nP weighted particles through T steps, using proposal kernels kt and multinomial resampling based\non weight functions w1(x1) and wt(xt\u22121, xt) for t > 1 that are de\ufb01ned in terms of \u2018backwards\nkernels\u2019 (cid:96)t for t = 2 . . . T . Let xi\nt denote the value, unnormalized weight, and normalized\nweight of particle i at time t, respectively. We de\ufb01ne the output sample x of SMC as a single draw\nfrom the particle approximation at the \ufb01nal time step, which is obtained by sampling a particle\nindex IT \u223c Categorical(W 1:P\nT ), and then\nsetting x \u2190 xIT\nT . The generative inference model uses traces of the form u = (x, a, IT ), where x\ncontains the values of all particles at all time steps and where a (for \u2018ancestor\u2019) contains the index\nt \u2208 {1 . . . P} of the parent of particle xi\nt+1 for each particle i and each time step t = 1 . . . T \u2212 1.\nai\nAlgorithm 1 de\ufb01nes a canonical meta-inference sampler for this generative inference model that takes\nas input a latent sample x and generates an SMC trace u \u223c r(u; x) as output. The meta-inference\nsampler \ufb01rst generates an ancestral trajectory of particles (xI1\nT ) that terminates in the\noutput sample x, by sampling sequentially from the backward kernels (cid:96)t, starting from xIT\nT = x.\nNext, it runs a conditional SMC update [2] conditioned on the ancestral trajectory. For this choice of\n\nr(u; x) and for Z = 1, the function \u03be(u, x) is closely related to the marginal likelihood estimate (cid:100)p(y)\nproduced by the SMC scheme:2 \u03be(u, x) = p(x, y)/(cid:100)p(y). See supplement for derivation.\n\ndenotes the vector of weights (W 1\n\n) where W 1:P\n\n1 , xI2\n\n2 , . . . , xIT\n\nT\n\nT\n\nT , . . . , W P\n\nAnnealed importance sampling. When a single particle is used (P = 1), and when each forward\nkernel kt satis\ufb01es detailed balance for some intermediate distribution, the SMC algorithm simpli\ufb01es\nto annealed importance sampling (AIS, [7]), and the canonical SMC meta-inference inference\n(Algorithm 1) consists of running the forward kernels in reverse order, as in the reverse annealing\nalgorithm of Grosse et al. [3, 12]. The canonical meta-inference algorithm is accurate (r(u; x) \u2248\nq(u; x)) if the AIS Markov chain is kept close to equilibrium at all times. This is achieved if the\nintermediate distributions form a suf\ufb01ciently \ufb01ne-grained sequence. See supplement for analysis.\n\nMarkov chain Monte Carlo. We de\ufb01ne each run of an MCMC algorithm as producing a single\noutput sample x that is the iterate of the Markov chain produced after a predetermined number of burn-\nin steps has passed. We also assume that each MCMC transition operator satis\ufb01es detailed balance\n\n2AIDE also applies to approximate inference algorithms for undirected probabilistic models; the marginal\n\nlikelihood estimate is replaced with the estimate of the partition function.\n\n4\n\n\fAlgorithm 1 Generalized conditional SMC (a canonical meta-inference sampler for SMC)\nRequire: Latent sample x, SMC parameters\n\nIT \u223c Uniform(1 . . . P )\nT \u2190 x\nxIT\nfor t \u2190 T \u2212 1 . . . 1 do\nIt \u223c Uniform(1 . . . P )\n(cid:46) Sample from backward kernel\nt \u223c (cid:96)t+1(\u00b7; xIt+1\nxIt\nt+1 )\nfor i \u2190 1 . . . P do\nif i (cid:54)= I1 then xi\n1 \u223c k1(\u00b7)\n1 \u2190 w1(xi\nwi\n1)\nfor t \u2190 2 . . . T do\nt\u22121 \u2190 w1:P\nW 1:P\nfor i \u2190 1 . . . P do\nif i = It then ai\nelse\n\ni=1 wi\nt\u22121)\nt\u22121 \u2190 It\u22121\n\nt\u22121/((cid:80)P\n\nt\u22121 \u223c Categorical(W 1:P\nai\nt\u22121 )\nt \u223c kt(\u00b7; x\nai\nt\u22121\nxi\nt\u22121 )\nt \u2190 wt(x\nai\nt\u22121\nt\u22121 , xi\nwi\nt)\n(cid:46) Return an SMC trace\n\nu \u2190 (x, a, IT )\nreturn u\n\nLatent sample\n(input to meta-inference sampler)\n\nx\n\n\u03b4\n\nx1\n3\n\nx2\n3\n\nx3\n3\n\nx4\n3\n\nI3 = 2\n\n(cid:96)3\n\nx1\n2\n\nx2\n2\n\nx3\n2\n\nx4\n2\n\nI2 = 3\n\n(cid:96)2\n\nx1\n1\n\nx2\n1\n\nx3\n1\n\nx4\n1\n\nI1 = 1\n\nT = 3\n\nxi\nt\n\nMember of ancestral\ntrajectory\n\nwith respect to the posterior p(x|y). Then, this is formally a special case of AIS. However, unless the\nMarkov chain was initialized near the posterior p(x|y), the chain will be far from equilibrium during\nthe burn-in period, and the AIS meta-inference algorithm will be inaccurate.\n\n1 \u2190 x, and samples the other P \u2212 1 particles from the importance distribution k1(x).\n\nImportance sampling with resampling.\nImportance sampling with resampling, or SIR [8] can be\nseen as a special case of SMC if we set the number of steps to one (T = 1). The trace of the SIR\n1 for i \u2208 {1, . . . , P} and output particle index I1. Given output\nalgorithm is then the set of particles xi\nsample x, the canonical SMC meta-inference sampler then simply samples I1 \u223c Uniform(1 . . . P ),\nsets xI1\nRejection sampling. To model a rejection sampler for a posterior distribution p(x|y), we assume\nit is tractable to evaluate the unnormalized posterior probability p(x, y). We de\ufb01ne U = {()} as\ndescribed in Section 3.1. For meta-inference, we de\ufb01ne Z = p(y) so that \u03be(u, x) = p(y)p(x|y) =\np(x, y). It is not necessary to represent the internal random variables of the rejection sampler.\n\nVariational inference. We suppose a variational approximation q\u03b8(x) has been computed through\noptimization over the variational parameters \u03b8. We assume that it is possible to sample from\nthe variational approximation, and evaluate its normalized probability distribution. Then, we use\nU = {()} and Z = 1 and \u03be(u, x) = q\u03b8(x). This formulation also applies to amortized variational\ninference algorithms, which reuse the parameters \u03b8 for inference across observation contexts y.\n\n3.3 The auxiliary inference divergence estimator\n\nConsider a probabilistic model p(x, y), a set of observations y, and two inference algorithms that\napproximate p(x|y). One of the two inference algorithms is considered the \u2018gold-standard\u2019, and has a\ngenerative inference model (U,X , qg) and a meta-inference algorithm (rg, \u03beg). The second algorithm\nis considered the \u2018target\u2019 algorithm, with a generative inference model (V,X , qt) (we denote a trace\nof the target algorithm by v \u2208 V), and a meta-inference algorithm (rt, \u03bet). This section shows how to\nestimate an upper bound on the symmetrized KL divergence between qg(x) and qt(x), which is:\n\n(cid:20)\n\n(cid:21)\n\nDKL(qg(x) (cid:107) qt(x)) + DKL(qt(x) (cid:107) qg(x)) = Ex\u223cqg(x)\n\nlog\n\nqg(x)\nqt(x)\n\n+Ex\u223cqt(x)\n\nlog\n\n(cid:20)\n\n(cid:21)\n\nqt(x)\nqg(x)\n\n(2)\n\nWe take a Monte Carlo approach. Simple Monte Carlo applied to the Equation (2) requires that qg(x)\nand qt(x) can be evaluated, which would prevent the estimator from being used when either inference\nalgorithm is sampling-based. Algorithm 2 gives the Auxiliary Inference Divergence Estimator\n\n5\n\n\f(AIDE), an estimator of the symmetrized KL divergence that only requires evaluation of \u03beg(u, x) and\n\u03bet(v, x) and not qg(x) or qt(x), permitting its use with sampling-based inference algorithms.\n\nAlgorithm 2 Auxiliary Inference Divergence Estimator (AIDE)\nRequire: Gold-standard inference model and meta-inference algorithm (U,X , qg) and (rg, \u03beg)\n(V,X , qt) and (rt, \u03bet)\nTarget inference model and meta-inference algorithm\nNumber of runs of gold-standard algorithm\nNg\nNumber of runs of meta-inference sampler for gold-standard Mg\nNumber of runs of target algorithm\nNt\nNumber of runs of meta-inference sampler for target\nMt\n\n(cid:46) Run gold-standard algorithm, record trace un,1 and output xn\n\nun,m \u223c rg(u; xn) (cid:46) Run meta-inference sampler for gold-standard algorithm, on input xn\nvn,m \u223c rt(v; xn) (cid:46) Run meta-inference sampler for target algorithm, on input xn\n\nfor n \u2190 1 . . . Ng do\n\nun,1, xn \u223c qg(u, x)\nfor m \u2190 2 . . . Mg do\nfor m \u2190 1 . . . Mt do\n\nfor n \u2190 1 . . . Nt do\n\nn \u223c qt(v, x)\nn,1, x(cid:48)\nv(cid:48)\nfor m \u2190 2 . . . Mt do\nn,m \u223c rt(v; x(cid:48)\nv(cid:48)\nfor m \u2190 1 . . . Mg do\nn,m \u223c rg(u; x(cid:48)\nu(cid:48)\n\u02c6D \u2190 1\nMg\nNg\n1\nMt\nreturn \u02c6D\n\n\uf8eb\uf8ed 1\n\nNg(cid:88)\n\nlog\n\nn=1\n\n(cid:46) Run target algorithm, record trace v(cid:48)\n\nn,1 and output x(cid:48)\n\nn\n\nn) (cid:46) Run meta-inference sampler for target algorithm, on input x(cid:48)\nn) (cid:46) Run meta-inference sampler for gold-standard algorithm, on input x(cid:48)\n\nn\n\nn\n\n\uf8f6\uf8f8 +\n\nNt(cid:88)\n\nn=1\n\n1\nNt\n\nlog\n\n\uf8eb\uf8ed 1\n\nMt\n1\nMg\n\n(cid:80)Mt\n(cid:80)Mg\n\nm=1 \u03bet(v(cid:48)\nm=1 \u03beg(u(cid:48)\n\nn,m, x(cid:48)\nn)\nn,m, x(cid:48)\nn)\n\n(cid:80)Mg\n(cid:80)Mt\n\nm=1 \u03beg(un,m, xn)\n\nm=1 \u03bet(vn,m, xn)\n\n\uf8f6\uf8f8\n\n(cid:46) \u02c6D is an estimate of DKL(qg(x)||qt(x)) + DKL(qt(x)||qg(x))\n\nThe generic AIDE algorithm above is de\ufb01ned in terms of abstract generative inference models and\nmeta-inference algorithms. For concreteness, the supplement contains the AIDE algorithm specialized\nto the case when the gold-standard is AIS and the target is a variational approximation.\nTheorem 1. The estimate \u02c6D produced by AIDE is an upper bound on the symmetrized KL divergence\nin expectation, and the expectation is nonincreasing in AIDE parameters Mg and Mt.\n\nSee supplement for proof. Brie\ufb02y, AIDE estimates an upper bound on the symmetrized divergence in\nexpectation because it uses unbiased estimates of qt(xn) and qg(xn)\u22121 for xn \u223c qg(x), and unbiased\nn \u223c qt(x). For Mg = 1 and Mt = 1, AIDE over-estimates\nestimates of qg(x(cid:48)\nthe true symmetrized divergence by:\n\nn) and qt(x(cid:48)\n\nn)\u22121 for x(cid:48)\n\n(cid:18)\n\nE[ \u02c6D] \u2212 (DKL(qg(x) (cid:107) qt(x)) + DKL(qt(x) (cid:107) qg(x))) =\n\nEx\u223cqg(x) [DKL(qg(u|x) (cid:107) rg(u; x)) + DKL(rt(v; x) (cid:107) qt(v|x))]\n+ Ex\u223cqt(x) [DKL(qt(v|x) (cid:107) rt(v; x)) + DKL(rg(u; x) (cid:107) qg(u|x))]\n\n(cid:19) Bias of AIDE\n\nfor Mg=Mt=1\n\n(3)\n\nNote that this expression involves KL divergences between the meta-inference sampling distributions\n(rg(u; x) and rt(v; x)) and the posteriors in their respective generative inference models (qg(u|x) and\nqt(v|x)). Therefore, the approximation error of meta-inference determines the bias of AIDE. When\nboth meta-inference algorithms are exact (rg(u; x) = qg(u|x) for all u and x and rt(v; x) = qt(v|x)\nfor all v and x), AIDE is unbiased. As Mg or Mt are increased, the bias decreases (see Figure 2 and\nFigure 4 for examples). If the generative inference model for one of the algorithms does not use a\ntrace (i.e. U = {()} or V = {()}), then that algorithm does not contribute a KL divergence term to\nthe bias of Equation (3). The analysis of AIDE is equivalent to that of Grosse et al. [12] when the\ntarget algorithm is AIS and Mt = Mg = 1 and the gold-standard inference algorithm is a rejection\nsampler.\n\n4 Related Work\n\nDiagnosing the convergence of approximate inference is a long-standing problem. Most existing work\nis either tailored to speci\ufb01c inference algorithms [13], designed to detect lack of exact convergence\n[1], or both. Estimators of the non-asymptotic approximation error of general approximate inference\n\n6\n\n\fFigure 3: AIDE detects when an inference algorithm misses a posterior mode. Left: A bimodal\nposterior density, with kernel estimates of the output densities of importance sampling with resampling\n(SIR) using two proposals. The \u2018broad\u2019 proposal (blue) covers both modes, and the \u2018offset\u2019 proposal\n(pink) misses the \u2018L\u2019 mode. Middle: AIDE detects the missing mode in offset-proposal SIR. Right:\nLog marginal likelihood estimates suggest that the offset-proposal SIR is nearly converged.\n\nalgorithms have received less attention. Gorham and Mackey [14] propose an approach that applies\nto arbitrary sampling algorithms but relies on special properties of the posterior distribution such as\nlog-concavity. Our approach does not rely on special properties of the posterior distribution.\nOur work is most closely related to Bounding Divergences with REverse Annealing (BREAD, [12])\nwhich also estimates upper bounds on the symmetric KL divergence between the output distribution\nof a sampling algorithm and the posterior distribution. AIDE differs from BREAD in two ways: First,\nwhereas BREAD handles single-particle SMC samplers and annealed importance sampling (AIS),\nAIDE handles a substantially broader family of inference algorithms including SMC samplers with\nboth resampling and rejuvenation steps, AIS, variational inference, and rejection samplers. Second,\nBREAD estimates divergences between the target algorithm\u2019s sampling distribution and the posterior\ndistribution, but the exact posterior samples necessary for BREAD\u2019s theoretical properties are only\nreadily available when the observations y that de\ufb01ne the inference problem are simulated from the\ngenerative model. Instead, AIDE estimates divergences against an exact or approximate gold-standard\nsampler on real (non-simulated) inference problems. Unlike BREAD, AIDE can be used to evaluate\ninference in both generative and undirected models.\nAIDE estimates the error of sampling-based inference using a mathematical framework with roots\nin variational inference. Several recent works have treated sampling-based inference algorithms as\nvariational approximations. The Monte Carlo Objective (MCO) formalism of Maddison et al. [15]\nis closely related to our formalism of generative inference models and meta-inference algorithms\u2014\nindeed a generative inference model and a meta-inference algorithm with Z = 1 give an MCO de\ufb01ned\nby: L(y, p) = Eu,x\u223cq(u,x)[log(p(x, y)/\u03be(u, x))], where y denotes observed data. In independent\nand concurrent work to our own, Naesseth et al. [16], Maddison et al. [15] and Le et al. [17] treat\nSMC as a variational approximation using constructions similar to ours. In earlier work, Salimans\net al. [18] recognized that MCMC samplers can be treated as variational approximations. However,\nthese works are concerned with optimization of variational objective functions instead of estimation\nof KL divergences, and do not involve generating a trace of a sampler from its output.\n\n5 Experiments\n\n5.1 Comparing the bias of AIDE for different types of inference algorithms\nWe used a Bayesian linear regression inference problem where exact posterior sampling is tractable\nto characterize the bias of AIDE when applied to three different types of target inference algorithms:\nsequential Monte Carlo (SMC), Metropolis-Hastings (MH), and variational inference. For the gold-\nstandard algorithm we used a posterior sampler with a tractable output distribution qg(x), which does\nnot introduce bias into AIDE, so that AIDE\u2019s bias could be completely attributed to the approximation\nerror of meta-inference for each target algorithm. Figure 2 shows the results. The bias of AIDE\nis acceptable for SMC, and AIDE is unbiased for variational inference, but better meta-inference\nalgorithms for MCMC are needed to make AIDE practical for estimating the accuracy of MH.\n\n7\n\n2024100101102103Number of particlesposteriordensityLR100101102103Number of particles0204060Large penalty formissing modeAIDE estimate(nats)Offset proposalBroad proposal100101102103Number of particles6040200Small penaltyLog marginallikelihood (nats)Offset proposalBroad proposalGold-standard\f5.2 Evaluating approximate inference in a Hidden Markov model\nWe applied AIDE to measure the approximation error of SMC algorithms for posterior inference in\na Hidden Markov model (HMM). Because exact posterior inference in this HMM is tractable via\ndynamic programming, we used this opportunity to compare AIDE estimates obtained using the exact\nposterior as the gold-standard with AIDE estimates obtained using a \u2018best-in-class\u2019 SMC algorithm as\nthe gold-standard. Figure 4 shows the results, which indicate AIDE estimates using an approximate\ngold-standard algorithm can be nearly identical to AIDE estimates obtained with an exact posterior\ngold-standard.\n\nTarget algorithms\n\nMeasuring accuracy\n\nof target algorithms using\nposterior as gold-standard\n\nGround truth states\n\nSMC prior proposal\n\n1 particle\n\nPosterior marginals\n\nSMC prior proposal\n\n10 particles\n\nMeasuring accuracy\n\nof target algorithms using\n\nSMC gold-standard\n\nSMC optimal proposal\n\n1000 particles\n\n(SMC gold standard)\n\nSMC optimal proposal\n\n100 particles\n\nFigure 4: Comparing use of an exact posterior as the gold-standard and a \u2018best-in-class\u2019 approximate\nalgorithm as the gold-standard, when measuring accuracy of target inference algorithms with AIDE.\nWe consider inference in an HMM, so that exact posterior sampling is tractable using dynamic\nprogramming. Left: Ground truth latent states, posterior marginals, and marginals of the the output\nof a gold-standard and three target SMC algorithms (A,B,C) for a particular observation sequence.\nRight: AIDE estimates using the exact gold-standard and using the SMC gold-standard are nearly\nidentical. The estimated divergence bounds decrease as the number of particles in the target sampler\nincreases. The optimal proposal outperforms the prior proposal. Increasing Mt tightens the estimated\ndivergence bounds. We used Mg = 1.\n\nFigure 5: Contrasting AIDE against a heuristic convergence diagnostic for evaluating the accuracy of\napproximate inference in a Dirichlet process mixture model (DPMM). The heuristic compares the\nexpected number of clusters under the target algorithm to the expectation under the gold-standard\nalgorithm [19]. White circles identify single-particle likelihood-weighting, which samples from the\nprior. AIDE clearly indicates that single-particle likelihood-weighting is inaccurate, but the heuristic\nsuggests it is accurate. Probe functions like the expected number of clusters can be error prone\nmeasures of convergence because they only track convergence along a speci\ufb01c projection of the\ndistribution. In contrast, AIDE estimates a joint KL divergence. Shaded areas in both plots show the\nstandard error. The amount of target inference computation used is the same for the two techniques,\nalthough AIDE performs a gold-standard meta-inference run for each target inference run.\n\n8\n\n150time120state150time120state150time120state150time120stateA150time120stateB150time120stateC100101102Number of particles020406080AIDE estimate (nats)ABC100101102Number of particles020406080AIDE estimate (nats)ABCSMC, prior proposal, 1 meta-inference run (Mt=1)SMC, prior proposal, 100 meta-inference runs (Mt=100)SMC, optimal proposal, 1 meta-inference run (Mt=1)SMC, optimal proposal, 100 meta-inference runs (Mt=100)100101102Number of particles0510152025natsLikelihood weightingwith 1 particle appearsleast accurateAIDE estimates100101102Number of particles2.53.03.54.0Average numberof clustersAppears accurateHeuristic diagnosticSMC, prior proposal0 rejuvenation sweepsSMC, optimal proposal0 rejuvenation sweepsSMC, optimal proposal4 rejuvenation sweepsGold-standardLikelihood-weighting(1 particle)\f5.3 Comparing AIDE to alternative inference evaluation techniques\nA key feature of AIDE is that it applies to different types of inference algorithms. We compared AIDE\nto two existing techniques for evaluating the accuracy of inference algorithms that share this feature:\n(1) comparing log marginal likelihood (LML) estimates made by a target algorithm against LML\nestimates made by a gold-standard algorithm, and (2) comparing the expectation of a probe function\nunder the approximating distribution to the same expectation under the gold-standard distribution\n[19]. Figure 3 shows a comparison of AIDE to LML, on a inference problem where the posterior\nis bimodal. Figure 5 shows a comparison of AIDE to a \u2018number of clusters\u2019 probe function in a\nDirichlet process mixture model inference problem for a synthetic data set. We also used AIDE to\nevaluate the accuracy of several SMC algorithms for DPMM inference on a real data set of galaxy\nvelocities [20] relative to an SMC gold-standard. This experiment is described in the supplement due\nto space constraints.\n\n6 Discussion\n\nAIDE makes it practical to estimate bounds on the error of a broad class of approximate inference\nalgorithms including sequential Monte Carlo (SMC), annealed importance sampling (AIS), sampling\nimportance resampling (SIR), and variational inference. AIDE\u2019s reliance on a gold-standard inference\nalgorithm raises two questions that merit discussion:\nIf we already had an acceptable gold-standard, why would we want to evaluate other inference\nalgorithms? Gold-standard algorithms such as very long MCMC runs, SMC runs with hundreds of\nthousands of particles, or AIS runs with a very \ufb01ne annealing schedule, are often too slow to use\nin production. AIDE make it possible to use gold-standard algorithms during an of\ufb02ine design and\nevaluation phase to quantitatively answer questions like \u201chow few particles or rejuvenation steps\nor samples can I get away with?\u201d or \u201cis my fast variational approximation good enough?\u201d. AIDE\ncan thus help practitioners con\ufb01dently apply Monte Carlo techniques in challenging, performance\nconstrained applications, such as probabilistic robotics or web-scale machine learning. In future\nwork we think it will be valuable to build probabilistic models of AIDE estimates, conditioned on\nfeatures of the data set, to learn of\ufb02ine what problem instances are easy or hard for different inference\nalgorithms. This may help practitioners bridge the gap between of\ufb02ine evaluation and production\nmore rigorously.\nHow do we ensure that the gold-standard is accurate enough for the comparison with it to be\nmeaningful? This is an intrinsically hard problem\u2014we are not sure that near-exact posterior inference\nis really feasible, for most interesting classes of models. In practice, we think that gold-standard\ninference algorithms will be calibrated based on a mix of subjective assumptions and heuristic\ntesting\u2014much like models themselves are tested. For example, users could initially build con\ufb01dence\nin a gold-standard algorithm by estimating the symmetric KL divergence from the posterior on\nsimulated data sets (following the approach of Grosse et al. [12]), and then use AIDE with the trusted\ngold-standard for a focused evaluation of target algorithms on real data sets of interest. We do not\nthink the subjectivity of the gold-standard assumption is a unique limitation of AIDE.\nA limitation of AIDE is that its bias depends on the accuracy of meta-inference, i.e.\ninference\nover the auxiliary random variables used by an inference algorithm. We currently lack an accurate\nmeta-inference algorithm for MCMC samplers that do not employ annealing, and therefore AIDE is\nnot yet suitable for use as a general MCMC convergence diagnostic. Research on new meta-inference\nalgorithms for MCMC and comparisons to standard convergence diagnostics [21, 22] are needed.\nOther areas for future work include understanding how the accuracy of meta-inference depends\non parameters of an inference algorithm, and more generally what makes an inference algorithm\namenable to ef\ufb01cient meta-inference.\nNote that AIDE does not rely on asymptotic exactness of the inference algorithm being evaluated.\nAn interesting area of future work is in using AIDE to study the non-asymptotic error of scalable but\nasymptotically biased sampling algorithms [23]. It also seems fruitful to connect AIDE to results\nfrom theoretical computer science, including the computability [24] and complexity [25\u201328] of\nprobabilistic inference. It should be possible to study the computational tractability of approximate\ninference empirically using AIDE estimates, as well as theoretically using a careful treatment of the\nvariance of these estimates. It also seems promising to use ideas from AIDE to develop Monte Carlo\nprogram analyses for samplers written in probabilistic programming languages.\n\n9\n\n\fAcknowledgments\n\nThis research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004),\nIARPA (under research contract 2015-15061000003), the Of\ufb01ce of Naval Research (under research\ncontract N000141310333), the Army Research Of\ufb01ce (under agreement number W911NF-13-1-\n0212), and gifts from Analog Devices and Google. This research was conducted with Government\nsupport under and awarded by DoD, Air Force Of\ufb01ce of Scienti\ufb01c Research, National Defense\nScience and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.\n\nReferences\n[1] Mary Kathryn Cowles and Bradley P Carlin. Markov chain Monte Carlo convergence diagnos-\ntics: a comparative review. Journal of the American Statistical Association, 91(434):883\u2013904,\n1996.\n\n[2] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72\n(3):269\u2013342, 2010.\n\n[3] Roger B Grosse, Zoubin Ghahramani, and Ryan P Adams. Sandwiching the marginal likelihood\n\nusing bidirectional Monte Carlo. arXiv preprint, arXiv:1511.02543, 2015.\n\n[4] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and\nEdward Teller. Equation of state calculations by fast computing machines. The Journal of\nChemical Physics, 21(6):1087\u20131092, 1953.\n\n[5] W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.\n\nBiometrika, 57(1):97\u2013109, 1970.\n\n[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411\u2013436, 2006.\n\n[7] Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139,\n\n2001.\n\n[8] Donald B Rubin. Using the SIR algorithm to simulate posterior distributions. Bayesian statistics,\n\n3(1):395\u2013402, 1988.\n\n[9] Adrian FM Smith and Alan E Gelfand. Bayesian statistics without tears: a sampling-resampling\n\nperspective. The American Statistician, 46(2):84\u201388, 1992.\n\n[10] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An in-\ntroduction to variational methods for graphical models. Machine Learning, 37(2):183\u2013233,\n1999.\n\n[11] Noah Goodman, Vikash K Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenen-\nbaum. Church: a language for generative models with non-parametric memoization and\napproximate inference. In Uncertainty in Arti\ufb01cial Intelligence, 2008.\n\n[12] Roger B Grosse, Siddharth Ancha, and Daniel M Roy. Measuring the reliability of MCMC\nIn Advances in Neural Information Processing\n\ninference with bidirectional Monte Carlo.\nSystems (NIPS), pages 2451\u20132459, 2016.\n\n[13] Augustine Kong. A note on importance sampling using standardized weights. Technical Report\n\n348, University of Chicago, Department of Statistics, 1992.\n\n[14] Jackson Gorham and Lester Mackey. Measuring sample quality with Stein\u2019s method.\n\nAdvances in Neural Information Processing Systems (NIPS), pages 226\u2013234, 2015.\n\nIn\n\n[15] Chris J Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi,\nAndriy Mnih, Arnaud Doucet, and Yee Whye Teh. Filtering variational objectives. arXiv\npreprint, arXiv:1705.09279, 2017.\n\n10\n\n\f[16] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational\n\nsequential Monte Carlo. arXiv preprint, arXiv:1705.11140, 2017.\n\n[17] Tuan Anh Le, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. Auto-encoding\n\nsequential Monte Carlo. arXiv preprint, arXiv:1705.10306, 2017.\n\n[18] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational\ninference: Bridging the gap. In International Conference on Machine Learning (ICML), pages\n1218\u20131226, 2015.\n\n[19] Yener Ulker, Bilge G\u00fcnsel, and Taylan Cemgil. Sequential Monte Carlo samplers for Dirichlet\n\nprocess mixtures. In Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 876\u2013883, 2010.\n\n[20] Michael J Drinkwater, Quentin A Parker, Dominique Proust, Eric Slezak, and Hern\u00e1n Quin-\ntana. The large scale distribution of galaxies in the Shapley supercluster. Publications of the\nAstronomical Society of Australia, 21(1):89\u201396, 2004.\n\n[21] Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple\n\nsequences. Statistical Science, 7(4):457\u2013472, 1992.\n\n[22] John Geweke. Getting it right: Joint distribution tests of posterior simulators. Journal of the\n\nAmerican Statistical Association, 99(467):799\u2013804, 2004.\n\n[23] Elaine Angelino, Matthew James Johnson, and Ryan P Adams. Patterns of scalable Bayesian\n\ninference. Foundations and Trends in Machine Learning, 9(2-3):119\u2013247, 2016.\n\n[24] Nathanael L Ackerman, Cameron E Freer, and Daniel M Roy. On the computability of\n\nconditional probability. arXiv preprint, arXiv:1005.3014, 2010.\n\n[25] Cameron E Freer, Vikash K Mansinghka, and Daniel M Roy. When are probabilistic programs\nprobably computationally tractable? In NIPS Workshop on Advanced Monte Carlo Methods\nwith Applications, 2010.\n\n[26] Jonathan H Huggins and Daniel M Roy. Convergence of sequential Monte Carlo-based sampling\n\nmethods. arXiv preprint, arXiv:1503.00966, 2015.\n\n[27] Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. arXiv\n\npreprint, arXiv:1511.01437, 2015.\n\n[28] S Agapiou, Omiros Papaspiliopoulos, D Sanz-Alonso, and AM Stuart. Importance sampling:\n\nIntrinsic dimension and computational cost. Statistical Science, 32(3):405\u2013431, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1720, "authors": [{"given_name": "Marco", "family_name": "Cusumano-Towner", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Vikash", "family_name": "Mansinghka", "institution": "Massachusetts Institute of Technology"}]}