{"title": "A Stochastic approximation method for inference in probabilistic graphical models", "book": "Advances in Neural Information Processing Systems", "page_first": 216, "page_last": 224, "abstract": "We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation. Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Notably, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. Experiments on a challenging inference problem in population genetics demonstrate improvements in stability and accuracy over existing methods, and at a comparable cost.", "full_text": "A Stochastic approximation method for inference\n\nin probabilistic graphical models\n\nPeter Carbonetto\n\nDept. of Human Genetics\n\nUniversity of Chicago\nChicago, IL, U.S.A.\n\nMatthew King\nDept. of Botany\n\nFiras Hamze\n\nD-Wave Systems\n\nUniversity of British Columbia\n\nVancouver, B.C., Canada\n\nBurnaby, B.C., Canada\nfhamze@dwavesys.com\n\npcarbone@bsd.uchicago.edu\n\nkingdom@interchange.ubc.ca\n\nAbstract\n\nWe describe a new algorithmic framework for inference in probabilistic models,\nand apply it to inference for latent Dirichlet allocation (LDA). Our framework\nadopts the methodology of variational inference, but unlike existing variational\nmethods such as mean \ufb01eld and expectation propagation it is not restricted to\ntractable classes of approximating distributions. Our approach can also be viewed\nas a \u201cpopulation-based\u201d sequential Monte Carlo (SMC) method, but unlike ex-\nisting SMC methods there is no need to design the arti\ufb01cial sequence of dis-\ntributions. Signi\ufb01cantly, our framework o\ufb00ers a principled means to exchange\nthe variance of an importance sampling estimate for the bias incurred through\nvariational approximation. We conduct experiments on a di\ufb03cult inference prob-\nlem in population genetics, a problem that is related to inference for LDA. The\nresults of these experiments suggest that our method can o\ufb00er improvements in\nstability and accuracy over existing methods, and at a comparable cost.\n\n1 Introduction\n\nOver the past several decades, researchers in many di\ufb00erent \ufb01elds\u2014statistics, economics, physics,\ngenetics and machine learning\u2014have focused on coming up with more accurate and more e\ufb03cient\napproximate solutions to intractable probabilistic inference problems. To date, there are three\nwidely-explored approaches to approximate inference in probabilistic models: obtaining a Monte\nCarlo estimate by simulating a Markov chain (MCMC); obtaining a Monte Carlo estimate by\ndrawing samples from a distribution other than the target then reweighting the samples to account\nfor any discrepancies (importance sampling); and variational inference, in which the original\nintegration problem is transformed into an optimization problem.\n\nThe variational approach in particular has attracted wide interest in the machine learning commu-\nnity, and this interest has lead to a number of important innovations in approximate inference\u2014\nsome of these more recent developments are described in the dissertations of Beal [3], Minka [22],\nRavikumar [27] and Wainwright [31]. The key idea behind variational inference is to come up\nwith a family of approximating distributions \u02c6p(x; \u03b8) that have \u201cnice\u201d analytic properties, then to\noptimize some criterion in order to \ufb01nd the distribution parameterized by \u03b8 that most closely\nmatches the target posterior p(x). All variational inference algorithms, including belief propaga-\ntion and its generalizations [32], expectation propagation [22] and mean \ufb01eld [19], can be derived\nfrom a common objective, the Kullback-Leibler (K-L) divergence [9]. The major drawback of\nvariational methods is that the best approximating distribution may still impose an unrealistic or\nquestionable factorization, leading to excessively biased estimates (see Fig. 1, left-hand side).\n\nIn this paper, we describe a new variational method that does not have this limitation: it adopts the\nmethodology of variational inference without being restricted to tractable classes of approximate\n\n1\n\n\fdistributions (see Fig. 1, right-hand side). The catch is that the variational objective (the K-L\ndivergence) is di\ufb03cult to optimize because its gradient cannot be computed exactly. So to descend\nalong the surface of the variational objective, we propose to employ stochastic approximation [28]\nwith Monte Carlo estimates of the gradient, and update these estimates over time with sequential\nMonte Carlo (SMC) [12]\u2014hence, a stochastic approximation method for probabilistic inference.\nLarge gradient descent steps may quickly lead to a degenerate sample, so we introduce a mechanism\nthat safeguards the variance of the Monte Carlo estimate at each iteration (Sec. 3.5). This variance\nsafeguard mechanism does not make the standard e\ufb00ective sample size (ESS) approximation [14],\nhence it is likely to more accurately monitor the variance of the sample.\n\nIndirectly, the variance safeguard provides a way to obtain an\nestimator that has low variance in exchange for (hopefully small)\nbias. To our knowledge, our algorithm is the \ufb01rst general means\nof achieving such a trade-o\ufb00 and, in so doing, it draws meaning-\nful connections between Monte Carlo and variational methods.\n\nThe advantage of our stochastic approximation method with re-\nspect to other variational methods is rather straightforward: it\ndoes not restrict the class of variational densities by making as-\nsumptions about their structure. However, whe advantage of our\napproach compared to Monte Carlo methods such as annealed\nimportance sampling (AIS) [24] is less obvious. One key ad-\nvantage is that there is no need to design the sequence of SMC\ndistributions as it is a direct product of the algorithm\u2019s deriva-\ntion (Sec. 3). It is our conjecture that this automatic selection,\nwhen combined with the variance safeguard, is more e\ufb03cient\nthan setting the sequence by hand, say, via tempered transitions\n[12, 18, 24]. The population genetics experiments we conduct\nin Sec. 4 provide some support for this claim.\n\nWe illustrate our approach on the problem of inferring pop-\nulation structure from a cohort of genotyped sequences using\nthe mixture model of Pritchard et al. [26]. We show in Sec. 4\nthat Markov chain Monte Carlo (MCMC) is prone to producing\nvery di\ufb00erent answers in independent simulations, and that it\nfails to adequately capture the uncertainty in its solutions. For\nmany population genetics applications, such as wildlife conser-\nvation [8], it is crucial to accurately characterize the con\ufb01dence\nin a solution. Since variational methods employing mean \ufb01eld\napproximations [4, 30] tend to be overcon\ufb01dent, they are poorly\nsuited for this problem. (This has generally not been an issue\nfor semantic text analysis [4, 15].) As we show, SMC with a\nuniform sequence of tempered distributions fares little better than MCMC. The implementation of\nour approach on the population structure model demonstrates improvements in both accuracy and\nreliability over MCMC and SMC alternatives, and at a comparable computational cost.\n\nFigure 1: The guiding princi-\nple behind standard variational\nmethods (top) is to \ufb01nd the ap-\nproximating density \u02c6p(x; \u03b8) that\nis closest to the distribution of\ninterest p(x), yet remains within\nthe de\ufb01ned set of tractable prob-\nability distributions. In our ap-\nproach (bottom), the class of ap-\nproximating densities always co-\nincides with the target p(x).\n\nThe latent Dirichlet allocation (LDA) model [4] is very similar to the population structure model\nof [26], under the assumption of \ufb01xed Dirichlet priors. Since LDA is already familiar to the\nmachine learning audience, it serves as a running example throughout our presentation.\n\n1.1 Related work\n\nThe interface of optimization and simulation strategies for inference has been explored in a number\nof papers, but none of the existing literature resembles the approach proposed in this paper. De\nFreitas et al. [11] use a variational approximation to formulate a Metropolis-Hastings proposal. Re-\ncent work on adaptive MCMC [1] combines ideas from both stochastic approximation and MCMC\nto automatically learn better proposal distributions. Our work is also unrelated to the paper [20]\nwith a similar title, where stochastic approximation is applied to improving the Wang-Landau\nalgorithm. Younes [33] employs stochastic approximation to compute the maximum likelihood\nestimate of an undirected graphical model. Also, the cross-entropy method [10] uses importance\nsampling and optimization for inference, but exhibits no similarity to our work beyond that.\n\n2\n\n\f2 Latent Dirichlet allocation\n\nLatent Dirichlet allocation (LDA) is a generative model of a collection of text documents, or corpus.\nIts two key features are: the order of the words is unimportant, and each document is drawn from\na mixture of topics. Each document d = 1, . . . , D is expressed as a \u201cbag\u201d of words, and each\nword wdi = j refers to a vocabulary item j \u2208 {1, . . . , W }. (Here we assume each document has\nthe same length N .) Also, each word has a latent topic indicator zdi \u2208 {1, . . . , K}. Observing\nthe jth vocabulary item in the kth topic occurs with probability \u03b2kj. The word proportions for\neach topic are generated according to a Dirichlet distribution with \ufb01xed prior \u03b7. The latent topic\nindicators are generated independently according to p(zdi = k | \u03c4d) \u2261 \u03c4dk, and \u03c4d in turn follows a\nDirichlet with prior \u03bd. The generative process we just described de\ufb01nes a joint distribution over\nthe observed data w and unknowns x = {\u03b2, \u03c4, z} given the hyperparameters {\u03b7, \u03bd}:\n\np(w, x | \u03b7, \u03bd) =\n\nK\n\nY\n\nk=1\n\np(\u03b2k | \u03b7) \u00d7\n\nD\n\nY\n\nd=1\n\np(\u03c4d | \u03bd) \u00d7\n\nD\n\nN\n\nY\n\nY\n\nd=1\n\ni=1\n\np(wdi | zdi, \u03b2) p(zdi | \u03c4d),\n\n(1)\n\nThe directed graphical model is given in Fig. 2.\n\nImplementations of approximate inference in LDA include\nMCMC [15, 26] and variational inference with a mean \ufb01eld\napproximation [4, 30]. The advantages of our inference ap-\nproach become clear when it is measured up against the\nvariational mean \ufb01eld algorithm of [4]: \ufb01rst, we make no\nadditional assumptions regarding the model\u2019s factorization;\nsecond, the number of variational parameters is independent\nof the size of the corpus, so there is no need to resort to\ncoordinate-wise updates that are typically slow to converge.\n\n3 Description of algorithm\n\nFigure 2: Directed graphical model\nfor LDA. Shaded nodes represent\nobservations or \ufb01xed quantities.\n\nThe goal is to calculate the expectation of function \u03d5(x) with respect to target distribution p(x):\n\nEp( \u00b7 )[\u03d5(X)] = R \u03d5(x) p(x) dx.\n\n(2)\n\nIn LDA, the target density p(x) is the posterior of x = {\u03b2, \u03c4, z} given w derived via Bayes\u2019 rule.\n\nFrom the importance sampling identity [2], we can obtain an unbiased estimate of (2) by drawing\nn samples from a proposal q(x) and evaluating importance weights w(x) = p(x)/q(x). (Usually\np(x) can only be evaluated up to a normalizing constant, in which case the asymptotically unbiased\nnormalized importance sampling estimator [2] is used instead.) The Monte Carlo estimator is\n\nEp( \u00b7 )[\u03d5(X)] \u2248 1\n\nn Pn\n\ns=1w(x(s)) \u03d5(x(s)).\n\n(3)\n\nUnless great care is taken is in designing the proposal q(x), the Monte Carlo estimator will exhibit\nastronomically high variance for all but the smallest problems.\n\nInstead, we construct a Monte Carlo es-\ntimate (3) by replacing p(x) with an al-\nternate target \u02c6p(x; \u03b8) that resembles it, so\nthat all importance weights are evaluated\nwith respect to this alternate target. (We\nelaborate on the exact form of \u02c6p(x; \u03b8) in\nSec. 3.1.) This new estimator is biased,\nbut we minimize the bias by solving a vari-\national optimization problem.\n\n\u2022 Draw samples from initial density \u02c6p(x; \u03b81).\n\u2022 for k = 2, 3, 4, . . .\n\n- Stochastic approximation step: take gradi-\nent descent step \u03b8k = \u03b8k\u22121 \u2212\u03b1kgk, where gk\nis a Monte Carlo estimate of the gradient of\nthe K-L divergence, and \u03b1k is the variance-\nsafeguarded step size.\n\n- SMC step: update samples and importance\n\nweights to re\ufb02ect new density \u02c6p(x; \u03b8k).\n\nOur algorithm has a dual interpretation: it\ncan be interpreted as a stochastic approxi-\nmation algorithm for solving a variational\noptimization problem, in which the iterates are the parameter vectors \u03b8k, and it can be equally\nviewed as a sequential Monte Carlo (SMC) method [12], in which each distribution \u02c6p(x; \u03b8k) in the\n\nFigure 3: Algorithm sketch.\n\n3\n\n\fsequence is chosen dynamically based on samples from the previous iteration. The basic idea is\nspelled out in Fig. 3. At each iteration, the algorithm selects a new target \u02c6p(x; \u03b8k) by optimizing\nthe variational objective. Next, the samples are revised in order to compute the stochastic gradient\ngk+1 at the next iteration. Since SMC is e\ufb00ectively a framework for conducting importance sam-\npling over a sequence of distributions, we describe a \u201cvariance safeguard\u201d mechanism (Sec. 3.5)\nthat directly regulates increases in variance at each step by preventing the iterates \u03b8k from moving\ntoo quickly. It is in this manner that we achieve a trade-o\ufb00 between bias and variance.\n\nSince this is a stochastic approximation method, asymptotic convergence of \u03b8k to a minimizer of\nthe objective is guaranteed under basic theory of stochastic approximation [29]. As we elaborate\nbelow, this implies that \u02c6p(x; \u03b8k) will converge almost surely to the target distribution p(x) as k\napproaches in\ufb01nity. And asymptotic variance results from the SMC literature [12] tell us that the\nMonte Carlo estimates will converge almost surely to the target expectation (2) so long as \u02c6p(x; \u03b8k)\napproaches p(x). A crucial condition is that the stochastic estimates of the gradient be unbiased.\nThere is no way to guarantee unbiased estimates under a \ufb01nite number of samples, so convergence\nholds only as the number of iterations and number of samples both approach in\ufb01nity.\n\nTo recap, the probabilistic inference recipe we propose has \ufb01ve main ingredients: one, a family\nof approximating distributions that admits the target (Sec. 3.1); two, a variational optimization\nproblem framed using the K-L divergence measure (Sec. 3.2); three, a stochastic approximation\nmethod for \ufb01nding a solution to the variational optimization problem (Sec. 3.3); four, the imple-\nmentation of a sequential Monte Carlo method for constructing stochastic estimates of the gradient\nof the variational objective (Sec 3.4); and \ufb01ve, a way to safeguard the variance of the importance\nweights at each iteration of the stochastic approximation algorithm (Sec. 3.5).\n\n3.1 The family of approximating distributions\n\nThe \ufb01rst implementation step is the design of a family of approximating distributions \u02c6p(x; \u03b8)\nparameterized by vector \u03b8. In order to devise a useful variational inference procedure, the usual\nstrategy is to restrict the class of approximating distributions to those that factorize in an analytically\nconvenient fashion [4, 19] or, in the dual formulation, to introduce an approximate (but tractable)\ndecomposition of the entropy [32]. Here, we impose no such restrictions on tractability; refer\nto Fig. 1. We allow any family of approximating distributions so long as it satis\ufb01es these three\nconditions: 1.) there is at least one \u03b8 = \u03b81 such that samples can be drawn from \u02c6p(x; \u03b81); 2.) there\nis a \u03b8 = \u03b8\u22c6 that recovers the target \u02c6p(x; \u03b8\u22c6) = p(x), hence an unbiased estimate of (2); and 3.) the\ndensities are members of the exponential family [13] expressed in standard form\n\n\u02c6p(x; \u03b8) = exp{ha(x), \u03b8i \u2212 c(\u03b8)},\n\n(4)\nin which h\u00b7, \u00b7i is an inner product, the vector-valued function a(x) is the statistic of x, and \u03b8 is the\nnatural or canonical parameterization. The log-normalization factor c(\u03b8) \u2261 log R expha(x), \u03b8i dx\nensures that \u02c6p(x; \u03b8) represents a proper probability. We further assume that the random vector\nx can be partitioned into two sets A and B such that it is always possible to draw samples\nfrom the conditionals \u02c6p(xA | xB; \u03b8) and \u02c6p(xB | xA; \u03b8). Hidden Markov models, mixture models,\ncontinuous-time Markov processes, and some Markov random \ufb01elds are all models that satisfy\nthis condition. This extra condition could be removed without great di\ufb03culty, but doing so would\nadd several complications to the description of the algorithm. The restriction to the exponential\nfamily is not a strong one as most conventionally-studied densities can be written in the form (4).\n\n\u02c6p(x; \u03b8) = exp (cid:8)PD\n\nFor LDA, we chose a family of approximating densities of the form\nk=1 PW\nj=1(cj \u2212 mkj) log \u03b2kj \u2212 c(\u03b8)(cid:9),\n\nk=1(\u03bdk + ndk \u2212 1) log \u03c4dk + PK\nk=1 PW\n\nj=1(\u02c6\u03b7kj \u2212 1) log \u03b2kj\n\nd=1 PK\n\n+ \u03c6PK\n\nk=1 PW\n\nj=1mkj log \u03b2kj + \u03b3PK\n\n(5)\nwhere mkj \u2261 Pd Pi \u03b4k(zdi) \u03b4j(wdi) counts the number of times the jth word is assigned to the\nkth topic, ndk \u2261 Pi \u03b4k(zdi) counts the number of words assigned to the kth topic in the dth\ndocument, and cj \u2261 Pd Pi \u03b4j(wdi) is is the number of times jth vocabulary item is observed.\nThe natural parameters are \u03b8 = {\u02c6\u03b7, \u03c6, \u03b3}, with \u03b8 \u2265 0. The target posterior \u02c6p(x; \u03b8\u22c6) \u221d p(w, x | \u03b7, \u03bd)\nis recovered by setting \u03c6 = 1, \u03b3 = 0 and \u02c6\u03b7 = \u03b7. A sampling density with a tractable expression\nfor c(\u03b8) is recovered whenever we set \u03c6 equal to \u03b3. The graphical structure of LDA (Fig. 2) allows\nus to draw samples from the conditionals \u02c6p(\u03b2, \u03c4 | z; \u03b8) and \u02c6p(z | \u03b2, \u03c4 ; \u03b8). Loosely speaking, this\nchoice is meant to strike a balance between the mean \ufb01eld approximation [4] (with parameters\n\u02c6\u03b7kj) and the tempered distribution (with \u201clocal\u201d temperature parameters \u03c6 and \u03b3).\n\n4\n\n\f3.2 The variational objective\n\nThe Kullback Leibler (K-L) divergence [9] asymmetrically measures the distance between the\ntarget distribution p(x) = \u02c6p(x; \u03b8\u22c6) and approximating distribution \u02c6p(x; \u03b8),\nF (\u03b8) = hE \u02c6p( \u00b7 ; \u03b8)[a(X)], \u03b8 \u2212 \u03b8\u22c6i + c(\u03b8\u22c6) \u2212 c(\u03b8),\n\n(6)\nthe optimal choice being \u03b8 = \u03b8\u22c6. This is our variational objective. The fact that we cannot compute\nc(\u03b8) poses no obstacle to optimizing the objective (6); through application of basic properties of\nthe exponential family, the gradient vector works out to be the matrix-vector product\n\n(7)\nwhere Var[a(X)] is the covariance matrix of the statistic a(x). The real obstacle is the presence of\nan integral in (7) that is most likely intractable. With a collection of samples x(s) with importance\nweights w(s), for s = 1, . . . , n, that approximate \u02c6p(x; \u03b8), we have the Monte Carlo estimate\n\n\u2207F (\u03b8) = Var \u02c6p( \u00b7 ; \u03b8)[a(X)](\u03b8 \u2212 \u03b8\u22c6),\n\n\u2207F (\u03b8) \u2248 Pn\n\ns=1 w(s)(a(x(s)) \u2212 \u00afa)(a(x(s)) \u2212 \u00afa)T (\u03b8 \u2212 \u03b8\u22c6),\n\n(8)\nwhere \u00afa \u2261 Ps w(s)a(x(s)) denotes the Monte Carlo estimate of the mean statistic. Note that\nthese samples {x(s), w(s)} serve to estimate both the expectation (2) and the gradient (7). The\nalgorithm\u2019s performance hinges on a good search direction, so it is worth our while to reduce the\nvariance of the gradient measurements when possible via Rao-Blackwellization [6]. Since we no\nlonger have an exact value for the gradient, we appeal to the theory of stochastic approximation.\n\n3.3 Stochastic approximation\n\nInstead of insisting on making progress toward a minimizer of f (\u03b8) at every iteration, as in\ngradient descent, stochastic approximation only requires that progress be achieved on average.\nThe Robbins-Monro algorithm [28] iteratively adjusts the control variable \u03b8 according to\n\n\u03b8k+1 = \u03b8k \u2212 \u03b1kgk,\n\n(9)\nwhere gk is a noisy observation of f (\u03b8k), and {\u03b1k} is a sequence of step sizes. Provided the\nsequence of step sizes satis\ufb01es certain conditions, this algorithm is guaranteed to converge to the\nsolution f (\u03b8\u22c6) = 0; see [29]. In our case, f (\u03b8) = \u2207F (\u03b8) = 0 is the \ufb01rst-order condition for an\nunconstrained minimum. Due to poor conditioning, we advocate replacing the gradient descent\nsearch direction \u2206\u03b8k = \u2212gk in (9) by the quasi-Newton search direction \u2206\u03b8k = \u2212B\u22121\nk gk, where\nBk is a damped quasi-Newton (BFGS) approximation of the Hessian [25]. To handle constraints\n\u03b8 \u2265 0 introduced in Sec. 3.1, we use the stochastic interior-point method of [5].\n\nAfter having taken a step along \u2206\u03b8k, the samples must be updated to re\ufb02ect the new distribution\n\u02c6p(x; \u03b8k+1). To accomplish this feat, we use SMC [12] to sample from a sequence of distributions.\n\n3.4 Sequential Monte Carlo\n\nIn the \ufb01rst step of SMC, samples x1\nthat the initial importance weights are uniform. After k steps the proposal density is\n\n(s) are drawn from a proposal density q1(x) = \u02c6p(x; \u03b81) so\n\n\u02dcqk(x1:k) = Kk(xk | xk\u22121) \u00b7 \u00b7 \u00b7 K2(x2 | x1) \u02c6p(x1; \u03b81),\n\n(10)\nwhere Kk(x\u2032 | x) is the Markov kernel that extends the path at every iteration. The insight of [12] is\nthat if we choose the densities \u02dcpk(x1:k) wisely, we can update the importance weights \u02dcwk(x1:k) =\n\u02dcpk(x1:k)/\u02dcqk(x1:k) without having to look at the entire history. This special construction is\n\n\u02dcpk(x1:k) = L1(x1 | x2) \u00b7 \u00b7 \u00b7 Lk\u22121(xk\u22121 | xk) \u02c6p(xk; \u03b8k),\n\n(11)\nwhere we\u2019ve introduced a series of arti\ufb01cial \u201cbackward\u201d kernels Lk(x | x\u2032).\nIn this paper, the\nsequence of distributions is determined by the iterates \u03b8k, so there remain two degrees of freedom:\nthe choice of forward kernel Kk(x\u2032 | x), and the backward kernel Lk(x | x\u2032). From the assumptions\nmade in Sec. 3.1, a natural choice for the forward transition kernel is the two-stage Gibbs sampler,\n(12)\nin which we \ufb01rst draw a sample of xB (in LDA, the variables \u03c4 and \u03b2) given xA (the discrete\nvariables z), then update xA conditioned on xB. A Rao-Blackwellized version of the sub-optimal\nbackward kernel [12] then leads to the following expression for updating the importance weights:\n(13)\nwhere xA is the component from time step k \u2212 1 restricted to the set A, and \u02dcp(xA; \u03b8k) is the\nunnormalized version of the marginal \u02c6p(xA; \u03b8k). It also follows from earlier assumptions (Sec 3.1)\nthat it is always possible to compute \u02dcp(xA; \u03b8). Refer to [15] for the marginal of z for LDA.\n\n\u02dcwk(x1:k) = \u02dcp(xA; \u03b8k)/\u02dcp(xA; \u03b8k\u22121) \u00d7 \u02dcwk\u22121(x1:k\u22121),\n\nKk(x\u2032 | x) = \u02c6p(x\u2032\n\nA | x\u2032\n\nB; \u03b8k) \u02c6p(x\u2032\n\nB | xA; \u03b8k),\n\n5\n\n\f3.5 Safeguarding the variance\n\nA key component of the algorithm is a mecha-\nnism that enables the practitioner to regulate the\nvariance of the importance weights and, by exten-\nsion, the variance of the Monte Carlo estimate of\nE[\u03d5(X)]. The trouble with taking a full step (9)\nis that the Gibbs kernel (12) may be unable to\ne\ufb00ectively migrate the particles toward the new\ntarget, in which case the the importance weights\nwill overcompensate for this failure, quickly lead-\ning to a degenerate population. The remedy we\npropose is to \ufb01nd a step size \u03b1k that satis\ufb01es\n\n\u03b2Sk(\u03b8k) \u2264 Sk\u22121(\u03b8k\u22121),\n\n(14)\nfor \u03b2 \u2208 [0, 1], whereby a \u03b2 near 1 leads to a strin-\ngent safeguard, and we\u2019ve de\ufb01ned\ns=1( \u02dcwk(x(s)\n\nSk(\u03b8k) \u2261 Pn\n\n1:k) \u2212 1\n\nn )2\n\n(15)\n\n\u2022 Let n, \u03b81, \u03b8\u22c6, A, B, {\u03b1k} be given.\n\u2022 Draw x(s) \u223c \u02c6p(x; \u03b81), set w(s) = 1/n.\n\u2022 Set inverse Hessian H to the identity.\n\u2022 for k = 2, 3, 4, . . .\n\n1. Compute gk \u2248 \u2207F (\u03b8k\u22121); see (8).\n2. if k > 2, then modify H following\n\ndamped quasi-Newton update.\n\n3. Compute variance-safeguarded step\nsize \u03b1k \u2264 \u02c6\u03b1k given \u2206\u03b8k = \u2212Hgk.\n\n4. Set \u03b8k = \u03b8k\u22121 + \u03b1k\u2206\u03b8k.\n5. Update w(s) following (13).\n6. Run the two-stage Gibbs sampler:\n\n- Draw x(s)\n- Draw x(s)\n\nA ; \u03b8k).\nB ; \u03b8k).\n7. Resample particles, if necessary.\n\nB \u223c \u02c6p( \u00b7 | x(s)\nA \u223c \u02c6p( \u00b7 | x(s)\n\nFigure 4: The proposed algorithm.\n\nto be the sample variance (\u00d7 n) for our choice of L(x | x\u2032). Note that since our variance safeguard\nscheme is myopic, the behaviour of the algorithm can be sensitive to the number of iterations.\n\nThe safeguarded step size is derived as follows. The goal is to \ufb01nd the largest step size \u03b1k\nsatisfying (14). Forming a Taylor-series expansion with second-order terms about the point \u03b1k = 0,\nthe safeguarded step size is the solution to\n\n1\n\n2 \u2206\u03b8T\n\nk \u22072Sk(\u03b8k\u22121)\u2206\u03b8k\u03b12\n\nk + \u2206\u03b8T\n\nk \u2207Sk(\u03b8k\u22121) \u03b1k = 1\u2212\u03b2\n\n\u03b2 Sk\u22121(\u03b8k\u22121),\n\n(16)\n\nwhere \u2206\u03b8k is the search direction at iteration k. In our experience, the quadratic approximation to\nthe importance weights (13) was unstable as it occasionally recommended strange step sizes, but\na naive importance weight update without Rao-Blackwellization yielded a reliable bound on (14).\nThe derivatives of Sk(\u03b8k) work out to sample estimates of second and third moments that can be\ncomputed in O(n) time. Since the importance weights initially have zero variance, no positive\nstep size will satisfy (14). We propose to also permit step sizes that do not drive the ESS below a\nfactor \u03be \u2208 (0, 1) from the optimal sample. Resampling will still be necessary over long sequences\nto prevent the population from degenerating. The basic algorithm is summarized in Fig. 4.\n\n4 Application to population genetics\n\ntext corpus\ndocuments\n\npopulation\nstructure\nindividuals\n\n\u21d4\n\ntopics\n\ndemes\n\nMicrosatellite genetic markers have been used to determine the\ngenealogy of human populations, and to assess individuals\u2019 an-\ncestry in inferring disease risks [16]. The problem is that all\nthese tasks require de\ufb01ning a priori population structure. The\nBayesian model of Pritchard et al. [26] o\ufb00ers a solution to this\nconundrum by simultaneously identifying both patterns of pop-\nulation subdivision and the ancestry of individuals from highly\nvariable genetic markers. This model is the same as LDA assuming \ufb01xed Dirichlet priors and\na single genetic marker; see Fig. 5 for the connection between the two domains. This model,\nhowever, can be frustrating to work with because independent MCMC simulations can produce\nremarkably di\ufb00erent answers for the same data, even simulations millions of samples long. Such\ninference challenges have been observed in other mixture models [7]; MCMC can do a poor job\nexploring the hypothesis space when there are several divergent hypotheses that explain the data.\n\nFigure 5: Correspondence be-\ntween LDA [4] and the popula-\ntion structure [26] models.\n\nlanguages\nvocabulary\n\nalleles\n\nloci\n\nMethod. We used the software CoaSim [21] to simulate the evolution of genetic markers following\na coalescent process. The coalescent is a lineage of alleles in a sample traced backward in time to\ntheir common ancestor allele, and the coalescent process is the stochastic process that generates\nthe genealogy [17]. We introduced divergence events at various coalescent times (see Fig. 6) so\nthat we ended up with 4 isolated populations. We simulated 10 microsatellite markers each with\na maximum of 30 alleles. We simulated the markers twice with scaled mutation rates of 2 and\n1\n2 , and for each rate we simulated 60 samples from the coalescent process (15 diploid individuals\nfrom each of the 4 populations). These samples are the words w in LDA. This may not seem like\na large data set, but it will be large enough to impose major challenges to approximate inference.\n\n6\n\n\fFigure 7: Variance in estimates of the admixture distance and admixture level taken over 20 trials.\n\nThe goal is to obtain posterior estimates that re-\ncover the correct population structure (Fig. 6) and\nexhibit high agreement in independent simula-\ntions. Speci\ufb01cally, the goal is to recover the mo-\nments of two statistics: the admixture distance, a\nmeasure of two individuals\u2019 dissimilarity in their\nancestry, and the admixture level where 0 means\nan individual\u2019s alleles all come from a single pop-\nulation, and 1 means its ancestry is shared equally\namong the K populations. The admixture dis-\ntance between individuals d and d\u2032 is\n\n\u03d5(\u03c4d, \u03c4d\u2032 ) \u2261 1\n\n2 PK\n\nk=1|\u03c4dk \u2212 \u03c4d\u2032k|,\n\n(17)\n\nand the admixture level of the dth individual is\n\n\u03c8(\u03c4d) \u2261 1 \u2212 K\n\n2(K\u22121) PK\n\nk=1(cid:12)(cid:12)\u03c4dk \u2212 1\nK (cid:12)(cid:12).\n\nFigure 6: The structured coalescent process\nwith divergence events at coalescent times T =\n0, 1\n2 , 1, 2. The width of the branches represents\ne\ufb00ective population size, and the arrow points\nbackward in time. The present isolated popu-\nlations are labeled left-to-right 1 through 4.\n\n(18)\n\nWe compared our algorithm to MCMC as implemented in the software Structure [26], and to\nanother SMC algorithm, annealed importance sampling (AIS) [24], with a uniform tempering\nschedule. One possible limitation of our study is that the choice of temperature scehdule can be\ncritical to the success of AIS, and we did not thoroughly investigate alternative schedules. Also,\nnote that our intent was not to present an exhaustive comparison of Monte Carlo methods, so we\ndid not compare to population MCMC [18], for example, which has advantages similar to AIS.\n\nFor the two data sets, and for each K from 2 to 6 (the most appropriate setting being K = 4), we\ncarried out 20 independent trials of the three methods. For fair comparison, we ran the methods\nwith the same number of sampling events: for MCMC, a Markov chain of length 50,000 and\nburn-in of 10,000; for both SMC methods, 100 particles and 500 iterations. Additional settings\nincluded an ESS threshold of 50, maximum step sizes \u03b1k = 1/(1 + k)0.6, centering parameters\n\u03c3k = 1/k0.9 for the stochastic interior-point method, safeguards \u03b2 = 0.95 and \u03be = 0.9, and a\nquasi-Newton damping factor of 0.75. We set the initial iterate of stochastic approximation to\n\u03c6 = \u03b3 = \u02c6\u03b7kj = \u03b7\u22c6\n\nj . We used uniform Dirichlet priors \u03b7\u22c6\n\nj = \u03bdk = 0.1 throughout.\n\nResults. First let\u2019s examine the variance in the answers. Fig. 7 shows the variance in the estimates\nof the admixture level and admixture distance over the independent trials. To produce these plots,\nat every K we took the individual d or pair (d, d\u2032) that exhibited the most variance in the estimate\nof E[\u03d5(\u03c4d, \u03c4d\u2032 )] and E[\u03c8(\u03c4d)]. What we observe is that the stochastic approximation method\nproduced signi\ufb01cantly more consistent estimates in almost all cases, whereas AIS o\ufb00ered little or\nno improvement over MCMC. The next step is to examine the accuracy of these answers.\n\nFig. 8 shows estimates from MCMC and stochastic approximation selected trials under a mutation\nrate of 1\n2 and K = 4 (left-hand side), and under a mutation rate of 2 and K = 3 (right-hand side).\nThe trials were chosen to re\ufb02ect the extent of variation in the answers. The mean and standard\ndeviation of the admixture distance statistic are drawn as matrices. The 60 rows and 60 columns in\neach matrix correspond to individuals sorted by their true population label; the rows and columns\nare ordered so that they correspond to the populations 1 through 4 in Fig. 6. In each \u201cmean\u201d matrix,\na light square means that two individuals share little ancestry in common, and a dark square means\nthat two individuals have similar ancestry. In each \u201cstd. dev.\u201d matrix, the darker the square, the\nhigher the variance. In the \ufb01rst trial (top-left), the MCMC algorithm mostly recovered the correct\n\n7\n\n\fFigure 8: Estimated mean and standard deviation (\u201cstd. dev.\u201d) of the admixture distance statistic for\ntwo independent trials and at two di\ufb00erent simulation settings. See the text for a full explanation.\n\npopulation structure; i.e. it successfully assigned individuals to their coalescent populations based\non the sampled alleles w. As expected, the individuals from populations 3 and 4 were hardest\nto distinguish, hence the high standard deviation in the bottom-right entries of the matrix. The\nresults of the second trial are less satisfying: MCMC failed to distinguish between individuals\nfrom populations 3 and 4, and it decided rather arbitrarily to partition the samples originating from\npopulation 2. In all these experiments, AIS exhibited behaviour that was very similar to MCMC.\n\nUnder the same conditions, our algorithm (bottom-left) failed to distinguish between the third and\nfourth populations. The trials, however, are more consistent and do not mislead by placing high\ncon\ufb01dence in these answers; observe the large number of dark squares in the bottom-right portion\nof the \u201cstd. dev.\u201d matrix. This evidence suggests that these trials are more representative of\nthe true posterior because the MCMC trials are inconsistent and occasionally spurious (trial #2).\nThis trend is repeated in the more challenging inference scenario with K = 3 and a mutation\nrate of 2 (right-hand side). MCMC, as before, exhibited a great deal of variance in its estimates\nof the admixture distance:\nthe estimates from the \ufb01rst trial are very accurate, but the second\ntrial strangely failed to distinguish between populations 1 and 2, and did not correctly assign the\nindividuals in populations 3 and 4. What\u2019s worse, MCMC placed disproportionate con\ufb01dence in\nthese estimates. The stochastic approximation method also exhibited some variance under these\nconditions, but importantly it did not place nearly so much con\ufb01dence in its solutions; observe the\nhigh standard deviation in the matrix entries corresponding to the individuals from population 3.\n\n5 Conclusions and discussion\n\nIn this paper, we proposed a new approach to probabilistic inference grounded on variational,\nMonte Carlo and stochastic approximation methodology. We demonstrated that our sophisticated\nmethod pays o\ufb00 in terms of producing more consistent, reliable estimates for a real and challenging\ninference problem in population genetics. Some of the components such as the variance safeguard\nhave not been independently validated, so we cannot fully attest to how critical they are, at least\nbeyond the motivation we already gave. More standard tricks, such as Rao-Blackwellization, were\nexplicitly included to demonstrate that well-known techniques from the Monte Carlo literature\napply without modi\ufb01cation to our algorithm. We have argued for the generality of our inference\napproach, but ultimately the success of our scheme hinges on a good choice of the variational\napproximation. Thus, it remains to be seen how well our results extend to probabilistic graphical\nmodels beyond LDA, and how much ingenuity will be required to achieve favourable outcomes.\n\nAnother critical issue, as we mentioned in Sec. 3.5, is the sensitivity of our method to the number\nof iterations. This issue is related to the bias-variance trade-o\ufb00, and in the future we would like to\nexplore more principled ways to formulate this trade-o\ufb00, in the process reducing this sensitivity.\n\nAcknowledgments\n\nWe would like to thank Matthew Ho\ufb00man, Nolan Kane, Emtiyaz Khan, Hendrik K\u00fcck and Pooja\nViswanathan for their input, and the reviewers for exceptionally detailed and thoughtful comments.\n\n8\n\n\fReferences\n\n[1] C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. Annals\n\nof Applied Probability, 16:1462\u20131505, 2006.\n\n[2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning.\n\nMachine Learning, 50:5\u201343, 2003.\n\n[3] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, University College\n\nLondon, 2003.\n\n[4] D. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[5] P. Carbonetto, M. Schmidt, and N. de Freitas. An interior-point stochastic approximation method and\nan L1-regularized delta rule. In Advances in Neural Information Processing Systems, volume 21. 2009.\n[6] G. Casella and C. P. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83:81\u201394, 1996.\n[7] G. Celeux, M. Hurn, and C. P. Robert. Computational and inferential di\ufb03culties with mixture posterior\n\ndistributions. Journal of the American Statistical Association, 95:957\u2013970, 2000.\n\n[8] D. W. Coltman. Molecular ecological approaches to studying the evolutionary impact of selective\n\nharvesting in wildlife. Molecular Ecology, 17:221\u2013235, 2007.\n\n[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.\n\n[10] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A tutorial on the cross-entropy method.\n\nAnnals of Operations Research, 134:19\u201367, 2005.\n\n[11] N. de Freitas, P. H\u00f8jen-S\u00f8rensen, M. I. Jordan, and S. Russell. Variational MCMC. In Proceedings of\n\nthe 17th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 120\u2013127, 2001.\n\n[12] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical\n\nSociety, 68:411\u2013436, 2006.\n\n[13] A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall/CRC Press, 2002.\n[14] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian\n\n\ufb01ltering. Statistics and Computing, 10:197\u2013208, 2000.\n\n[15] T. L. Gri\ufb03ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of\n\nSciences, 101:5228\u20135235, 2004.\n\n[16] D. L. Hartl and A. G. Clark. Principles of population genetics. Sinauer Associates, 2007.\n[17] J. Hein, M. H. Schierup, and C. Wiuf. Gene genealogies, variation and evolution: a primer in coalescent\n\ntheory. Oxford University Press, 2005.\n\n[18] A. Jasra, D. Stephens, and C. Holmes. On population-based simulation for static inference. Statistics\n\nand Computing, 17:263\u2013279, 2007.\n\n[19] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical\n\nmodels. In M. Jordan, editor, Learning in Graphical Models, pages 105\u2013161. MIT Press, 1998.\n\n[20] F. Liang, C. Liu, and R. J. Carroll. Stochastic approximation in Monte Carlo computation. Journal of\n\nthe American Statistical Association, 102:305\u2013320, 2007.\n\n[21] T. Mailund, M. Schierup, C. Pedersen, P. Mechlenborg, J. Madsen, and L. Schauser. CoaSim: a \ufb02exible\n\nenvironment for simulating genetic data under coalescent models. BMC Bioinformatics, 6, 2005.\n\n[22] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.\n[23] R. Neal and G. Hinton. A view of the EM algorithm that that justi\ufb01\ufb01es incremental, sparse, and other\nvariants. In M. Jordan, editor, Learning in graphical models, pages 355\u2013368. Kluwer Academic, 1998.\n\n[24] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125\u2013139, 2001.\n[25] M. J. D. Powell. Algorithms for nonlinear constraints that use Lagrangian functions. Mathematical\n\nProgramming, 14:224\u2013248, 1978.\n\n[26] J. K. Pritchard, M. Stephens, and P. Donnelly.\n\nInference of population structure using multilocus\n\ngenotype data. Genetics, 155:945\u2013959, 2000.\n\n[27] P. Ravikumar. Approximate Inference, Structure Learning and Feature Estimation in Markov Random\n\nFields. PhD thesis, Carnegie Mellon University, 2007.\n\n[28] H. Robbins and S. Monro. A stochastic approximation method. Annals of Math. Statistics, 22, 1951.\n[29] J. C. Spall. Introduction to stochastic search and optimization. Wiley-Interscience, 2003.\n[30] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent\n\nDirichlet allocation. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\n[31] M. J. Wainwright. Stochastic processes on graphs with cycles: geometric and variational approaches.\n\nPhD thesis, Massachusetts Institute of Technology, 2002.\n\n[32] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. IEEE Transactions on Information Theory, 51:2282\u20132312, 2005.\n\n[33] L. Younes. Stochastic gradient estimation strategies for Markov random \ufb01elds. In Proceedings of the\n\nSpatial Statistics and Imaging Conference, 1991.\n\n9\n\n\f", "award": [], "sourceid": 36, "authors": [{"given_name": "Peter", "family_name": "Carbonetto", "institution": null}, {"given_name": "Matthew", "family_name": "King", "institution": null}, {"given_name": "Firas", "family_name": "Hamze", "institution": null}]}