{"title": "Propagation Algorithms for Variational Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 513, "abstract": null, "full_text": "Propagation Algorithms for Variational \n\nBayesian Learning \n\nZoubin GhahraIllani and Matthew J. Beal \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\n17 Queen Square, London WC1N 3AR, England \n\n{zoubin,m.beal}~gatsby.ucl.ac.uk \n\nAbstract \n\nVariational approximations are becoming a widespread tool for \nBayesian learning of graphical models. We provide some theoret(cid:173)\nical results for the variational updates in a very general family of \nconjugate-exponential graphical models. We show how the belief \npropagation and the junction tree algorithms can be used in the \ninference step of variational Bayesian learning. Applying these re(cid:173)\nsults to the Bayesian analysis of linear-Gaussian state-space models \nwe obtain a learning procedure that exploits the Kalman smooth(cid:173)\ning propagation, while integrating over all model parameters. We \ndemonstrate how this can be used to infer the hidden state dimen(cid:173)\nsionality of the state-space model in a variety of synthetic problems \nand one real high-dimensional data set. \n\nIntroduction \n\n1 \nBayesian approaches to machine learning have several desirable properties. Bayesian \nintegration does not suffer overfitting (since nothing is fit to the data). Prior knowl(cid:173)\nedge can be incorporated naturally and all uncertainty is manipulated in a consis(cid:173)\ntent manner. Moreover it is possible to learn model structures and readily compare \nbetween model classes. Unfortunately, for most models of interest a full Bayesian \nanalysis is computationally intractable. \nUntil recently, approximate approaches to the intractable Bayesian learning prob(cid:173)\nlem had relied either on Markov chain Monte Carlo (MCMC) sampling, the Laplace \napproximation (Gaussian integration), or asymptotic penalties like BIC. The recent \nintroduction of variational methods for Bayesian learning has resulted in the series \nof papers showing that these methods can be used to rapidly learn the model struc(cid:173)\nture and approximate the evidence in a wide variety of models. In this paper we \nwill not motivate advantages of the variational Bayesian approach as this is done in \nprevious papers [1, 5]. Rather we focus on deriving variational Bayesian (VB) learn(cid:173)\ning in a very general form, relating it to EM, motivating parameter-hidden variable \nfactorisations, and the use of conjugate priors (section 3). We then present several \ntheoretical results relating VB learning to the belief propagation and junction tree \nalgorithms for inference in belief networks and Markov networks (section 4). Fi(cid:173)\nnally, we show how these results can be applied to learning the dimensionality of \nthe hidden state space of linear dynamical systems (section 5). \n\n\f2 Variational Bayesian Learning \nThe basic idea of variational Bayesian learning is to simultaneously approximate the \nintractable joint distribution over both hidden states and parameters with a simpler \ndistribution, usually by assuming the hidden states and parameters are independent; \nthe log evidence is lower bounded by applying Jensen's inequality twice: \n\nIn P(yIM) > \n\n/ dO Qo(O) [/ dx Qx(x) In P(~I(~)M) + In Pci~~~)] (1) \n\n= F(Qo(O),Qx(x),y) \n\nwhere y, x, 0 and M, are observed data, hidden variables, parameters and model \nclass, respectively; P(OIM) is a parameter prior under model class M. The lower \nbound F is iteratively maximised as a functional of the two free distributions, Qx(x) \nand Qo(O). From (1) we can see that this maximisation is equivalent to minimising \nthe KL divergence between Qx(x)Qo(O) and the joint posterior over hidden states \nand parameters P(x, Oly, M). \nThis approach was first proposed for one-hidden layer neural networks [6] under the \nrestriction that Qo(O) is Gaussian. It has since been extended to models with hidden \nvariables and the restrictions on Qo(O) and Qx(x) have been removed in certain \nmodels to allow arbitrary distributions [11, 8, 3, 1, 5]. Free-form optimisation with \nrespect to the distributions Qo(O) and Qx(x) is done using calculus of variations, \noften resulting in algorithms that appear closely related to the corresponding EM \nalgorithm. We formalise this relationship and others in the following sections. \n\n3 Conjugate-Exponential Models \nWe consider variational Bayesian learning in models that satisfy two conditions: \nCondition (1). The complete data likelihood is in the exponential family: \n\nP(x,yIO) = f(x,y) g(O)exp{\u00a2(O)T u(x,y)} \n\nwhere \u00a2( 0) is the vector of natural parameters, and u and f and g are the functions \nthat define the exponential family . \nThe list of latent-variable models of practical interest with complete-data likeli(cid:173)\nhoods in the exponential family is very long. We mention a few: Gaussian mixtures, \nfactor analysis, hidden Markov models and extensions, switching state-space mod(cid:173)\nels, Boltzmann machines, and discrete-variable belief networks. 1 Of course, there \nare also many as yet undreamed-of models combining Gaussian, Gamma, Poisson, \nDirichlet, Wishart, Multinomial, and other distributions. \nCondition (2) . The parameter prior is conjugate to the complete data likelihood: \n\nP(OI\"7, v) = h(\"7, v) g(O)'1 exp {\u00a2(O) TV} \n\nwhere \"7 and v are hyperparameters of the prior. \nCondition (2) in fact usually implies condition (1). Apart from some irregular cases, \nit has been shown that the exponential families are the only classes of distributions \nwith a fixed number of sufficient statistics, hence allowing them to have natural \nconjugate priors. From the definition of conjugacy it is easy to see that the hyper(cid:173)\nparameters of a conjugate prior can be interpreted as the number (\"7) and values \n(v) of pseudo-observations under the corresponding likelihood. We call models that \nsatisfy conditions (1) and (2) conjugate-exponential. \n\nIModels whose complete-data likelihood is not in the exponential family (such as ICA \nwith the logistic nonlinearity, or sigmoid belief networks) can often be approximated by \nmodels in the exponential family with additional hidden variables. \n\n\fIn Bayesian inference we want to determine the posterior over parameters and \nhidden variables P(x, 91y, 'f'J, v). In general this posterior is neither conjugate nor in \nthe exponential family. We therefore approximate the true posterior by the following \nfactorised distribution: P(x, 91y, 'f'J, v) :::::i Q(x, 9) = Qx(x)Q9(9), and minimise \n\nKL(QIIP) = fdX d9 Q(X, 9) In (Q~7,9) ) \nP x, Y,'f'J,V \n\nwhich is equivalent to maximising F(Qx(X),Q9(9),y). We provide several general \nresults with no proof (the proofs follow from the definitions and Gibbs inequality). \nTheorem 1 Given an iid data set Y = (Yl, ... Y n), if the model satisfies conditions \n(1) and (2), then at the maxima of F(Q,y) (minima of KL(QIIP)): \n\n(a) Q9(9) is conjugate and of the form: \n\nQ9(9) = h(ij, v)g(9)7) exp {4>(9) Tv} \n\nwhere ij = 'f'J+n, v = v+ L~=l U(Yi), and U(Yi) = (U(Xi,yi))Q, using (.)Q \nto denote expectation under Q. \n\n(b) Qx (x) = TI~=l Qx. (Xi) and Qx. (Xi) is of the same form as the known pa(cid:173)\n\nrameter posterior: \n\nQx. (Xi) \n\nex: \n\nf(xi,Yi)exp{\u00a2(9)T u (xi,yi)} =P(xiIYi,\u00a2(9)) \n\nwhere \u00a2(9) = (4)(9))Q. \n\nSince Q9(9) and Qx.(Xi) are coupled, (a) and (b) do not provide an analytic so(cid:173)\nlution to the minimisation problem. We therefore solve the optimisation problem \nnumerically by iterating between the fixed point equations given by (a) and (b), and \nwe obtain the following variational Bayesian generalisation of the EM algorithm: \nVE Step: Compute the expected sufficient statistics t(y) = Li U(Yi) \nunder the hidden variable distributions Qx. (Xi). \nVM Step: Compute the expected natural parameters \u00a2( 9) under the \nparameter distribution given by ij and v. \n\nThis reduces to the EM algorithm if we restrict the parameter density to a point \nestimate (Le. Dirac delta function), Q9(9) = 15(9 - 9*), in which case the M step \ninvolves re-estimating 9*. \nNote that unless we make the assumption that the parameters and hidden variables \nfactorise, we will not generally obtain the further hidden variable factorisation over \nn in (b). In that case, the distributions of Xi and Xj will be coupled for all cases i,j \nin the data set, greatly increasing the overall computational complexity of inference. \n\n4 Belief Networks and Markov Networks \nThe above result can be used to derive variational Bayesian learning algorithms for \nexponential family distributions that fall into two important special classes. 2 \nCorollary 1: Conjugate-Exponential Belief Networks . \nLet M be a \nconjugate-exponential model with hidden and visible variables z = (x, y) that sat(cid:173)\nisfy a belief network factorisation. That is, each variable Zj has parents zp; and \nP(zI9) = TIj P(Zjlzp;,9). Then the approximating joint distribution for M satis(cid:173)\nfies the same belief network factorisation: \n\nQz(z) = II Q(zjlzp;,ij) \n\n2 A tutorial on belief networks and Markov networks can be found in [9]. \n\nj \n\n\fwhere the conditional distributions have exactly the same form as those in the \noriginal model but with natural parameters \u00a2(O) = \u00a2(9). Furthermore, with the \nmodified parameters 0, the expectations under the approximating posterior Qx(x) ex: \nQz(z) required for the VE Step can be obtained by applying the belief propagation \nalgorithm if the network is singly connected and the junction tree algorithm if the \nnetwork is multiply-connected. \nThis result is somewhat surprising as it shows that it is possible to infer the hidden \nstates tractably while integrating over an ensemble of model parameters. This result \ngeneralises the derivation of variational learning for HMMs in [8], which uses the \nforward-backward algorithm as a subroutine. \nTheorem 2: Markov Networks . Let M be a model with hidden and visible vari(cid:173)\nables z = (x, y) that satisfy a Markov network factorisation. That is, the joint den(cid:173)\nsity can be written as a product of clique-potentials 'lj;j, P(zI9) = g(9) TI j 'lj;j(Cj , 9), \nwhere each clique Cj is a subset of the variables in z. Then the approximating joint \ndistribution for M satisfies the same Markov network factorisation: \n\nQz(z) = 9 II \u00a2j (Cj ) \n\nj \n\nwhere \u00a2j (Cj ) = exp { (In 'lj;j (Cj , 9))Q} are new clique potentials obtained by averag(cid:173)\ning over Qe(9), and 9 is a normalisation constant. Furthermore, the expectations \nunder the approximating posterior Qx(x) required for the VE Step can be obtained \nby applying the junction tree algorithm. \nCorollary 2: Conjugate-Exponential Markov Networks. Let M be a \nconjugate-exponential Markov network over the variables in z . Then the approx(cid:173)\nimating joint distribution for M is given by Qz(z) = gTIj 'lj;j(Cj,O), where the \nclique potentials have exactly the same form as those in the original model but with \nnatural parameters \u00a2(O) = \u00a2(9). \nFor conjugate-exponential models in which belief propagation and the junction tree \nalgorithm over hidden variables is intractable further applications of Jensen's in(cid:173)\nequality can yield tractable factorisations in the usual way [7]. \nIn the following section we derive a variational Bayesian treatment of linear(cid:173)\nGaussian state-space models. This serves two purposes. First, it will illustrate \nan application of Theorem 1. Second, linear-Gaussian state-space models are the \ncornerstone of stochastic filtering, prediction and control. A variational Bayesian \ntreatment of these models provides a novel way to learn their structure, i.e. \nto \nidentify the optimal dimensionality of their state-space. \n\n5 State-space models \nIn state-space models (SSMs) , a sequence of D-dimensional real-valued observation \n,YT}, denoted Yl:T, is modeled by assuming that at each time step \nvectors {Yl,'\" \nt, Yt was generated from a K-dimensional real-valued hidden state variable Xt, and \nthat the sequence of x's define a first-order Markov process. The joint probability \nof a sequence of states and observations is therefore given by (Figure 1): \n\nP(Xl:T' Yl:T) = P(Xl)P(Yllxl) II P(Xt IXt-l)P(Yt IXt). \n\nT \n\nt=2 \n\nWe focus on the case where both the transition and output functions are linear and \ntime-invariant and the distribution of the state and observation noise variables is \nGaussian. This model is the linear-Gaussian state-space model: \n\nYt = CXt +Vt \n\n\f~~\u00b7\u00b7\u00b71T \n\u00ae @ \u00a9 \n~ \n\nFigure 1: Belief network representation of a state-space model. \n\nwhere A and C are the state transition and emission matrices and Wt and Vt are \nstate and output noise. It is straightforward to generalise this to a linear system \ndriven by some observed inputs, Ut. A Bayesian analysis of state-space models using \nMCMC methods can be found in [4]. \nThe complete data likelihood for state-space models is Gaussian, which falls within \nthe class of exponential family distributions. \nIn order to derive a variational \nBayesian algorithm by applying the results in the previous section we now turn \nto defining conjugate priors over the parameters. \nPriors. Without loss of generality we can assume that Wt has covariance equal to \nthe unit matrix. The remaining parameters of a linear-Gaussian state-space model \nare the matrices A and C and the covariance matrix of the output noise, Vt , which \nwe will call R and assume to be diagonal, R = diag(p)-l, where Pi are the precisions \n(inverse variances) associated with each output. \nEach row vector of the A matrix, denoted a\"[, is given a zero mean Gaussian prior \nwith inverse covariance matrix equal to diag( a). Each row vector of C, c\"[, is \ngiven a zero-mean Gaussian prior with precision matrix equal to diag(pi,8). The \ndependence of the precision of c\"[ on the noise output precision Pi is motivated by \nconjugacy. Intuitively, this prior links the scale of the signal and noise. \nThe prior over the output noise covariance matrix, R, is defined through the pre(cid:173)\ncision vector, p, which for conjugacy is assumed to be Gamma distributed3 with \nhyperparameters a and b: P(p la, b) = I1~1 A:) p~-l exp{ -bpi}. Here, a, ,8 are \nhyperparameters that we can optimise to do automatic relevance determination \n(ARD) of hidden states, thus inferring the structure of the SSM. \n\nVariational Bayesian learning for SSMs \nSince A, C, p and Xl :T are all unknown, given a sequence of observations Yl:T, an \nexact Bayesian treatment of SSMs would require computing marginals of the poste(cid:173)\nrior P(A, C, p, xl:TIY1:T). This posterior contains interaction terms up to fifth order \n(for example, between elements of C, x and p), and is not analytically manageable. \nHowever, since the model is conjugate-exponential we can apply Theorem 1 to de(cid:173)\nrive a variational EM algorithm for state-space models analogous to the maximum(cid:173)\nlikelihood EM algorithm [10]. Moreover, since SSMs are singly connected belief \nnetworks Corollary 1 tells us that we can make use of belief propagation, which in \nthe case of SSMs is known as the Kalman smoother. \nWriting out the expression for log P(A, C, p, Xl :T, n :T), one sees that it contains \ninteraction terms between p and C, but none between A and either p or C. This \nobservation implies a further factorisation, Q(A, C,p) = Q(A)Q(C, p), which falls \nout of the initial factorisation and the conditional independencies of the model. \nStarting from some arbitrary distribution over the hidden variables, the VM step \nobtained by applying Theorem 1 computes the expected natural parameters of \nQ9((}), where (} = (A,C,p). \n\n3More generally, if we let R be a full covariance matrix for conjugacy we would give \n, \n\nits inverse V = R- 1 a Wishart distribution: P(Vlv, S) ex IVI(v-D-1)/2 exp {-~tr VS- 1 }\nwhere tr is the matrix trace operator. \n\n\fWe proceed to solve for Q(A). We know from Theorem 1 that Q(A) is multivariate \nGaussian, like the prior, so we only need to compute its mean and covariance. A \nhas mean ST(diag(o:) + W)-l and each row of A has covariance (diag(o:) + W)-I, \nwhere S = Ei'=2 (Xt-lXi) , W = Ei'.;/ (Xtxi), and (.) denotes averaging w.r.t. \nthe Q(Xl:T) distribution. \nQ (C, p) is also of the same form as the prior. Q (p) is a product of Gamma densities \nQ(Pi) = 9(Pi; ii, bi) where ii = a + t, bi = b + ~gi' gi = Ei'=l yfi - Ui(diag(,8) + \nW,)-lUi, Ui = Ei'=l Yti(xi) and W' = W + (XTXj:). Given p, each row of \nC is Gaussian with covariance COV(Ci) = (diag(,8) + W,)-l / Pi and mean Ci = \nPi Ui COV(Ci). Note that S, W and Ui are the expected complete data sufficient \nstatistics IT mentioned in Theorem l(a). Using the parameter distributions the \nhyperparameters can also be optimised. 4 \nWe now turn to the VE step: computing Q(Xl:T). Since the model is a conjugate(cid:173)\nexponential singly-connected belief network, we can use belief propagation (Corol(cid:173)\nlary 1). For SSMs this corresponds to the Kalman smoothing algorithm, where \nevery appearance of the natural parameters of the model is replaced with the fol(cid:173)\nlowing corresponding expectations under the Q distribution: (PiCi), (PiCiCi) , (A), \n(A T A). Details can be found in [2]. \nLike for PCA [3], independent components analysis [1], and mixtures of factor \nanalysers [5], the variational Bayesian algorithm for state-space models can be used \nto learn the structure of the model as well as average over parameters. Specifically, \nusing F it is possible to compare models with different state-space sizes and optimise \nthe dimensionality of the state-space, as we demonstrate in the following section. \n\n6 Results \nExperiment 1: The goal of this experiment was to see if the variational method \ncould infer the structure of a variety of state space models by optimising over 0: and \n,8. We generated a 200-step time series of 10-dimensional data from three models: 5 \n(a) a factor analyser (Le. an SSM with A = 0) with 3 factors (static state variables); \n(b) an SSM with 3 dynamical interacting state variables, Le. A::p 0; (c) an SSM \nwith 3 interacting dynamical and 1 static state variables. The variational Bayesian \nmethod correctly inferred the structure of each model in 2-3 minutes of CPU time \non a 500 MHz Pentium III (Fig. 2 (a)- (c)). \nExperiment 2: We explored the effect of data set size on complexity of the recov(cid:173)\nered structure. 10-dim time series were generated from a 6 state-variable SSM. On \nreducing the length of the time series from 400 to 10 steps the recovered structure \nbecame progressively less complex (Fig. 2(d)- (j)), to a 1-variable static model (j). \nThis result agrees with the Bayesian perspective that the complexity of the model \nshould reflect the data support. \nExperiment 3 (Steel plant): 38 sensors (temperatures, pressures, etc) were \nsampled at 2 Hz from a continuous casting process for 150 seconds. These sensors \ncovaried and were temporally correlated, suggesting a state-space model could cap(cid:173)\nture some of its structure. The variational algorithm inferred that 16 state variables \nwere required, of which 14 emitted outputs. While we do not know whether this is \nreasonable structure we plan to explore this as well as other real data sets. \n\n4The ARD hyperparameters become Ok = (AT~}kk ' and 13k = (CTdia~p)C}kk . The \nhyperparameters a and b solve the fixed point equations '1j;(a) = Inb+ i ~~l (Inpi), and \nt = ab ~~l (Pi) , where '1j;(w) = BBw In r(w) is the digamma function. \nsParameters were chosen as follows: R = I, and elements of C sampled from \n'\" Unif( -5,5), and A chosen with eigen-values in [0.5,0.9] . \n\n\fx,_, \n\nx, \n\nV, \n\nv, \n\nx,_, \n\nx, \n\nY, \n\nx,_, \n\nY, \n\nV, \n\nx,_, \n\nV, \n\nXt_, \n\nV, \n\nJ \u00b7 ~ \u00b7 \u00b7 \u00b7 \u00b7 \n\nx,_, \n\n~ \n\nY, \n\n1 \n\nOk \n\nFigure 2: The elements of the A and C matrices after learning are displayed graphically. \nA link is drawn from node k in Xt-l to node 1 in Xt iff -..L > E, and either _{31 > E or \n...!... > E, for a small threshold E. Similarly links are drawn from node k of Xt to Yt if {31 > E. \nk \n01 \nTherefore the graph shows the links that take part in the dynamics and the output. \n7 Conclusions \nWe have derived a general variational Bayesian learning algorithm for models in the \nconjugate-exponential family. There are a large number of interesting models that \nfall in this family, and the results in this paper should allow an almost automated \nprotocol for implementing a variational Bayesian treatment of these models. \nWe have given one example of such an implementation, state-space models, and \nshown that the VB algorithm can be used to rapidly infer the hidden state dimen(cid:173)\nsionality. Using the theory laid out in this paper it is straightforward to generalise \nthe algorithm to mixtures of SSMs, switching SSMs, etc. \nFor conjugate-exponential models, integrating both belief propagation and the junc(cid:173)\ntion tree algorithm into the variational Bayesian framework simply amounts to com(cid:173)\nputing expectations of the natural parameters. Moreover, the variational Bayesian \nalgorithm contains EM as a special case. We believe this paper provides the founda(cid:173)\ntions for a general algorithm for variational Bayesian learning in graphical models. \n\nReferences \n[1] H. Attias. A variational Bayesian framework for graphical models. In Advances in \n\nNeural Information Processing Systems 12. MIT Press, Cambridge, MA, 2000. \n\n[2] M.J. Beal and Z. Ghahramani. The variational Kalman smoother. Technical report, \n\nGatsby Computational Neuroscience Unit, University College London, 2000. \n\n[3] C.M. Bishop. Variational PCA. In Proc. Ninth ICANN, 1999. \n[4] S. Friiwirth-Schnatter. Bayesian model discrimination and Bayes factors for linear \n\nGaussian state space models. J. Royal. Stat. Soc. B , 57:237-246, 1995. \n\n[5] Z. Ghahramani and M.J. Beal. Variational inference for Bayesian mixtures of factor \n\nanalysers. In Adv. Neur. Inf. Proc. Sys. 12. MIT Press, Cambridge, MA, 2000. \n\n[6] G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the de(cid:173)\n\nscription length ofthe weights. In Sixth ACM Conference on Computational Learning \nTheory, Santa Cruz, 1993. \n\n[7] M.1. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K Saul. An introduction to vari(cid:173)\n\national methods in graphical models. Machine Learning, 37:183- 233, 1999. \n\n[8] D.J .C. MacKay. Ensemble learning for hidden Markov models. Technical report, \n\nCavendish Laboratory, University of Cambridge, 1997. \n\n[9] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer(cid:173)\n\nence. Morgan Kaufmann, San Mateo, CA, 1988. \n\n[10] R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and fore(cid:173)\n\ncasting using the EM algorithm. J. Time Series Analysis, 3(4):253- 264, 1982. \n\n[11] S. Waterhouse, D.J.C. Mackay, and T. Robinson. Bayesian methods for mixtures of \n\nexperts. In Adv. Neur. Inf. Proc. Sys. 7. MIT Press, 1995. \n\n\f", "award": [], "sourceid": 1907, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Matthew", "family_name": "Beal", "institution": null}]}