{"title": "A Tighter Bound for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 266, "page_last": 272, "abstract": null, "full_text": "A tighter bound for graphical models \n\nM.A.R. Leisink* and H.J. Kappent \n\nDepartment of Biophysics \n\nUniversity of Nijmegen, Geert Grooteplein 21 \n\nNL 6525 EZ Nijmegen, The Netherlands \n\n{martijn,bert}Cmbfys.kun.nl \n\nAbstract \n\nWe present a method to bound the partition function of a Boltz(cid:173)\nmann machine neural network with any odd order polynomial. This \nis a direct extension of the mean field bound, which is first order. \nWe show that the third order bound is strictly better than mean \nfield. Additionally we show the rough outline how this bound is \napplicable to sigmoid belief networks. Numerical experiments in(cid:173)\ndicate that an error reduction of a factor two is easily reached in \nthe region where expansion based approximations are useful. \n\n1 \n\nIntroduction \n\nGraphical models have the capability to model a large class of probability distri(cid:173)\nbutions. The neurons in these networks are the random variables, whereas the \nconnections between them model the causal dependencies. Usually, some of the \nnodes have a direct relation with the random variables in the problem and are \ncalled 'visibles'. The other nodes, known as 'hiddens', are used to model more \ncomplex probability distributions. \n\nLearning in graphical models can be done as long as the likelihood that the visibles \ncorrespond to a pattern in the data set, can be computed. In general the time it \ntakes, scales exponentially with the number of hidden neurons. For such architec(cid:173)\ntures one has no other choice than using an approximation for the likelihood. \n\nA well known approximation technique from statistical mechanics, called Gibbs \nsampling, was applied to graphical models in [1]. More recently, the mean field \napproximation known from physics was derived for sigmoid belief networks [2]. For \nthis type of graphical models the parental dependency of a neuron is modelled by a \nnon-linear (sigmoidal) function of the weighted parent states [3]. It turns out that \nthe mean field approximation has the nice feature that it bounds the likelihood \nfrom below. This is useful for learning, since a maximisation of the bound either \nincreases its accuracy or increases the likelihood for a pattern in the data set, which \nis the actual learning process. \n\nIn this article we show that it is possible to improve the mean field approximation \n\n*http://www.mbfys.kun.nl/-martijn \nthttp://www.mbfys.kun.nl/-bert \n\n\fwithout losing the bounding properties. In section 2 we show the general theory \nto create a new bound using an existing one, which is applied to a Boltzmann \nmachine in section 3. Boltzmann machines are another type of graphical models. \nIn contrast with belief networks the connections are symmetric and not directed [4]. \nA mean field approximation for this type of neural networks was already described \nin [5]. An improvement of this approximation was found by Thouless, Anderson \nand Palmer in [6], which was applied to Boltzmann machines in [7]. Unfortunately, \nthis so called TAP approximation is not a bound. We apply our method to the mean \nfield approximation, which results in a third order bound. We prove that the latter \nis always tighter. \n\nDue to the limited space it is not possible to discuss the third order bound for \nsigmoid belief networks in much detail. Instead, we show the general outline and \nfocus more on the experimental results in section 5. Finally, in section 6, we present \nour conclusions. \n\n2 Higher order bounds \n\nSuppose we have a function 10 (x) and a bound bo (x) such that 't/ x 10 (x) ::::: bo (x) . \nLet It(x) and b1(x) be two primitive functions of lo(x) and bo(x) \n\nIt(x) = J dx lo(x) \n\nand b1(x) = J dx bo(x) \n\n(1) \n\nsuch that It (v) = b1 (v) for some v. Note that we can always add an appropriate \nconstant to the primitive functions such that they are indeed equal at x = v . \nSince the surface under 10 (x) at the left as well as at the right of x = v is obviously \ngreater than the surface under bo(x) and the primitive functions are equal at x = v \n(by construction), we know \n\n{ It (x) :s; b1 (x) \nIt (x) ::::: b1 (x) \n\nfor x :s; v \nfor x ::::: v \n\n(2) \n\nor in shorthand notation It (x) ~ bt (x). It is important to understand that even if \nlo(v) > bo(v) the above result holds. Therefore we are completely free to choose v. \nIf we repeat this and let 12 (x) and b2 (x) be two primitive functions of It (x) and \nbt{x), again such that h(v) = b2(v), one can easily verify that 't/x h(x) ::::: b2(x) . \nThus given a lower bound of lo(x) we can create another lower bound. In case the \ngiven bound is a polynomial of degree k, the new bound is a polynomial of degree \nk + 2 with one additional variational parameter. \nTo illustrate this procedure, we derive a third order bound on the exponential \nfunction starting with the well known linear bound: the tangent of the exponential \nfunction at x = v. Using the procedure of the previous section we derive \n\n't/X ,V lo(x) = eX ::::: eV (1 + x - v) = bo(x) \n\n(3) \n\nIt (x) = eX ~ el' + eV \n\n( ( 1 + J1. - v) (x - J1.) + ~ (x - J1.) 2) = b1 (x) \n\n't/\",I', >' h(x) = eX ::::: el' { 1 + x - J1. + e>' C ~ A (x - J1.)2 + ~ (x - J1.)3) } = b2(x{5) \n\n(4) \n\nwith A = v -\n\nJ1.. \n\n\f3 Boltzmann machines \n\nIn this section we derive a third order lower bound on the partition function of \na Boltzmann machine neural network using the results from the previous section. \nThe probability to find a Boltzmann machine in a state i E {-I, +1}N is given by \n\nP (i) = Z exp (- E (i)) = Z exp 20'J Si Sj + 0' Si \n\n(6) \n\n1 \n\n1 (1.. \n\n. ) \n\nThere is an implicit summation over all repeated indices (Einstein's convention). \nZ = Lall s exp ( - E (i)) is the normalisation constant known as the partition func(cid:173)\ntion which requires a sum over all, exponentially many states. Therefore this sum \nis intractable to compute even for rather small networks. \n\nTo compute the partition function approximately, we use the third order bound 1 \nfrom equation 5. We obtain \n\nwhere t::..E = J-L (i) + E . Note that the former constants J-L and A are now functions \nof i, since we may take different values for J-L and A for each term in the sum. In \nprinciple these functions can take any form. If we take, for instance, J-L (i) = - E (i) \nthe approximation is exact. This would lead, however, to the same intractability \nas before and therefore we must restrict our choice to those that make equation 7 \ntractable to compute. We choose J-L (8) and A (8) to be linear with respect to the \nneuron states Si : \n\n(8) \nOne may view J-L (i) and A (.S) as (the negative of) the energy functions for the \nBoltzmann distribution P ~ exp (J-L (i)) and P ~ exp (A (i)). Therefore we will \nsometimes speak of 'the distribution J-L (i)' . Since these linear energy functions cor(cid:173)\nrespond to factorised distributions, we can compute the right hand side of equation 7 \nin a reasonable time, 0 (N 3 ). \n\nTo obtain the tightest bound, we may maximise equation 7 with respect to its \nvariational parameters J-LD, J-Li , AD and Ai . \n\nA special case of the third order bound \nAlthough it is possible to choose Ai f:. 0, we will set them to the suboptimal value \nAi = 0, since this simplifies equation 7 enormously. The reader should keep in mind, \nhowever, that all calculations could be done with non-zero Ai . Given this choice we \ncan compute the optimal values for J-LD and AD, given by \n\n(9) \n\nwhere (-) denotes an average over the (factorised) distribution J-L (i) . Using this \nsolution the bound reduces to the simple form \n\n(10) \n\nlUsing the first order bound from equation 3 resuJts in the standard mean field bound. \n\n\fwhere ZI' is the partition function of the distribution fl (8). The term (t::..E2) \ncorresponds to the variance of E + fli Si with respect to the distribution fl (8), since \nflo = - (E + fli si ). ).0 is proportional to the third order moment according to (9). \nExplicit expressions for these moments can be derived with patience. \n\nThere is no explicit expression for the optimal fli as is the case with the stan(cid:173)\ndard mean field equations. An implicit expression, however, follows from setting \nthe derivative with respect to fl i to zero. We solved fli numerically by iteration. \nWherever we speak of 'fully optimised', we refer to this solution for fli . \n\nConnection with standard Illean field and TAP \n\nWe like to focus for a moment on the suboptimal case where fli correspond to the \nmean field solution, given by \n\nVi m i ~f tanhfli = tanh (Oi + Oi j mj ) \n\n(11) \n\nFor this choice for fli the logZI' term in equation 10 is equal to the optimal mean \nfield bound 2 . Since the last term in equation 10 is always positive, we conclude that \nthe third order bound is always tighter than the mean field bound. \n\nThe relation between TAP and the third order bound is clear in the region of small \nweights. If we assume that 0 (Oi j3) is negligible, a small weight expansion of equa(cid:173)\ntion 10 yields \n\nlogZ ?: logZI' + log {I + ~eAO (t::..E2)} >;:J logZI' + ~Oij2 (1- m;) (1 - mj) \n\n4 \n\n(12) \n\nwhere the last term is equal to the TAP correction term [7] . Thus the third order \nbound tends to the TAP approximation for small weights. For larger weights, how(cid:173)\never, the TAP approximation overestimates the partition function, whereas the third \norder approximation is still a bound. \n\n4 Sigmoid belief networks \n\nIn the previous section we saw how to derive a third order bound on the partition \nfunction. For sigmoid belief networks3 we can use the same strategy to obtain a \nthird order bound on the likelihood of the visible neurons of the network to be in \na particular state. In this article, we present the rough outline of our method. The \nfull derivation will be presented elsewhere. \n\nIt turns out that these graphical models are comparable to Boltzmann machines to \na large extent. The energy function E(s) (as in equation 6), however, differs for \nsigmoid belief networks: \n\n-E(s) = Oi jsi Sj + Oi Si - Llog2cosh (OPi Si + Op) \n\n(13) \n\np \n\nThe last term, known as the local normalisation, does not appear in the Boltz(cid:173)\nmann machine energy function. We have similar difficulties as with the Boltzmann \nmachine, if we want to compute the log-likelihood given by \n\nloge = log L P (s) = log L exp (-E (8)) \n\n(14) \n\nsEHidden \n\nsEHidden \n\n2Be aware of the fact that J.I (S) contains the parameter J.l0 = - (E + J.li Si). This gives \n\nan important contribution to the expression for log Z 1' . \n\n3 A detailed description of these networks can be found in [3]. \n\n\f19 \n\n18 \n\n17 \n\n16 \n\n15 \n\n14 \n0 \n\n80 \n\n\u00a770 \nU \nc: \n.260 \nc: \n0 \n~50 \n\". We can use this result to derive a bound on the \npartition function, where the variational parameters can be seen as energy functions \nfor probability distributions. If we choose those distributions to be factorised, we \nhave (N + l)k new variational parameters. Since the approximating function is a \nbound, we may maximise it with respect to all these parameters. \n\nIn this article we restricted ourselves to the third order bound, although an extension \nto any odd order bound is possible. Third order is the next higher order bound to \nnaive mean field. We showed that this bound is strictly better than the mean field \nbound and tends to the TAP approximation for small weights. For larger weights, \nhowever, the TAP approximation crosses the partition function and violates the \nbounding properties. \n\nWe saw that the third order bound gives an enormous improvement compared to \nmean field. Our results are comparable to those obtained by the structured approach \nin [10] . The choice between third order and variational structures, however, is not \nexclusive. We expect that a combination of both methods is a promising research \ndirection to obtain the tightest tractable bound. \n\nAcknowledgements \n\nThis research is supported by the Technology Foundation STW, applied science \ndevision of NWO and the technology programme of the Ministry of Economic Affairs. \n\nReferences \n\n[1) J. Pearl. Probabilistic Reasoning in Intelligent Systems, chapter 8.2.1, pages 387- 390. \n\nMorgan Kaufmann, San Francisco, 1988. \n\n[2) S.K. Saul, T.S. Jaakkola, and M.l. Jordan. Mean field theory for sigmoid belief \n\nnetworks. Technical Report 1, Computational Cognitive Science, 1995. \n\n[3) R. Neil. Connectionist learning of belief networks. Artificial intelligence, 56:71- 113, \n\n1992. \n\n[4) D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann ma(cid:173)\n\nchines. Cognitive Science, 9:147-169, 1985. \n\n[5) C. P eterson and J . Anderson. A mean field theory learning algorithm for neural \n\nnetworks. Complex systems, 1:995- 1019, 1987. \n\n[6) D.J. Thouless, P.W. Andersson, and R.G. Palmer. Solution of 'solvable model of a \n\nspin glass'. Philisophi cal Magazine, 35(3):593-601, 1977. \n\n[7) H.J . Kappen and F .B. Rodriguez. Boltzmann machine learning using mean field \ntheory and linear response correction. In M.S. Kearns, S.A. Solla, and D.A. Cohn, \neditors, Advances in Neural Tnformation Processing Systems, volume 11, pages 280-\n286. MIT Press, 1999. \n\n[8) D. Sherrington and S. Kirkpatrick. Solvable model of a spin-glass. Physical Review \n\nLetters, 35(26):1793- 1796,121975. \n\n[9) M.A.R. Leisink and H.J. Kappen. Validity of TAP equations in neural networks. In \nICANN 99, volume 1, pages 425- 430, ISBN 0852967217, 1999. Insti tution of Electrical \nEngineers, London. \n\n[10) D. Barber and W. Wiegerinck. Tractable variational structures for approximating \ngraphical models. In M.S. Kearns, S.A. Solla, and D.A . Cohn, editors, Advances in \nNeural Information Processin g Systems, volume 11, pages 183- 189. MIT Press, 1999. \n\n\f", "award": [], "sourceid": 1914, "authors": [{"given_name": "Martijn", "family_name": "Leisink", "institution": null}, {"given_name": "Hilbert", "family_name": "Kappen", "institution": null}]}