{"title": "Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 528, "page_last": 534, "abstract": null, "full_text": "Fast Learning by Bounding Likelihoods \n\nin Sigmoid Type Belief Networks \n\nTommi Jaakkola \n\ntommi@psyche.mit.edu \n\nLawrence K. Saul \nlksaul@psyche.mit.edu \n\nMichael I. Jordan \njordan@psyche.mit.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nSigmoid type belief networks, a class of probabilistic neural net(cid:173)\nworks, provide a natural framework for compactly representing \nprobabilistic information in a variety of unsupervised and super(cid:173)\nvised learning problems. Often the parameters used in these net(cid:173)\nworks need to be learned from examples. Unfortunately, estimat(cid:173)\ning the parameters via exact probabilistic calculations (i.e, the \nEM-algorithm) is intractable even for networks with fairly small \nnumbers of hidden units. We propose to avoid the infeasibility of \nthe E step by bounding likelihoods instead of computing them ex(cid:173)\nactly. We introduce extended and complementary representations \nfor these networks and show that the estimation of the network \nparameters can be made fast (reduced to quadratic optimization) \nby performing the estimation in either of the alternative domains. \nThe complementary networks can be used for continuous density \nestimation as well. \n\n1 \n\nIntroduction \n\nThe appeal of probabilistic networks for knowledge representation, inference, and \nlearning (Pearl, 1988) derives both from the sound Bayesian framework and from \nthe explicit representation of dependencies among the network variables which al(cid:173)\nlows ready incorporation of prior information into the design of the network. The \nBayesian formalism permits full propagation of probabilistic information across the \nnetwork regardless of which variables in the network are instantiated. In this sense \nthese networks can be \"inverted\" probabilistically. \n\nThis inversion, however, relies heavily on the use of look-up table representations \n\n\fFast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks \n\n529 \n\nof conditional probabilities or representations equivalent to them for modeling de(cid:173)\npendencies between the variables. For sparse dependency structures such as trees \nor chains this poses no difficulty. In more realistic cases of reasonably interdepen(cid:173)\ndent variables the exact algorithms developed for these belief networks (Lauritzen & \nSpiegelhalter, 1988) become infeasible due to the exponential growth in the size of \nthe conditional probability tables needed to store the exact dependencies. Therefore \nthe use of compact representations to model probabilistic interactions is unavoidable \nin large problems. As belief network models move away from tables, however, the \nrepresentations can be harder to assess from expert knowledge and the important \nrole of learning is further emphasized. \n\nCompact representations of interactions between simple units have long been em(cid:173)\nphasized in neural networks. Lacking a thorough probabilistic interpretation, how(cid:173)\never, classical feed-forward neural networks cannot be inverted in the above sense; \ne.g. given the output pattern of a feed-forward neural network it is not feasible \nto compute a probability distribution over the possible input patterns that would \nhave resulted in the observed output. On the other hand, stochastic neural net(cid:173)\nworks such as Boltzman machines admit probabilistic interpretations and therefore, \nat least in principle, can be inverted and used as a basis for inference and learning \nin the presence of uncertainty. \n\nSigmoid belief networks (Neal, 1992) form a subclass of probabilistic neural networks \nwhere the activation function has a sigmoidal form - usually the logistic function. \nNeal (1992) proposed a learning algorithm for these networks which can be viewed \nas an improvement ofthe algorithm for Boltzmann machines. Recently Hinton et al. \n(1995) introduced the wake-sleep algorithm for layered bi-directional probabilistic \nnetworks. This algorithm relies on forward sampling and has an appealing coding \ntheoretic motivation. The Helmholtz machine (Dayan et al., 1995), on the other \nhand, can be seen as an alternative technique for these architectures that avoids \nGibbs sampling altogether. Dayan et al. also introduced the important idea of \nbounding likelihoods instead of computing them exactly. Saul et al. (1995) sub(cid:173)\nsequently derived rigorous mean field bounds for the likelihoods. In this paper we \nintroduce the idea of alternative - extended and complementary - representations \nof these networks by reinterpreting the nonlinearities in the activation function. We \nshow that deriving likelihood bounds in the new representational domains leads to \nefficient (quadratic) estimation procedures for the network parameters. \n\n2 The probability representations \n\nBelief networks represent the joint probability of a set of variables {S} as a product \nof conditional probabilities given by \n\nP(St, ... , Sn) = IT P(Sk Ipa[k]), \n\nn \n\nk=l \n\n(1) \n\nwhere the notation pa[k], \"parents of Sk\", refers to all the variables that directly \ninfluence the probability of Sk taking on a particular value (for equivalent represen(cid:173)\ntations, see Lauritzen et al. 1988). The fact that the joint probability can be written \nin the above form implies that there are no \"cycles\" in the network; i.e. there exists \nan ordering of the variables in the network such that no variable directly influences \nany preceding variables. \nIn this paper we consider sigmoid belief networks where the variables S are binary \n\n\f530 \n\nT. JAAKKOLA, L. K. SAUL, M. I. JORDAN \n\n(0/1), the conditional probabilities have the form \n\nP(Ss:lpa[i]) = g( (2Ss: - 1) L WS:jSj) \n\nj \n\n(2) \n\nand the weights Wij are zero unless Sj is a parent of Si, thus preserving the feed(cid:173)\nforward directionality of the network. For notational convenience we have assumed \nthe existence of a bias variable whose value is clamped to one. The activation \nfunction g(.) is chosen to be the cumulative Gaussian distribution function given by \n\n.l ~ \n\ne- 2 Z dz = - -\n\ne- 2 z-x dz \n\n.l( \n\n)~ \n\n(3) \n\n1 \ng(x) = - -\n\njX \n..j2; - 00 \n\n1 100 \n\n..j2; 0 \n\nAlthough very similar to the standard logistic function, this activation function \nderives a number of advantages from its integral representation. In particular, we \nmay reinterpret the integration as a marginalization and thereby obtain alternative \nrepresentations for the network. We consider two such representations. \n\nWe derive an extended representation by making explicit the nonlinearities in the \nactivation function. More precisely, \n\nP(Silpa[i]) \n\ng( (2Si - 1) L WijSj) \n\nj \n\n(4) \n\nThis suggests defining the extended network in terms of the new conditional proba(cid:173)\nbilities P(Si, Zs:lpa[i]). By construction then the original binary network is obtained \nby marginalizing over the extra variables Z. In this sense the extended network is \n(marginally) equivalent to the binary network. \n\nWe distinguish a complementary representation from the extended one by writing \nthe probabilities entirely in terms of continuous variables!. Such a representation \ncan be obtained from the extended network by a simple transformation of variables. \nThe new continuous variables are defined by Zs: = (2Si -\nl)Zi, or, equivalently, \nby Zi = IZs: I and Si = O( Zs:) where 0(\u00b7) is the step function. Performing this \ntransformation yields \n\nP(Z-'I [.]) - _1_ -MZi-L: . Wij9(Zj)1~ \n\nI pa z -\n\nJ \n\nrn=e \nV 211\" \n\n(5) \n\nwhich defines a network of conditionally Gaussian variables. The original network \nin this case can be recovered by conditional marginalization over Z where the con(cid:173)\nditioning variables are O(Z). \n\nFigure 1 below summarizes the relationships between the different representations. \nAs will become clear later, working with the alternative representations instead \nof the original binary representation can lead to more flexible and efficient (least(cid:173)\nsquares) parameter estimation. \n\n3 The learning problem \n\nWe consider the problem of learning the parameters of the network from instantia(cid:173)\ntions of variables contained in a training set. Such instantiations, however, need not \n\n1 While the binary variables are the outputs of each unit the continuous variables pertain \n\nto the inputs - hence the name complementary. \n\n\fFast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks \n\n531 \n\n___ --~-::a-z_.::~ ~ _::s. Z) \n\nExtended network \n\nOriginal network \n\nover {S} \n\n\"tr:sfonnation of \n\n~ariables \n\nComplementary \nnetwork over {Z} \n\nFigure 1: The relationship between the alternative representations. \n\nbe complete; there may be variables that have no value assignments in the training \nset as well as variables that are always instantiated. The tacit division between \nhidden (H) and visible (V) variables therefore depends on the particular training \nexample considered and is not an intrinsic property of the network. \n\nTo learn from these instantiations we adopt the principle of maximum likelihood \nto estimate the weights in the network. In essence, this is a density estimation \nproblem where the weights are chosen so as to match the probabilistic behavior \nof the network with the observed activities in the training set. Central to this \nestimation is the ability to compute likelihoods (or log-likelihoods) for any (partial) \nconfiguration of variables appearing in the training set. In other words, if we let \nXV be the configuration of visible or instantiated variables 2 and XH denote the \nhidden or uninstantiated variables, we need to compute marginal probabilities of \nthe form \n\n(6) \n\nIf the training samples are independent, then these log marginals can be added to \ngive the overall log-likelihood of the training set \n\n10gP(training set) = L:logP(XVt) \n\n(7) \n\nXH \n\nUnfortunately, computing each of these marginal probabilities involves summing \n(integrating) over an exponential number of different configurations assumed by \nthe hidden variables in the network. This renders the sum (integration) intractable \nin all but few special cases (e.g. trees and chains). It is possible, however, to instead \nfind a manageable lower bound on the log-likelihood and optimize the weights in \nthe network so as to maximize this bound. \n\nTo obtain such a lower bound we resort to Jensen's inequality: \n\n10gP(Xv) \n\n10gL p(XH,XV) = 10gLQ(XH)P(XH,;V) \n\nXH \n\nXH \n> ~Q(XH)1 p(XH,XV) \n\nog Q(XH) \n\nf; \n\nQ(X ) \n\n(8) \n\nAlthough this bound holds for all distributions Q(X) over the hidden variables, the \naccuracy of the bound is determined by how closely Q approximates the posterior \ndistribution p(XH IXv) in terms of the Kullback-Leibler divergence; if the approx(cid:173)\nimation is perfect the divergence is zero and the inequality is satisfied with equality. \nSuitable choices for Q can make the bound both accurate and easy to compute. \nThe feasibility of finding such Q, however, is highly dependent on the choice of the \nrepresentation for the network. \n\n2To postpone the issue of representation we use X to denote 5, {5, Z}, or Z depending \n\non the particular representation chosen. \n\n\f532 \n\nT. JAAKKOLA, L. K. SAUL, M. I. JORDAN \n\n4 Likelihood bounds in different representations \n\nTo complete the derivation of the likelihood bound (equation 8) we need to fix the \nrepresentation for the network. Which representation to select, however, affects the \nquality and accuracy of the bound. In addition, the accompanying bound of the \nchosen representation implies bounds in the other two representational domains as \nthey all code the same distributions over the observables. In this section we illustrate \nthese points by deriving bounds in the complementary and extended representations \nand discuss the corresponding bounds in the original binary domain. \nNow, to obtain a lower bound we need to specify the approximate posterior Q. In \nthe complementary representation the conditional probabilities are Gaussians and \ntherefore a reasonable approximation (mean field) is found by choosing the posterior \napproximation from the family of factorized Gaussians: \n\nQ(Z) = IT _1_e-(Zi-hi)~/2 \n\ni..;?:; \n\n(9) \n\n(10) \n\nSubstituting this into equation 8 we obtain the bound \n\nlog P(S*) ~ -~ L (hi - Ej Jij g(hj\u00bb2 - ~ L Ji~g(hj )g(-hj ) \n\ni \n\nij \n\nThe means hi for the hidden variables are adjustable parameters that can be tuned \nto make the bound as tight as possible. For the instantiated variables we need \nto enforce the constraints g( hi) = S: to respect the instantiation. These can be \nsatisfied very accurately by setting hi = 4(2S: - 1). A very convenient property \nof this bound and the complementary representation in general is the quadratic \nweight dependence - a property very conducive to fast learning. Finally, we note \nthat the complementary representation transforms the binary estimation problem \ninto a continuous density estimation problem. \n\nWe now turn to the interpretation of the above bound in the binary domain. The \nsame bound can be obtained by first fixing the inputs to all the units to be the \nmeans hi and then computing the negative total mean squared error between the \nfixed inputs and the corresponding probabilistic inputs propagated from the parents. \nThe fact that this procedure in fact gives a lower bound on the log-likelihood would \nbe more difficult to justify by working with the binary representation alone. \nIn the extended representation the probability distribution for Zi is a truncated \nGaussian given Si and its parents. We therefore propose the partially factorized \nposterior approximation: \n\nwhere Q(ZiISi) is a truncated Gaussian: \n\nQ(Zi lSi) = \n\n1 \n\n_1_e- t(Zi-(2S,-1)hi)~ \n\ng\u00ab 2Si- 1)hi ) ..;?:; \n\n(11) \n\n(12) \n\nAs in the complementary domain the resulting bound depends quadratically on the \nweights. Instead of writing out the bound here, however, it is more informative to \nsee its derivation in the binary domain. \n\nA factorized posterior approximation (mean field) Q(S) = n. q~i(1 - qi)l-S, for \n\nthe binary network yields a bound \n\nI \n\nI \n\n10gP(S*) > L {(Si 10gg(Lj J,jSj\u00bb) + (1- Si) 10g(l- 9(L; Ji;S;\u00bb)} \n\ni \n\n\fFast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks \n\n533 \n\n(13) \n\nwhere the averages (.) are with respect to the Q distribution. These averages, \nhowever, do not conform to analytical expressions. The tractable posterior ap(cid:173)\nproximation in the extended domain avoids the problem by implicitly making the \nfollowing Legendre transformation: \n\n1 2 \nlogg(x) = [\"2x + logg(x)] -\"2x ~ AX - G(A) - \"2x \n\n(14) \nwhich holds since x 2/2 + logg(x) is a convex function. Inserting this back into the \nrelevant parts of equation 13 and performing the averages gives \n\n1 2 \n\n1 2 \n\n10gP(S*) > L \n\n{[qjAj - (1- qj),Xd Lhjqj - qjG(Ai) - (1- qj)G('xi)} \n\nj \n\nI \" , \n\n-\"2(L.,.. Jijqj) -\"2 L.,.. Jjjqj 1- gj) \n\n2 1 \" ,2 ( \n\nj \n\nij \n\n(15) \n\nwhich is quadratic in the weights as expected. The mean activities q for the hidden \nvariables and the parameters A can be optimized to make the bound tight. For the \ninstantiated variables we set qi = S; . \n\n5 Numerical experiments \n\nTo test these techniques in practice we applied the complementary network to the \nproblem of detecting motor failures from spectra obtained during motor operation \n(see Petsche et al. 1995). We cast the problem as a continuous density estimation \nproblem. The training set consisted of 800 out of 1283 FFT spectra each with 319 \ncomponents measured from an electric motor in a good operating condition but \nunder varying loads. The test set included the remaining 483 FFTs from the same \nmotor in a good condition in addition to three sets of 1340 FFTs each measured \nwhen a particular fault was present. The goal was to use the likelihood of a test \nFFT with respect to the estimated density to determine whether there was a fault \npresent in the motor. \n\nWe used a layered 6 -+ 20 -+ 319 generative model to estimate the training set \ndensity. The resulting classification error rates on the test set are shown in figure 2 \nas a function of the threshold likelihood. The achieved error rates are comparable \nto those of Petsche et al. (1995). \n\n6 Conclusions \n\nNetwork models that admit probabilistic formulations derive a number of advan(cid:173)\ntages from probability theory. Moving away from explicit representations of de(cid:173)\npendencies, however, can make these properties harder to exploit in practice. We \nshowed that an efficient estimation procedure can be derived for sigmoid belief \nnetworks, where standard methods are intractable in all but a few special cases \n(e.g. trees and chains). The efficiency of our approach derived from the combina(cid:173)\ntion of two ideas. First, we avoided the intractability of computing likelihoods \nin these networks by computing lower bounds instead. Second, we introduced \nnew representations for these networks and showed how the lower bounds in the \nnew representational domains transform the parameter estimation problem into \n\n\f534 \n\nT. JAAKKOLA, L. K. SAUL, M. 1. JORDAN \n\n0.0 \n\n..... \n\n0.8 \n\n0.7 \n\n0.8 ',_ \n\nfo.s \"\\ \n\n, . , , \n, \n\nD:' \n\nd.. \n\n0.3 \n\n0.2 \n\n0.1 \n\n---\n\n' , ... .. \n\n' '', \n\n\" \n\n, \n\n.\n\n.. \n\n'-\n\n, \n\" \n\n, , \n'. \n, , . \n\nFigure 2: The probability of error curves for missing a fault (dashed lines) and \nmisclassifying a good motor (solid line) as a function of the likelihood threshold. \n\nquadratic optimization. \n\nAcknowledgments \n\nThe authors wish to thank Peter Dayan for helpful comments. This project was \nsupported in part by NSF grant CDA-9404932, by a grant from the McDonnell(cid:173)\nPew Foundation, by a grant from ATR Human Information Processing Research \nLaboratories, by a grant from Siemens Corporation, and by grant N00014-94-1-\n0777 from the Office of Naval Research. Michael I. Jordan is a NSF Presidential \nYoung Investigator. \n\nReferences \n\nP. Dayan, G. Hinton, R. Neal, and R. Zemel (1995). The helmholtz machine. Neural \nComputation 7: 889-904. \n\nA. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data \nvia the EM algorithm (1977). J. Roy. Statist. Soc. B 39:1-38. \nG. Hinton, P. Dayan, B. Frey, and R. Neal (1995). The wake-sleep algorithm for \nunsupervised neural networks. Science 268: 1158-1161. \n\nS. L. Lauritzen and D. J. Spiegelhalter (1988) . Local computations with probabili(cid:173)\nties on graphical structures and their application to expert systems. J. Roy. Statist. \nSoc. B 50:154-227. \nR. Neal. Connectionist learning of belief networks (1992). Artificial Intelligence 56: \n71-113. \n\nJ. Pearl (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann: \nSan Mateo. \n\nT. Petsche, A. Marcantonio, C. Darken, S. J. Hanson, G. M. Kuhn, I. Santoso \n(1995). A neural network autoassociator for induction motor failure prediction. In \nAdvances in Neural Information Processing Systems 8. MIT Press. \n1. K. Saul, T. Jaakkola, and M. I. Jordan (1995). Mean field theory for sigmoid \nbelief networks. M.l. T. Computational Cognitive Science Technical Report 9501. \n\n\f", "award": [], "sourceid": 1111, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}