{"title": "Exploiting Tractable Substructures in Intractable Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 492, "abstract": null, "full_text": "Exploiting Tractable Substructures \n\nin Intractable Networks \n\nLawrence K. Saul and Michael I. Jordan \n\n{lksaul.jordan}~psyche.mit.edu \n\nCenter for Biological and Computational Learning \n\nMassachusetts Institute of Technology \n\n79 Amherst Street, ElO-243 \n\nCambridge, MA 02139 \n\nAbstract \n\nWe develop a refined mean field approximation for inference and \nlearning in probabilistic neural networks. Our mean field theory, \nunlike most, does not assume that the units behave as independent \ndegrees of freedom; instead, it exploits in a principled way the \nexistence of large substructures that are computationally tractable. \nTo illustrate the advantages of this framework, we show how to \nincorporate weak higher order interactions into a first-order hidden \nMarkov model, treating the corrections (but not the first order \nstructure) within mean field theory. \n\n1 \n\nINTRODUCTION \n\nLearning the parameters in a probabilistic neural network may be viewed as a \nproblem in statistical estimation. In networks with sparse connectivity (e.g. trees \nand chains), there exist efficient algorithms for the exact probabilistic calculations \nthat support inference and learning. In general, however, these calculations are \nintractable, and approximations are required. \n\nMean field theory provides a framework for approximation in probabilistic neural \nnetworks (Peterson & Anderson, 1987). Most applications of mean field theory, \nhowever, have made a rather drastic probabilistic assumption-namely, that the \nunits in the network behave as independent degrees of freedom. In this paper we \nshow how to go beyond this assumption. We describe a self-consistent approxi(cid:173)\nmation in which tractable substructures are handled by exact computations and \nonly the remaining, intractable parts of the network are handled within mean field \ntheory. For simplicity we focus on networks with binary units; the extension to \ndiscrete-valued (Potts) units is straightforward. \n\n\fExploiting Tractable Substructures in Intractable Networks \n\n487 \n\nWe apply these ideas to hidden Markov modeling (Rabiner & Juang, 1991). The \nfirst order probabilistic structure of hidden Markov models (HMMs) leads to net(cid:173)\nworks with chained architectures for which efficient, exact algorithms are available. \nMore elaborate networks are obtained by introducing couplings between multiple \nHMMs (Williams & Hinton, 1990) and/or long-range couplings within a single HMM \n(Stolorz, 1994). Both sorts of extensions have interesting applications; in speech, \nfor example, multiple HMMs can provide a distributed representation of the artic(cid:173)\nulatory state, while long-range couplings can model the effects of coarticulation. In \ngeneral, however, such extensions lead to networks for which exact probabilistic cal(cid:173)\nculations are not feasible. One would like to develop a mean field approximation for \nthese networks that exploits the tractability of first-order HMMs. This is possible \nwithin the more sophisticated mean field theory described here. \n\n2 MEAN FIELD THEORY \n\nWe briefly review the basic methodology of mean field theory for networks of binary \n(\u00b11) stochastic units (Parisi, 1988). For each configuration {S} = {Sl, S2, ... , SN}, \nwe define an energy E{S} and a probability P{S} via the Boltzmann distribution: \n\nP{S} = \n\ne-.8E {S} \n\nZ \n\n' \n\n(1) \n\nwhere {3 is the inverse temperature and Z is the partition function. When it is \nintractable to compute averages over P{S}, we are motivated to look for an ap(cid:173)\nproximating distribution Q{S}. Mean field theory posits a particular parametrized \nform for Q{S}, then chooses parameters to minimize the Kullback-Liebler (KL) \ndivergence: \n\nKL(QIIP) = L Q{S} In P{S} . \n[ Q{S}] \n\nis} \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nWhy are mean field approximations valuable for learning? Suppose that P{S} \nrepresents the posterior distribution over hidden variables, as in the E-step of an \nEM algorithm (Dempster, Laird, & Rubin, 1977). Then we obtain a mean field \napproximation to this E-step by replacing the statistics of P{S} (which may be \nquite difficult to compute) with those of Q{S} (which may be much simpler). If, in \naddition, Z represents the likelihood of observed data (as is the case for the example \nof section 3), then the mean field approximation yields a lower bound on the log(cid:173)\nlikelihood. This can be seen by noting that for any approximating distribution \nQ{S}, we can form the lower bound: \n\nIn Z = \n\nIn L e-.8 E {S} \n\n{S} \n\n{S} \n\nIn L Q{S}\u00b7 Q{S} \n\ne-.8 E {S} ] \n\n[\n\n> L Q{ S}[ - {3E {S} - In Q{ S}], \n\n{S} \n\nwhere the last line follows from Jensen's inequality. The difference between the left \nand right-hand side of eq. (5) is exactly KL( QIIP); thus the better the approximation \nto P {S}, the tighter the bound on In Z. Once a lower bound is available, a learning \nprocedure can maximize the lower bound. This is useful when the true likelihood \nitself cannot be efficiently computed. \n\n\f488 \n\nL. K. SAUL, M.I. JORDAN \n\n2.1 Complete Factorizability \n\nThe simplest mean field theory involves assuming marginal independence for the \nunits Si. Consider, for example, a quadratic energy function \n\nand the factorized approximation: \n\ni T - A. \n\n\f492 \n\nL. K. SAUL, M. I. JORDAN \n\n\u2022: ., ..... .. \n\n'. \n:;. \n',; ... ~ ... } \n\n-5 \n\n-,0 \n\n.~.; .. : \n. . ' .. \n\" . \"' .. \n\nFigure 1: 2D output vectors {Xt } sampled from a first-order HMM and a context(cid:173)\nsensitive HMM, each with n = 5 hidden states. The latter's coarticulation model \nused left and right context, coupling Xt to the hidden states at times t and t \u00b1 5. \nAt left: the five main clusters reveal the basic first-order structure. At right: weak \nmodulations reveal the effects of context. \n\nthis paper we have developed a mean field approximation that meets both these ob(cid:173)\njectives. As an example, we have applied our methods to context-sensitive HMMs, \nbut the methods are general and can be applied more widely. \n\nAcknowledgements \n\nThe authors acknowledge support from NSF grant CDA-9404932, ONR grant \nNOOOI4-94-1-0777, ATR Research Laboratories, and Siemens Corporation. \n\nReferences \n\nA. Dempster, N. Laird, and D. Rubin. (1977) Maximum likelihood from incomplete \ndata via the EM algorithm. J. Roy. Stat. Soc. B39:1-38. \n\nB. H. Juang and L. R. Rabiner. (1991) Hidden Markov models for speech recogni(cid:173)\ntion, Technometrics 33: 251-272. \n\nS. Luttrell. (1989) The Gibbs machine applied to hidden Markov model problems. \nRoyal Signals and Radar Establishment: SP Research Note 99. \n\nG. Parisi. (1988) Statistical field theory. Addison-Wesley: Redwood City, CA. \nC. Peterson and J. R. Anderson. (1987) A mean field theory learning algorithm for \nneural networks. Complex Systems 1:995-1019. \nL. Saul and M. Jordan. (1994) Learning in Boltzmann trees. Neural Compo 6: \n1174-1184. \n\nL. Saul and M. Jordan. \n(1995) Boltzmann chains and hidden Markov models. \nIn G. Tesauro, D. Touretzky, and T . Leen, eds. Advances in Neural Information \nProcessing Systems 7. MIT Press: Cambridge, MA. \n\nP. Stolorz. (1994) Recursive approaches to the statistical physics of lattice proteins. \nIn L. Hunter, ed. Proc. 27th Hawaii Inti. Conf on System Sciences V: 316-325. \n\nC. Williams and G. E. Hinton. (1990) Mean field networks that learn to discriminate \ntemporally distorted strings. Proc. Connectionist Models Summer School: 18-22. \n\n\f", "award": [], "sourceid": 1155, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}