{"title": "Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 493, "page_last": 499, "abstract": null, "full_text": "Sequentially fitting \"inclusive\" trees for \n\ninference in noisy-OR networks \n\nBrendan J. Frey!, Relu Patrascul , Tommi S. Jaakkola\\ Jodi Moranl \n\n1 Intelligent Algorithms Lab \n\nUniversity of Toronto \n\nwww.cs.toronto.edu/~frey \n\n2 Computer Science and \nElectrical Engineering \n\nMassachusetts Institute of Technology \n\nAbstract \n\nAn important class of problems can be cast as inference in noisy(cid:173)\nOR Bayesian networks, where the binary state of each variable is \na logical OR of noisy versions of the states of the variable's par(cid:173)\nents. For example, in medical diagnosis, the presence of a symptom \ncan be expressed as a noisy-OR of the diseases that may cause the \nsymptom - on some occasions, a disease may fail to activate the \nsymptom. Inference in richly-connected noisy-OR networks is in(cid:173)\ntractable, but approximate methods (e .g., variational techniques) \nare showing increasing promise as practical solutions. One prob(cid:173)\nlem with most approximations is that they tend to concentrate \non a relatively small number of modes in the true posterior, ig(cid:173)\nnoring other plausible configurations of the hidden variables. We \nintroduce a new sequential variational method for bipartite noisy(cid:173)\nOR networks, that favors including all modes of the true posterior \nand models the posterior distribution as a tree. We compare this \nmethod with other approximations using an ensemble of networks \nwith network statistics that are comparable to the QMR-DT med(cid:173)\nical diagnostic network. \n\nInclusive variational approximations \n\n1 \nApproximate algorithms for probabilistic inference are gaining in popularity and are \nnow even being incorporated into VLSI hardware (T. Richardson, personal commu(cid:173)\nnication). Approximate methods include variational techniques (Ghahramani and \nJordan 1997; Saul et al. 1996; Frey and Hinton 1999; Jordan et al. 1999), local prob(cid:173)\nability propagation (Gallager 1963; Pearl 1988; Frey 1998; MacKay 1999a; Freeman \nand Weiss 2001) and Markov chain Monte Carlo (Neal 1993; MacKay 1999b). Many \nalgorithms have been proposed in each of these classes. \n\nOne problem that most of the above algorithms suffer from is a tendency to con(cid:173)\ncentrate on a relatively small number of modes of the target distribution (the dis(cid:173)\ntribution being approximated). In the case of medical diagnosis, different modes \ncorrespond to different explanations of the symptoms. Markov chain Monte Carlo \nmethods are usually guaranteed to eventually sample from all the modes, but this \nmay take an extremely long time, even when tempered transitions (Neal 1996) are \n\n\f(a) \n\n\" , \\ Q(x) \n\n, \\ fiE. .. x \n\n(b) \n\n.. x \n\nFigure 1: We approximate P(x) by adjusting the mean and variance of a Gaussian, Q(x}. (a) \nThe result of minimizing D(QIIP) = 2:\" Q(x)log(Q(x)/ P(x\u00bb, as is done for most variational \nmethods. (b) The result of minimizing D(PIIQ) = 2:\" P(x)log(P(x)/Q(x\u00bb. \n\nused. Preliminary results on local probability propagation in richly connected net(cid:173)\nworks show that it is sometimes able to oscillate between plausible modes (Murphy \net al. 1999; Frey 2000), but other results also show that it sometimes diverges or \noscillates between implausible configurations (McEliece et al. 1996). \nMost variational techniques minimize a cost function that favors finding the single, \nmost massive mode, excluding less probable modes of the target distribution (e.g., \nSaul et al. 1996; Ghahramani and Jordan 1997; Jaakkola and Jordan 1999; Frey \nand Hinton 1999; Attias 1999). More sophisticated variational techniques capture \nmultiple modes using substructures (Saul and Jordan 1996) or by leaving part of \nthe original network intact and approximating the remainder (Jaakkola and Jordan \n1999). However, although these methods increase the number of modes that are \ncaptured, they still exclude modes. \nVariational techniques approximate a target distribution P(x) using a simpler, \nparameterized distribution Q(x) (or a parameterized bound). \nFor example, \nP(diseasel, disease2,'\" \nized distribution, Ql (diseasedQ2 (disease2) .. \u00b7QN(diseaseN). For the current set \nof observed symptoms, the parameters of the Q-distributions are adjusted to make \nQ as close as possible to P. \nA common approach to variational inference is to minimize a relative entropy, \n\n, diseaseNlsymptoms) may be approximated by a factor(cid:173)\n\nD(QIIP) = l: Q(x) log ~~:~. \n\nx \n\n(1) \n\nNotice that D(QIIP):j:. D(PIIQ). Often D(QIIP) can be minimized with respect to \nthe parameters of Q using iterative optimization or even exact optimization. \nTo see how minimizing D(QIIP) may exclude modes of the target distribution, \nsuppose Q is a Gaussian and P is bimodal with a region of vanishing density between \nthe two modes, as shown in Fig. 1. If we minimize D(QIIP) with respect to the \nmean and variance of Q, it will cover only one of the two modes, as illustrated in \nFig.1a. (We assume the symmetry is broken.) This is because D(QIIP) will tend \nto infinity if Q is nonzero in the region where P has vanishing density. \nIn contrast, if we minimize D(PIIQ) = Ex P(x)log(P(x)/Q(x)) with respect to the \nmean and variance of Q, it will cover all modes, since D(PIIQ) will tend to infinity \nif Q vanishes in any region where P is nonzero. See Fig. lb. \nFor many problems, including medical diagnosis, it is easy to argue that it is more \nimportant that our approximation include all modes than exclude non plausible \nconfigurations at the cost of excluding other modes. The former leads to a low \nnumber of false negatives, whereas the latter may lead to a large number of false \nnegatives (concluding a disease is not present when it is) . \n\n\fFigure 2: Bipartite Bayesian network. 8kS are observed, dns are hidden. \n\n2 Bipartite noisy-OR networks \nFig. 2 shows a bipartite noisy-OR Bayesian network with N binary hidden variables \nd = (d1, . . . , dN ) and K binary observed variables s = (Sl, . . . , S K) . Later, we \npresent results on medical diagnosis, where dn = 1 indicates a disease is active, \ndn = 0 indicates a disease is inactive, Sk = 1 indicates a symptom is active and \nSk = 0 indicates a symptom is inactive. \n\nThe joint distribution is \n\nP(d, s) = [II P(skl d )] [II P(dn )]. \n\nN \n\nn=l \n\nK \n\nk=l \n\n(2) \n\nIn the case of medical diagnosis, this form assumes the diseases are independent. 1 \nAlthough some diseases probably do depend on other diseases, this form is consid(cid:173)\nered to be a worthwhile representation of the problem (Shwe et al., 1991). \n\nThe likelihood for Sk takes the noisy-OR form (Pearl 1988). The probability that \nsymptom Sk fails to be activated (Sk = 0) is the product of the probabilities that \neach active disease fails to activate Sk: \n\nP(Sk = Old) = PkO II p~~. \n\nN \n\nn=l \n\n(3) \n\nPkn is the probability that an active dn fails to activate Sk. PkO accounts for a \"leak \nprobability\". 1- PkO is the probability that symptom Sk is active when none of the \ndiseases are active. \n\nExact inference computes the distribution over d given a subset of observed values \nin s. However, if Sk is not observed, the corresponding likelihood (node plus edges) \nmay be deleted to give a new network that describes the marginal distribution \nover d and the remaining variables in s. So, we assume that we are considering a \nsubnetwork where all the variables in s are observed. \n\nWe reorder the variables in s so that the first J variables are active (Sk = 1, \n1 ~ k ~ J) and the remaining variables are inactive (Sk = 0, J + 1 ~ k ~ K). The \nposterior distribution can then be written \n\nP(dls) ocP(d,s) = [II(1-PkoIIp~~)][ II (pkoIIp~~)][IIP(dn)J. (4) \n\nK \n\nN \n\nN \n\nJ \n\nk=l \n\nN \n\nn=l \n\nk=J+1 \n\nn=l \n\nn=l \n\nTaken together, the two terms in brackets on the right take a simple, product form \nover the variables in d. So, the first step in inference is to \"absorb\" the inactive \n\n1 However, the diseases are dependent given that some symptoms are present. \n\n\fvariables in s by modifying the priors P(dn) as follows: \n\npI (dn) = anP(dn) ( II Pkn) n, \n\nK \n\nd \n\nk=J+l \n\nwhere an is a constant that normalizes P/(dn). \nAssuming the inactive symptoms have been absorbed, we have \n\nP(dls) ex [II (1 - PkO II p~~)] [II P/(dn)]. \n\nN \n\nN \n\nJ \n\nk=l \n\nn=l \n\nn=l \n\n(5) \n\n(6) \n\nThe term in brackets on the left does not have a product form. The entire expression \ncan be multiplied out to give a sum of 2J product forms, and exact \"QuickS core\" \ninference can be performed by combining the results of exact inference in each of the \n2J product forms (Heckerman 1989). However, this exponential time complexity \nmakes large problems, such as QMR-DT, intractable. \n\n3 Sequential inference using inclusive variational trees \nAs described above, many variational methods minimize D(QIIP), and find ap(cid:173)\nproximations that exclude some modes of the posterior distribution. We present \na method that minimizes D(PIIQ) sequentially - by absorbing one observation at \na time - so as to not exclude modes of the posterior. Also, we approximate the \nposterior distribution with a tree. (Directed and undirected trees are equivalent -\nwe use a directed representation, where each variable has at most one parent.) \n\nThe algorithm absorbs one active symptom at a time, producing a new tree by \nsearching for the tree that is closest - in the D(PIIQ) sense - to the product of \nthe previous tree and the likelihood for the next symptom. This search can be \nperformed efficiently in O(N 2 ) time using probability propagation in two versions \nof the previous tree to compute weights for edges of a new tree, and then applying \na minimum-weight spanning-tree algorithm. \n\nLet Tk(d) be the tree approximation obtained after absorbing the kth symptom, \nSk = 1. Initially, we take To(d) to be a tree that decouples the variables and has \nmarginals equal to the marginals obtained by absorbing the inactive symptoms, as \ndescribed above. \n\nInterpreting the tree Tk-l (d) from the previous step as the current \"prior\" over the \ndiseases, we use the likelihood P(Sk = lid) for the next symptom to obtain a new \nestimate of the posterior: \n\nA(dls1 , ... ,Sk) ex Tk-l (d)P(Sk = lid) = Tk-l (d) (1 - PkO II p~~) \n\nN \n\nn=l \n\n= Tk-l(d) - TLl(d), \n\n(7) \n\nwhere TLI (d) = Tk-l (d) (PkO TI;;=l p~~) is a modified tree. \nLet the new tree be Tk(d) = TIn Tk(dnld1rk (n)), where 7rk (n) is the index of the par(cid:173)\nent of dn in the new tree. The parent function 7rk (n) and the conditional probability \ntables of Tk (d) are found by minimizing \n\n(8) \n\n\fIgnoring constants, we have \n\nD(FkIITk) = - 2:Fk(dls1, ... ,Sk) log Tk (d) \n\n= - 2: (Tk- 1 (d) - TLl(d)) log (II Tk(dnld1fk(n))) \n\nn \n\nd \n\nd \n\n= - 2: (2:(Tk-l(d) - TLl(d)) 10gTk(dnld1fk(n))) \n\nn \n\nd \n\n= - 2:(2: 2: (Tk-l(dn,d1fk(n)) - T~_l(dn,d1fk(n))) 10gTk(dnld1fk(n))). \n\nn \n\ndn d\"k(n) \n\nFor a given structure (parent function 7l\"k(n)), the optimal conditional probability \ntables are \n\n(9) \nwhere f3n is a constant that ensures Ldn Tk(dnld1fk (n)) = 1. This table is easily com(cid:173)\nputed using probability propagation in the two trees to compute the two marginals \nneeded in the difference. \n\nThe optimal conditional probability table for a variable is independent of the parent(cid:173)\nchild relationships in the remainder of the network. So, for the current symptom, \nwe compute the optimal conditional probability tables for all N(N - 1)/2 possible \nparent-child relationships in O(N2) time using probability propagation. Then, we \nuse a minimum-weight directed spanning tree algorithm (Bock 1971) to search for \nthe best tree. \nOnce all of the symptoms have been absorbed, we use the final tree distribution, \nTJ(d) to make inferences about d given s. The order in which the symptoms are \nabsorbed will generally affect the quality of the resulting tree (Jaakkola and Jordan \n1999), but we used a random ordering in the experiments reported below. \n\n4 Results on QMR-DT type networks \nUsing the structural and parameter statistics of the QMR-DT network given in \nShwe et al. \n(1991) we simulated 30 QMR-DT type networks with roughly 600 \ndiseases each. There were 10 networks in each of 3 groups with 5, 10 and 15 \ninstantiated active symptoms. We chose the number of active symptoms to be small \nenough that we can compare our approximate method with the exact QuickScore \nmethod (Heckerman 1989). We also tried two other approximate inference methods: \nlocal probability propagation (Murphy et al. 1999) and a variational upper bound \n(Jaakkola and Jordan 1999). \nFor medical diagnosis, an important question is how many most probable diseases \nn' under the approximate posterior must be examined before the most probable n \ndiseases under the exact posterior are found. Clearly, n ~ n' ~ N. An exact infer(cid:173)\nence algorithm will give n' = n, whereas an approximate algorithm that mistakenly \nranks the most probable disease last will give n' = N. For each group of networks \nand each inference method, we averaged the 10 values of n' for each value of n. \nThe left column of plots in Fig. 3 shows the average of n' versus n for 5, 10 and 15 \nactive symptoms. The sequential tree-fitting method is closest to optimal (n' = n) \nin all cases. The right column of plots shows the \"extra work\" caused by the excess \nnumber of diseases n' - n that must be examined for the approximate methods. \n\n\f5 positive findings \n\n5 positive findings \n\n'-:.' \n\n100 \n\n,,' r , \n\nc \nI \n\n~ \n~ \n\n100 \n\n150 \n\n200 \n\n250 \n\n300 \n\n50 \n\n100 \n\n150 \n\n200 \n\n250 \n\n10 positive findings \n\n10 positive findings \n\n350 ,-----~~-~~~-~~-~~_____, \n\n6:]-- Ub \n\n-\n- - pp \n\ntree \n\n300 \n\n: '~J .. 1 ' \n, \n250 I \n\n, \n\nc 200 : \nI \n, \n\u00b7c 150 I \n\n300 \n\n250 \n\n200 \n\n150 \n\n'c \n\n100 , \n\n50 \n\n450 \n\n400 \n\n350 \n\n'c 300 \n\n, \n\n250 : \n\n200 ; \n\n150 ; \n\n100 : \n\n50 ' /' \n\n~~~50~~1O~0 -'~50~2=00~2~50~30~0 -=35~0 -4~00~4~50~ \n\n50 100 150 200 250 300 350 400 450 \n\n15 positive findings \n\n15 positive findings \n\n- ~ --- ---\n\n550 \nsoo \n\n450 I \n\n400 , \n\n350 , \n\n'c 3OO \n\n250 \n\n200 \n\n150 \n\n100 \n\n50 , \n\n6:]-- ub \n\n-\n- - pp \n\ntree \n\n00 \n\n50 \n\n100 150 200 250 300 350 400 450 500 550 \n\n00 50 100 150 200 250 300 350 400 450 500 550 \n\nFigure 3: Comparisons of the number of most probable diseases n' under the approximate \nposterior that must be examined before the most probable n diseases under the exact posterior \nare found. Approximate methods include the sequential tree-fitting method presented in this \npaper (tree), local probability propagation (pp) and a variational upper bound (ub). \n\n5 Summary \nNoisy-OR networks can be used to model a variety of problems, including medical di(cid:173)\nagnosis. Exact inference in large, richly connected noisy-OR networks is intractable, \nand most approximate inference algorithms tend to concentrate on a small number \nof most probable configurations of the hidden variables under the posterior. We \npresented an \"inclusive\" variational method for bipartite noisy-OR networks that \nfavors including all probable configurations, at the cost of including some improba(cid:173)\nble configurations. The method fits a tree to the posterior distribution sequentially, \ni.e., one observation at a time. Results on an ensemble of QMR-DT type networks \nshow that the method performs better than local probability propagation and a \nvariational upper bound for ranking most probable diseases. \n\nAcknowledgements. We thank Dale Schuurmans for discussions about this work. \n\n\fReferences \n\nH. Attias 1999. Independent factor analysis. Neural Computation 11:4, 803- 852. \n\nF. Bock 1971. An algorithm to construct a minimum directed spanning tree in a directed \nnetwork. Developments in Operations Research, Gordon and Breach, New York, 29-44. \n\nW. T. Freeman and Y. Weiss 2001. On the fixed points of the max-product algorithm. To \nappear in IEEE Transactions on Information Theory, Special issue on Codes on Graphs \nand Iterative Algorithms. \n\nB. J. Frey 1998. Graphical Models for Machine Learning and Digital Communication. \nMIT Press, Cambridge, MA. \n\nB. J. Frey 2000. Filling in scenes by propagating probabilities through layers and into \nappearance models. Proceedings of the IEEE Conference on Computer Vision and Pattern \nRecognition, IEEE Computer Society Press, Los Alamitos, CA. \n\nB. J. Frey and G. E. Hinton 1999. Variational learning in non-linear Gaussian belief \nnetworks. Neural Computation 11:1, 193-214. \n\nR. G. Gallager 1963. Low-Density Parity-Check Codes. MIT Press, Cambridge, MA. \n\nZ. Ghahramani and M. I. Jordan 1997. Factorial hidden Markov models. Machine Learn(cid:173)\ning 29, 245- 273. \n\nD. Heckerman 1989. A tractable inference algorithm for diagnosing multiple diseases. \nProceedings of the Fifth Conference on Uncertainty in Artificial Intelligence. \n\nT. S. Jaakkola and M. I. Jordan 1999. Variational probabilistic inference and the QMR-DT \nnetwork. Journal of Artificial Intelligence Research 10, 291-322. \n\nM. I. Jordan, Z. Ghahramani, T. S. Jaakkola and L. K. Saul 1999. An introduction to \nvariational methods for graphical models. In M. I. Jordan (ed) Learning in Graphical \nModels, MIT Press, Cambridge, MA. \n\nD. J. C MacKay 1999a. Good error-correcting codes based on very sparse matrices. IEEE \nTransactions on Information Theory 45:2, 399-431. \n\nD. J. C MacKay 1999b. Introduction to Monte Carlo methods. In M. I. Jordan (ed) \nLearning in Graphical Models, MIT Press, Cambridge, MA. \n\nR. J. McEliece, E. R. Rodemich and J.-F. Cheng 1996. The turbo decision algorithm. \nProceedings of the 33rd Allerton Conference on Communication, Control and Computing, \nChampaign-Urbana, IL. \n\nK. P. Murphy, Y. Weiss and M. I. Jordan 1999. Loopy belief propagation for approximate \ninference: An empirical study. Proceedings of the Fifteenth Conference on Uncertainty in \nArtificial Intelligence, Morgan Kaufmann, San Francisco, CA. \n\nR. M. Neal 1993. Probabilistic inference using Markov chain Monte Carlo methods. Tech(cid:173)\nnical Report CRG-TR-93-1, Computer Science, University of Toronto. \n\nR. M. Neal 1996. Sampling from multimodal distributions using tempered transitions. \nStatistics and Computing 6, 353-366. \n\nL. K. Saul, T. Jaakkola and M. I. Jordan 1996. Mean field theory for sigmoid belief \nnetworks. Journal of Artificial Intelligence Research 4, 61-76. \n\nL. K. Saul and M. I. Jordan 1996. Exploiting tractable substructures in intractable net(cid:173)\nworks. In D. Touretzky, M. Mozer, and M. Hasselmo (eds) Advances in Neural Information \nProcessing Systems 8. MIT Press, Cambridge, MA. \n\nM. Shwe, B. Middleton, D. Heckerman, M. Henrion, E. Horvitz, H. Lehmann and G. \nCooper 1991. Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR \nknowledge base I. The probabilistic model and inference algorithms. Methods of Informa(cid:173)\ntion in Medicine 30, 241- 255. \n\n\f", "award": [], "sourceid": 1815, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Relu", "family_name": "Patrascu", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Jodi", "family_name": "Moran", "institution": null}]}