{"title": "Fast Information Value for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 51, "page_last": 58, "abstract": "", "full_text": "Fast Information Value\nfor Graphical Models\n\nBrigham S. Anderson\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nbrigham@cmu.edu\n\nAndrew W. Moore\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nawm@cs.cmu.edu\n\nAbstract\n\nCalculations that quantify the dependencies between variables are vital\nto many operations with graphical models, e.g., active learning and sen-\nsitivity analysis. Previously, pairwise information gain calculation has\ninvolved a cost quadratic in network size. In this work, we show how\nto perform a similar computation with cost linear in network size. The\nloss function that allows this is of a form amenable to computation by\ndynamic programming. The message-passing algorithm that results is\ndescribed and empirical results demonstrate large speedups without de-\ncrease in accuracy. In the cost-sensitive domains examined, superior ac-\ncuracy is achieved.\n\n1 Introduction\n\nIn a diagnosis problem, one wishes to select the best test (or observation) to make in order\nto learn the most about a system of interest. Medical settings and disease diagnosis imme-\ndiately come to mind, but sensor management (Krishnamurthy, 2002), sensitivity analysis\n(Kjrulff & van der Gaag, 2000), and active learning (Anderson & Moore, 2005) all make\nuse of similar computations. These generally boil down to an all-pairs analysis between\nobservable variables (queries) and the variables of interest (targets.)\nA common technique in the (cid:2)eld of diagnosis is to compute the mutual information between\neach query and target, then select the query that is expected to provide the most information\n(Agostak & Weiss, 1999). Likewise, a sensitivity analysis between the query variable\nand the target variables can be performed (Laskey, 1995; Kjrulff & van der Gaag, 2000).\nHowever, both suffer from a quadratic blowup with respect to the number of queries and\ntargets.\nIn the current paper we present a loss function which can be used in a message-passing\nframework to perform the all-pairs computation with cost linear in network size. We de-\nscribe the loss function in Section 2, we describe a polynomial expression for network-\nwide expected loss in Section 3, and in Section 4 we present a message-passing scheme\nto perform this computation ef(cid:2)ciently for each node in the network. Section 5 shows the\nempirical speedups and accuracy gains achieved by the algorithm.\n\n\f1.1 Graphical Models\nTo simplify presentation, we will consider only Bayesian networks, but the results gen-\neralize to any graphical model. We also restrict the class of networks to those without\nundirected loops, or polytrees, of which Junction trees are a member. We have a Bayesian\nNetwork B, which is composed of an independence graph, G and parameters for CPT ta-\nbles. The independence graph G = (X ; E) is a directed acyclic graph (DAG) in which\nX is a set of N discrete random variables fx1; x2; :::; xN g 2 X , and the edges de(cid:2)ne the\nindependence relations. We will denote the marginal distribution of a single node P (xjB)\nby (cid:25)x, where ((cid:25)x)i is P (x = i). We will omit conditioning on B for the remainder of the\npaper. We indicate the number states a node x can assume as jxj.\nAdditionally, each node x is assigned a cost matrix Cx, in which (Cx)ij is the cost of be-\nlieving x = j when in fact the true value x(cid:3) = i. A cost matrix of all zeros indicates that\none is not interested in the node\u2019s value. The cost matrix C is useful because inhomoge-\nneous costs are a common feature in most realistic domains. This ubiquity results from the\nfact that information almost always has a purpose, so that some variables are more relevant\nthan others, some states of a variable are more relevant than others, and confusion between\nsome pairs of states are more relevant than between other pairs.\nFor our task, we are given B, and wish to estimate P (X ) accurately by iteratively selecting\nthe next node to observe. Although typically only a subset of the nodes are queryable, we\nwill for the purposes of this paper assume that any node can be queried. How do we select\nthe most informative node to query? We must (cid:2)rst de(cid:2)ne our objective function, which is\ndetermined by our de(cid:2)nition of error.\n\n2 Risk Due to Uncertainty\n\nThe underlying error function for the information gain computation will be denoted\nError(P (X )jjX (cid:3)), which quanti(cid:2)es the loss associated with the current belief state, P (X )\ngiven the true values X (cid:3). There are several common candidates for this role, a log-loss\nfunction, a log-loss function over marginals, and an expected 0-1 misclassi(cid:2)cation rate\n(Kohavi & Wolpert, 1996). Constant factors have been omitted.\n\nErrorlog (P (X )jjX (cid:3)) = (cid:0) log P (X (cid:3))\n\nWhere X is the set of nodes, and u(cid:3) is the true value of node u. The error function of\nEquation 1 will prove insuf(cid:2)cient for our needs as it cannot target individual node errors,\nwhile the error function of Equation 2 results in an objective function that is quadratic in\ncost to compute.\nWe will be exploring a more general form of Equation 3 which allows arbitrary weights to\nbe placed on different types of misclassi(cid:2)cations. For instance, we would like to specify\nthat misclassifying a node\u2019s state as 0 when it is actually 1 is different from misclassifying\nit as 0 when it is actually in state 2. Different costs for each node can be speci(cid:2)ed with cost\nmatrices Cu for u 2 X . The (cid:2)nal error function is\n\nError(P (X )jjX (cid:3) ) = Xu2X\n\njuj\n\nXi\n\nP (u = i)Cu[u(cid:3); i]\n\n(4)\n\nErrormlog (P (X )jjX (cid:3)) = (cid:0)Xu2X\nError01(P (X )jjX (cid:3)) = (cid:0)Xu2X\n\nlog P (u(cid:3))\n\nP (u(cid:3))\n\n(1)\n(2)\n\n(3)\n\n\fWhere C[i; j] is the ijth element of the matrix C, and juj is the number of states that\nthe node u can assume. The presence of the cost matrix Cu in Equation 4 constitutes a\nsigni(cid:2)cant advantage in real applications, as they often need to specify inhomogeneous\ncosts.\nThere is a separate consideration, that of query cost, or cost(x), which is the cost incurred\nby the action of observing x (e.g., the cost of a medical test.) If both the query cost and\nthe misclassi(cid:2)cation cost C are formulated in the same units, e.g., dollars, then they form\na coherent decision framework. The query costs will be omitted from this presentation for\nclarity.\nIn general, one does not actually know the true values X (cid:3) of the nodes, so one cannot\ndirectly minimize the error function as described. Instead, the expected error, or risk, is\nused.\n\nwhich for the error function of Equation 4 reduces to\n\nP (x)ErrorP (X jjx)\n\nP (u = j)P (u = k)Cu[j; k]\n\nRisk(P (X )) =Xx\nRisk(P (X )) = Xu2XXj Xk\n\n(cid:25)T\n\nu Cu(cid:25)u\n\n= Xu2X\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nwhere ((cid:25)u)i = P (u = i). This is the objective we will minimize. It quanti(cid:2)es (cid:147)On av-\nerage, how much is our current ignorance going cost us?(cid:148) For comparison, note that the\nlog-loss function, Errorlog, results in an entropy risk function Risklog(P (X )) = H(X ),\nand the log-loss function over the marginals, Errormlog, results in the risk function\n\nRiskmlog(P (X )) =Pu2X H(u).\n\nUltimately, we want to (cid:2)nd the nodes that have the greatest effect on Risk(P (X )), so we\nmust condition Risk(P (X )) on the beliefs at each node. In other words, if we learned\nthat the true marginal probabilities of node x were (cid:25)x, what effect would that have on our\ncurrent risk, or rather, what is Risk(P (X )jP (x) = (cid:25)x)? Discouragingly, however, any\nchange in (cid:25)x propagates to all the other beliefs in the network. It seems as if we must\nperform several network evaluations for each node, a prohibitive cost for networks of any\nappreciable size. However, we will show that in fact dynamic programming can perform\nthis computation for all nodes in only two passes through the network.\n\n3 Risk Calculation\nTo clarify our objective, we wish to construct a function Ra((cid:25)) for each node a, where\nRa((cid:25)) = Risk(P (X )jP (a) = (cid:25)). Suppose, for instance, that we learn that the value\nof node a is equal to 3. Our P (X ) is now constrained to have the marginal P (a) = (cid:25) 0\na,\na)3 = 1 and equals zero elsewhere. If we had the function Ra in hand, we could\nwhere ((cid:25)0\nsimply evaluate Ra((cid:25)0\na) to immediately compute our new network-wide risk, which would\naccount for all the changes in beliefs to all the other nodes due to learning that a = 3. This\nis exactly our objective; we would like to precompute Ra for all a 2 X . De(cid:2)ne\n\nRa((cid:25)) = Risk(P (X )jP (a) = (cid:25))\n\n= Xu2X\n\n(cid:25)T\n\nu Cu(cid:25)u(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)P (a)=(cid:25)\n\nThis simply restates the risk de(cid:2)nition of Equation 7 under the condition that P (a) = (cid:25).\nAs shown in the next theorem, the function Ra has a surprisingly simple form.\n\n\fTheorem 3.1. For any node x, the function Rx((cid:25)) is a second-degree polynomial function\nof the elements of (cid:25)\n\nProof. De(cid:2)ne the matrix P\nujv)ij = P (u =\njjv = i). Recall that the the beliefs at node x have a strictly linear relationship to the beliefs\nof node u, since\n\nujv for every pair of nodes (u; v), such that (P\n\n((cid:25)u)i =Xk\n\nP (u = ijx = k)P (x = k)\n\nis equivalent to (cid:25)u = Pujx(cid:25)x. Substituting P\n\nujx(cid:25)x for (cid:25)u in Equation 9 obtains\n\nRx((cid:25)) = Xu2X\n\n(cid:25)T\nx\n\nPT\n\n= (cid:25)T Xu2X\n\n= (cid:25)T (cid:2)x(cid:25)\n\nujxCuPujx(cid:25)x(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:25)x=(cid:25)\nujxCuPujx! (cid:25)\n\nPT\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nWhere (cid:2)x is an jxj (cid:2) jxj matrix.\n\nNote that the matrix (cid:2)x is suf(cid:2)cient to completely describe Rx, so we only need to consider\nthe computation of (cid:2)x for x 2 X . From Equation 12, we see a simple equation for\ncomputing these (cid:2)x directly (though expensively):\n\n(cid:2)x = Xu2X\n\nPT\n\nujxCuP\n\nujx\n\n(14)\n\nExample #1\nGiven the 2-node network a ! b, how do we calculate Ra((cid:25)), the total risk associated with\nour beliefs about the value of node a? Our objective is thus to determine\n\nRa((cid:25)) = Risk(P (a; b)jP (a) = (cid:25))\n\n= (cid:25)T (cid:2)a(cid:25)\n\nEquation 14 will give (cid:2)a as\n\n(cid:2)a = PT\n\najaCaP\n= Ca + PT\n\nbjaCbP\n\nbja\n\naja + PT\n\nbjaCbP\n\nbja\n\nwith P\n\naja = I by de(cid:2)nition. The individual coef(cid:2)cients of (cid:2)a are thus\n\n(15)\n(16)\n\n(17)\n(18)\n\n(19)\n\n(cid:18)aij = Caij +Xk Xl\n\nP (b = kja = i)P (b = lja = j)Cbkl\n\nNow we can compute the relation between any marginal (cid:25) at node a and our total network-\nwide risk via Ra((cid:25)). However, using Equation 14 to compute all the (cid:2) would require\nevaluating the entire network once per node. The function can, however, be decomposed\nfurther, which will enable much more ef(cid:2)cient computation of (cid:2)x for x 2 X .\n\n\f3.1 Recursion\nTo create an ef(cid:2)cient message-passing algorithm for computing (cid:2)x for all x 2 X , we will\nx , where W is a subset of the network over which Risk(P (X )) is summed.\nintroduce (cid:2)W\n(20)\n\n(cid:2)W\n\nPT\n\nujxCuP\n\nujx\n\nThis is otherwise identical to Equation 14. It implies, for instance, that (cid:2)x\nimportantly, these matrices can be usefully decomposed as follows.\nTheorem 3.2. (cid:2)W\n\nif x and W are conditionally independent given y.\n\nx = Cx. More\n\nx = PT\n\nyjx\n\nP\n\nyjx(cid:2)W\n\ny\n\nx = Xu2W\n\nProof. Note that P\n\nujx = P\n\nujy\n\nP\n\nyjx for u 2 X , since\n\njyj\n\n(P\n\nP\n\nyjx)ij =\n\nujy\n\nP (u = ijy = k)P (y = kjx = j)\n\n(21)\n\nXk\n\n(22)\n(23)\nStep (21) is only true if x and u are conditionally independent given y. Substituting this\nresult into Equation 20, we conclude\n\n= P (u = ijx = j)\n= (P\n\nujx)ij\n\nujxCuP\n\nujx\n\n(cid:2)W\n\nPT\n\nPT\n\nx = Xu2W\n= Xu2W\nyjx Xu2W\n\n= PT\n\nyjx\n\n= PT\n\nyjx(cid:2)W\n\ny\n\nP\n\nyjx\n\nPT\n\nujy(cid:2)u\n\nu\n\nP\n\nujy\n\nP\n\nyjx\n\nPT\n\nujy(cid:2)u\n\nu\n\nP\n\nujy! P\n\nyjx\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\nExample #2\nSuppose we now have a 3-node network, a ! b ! c, and we are only interested in the\neffect that node a has on the network-wide Risk. Our objective is to compute\n\nRa((cid:25)) = Risk(P (a; b; c)jP (a) = (cid:25))\n\n= (cid:25)T (cid:2)a(cid:25)\n\n(28)\n(29)\n\nwhere (cid:2)a is by de(cid:2)nition\n\na\n\n(cid:2)a = (cid:2)abc\n\n(30)\n(31)\nUsing Theorem 3.2 and the fact that a is conditionally independent of c given b, we know\n(32)\n(33)\n\na = (cid:2)a\n(cid:2)bc\nb = (cid:2)b\n\na + PT\nb + PT\n\nbja(cid:2)bc\ncjb(cid:2)c\nP\n\n= Ca + PT\n\nbja + PT\n\ncjaCcP\n\nbjaCbP\n\n(cid:2)abc\n\ncja\n\nbja\n\ncjb\n\nP\n\nc\n\nb\n\nSubstituting 33 into 32\n\n(cid:2)a = (cid:2)a\n\na + PT\n\ncjb(cid:2)c\n\nc\n\nP\n\nbja\n\n(34)\n\n(35)\nNote that the coef(cid:2)cient (cid:2)a is obtained from probabilities between neighboring nodes only,\nwithout having to explicitly compute P\n\n= Ca + PT\n\ncjbCcP\n\nbja\n\ncja.\n\nb + PT\n\nbja(cid:16)(cid:2)b\nbja(cid:16)Cb + PT\n\ncjb(cid:17) P\ncjb(cid:17) P\n\n\f4 Message Passing\nWe are now ready to de(cid:2)ne message passing. Messages are of two types; in-messages and\nout-messages. They are denoted by (cid:21) and (cid:22), respectively. Out-messages (cid:22) are passed from\nparent to child, and in-messages (cid:21) are passed from child to parent. The messages from x to\ny will be denoted as (cid:22)xy and (cid:21)xy. In the discrete case, (cid:22)xy and (cid:21)xy will both be matrices\nof size jyj (cid:2) jyj. The messages summarize the effect that y has on the part of the network\nthat y is d-separated from by x. Messages relate to the (cid:2) coef(cid:2)cients by the following\nde(cid:2)nition\n\n(cid:21)yx = (cid:2)ny\nx\n(cid:22)yx = (cid:2)ny\nx\n\nx indicates the matrix (cid:2)V\n\n(36)\n(37)\nwhere the (nonstandard) notation (cid:2)ny\nx for which V is the set of all\nthe nodes in X that are reachable by x if y were removed from the graph. In other words,\nx is summarizing the effect that x has on the entire network except for the part of the\n(cid:2)ny\nnetwork that x can only reach through y.\nPropagation: The message-passing scheme is organized to recursively compute the (cid:2)\nmatrices using Theorem 3.2. As can be seen from Equations 36 and 37, the two types of\nmessages are very similar in meaning. They differ only in that passing a message from a\nparent to child automatically separates the child from the rest of the network the parent is\nconnected to, while a child-to-parent message does not necessarily separate the parent from\nthe rest of the network that the child is connected to (due to the (cid:147)explaining away(cid:148) effect.)\nThe contruction of the (cid:22)-message involves a short sequence of basic linear algebra. The\n(cid:22)-message from x to child c is created from all other messages entering x except those\nfrom c. The de(cid:2)nition is\n\n(cid:22)xc = PT\n\nxjc0\n@Cx + Xu2pa(x)\n\n(cid:22)ux + Xv2ch(x)nc\n\n(cid:21)vx1\nA\n\nP\n\nxjc\n\n(38)\n\nThe (cid:21)-messages from x to parent u are only slightly more involved. To account for the\n(cid:147)explaining away(cid:148) effect, we must construct (cid:21)xu directly from the parents of x.\n\n(cid:21)xu =PT\n\nP\n\nxju+\n\n(cid:21)cx1\nA\n\nxju0\n@Cx + Xc2ch(x)\nwju0\n@Cw + Xv2pa(w)\n\nXw2pa(x)nu\n\nPT\n\n(cid:22)vw + Xc2ch(w)nx\n\n(cid:21)cw1\nA\n\nP\n\nwju\n\n(39)\n\nMessages are constructed (or (cid:147)sent(cid:148)) whenever all of the required incoming messages are\npresent and that particular message has not already been sent. For example, the out-\nmessage (cid:22)xc can be sent only when messages from all the parents of x and all the children\nof x (save c) are present. The overall effect of this constraint is a single leaves-inward\npropagation followed by a single root-outward propagation.\nInitialization and Termination: Initialization occurs naturally at any singly-connected\n(leaf) node x, where the message is by de(cid:2)nition Cx. Termination occurs when no more\nmessages meet the criteria for sending. Once all message propagation is (cid:2)nished, for each\nnode x the coef(cid:2)cients (cid:2)x can be computed by a simple summation:\n\n(cid:2)x = Xc2ch(x)\n\n(cid:21)cx + Xu2par(x)\n\n(cid:22)ux + Cx\n\n(40)\n\n\f200\n\n150\n\ns\nc\ne\nS\n\n100\n\n50\n\n0\n0\n\n200\n\nMI\nCost Prop\n\n800\n\n1000\n\n400\n\n600\n\nNumber of Nodes\n\nFigure 1: Comparison of execution times with synthetic polytrees.\n\nPropagation runs in time linear in the number of nodes once the initial local probabilities\nare calculated. The local probabilities required are the matrices P for each parent-child\nprobability,e.g., P (child = jjparent = i), and for each pair (not set) of parents that share\na child, P (parent = jjcoparent = i). These are all immediately available from a junction\ntree, or they can be obtained with a run of belief propagation.\nIt is worth noting that the apparent complexity of the (cid:21); (cid:22) message propagation equations\nis due to the Bayes Net representation. The equivalent factor graph equations (not shown)\nare markedly more succinct.\n\n5 Experiments\n\nThe performance of the message-passing algorithm (hereafter CostProp) was compared\nwith a standard information gain algorithm which uses mutual information (hereafter MI).\nThe error function used by MI is from Equation 2, where Errormlog (P (X )jjX (cid:3)) =\n\nPx2X log P (x(cid:3)), with a corresponding risk function Risk(P (X )) = Px2X H(x). This\n\ncorresponds to selecting the node x that has the highest summed mutual information with\neach of the target nodes (in this case, the set of target nodes is X and the set of query nodes\nis also X .) The computational cost of MI grows quadratically as the product of the number\nof queries and of targets.\nIn order to test the speed and relative accuracy of CostProp, we generated random polytrees\nwith varying numbers of trinary nodes. The CPT tables were randomly generated with a\nslight bias towards lower-entropy probabilities. The code was written in Matlab using the\nBayes Net Toolbox (Murphy, 2005).\nSpeed: We generated polytrees of sizes ranging from 2 to 1000 nodes and ran the MI\nalgorithm, the CostProp algorithm, and a random-query algorithm on each. The two non-\nrandom algorithms were run using a junction tree, the build time of which was not included\nin the reported run times of either algorithm. Even with the relatively slow Matlab code, the\nspeedup shown in Figure 1 is obvious. As expected, CostProp is many orders of magnitude\nfaster than the MI algorithm, and shows a qualitative difference in scaling properties.\nAccuracy: Due to the slow running time of MI, the accuracy comparison was performed\non polytrees of size 20. For each run, a true assignment X (cid:3) was generated from the tree, but\nwere initially hidden from the algorithms. Each algorithm would then determine for itself\nthe best node to observe, receive the true value of that node, then select the next node to\nobserve, et cetera. The true error at each step was computed as the 0-1 error of Equation 3.\nThe reduction in error plotted against number of queries is shown in Figure 2. With uniform\n\n\ft\ns\no\nC\n \ne\nu\nr\nT\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\nRandom\nMI\nCost Prop\n\n5\n\n10\n\nNumber of Queries\n\n15\n\n20\n\nt\ns\no\nC\ne\nu\nr\nT\n\n \n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n0\n\nRandom\nMI\nCost Prop\n\n5\n\n10\n\nNumber of Queries\n\n15\n\n20\n\nFigure 2: Performance on synthetic poly-\ntrees with symmetric costs.\n\nFigure 3: Performance on synthetic poly-\ntrees with asymmetric costs.\n\ncost matrices, performance of MI and CostProp are approximately equal on this task, but\nboth are better than random. We next made the cost matrices asymmetric by initializing\nthem such that confusing one pair of states was 100 times more costly than confusing the\nother two pairs. The results of Figure 3 show that CostProp reduces error faster than MI,\npresumably because it can accomodate the cost matrix information.\n\n6 Discussion\nWe have described an all-pairs information gain calculation that scales linearly with net-\nwork size. The objective function used has a polynomial form that allows for an ef(cid:2)cient\nmessage-passing algorithm. Empirical results demonstrate large speedups and even im-\nproved accuracy in cost-sensitive domains. Future work will explore other applications of\nthis method, including sensitivity analysis and active learning. Further research into other\nuses for the belief polynomials will also be explored.\n\nReferences\nAgostak, J. M., & Weiss, J. (1999). Active Fusion for Diagnosis Guided by Mutual Information.\n\nProceedings of the 2nd International Conference on Information Fusion.\n\nAnderson, B. S., & Moore, A. W. (2005). Active learning for hidden markov models: Objective\nfunctions and algorithms. Proceedings of the 22nd International Conference on Machine Learning.\n\nKjrulff, U., & van der Gaag, L. (2000). Making sensitivity analysis computationally ef(cid:2)cient.\nKohavi, R., & Wolpert, D. H. (1996). Bias Plus Variance Decomposition for Zero-One Loss Func-\ntions. Machine Learning : Proceedings of the Thirteenth International Conference. Morgan\nKaufmann.\n\nKrishnamurthy, V. (2002). Algorithms for optimal scheduling and management of hidden markov\n\nmodel sensors. IEEE Transactions on Signal Processing, 50, 1382(cid:150)1397.\n\nLaskey, K. B. (1995). Sensitivity Analysis for Probability Assessments in Bayesian Networks. IEEE\n\nTransactions on Systems, Man, and Cybernetics.\n\nMurphy, K. (2005). Bayes net toolbox for matlab. U. C. Berkeley. http://www.ai.mit.edu/(cid:152)\n\nmurphyk/Software/BNT/bnt.html.\n\n\f", "award": [], "sourceid": 2898, "authors": [{"given_name": "Brigham", "family_name": "Anderson", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}