{"title": "Triangulation by Continuous Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 563, "abstract": null, "full_text": "Triangulation by Continuous Embedding \n\nMarina MeiHl and Michael I. Jordan \n\n{mmp, jordan }@ai.mit.edu \n\nCenter for Biological & Computational Learning \n\nMassachusetts Institute of Technology \n\n45 Carleton St. E25-201 \nCambridge, MA 02142 \n\nAbstract \n\nWhen triangulating a belief network we aim to obtain a junction \ntree of minimum state space. According to (Rose, 1970), searching \nfor the optimal triangulation can be cast as a search over all the \npermutations of the graph's vertices. Our approach is to embed \nthe discrete set of permutations in a convex continuous domain D. \nBy suitably extending the cost function over D and solving the \ncontinous nonlinear optimization task we hope to obtain a good \ntriangulation with respect to the aformentioned cost. This paper \npresents two ways of embedding the triangulation problem into \ncontinuous domain and shows that they perform well compared to \nthe best known heuristic. \n\n1 \n\nINTRODUCTION. WHAT IS TRIANGULATION? \n\nBelief networks are graphical representations of probability distributions over a set \nof variables. In what follows it will be always assumed that the variables take \nvalues in a finite set and that they correspond to the vertices of a graph. The \ngraph's arcs will represent the dependencies among variables. There are two kinds of \nrepresentations that have gained wide use: one is the directed acyclic graph model, \nalso called a Bayes net, which represents the joint distribution as a product of the \nprobabilities of each vertex conditioned on the values of its parents; the other is the \nundirected graph model, also called a Markov field, where the joint distribution is \nfactorized over the cliques! of an undirected graph. This factorization is called a \njunction tree and optimizing it is the subject of the present paper. The power of both \nmodels lies in their ability to display and exploit existent marginal and conditional \nindependencies among subsets of variables. Emphasizing independencies is useful \n\n1 A clique is a fully connected set of vertices and a maximal clique is a clique that is \n\nnot contained in any other clique. \n\n\f558 \n\nM. Meilii and M. /. Jordan \n\nfrom both a qualitative point of view (it reveals something about the domain under \nstudy) and a quantitative one (it makes computations tractable). The two models \ndiffer in the kinds of independencies they are able to represent and often times \nin their naturalness in particular tasks. Directed graphs are more convenient for \nlearning a model from data; on the other hand, the clique structure of undirected \ngraphs organizes the information in a way that makes it immediately available to \ninference algorithms. Therefore it is a standard procedure to construct the model \nof a domain as a Bayes net and then to convert it to a Markov field for the purpose \nof querying it. \n\nThis process is known as decomposition and it consists of the following stages: \nfirst, the directed graph is transformed into an undirected graph by an operation \ncalled moralization. Second, the moralized graph is triangulated. A graph is called \ntriangulated if any cycle of length> 3 has a chord (i.e. an edge connecting two \nnonconsecutive vertices). If a graph is not triangulated it is always possible to add \nnew edges so that the resulting graph is triangulated. We shall call this procedure \ntriangulation and the added edges the fill-in. In the final stage, the junction tree \n(Kjrerulff, 1991) is constructed from the maximal cliques of the triangulated graph. \nWe define the state space of a clique to be the cartesian product of the state spaces \nof the variables associated to the vertices in the clique and we call weight of the \nclique the size of this state space. The weight of the junction tree is the sum of the \nweights of its component cliques. All further exact inference in the net takes place \nin the junction tree representation. The number of computations required by an \ninference operation is proportional to the weight of the tree. \n\nFor each graph there are several and usually a large number of possible triangu(cid:173)\nlations, with widely varying state space sizes. Moreover, triangulation is the only \nstage where the cost of inference can be influenced. It is therefore critical that the \ntriangulation procedure produces a graph that is optimal or at least \"good\" in this \nrespect. \n\nUnfortunately, this is a hard problem. No optimal triangulation algorithm is known \nto date. However, a number of heuristic algorithms like maximum cardinality search \n(Tarjan and Yannakakis, 1984), lexicographic search (Rose et al., 1976) and the \nminimum weight heuristic (MW) (Kjrerulff, 1990) are known. An optimization \nmethod based on simulated annealing which performs better than the heuristics \non large graphs has been proposed in (Kjrerulff, 1991) and recently a \"divide and \nconquer\" algorithm which bounds the maximum clique size of the triangulated graph \nhas been published (Becker and Geiger, 1996). All but the last algorithm are based \non Rose's (Rose, 1970) elimination procedure: choose a node v of the graph, connect \nall its neighbors to form a clique, then eliminate v and all the edges incident to it \nand proceed recursively. The resulting filled-in graph is triangulated. \n\nIt can be proven that the optimal triangulation can always be obtained by applying \nRose's elimination procedure with an appropriate ordering of the nodes. It follows \nthen that searching for an optimal triangulation can be cast as a search in the space \nof all node permutations. The idea of the present work is the following: embed \nthe discrete search space of permutations of n objects (where n is the number of \nvertices) into a suitably chosen continuous space. Then extend the cost to a smooth \nfunction over the continuous domain and thus transform the discrete optimization \nproblem into a continuous nonlinear optimization task. This allows one to take \nadvantage of the thesaurus of optimization methods that exist for continuous cost \nfunctions. The rest of the paper will present this procedure in the following sequence: \nthe next section introduces and discusses the objective function; section 3 states \nthe continuous version of the problem; section 4 discusses further aspects of the \noptimization procedure and presents experimental results and section 5 concludes \n\n\fTriangulation by Continuous Embedding \n\n559 \n\nthe paper. \n\n2 THE OBJECTIVE \n\nIn this section we introduce the objective function that we used and we discuss its \nrelationship to the junction tree weight. First, some notation. Let G = (V, E) be a \ngraph, its vertex set and its edge set respectively. Denote by n the cardinality of the \nvertex set, by ru the number of values of the (discrete) variable associated to vertex \nv E V, by # the elimination ordering of the nodes, such that #v = i means that \nnode v is the i-th node to be eliminated according to ordering #, by n(v) the set of \nneighbors of v E V in the triangulated graph and by Cu = {v} U {u E n( v) I #u > \n#v}. 2 Then, a result in (Golumbic, 1980) allows us to express the total weight of \nthe junction tree obtained with elimination ordering # as \n\n(1) \n\nwhere ismax(Cu ) is a variable which is 1 when C u is a maximal clique and 0 oth(cid:173)\nerwise. As stated, this is the objective of interest for belief net triangulation. Any \nreference to optimality henceforth will be made with respect to J* . \n\nThis result implies that there are no more than n maximal cliques in a junction tree \nand provides a method to enumerate them. This suggests defining a cost function \nthat we call the raw weight J as the sum over all the cliques Cu (thus possibly \nincluding some non-maximal cliques) : \n\nJ(#) = I: II ru \n\nuEV uECv \n\n(2) \n\nJ is the cost function that will be used throughout this paper. A reason to use \nit instead of J* in our algorithm is that the former is easier to compute and to \napproximate. How to do this will be the object of the next section. But it is \nnatural to ask first how well do the two agree? \nObviously, J is an upper bound for J*. Moreover, it can be proved that if r = min ru \n\n(3) \n\nand therefore that J is less than a fraction 1/(r - 1) away from J* . The upper \nbound is attained when the triangulated graph is fully connected and all ru are \nequal. \n\nIn other words, the differece between J and J* is largest for the highest cost tri(cid:173)\nangulation. We also expect this difference to be low for the low cost triangulation. \nAn intuitive argument for this is that good triangulations are associated with a \nlarge number of smaller cliques rather than with a few large ones. But the former \nsituation means that there will be only a small number of small size non-maximal \ncliques to contribute to the difference J - J* , and therefore that the agreement with \nJ* is usually closer than (3) implies. This conclusion is supported by simulations \n(Meila. and Jordan, 1997). \n\n2Both n(v) and CO) depend on # but we chose not to emphasize this in the notation \n\nfor the sake of readability. \n\n\f560 \n\nM. Meilii and M. I. Jordan \n\n3 THE CONTINUOUS OPTIMIZATION PROBLEM \n\nThis section shows two ways of defining J over continuous domains. Both rely on \na formulation of J that eliminates explicit reference to the cliques Gu ; we describe \nthis formulation here. \nLet us first define new variables J.tUIl and eUIl , U, v = 1, .. , n . For any permutation # \n\nJ.tuu \n\n{ 1 \n\nif #u ~ #v \n\no otherwise \n\nif the edge (u,v) E EUF# \n\n{ 1 \n\no otherwise \n\nwhere F # is the set of fill-in edges. \nIn other words, J.t represent precedence relationships and e represent the edges \nbetween the n vertices. Therefore, they will be called precedence variables and edge \nvariables respectively. With these variables, J can be expressed as \n\nJ(#) = I: IT r~vuevu \n\nuEV uEV \n\n(4) \n\nIn (4), the product J.tuueuu acts as an indicator variable being 1 iff \"u E Gil\" is true. \nFor any given permutation, finding the J.t variables is straightforward. Computing \nthe edge variables is possible thanks to a result in (Rose et al., 1976) . It states that \nan edge (u, v) is contained in F# iff there is a path in G between u and v containing \nonly nodes w for which #w < mine #u, #v). Formally, eUIl = euu = 1 iff there exists \na path P = (u, WI, W2, ... v) such that \n\nIT J.tw,uJ.tw,u = 1 \n\nWoEP \n\nSo far, we have succeeded in defining the cost J associated with any permutation \nin terms of the variables J.t and e. In the following, the set of permutations will be \nembedded in a continuous domain. As a consequence, J.t and e will take values in \nthe interval [0,1] but the form of J in (4) will stay the same. \n\nThe first method, called J.t-continuous embedding (J.t-CE) assumes that the vari(cid:173)\nables J.tuu E [0,1] represent independent probabilities that #u < #v. For any \npermutation, the precedence variables have to satisfy the transitivity condition. \nTransitivity means that if #u < #v and #v < #w, then #u < #w, or, that for \nany triple (J.tuu, J.tIlW, J.twu) the assignments (0, 0,0) and (1,1,1) are forbidden. Ac(cid:173)\ncording to the probabilistic interpretation of J.t we introduce a term that penalizes \nthe probability of a transitivity violation: \n\nL P[(u, v, w) nontransitive] \nI: [J.tUUJ.tIlWJ.tWU + (1 - J.tuu)(l - J.tuw)(l- J.twu)] \n\nU__ P[assignment non transitive] \n\n(5) \n\n(6) \n\n(7) \n\nIn the second approach, called O-continuous embedding (O-CE), the permutations \nare directly embedded into the set of doubly stochastic matrices. A doubly stochastic \nmatrix () is a matrix for which the elements in a row or column sum to one. \n\nI:0ij \n\nI:0ij = 1 Oij ~ 0 for i,j = 1, .. n. \nj \n\n(8) \n\n\fTriangulation by Continuous Embedding \n\n561 \n\nWhen Oij are either 0 or 1, implying that there is exactly one nonzero element \nin each row or column, the matrix is called a permutation matrix. Oij = 1 and \n#i = j both mean that the position of object i is j in the given permutation. The \nset of doubly stochastic matrices e is a convex polytope of dimension (n - 1)2 \nwhose extreme points are the permutation matrices (Balinski and Russakoff, 1974). \nThus, every doubly stochastic matrix can be represented as a convex combination \nof permutation matrices. To constrain the optimum to be a an extreme point, we \nuse the penalty term \n\nR(O) = I: Oij (1 - Oij) \n\nij \n\n(9) \n\nThe precedence variables are defined over e as \n\nJ.luv \n\n1 - J.lvu \n\n1 \n\nNow, for both embeddings, the edge variables can be computed from J.l as follows \n\n{\n\nI max \nPE {path& u-v} \n\nITwEP J.lwuJ.lwv \n\nfor (u, v) E E or u = v \notherwise \n\nThe above assignments give the correct values for J.l and e for any point representing \na permutation. Over the interior of the domain, e is a continuous, piecewise differen(cid:173)\ntiable function. Each euv , (u, v) ftE can be computed by a shortest path algorithm \nbetween u and v, with the length of (WI,W2) E E defined as (-logJ.lwluJ.lw:>v). \nO-CE is an interior point method whereas in J.l-CE the current point, although inside \n[0,I]n(n-I)/2, isn't necessarily in the convex hull of the hypercube's corners that \nrepresent permutations. The number of operation required for one evaluation of J \nand its gradient is as follows: O(n4) operations to compute J.l from 0, O(n3 10gn) to \ncompute e, O(n3 ) for ~: and O(n2 ) for ~~ and ~~ afterwards. Since computing J.l is \nthe most computationally intensive step, J.l-CE is a clear win in terms of computation \ncost. In addition, by operating directly in the J.l domain, one level of approximation \nis eliminated, which makes one expect J.l-CE to perform better than O-CE. The \nresults in the following section will confirm this. \n\n4 EXPERIMENTAL RESULTS \n\nTo assess the performance of our algorithms we compared their results with the \nresults of the minimum weight heuristic (MW), the heuristic that scored best in \nempirical tests (Kjrerulff, 1990). The lowest junction tree weight obtained in 200 \nruns of MW was retained and denoted by JMW ' Tests were run on 6 graphs of \ndifferent sizes and densities: \n\nh9 \n9 \n.33 \n\nh12 \n12 \n.25 \n\ngraph \nn= IVI \ndensity \nr m in/r max/r avf1, \n10giO JMW \nThe last row of the table shows the 10giO JMW ' We ran 11 or more trials of each \nof our two algorithms on each graph. To enforce the variables to converge to a \npermutation, we minimized the objective J + >.R, where>. > 0 is a parameter \n\n6/15/10 2/8/5 6/15/10 6/15/10 \n13.94 \n\nm20 \n20 \n.25 \n\n2/2/2/ 3/3/3 \n2.71 \n\n2.43 \n\ndlO \n10 \n.6 \n\n7.44 \n\n5.47 \n\n12.75 \n\na20 \n20 \n.45 \n\nd20 \n20 \n.6 \n\n\f562 \n\n100 \n\n30 \n\n10 \n\n3 \n\n0.3 \n\n-\n\n-\n\n--\n\n=-\n\n........- -\n\nh9 \n\nh12 d10 m20 820 d20 \n\na \n\nM. Meilii and M. l Jordan \n\n20 \n\n10 \n\n5 \n\n2 \n\n.5 \n\n.2 \n\n.1~~--~--~------~--~----~ \n\nh9 \n\nh12 d10 m20 a20 d20 \n\nb \n\nFigure 1: Minimum, maximum (solid line) and median (dashed line) values of J1\u00b7 \nobtained by O-CE (a) and JL-CE (b). \n\nMW \n\nthat was progressively increased following a deterministic annealing schedule and \nR is one of the aforementioned penalty terms. The algorithms were run for 50-\n150 optimization cycles, usually enough to reach convergence. However, for the \nJL-embedding on graph d20, there were several cases where many JL values did not \nconverge to 0 or 1. In those cases we picked the most plausible permutation to be \nthe answer. \n\nThe results are shown in figure 1 in terms of the ratio of the true cost obtained by \nthe continuous embedding algorithm (denoted by J*) and J'Mw. For the first two \ngraphs, h9 and h12, J1-w is the optimal cost; the embedding algorithms reach it \nmost trials. On the remaining graphs, JL-CE clearly outperforms O-CE, which also \nperforms poorer than MW on average. On dIO, a20 and m20 it also outperforms \nthe MW heuristic, attaining junction tree weights that are 1.6 to 5 times lower \non average than those obtained by MW. On d20, a denser graph, the results are \nsimilar for MW and JL-CE in half of the cases and worse for JL-CE otherwise. The \nplots also show that the variability of the results is much larger for CE than for \nMW. This behaviour is not surprising, given that the search space for CE, although \ncontinuous, comprises a large number of local minima. This induces dependence on \nthe initial point and, as a consequence, nondeterministic behaviour of the algorithm. \nMoreover, while the number of choices that MW has is much lower than the upper \nlimit of n!, the \"choices\" that CE algorithms consider, although soft, span the space \nof all possible permutations. \n\n5 CONCLUSION \n\nThe idea of continuous embedding is not new in the field of applied mathematics. \nThe large body of literature dealing with smooth (sygmoidal) functions instead \nof hard nonlinearities (step functions) is only one example. The present paper \nshows a nontrivial way of applying a similar treatment to a new problem in a new \nfield. The results obtained by it-embedding are on average better than the standard \nMW heuristic. Although not directly comparable, the best results reported on \ntriangulation (Kjrerulff, 1991; Becker and Geiger, 1996) are only by little better \nthan ours. Therefore the significance of the latter goes beyond the scope of the \npresent problem. They are obtained on a hard problem, whose cost function has no \nfeature to ease its minimization (J is neither linear, nor quadratic, nor is it additive \n\n\fTriangulation by Continuous Embedding \n\n563 \n\nw.r.t. the vertices or the edges) and therefore they demonstrate the potential of \ncontinuous embedding as a general tool. \n\nColaterally, we have introduced the cost function J, which is directly amenable \nto continuous approximations and is in good agreement with the true cost r. \nSince minimizing J may not be NP-hard, this opens a way for investigating new \ntriangulation methods. \n\nAcknowledgements \n\nThe authors are grateful to Tommi Jaakkola for many discussions and to Ellie \nBonsaint for her invaluable help in typing the paper. \n\nReferences \n\nBalinski, M. and Russakoff, R. (1974). On the assignment polytope. SIAM Rev. \nBecker, A. and Geiger, D. (1996) . A sufficiently fast algorithm for finding close to \n\noptimal junction trees. In UAI 96 Proceedings. \n\nGolumbic, M. (1980) . Algorithmic Graph Theory and Perfect Graphs. Academic \n\nPress, New York. \n\nKjrerulff, U. (1990) . Triangulation of graphs-algorithms giving small total state \nspace. Technical Report R 90-09, Department of Mathematics and Computer \nScience, Aalborg University, Denmark. \n\nKjrerulff, U. (1991). Optimal decomposition of probabilistic networks by simulated \n\nannealing. Statistics and Computing. \n\nMeila., M. and Jordan, M. I. (1997) . An objective function for belief net triangula(cid:173)\n\ntion. In Madigan, D., editor , AI and Statistics, number 7. (to appear). \n\nRose, D. J. (1970). Triangulated graphs and the elimination process. Journal of \n\nMathematical Analysis and Applications. \n\nRose, D. J., Tarjan, R. E., and Lueker, E. (1976). Algorithmic aspects of vertex \n\nelimination on graphs. SIAM J. Comput. \n\nTarjan, R. and Yannakakis, M. (1984). Simple linear-time algorithms to test chordal(cid:173)\nity of graphs, test acyclicity of hypergraphs, and select reduced acyclic hyper(cid:173)\ngraphs. SIAM 1. Comput. \n\n\f", "award": [], "sourceid": 1318, "authors": [{"given_name": "Marina", "family_name": "Meila", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}__