{"title": "Learning Bounded Treewidth Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 424, "abstract": "With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while also allowing for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial in the size of the graph and the treewidth bound. At the heart of our method is a triangulated graph that we dynamically update in a way that facilitates the addition of chain structures that increase the bound on the model's treewidth by at most one. We demonstrate the effectiveness of our ``treewidth-friendly'' method on several real-life datasets. Importantly, we also show that by using global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth.", "full_text": "Learning Bounded Treewidth Bayesian Networks\n\nGal Elidan\n\nDepartment of Statistics\n\nHebrew University\n\nJerusalem, 91905, Israel\ngalel@huji.ac.il\n\nStephen Gould\n\nDepartment of Electrical Engineering\n\nStanford University\n\nStanford, CA 94305, USA\nsgould@stanford.edu\n\nAbstract\n\nWith the increased availability of data for complex domains, it is desirable to\nlearn Bayesian network structures that are suf\ufb01ciently expressive for generaliza-\ntion while also allowing for tractable inference. While the method of thin junction\ntrees can, in principle, be used for this purpose, its fully greedy nature makes it\nprone to over\ufb01tting, particularly when data is scarce. In this work we present a\nnovel method for learning Bayesian networks of bounded treewidth that employs\nglobal structure modi\ufb01cations and that is polynomial in the size of the graph and\nthe treewidth bound. At the heart of our method is a triangulated graph that we\ndynamically update in a way that facilitates the addition of chain structures that\nincrease the bound on the model\u2019s treewidth by at most one. We demonstrate the\neffectiveness of our \u201ctreewidth-friendly\u201d method on several real-life datasets. Im-\nportantly, we also show that by using global operators, we are able to achieve bet-\nter generalization even when learning Bayesian networks of unbounded treewidth.\n\n1 Introduction\n\nRecent years have seen a surge of readily available data for complex and varied domains. Accord-\ningly, increased attention has been directed towards the automatic learning of complex probabilistic\ngraphical models [22], and in particular learning the structure of a Bayesian network. With the goal\nof making predictions or providing probabilistic explanations, it is desirable to learn models that\ngeneralize well and at the same time have low inference complexity or a small treewidth [23].\nWhile learning optimal tree-structured models is easy [5], learning the optimal structure of general\nand even quite simple (e.g., poly-trees, chains) Bayesian networks is computationally dif\ufb01cult [8,\n10, 19]. Several works attempt to generalize the tree-structure result of Chow and Liu [5], either\nby making assumptions about the true distribution (e.g., [1, 21]), by searching for a local maxima\nover tree mixtures [20], or by approximate methods that are polynomial in the size of the graph but\nexponential in the treewidth bound (e.g., [3, 15]). In the context of general Bayesian networks, the\nthin junction tree approach of Bach and Jordan [2] is a local greedy search procedure that relies\nat each step on tree-decomposition heuristic techniques for computing an upper bound the true\ntreewidth of the model. Like any local search approach, this method does not provide performance\nguarantees but is appealing in its ability to ef\ufb01ciently learn models with an arbitrary treewidth bound.\nThe thin junction tree method, however, suffers from two important limitations. First, while useful\non average, even the best of the tree-decomposition heuristics exhibit some variance in the treewidth\nestimate [16]. As a result, a single edge addition can lead to a jump in the treewidth estimate despite\nthe fact that it can increase the true treewidth by at most one. More importantly, structure learning\nscores (e.g., BIC, BDe) tend to learn spurious edges that result in over\ufb01tting when the number of\nsamples is relatively small, a phenomenon that is made worse by a fully greedy approach. Intu-\nitively, to generalize well, we want to learn bounded treewidth Bayesian networks where structure\nmodi\ufb01cations are globally bene\ufb01cial (i.e., contribute to the score in many regions of the network).\nIn this work we propose a novel method for ef\ufb01ciently learning Bayesian networks of bounded\ntreewidth that addresses these concerns. At the heart of our method is a dynamic update of the\ntriangulation of the model in a way that is tree-width friendly: the treewidth of the triangulated\ngraph (upper bound on the model\u2019s true treewidth) is guaranteed to increase by at most one when an\n\n1\n\n\fedge is added to the network. Building on the single edge triangulation, we characterize sets of edges\nthat are jointly treewidth-friendly. We use this characterization in a dynamic programming approach\nfor learning the optimal treewidth-friendly chain with respect to a node ordering. Finally, we learn\na bounded treewidth Bayesian network by iteratively augmenting the model with such chains.\nInstead of using local edge modi\ufb01cations, our method progresses by incrementally adding chain\nstructures that are globally bene\ufb01cial, improving our ability to generalize. We are also able to\nguarantee that the bound on the model\u2019s treewidth grows by at most one at each iteration. Thus, our\nmethod resembles the global nature of Chow and Liu [5] more closely than the thin junction tree\napproach of Bach and Jordan [2], while being applicable in practice to any desired treewidth.\nWe evaluate our method on several challenging real-life datasets and show that our method is able\nto learn richer models that generalize better than the thin junction tree approach as well as an un-\nbounded aggressive search strategy. Furthermore, we show that even when learning models with\nunbounded treewidth, by using global structure modi\ufb01cation operators, we are better able to cope\nwith the problem of local maxima and learn better models.\n\n2 Background: Bayesian networks and tree decompositions\n\nA Bayesian network [22] is a pair (G, \u0398) that encodes a joint probability distribution over a \ufb01nite\nset X = {X1, . . . , Xn} of random variables. G is a directed acyclic graph whose nodes correspond\nto the variables in X . The parameters \u0398Xi|Pai encode local conditional probability distributions\n(CPDs) for each node Xi given its parents in G. Together, these de\ufb01ne a unique joint probability\n\ni=1 P (Xi | Pai).\n\ndistribution over X given by P (X1, . . . , Xn) =Qn\n\nGiven a structure G and a complete training set D, estimating the (regularized) maximum likelihood\n(ML) parameters is easy for many choices of CPDs (see [14] for details). Learning the structure of\na network, however, is generally NP-hard [4, 10, 19] as the number of possible structures is super-\nexponential in the number of variables. In practice, structure learning relies on a greedy search\nprocedure that examines easy to evaluate local structure changes (add, delete or reverse an edge).\nThis search is usually guided by a decomposable score that balances the likelihood of the data and\nthe complexity of the model (e.g., BIC [24], Bayesian score [14]). Chow and Liu [5] showed that\nthe ML tree can be learned ef\ufb01ciently. Their result is easily generalized to any decomposable score.\nGiven a model, we are interested in the task of inference, or evaluating queries of the form P (Y | Z)\nwhere Y and Z are arbitrary subsets of X . This task is, in general, NP-hard [7], except when G is\ntree structured. The actual complexity of inference in a Bayesian network is proportional to its\ntreewidth [23] which, roughly speaking, measures how closely the network resembles a tree. The\nnotions of tree-decompositions and treewidth were introduced by Robertson and Seymour [23]:1\n\nDe\ufb01nition 2.1: A tree-decomposition of an undirected graph H = (V, E) is a pair ({Ci}i\u2208T , T ),\n\nwhere T is a tree, {Ci} is a subset of V such thatSi\u2208T\n\nCi = V and where\n\u2022 for all edges (v, w) \u2208 E there exists an i \u2208 T with v \u2208 Ci and w \u2208 Ci.\n\u2022 for all i, j, k \u2208 T : if j is on the (unique) path from i to k in T , then Ci \u2229 Ck \u2286 Cj.\n\nThe treewidth of a tree-decomposition is de\ufb01ned to be maxi\u2208T |Ci| \u2212 1. The treewidth T W (H) of\nan undirected graph H is the minimum treewidth over all possible tree-decompositions of H. An\nequivalent notion of treewidth can be phrased in terms of a graph that is a triangulation of H.\n\nDe\ufb01nition 2.2: An induced path P in an undirected graph H is a path such that for every non-\nadjacent vertices pi, pj \u2208 P there is no edge (pi\u2014pj) \u2208 H. A triangulated (chordal) graph is an\nundirected graph with no induced cycles. Equivalently, it is an undirected graph in which every\ncycle of length four or more contains a chord.\n\nIt can be easily shown that the treewidth of a triangulated graph is the size of the maximal clique of\nthe graph minus one [23]. The treewidth of an undirected graph H is then the minimum treewidth\nof all triangulations of H. For the underlying directed acyclic graph of a Bayesian network, the\ntreewidth can be characterized via a triangulation of the moralized graph.\n\nDe\ufb01nition 2.3: A moralized graph M of a directed acyclic graph G is an undirected graph that has\nan edge (i\u2014j) for every (i \u2192 j) \u2208 G and an edge (p\u2014q) for every pair (p \u2192 i), (q \u2192 i) \u2208 G.\n\n1The tree-decomposition properties are equivalent to the corresponding family preserving and running in-\n\ntersection properties of clique trees introduced by Lauritzen and Spiegelhalter [17] at around the same time.\n\n2\n\n\fInput:\ndataset D, treewidth bound K\nOutput: a network with treewidth \u2264 K\nG \u2190 best scoring tree\nM+ \u2190 undirected skeleton of G\nk \u2190 1\nWhile k < K\n\nO \u2190 node ordering given G and M+\nC \u2190 best chain with respect to O\nG \u2190 G \u222a C\nForeach (i \u2192 j) \u2208 C do\n\nM+ \u2190 EdgeUpdate(M+, (i \u2192 j))\n\nk \u2190 maximal clique size of M+\n\nGreedily add edges while treewidth \u2264 K\nReturn G\n\ncM\nv2\n\np1\n\nv3\n\np2\n\ns\n\nv1\ns\n\ncM\n\ncM\n\np1\n\np2 s\n\np1\n\np2\n\n(a)\n\ncM\n\ns\n\np1\n\nt\n\nt\n\n(b)\n\ncM\n\np2\n\ns\n\np1\n\nt\n\nt\n\n(c)\n\ncM\n\np2\n\ns\n\np1\n\nt\n\nt\n\np2\n\n(d)\n\n(e)\n\n(f)\n\nFigure 1:\n(left) Outline of our algorithm for learning Bayesian networks of bounded treewidth. (right) An\nexample of the different steps of our triangulation procedure (b)-(e) when (s \u2192 t) is added to the graph in (a).\nThe blocks are {s, v1}, {v1, cM }, and {cM , v2, v3, p1, p2, t} with corresponding cut-vertices v1 and cM . The\naugmented graph (e) has a treewidth of three (maximal clique of size four). An alternative triangulation (f),\nconnecting cM to t, would result in a maximal clique of size \ufb01ve.\n\nThe treewidth of a Bayesian network graph G is de\ufb01ned as the treewidth of its moralized graph M.\nIt follows that the maximal clique of any moralized triangulation of G is an upper bound on the\ntreewidth of the model, and thus its inference complexity.\n\n3 Learning Bounded Treewidth Bayesian Networks\nIn this section we outline our approach for learning Bayesian networks given an arbitrary treewidth\nbound that is polynomial in both the number of variables and the desired treewidth. We rely on\nglobal structure modi\ufb01cations that are optimal with respect to a node ordering.\nAt the heart of our method is the idea of using a dynamically maintained triangulated graph to upper\nbound the treewidth of the current model. When an edge is added to the Bayesian network we update\nthis triangulated graph in a way that is not only guaranteed to produce a valid triangulation, but that\nis also treewidth-friendly. That is, our update is guaranteed to increase the size of the maximal clique\nof the triangulated graph, and hence the treewidth bound, by at most one. An important property of\nour edge update is that we can characterize the parts of the network that are \u201ccontaminated\u201d by the\nnew edge. This allows us to de\ufb01ne sets of edges that are jointly treewidth-friendly. Building on the\ncharacterization of these sets, we propose a dynamic programming approach for ef\ufb01ciently learning\nthe optimal treewidth-friendly chain with respect to a node ordering.\nFigure 1 shows pseudo-code for our method. Brie\ufb02y, we learn a Bayesian network with bounded\ntreewidth K by starting from a Chow-Liu tree and iteratively augmenting the current structure with\nan optimal treewidth-friendly chain. During each iteration (below the treewidth bound) we apply\nour treewidth-friendly edge update procedure that maintains a moralized and triangulated graph for\nthe model at hand. Appealingly, as each global modi\ufb01cation can increase the treewidth by at most\none, at least K such chains will be added before we face the problem of local maxima. In practice,\nas some chains do not increase the treewidth, many more such chains are added for a given K.\n\nTheorem 3.1: Given a treewidth bound K and dataset over N variables, the algorithm outlined in\nFigure 1 runs in time polynomial in N and K.\n\nThis result relies on the ef\ufb01ciency of each step of the algorithm and that there can be at most N \u00b7 K\niterations (\u2264 |edges|) before exceeding the treewidth bound. In the next sections we develop the\nedge update and best scoring chain procedures and show that both are polynomial in N and K.\n\n4 Treewidth-Friendly Edge Update\nThe basic building block of our method is a procedure for maintaining a valid triangulation of the\nBayesian network. An appealing feature of this procedure is that the treewidth bound is guaranteed\nto grow by at most one after the update. We \ufb01rst consider single edge (s \u2192 t) addition to the model.\nFor clarity of exposition, we start with a simple variant of our procedure, and later re\ufb01ne this to\nallow for multiple edge additions while maintaining our guarantee on the treewidth bound.\n\n3\n\n\fTo gain intuition into how the dynamic nature of our update is useful, we use the notion of induced\npaths or paths with no shortcuts (see Section 2), and make explicit the following obvious fact:\nObservation 4.1: Let G be a Bayesian network structure and let M+ be a moralized triangulation\nof G. Let M(s\u2192t) be M+ augmented with the edge (s\u2014t) and with the edges (s\u2014p) for every\nparent p of t in G. Then, every non-chordal cycle in M(s\u2192t) involves s and either t or a parent of t\nand an induced path between the two vertices.\nStated simply, if the graph was triangulated before the addition of (s \u2192 t) to the Bayesian network,\nthen we only need to triangulate cycles created by the addition of the new edge or those forced by\nmoralization. This observation immediately suggests a straight-forward single-source triangulation\nwhereby we simply add an edge (s\u2014v) for every node v on an induced path between s and t or its\nparents before the edge update. Clearly, this naive method results in a valid moralized triangulation\nof G \u222a (s \u2192 t). Surprisingly, we can also show that it is treewidth-friendly.\nTheorem 4.2: The treewidth of the graph produced by the single-source triangulation procedure is\ngreater than the treewidth of the input graph M+ by at most one.\nProof: (outline) For the treewidth to increase by more than one, some maximal C in M+ needs to\nconnect to two new nodes. Since all edges are being added from s, this can only happen in one of\ntwo ways: (i) either t, a parent p of t, or a node v on induced path between s and t is also connected\nto C, but not part of C, or (ii) two such (non-adjacent) nodes exist and s is in C. In either case one\nedge is missing after the update procedure preventing the formation of a larger clique.\nOne problem with the proposed single-source triangulation, despite it being treewidth-friendly, is\nthat many vertices are connected to the source node, making the triangulations shallow. This can\nhave an undesirable effect on future edge additions and increases the chances of the formation of\nlarge cliques. We can alleviate this problem with a re\ufb01nement of the single-source triangulation\nprocedure that makes use of the concepts of cut-vertices, blocks, and block trees.\nDe\ufb01nition 4.3: A block of an undirected graph H is a set of connected nodes that cannot be discon-\nnected by the removal of a single vertex. By convention, if the edge (u\u2014v) is in H then u and v are\nin the same block. Vertices that separate (are in the intersection of) blocks are called cut-vertices.\nIt is easy to see that between every two nodes in a block of size greater than two there are at least\ntwo distinct paths, i.e. a cycle. There are also no simple cycles involving nodes in different blocks.\nDe\ufb01nition 4.4: The (unique) block tree B of an undirected graph H is a graph with nodes that\ncorrespond both to cut-vertices and to blocks of H. The edges in the block tree connect any block\nnode Bi with a cut-vertex node vj if and only if vj \u2208 Bi in H.\nIt can be easily shown that any path in H between two nodes in different blocks passes through all\nthe cut-vertices along the path between the blocks in B. An important consequence that follows\nfrom Dirac [11] is that an undirected graph whose blocks are triangulated is overall triangulated.\nOur re\ufb01ned treewidth-friendly triangulation procedure (illustrated via an example in Figure 1) makes\nuse of this fact as follows. First, the triangulated graph is augmented with the edge (s\u2014t) and any\nedges needed for moralization (Figure 1(b) and (c)). Second, a block level triangulation is carried\nout by zig-zagging across cut-vertices along the unique path between the blocks containing s and\nt and its parents (Figure 1(d)). Next, within each block (not containing s or t) along the path, a\nsingle-source triangulation is performed with respect to the \u201centry\u201d and \u201cexit\u201d cut-vertices. This\nshort-circuits any other node path through (and within) the block. For the block containing s the\nsingle-source triangulation is performed between s and the \u201cexit\u201d cut-vertex. The block containing\nt and its parents is treated differently: we add chords directly from s to any node v within the block\nthat is on an induced path between s and t (or parents of t) (Figure 1(e)). This is required to prevent\nmoralization and triangulation edges from interacting in a way that will increase the treewidth by\nmore than one (e.g., Figure 1(f)). If s and t happen to be in the same block, then we only triangulate\nthe induced paths between s and t, i.e., the last step outlined above. Finally, in the special case that s\nand t are in disconnected components of G, the only edges added are those required for moralization.\nTheorem 4.5: Our revised edge update procedure results in a triangulated graph with a treewidth at\nmost one greater than that of the input graph. Furthermore, it runs in polynomial time.\nProof: (outline) First, observe that the \ufb01nal step of adding chords emanating from s is a single-\nsource triangulation once the other steps have been performed. Since each block along the block\npath between s and t is triangulated separately, we only need to consider the zig-zag triangulation be-\ntween blocks. As this creates 3-cycles, the graph must also be triangulated. To see that the treewidth\n\n4\n\n\fincreases by at most one, we use similar arguments to those used in the proof of Theorem 4.2, and\nobserve that the zig-zag triangulation only touches cut-vertices and any three of these vertices could\nnot have been in the same clique. The fact that the update procedure runs in polynomial time follows\nfrom the fact that an adaptation (not shown for lack of space) of maximum cardinality search (see,\nfor example [16]) can be used to ef\ufb01ciently identify all induced nodes between s and t.\n\nMultiple Edge Updates. We now consider the addition of multiple edges to the graph G. To ensure\nthat multiple edges do not interact in ways that will increase the treewidth bound by more than one,\nwe need to characterize the nodes contaminated by each edge addition\u2014a node v is contaminated\nby the adding (s \u2192 t) to G if it is incident to a new edge added during our treewidth friendly\ntriangulation. Below are several examples of contaminated sets (solid nodes) incident to edges\nadded (dashed) by our edge update procedure for different candidate edge additions (s \u2192 t) to the\nBayesian network on the left. In all examples except the last treewidth is increased by one.\n\ns\n\nt\n\nt\n\ns\n\ns\n\nt\n\ns\n\nt\n\nt\n\ns\n\nUsing the notion of contamination, we can characterize sets of edges that are jointly treewidth-\nfriendly. We will use this to learn optimal treewidth friendly chains given a ordering in Section 5.\nTheorem 4.6: (Treewidth-friendly set). Let G be a graph structure and M+ be its corresponding\nmoralized triangulation. If {(si \u2192 ti)} is a set of candidate edges satisfying the following:\n\n\u2022 the contaminated sets of any (si \u2192 ti) and (sj \u2192 tj) are disjoint, or,\n\u2022 the contaminated sets overlap at a single cut-vertex, but the endpoints of each edge are not\n\nin the same block and the block paths between the endpoints do not overlap;\n\nthen adding all edges to G can increase the treewidth bound by at most one.\nProof: (outline) The theorem holds trivially for the \ufb01rst condition. Under the second condition, the\nonly common vertex is a cut-vertex. However, since all other contaminated nodes are in in different\nblocks, they cannot interact to form a large clique.\n\n5 Learning Optimal Treewidth-Friendly Chains\nIn the previous section we described our edge update procedure and characterized edge chains that\njointly increase the treewidth bound by at most one. We now use this to search for optimal chain\nstructures that satisfy Theorem 4.6, and are thus treewidth friendly, given a topological node or-\ndering. On the surface, one might question the need for a speci\ufb01c node ordering altogether if chain\nglobal operators are to be used\u2014given the result of Chow and Liu [5], one might expect that learning\nthe optimal chain with respect to any ordering can be carried out ef\ufb01ciently. However, Meek [19]\nshowed that learning an optimal chain over a set of random variables is computationally dif\ufb01cult and\nthe result can be generalized to learning a chain conditioned the current model. Thus, during any\niteration of our algorithm, we cannot expect to \ufb01nd the overall optimal chain.\nInstead, we commit to a single node ordering that is topologically consistent (each node appears\nafter its parent in the network) and learn the optimal treewidth-friendly chain with respect to that\norder (we brie\ufb02y discuss the details of our ordering below). To \ufb01nd such a chain in polynomial\ntime, we use a straightforward dynamic programming approach: the best treewidth-friendly chain\nthat contains (Os \u2192 Ot) is the concatenation of:\n\u2022 the best chain from the \ufb01rst node O1 to OF , the \ufb01rst node contaminated by (Os \u2192 Ot)\n\u2022 the edge (Os \u2192 Ot)\n\u2022 the best chain starting from\nthe last node contaminated OL\nto the last node in the order ON . O1\n\nON\nWe note that when the end nodes are not separating cut-vertices, we maintain a gap so that the\ncontamination sets are disjoint and the conditions of Theorem 4.6 are met.\n\noptimal chain\n\noptimal chain\n\nOF\n\nOs\n\nOt\n\nOL\n\n5\n\n\f1\n\n0\n\n-1\n\n-2\n\n-3\n\n-4\n\n-5\n\ne\nc\nn\na\nt\ns\nn\n\ni\n \n/\n \ns\ns\no\nl\n-\ng\no\n\nl\n \nt\ns\ne\nT\n\n5\n\n10\n\n15\n\nOurs\n\nThin Junction-tree\n\nAggressive\n\n30\n\n25\n\n20\nTreewidth bound\n\n35\n\n40\n\n45\n\n50\n\n55\n\n60\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n0\n\nTreewidth bound\n\nLength of chain\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\nOurs\n\nThin Junction tree\n\ni\n\ns\ne\nt\nu\nn\nm\n \nn\ni\n \ne\nm\n\ni\nt\nn\nu\nR\n\n5\n\n10\n\n15\n\nIteration\n\n20\n\n25\n\n30\n\n0\n\n10\n\n30\n\n20\n40\nTreewidth bound\n\n50\n\n60\n\nFigure 2: Gene expression results: (left) 5-fold mean test log-loss per/instance vs.\ntreewidth bound. Our\nmethod (solid blue squares) is compared to the thin junction tree method (dashed red circles), and an unbounded\naggressive search (dotted black). (middle) the treewidth estimate and the number of edges in the chain during\nthe iterations of a typical run with the bound set to 10. (right) shows running time as a function of the bound.\n\nFormally, we de\ufb01ne C[i, j] as the optimal chain whose contamination is limited to the range [Oi,Oj]\nand our goal is to compute C[1, N ]. Using F to denote the \ufb01rst node ordered in the contamination set\nof (s \u2192 t) (and L for the last), we can compute C[1, N ] via the following recursive update principle\n\nC[i, j] =( maxs,t:F =i,L=j(s \u2192 t)\n\nmaxk=i+1:j\u22121 C[i, k] \u222a C[k, j]\n\u2205\n\nno split\nsplit\nleave a gap\n\nwhere the maximization is with respect to the structure score (e.g., BIC). That is, the best chain in a\nsubsequence [i, j] in the ordering is the maximum of three alternatives: edges whose contamination\nboundaries are exactly i and j (no split); two chains that are joined at some node i < k < j (split);\na gap between i and j when there is no positive edge whose contamination is in [i, j].\nFinally, for lack of space we only provide a brief description of our topological node ordering.\nIntuitively, since edges contaminate nodes along the block path between the edge\u2019s endpoints (see\nSection 4), we want to adopt a DFS ordering over the blocks so as to facilitate as many edges as\npossible between different branches of the block tree. We order nodes with a block by the distance\nfrom the \u201centry\u201d vertex as motivated by the following result on the distance dM\nmin (u, v) between\nnodes u, v in the triangulated graph M+ (proof not shown for lack of space):\nTheorem 5.1: Let r, s, t be nodes in a block B in the triangulated graph M+ with dM\nmin (r, t). Then for any v on an induced path between s and t we have dM\ndM\nThe ef\ufb01ciency of our method outlined in Figure 1 in the number of variables and the treewidth bound\n(Theorem 3.1) now follows from the ef\ufb01ciency of the ordering and chain learning procedures.\n\nmin (r, v) \u2264 dM\n\nmin (r, s) \u2264\n\nmin (r, t).\n\n6 Experimental Evaluation\nWe compare our approach on four real-world datasets to several methods. The \ufb01rst is an improved\nvariant of the thin junction tree method [2]. We start (as in our method) with a Chow-Liu forest and\niteratively add the single best scoring edge as long as the treewidth bound is not exceeded. To make\nthe comparison independent of the choice of triangulation method, at each iteration we replace the\nheuristic triangulation (best of maximum cardinality search or minimum \ufb01ll-in [16], which in prac-\ntice had negligible differences) with our triangulation if it results in a lower treewidth.The second\nbaseline is an aggressive structure learning approach that combines greedy edge modi\ufb01cations with\na TABU list (e.g., [13]) and random moves and that is not constrained by a treewidth bound. Where\nrelevant we also compare our results to the results of Chechetka and Guestrin [3].\nGene Expression. We \ufb01rst consider a continuous dataset of the expression of yeast genes (variables)\nin 173 experiments (instances) [12]. We learn sigmoid Bayesian networks using the BIC structure\nscore [24] using the fully observed set of 89 genes that participate in general metabolic processes.\nHere a learned model indicates possible regulatory or functional connections between genes.\nFigure 2(a) shows test log-loss as a function of treewidth bound. The \ufb01rst obvious phenomenon\nis that both our method and the thin junction tree approach are superior to the aggressive baseline.\nAs one might expect, the aggressive baseline achieves a higher BIC score on training data (not\nshown), but over\ufb01ts due to the scarcity of the data. The consistent superiority of our method over\nthin junction trees demonstrates that a better choice of edges, i.e., ones chosen by a global operator,\ncan lead to increased robustness and better generalization. Indeed, even when the treewidth bound\n\n6\n\n\f-45\n\n-50\n\n-55\n\n-60\n\n-65\n\ne\nc\nn\na\nt\ns\nn\ni\n \n/\n \ns\ns\no\nl\n-\ng\no\nl\n \nt\ns\ne\nT\n\nAggressive\n\nThin Junction-tree\n\nOurs\n\nChechetka+Guestrin\n\nAggressive\n\nOurs\n\nThin Junction-tree\n\nChechetka+Guestrin\n\n-30\n\n-32\n\n-34\n\n-36\n\n-38\n\ne\nc\nn\na\nt\ns\nn\n\ni\n \n/\n \ns\ns\no\nl\n-\ng\no\n\nl\n \nt\ns\ne\nT\n\ne\nc\nn\na\nt\ns\nn\n\ni\n \n/\n \ns\ns\no\nl\n-\ng\no\n\nl\n \nt\ns\ne\nT\n\n-28\n\n-30\n\n-32\n\n-34\n\n-36\n\nOurs\n\n[unordered]\n\nThin Junction-tree\n\n100\n\n200\n\n300\n\n500\n\n400\nTraining instances\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n100\n\n200\n\n300\n\n500\n\n400\nTraining instances\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n2\n\n4\n\n6\n\n10\n\n8\n14\nTreewidth bound\n\n12\n\n16\n\n18\n\n20\n\nFigure 3: 5-fold mean test log-loss/instance for a treewidth\nbound of two vs. training set size for the temperature (left) and\ntraf\ufb01c (right) datasets. Compared are our approach (solid blue\nsquares), the thin junction tree method (dashed red circles), an\naggressive unbounded search (dotted black), and the method of\nChechetka and Guestrin [3] (dash-dot magenta diamonds).\n\nFigure 4: Average log-loss vs. treewidth\nbound for the Hapmap data. Compared\nare an unbounded aggressive search (dot-\nted) and unconstrained (thin) and con-\nstrained by the DNA order (thick) variants\nof ours and the thin junction tree method.\n\nis increased past the saturation point, our method surpasses both baselines. In this case, we are\nlearning unbounded networks and all bene\ufb01t comes from the global nature of our updates.\nTo qualitatively illustrate the progression of our algorithm, in Figure 2(b) we plot the number of\nedges in the chain and the treewidth estimate at the end of each iteration for a typical run. Our\nalgorithm aggressively adds multi-edge chains until the treewidth bound is reached, at which point\n(iteration 24) it becomes fully greedy. To appreciate the non-triviality of some of the chains learned\nwith 4\u2212 7 edges, we recall that the chains are added after a Chow-Liu model was initially learned. It\nis also worth noting that despite their complexity, some chains do not increase the treewidth estimate\nand we typically have more than K iterations where chains with more than one edge are added. The\nnumber of such iterations is still polynomially bounded as for a Bayesian network with N variables\nadding more than K \u00b7 N edges will necessarily result in a treewidth that is greater than K.\nTo evaluate the ef\ufb01ciency of our method we measured its running time as a function of the treewidth\nbound. Figure 2(c) shows results for the gene expression dataset. Observe that our method and the\ngreedy thin junction tree approach are both approximately linear in the treewidth bound. Appeal-\ningly, the additional computation our method requires is not signi\ufb01cant (\u2264 25%). This should not\ncome as a surprise as the bulk of the time is spent on the collection of the data suf\ufb01cient statistics.\nIt is also worth discussing the range of treewidths we considered in the above experiment as well as\nthe Haplotype experiment below. While treewidths greater than 25 seem excessive for exact infer-\nence, state-of-the-art techniques (e.g., [9, 18]) can reasonably handle inference in networks of this\ncomplexity. Furthermore, as our results show, it is bene\ufb01cial in practice to learn such models. Thus,\ncombining our method with state-of-the-art inference techniques can allow practitioners to push the\nenvelope of the complexity of models learned for real applications that rely on exact inference.\nThe Traf\ufb01c and Temperature Datasets. We now compare our method to the mutual-information\nbased LPACJT approach of Chechetka and Guestrin [3] (we compare to the better variant). As their\nmethod is exponential in the treewidth and cannot be used in the gene expression setting, we compare\nto it on the two discrete real-life datasets Chechetka and Guestrin [3] considered: the temperature\ndata is from a deployment of 54 sensor nodes; the traf\ufb01c dataset contains traf\ufb01c \ufb02ow information\nmeasured every 5 minutes in 32 locations in California. To make the comparison fair, we used the\nsame discretization and train/test splits. Furthermore, as their method can only be applied to a small\ntreewidth bound, we also limited our model to a treewidth of two. Figure 3 compares the different\nmethods. Both our method and the thin junction tree approach signi\ufb01cantly outperform the LPACJT\non small sample sizes. This result is consistent with the results reported in Chechetka and Guestrin\n[3] and is due to the fact that the LPACJT method does not facilitate the use of regularization which\nis crucial in the sparse-data regime. The performance of our method is comparable to the greedy\nthin junction tree approach with no obvious superiority of either method. This should not come as a\nsurprise since the fact that the unbounded aggressive search is not signi\ufb01cantly better suggests that\nthe strong signal in the data can be captured rather easily. In fact, Chechetka and Guestrin [3] show\nthat even a Chow-Liu tree does rather well on these datasets (compare this to the gene expression\ndataset where the aggressive variant was superior even at a treewidth of \ufb01ve).\nHaplotype Sequences. Finally we consider a more dif\ufb01cult discrete dataset of a sequence of single\nnucleotide polymorphism (SNP) alleles from the Human HapMap project [6]. Our model is de\ufb01ned\nover 200 SNPs (binary variables) from chromosome 22 of a European population consisting of 60\nindividuals (we considered several different sequences along the chromosome with similar results).\n\n7\n\n\fIn this case, there is a natural ordering of variables that corresponds to the position of the SNPs in\nthe DNA sequence. Figure 4 shows test log-loss results when this ordering is enforced (thicker)\nand when it is not (thinner). The superiority of our method when the ordering is used is obvious\nwhile the performance of the thin junction tree method degrades. This can be expected as the greedy\nmethod does not make use of a node ordering, while our method provides optimality guarantees with\nrespect to a variable ordering at each iteration. Whether constrained to the natural variable ordering\nor not, our method ultimately also surpasses the unbounded aggressive search.\n\n7 Discussion and Future Work\nIn this work we presented a novel method for learning Bayesian networks of bounded treewidth in\ntime that is polynomial in both the number of variables and the treewidth bound. Our method builds\non an edge update algorithm that dynamically maintains a valid moralized triangulation in a way\nthat facilitates the addition of chains that are guaranteed to increase the treewidth bound by at most\none. We demonstrated the effectiveness of our treewidth-friendly method on real-life datasets, and\nshowed that by utilizing global structure modi\ufb01cation operators, we are able to learn better models\nthan competing methods, even when the treewidth of the models learned is not constrained.\nOur method can be viewed as a generalization of the work of Chow and Liu [5] that is constrained to\na chain structure but that provides an optimality guarantee (with respect to a node ordering) at every\ntreewidth. In addition, unlike the thin junction trees approach of Bach and Jordan [2], we provide\na guarantee that our estimate of the treewidth bound will not increase by more than one at each\niteration. Furthermore, we add multiple edges at each iteration, which in turn allows us to better\ncope with the problem of local maxima in the search. To our knowledge, ours is the \ufb01rst method for\nef\ufb01ciently learning Bayesian networks with an arbitrary treewidth bound that is not fully greedy.\nOur method motivates several exciting future directions.\nIt would be interesting to see to what\nextent we could overcome the need to commit to a speci\ufb01c node ordering at each iteration. While\nwe provably cannot consider every ordering, it may be possible to polynomially provide a reasonable\napproximation. Second, it may be possible to re\ufb01ne our characterization of the contamination that\nresults from an edge update, which in turn may facilitate the addition of more complex treewidth-\nfriendly structures at each iteration. Finally, we are most interested in exploring whether tools\nsimilar to the ones employed in this work could be used to dynamically update the bounded treewidth\nstructure that is the approximating distribution in a variational approximate inference setting.\n\nReferences\n[1] P. Abbeel, D. Koller, and A. Ng. Learning factor graphs in poly. time & sample complexity. JMLR, 2006.\n[2] F. Bach and M. I. Jordan. Thin junction trees. In NIPS, 2001.\n[3] A. Chechetka and C. Guestrin. Ef\ufb01cient principled learning of thin junction trees. In NIPS. 2008.\n[4] D. Chickering. Learning Bayesian networks is NP-complete. In Learning from Data: AI & Stats V. 1996.\n[5] C. Chow and C. Liu. Approx. discrete distrib. with dependence trees. IEEE Trans. on Info. Theory, 1968.\n[6] The International HapMap Consortium. The international hapmap project. Nature, 2003.\n[7] G. F. Cooper. The computationl complexity of probabilistic inference using belief networks. AI, 1990.\n[8] P. Dagum and M. Luby. An optimal approximation algorithm for baysian inference. AI, 1993.\n[9] A. Darwiche. Recursive conditioning. Arti\ufb01cial Intelligence, 2001.\n[10] S. Dasgupta. Learning polytrees. In UAI, 1999.\n[11] G. A. Dirac. On rigid circuit graphs. Abhandlungen aus dem Math. Seminar der Univ. Hamburg 25, 1961.\n[12] A. Gasch et al. Genomic expression program in the response of yeast cells to environmental changes.\n\nMolecular Biology of the Cell, 2000.\n\n[13] F. Glover and M. Laguna. Tabu search. In Modern Heuristic Tech. for Comb. Problems, 1993.\n[14] D. Heckerman. A tutorial on learning Bayesian networks. Technical report, Microsoft Research, 1995.\n[15] D. Karger and N. Srebro. Learning markov networks: maximum bounded tree-width graphs. In Sympo-\n\nsium on Discrete Algorithms, 2001.\n\nUniversiteit Utrecht, 2001.\n\n[16] A. Koster, H. Bodlaender, and S. Van Hoesel. Treewidth: Computational experiments. Technical report,\n\n[17] S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on graphical structures. Journal\n\nof the Royal Statistical Society, 1988.\n\n[18] R. Marinescu and R. Dechter. And/or branch-and-bound for graphical models. IJCAI, 2005.\n[19] C. Meek. Finding a path is harder than \ufb01nding a tree. Journal of Arti\ufb01cial Intelligence Research, 2001.\n[20] M. Meila and M. I. Jordan. Learning with mixtures of trees. JMLR, 2000.\n[21] M. Narasimhan and J. Bilmes. Pac-learning bounded tree-width graphical models. In UAI, 2004.\n[22] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[23] N. Robertson and P. Seymour. Graph minors II. algorithmic aspects of tree-width. J. of Algorithms, 1987.\n[24] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461\u2013464, 1978.\n\n8\n\n\f", "award": [], "sourceid": 168, "authors": [{"given_name": "Gal", "family_name": "Elidan", "institution": null}, {"given_name": "Stephen", "family_name": "Gould", "institution": null}]}