{"title": "Advances in Learning Bayesian Networks of Bounded Treewidth", "book": "Advances in Neural Information Processing Systems", "page_first": 2285, "page_last": 2293, "abstract": "This work presents novel algorithms for learning Bayesian networks of bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in sampling k-trees (maximal graphs of treewidth k), and subsequently selecting, exactly or approximately, the best structure whose moral graph is a subgraph of that k-tree. The approaches are empirically compared to each other and to state-of-the-art methods on a collection of public data sets with up to 100 variables.", "full_text": "Advances in Learning Bayesian\nNetworks of Bounded Treewidth\n\nSiqi Nie\n\nRensselaer Polytechnic Institute\n\nTroy, NY, USA\nnies@rpi.edu\n\nCassio P. de Campos\n\nQueen\u2019s University Belfast\n\nBelfast, UK\n\nc.decampos@qub.ac.uk\n\nDenis D. Mau\u00b4a\n\nUniversity of S\u02dcao Paulo\n\nS\u02dcao Paulo, Brazil\n\ndenis.maua@usp.br\n\nQiang Ji\n\nRensselaer Polytechnic Institute\n\nTroy, NY, USA\n\nqji@ecse.rpi.edu\n\nAbstract\n\nThis work presents novel algorithms for learning Bayesian networks of bounded\ntreewidth. Both exact and approximate methods are developed. The exact method\ncombines mixed integer linear programming formulations for structure learning\nand treewidth computation. The approximate method consists in sampling k-trees\n(maximal graphs of treewidth k), and subsequently selecting, exactly or approx-\nimately, the best structure whose moral graph is a subgraph of that k-tree. The\napproaches are empirically compared to each other and to state-of-the-art meth-\nods on a collection of public data sets with up to 100 variables.\n\n1\n\nIntroduction\n\nBayesian networks are graphical models widely used to represent joint probability distributions on\ncomplex multivariate domains. A Bayesian network comprises two parts: a directed acyclic graph\n(the structure) describing the relationships among the variables in the model, and a collection of\nconditional probability tables from which the joint distribution can be reconstructed. As the number\nof variables in the model increases, specifying the underlying structure becomes a daunting task,\nand practitioners often resort to learning Bayesian networks directly from data. Here, learning a\nBayesian network refers to inferring its structure from data, a task known to be NP-hard [9].\nLearned Bayesian networks are commonly used for drawing inferences such as querying the pos-\nterior probability of some variable given some evidence or \ufb01nding the mode of the posterior joint\ndistribution. Those inferences are NP-hard to compute even approximately [23], and all known\nexact and provably good algorithms have worst-case time complexity exponential in the treewidth,\nwhich is a measure of the tree-likeness of the structure. In fact, under widely believed assumptions\nfrom complexity theory, exponential time complexity in the treewidth is inevitable for any algorithm\nthat performs exact inference [7, 20]. Thus, learning networks of small treewidth is essential if one\nwishes to ensure reliable and ef\ufb01cient inference. This is particularly important in the presence of\nmissing data, when learning becomes intertwined with inference [16]. There is a second reason to\nlimit the treewidth. Previous empirical results [15, 22] suggest that bounding the treewidth improves\nmodel performance on unseen data, hence improving the model generalization ability.\nIn this paper we present two novel ideas for score-based Bayesian network learning with a hard\nconstraint on treewidth. The \ufb01rst one is a mixed-integer linear programming (MILP) formulation\nof the problem (Section 3) that builds on existing MILP formulations for unconstrained learning\nof Bayesian networks [10, 11] and for computing the treewidth of a graph [17]. Unlike the MILP\n\n1\n\n\fformulation of Parviainen et al. [21], the MILP problem we generate is of polynomial size in the\nnumber of variables, and dispense with the use of cutting planes techniques. This makes for a clean\nand succinct formulation that can be solved with a single call of any MILP optimizer. We provide\nsome empirical evidence (in Section 5) that suggests that our approach is not only simpler but often\nfaster. It also outperforms the dynamic programming approach of Korhonen and Parviainen [19].\nSince linear programming relaxations are used for solving the MILP problem, any MILP formula-\ntion can be used to provide approximate solutions and error estimates in an anytime fashion (i.e., the\nmethod can be stopped at any time during the computation with a feasible solution whose quality\nmonotonically improves with time). However, the MILP formulations (both ours and that of Parvi-\nainen et al. [21]) cannot cope with very large domains, even if we settle for approximate solutions.\nIn order to deal with large domains, we devise (in Section 4) an approximate method based on a\nuniform sampling of k-trees (maximal chordal graphs of treewidth k), which is achieved by using\na fast computable bijection between k-trees and Dandelion codes [6]. For each sampled k-tree, we\neither run an exact algorithm similar to the one in [19] (when computationally appealing) to learn\nthe score-maximizing network whose moral graph is a subgraph of that k-tree, or we resort to a\nmore ef\ufb01cient method that takes partial variable orderings uniformly at random from a (relatively\nsmall) space of orderings that are compatible with the k-tree. We show empirically (in Section 5)\nthat our sampling-based methods are very effective in learning close to optimal structures and scale\nup to large domains. We conclude in Section 6 and point out possible future work. We begin with\nsome background knowledge and literature review on learning Bayesian networks (Section 2).\n\n2 Bayesian Network Structure Learning\nLet N be {1, . . . , n} and consider a \ufb01nite set X = {Xi : i \u2208 N} of categorical random variables\nXi taking values in \ufb01nite sets Xi. A Bayesian network is a triple (X, G, \u03b8), where G = (N, A)\nis a directed acyclic graph (DAG) whose nodes are in one-to-one correspondence with variables in\nX, and \u03b8 = {\u03b8i(xi, xGi)} is a set of numerical parameters specifying (conditional) probabilities\n\u03b8i(xi, xGi ) = Pr(xi|xGi ), for every node i in G, value xi of Xi and assignment xGi to the parents\nGi of Xi in G. The structure G of the network represents a set of stochastic independence assess-\nments among variables in X called graphical Markov conditions: every variable Xi is conditionally\nindependent of its nondescendant nonparents given its parents. As a consequence, a Bayesian net-\nwork uniquely de\ufb01nes a joint probability distribution over X as the product of its parameters.\nAs it is common in the literature, we formulate the problem of Bayesian network learning as an\noptimization over DAG structures guided by a score function. We only require that (i) the score\nfunction can be written as a sum of local score functions si(Gi), i \u2208 N, each depending only on\nthe corresponding parent set Gi and on the data, and (ii) the local score functions can be ef\ufb01ciently\ncomputed and stored [13, 14]. These properties are satis\ufb01ed by commonly used score functions\nsuch as the Bayesian Dirichlet equivalent uniform score [18]. We assume the reader is familiar with\ngraph-theoretic concepts such as polytrees, chordal graphs, chordalizations, moral graphs, moral-\nizations, topological orders, (perfect) elimination orders, \ufb01ll-in edges and clique-trees. References\n[1] and [20] are good starting points to the topic.\nMost score functions penalize model complexity in order to avoid over\ufb01tting. The way scores penal-\nize model complexity generally leads to learning structures of bounded in-degree, but even bounded\nin-degree graphs can have high treewidth (for instance, directed square grids have treewidth equal\nto the square root of the number of nodes, yet have maximum in-degree equal to two), which brings\ndif\ufb01culty to subsequent probabilistic inferences with the model [5].\nThe goal of this work is to develop methods that search for\n\nG\u2217 = argmax\nG\u2208GN,k\n\nsi(Gi) ,\n\n(1)\nwhere GN,k is the set of all DAGs with node set N and treewidth at most k. Dasgupta proved\nNP-hardness of learning polytrees of bounded treewidth when the score is data log likelihood [12].\nKorhonen and Parviainen [19] adapted Srebro\u2019s complexity result for Markov networks [25] to show\nthat learning Bayesian networks of treewidth two or greater is NP-hard.\nIn comparison to the unconstrained problem, few algorithms have been designed for the bounded\ntreewidth case. Korhonen and Parviainen [19] developed an exact algorithm based on dynamic\n\n(cid:88)\n\ni\u2208N\n\n2\n\n\fprogramming that learns optimal n-node structures of treewidth at most w in time 3nnw+O(1),\nwhich is above the 2nnO(1) time required by the best worst-case algorithms for learning optimal\nBayesian networks with no constraint on treewidth [24]. We shall refer to their method in the rest\nof this paper as K&P (after the authors\u2019 initials). Elidan and Gould [15] combined several heuristics\nto treewidth computation and network structure learning in order to design approximate methods.\nOthers have addressed the similar (but not equivalent) problem of learning undirected models of\nbounded treewidth [2, 8, 25]. Very recently, there seems to be an increase of interest in the topic.\nBerg et al. [4] showed that the problem of learning bounded treewidth Bayesian networks can be\nreduced to a weighted maximum satis\ufb01ability problem, and subsequently solved by weighted MAX-\nSAT solvers. They report experimental results showing that their approach outperforms K&P. In the\nsame year, Parviainen et al. [21] showed that the problem can be reduced to a MILP. Their reduced\nMILP problem however has exponentially many constraints in the number of variables. Following\nthe work of Cussens [10], the authors avoid creating such large programs by a cutting plane gener-\nation mechanism, which iteratively includes a new constraint while the optimum is not found. The\ngeneration of each new constraint (cutting plane) requires solving another MILP problem. We shall\nrefer to their method from now on as TWILP (after the name of the software package the authors\nprovide).\n\n3 A Mixed Integer Linear Programming Approach\n\nThe \ufb01rst contribution of this work is the MILP formulation that we design to solve the problem of\nstructure learning with bounded treewidth. MILP formulations have shown to be very effective for\nlearning Bayesian networks with no constraint on treewidth [3, 10], surpassing other attempts in\na range of data sets. The formulation is based on combining the MILP formulation for structure\nlearning in [11] with the MILP formulation presented in [17] for computing the treewidth of an\nundirected graph. There are however notable differences: for instance, we do not enforce a linear\nelimination ordering of nodes; instead we allow for partial orders which capture the equivalence be-\ntween different orders in terms of minimizing treewidth, and we represent such partial order by real\nnumbers instead of integers. We avoid the use of sophisticate techniques for solving MILP problems\nsuch as constraint generation [3, 10], which allows for an easy implementation and parallelization\n(MILP optimizers such as CPLEX can take advantage of that).\nFor each node i in N, let Pi be the collection of allowed parent sets for that node (these sets can\nbe speci\ufb01ed manually by the user or simply de\ufb01ned as the subsets of N \\ {i} with cardinality less\nthan a given bound). We denote an element of Pi as Pit, with t = 1, . . . , ri and ri = |Pi| (hence\nPit \u2282 N). We will refer to a DAG as valid if its node set is N and the parent set of each node i in it\nis an element of Pi. The following MILP problem can be used to \ufb01nd valid DAGs whose treewidth\nis at most w:\n\nMaximize (cid:88)\n\nit\n\n(2)\n\n(3a)\n(3b)\n(3c)\n(4a)\n(4b)\n(4c)\n(4d)\n\n(5)\n\n(n + 1) \u00b7 yij \u2264 n + zj \u2212 zi,\n\nyij + yik \u2212 yjk \u2212 ykj \u2264 1,\n\n(cid:80)\nj\u2208N yij \u2264 w,\n(cid:80)\n\n(n + 1)pit \u2264 n + vj \u2212 vi,\n\nt pit = 1,\npit \u2264 yij + yji,\npit \u2264 yjk + ykj,\n\nsubject to\n\npit \u00b7 si(Pit)\n\u2200i \u2208 N,\n\u2200i, j \u2208 N,\n\u2200i, j, k \u2208 N,\n\u2200i \u2208 N,\n\u2200i \u2208 N, \u2200t \u2208 {1, . . . , ri}, \u2200j \u2208 Pit,\n\u2200i \u2208 N, \u2200t \u2208 {1, . . . , ri}, \u2200j \u2208 Pit,\n\u2200i \u2208 N, \u2200t \u2208 {1, . . . , ri}, \u2200j, k \u2208 Pit,\n\nzi \u2208 [0, n], vi \u2208 [0, n], yij \u2208 {0, 1}, pit \u2208 {0, 1}\n\n\u2200i, j \u2208 N, \u2200t \u2208 {1, . . . , ri}.\n\nThe variables pit de\ufb01ne which parent sets are chosen, while the variables vi guarantee that those\nchoices respect a linear ordering of the variables, and hence that the corresponding directed graph\nis acyclic. The variables yij specify a chordal moralization of this DAG with arcs respecting an\nelimination ordering of width at most w, which is given by the variables zi.\nThe following result shows that any solution to the MILP above can be decoded into a chordal graph\nof bounded treewidth and a suitable perfect elimination ordering.\n\n3\n\n\fLemma 1. Let zi, yij, i, j \u2208 N, be variables satisfying Constraints (4) and (5). Then the undirected\ngraph M = (N, E), where E = {ij \u2208 N \u00d7 N : yij = 1 or yji = 1}, is chordal and has treewidth\nat most w. Any elimination ordering that extends the weak ordering induced by zi is perfect for M.\n\nThe graph M is used in the formulation as a template for the moral graph of a valid DAG:\nLemma 2. Let vi, pit, i \u2208 N, t = 1, . . . , ri, be variables satisfying Constraints (4) and (5). Then\nthe directed graph G = (N, A), where Gi = {j : pit = 1 and j \u2208 Pit}, is acyclic and valid.\nMoreover the moral graph of G is a subgraph of the graph M de\ufb01ned in the previous lemma.\n\nThe previous lemmas suf\ufb01ce to show that the solutions of the MILP problem can be decoded into\nvalid DAGs of bounded treewidth:\nTheorem 1. Any solution to the MILP can be decoded into a valid DAG of treewidth less than or\nequal to w. In particular, the decoding of an optimal solution solves (1).\n\nThe MILP formulation can be directly fed into any off-the-shelf MILP optimizer. Most MILP op-\ntimizers (e.g. CPLEX) can be prematurely stopped while providing an incumbent solution and an\nerror estimate. Moreover, given enough resources (time and memory), these solvers always \ufb01nd\noptimal solutions. Hence, the MILP formulation provides an anytime algorithm that can be used to\nprovide both exact and approximate solutions.\nThe bottleneck in terms of ef\ufb01ciency of the MILP construction lies in the speci\ufb01cation of Con-\nstraints (3c) and (4d), as there are \u0398(n3) such constraints. Thus, as n increases even the linear\nrelaxations of the MILP problem become hard to solve. We demonstrate empirically in Section 5\nthat the quality of solutions found by the MILP approach in a reasonable amount of time degrades\nquickly as the number of variables exceeds a few dozens. In the next section, we present an approx-\nimate algorithm to overcome such limitations and handle large domains.\n\n4 A Sampling Based Approach\n\nA successful method for learning Bayesian networks of unconstrained treewidth on large domains\nis order-based local search, which consists in sampling topological orderings for the variables and\nselecting optimal compatible DAGs [26]. Given a topological ordering, the optimal DAG can be\nfound in linear time (assuming scores are given as input), hence rendering order-based search re-\nally effective in exploring the solution space. A naive extension of that approach to the bounded\ntreewidth case would be to (i) sample a topological order, (ii) \ufb01nd the optimal compatible DAG, (iii)\nverify the treewidth and discard if it exceeds the desired bound. There are two serious issues with\nthat approach. First, verifying the treewidth is an NP-hard problem, and even if there are linear-time\nalgorithms (which are exponential in the treewidth), they perform poorly in practice; second, the vast\nmajority of structures would be discarded, since the most used score functions penalize the number\nof free parameters, which correlates poorly with treewidth [5].\nIn this section, we propose a more sophisticate extension of order-based search to learn bounded\ntreewidth structures. Our method relies on sampling k-trees, which are de\ufb01ned inductively as fol-\nlows [6]. A complete graph with k + 1 nodes (i.e., a (k + 1)-clique) is a k-tree. Let Tk = (V, E)\nbe a k-tree, K be a k-clique in it, and v be a node not in V . Then the graph obtained by connecting\nv to every node in K is also a k-tree. A k-tree is a maximal graph of treewidth k in the sense that\nno edge can be added without increasing the treewidth. Every graph of treewidth at most k is a\nsubgraph of some k-tree. Hence, Bayesian networks of treewidth bounded by k are exactly those\nwhose moral graph is a subgraph of some k-tree [19]. We are interested in k-trees over the nodes N\nof the Bayesian network and where k = w is the bound we impose to the treewidth.\nCaminiti et al. [6] proposed a linear time method (in both n and k) for coding and decoding k-\ntrees into what is called (generalized) Dandelion codes. They also established a bijection between\nDandelion codes and k-trees. Hence, sampling Dandelion codes is essentially equivalent to sampling\nk-trees. The former however is computationally much easier and faster to perform, especially if we\nwant to draw samples uniformly at random (uniform sampling provides good coverage of the space\nand produces low variance estimates across data sets). Formally, a Dandelion code is a pair (Q, S),\nwhere Q \u2286 N with |Q| = k and S is a list of n\u2212 k\u2212 2 pairs of integers drawn from N \u222a{\u0001}, where\n\u0001 is an arbitrary number not in N. Dandelion codes can be sampled uniformly by a trivial linear-time\n\n4\n\n\falgorithm that uniformly chooses k elements from N to build Q, then uniformly samples n \u2212 k \u2212 2\npairs of integers in N \u222a {\u0001}. Algorithm 1 contains a high-level description of our approach.\n\nAlgorithm 1 Learning a structure of bounded treewidth by sampling Dandelion codes.\n% Takes a score function si, i \u2208 N, and an integer k, and outputs a DAG G\u2217 of treewidth \u2264 k.\n1 Initialize G\u2217 as an empty DAG.\n2 Repeat a certain number of iterations:\n2.a Uniformly sample a Dandelion code (Q, S) and decode it into Tk.\n2.b Search for a DAG G that maximizes the score function and is compatible with Tk.\n\n2.c If(cid:80)\n\ni\u2208N si(Gi) >(cid:80)\n\ni\u2208N si(G\u2217\n\ni ), update G\u2217.\n\nWe assume from now on that a k-tree Tk is available, and consider the problem of searching for a\ncompatible DAG that maximizes the score (Step 2.b). Korhonen and Parviainen [19] presented an\nalgorithm (which we call K&P) that given an undirected graph M \ufb01nds a DAG G maximizing the\nscore function such that the moralization of G is a subgraph of M. The algorithm runs in time and\nspace O(n) assuming the scores are part of the input (hence pre-computed and accessed at constant\ntime). We can use their algorithm to \ufb01nd the optimal structure whose moral graph is a subgraph of\nTk. We call this approach S+K&P to remind of (k-tree) sampling followed by K&P.\nTheorem 2. The size of the sampling space of S+K&P is less than en log(nk). Each of its iterations\nruns in linear time in n (but exponential in k).\n\nAccording to the result above, the sampling space of S+K&P is not much bigger than that of stan-\ndard order-based local search (which is approximately en log n), especially if k (cid:28) n. The practical\ndrawback of this approach is the \u0398(k3k(k + 1)!n) time taken by K&P to process each sampled\nk-tree, which forbids its use for moderately high treewidth bounds (say, k \u2265 10). Our experiments\nin the next section further corroborate our claim: S+K&P often performs poorly even on small k,\nmostly due to the small number of k-trees sampled within the given time limit. A better approach is\nto sacri\ufb01ce the optimality of the search for compatible DAGs in exchange of an ef\ufb01ciency gain. We\nnext present a method based on sampling topological orderings that achieves such a goal.\nLet Ci be the collection of maximal cliques of Tk that contain a certain node i (these can be obtained\nef\ufb01ciently, as Tk is chordal), and consider a topological ordering < of N. Let C