{"title": "Learning Chordal Markov Networks via Branch and Bound", "book": "Advances in Neural Information Processing Systems", "page_first": 1847, "page_last": 1857, "abstract": "We present a new algorithmic approach for the task of finding a chordal Markov network structure that maximizes a given scoring function. The algorithm is based on branch and bound and integrates dynamic programming for both domain pruning and for obtaining strong bounds for search-space pruning. Empirically, we show that the approach dominates in terms of running times a recent integer programming approach (and thereby also a recent constraint optimization approach) for the problem. Furthermore, our algorithm scales at times further with respect to the number of variables than a state-of-the-art dynamic programming algorithm for the problem, with the potential of reaching 20 variables and at the same time circumventing the tight exponential lower bounds on memory consumption of the pure dynamic programming approach.", "full_text": "Learning Chordal Markov Networks\n\nvia Branch and Bound\n\nKari Rantanen\n\nHIIT, Dept. Comp. Sci.,\nUniversity of Helsinki\n\nAntti Hyttinen\n\nHIIT, Dept. Comp. Sci.,\nUniversity of Helsinki\n\nMatti J\u00e4rvisalo\n\nHIIT, Dept. Comp. Sci.,\nUniversity of Helsinki\n\nAbstract\n\nWe present a new algorithmic approach for the task of \ufb01nding a chordal Markov\nnetwork structure that maximizes a given scoring function. The algorithm is\nbased on branch and bound and integrates dynamic programming for both domain\npruning and for obtaining strong bounds for search-space pruning. Empirically,\nwe show that the approach dominates in terms of running times a recent integer\nprogramming approach (and thereby also a recent constraint optimization approach)\nfor the problem. Furthermore, our algorithm scales at times further with respect to\nthe number of variables than a state-of-the-art dynamic programming algorithm\nfor the problem, with the potential of reaching 20 variables and at the same time\ncircumventing the tight exponential lower bounds on memory consumption of the\npure dynamic programming approach.\n\n1\n\nIntroduction\n\nGraphical models offer a versatile and theoretically solid framework for various data analysis\ntasks [1, 30, 17]. In this paper we focus on the structure learning task for chordal Markov networks\n(or chordal/triangulated Markov random \ufb01elds or decomposable graphs), a central class of undirected\ngraphical models [7, 31, 18, 17]. This problem, chordal Markov network structure learning (CMSL), is\ncomputationally notoriously challenging; e.g., \ufb01nding a maximum likelihood chordal Markov network\nwith bounded structure complexity (clique size) is known to be NP-hard [23]. Several Markov chain\nMonte Carlo (MCMC) approaches have been proposed for this task in the literature [19, 27, 10, 11].\nHere we take on the challenge of developing a new exact algorithmic approach for \ufb01nding an optimal\nchordal Markov network structure in the score-based setting. Underlining the dif\ufb01culty of this\nchallenge, \ufb01rst exact algorithms for CMSL have only recently been proposed [6, 12, 13, 14], and\ngenerally do no scale up to 20 variables. Speci\ufb01cally, the constraint optimization approach introduced\nin [6] does not scale up to 10 variables within hours. A similar approach was also taken in [16] in the\nform of a direct integer programming encoding for CMSL, but was not empirically evaluated in an\nexact setting. Comparably better performance, scaling up to 10 (at most 15) variables, is exhibited\nby the integer programming approach implemented in the GOBNILP system [2], extending the core\napproach of GOBNILP to CMSL by enforcing additional constraints. The true state-of-the-art exact\nalgorithm for CMSL, especially when the clique size of the networks to be learned is not restricted, is\nJunctor, implementing a dynamic programming approach [13]. The method is based on recursive\ncharacterization of clique trees and storing in memory the scores of already-solved subproblems.\nDue to its nature, the algorithm has to iterate through every single solution candidate, although its\neffective memoization technique helps to avoid revisiting solution candidates [13]. As typical for\ndynamic programming algorithms, the worst-case and best-case performance coincide: Junctor is\nguaranteed to use \u2126(4n) time and \u2126(3n) space.\nIn this work, we develop an alternative exact algorithm for CMSL. While a number of branch-\nand-bound algorithms have been proposed in the past for Bayesian network structure learning\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(BNSL) [25, 28, 20, 29, 26], to the best of our knowledge our approach constitutes the \ufb01rst non-trivial\nbranch-and-bound approach for CMSL. Our core search routine takes advantage of similar ideas\nas a recently proposed approach for optimally solving BNSL [29], and, on the other hand, like\nGOBNILP, uses the tight connection between BNSL and CMSL by searching over the space of\nchordal Markov network structures via considering decomposable directed acyclic graphs. Central to\nthe ef\ufb01ciency of our approach is the integration of dynamic programming over Bayesian network\nstructures for obtaining strong bounds for effectively pruning the search space during search, as\nwell as problem-speci\ufb01c dynamic programming for ef\ufb01ciently implementing domain \ufb01ltering during\nsearch. Furthermore, we establish a condition which enables symmetry breaking for noticeably\npruning the search space over which we perform branch and bound. In comparison with Junctor,\na key bene\ufb01t of our approach is the potential of avoiding worst-case behavior, especially in terms\nof memory usage, based on using strong bounds to rule out provably non-optimal solutions from\nconsideration during search.\nEmpirically, we show the approach dominates the integer programming approach of GOBNILP [2],\nand thereby also the constraint optimization approach [6, 12]. Furthermore, our algorithm scales\nat times further in terms of the number of variables than the DP-based approach implemented in\nJunctor [13], with the potential of reaching 20 variables within hours and at the same time circumvent-\ning the tight exponential lower bounds on memory consumption of the pure dynamic programming\napproach, which is witnessed also in practice by noticeably lower memory consumption.1\n\n2 Chordal Markov Network Structure Learning\n\nA Markov network structure is represented by an undirected graph Gu = (V, Eu), where V =\n{v1, . . . , vn} is the set of vertices and Eu the set of undirected edges. This structure represents\nindependencies vi \u22a5\u22a5 vj|S according to the undirected separation property: vi and vj are separated\ngiven set S if and only if all paths between them go through a vertex in set S. The undirected graph is\nchordal iff every (undirected) cycle of length greater than three contains a chord, i.e., an edge between\ntwo non-consecutive vertices in the cycle. Figure 1 a) shows an example. Here we focus on the task of\n\ufb01nding a chordal graph U that maximizes posterior probability P (Gu|D) = P (D|Gu)P (Gu)/P (D),\nwhere D denotes the i.i.d. data set. As we assume a uniform prior over chordal graphs, this boils\ndown to maximizing the marginal likelihood P (D|Gu).\nDawid et al. have shown that the marginal likelihood P (D|Gu) for chordal Markov networks can be\ncalculated using a clique tree representation [7, 9]. A clique C is a fully connected subset of vertices.\nA clique tree for an undirected graph Gu is an undirected tree such that\n\nI. (cid:83)\n\nII.\nIII.\n\ns(Ci) \u2212(cid:80)\n\nCi\n\nSj\n\ni P (Ci)/(cid:81)\n\ni Ci = V ,\nif {v(cid:96), vk} \u2208 Eu, then either {v(cid:96), vk} \u2286 Ck or {v(cid:96), vk} \u2286 C(cid:96), and\nthe running intersection property holds: whenever vk \u2208 Ci and vk \u2208 Cj, then vk is also in\nevery clique on the unique path between Ci and Cj.\n\nThe marginal likelihood factorizes according to the clique tree: P (D|U ) =(cid:81)\n(cid:80)\n\nThe separators are the intersections of adjacent cliques in a clique tree. Figure 1 b) shows an example.\nj P (Sj)\n(assuming positivity and that the prior factorizes) [6]. The marginal likelihood P (S) for a set\nS of random variables can be calculated with suitable priors; in this paper we consider discrete\ndata using a Dirichlet prior. If we denote s(S) = log P (S), CMSL can be cast as maximizing\ns(Sj). For example, the marginal log-likelihood of the graph in Figure 1 a)\ncan be calculated using the clique tree presentation in Figure 1 b) as s({v1, v6}) + s({v1, v5}) +\ns({v1, v2, v3}) + s( v2, v3, v4}) \u2212 s({v1}) \u2212 s({v1}) \u2212 s({v2, v3}).\nIn this paper, we view the chordal Markov network structure learning problem from the viewpoint\nof directed graphs, making use of the fact that for each chordal Markov network structure there are\nequivalent directed graph structures [15, 7], which we call here decomposable DAGs. A decomposable\nDAG is a DAG G = (V, E) such that the set of directed edges E \u2282 V \u00d7 V does not include any\nimmoralities, i.e., structures of the form vi \u2192 vk \u2190 vj with no edges between vi and vj. Due\nto lack of immoralities, the d-separation property on a decomposable DAG corresponds exactly\nto the separation property on the chordal undirected graph (the skeleton of the decomposable\nDAG). Thus, decomposable graphs represent distributions that are representable by Markov and by\n\n1Extended discussion and empirical results are available in [21].\n\n2\n\n\fv6\n\nv2\n\nv1\n\nv4\n\nv5\n\nv3\n\n{v1, v6}\n\n{v1}\n\n{v1, v5}\n\n{v1}\n\n{v2, v3}\n\n{v1, v2, v3}\n\n{v2, v3, v4}\n\nv6\n\nv2\n\nv1\n\nv4\n\nv5\n\nv3\n\na)\nFigure 1: Three views on chordal Markov network structures: a) chordal undirected graph, b) clique\ntree, (c) decomposable DAG.\n\nb)\n\nc)\n\nBayesian networks. Figure 1 c) shows a corresponding decomposable DAG for the chordal undirected\ngraph in a). Note that the decomposable DAG may not be unique; for example, v2 \u2192 v3 can be\ndirected also in the opposite direction. The score of the decomposable DAG can be calculated as\ns(v1,\u2205) + s(v5,{v1}) + s(v6,{v1}) + s(v2,{v1}) + s(v3,{v1, v2}) + s(v4,{v2, v3}), where s(vi, S)\nare the local scores for BNSL using e.g. a Dirichlet prior. Because these local scores s(\u00b7,\u00b7) correspond\nto s(\u00b7) through s(vi, S) = s({vi, S}) \u2212 s(S) (and s(\u2205) = 0), we \ufb01nd that this BNSL scoring gives\nthe same result as the clique tree based scoring rule.\nThus CMSL can also be cast as the optimization problem of \ufb01nding a graph in\n\n(cid:88)\n\nvi\u2208V\n\narg max\nG\u2208G\n\ns(vi, paG(vi)),\n\nwhere G denotes the class of decomposable DAGs. (This formulation is used also in the GOBNILP\nsystem [2].) The optimal chordal Markov network structure is the skeleton of the optimal G. This\nproblem is notoriously computationally dif\ufb01cult in practice, emphasized by the fact that standard\nscore-pruning [3, 8] used for BNSL is not generally applicable in the context of CMSL as it will often\nprevent \ufb01nding the true optimum: pruning parent sets for vertices often circumvents other vertices\nachieving high scoring parents sets (as immoralities would be induced).\n\n3 Hybrid Branch and Bound for CMSL\n\nIn this section we present details on our branch-and-bound approach to CMSL. We start with an\noverview of the search algorithm, and then detail how we apply symmetry breaking and make use of\ndynamic programming to dynamically update variable domains, i.e., for computing parent set choices\nduring search, and to obtain tight bounds for pruning the search tree.\n\n3.1 Branch and Bound over Ordered Decomposable DAGs\n\nThe search is performed over the space of ordered decomposable DAGs. While in general the order\nof the vertices of a DAG can be ambiguous, this notion allows for differentiating the exact order of\nthe vertices, and allows for pruning the search space by identifying symmetries (see Section 3.2).\nDe\ufb01nition 1. G = (V, E, \u03c0) is an ordered decomposable DAG if and only if (V, E) is a decomposable\nDAG and \u03c0 : {1...n} \u2192 {1...n} a total order over V such that (vi, vj) \u2208 E only if \u03c0\u22121(i) < \u03c0\u22121(j)\nfor all vi, vj \u2208 V .\nPartial solutions during search are hence ordered decomposable DAGs, which are extended by adding\na parent set choice (v, P ), i.e., adding the new vertex v and edges from each of its parents in P to v.\nDe\ufb01nition 2. Let G = (V, E, \u03c0) be an ordered decomposable DAG. Given vk /\u2208 V and P \u2286 V , we\nsay that the ordered decomposable DAG G(cid:48) = (V (cid:48), E(cid:48), \u03c0(cid:48)) is G with the parent set choice (vk, P ) if\nthe following conditions hold.\n\n2. E(cid:48) = E \u222a(cid:83)\n\n1. V (cid:48) = V \u222a {vk}\n3. We have \u03c0(cid:48)(i) = \u03c0(i) for all i = 1...|V |, and \u03c0(cid:48)(|V | + 1) = k.\n\nv(cid:48)\u2208P{(v(cid:48), vk)}.\n\nAlgorithm 1 represents the core functionality of the branch and bound. The recursive function\ntakes two arguments: the remaining vertices of the problem instance, U, and the current partial\nsolution G = (V, E, \u03c0). In addition we keep stored a best lower bound solution G\u2217, which is the\n\n3\n\n\fAlgorithm 1 The core branch-and-bound search.\n1: function BRANCHANDBOUND(U, G = (V, E, \u03c0))\n2:\n3:\n4:\n5:\n6:\n\nif U = \u2205 and s(G\u2217) < s(G) then G\u2217 \u2190 G\nif this branch cannot improve LB then return\nfor (vi, P ) \u2208 PARENTSETCHOICES(U, G) do\n\nLet G(cid:48) = (V (cid:48), E(cid:48), \u03c0(cid:48)) be G with the parent set choice (vi, P ).\nBRANCHANDBOUND(U \\ {vi}, G(cid:48))\n\n(cid:46) Update LB if improved.\n(cid:46) Backtrack\n(cid:46) Iterate the current parent set choices.\n\n(cid:46) Continue the search.\n\nhighest-scoring solution that has been found so far. Thus, at the end of the search, G\u2217 is an optimal\nsolution. During the search we use G\u2217 for bounding as further detailed in Section 3.3.\nIn the loop on line 4 we branch with all the parent set choices that we have deemed necessary to try\nduring the search. The method PARENTSETCHOICES(U, G) and the related symmetry breaking are\nexplained in Section 3.2. We sort the parent set choices into decreasing order based on their score, so\nthat (v, P ) is tried before (v(cid:48), P (cid:48)) if s(v, P ) > s(v(cid:48), P (cid:48)), where v, v(cid:48) \u2208 U and P, P (cid:48) \u2286 V . This is\ndone to focus the search \ufb01rst to the most promising branches for \ufb01nding an optimal solution. When\nU = \u2205, we have PARENTSETCHOICES(U, G) = \u2205, and so the current branch gets terminated.\n\n3.2 Dynamic Branch Selection, Parent Set Pruning, and Symmetry Breaking\n\nWe continue by proposing symmetry breaking for the space of ordered decomposable DAGs, and\npropose a dynamic programming approach for dynamic parent set pruning during search. We start\nwith symmetry breaking.\nIn terms of our branch-and-bound approach to CMSL, symmetry breaking is a vital part of the search,\nas there can be exponentially many decomposable DAGs which correspond to a single undirected\nchordal graph; for example, the edges of a complete graph can be directed arbitrarily without the\nresulting DAG containing any immoralities. Hence symmetry breaking in terms of pruning out\nsymmetric solution candidates during search has potential for noticeably speeding up search.\nChickering [4, 5] showed how so-called covered edges can be used to detect equivalencies between\nBayesian network structures. Later van Beek and Hoffmann [29] implemented covered edge based\nsymmetry breaking in their BNSL approach. Here we introduce the concept of preferred vertex\norders, which generalizes covered edges for CMSL based on the decomposability of the solution\ngraphs.\nDe\ufb01nition 3. Let G = (V, E, \u03c0) be an ordered decomposable DAG. A pair vi, vj \u2208 V violates the\npreferred vertex order in G if the following conditions hold.\n\n1. i > j.\n2. paG(vi) \u2286 paG(vj).\n3. There is a path from vi to vj in G.\n\nTheorem 1 states that for any (partial) solution (i.e., an ordered decomposable DAG), there always\nexists an equivalent solution that does not contain any violations of the preferred vertex order.\nMapping to practice, this theorem allows for very effectively pruning out all symmetric solutions but\nthe one not violating the preferred vertex order within our branch-and-bound approach. A detailed\nproof is provided in Appendix A.\nTheorem 1. Let G = (V, E, \u03c0) be an ordered decomposable DAG. There exists an ordered decom-\nposable DAG G(cid:48) = (V, E(cid:48), \u03c0(cid:48)) that is equivalent to G, but where for all vi, vj \u2208 V the pair (vi, vj)\ndoes not violate the preferred vertex order in G(cid:48).\nIt follows from Theorem 1 that for each solution (ordered decomposable DAG) there exists an\nequivalent solution where the lexicographically smallest vertex is a source. Thus we can \ufb01x it as the\n\ufb01rst vertex in the order at the beginning of the search.\nSimilarly as in [29] for BNSL, we de\ufb01ne the depths of vertices as follows.\nDe\ufb01nition 4. Let G = (V, E, \u03c0) be an ordered decomposable DAG. The depth of v \u2208 V in G is\n\n(cid:40) 0\n\nd(G, v) =\n\nmax\n\nv(cid:48)\u2208paG(v)\n\nif paG(v) = \u2205,\notherwise.\n\nd(G, v(cid:48)) + 1\n\n4\n\n\fThe depths of G are ordered if for all vi, vj \u2208 V , where \u03c0\u22121(i) < \u03c0\u22121(j), the following hold.\n1. d(G, vi) \u2264 d(G, vj), and 2. If d(G, vi) = d(G, vj), then i < j.\nNote that \"violating the preferred vertex order\" concerns the order in which the vertices are in\nthe underlying DAG, whereas \"depths are ordered\" concerns the order by which a solution was\nconstructed. We use the former to prune whole solution candidates from the search space, and the\nlatter to ensure that no solution candidate is seen twice during search.\nWe also propose a dynamic programming approach to branch selection and parent set pruning during\nsearch, based on the following de\ufb01nition of valid parent sets.\nDe\ufb01nition 5. Let G = (V, E, \u03c0) be an ordered decomposable DAG. Given vk /\u2208 V and P \u2286 V , let\nG(cid:48) = (V (cid:48), E(cid:48), \u03c0(cid:48)) be G with the parent set choice (vk, P ). The parent set choice (vk, P ) is valid for\nG if the following hold.\n\n1. For all vi, vj \u2208 P we have either (vi, vj) \u2208 E or (vj, vi) \u2208 E.\n2. For all vi \u2208 V , the pair (vi, vk) does not violate the preferred vertex order in G(cid:48).\n3. The depths of G(cid:48) are ordered.\n\nGiven a partial solution G = (V, E, \u03c0), a vertex v /\u2208 V , and a subset P \u2286 V , the function GETSU-\nPERSETS in Algorithm 2 represents a dynamic programming method for determining valid parent set\nchoices (v, P (cid:48)) for G where P (cid:48) \u2287 P . An advantage of this formulation is that invalidating conditions\nfor a parent set, such as immoralities or violations of the preferred vertex order, automatically hold\nfor all the supersets of the parent set; this is applied on line 6 to avoid unnecessary branching.\nOn line 8 we require that a parent set P is added to the list only if none of its valid supersets P (cid:48) \u2208 C\nhave a higher score. This pruning technique is based on the observation that P (cid:48) provides all the same\nmoralizing edges as P , and therefore it is suf\ufb01cient to only consider the parent set choice (v, P (cid:48)) in\nthe search when s(v, P ) \u2264 s(v, P (cid:48)).\nGiven the set of remaining vertices U, the function PARENTSETCHOICES in Algorithm 2 constructs\nall the available parent set choices for the current partial solution G = (V, E, \u03c0). The collection\nM(G, vi) contains the subset-minimal parent sets for vertex vi \u2208 U that satisfy the 3rd condition\nof De\ufb01nition 5. If V = \u2205, then M(G, vi) = {\u2205}. Otherwise, let k be the maximum depth of the\nvertices in G. Now M(G, vi) contains the subset-minimal parent sets that would insert vi on depth\nk + 1. In addition, if i > j for all vj \u2208 V where d(G, vj) = k, then M(G, vi) also contains the\nsubset-minimal parent sets that would insert vi on depth k. Note that the cardinality of any parent set\nin M(G, vi) is at most one.\n\n3.3 Computing Tight Bounds by Harnessing Dynamic Programming for BNSL\n\nTo obtain tight bounds during search, we make use of the fact that the score of the optimal BN\nstructures for the BNSL instance with same scores as in the CMSL instance at hand is guaranteed\nto give an upper bound on the optimal solutions to the CMSL instance. To compute an optimal\nBN structure, we use a variant of a standard dynamic programming algorithm by Silander and\nMyllym\u00e4ki [22]. While there are far more ef\ufb01cient algorithms for BNSL [2, 32, 29], we use BNSL\nDP for obtaining an upper bound during the branch-and-bound search under the current partial\n\nAlgorithm 2 Constructing parent set choices via dynamic programming.\n1: function PARENTSETCHOICES(U, G = (V, E, \u03c0))\n2:\n\nGETSUPERSETS(v, G, M )\n\nreturn (cid:83)\n\n(cid:83)\n\nM\u2208M(G,v)\n\nv\u2208U\nLet C = \u2205\nfor v(cid:48) \u2208 V \\ P \\ {v} do\n\n3: function GETSUPERSETS(v, G = (V, E, \u03c0), P )\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nC \u2190 C \u222a GETSUPERSETS(v, G, P \u222a {v(cid:48)})\n\nif (v, P ) is valid parent set choice for G and s(v, P ) > s(v, P (cid:48)) for all P (cid:48) \u2208 C then\nreturn C\n\nC \u2190 C \u222a {(v, P )}\n\nif (v, P (cid:48)) is a valid parent set choice for G with some P (cid:48) \u2287 P \u222a {v(cid:48)} then\n\n5\n\n\fCMSL solution (i.e., under the current branch). Speci\ufb01cally, before the actual branch and bound, we\nprecompute a DP table which stores, for each subset of vertices V (cid:48) \u2282 V of the problem instance, the\nscore of the so-called BN extensions of V (cid:48), i.e., the optimal BN structures over U = V \\ V (cid:48) where\nwe additionally allow the vertices in U to also take parents from V (cid:48). This guarantees that the BN\nextensions are compatible with the vertex order in the current branch of the branch-and-bound search\ntree, and thereby the sum of the score of the current partial CMSL solution over V (cid:48) and the score of\nthe optimal BN extensions of V (cid:48) is a valid upper bound. By spending O(n\u00b7 2n) time in the beginning\nof the branch and bound for computing the scores of optimal BN extensions of every V (cid:48) \u2282 V , we\ncan then look up these scores during branch and bound in O(1) time.\nWith the DP table, it takes only low polynomial time to construct the optimal BN structure over the\nset of all vertices [22], i.e., a BN extension of \u2205. Thus, we can obtain an initial lower bound solution\nG\u2217 for the branch and bound as follows.\n\n1. Construct the optimal BN structure for the vertices of the problem instance\n2. Try to make the BN decomposable by heuristically adding or removing edges.\n3. Let G\u2217 be the highest-scoring decomposable DAG from step 2.\n\nHowever, the upper bounds obtained via BNSL can be at times can be quite weak when the network\nstructures contain many immoralities. For this reason, in Algorithm 3, we introduce an additional\nmethod for computing the upper bounds, taking immoralities \u201crelaxedly\u201d into consideration. The\nalgorithm takes four inputs: A \ufb01xed partial solution G = (V, E, \u03c0), a list of vertices A that we have\nassigned during the upper bound computation, a list of remaining vertices U, and an integer d \u2265 0\nwhich dictates the maximum recursion depth. As a fallback option, on line 3 we return the optimal\nBN score for the remaining vertices if the maximum recursion depth is reached.\nOn line 4 we construct the collection of sets P that are the maximal sets that any vertex can take\nas parent set during the upper bound computation. The sets in P take immoralities relaxedly into\nconsideration: For any vi, vj \u2208 V , we have {vi, vj} \u2286 P for some P \u2208 P if and only if (vi, vj) \u2208 E\nor (vj, vi) \u2208 E. That is, when choosing parent sets during the upper bound computation, we allow\nimmoralities to appear, as long as they are not between vertices of the \ufb01xed partial solution. In the\nloop on line 6, we iterate through each vertex v \u2208 U that is still remaining, and \ufb01nd its highest-\nscoring relaxedly-moral parent set according to P. Note that given any P (cid:48) \u2208 P, we can \ufb01nd the\nhighest-scoring parent set P \u2286 P (cid:48) in O(1) time when the scores are stored in a segment tree. For\ninformation about constructing such data structure, see [22]. Thus line 7 takes O(|V |) time to execute.\nFinally, on line 8 of the loop, we split the problem into subproblems to see which parent set choice\n(v, P ) provides the highest local upper bound u to be returned.\nAlgorithm 3 requires O((n \u2212 m) \u00b7 m \u00b7 2n\u2212m) time, where m = |V | is the number of vertices in the\npartial solution and n the number of vertices in the problem instance, assuming that the BN extensions\nand the segment trees have been precomputed. (In the empirical evaluation, the total runtimes of our\nbranch-and-bound approach include these computations.) The collections P can exist implicitly.\nWe use the upper bounds within branch and bound as follows. Let G = (V, E, \u03c0) be the current\npartial solution, let U be the set of remaining vertices, and let b be the score of optimal BN extensions\nof V . We can close the current branch if s(G\u2217) \u2265 s(G) + b. Otherwise, we can close the branch if\ns(G\u2217) \u2265 s(G) + UPPERBOUND(G,\u2205, U, d) for some d > 0. Our implementation uses d = 10.\n\nAlgorithm 3 Computing upper bounds for a partial solution via dynamic programming.\n1: function UPPERBOUND(G = (V, E, \u03c0), A, U, d)\n2:\n3:\n4:\n\nif U = \u2205 then return 0\nif d = 0 then return the score of optimal BN extensions of V \u222a A\n\nLet P = (cid:83)\n\n{{v} \u222a paG(v) \u222a A}\n\nv\u2208V\nLet u \u2190 \u2212\u221e\nfor v \u2208 U do\n\n5:\n6:\n7:\n\n8:\n9:\n\nLet P = arg max\nP\u2286P (cid:48)\u2208P\nu \u2190 max(u, s(v, P ) + UPPERBOUND(G, A \u222a {v}, U \\ {v}, d \u2212 1))\n\ns(v, P )\n\nreturn u\n\n6\n\n\f4 Empirical Evaluation\n\nWe implemented the branch-and-bound algorithm in C++, and refer to this prototype as BBMarkov.\nWe compare the performance of BBMarkov to that of GOBNILP (the newest development version [24]\nat the time of publication, using IBM CPLEX version 12.7.1 as the internal IP solver) as a state-of-\nthe-art BNSL system implementing a integer programming branch-and-cut approach to CMSL by\nruling out non-chordal graphs, and Junctor, implementing a state-of-the-art DP approach to CMSL.\nWe used a total of 54 real-world datasets used as standard benchmarks for exact approaches [32, 29].\nFor investigating scalability of the algorithms in terms of the number of variables n, we obtained\nfrom each dataset several benchmark instances by restricting to the \ufb01rst n variables for increasing\nvalues of n. We did not impose a bound on the treewidth of the chordal graphs of interest, i.e., the\nsize of candidate parent sets was not limited. We used the BDeu score with equivalent sample size 1.\nAs standard practice in benchmarking exact structure learning algorithms, we focus on comparing the\nrunning times of the considered approaches on precomputed input CMSL instances. The experiments\nwere run under Debian GNU/Linux on 2.83-GHz Intel Xeon E5440 nodes with 32-GB RAM.\nFigure 2 compares BBMarkov to GOBNILP and Junctor under a 1-h per-instance time limit, with\ndifferent numbers n of variables distinguished using different point styles. BBMarkov clearly\ndominates GOBNILP in runtime performance (Fig. 2 left); instances for n > 15 are not shown as\nGOBNILP was unable to solve them. Compared to Junctor (Fig. 2 middle, Table 1), BBMarkov\nexhibits complementary performance. Junctor is noticeably strong on several datasets and lower\nvalues of n, and exhibits fewer timeouts. For a \ufb01xed n, Junctor\u2019s runtimes have a very low variance\nindependent of the dataset, which is due to the \u2126(4n) (both worst-case and best-case) runtime\nguarantee. However, BBMarkov shows potential for scaling up for larger n than Junctor: at n = 17\nJunctor\u2019s runtimes are very close to 1 h on all instances, while BBMarkov\u2019s bounds rule out at times\nvery effectively non-optimal solutions, resulting in noticeable lower runtimes on speci\ufb01c datasets\nwith increasing n. This is show-cased in Table 1 on the right, highlighting some of the best-case\nperformance of BBMarkov using per-instance time limit of 24 h for both BBMarkov and Junctor.\nIn terms of how the various search techniques implemented in BB-\nMarkov contribute to the running times of BBMarkov, we observed\nthat the running times for obtaining BNSL-based bounds (via the\nuse of exact BN dynamic programming and segment trees) tend to\nbe only a small fraction of the overall running times. For example,\nat n = 20, these computations take less than minute in total. Most\nof the time in the search is typically used in the optimization loop\nand in computing the tighter upper bounds that take immoralities\n\"relaxedly\" into consideration. While computing the tighter bounds\nis more expensive than computing the exact BNs at the beginning of\nsearch, the tighter bounds often pay off in terms of overall running\ntimes as branches can be closed earlier during search.\nAnother bene\ufb01t of BBMarkov compared to Junctor is the observed\nlower memory consumption (Figure 3). Junctor\u2019s \u2126(3n) memory\n\nFigure 3: Memory usage\n\nFigure 2: Per-instance runtime comparisons. Left: BBMarkov vs GOBNILP. Middle: BBMarkov vs\nJunctor. Right: BBMarkov time to \ufb01nding vs BBMarkov time to proving an optimal solution.\n\n7\n\n10MB100MB1GB10GB25GB891113151719Memory usageNumber of variablesJunctorBBMarkov<1s1m>1h<1s1m>1hRun time of GOBNILPRun time of BBMarkovn=15n=14n=13n=12n=11<1s1m>1h<1s1m>1hRun time of JunctorRun time of BBMarkov<1s1m>1h<1s1m>1hTo find an optimal solutionTo find and prove the solutionn=17n=16n=15n=14n=13\fDataset\nWine\nAdult\nLetter\nVoting\nZoo\nWater100\nWater1000\nWater10000\nTumor\n\nRunning times (s)\nBBMarkov\nn\n13\n<1\n14\n58\n16 >3600\n17\n281\n17 >3600\n17\n100\n17\n2731\n17 >3600\n18\n610\n\nTable 1: BBMarkov v Junctor. Left: smaller datasets and for different sample sizes on the Water\ndataset. Right: Examples of best-case performance of BBMarkov. to: timeout, mo: memout.\nRunning times (s)\nBBMarkov\n268\n1462\n10274\n49610\n41\n162\n1186\n15501\n225\n2543\n13749\n33503\n590\n6581\n61152\n\nJunctor\n2724\n12477\n52130\nmo\n3007\n11179\n50296\nmo\n2588\n12422\n53108\nmo\n12244\n52575\nmo\n\n(62)\n(315)\n(2028)\n(50)\n(22)\n(85)\n(698)\n(13845)\n(108)\n(1348)\n(6418)\n(25393)\n(244)\n(6187)\n(54806)\n\n(<1)\n(35)\n(>3600)\n(207)\n(>3600)\n(49)\n(279)\n(>3600)\n(268)\n\nJunctor\n6\n29\n592\n3050\n2690\n2580\n2592\n2928\n12019\n\nDataset\nAlarm\n\nHeart\n\nHail\ufb01nder500\n\nWater100\n\nn\n17\n18\n19\n20\n17\n18\n19\n20\n17\n18\n19\n20\n18\n19\n20\n\nusage results consistently in running out on memory for n \u2265 20. At n = 19, BBMarkov uses on\naverage approx. 1 GB of memory, while Junctor uses close to 30 GB. A further bene\ufb01t of BBMarkov\nis its ability to provide \u201canytime\u201d solutions during search. In fact, the bounds obtained during search\nresult at times in \ufb01nding optimal solutions relatively fast: Figure 2 right shows the ratio of time\nneeded to \ufb01nd an optimal solution (x-axis) from time needed to terminate search, i.e., to \ufb01nd a\nsolution and prove its optimality (y-axis), and in Table 1, with the time needed to \ufb01nd an optimal\nsolution given in parentheses.\n\n5 Conclusions\n\nWe introduced a new branch-and-bound approach to learning optimal chordal Markov network\nstructures, i.e., decomposable graphs. In addition to core branch-and-bound search, the approach\nintegrates dynamic programming for obtaining tight bounds and effective variable domain pruning\nduring search. In terms of practical performance, the approach has the potential of reaching 20\nvariables within hours of runtime, at which point the competing native dynamic programming\napproach Junctor runs out of memory on standard modern computers. When approaching 20\nvariables, our approach is approximately 30 times as memory-ef\ufb01cient as Junctor. Furthermore, in\ncontrast to Junctor, the approach is \u201canytime\u201d as solutions can be obtained already before \ufb01nishing\nsearch. Ef\ufb01cient parallelization of the approach is a promising direction for future work.\n\nAcknowledgments\n\nThe authors gratefully acknowledge \ufb01nancial support from the Academy of Finland under grants\n251170 COIN Centre of Excellence in Computational Inference Research, 276412, 284591, 295673,\nand 312662; and the Research Funds of the University of Helsinki.\n\nA Proofs\n\nWe give a proof for Theorem 1, central in enabling effective symmetric breaking in our branch-and-\nbound approach. We start with a de\ufb01nition and lemma towards the proof.\nDe\ufb01nition 6. Let V = {v1, ..., vn} be a set of vertices and let \u03c0 and \u03c0(cid:48) be some total orders over V .\nLet k = mini,\u03c0(i)(cid:54)=\u03c0(cid:48)(i) i be the \ufb01rst difference between the orders. If no such difference exists, we\ndenote \u03c0 = \u03c0(cid:48). Otherwise we denote \u03c0 < \u03c0(cid:48) if and only if \u03c0(k) < \u03c0(cid:48)(k).\nLemma 1. Let G = (V, E, \u03c0) be an ordered decomposable DAG. If there are vi, vj \u2208 V such that\nthe pair (vi, vj) violates the preferred vertex order in G, then there exists an ordered decomposable\nDAG G(cid:48) = (V, E(cid:48), \u03c0(cid:48)), where 1. G(cid:48) belongs to the same equivalence class with G, 2. the pair (vi, vj)\ndoes not violate the preferred vertex order in G(cid:48), and 3. \u03c0 < \u03c0(cid:48).\n\n8\n\n\fProof. We begin by de\ufb01ning a directed clique tree C = (V,E) over G.\nGiven vk \u2208 V , let Ck = paG(vk) \u222a {vk} be the clique de\ufb01ned by vk in G. The vertices of C are\nthese cliques; we also add an empty set as a clique to make sure the cliques form a tree (and not a\nforest). Formally, V = {Ck | vk \u2208 V } \u222a {\u2205}.\nGiven vk \u2208 V , where paG(vk) (cid:54)= \u2205, let \u03c6k = argmaxv(cid:96)\u2208paG(vk)\u03c0\u22121((cid:96)) denote the parent of vk in\nG that is in the least signi\ufb01cant position in \u03c0. Now, the edges of C are\n\nE = {(\u2205, Ck) | Ck = {vk}, vk \u2208 V } \u222a {(C(cid:96), Ck) | v(cid:96) = \u03c6k, Ck (cid:54)= {vk}, vk \u2208 V }.\n\nI.(cid:83)\n\nIn words, if vk \u2208 V is a source vertex in G (i.e., Ck = {vk}), then the parent of Ck is \u2205 in C.\nOtherwise (i.e., Ck (cid:54)= {vk}) the parent of Ck is C(cid:96), where v(cid:96) is the closest vertex to vk in order\n\u03c0 that satis\ufb01es C(cid:96) \u2229 paG(vk) (cid:54)= \u2205. We see that all the requirements for clique trees hold for C:\nC\u2208V C = V , II. if {v(cid:96), vk} \u2208 E, then either {v(cid:96), vk} \u2286 Ck or {v(cid:96), vk} \u2286 C(cid:96), and III. due to the\ndecomposability of G, we have Ca \u2229 Cc \u2286 Cb on any path from Ca to Cc through Cb (the running\nintersection property).\nNow assume that there are vi, vj \u2208 V such that the pair (vi, vj) violates the preferred vertex order in\nG; that is, we have i > j, paG(vi) \u2286 paG(vj) and a path from vi to vj in G. This means that there is\na path from Ci to Cj in C as well.\nLet P \u2208 V be the parent vertex of Ci in C. We see that Cj exists in a subtree T of C that is separated\nfrom rest of C by P , and where Ci is the root vertex. Let T (cid:48) be a new clique tree that is like T , but\nredirected so that Cj is the root vertex of T (cid:48). Let C(cid:48) be a new clique tree that is like C, but T is\nreplaced with T (cid:48).\nWe show that C(cid:48) is a valid clique tree. First of all, the vertices (cliques) of C(cid:48) are exactly the same as in\nC, so C(cid:48) clearly satis\ufb01es the requirements I and II. As for the requirement III, consider the non-trivial\ncase where Ca, Cb \u2208 C have a path from Ca to Cb through Ci in C. This means vi /\u2208 Ca (due to the\nway C was constructed), and so we get\n\nCa \u2229 Cb \u2286 Ci \u2192 Ca \u2229 Cb \u2286 Ci \\ {vi} \u2192 Ca \u2229 Cb \u2286 paG(vi) \u2286\n\nDef. 3 (2)\n\npaG(vj) \u2286 Cj.\n\nTherefore the running intersection property holds for C(cid:48).\nLet \u02c6\u03c0 be the total order by which C(cid:48) is ordered. Let G(cid:48) = (V, E(cid:48), \u02c6\u03c0) be a new ordered decomposable\nDAG that is equivalent to G, but where the edges E(cid:48) are arranged to follow the order \u02c6\u03c0.\nFinally, we see that G(cid:48) satis\ufb01es the conditions of the theorem: 1. The cliques of G(cid:48) are identical to\nthat of G, so G(cid:48) belongs to the same equivalence class with G. 2. We have \u02c6\u03c0\u22121(j) < \u02c6\u03c0\u22121(i), and\ntherefore there is no path from vi to vj in G(cid:48). Thus the pair (vi, vj) does not violate the preferred\nvertex order in G(cid:48). 3. Let o = \u03c0\u22121(i). We have \u02c6\u03c0(o) = j < i = \u03c0(o). Furthermore, the change from\nT to T (cid:48) in C(cid:48) did not affect any vertex whose position was earlier than o. Therefore \u02c6\u03c0(k) = \u03c0(k) for\nall k = 1...(o \u2212 1). This implies \u02c6\u03c0 < \u03c0.\n\nProof of Theorem 1. Consider the following procedure for \ufb01nding G(cid:48).\n\n1. Select vi, vj \u2208 V where the pair (vi, vj) violates the preferred vertex order in G. If there\n\nare no such vertices, assign G(cid:48) \u2190 G and terminate.\n2. Let \u03c0 be the total order of the vertices of G. Construct an ordered decomposable DAG\n\u02c6G = (V, \u02c6E, \u03c0(cid:48)) such that I. the pair (vi, vj) does not violate the preferred vertex order in \u02c6G,\nII. \u02c6G belongs to the same equivalent class with G, and III. \u03c0(cid:48) < \u03c0. By Lemma 1, \u02c6G can be\nconstructed from G.\n\n3. Assign G \u2190 \u02c6G and return to step 1.\n\nIt is clear that when the procedure terminates, G(cid:48) belongs to same equivalence class with G and there\nare no violations of the preferred vertex order in G(cid:48). We also see that the total order of G (i.e., \u03c0)\nis lexicographically strictly decreasing every time the step 3 is reached. There are \ufb01nite amount of\npossible permutations (total orders) and therefore the procedure converges. The existence of this\nprocedure and its correctness proves that G(cid:48) exists.\n\n9\n\n\fReferences\n[1] Haley J. Abel and Alun Thomas. Accuracy and computational ef\ufb01ciency of a graphical\nmodeling approach to linkage disequilibrium estimation. Statistical Applications in Genetics\nand Molecular Biology, 143(10.1), 2017.\n\n[2] Mark Bartlett and James Cussens.\n\nInteger linear programming for the Bayesian network\n\nstructure learning problem. Arti\ufb01cial Intelligence, 244:258\u2013271, 2017.\n\n[3] Cassio P. de Campos and Qiang Ji. Ef\ufb01cient structure learning of Bayesian networks using\n\nconstraints. Journal of Machine Learning Research, 12:663\u2013689, 2011.\n\n[4] David Maxwell Chickering. A transformational characterization of equivalent Bayesian network\n\nstructures. In Proc. UAI, pages 87\u201398. Morgan Kaufmann, 1995.\n\n[5] David Maxwell Chickering. Learning equivalence classes of Bayesian network structures.\n\nJournal of Machine Learning Research, 2:445\u2013498, 2002.\n\n[6] Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik J. Nyman, and Johan Pensar. Learning\nchordal Markov networks by constraint satisfaction. In Proc. NIPS, pages 1349\u20131357, 2013.\n\n[7] A. Philip Dawid and Steffen L. Lauritzen. Hyper Markov laws in the statistical analysis of\n\ndecomposable graphical models. Annals of Statistics, 21(3):1272\u20131317, 09 1993.\n\n[8] Cassio P. de Campos and Qiang Ji. Properties of Bayesian Dirichlet scores to learn Bayesian\n\nnetwork structures. In Proc. AAAI, pages 431\u2013436. AAAI Press, 2010.\n\n[9] Petros Dellaportas and Jonathan J. Forster. Markov chain Monte Carlo model determination for\n\nhierarchical and graphical log-linear models. Biometrika, 86(3):615\u2013633, 1999.\n\n[10] Paolo Giudici and Peter J. Green. Decomposable graphical Gaussian model determination.\n\nBiometrika, 86(4):785, 1999.\n\n[11] Peter J. Green and Alun Thomas. Sampling decomposable graphs using a Markov chain on\n\njunction trees. Biometrika, 100(1):91, 2013.\n\n[12] Tomi Janhunen, Martin Gebser, Jussi Rintanen, Henrik Nyman, Johan Pensar, and Jukka Coran-\nder. Learning discrete decomposable graphical models via constraint optimization. Statistics\nand Computing, 27(1):115\u2013130, 2017.\n\n[13] Kustaa Kangas, Mikko Koivisto, and Teppo M. Niinim\u00e4ki. Learning chordal Markov networks\n\nby dynamic programming. In Proc. NIPS, pages 2357\u20132365, 2014.\n\n[14] Kustaa Kangas, Teppo Niinim\u00e4ki, and Mikko Koivisto. Averaging of decomposable graphs by\n\ndynamic programming and sampling. In Proc. UAI, pages 415\u2013424. AUAI Press, 2015.\n\n[15] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[16] K. S. Sesh Kumar and Francis R. Bach. Convex relaxations for learning bounded-treewidth de-\ncomposable graphs. In Proc. ICML, volume 28 of JMLR Workshop and Conference Proceedings,\npages 525\u2013533. JMLR.org, 2013.\n\n[17] Steffen L. Lauritzen and David J. Spiegelhalter. Local computations with probabilities on\ngraphical structures and their application to expert systems. In Glenn Shafer and Judea Pearl,\neditors, Readings in Uncertain Reasoning, pages 415\u2013448. Morgan Kaufmann Publishers Inc.,\n1990.\n\n[18] G\u00e9rard Letac and H\u00e9l\u00e8ne Massam. Wishart distributions for decomposable graphs. The Annals\n\nof Statistics, 35(3):1278\u20131323, 2007.\n\n[19] David Madigan, Jeremy York, and Denis Allard. Bayesian graphical models for discrete data.\n\nInternational Statistical Review/Revue Internationale de Statistique, pages 215\u2013232, 1995.\n\n10\n\n\f[20] Brandon M. Malone and Changhe Yuan. A depth-\ufb01rst branch and bound algorithm for learning\noptimal Bayesian networks. In GKR 2013 Revised Selected Papers, volume 8323 of Lecture\nNotes in Computer Science, pages 111\u2013122. Springer, 2014.\n\n[21] Kari Rantanen. Learning score-optimal chordal Markov networks via branch and bound.\n\nMaster\u2019s thesis, University of Helsinki, Finland, 2017.\n\n[22] Tomi Silander and Petri Myllym\u00e4ki. A simple approach for \ufb01nding the globally optimal\n\nBayesian network structure. In Proc. UAI, pages 445\u2013452. AUAI Press, 2006.\n\n[23] Nathan Srebro. Maximum likelihood bounded tree-width Markov networks. Arti\ufb01cial Intelli-\n\ngence, 143(1):123 \u2013 138, 2003.\n\n[24] Milan Studen\u00fd and James Cussens. Towards using the chordal graph polytope in learning\ndecomposable models. International Journal of Approximate Reasoning, 88:259\u2013281, 2017.\n\n[25] Joe Suzuki. Learning Bayesian belief networks based on the Minimum Description Length\nprinciple: An ef\ufb01cient algorithm using the B&B technique. In Proc. ICML, pages 462\u2013470.\nMorgan Kaufmann, 1996.\n\n[26] Joe Suzuki and Jun Kawahara. Branch and Bound for regular Bayesian network structure\n\nlearning. In Proc. UAI. AUAI Press, 2017.\n\n[27] Claudia Tarantola. MCMC model determination for discrete graphical models. Statistical\n\nModelling, 4(1):39\u201361, 2004.\n\n[28] Jin Tian. A branch-and-bound algorithm for MDL learning Bayesian networks. In Proc. UAI,\n\npages 580\u2013588. Morgan Kaufmann, 2000.\n\n[29] Peter van Beek and Hella-Franziska Hoffmann. Machine learning of Bayesian networks using\nconstraint programming. In Proc. CP, volume 9255 of Lecture Notes in Computer Science,\npages 429\u2013445. Springer, 2015.\n\n[30] Claudio J. Verzilli, Nigel Stallard, and John C. Whittaker. Bayesian graphical models for\ngenomewide association studies. The American Journal of Human Genetics, 79(1):100\u2013112,\n2006.\n\n[31] Ami Wiesel, Yonina C. Eldar, and Alfred O. Hero III. Covariance estimation in decomposable\nGaussian graphical models. IEEE Transactions on Signal Processing, 58(3):1482\u20131492, 2010.\n\n[32] Changhe Yuan and Brandon M. Malone. Learning optimal Bayesian networks: A shortest path\n\nperspective. Journal of Arti\ufb01cial Intelligence Research, 48:23\u201365, 2013.\n\n11\n\n\f", "award": [], "sourceid": 1156, "authors": [{"given_name": "Kari", "family_name": "Rantanen", "institution": "University of Helsinki"}, {"given_name": "Antti", "family_name": "Hyttinen", "institution": "University of Helsinki"}, {"given_name": "Matti", "family_name": "J\u00e4rvisalo", "institution": "University of Helsinki"}]}