{"title": "Learning Chordal Markov Networks by Constraint Satisfaction", "book": "Advances in Neural Information Processing Systems", "page_first": 1349, "page_last": 1357, "abstract": "We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove the optimality of networks which have been previously found by stochastic search.", "full_text": "Learning Chordal Markov Networks by\n\nConstraint Satisfaction\n\nJukka Corander\u2217\u2020\nUniversity of Helsinki\n\nFinland\n\nTomi Janhunen\u2217\u2021\nAalto University\n\nFinland\n\nJussi Rintanen\u2217\u2021\u00a7\nAalto University\n\nFinland\n\nHenrik Nyman\u00b6\n\n\u02daAbo Akademi University\n\nFinland\n\nJohan Pensar\u00b6\n\n\u02daAbo Akademi University\n\nFinland\n\nAbstract\n\nWe investigate the problem of learning the structure of a Markov network from\ndata. It is shown that the structure of such networks can be described in terms of\nconstraints which enables the use of existing solver technology with optimization\ncapabilities to compute optimal networks starting from initial scores computed\nfrom the data. To achieve ef\ufb01cient encodings, we develop a novel characteriza-\ntion of Markov network structure using a balancing condition on the separators\nbetween cliques forming the network. The resulting translations into proposi-\ntional satis\ufb01ability and its extensions such as maximum satis\ufb01ability, satis\ufb01ability\nmodulo theories, and answer set programming, enable us to prove optimal certain\nnetworks which have been previously found by stochastic search.\n\n1\n\nIntroduction\n\nGraphical models (GMs) represent the backbone of the generic statistical toolbox for encoding de-\npendence structures in multivariate distributions. Using Markov networks or Bayesian networks\nconditional independencies between variables can be readily communicated and used for various\ncomputational purposes. The development of the statistical theory of GMs is largely set by the\nseminal works of Darroch et al. [1] and Lauritzen and Wermuth [2]. Although various approaches\nhave been developed to generalize the theory of graphical models to allow for modeling of more\ncomplex dependence structures, Markov networks and Bayesian networks are still widely used in\napplications ranging from genetic mapping of diseases to machine learning and expert systems.\nBayesian learning of undirected GMs, also known as Markov random \ufb01elds, from databases has\nattained a considerable interest, both in the statistical and computer science literature [3, 4, 5, 6, 7,\n8, 9]. The cardinality and complex topology of GM space pose dif\ufb01culties with respect to both the\ncomputational complexity of the learning task and the reliability of reaching representative model\nstructures. Solutions to these problems have been proposed in earlier work. Della Pietra et al. [10]\npresent a greedy local search algorithm for Markov network learning and apply it to discovering\nword morphology. Lee et al. [11] reduce the learning problem to a convex optimization problem\nthat is solved by gradient descent. Related methods have been investigated later [12, 13].\n\n\u2217This work was funded by the Academy of Finland, project 251170.\n\u2020Funded by ERC grant 239784.\n\u2021Also af\ufb01liated with the Helsinki Institute of Information Technology, Finland.\n\u00a7Also af\ufb01liated with Grif\ufb01th University, Brisbane, Australia.\n\u00b6This work was funded by the Foundation of \u02daAbo Akademi University, as part of the grant for the Center of\n\nExcellence in Optimization and Systems Engineering.\n\n1\n\n\fCertain types of stochastic search methods, such as Markov Chain Monte Carlo (MCMC) or simu-\nlated annealing, can be proven to be consistent with respect to the identi\ufb01cation of a structure max-\nimizing posterior probability [4, 5, 6, 7]. However, convergence of such methods towards the areas\nassociated with high posterior probabilities may still be slow when the number of nodes increases\n[4, 6]. In addition, it is challenging to guarantee that the identi\ufb01ed model indeed truly represents\nthe global optimum since the consistency of MCMC estimates is by de\ufb01nition a limit result. To the\nbest of our knowledge, strict constraint-based search methods have not been previously applied in\nlearning of Markov random \ufb01elds. In this article, we formalize the structure of Markov networks\nusing constraints at a general level. This enables the development of reductions from the structure\nlearning problem to propositional satis\ufb01ability (SAT) [14] and its generalizations such as maximum\nsatis\ufb01ability (MAXSAT) [15], and satis\ufb01ability modulo theories (SMT) [16], as well as answer-set\nprogramming (ASP) [17]. A main novelty is the recognition of maximum weight spanning trees\nof the clique graph by a condition on the cardinalities of occurrences of variables in cliques and\nseparators, which we call the balancing condition.\nThe article is structured as follows. We \ufb01rst review some details of Markov networks and the re-\nspective structure learning problem in Section 2. To enable ef\ufb01cient encodings of Markov network\nlearning as a constraint satisfaction problem, in Section 3 we establish a new characterization of\nthe separators of a Markov network based on a balancing condition. In Section 4, we provide a\nhigh-level description how the learning problem can be expressed using constraints and sketch the\nactual translations into propositional satis\ufb01ability (SAT) and its generalizations. We have imple-\nmented these translations and conducted experiments to study the performance of existing solver\ntechnology on structure learning problems in Section 5 using two widely used datasets [18]. Finally,\nsome conclusions and possibilities for further research in this area are presented in Section 6.\n\n2 Structure Learning for Markov Networks\nAn undirected graph G = (cid:104)V, E(cid:105) consists of a set of nodes V which represents a set of random\nvariables and a set of undirected edges E \u2286 {{n, n(cid:48)} | n, n(cid:48) \u2208 V and n (cid:54)= n(cid:48)}. A path in a graph\nis a sequence of nodes such that every two consecutive nodes are connected by an edge. Two sets of\nnodes A and B are said to be separated by a third set of nodes D if every path between a node in\nA and a node in B contains at least one node in D. An undirected graph is chordal if for all paths\nn0, . . . nk with k \u2265 4 and n0 = nk there exist two nodes ni, nj in the path connected by an edge\nsuch that j (cid:54)= i \u00b1 1. A clique in a graph is a set of nodes c such that every two nodes in it are\nconnected by an edge. In addition, there may not exist a set of nodes c(cid:48) such that c \u2282 c(cid:48) and every\ntwo nodes in c(cid:48) are connected by an edge. Given the set of cliques C in a chordal graph, the set of\nseparators S can be obtained through intersections of the cliques ordered in terms of a junction tree\n[19], this operation is considered thoroughly in Section 3.\nA Markov network is de\ufb01ned as a pair consisting of a graph G and a joint distribution PV over\nthe variables in V . The graph speci\ufb01es the dependence structure of the variables and PV factorizes\naccording to G (see below). Given G it is possible to ascertain if two sets of variables A and B are\nconditionally independent given another set of variables D, due to the global Markov property\n\nA \u22a5\u22a5 B | D, if D separates A from B.\n\nFor a Markov network with a chordal graph G, the probability of a joint outcome x factorizes as\n\nFollowing this factorization the marginal likelihood of a dataset X given a Markov network with a\nchordal graph G can be written\n\n(cid:81)\n(cid:81)\n(cid:81)\n(cid:81)\n\nPV (x) =\n\nci\u2208C Pci(xci)\nsi\u2208S Psi(xsi)\n\n.\n\nP (X|G) =\n\nci\u2208C Pci(Xci)\nsi\u2208S Psi(Xsi)\n\n.\n\nBy a suitable choice of prior distribution, the terms Pci(Xci ) and Psi (Xsi) can be calculated ana-\nlytically. Let a denote an arbitrary clique or separator containing the variables Xa whose outcome\nspace has the cardinality k. Further, let n(j)\nin\n\na denote the number of occurrences where Xa = x(j)\na\n\n2\n\n\fthe dataset Xa. Now assign the Dirichlet(\u03b1a1, . . . , \u03b1ak ) distribution as prior over the probabilities\nPa(Xa = x(j)\n\na ) = \u03b8j, determining the distribution Pa(Xa). Now Pa(Xa) can be calculated as\n\nwhere \u03c0a(\u03b8) is the density function of the Dirichlet prior distribution. By the standard properties of\nthe Dirichlet integral, Pa(Xa) can be reduced to the form\n\nwhere \u0393(\u00b7) denotes the gamma function and\n\n(cid:90)\n\nk(cid:89)\n\n\u0398\n\nj=1\n\nPa(Xa) =\n\n(\u03b8j)n(j)\n\na\n\n\u00b7 \u03c0a(\u03b8)d\u03b8\n\nPa(Xa) =\n\n\u0393(\u03b1)\n\n\u0393(na + \u03b1)\n\n\u0393(n(j)\n\na + \u03b1aj )\n\u0393(\u03b1aj )\n\nk(cid:89)\n\nj=1\n\nk(cid:88)\n\nj=1\n\nn(j)\na .\n\nk(cid:88)\n\nj=1\n\n\u03b1 =\n\n\u03b1aj\n\nand\n\nna =\n\n(cid:88)\n\nci\u2208C\n\nlog Pci (Xci) \u2212 (cid:88)\n(cid:80)\nP (X|G)P (G)\nG\u2208G P (X|G)P (G)\n\nlog Psi(Xsi) =\n\nP (G|X) =\n\nsi\u2208S\n\n.\n\nWhen dealing with the marginal likelihood of a dataset it is most often necessary to use the logarith-\nmic value log P (X|G). Introducing the notations v(ci) = log Pci(Xci) the logarithmic value of the\nmarginal likelihood can be written\n\n(cid:88)\n\nci\u2208C\n\nv(ci) \u2212 (cid:88)\n\nsi\u2208S\n\nv(si).\n\n(1)\n\nlog P (X|G) =\n\nThe learning problem is to \ufb01nd a graph G that optimizes the posterior distribution\n\nHere G denotes the set of all graphs under consideration and P (G) is the prior probability assigned\nto G. In the case where a uniform prior is used for the graphs the optimization problem reduces to\n\ufb01nding the graph with the largest marginal likelihood.\n\n3 Fundamental Properties and Characterization Results\n\nIn this section, we point out some properties of chordal graphs and clique graphs that can be utilized\nin the encodings of the learning problem. In particular, we develop a characterization of maximum\nweight spanning trees in terms of a balancing condition on separators.\nThe separators needed for determining the score (1) of a candidate Markov network are de\ufb01ned as\nfollows. Given the cliques, we can form the clique graph, in which the nodes are the cliques and\nthere is an edge between two nodes if the corresponding cliques have a non-empty intersection.\nWe label each of the edges with this intersection and consider the cardinality of the label as its\nweight. The separators are the edge labels of a maximum weight spanning tree of the clique graph.\nMaximum weight spanning trees of arbitrary graphs can be found in polynomial time by reducing\nthe problem to \ufb01nding minimum weight spanning trees. This reduction consists of negating all the\nedge weights and then using any of the polynomial time algorithms for the latter problem [20]. There\nmay be several maximum weight spanning trees, but they induce exactly the same separators, and\nthey only differ in terms of which pairs of cliques induce the separators.\nTo restrict the search space we can observe that a chordal graph with n nodes has at most n maximal\ncliques [19]. This gives an immediate upper bound on the number of cliques chosen to build a\nMarkov network, which can be encoded as a simple cardinality constraint.\n\n3.1 Characterization of Maximum Weight Spanning Trees\n\nTo simplify the encoding of maximum weight spanning trees (and forests) of chordal clique graphs,\nwe introduce the notion of balanced spanning trees (respectively, forests), and show that these two\nconcepts coincide for chordal graphs. Then separators can be identi\ufb01ed more effectively: rather than\nencoding an algorithm for \ufb01nding maximum-weight spanning trees as constraints, it is suf\ufb01cient to\nselect a subset of the edges of the clique graph that is acyclic and satis\ufb01es the balancing condition\nexpressible as a cardinality constraint over occurrences of nodes in cliques and separators.\n\n3\n\n\fDe\ufb01nition 1 (Balancing) A spanning tree (or forest) of a clique graph is balanced if for every node\nn, the number of cliques containing n is one higher than the number of labeled edges containing n.\n\nWhile in the following we state many results for spanning trees only, they can be straightforwardly\ngeneralized to spanning forests as well (in case the Markov networks are disconnected.)\n\nLemma 2 For any clique graph, all its balanced spanning trees have the same weight.\n\nProof: This holds in general because the balancing condition requires exactly the same number of\noccurrences of any node in the separator edges for any balanced spanning tree, and the weight is\n(cid:3)\nde\ufb01ned as the sum of the occurrences of nodes in the edge labels.\n\nLemma 3 ([21, 22]) Any maximum weight spanning tree of the clique graph is a junction tree, and\nhence satis\ufb01es the running intersection property: for every pair of nodes c and c(cid:48), (c \u2229 c(cid:48)) \u2286 c(cid:48)(cid:48) for\nall nodes c(cid:48)(cid:48) on the unique path between c and c(cid:48).\n\nLemma 4 Let T = (cid:104)V, ET(cid:105) be a maximum weight spanning tree of the clique graph (cid:104)V, E(cid:105) of a\nconnected chordal graph. Then T is balanced.\n\nProof: We order the tree by choosing an arbitrary clique as the root and by assigning a depth to all\nnodes according to their distance from the root node. The rest of the proof proceeds by induction on\nthe height of subtrees starting from the leaf nodes as the base case. The induction hypothesis says\nthat all subtrees satisfy the balancing condition. The base cases are trivial: each leaf node (clique)\ntrivially satis\ufb01es the balancing condition, as there are no separators to consider.\nIn the inductive cases, we have a clique c at depth d, connected to one or more subtrees rooted at\nneighboring cliques c1, . . . , ck at depth d + 1, with the subtrees satisfying the balancing condition.\nWe show that the tree consisting of the clique c, the labeled edges connecting c respectively to\ncliques c1, . . . , ck, and the subtrees rooted at c1, . . . , ck, satis\ufb01es the balancing condition.\nFirst note that by Lemma 3, any maximum weight spanning tree of the clique graph is a junction\ntree and hence satis\ufb01es the running intersection property, meaning that for any two cliques c1 and c2\nin the tree, every clique on the unique path connecting them includes c1 \u2229 c2.\nWe have to show that the subtree rooted at c is balanced, given that its subtrees are balanced. We\nshow that the balancing condition is satis\ufb01ed for each node separately. So let n be one of the nodes\nin the original graph. Now each of the subtrees rooted at some ci has either 0 occurrences of n, or\nki \u2264 1 occurrences in the cliques and ki\u2212 1 occurrences in the edge labels, because by the induction\nhypothesis the balancing condition is satis\ufb01ed. Four cases arise:\n\n1. The node n does not occur in any of the subtrees.\n\nNow the balancing condition is trivially satis\ufb01ed for the subtree rooted at c, because n\neither does not occur in c, or it occurs in c but does not occur in the label of any of the\nedges to the subtrees.\n\n2. The node n occurs in more than one subtree.\n\nSince any maximum weight spanning tree is a junction tree by Lemma 3, n must occur\nalso in c and in the labels of the edges between c and the cliques in which the subtrees with\nn are rooted. Let s1, . . . , sj be the numbers of occurrences of n in the edge labels in the\nsubtrees with at least one occurrence of n, and t1, . . . , tj the numbers of occurrences of n\nin the cliques in the same subtrees.\nBy the induction hypothesis, these subtrees are balanced, and hence ti \u2212 si = 1 for all\ni=1 ti occurrences of n in the nodes\ni=1 si occurrences in the edge labels,\n\ni \u2208 {1, . . . , j}. The subtree rooted at c now has 1 +(cid:80)k\n(once in c itself and then the subtrees) and j +(cid:80)j\n\nwhere the j occurrences are in the edges between c and the j subtrees.\n\n4\n\n\fWe establish the balancing condition through a sequence of equalities. The \ufb01rst and the last\nexpression are the two sides of the condition.\n\n(1 +(cid:80)j\n= 1 \u2212 j +(cid:80)j\n\ni=1 ti) \u2212 (j +(cid:80)k\n\n= 1 \u2212 j + j\n= 1\n\ni=1 si)\n\ni=1(ti \u2212 si)\n\nreordering the terms\nsince ti \u2212 si = 1 for every subtree\n\nHence also the subtree rooted at c is balanced.\n\n3. The node n occurs in one subtree and in c.\n\nLet i be the index of the subtree in which n occurs. Since any maximum weight spanning\ntree is a junction tree by Lemma 3, n must occur also in the clique ci. Hence n occurs in\nthe label of the edge from ci to c. Since the subtree is balanced, the new graph obtained by\nadding the clique c and the edge with a label containing n is also balanced. Further, adding\nall the other subtrees that do not contain n will not affect the balancing of n.\n\n4. The node n occurs in one subtree but not in c.\n\nSince there are n occurrences of n in any of the other subtrees, in c, or in the edge labels\nbetween c and any of the subtrees, the balancing condition holds.\n\n(cid:3)\nThis completes the induction step and consequently, the whole spanning tree is balanced.\nLemma 5 Assume T = (cid:104)V, EB(cid:105) is a spanning tree of the clique graph GC = (cid:104)V, E(cid:105) of a chordal\ngraph that satis\ufb01es the balancing condition. Then T is a maximum weight spanning tree of GC.\n\nProof: Let TM be one of the spanning trees of GC with the maximum weight w. By Lemma 4, this\nmaximum weight spanning tree is balanced. By Lemma 2, T has the same weight w as TM . Hence\n(cid:3)\nalso T is a maximum weight spanning tree of GC.\n\nTheorem 6 For any clique graph of a chordal graph, any of its subgraphs is a maximum weight\nspanning tree if and only if it is a balanced acyclic subgraph.\n\n4 Representation as Constraints\n\nIn this section we \ufb01rst show how the structure learning problem of Markov networks is cast as a\nconstraint satisfaction problem, and then formalize it concretely in the language of propositional\nlogic, as directly supported by SMT solvers and easily translatable into conjunctive normal form as\nused by SAT and MAXSAT solvers. In ASP slightly different rule-based formulations are used.\nThe learning problem is formalized as follows. The goal is to \ufb01nd a balanced spanning tree (cf. Def-\ninition 1) for a set C of cliques forming a Markov network and the set S of separators induced by\nthe tree structure. In addition, C and S are supposed to be optimal in the sense of (1), i.e., the overall\ns\u2208S v(s) is maximized. The individual score v(c) for any set of\n\nscore v(C, S) =(cid:80)\n\nc\u2208C v(c) \u2212(cid:80)\n\nnodes c describes how well it re\ufb02ects the interdependencies of the variables in c in the data.\nDe\ufb01nition 7 Let V be a set of nodes representing random variables and v : 2V \u2192 R a scoring\nfunction. A solution to the Markov network learning problem is a set of cliques C = {c1, . . . , ck}\nsatisfying the following requirements viewed as abstract constraints:\n\n2. Cliques in C are maximal, i.e.,\n\n1. Every node is included in at least one of the chosen cliques in C, i.e.,(cid:83)k\n(b) for every c \u2286 V , if edges(c) \u2286(cid:83)\n3. The graph (cid:104)V, E(cid:105) with the set of edges E =(cid:83)\n\nwhere edges(c) = {{n, n(cid:48)} \u2286 c | n (cid:54)= n(cid:48)} is de\ufb01ned for each c \u2286 V .\nc\u2208C edges(c) is chordal.\n\n(a) for every c, c(cid:48) \u2208 C, if c \u2286 c(cid:48), then c = c(cid:48); and\n\nc(cid:48)\u2208C edges(c(cid:48)), then c \u2286 c(cid:48) for some c(cid:48) \u2208 C\n\ni=1 ci = V .\n\n5\n\n\f4. The set C has a balanced spanning tree labeled by a set of separators S = {s1, . . . , sl}.\n\nMoreover, the solution is optimal if it maximizes the overall score v(C, S).\n\nThe encodings of basic graph properties (conditions 1 and 2 above) are presented Section 4.1. The\nmore complex properties (3 and 4) are addressed in Sections 4.2 and 4.3.\n\n4.1 Graph Properties\n\nWe assume that clique candidates \u2013 which are the non-empty subsets of V \u2013 are indexed from 1 to\n2|V |. We often identify a clique with its index. Each clique candidate c \u2286 V has an associated score\nv(c). To encode the search space for Markov networks, we introduce, for every clique candidate\nc, a propositional variable xc denoting that c is part of the learned network. We also introduce\npropositional variables en,m that represent edges {n, m} that are in at least one chosen clique.1\nTo formalize condition 1 of De\ufb01nition 7, for every node n we have the constraint\n\nxc1 \u2228 \u00b7\u00b7\u00b7 \u2228 xck\n\n(2)\n\nwhere c1, . . . , ck are all cliques c with n \u2208 c.\nTo satisfy the maximality condition 2(a), we require that if a clique is chosen, then at least one edge\nin each of its super-cliques is not chosen. We \ufb01rst make the edges of the chosen cliques explicit by\nthe next constraint for all {n, m} \u2286 V and cliques c1, . . . , ck such that {n, m} \u2286 ci.\n\nen,m \u2194 (xc1 \u2228 \u00b7\u00b7\u00b7 \u2228 xck )\n\nxc \u2192 (\u00acen1,n \u2228 \u00b7\u00b7\u00b7 \u2228 \u00acenk,n)\n\n(3)\nThen for every clique candidate c = {n1, . . . , nk} and every node n \u2208 V \\c we have the constraint\n(4)\nwhere en1,n, . . . , enk,n represent all additional edges that would turn c \u222a {n} into a clique. For\neach pair of clique candidates c and c(cid:48) such that c \u2282 c(cid:48), \u00acxc \u2228 \u00acxc(cid:48) is a logical consequence of the\nconstraints (4). They are useful for strengthening the inferences made by SAT solvers.\nFor condition 2(b) we use propositional variables zc which mean that either c or one of its super-\ncliques is chosen, and propositional variables wc which mean that all edges of c are chosen. For\n2-element cliques c = {n1, n2} we have\n\nwc \u2194 en1,n2.\n\nFor larger cliques c we have\n\n(6)\nwhere c1, . . . , ck are all subcliques of c with one less node than c. Hence wc is true iff all edges of c\nare chosen. If all edges of a clique are chosen, then the clique itself or one of its super-cliques must\nbe chosen. If c1, . . . , ck are all cliques that extend c by one node, this is encoded as follows.\n\nwc \u2194 wc1 \u2227 \u00b7\u00b7\u00b7 \u2227 wck\n\n(5)\n\n(7)\n(8)\n\nwc \u2192 zc\nzc \u2194 (xc \u2228 zc1 \u2228 \u00b7\u00b7\u00b7 \u2228 zck )\n\n4.2 Chordality\n\nWe use a straightforward encoding of the chordality condition (3) of De\ufb01nition 7. The idea is to\ngenerate constraints corresponding to every k \u2265 4 element subset S = {n1, . . . , nk} of V . Let\nus consider all cycles these nodes could form in the graph (cid:104)V, E(cid:105) of condition 3 in De\ufb01nition 7.\nA cycle starts from a given node, goes through all other nodes, with (undirected) edges between\ntwo consecutive nodes, and ends in the starting node. The number of constraints can be reduced\nby two observations. First, the same cycle could be generated from different starting nodes, e.g.,\ncycles n1, n2, n3, n4, n1 and n2, n3, n4, n1, n2 are the same. Second, generating the same cycle\nin two opposite directions, as in n1, n2, n3, n4, n1 and n1, n4, n3, n2, n1, is unnecessary. To avoid\n\n1As the edges are undirected, we limit to en,m such that the ordering of n and m according to some \ufb01xed\n\nordering is increasing, i.e., n < m. Under this assumption, em,n for n < m denotes en,m.\n\n6\n\n\fredundant cycle constraints, we arbitrarily \ufb01x the starting node and require that the index of the\nsecond node in the cycle is lower than the index of the second last node. These restrictions guarantee\nthat every cycle associated with S is considered exactly once. Now, the chordality constraint says\nthat if there is an edge between every pair of consecutive nodes in n1, . . . , nk, n1, then there also\nhas to be an edge between at least one pair of two non-consecutive nodes. In the case k = 4, for\ninstance, this leads to formulas of the form\n\nen1,n2 \u2227 en2,n3 \u2227 en3,n4 \u2227 en4,n1 \u2192 en1,n3 \u2228 en2,n4.\n\n(9)\nThis encoding of chordality constraints is exponential in |V | and therefore not scalable to large\nnumbers of nodes. However, the datasets considered in Section 5 have only 6 or 8 variables, and in\nthese cases the exponentiality is not an issue.\n\n4.3 Separators\nSeparators for pairs c and c(cid:48) of clique candidates can be formalized as propositional variables sc,c(cid:48),\nmeaning that c \u2229 c(cid:48) is a separator and there is an edge in the spanning tree between c and c(cid:48) labeled\nby c \u2229 c(cid:48). The corresponding constraint is\n\nsc,c(cid:48) \u2192 xc \u2227 xc(cid:48).\n\n(10)\nThe lack of the converse implication formalizes the choice of the spanning tree, i.e., sc,c(cid:48) can be\nfalse even if xc and xc(cid:48) are true. The remaining constraints on separators fall into two cases.\nFirst, we have cardinality constraints encoding the balancing condition (cf. Section 3.1): each vari-\nable occurs in the chosen cliques one more time than it occurs in the separators which label the\nspanning tree. Cardinality constraints are natively supported by some constraint solvers, or they can\nbe reduced to Boolean constraints [23]. Second, the graph formed by the cliques with the separators\nas edges must be acyclic. We encode this through an inductive de\ufb01nition of trees: repeatedly remove\nleaf nodes, i.e., nodes with at most one neighbor, until all nodes have been removed. When applying\nthis de\ufb01nition to a cyclic graph, some nodes will remain in the end. We de\ufb01ne the leaf level for each\nnode. A node is a level 0 leaf iff it has 0 or 1 neighbors in the graph. A node is a level l + 1 leaf iff\nall its neighbors except possibly one are level j \u2264 l leaves. This de\ufb01nition is directly expressible as\nBoolean constraints. A graph with m nodes is acyclic iff all its nodes are level (cid:98) m\n\n2 (cid:99) leaves.\n\n5 Experimental Evaluation\n\nThe constraints described in Section 4 can be alternatively expressed as MAXSAT, SMT, or ASP\nproblems. We have used respective solvers in computing optimal Markov networks for datasets from\nthe literature. The test runs were with an Intel Xeon E3-1230 CPU running at 3.20 GHz.\n\n1. For the MAXSAT encodings, we tried out SAT4J (version 2.3.2) [24] and PWBO (version\n2.2) [25]. The latter was run in its default con\ufb01guration as well as in the UB con\ufb01guration.\n\n2. For SMT, we used the OPTIMATHSAT solver (version 5) [26].\n3. For ASP, we used the CLASP (version 2.1.3) [27] and HCLASP (also v. 2.1.3) [28]\nsolvers. The latter allows declaratively specifying search heuristics. We also tried the\nLP2NORMAL tool that reduces cardinality constraints to more basic constraints [29].\n\nWe consider two datasets, one containing risk factors in heart diseases and the other variables related\nto economical behavior [18], to be abbreviated by heart and econ in the sequel. For heart, the glob-\nally optimal network has been veri\ufb01ed via (expensive) exhaustive enumeration. For econ, however,\nexhaustive enumeration is impractical due to the extremely large search space, and consequently the\noptimality of the Markov network found by stochastic search in [4] had been open until now. For\nboth datasets, we computed the respective score \ufb01le that speci\ufb01es the score of each clique candidate,\ni.e., the log-value of its potential function, and the list of variables involved in that clique. The score\n\ufb01les were then translated to be run with the different solvers. The MAXSAT and ASP solvers only\nsupport integer scores obtained by multiplying the original scores by 1000 and rounding. The SMT\nsolver OptiMathSAT used the original \ufb02oating point scores. The results are given in Table 1.\nThe heart data involves 6 variables giving rise to 26 = 64 clique candidates in total and a search\nspace of 215 undirected networks of which a subset are decomposable. For instance, the ASP solver\n\n7\n\n\fOPTIMATHSAT\nPWBO (default)\nPWBO (UB)\nSAT4J\nLP2NORMAL+CLASP\nCLASP\nHCLASP\n\nheart\n74\n158\n63\n28\n111\n5.6\n1.6\n\necon\n-\n-\n-\n-\n-\n-\n310 \u00d7 103\n\nheart\necon\n3930 kB\n139 MB\n3120 kB\n130 MB\n3120 kB\n130 MB\n3120 kB\n130 MB\n8120 kB 1060 MB\n4.2 MB\n197 kB\n203 kB\n4.2 MB\n\nTable 1: Summary of results: Runtimes in seconds and sizes of solver input \ufb01les\n\nHCLASP traversed a considerably smaller search space that consisted of 26651 (partial) networks.\nThis illustrates the power of branch-and-bound type algorithms behind the solvers and their ability\nto prune the search space. On the other hand, the econ dataset is based on 8 variables giving rise to\na much larger search space 228. We were able to solve this instance optimally with one solver only,\nHCLASP, which allows for a more re\ufb01ned control of the search heuristic: we forced HCLASP to try\ncliques in an ascending order by size, with greatest cliques \ufb01rst. This allowed us to \ufb01nd the global\noptimum in about 14 hours, after which 3 days is spent on the proof of optimality.\n\n6 Conclusions\n\nBoolean constraint methods appear not to have been earlier applied to learning of undirected Markov\nnetworks. We have introduced a generic approach in which the learning problem is expressed in\nterms of constraints on variables that determine the structure of the learned network. The related\nproblem of structure learning of Bayesian networks has been addressed by general-purpose com-\nbinatorial search methods, including MAXSAT [30] and a constraint-programming solver with a\nlinear-programming solver as a subprocedure [31, 32]. We introduced explicit translations of the\ngeneric constraints to the languages of MAXSAT, SMT and ASP, and demonstrated their use through\nexisting solver technology. Our method thus opens up a novel venue of research to further develop\nand optimize the use of such technology for network learning. A wide variety of possibilities does\nexist also for using these methods in combination with stochastic or heuristic search.\n\nReferences\n[1] J. N. Darroch, Steffen L. Lauritzen, and T. P. Speed. Markov \ufb01elds and log-linear interaction\n\nmodels for contingency tables. The Annals of Statistics, 8:522\u2013539, 1980.\n\n[2] Steffen L. Lauritzen and Nanny Wermuth. Graphical models for associations between vari-\nables, some of which are qualitative and some quantitative. The Annals of Statistics, 17:31\u201357,\n1989.\n\n[3] Jukka Corander. Bayesian graphical model determination using decision theory. Journal of\n\nMultivariate Analysis, 85:253\u2013266, 2003.\n\n[4] Jukka Corander, Magnus Ekdahl, and Timo Koski. Parallel interacting MCMC for learning of\n\ntopologies of graphical models. Data Mining and Knowledge Discovery, 17:431\u2013456, 2008.\n\n[5] Petros Dellaportas and Jonathan J. Forster. Markov chain Monte Carlo model determination\n\nfor hierarchical and graphical log-linear models. Biometrika, 86:615\u2013633, 1999.\n\n[6] Paolo Giudici and Robert Castello. Improving Markov chain Monte Carlo model search for\n\ndata mining. Machine Learning, 50:127\u2013158, 2003.\n\n[7] Paolo Giudici and Peter J. Green. Decomposable graphical Gaussian model determination.\n\nBiometrika, 86:785\u2013801, 1999.\n\n[8] Mikko Koivisto and Kismat Sood. Exact Bayesian structure discovery in Bayesian networks.\n\nJournal of Machine Learning Research, 5:549\u2013573, 2004.\n\n[9] David Madigan and Adrian E. Raftery. Model selection and accounting for model uncertainty\nin graphical models using Occam\u2019s window. Journal of the American Statistical Association,\n89:1535\u20131546, 1994.\n\n8\n\n\f[10] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random\n\n\ufb01elds. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4):380\u2013393, 1997.\n\n[11] Su-In Lee, Varun Ganapathi, and Daphne Koller. Ef\ufb01cient structure learning of Markov net-\nIn Advances in Neural Information Processing Systems 19,\n\nworks using L1-regularization.\npages 817\u2013824. MIT Press, 2006.\n\n[12] M. Schmidt, A. Niculescu-Mizil, and K. Murphy. Learning graphical model structure using\nL1-regularization paths. In Proceedings of the National Conference on Arti\ufb01cial Intelligence,\npage 1278. AAAI Press / MIT Press, 2007.\n\n[13] Holger H\u00a8o\ufb02ing and Robert Tibshirani. Estimation of sparse binary pairwise Markov networks\n\nusing pseudo-likelihoods. Journal of Machine Learning Research, 10:883\u2013906, 2009.\n\n[14] Armin Biere, Marijn J. H. Heule, Hans van Maaren, and Toby Walsh, editors. Handbook of\n\nSatis\ufb01ability. IOS Press, 2009.\n\n[15] Chu Min Li and Felip Many`a. MaxSAT, Hard and Soft Constraints, chapter 19, pages 613\u2013631.\n\nIn Biere et al. [14], 2009.\n\n[16] Clark Barrett, Roberto Sebastiani, Sanjit A. Seshia, and Cesare Tinelli. Satis\ufb01ability Modulo\n\nTheories, chapter 26, pages 825\u2013885. In Biere et al. [14], 2009.\n\n[17] Gerhard Brewka, Thomas Eiter, and Miroslaw Truszczynski. Answer set programming at a\n\nglance. Commun. ACM, 54(12):92\u2013103, 2011.\n\n[18] Joe Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley Publishing, 1990.\n[19] Martin C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, 1980.\n[20] Ronald L. Graham and Pavol Hell. On the history of the minimum spanning tree problem.\n\nAnnals of the History of Computing, 7(1):43\u201357, 1985.\n\n[21] Yukio Shibata. On the tree representation of chordal graphs.\n\n12(3):421\u2013428, 1988.\n\nJournal of Graph Theory,\n\n[22] Finn V. Jensen and Frank Jensen. Optimal junction trees. In Proceedings of the Tenth Confer-\n\nence on Uncertainty in Arti\ufb01cial Intelligence (UAI-94), pages 360\u2013366, 1994.\n\n[23] Carsten Sinz. Towards an optimal CNF encoding of Boolean cardinality constraints. In Prin-\nciples and Practice of Constraint Programming \u2013 CP 2005, number 3709 in Lecture Notes in\nComputer Science, pages 827\u2013831. Springer-Verlag, 2005.\n\n[24] Daniel Le Berre and Anne Parrain. The Sat4j library, release 2.2 system description. Journal\n\non Satis\ufb01ability, Boolean Modeling and Computation, 7:59\u201364, 2010.\n\n[25] Ruben Martins, Vasco Manquinho, and In\u02c6es Lynce. Parallel search for maximum satis\ufb01ability.\n\nAI Communications, 25:75\u201395, 2012.\n\n[26] Roberto Sebastiani and Silvia Tomasi. Optimization in SMT with LA(Q) cost functions. In\n\nAutomated Reasoning, volume 7364 of LNCS, pages 484\u2013498. Springer-Verlag, 2012.\n\n[27] Martin Gebser, Benjamin Kaufmann, and Torsten Schaub. Con\ufb02ict-driven answer set solving:\n\nFrom theory to practice. Artif. Intell., 187:52\u201389, 2012.\n\n[28] Martin Gebser, Benjamin Kaufmann, Ram\u00b4on Otero, Javier Romero, Torsten. Schaub, and\nIn Proceedings of\n\nPhilipp Wanko. Domain-speci\ufb01c heuristics in answer set programming.\nthe Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence. AAAI Press, 2013.\n\n[29] Tomi Janhunen and Ilkka Niemel\u00a8a. Compact translations of non-disjunctive answer set pro-\ngrams to propositional clauses. In Gelfond Festschrift, Vol. 6565 of LNCS, pages 111\u2013130.\nSpringer-Verlag, 2011.\n\n[30] James Cussens. Bayesian network learning by compiling to weighted MAX-SAT. In Proceed-\n\nings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 105\u2013112, 2008.\n\n[31] James Cussens. Bayesian network learning with cutting planes. In Proceedings of the Twenty-\nSeventh Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-11),\npages 153\u2013160. AUAI Press, 2011.\n\n[32] Mark Bartlett and James Cussens. Advances in Bayesian network learning using integer pro-\nIn Proceedings of the 29th Conference on Uncertainty in Arti\ufb01cial Intelligence\n\ngramming.\n(UAI 2013), pages 182\u2013191. AUAI Press, 2013.\n\n9\n\n\f", "award": [], "sourceid": 690, "authors": [{"given_name": "Jukka", "family_name": "Corander", "institution": "University of Helsinki"}, {"given_name": "Tomi", "family_name": "Janhunen", "institution": "Aalto University"}, {"given_name": "Jussi", "family_name": "Rintanen", "institution": "Aalto University"}, {"given_name": "Henrik", "family_name": "Nyman", "institution": "\u00c5bo Akademi"}, {"given_name": "Johan", "family_name": "Pensar", "institution": "\u00c5bo Akademi"}]}