{"title": "Efficient Principled Learning of Thin Junction Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 280, "abstract": "We present the first truly polynomial algorithm for learning the structure of bounded-treewidth junction trees -- an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efficient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity, and provides strong theoretical guarantees in terms of $KL$ divergence from the true distribution. We also present a lazy extension of our approach that leads to very significant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of random variables with only a polynomial number of mutual information computations on fixed-size subsets of variables, when the underlying distribution can be approximated by a bounded treewidth junction tree.", "full_text": "Ef\ufb01cient Principled Learning of Thin Junction Trees\n\nAnton Chechetka Carlos Guestrin\n\nCarnegie Mellon University\n\nAbstract\n\nWe present the \ufb01rst truly polynomial algorithm for PAC-learning the structure of\nbounded-treewidth junction trees \u2013 an attractive subclass of probabilistic graphical\nmodels that permits both the compact representation of probability distributions\nand ef\ufb01cient exact inference. For a constant treewidth, our algorithm has polyno-\nmial time and sample complexity. If a junction tree with suf\ufb01ciently strong intra-\nclique dependencies exists, we provide strong theoretical guarantees in terms of\nKL divergence of the result from the true distribution. We also present a lazy\nextension of our approach that leads to very signi\ufb01cant speed ups in practice, and\ndemonstrate the viability of our method empirically, on several real world datasets.\nOne of our key new theoretical insights is a method for bounding the conditional\nmutual information of arbitrarily large sets of variables with only polynomially\nmany mutual information computations on \ufb01xed-size subsets of variables, if the\nunderlying distribution can be approximated by a bounded-treewidth junction tree.\n\n1 Introduction\nIn many applications, e.g., medical diagnosis or datacenter performance monitoring, probabilistic\ninference plays an important role: to decide on a patient\u2019s treatment, it is useful to know the prob-\nability of various illnesses given the known symptoms. Thus, it is important to be able to represent\nprobability distributions compactly and perform inference ef\ufb01ciently. Here, probabilistic graphical\nmodels (PGMs) have been successful as compact representations for probability distributions.\n\nIn order to use a PGM, one needs to de\ufb01ne its structure and parameter values. Usually, we only\nhave data (i.e., samples from a probability distribution), and learning the structure from data is thus\na crucial task. For most formulations, the structure learning problem is NP-complete, c.f., [10].\nMost structure learning algorithms only guarantee that their output is a local optimum. One of the\nfew notable exceptions is the work of Abbeel et al. [1], for learning structure of factor graphs, that\nprovides probably approximately correct (PAC) learnability guarantees.\n\nWhile PGMs can represent probability distributions compactly, exact inference in compact models,\nsuch as those of Abbeel et al., remains intractable [7]. An attractive solution is to use junction\ntrees (JTs) of limited treewidth \u2013 a subclass of PGMs that permits ef\ufb01cient exact inference. For\ntreewidth k = 1 (trees), the most likely (MLE) structure of a junction tree can be learned ef\ufb01ciently\nusing the Chow-Liu algorithm [6], but the representational power of trees is often insuf\ufb01cient. We\naddress the problem of learning JTs for \ufb01xed treewidth k > 1. Learning the most likely such JT is\nNP-complete [10]. While there are algorithms with global guarantees for learning \ufb01xed-treewidth\nJTs [10, 13], there has been no polynomial algorithm with PAC guarantees. The guarantee of [10]\nis in terms of the difference in log-likelihood of the MLE JT and the model where all variables are\nindependent: the result is guaranteed to achieve at least a constant fraction of that difference. The\nconstant does not improve as the amount of data increases, so it does not imply PAC learnability.\nThe algorithm of [13] has PAC guarantees, but its complexity is exponential. In contrast, we provide\na truly polynomial algorithm with PAC guarantees. The contributions of this paper are as follows:\n\n\u2022 A theoretical result (Lemma 4) that upper bounds the conditional mutual information of\nIn particular, we do not\n\narbitrarily large sets of random variables in polynomial time.\nassume that an ef\ufb01ciently computable mutual information oracle exists.\n\n\u2022 The \ufb01rst polynomial algorithm for PAC-learning the structure of limited-treewidth junction\ntrees with strong intra-clique dependencies. We provide graceful degradation guarantees\nfor distributions that are only approximately representable by JTs with \ufb01xed treewidth.\n\n1\n\n\fx4,x5\n1\n\nx2,x5\n2\n\nx4,x5,x6\n\nx2,x3,x5\n\n4\n\n5\n\nx1,x4,x5\n\nx1,x5\n\nx1,x2,x5\n\nx1,x2\n\nx1,x2,x7\n\n3\n\nFigure 1: A junction tree. Rectangles denote\ncliques, separators are marked on the edges.\n\nAlgorithm 1: Na\u00a8\u0131ve approach to structure learning\n\nInput: V , oracle I (\u00b7, \u00b7 | \u00b7), treewidth k, threshold \u03b4\nL \u2190 \u2205 ; // L is a set of \u201cuseful components\u201d\nfor S \u2282 V s.t. |S| = k do\n\nfor Q \u2282 V-S do\n\nif I (Q, V-SQ | S) \u2264 \u03b4 then\n\nL \u2190 L \u222a (S, Q)\n\nreturn FindConsistentTree(L)\n\n1\n2\n3\n4\n5\n\n6\n\n\u2022 A lazy heuristics that allows to make the algorithm practical.\n\u2022 Empirical evidence of the viability of our approach on real-world datasets.\n\n2 Bounded treewidth graphical models\nIn general, even to represent a probability distribution P (V ) over discrete variables1 V we need\nspace exponential in the size n of V . However, junction trees of limited treewidth allow compact\nrepresentation and tractable exact inference. We brie\ufb02y review junction trees (for details see [7]).\nLet C = {C1, . . . , Cm} be a collection of subsets of V . Elements of C are called cliques. Let T be\na set of edges connecting pairs of cliques such that (T, C) is a tree.\nDe\ufb01nition 1. Tree (T, C) is a junction tree iff it satis\ufb01es the running intersection property (RIP):\n\u2200Ci, Cj \u2208 C and \u2200Ck on the (unique) simple path between Ci and Cj, x \u2208 Ci \u2229 Cj \u21d2 x \u2208 Ck.\n\nA set Sij \u2261 Ci \u2229 Cj is called the separator corresponding to an edge (i\u2212j) from T . The size of\na largest clique in a junction tree minus one is called the treewidth of that tree. For example, in a\njunction tree in Fig. 1, variable x2 is contained in both clique 3 and 5, so it has to be contained in\nclique 2, because 2 is on the simple path between 3 and 5. The largest clique in Fig. 1 has size 3, so\nthe treewidth of that junction tree is 2.\n\nA distribution P (V ) is representable using junction tree (T, C) if instantiating all variables in a sep-\narator Sij renders the variables on different sides of Sij independent. Denote the fact that A is inde-\nij be cliques that can be reached from Ci in the (T, C)\npendent of B given C by (A \u22a5 B | C). Let Ci\nwithout using edge (i\u2212j), and denote these reachable variables by V i\nCk \\ Sij.\nFor example, in Fig. 1, S12 = {x1, x5}, V 1\nDe\ufb01nition 2. P (V ) factors according to junction tree (T, C) iff \u2200(i \u2212 j) \u2208 T ,(cid:16)V i\n\nij \u2261 V i\n12 = {x2, x3, x7}.\n\n12 = {x4, x6} , V 2\n\nij | Sij(cid:17).\n\nji \u2261 SCk\u2208Ci\n\nij\n\nij \u22a5 V j\n\nIf a distribution P (V ) factors according to some junction tree of treewidth k, we will say that P (V )\nis k-JT representable. In this case, a projection P(T,C) of P on (T, C), de\ufb01ned as\n\nP(T,C) = QCi\u2208C P (Ci)\nQ(i\u2212j)\u2208T P (Sij)\n\n,\n\n(1)\n\nis equal to P itself. For clarity, we will only consider maximal junction trees, where all separators\nhave size k. If P is k-JT representable, it also factors according to some maximal JT of treewidth k.\n\nIn practice the notion of conditional independence is too strong. Instead, a natural relaxation is to\nrequire sets of variables to have low conditional mutual information I. Denote H(A) the entropy\nof A, then I(A,B | S) \u2261 H(A | S)\u2212H(A | BS) is nonnegative, and zero iff (A \u22a5 B | S). Intuitively,\nI (A, B | S) shows how much new information about A can we extract from B if we already know S.\nDe\ufb01nition 3. (T, C) is an \u03b5-junction tree for P (V ) iff \u2200(i \u2212 j) \u2208 T : I (cid:16)V i\n\nij | Sij(cid:17) \u2264 \u03b5.\n\nij, V j\n\n1Notation note: throughout the paper, we use small letters (x, y) to denote variables, capital letters (V, C)\n\nto denote sets of variables, and double-barred font (C, D) to denote sets of sets.\n\n2\n\n\fIf there exists an \u03b5-junction tree (T, C) for P (V ), we will say that P is k-JT \u03b5-representable. In\nthis case, the Kullback-Leibler divergence of projection (1) of P on (T, C) from P is bounded [13]:\n\nKL(cid:0)P, P(T,C)(cid:1) \u2264 n\u03b5.\n\n(2)\n\nThis bound means that if we have an \u03b5-junction tree for P (V ), then instead of P we can use its\ntractable principled approximation P(T,C) for inference. In this paper, we address the problem of\nlearning structure of such junction tree from data (samples from P ).\n3 Structure learning\nIn this paper, we address the following problem: given data, such as multiple temperature read-\nings from sensors in a sensor network, we treat each datapoint as an instantiation of the random\nvariables V and seek to \ufb01nd a good approximation of P (V ). We will assume that P (V ) is k-JT\n\u03b5-representable for some \u03b5 and aim to \ufb01nd a \u02c6\u03b5-junction tree for P with the same treewidth k and\nwith \u02c6\u03b5 as small as possible. Note that the maximal treewidth k is considered to be a constant and not\na part of problem input. The complexity of our approach is exponential in k.\n\n1\n2\n3\n\n4\n\n5\n\nreturn QS\n\nInput: V , separator S, oracle I (\u00b7, \u00b7 | \u00b7),\n\nthreshold \u03b4, max set size q\n\n// QS is a set of singletons\n\nQS \u2190 \u222ax\u2208V {x} ;\nfor A \u2282 V-S s.t. |A| \u2264 q do\n\nAlgorithm 2: LTCI: \ufb01nd Conditional Indepen-\ndencies in Low-Treewidth distributions\n\nif minX\u2282A I (X, A-X | S) > \u03b4 then\n// \ufb01nd min with Queyranne\u2019s alg.\nmerge all Qi \u2208 QS, s.t. Qi \u2229 A 6= \u2205\n\nLet us initially assume that we have an ora-\ncle I (\u00b7, \u00b7 | \u00b7) that can compute the mutual in-\nformation I (A, B | C) exactly for any disjoint\nsubsets A, B, C \u2282 V . This is a very strict re-\nquirement, which we address in the next sec-\ntion. Using the oracle I, a na\u00a8\u0131ve approach\nwould be to evaluate2 I(Q, V-QS | S) for all\npossible Q, S \u2282 V s.t. |S| = k and record all\npairs (S, Q) with I(Q, V-QS | S) \u2264 \u03b5 into a\nlist L. We will say that a junction tree (T, C)\nis consistent with a list L iff for every separa-\nij) \u2208 L.\ntor Sij of (T, C) it holds that (Sij, V i\nAfter L is formed, any junction tree consistent with L would be an \u03b5-junction tree for P (V ). Such\ntree would be found by some FindConsistentTree procedure, implemented, e.g., using constraint\nsatisfaction. Alg. 1 summarizes this idea. Algorithms that follow this outline, including ours, form a\nclass of constraint-based approaches. These algorithms use mutual information tests to constrain the\nset of possible structures and return one that is consistent with the constraints. Unfortunately, using\nAlg. 1 directly is impractical because its complexity is exponential in the total number of variables\nn. In the following sections we discuss inef\ufb01ciencies of Alg. 1 and present ef\ufb01cient solutions.\n3.1 Global independence assertions from local tests\nOne can see two problems with the inner loop of Alg. 1 (lines 3-5). First, for each separator we\nneed to call the oracle exponentially many times (2n\u2212k\u22121, once for every Q \u2282 V-S). This drawback\nis addressed in the next section. Second, the mutual information oracle, I (A, B | S), is called on\nsubsets A and B of size O(n). Unfortunately, the best known way of computing mutual information\n(and estimating I from data) has time and sample complexity exponential in |A|+|B|+|S|. Previous\nwork has not addressed this problem. In particular, the approach of [13] has exponential complexity,\nin general, because it needs to estimate I for subsets of size O(n). Our \ufb01rst new result states that we\ncan limit ourselves to computing mutual information over small subsets of variables:\nLemma 4. Let P (V ) be a k-JT \u03b5-representable distribution. Let S \u2282 V , A \u2282 V-S. If \u2200X \u2286 V-S\ns.t. |X| \u2264 k + 1, it holds that I(A \u2229 X, V-SA \u2229 X | S) \u2264 \u03b4, then I(A, V-SA | S) \u2264 n(\u03b5 + \u03b4).\nWe can thus compute an upper bound on I(A, V-SA | S) using O (cid:0)(cid:0)n\nk(cid:1)(cid:1) \u2261 O(nk) (i.e., polynomially\nmany) calls to the oracle I (\u00b7, \u00b7 | \u00b7), and each call will involve at most |S| + k + 1 variables. Lemma 4\nalso bounds the quality of approximation of P by a projection on any junction tree (T, C):\nCorollary 5. If conditions of Lemma 4 hold for P (V ) with S = Sij and A = V i\nSij of a junction tree (T, C), then (T, C) is a n(\u03b5 + \u03b4)-junction tree for P (V ).\n\nij for every separator\n\n3.2 Partitioning algorithm for weak conditional independencies\nNow that we have an ef\ufb01cient upper bound for I (\u00b7, \u00b7 | \u00b7) oracle, let us turn to reducing the number of\noracle calls by Alg. 1 from exponential (2n\u2212k\u22121) to polynomial.\nIn [13], Narasimhan and Bilmes\n\n2Notation note: for any sets A, B, C we will denote A \\ (B \u222a C) as A-BC to lighten the notation.\n\n3\n\n\fAlgorithm 3: Ef\ufb01cient approach to struc-\nture learning\n\nInput: V , oracle I (\u00b7, \u00b7 | \u00b7), treewidth k,\n\n1\n2\n3\n\n4\n\nthreshold \u03b5, L = \u2205\nfor S \u2282 V s.t. |S| = k do\n\nfor Q \u2208 LTCI(V ,S,I,\u03b5,k + 2) do\n\nL \u2190 L \u222a (S, Q)\n\nreturn FindConsistentTreeDPGreedy(L)\n\nAlgorithm 4: FindConsistentTreeDPGreedy\n\nInput: List L of components (S, Q)\nfor (S, Q) \u2208 L in the order of increasing |Q| do\ngreedily check if (S, Q) is L-decomposable\nrecord the decomposition if it exists\nif \u2203S : (S, V-S) is L-decomposable then\n\nreturn corresponding junction tree\n\nelse return no tree found\n\n1\n2\n3\n\n4\n5\n\n6\n\npresent an approximate solution to this problem, assuming that an ef\ufb01cient approximation of oracle\nI (\u00b7, \u00b7 | \u00b7) exists. A key observation that they relied on is that the function FS(A) \u2261 I (A, V-SA | S)\nis submodular: FS(A)+FS(B) \u2265 FS(A\u222aB)+FS(A\u2229B). Queyranne\u2019s algorithm [14] allows the\nminimization of a submodular function F using O(n3) evaluations of F . [13] combines Queyranne\u2019s\nalgorithm with divide-and-conquer approach to partition V-S into conditionally independent subsets\nusing O(n3) evaluations of I (\u00b7, \u00b7 | \u00b7). However, since I (\u00b7, \u00b7 | \u00b7) is computed for sets of size O(n),\ncomplexity of their approach is still exponential in n, in general.\n\nOur approach, called LTCI (Alg. 2), in contrast, has polynomial complexity for q = O(1). We\nwill show that q = O(1) in our approach that uses LTCI as a subroutine. To gain intuition for LTCI,\nsuppose there exists a \u03b5-junction tree for P (V ), such that S is a separator and subsets B and C are on\ndifferent sides of S in the junction tree. By de\ufb01nition, this means I (B, C | S) \u2264 \u03b5. When we look\nat subset A \u2261 B \u222a C, the true partitioning is not known, but setting \u03b4 = \u03b5, we can test all possible\n2|A|\u22121 ways to partition A into two subsets (X and A-X). If none of the possible partitionings have\nI (X, A-X | S) \u2264 \u03b5, we can conclude that all variables in A are on the same side of separator S in\nany \u03b5-junction tree that includes S as a separator. Notice also that\n\n\u2200X \u2282 A I (X, A-X | S) > \u03b4 \u21d4 min\nX\u2282A\n\nI (X, A-X | S) > \u03b4,\n\nso we can use Queyranne\u2019s algorithm to evaluate I (\u00b7, \u00b7 | \u00b7) only O(|A|3) times instead of 2|A|\u22121\ntimes for minimization by exhaustive search. LTCI initially assumes that every variable x forms\nits own partition Q = {x}. If a test shows that two variables x and y are on the same side of the\nseparator, it follows that their container partitions Q1 \u220b x, Q2 \u220b y cannot be separated by S, so\nLTCI merges Q1 and Q2 (line 3 of Alg. 2). This process is then repeated for larger sets of variables,\nof size up to q, until we converge to a set of partitions that are \u201calmost independent\u201d given S.\nProposition 6. The time complexity of LTCI with |S| = k is O (cid:16)(cid:0)n\nwhere J M I\n\nk+q is the time complexity of computing I (A, B | C) for |A| + |B| + |C| = k + q.\n\nq(cid:1)nJ M I\n\nk+q(cid:17) \u2261 O (cid:16)nq+1J M I\n\nk+q(cid:17) ,\n\nij or Q \u2286 V j\n\nIt is important that the partitioning algorithm returns partitions that are similar to connected com-\nponents of V i\nij of the true junction tree for P (V ). Formally, let us de\ufb01ne two desirable properties.\nSuppose (T, C) is an \u03b5-junction tree for P (V ), and QSij is an output of the algorithm for separator\nSij and threshold \u03b4. We will say that partitioning algorithm is correct iff for \u03b4 = \u03b5, \u2200Q \u2208 QSij\neither Q \u2286 V i\nij. A correct algorithm will never mistakenly put two variables on the same\nside of a separator. We will say that an algorithm is \u03b1-weak iff \u2200Q \u2208 QSij I (cid:0)Q, V-QSij | Sij(cid:1) \u2264 \u03b1.\nFor small \u03b1, an \u03b1-weak algorithm puts variables on different sides of a separator only if correspond-\ning mutual information between those variables is not too large. Ideally, we want a correct and\n\u03b4-weak algorithm; for \u03b4 = \u03b5 it would separate variables that are on different sides of S in a true\njunction tree, but not introduce any spurious independencies. LTCI, which we use instead of lines\n3-5 in Alg. 1, satis\ufb01es the \ufb01rst requirement and a relaxed version of the second:\nLemma 7. LTCI, for q \u2265 k + 1, is correct and n(\u03b5 + (k \u2212 1)\u03b4)-weak.\n\nImplementing FindConsistentTree using dynamic programming\n\n3.3\nA concrete form of FindConsistentTree procedure is the last step needed to make Alg. 1 practical.\nFor FindConsistentTree, we adopt a dynamic programming approach from [2] that was also used in\n[13] for the same purpose. We brie\ufb02y review the intuition; see [2] for details.\nConsider a junction tree (T, C). Let Sij be a separator in (T, C) and Ci\nreachable from Ci without using edge (i \u2212 j). Denote T i\n\nij be the set of cliques\nij the set of edges from T that connect\n\n4\n\n\fij, T i\n\nij \u222a Sij). Moreover, the subtree (Ci\n\nij. If (T, C) is an \u03b5-junction tree for P (V ), then (Ci\n\nij) is an \u03b5-junction tree for\ncliques from Ci\nij) consists of a clique Ci and several sub-subtrees that\nP (V i\nare each connected to Ci. For example, in Fig. 1 the subtree over cliques 1,2,4,5 can be decom-\nposed into clique 2 and two sub-subtrees: one including cliques {1,4} and one with clique 5. The\nrecursive structure suggests dynamic programming approach: given a component (S, Q) such that\nI (Q, V-QS | S) < \u03b4, check if smaller subtrees can be put together to cover the variables of (S, Q).\nFormally, we require the following property:\nDe\ufb01nition 8. (S, Q) \u2208 L is L-decomposable iff \u2203D = \u222ai{(Si, Qi)}, x \u2208 Q s.t.\n\nij, T i\n\n1. \u2200i(Si, Qi) is L-decomposable and \u222am\n2. Si \u2282 S \u222a {x}, i.e., each subcomponent can be connected directly to the clique (S, x);\n3. Qi \u2229 Qj = \u2205, ensuring the running intersection property within the subtree over S \u222a Q.\n\ni=1Qi = Q \\ {x};\n\nThe set {(S1, Q1), . . . , (Sm, Qm)} is called a decomposition of (S, Q).\n\nUnfortunately, checking whether a decomposition exists is equivalent to an NP-complete exact set\ncover problem because of the requirement Qi \u2229 Qj = \u2205 in part 3 of Def. 8. Unfortunately, this chal-\nlenging issue was not addressed by [13], where the same algorithm was used. To keep complexity\npolynomial, we use a simple greedy approach: for every x \u2208 Qi, starting with an empty candidate\ndecomposition D, add (Si, Qi) \u2208 L to D if the last two properties of Def. 8 hold for (Si, Qi). If\neventually Def. 8 holds, return the decomposition D, otherwise return that no decomposition exists.\nWe call the resulting procedure FindConsistentTreeDPGreedy.\nProposition 9. For separator size k, time complexity of FindConsistentTreeDPGreedy is O(nk+2)\n\n2k+2).\n\nCombining Alg. 2 and FindConsistentTreeDPGreedy, we arrive at Alg. 3. Overall complexity of\nAlg. 3 is dominated by Alg. 2 and is equal to O(n2k+3J M I\nIn general, FindConsistentTreeDP with greedy decomposition checks may miss a junction tree that\nis consistent with the list of components L, but there is a class of distributions for which Alg. 3 is\nguaranteed to \ufb01nd a junction tree. Intuitively, we require that for every (Sij, V i\nij) from a \u03b5-junction\ntree (T, C), Alg. 2 adds all the components from decomposition of (Sij, V i\nij) to L and nothing else.\nThis requirement is guaranteed for distributions where variables inside every clique of the junction\ntree are suf\ufb01ciently strongly interdependent (have a certain level of mutual information):\nLemma 10. If \u2203 an \u03b5-JT (T, C) for P (V ) s.t. no two edges of T have the same separator, and\nfor every separator S, clique C \u2208 C, minX\u2282C-S I (X, C-XS | S) > (k + 3)\u03b5 (we will call (T, C)\n(k + 3)\u03b5-strongly connected), then Alg. 3, called with \u03b4 = \u03b5, will output a nk\u03b5-JT for P (V ).\n\n4 Sample complexity\nSo far we have assumed that a mutual information oracle I (\u00b7, \u00b7 | \u00b7) exists for the distribution P (V )\nand can be ef\ufb01ciently queried. In real life, however, one only has data (i.e., samples from P (V ))\nto work with. However, we can get a probabilistic estimate of I (A, B | C), that has accuracy \u00b1\u2206\nwith probability 1 \u2212 \u03b3, using number of samples and computation time polynomial in 1\n\u2206 and log 1\n\u03b3 :\nTheorem 11. (H\u00a8offgen, [9]). The entropy of a probability distribution over 2k + 2 discrete vari-\nables with domain size R can be estimated with accuracy \u2206 with probability at least (1 \u2212 \u03b3) using\nF (k, R, \u2206, \u03b3) \u2261 O(cid:16) R4k+4\n\u03b3 (cid:17)(cid:17) samples from P and the same amount of time.\n\n\u22062 (cid:17) log (cid:16) R2k+2\n\nlog2 (cid:16) R2k+2\n\n\u22062\n\nIf we employ this oracle in our algorithms, the performance guarantee becomes probabilistic:\nTheorem 12. If there exists a (k + 3)(\u03b5 + 2\u2206)-strongly connected \u03b5-junction tree for P (V ), then\nAlg. 3, called with \u03b4 = \u03b5 + \u2206 and \u02c6I (\u00b7, \u00b7, \u00b7) based on Thm. 11, using U \u2261 F (k, R, \u2206,\nn2k+2 ) samples\nand O(n2k+3U ) time, will \ufb01nd a kn(\u03b5+2\u2206)-junction tree for P (V ) with probability at least (1\u2212\u03b3).\n\n\u03b3\n\nFinally, if P (V ) is k-JT representable (i.e., \u03b5 = 0), and the corresponding junction tree is strongly\nconnected, then we can let both \u2206 and \u03b3 go to zero and use Alg. 3 to \ufb01nd, with probability arbitrarily\n\u2206 and log 1\nclose to one, a junction tree that approximates P arbitrarily well in time polynomial in 1\n\u03b3 ,\ni.e., the class of strongly connected k-junction trees is probably approximately correctly learnable3.\n\n3A class P of distributions is PAC learnable if for any P \u2208 P, \u03b4 > 0, \u03b3 > 0 a learning algorithm will output\n\nP \u2032 : KL(P, P \u2032) < \u03b4 with probability 1 \u2212 \u03b3 in time polynomial in 1\n\n\u03b4 and log 1\n\u03b3 .\n\n5\n\n\fCorollary 13. If there exists an \u03b1-strongly connected junction tree for P (V ) with \u03b1 > 0, then\nfor \u03b2 < \u03b1n, Alg. 3 will learn a \u03b2-junction tree for P with probability at least 1 \u2212 \u03b3 using\nO (cid:16) n4\n\n\u03b3(cid:17) samples from P (V ) and O (cid:16) n2k+7\n\n\u03b3(cid:17) computation time.\n\n\u03b22 log2 n\n\n\u03b2 log n\n\nlog2 n\n\n\u03b2 log n\n\n\u03b22\n\n5 Lazy evaluation of mutual information\nAlg. 3 requires the value of threshold \u03b4 as an input. To get tighter quality guarantees, we need to\nchoose the smallest \u03b4 for which Alg. 3 \ufb01nds a junction tree. A priori, this value is not known, so we\nneed a procedure to choose the optimal \u03b4. A natural way to select \u03b4 is binary search. For discrete\nrandom variables with domain size R, for any P (V ), S, x it holds that I (x, V-Sx | S) \u2264 logR, so for\nany \u03b4 > logR Alg. 3 is guaranteed to \ufb01nd a junction tree (with all cliques connected to the same\nseparator). Thus, we can restrict binary search to range \u03b4 \u2208 [0, log R].\nIn binary search, for every value of \u03b4, Alg. 2 checks the result of Queyranne\u2019s algorithm minimizing\nminX\u2282A I (X, A-X | S) for every |S| = k, |A| \u2264 k+2, which amounts to O(n2k+2) complexity per\nvalue of \u03b4. It is possible, however, to \ufb01nd the optimal \u03b4 while only checking minX\u2282A I (X, A-X | S)\nfor every S and A once over the course of the search process.\nIntuitively, think of the set of\npartitions QS in Alg. 2 as a set of connected components of a graph with variables as vertices,\nand a hyper-edge connecting all variables from A whenever minX\u2282A I (X, A-X | S) > \u03b4. As \u03b4\nincreases, some of the hyper-edges disappear, and the number of connected components (or in-\ndependent sets) may increase. More speci\ufb01cally, a graph QS is maintained for each separator\nS. For all S, A add a hyper-edge connecting all variables in A annotated with strengthS(A) \u2261\nminX\u2282A I (X, A-X | S) to QS. Until F indConsistentT ree(\u222aS QS) returns a tree, increase \u03b4 to\nbe minS,A:hyperedgeS (A)\u2208QS strengthS(A) (i.e., strength of the weakest remaining hyper-edge), and\nremove hyperedgeS(A) from QS. Fig. 2(a) shows an example evolution of Qx4 for k = 1.\nTo further save computation time, we exploit two observations: First, if A is a subset of a connected\ncomponent Q \u2208 QS, adding hyperedgeS(A) to QS will not change QS. Thus, we do not test any\nhyper-edge A which is contained in a connected component. However, as \u03b4 increases, a component\nmay become disconnected, because such an edge was not added. Therefore, we may have more\ncomponents than we should (inducing incorrect independencies). This issue is addressed by our\nsecond insight: If we \ufb01nd a junction tree for a particular value of \u03b4, we only need to recheck the\ncomponents used in this tree. These insights lead to a simple, lazy procedure: If FindConsistentTree\nreturns a tree (T, C), we check the hyper-edges that intersect the components used to form (T, C).\nIf none of these edges are added, then we can return (T, C) for this value of \u03b4. Otherwise, some of\nQS have changed; we can iterate this procedure until we \ufb01nd a solution.\n\n6 Evaluation\nTo evaluate our approach, we have applied it to two real-world (sensor network temperature [8] and\nSan Francisco Bay area traf\ufb01c [11]) and one arti\ufb01cial (samples from ALARM Bayesian network [4])\ndatasets. Our implementation, called LPACJT, uses lazy evaluations of I (\u00b7, \u00b7 | \u00b7) from section 5.\nAs baselines for comparison, we used a simple hill-climbing heuristic4, a combination of LPACJT\nwith hill-climbing, where intermediate results returned by FindConsistentTree were used as starting\npoints for hill-climbing, Chow-Liu algorithm, and algorithms of [10] (denoted Karger-Srebro) and\n[17] (denoted OBS). All experiments were run on a Pentium D 3.4 GHz, with runtimes capped to\n10 hours. The necessary entropies were cached in advance.\nALARM. This discrete-valued data was sampled from a known Bayesian network with treewidth 4.\nWe learned models with treewidth 3 because of computational concerns. Fig. 2(b) shows the per-\npoint log-likelihood of learned models on test data depending on the amount of training data. We see\nthat on small training datasets both LPACJT \ufb01nds better models than a basic hill-climbing approach,\nbut worse than the OBS of [17] and Chow-Liu. The implementation of OBS was the only one to\nuse regularization, so this outcome can be expected. We can also conclude that on this dataset our\napproach over\ufb01ts than hill-climbing. For large enough training sets, LPACJT results achieve the\nlikelihood of the true model, despite being limited to models with smaller treewidth. Chow-Liu\nperforms much worse, since it is limited to models with treewidth 1. Fig. 2(c) shows an example of\na structure found by LPACJT for ALARM data. LPACJT only missed 3 edges of the true model.\n\n4Hill-climbing had 2 kinds of moves available: replace variable x with variable y in a connected sub-\njunction tree, or relpace a leaf clique Ci with another clique (Ci \\ Sij) \u222a Smr connected to a separator Smr.\n\n6\n\n\fx3\n\n0.1\nx1 0.2\n0=d\n\n0\n.\n4\n\nx2\n\nx3\n\n0\n.\n4\n\nx1\nx2\n2.0=d\n\nx3\n\n0\n.\n4\n\nx2\n\nx1 0.2\n1.0=d\n\nx3\n\nx2\n\nx1\n4.0=d\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\nL\n\n\u221215\nOBS\n\u221220\n\n\u221225\n\n\u221230\n\n102\n\nALARM\n\nTrue model\n\nChow\u2212Liu\n\nKarger\u2212Srebro\n\nLPACJT\n\nLPACJT+Local\nLocal\n\n103\n\nTraining set size\n\n104\n\n(a) Example QS evolution\n\n(b) ALARM - loglikelihood\n\n(c) ALARM - structure\n\nTemperature\n\nOBS\n\n\u221240\nChow\u2212Liu\n\u221250\n\nd\no\no\nh\n\ni\nl\ne\nk\ni\nl\n\n\u2212\ng\no\nL\n\n\u221260\n\n\u221270\n\nLocal\nLPACJT\n\nKarger\u2212Srebro\n\nLPACJT+Local\n\n\u221280\n\n102\n\n103\n\nTraining set size\n\n104\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\n\u2212\ng\no\nL\n\nTEMPERATURE sample run,\n\n 2K training points\n\n \n\nLPACJT\n\n1\nTime, seconds\n\n2\n\n3\nx 104\n\n\u221246\n\n\u221247\n\n \n\n\u221248\n0\n\nTRAFFIC\n\nChow\u2212Liu\n\nOBS\n\nLocal\nLPACJT+Local\nKarger\u2212Srebro\n\nLPACJT\n\nd\no\no\nh\n\ni\nl\ne\nk\ni\nl\n\n\u2212\ng\no\nL\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221260\n\n102\n\nTraining set size\n\n103\n\n(d) TEMPERATURE loglikelihood\n\n(e) TEMPERATURE sample run\n\n(f) TRAFFIC loglikelihood\n\nFigure 2: An example of evolution of QS for section 5 (2(a)), one structure learned by LPACJT(2(c)), experi-\nmental results (2(b),2(d),2(f)), and an example evolution of the test set likelihood of the best found model (2(e)).\nIn 2(c), nodes denote variables, edges connect variables that belong to the same clique, green edges belong to\nboth true and learned models, blue edges belong only to the learned model, red - only to the true one.\n\nTEMPERATURE. This data is from a 2-month deployment of 54 sensor nodes (15K data-\npoints) [8]. Each variable was discretized into 4 bins and we learned models of treewidth 2. Since\nthe locations of the sensor have an \u221e-like shape with two loops, the problem of learning a thin\njunction tree for this data is hard. In Fig. 2(d) one can see that LPACJT performs almost as good\nas hill-climbing-based approaches, and, on large training sets, much better than Karger-Srebro al-\ngorithm. Again, as expected, LPACJT outperforms Chow-Liu algorithm by a signi\ufb01cant margin if\nthere is enough data available, but over\ufb01ts on the smallest training sets. Fig 2(e) shows the evolution\nof the test set likelihood of the best (highest training set likelihood) structure identi\ufb01ed by LPACJT\nover time. The \ufb01rst structure was identi\ufb01ed within 5 minutes, and the \ufb01nal result within 1 hour.\nTRAFFIC. This dataset contains traf\ufb01c \ufb02ow information measured every 5 minutes in 8K loca-\ntions in California for 1 month [11]. We selected 32 locations in San Francisco Bay area for the\nexperiments, discretized traf\ufb01c \ufb02ow values into 4 bins and learned models of treewidth 3. All non-\nregularized algorithms, including LPACJT, give results of essentially the same quality.\n\n7 Relation to prior work and conclusions\nFor a brief overview of the prior work, we refer the reader to Fig. 3. Most closely related to LPACJT\nare learning factor graphs of [1] and learning limited-treewidth Markov nets of [13, 10]. Unlike our\napproach, [1] does not guarantee low treewidth of the result, instead settling for compactness. [13,\n10] guarantee low treewidth. However, [10] only guarantees that the difference of the log-likelihood\nof the result from the fully independent model is within a constant factor from the difference of the\nmost likely JT: LLH(optimal) \u2212 LLH(indep.) \u2264 8kk!2(LLH(learned) \u2212 LLH(indep.)). [13] has\nexponential complexity. Our approach has polynomial complexity and quality guarantees that hold\nfor strongly connected k-JT \u03b5-representable distributions, while those of [13] only hold for \u03b5 = 0.\nWe have presented the \ufb01rst truly polynomial algorithm for learning junction trees with limited\ntreewidth. Based on a new upper bound for conditional mutual information that can be computed us-\ning polynomial time and number of samples, our algorithm is guaranteed to \ufb01nd a junction tree that\nis close in KL divergence to the true distribution, for strongly connected k-JT \u03b5-representable distri-\nbutions. As a special case of these guarantees, we show PAC-learnability of strongly connected k-JT\nrepresentable distributions. We believe that the new theoretical insights herein provide signi\ufb01cant\nstep in the understanding of structure learning in graphical models, and are useful for the analysis of\nother approaches to the problem. In addition to the theory, we have also demonstrated experimen-\ntally that these theoretical ideas are viable, and can, in the future, be used in the development of fast\nand effective structure learning heuristics.\n\n7\n\n\fapproach\n\nmodel class\n\nguarantees\n\ntrue distribution\n\nsamples\n\nscore\nscore\nscore\nscore\nscore\nscore\n\nconstraint\nconstraint\nconstraint\nconstraint\n\ntractable\n\ntree\n\ntree mixture\n\ncompact\n\ntractable\ncompact\n\nall\n\nall\n\ntractable\ntractable\n\nlocal\nglobal\nlocal\nlocal\nglobal\n\nPAC\u25e6\nglobal\nPAC\nPAC\u00a7\n\nconst-factor\n\nany\nany\nany\nany\nany\nany\n\nany\n\npositive\n\nstrong k-JT\nstrong k-JT\n\nany\nany\nany\nany\nany\nany\npoly\n\u221e\nexp\u2021\npoly\n\ntime\npoly\u2020\nO(n2)\nO(n2)\u2020\npoly\u2020\nexp\npoly\npoly\n\npoly(tests)\n\nexp\u2021\npoly\n\nreference\n\n[3, 5]\n[6]\n[12]\n[17]\n[15]\n[10]\n[1]\n[16]\n[13]\n\nthis paper\n\nFigure 3: Prior work. The majority of the literature can be subdivided into score-based [3, 5, 6, 12, 15, 10] and\nconstraint-based [13, 16, 1] approaches. The former try to maximize some target function, usually regularized\nlikelihood, while the latter perform conditional independence tests and restrict the set of candidate structures\nto those consistent with the results of the tests. Tractable means that the result is guaranteed to be of limited\ntreewidth, compact - with limited connectivity of the graph. Guarantees column shows whether the result is a\nlocal or global optimum, whether there are PAC guarantees, or whether the difference of the log-likelihood of\nthe result from the fully independent model is within a const-factor from the difference of the most likely JT.\nTrue distribution shows for what class of distributions the guarantees hold. \u2020 superscript means per-iteration\ncomplexity, poly - O(nO(k)), exp\u2021 - exponential in general, but poly for special cases. PAC\u25e6 and PAC\u00a7 mean\nPAC with (different) graceful degradation guarantees.\n\n8 Acknowledgments\nThis work is supported in part by NSF grant IIS-0644225 and by the ONR under MURI\nN000140710747. C. Guestrin was also supported in part by an Alfred P. Sloan Fellowship. We\nthank Nathan Srebro for helpful discussions, and Josep Roure, Ajit Singh, CMU AUTON lab, Mark\nTeyssier, Daphne Koller, Percy Liang and Nathan Srebro for sharing their source code.\nReferences\n\n[1] P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time and sample complexity.\n\nJMLR, 7, 2006.\n\n[2] S. Arnborg, D. G. Corneil, and A. Proskurowski. Complexity of \ufb01nding embeddings in a k-tree. SIAM\n\nJournal on Algebraic and Discrete Methods, 8(2):277\u2013284, 1987.\n[3] F. R. Bach and M. I. Jordan. Thin junction trees. In NIPS, 2002.\n[4] I. Beinlich, J. Suermondt, M. Chavez, and G. Cooper. The ALARM monitoring system: A case study\n\nwith two probablistic inference techniques for belief networks. In Euro. Conf. on AI in Medicine, 1988.\n\n[5] A. Choi, H. Chan, and A. Darwiche. On Bayesian network approximation by edge deletion. In UAI, 2005.\n[6] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\nIEEE\n\nTransactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[7] R. G. Cowell, P. A. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and Expert\n\nSystems (Information Science and Statistics). Springer, May 2003.\n\n[8] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in\n\nsensor networks. In VLDB, 2004.\n\n[9] K. U. H\u00a8offgen. Learning and robust learning of product distributions. In COLT, 1993.\n[10] D. Karger and N. Srebro. Learning Markov networks: Maximum bounded tree-width graphs. SODA-01.\n[11] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. UAI-05.\n[12] M. Meil\u02d8a and M. I. Jordan. Learning with mixtures of trees. JMLR, 1:1\u201348, 2001.\n[13] M. Narasimhan and J. Bilmes. PAC-learning bounded tree-width graphical models. In UAI, 2004.\n[14] M. Queyranne. Minimizing symmetric submodular functions. Math. Programming, 82(1):3\u201312, 1998.\n[15] A. Singh and A. Moore. Finding optimal Bayesian networks by dynamic programming. Technical Report\nCMU-CALD-05-106, Carnegie Mellon University, Center for Automated Learning and Discovery, 2005.\n\n[16] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2001.\n[17] M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian\n\nnetworks. In UAI, 2005.\n\n8\n\n\f", "award": [], "sourceid": 1021, "authors": [{"given_name": "Anton", "family_name": "Chechetka", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}]}