{"title": "Generalizing Tree Probability Estimation via Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1444, "page_last": 1453, "abstract": "Probability estimation is one of the fundamental tasks in statistics and machine learning. However, standard methods for probability estimation on discrete objects do not handle object structure in a satisfactory manner. In this paper, we derive a general Bayesian network formulation for probability estimation on leaf-labeled trees that enables flexible approximations which can generalize beyond observations. We show that efficient algorithms for learning Bayesian networks can be easily extended to probability estimation on this challenging structured space. Experiments on both synthetic and real data show that our methods greatly outperform the current practice of using the empirical distribution, as well as a previous effort for probability estimation on trees.", "full_text": "Generalizing Tree Probability Estimation via\n\nBayesian Networks\n\nCheng Zhang\n\nFrederick A. Matsen IV\n\nComputational Biology Program\n\nFred Hutchinson Cancer Research Center\n\nComputational Biology Program\n\nFred Hutchinson Cancer Research Center\n\nSeattle, WA 98109\n\nchengz23@fredhutch.org\n\nSeattle, WA 98109\n\nmatsen@fredhutch.org\n\nAbstract\n\nProbability estimation is one of the fundamental tasks in statistics and machine\nlearning. However, standard methods for probability estimation on discrete objects\ndo not handle object structure in a satisfactory manner. In this paper, we derive a\ngeneral Bayesian network formulation for probability estimation on leaf-labeled\ntrees that enables \ufb02exible approximations which can generalize beyond observa-\ntions. We show that ef\ufb01cient algorithms for learning Bayesian networks can be\neasily extended to probability estimation on this challenging structured space. Ex-\nperiments on both synthetic and real data show that our methods greatly outperform\nthe current practice of using the empirical distribution, as well as a previous effort\nfor probability estimation on trees.\n\nIntroduction\n\n1\nLeaf-labeled trees, where labels are associated with the observed variables, are extensively used in\nprobabilistic graphical models. A typical example is the phylogenetic leaf-labeled tree, which is\nthe fundamental structure for modeling the evolutionary history of a family of genes [Felsenstein,\n2003, Friedman et al., 2002]. Inferring a phylogenetic tree based on a set of DNA sequences under a\nprobabilistic model of nucleotide substitutions has been one of the central problems in computational\nbiology, with a wide range of applications from genomic epidemiology [Neher and Bedford, 2015]\nto conservation genetics [DeSalle and Amato, 2004]. To account for the phylogenetic uncertainty,\nBayesian approaches are adopted [Huelsenbeck et al., 2001] and Markov chain Monte Carlo (MCMC)\n[Yang and Rannala, 1997, Mau et al., 1999, Huelsenbeck and Ronquist, 2001] is commonly used to\nsample from the posterior of phylogenetic trees. Posterior probabilities of phylogenetic trees are then\ntypically estimated with simple sample relative frequencies (SRF), based on those MCMC samples.\nWhile classical, this empirical approach is unsatisfactory for tree posterior estimation due to the\ncombinatorially exploding size of tree space. Speci\ufb01cally, SRF does not support trees beyond\nobserved samples (i.e., simply sets the probabilities of unsampled trees to zero), and is prone to\nunstable estimates for low-probability trees. As a result, reliable estimations using SRF usually\nrequire impractically large sample sizes. Previous work [H\u00f6hna and Drummond, 2012, Larget,\n2013] attempted to remedy these problems by harnessing the similarity among trees and proposed\nseveral probability estimators using MCMC samples based on conditional independence of separated\nsubtrees. Although these estimators do extend to unsampled trees, the conditional independence\nassumption therein is often too strong to provide accurate approximations for posteriors inferred from\nreal data [Whidden and Matsen, 2015].\nIn this paper, we present a general framework for tree probability estimation given a collection of\ntrees (e.g., MCMC samples) by introducing a novel structure called subsplit Bayesian networks\n(SBNs). This structure provides rich distributions over the entire tree space and hence differs from\nexisting applications of Bayesian networks in phylogenetic inference [e.g. Strimmer and Moulton,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2000, H\u00f6hna et al., 2014] to compute tree likelihood. Moreover, SBNs relax the conditional clade\nindependence assumption and allow easy adjustment for a variety of \ufb02exible dependence structures\nbetween subtrees. They also allow many ef\ufb01cient learning algorithms for Bayesian networks to be\neasily extended to tree probability estimation. Inspired by weight sharing used in convolutional neural\nnetworks [LeCun et al., 1998], we propose conditional probability sharing for learning SBNs, which\ngreatly reduces the number of free parameters by exploiting the similarity of local structures among\ntrees. Although initially proposed for rooted trees, we show that SBNs can be naturally generalized\nto unrooted trees, which leads to a missing data problem that can be ef\ufb01ciently solved through\nexpectation maximization. Finally, we demonstrate that SBN estimators greatly outperform other tree\nprobability estimators on both synthetic data and a benchmark of challenging phylogenetic posterior\nestimation problems. The SBN framework also works for general leaf-labeled trees, however for ease\nof presentation, we restrict to leaf-labeled bifurcating trees in this paper.\n\n2 Background\n\nC4\n\nO1\n\nO2\n\nC5\n\nC1\n\nC2\n\nC3\n\nO4\n\nO3\n\nA leaf-labeled bifurcating tree is a binary tree (rooted or\nunrooted) with labeled leaves (e.g., leaf nodes associated\nwith observed variables or a set of labels); we refer to it as\na tree for short. Recently, several probability estimators on\ntree spaces have been proposed that exploit the similarity\nof clades, a local structure of trees, to generalize beyond\nobserved trees. Let X = {O1, . . . , ON} be a set of N\nlabeled leaves. A clade X of X is a nonempty subset of X .\nGiven a rooted tree T on X , one can \ufb01nd its unique clade\ndecomposition as follows. Start from the root, which has\na trivial clade that contains all the leaves C1. This clade\n\ufb01rst splits into two subclades C2, C3. The splitting pro-\ncess continues recursively onto each successive subclade\nuntil there are no subclades to split. Finally, we obtain a\ncollection of nontrivial clades TC. As for the tree in Figure 1, TC = {C2, C3, C4, C5, C6, C7}. This\nway, T is represented uniquely as a set of clades TC. Therefore, distributions over the tree space can\nbe speci\ufb01ed as distributions over the space of sets of clades. Again, for Figure 1:\n\nFigure 1: A rooted tree and its clade de-\ncomposition. Each clade corresponds\nto the set of offspring leaves (e.g.,\nC2 = {O1, O2, O3, O4, O5} and C3 =\n{O6, O7, O8}).\n\nO5\n\nO6\n\nO7\n\nO8\n\nC6\n\nC7\n\np(T ) = p(TC) = p(C2, C3, C4, C5, C6, C7)\n\n(1)\n\nThe clade decomposition representation enables distributions that re\ufb02ect the similarity of trees through\nthe local clade structure. However, a full parameterization of this approach over all rooted trees on X\nusing rules of conditional probability is intractable even for a moderate N. Larget [2013], building on\nwork of H\u00f6hna and Drummond [2012], introduced the Conditional Clade Distribution (CCD) which\nassumes that given the existence of an edge in a tree, clades that further re\ufb01ne opposite sides of the\nedge are independent (see the Supplementary Material (SM) for a more detailed discussion). CCD\ngreatly reduces the number of parameters. For example, (1) has the following CCD approximation\n\npccd(T ) = p(C2, C3)p(C4, C5|C2)p(C6|C5)p(C7|C3)\n\nHowever, CCD also introduces strong bias which makes it insuf\ufb01cient to capture the complexity of\ninferred posterior distributions on real data (see Figure 5). In particular, certain clades may depend\non their sisters. This motivates a more \ufb02exible set of approximate distributions.\n\n3 A Subsplit Bayesian Network Formulation\nIn addition to the clade decomposition representation, a rooted tree T can also be uniquely represented\nas a set of subsplits. Let (cid:31) be a total order on clades (e.g., lexicographical order). A subsplit (Y, Z)\nof a clade X is an ordered pair of disjoint subclades of X such that Y \u222a Z = X and Y (cid:31) Z. For\nexample, the tree in Figure 1 corresponds to the following set of nontrivial subsplits\n\nTS = {(C2, C3), (C4, C5), ({O3}, C6), (C7,{O8})}\n\nwith lexicographical order on clades. Moreover, this set-of-subsplits representation of trees inspires a\nnatural probabilistic Bayesian network formulation as follows (Figure 2):\n\n2\n\n\fS1\n\nS2\n\nS3\n\nS4\n\nS5\n\nS6\n\nS7\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nABC\n\nD\n\n1.0\n\nAB\nCD\n\nA\nBC\n\nD\n\nA\nB\n\nC\nD\n\n1.0\n\n1.0\n\n1.0\n\n1.0\n\n1.0\n\n1.0\n\n1.0\n\nA\n\nB\nC\n\nD\n\nD\n\nA\n\nB\n\nC\n\nD\n\nA\n\nB\n\nC\n\nD\n\nA\n\nB\n\nC\n\nD\n\nX as a subnetwork.\n\nFigure 2: The subsplit Bayesian network formulation. Left: A general Bayes net for rooted trees.\nEach node represents a subsplit-valued or singleton-clade-valued random variable. The solid full and\ncomplete binary tree network is B\u2217\nX . Middle/Right: Examples of rooted trees with 4 leaves. Note\nthat the solid dark nets that represent the true splitting processes of the trees may have dynamical\nstructures. By allowing singleton clades continue to split (the dashed gray nets) until depth of 3, both\nnets grow into the full and complete binary tree of depth 3.\nDe\ufb01nition 1. A subsplit Bayesian network (SBN) BX on a leaf set X of size N is a Bayesian network\nwhose nodes take on subsplit values or singleton clade values of X and: i) has depth N \u2212 1 (the root\ncounts as depth 1); ii) The root node takes on subsplits of the entire leaf set X ; iii) contains a full and\ncomplete binary tree network B\u2217\nNote that B\u2217\nX itself is an SBN and is contained in all SBNs; therefore, B\u2217\nX is the minimum SBN on X .\nMoreover, B\u2217\nX induces a natural indexing procedure for the nodes of all SBNs on X : starting from\nthe root node, which is denoted as S1, for any i, we denote the two children of Si as S2i and S2i+1,\nrespectively, until a leaf node is reached. We call the parent nodes in B\u2217\nDe\ufb01nition 2. We say a subsplit (Y, Z) is compatible with a clade X if Y \u222a Z = X. Moreover, a\nsingleton clade {W} is said to be compatible with itself. With natural indexing, we say a full SBN\nassignment {Si = si}i\u22651 is compatible if for any interior node assignment si = (Yi, Zi) (or {Wi}),\ns2i, s2i+1 are compatible with Yi, Zi (or {Wi}), respectively. Consider a parent-child pair in an\nSBN, Si and S\u03c0i, where \u03c0i denotes the index set of the parent nodes of Si. We say an assignment\nSi = si, S\u03c0i = s\u03c0i is compatible if it can be extended to a compatible assignment of the SBN.\nLemma 1. Given an SBN BX , each rooted tree T on X can be uniquely represented as a compatible\nassignment of BX .\nA proof of Lemma 1 is provided in the SM. Unlike in phylogenies, nodes in SBNs take on subsplit\n(or singleton clade) values that represent the local topological structure of trees. By including the\ntrue splitting processes (e.g., TS) of the trees while allowing singleton clades continue to split (\u201cfake\u201d\nsplit) until the whole network reaches depth N \u2212 1 (see Figure 2), each SBN on X has a \ufb01xed\nstructure which contains the full and complete binary tree as a subnetwork. Note that those fake\nsplits are all deterministically assigned, which means the corresponding conditional probabilities\nare all one. Therefore, the estimated probabilities of rooted trees only depend on their true splitting\nprocesses. With SBNs, we can easily construct a family of \ufb02exible approximate distributions on the\ntree space. For example, using the minimum SBN, (1) can be estimated as\n\nX the natural parents.\n\np(C2, C3)p(C4, C5|C2, C3)p(C6|C4, C5)p(C7|C2, C3)\n\nThis approximation implicitly assumes that given the existence of a subsplit, subsplits that further\nre\ufb01ne opposite sides of this subsplit are independent. Note that CCD can be viewed as a simpli\ufb01cation\nwhere the conditional probabilities are further approximated as follows\np(C4, C5|C2, C3) \u2248 p(C4, C5|C2),\np(C7|C2, C3) \u2248 p(C7|C3)\nBy including the sister clades in the conditional subsplit probabilities, SBNs relax the conditional\nclade independence assumption made in CCD and allows for more \ufb02exible dependent structures\nbetween local components (e.g., subsplits in sister-clades). Moreover, one can add more complicated\ndependencies between nodes (e.g., dashed arrows in Figure 2(a)) and hence easily adjust SBN\nformulation to provide a wide range of \ufb02exible approximate distributions. For general SBNs, the\nestimated tree probabilities take the following form:\n\np(C6|C4, C5) \u2248 p(C6|C5),\n\n(cid:89)\n\ni>1\n\n3\n\npsbn(T ) = p(S1)\n\np(Si|S\u03c0i).\n\n(2)\n\n\fIn addition to the superior \ufb02exibility, another bene\ufb01t of the SBN formulation is that these approximate\ndistributions are all naturally normalized if the conditional probability distributions (CPDs) are\nconsistent, as de\ufb01ned next:\nDe\ufb01nition 3. We say the conditional probability p(Si|S\u03c0i) is consistent if p(Si = si|S\u03c0i = s\u03c0i) = 0\nfor any incompatible assignment Si = si, S\u03c0i = s\u03c0i.\n\nProposition 1. If \u2200i > 1, p(Si|S\u03c0i) is consistent, then(cid:80)\n\nT psbn(T ) = 1.\n\nWith Lemma 1, the proof is standard and is given in the SM. Furthermore, the SBN formulation also\nallows us to easily extend many ef\ufb01cient algorithms for learning Bayesian networks to SBNs for tree\nprobability estimation, as we see next.\n\n4 Learning Subsplit Bayesian Networks\n4.1 Rooted Trees\nSuppose we have a sample of rooted trees D = {Tk}K\nk=1 (e.g., from a phylogenetic MCMC run\ngiven DNA sequences). As before, each sampled tree can be represented as a collection of subsplits\nTk = {Si = si,k, i \u2265 1}, k = 1, . . . , K and therefore has the following SBN likelihood\n\nL(Tk) = p(S1 = s1,k)\n\np(Si = si,k|S\u03c0i = s\u03c0i,k).\n\n(cid:89)\n\ni>1\n\nmsi,ti log p(Si = si|S\u03c0i = ti)\n\n(3)\n\nMaximum Likelihood In this complete data scenario, we can simply use maximum likelihood to\nlearn the parameters of SBNs. Denote the set of all observed subsplits of node Si as Ci, i \u2265 1 and\nthe set of all observed subsplits of the parent nodes of Si as C\u03c0i, i > 1. Assuming that trees are\nindependently sampled, the complete data log-likelihood is\n\nlog p(S1 = s1,k) +\n\n(cid:18)\nK(cid:88)\n(cid:88)\nk=1 I(s1,k = s1), msi,ti =(cid:80)K\n\nms1 log p(S1 = s1) +\n\ns1\u2208C1\n\nk=1\n\n=\n\n(cid:19)\n(cid:88)\nlog p(Si = si,k|S\u03c0i = s\u03c0i,k)\n(cid:88)\n(cid:88)\n\ni>1\n\ni>1\n\nsi\u2208Ci\nti\u2208C\u03c0i\n\nlog L(D) =\n\nwhere ms1 =(cid:80)K\n\n(cid:80)\n\nms1(cid:80)\n\ns\u2208C1\n\nk=1 I(si,k = si, s\u03c0i,k = ti), i > 1 are the frequency\ncounts of the root splits and parent-child subsplit pairs for each interior node respectively, and I(\u00b7) is\nthe indicator function. The maximum likelihood estimates of CPDs have the following simple closed\nform expressions in terms of relative frequencies:\n\n\u02c6pML(S1 = s1) =\n\n=\n\nms1\nK\n\n,\n\nms\n\n\u02c6pML(Si = si|S\u03c0i = ti) =\n\nmsi,ti\ns\u2208Ci\n\nms,ti\n\n.\n\nConditional Probability Sharing We can use the similarity of local structures to further reduce\nthe number of SBN parameters and achieve better generalization, similar to weight sharing for\nconvolutional nets. Indeed, different trees do share lots of local structures, such as subsplits and\nclades. As a result, the representations of the trees in an SBN could have the same parent-child\nsubsplit pairs, taken by different nodes (see Figure D.1 in SM). Instead of assigning independent\nsets of parameters for those pairs at different locations, we can use one set of parameters for each of\nthose shared pairs, regardless of their locations in SBNs. We call this speci\ufb01c setting of parameters in\nSBNs conditional probability sharing (see more on parameter sharing in SM). Compared to standard\nBayes nets, this index-free parameterization only needs CPDs for each observed parent-child subsplit\npair, dramatically reducing the number of parameters in the model.\nNow denote the set of all observed splits of S1 as Cr, and the set of all observed parent-child subsplit\npairs as Cch|pa. The log-likelihood log L(D) in (3) can be rewritten into\n\n(cid:88)\n\ns1\u2208Cr\n\nlog L(D) =\n\nwhere ms,t =(cid:80)K\n\n(cid:80)\n\nms1 log p(S1 = s1) +\n\nms,t log p(s|t)\n\n(cid:88)\n\ns|t\u2208Cch|pa\n\ni>1 I(si,k = s, s\u03c0i,k = t) is the frequency count of the corresponding subsplit\npair s|t \u2208 Cch|pa. Similarly, we have the maximum likelihood estimates of the CPDs for those parent-\nchild subsplit pairs:\n\nk=1\n\n\u02c6pML(s|t) =\n\ns|t \u2208 Cch|pa.\n\n,\n\n(4)\n\nms,t(cid:80)\n\ns ms,t\n\n4\n\n\fA\n\nB\n\n1\n\n4\n\n3\n\n2\n\n5\n\nC\n\nD\n\nro ot/u nro ot\n\nroot/unroot\n\n1\n\n3\n\nA\n\nB\n\nC\n\nD\n\nA\n\nB\n\nC\n\nD\n\nA\n\nBCD\n\nAB\nCD\n\nA\n\nB\nCD\n\nA\nB\n\nC\nD\n\nA\n\nA\n\nB\n\nC\nD\n\nA\n\nB\n\nC\n\nD\n\nS1\n\nS2\n\nS3\n\nS4\n\nS5\n\nS6\n\nS7\n\nFigure 3: SBNs for unrooted trees. Left: A simple four taxon unrooted tree example. It has \ufb01ve\nedges, 1, 2, 3, 4, 5, that can be rooted on to make (compatible) rooted trees. Middle(left): Two rooted\ntrees when rooting on edges 1 and 3. Middle(right): The corresponding SBN representations for the\ntwo rooted trees. Right: An integrated SBN for the unrooted tree with unobserved root node S1.\n\nThe main computation is devoted to the collection of the frequency counts ms1 and ms,t which\nrequires iterating over all sampled trees and for each tree, looping over all the edges. Thus the overall\ncomputational complexity is O(KN ).\n\n(cid:88)\n\n4.2 Unrooted Trees\nUnrooted trees are commonly used to express undirected relationships between observed variables,\nand are the most common tree type in phylogenetics. The SBN framework can be easily generalized\nto unrooted trees because each unrooted tree can be transformed into a rooted tree by placing the\nroot on one of its edges. Since there are multiple possible root placements, each unrooted tree has\nmultiple representations in terms of SBN assignments for the corresponding rooted trees. Unrooted\ntrees, therefore, can be represented using the same SBNs for rooted trees, with root node S1 being\nunobserved1 (Figure 3). Marginalizing out the unobserved node S1, we obtain SBN probability\nestimates for unrooted trees (denoted by T u in the sequel):\n\np(Si|S\u03c0i)\n\np(S1)\n\nS1\u223cT u\n\npsbn(T u) =\n\nthen (5) is a probability distribution over unrooted trees with leaf set X , that is,(cid:80)\n\n(5)\nwhere \u223c means all root splits that are compatible with T u. Similar to the SBN approximations for the\nrooted trees, (5) provides a natural probability distribution over unrooted trees (see a proof in SM).\nProposition 2. Suppose that the conditional probability distributions p(Si|S\u03c0i), i > 1 are consistent,\nT u psbn(T u) = 1.\nk=1. Each pair of the unrooted\nk , e) = {Si =\ni,k denotes all the\n\nAs before, assume that we have a sample of unrooted trees Du = {T u\ntree and rooting edge corresponds to a rooted tree that can be represented as: (T u\ni,k, i \u2265 1}, e \u2208 E(T u\nse\nresulting subsplits when T u\n\nk is rooted on edge e. The SBN likelihood for the unrooted tree T u\n\nk ) denotes the edges of T u\n\nk ), 1 \u2264 k \u2264 K where E(T u\n(cid:89)\n\n(cid:88)\n\nk and se\n\nk }K\n\nk is\n\nL(T u\n\nk ) =\n\np(S1 = se\n\n1,k)\n\np(Si = se\n\ni,k|S\u03c0i = se\n\n\u03c0i,k).\n\n(cid:89)\n\ni>1\n\ne\u2208E(T u\nk )\n\ni>1\n\nThe lost information on the root node S1 means the SBN likelihood for unrooted trees can no longer\nbe factorized. We, therefore, propose the following two algorithms to handle this challenge.\nMaximum Lower Bound Estimates A simple strategy is to construct tractable lower bounds via\nvariational approximations [Wainwright and Jordan, 2008]\n\nLBq(T u) :=\n\n(6)\nwhere q is a probability distribution on S1 \u223c T u. In particular, taking q to be uniform on the 2N \u2212 3\ntree edges together with conditional probability sharing gives the simple average (SA) lower bound\nof the data log-likelihood\n\nS1\u223cT u\n\nq(S1)\n\nlog\n\ni>1 p(Si|S\u03c0i)\nq(S1)\n\n\u2264 log L(T u)\n\n(cid:18)\n\np(S1)(cid:81)\n\n(cid:88)\n\n(cid:19)\n\nLBSA(Du) :=\n\nmu\ns1\n\nlog p(S1 = s1) +\n\n+ K log(2N \u2212 3)\n\n(cid:18) (cid:88)\n\ns1\u2208Cr\n\n(cid:88)\n\n(cid:19)\ns,t log p(s|t)\nmu\n\ns|t\u2208Cch|pa\n\n1The subsplits S2, S3, . . . are well de\ufb01ned once the split in S1 (or equivalently the root) is given.\n\n5\n\n\fAlgorithm 1 Expectation Maximization for SBN\n\nk }K\n\nInput: Data Du = {T u\nInitialize \u02c6pEM,(0) (e.g., via \u02c6pSA) and n = 0. Set equivalent counts \u02dcmu\ns1\nrepeat\n\nk=1, regularization coeff \u03b1.\n(cid:80)\n\nE-step. \u2200 1 \u2264 k \u2264 K, compute q(n)\nM-step. Compute the expected frequency counts with conditional probability sharing and update\nthe CPDs by maximizing the regularized Q score\n\nk ,S1| \u02c6pEM,(n))\np(T u\n\nk ,S1| \u02c6pEM,(n))\n\n, \u02dcmu\n\ns,t for regularization.\n\np(T u\nS1\u223cT u\n\nk (S1) =\n\nk\n\n\u02c6pEM,(n+1)(S1 = s1) =\n\nmu,(n)\n\nK + \u03b1(cid:80)\n\ns1\n\n(cid:80)\n\ns,t + \u03b1 \u02dcmu\n\nmu,(n)\ns(mu,(n)\n\ns,t\ns,t + \u03b1 \u02dcmu\n\ns,t)\n\n, s1 \u2208 Cr, mu,(n)\n\n+ \u03b1 \u02dcmu\ns1\n\u02dcmu\ns1\u2208Cr\ns1\n, s|t \u2208 Cch|pa, mu,(n)\n\nK(cid:88)\n\ns,t =\n\ns1\n\n=\n\n(cid:88)\n\ne\u2208E(T u\nk\n\n)\n\nK(cid:88)\n(cid:88)\n\nk=1\n\n(cid:88)\n\ni>1\n\nk=1\n\ne\u2208E(T u\nk\n\nq(n)\nk (S1 = se\n)\n\n1,k)\n\nI(se\n\ni,k = s, se\n\n\u03c0i,k = t)\n\nq(n)\nk (S1 = s1)I(se\n\ni,k = s1)\n\n\u02c6pEM,(n+1)(s|t) =\n\nn \u2190 n + 1\n\nuntil convergence.\n\nwhere\n\nmu\ns1\n\n=\n\nK(cid:88)\n\n(cid:88)\n\nk=1\n\ne\u2208E(T u\nk )\n\n1\n\n2N \u2212 3\n\nI(se\n\n1,k = s1), mu\n\ns,t =\n\nK(cid:88)\n\n(cid:88)\n\nk=1\n\ne\u2208E(T u\nk )\n\n(cid:88)\n\ni>1\n\n1\n\n2N \u2212 3\n\nI(se\n\ni,k = s, se\n\n\u03c0i,k = t).\n\nThe maximum SA lower bound estimates are then\n\n\u02c6pSA(S1 = s1) =\n\n=\n\nmu\ns1\nK\n\n,\n\nmu\ns\n\ns1 \u2208 Cr,\n\n\u02c6pSA(s|t) =\n\ns1(cid:80)\n\nmu\ns\u2208Cr\n\ns,t(cid:80)\n\nmu\ns mu\ns,t\n\n,\n\ns|t \u2208 Cch|pa.\n\nExpectation Maximization The maximum lower bound approximations can be improved upon by\nadapting the variational distribution q, instead of using a \ufb01xed one. This, together with conditional\nprobability sharing, leads to an extension of the expectation maximization (EM) algorithm for learning\nSBNs, which also allows us use of the Bayesian formulation. Speci\ufb01cally, at the E-step of the n-th\niteration, an adaptive lower bound is constructed through (6) using the conditional probabilities of\nthe missing root node\n\ngiven \u02c6pEM,(n), the current estimate of the CPDs. The lower bound contains a constant term that only\ndepends on the current estimates, and a score function for the CPDs p\n\nk (S1) = p(S1|T u\nq(n)\nK(cid:88)\n\nQ(n)(T u\n\nk ; p) =\n\nK(cid:88)\n\nk=1\n\nk , \u02c6pEM,(n)),\n\nk = 1, . . . , K\n\n(cid:88)\n\nk (S1)(cid:0) log p(S1) +\n\nq(n)\n\nlog p(Si|S\u03c0i)(cid:1)\n\n(cid:88)\n\ni>1\n\nk=1\n\nS1\u223cT u\n\nk\n\nQ(n)(Du; p) =\n\nwhich is then optimized at the M-step. This variational perspective of the EM algorithm was found\nand discussed by Neal and Hinton [1998]. The following theorem guarantees that maximizing (or\nimproving) the Q score is suf\ufb01cient to improve the objective likelihood.\nTheorem 1. Let T u be an unrooted tree. \u2200p,\n\nQ(n)(T u; p) \u2212 Q(n)(T u; \u02c6pEM,(n)) \u2264 log L(T u; p) \u2212 log L(T u; \u02c6pEM,(n)).\n\nWhen data is insuf\ufb01cient or the number of parameters is large, the EM approach also easily incorpo-\nrates regularization [Dempster et al., 1977]. Taking conjugate Dirichlet priors [Buntine, 1991], the\nregularized score function is\nQ(n)(Du; p) +\n\n(cid:88)\n\n(cid:88)\n\nlog p(S1 = s1) +\n\ns,t log p(s|t)\n\n\u03b1 \u02dcmu\n\n\u03b1 \u02dcmu\ns1\n\ns1\u2208Cr\n\ns|t\u2208Cch|pa\n\n, \u02dcmu\n\nwhere \u02dcmu\ns,t are the equivalent sample counts and \u03b1 is the global regularization coef\ufb01cient.\ns1\nWe then simply maximize the regularized score in the same manner at the M-step. Similarly, this\nguarantees that the regularized log-likelihood is increasing at each iteration. We summarize the EM\n\n6\n\n\fFigure 4: Performance on a challenging tree probability estimation problem with simulated data.\nLeft: The KL divergence of CCD and sbn-em estimates over a wide range of degree of diffusion \u03b2\nand sample size K. Right: A comparison among different methods for a \ufb01xed K, as a function of \u03b2\nand for a \ufb01xed \u03b2, as a function of K. Error bar shows one standard deviation over 10 runs.\n\napproach in Algorithm 1. The computational complexities of maximum lower bound estimate and\neach EM iteration are both O(KN ), the same as CCD and SBNs for rooted trees. See more detailed\nderivation and proofs in the SM. In practice, EM usually takes several iterations to converge and\nhence could be more expensive than other methods. However, the gain in approximation makes it a\nworthwhile trade-off (Table 1). We use the maximum SA lower bound algorithm (sbn-sa), the EM\nalgorithm (sbn-em) and EM with regularization (sbn-em-\u03b1) in the experiment section.\n\n5 Experiments\nWe compare sbn-sa, sbn-em, sbn-em-\u03b1 to the classical sample relative frequency (SRF) method\nand CCD on a synthetic data set and on estimating phylogenetic tree posteriors for a number of\nreal data sets. For all SBN algorithms, we use the simplest SBN, B\u2217\nX , which we \ufb01nd provide\nsuf\ufb01ciently accurate approximation in the tree probability estimation tasks investigated in our ex-\nperiments. For sbn-em-\u03b1, we use the sample frequency counts of the root splits and parent-child\nsubsplit pairs as the equivalent sample counts (see Algorithm 1). The code is made available at\nhttps://github.com/zcrabbit/sbn.\nSimulated Scenarios To empirically explore the behavior of SBN algorithms relative to SRF and\nCCD, we \ufb01rst conduct experiments on a simulated setup. We choose a tractable but challenging\ntree space, the space of unrooted trees with 8 leaves, which contains 10395 unique trees. The trees\nare given an arbitrary order. To test the approximation performance on targets of different degrees\nof diffusion, we generate target distributions by drawing samples from the Dirichlet distributions\nDir(\u03b21) of order 10395 with a variety of \u03b2s. The target distribution becomes more diffuse as \u03b2\nincreases. Simulated data sets are then obtained by sampling from the unrooted tree space according\nto these target distributions with different sample sizes K. The resulting probability estimation\nis challenging in that the target probabilities of the trees are assigned regardless of the similarity\namong them. For sbn-em-\u03b1, we adjust the regularization coef\ufb01cient using \u03b1 = 50\nK for different\nsample sizes. Since the target distributions are known, we use KL divergence from the estimated\ndistributions to the target distributions to measure the approximation accuracy of different methods.\nWe vary \u03b2 and K to control the dif\ufb01culty of the learning task, and average over 10 independent\nruns for each con\ufb01guration. Figure 4 shows the empirical approximation performance of different\nmethods. We see that the learning rate of CCD slows down very quickly as the data size increases,\nimplying that the conditional clade independence assumption could be too strong to provide \ufb02exible\napproximations. On the other hand, sbn-em keeps learning ef\ufb01ciently from the data when more\nsamples are available. While all methods tend to perform worse as \u03b2 increases and perform better as\nK increases, SBN algorithms performs consistently much better than CCD. Compared to sbn-sa,\nsbn-em usually greatly improves the approximation with the price of additional computation. When\nthe degree of diffusion is large or the sample size is small, sbn-em-\u03b1 gives much better performance\nthan the others, showing that regularization indeed improves generalization. See the SM for a runtime\ncomparison of different methods with varying K and \u03b2.\nReal Data Phylogenetic Posterior Estimation We now investigate the performance on large un-\nrooted tree space posterior estimation using 8 real datasets commonly used to benchmark phylogenetic\nMCMC methods [Lakner et al., 2008, H\u00f6hna and Drummond, 2012, Larget, 2013, Whidden and\nMatsen, 2015] (Table 1). For each of these data sets, 10 single-chain MrBayes [Ronquist et al., 2012]\nreplicates are run for one billion iterations and sampled every 1000 iterations, using the simple Jukes\nand Cantor [1969] substitution model. We discard the \ufb01rst 25% as burn-in for a total of 7.5 million\n\n7\n\n5001000200040008000K0.0010.0020.0040.0080.0160.032CCD1011001015001000200040008000K0.0010.0020.0040.0080.0160.032SBN-EM1011001011031020.00.51.01.52.02.53.0KL divergenceK=4000ccdsbn-sasbn-emsbn-em-srf103K012345=0.008ccdsbn-sasbn-emsbn-em-srf\fFigure 5: Comparison on DS1, a data set with multiple posterior modes. Left/Middle: Ground\ntruth posterior probabilities vs CCD and sbn-em estimates. Right: Approximation performance as a\nfunction of sample size. One standard deviation error bar over 10 replicates.\n\nTable 1: Data sets used for phylogenetic posterior estimation, and approximation accuracy results of\ndifferent methods across datasets. Sampled trees column shows the numbers of unique trees in the\nstandard run samples. The results are averaged over 10 replicates.\n\nDATA SET\n\nREFERENCE\n\n(#TAXA, #SITES)\n\nDS1\nDS2\nDS3\nDS4\nDS5\nDS6\nDS7\nDS8\n\nHEDGES ET AL. [1990]\nGAREY ET AL. [1996]\n\nYANG AND YODER [2003]\n\nHENK ET AL. [2003]\nLAKNER ET AL. [2008]\n\nZHANG AND BLACKWELL [2001]\n\nYODER AND YANG [2004]\nROSSMAN ET AL. [2001]\n\n(27, 1949)\n(29, 2520)\n(36, 1812)\n(41, 1137)\n(50, 378)\n(50, 1133)\n(59, 1824)\n(64, 1008)\n\nTREE SPACE\n\nSIZE\n\nSAMPLED\n\nTREES\n\n5.84\u00d71032\n1.58\u00d71035\n4.89\u00d71047\n1.01\u00d71057\n2.84\u00d71074\n2.84\u00d71074\n4.36\u00d71092\n1.04\u00d710103\n\n1228\n\n7\n43\n828\n33752\n35407\n1125\n3067\n\nKL DIVERGENCE TO GROUND TRUTH\n\nSRF\n0.0155\n0.0122\n0.3539\n0.5322\n11.5746\n10.0159\n1.2765\n2.1653\n\nCCD\n0.6027\n0.0218\n0.2074\n0.1952\n1.3272\n0.4526\n0.3292\n0.4149\n\nSBN-SA SBN-EM SBN-EM-\u03b1\n0.0687\n0.0218\n0.1152\n0.1021\n0.8952\n0.2613\n0.2341\n0.2212\n\n0.0130\n0.0128\n0.0882\n0.0637\n0.8218\n0.2786\n0.0399\n0.1236\n\n0.0136\n0.0199\n0.1243\n0.0763\n0.8599\n0.3016\n0.0483\n0.1415\n\nposterior samples per data set. These extremely long \u201cgolden runs\u201d form the ground truth to which\nwe will compare various posterior estimates based on standard runs. For these standard runs, we run\nMrBayes on each data set with 10 replicates of 4 chains and 8 runs until the runs have ASDSF (the\nstandard convergence criteria used in MrBayes) less than 0.01 or a maximum of 100 million iterations.\nThis conservative setting has been shown to \ufb01nd all posterior modes on these data sets [Whidden and\nMatsen, 2015]. We collect the posterior samples every 100 iterations of these runs and discard the\n\ufb01rst 25% as burn-in. We apply SBN algorithms, SRF and CCD to the posterior samples in each of the\n10 replicates for each data set. For sbn-em-\u03b1, we use \u03b1 = 0.0001 to give some weak regularization2.\nWe use KL divergence to the ground truth to measure the performance of all methods.\nPrevious work [Whidden and Matsen, 2015] has observed that conditional clade independence does\nnot hold in multimodal distributions. Figure 5 shows a comparison on a typical data set, DS1, that\nhas such a \u201cpeaky\u201d distribution. We see that CCD underestimates the probability of trees within the\nsubpeak and overestimate the probability of trees between peaks. In contrast, sbn-em signi\ufb01cantly\nremoves these biases, especially for trees in the 95% credible set.\nWhen applied to a broad range of data sets, we \ufb01nd that SBNs consistently outperform other\nmethods (Table 1). Due to its inability to generalize beyond observed samples, SRF is worse\nthan generalizing probability estimators except for an exceedingly simple posterior with only 7\nsampled trees (DS2). CCD is, again, comparatively much worse than SBN algorithms. With weak\nregularization, sbn-em-\u03b1 gives the best performance in most cases.\nTo illustrate the ef\ufb01ciency of the algorithms on training data size, we perform an additional study on\nDS1 with increasing sample sizes and summarize the results in the right panel of Figure 5. As before,\nwe see that CCD slows down quickly while SBN algorithms, especially fully-capable SBN estimators\nsbn-em and sbn-em-\u03b1, keep learning ef\ufb01ciently as the sample size increases. Moreover, SBN\nalgorithms tend to provide much better approximation than SRF when fewer samples are available,\nwhich is important in practice where large samples are expensive to obtain.\n\n2The same \u03b1 is used for the real datasets since the sample sizes are roughly the same, although the number\n\nof unique trees are quite different.\n\n8\n\n108106104102100log(ground truth)108107106105104103102101100log(estimated probability)CCDpeak 1peak 2108106104102100log(ground truth)108107106105104103102101100log(estimated probability)SBN-EMpeak 1peak 2104105K102101100KL divergenceDS1ccdsbn-sasbn-emsbn-em-srf\f6 Conclusion\nWe have proposed a general framework for tree probability estimation based on subsplit Bayesian\nnetworks. SBNs allow us to exploit the similarity among trees to provide a wide range of \ufb02exible\nprobability estimators that generalize beyond observations. Moreover, they also allow many ef\ufb01cient\nBayesian network learning algorithms to be extended to tree probability estimation with ease. We\nreport promising numerical results demonstrating the importance of being both \ufb02exible and generaliz-\ning when estimating probabilities on trees. Although we present SBNs in the context of leaf-labeled\nbifurcating trees, it can be easily adapted for general leaf-labeled trees by allowing partitions other\nthan subsplits (bipartitions) of the clades in parent nodes. We leave for future work investigating\nthe performance of more complicated SBNs for general trees, structure learning of SBNs, deeper\nexamination of the effect of parameter sharing, and further applications of SBNs to other probabilistic\nlearning problems in tree spaces, such as designing more ef\ufb01cient tree proposals for MCMC transition\nkernels and providing \ufb02exible and tractable distributions for variational inference.\n\nAcknowledgements\n\nThis work supported by National Science Foundation grant CISE-1564137, as well as National\nInstitutes of Health grants R01-GM113246 and U54-GM111274. The research of Frederick Matsen\nwas supported in part by a Faculty Scholar grant from the Howard Hughes Medical Institute and the\nSimons Foundation.\n\nReferences\nW. Buntine. Theory re\ufb01nement on Bayesian networks. In B. D\u2019Ambrosio, P. Smets, and P. Bonissone,\neditors, Proceedings of the 7th Conference on Uncertainty in Arti\ufb01cial Intelligence., pages 52\u201360.\nMorgan Kaufmann, 1991.\n\nA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the\n\nEM algorithm. Journal of the Royal Statistical Society Series B, 39:1\u201338, 1977.\n\nRob DeSalle and George Amato. The expansion of conservation genetics. Nat. Rev. Genet., 5(9):\n702\u2013712, September 2004. ISSN 1471-0056. doi: 10.1038/nrg1425. URL http://dx.doi.org/\n10.1038/nrg1425.\n\nJ. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2nd edition, 2003.\n\nN. Friedman, M. Ninio, I. Pe\u2019er, and T. Pupko. A structural EM algorithm for phylogenetic inference.\n\nJournal of Computational Biology, 9(2):331\u2013353, 2002.\n\nJ. R. Garey, T. J. Near, M. R. Nonnemacher, and S. A. Nadler. Molecular evidence for Acanthocephala\n\nas a subtaxon of Rotifera. Mol. Evol., 43:287\u2013292, 1996.\n\nS. B. Hedges, K. D. Moberg, and L. R. Maxson. Tetrapod phylogeny inferred from 18S and 28S\nribosomal RNA sequences and review of the evidence for amniote relationships. Mol. Biol. Evol.,\n7:607\u2013633, 1990.\n\nD. A. Henk, A. Weir, and M. Blackwell. Laboulbeniopsis termitarius, an ectoparasite of termites\n\nnewly recognized as a member of the Laboulbeniomycetes. Mycologia, 95:561\u2013564, 2003.\n\nS. H\u00f6hna, T. A. Heath, B. Boussau, M. J. Landis, F. Ronquist, and J. P. Huelsenbeck. Probabilistic\n\ngraphical model representation in phylogenetics. Syst. Biol., 63:753\u2013771, 2014.\n\nSebastian H\u00f6hna and Alexei J. Drummond. Guided tree topology proposals for Bayesian phylogenetic\ninference. Syst. Biol., 61(1):1\u201311, January 2012. ISSN 1063-5157. doi: 10.1093/sysbio/syr074.\nURL http://dx.doi.org/10.1093/sysbio/syr074.\n\nJ. P. Huelsenbeck and F. Ronquist. MrBayes: Bayesian inference of phylogeny. Bioinformatics, 17:\n\n754\u2013755, 2001.\n\nJ. P. Huelsenbeck, F. Ronquist, R. Nielsen, and J. P. Bollback. Bayesian inference of phylogeny and\n\nits impact on evolutionary biology. Science, 294:2310\u20132314, 2001.\n\n9\n\n\fT. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro, editor, Mammalian\n\nprotein metabolism, III, pages 21\u2013132, New York, 1969. Academic Press.\n\nC. Lakner, P. van der Mark, J. P. Huelsenbeck, and B. Largetand F. Ronquist. Ef\ufb01ciency of Markov\n\nchain Monte Carlo tree proposals in Bayesian phylogenetics. Syst. Biol., 57:86\u2013103, 2008.\n\nBret Larget. The estimation of tree posterior probabilities using conditional clade probability\ndistributions. Syst. Biol., 62(4):501\u2013511, July 2013. ISSN 1063-5157. doi: 10.1093/sysbio/syt014.\nURL http://dx.doi.org/10.1093/sysbio/syt014.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nB. Mau, M. Newton, and B. Larget. Bayesian phylogenetic inference via Markov chain Monte Carlo\n\nmethods. Biometrics, 55:1\u201312, 1999.\n\nR. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. Learning in Graphical Models, 89:355\u2013368, 1998.\n\nRichard A Neher and Trevor Bedford. next\ufb02u: Real-time tracking of seasonal in\ufb02uenza virus\nevolution in humans. Bioinformatics, June 2015. ISSN 1367-4803, 1367-4811. doi: 10.1093/\nbioinformatics/btv381. URL http://dx.doi.org/10.1093/bioinformatics/btv381.\n\nF. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu,\nM. A. Shchard, and J. P. Huelsenbeck. MrBayes 3.2: ef\ufb01cient Bayesian phylogenetic inference\nand model choice across a large model space. Syst. Biol., 61:539\u2013542, 2012.\n\nA. Y. Rossman, J. M. Mckemy, R. A. Pardo-Schultheiss, and H. J. Schroers. Molecular studies of the\n\nBionectriaceae using large subunit rDNA sequences. Mycologia, 93:100\u2013110, 2001.\n\nK. Strimmer and V. Moulton. Likelihood analysis of phylogenetic networks using directed graphical\n\nmodels. Molecular biology and evolution, 17:875\u2013881, 2000.\n\nM. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Maching Learning, 1(1-2):1\u2013305, 2008.\n\nChris Whidden and Frederick A Matsen, IV. Quantifying MCMC exploration of phylogenetic tree\nspace. Syst. Biol., 64(3):472\u2013491, May 2015. ISSN 1063-5157, 1076-836X. doi: 10.1093/sysbio/\nsyv006. URL http://dx.doi.org/10.1093/sysbio/syv006.\n\nZ. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: a Markov chain\n\nMonte Carlo method. Mol. Biol. Evol., 14:717\u2013724, 1997.\n\nZ. Yang and A. D. Yoder. Comparison of likelihood and Bayesian methods for estimating divergence\ntimes using multiple gene loci and calibration points, with application to a radiation of cute-looking\nmouse lemur species. Syst. Biol., 52:705\u2013716, 2003.\n\nA. D. Yoder and Z. Yang. Divergence datas for Malagasy lemurs estimated from multiple gene loci:\n\ngeological and evolutionary context. Mol. Ecol., 13:757\u2013773, 2004.\n\nN. Zhang and M. Blackwell. Molecular phylogeny of dogwood anthracnose fungus (Discula destruc-\n\ntiva) and the Diaporthales. Mycologia, 93:355\u2013365, 2001.\n\n10\n\n\f", "award": [], "sourceid": 750, "authors": [{"given_name": "Cheng", "family_name": "Zhang", "institution": "Fred Hutchinson Cancer Research Center"}, {"given_name": "Frederick A", "family_name": "Matsen IV", "institution": "Fred Hutchinson Cancer Research Center"}]}