{"title": "Efficient Inference in Phylogenetic InDel Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 177, "page_last": 184, "abstract": "Accurate and efficient inference in evolutionary trees is a central problem in computational biology. Realistic models require tracking insertions and deletions along the phylogenetic tree, making inference challenging. We propose new sampling techniques that speed up inference and improve the quality of the samples. We compare our method to previous approaches and show performance improvement on metrics evaluating multiple sequence alignment and reconstruction of ancestral sequences.", "full_text": "Ef\ufb01cient Inference in Phylogenetic InDel Trees\n\nAlexandre Bouchard-C\u02c6ot\u00b4e\u2020 Michael I. Jordan\u2020\u2021\n\nComputer Science Division\u2020, Department of Statistics\u2021\n\nDan Klein\u2020\n\nUniversity of California at Berkeley\n\n{bouchard,jordan,klein}@cs.berkeley.edu\n\nBerkeley, CA 94720\n\nAbstract\n\nAccurate and ef\ufb01cient inference in evolutionary trees is a central problem in com-\nputational biology. While classical treatments have made unrealistic site inde-\npendence assumptions, ignoring insertions and deletions, realistic approaches re-\nquire tracking insertions and deletions along the phylogenetic tree\u2014a challenging\nand unsolved computational problem. We propose a new ancestry resampling\nprocedure for inference in evolutionary trees. We evaluate our method in two\nproblem domains\u2014multiple sequence alignment and reconstruction of ancestral\nsequences\u2014and show substantial improvement over the current state of the art.\n\n1 Introduction\n\nPhylogenetic analysis plays a signi\ufb01cant role in modern biological applications such as ancestral\nsequence reconstruction and multiple sequence alignment [1, 2, 3]. While insertions and deletions\n(InDels) of nucleotides or amino acids are an important aspect of phylogenetic inference, they pose\nformidable computational challenges and they are usually handled with heuristics [4, 5, 6]. Routine\napplication of approximate inference techniques fails because of the intricate nature of the combi-\nnatorial space underlying InDel models.\nConcretely, the models considered in the phylogenetic literature take the form of a tree-shaped\ngraphical model where nodes are string-valued random variables representing a fragment of DNA,\nRNA or protein of a species. Edges denote evolution from one species to another, with conditional\nprobabilities derived from the stochastic model described in Sec. 2. Usually, only the terminal nodes\nare observed, while the internal nodes are hidden. The interpretation is that the sequence at the root\nis the common ancestor of those at the terminal nodes, and it subsequently evolved in a branching\nprocess following the topology of the tree. We will concentrate on the problem of computing the\nposterior of these hidden nodes rather than the problem of selecting the topology of the tree\u2014hence\nwe will assume the tree is known or estimated with some other algorithm (a guide tree assumption).\nThis graphical model can be misleading. It only encodes one type of independence relation, those\nbetween generations. There is another important structure that can be exploited. Informally, InDel\nevents that operate at the beginning of the sequences should not affect, for instance, those at the\nend. However, because alignments between the sequences are unknown in practice, it is dif\ufb01cult to\nexploit this structure in a principled way.\nIn many previous works [4, 5, 6], the following heuristic approach is taken to perform inference on\nthe hidden nodes (refer to Fig. 1): First, a guide tree (d) and a multiple sequence alignment (a) (a\ntransitive alignment between the characters in the sequences of the modern species) are computed\nusing heuristics [7, 8]. Second, the problem is cast into several easy subproblems as follows. For\neach equivalence class in the multiple sequence alignment (called a site, corresponding to a column\nin Fig. 1(b)), a new graphical model is created with the same tree structure as the original problem,\nbut where there is exactly one character in each node rather than a string. For nodes with a character\n\n1\n\n\fFigure 1: Comparison of different approaches to phylogenetic modeling: (a,b,c,d) heuristics based\non site independence; (e) Single Sequence Resampling; (f) Ancestry Resampling. The boxes denote\nthe structures that can be sampled or integrated out in one step by each method.\n\nin the current equivalence class, the node in this new tree is observed, and the rest of the nodes are\nconsidered as unobserved data (Fig. 1(c)). Note that the question marks are not the gaps commonly\nseen in linearized representations of multiple alignments, but rather phantom characters. Finally,\neach site is assumed independent of the others, so the subproblems can be solved ef\ufb01ciently by\nrunning the forward-backward algorithm on each site.\nThis heuristic has several problems, the most important being that it does not allow explicit mod-\neling of insertions and deletions (InDel), which are frequent in real biological data and play an\nimportant role in evolution [9]. If InDels are included in the probabilistic model, there is no longer\na deterministic notion of site on which independence assumptions can be made. This complicates\ninference substantially. For instance, in the standard TKF91 model [10], the fastest known algorithm\nfor computing exact posteriors takes time O(2F N F ) where F is the number of leaves and N is the\ngeometric mean sequence length [11].\nHolmes et al. [2] developed an approximate Markov chain Monte Carlo (MCMC) inference proce-\ndure for the TKF91 model. Their algorithm proceeds by sampling the entire sequence corresponding\nto a single species conditioning on its parent and children (Fig. 1(e)). We will call this type of kernel\na Single Sequence Resampling (SSR) move. Unfortunately, chains based exclusively on SSR have\nperformance problems.\nThere are two factors behind these problems. The \ufb01rst factor is a random walk behavior that arises\nin tall chains found in large or unbalanced trees [2, 12]: initially, the InDel events resampled at\nthe top of the tree are independent of all the observations. It takes time for the information from\nthe observations to propagate up the tree. The second factor is the computational cost of each SSR\nmove, which is O(N 3) with the TKF91 model and binary trees. For long sequences, this becomes\nprohibitive, so it is common to use a \u201cmaximum deviation pruning strategy\u201d (i.e., putting a bound\non the relative positions of characters that mutate from one to the other) to speed things up [12]. We\nobserved that this pruning can substantially hurt the quality of the estimated posterior (see Sec. 4).\nIn this paper, we present a novel MCMC procedure for phylogenetic InDel models that we refer to as\nAncestry Resampling (AR). AR addresses both of the ef\ufb01ciency and accuracy problems that arise for\nSSR. The intuition behind the AR approach is to use an MCMC kernel that combines the advantages\nof the two approaches described above: like the forward-backward algorithm in the site-independent\ncase, AR always directly conditions on some part of the observed data, but, like SSR, it is capable\nof resampling the InDel history. This is illustrated in Fig. 1(f).\n\n2 Model\n\nFor concreteness, we describe the algorithms in the context of the standard TKF91 model [10], but\nin Sec. 5 we discuss how the ideas extend to other models. We assume that a phylogenetic directed\ntree topology \u03c4 = (V, E) is \ufb01xed, where nodes in this tree are string-valued random variables, from\n\n2\n\n(a)(b)(e)...ACCGTGCGATGCT......ACCCTGCGGTGCT......ACGCTGCGGTGAT......ACGTTGCGGTGAT......ACGCTGTTGTGAT......ACGGTGCTGTGCT..ACTACa :ACGb :TACCc :ACTACa :ACGb :TACCc :??????d :??????abcd...ACCGTGCGTGCT......ACCCTGCGGTGCT......ACGCTGCGTGAT......ACGTTGCGTGAT......ACGCTGTTGTGAT......ACGGTGCTGTGCT..(AR)CCAAAT...(c)(d)(f)\fan alphabet of K characters\u2014K is four in nucleotide sequences and about twenty in amino-acid\nsequences. Also known is a positive time length te associated to each edge e \u2208 E.\nWe start the description of the model in the simple case of a single branch of known length t, with a\nstring x at the root and a string y at the leaf. The model, TKF91, is a string-valued Continuous-Time\nMarkov Chain (CTMC). There is one rate \u00b5 for deletion (death in the original TKF terminology)\nand one rate \u03bb for insertions, which can occur either to the right of one of the existing character\n(birth), or to the left of the sequence (immigration). Additionally, there is an independent CTMC\nsubstitution process on each character.\nFortunately, the TKF91 model has a closed form solution for the conditional distribution over strings\ny at the leaf given the string x at the root. The derivation of this conditional distribution is presented\nin [10] and its form is:\nP(a character in x survived and has n descendants in y) = \u03b1\u03b2n\u22121(1 \u2212 \u03b2)\nP(a character in x died and has n descendants in y) = (1 \u2212 \u03b1)(1 \u2212 \u03b3)\n\nfor n = 1, 2, . . .\nfor n = 0\nfor n = 1, 2, . . .\nfor n = 0, 1, . . .\n\n= (1 \u2212 \u03b1)\u03b3\u03b2n\u22121(1 \u2212 \u03b2)\n\nP(immigrants inserted at the left have n descendants in y) = \u03b2n(1 \u2212 \u03b2)\nIn de\ufb01ning descendants, we count the character itself, its children, grandchildren, etc. \u03b1, \u03b2, \u03b3 are\nfunctions of t, \u00b5, \u03bb. See [2] for the details. Since we only work with these conditionals, note that the\nsituation resembles that of a standard weighted edit process with a speci\ufb01c, branch-length dependent\nstructure over insertions and deletions.\nTo go from a single branch to a tree, we simply compose this process. The full generative process\nworks as follows: starting at the root, we generate the \ufb01rst string according to the stationary distribu-\ntion of TKF91. Then, for each outgoing edge e, we use the known time te and the equations above\nto generate a child string. We continue in preorder recursively.\n\n2.1 Auxiliary variables\n\nWe now de\ufb01ne some auxiliary variables that will be useful in the next section. Between each pair\nof nodes a, b \u2208 V connected by an edge and with respective strings x, y , we de\ufb01ne an alignment\nrandom variable: its values are bipartite matchings between the characters of the strings x and y.\nLinks in this alignment denote survival of a character (allowing zero or more substitutions). Note\nthat this alignment is monotonic: if character i in x is linked to character j in y, then the characters\ni0 > i in x can only be unlinked or linked to a character with index j0 > j in y. The random variable\nthat consists of the alignments and the strings for all the edges and nodes in the phylogenetic tree \u03c4\nwill be called a derivation.\nNote also that a derivation D de\ufb01nes another graph that we will call a derivation graph. Its nodes\nare the characters of all the strings in the tree. We put an edge between two characters x, y in this\ngraph iff two properties hold. Let a, b \u2208 V be the nodes corresponding to the strings from which\nrespectively x, y belongs to. We put an edge between x, y iff (1) there is an edge between a and b\nin E and (2) there is a link between x, y in the alignment of the corresponding strings. Examples of\nderivation graphs are shown in Fig. 2.\n\n3 Ef\ufb01cient inference\n\nThe approximate inference algorithm we propose, Ancestry Resampling (AR), is based on the\nMetropolis-Hastings (MH) framework. While the SSR kernel resamples the whole sequence cor-\nresponding to a single node, AR works around the dif\ufb01culties of SSR by joint resampling of a \u201cthin\nvertical slice\u201d (Fig. 1(f)) in the tree that is composed of a short substring in every node. As we will\nsee, if we use the right de\ufb01nition of vertical slice, this yields a valid and ef\ufb01cient MH algorithm.\n\n3.1 Ancestry Resampling\nWe will call one of these \u201cthin slices\u201d an ancestry A, and we now discuss what its de\ufb01nition should\nbe. Some care will be needed to ensure irreducibility and reversibility of the sampler.\n\n3\n\n\fFigure 2: (a): the simple guide tree used in this example (left) and the corresponding sequences\nand alignments (right). (a,b,c): the de\ufb01nitions of A0,A\u221e,A respectively are shaded (the \u201cselected\ncharacters\u201d). (d,e): An example showing the non-reversibility problem with A\u221e.\n\nWe \ufb01rst augment the state of the AR sampler to include the derivation auxiliary variable described\nin Sec. 2.1. Let D be the current derivation and let x be a substring of one of the terminal nodes, say\nin node e. We will call x an anchor. The ancestry will depend on both a derivation and an anchor.\nThe overall MH sampler is a mixture of proposal distributions indexed by a set of anchors covering\nall the characters in the terminal strings. Each proposal resamples a new value of A(D, x) given the\nterminal nodes and keeping A(D, x)c frozen.\nWe \ufb01rst let A0(D, x) be the set of characters connected to some character in x in the derivation\ngraph of D (see Fig. 2(a)). This set A0(D, x) is not a suitable de\ufb01nition of vertical slice, but will\nbe useful to construct the correct one. It is unsuitable for two reasons. First, it does not yield an\nirreducible chain, as illustrated in same \ufb01gure, where nine of the characters of this sample (those\ninside the dashed curve) will never be resampled, no matter which substring of the terminal node\nis selected as anchor. Secondly, we would like the vertical slices to be contiguous substrings rather\nthan general subsequences to ease implementation.\nWe therefore modify the de\ufb01nition recursively as follows. See Fig. 2(b) for an illustration of this\nde\ufb01nition. For i > 0, we will say that a character token y is in Ai(D, x) if one of the following\nconditions is true:\n\n1. y is connected to Ai\u22121(D, x),\n2. y appears in a string \u00b7\u00b7\u00b7 y0 \u00b7\u00b7\u00b7 y \u00b7\u00b7\u00b7 y00 \u00b7\u00b7\u00b7 such that both y0 and y00 are in Ai\u22121(D, x),\n3. y appears in a string \u00b7\u00b7\u00b7 y0 \u00b7\u00b7\u00b7 y \u00b7\u00b7\u00b7 such that y0 is in Ai\u22121(D, x) and x is a suf\ufb01x,\n4. y appears in a string \u00b7\u00b7\u00b7 y \u00b7\u00b7\u00b7 y0 \u00b7\u00b7\u00b7 such that y0 is in Ai\u22121(D, x) and x is a pre\ufb01x.\n\nThen, we de\ufb01ne A\u221e(D, x) := \u222a\u221e\ni=0Ai(D, x). In words, a symbol is in A\u221e(D, x) if it is linked to\nan anchored character through the alignments, or if it is \u201csqueezed\u201d between previously connected\ncharacters. Cases 3 and 4 handle the boundaries of strings. With this property, irreducibility could\nbe established with some conditions on the anchors, but it turns out that this de\ufb01nition is still not\nquite right.\nWith A\u221e, the main problem arises when one tries to establish reversibility of the chain. This is\nillustrated in Fig. 2(d). In this example, the chain \ufb01rst transitions to a new state by altering the\ncircled link. One can see that with the de\ufb01nition of A\u221e(D, x) given above, from the state 2 (e), the\nstate in 2 (d) is now unreachable by the same resampling operator, the reason being that the substring\nlabeled z in the \ufb01gure belongs to the frozen part of the state if the transition is visited backwards.\nWhile there exist MCMC methods that are not based on reversible chains [13], we prefer to take a\nsimpler approach: a variation on our de\ufb01nition solves the issue, informally by taking vertical slices\nA(D, x) to be the \u201ccomplement of the ancestry taken on the complement of the anchor.\u201d More\nprecisely, if x = x0xx00 is the string at the anchor node e, we let the resampled section to be\nA(D, x) := (A\u221e(D, x0) \u222a A\u221e(D, x00))c. This creates slightly thicker slices (Fig. 2(c)) but solves\nthe reversibility problem. We will call A(D, x) the ancestry of the anchor x. With this de\ufb01nition,\nthe proposal distribution can be made reversible using a MH acceptance ratio; it is also irreducible.\n\n4\n\nacb(a)(b)(c)(d)a :b :c :anchorobserved charactersselected charactersanchorLegendz(e)\f(cid:26)\n\nmin\n\n1,\n\nP(ap) \u00d7 Q(ac|ap)\nP(ac) \u00d7 Q(ap|ac)\n\n(cid:27)\n\n,\n\nThe problem of resampling a single slice decomposes along the tree structure \u03c4, but an unbounded\nnumber of InDels could occur a priori inside the thin slice. It may seem at the \ufb01rst glance that we\nare back at our initial problem: sampling from a tree-structured directed graphical model where the\nsupport of the space of the nodes is a countably in\ufb01nite space. But in fact, we have made progress:\nthe distribution is now concentrated on very short sequences. Indeed, the anchors x can be taken\nrelatively small (we used anchors of length 3 to 5 in our experiments).\nAnother important property to notice is that given an assignment of the random variable A(D, x), it\nis possible to compute ef\ufb01ciently and exactly an unnormalized probability for this assignment. The\nsummation over the possible alignments can be done using a standard quadratic dynamic program\nknown in its max version as the Needleman-Wunsch algorithm [14].\n\n3.2 Cylindric proposal\n\nWe now introduce the second idea that will make ef\ufb01cient inference possible: when resampling\nan ancestry given its complement, rather than allowing all possible strings for the resampled value\nof A(D, x), we restrict the choices to the set of substitutes that are close to its current value. We\nformalize closeness as follows: Let a1, a2 be two values for the ancestry A(D, x). We de\ufb01ne the\ncylindric distance as the maximum over all the nodes e of the Levenshtein edit distance between the\nsubstrings in a1 and a2 at node e. Fix some positive integer m. The proposal distribution consider\nthe substitution ancestry that are within a ball of radius m centered at the current state in the cylindric\nmetric. The value m = 1 worked well in practice.\nHere the number of states in the tree-structured dynamic program at each node is polynomial in the\nlengths of the strings in the current ancestry. A sample can therefore be obtained easily using the\nobservation we have made that unnormalized probability can be computed.1 Next, we compute the\nacceptance ratio, i.e.:\n\nwhere ac, ap are the current and proposed ancestry values and Q(a2|a1) is the transition probability\nof the MH kernel, proportional to P(\u00b7), but with support restricted to the cylindric ball centered at\na1.\n\n4 Experiments\n\nWe consider two tasks: reconstruction of ancestral sequences and prediction of alignments between\nmultiple genetically-related proteins. We are interested in comparing the ancestry sampling method\n(AR) presented in this paper with the Markov kernel used in previous literature (SSR).\n\n4.1 Reconstruction of ancestral sequences\n\nGiven a set of genetically-related sequences, the reconstruction task is to infer properties of the\ncommon ancestor of these modern species. This task has important scienti\ufb01c applications: for\ninstance, in [1], the ratio of G+C nucleotide content of ribosomal RNA sequences was estimated to\nassess the environmental temperature of the common ancestor to all life forms (this ratio is strongly\ncorrelated with the optimal growth temperature of prokaryotes).\nJust as in the task of topology reconstruction, there are no gold ancestral sequences available to eval-\nuate ancestral sequence reconstruction. For this reason, we take the same approach as in topology\nreconstruction and perform comparisons on synthetic data [15].\nWe generated a root node from the DNA alphabet and evolved it down a binary tree of seven nodes.\nOnly the leaves were given to the algorithms (a total of 124010 nucleotides); the hidden nodes were\nheld out. Since our goal in this experiment is to compare inference algorithms rather than methods\n\n1What we are using here is actually a nested dynamic programs, meaning that the computation of a prob-\nability in the outer dynamic program (DP) requires the computation of an inner, simpler DP. While this may\nseem prohibitive, this is made feasible by designing the sampling kernels so that the inner DP is executed most\nof the time on small problem instances. We also cached the small-DP cost matrices.\n\n5\n\n\fSSR AR\n\nWith max. dev. Max dev = MAX\n\nr\no\nr\nr\nE\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nr\no\nr\nr\nE\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n100\n\n200\n\n300\n\n0\n\n100\n\n200\n\n300\n\nTime\n\nTime\n\nFigure 3: Left: Single Sequence Resampling versus Ancestry Resampling on the sequence recon-\nstruction task. Right: Detrimental effect of a maximum deviation heuristic, which is not needed\nwith AR samplers.\n\nP\n\nj\u22081...I l(si, sj).\n\nof estimation, we gave both algorithms the true parameters; i.e., those that were used to generate the\ndata.\nThe task is to predict the sequences at the root node with error measured using the Levenshtein edit\ndistance l. For both algorithms, we used a standard approximation to minimum Bayes risk decoding\nto produce the \ufb01nal reconstruction. If s1, s2, . . . , sI are the samples collected up to iteration I, we\nreturn mini\u22081...I\nFig. 3 (left) shows the error as a function of time for the two algorithms, both implemented ef\ufb01ciently\nin Java. Although the computational cost for one pass through the data was higher with AR, the AR\nmethod proved to be dramatically more effective: after only one pass through the data (345s), AR\nalready performed better than running SSR for nine hours. Moreover, AR steadily improved its\nperformance as more samples were collected, keeping its error at each iteration to less than half of\nthat of the competitor.\nFig. 3 (right) shows the detrimental effect of a maximum deviation heuristic. This experiment was\nperformed under the same setup described in this section. While the maximum deviation heuristic\nis necessary for SSR to be able to handle the long sequences found in biological datasets, it is not\nnecessary for AR samplers.\n\n4.2 Protein multiple sequence alignment\n\nWe also performed experiments on the task of protein multiple sequence alignment, for which the\nBAliBASE [16] dataset provides a standard benchmark. BAliBASE contains annotations created by\nbiologists using secondary structure elements and other biological cues.\nNote \ufb01rst that we can get a multiple sequence alignment from an InDel evolutionary model. For a\nset S of sequences to align, construct a phylogenetic tree such that its terminal leaves coincide with\nS. A multiple sequence alignment can be extracted from the inferred derivation D as follows: deem\nthe amino acids x, y \u2208 S aligned iff y \u2208 A0(D, x).\nThe state- of- the- art for multiple sequence alignment systems based on an evolutionary model is\nHandel [2]. It is based on TKF91 and produces a multiple sequence alignment as described above.\nThe key difference with our approach is that their inference algorithm is based on SSR rather than\nthe AR move that we advocate in this paper.\nWhile other heuristic approaches are known to perform better than Handel on this dataset [8, 17],\nthey are not based on explicit evolutionary models. They perform better because they leverage more\nsophisticated features such as af\ufb01ne gap penalties and hydrophobic core modeling. While these\nfeatures can be incorporated in our model, we leave this for future work since the topic of this paper\nis inference.\n\n6\n\n\fSystem\nSSR (Handel)\nAR (this paper)\n\nCS\n0.63\n0.77\n\nSP\n0.77\n0.86\n\nTable 1: XXXXXX\u2014clustalw\n\n100\n\n75\n\nS\nC\n\n50\n\n25\n\n0\n\n3\n\nSSR AR\n\nSSR AR\n\nP\nS\n\n100\n\n90\n\n80\n\n70\n\n4\n\n5\n\n6\n\n7\n\n8\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nDepth\n\nDepth\n\nFigure 4: Left: performance on the ref1 directory of BAliBASE. Center, right: Column- Score (CS)\nand Sum- of- Pairs score (SP) as a function of the depth of the generating trees.\n\nFigure 4: CS and SP XXXCITE are recall metrics XXX\n\nWe built evolutionary trees using weighbor [7]. We ran each system for the same time on the\nsequences in the ref1 directory of BAliBASE v.1. Decoding for this experiment was done by picking\nthe sample with highest likelihood. We report in Fig. 4(left) the CS and SP Scores, the two standard\nmetrics for this task. Both are recall measures on the subset of the alignments that were labeled,\ncalled the core blocks; see, e.g., [17] for instance for the details. For both metrics, our approach\nperforms better.\nIn order to investigate where the advantage comes from, we did another multiple alignment exper-\niment, plotting performance as a function of the depth of the trees. If the random walk argument\npresented in the introduction holds, we would expect the advantage of AR over SSR to increase as\nthe tree gets taller. This prediction is con\ufb01rmed as illustrated in Fig. 4 (center, right). For short trees,\nthe two algorithms perform equally, SSR beating AR slightly for trees with three nodes, which is\nnot surprising since SSR actually performs exact inference in this tiny topology. However, as the\ntrees get taller, the task becomes more dif\ufb01cult, and only AR maintains good performance.\n\nsequence alignment can be extracted from the infered derivation D as follow: deem the amino acids x, y \u2208 S\naligned iff y \u2208 A0(D, x).\nThe state-of-the-art for multiple sequence alignment systems based on an evolutionary model is Handel [2].\nIt is based on TKF91 and produces a multiple sequence alignment as described above. The key difference\nwith our approach is that their inference algorithm is based on SSR rather than the AR move that we advocate\nin this paper.\nWhile other heuristic approaches are known to perform better than Handel on this dataset [7, 16], they are not\nbased on explicit evolutionary models. They perform better because they leverage more sophisticated features\nsuch as af\ufb01ne gap penalty and hydrophobic core modelling. While these features can be incorporated in our\nmodel, we leave this for future work since the topic of this paper is inference.\nWe built evolutionary trees using weighbor [17]. We ran each system for the same time on the sequences\nin the ref1 directory of BAliBASE v.1. Decoding for this experiment was done by picking the sample with\nhighest likelihood. We report in Figure 4, left, the CS and SP Scores, the two standard metrics for this task.\nBoth are recall measures on the subset of the alignments that were labeled, called the core blocks, see [16]\nfor the details. For both metrics, our approach performs better.\nIn order to investigate where the advantage comes from, we did another multiple alignment experiment plot-\nting this time performance after a \ufb01xed time as a function of depth of the trees. If the random walk argument\npresented in the introduction held, we would expect the advantage of AR over SSR to increase as the tree\ngets taller. This prediction is con\ufb01rmed as illustrated in Figure 4, middle, right. For short trees, the two\nalgorithms perform equally, SSR beating AR slightly for trees with three nodes, which is not surprising since\nSSR actually performs exact inference in this tiny topology con\ufb01guration. However, as trees get taller, the\ntask becomes more dif\ufb01cult, and only AR manages to maintain good performances.\nlife forms. Science, 283:220\u2013221, 1999.\n\nWe have described a principled inference procedure for InDel trees. We have evaluated its per-\nformance against a state- of- the- art statistical alignment procedure and shown its clear superiority.\nIn contrast to heuristics such as Clustalw [8], it can be used both for reconstruction of ancestral\nsequences and multiple alignment.\nWhile our algorithm was described in the context of TKF91, it can be extended to more sophisticated\nmodels. Incorporating af\ufb01ne gap penalties and hydrophobic core modeling is of particular interest\nas they are known to dramatically improve multiple alignment performance [2]. These models\ntypically do not have closed forms for the conditional probabilities, but this could be alleviated\nby using a discretization of longer branches. This creates tall trees, but as we have seen, AR still\nperforms very well in this setting.\n\n[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant\n\n5 Conclusion\n\nReferences\n\n5 Future Directions\n\n[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a Bayesian approach to multiple alignment.\n\nBioinformatics, 17:803\u2013820, 2001.\n\nTODO: integrating af\ufb01ne gap, hydrophobic core modelling, CRF models\n\n[3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.\n[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov\n\nchain Monte Carlo method. Molecular Biology and Evolultion, 14:717\u2013724, 1997.\n\nReferences\n[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant life forms. Science,\n\n[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using\nMarkov chain Monte Carlo. Journal of Computational and Graphical Statistics, 6:122\u2013131,\n1997.\n\n283:220\u2013221, 1999.\n\n[2] I. Holmes and W. J. Bruno. Evolutionary hmm: a bayesian approach to multiple alignment. Bioinformatics, 17:803\u2013\n\n820, 2001.\n\n7\n\n7\n\n\f[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov chain Monte\n\nCarlo. Journal of the American Statistical Association, 95:493\u2013508, 2000.\n\n[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-\nbased approach to distance-based phylogeny reconstruction. Molecular Biology and Evolution,\n17:189\u2013197, 2000.\n\n[8] D. G. Higgins and P. M. Sharp. CLUSTAL: a package for performing multiple sequence\n\nalignment on a microcomputer. Gene, 73:237\u2013244, 1988.\n\n[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood\n\nmodel of sequence evolution. Journal of Molecular Evolution, 34:3\u201316, 1992.\n\n[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood\n\nalignment of DNA sequences. Journal of Molecular Evolution, 33:114\u2013124, 1991.\n\n[11] G. A. Lunter, I. Mikl\u00b4os, Y. S. Song, and J. Hein. An ef\ufb01cient algorithm for statistical multiple\nalignment on arbitrary phylogenetic trees. Journal of Computational Biology, 10:869\u2013889,\n2003.\n\n[12] A. Bouchard-C\u02c6ot\u00b4e, P. Liang, D. Klein, and T. L. Grif\ufb01ths. A probabilistic approach to di-\n\nachronic phonology. In Proceedings of EMNLP 2007, 2007.\n\n[13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler.\n\nTechnical report, Cornell University, 1997.\n\n[14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the\n\namino acid sequence of two proteins. Journal of Molecular Biology, 48:443\u2013453, 1970.\n\n[15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic\nmethods: (unweighted) quartet methods and neighbor-joining. Journal of Algorithms, 48:173\u2013\n193, 2003.\n\n[16] J. Thompson, F. Plewniak, and O. Poch. BAliBASE: A benchmark alignments database for\n\nthe evaluation of multiple sequence alignment programs. Bioinformatics, 15:87\u201388, 1999.\n\n[17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. PROBCONS: Probabilistic\n\nconsistency-based multiple sequence alignment. Genome Research, 15:330\u2013340, 2005.\n\n8\n\n\f", "award": [], "sourceid": 438, "authors": [{"given_name": "Alexandre", "family_name": "Bouchard-c\u00f4t\u00e9", "institution": null}, {"given_name": "Dan", "family_name": "Klein", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}