{"title": "Large Scale Hidden Semi-Markov SVMs", "book": "Advances in Neural Information Processing Systems", "page_first": 1161, "page_last": 1168, "abstract": null, "full_text": "Large Scale Hidden Semi-Markov SVMs\n\n Gunnar Ratsch Friedrich Miescher Laboratoy, Max Planck Society  Spemannstr. 39, 72070 Tubingen, Germany Gunnar.Raetsch@tuebingen.mpg.de\n\n Soren Sonnenburg Fraunhofer FIRST.IDA  Kekulestr. 7, 12489 Berlin, Germany sonne@first.fhg.de\n\nAbstract\nWe describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology.\n\n1\n\nIntroduction\n\nHidden Markov SVMs are a recently-proposed method for predicting a label sequence given the input sequence [3, 17, 18, 1, 2]. They combine the benefits of the power and flexibility of kernel methods with the idea of Hidden Markov Models (HMM) [11] to predict label sequences. In this work we introduce a generalization of Hidden Markov SVMs, called Hidden Semi-Markov SVMs (HSM SVMs). In HM SVMs and HMMs there is a state transition for every input symbol. In semiMarkov processes it is allowed to persist in a state for a number of time steps before transitioning into a new state. During this segment of time the system's behavior is allowed to be non-Markovian. This adds flexibility for instance to model segment lengths or to use non-linear content sensors that may depend on the start and end of the segment. One of the largest problems with HM SVMs and also SHM SVMs is their high computational complexity. Solving the resulting optimization problems may become computationally infeasible already for a few hundred examples. In the second part of the paper we consider the case of using content sensors (for whole segments) and signal detectors (at segment boundaries) in SHM SVMs. We motivate a simple, but very effective strategy of partitioning the problem into independent subproblems and discuss how one can reunion the different parts. We propose to solve a relatively small optimization problem that can be solved rather efficiently. This strategy allows us to tackle significantly larger label sequence problems (with several thousands of sequences). To illustrate the strength of our approach we have applied our algorithm to an important problem in computational biology: the prediction of the segmentation of a pre-mRNA sequence into exons and introns. On problems derived from sequences of the model organism Caenorhabditis elegans we can show that the SHM SVM approach consistently outperforms HMM based approaches by a large margin (see also [13]). The paper is organized as follows: In Section 2 we introduce the necessary notation, HM SVMs and the extension to semi-Markov models. In Section 3 we propose and discuss a technique that allows us to train SHM SVMs on significantly more training examples. Finally, in Section 4 we outline the gene structure prediction problem, discuss additional techniques to apply SHM SVMs to this problem and show surprisingly large improvements compared to state-of-the-art methods.\n\n\nCorresponding author, http://www.fml.mpg.de/raetsch\n\n\f\n2\n\nHidden Markov SVMs\n\nIn label sequence learning one learns a function that assigns to a sequence of objects x = 1 2 . . . l a sequence of labels y = 1 2 . . . l (i  X, i  , i = 1, . . . , l). While objects can be of rather arbitrary kind (e.g. vectors, letters, etc), the set of labels  has to be finite.1 A common approach is to determine a discriminant function F : X  Y  R that assigns a score to every input x  X := X  and every label sequence y  Y :=  , where X  denotes the Kleene closure of X . In order to obtain a prediction f (x)  Y , the function is maximized with respect to the second argument: f (x) = argmax F (x, y ).\ny Y\n\n(1)\n\n2.1 Representation & Optimization Problem In Hidden Markov SVMs (HM SVMs) [3], the function F (x, y ) := w, (x, y ) is linearly parametrized by a weight vector w, where (x, y ) is some mapping into a feature space F . Given a set of training examples (xn , y n ), n = 1, . . . , N , the parameters are tuned such that the true labeling y n scores higher than all other labelings y  Yn := Y \\ y n with a large margin, i.e. F (xn , y n ) argmaxyYn F (xn , y ). This goal can be achieved by solving the following optimization problem (appeared equivalently in [3]): min C nN\n=1\n\nRN ,wF\n\ni + P (w) for all n = 1, . . . , N and y  Yn ,\n\n(2)\n\ns.t.\n\nw, (x, y n ) - w, (x, y )  1 - n\n2\n\nwhere P is a suitable regularizer (e.g. P (w) = w ) and the  's are slack variables to implement a soft margin. Note that the linear constraints in (2) are equivalent to the following set of nonlinear constraints: F (xn , y n ) - maxyYn F (xn , y )  1 - n for n = 1, . . . , N [3]. If P (w) = w 2 , it can be shown that the solution w of (2) can be written as nN y w = n (y )(xn , y ),\n=1 Y\n\nwhere n (y ) is the Lagrange multiplier of the constraint involving example n and labeling y (see [3] for details). Defining the kernel as k ((x, y ), (x , y )) := (x, y ), (x , y ) , we can rewrite F (x, y ) as nN y F (x , y ) = n (y )k ((xn , y ), (x , y )).\n=1 Y\n\n2.2 Outline of an Optimization Algorithm The number of constraints in (2) can be very large, which may constitute challenges for efficiently solving problem (2). Fortunately, only a few of the constraints usually are active and working set methods can be applied in order to solve the problem for larger number of examples. The idea is to start with small sets of negative (i.e. false) labelings Y n for every example. One solves (2) for the smaller problem and then identifies labelings y  Yn that maximally violate constraints, i.e. y = argmax F (xn , y ),\ny Yn\n\n(3)\n\nwhere w is the intermediate solution of the restricted problem. The new constraint generated by the negative labeling is then added to the optimization problem. The method described above is also known as column generation method or cutting-plane algorithm and can be shown to converge to the optimal solution w [18]. However, since the computation of F involves many kernel computations and also the number of non-zero 's is often large, solving the problem with more than a few hundred labeled sequences often seems computationally too expensive. 2.3 Viterbi-like Decoding Determining the optimal labeling in (1) efficiently is crucial during optimization and prediction. If F (x, ) satisfies certain conditions, one can use a Viterbi-like algorithm [20] for efficient decoding\n1\n\nNote that the number of possible labelings grows exponentially in the length of the sequence.\n\n\f\nof the optimal labeling. This is particularly the case when  can be written as a sum over the length of the sequence and decomposed as   l (x) i (x, y ) =  , (i , i+1 , x, i)\n=1 2 , \n\nwhere l(x) is the length of the sequence x. By ( )  we denote the concatenation of feature vectors, i.e. (1 , 2 , . . .) . It is essential that  is composed of mapping functions that depend only on labels at position i and i + 1, x as well as i. We can rewrite F using w = (w, ),  : w = l (x) l (x)  i i F (x, y ) = , (i , i+1 , x, i) w, , , (i , i+1 , x, i) . (4) = , ,\n,  =1 =1 ,  :g (i ,i+1 ,x,i)\n\nThus we have positionally decomposed the function F . The score at position i + 1 only depends on x, i and labels at positions i and i + 1 (Markov property). Using this decomposition we can define m ax(V (i - 1,  ) + g ( ,  , x, i - 1)) i > 1   V (i,  ) := 0 otherwise as the maximal score for all labelings with label  at position i. Via dynamic programming one can compute max V (l(x),  ), which can be proven to solve (1) for the considered case. Moreover, using backtracking one can recover the optimal label sequence.3 The above decoding algorithm requires to evaluate g at most ||2 l(x) times. Since computing g involves computing potentially large sums of kernel functions, the decoding step can be computationally quite demandingdepending on the kernels and the number of examples. 2.4 Extension to Hidden Semi-Markov SVMs Semi-Markov models extend hidden Markov models by allowing each state to persist for a nonunit number i of symbols. Only after that the system will transition to a new state, which only depends on x and the current state. During the interval (i, i + i ) the behavior of the system may be non-Markovian [14]. Semi-Markov models are fairly common in certain applications of statistics [6, 7] and are also used in reinforcement learning [16]. Moreover, [15, 9] previously proposed an extension of HMMs, called Generalized HMMs (GHMMs) that is very similar to the ideas above. Also, [14] proposed a semi-Markov extension to Conditional Random Fields. In this work we extend Hidden Markov-SVMs to Hidden Semi-Markov SVMs by considering sequences of segments instead of simple label sequences. We need to extend the definition of the labeling with s segments: y = (1 , 1 ), (2 , 2 ), . . . , (s , s ), where j is the start position of the segment and j its label.4 We assume 1 = 1 and let j = j -1 + j . To simplify the notation we define s+1 := l(x) + 1, s := s(y ) to be the number of segments in y and s+1 := . We can now   generalize the mapping  to: s( y ) j (x, y ) =  , (j , j +1 , x, j , j +1 ) .\n=1\n2 3\n\n, \n\nWe define l+1 :=  to keep the notation simple. Note that one can extend the outlined decoding algorithm to produce not only the best path, but the K best paths. The 2nd best path may be required to compute the structure in (3). The idea is to duplicate tables K times as follows: ( max(k) (V (i - 1,  , k ) + g ( ,  , x, i - 1)) i > 1  ,k =1,...,K V (i,  , k) := 0 otherwise where max(k) is the function computing the kth largest number and is - if there are fewer numbers. V (i,  , k) now is the k-best score of labelings with label  at position i. 4 For simplicity, we associate the label of a segment with the signal at the boundary to the next segment. A generalization is straightforward.\n\n\f\nWith this definition we can extract features from segments: As j and j +1 are given one can for instance compute the length of the segment or other features that depend on the start and the end of the segment. Decomposing F results in:\ns( y )\n\nF (x, y )\n\n=\n\nj\n\n\n=1 , \n\nw,= , , (j , j +1 , x, j , j +1 ) . \n:g (j ,j +1 ,x,j ,j +1 )\n\n(5)\n\nAnalogously we can extend the formula for the Viterbi-like decoding algorithm [14]: m ax (V (i - d,  ) + g ( ,  , x, i - d, i)) i > 1 ,d=1,...,min(i-1,S )  V (i,  ) := 0 otherwise\n\n(6)\n\nwhere S is the maximal segment length and max V (l(x),  ) is the score of the best segment labeling. The function g needs to be evaluated at most ||2 l(x)S times. The optimal label sequence can be obtained as before by backtracking. Also the above method can be easily extended to produce the K best labelings (cf. Footnote 3).\n\n3\n\nAn Algorithm for Large Scale Learning\n\n3.1 Preliminaries In this section we consider a specific case that is relevant for the application that we have in mind. The idea is that the feature map should contain information about segments such as the length or the content as well as segment boundaries, which may exhibit certain detectable signals. For simplicity we assume that it is sufficient to consider the string j ..j+1 := j j +1 . . . j+1 -2 j+1 -1 for extracting content information about segment j . Also, for considering signals we assume it to be sufficient to consider a window  around the end of the segment, i.e. we only consider j+1  := j+1 - . . . j+1 + . To keep the notation simple we do not consider signals at the start of the segment. Moreover, we assume for simplicity that x is appropriately defined for every  = 1, . . . , l(x). We may therefore define the following feature map:    s( y ) j   [[j =  ]][[j +1 =  ]]c (j ..j+1 ) %content     =1 ,    (x, y ) =    s( y )  j     [[j +1 =  ]]s (j+1  ) %sig nal\n=1  \n\nwhere [[true]] = 1 and 0 otherwise. Then the kernel between two examples using this feature map can be written as:  j j k ((x, y ), (x , y )) = kc (j ..j+1 , j ..j + ) + ks (j+1  , j +  )\n,  :(j ,j )=(, ) j :(j ,j )=(, )\n1\n\n :j +1 = j :j +1 =\n\n1\n\nwhere kc (, ) := 1 (), 1 () and ks (, ) := s (), s () . The above formulation has the benefit of keeping the signals and content kernels separated for each label, which we can exploit for rewriting F (x, y )  j  j F (x, y ) = F, (j ..j+1 ) + F (j+1  ),\n,  :(j ,j +1 )=(, )  :j +1 =\n\nwhere F, () := and\n\nnN y\n=1\nY\n\nj n (y )\n:\n\nkc (, nj \n(j ,j\n+1 )=(, )\n\nj .. +1\n\n)\n\nF () =\n\nnN y\n=1\nY\n\nj n (y )\n:\n\nks (, nj \nj\n+1 =\n\n+1 \n\n).\n\nHence, we have partitioned F (x, y ) into ||2 + || functions characterizing the content and the signals.\n\n\f\n3.2 Two-Stage Learning By enumerating all non-zero 's and valid settings of j in F and F, , we can define sets of sequences { , }m=1,...,M, and { }m=1,...,M where every element is of the form nj ..j+1 m m  and nj+1  , respectively. Hence, F and F, can be rewritten as a (single-sum) linear combi M, , M   ,  nation of kernels: F, () := m=1 m kc (, m ) and F () := m=1 m ks (, m ) for  appropriately chosen 's. For sequences m that do not correspond to true segment boundaries,  the coefficient m is either negative or zero (since wrong segment boundaries can only appear in wrong labelings y = y n and n (y )  0). True segment boundaries in correct label sequences have  non-negative m 's. Analogously with segments  , . Hence, we may interpret these functions as m SVM classification functions recognizing segments and boundaries of all kinds. Hidden Semi-Markov SVMs simultaneously optimize all these functions and also determine the relative importance of the different signals and sensors. In this work we propose to separate the learning of the content sensors and signal detectors from learning how they have to act together   in order to produce the correct labeling. The idea is to train SVM-based classifiers F, and F using the kernels kc and ks on examples with known labeling. For every segment type and segment boundary we generate a set of positive examples from observed segments and boundaries. As negative examples we use all boundaries and segments that were not observed in a true labeling. This leads to a set of sequences that may potentially also appear in the expansions of F, and F . However, the expansion coefficients m and m are expected to be different as the functions are  ,  estimated independently. The advantage of this approach is that solving two-class problemsfor which we can reuse existing large scale learning methodsis much easier than solving the full HSM SVM problem. However,   while the functions F, and F might recognize the same contents and signals as F, and F , the functions are obtained independently from each other and might not be scaled correctly to jointly produce the correct labeling. We therefore propose to learn transformations t, and t such that   F, ()  t, (F, ()) and F ()  t (F ()). The transformation functions t : R  R are one-dimensional mappings and it seems fully sufficient to use for instance piece-wise linear functions (PLiFs) p, () :=  (),  with fixed abscissa boundaries  and  -parametrized ordinate values ( () can be appropriately defined). We may define the mapping (x, y ) for our  case as   s( y ) j    [[j =  ]][[j +1 =  ]] , (F, (j ..j+1 ))     =1 ,   ,   (x, y ) =  (7)  s( y ) j     [[j +1 =  ]]  (F (  ))\n j +1\n\n=1\n\n \n\nwhere we simply replaced the feature with PLiF features based on the outcomes of precomputed predictions. Note that (x, y ) has only (||2 + ||)P dimensions, where P is the number of support points used in the PLiFs. If the alphabet  is reasonably small then the dimensionality is low enough to solve the optimization problem (2) efficiently in the primal domain. In the next section we will illustrate how to successfully apply a version of the outlined algorithm to a problem where we have several thousands of relatively long labeled sequences.\n\n4\n\nApplication to Gene Structure Prediction\n\nThe problem of gene structure prediction is to segment nucleotide sequences (so-called pre-mRNA sequences generated by transcription; cf. Figure 4) into exons and introns. In a complex biochemical process called splicing the introns are removed from the pre-mRNA sequence to form the mature mRNA sequence that can be translated into protein. The exon-intron and intron-exon boundaries are defined by sequence motifs almost always containing the letters GT and AG (cf. Figure 4), respectively. However, these dimers appear very frequently and one needs sophisticated methods to recognize true splice sites [21, 12, 13]. So far mostly HMM-based methods such as Genscan [5], Snap [8] or ExonHunter [4] have been applied to this problem and also to the more difficult problem of gene finding. In this work we show\n\n\f\nthat our newly developed method is applicable to this task and achieves very competitive results. We call it mSplicer. Figure 2 illustrates the \"grammar\" that we use for gene structure prediction. We only require four different states (start, exon-end, exon-start and end) and two different segment labels (exon & intron). Biologically it makes sense to distinguish between first, internal, last and single exons, as their typical lengths are quite different. Each of these exon types correspond to one transition in the model. States two and three recognize the two types of splice sites and the transition between these states defines an intron. For our specific problem we only need signal detectors for segments ending in state two and three.   In the next subsection we outline how we obtain F2 and F3 . Additionally we need content sensors for every possible transition. While the \"content\" of the different exon segments is essentially the same, the length of them can vary quite drastically. We therefore decided to use one content sensor   FI for the intron transition 2  3 and the same content sensor FE for all four exon transitions 1  2, 1  4, 3  2 and 3  4. However, in order to capture the different length characteristics, we include   s( y ) j  [[j =  ]][[j +1 =  ]] , (j +1 - j ) (8)\n=1 , \n\nin the feature map (7), which amounts to using PLiFs for the lengths of all transitions. Also, note that we can drop those features in (7) and (8) that correspond to transitions that are not allowed (e.g. 4  1; cf. Figure 2).5 We have obtained data for training, validation and testing from public sequence databases (see [13] for details).For the considered genome of C. elegans we have split the data into four different sets: Set 1 is used for training the splice site signal detectors and the two content sensors; Set 2 is used for model selection of the latter signal detectors and content sensors and for training the HSM SVM; Set 3 is used for model selection of the HSM SVM; and Set 4 is used for the final evaluation. These are large scale datasets, with which current Hidden-Markov-SVMs are unable to deal with: The C. elegans training set used for label-sequence learning contains 1,536 sequences with an average length of  2, 300 base pairs and about 9 segments per sequence, and the splice site signal detectors where trained on more than a million examples. In principle it is possible to join sets 1 & 2, however,   then the predictions of F, and F on the sequences used for the HSM SVM are skewed in the margin area (since the examples are pushed away from the decision boundary on the training set). We therefore keep the two sets separated. 4.1 Learning the Splice Site Signal Detectors From the training sequences (Set 1) we extracted sequences of confirmed splice sites (intron start and end). For intron start sites we used a window of [-80, +60] around the site. For intron end sites we used [-60, +80]. From the training sequences we also extracted non-splice sites, which are within an exon or intron of the sequence and have AG or GT consensus. We train an SVM [19] with softl-j d margin using the WD kernel [12]: k(x, x ) = j =1 j i=1 [[(x[i,i+j ] = x[i,i+j ] )]], where l = 140 is the length of the sequence and x[a,b] denotes the sub-string of x from position a to (excluding) b ) ~ and j := d - j + 1. We used a normalization of the kernel k (x, x ) =  k(x,x , ) . This leads\nk ( x ,x ) k ( x x\n\n  to the two discriminative functions F2 and F3 . All model parameters (including the window size) have been tuned on the validation set (Set 2). SVM training for C. elegans resulted in 79,000 and 61,233 support vectors for detecting intron start and end sites, respectively.\n5\n\nWe also excluded these transitions during the Viterbi-like algorithm. Figure 1: The major steps in protein synthesis [10]. A transcript of a gene starts with an exon and may then be interrupted by an intron, followed by another exon, intron and so on until it ends in an exon. In this work we learn the unknown formal mapping from the pre-mRNA to the mRNA.\n\n\f\nFigure 2: An elementary state model for unspliced mRNA: The start is either directly followed by the end or by an arbitrary number of donor-acceptor splice site pairs.\n\n4.2 Learning the Exon and Intron Content Sensors To obtain the exon content sensor we derived a set of exons from the training set. As negative examples we used sub-sequences of intronic sequences sampled such that both sets of strings have roughly the same length distribution. We trained SVMs using a variant of the Spectrum kernel [21] of degree d = 6, where we count 6-mers appearing at least once in both sequences. We applied the same normalization as in Sec. 4.1 and proceeded analogously for the intron content sensor. The model parameters have been obtained by tuning them on the validation set.   Note that the resulting content sensors FI and FE need to be evaluated several times during the Viterbi-like algorithm (cf. (6)): One needs to extend segments ending at the same position i to several different starting points. By re-using the shorter segment's outputs this computation can be made drastically faster. 4.3 Combination   For datasets 2-4 we can precompute all candidate splice sites using the classifiers F2 and F3 . We  decided to use PLiFs with P = 30 support points and chose the boundaries for F2 , F3 , FE , and  FI uniformly between -5 and 5 (typical range of outputs of our SVMs). For the PLiFs concerned with length of segments we chose appropriate boundaries in the range 30 - 1000. With all these definitions the feature map as in (7) and (8) is fully defined. The model has nine PLiFs as parameters, with a total of 270 parameters. Finally, we have modified the regularizer for our particular case, which favors smooth PLiFs:    P -1  ,l i , ,  ,l   P(w) := |wP - w1 | + |wP - w1 | + |wi - wi+1 |,\n,    =1\n\na ( where w = w, ),  ; (w )  ; (w ,l )  nd we constrain the PLiFs for the signal and content sensors to be monotonically increasing.6 Having defined the feature map and the regularizer, we can now apply the HSM SVM algorithm outlined in Sections 2.4 and 3. Since the feature space is rather low dimensional (270 dimensions), we can solve the optimization problem in the primal domain even with several thousands of examples employing a standard optimizer (we used ILOG CPLEX and column generation) within a reasonable time.7 4.4 Results To estimate the out-of-sample accuracy, we apply our method to the independent test dataset 4. For C. elegans we can compare it to ExonHunter8 on 1177 test sequences. We greatly outperform the ExonHunter method: our method obtains almost 1/3 of the test error of ExonHunter (cf. Table 1). Simplifying the problem by only considering sequences between the start and stop codons allows us to also include SNAP in the comparison on the dataset 4', a slightly modified version of dataset 4 with 1138 sequences.9 The results are shown in Table 1. On dataset 4' the best competing method achieves an error rate of 9.8% which is more than twice the error rate of our method.\n\n5\n\nConclusion\n\nWe have extended the framework of Hidden Markov SVMs to Hidden Semi-Markov SVMs and suggested an very efficient two-stage learning algorithm to train an approximation to Hidden SemiMarkov SVMs. Moreover, we have successfully applied our method on large scale gene structure\nThis implements our intuition that large SVM scores should lead to larger scores for a labeling. It takes less than one hour to solve the HSM SVM problem with about 1,500 sequences on a single CPU. Training the content and signal detectors on several hundred thousand examples takes around 5 hours in total. 8 The method was trained by their authors on the same training data. 9 In this setup additional biological information about the so-called \"open reading frame\" is used: As there was only a version of SNAP available that uses this information, we incorporated this extra knowledge also in our model (marked  ) and also used another version of Exonhunter that also exploits that information in order to allow a fair comparison.\n7 6\n\n\f\nMethod Our Method ExonHunter Our Method ExonHunter SNAP\n\nerror rate 13.1% 36.8% 4.8% 9.8% 17.4%\n\nC. elegans Dataset 4 exon Sn exon Sp exon nt Sn 96.7% 96.8% 98.9% 89.1% 88.4% 98.2% C. elegans Dataset 4' 98.9% 99.2% 99.2% 97.9% 96.6% 99.4% 95.0% 93.3% 99.0%\n\nexon nt Sp 97.2% 97.4% 99.9% 98.1% 98.9%\n\nTable 1: Shown are the rates of predicting a wrong gene structure, sensitivity (Sn) and specificity (Sp) on exon and nucleotide levels (see e.g. [8]) for our method, ExonHunter and SNAP. The methods exploiting additional biological knowledge have an advantage and are marked with  .\n\nprediction appearing in computational biology, where our method obtains less than a half of the error rate of the best competing HMM-based method. Our predictions are available at Wormbase: http://www.wormbase.org. Additional data and results are available at the project's website http://www.fml.mpg.de/raetsch/projects/msplicer.   Acknowledgments We thank K.-R. Muller, B. Scholkopf, E. Georgii, A. Zien, G. Schweikert and G. Zeller for inspiring discussions. The latter three we also thank for proofreading the manuscript. Moreover, we thank D. Surendran for naming the piece-wise linear functions PLiF and optimizing the Viterbi-implementation.\n\nReferences\n[1] Y. Altun, T. Hofmann, and A. Smola. Gaussian process classification for segmenting and annotating sequences. In Proc. ICML 2004, 2004. [2] Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structured variables. In Proc. NIPS 2005, 2006. [3] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In T. Fawcett, editor, Proc. 20th Int. Conf. Mach. Learn., pages 310, 2003. [4] B. Brejova, D.G. Brown, M. Li, and T. Vinar. ExonHunter: a comprehensive approach to gene finding. Bioinformatics, 21(Suppl 1):i57i65, 2005. [5] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268:7894, 1997. [6] X. Ge. Segmental Semi-Markov Models and Applications to Sequence Analysis. PhD thesis, University of California, Irvine, 2002. [7] J. Janssen and N. Limnios. Semi-Markov Models and Applications. Kluwer Academic, 1999. [8] I. Korf. Gene finding in novel genomes. BMC Bioinformatics, 5(59), 2004. [9] D. Kulp, D. Haussler, M.G. Reese, and F.H. Eeckman. A generalized hidden markov model for the recognition of human genes in DNA. ISMB 1996, pages 134141, 1996. [10] B. Lewin. Genes VII. Oxford University Press, New York, 2000. [11] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257285, February 1989.   [12] G. Ratsch and S. Sonnenburg. Accurate splice site prediction for Caenorhabditis elegans. In B. Scholkopf, K. Tsuda, and J.-P. Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.    [13] G. Ratsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. Muller, R. Sommer, and B. Scholkopf. Improving the C. elegans genome annotation using machine learning. PLoS Computational Biology, 2007. In press. [14] S. Sarawagi and W.W. Cohen. Semi-markov conditional random fields for information extraction. In Proc. NIPS 2004, 2005. [15] G.D. Stormo and D. Haussler. Optimally parsing a sequence into different classes based on multiple types of information. In Proc. ISMB 1994, pages 369375, Menlo Park, CA, 1994. AAAI/MIT Press. [16] R. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learrning. Artificial Intelligence, 112:181211, 1999. [17] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Proc. NIPS 2003, 16, 2004. [18] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Large margin methods for structured output spaces. Journal for Machine Learning Research, 6, September 2005. [19] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995. [20] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory, IT-13:260269, Apr 1967. [21] X.H. Zhang, K.A. Heller, I. Hefter, C.S. Leslie, and L.A. Chasin. Sequence information for the splicing of human pre-mRNA identified by SVM classification. Genome Res, 13(12):263750, 2003.\n\n\f\n", "award": [], "sourceid": 2988, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "S\u00f6ren", "family_name": "Sonnenburg", "institution": null}]}