{"title": "Hidden Markov Model Induction by Bayesian Model Merging", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 18, "abstract": null, "full_text": "Hidden Markov Model Induction by Bayesian \n\nModel Merging \n\nAndreas Stolcke*'** \n\n*Computer Science Division \n\nUniversity of California \n\nBerkeley, CA 94720 \n\nstolcke@icsi.berkeley.edu \n\nStephen Omohundro\" \n\n**International Computer Science Institute \n\n1947 Center Street, Suite 600 \n\nBerkeley, CA 94704 \n\nom@icsi.berkeley.edu \n\nAbstract \n\nThis paper describes a technique for learning both the number of states and the \ntopology of Hidden Markov Models from examples. The induction process starts \nwith the most specific model consistent with the training data and generalizes \nby successively merging states. Both the choice of states to merge and the \nstopping criterion are guided by the Bayesian posterior probability. We compare \nour algorithm with the Baum-Welch method of estimating fixed-size models, and \nfind that it can induce minimal HMMs from data in cases where fixed estimation \ndoes not converge or requires redundant parameters to converge. \n\n1 INTRODUCTION AND OVERVIEW \n\nHidden Markov Models (HMMs) are a well-studied approach to the modelling of sequence \ndata. HMMs can be viewed as a stochastic generalization of finite-state automata, where \nboth the transitions between states and the generation of output symbols are governed by \nprobability distributions. HMMs have been important in speech recognition (Rabiner & \nJuang, 1986), cryptography, and more recently in other areas such as protein classification \nand alignment (Haussler, Krogh, Mian & SjOlander, 1992; Baldi, Chauvin, Hunkapiller & \nMcClure, 1993). \n\nPractitioners have typically chosen the HMM topology by hand, so that learning the HMM \nfrom sample data means estimating only a fixed number of model parameters. The standard \napproach is to find a maximum likelihood (ML) or maximum a posteriori probability (MAP) \nestimate of the HMM parameters. The Baum-Welch algorithm uses dynamic programming \n\n11 \n\n\f12 \n\nStokke and Omohundro \n\nto approximate these estimates (Baum, Petrie, Soules & Weiss, 1970). \n\nA more general problem is to additionally find the best HMM topology. This includes \nboth the number of states and the connectivity (the non-zero transitions and emissions). \nOne could exhaustively search the model space using the Baum-Welch algorithm on fully \nconnected models of varying sizes, picking the model size and topology with the highest \nposterior probability. (Maximum likelihood estimation is not useful for this comparison \nsince larger models usually fit the data better.) This approach is very costly and Baum(cid:173)\nWelch may get stuck at sub-optimal local maxima. Our comparative results later in the \npaper show that this often occurs in practice. The problem can be somewhat alleviated by \nsampling from several initial conditions, but at a further increase in computational cost. \n\nThe HMM induction method proposed in this paper tackles the structure learning problem \nin an incremental way. Rather than estimating a fixed-size model from scratch for various \nsizes, the model size is adjusted as new evidence arrives. There are two opposing tendencies \nin adjusting the model size and structure. Initially new data adds to the model size, because \nthe HMM has to be augmented to accommodate the new samples. If enough data of a similar \nstructure is available, however, the algorithm collapses the shared structure, decreasing the \nmodel size. The merging of structure is also what drives generalization, Le., creates HMMs \nthat generate data not seen during training. \n\nBeyond being incremental, our algorithm is data-driven, in that the samples themselves \ncompletely determine the initial model shape. Baum-Welch estimation, by comparison, \nuses an initially random set of parameters for a given-sized HMM and iteratively updates \nthem until a point is found at which the sample likelihood is locally maximal. What seems \nintuitively troublesome with this approach is that the initial model is completely uninformed \nby the data. The sample data directs the model formation process only in an indirect manner \nas the model approaches a meaningful shape. \n\n2 HIDDEN MARKOV MODELS \n\nFor lack of space we cannot give a full introduction to HMMs here; see Rabiner & Juang \n(1986) for details. Briefly, an HMM consists of states and transitions like a Markov chain. \nIn the discrete version considered here, it generates strings by performing random walks \nbetween an initial and a final state, outputting symbols at every state in between. The \nprobability P(xlM) that a model M generates a string x is determined by the conditional \nprobabilities of making a transition from one state to another and the probability of emitting \neach symbol from each state. Once these are given, the probability of a particular path \nthrough the model generating the string can be computed as the product of all transition \nand emission probabilities along the path. The probability of a string x is the sum of the \nprobabilities of all paths generating x. \n\nFor example, the model M3 in Figure 1 generates the strings ab, abab, ababab, ... with \nI \npro a I lUes 3' 3!' 3!' ... , respective y. \n\nb b\u00b7l\u00b7\u00b7 \n\n. \n\n2 2 \n\n2 \n\n3 HMM INDUCTION BY STATE MERGING \n\n3.1 MODEL MERGING \n\nOmohundro (1992) has proposed an approach to statistical model inference in which initial \n\n\fHidden Markov Model Induction by Bayesian Model Merging \n\n13 \n\nmodels simply replicate the data and generalize by similarity. As more data is received, \ncomponent models are fit from more complex model spaces. This allows the formation of \narbitrarily complex models without overfitting along the way. The elementary step used in \nmodifying the overall model is a merging of sub-models, collapsing the sample sets for the \ncorresponding sample regions. The search for sub-models to merge is guided by an attempt \nto sacrifice as little of the sample likelihood as possible as a result of the merging process. \nThis search can be done very efficiently if (a) a greedy search strategy can be used, and (b) \nlikelihood computations can be done locally for each sub-model and don't require global \nrecomputation on each model update. \n\n3.2 STATE MERGING IN HMMS \n\nWe have applied this general approach to the HMM learning task. We describe the algorithm \nhere mostly by presenting an example. The details are available in Stolcke & Omohundro \n(1993). \n\nTo obtain an initial model from the data, we first construct an HMM which produces \nexactly the input strings. The start state has as many outgoing transitions as there are \nstrings and each string is represented by a unique path with one state per sample symbol. \nThe probability of entering these paths from the start state is uniformly distributed. Within \neach path there is a unique transition arc whose probability is 1. The emission probabilities \nare 1 for each state to produce the corresponding symbol. \n\nAs an example, consider the regular language (abt and two samples drawn from it, the \nstrings ab and abab. The algorithm constructs the initial model Mo depicted in Figure 1. \nThis is the most specific model accounting for the observed data. It assigns each sample \na probability equal to its relative frequency, and is therefore a maximum likelihood model \nfor the data. \n\nLearning from the sample data means generalizing from it. This implies trading off \nmodel likelihood against some sort of bias towards 'simpler' models, expressed by a prior \nprobability distribution over HMMs. Bayesian analysis provides a formal basis for this \ntradeoff. Bayes' rule tells us that the posterior model probability P(Mlx) is proportional \nto the product of the model prior P(M) and the likelihood of the data P(xlM). Smaller or \nsimpler models will have a higher prior and this can outweigh the drop in likelihood as long \nas the generalization is conservative and keeps the model close to the data. The choice of \nmodel priors is discussed in the next section. \n\nThe fundamental idea exploited here is that the initial model Mo can be gradually trans(cid:173)\nformed into the generating model by repeatedly merging states. The intuition for this \nheuristic comes from the fact that if we take the paths that generate the samples in an actual \ngenerating HMM M and 'unroll' them to make them completely disjoint, we obtain Mo. \nThe iterative merging process, then, is an attempt to undo the unrolling, tracing a search \nthrough the model space back to the generating model. \n\nMerging two states q] and q2 in this context means replacing q] and q2 by a new state r with \na transition distribution that is a weighted mixture of the transition probabilities of q], q2, \nand with a similar mixture distribution for the emissions. Transition probabilities into q] or \nq2 are added up and redirected to r. The weights used in forming the mixture distributions \nare the relative frequencies with which q] and q2 are visited in the current model. \n\nRepeatedly performing such merging operations yields a sequence of models Mo, M J , \n\n\f14 \n\nStokke and Omohundro \n\nMo: \n\na \n\nb \n\nlog L(xIMo) = - 1. 39 \n\nlog L(xIMJ) = log L(xIMo) \n\na \n\nb \n\nFigure I: Sequence of models obtained by merging samples {ab, abab}. All transitions without \nspecial annotations have probability 1; Output symbols appear above their respective states and also \ncarry an implicit probability of 1. For each model the log likelihood is given. \n\nM2 \u2022... , along which we can search for the MAP model. To make the search for M efficient, \nwe use a greedy strategy: given Mi. choose a pair of states for merging that maximizes \nP(Mi+llX)\u00b7 \nContinuing with the previous example, we find that states 1 and 3 in Mo can be merged \nwithout penalizing the likelihood. This is because they have identical outputs and the loss \ndue to merging the outgoing transitions is compensated by the merging of the incoming \ntransitions. The .5/.5 split is simply transferred to outgoing transitions of the merged state. \nThe same situation obtains for states 2 and 4 once 1 and 3 are merged. From these two \nfirst merges we get model M. in Figure 1. By convention we reuse the smaller of two state \nindices to denote the merged state. \n\nAt this point the best merge turns out to be between states 2 and 6, giving model M2. \nHowever, there is a penalty in likelihood, which decreases to about .59 of its previous \nvalue. Under all the reasonable priors we considered (see below), the posterior model \nprobability still increases due to an increase in the prior. Note that the transition probability \nratio at state 2 is now 2/1, since two samples make use of the first transition, whereas only \none takes the second transition. \n\nFinally, states 1 and 5 can be merged without penalty to give M3, the minimal model that \ngenerates (ab)+. Further merging at this point would reduce the likelihood by three orders \nof magnitude. The resulting decrease in the posterior probability tells the algorithm to stop \n\n\fHidden Markov Model Induction by Bayesian Model Merging \n\n15 \n\nat this point. \n\n3.3 MODEL PRIORS \n\nAs noted previously, the likelihoods P(XIMj ) along the sequence of models considered by \nthe algorithm is monotonically decreasing. The prior P(M) must account for an overall \nincrease in posterior probability, and is therefore the driving force behind generalization. \n\nAs in the work on Bayesian learning of classification trees by Buntine (1992), we can split \nthe prior P(M) into a term accounting for the model structure, P(Ms), and a term for the \nadjustable parameters in a fixed structure P(MpIMs). \n\nWe initially relied on the structural prior only, incorporating an explicit bias towards smaller \nmodels. Size here is some function of the number of states and/or transitions, IMI. Such a \nprior can be obtained by making P(Ms ) <X e- 1M1 , and can be viewed as a description length \nprior that penalizes models according to their coding length (Rissanen, 1983; Wallace & \nFreeman, 1987). The constants in this \"MOL\" term had to be adjusted by hand from \nexamples of 'desirable' generalization. \n\nFor the parameter prior P(MpIMs), it is standard practice to apply some sort of smoothing \nor regularizing prior to avoid overfitting the model parameters. Since both the transition \nand the emission probabilities are given by multinomial distributions it is natural to use a \nDirichlet conjugate prior in this case (Berger, 1985). The effect of this prior is equivalent \nto having a number of 'virtual' samples for each of the possible transitions and emissions \nwhich are added to the actual samples when it comes to estimating the most likely parameter \nsettings. In our case, the virtual samples made equal use of all potential transitions and \nemissions, adding bias towards uniform transition and emission probabilities. \n\nWe found that the Dirichlet priors by themselves produce an implicit bias towards smaller \nmodels, a phenomenon that can be explained as follows. The prior alone results in a model \nwith uniform, flat distributions. Adding actual samples has the effect of putting bumps into \nthe posterior distributions, so as to fit the data. The more samples are available, the more \npeaked the posteriors will get around the maximum likelihood estimates of the parameters, \nincreasing the MAP value. In estimating HMM parameters, what counts is not the total \nnumber of samples, but the number of samples per state, since transition and emission \ndistributions are local to each state. As we merge states, the available evidence gets shared \nby fewer states, thus allowing the remaining states to produce a better fit to the data. \n\nThis phenomenon is similar, but not identical, to the Bayesian 'Occam factors' that prefer \nmodels with fewer parameter (MacKay, 1992). Occam factors are a result of integrating \nthe posterior over the parameter space, something which we do not do because of the \ncomputational complications it introduces in HMMs (see below). \n\n3.4 APPROXIMATIONS \n\nAt each iteration step, our algorithm evaluates the posterior resulting from every possible \nmerge in the current HMM. To keep this procedure feasible, a number of approximations \nare incorporated in the implementation that don't seem to affect its qualitative properties. \n\n\u2022 For the purpose of likelihood computation, we consider only the most likely path \nthrough the model for a given sample string (the Viterbi path). This allows us to \n\n\f16 \n\nStokke and Omohundro \n\nexpress the likelihood in product form, computable from sufficient statistics for each \ntransition and emission. \n\n\u2022 We assume the Viterbi paths are preserved by the merging operation, that is, the \npaths previously passing through the merged states now go through the resulting new \nstate. This allows us to update the sufficient statistics incrementally, and means only \nO(number of states) likelihood terms need to be recomputed. \n\n\u2022 The posterior probability of the model structure is approximated by the posterior of the \nMAP estimates for the model parameters. Rigorously integrating over all parameter \nvalues is not feasible since varying even a single parameter could change the paths of \nall samples through the HMM. \n\n\u2022 Finally, it has to be kept in mind that our search procedure along the sequence of \nmerged models finds only local optima, since we stop as soon as the posterior starts to \ndecrease. A full search of the space would be much more costly. However, we found \na best-first look-ahead strategy to be sufficient in rare cases where a local maximum \ncaused a problem. In those cases we continue merging along the best-first path for a \nfixed number of steps (typically one) to check whether the posterior has undergone \njust a temporary decrease. \n\n4 EXPERIMENTS \n\nWe have used various artificial finite-state languages to test our algorithm and compare its \nperformance to the standard Baum-Welch algorithm. \nTable 1 summarizes the results on the two sample languages ac\u00b7 a u bc\u00b7 band a+ b+ a+ b+. \nThe first of these contains a contingency between initial and final symbols that can be hard \nfor learning algorithms to uncover. \n\nWe used no explicit model size prior in our experiments after we found that the Dirichlet \nprior was very robust in giving just the the right amount of bias toward smaller models.! \nSummarizing the results, we found that merging very reliably found the generating model \nstructure from a very small number of samples. The parameter values are determined by \nthe sample set statistics. \n\nThe Baum-Welch algorithm, much like a backpropagation network, may be sensitive to its \nrandom initial parameter settings. We therefore sampled from a number of initial conditions. \nInterestingly, we found that Baum-Welch has a good chance of settling into a suboptimal \nHMM structure, especially if the number of states is the minimal number required for the \ntarget language. It proved much easier to estimate correct language models when extra \nstates were provided. Also, increasing the sample size helped it converge to the target \nmodel. \n\n5 RELATED WORK \n\nOur approach is related to several other approaches in the literature. \n\nThe concept of state merging is implicit in the notion of state equivalence classes, which is \nfundamental to much of automata theory (Hopcroft & Ullman, 1979) and has been applied \n\nI The number of 'virtual' samples per transition/emission was held constant at 0.1 throughout. \n\n\fHidden Markov Model Induction by Bayesian Model Merging \n\n17 \n\n(a) Method \nMerging \nMerging \nBaum-Welch \n(10 trials) \nBaum-Welch \n(l0 trials) \nBaum-Welch \nBaum-Welch \n\n(b) Method \nMerging \nBaum-Welch \n(3 trials) \nMerging \nBaum-We1ch \n(3 trials) \n\nSample \n8 m.p. \n\n20 random \n\n8 m.p. \n\n20 random \n\n8 m.p. \n\n20 random \n\nSample \n5m.p. \n5m.p. \n\n10 random \n10 random \n\nEntropy \n2.295 \n2.087 \n2.087 \n2.773 \n2.087 \n2.775 \n2.384 \n2.085 \n\nEntropy \n2.163 \n3.545 \n3.287 \n5.009 \n5.009 \n6.109 \n\nCross-entropy \n2.188 \u00b1 .020 \n2.158 \u00b1 .033 \n\n2.894 \u00b1 .023 (best) \n4.291 \u00b1 .228 (worst) \n2.105 \u00b1 .031 (best) \n2.825 \u00b1 .031 (worst) \n\n3.914 \u00b1 .271 \n2.155 \u00b1 .032 \n\nCross-entropy \n7.678 \u00b1 .158 \n\n8.963 \u00b1 .161 (best) \n59.663 \u00b1 .007 (worst) \n\n5.623 \u00b1 .074 \n\n5.688 \u00b1 .076 (best) \n8.395 \u00b1 .137 (worst) \n\nn \n6 \n6 \n6 \n6 \n6 \n6 \n10 \n10 \n\nLanguage \nac'a v bc'b \nac'a v bc'b \n\n(a v b)c'(a v b) \n(a v b)c'(a v b) \n\nac'a v bc'b \n\n(a v b)c'(a v b) \n\nac'a v bc'b \nac'a v bc'b \n\nLanguage \na+b+a+b+ \n(a+b+t \n(a+b+t \na+b+a+b+ \na+b+a+b+ \n(a+b+t \n\nn \n4 \n4 \n4 \n4 \n4 \n4 \n\nTable 1: Results for merging and Baum-Welch on two regular languages: (a) ac'a v bc'b and (b) \na+b+a+b+. Samples were either the top most probable (m.p.) ones from the target language, or a \nset of randomly generated ones. \n'Entropy' is the average negative log probability on the training \nset, whereas 'cross-entropy' refers to the empirical cross-entropy between the induced model and \nthe generating model (the lower, the better generalization). n denotes the final number of model \nstates for merging, or the fixed model size for Baum-Welch. For Baum-Welch, both best and worst \nperformance over several initial conditions is listed. \n\nto automata learning as well (Angluin & Smith, 1983). \n\nTomita (1982) is an example of finite-state model space search guided by a (non(cid:173)\nprobabilistic) goodness measure. \n\nHorning (1969) describes a Bayesian grammar induction procedure that searches the model \nspace exhaustively for the MAP model. The procedure provably finds the globally optimal \ngrammar in finite time, but is infeasible in practice because of its enumerative character. \n\nThe incremental augmentation of the HMM by merging in new samples has some of the \nflavor of the algorithm used by Porat & Feldman (1991) to induce a finite-state model from \npositive-only, ordered examples. \n\nHaussler et al. (1992) use limited HMM 'surgery' (insertions and deletions in a linear \nHMM) to adjust the model size to the data, while keeping the topology unchanged. \n\n6 FURTHER RESEARCH \n\nWe are investigating several real-world applications for our method. One task is the \nconstruction of unified multiple-pronunciation word models for speech recognition. This \nis currently being carried out in collaboration with Chuck Wooters at ICSI, and it appears \nthat our merging algorithm is able to produce linguistically adequate phonetic models. \n\nAnother direction involves an extension of the model space to stochastic context-free \ngrammars, for which a standard estimation method analogous to Baum-Welch exists (Lari \n\n\f18 \n\nStokke and Omohundro \n\n& Young, 1990). The notions of sample incorporation and merging carry over to this domain \n(with merging now involving the non-terminals of the CFO), but need to be complemented \nwith a mechanism that adds new non-terminals to create hierarchical structure (which we \ncall chunking). \n\nAcknowledgements \n\nWe would like to thank Peter Cheeseman, Wray Buntine, David Stoutamire, and Jerry \nFeldman for helpful discussions of the issues in this paper. \n\nReferences \n\nAngluin, D. & Smith, C. H. (1983), 'Inductive inference: Theory and methods', ACM Computing \n\nSurveys 15(3), 237-269. \n\nBaldi, P., Chauvin, Y., Hunkapiller, T. & McClure, M. A. (1993), 'Hidden Markov Models in \n\nmolecular biology: New algorithms and applications' , this volume. \n\nBaum, L. E., Petrie, T., Soules, G. & Weiss, N. (1970), 'A maximization technique occuring in the \nstatistical analysis of probabilistic functions in Markov chains', The Annals of Mathematical \nStatistics 41( 1), 164-17l. \n\nBerger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, Springer Verlag, New York. \nBuntine, W. (1992), Learning classification trees, in D. J. Hand, ed., 'Artificial Intelligence Frontiers \n\nin Statistics: AI and Statistics III' , Chapman & Hall. \n\nHaussler, D., Krogh, A., Mian, 1. S. & Sjolander, K. (1992), Protein modeling using hidden Markov \nmodels: Analysis of globins, Technical Report UCSC-CRL-92-23, Computer and Information \nSciences, University of California, Santa Cruz, Ca. Revised Sept. 1992. \n\nHopcroft, J. E. & Ullman, J. D. (1979), Introduction to Automata Theory, Languages, and Computa(cid:173)\n\ntion, Addison-Wesley, Reading, Mass. \n\nHorning, J. J. (1969), A study of grammatical inference, Technical Report CS 139, Computer Science \n\nDepartment, Stanford University, Stanford, Ca. \n\nLari, K. & Young, S. J. (1990), 'The estimation of stochastic context-free grammars using the \n\nInside-Outside algorithm' , Computer Speech and Language 4, 35-56. \n\nMacKay, D. J. C. (1992), 'Bayesian interpolation', Neural Computation 4,415-447. \nOmohundro, S. M. (1992), Best-first model merging for dynamic learning and recognition, Technical \n\nReport TR-92-004, International Computer Science Institute, Berkeley, Ca. \n\nPorat, S. & Feldman, 1. A. (1991), 'Learning automata from ordered examples', Machine Learning \n\n7, 109-138. \n\nRabiner, L. R. & Juang, B. H. (1986), 'An introduction to Hidden Markov Models', IEEE ASSP \n\nMagazine 3(1), 4-16. \n\nRissanen, J. (1983), 'A universal prior for integers and estimation by minimum description length', \n\nThe Annals of Statistics 11(2), 416-431 . \n\nStolcke, A. & Omohundro, S. (1993), Best-first model merging for Hidden Markov Model induction, \n\nTechnical Report TR-93-003, International Computer Science Institute, Berkeley, Ca. \n\nTomita, M. (1982), Dynamic construction of finite automata from examples using hill-climbing, in \n'Proceedings of the 4th Annual Conference of the Cognitive Science Society', Ann Arbor, \nMich., pp. 105-108. \n\nWallace, C. S. & Freeman, P. R. (1987), 'Estimation and inference by compact coding', Journal of \n\nthe Royal Statistical Society, Series B 49(3),240-265 . \n\n\f", "award": [], "sourceid": 669, "authors": [{"given_name": "Andreas", "family_name": "Stolcke", "institution": null}, {"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}