{"title": "Clustering Sequences with Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 648, "page_last": 654, "abstract": null, "full_text": "Clustering Sequences with Hidden \n\nMarkov Models \n\nPadhraic Smyth \n\nInformation and Computer Science \n\nUniversity of California, Irvine \n\nCA 92697-3425 \n\nsmyth~ics.uci.edu \n\nAbstract \n\nThis paper discusses a probabilistic model-based approach to clus(cid:173)\ntering sequences, using hidden Markov models (HMMs) . The prob(cid:173)\nlem can be framed as a generalization of the standard mixture \nmodel approach to clustering in feature space. Two primary issues \nare addressed. First, a novel parameter initialization procedure is \nproposed, and second, the more difficult problem of determining \nthe number of clusters K, from the data, is investigated. Experi(cid:173)\nmental results indicate that the proposed techniques are useful for \nrevealing hidden cluster structure in data sets of sequences. \n\nIntroduction \n\n1 \nConsider a data set D consisting of N sequences, D = {SI,\"\" SN}' Si = \n(.f.L ... .f.~J is a sequence of length Li composed of potentially multivariate fea(cid:173)\nture vectors.f.. The problem addressed in this paper is the discovery from data of a \nnatural grouping of the sequences into K clusters. This is analagous to clustering in \nmultivariate feature space which is normally handled by methods such as k-means \nand Gaussian mixtures. Here, however, one is trying to cluster the sequences S \nrather than the feature vectors.f.. As an example Figure 1 shows four sequences \nwhich were generated by two different models (hidden Markov models in this case) . \nThe first and third came from a model with \"slower\" dynamics than the second and \nfourth (details will be provided later). The sequence clustering problem consists \nof being given sample sequences such as those in Figure 1 and inferring from the \ndata what the underlying clusters are. This is non-trivial since the sequences can \nbe of different lengths and it is not clear what a meaningful distance metric is for \nsequence comparIson. \n\nThe use of hidden Markov models for clustering sequences appears to have first \n\n\fClustering Sequences with Hidden Markov Models \n\n649 \n\nFigure 1: Which sequences came from which hidden Markov model? \n\nbeen mentioned in Juang and Rabiner (1985) and subsequently used in the context \nof discovering subfamilies of protein sequences in Krogh et al. (1994). This present \npaper contains two new contributions in this context: a cluster-based method for \ninitializing the model parameters and a novel method based on cross-validated like(cid:173)\nlihood for determining automatically how many clusters to fit to the data. \nA natural probabilistic model for this problem is that of a finite mixture model: \n\nK \n\nfK(S) = L/j(SIOj)pj \n\nj=l \n\n(1) \n\nwhere S denotes a sequence, Pj is the weight of the jth model, and /j (SIOj) is \nthe density function for the sequence data S given the component model /j with \nparameters OJ . Here we will assume that the /j's are HMMs: thus, the OJ'S are the \ntransition matrices, observation density parameters, and initial state probabilities, \nall for the jth component. /j (SIOj) can be computed via the forward part of the \nforward backward procedure. More generally, the component models could be any \nprobabilistic model for S such as linear autoregressive models, graphical models, \nnon-linear networks with probabilistic semantics, and so forth . \nIt is important to note that the motivation for this problem comes from the goal \nof building a descriptive model for the data, rather than prediction per se. For the \nprediction problem there is a clearly defined metric for performance, namely average \nprediction error on out-of-sample data (cf. Rabiner et al. (1989) in a speech context \nwith clusters of HMMs and Zeevi, Meir, and Adler (1997) in a general time-series \ncontext). \nIn contrast, for descriptive modeling it is not always clear what the \nappropriate metric for evaluation is, particularly when K, the number of clusters, \nis unknown. In this paper a density estimation viewpoint is taken and the likelihood \nof out-of-sample data is used as the measure of the quality of a particular model. \n\n2 An Algorithm for Clustering Sequences into ]{ Clusters \n\nAssume first that K, the number of clusters, is known. Our model is that of a \nmixture of HMMs as in Equation 1. We can immediately observe that this mixture \ncan itself be viewed as a single \"composite\" HMM where the transition matrix A of \nthe model is block-diagonal, e.g., if the mixture model consists of two components \nwith transition matrices Al and A2 we can represent the overall mixture model as \n\n\f650 \n\na single HMM (in effect, a hierarchical mixture) with transition matrix \n\nP Smyth \n\n(2) \n\nwhere the initial state probabilities are chosen appropriately to reflect the relative \nweights ofthe mixture components (the Pic in Equation 1). Intuitively, a sequence is \ngenerated from this model by initially randomly choosing either the \"upper\" matrix \nAl (with probability PI) or the \"lower\" matrix with probability A2 (with probability \n1 - PI) and then generating data according to the appropriate Ai . There is no \n\"crossover\" in this mixture model: data are assumed to come from one component \nor the other. Given this composite HMM a natural approach is to try to learn \nthe parameters of the model using standard HMM estimation techniques, i.e., some \nform of initialization followed by Baum-Welch to maximize the likelihood . Note \nthat unlike predictive modelling (where likelihood is not necessarily an appropriate \nmetric to evaluate model quality), likelihood maximization is exactly what we want \nto do here since we seek a generative (descriptive) model for the data. We will \nassume throughout that the number of states per component is known a priori, i.e., \nthat we are looking for K HMM components each of which has m states and m \nis known. An obvious extension is to address the problem of learning K and m \nsimultaneously but this is not dealt with here. \n\n2.1 \n\nInitialization using Clustering in \"Log-Likelihood Space\" \n\nSince the EM algorithm is effectively hill-climbing the likelihood surface, the quality \nof the final solution can depend critically on the initial conditions. Thus, using as \nmuch prior information as possible about the problem to seed the initialization is \npotentially worthwhile. This motivates the following scheme for initializing the A \nmatrix of the composite HMM: \n\n1. Fit N m-state HMMs, one to each individual sequence Si, 1 ~ i ~ N. \nThese HMMs can be initialized in a \"default\" manner: set the transition \nmatrices uniformly and set the means and covariances using the k-means \nalgorithm, where here k = m, not to be confused with K, the number of \nHMM components. For discrete observation alphabets modify accordingly. \n2. For each fitted model Mi, evaluate the log-likelihood of each of the N \nsequences given model Mi, i.e., calculate Lij = log L(Sj IMi), 1 ~ i, j ~ N. \n3. Use the log-likelihood distance matrix to cluster the sequences into K \n\ngroups (details of the clustering are discussed below). \n\n4. Having pooled the sequences into K groups, fit K HMMs, one to each group, \nusing the default initialization described above. From the K HMMs we get \nK sets of parameters: initialize the composite HMM in the obvious way, \ni.e., the m x m \"block-diagonal\" component Aj of A (where A is mK x mK) \nis set to the estimated transition matrix from the jth group and the means \nand covariances of the jth set of states are set accordingly. Initialize the Pj \nin Equation 1 to Nj / N where Nj is the number of sequences which belong \nto cluster j. \n\nAfter this initialization step is complete, learning proceeds directly on the composite \nHMM (with matrix A) in the usual Baum-Welch fashion using all of the sequences. \nThe intuition behind this initialization procedure is as follows. The hypothesis is \nthat the data are being generated by K models. Thus, if we fit models to each \nindividual sequence, we will get noisier estimates of the model parameters (than \nif we used all of the sequences from that cluster) but the parameters should be \n\n\fClustering Sequences with Hidden Markov Models \n\n651 \n\nclustered in some manner into K groups about their true values (assuming the \nmodel is correct). Clustering directly in parameter space would be inappropriate \n(how does one define distance?): however, the log-likelihoods are a natural way to \ndefine pairwise distances. \n\nNote that step 1 above requires the training of N sequences individually and step 2 \nrequires the evaluation of N 2 distances. For large N this may be impractical. Suit(cid:173)\nable modifications which train only on a small random sample of the N sequences \nand randomly sample the distance matrix could help reduce the computational bur(cid:173)\nden, but this is not pursued here. A variety of possible clustering methods can be \nused in step 3 above. The \"symmetrized distance\" Lij = Ij2(L(Si IMj)+ L(Sj IMi\u00bb \ncan be shown to be an appropriate measure of dissimilarity between models Mi and \nMj (Juang and Rabiner 1985). For the results described in this paper, hierarchical \nclustering was used to generate K clusters from the symmetrized distance matrix. \nThe \"furthest-neighbor\" merging heuristic was used to encourage compact clusters \nand worked well empirically, although there is no particular reason to use only this \nmethod. \n\nWe will refer to the above clustering-based initialization followed by Baum-Welch \ntraining on the composite model as the \"HMM-Clustering\" algorithm in the rest of \nthe paper. \n\n2.2 Experimental Results \n\nConsider a deceptively simple \"toy\" problem. I-dimensional feature data are gen(cid:173)\nerated from a 2-component HMM mixture (K = 2), each with 2 states. We have \n\nA - (0.6 0.4) \n\n0.4 0.6 \n\n1 -\n\nA _ (0.4 0.6) \n\n0.6 0.4 \n\n2 -\n\nand the observable feature data obey a Gaussian density in each state with \n0'1 = 0'2 = 1 for each state in each component, and PI = 0, P2 = 3 for the re(cid:173)\nspective mean of each state of each component. 4 sample sequences are shown in \nFigure 1. The top, and third from top, sequences are from the \"slower\" component \nAl (is more likely to stay in any state than switch). In total the training data \ncontain 20 sample sequences from each component of length 200. The problem is \nnon-trivial both because the data have exactly the same marginal statistics if the \ntemporal sequence information is removed and because the Markov dynamics (as \ngoverned by Al and A 2 ) are relatively similar for each component making identifi(cid:173)\ncation somewhat difficult. \n\nThe HMM clustering algorithm was applied to these sequences. The symmetrized \nlikelihood distance matrix is shown as a grey-scale image in Figure 2. The axes \nhave been ordered so that the sequences from the same clusters are adjacent. The \ndifference in distances between the two clusters is apparent and the hierarchical \nclustering algorithm (with K = 2) easily separates the two groups. This initial \nclustering, followed by training separately the two clusters on the set of sequences \nassigned to each cluster, yielded: \n\nA - (0 .580 0.402) \n\n0.420 0.598 \n\n1 -\n\nA \n\nPI = \n\n(2.892 \n0.040 \n\nA \n\nA2 = \n\n(0.392 0.611) \n\n0.608 0.389 \n\nA \n\n(2.911 \nP2 = 0.138 \n\n) \n) \n\nA \n\n0'1 = \n\n(1.353 \n1.219 \n\nA \n\n0'2 = \n\n(1.239 \n1.339 \n\n) \n) \n\nSubsequent training of the composite model on all of the sequences produced more \nrefined parameter estimates, although the basic cluster structure of the model re(cid:173)\nmained the same (i.e., the initial clustering was robust) . \n\n\f652 \n\nP. Smyth \n\nmodeLo.\" \n\nto \n\n15 \n\n20 \n\n26 \n\n30 \n\n3S \n\n40 \n\nmod,LO.\" \n\nmodelO.e \n\nFigure 2: Symmetrized log-likelihood distance matrix. \n\nFor comparative purposes two alternative initialization procedures were used to \ninitialize the training ofthe composite HMM. The \"unstructured\" method uniformly \ninitializes the A matrix without any knowledge of the fact that the off-block-diagonal \nterms are zero (this is the \"standard\" way of fitting a HMM). The \"block-uniform\" \nmethod uniformly initializes the K block-diagonal matrices within A and sets the \noff-block-diagonal terms to zero. Random initialization gave poorer results overall \ncompared to uniform. \n\nTable 1: Differences in log-likelihood for different initialization methods. \n\nInitialization \n\nMethod \n\nLog-Likelihood Log-Likelihood \n\nMaximum \n\nValue \n\nMean \n\nValue \n\nStandard \nDeviation \n\nof Log-Likelihoods \n\nUnstructured \nBlock-Uniform \nHMM-Clustering \n\n7.6 \n44.8 \n55.1 \n\n0.0 \n8.1 \n50.4 \n\n1.3 \n28.7 \n0.9 \n\nThe three alternatives were run 20 times on the data above, where for each run the \nseeds for the k-means component of the initialization were changed. The maximum, \nmean and standard deviations of log-likelihoods on test data are reported in Table \n1 (the log-likelihoods were offset so that the mean unstructured log-likelihood is \nzero) . The unstructured approach is substantially inferior to the others on this \nproblem: this is not surprising since it is not given the block-diagonal structure \nof the true model. The Block-Uniform method is closer in performance to HMM(cid:173)\nClustering but is still inferior. In particular, its log-likelihood is consistently lower \nthan that of the HMM-Clustering solution and has much greater variability across \ndifferent initial seeds. The same qualitative behavior was observed across a variety \nof simulated data sets (results are not presented here due to lack of space). \n\n3 Learning K, the Number of Sequence Components \n\n3.1 Background \n\nAbove we have assumed that K, the number of clusters, is known. The problem \nof learning the \"best\" value for K in a mixture model is a difficult one in practice \n\n\fClustering Sequences with Hidden Markov Models \n\n653 \n\neven for the simpler (non-dynamic) case of Gaussian mixtures. There has been con(cid:173)\nsiderable prior work on this problem. Penalized likelihood approaches are popular, \nwhere the log-likelihood on the training data is penalized by the subtraction of a \ncomplexity term. A more general approach is the full Bayesian solution where the \nposterior probability of each value of K is calculated given the data, priors on the \nmixture parameters, and priors on K itself. A potential difficulty here is the the \ncomputational complexity of integrating over the parameter space to get the pos(cid:173)\nterior probabilities on K. Various analytic and sampling approximations are used \nin practice. In theory, the full Bayesian approach is fully optimal and probably the \nmost useful. However, in practice the ideal Bayesian solution must be approximated \nand it is not always obvious how the approximation affects the quality of the final \nanswer. Thus, there is room to explore alternative methods for determining K. \n\n3.2 A Monte-Carlo Cross-Validation Approach \n\nImagine that we had a large test data set Dtest which is not used in fitting any of the \nmodels. Let LK(Dtest) be the log-likelihood where the model with K components is \nfit to the training data D but the likelihood is evaluated on D test . We can view this \nlikelihood as a function of the \"parameter\" K, keeping all other parameters and D \nfixed. Intuitively, this \"test likelihood\" should be a much more useful estimator than \nthe training data likelihood for comparing mixture models with different numbers of \ncomponents. In fact, the test likelihood can be shown to be an unbiased estimator \nof the Kullback-Leibler distance between the true (but unknown) density and the \nmodel. Thus, maximizing out-of-sample likelihood over K is a reasonable model \nselection strategy. In practice, one does not usually want to reserve a large fraction \nof one's data for test purposes: thus, a cross-validated estimate of log-likelihood can \nbe used instead. \nIn Smyth (1996) it was found that for standard multivariate Gaussian mixture mod(cid:173)\neling, the standard v-fold cross-validation techniques (with say v = 10) performed \npoorly in terms of selecting the correct model on simulated data. Instead Monte(cid:173)\nCarlo cross-validation (Shao, 1993) was found to be much more stable: the data are \npartitioned into a fraction f3 for testing and 1 - f3 for training, and this procedure is \nrepeated M times where the partitions are randomly chosen on each run (i.e., need \nnot be disjoint). In choosing f3 one must tradeoff the variability of the performance \nestimate on the test set with the variability in model fitting on the training set. In \ngeneral, as the total amount of data increases relative to the model complexity, the \noptimal f3 becomes larger. For the mixture clustering problem f3 = 0.5 was found \nempirically to work well (Smyth, 1996) and is used in the results reported here. \n\n3.3 Experimental Results \n\nThe same data set as described earlier was used where now K is not known a priori. \nThe 40 sequences were randomly partitioned 20 times into training and test cross(cid:173)\nvalidation sets. For each train/test partition the value of K was varied between 1 \nand 6, and for each value of K the HMM-Clustering algorithm was fit to the training \ndata, and the likelihood was evaluated on the test data. The mean cross-validated \nlikelihood was evaluated as the average over the 20 runs. Assuming the models \nare equally likely a priori, one can generate an approximate posterior distribution \np(KID) by Bayes rule: these posterior probabilities are shown in Figure 3. The \ncross-validation procedure produces a clear peak at K = 2 which is the true model \nsize. In general, the cross-validation method has been tested on a variety of other \nsimulated sequence clustering data sets and typically converges as a function of the \nnumber oftraining samples to the true value of K (from below). As the number of \n\n\f654 \n\nP. Smyth \n\nEallmeted POIIertor PrcbabIlIlu on K lor 2-Clus1er Data \n\n-\n\n0.9 \n\n0.8 \n\n0.7 \n\n~0.6 10.5 \n\n\u00b7~0.4 \n~ \n\n0.3 \n\n0.2 \n\n0.1 \n\n\u00b00 \n\n2 \n\nK \n\n5 \n\n6 \n\n7 \n\nFigure 3: Posterior probability distribution on K as estimated by cross-validation. \n\ndata points grow, the posterior distribution on K narrows about the true value of \nK. If the data were not generated by the assumed form of the model, the posterior \ndistribution on K will tend to be peaked about the model size which is closest \n(in K-L distance) to the true model. Results in the context of Gaussian mixture \nclustering(Smyth 1996) have shown that the Monte Carlo cross-validation technique \nperforms as well as the better Bayesian approximation methods and is more robust \nthen penalized likelihood methods such as BIC. \nIn conclusion, we have shown that model-based probabilistic clustering can be gener(cid:173)\nalized from feature-space clustering to sequence clustering. Log-likelihood between \nsequence models and sequences was found useful for detecting cluster structure and \ncross-validated likelihood was shown to be able to detect the true number of clusters. \n\nReferences \n\nBaldi, P. and Y. Chauvin, 'Hierarchical hybrid modeling, HMM/NN architectures, \n\nand protein applications,' Neural Computation, 8(6), 1541-1565,1996. \n\nKrogh, A. et al., 'Hidden Markov models in computational biology: applications \n\nto protein modeling,' it J. Mol. Bio., 235:1501-1531, 1994. \n\nJuang, B. H., and L. R. Rabiner, 'A probabilistic distance measure for hidden \n\nMarkov models,' AT&T Technical Journal, vo1.64, no.2 , February 1985. \n\nRabiner, L. R., C. H. Lee, B. H. Juang, and J. G . Wilpon, 'HMM clustering for \nconnected word recognition,' Pmc. Int. Conf. Ac. Speech. Sig. Proc, \nIEEE Press, 405-408, 1989. \n\nShao, J ., 'Linear model selection by cross-validation,' J. Am. Stat. Assoc., 88(422), \n\n486-494, 1993. \n\nSmyth, P., 'Clustering using Monte-Carlo cross validation,' in Proceedings of the \n\nSecond International Conference on Knowledge Discovery and Data Min(cid:173)\ning, Menlo Park, CA: AAAI Press, pp.126-133, 1996. \n\nZeevi, A. J ., Meir, R., Adler, R., 'Time series prediction using mixtures of experts,' \n\nin this volume, 1997. \n\n\f", "award": [], "sourceid": 1217, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}