{"title": "Clustering Sequences with Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 648, "page_last": 654, "abstract": null, "full_text": "Clustering Sequences with Hidden \n\nMarkov Models \n\nPadhraic Smyth \n\nInformation and Computer Science \n\nUniversity of California, Irvine \n\nCA  92697-3425 \n\nsmyth~ics.uci.edu \n\nAbstract \n\nThis paper discusses  a probabilistic model-based approach to clus(cid:173)\ntering sequences,  using hidden Markov models (HMMs) .  The prob(cid:173)\nlem  can  be  framed  as  a  generalization  of  the  standard  mixture \nmodel approach to clustering in feature space.  Two primary issues \nare  addressed.  First,  a  novel  parameter initialization procedure  is \nproposed,  and  second,  the  more  difficult  problem  of determining \nthe  number of clusters  K,  from  the data,  is  investigated.  Experi(cid:173)\nmental results  indicate that the proposed  techniques  are useful  for \nrevealing hidden  cluster structure in  data sets of sequences. \n\nIntroduction \n\n1 \nConsider  a  data  set  D  consisting  of  N  sequences,  D  =  {SI,\"\"  SN}'  Si  = \n(.f.L ... .f.~J  is  a  sequence  of length  Li  composed  of potentially  multivariate fea(cid:173)\nture vectors.f..  The problem addressed  in this paper is the discovery from data of a \nnatural grouping of the sequences into K  clusters.  This is  analagous to clustering in \nmultivariate feature space  which  is  normally handled  by  methods such  as  k-means \nand  Gaussian  mixtures.  Here,  however,  one  is  trying  to  cluster  the  sequences  S \nrather  than  the feature  vectors.f..  As  an  example  Figure  1 shows  four  sequences \nwhich were  generated by two  different  models (hidden  Markov models in this case) . \nThe first  and third came from a model with \"slower\"  dynamics than the second  and \nfourth  (details  will  be  provided  later).  The sequence  clustering  problem  consists \nof being  given  sample sequences  such  as  those  in  Figure  1  and  inferring from  the \ndata what  the  underlying  clusters  are.  This is  non-trivial since  the sequences  can \nbe  of different  lengths  and it is  not  clear  what  a  meaningful distance  metric is  for \nsequence  comparIson. \n\nThe  use  of hidden  Markov  models  for  clustering  sequences  appears  to  have  first \n\n\fClustering Sequences with Hidden Markov Models \n\n649 \n\nFigure  1:  Which sequences  came from which hidden  Markov  model? \n\nbeen  mentioned in Juang and Rabiner (1985)  and subsequently used  in the context \nof discovering subfamilies of protein sequences  in  Krogh et al.  (1994).  This present \npaper  contains  two  new  contributions in  this  context:  a  cluster-based  method for \ninitializing the model parameters and a novel method based on  cross-validated like(cid:173)\nlihood for  determining automatically how  many clusters  to fit  to the data. \nA  natural probabilistic model for  this problem is  that of a finite  mixture model: \n\nK \n\nfK(S) = L/j(SIOj)pj \n\nj=l \n\n(1) \n\nwhere  S  denotes  a  sequence,  Pj  is  the  weight  of the  jth  model,  and  /j (SIOj)  is \nthe  density  function  for  the sequence  data S  given  the  component  model  /j  with \nparameters OJ .  Here  we  will  assume that the /j's are HMMs:  thus,  the OJ'S  are the \ntransition matrices,  observation density  parameters,  and initial state probabilities, \nall  for  the jth component.  /j (SIOj)  can  be  computed  via the  forward  part  of the \nforward  backward  procedure.  More  generally,  the  component  models could  be  any \nprobabilistic  model  for  S  such  as  linear  autoregressive  models,  graphical  models, \nnon-linear networks  with probabilistic semantics,  and so forth . \nIt is  important to  note  that  the motivation for  this  problem  comes  from  the  goal \nof building a  descriptive model for  the data,  rather than prediction per se.  For the \nprediction problem there is a clearly defined metric for performance, namely average \nprediction error on out-of-sample data (cf.  Rabiner et al.  (1989) in a speech context \nwith  clusters  of HMMs  and  Zeevi,  Meir, and Adler  (1997)  in  a  general  time-series \ncontext). \nIn  contrast,  for  descriptive  modeling  it  is  not  always  clear  what  the \nappropriate metric for  evaluation is,  particularly  when  K,  the number of clusters, \nis unknown. In this paper a density estimation viewpoint is taken and the likelihood \nof out-of-sample data is used  as  the measure of the quality of a particular model. \n\n2  An Algorithm for  Clustering Sequences into ]{  Clusters \n\nAssume  first  that  K,  the  number  of clusters,  is  known.  Our  model  is  that  of a \nmixture of HMMs  as in Equation 1.  We  can  immediately observe that this mixture \ncan itself be viewed  as a single  \"composite\"  HMM  where the transition matrix A of \nthe model is  block-diagonal, e.g.,  if the mixture model consists  of two  components \nwith transition matrices Al and  A2  we  can represent  the overall mixture model as \n\n\f650 \n\na  single HMM  (in effect,  a hierarchical mixture) with transition matrix \n\nP  Smyth \n\n(2) \n\nwhere  the initial state probabilities are  chosen  appropriately  to reflect  the relative \nweights ofthe mixture components (the Pic  in Equation 1).  Intuitively, a sequence is \ngenerated from this model by initially randomly choosing either the \"upper\"  matrix \nAl (with probability PI) or the \"lower\" matrix with probability A2 (with probability \n1 - PI)  and  then  generating  data  according  to  the  appropriate  Ai .  There  is  no \n\"crossover\"  in  this mixture model:  data are  assumed to come from one component \nor  the  other.  Given  this  composite  HMM  a  natural  approach  is  to  try  to  learn \nthe parameters of the model using standard HMM estimation techniques,  i.e., some \nform  of initialization followed  by  Baum-Welch  to  maximize  the  likelihood .  Note \nthat unlike predictive modelling (where likelihood is not necessarily  an appropriate \nmetric to evaluate model quality), likelihood maximization is exactly what we want \nto  do  here  since  we  seek  a  generative  (descriptive)  model for  the  data.  We  will \nassume throughout that the  number of states  per component is known a  priori, i.e., \nthat  we  are  looking  for  K  HMM  components  each  of which  has  m  states  and  m \nis  known.  An  obvious  extension  is  to  address  the  problem  of learning  K  and  m \nsimultaneously but this is  not dealt  with here. \n\n2.1 \n\nInitialization using Clustering in \"Log-Likelihood Space\" \n\nSince the EM  algorithm is effectively hill-climbing the likelihood surface,  the quality \nof the final  solution  can  depend  critically on  the initial conditions.  Thus, using  as \nmuch  prior  information as  possible  about  the  problem to  seed  the initialization is \npotentially worthwhile.  This motivates  the following scheme for  initializing the A \nmatrix of the composite HMM: \n\n1.  Fit  N  m-state  HMMs,  one  to  each  individual  sequence  Si, 1  ~ i  ~ N. \nThese  HMMs  can  be  initialized  in  a  \"default\"  manner:  set  the  transition \nmatrices uniformly  and set  the  means  and  covariances  using  the  k-means \nalgorithm,  where  here  k  = m,  not  to  be  confused  with  K,  the number of \nHMM  components.  For discrete observation alphabets modify accordingly. \n2.  For  each  fitted  model  Mi,  evaluate  the  log-likelihood  of each  of the  N \nsequences  given model Mi,  i.e.,  calculate Lij  = log L(Sj IMi), 1 ~ i, j  ~ N. \n3.  Use  the  log-likelihood  distance  matrix  to  cluster  the  sequences  into  K \n\ngroups  (details of the clustering are discussed  below). \n\n4.  Having pooled the sequences into K  groups, fit  K  HMMs, one to each group, \nusing the default initialization described  above.  From the K  HMMs we  get \nK  sets  of parameters:  initialize  the  composite  HMM  in  the  obvious  way, \ni.e., the m x m  \"block-diagonal\" component Aj of A  (where A  is  mK x mK) \nis set to the estimated transition matrix from the jth group and the means \nand covariances of the jth set of states are set  accordingly.  Initialize the Pj \nin Equation 1 to Nj / N  where  Nj  is  the number of sequences  which belong \nto cluster j. \n\nAfter this initialization step is complete, learning proceeds directly on the composite \nHMM  (with matrix A) in the usual Baum-Welch fashion using all of the sequences. \nThe intuition behind  this  initialization procedure  is  as  follows.  The hypothesis  is \nthat  the  data  are  being  generated  by  K  models.  Thus,  if we  fit  models  to  each \nindividual  sequence,  we  will  get  noisier  estimates  of the  model  parameters  (than \nif we  used  all  of the  sequences  from  that  cluster)  but  the  parameters  should  be \n\n\fClustering Sequences with Hidden Markov Models \n\n651 \n\nclustered  in  some  manner  into  K  groups  about  their  true  values  (assuming  the \nmodel is  correct).  Clustering  directly  in  parameter space  would  be  inappropriate \n(how  does  one define  distance?):  however,  the log-likelihoods are  a  natural way  to \ndefine  pairwise  distances. \n\nNote that step  1 above requires  the training of N  sequences  individually and step 2 \nrequires  the evaluation of N 2  distances.  For large N  this may be impractical.  Suit(cid:173)\nable  modifications which  train only on  a  small random sample of the  N  sequences \nand randomly sample the distance matrix could help reduce the computational bur(cid:173)\nden,  but this is  not  pursued  here.  A  variety of possible  clustering methods can  be \nused in step 3 above.  The \"symmetrized distance\"  Lij = Ij2(L(Si IMj)+ L(Sj IMi\u00bb \ncan be shown to be an appropriate measure of dissimilarity between models Mi  and \nMj  (Juang and  Rabiner 1985).  For the results  described  in this paper, hierarchical \nclustering was  used  to generate  K  clusters  from  the symmetrized distance  matrix. \nThe  \"furthest-neighbor\"  merging heuristic was used  to encourage  compact clusters \nand worked  well  empirically, although there  is  no  particular reason  to use only this \nmethod. \n\nWe  will  refer  to  the  above  clustering-based initialization followed  by  Baum-Welch \ntraining on the composite model as the  \"HMM-Clustering\"  algorithm in the rest  of \nthe paper. \n\n2.2  Experimental Results \n\nConsider  a  deceptively  simple  \"toy\"  problem.  I-dimensional feature  data are  gen(cid:173)\nerated from  a  2-component HMM  mixture (K = 2), each with 2 states.  We  have \n\nA  - (0.6  0.4) \n\n0.4  0.6 \n\n1  -\n\nA  _  (0.4  0.6) \n\n0.6  0.4 \n\n2  -\n\nand  the  observable  feature  data  obey  a  Gaussian  density  in  each  state  with \n0'1  = 0'2  = 1  for  each  state  in  each  component,  and  PI  = 0, P2  = 3  for  the  re(cid:173)\nspective  mean of each  state of each  component.  4 sample sequences  are  shown  in \nFigure 1.  The top, and third from top, sequences  are from the  \"slower\"  component \nAl  (is  more  likely  to  stay  in  any  state  than  switch).  In  total  the  training  data \ncontain  20  sample sequences  from  each  component of length  200.  The  problem is \nnon-trivial  both  because  the  data have exactly  the same marginal statistics  if the \ntemporal sequence  information is  removed  and  because  the  Markov  dynamics  (as \ngoverned  by  Al and A 2 )  are relatively similar for  each  component making identifi(cid:173)\ncation somewhat difficult. \n\nThe HMM  clustering  algorithm was  applied  to these sequences.  The symmetrized \nlikelihood  distance  matrix is  shown  as  a  grey-scale  image  in  Figure  2.  The  axes \nhave been  ordered so  that the sequences  from the same clusters  are  adjacent.  The \ndifference  in  distances  between  the  two  clusters  is  apparent  and  the  hierarchical \nclustering  algorithm  (with  K  =  2)  easily  separates  the  two  groups.  This  initial \nclustering,  followed  by  training separately  the two  clusters  on  the set of sequences \nassigned  to each cluster,  yielded: \n\nA  - (0 .580  0.402) \n\n0.420  0.598 \n\n1  -\n\nA \n\nPI = \n\n(2.892 \n0.040 \n\nA \n\nA2  = \n\n(0.392  0.611) \n\n0.608  0.389 \n\nA \n\n(2.911 \nP2  =  0.138 \n\n) \n) \n\nA \n\n0'1  = \n\n(1.353 \n1.219 \n\nA \n\n0'2  = \n\n(1.239 \n1.339 \n\n) \n) \n\nSubsequent  training of the composite model on all of the sequences  produced more \nrefined  parameter estimates,  although the  basic  cluster  structure  of the  model  re(cid:173)\nmained the same (i.e.,  the initial clustering was  robust) . \n\n\f652 \n\nP.  Smyth \n\nmodeLo.\" \n\nto \n\n15 \n\n20 \n\n26 \n\n30 \n\n3S \n\n40 \n\nmod,LO.\" \n\nmodelO.e \n\nFigure 2:  Symmetrized log-likelihood distance matrix. \n\nFor  comparative  purposes  two  alternative  initialization  procedures  were  used  to \ninitialize the training ofthe composite HMM. The \"unstructured\"  method uniformly \ninitializes the A matrix without any knowledge of the fact that the off-block-diagonal \nterms are zero  (this is  the \"standard\"  way of fitting a HMM).  The  \"block-uniform\" \nmethod uniformly initializes  the  K  block-diagonal matrices  within  A  and sets  the \noff-block-diagonal terms to  zero.  Random initialization gave poorer results  overall \ncompared to uniform. \n\nTable 1:  Differences  in log-likelihood for  different  initialization methods. \n\nInitialization \n\nMethod \n\nLog-Likelihood  Log-Likelihood \n\nMaximum \n\nValue \n\nMean \n\nValue \n\nStandard \nDeviation \n\nof Log-Likelihoods \n\nUnstructured \nBlock-Uniform \nHMM-Clustering \n\n7.6 \n44.8 \n55.1 \n\n0.0 \n8.1 \n50.4 \n\n1.3 \n28.7 \n0.9 \n\nThe three alternatives were  run 20  times on the data above,  where for each run the \nseeds for  the k-means component of the initialization were changed.  The maximum, \nmean and standard  deviations of log-likelihoods on test  data are reported  in  Table \n1  (the  log-likelihoods  were  offset  so  that  the  mean  unstructured  log-likelihood  is \nzero) .  The  unstructured  approach  is  substantially  inferior  to  the  others  on  this \nproblem:  this  is  not  surprising  since  it  is  not  given  the  block-diagonal structure \nof the  true  model.  The  Block-Uniform method is  closer  in  performance to  HMM(cid:173)\nClustering but is  still inferior.  In  particular, its log-likelihood is  consistently  lower \nthan that of the HMM-Clustering solution and has much greater  variability  across \ndifferent  initial seeds.  The same qualitative behavior was observed  across  a  variety \nof simulated data sets  (results  are not presented  here  due  to lack  of space). \n\n3  Learning K,  the Number of Sequence Components \n\n3.1  Background \n\nAbove  we  have  assumed  that  K,  the  number  of clusters,  is  known.  The  problem \nof learning the  \"best\"  value for  K  in  a  mixture model is  a  difficult  one in  practice \n\n\fClustering Sequences with Hidden Markov Models \n\n653 \n\neven for the simpler (non-dynamic) case of Gaussian mixtures.  There has been con(cid:173)\nsiderable prior work on this problem.  Penalized likelihood approaches  are popular, \nwhere  the  log-likelihood on  the  training  data is  penalized  by  the subtraction of a \ncomplexity term.  A more general  approach is  the full  Bayesian solution  where  the \nposterior probability of each  value of K  is  calculated given  the data, priors  on  the \nmixture  parameters,  and  priors  on  K  itself.  A  potential  difficulty  here  is  the  the \ncomputational complexity of integrating over  the  parameter space  to  get  the pos(cid:173)\nterior  probabilities on  K.  Various analytic and sampling approximations are  used \nin  practice.  In  theory,  the full  Bayesian approach is fully optimal and probably the \nmost useful.  However, in practice the ideal Bayesian solution must be approximated \nand it is  not always obvious how  the approximation affects  the quality of the final \nanswer.  Thus,  there  is  room to explore  alternative methods for  determining K. \n\n3.2  A  Monte-Carlo Cross-Validation Approach \n\nImagine that we had a large test data set Dtest  which is not used in fitting any of the \nmodels.  Let LK(Dtest) be the log-likelihood where the model with K  components is \nfit  to the training data D  but the likelihood is evaluated on D test .  We  can view this \nlikelihood as a function of the  \"parameter\"  K, keeping  all other parameters and D \nfixed.  Intuitively, this \"test likelihood\" should be a much more useful estimator than \nthe training data likelihood for comparing mixture models with different numbers of \ncomponents.  In fact,  the test likelihood can  be shown  to be an unbiased estimator \nof the  Kullback-Leibler  distance  between  the true  (but  unknown)  density  and the \nmodel.  Thus,  maximizing out-of-sample likelihood  over  K  is  a  reasonable  model \nselection strategy.  In practice, one does  not usually want to reserve  a large fraction \nof one's data for test purposes:  thus,  a cross-validated estimate of log-likelihood can \nbe used  instead. \nIn Smyth (1996) it was found that for standard multivariate Gaussian mixture mod(cid:173)\neling,  the standard v-fold  cross-validation techniques  (with say  v  =  10)  performed \npoorly  in  terms  of selecting  the  correct  model on  simulated data.  Instead  Monte(cid:173)\nCarlo cross-validation (Shao, 1993) was found to be much more stable:  the data are \npartitioned into a fraction f3 for testing and 1 - f3  for training, and this procedure is \nrepeated  M  times where the partitions are randomly chosen  on each run  (i.e., need \nnot be disjoint).  In choosing f3  one must tradeoff the variability of the performance \nestimate on the test set with the variability in model fitting on the training set.  In \ngeneral,  as  the total amount of data increases  relative to the model complexity, the \noptimal f3  becomes  larger.  For  the  mixture clustering  problem f3  =  0.5  was  found \nempirically to  work  well  (Smyth,  1996)  and is  used in  the results  reported  here. \n\n3.3  Experimental Results \n\nThe same data set as described earlier was used where now  K  is not known a priori. \nThe 40  sequences  were  randomly partitioned  20  times into training and test  cross(cid:173)\nvalidation sets.  For each  train/test partition  the value of K  was  varied  between  1 \nand 6, and for each value of K  the HMM-Clustering algorithm was fit  to the training \ndata,  and the likelihood was  evaluated on  the test data.  The mean cross-validated \nlikelihood  was  evaluated  as  the  average  over  the  20  runs.  Assuming  the  models \nare equally  likely  a  priori,  one  can  generate  an  approximate posterior distribution \np(KID)  by  Bayes  rule:  these  posterior  probabilities  are  shown  in  Figure  3.  The \ncross-validation procedure  produces  a  clear  peak  at K  = 2 which is  the true model \nsize.  In general,  the cross-validation method has been  tested  on  a  variety of other \nsimulated sequence  clustering data sets and typically converges  as  a function of the \nnumber oftraining samples to the true value of K  (from below).  As  the number of \n\n\f654 \n\nP.  Smyth \n\nEallmeted POIIertor PrcbabIlIlu on K lor 2-Clus1er Data \n\n-\n\n0.9 \n\n0.8 \n\n0.7 \n\n~0.6 10.5 \n\n\u00b7~0.4 \n~ \n\n0.3 \n\n0.2 \n\n0.1 \n\n\u00b00 \n\n2 \n\nK \n\n5 \n\n6 \n\n7 \n\nFigure 3:  Posterior  probability distribution on  K  as estimated by  cross-validation. \n\ndata points grow,  the  posterior distribution on  K  narrows  about  the  true value  of \nK.  If the data were  not generated by  the assumed form of the model, the posterior \ndistribution  on  K  will  tend  to  be  peaked  about  the  model  size  which  is  closest \n(in  K-L  distance)  to  the  true  model.  Results  in  the  context  of Gaussian  mixture \nclustering(Smyth 1996) have shown that the Monte Carlo cross-validation technique \nperforms as  well  as the better Bayesian approximation methods and is  more robust \nthen  penalized likelihood methods such  as  BIC. \nIn conclusion, we have shown that model-based probabilistic clustering can be gener(cid:173)\nalized from feature-space  clustering to sequence  clustering.  Log-likelihood between \nsequence  models and sequences  was found useful for  detecting cluster structure and \ncross-validated likelihood was shown to be able to detect the true number of clusters. \n\nReferences \n\nBaldi, P.  and Y.  Chauvin, 'Hierarchical hybrid modeling, HMM/NN architectures, \n\nand  protein applications,' Neural  Computation,  8(6),  1541-1565,1996. \n\nKrogh,  A.  et  al.,  'Hidden  Markov models  in  computational biology:  applications \n\nto  protein modeling,' it J. Mol.  Bio., 235:1501-1531, 1994. \n\nJuang,  B.  H.,  and  L.  R.  Rabiner,  'A  probabilistic  distance  measure  for  hidden \n\nMarkov  models,' AT&T Technical  Journal,  vo1.64,  no.2 , February  1985. \n\nRabiner,  L.  R.,  C.  H.  Lee,  B.  H.  Juang,  and  J.  G . Wilpon,  'HMM  clustering for \nconnected  word  recognition,'  Pmc.  Int.  Conf.  Ac.  Speech.  Sig.  Proc, \nIEEE Press,  405-408,  1989. \n\nShao, J ., 'Linear model selection by cross-validation,' J.  Am.  Stat.  Assoc., 88(422), \n\n486-494,  1993. \n\nSmyth, P.,  'Clustering  using  Monte-Carlo  cross  validation,' in  Proceedings  of the \n\nSecond  International  Conference  on  Knowledge  Discovery  and  Data  Min(cid:173)\ning,  Menlo  Park, CA: AAAI  Press,  pp.126-133, 1996. \n\nZeevi, A. J ., Meir,  R., Adler,  R., 'Time series prediction using mixtures of experts,' \n\nin  this  volume, 1997. \n\n\f", "award": [], "sourceid": 1217, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}