{"title": "Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 647, "abstract": null, "full_text": "Training Algorithms for Hidden Markov Models \n\nUsing Entropy Based Distance Functions \n\nYoram Singer \n\nAT&T Laboratories \n600 Mountain Avenue \nMurray Hill, NJ  07974 \nsinger@research.att.com \n\nManfred K.  Warmuth \n\nComputer Science Department \n\nUniversity of California \nSanta Cruz, CA  95064 \nmanfred@cse.ucsc.edu \n\nAbstract \n\nWe  present  new  algorithms  for  parameter  estimation  of HMMs.  By \nadapting a framework used for supervised learning, we construct iterative \nalgorithms  that maximize  the  likelihood of the  observations while also \nattempting to stay \"close\" to the current estimated parameters.  We use a \nbound on the relative entropy between the two HMMs as a distance mea(cid:173)\nsure between them.  The result is new iterative training algorithms which \nare similar to the EM (Baum-Welch) algorithm for training HMMs.  The \nproposed algorithms are  composed  of a  step similar to the expectation \nstep of Baum-Welch and a new update of the parameters  which replaces \nthe maximization (re-estimation) step.  The algorithm takes only negligi(cid:173)\nbly more time per iteration and an  approximated version  uses  the same \nexpectation  step  as  Baum-Welch.  We  evaluate  experimentally  the  new \nalgorithms on synthetic and natural speech pronunciation data.  For sparse \nmodels, i.e.  models with relatively small number of non-zero parameters, \nthe proposed algorithms require significantly fewer iterations. \n\n1  Preliminaries \nWe use the numbers from 0 to N  to name the states of an  HMM. State 0 is a special initial \nstate and state N  is a special final  state.  Any  state sequence, denoted by  s, starts with the \ninitial state but never returns to it and ends in the final  state.  Observations symbols are also \nnumbers  in  {I, ... , M}  and  observation sequences  are  denoted  by  x.  A discrete output \nhidden Markov model (HMM) is parameterized by two matrices A and B. The first matrix \nis  of dimension  [N, N]  and  ai,j  (0:5:  i  :5:  N  - 1,1  :5:  j  :5:  N) denotes the probability of \nmoving from state i to state j. The second matrix is of dimension [N + 1, M] and bi ,k  is the \nprobability of outputting symbol k at state i.  The set of parameters of an HMM is denoted \nby 0 =  (A, B). (The initial state distribution vector is represented by the first row of A.) \nAn  HMM is  a probabilistic generator of sequences.  It starts in the initial state O.  It then \niteratively does the following until the final  state is reached.  If i  is the current state then a \nnext state j  is chosen according to the transition probabilities out of the current state (row i of \nmatrix A).  After arriving at state j  a symbol is output according to the output probabilities \nof that state (row j  of matrix B).  Let P(x, slO) denote the probability (likelihood) that an \nHMM 0  generates the observation sequence x  on the path s  starting at state 0 and ending \nat  state  N:  P(x, sllsl  =  Ixl + 1, So  =  0, slSI  =  N, 0)  ~ I1~~ll as._t,s.bs.,x \u2022.  For the \nsake of brevity we  omit the conditions on  s  and  x.  Throughout the paper we  assume that \nthe  HMMs are absorbing, that is from  every  state there is a path to the final  state  with a \n\n\f642 \n\nY.  Singer and M.  K.  Warmuth \n\nnon-zero probability.  Similar parameter estimation algorithms can  be derived for ergodic \nHMMs.  Absorbing  HMMs  induce  a  probability  over  all  state-observation  sequences, \ni.e.  Ex,s P(x, s18)  = 1.  The  likelihood of an  observation  sequence  x  is  obtained  by \nsumming  over all  possible hidden paths  (state sequences),  P(xI8)  =  Es P(x, sI8).  To \nobtain the likelihood for a set X  of observations we simply mUltiply the likelihood values \nfor  the  individual sequences.  We  seek  an  HMM  8  that  maximizes  the  likelihood for  a \ngiven  set  of observations  X,  or equivalently,  maximizes  the log-likelihood, LL(XI8)  = \nr:h EXEX In P(xI8). \nTo  simplify  our  notation  we  denote  the  generic  parameter  in  8  by  Oi,  where  i  ranges \nfrom  1 to  the  total  number  of parameters  in  A  and  B  (There  might be  less  if some  are \nclamped to zero).  We denote the total number of parameters of 8  by I  and leave the (fixed) \ncorrespondence  between  the Oi  and  the entries  of A  and  B  unspecified.  The indices are \nnaturally partitioned into classes corresponding to the rows of the matrices.  We denote by \n[i]  the class of parameters to which Oi  belongs and by O[i)  the vector of all OJ  S.t.  j  E  [i].  If \nj  E  [i]  then both Oi  and OJ  are parameters from  the  same row of one of the two matrices. \nWhenever it is  clear from the context, we will use  [i]  to denote both a class of parameters \nand the row number (i.e.  state) associated with the class.  We now can rewrite P(x, s18)  as \nnf=l O~'(X,S), where ni(x, s) is the number of times parameter i  is used along the path s \nwith observation sequence x.  (Note that this value does not depend on the actual parameters \n8.)  We next compute partial derivatives ofthe likelihood and the log-likelihood using this \nnotation. \n\no \nOOi P(x, s18) \n\nlInl(X,S) \nu 1 \n\nlIn._I(X,S) \n\n... U i-I \n\n( \n\nni  x, SUi \n\n)  lIn,(X,S)-l \n\nlInl(X,S) \n\n... U 1 \n\noLL(XI8) \n\nOOi \n\nHere  11i(xI8)  ~ Es ni(x, s)P(slx, 8)  is  the  expected  number  of occurrences  of the \ntransition/output that  corresponds  to  Oi  over  all  paths  that  produce  x  in  8.  These  val(cid:173)\nues  are  calculated  in  the  expectation  step  of the  Expectation-Maximization  (EM)  train(cid:173)\ning  algorithm  for  HMMs  [7],  also  known  as  the  Baum-Welch  [2]  or  the  Forward(cid:173)\nBackward  algorithm.  In  the  next sections  we  use  the  additional  following expectations, \n11i(8)  ~ Ex,s ni(X, s)P(x, s18)  and  11[i) (8)  ~ EjE[i) 11j(8).  Note that the summation \nhere is over all legal x  and s of arbitrary length and 11[i) (8) is the expected number of times \nthe state [i]  was visited. \n2  Entropic distance functions for HMMs \nOur training algorithms  are  based  on  the following framework  of Kivinen and  Wannuth \nfor motivating iterative updates  [6].  Assume we have already done a number of iterations \nand  our current  parameters  are  8 .  Assume  further  that  X  is  the  set  of observations  to \nbe processed  in  the current iteration.  In  the batch case  this  set  never changes and in  the \non-line case X  is typically a single observation.  The new  parameters 8 should stay close \nto 8, which incorporates all  the  knowledge obtained in  past  iterations, but it should also \nmaximize the log-likelihood on the current date set X. Thus, instead of maximizing the log(cid:173)\nlikelihood we maximize, U(8) =  7JLL(XI8) - d(8, 8) (see [6,  5]  for further motivation). \n\n\fTraining Algorithms/or Hidden Markov Models \n\n643 \n\nHere d measures the dis!ance between the old and new parameters and 1]  > 0 is a trade-off \nfactor.  Maximizing  U~B) is usually difficult since both the distance function  and the log(cid:173)\nlikelihood depend  on  B.  As in  [6,  5],  we approximate the log-likelihood by  a  first  order \nTaylor expansion  around 9 =  B and add Lagrange multipliers for the constraints that the \nparameters of each class must sum to one: \n\nU(8) ::::::  1]  (LL(XIB) + (8 - B)\\7 BLL(XIB\u00bb)  - d(8, B) + L A[i] L  OJ. \n\n(3) \n\n[i] \n\nJEri] \n\nA commonly used distance function is the relative entropy.  To calculate the relative entropy \nbetween two HMMs we need to sum over all possible hidden state sequence which leads to \nthe following definition, \n\nd \nRE,  ~  x \n\n(8  B) ~f ~ P(  18)  1  P(xI8)  =  ~ (~ P( \n\nn P(xIB)  ~ '7  x, s \n\n18\u00bb)  1  Ls P(x, s19) \n\nn Ls P(x, siB) \n\nHowever, the above divergence is very difficult to calculate and is not a convex function in \nB.  To avoid the computational difficulties and the non-convexity of dRE  we upper bound \nthe relative entropy using the log sum inequality [3]: \n\ndRE (8, B) \n\n~ \n\n-\n\n:s;  dRE(B,8)  =  L.t P(x, s18)  In  P( \n\ndef ~  -\n\nP(x, s19) \nIB) \n\nx,s \n\n_ (nIl On.(x,s\u00bb) \n\nX,s \nL P(x, siB)  In \nx,s \n~ O\u00b7  ~  -\nO\u00b7 \nL.t In  (J~  L.t P(x, s18)  ni(x, s) = L.t ni(8) In  (J~ \ni=1 \n\nx,s \n~  -\ni= I '  \n\nn,=1  , \n\n'  X,S \n\n_  I \n,=1 \n\nO. \n't  (J~'(X,S)  = L P(x, siB) ?= ni(x, s)  In (J: \n\nNote  that for  the distance function  ~E(9, 8)  an  HMM  is  viewed  as  a joint distribution \nbetween  observation sequences  and  hidden  state sequences.  We  can  further simplify the \nbound on the relative entropy using the following lemma (proof omitted). \nLemma 1  ForanyabsorbingHMM, 8, and any parameter(Jj  E 8,  ni(8) = (Jin[i](B). \n\n-\n\n-\n\n-\n\n-\n\n{j \n\n-\n\n~ \n\nThis gives  the following  new  formula,  dRE (9, 8)  =  L7= 1 n[j] (9)  [Oi In ~ ] ,  which can \nbe rewritten as,  dRE(8, B)  =  L[i] n[i](8) dRE(8[iJ> B[i])  =  L[i] n[i](8)  LjE[i] (Jj  In ~ . \nEquation (3) is  still difficult to solve since the variables n[i] (9) depend on the new  set of \nparameters  (which~ are  not known).  We  therefore further  approximate ~E(8, 8)  by  the \ndistance function, dRE(9, B)  = L[i] n[i](B)  LjE[i] OJ  In~. \n3  New Parameter Updates \nWe now would like to use the distance functions discussed in previous section in U (9).  We \nfirst derive ou~ main update using this distance function.  This is done by replacing d( 8,8) \nin U (9) with ~E (9, 8) and setting the derivatives of the resul ting U (9) w.r.t OJ  to O.  This \ngives the following set of equations (i E {I, ... , I}), \nOi \n\nLXEX  ni(xIB) \n\n_ \n\n1] \n\nIXI(Ji \n\nA \n\n- n[i](B)  (In (Ji  - 1)  + A[i]  - 0  , \n\nwhich are equivalent to \n\n\f644 \n\nY.  Singer and M.  K.  Warmuth \n\nWe  now can  solve for Oi  and replace A[i]  by  a nonnalization factor which ensures that the \nsum of the parameters in [i]  is 1: \n\n_ \nOi  = \n\nOJ  exp  n).)(8) \n\n(~ 2:XEX n.(XI8)) \n\nIXI9. \n\n2:jE [i] OJ  exp  n[)6) \n\n'~I 9J \n\n(2 :x   nJ(XI8)) \n\n(4) \n\nThe above re-estimation rule is the entropic update for HMMs. l \n\nn[.)(H) \n\nWe now derive an alternate to the updateof(4). The mixture weights n[i](8) (whichapprox(cid:173)\nimate the original mixture weights n[i] (0)  in ~E (0, 8)  lead to a state dependent learning \nrate of ~ for the parameters of class [i].  If computation time is limited (see discussion \nbelow) then the expectations n[i] (8) can be approximated by values that are readily available. \nOne possible choice is to use the sample based expectations 2:jE [i]2:xEX nj(xI8)/IXI as \nan approximation for n[i] (8). These weights are needed for calculating the gradient and are \nevaluated in the expectation step of Baum-Welch.  Let,  n[i](xI8) ~ 2:jE [i] nj(xI8), then \nthis approximation leads to the following distance function \n\n\"'\" 2:xEX n[i](xI8) d \nL-\n[i] \n\nIXI \n\nRE \n\n[~l> \n\n[a) \n\n(0.  8.)  =  \"'\" 2:xEx n[j)(xI8)  \"'\" O\u00b7  In OJ \n\n(5) \n\nL-\n[i] \n\nIXI \n\nL- J \nJEri] \n\n0 . '  \nJ \n\nwhich results in an update which we call the approximated entropic update for HMMs: \n\n-\nOi  = \n\nOi  exp  ~  . \n\n( \n\n(XI8) \n\n1) \n\nDXEX n['1 \n( \n\n~. \nDjE[i) OJ  exp  2:xEx nbl(XI8) \n\n1) \n\n2:xEx n.(XI8)) \n\n9, \n~ \nDXEX nJ(Xlo) \n\nLl  ) \n\n9J \n\n(6) \n\ngiven a current set of parameters 8  and a learning rate 11  we obtain a new set of parameters \n8  by iteratively evaluating the right-hand-side of the entropic update or the approximated \nentropic  update.  We  calculate  the  expectations  ni(xI8)  as  done  in  the  expectation  step \nof Baum-Welch.  The  weights n[i](xI8)  are  obtained by  averaging  nj(xI8) for j  E  [i]. \nThis lets us evaluate the right-hand-side of the approximated entropic update. The en tropic \nupdate is slightly more involved and requires  an  additional calculation of n[i) (8).  (Recall \nthat n[i] (8) is the expected number oftimes state [i] is visited, unconditioned on the data).  To \ncompute these expectations we need to sum over all possible sequences of state-observation \npairs.  Since the probability of outputting the possible symbols at a given state sum to one, \ncalculating n[i] (8) reduces to evaluating the probability of reaching a state for each possible \ntime and sequence length.  For absorbing HMMs n[i] (8) can be approximated efficiently \nusing dynamic programming; we compute n[i] (8) by summing the probabilities of all legal \nstate sequences S  of up to length eN (typically C  =  3 proved to be sufficient to obtain very \naccurate approximations of n[i] (8). Therefore, the time complexity of calculating n[i] (8) \ndepends only on the number of states, regardless of the dimension of the output vector M \nand the training data X. \n\n1 A subtle improvement is  possible over the update (4) by treating the transition probabilities and \noutput  probabilities  differently.  First the  transition  probabilities  are  updated based on  (4).  Then \nthe  state probabilities n[i)(O)  =  n[i)(A) are recomputed based on th; new  parameters A.  This is \npossible since the state probabilities depend only on the transition probabilities and not on the output \nprobabilities.  Finally the output probabilities are updated with (4) where the n[.)(O) are used in place \nof the n[i](8). \n\n\fTraining Algorithms/or Hidden Markov Models \n\n645 \n\n4  The relation to EM and convergence properties \nWe  first  show that the EM algorithm for HMMs can  be derived using our framework.  To \ndo  so,  we  approximate  the  relative  entropy  by  the  X2 distance  (see  [3]),  dRE(p, p)  ~ \ndx2(p, p) ~ ~ L:i  (P.;~./, and use this distance to approximate dRE(9, 8): \n\ndRE(9, 8) ~ ~2(9, 8)  1;\u00a3  2: 1l[i) (9)  dx2(9[i),8[i)) \n\n[i) \n\n~ A \n\n~  L-n[i)(8)  dx2(8[i]>8[i))  ~ L-\n[i) \n\n[i) \n\n-\n\n~ L:XEX 1l[i) (xI8) \n\nIXI \n\nHere dx2(9[i), 8[i)) = ~ L:j  E[i)  (9'~,8.)2 . By minimizing U (9) with the last version of the X2 \ndistance function and following the same derivation steps as for the approximated entropic \nupdate we arrive at  what we call the approximated X2  update for HMMs: \n\nOi  = (1  - 7J)Oi  + 7J  2: 1li(xI8) /2: 1l[j)(xI8)  . \n\nXEX \n\nXEX \n\n(7) \n\nSetting TJ  = 1 results in the update, Oi  = L:xEX 1li(xI8)/L:xEX 1l[i) (xI8), which is  the \nmaximization (re-estimation) step of the EM algorithm. \n\nAlthough omitted  from  this  paper  due  to  the  lack  of space,  it is  can  be shown  that  for \n7J  E  (0,1] the en tropic updates and the X2  update improve the likelihood on each iteration. \nTherefore, these updates belong to the family of Generalized EM (GEM) algorithms which \nare  guaranteed  to  converge  to  a  local  maximum  given  some  additional  conditions  [4]. \nFurthennore, using infinitesimal analysis and second order approximation of the likelihood \nfunction at the (local) maximum similar to [10]. it can be shown that the approximated X2 \nupdate is a contraction mapping and close to the local maximum there exists a learning rate \n7J  > 1 which results in a faster rate of convergence than when using TJ  =  1. \n5  Experiments with Artificial and Natural Data \nIn  order  to  test  the  actual  convergence  rate  of the  algorithms  and  to  compare  them  to \nBaum-Welch we created synthetic data using HMMs.  In our experiments we mainly  used \nsparse  models,  that  is,  models  with  many  parameters  clamped  to  zero.  Previous  work \n(e.g.,  [5,  6]) might suggest that the entropic updates will perfonn better on sparse models. \n(Indeed,  when  we used dense models to generate the data,  the algorithms showed almost \nthe same  perfonnance).  The training algorithms, however,  were  started from  a randomly \nchosen  dense  model.  When  comparing  the  algorithms  we  used  the  same  initial model. \nDue to different trajectories in parameter space, each algorithm may converge to a different \n(local) maximum.  For the clarity of presentation we show here results for cases  where all \nupdates converged to the same maximum, which often occur when the HMM generating the \ndata is sparse and there are enough examples  (typically tens  of observations per non-zero \nparameter).  We tested both the entropic updates and the X2  updates.  Learning rates greater \nthan  one  speed  up  convergence.  The  two entropic updates  converge  almost equally  fast \non  synthetic data generated  by  an  HMM.  For natural data  the  entropic update converges \nslightly faster than  the approximated version.  The  X2  update also benefits  from  learning \nrates larger than  one.  However,  the x2-update need to be used  carefully  since it does not \nnecessarily  ensure non-negativeness  of the  new  parameters  for  7J  >  1.  This problems is \nexaggerated  when  the data is  not generated by an  HMM.  We  therefore  used the entropic \nupdates in our experiments with natural data.  In order to have a fair comparison, we did not \ntune the learning rate 7J  and set it to 1.5.  In Figure 1 we give a comparison of the entropic \nupdate,  the approximated entropic update, and Baum-Welch (left figure),  using an  HMM \nto generate  the random  observation  sequences,  where  N  =  M  = 40 but only 25%  (10 \nparameters on the average for each transition/observation vector) of the parameters of the \n\n\f646 \n\nY.  Singer and M.  K.  Warmuth \n\nHMM are non-zero.  The perfonnance of the entropic update and the approximated entropic \nupdate are  practically the same  and both updates  clearly  outperfonn Baum-Welch.  One \nreason the perfonnance of the two entropic updates is the same is that the observations were \nindeed generated by  an  HMM. In this case,  approximating the expectations n(il (8) by the \nsample based expectations seems reasonable.  These results suggest a valuable alternative \nto using Baum-Welch with a predetermined sparse, potentially biased, HMM where a large \nnumber of parameters is clamped to zero. Instead, we suggest starting with a full model and \nlet one of the en tropic updates find the relevant parameters.  This approach is demonstrated \non the right part of Figure 1. In this example the data was generated by a sparse HMM with \n100 states and 100 possible output symbols. Only 10% ofthe HMM's parameters were non(cid:173)\nzero.  Three log-likelihood curves are given in the figure. One is the log-likelihood achieved \nby Baum-Welch when only those parameters that are non-zero in the HMM generating the \ndata are initialized to random non-zero values.  The other two are the log-likelihood of the \nentropic  update and Baum-Welch  when  all the parameters  are  initialized randomly.  The \ncurves show that the en tropic update compensates  for its inferior initialization in less than \n10 iterations  (see  horizontal  line in  Figure  1)  and  from  this point on it requires  only  23 \nmore iterations to converge compared  to Baum-Welch which is given prior knowledge of \nthe non-zero parameters.  In contrast, when  Baum-Welch is started with a full  model  then \nits convergence is much slower than the entropic update. \n\n-0 . 4  r---\"---\"'--~----r--~---, \n\n'g  -0 .6 \n~ \n.. -0,8 \n~ \n7 \ngo \nol \n1:1  - 1. 2 \n~ \n~ - 1. 4 \n\n- 1 \n\n~ -1. 6 \n\nEntr op ic  Upda t e  - (cid:173)\nEntr opi c  Up date  ....... -~ \nEM \n(Daum- we l c h)  \u00b7 41 ' \" \n\nEM  (Daum-wel c h ) .  Random  I ni t. \nEn t r opic  Updat e,  Random  I ni t . \nEM  (Ba um-we l c h ).  s pa r se  I ni t  .\n\n. .g.. \n\n-1. 8  ' -_ \" ' - -_ . . .L . . -_ - ' -_ - ' - -_ - - ' -_ - - '  \n30 \n\n20 \n\n1 5 \n\n10 \n\n25 \n\no \n\nIte r a t ion\" \n\n-2. 4  \"-_...L..-_--'--_-'-_--'--_--'-_--' \n120 \n\n100 \n\n80 \n\n4 0 \n\n60 \n\n20 \n\nIt e r a t ion \" \n\nFigure 1: Comparison of the entropic updates and Baum-Welch. \n\nWe next tested the updates on  speech pronunciation data.  In natural speech, a word might \nbe  pronounced differently  by  different  speakers.  A  common  practice  is  to  construct  a \nset of stochastic models in  order to capture the variability of the possible pronunciations. \nalternative pronunciations of a  given  word.  This  problem  was  studied  previously  in  [9] \nusing  a  state merging  algorithm  for  HMMs  and  in  [8]  using a  subclass  of probabilistic \nfinite automata.  The purpose of the experiments discussed here is not to compare the above \nalgorithms to the en tropic updates but rather compare the entropic updates to Baum-Welch. \nNevertheless, the resulting HMM pronunciation models are usually sparse.  Typically, only \ntwo or three phonemes have a non zero output probability at a given state and the average \nnumber of states  that in  practice can  follow  a  states  is  about 2.  Therefore,  the entropic \nupdates may provide a good alternative to the algorithms presented in [8,  9]. \n\nWe  used the TIMIT (Texas Instruments-MIT) database as in [8, 9].  This database contains \nthe  acoustic  wavefonns  of continuous speech  with phone labels from  an  alphabet  of 62 \nphones which constitute a temporally aligned phonetic transcription to the uttered  words. \nFor the purpose of building pronunciation models,  the acoustic data was ignored and  we \npartitioned the phonetic labels according to the words that appeared in the data.  The data \nwas filtered and partitioned so that words occurring between 20 and 100 times in the dataset \nwere  used  for  training and  evaluation  according  to  the  following partition.  75%  of the \noccurrences  of each  word  were  used  as  training data  for  the  learning  algorithm and  the \nremaining 25% were used for evaluation.  We then built for each word three pronunciation \nmodels by training a fully connected HMM  whose number of states was  set to  1, 1.5 and \n1.75 times the longest sample (denoted by N m).  The models were evaluated by calculating \n\n\fTraining Algorithmsfor Hidden Markov Models \n\n647 \n\nthe  log-likelihood (averaged  over  10 different  random  parameter initializations) of each \nIn  Table  1  we  give \nHMM  on  the  phonetic  transcription  of each  word  in  the  test  set. \nthe negative log-likelihood achieved  on the test data together with the average  number of \niterations needed for training. Overall the differences in the log-likelihood are small which \nmeans that the results should be interpreted with some caution.  Nevertheless, the entropic \nupdate obtained the highest likelihood on  the test data while needing the least number of \niterations.  The approximated en tropic update and Baum-Welch achieve similar results on \nthe test data but the latter requires more iterations.  Checking the resulting models reveals \none reason  why  the en tropic update achieves  higher likelihood values,  namely,  it does  a \nbetter job in setting the irrelevant parameters to zero (and it does it faster). \n\n# States \nBaum-Welch \nApprox.  EU \nEntropic  Update \n\n1.0Nm \n2448 \n2440 \n2418 \n\nNegative Log-Likelihood \n\n1.5Nm \n2388 \n2389 \n2352 \n\n1.75Nm \n2425 \n2426 \n2405 \n\n1.0Nm \n27.4 \n25.5 \n23.1 \n\n# Iterations \n\n1.5Nm \n36.1 \n35.0 \n30.9 \n\n1.75Nm \n\n41.1 \n37.0 \n32.6 \n\nTable  1:  Comparison of the entropic updates and Baum-Welch on speech pronunciation data. \n\n6  Conclusions and future research \nIn this paper we have showed how the framework of Kivinen and Warmuth [6] can be used \nto derive parameter updates algorithms for HMMs.  We view an HMM as a joint distribution \nbetween the observation sequences and hidden state sequences and use a bound on relative \nentropy as a distance between the new and old parameter settings. If we approximate of the \nrelative entropy by the X2  distance, replace the exact state expectations by a sample based \napproximation, and  fix  the learning rate to one then  the  framework  yields  an  alternative \nderivation  of the  EM  algorithm  for  HMMs.  Since  the  EM  update  uses  sample  based \nestimates of the state expectations it is hard to use it in  an  on-line setting.  In  contrast, the \non-line versions of our updates can  be easily derived using only one observation sequence \nat a  time.  Also,  there  are  alternative gradient descent  based  methods  for estimating the \nparameters  of HMMs.  Such  methods  usually  employ  an  exponential  parameterization \n(such as  soft-max) of the parameters (see [1 D. For the case of learning one set of mixture \ncoefficients an exponential parameterization led to an algorithm with a slower convergence \nrate compared to algorithms derived using entropic distances  [5] .  However,  it is not clear \nwhether this is still the case for HMMs.  Our future goals is to perform a comparative study \nof the different updates with emphasis on the on-line versions. \nAcknowledgments \nWe thank Anders Krogh for showing us the simple derivative calculations used in this paper and thank \nFernando Pereira and Yasubumi Sakakibara for interesting discussions. \nReferences \n[1]  P.  Baldi and Y.  Chauvin. Smooth on-line learning algorithms for Hidden Markov Models.  Neural Computation . 6(2), 1994. \n[2]  L.E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. Annals of Mathematic a I \n\nStatisitics, 37, 1966. \n\n[3]  T.  Cover and J. Thomas. Elements of Information Theory.  Wiley, 1991. \n[4]  A.  P. Dempster, N.  M. Laird, and D.  B. Rubin.  Maximum-likelihood from incomplete data via the EM algorithm. Journal \n\nof the Royal Statistical Society, B39: 1-38,1977. \n\n[5]  D. P.  Helmbold, R.  E.  Schapire,  Y.  Singer, and M.  K.  Warmuth.  A comparison of new  and old algorithms for a  mixture \nestimation problem. In Proceedingsofthe Eighth Annual Workshop on Computational Learning Theory, pages 69-78,1995. \n[6]  J.  Kivinen and M. K.  Warmuth.  Exponentiated gradient versus gradient descent for linear predictors.  Informationa and \n\nComputation, 1997. To appear. \n\n[7]  LR Rabiner and B. H.  Juang. An introduction to hidden markov models. IEEE ASSP Magazine, 3(1 ):4-16, 1986. \n[8]  D. Ron,  Y.  Singer, and N.  Tishby.  On the learnability and usage of acyclic  probabilistic finite  automata.  In Proc.  of the \n\nEighth Annual Workshop on Computational Learning Theory,  1995. \n\n[9]  A.  Stolcke and  S.  Omohundro.  Hidden Markov model induction by Bayesian  model  merging.  In Advances in  Neural \n\nInformation Processing Systems, volume 5.  Morgan Kaufmann, 1993. \n\n[10]  L.  Xu  and  M.I. Jordan.  On convergence properties of the EM  algorithm for Gaussian  mixtures.  Neurol Computation, \n\n8:129-151 , 1996. \n\n\f", "award": [], "sourceid": 1263, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}