{"title": "Markov Processes on Curves for Automatic Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 751, "page_last": 760, "abstract": null, "full_text": "Markov processes  on curves for \nautomatic speech recognition \n\nLawrence Saul and Mazin  Rahim \n\nAT&T Labs - Research \n\nShannon Laboratory \n180 Park Ave  E-171 \n\nFlorham Park,  NJ  07932 \n\n{lsaul,rnazin}Gresearch.att.com \n\nAbstract \n\nWe  investigate  a  probabilistic  framework  for  automatic  speech \nrecognition  based  on  the  intrinsic  geometric  properties  of curves. \nIn  particular,  we  analyze  the  setting  in  which  two  variables-one \ncontinuous  (~), one  discrete  (s )-evolve jointly in  time.  We  sup(cid:173)\npose  that the vector ~ traces out a smooth multidimensional curve \nand  that  the  variable  s  evolves  stochastically  as  a  function  of the \narc  length  traversed  along  this  curve.  Since  arc  length  does  not \ndepend  on  the  rate  at  which  a  curve  is  traversed,  this  gives  rise \nto  a  family  of  Markov  processes  whose  predictions,  Pr[sl~]'  are \ninvariant  to  nonlinear  warpings  of time.  We  describe  the  use  of \nsuch  models,  known  as  Markov  processes  on  curves  (MPCs),  for \nautomatic speech  recognition,  where  ~ are  acoustic feature  trajec(cid:173)\ntories and s are phonetic transcriptions.  On two tasks-recognizing \nNew  Jersey  town names and connected  alpha-digits- we  find  that \nMPCs yield lower word error rates than comparably trained hidden \nMarkov  models. \n\n1 \n\nIntroduction \n\nVariations  in  speaking  rate  currently  present  a  serious  challenge  for  automatic \nspeech recognition (ASR)  (Siegler & Stern, 1995).  It is widely observed, for example, \nthat fast speech  is more prone  to recognition errors than slow speech.  A related ef(cid:173)\nfect,  occurring at the phoneme level, is that consonants (l,re  more frequently botched \nthan vowels.  Generally speaking, consonants have short-lived, non-stationary acous(cid:173)\ntic signatures;  vowels,  just  the opposite.  Thus,  at  the  phoneme level,  we  can  view \nthe increased  confusability of consonants as a  consequence  of locally fast  speech. \n\n\f752 \n\nL.  Saul and M.  Rahim \n\nset) = s 1 \n\nSTART \n\nt=O \n\nx(t) \n\nEND \nt='t \n\nFigure  1:  Two  variables-one  continuous  (x),  one  discrete  (s )- evol ve  jointly  in \ntime.  The trace of s  partitions the curve of x  into different segments whose  bound(cid:173)\naries  occur  where  s  changes  value. \n\nIn  this paper,  we  investigate  a  probabilistic framework  for  ASR that  models vari(cid:173)\nations  in  speaking  rate  as  arising from  nonlinear  warpings  of time  (Tishby,  1990) . \nOur framework is  based on  the  observation  that  acoustic feature  vectors  trace  out \ncontinuous  trajectories  (Ostendorf et  aI , 1996).  We  view  these  trajectories  as mul(cid:173)\ntidimensional  curves  whose  intrinsic  geometric  properties  (such  as  arc  length  or \nradius)  do  not  depend  on  the  rate  at  which  they  are  traversed  (do  Carmo,  1976). \nWe  describe  a  probabilistic  model  whose  predictions  are  based  on  these  intrinsic \ngeometric  properties  and-as  such-are  invariant  to  nonlinear  warpings  of  time. \nThe  handling of this  invariance distinguishes our methods from  traditional hidden \nMarkov models  (HMMs)  (Rabiner &  Juang,  1993). \n\nThe  probabilistic  models  studied  in  this  paper  are  known  as  Markov  processes  on \ncurves  (MPCs).  The  theoretical  framework for  MPCs  was introduced in  an  earlier \npaper  (Saul,  1997),  which  also  discussed  the  problems of decoding  and  parameter \nestimation.  In  the  present  work,  we  report  the  first  experimental results  for  MPCs \non  two difficult benchmark problems in ASR. On these  problems-\nrecognizing New \nJersey  town  names  and  connected  alpha-digits- our results  show  that  MPCs  gen(cid:173)\nerally match or exceed  the  performance of comparably trained  HMMs. \n\nThe  organization  of  this  paper  is  as  follows .  In  section  2,  we  review  the  basic \nelements of MPCs and discuss important differences  between  MPCs and HMMs.  In \nsection  3, we  present our experimental results  and evaluate  their significance. \n\n2  Markov  processes  on  curves \n\nSpeech recognizers  take a continuous acoustic signal as input and return a sequence \nof discrete  labels  representing  phonemes,  syllables,  or  words  as  output.  Typically \nthe  short-time  properties  of the  speech  signal  are  summarized by  acoustic  feature \nvectors.  Thus the abstract mathematical problem is to describe  a  multidimensional \ntrajectory  {x(t) It  E  [0, T]}  by  a  sequence  of discrete  labels  S1 S2  . . . Sn.  As  shown  in \nfigure  1,  this  is  done  by  specifying  consecutive  time  intervals  such  that  s(t)  = Sk \nfor  t  E  [tk-1, tk]  and attaching the labels Sk  to contiguous arcs  along the  trajectory. \nTo  formulate a  probabilistic model of this  process,  we  consider  two  variables-one \ncontinuous  (x),  one  discrete  (s )-that evolve  jointly  in  time.  Thus  the  vector  x \ntraces  out  a  smooth multidimensional curve,  to  each  point of which  the  variable  s \nattaches  a  discrete  label. \n\nMarkov processes  on curves are based on the concept of arc  length.  After reviewing \nhow to compute arc lengths along curves,  we  introduce a family of Markov processes \nwhose  predictions  are  invariant  to  nonlinear  warpings  of time.  We  then  consider \nthe  ways in which  these  processes  (and  various generalizations)  differ from  HMMs. \n\n\fMarkov Processes on Curves for Automatic Speech Recognition \n\n753 \n\n2.1  Arc length \n\nLet g(~) define  a  D  x D  matrix-valued function over x  E RP.  If g(~) is everywhere \nnon-negative  definite,  then  we  can  use  it  as  a  metric  to  compute  distances  along \nIn  particular,  consider  two  nearby  points  separated  by  the  infinitesimal \ncurves. \nvector  d~. We  define  the squared  distance  between  these  two  points as: \n\n(1) \n\nArc  length  along  a  curve  is  the  non-decreasing  function  computed  by  integrating \nthese  local  distances.  Thus,  for  the  trajectory  x(t),  the  arc  length  between  the \npoints  x(t!)  and  X(t2)  is  given  by: \n\nf= l t2dt  [~Tg(x)i:]~, \n\n(2) \nwhere i: =  it [~(t)] denotes the time derivative of~ .  Note that the arc length defined \nby eq.  (2)  is  invariant under reparameterizations of the  trajectory,  ~(t) -t ~(J(t)) , \nwhere  f(t)  is  any smooth monotonic function of time that maps the  interval [tl, t2] \ninto itself. \n\ntl \n\nIn the special case where g(~) is the identity matrix, eq.  (2)  reduces  to the standard \ndefinition of arc length in Euclidean space.  More generally,  however,  eq.  (1)  defines \na  non-Euclidean metric for  computing arc lengths.  Thus, for  example, if the metric \ng(x)  varies  as  a  function  of~, then  eq.  (2)  can  assign  different  arc  lengths  to  the \ntrajectories  x(t)  and  x(t) + ~o, where  ~o is  a  constant displacement. \n\n2.2  States and lifelengths \n\nWe  now return  to  the  problem of segmentation, as illustrated in  figure  1.  We  refer \nto  the  possible  values of s  as  states.  MPCs  are  conditional random processes  that \nevolve  the  state  variable  s  stochastically  as  a  function  of the  arc  length  traversed \nalong  the  curve  of~.  In  MPCs,  the  probability  of remaining  in  a  particular  state \ndecays  exponentially  with  the  cumulative  arc  length  traversed  in  that  state.  The \nsignature of a  state is  the particular way  in which  it computes arc length. \n\nTo formalize  this  idea,  we  associate  with  each  state  i  the following  quantities:  (i) \na  feature-dependent  matrix gi (x)  that  can  be  used  to  compute  arc  lengths,  as  in \neq.  (2);  (ii)  a  decay  parameter Ai  that measures  the probability per unit arc  length \nthat s makes a transition from state i  to some other state; and (iii) a set of transition \nprobabilities aij,  where  aij  represents  the  probability that-having decayed  out of \nstate  i-the variable  s  makes a  transition  to state j .  Thus,  aij  defines  a  stochastic \ntransition matrix with zero  elements along the  diagonal and rows  that sum to one: \naii  = 0  and  2:j  aij  = 1.  A  Markov  process  is  defined  by  the  set  of differential \nequations: \n\nd \n1 \nPi \ndt =  -/liPi  X  gi  X  X  + L.J /ljpjaji  ~ 9j  x  ~  , \n)  \u2022  ]  :I \n\n(  ).]:1  ~ \\ \n\n[ . T \n\n( \n\n\\ \n\n[ . T \n\n1 \n\n(3) \n\n#i \n\nwhere  Pi(t)  denotes  the  (forward)  probability  that  s  is  in  state  i  at  time t,  based \non  its  history  up  to  that  point  in  time.  The  right  hand  side  of eq.  (3)  consists  of \ntwo  competing terms.  The first  term  computes  the  probability  that  s  decays  out \nof state  i;  the  second  computes  the  probability  that  s  decays  into  state  i.  Both \nterms are proportional to  measures of arc length,  making the evolution of Pi  along \nthe  curve  of x  invariant  to  nonlinear  warpings  of time.  The  decay  parameter,  Ai, \ncontrols  the  typical  amount of arc  length  traversed  in  state i ; it may be  viewed  as \n\n\f754 \n\nL.  Saul and M.  Rahim \n\nan inverse  lifetime or-to be  more precise-an inverse  lifelength.  The entire process \nis  Markovian  because  the  evolution  of Pi  depends  only  on  quantities  available  at \ntime t. \n\n2.3  Decoding \n\nGiven  a  trajectory  x(t),  the  Markov  process  in  eq.  (3)  gives  rise  to  a  conditional \nprobability distribution over  possible  segmentations,  s(t).  Consider  the  segmenta(cid:173)\ntion in which  s(t)  takes  the  value Sk  between  times tk-l and tk,  and let \n\nfSk  =  jtk dt  [XTgsk(X)  X ]% \n\ntk-l \n\n(4) \n\ndenote the arc length traversed in state Sk.  By integrating eq.  (3), one can show that \nthe probability of remaining in state Sk  decays exponentially with the arc length f Sk ' \nThus,  the  conditional probability of the overall segmentation is  given by: \n\nPr[s,flx] = II ASke->'Sklsk II aSkSk+ll \n\nn \n\nn \n\n(5) \n\nI \n\n/ \n\nk=l \n\nk=O \n\nwhere  we  have used  So  and  Sn+1  to denote  the  START  and END  states of the  Markov \nprocess.  The  first  product  in  eq.  (5)  multiplies the  probabilities that each segment \ntraverses  exactly  its observed  arc length.  The second  product  multiplies the  prob(cid:173)\nabilities for  transitions between  states  Sk  and  Sk+l'  The  leading factors  of ASk  are \nincluded  to normalize each  state's distribution over observed  arc  lengths. \n\nThere  are  many important quantities that can be computed from  the distribution, \nPr[ S Ix].  Of particular interest for  ASR is  the most probable segmentation:  s* (x)  = \nargmaxs,l {In Pr[s, fix]}.  As  described  elsewhere  (Saul,  1997),  this  maximization \ncan be performed by discretizing the time axis and applying a dynamic programming \nprocedure.  The resulting algorithm is similar to the Viterbi procedure for  maximum \nlikelihood decoding  (Rabiner &  Juang,  1993). \n\n2.4  Parameter estimation \n\nThe parameters {Ai, aij, gi (x)}  in  MPCs are  estimated from  training data to  max(cid:173)\nimize  the  log-likelihood of  target  segmentations.  In  our  preliminary  experiments \nwith  MPCs,  we  estimated  only  the  metric  parameters,  gi(X);  the  others  were  as(cid:173)\nsigned  the  default  values  Ai  =  1 and  aij  =  1/ Ii, where  Ii  is  the  fanout  of state  i. \nThe metrics gi (x)  were  assumed to  have  the  parameterized form: \n\n(6) \nwhere  (ji  is  a  positive  definite  matrix with  unit  determinant,  and  cI>i (x)  is  a  non(cid:173)\nnegative scalar-valued function of x.  For the experiments in this paper,  the form of \ncI>i(X)  was fixed  so that  the  MPCs reduced  to HMMs  as a special case,  as described \nin  the  next  section.  Thus  the  only  learning  problem  was  to  estimate  the  matrix \nparameters (ji.  This was  done  using  the  reestimation formula: \n\nJ \n\n~xT \n\n[x  (ji-1X]\"2 \n\n(ji  ~ C  dt. T \n\n\u2022  1  cI>i(x(t)), \n\n(7) \n\nwhere  the integral is over all speech segments belonging to state i,  and the constant \nC  is  chosen  to  enforce  the  determinant  constraint  l(ji I =  1.  For  fixed  cI>i (x),  we \nhave  shown  previously  (Saul,  1997)  that  this  iterative  update  leads  to  monotonic \nincreases  in  the  log-likelihood. \n\n\fMarkov Processes on Curves for Automatic Speech Recognition \n\n755 \n\n2.5  Relation to HMMs  and previous work \n\nThere  are several important differences  between  HMMs  and MPCs.  HMMs param(cid:173)\neterize  joint distributions of the  form:  Pr[s, z]  =  Dt Pr[st+1lsd Pr[zt Isd.  Thus, \nin  HMMs,  parameter estimation is directed  at learning a  synthesis  model,  Pr[zls]' \nwhile  in  MPCs,  it  is  directed  at  learning  a  segmentation  model,  Pr[s,flz].  The \ndirection of conditioning on  z  is a crucial difference.  MPCs do not attempt to learn \nanything as  ambitious as a joint distribution over acoustic feature  trajectories.  \\ \n\nHMMs and MPCs also differ  in how they  weight  the speech  signal.  In  HMMs, each \nstate contributes  an  amount to  the overall  log-likelihood that grows  in proportion \nto  its  duration  in  time.  In  MPCs,  on  the  other  hand,  each  state  contributes  an \namount that grows  in proportion to its arc length.  Naturally, the  weighting by  arc \nlength attaches  a  more important role  to short-lived but  non-stationary phonemes, \nsuch  as  consonants.  It also guarantees the invariance to nonlinear warpings of time \n(to which  the  predictions of HMMs are  quite sensitive). \n\nIn terms of previous work,\\mr motivation for MPCs resembles that of Tishby (1990), \nwho several years ago proposed a dynamical systems approach to speech  processing. \nBecause  MPCs exploit the continuity of acoustic feature  trajectories,  they also bear \nsome  resemblance  to  so-called  segmental  HMMs  (Ostendorf et  aI,  1996).  MPCs \nnevertheless differ from segmental HMMs in two important respects:  the invariance \nto nonlinear warpings of time , and the emphasis on learning a segmentation model \nPr[s, flz],  as opposed  to a  synthesis model,  Pr[xls]. \n\nFinally, we  note  that admitting a slight generalization in the concept of arc length, \nwe  can  essentially  realize  HMMs  as  a  special  case of MPCs.  This is  done  by  com(cid:173)\nputing arc lengths along the  spacetime trajectories z(t) = {x(t),t}-that is  to say, \nreplacing  eq.  (1)  by  dL 2  = [zTg(z) z]dt 2 ,  where  z  = {:il, 1}  and g(z)  is  a spacetime \nmetric.  This relaxes  the  invariance to nonlinear  warpings of time and incorporates \nboth movement in acoustic feature space  and duration in time as measures of phone(cid:173)\nmic evolution.  Moreover,  in  this  setting,  one  can mimic the  predictions  of HMMs \nby setting the  (J'i  matrices to have only one non-zero  element (namely, the diagonal \nelement for delta-time contributions to the arc length) and by defining the functions \n<l>i(X)  in  terms of HMM  emission probabilities P(xli)  as: \n\n] \n<l>i(X)  =  -In  2::k P(xlk) \n\n[  P(zli) \n\n. \n\n(8) \n\nThis  relation  is  important  because  it  allows  us  to  initialize  the  parameters of an \nMPC by  those of a continuous-density HMM,. This initialization was used in all the \nexperiments reported  below. \n\n3  Automatic speech recognition \n\nBoth HMMs and  MPCs were  used  to, build connected speech  recognizers.  Training \nand  test  data came from  speaker-independent  databases  of telephone  speech.  All \ndata was  digitized  at the  caller's local  switch  and  transmitted  in  this form  to  the \nreceiver.  For  feature  extraction,  input  telephone  signals  (sampled  at  8  kHz  and \nband-limited  between  100-3800  Hz)  were  pre-emphasized  and  blocked  into  30ms \nframes  with a  frame  shift of 10ms.  Each  frame  was  Hamming windowed, autocor(cid:173)\nrelated,  and  processed  by  LPC  cepstral  analysis to  produce  a  vector  of 12  liftered \ncepstral  coefficients  (Rabiner  & Juang,  1993).  The  feature  vector  was  then  aug(cid:173)\nmented  by  its  normalized log energy  value,  as well  as  temporal derivatives of first \nand second order.  Overall, each frame of speech was described by 39 features .  These \nfeatures  were  used  diffe:.;ently  by  HMMs  and  MPCs,  as described  below. \n\n\f756 \n\nL.  Saul and M.  Rahim \n\nMixtures  HMM  (%) \n\nMPC  ('fo) \n\n2 \n4 \n8 \n16 \n32 \n64 \n\n22.3 \n18.9 \n16.5 \n14.6 \n13.5 \n11.7 \n\n20.9 \n17.5 \n15.1 \n13.3 \n12.3 \n11.4 \n\nNJ town names \n\n- 0-\n\n22  ~ , \n\n14 \n\n12 \n\no \n\n1000 \n\n2000 \nparameters pe r state \n\n3000 \n\n4000 \n\n5000 \n\nTable  1:  Word  error  rates  for  HMMs  (dashed)  and  MPCs  (solid)  on  the  task  of \nrecognizing NJ  town  names.  The table shows  the  error  rates  versus  the  number of \nmixture components; the  graph, versus  the number of parameters per hidden state. \n\nRecognizers  were  evaluated on  two  tasks.  The first  task  was  recognizing  New  Jer(cid:173)\nsey  town  names  (e.g.,  Newark) .  The  training data for  this  task  (Sachs  et  aI ,  1994) \nconsisted  of 12100  short  phrases,  spoken  in  the  seven  major dialects  of American \nEnglish.  These  phrases,  ranging from  two  to four  words  in  length,  were  selected  to \nprovide maximum phonetic coverage.  The test data consisted of 2426 isolated utter(cid:173)\nances  of 1219  New  Jersey  town names and was collected  from  nearly  100 speakers. \nNote that the training and test data for this task have non-overlapping vocabularies. \n\nBaseline recognizers were built using 43Ieft-to-right continuous-density HMMs, each \ncorresponding  to  a  context-independent  English  phone.  Phones  were  modeled  by \nthree-state  HMMs, with  the exception  of background noise , which  was  modeled by \na single state.  State emission probabilities were  computed by mixtures of Gaussians \nwith diagonal covariance matrices.  Different sized models were trained using M  = 2, \n4,  8,  16,  32,  and  64  mixture  components per  hidden  state;  for  a  particular  model , \nthe  number  of  mixture  components  was  the  same  across  all  states.  Parameter \nestimation was handled by  a Viterbi implementation of the Baum-Welch algorithm. \n\nMPC  recognizers  were  built  using  the  same  overall  grammar.  Each  hidden  state \nin  the  MPCs  was  assigned  a  metric gi(~) =  O';l<I>l(~).  The  functions  <I>i(~)  were \ninitialized  (and  fixed)  by  the  state  emission  probabilities  of the  HMMs,  as  given \nby  eq.  (8).  The  matrices O'i  were  estimated  by  iterating eq.  (7).  We  computed  arc \nlengths along the  14 dimensional spacetime trajectories through cepstra, log-energy, \nand  time.  Thus each O'i  was a  14 x  14 symmetric matrix applied to  tangent vectors \nconsisting of delta-cepstra,  delta-log-energy, and delta-time. \n\nThe  table  in  figure  1 shows  the  results  of these  experiments  comparing  MPCs  to \nHMMs.  For  various  model sizes  (as  measured  by  the  number of mixture  compo(cid:173)\nnents),  we  found  the  MPCs to yield consistently lower error rates  than  the  HMMs. \nThe graph in figure  1 plots these word error rates versus the number of modeling pa(cid:173)\nrameters per hidden state.  This graph shows that the  MPCs are not outperforming \nthe  HMMs  merely  because  they  have  extra  modeling parameters  (i .e. , the  O'i  ma(cid:173)\ntrices).  The  beam  widths  for  the  decoding  procedures  in  these  experiments  were \nchosen  so  that corresponding  recognizers  activated roughly equal  numbers of arcs. \n\nThe  second  task  in  our  experiments  involved  the  recognition  of connected  alpha(cid:173)\ndigits  (e.g.,  N  Z  3  V  J  4  E  3  U  2).  The  training  and  test  data  consisted  of \n\n\fMarkov Processes on Curves for Automatic Speech Recognition \n\n757 \n\nMixtures  HMM  (%)  MPC  (%) \n\n2 \n4 \n8 \n\n12.5 \n10.7 \n10.0 \n\n10.0 \n8.8 \n8.2 \n\n13 \n\n12 \n\n~11 \n~ \ng10 \nCD \n\n9 \n\n..... , \n\n'0 \n\n~oo \n\n400 \n\n1000 \n600 \nparameters per state \n\n800 \n\n1200 \n\n1400 \n\nFigure  2:  Word  error  rates  for  HMMs  and  MPCs  on  the  task  of recognizing  con(cid:173)\nnected  alpha-digits.  The  table shows  the  error rates  versus  the  number of mixture \ncomponents;  the  graph , versus  the number of parameters per hidden state. \n\n14622 and 7255  utterances,  respectively.  Recognizers  were  built from  285  sub-word \nHMMs/MPCs, each corresponding to a context-dependent  English phone.  The rec(cid:173)\nognizers  were  trained and evaluated in  the  same way  as  the  previous  task.  Results \nare shown in figure  2. \n\nWhile  these  results  demonstrate the viability of MPCs for  automatic speech  recog(cid:173)\nnition, several  issues  require  further  attention.  The  most important issues are fea(cid:173)\nture selection-how  to define  meaningful acoustic trajectories from  the  raw speech \nsignal-and learning- how  to  parameterize  and estimate  the  hidden  state  metrics \ngi (~) from  sampled trajectories  {z (t)}.  These  issues  and others will  be  studied  in \nfuture  work. \n\nReferences \n\nM.  P.  do  Carmo  (1976) .  Differential  Geometry  of Curves  and  Surfaces.  Prentice \nHall. \n\nM.  Ostendorf, V. Digalakis, and O.  Kimball (1996).  From HMMs  to segment  mod(cid:173)\nels:  a unified view of stochastic modeling for speech recognition.  IEEE Transactions \non  Acoustics,  Speech  and Signal  Processing, 4:360-378. \n\nL.  Rabiner  and  B.  Juang  (1993) .  Fundamentals  of Speech  Recognition.  Prentice \nHall,  Englewood  Cliffs,  NJ. \n\nR.  Sachs,  M.  Tikijian, and E.  Roskos (1994).  United States English subword speech \ndata.  AT&T unpublished  report. \n\nL. Saul  (1998) .  Automatic segmentation of continuous trajectories  with invariance \nto  nonlinear  warpings  of time.  In  Proceedings  of the  Fifteenth  International  Con(cid:173)\nference  on  Machine  Learning,  506- 514. \n\nM. A. Siegler and R . M. Stern  (1995).  On  the effects  of speech  rate in  large vocab(cid:173)\nulary  speech  recognition  systems.  In  Proceedings  of th e  1995  IEEE  International \nConference  on  Acoustics,  Speech,  and Signal  Processing, 612-615. \nN.  Tishby  (1990).  A  dynamical system approach to  speech  processing.  In  Proceed(cid:173)\nings  of the  1990  IEEE  International  Conference  on  Acoustics,  Speech,  and  Signal \nProcessing,  365-368 . \n\n\f\fPART VII \n\nVISUAL PROCESSING \n\n\f\f", "award": [], "sourceid": 1508, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Mazin", "family_name": "Rahim", "institution": null}]}