{"title": "A New Approach to Hybrid HMM/ANN Speech Recognition using Mutual Information Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 772, "page_last": 778, "abstract": null, "full_text": "A New Approach to Hybrid HMMJANN Speech \nRecognition Using Mutual Information Neural \n\nNetworks \n\nG.  Rigoll,  c.  Neukirchen \nGerhard-Mercator-University Duisburg \n\nFaculty of Electrical Engineering \nDepartment of Computer Science \n\nBismarckstr. 90, Duisburg, Germany \n\nABSTRACT \n\nThis paper presents a new approach to speech recognition with hybrid \nHMM/ANN  technology.  While  the  standard  approach  to  hybrid \nHMMI ANN  systems  is  based  on  the  use  of  neural  networks  as \nposterior probability estimators, the new approach is based on the use \nof mutual information neural  networks trained with a special learning \nalgorithm  in  order to  maximize the  mutual  information  between  the \ninput classes of the network and its resulting sequence of firing output \nneurons  during  training.  It is  shown  in  this  paper that  such  a  neural \nnetwork  is  an  optimal  neural  vector quantizer for  a  discrete  hidden \nMarkov  model  system  trained  on  Maximum  Likelihood  principles. \nOne  of the  main  advantages  of this  approach  is  the  fact,  that  such \nneural  networks  can  be  easily  combined  with  HMM's  of  any \ncomplexity  with context-dependent capabilities.  It is  shown  that  the \nresulting  hybrid  system  achieves  very  high  recognition  rates,  which \nare  now  already  on  the  same  level  as  the  best  conventional  HMM \nsystems  with  continuous  parameters,  and  the  capabilities  of  the \nmutual information neural networks are not yet entirely exploited. \n\n1  INTRODUCTION \n\nHybrid  HMM/ANN  systems  deal  with  the  optimal  combination  of  artificial  neural \nnetworks (ANN) and hidden Markov models (HMM). Especially in the area of automatic \nspeech recognition, it has been shown that hybrid approaches can lead to very  powerful \nand  efficient systems,  combining the discriminative capabilities of neural  networks  and \nthe  superior  dynamic  time  warping  abilities  of  HMM's.  The  most  popular  hybrid \napproach  is  described  in  (Hochberg,  1995)  and  replaces  the  component modeling  the \nemission probabilities of the HMM by a neural  net.  This is possible, because it is shown \n\n\fMutual In/ormation Neural Networks/or Hybrid HMMIANN Speech Recognition \n\n773 \n\nin  (Bourlard,  1994)  that  neural  networks  can  be  trained  so  that  the  output  of the  m-th \nneuron  approximates  the  posterior  probability  p(QmIX).  In  this  paper,  an  alternative \nmethod  for  constructing a hybrid  system is presented.  It is  based on the use of discrete \nHMM's which are combined with a neural vector quantizer (VQ) in order to form a hybrid \nsystem. Each speech feature vector is presented to the neural network, which generates a \nfiring  neuron  in  its  output layer.  This  neuron  is processed as  VQ label  by  the  HMM's. \nThere are the following arguments for this alternative hybrid approach: \n\n\u2022 The neural vector quantizer has to be trained on a special information theory criterion, \nbased  on  the  mutual  information  between  network  input  and  resulting  neuron  firing \nsequence. It will  be shown that such a network is the optimal acoustic processor for a \ndiscrete HMM system, resulting in a profound mathematical theory for this approach. \n\n\u2022  Resulting  from  this  theory,  a  formula  can  be  derived  which  jointly  describes  the \nbehavior of the HMM and the neural acoustic processor. In that way, both systems can \nbe described in a unified manner and both major components of the hybrid system can \nbe trained using a unified learning criterion. \n\n\u2022 The  above mentioned theoretical background leads to the development of new  neural \nnetwork paradigms  using  novel  training  algorithms that have not been  used before in \nother areas of neurocomputing, and therefore represent major challenges and issues in \nlearning and training for neural systems. \n\n\u2022  The  neural  networks  can  be  easily  combined  with  any  HMM  system  of arbitrary \ncomplexity. This leads to  the  combination  of optimally  trained neural  networks  with \nvery powerful HMM's, having all features useful for speech recognition, e.g. triphones, \nfunction words, crossword triphones, etc .. Context-dependency, which is very desirable \nbut relatively difficult to realize with a pure neural approach, can be left to the HMM's. \n\u2022  The  resulting  hybrid  system  has  still  the  basic  structure  of a  discrete  system,  and \ntherefore has all  the effective features  associated with discrete systems, e.g. quick and \neasy training as well as recognition procedures, real-time capabilities, etc .. \n\n\u2022  The  work  presented  in  this  paper  has  been  also  successfully  implemented  for  a \ndemanding speech recognition problem, the 1000 word speaker-independent continuous \nResource  Management  speech  recognition  task.  For  this  task,  the  hybrid  system \nproduces one of the best recognition results obtained by any speech recognition system. \n\nIn the  following  section,  the theoretical  foundations  of the hybrid approach  are briefly \nexplained. A unified probabilistic model for the combined HMMIANN system is derived, \ndescribing  the  interaction  of the  neural  and  the  HMM  component.  Furthermore,  it  is \nshown  that  the  optimal  neural  acoustic  processor  can  be  obtained  from  a  special \ninformation theoretic network training algorithm. \n\n2  INFORMATION  THEORY  PRINCIPLES  FOR  NEURAL \n\nNETWORK  TRAINING \n\nWe are  considering  now  a  neural  network  of arbitrary  topology  used  as  neural  vector \nquantizer for a  discrete HMM  system. If K patterns are presented to  the hybrid system \nduring  training,  the  feature  vectors  resulting  from  these  patterns  using  any  feature \nextraction method can be denoted as x(k), k=l.. .K. If these feature vectors are presented to \nthe input layer of a neural network, the network will  generate one firing  neuron for each \npresentation.  Hence,  all  K  presentations  will  generate  a  stream  of firing  neurons  with \nlength K resulting from the output layer of the neural net. This label stream is denoted as \nY=y(l) ... y(K). The label stream Y will be presented to the HMM's,  which calculate the \nprobability that this stream has been observed while a pattern of a certain class has been \npresented  to  the  system.  It  is  assumed,  that  M  different  classes  Q m  are  active  in  the \n\n\f774 \n\nG.  Rigoll and C.  Neukirchen \n\nsystem, e.g.  the  words or phonemes in  speech recognition. Each feature  vector ~(k) will \nbelong  to  one of these  classes.  The class Om,  to  which  feature  vector ~(k) belongs  is \ndenoted as Q(k). The major training issue for the neural network can be now formulated \nas  follows :  How  should  the  weights  of the  network  be  trained,  so  that  the  network \nproduces a stream of firing neurons that can be used by the discrete HMM's in an optimal \nway? It is known that HMM's are usually trained with information theory methods which \nmostly rely  on  the Maximum Likelihood (ML) principle. If the parameters of the hybrid \nsystem (i.e. transition and emission probabilities and network weights) are summarized in \nthe  vector !!,  the  probability  P!!(x(k)IQ(k\u00bb  denotes  the  probability  of the  pattern  X  at \ndiscrete time k, under the assumption that it has been generated by the model representing \nclass O(k), with parameter set !!. The ML principle  will  then  try  to  maximize the joint \nprobability of all  presented training patterns ~(k), according to  the following  Maximum \nLikelihood function: \n\nfl* = arg max  {~ i log P!! (K(k) I Q(k\u00bbj \n\n~ \n\nk=1 \n\n(1) \nwhere !!* is the optimal parameter vector maximizing this  equation.  Our goal is  to  feed \nthe feature vector ~ into a neural network and to present the neural network output to the \nMarkov model.  Therefore, one has  to  introduce the neural  network output in  a suitable \nmanner into the above formula. If the vector ~ is presented to the network input layer, and \nwe assume that there is a chance that any  neuron Yn,  n=1...N (with network output layer \nsize N)  can  fire  with a certain probability,  then  the output probability p(~IQ) in  (1)  can \nbe written as: \n\nN \n\nN \n\np(KIQ) = I  p(x ,Y n IQ) = I  p(y n IQ)  . p(x Iy n,Q) \n\nn=1 \n\nn=1 \n\n(2) \nNow,  the  combination  of the  neural  component  with  the  HMM  can  be  made  more \nobvious:  In (2), typically the probability P(YnIQ)  will be described by  the Markov model, \nin  terms  of  the  emission  probabilities  of  the  HMM.  For  instance,  in  continuous \nparameter HMM's, these probabilities are interpreted as weights for Gaussian mixtures. In \nthe case of semi-continuous systems or discrete HMM's, these probabilities will serve as \ndiscrete  emission  probabilities  of  the  codebook  labels.  The  probability  p(xIYn,Q) \ndescribes the acoustic processor of the system and is characterizing the relation between \nthe vector ~ as input to  the acoustic processor and the label Yn,  which can be considered \nas  the n-th output component of the  acoustic processor. This n-th output component may \ncharacterize e.g.  the n-th Gaussian mixture component in  continuous parameter HMM's, \nor  the  generation  of the  n-th  label  of a  vector  quantizer  in  a  discrete  system.  This \nprobability is often considered as independent of the class 0  and can then be expressed as \np(xIYn).  It is  exactly  this  probability,  that  can  be  modeled  efficiently  by  our  neural \nnetwork.  In  this  case,  the  vector  X  serves  as  input  to  the  neural  network  and  Yn \ncharacterizes the  n-th  neuron  in  the  output layer of the  network. Using Bayes law,  this \nprobability can be written as: \n\nP(YnIK) ' pW \n\np(xl Y n)  = \n\np(y n) \n\n(3) \n\n(4) \n\nyielding for (2): \n\nUsing again Bayes law to express \n\n\fMutual Information Neural Networks for Hybrid HMMIANN Speech Recognition \n\n775 \n\n(5) \n\none obtains from  (4): \n\np(K) \n\nN \n\np(KI.Q)= -(.Q) .  L  p(.Qly n) \u00b7p(ynlo!J \n\np \n\nn=1 \n\n(6) \nWe have now modified the class-dependent probability of the  feature  vector X in a way \nthat allows the  incorporation of the probability P(YnIX).  This probability  allows a better \ncharacterization of the behavior of the neural network, because it describes the probability \nof the  various  neurons  Yn,  if the  vector X is  presented to  the  network input.  Therefore, \nthese  probabilities  give  a  good  description  of the  input/output  behavior  of the  neural \nnetwork. Eq. (6) can therefore be considered as probabilistic model for the hybrid system, \nwhere  the  neural  acoustic processor is  characterized by  its  input/output behavior.  Two \ncases can be now distinguished:  In the first  case,  the  neural network is  assumed to be a \nprobabilistic  paradigm,  where  each  neuron  fires  with  a certain  probability,  if an  input \nvector is presented. In this case all neurons contribute to the information forwarded to the \nHMM's.  As  already  mentioned,  in  this  paper,  the  second  possible  case  is  considered, \nnamely that only one neuron in the output layer fires  and will be fed as observed label to \nthe  HMM.  In  this  case,  we  have  a  deterministic  decision,  and  the  probability  P(YnIX) \ndescribes what neuron Yn*  fires if vector X is presented to the input layer. Therefore, this \nprobability reduces to \n\nThen, (6) yields: \n\n(7) \n\n(8) \n\nNow,  the  class-dependent  probability  p(Xln)  is  expressed  through  the  probability \np(nIYn*),  involving directly  the firing  neuron  Yn*,  when  feature  vector X is  presented. \nOne has  now to  turn back to  (1),  recalling  the fact,  that this  equation  describes the  fact \nthat the Markov models are trained with the ML criterion. It should also be recalled, that \nthe entire sequence of feature  vectors,  x(k),  k=l...K, results  in  a  label  stream  of firing \nneurons  Yn*(k),  k=l...K,  where  Yn*(k)  is  the  firing  neuron  if the  k-th  vector x(k)  is \npresented to the neural network. Now, (8) can be substituted into (1) for each presentation \nk, yielding the modified ML criterion: \n\n{ \n\nK \n\np(x(k)) \n\n1( =  arg;ax  ::1 log  P(Q (k))  . p(.Q(k) I Y n*,k)) \n~ arg;ax {~, log p(x (k))  - ~109P(Q(k)) + ~IOg p(Q(k) I Y n.(k))} \n\n} \n\n(9) \n\nUsually, in a continuous parameter system, the probability p(x) can be expressed as: \n\nN \n\np(K) = LP(K,ly n)  . p(y n) \n\n(10) \nand is therefore dependent of the parameter vector ft,  because in  this case, p(xIYn) can be \ninterpreted as the probability provided by the Gaussian distributions, and the parameters of \n\nn=1 \n\n\f776 \n\nG.  Rigoll and C.  Neukirchen \n\nthe Gaussians will depend on ft.  As just mentioned before, in a discrete system, only one \nfiring  neuron  Yn*  survives,  resulting in  the fact  that only  the  n*-th  member remains  in \nthe sum in  (10).  This  would correspond to  only one  \"firing Gaussian\"  in  the  continuous \ncase, leading to the following expression for p(x): \n\np(K) = p(x Iy nJ\u00b7 p(y nJ = p(K,y nJ = p(y n\"lx) . p(x) \n\n(11) \n\nConsidering now the fact,  that the acoustic processor is not represented by a Gaussian but \ninstead by a vector quantizer, where the probability P(Yn*IX)  of the firing  neuron is equal \nto  1,  then  (11) reduces to  p(~) = p(x)  and it becomes obvious that this  probability  is  not \naffected  by  any  distribution  that  depends  on  the  parameter  vector  ft.  This  would  be \ndifferent, if P(Yn*IX)  in  (11) would not have binary characteristics as  in  (7), but would be \ncomputed  by  a continuous function  which in  this case  would  depend  on  the  parameter \nvector ft. Thus, without consideration of p(X),  the remaining expression to be maximized \nin  (9) reduces to: \n\n,r( = arg;ax [~ ~IOg p(.Q( k)) + ! log p(.Q( k) I Y n\u00b7(k)) 1 \n\n(12) \n\n= arg max [- E {log p(.o)} + E {log p(.o I y n\")}] \n\nfJ.. \n\nThese expectations of logarithmic probabilities are  also defined as entropies.  Therefore, \n(9) can be also written as \n\nfl.\" = arg max  {H (.0)  - H(.o I Y)} \n\nfJ.. \n\n(13) \nThis equation  can  be  interpreted as  follows:  The  term  on  the  right  side of (13)  is  also \nknown  as  the  mutual  information  I(n,Y~ between  the  probabilistic  variables  nand Y, \ni.e. : \n\n1(.0, Y) =H(.o) - H (.01  Y)  =H (Y)  - H(YI.o) \n\n(14) \nTherefore, the final information theory-based training criterion for the neural network can \nbe formulated as follows: The synaptic weights of the neural network should be chosen as \nto  maximize  the  mutual  information  between  the  string representing  the  classes of the \nvectors presented to the network input layer during training and the string representing the \nresulting sequence of firing neurons in the output layer of the neural network. This can be \nalso expressed as  the Maximum Mutual Information (MMI) criterion for neural network \ntraining. This concludes the proof that MMI neural networks are indeed optimal acoustic \nprocessors for HMM's trained with maximum likelihood principles. \n\n3  REALIZATION  OF  MMI  TRAINING  ALGORITHMS  FOR \n\nNEURAL  NETWORKS \n\nTraining the synaptic weights of a neural network in order to  achieve mutual information \nmaximization is not easy. Two different algorithms have been developed for this task and \ncan only be briefly outlined in this paper. A detailed description can be found in  (Rigoll, \n1994) and  (Neukirchen,  1996). The first experiments used a single-layer neural  network \nwith Euclidean  distance  as  propagation  function. The first  implementation of the  MMI \ntraining paradigm has  been realized  in  (Rigoll,  1994) and is based  on  a self-organizing \nprocedure,  starting  with initial  weights  derived from  k-means clustering of the training \nvectors,  followed  by  an  iterative  procedure  to  modify  the  weights.  The  mutual \ninformation  increases in  a self-organizing way  from  a low  value  at the  start to  a much \nhigher value after several  iteration cycles. The second implementation has been realized \n\n\fMutual Information Neural Networks for Hybrid HMMIANN Speech Recognition \n\n777 \n\nrecently and is  described in detail in (Neukirchen,  1996). It is based on  the idea of using \ngradient  methods  for finding  the MMI value.  This technique has  not  been  used before, \nbecause  the  maximum  search  for  finding  the  firing  neuron  in  the  output  layer  has \nprevented the calculation of derivatives. This maximum search can be approximated using \nthe  softmax  function,  denoted  as  sn  for  the  n-th  neuron.  It can  be  computed  from  the \nactivations Zl  of all  neurons as: \n\nN \n\nz  IT  \"\"  Z I IT \n\nSn=e  n \n\n/  \u00a3..Je \n\n(15) \nwhere a small  value for parameter T approximates a crisp maximum selection. Since the \nstring n  in  (14) is  always fixed  during  training and independent of the parameters in ft, \nonly the function H(nIY) has to be minimized. This function can also be expressed as \n\n/=1 \n\nH(!2 I Y) = - L  L p(y n,!2m )  \u00b7logp(!2m I Y n) \n\nM  N \n\nm=1  n=1 \n\nA derivative with respect to a weight Wlj  of the neural network yields: \n\nm=1  n=1 \n\naH (!21  Y)  = \n\nJW/j \n\n(16) \n\n(17) \n\nAs  shown  in  (Neukirchen,  1996),  all  the  required  terms  in  (17)  can  be  computed \neffectively and it is possible to realize a gradient descend method in order to maximize the \nmutual information of the training  data.  The great advantage of this  method  is  the  fact \nthat it is  now possible to  generalize this algorithm for  use in  all  popular neural network \narchitectures, including multilayer and recurrent neural networks. \n\n4  RESULTS  FOR  THE  HYBRID  SYSTEM \n\nThe  new  hybrid system has  been  developed  and  extensively  tested using  the Resource \nManagement 1000 word speaker-independent continuous speech recognition task. First, a \nbaseline  discrete  HMM  system  has  been  built  up  with  all  well-known  features  of a \ncontext-dependent HMM  system.  The performance of that  baseline system  is  shown in \ncolumn 2 of Table  1.  The  1st column  shows the performance of the hybrid system with \nthe neural  vector quantizer. This  network has some special features not mentioned in the \nprevious  sections,  e.g.  it  uses  multiple  frame  input  and  has  been  trained  on  context(cid:173)\ndependent classes.  That means that the mutual  information between the  stream  of firing \nneurons  and  the  corresponding  input stream of triphones  has  been  maximized. In  this \nway,  the  firing  behavior of the  network becomes  sensitive  to  context-dependent units. \nTherefore,  this  network may be the  only existing context-dependent acoustic  processor, \ncarrying the principle of triphone modeling from  the HMM structure to the acoustic front \nend.  It can be seen,  that a substantially  higher recognition performance is  obtained with \nthe  hybrid  system,  that  compares  well  with  the  leading  continuous  system  (HTK,  in \ncolumn  3).  It is  expected,  that  the  system  will  be  further  improved  in  the  near  future \nthrough various additional features,  including full  exploitation of multilayer neural  VQ's \n\n\f778 \n\nG.  Rigoll and C.  Neukirchen \n\nand  several  conventional  HMM  improvements,  e.g.  the  use  of crossword  triphones. \nRecent results on the larger Wall Street Journal (WSJ) database have shown a 10.5% error \nrate for the hybrid system compared to a  13.4% error rate for a standard discrete system, \nusing  the  5k vocabulary  test with bigram language model of perplexity  110.  This error \nrate can be further reduced to  8.9%  using crossword triphones and 6.6%  with a trigram \nlanguage  model.  This  rate  compares  already  quite  favorably  with  the  best continuous \nsystems for the same task.  It should be noted that this  hybrid WSJ system is still  in  its \ninitial stage and the neural component is not yet as sophisticated as in the RM system. \n\n5  CONCLUSION \n\nA new neural  network paradigm and the resulting hybrid HMMI ANN speech recognition \nsystem have been presented in  this paper. The new  approach performs already  very  well \nand is still perfectible. It gains its good performance from the following facts:  (1) The use \nof information theory-based training algorithms for the neural vector quantizer, which can \nbe  shown  to  be  optimal  for  the  hybrid  approach.  (2)  The  possibility  of introducing \ncontext-dependency not only to  the HMM's, but also to  the neural quantizer.  (3) The fact \nthat this hybrid approach allows the combination of an optimal neural acoustic processor \nwith  the  most advanced  context-dependent HMM  system.  We will  continue  to  further \nimplement various possible improvements for our hybrid speech recognition system. \n\nREFERENCES \n\nRigoll,  G.  (1994)  Maximum  Mutual  Information  Neural  Networks  for  Hybrid \nConnectionist-HMM  Speech  Recognition  Systems,  IEEE  Transactions  on  Speech  and \nAudio  Processing,  Vol.  2,  No.1,  Special  Issue  on  Neural  Networks  for  Speech \nProcessing, pp.  175-184 \nNeukirchen,  C.  &  Rigoll,  G.  (1996)  Training  of MMI  Neural  Networks  as  Vector \nQuantizers, Internal Report, Gerhard-Mercator-University Duisburg,  Faculty of Electrical \nEngineering,  available via http://www.fb9-tLuni-duisburg.de/veroeffentl.html \nBourlard,  H.  &  Morgan,  N.  (1994)  Connectionist  Speech  Recognition:  A  Hybrid \nApproach, Kluwer Academic Publishers \nHochberg,  M., Renals,  S.,  Robinson,  A., Cook, G.  (1995) Recent Improvements to  the \nABBOT Large Vocabulary CSR System,  in  Proc.  IEEE-ICASSP,  Detroit, pp. 69-72 \nRigoll,  G., Neukirchen, c., Rottland, J.  (1996)  A New  Hybrid System Based on  MMI(cid:173)\nNeural Networks for the RM Speech Recognition Task, in Proc.  IEEE-ICASSP, Atlanta \n\nTable 1:  Comparison of recognition rates for different speech recognition systems \n\nRM SI word recognition rate with word pair grammar: correctness (accuracy) \n\ntest set \n\nFeb.'89 \nOct.'89 \n\nFeb.'91 \n\nSep.'92 \n\naverage \n\nhybrid  MMI-NN \n\nsystem \n\nbaseline k-means \n\nVQ system \n\ncontinuous pdf system \n\n(HTK) \n\n96,3  % \n95,4  % \n96,7  % \n93,9  % \n95,6  % \n\n(95,6  %) \n(94,5  %) \n(95,9  %) \n(92,5  %) \n(94,6  %) \n\n94,3  %  (93,6 %) \n93,5  %  (92,0 %) \n\n96,0 %  (95,5  %) \n95,4%  (94,9  %) \n\n94,4%  (93,5  %) \n\n96,6%  (96,0 %) \n\n90,7 %  (88,9  %) \n\n93,6 % \n\n(92,6 %) \n\n93,2  %  (92,0 %) \n\n95,4%  (94,7 %) \n\n\f", "award": [], "sourceid": 1193, "authors": [{"given_name": "Gerhard", "family_name": "Rigoll", "institution": null}, {"given_name": "Christoph", "family_name": "Neukirchen", "institution": null}]}