{"title": "ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1165, "page_last": 1171, "abstract": null, "full_text": "ALGONQUIN - Learning dynamic  noise \n\nmodels  from  noisy speech for  robust \n\nspeech recognition \n\nBrendan J. Freyl,  Trausti T.  Kristjanssonl ,  Li Deng2 ,  Alex Acero 2 \n\n1  Probabilistic and Statistical Inference Group,  University of Toronto \n\nhttp://www.psi.toronto.edu \n\n2  Speech  Technology Group, Microsoft Research \n\nAbstract \n\nA  challenging,  unsolved  problem  in  the  speech  recognition  com(cid:173)\nmunity  is  recognizing  speech  signals  that  are  corrupted  by  loud, \nhighly  nonstationary noise.  One  approach to  noisy  speech  recog(cid:173)\nnition is  to automatically  remove the noise from  the  cepstrum se(cid:173)\nquence before feeding it in to a clean speech recognizer.  In previous \nwork published in  Eurospeech,  we  showed how  a  probability model \ntrained on  clean  speech  and  a  separate probability  model  trained \non noise could be combined for the purpose of estimating the noise(cid:173)\nfree speech from the noisy speech.  We showed how an iterative 2nd \norder  vector  Taylor  series  approximation  could  be  used  for  prob(cid:173)\nabilistic  inference  in this  model.  In  many  circumstances,  it is  not \npossible to obtain  examples  of noise  without  speech.  Noise  statis(cid:173)\ntics  may change significantly  during an utterance,  so  that speech(cid:173)\nfree frames are not sufficient for  estimating the noise model.  In this \npaper, we  show how the noise model can be learned even when the \ndata contains speech.  In particular, the noise model can be learned \nfrom the test utterance and then used to de noise the test utterance. \nThe approximate inference technique is  used as  an approximate E \nstep  in  a  generalized  EM  algorithm that learns the parameters of \nthe noise  model from  a  test utterance.  For both Wall  Street  J our(cid:173)\nnal data with added noise samples and the Aurora benchmark, we \nshow that the new noise adaptive technique performs as  well  as or \nsignificantly better  than  the  non-adaptive  algorithm,  without  the \nneed for  a  separate training set of noise examples. \n\n1 \n\nIntroduction \n\nTwo main approaches to robust speech recognition include  \"recognizer domain ap(cid:173)\nproaches\"  (Varga  and  Moore  1990;  Gales  and  Young  1996),  where  the  acoustic \nrecognition model is modified or retrained to recognize noisy,  distorted speech,  and \n\"feature domain approaches\"  (Boll  1979;  Deng et al.  2000;  Attias et al.  2001;  Frey \net al.  2001), where the features of noisy,  distorted speech are first denoised and then \nfed  into a speech recognition system whose acoustic recognition model is trained on \nclean speech. \n\nOne advantage of the feature domain approach over the recognizer domain approach \nis that the speech modeling part of the denoising model  can have much lower  com-\n\n\fplexity  than  the  full  acoustic  recognition  model.  This  can  lead  to  a  much  faster \noverall  system,  since  the  denoising  process  uses  probabilistic  inference  in  a  much \nsmaller  model.  Also,  since  the  complexity  of the  denoising  model  is  much  lower \nthan the complexity of the recognizer,  the denoising model can be adapted to new \nenvironments more easily, or a variety of denoising models can be stored and applied \nas needed. \n\n(In  contrast,  Attias  et  al. \n\nWe  model  the  log-spectra  of  clean  speech,  noise,  and  channel  impulse  response \n(2001)  model \nfunction  using  mixtures  of  Gaussians. \nautoregressive  coefficients.)  The  relationship  between  these  log-spectra  and  the \nlog-spectrum  of the  noisy  speech  is  nonlinear,  leading  to  a  posterior  distribution \nover  the  clean  speech  that  is  a  mixture  of non-Gaussian  distributions.  We  show \nhow  a  variational technique that makes use of an iterative 2nd order vector Taylor \nseries  approximation can be  used  to  infer the  clean speech  and  compute  sufficient \nstatistics for  a generalized EM algorithm that can learn the noise model from  noisy \nspeech. \n\nOur  method,  called  ALGONQUIN,  improves  on  previous  work  using  the  vector \nTaylor  series  approximation  (Moreno  1996)  by  modeling  the  variance  of the  noise \nand channel instead of using point estimates, by modeling the noise and channel as a \nmixture mixture model instead of a  single component model,  by iterating Laplace's \nmethod to track the clean speech instead of applying it once at the model  centers, \nby  accounting for  the  error in  the  nonlinear  relationship  between  the  log-spectra, \nand by learning the noise  model from  noisy speech. \n\n2  ALGONQUIN's Probability Model \n\nFor clarity, we  present a  version of ALGONQUIN that treats frames  of log-spectra \nindependently.  The  extension  of  the  version  presented  here  to  HMM  models  of \nspeech,  noise  and channel  distortion is  analogous  to the  extension of a  mixture of \nGaussians to an HMM  with Gaussian outputs. \n\nFollowing  (Moreno  1996),  we  derive  an  approximate  relationship  between  the  log \nspectra of the  clean  speech,  noise,  channel  and  noisy  speech.  Assuming  additive \nnoise and linear channel distortion, the windowed FFT Y(j) for  a  particular frame \n(25  ms  duration, spaced at  10  ms  intervals)  of noisy speech is  related to the FFTs \nof the channel H(j), clean speech 5(j) and additive noise N(j) by \n\nY(j) =  H(j)5(j) + N(j). \n\n(1) \n\nWe  use  a  mel-frequency  scale,  in which  case this  relationship  is  only  approximate. \nHowever,  it  is  quite accurate if the channel frequency  response is  roughly constant \nacross each mel-frequency filter  band. \nFor  brevity,  we  will  assume  H(j)  =  1  in  the  remainder  of this  paper.  Assuming \nthere  is  no  channel  distortion  simplifies  the  description  of the  algorithm.  To  see \nhow  channel distortion can be accounted for  in  a  nonadaptive way,  see  (Frey et al. \n2001).  The technique  described  in  this  paper for  adapting the noise  model  can be \nextended to adapting the channel model. \nAssuming H(j) =  1, the energy spectrum is  obtained as follows: \n\nIY(j)1 2 =  Y(j)*Y(j) =  5(j)* 5(j) + N(j)* N(j) + 2Re(N(j)* 5(j)) \n\n=  15(j)12 + IN(j)12 + 2Re(N(j)* 5(j)) , \n\nwhere  \"*,,  denotes complex conjugate.  If the phase of the noise  and the speech are \nuncorrelated, the last term in the above expression is small and we  can approximate \n\n\fthe energy spectrum as  follows: \n\nIYUW  ~ ISUW + INUW\u00b7 \n\n(2) \nAlthough  we  could  model  these  spectra directly,  they  are  constrained  to  be  non(cid:173)\nnegative.  To make density modeling easier, we  model the log-spectrum instead.  An \nadditional benefit to this approach is  that channel distortion is  an additive effect in \nthe log-spectrum domain. \nLetting y  be the vector containing the log-spectrum log IY(:W, and similarly for  s \nand n , we  can rewrite  (2)  as \n\nexp(y)  ~ exp(s) + exp(n)  =  exp(s)  0  (1 + exp(n - s)) , \n\nwhere the expO function operates in an element-wise fashion on its vector argument \nand the  \"0\"  symbol indicates element-wise product. \n\nTaking the logarithm, we  obtain a  function  gO  that is  an approximate mapping of \nsand n  to y  (see  (Moreno  1996)  for  more details): \n\ny  ~ g([~]) =  s + In(l + exp(n - s)). \n\n(4) \n\n\"T\"  indicates matrix transpose and InO  and expO operate on the individual elements \nof their vector arguments. \n\nAssuming  the  errors  in  the  above  approximation  are  Gaussian,  the  observation \nlikelihood is \n\np(yls,n) =N(y;g([~]),W), \n\n(5) \nwhere  W is  the  diagonal  covariance  matrix of the  errors.  A  more  precise  approxi(cid:173)\nmation to the observation likelihood can be obtained by writing  W as  a function of \ns  and n , but we  assume  W is  constant for  clarity. \nUsing a  prior p(s, n), the goal of de noising is  to infer the log-spectrum of the clean \nspeech s , given the log-spectrum ofthe noisy speech y.  The minimum squared error \nestimate of sis s =  Is sp(sly) , where p(sly)  ex  InP(yls, n)p(s, n).  This inference is \nmade difficult  by the fact  that the nonlinearity g([s n]T)  in  (5)  makes the posterior \nnon-Gaussian even if the prior is  Gaussian.  In the next section, we show how an it(cid:173)\nerative variational method that uses a 2nd order vector Taylor series approximation \ncan be used for  approximate inference and learning. \nWe assume that a priori the speech and noise are independent - p(s , n)  =  p(s)p(n) \n-\nand we  model each using a  separate mixture of Gaussians.  cS  =  1, ... , NS  is  the \nclass index for  the clean speech and en  =  1, ... ,Nn is  the class index for  the noise. \nThe mixing proportions and  Gaussian components are parameterized as  follows: \n\np(s)  =  LP(cS)p(slcS),  p(CS) =7r~s ,  p(slcS) =N(s;JL~s ,~~s ), \n\nC S \n\nWe assume the covariance matrices  ~~s  and  ~~n are diagonal. \n\nCombining  (5)  and  (6),  the joint  distribution  over  the  noisy  speech,  clean  speech \nclass, clean speech vector, noise  class and noise  vector is \n\np(y, s , cs, n , en)  =  N(y; g([~]), w)7r~sN(s; JL~s ,  ~~s )7r~N(n; JL~n , ~~n). \n\n(7) \n\nUnder this joint  distribution,  the  posterior p(s, nly)  is  a  mixture  of non-Gaussian \nIn  fact,  for  a  given  speech  class  and  noise  class,  the  posterior \ndistributions. \np(s, nics, en , y)  may have multiple modes.  So,  exact computation of s is  intractable \nand we  use  an approximation. \n\n\f3  Approximating the Posterior \nFor the current frame of noisy speech y,  ALGONQUIN approximates the posterior \nusing a  simpler,  parameterized distribution,  q: \n\np(s ,cS, n,cnly)  ~ q(s,cS,n,cn). \n\nThe \"variational parameters\" of q are adjusted to make this approximation accurate, \nand  then  q  is  used  as  a  surrogate  for  the  true  posterior  when  computing  \u00a7  and \nlearning the noise model  (c.f.  (Jordan et  al.  1998)). \nFor each cS  and en,  we  approximate p(s, nics, en, y)  by a  Gaussian, \n\n(9) \n\nwhere 1J~'en  and 1J~'en are the approximate posterior means of the speech and noise \nfor  classes cS  and en,  and  <P~~en, <P~.r;,n  and  <P~::'en  specify the covariance matrix for \nthe speech and noise for classes cS  and en.  Since rows of vectors in (4)  do not interact \nand since the likelihood covariance matrix q, and the prior covariance matrices ~ ~. \nand  ~~n  are diagonal, the matrices  <P~~ en,  <P~.r;,n  and  <P~::'en  are diagonal. \nThe  posterior mixing  proportions  for  classes  cS  and  en  are  q( cS ,  en)  =  Pc' en.  The \napproximate posterior is  given by q(s,n,cs,cn) = q(s , nlcs ,cn)q(cS, en). \nThe  goal  of  variational  inference  is  to  minimize  the  relative  entropy  (Kullback(cid:173)\nLeibler divergence)  between q and p: \n\nK  \"''''11 (  S  n) \n\n= ~ ~  q  s , n , c  ,c \n\nc' \n\nen \n\ns \n\nn \n\nIn \n\nq(s , n ,cS,cn) \n( \nP s , c  , n , c  y \n\nn I  ). \n\nS \n\nThis is  a  particularly good choice for  a  cost function,  because, since lnp(y)  doesn't \ndepend on the variational parameters, minimizing K  is  equivalent to maximizing \n\n()  K  \"''''11 (  S  n) \n\n=  ~ ~  q  s , n , c  ,c \n\ne' \n\nen \n\ns \n\nn \n\nF  =  lnp y  -\n\nIn \n\np(s,cS,n,cn,y) \n, \n\n( \nq  s, n, c  ,c \n\nS  n) \n\nwhich is  a  lower  bound on the log-probability of the data.  So,  variational inference \ncan  be  used  as  a  generalized  E  step  (Neal  and  Hinton  1998)  in  an  algorithm that \nalternatively  maximizes  a  lower  bound  on  lnp(y)  with  respect  to  the  variational \nparameters and the noise  model parameters, as described in the next section. \n\nVariational inference begins  by optimizing the means and variances  in  (9)  for  each \nCS  and  en.  Initially,  we  set  the  posterior  means  and  variances  to  the  prior  means \nand  variances.  F  does  not  have  a  simple  form  in  these  variational  parameters. \nSo,  at  each  iteration,  we  make  a  2nd  order  vector  Taylor  series  approximation of \nthe  likelihood,  centered  at  the  current  variational  parameters,  and  maximize  the \nresulting approximation to F.  The updates are \n\nwhere  g' 0 is  a  matrix of derivatives  whose  rows  correspond to the noisy  speech  y \nand whose columns  correspond to the clean speech  and noise  [s  n]. \n\n\fThe  inverse  posterior covariance  matrix is  the sum  of the  inverse prior covariance \nmatrix and the inverse likelihood covariance matrix, modified  by the Jacobian g' 0 \nfor  the mapping from  s  and n  to y \n\nThe  posterior  means  are  moved  towards  the  prior  means  and  toward  values  that \nmatch the  observation  y.  These  two  effects  are  weighted  by  the  inverse  prior  co(cid:173)\nvariance matrix and the inverse likelihood covariance matrix. \n\nAfter iterating the above updates (in our experiments, 3 to 5 times)  for  each eS  and \nen,  the posterior mixing proportions that maximize :F  are computed: \n\nwhere  A is  a  normalizing constant that  is  computed  so  that  L e.en  Pe'en  =  1.  The \nminimum squared error estimate of the clean speech, s, is \n\nWe  apply  this  algorithm  on  a  frame-by-frame  basis,  until  all  frames  in  the  test \nutterance have been denoised. \n\n4  Speed \nSince  elements  of s,  nand y  that  are in  different  rows  do  not interact  in  (4),  the \nabove matrix algebra reduces to efficient scalar algebra.  For 256 speech components, \n4 noise components and 3 iterations of inference, our unoptimized  C  code takes  60 \nms  to  denoise  each  frame.  We  are  confident  that  this  time  can  be  reduced  by  an \norder of magnitude using standard implementation tricks. \n\n5  Adapting the Noise Model  Using  Noisy Speech \nThe version of ALGONQUIN described above requires that a  mixture model of the \nnoise  be trained on noise samples,  before the log-spectrum of the noisy speech can \nbe denoised.  Here, we  describe how the iterative inference technique can be used as \nthe  E  step  in a  generalized  EM  algorithm for  learning  the  noise  model  from  noisy \nspeech. \nFor  a  set  of frames  y(1), . .. , yeT)  in  a  noisy  test  utterance,  we  construct  a  total \nbound \n\n:F  =  L:F(t) :::;  Llnp(y(t)). \n\nt \n\nt \n\n(t) \n\nn(t) \n\nThe  generalized  EM  algorithm  alternates  between  updating  one  set  of variational \n. . .   T \u00b7  h \nrame  =  , ... ,  ,an  maximizIng.r  WIt \nparameters Pe.en,  11  e'en,  e c.  or eac \nrespect to the noise  model  parameters 7r~n,  J.t~n  and  ~~n.  Since:F:::;  Ltlnp(y(t)), \nthis procedure maximizes a lower bound on the log-probability of the data.  The use \nof the vector Taylor series approximations leads to an algorithm that maximizes an \napproximation to a  lower bound on the log-probability of the data. \n\nh f  t IT   d \n\nt  \u00a3 \n\n\f20  dB \n15  dB \n10  dB \n5 dB \no dB \n-5dB \nAverage \n\nRestaurant  Street  Airport  Station  Average \n2.16 \n3.54 \n7.97 \n18.54 \n41.49 \n71.83 \n14.74 \n\n1.73 \n3.24 \n6.48 \n15.18 \n37.24 \n67.26 \n12.77 \n\n1.82 \n2.27 \n5.49 \n14.97 \n36.00 \n69.04 \n12.11 \n\n2.96 \n4.78 \n10.73 \n13.52 \n45.68 \n72.34 \n17.53 \n\n2.12 \n3.87 \n9.18 \n20.51 \n47.04 \n78.69 \n16.54 \n\nTable  1:  Word  error  rates  (in  percent)  on  set  B  of  the  Aurora  test  set,  for  the \nadaptive  version of ALGONQUIN  with 4 noise  componentsset. \n\nSetting the derivatives of :F  with respect to the noise model parameters to zero, we \nobtain the following  M step updates: \n\n~n \n\nen  +--- ~ ~ Pe. en \n('\"' '\"'  (t) \n\nt \n\nc B \n\n(opnn(t) +d\u00b7 \n\ne' en \n\n((  n  (t) \n\nlag  11e' en  -#-t en \n\nn)(  n  (t) \n\n11e' en  -#-ten \n\nn  )T))) / ('\"' '\"'  (t) \n~ ~ P e. en \n\n) \n\n. \n\nt \n\nc B \n\nThe  variational  parameters  can  be  updated  multiple  times  before  updating  the \nmodel parameters, or the variational parameters can updated only once before up(cid:173)\ndating the  model  parameters.  The  latter approach may  converge more  quickly  in \nsome situations. \n\n6  Experimental Results \nAfter  training  a  256-component  speech  model  on  clean  speech,  we  used  the \nadaptive  version  of  ALGONQUIN  to  denoise  noisy  test  utterances  on  two \ntasks:  the  publically  available  Aurora limited  vocabulary  speech  recognition  task \n(http://www.etsi.org/technicalactiv/dsr.htm); the Wall Street J ournal (WSJ) large \nvocabulary  speech  recognition  task,  with  Microsoft's  Whisper  speech  recognition \nsystem. \n\nWe  obtained  results  on  all  48  test  sets  from  partitions  A  and  B  of  the  Aurora \ndatabase.  Each set contains 24,000 sentences that have been corrupted from one of \n4  different  noise  types  and  one  of 6  different  signal  to  noise  ratios.  Table  1  gives \nthe error rates for  the adaptive  version of ALGONQUIN,  with 4 noise components. \nThese  error rates  are  superior  to  error rates  obtained  by  our  spectral  subtraction \ntechnique for  (Deng  et al.  2000) , and highly  competitive with other results  on the \nAurora task. \n\nTable  2 compares the  performances  of the  adaptive  version  of  ALGONQUIN  and \nthe  non-adaptive version.  For  the non-adaptive version,  20  non-speech frames  are \nused  to  estimate  the  noise  model.  For  the  adaptive  version,  the  parameters  are \ninit ialized  using  20  non-speech frames  and then 3 iterations  of generalized EM are \nused  to  learn  the  noise  model.  The  average  error  rate  over  all  noise  types  and \nSNRs  for  set  B  of Aurora drops  from  17.65%  to  15.19%  when  the  noise  adaptive \nalgorithm  is  used  to  update  the  noise  model.  This  is  a  relative  gain  of  13.94%. \nWhen 4 components are used there is  a  further  gain of 2.5%. \n\nThe Wall  Street J ournal test set consists of 167 sentences spoken by  female  speak(cid:173)\ners.  The  Microsoft  Whisper  recognizer  with  a  5,000  word  vocabulary was  used  to \nrecognize  these  sentences.  Table  2  shows  that  the  adaptive  version  of algonquin \n\n\fAurora,  Set  A \nAurora,  Set  B \nWSJ,  XD14,  10dB \nWSJ,  XD10,  10dB \n\nWER \n\n20  frames \n18.10% \n17.65% \n30.00% \n21.80% \n\nWER  Reduction  WER \n4 comps \n1 comp \n15.62% \n15.91% \n14.74% \n15.19% \n21.50% \n21.8% \n20.6% \n20.6% \n\nin WER \n12.10% \n13.94% \n27.33% \n5.50'70 \n\nReduction \nin  WER \n13.70% \n16.49% \n28.33% \n5.50  '70 \n\nTable 2:  Word error rates (WER)  and percentage reduction in WER for the Aurora \ntest data and the Wall  Street Journal test  data,  without scaling. \n\nperforms  better  than  the  non-adaptive  version,  especially  on  noise  type  \"XD14\" , \nwhich consists of the highly-nonstationary sound of a jet engine shutting down.  For \nnoise type  \"XD1O\", which is  stationary noise,  we  observe a gain, but we  do  not see \nany further  gain for  multiple noise  components. \n7  Conclusions \nA far as variational methods go,  ALGONQUIN is a fast technique for  denoising log(cid:173)\nspectrum or cepstrum speech feature  vectors.  ALGONQUIN improves on previous \nwork  using  the  vector  Taylor  series  approximation,  by  using  multiple  component \nspeech  and  noise  models,  and  it  uses  an  iterative  variational  method  to  produce \naccurate posterior  distributions  for  speech  and  noise.  By employing  a  generalized \nEM method, ALGONQUIN  can estimate a  noise  model from  noisy speech data. \n\nOur results  show that the noise  adaptive ALGONQUIN  algorithm can obtain  bet(cid:173)\nter  results  than  the  non-adaptive  version.  This  is  especially  important  for  non(cid:173)\nstationary  noise,  where  the  non-adaptive  algorithm  relies  on  an  estimate  of  the \nnoise  based  on  a  subset  of  the  frames ,  but  the  adaptive  algorithm  uses  all  the \nframes  of the utterance,  even those that contain speech. \n\nA  different  approach  to denoising  speech features  is  to learn  time-domain models. \nAttias  et  al.  (2001)  report results  on a  non-adaptive  time-domain technique.  Our \nresults  cannot  be  directly  compared with theirs,  since  our results  are for  unscaled \ndata.  Eventually, the two  approaches should  be thoroughly compared. \nReferences \nAttias,  H. ,  Platt,  J .  C.,  Acero,  A.,  and  Deng,  L.  2001.  Speech  denoising  and  derever(cid:173)\n\nberation  using  probabilistic  models.  In  Advances  in  Neural  Information  Processing \nSystems  13.  MIT  Press,  Cambridge  MA. \n\nBoll,  S.  1979.  Suppression  of acoustic  noise  in  speech  using  spectral  subtraction.  IEEE \n\nTransactions  on  Acoustics,  Speech  and Signal  Processing,  27:114- 120. \n\nDeng,  L. ,  Acero,  A.,  Plumpe,  M.,  and  Huang,  X.  D.  2000.  Large-vocabulary  speech \nrecognition under adverse acoustic environments.  In  Proceedings  of the  International \nConference  on  Spoken  Language  Processing,  pages  806- 809. \n\nFrey,  B.  J. ,  Deng,  L. ,  Acero,  A.,  and  Krist jansson,  T.  2001.  ALGONQUIN:  Iterating \nLaplace's  method to  remove  multiple  types  of acoustic  distortion  for  robust  speech \nrecognition.  In  Proceedings  of Eurospeech  2001. \n\nGales,  M. J.  F.  and Young, S.  J . 1996.  Robust continuous speech recognition using parallel \n\nmodel combination.  IEEE Speech  and  Audio  Processing, 4(5):352- 359. \n\nJordan,  M.  1. ,  Ghahramani,  Z.,  J aakkola,  T.  S.,  and  Saul,  L.  K.  1998.  An  introduction \nto  variational  methods  for  graphical  models.  In  Jordan,  M.  1.,  editor,  Learning  in \nGraphical  Models.  Kluwer  Academic  Publishers, Norwell  MA. \n\nMoreno,  P.  1996.  Speech  R ecognition  in Noisy  Environments.  Carnegie  Mellon University, \n\nPittsburgh PA.  Doctoral dissertation. \n\nNeal, R.  M.  and Hinton, G.  E.  1998.  A view of the EM algorithm that justifies incremental, \nsparse,  and other  variants.  In  Jordan ,  M.  1. ,  editor,  Learning  in  Graphical  Models , \npages  355- 368. Kluwer  Academic Publishers,  Norwell  MA. \n\nVarga,  A.  P.  and Moore,  R.  K.  1990.  Hidden  Markov  model decomposition  of speech and \nnoise.  In Proceedings  of th e International  Conference  on Acoustics,  Speech  and Signal \nProcessing,  pages 845- 848.  IEEE Press. \n\n\f", "award": [], "sourceid": 2016, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Trausti", "family_name": "Kristjansson", "institution": null}, {"given_name": "Li", "family_name": "Deng", "institution": null}, {"given_name": "Alex", "family_name": "Acero", "institution": null}]}