{"title": "Ensemble Methods for Phoneme Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 800, "page_last": 806, "abstract": null, "full_text": "Ensemble Methods  for  Phoneme \n\nClassification \n\nSteve Waterhouse \n\nGary  Cook \n\nCambridge University  Engineering  Department \n\nCambridge CB2  IPZ,  England, Tel:  [+44]  1223  332754 \n\nEmail:  srwl00l@eng.cam .ac.uk.gdc@eng.cam .ac.uk \n\nAbstract \n\nThis paper investigates a  number of ensemble methods for  improv(cid:173)\ning  the  performance of phoneme classification  for  use  in  a  speech \nrecognition system.  Two ensemble methods are described; boosting \nand mixtures of experts,  both in isolation and in combination.  Re(cid:173)\nsults are presented on two speech recognition databases:  an isolated \nword database and a large vocabulary continuous speech database. \nThese results show that principled ensemble methods such as boost(cid:173)\ning  and  mixtures provide superior  performance to  more naive  en(cid:173)\nsemble methods such  as  averaging. \n\nINTRODUCTION \n\nThere  is  now  considerable  interest  in  using  ensembles  or  committees  of learning \nmachines  to  improve the  performance  of the system  over  that of a  single  learning \nmachine.  In most neural  network ensembles,  the ensemble members are trained on \neither the same data (Hansen & Salamon 1990) or different subsets of the data (Per(cid:173)\nrone  &  Cooper  1993) .  The ensemble  members typically  have different  initial con(cid:173)\nditions  and/or  different  architectures.  The  subsets  of the  data may  be  chosen  at \nrandom , with prior knowledge or by some principled approach e.g.  clustering.  Addi(cid:173)\ntionally, the outputs of the networks may be combined by any function which results \nin  an  output  that  is  consistent  with  the form  of the  problem.  The expectation  of \nensemble methods is  that the member networks pick out different properties present \nin the data,  thus improving the  performance when  their outputs are combined . \n\nThe  two  techniques  described  here,  boosting  (Drucker,  Schapire  &  Simard  1993) \nand mixtures of experts (Jacobs, Jordan, Nowlan &  Hinton 1991), differ from simple \nensemble methods. \n\n\fEnsemble Methods/or Phoneme Classification \n\n801 \n\nIn  boosting,  each  member  of the  ensemble  is  trained  on  patterns  that  have  been \nfiltered  by  previously trained members of the ensemble.  In mixtures,  the members \nof the ensemble,  or  \"experts\",  are  trained on data that is stochastically selected  by \na  gate which  additionally learns how  to best  combine the outputs of the experts. \n\nThe aim of the  work  presented  here  is  twofold and inspired from  two differing  but \ncomplimentary directions.  Firstly,  how does one  select  which  data to train the en(cid:173)\nsemble members on and secondly, given these members how does one combine them \nto achieve  the  optimal result?  The rest  of the  paper describes  how  a  combination \nof boosting and mixtures may be used  to improve phoneme error rates. \n\nPHONEME  CLASSIFICATION \n\nSpeech \n\nFigure  1:  The  ABBOT  hybrid  connectionist-HMM  speech  recognition  system  with \nan  MLP  ensemble  acoustic model \n\nThe  Cambridge University  Engineering  Department connectionist  speech  recogni(cid:173)\ntion system  (ABBOT)  uses  a  hybrid connectionist  - hidden  Markov  model  (HMM) \napproach.  This is shown in figure  1.  A connectionist acoustic model is  used to map \neach frame of acoustic data to posterior phone probabilities.  These estimated phone \nprobabilities are  then used as estimates of the observation probabilities in an HMM \nframework.  Given  new  acoustic  data and  the connectionist-HMM  framework , the \nmaximum a  posteriori  word  sequence  is  then  extracted  using  a  single  pass,  start \nsynchronous  decoder.  A  more  complete  description  of the  system  can  be  found \nin (Hochberg,  Renals &  Robinson  1994). \n\nPrevious  work  has  shown  how  a  novel  boosting  procedure  based  on  utterance  se(cid:173)\nlection  can  be  used  to increase  the  performance of the  recurrent  network  acoustic \nmodel  (Cook  &  Robinson  1996).  In  this  work  a  combined  boosting  and mixtures(cid:173)\nof-experts  approach  is  used  to improve the  performance of MLP  acoustic  models. \nResults  are  presented  for  two speech  recognition  tasks.  The first  is  phonetic clas(cid:173)\nsification  on  a  small  isolated  digit  database.  The  second  is  a  large  vocabulary \ncontinuous speech  recognition task from the Wall Street  Journal corpus. \n\nENSEMBLE  METHODS \n\nMost  ensemble  methods can  be  divided into two separate  methods;  network selec(cid:173)\ntion and network combination.  Network selection  addresses  the question of how to \n\n\f802 \n\nS.  Waterhouse and G.  Cook \n\nchoose  the  data each  network  is  trained  on.  Network  combination  addresses  the \nquestion of what is the best  way to combine the outputs of these  trained networks. \nThe simplest method for  network selection is to train separate networks on separate \nregions  of the  data,  chosen  either  randomly, with  prior  knowledge  or according  to \nsome other criteria, e.g.  clustering. \n\nThe simplest  method of combining the  outputs  of several  networks  is  to  form  an \naverage,  or simple additive merge:  y(t)  =  k L~=l Yk(t),  where  Yk(t)  is  the output \nof the  kth  network at time t. \n\nBoosting \n\nBoosting  is  a  procedure  which  results  in  an  ensemble  of networks.  The  networks \nin  a  boosting ensemble  are  trained  sequentially  on  data that  has  been  filtered  by \nthe  previously  trained  networks  in  the  ensemble.  This  has  the  advantage  that \nonly  data  that  is  likely  to  result  in  improved  generalization  performance  is  used \nfor  training.  The  first  practical  application  of a  boosting  procedure  was  for  the \noptical character recognition task (Drucker et al.  1993).  An ensemble offeedforward \nneural networks was  trained using supervised learning.  Using  boosting the authors \nreported a reduction in error rate on ZIP codes from the United States Postal Service \nof 28% compared to a single network.  The boosting procedure is  as follows:  train a \nnetwork on a  randomly chosen  subset  of the  available training data.  This  network \nis  then  used  to  filter  the  remaining  training  data to  produce  a  training  set  for  a \nsecond  network with an even  distribution of cases  which  the first  network classifies \ncorrectly  and  incorrectly.  After  training  the second  network  the  first  and  second \nnetworks  are  used  to produce  a  training set  for  a  third  network.  This training set \nis  produced  from  cases  in  the  remaining training data that the  first  two  networks \ndisagree on. \n\nThe boosted networks are combined using either a voting scheme or a simple add as \ndescribed  in the previous  section.  The  voting scheme works  as follows:  if the first \ntwo networks agree,  take their answer  as  the output, if they disagree,  use  the  third \nnetwork's  answer  as  the output. \n\nMixtures of Experts \n\nThe mixture of experts  (Jacobs  et  al.  1991)  is  a  different  type  of ensemble  to the \ntwo  considered  so  far.  The  ensemble  members  or  experts  are  trained  with  data \nwhich  is  stochastically  selected  by  a  9ate.  The  gate  in  turn  learns  how  to  best \ncombine the experts given the data.  The training of the experts, which are typically \nsingle or multi-layer networks, proceeds as for standard networks, with an additional \nweighting  of the  output  error  terms  by  the  posterior  probabilty  hi (t)  of selecting \nexpert i given the current data point at time (t):  hi(t) = 9i(t).Pj(t) /Lj 9j(t).Pj(t) , \nwhere  9j(t)  is  the  output  of the  gate  for  expert  i,  and  Pj(t)  is  the  probability  of \nobtaining the correct  output given expert  i.  In the case of classification, considered \nhere,  the  experts  use  softmax output  units.  The  gate,  which  is  typically  a  single \nor  multi-layered  network  with  softmax output  units  is  trained  using  the posterior \nprobabilities as targets.  The overall output y(t) ofthe mixture of experts is given by \nthe weighted combination of the gate and expert outputs:  y(t) = L~=l 9k(t).Yk(t), \nwhere  Yk(t)  is the output of the  kth  expert,  and 9k(t)  is  the output of the gate for \n\n\fEnsemble Methods/or Phoneme Classification \n\n803 \n\nexpert  k  at time t. \n\nThe mixture of experts  is  based  on  the  principle of divide and conquer,  in which  a \nrelatively hard problem is broken up into a series of smaller easier to solve problems. \nBy  using  the  posterior  probabilities  to  weight  the  experts  and  provide  targets  for \nthe gate,  the effective  data sets  used  to train each expert  may overlap. \n\nSPEECH RECOGNITION  RESULTS \n\nThis  section  describes  the  results  of experiments  on  two  speech  databases:  the \nBellcore isolated digits database and the Wall Street Journal Corpus (Paul & Baker \n1992).  The inputs  to  the  networks  consist  of 9 frames  of acoustic  feature  vectors; \nthe frame on which the network is currently performing classification, plus 4 frames \nof left  context and 4 frames of right context.  The context frames allow the network \nto  take  account  of the  dynamical  nature  of speech.  Each  acoustic  feature  vector \nconsists of 8th order PLP plus log energy  coefficients  along with the dynamic delta \ncoeficients  of these  coefficients,  computed with  an  analysis window of 25ms, every \n12.5  ms  at  a  sampling  rate  of 8kHz.  The  speech  is  labelled  with  54  phonemes \naccording to the standard  ABBOT  phone set. \n\nBellcore Digits \n\nThe  Bellcore  digits  database  consists  of  150  speakers  saying  the  words  \"zero\" \nthrough  \"nine\",  \"oh\",  \"no\"  and  \"yes\".  The  database  was  divided  into  a  train(cid:173)\ning  set  of 122  speakers,  a  cross  validation set  of 13  speakers  and  a  test  set  of 15 \nspeakers.  Each  method was  evaluated over  10  partitions of the  data into different \ntraining, cross  validation and test sets.  In all the experiments on the Bellcore digits \nmulti-layer perceptrons with 200 hidden units were  used as the basic network mem(cid:173)\nbers  in the ensembles.  The gates in the mixtures were  also multi-layer perceptrons \nwith  20  hidden units. \n\nEnsemble \n\nCombination  Phone Error Rate \n\nSimple ensemble \nSimple ensemble \nSimple ensemble \nSimple ensemble \nSimple ensemble \nSimple ensemble \nBoosted ensemble \nBoosted ensemble \nBoosted ensemble \nBoosted ensemble \nBoosted ensemble \nBoosted ensemble \n\nMethod \ncheat \nvote \n\naverage \nsoft gated \nhard gated \n\nmixed \ncheat \nvote \n\naverage \nsoft  gated \nhard gated \n\nmixed \n\nAverage \n14.7 % \n20.3  % \n19.3 % \n20.9 % \n19.3 % \n17.1  % \n11.9 % \n17.8 % \n17.4 % \n17.8 % \n17.4 % \n16.4 % \n\n(J' \n\n0.9 \n1.2 \n1.2 \n1.2 \n1.0 \n1.3 \n1.0 \n1.1 \n1.1 \n1.0 \n1.2 \n1.0 \n\nTable  1:  Comparison of phone error rates using different  ensemble methods on the \nBellcore  isolated digits task. \n\n\f804 \n\nS.  Waterhouse and G.  Cook \n\nTable 1 summarises the results obtained on the Bellcore digits database.  The mean(cid:173)\ning of the entries are  as follows.  Two types of ensemble were  trained: \n\nSimple  Ensemble:  consisting  of 3  networks  each  trained  on  1/3 of the  training \ndata each  (corresponding  to 40  speakers  used  for  training  and  5  for  cross \nvalidation for  each network), \n\nBoosted  Ensemble:  consisting  of 3  networks  trained  according  to  the  boosting \nalgorithm  of the  previous  section.  Due  to  the  relatively  small  size  of the \ndata set, it  was necessary  to ensure  that the distributions of the randomly \nchosen  data were  consistent  with the overall training data distribution. \n\nGiven each set of ensemble networks,  6 combination methods were  evaluated: \n\ncheat:  The cheat  scheme  uses  the  best  ensemble  member for  each  example in the \ndata set.  The best ensemble member is determined by looking at the correct \nlabel  in the  labelled  test  set  (hence  cheating).  This method is  included  as \na  lower  bound on the error.  Since  the tests  are performed on unseen  data, \nthis bound can only be approached by  learning an appropriate combination \nfunction  of the ensemble member outputs. \n\naverage:  The ensemble members' outputs are  combined using  a  simple average. \n\nvote:  The voting scheme outlined in the previous section. \n\ngated:  In  the  gated combination method,  the  ensemble  networks  were  kept  fixed \nwhilst  the gate was  trained.  Two types of gating were  evaluated, standard \nor sojtgating, and hard or winner take all (WTA) training.  In WTA training \nthe  targets  for  the  gate  are  binary,  with  a  target  of  l.0  for  the  output \ncorresponding  to  the  expert  whose  probability  of generating  the  current \ndata point correctly is greatest,  and  0.0 for  the other outputs. \n\nmixed:  In contrast to the gated method, the  mixed combination method both trains \na  gate  and  retrains  the  ensemble  members  using  the  mixture  of experts \nframework. \n\nFrom  these  results  it  can  be  concluded  that  boosting  provides  a  significant  im(cid:173)\nprovement  in  performance over  a  simple ensemble.  In addition,  by  training a  gate \nto combine the boosted networks performance can be further enhanced.  As might be \nexpected,  re-training  both the boosted networks  and the gate provides  the biggest \nimprovement, as shown  by  the result for  the mixed boosted  networks. \n\nWall Street Journal Corpus \n\nThe  training  data used  in  these  experiments  is  the  short  term speakers  from  the \nWall Street  Journal corpus.  This consists  of approximately 36,400 sentences  from \n284  different  speakers  (SI284).  The  first  network  is  trained  on  l.5 million frames \nrandomly  selected  from  the  available  training  data  (15  million  frames).  This  is \nthen used  to filter  the unseen  training data to select  frames for  training the second \nnetwork.  The first  and second  networks  are  then  used  to select  data for  the  third \nnetwork  as  described  previously.  The  performance  of the  boosted  MLP  ensemble \n\n\fEnsemble Methods/or Phoneme Classification \n\n805 \n\nTest \nSet \n\neLh2_93 \ndt....s5_93 \n\nLanguage \n\nModel \ntrigram \nbigram \n\nLexicon \n\n20k \n5k \n\nWord Error  Rate \n\nSingle MLP  Boosted  Gated  Mixed \n11.2 % \n15.1% \n\n12.9% \n16.5% \n\n16.0% \n20.4% \n\n12.9% \n16.5% \n\nTable 2:  Evaluation of the performance of boosting MLP  acoustic models \n\nwas evaluated on a number of ARPA benchmark tests.  The results  are summarised \nin Table 2. \n\nInitial experiments  use  the  November  1993  Hub  2 evaluation  test  set  (eLh2-93)  . \nThis is  a 5,000 word closed  vocabulary, non-verbalised punctuation test.  It consists \nof 200  utterances,  20  from  each  of 10  different  speakers,  and  is  recorded  using  a \nSennheiser  HMD  410  microphone.  The  prompting texts  are  from  the  Wall Street \nJournal.  Results  are  reported  for  a  system  using  the  standard  ARPA  5k  bigram \nlanguage model. \n\nThe Spoke 5 test (dt....s5_93)  is designed for evaluation of unsupervised channel adap(cid:173)\ntation algorithms.  It consists of a total of 216 utterances from 10 different speakers. \nEach  speaker's  data was  recorded  with  a  different  microphone.  In  all  cases  simul(cid:173)\ntaneous  recordings  were  made using  a  Sennheiser  microphone.  The  task is  a  5,000 \nword, closed vocabulary, non-verbalised punctuation test.  Results are only reported \nfor the data recorded  using the Sennheiser microphone.  This is a matched test since \nthe same microphone is  used  to record  the  training data.  The standard ARPA  5k \nbigram  language  model  was  used  for  the  tests.  Further  details  of the  November \n1993  spoke  5  and  hub  2  tests,  can  be found  in  (Pallett,  Fiscus,  Fisher,  Garofolo, \nLund &  Pryzbocki  1994). \n\nFour techniques were evaluated on the WSJ corpus; a single network with 500 hidden \nunits,  a  boosted  ensemble  with  3  networks  with  500  hidden  units  each,  a  gated \nensemble  of the  boosted  networks  and a  mixture trained from  boosted  ensembles. \nAs can be seen from the table,  boosting has resulted in significant improvements in \nperformance for both the test sets over a single model.  In addition, in common with \nthe  results  on  the  Bellcore  digits,  whilst  the  gating combination method does  not \ngive  an  improvement over  simple  averaging,  the  retraining  of the  whole  ensemble \nusing the mixed combination method gives an average improvement of a further 8% \nover  the averaging method. \n\nCONCLUSION \n\nThis paper has described a number of ensemble methods for use with neural network \nacoustic  models.  It  has  been  shown  that  through  the  use  of principled  methods \nsuch  as  boosting  and  mixtures the  performance  of these  models may be  improved \nover  standard  ensemble  techniques.  In  addition,  by  combining the  techniques  via \nboot-strapping mixtures using the boosted networks the performance of the models \ncan  be  improved further.  Previous  work,  which  focused  on  boosting  at  the  word \nlevel  showed  improvements for  a  recurrent  network:HMM  hybrid at the word  level \nover  the  baseline system  (Cook &  Robinson  1996).  This paper has shown  how  the \nperformance of a static MLP system can also be improved by boosting at the frame \nlevel. \n\n\f806 \n\nAcknowledgements \n\nS.  Waterhouse and G.  Cook \n\nMany  thanks  to  Bellcore  for  providing  the  digits  data  set  to  our  partners,  ICSI; \nNikki  Mirghafori  for  help  with  datasets;  David Johnson  for  providing  the  starting \npoint for  our code development;  and Dan Kershaw for  his invaluable advice. \n\nReferences \n\nCook,  G.  &  Robinson,  A.  (1996),  Boosting  the  performance of connectionist  large \n\n-vocabulary speech  recognition,  in  'International  Conference  on  Spoken  Lan(cid:173)\nguage Processing'. \n\nDrucker,  H.,  Schapire,  R.  &  Simard,  P.  (1993),  Improving  Performance  in  Neural \nNetworks Using a Boosting Algorithm, in S. Hanson, J. Cowan &  C. Giles, eds, \n'Advances  in  Neural  Information Processing  Systems  5',  Morgan  Kauffmann, \npp.42-49. \n\nHansen,  L.  &  Salamon, P.  (1990), 'Neural Network Ensembles',  IEEE  Transactions \n\non  Pattern  Analysis and Machine  Intelligence 12, 993-1001. \n\nHochberg,  M.,  Renals,  S.  &  Robinson,  A.  (1994), \n\n'ABBOT:  The  CUED  hy(cid:173)\n\nbrid connectionist-HMM large-vocabulary recognition system', Proc.  of Spoken \nLanguage  Systems  Technology  Worshop,  ARPA. \n\nJacobs,  R.  A.,  Jordan,  M.  I.,  Nowlan,  S.  J .  &  Hinton,  G.  E.  (1991),  'Adaptive \n\nmixtures oflocal experts' ,  Neural  Computation 3 (1),  79-87. \n\nPallett,  D.,  Fiscus,  J.,  Fisher,  W.,  Garofolo, J.,  Lund,  B.  &  Pryzbocki,  M.  (1994), \n'1993  Benchmark  Tests  for  the  ARPA  Spoken  Language  Program',  ARPA \nWorkshop  on  Human  Language  Technology  pp.  51-73.  Merrill  Lynch  Con(cid:173)\nference  Center,  Plainsboro, NJ. \n\nPaul, D. &  Baker, J . (1992), The Design for the Wall Street Journal-based CSR Cor(cid:173)\n\npus,  in  'DARPA Speech  and  Natural  Language Workshop',  Morgan Kaufman \nPublishers,  Inc.,  pp.  357-62. \n\nPerrone,  M.  P.  &  Cooper,  L.  N.  (1993),  When  networks disagree:  Ensemble meth(cid:173)\n\nods  for  hybird  neural  networks,  in  'Neural  Networks  for  Speech  and  Image \nProcessing',  Chapman-Hall. \n\n\f", "award": [], "sourceid": 1280, "authors": [{"given_name": "Steve", "family_name": "Waterhouse", "institution": null}, {"given_name": "Gary", "family_name": "Cook", "institution": null}]}