{"title": "Adaptively Growing Hierarchical Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 465, "abstract": null, "full_text": "Adaptively  Growing Hierarchical \n\nMixtures of Experts \n\nJiirgen Fritsch, Michael Finke,  Alex Waibel \n\n{fritsch+,finkem, waibel }@cs.cmu.edu \n\nInteractive Systems Laboratories \n\nCarnegie  Mellon University \n\nPittsburgh,  PA  15213 \n\nAbstract \n\nWe propose a novel approach to automatically growing and pruning \nHierarchical  Mixtures  of Experts.  The constructive  algorithm pro(cid:173)\nposed  here  enables  large  hierarchies  consisting  of several  hundred \nexperts  to be  trained effectively.  We  show  that HME's  trained  by \nour  automatic  growing  procedure  yield  better  generalization  per(cid:173)\nformance  than  traditional  static  and  balanced  hierarchies.  Eval(cid:173)\nuation  of  the  algorithm  is  performed  (1)  on  vowel  classification \nand  (2)  within  a  hybrid  version  of the  JANUS  r9]  speech  recog(cid:173)\nnition system  using a  subset  of the Switchboard large-vocabulary \nspeaker-independent  continuous speech  recognition database. \n\nINTRODUCTION \n\nThe  Hierarchical  Mixtures  of Experts  (HME)  architecture  [2,3,4]  has  proven  use(cid:173)\nful  for  classification  and  regression  tasks  in  small  to  medium  sized  applications \nwith  convergence  times  several  orders  of magnitude  lower  than  comparable  neu(cid:173)\nral networks such  as  the multi-layer perceptron.  The HME  is best  understood  as  a \nprobabilistic decision tree, making use of soft splits of the input feature space at the \ninternal nodes,  to divide a given task into smaller, overlapping tasks that are solved \nby  expert  networks  at the terminals of the tree.  Training of the hierarchy is  based \non a generative model using the Expectation  Maximisation (EM)  [1,3]  algorithm as \na  powerful  and efficient  tool for  estimating the  network  parameters. \nIn  [3],  the architecture of the HME is considered pre-determined and remains fixed \nduring  training.  This  requires  choice  of structural  parameters such  as  tree  depth \nand branching factor  in  advance.  As  with  other  classification  and regression  tech(cid:173)\nniques,  it  may  be  advantageous  to  have  some  sort  of data-driven  model-selection \nmechanism to  (1)  overcome false  initialisations (2)  speed-up  training time and  (3) \nadapt  model  size  to  task  complexity  for  optimal  generalization  performance.  In \n[11],  a constructive algorithm for  the HME is presented and evaluated on two small \nclassification  tasks:  the  two  spirals  and  the  8-bit  parity  problems.  However,  this \n\n\f460 \n\n1.  Fritsch, M.  Finke and A.  Waibel \n\nalgorithm requires  the evaluation of the increase in the overall log-likelihood for  all \npotential  splits  (all  terminal  nodes)  in  an  existing  tree  for  each  generation.  This \nmethod  is  computationally too expensive  when  applied to  the  large  HME's  neces(cid:173)\nsary  in  tasks  with several  million training vectors,  as  in speech  recognition,  where \nwe can not afford to train all potential splits to eventually determine the single  best \nsplit and discard all others.  We  have developed an alternative approach to growing \nHME trees  which allows the fast  training of even large HME's, when combined with \na  path  pruning  technique.  Our algorithm  monitors the performance of the  hierar(cid:173)\nchy  in  terms  of scaled  log-likelihoods,  assigning  penalties  to  the  expert  networks, \nto determine  the expert  that performs worst  in its local partition.  This expert  will \nthen be expanded into a new subtree consisting of a new gating network and several \nnew  expert  networks. \n\nHIERARCHICAL  MIXTURES  OF  EXPERTS \n\nWe restrict the presentation of the HME to the case of classification, although it was \noriginally introduced  in  the  context  of regression.  The  architecture  is  a  tree  with \ngating networks at the non-terminal nodes  and expert  networks  at the leaves.  The \ngating networks  receive  the input  vectors  and  divide  the  input space into a  nested \nset  of regions,  that correspond  to the  leaves  of the  tree.  The expert  networks  also \nreceive the input vectors and produce estimates of the a-posteriori class probabilities \nwhich  are  then  blended  by  the  gating  network  outputs.  All  networks  in  the  tree \nare linear,  with a softmax non-linearity as their activation function.  Such networks \nare  known  in  statistics  as  multinomiallogit  models,  a  special  case  of Generalized \nLinear Models  (GLIM)  [5]  in which the probabilistic component is  the multinomial \ndensity.  This  allows for  a  probabilistic interpretation  of the  hierarchy  in  terms  of \na  generative  likelihood-based  model.  For  each  input  vector  x,  the  outputs  of the \ngating  networks  are  interpreted  as  the  input-dependent  multinomial probabilities \nfor  the  decisions  about  which  child  nodes  are responsible  for  the generation  of the \nactual  target  vector  y.  After  a  sequence  of these  decisions,  a  particular  expert \nnetwork  is  chosen  as  the current  classifier  and  computes  multinomial probabilities \nfor  the output classes.  The overall output of the hierarchy  is \n\nN \n\nN \n\nP(ylx,0) =  L9i(X, Vi) L9jli(X, Vij)P(ylx, Oij) \n\ni=1 \n\nj=1 \n\nwhere  the 9i  and 9jli  are  the outputs of the gating networks. \nThe HME  is  trained using  the  EM  algorithm [1]  (see  [3]  for  the  application of EM \nto  the HME  architecture) .  The  E-step  requires  the computation of posterior  node \nprobabilities as expected  values for  the unknown decision  indicators: \n\nhi = \n\n9i I:j  9jli Pij(y) \nI:i 9i I:j  9jli Pij(y) \n\nThe M-step then leads to the following independent maximum-likelihood equations \n\nVij  = arg ~ax L.,.. L.,.. hk  L.,.. hqk log911k \n(t) \n\n\"\"\"\"  (t)  \"\"  (t) \n\n'J \n\nt \n\nk \n\nI \n\n\fAdaptiveLy Growing HierarchicaL Mixtures of Experts \n\n461 \n\nwhere  the  (Jij  are  the  parameters  of the  expert  networks  and  the  Vi  and  Vij  are \nthe  parameters  of the  gating networks.  In  the  case  of a  multinomiallogit  model , \nPij{y) =  Yc , where Yc  is the output of the node associated with the correct class.  The \nabove maximum likelihood equations  might be  solved by gradient  ascent,  weighted \nleast squares or Newton methods.  In our implementation, we  use a variant of Jordan \n&  Jacobs'  [3]  least squares  approach. \n\nGROWING  MIXTURES \n\nIn  order  to  grow  an  HME,  we  have  to  define  an  evaluation  criterion  to score  the \nexperts  performance  on  the  training  data.  This  in  turn  will  allow  us  to  select \nand  split  the  worst  expert  into  a  new  subtree,  providing  additional  parameters \nwhich  can  help  to  overcome  the  errors  made  by  this  expert .  Viewing  the  HME \nas  a  probabilistic  model  of the  observed  data,  we  partition  the  input  dependent \nlikelihood using expert  selection  probabilities provided  by  the gating networks \n\n1(0; X) \n\nLlogP(y(t)lx(t) , 0) = LLgk log P (y (t)lx(t) , 0) \n\nLL log[P(y(t)lx(t) , 0)]gk  =  Llk(0;X) \nk \n\nk \n\nk \n\nwhere  the  gk  are  the  products  of the  gating probabilities  along the  path from  the \nroot  node  to  the  k-th  expert.  gk  is  the  probability  that  expert  k  is  responsible \nfor  generating  the  observed  data  (note,  that  the  gk  sum  up  to  one) .  The  expert(cid:173)\ndependent scaled likelihoods Ik (0; X) can be used as a measure for the performance \nof an expert  within its region  of responsibility.  We  use  this measure as  the basis  of \nour tree  growing algorithm : \n\n1.  Initialize  and train a  simple HME consisting of only one gate and several experts. \n2.  Compute the expert-dependent  scaled likelihoods  lk(9j X) for each expert in one \n\nadditional  pass  through  the training  data. \n\n3.  Find the expert k  with minimum  lk  and expand the tree, replacing  the expert by \na  new gate with random weights and new experts that copy  the weights from  the \nold expert  with additional  small random perturbations. \n\n4.  Train the architecture to a local minimum of the classification  error using  a cross(cid:173)\n\nvalidation  set. \n\n5.  Continue  with step (2)  until  desired  tree size  is  reached. \n\nThe  number  of tree  growing  phases  may  either  be  pre-determined,  or  based  on \ndifference  in  the  likelihoods  before  and  after  splitting  a  node.  In  contrast  to  the \ngrowing  algorithm  in  [11],  our  algorithm  does  not  hypothesize  all  possible  node \nsplits,  but  determines  the  expansion  node(s)  directly,  which  is  much  faster ,  espe(cid:173)\ncially  when  dealing  with  large  hierarchies.  Furthermore,  we  implemented  a  path \npruning  technique  similar  to  the  one  proposed  in  [11],  which  speeds  up  training \nand testing times significantly.  During the recursive  depth-first  traversal of the tree \n(needed  for  forward  evaluation,  posterior  probability  computation and  accumula(cid:173)\ntion of node statistics) a path is pruned temporarily if the current node's probability \nof activation falls  below  a  certain  threshold .  Additionally,  we  also  prune  subtrees \npermanently, if the sum of a node's  activation probabilities over the  whole  training \nset  falls  below  a  certain  threshold.  This  technique  is  consistent  with  the  growing \nalgorithm and also helps  preventing instabilities and  singularities in the  parameter \nupdates,  since  nodes  that  accumulate  too  little  training  information  will  not  be \nconsidered  for  a parameter update because such nodes are automatically pruned by \nthe algorithm. \n\n\f462 \n\n1.  Fritsch, M.  Finke and A.  Waibel \n\nFigure  1:  Histogram trees  for  a  standard and a  grown  HME \n\nVOWEL  CLASSIFICATION \n\nIn  initial experiments,  we  investigated  the  usefulness  of the  proposed  tree  growing \nalgorithm on  Peterson  and Barney's  [6]  vowel  classification  data that uses  formant \nfrequencies  as  features.  We  chose  this  data set  since  it  is  small,  non-artificial  and \nlow-dimensional,  which  allows  for  visualization  and  understanding of the  way  the \ngrowing HME  tree  performs classification tasks. \n\nx \n\nx \n~ \n\n0 \n\no  0 \n\nThe  vowel  data  set  contains \n1520 samples consisting of the \nformants  FO,  F1,  F2  and  F3 \nand  a  class  label,  indicating \none  of  10  different  vowels. \nExperiments were carried out \non  the  4-dimensional  feature \nspace,  however,  in  this  paper \ngraphical  representations  are \nrestricted  to the F1-F2 plane. \nThe  figure  to  the  left  shows \nthe  data  set  represented  in \nthis  plane  (The  formant  fre(cid:173)\nquencies  are  normalized  to \nthe range  [0,1]). \n\nIn  the  following  experiments,  we  use  binary  branching  HME's  exclusively,  but  in \ngeneral  the  growing  algorithm  poses  no  restrictions  on  the  tree  branching  factor. \nWe  compare a standard,  balanced  HME of depth  3 with  an HME  that grows from \na two expert  tree  to a tree with the same number of experts  (eight)  as the standard \nHME.  The size of the standard HME was chosen  based on a number of experiments \nwith  different  sized  HME's  to  find  an  optimal  one.  Fig.  1  shows  the  topology \nof the  standard  and  the fully  grown  HME  together  with  histograms of the  gating \nprobability distributions at the internal nodes. \nFig.  2  shows  results  on  4-dimensional feature  vectors  in  terms  of correct  classifi(cid:173)\ncation rate  and log-likelihood.  The growing HME achieved  a  slightly better  (1.6% \nabsolute)  classification rate than the fixed  HME.  Note also,  that the growing HME \noutperforms the fixed  HME  even  before  it  reaches  its full  size.  The growing  HME \nwas expanded every  4  iterations,  which  explains  the bumpiness of the curves. \nFig.  3 shows  the impact of path  pruning during training on  the final  classification \nrate of the grown HME's.  The pruning factor ranges from no pruning to full pruning \n(e.g.  only the most likely path survives). \nFig.  4 shows how  the  gating networks partition the feature space.  It contains plots \n\n\fAdaptively Growing Hierarchical Mixtures of Experts \n\n463 \n\n, \n\n80 \n\nAvorage Claaalficlllion Ra'o for dlfforenl '0\" eo'\" \n\n\u2022 .  ,  .\u2022 \n\n-\n\n-.. \n\n1 \n\ni \n\n-..... t\n\n100  r-----r-'-r---,--,-----,r----;--.--, \n.-- -:c'= --- .--:==- =--SiC' \neeLS% \n\u00b7-r \n_  ._ H~ .--.\u2022 -+~.=-t. \n:  ..  ~ -=--+=  -1= ...... --.-\n-+.---.  -.--+--- .----\n\n.  -1 \n\n__ ----L--\n\n----\u00b7t\u00b7 \n\n--....  \u00b7'1-\n\n-.--.-\n\n: \n\n_ \n\n40 \n\n_\n\nlOl/\"lkolihood for o\"'ndard and growing HME \n\n! \n\n--+--\n\n-~.-\n\n- r--.,. \n\ng \n! \n} \n\n-500 \n\n\u00b71000 \n\n\u00b71500 \n\n\u00b72500 \n\n-3000 \n\n10\no \n\no \n\n4 \n\n8 \n\n12 \n\n16 \nepoch \n\n20 \n\n24 \n\n28 \n\n16 \nopoch \n\n20 \n\n24 \n\n28 \n\nFigure 2:  Classification rate  and log-likelihood for  standard and  growing HME \n\nCIa \u2022 \u2022 Hlcatlon  rato when pruning gro,mg HME\u00b7.  '*'ring training \n\n~r--.--~~-r~~--r--~ \n\n\"j' \n\nI \n\"\"1-\n\nl .O __ ll.1--_ I l-~--. ....J.-2.2------ __  _  _ _ \n\n~ \n\n91 \n\n90 \n\n89 \n\n88 \n\n87 \n\n88 \n\n85 \n\n84 \n\n83 \n\n-6 \n\n-5 \n\n-4 \n\n-3 \n\n-2 \n\nprunklg factor (10'1<) \n\n-1 \n\no \n\nFigure 3:  Impact of path pruning during training of growing HME's \n\nof the activation regions of all  8 experts  of the standard HME in  the  2-dimensional \nrange  [-0.1, 1.1]2.  Activation  probabilities  (product  of gating  probabilities  from \nroot  to expert)  are colored  in shades  of gray from  black  to white.  Fig.  5 shows  the \nsame kind of plot for  all 8 experts  of the grown HME.  The plots in  the upper right \ncorner  illustrate the class  boundaries  obtained  by  each  HME. \n\nFigure 4:  Expert  activations for  standard HME \n\nFig.  4 reveals a weakness  of standard HME's: Gating networks  at high levels  in  the \ntree  can  pinch  off whole  branches,  rendering  all  the experts  in  the  subtree  useless. \nIn our case,  half of the experts  of the standard HME do not contribute to the final \ndecision  at  all  (black  boxes).  The growing HME's  are  able  to overcome  this  effect . \nAll  the experts  of the  grown  HME  (Fig.  5)  have  non-zero  activation  patterns  and \nthe overlap between  experts  is  much higher  in  the  growing case,  which  indicates  a \nhigher degree of cooperation among experts.  This can also be seen  in the histogram \ntrees  in  Fig.  3,  where  gating  networks  in  lower  levels  of the  grown  tree  tend  to \n\n\f464 \n\n1.  Fritsch,  M. Finke and A.  Waibel \n\nFigure 5:  Expert  activations for  grown  HME \n\naverage  the  experts  outputs.  The  splits formed  by  the  gating networks  also  have \nimplications on  the way class boundaries are formed by the HME. There are strong \nde~endencies visible between the class boundaries and some of the experts activation \nreglOns. \n\nEXPERIMENTS ON  SWITCHBOARD \n\nWe  recently  started  experiments  using  standard  and  growing  HME's  as  estima(cid:173)\ntors  of .posterior  phone  probabilities  in  a  hybrid  version  of the  JANUS  [9]  speech \nrecognizer.  Following  the  work  in  [12],  we  use  different  HME's  for  each  state  of \na  phonetic  HMM.  The  posteriors  for  52  phonemes  computed  by  the  HME's  are \nconverted  into  scaled  likelihoods  by  dividing  by  prior  probabilities  to  account  for \nthe likelihood based  training and decoding of HMM's.  During training,  targets for \nthe HME's are generated  by forced-alignment using a baseline  mixture of Gaussian \nHMM  system.  We  evaluate the system on  the Switchboard spontaneous  telephone \nspeech  corpus.  Our  best  current  mixture  of  Gaussians  based  context-dependent \nHMM  system  achieves  a  word  accuracy  of 61.4% on  this task,  which  is  among the \nbest  current  systems  [7].  We  started  by  using  phonetic  context-independent  (CI) \nHME's  for  3-state  HMM's.  We  restricted  the  training set  to  all  dialogues  involv(cid:173)\ning speakers  from one  dialect  region  (New  York  City), since  the whole  training set \ncontains over  140 hours  of speech.  Our aim here  was,  to reduce  training time  (the \nsubset  contains  only  about  5%  of the  data)  to  be  able  to  compare different  HME \narchitectures. \n\nContext \n\nCI \n\nI ff  experts  Word  Acc. \nI  64 \n33.8~ \n64 \n35.1  0 \n\nFigure 6:  Preliminary results  on Switchboard telephone  data \n\nTo  improve performance,  we  then build context-dependent  (CD)  models consisting \nof a  separate  HME for  each  biphone context  and  state.  The  Cb  HME's  output  is \nsmoothed  with  the  CI  models  based  on  prior conteie probabilities.  Current  work \nfocuses  on improving context modeling (e.g.  larger contexts and decision tree based \nclustering) . \nFig.  6  summarizes  the  results  so  far,  showing  consistently  that  growing  HME's \noutperform equally sized  standard HME's.  The results  are not directly comparable \n\n\fAdaptively Growing Hierarchical Mixtures of Experts \n\n465 \n\nwith  our  best  Gaussian  mixture  system,  since  we  restricted  context  modeling  to \nbiphones  and  used  only a  small subset  of the Switchboard database for  training. \n\nCONCLUSIONS \n\nIn this paper,  we  presented  a  method for  adaptively growing Hierarchical  Mixtures \nof Experts.  We  showed,  that  the  algorithm  allows  the  HME  to  use  the  resources \n(experts)  more efficiently  than  a standard  pre-determined  HME architecture.  The \ntree growing algorithm leads to better classification performance compared to stan(cid:173)\ndard  HME's  with  equal  numbers  of parameters.  Using  growing  instead  of fixed \nHME's as continuous density estimators in a  hybrid speech  recognition system also \nimproves performance. \n\nReferences \n\n[1]  Dempster,  A.P.,  Laird,  N.M.  & Rubin,  D.B.  (1977)  Maximum  likelihood  from  incom(cid:173)\nplete  data via  the EM  algorithm.  J.R.  Statist.  Soc.  B  39,  1-38. \n[2]  Jacobs,  R.  A.,  Jordan,  M.  1.,  Nowlan,  S.  J.,  & Hinton,  G.  E.  (1991)  Adaptive  mixtures \nof local experts.  In  Neural  Computation 3,  pp.  79-87,  MIT press. \n\n[31  Jordan,  M.l.  &  Jacobs  RA.  (1994)  Hierarchical  Mixtures  of  Experts  and  the  EM \nAlgorithm.  In  Neural  Computation 6,  pp.  181-214.  MIT press. \n\n[41  Jordan,  M.l.  &  Jacobs,  RA.  (1992)  Hierarchies  of adaptive  experts.  In  Advances  in \nNeural Information  Processing  Systems 4,  J.  Moody,  S.  Hanson,  and  R  Lippmann,  eds., \npp.  985-993.  Morgan Kaufmann,  San Mateo,  CA. \n[5]  McCullagh,  P.  & Nelder,  J.A.  (1983)  Generalized Linear Models.  Chapman  and  Hall, \nLondon. \n[6]  Peterson,  G.  E.  &  Barney,  H.  L.  (1952)  Control measurements  used in a  study  of  the \nvowels.  J oumal of the  Acoustical Society of America  24,  175-184. \n[7]  Proceedings  of LVCSR Hub  5 workshop,  Apr.  29  - May  1  (1996)  MITAGS,  Linthicum \nHeights,  Maryland. \n[8]  Syrdal,  A.  K.  &  Gopal,  H.  S.  (1986)  A perceptual  model of vowel  recognition  based on \nthe auditory  representation  of American English  vowels.  Journal of the  Acoustical Society \nof America,  79 (4):1086-1100. \n\n[9]  Zeppenfeld  T.,  Finke  M.,  Ries  K.,  Westphal  M.  &  Waibel  A.  (1997)  Recognition  of \nConversational  Telephone Speech using  the Janus Speech Engine.  Proceedings  of ICASSP \n97,  Muenchen,  Germany \n[10]  Waterhouse,  S.R, Robinson,  A.J.  (1994)  Classification  using  Hierarchical Mixtures of \nt;xperts.  In  Proc.  1994 IEEE  Workshop  on  Neural Networks for Signal Processing IV,  pp. \n177-186. \n\n[11]  Waterhouse,  S.R,  Robinson,  A.J.  (1995)  Constructive  Algorithms  for  Hierarchical \nMixtures  of Experts.  In  Advances in  Neural  information Processing Systems  8. \n[12]  Zhao,  Y.,  Schwartz,  R, Sroka,  J.  & Makhoul,  J.  (1995)  Hierarchical  Mixtures  of Ex(cid:173)\nperts Methodology  Applied  to Continuous  Speech  Recognition.  In  ICASSP 1995, volume \n5,  pp.  3443-6,  May  1995. \n\n\f", "award": [], "sourceid": 1279, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Fritsch", "institution": null}, {"given_name": "Michael", "family_name": "Finke", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}