{"title": "Stacked Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 668, "page_last": 674, "abstract": null, "full_text": "Stacked Density Estimation \n\nPadhraic Smyth * \n\nInformation and Computer Science \n\nUniversity of California, Irvine \n\nCA  92697-3425 \n\nsmythCics.uci.edu \n\nDavid Wolpert \n\nNASA  Ames  Research  Center \n\nCaelum Research \n\nMS  269-2,  Mountain View,  CA  94035 \n\ndhwCptolemy.arc.nasa.gov \n\nAbstract \n\nIn  this  paper,  the  technique  of stacking,  previously  only  used  for \nsupervised  learning,  is  applied  to  unsupervised  learning.  Specifi(cid:173)\ncally, it is used for non-parametric multivariate density estimation, \nto combine finite  mixture model and kernel density estimators.  Ex(cid:173)\nperimental results on both simulated data and real  world data sets \nclearly  demonstrate  that  stacked  density  estimation  outperforms \nother  strategies  such  as  choosing  the  single  best  model  based  on \ncross-validation, combining with uniform weights, and even the sin(cid:173)\ngle  best  model  chosen  by  \"cheating\"  by  looking  at  the  data used \nfor  independent  testing. \n\n1 \n\nIntroduction \n\nMultivariate probability density estimation is a fundamental problem in exploratory \ndata analysis,  statistical pattern recognition and machine learning.  One  frequently \nestimates density  functions  for  which  there  is  little  prior  knowledge  on  the  shape \nof the  density  and  for  which  one  wants  a  flexible  and  robust  estimator  (allowing \nmultimodality if it exists).  In  this context,  the  methods of choice  tend  to be  finite \nmixture  models  and  kernel  density  estimation  methods.  For  mixture  modeling, \nmixtures of Gaussian components are frequently assumed and model choice reduces \nto  the  problem  of choosing  the  number  k  of Gaussian  components  in  the  model \n(Titterington,  Smith  and  Makov,  1986)  .  For  kernel  density  estimation,  kernel \nshapes  are  typically chosen  from  a  selection  of simple  unimodal  densities  such  as \nGaussian,  triangular, or  Cauchy densities,  and  kernel  bandwidths are  selected  in  a \ndata-driven  manner  (Silverman  1986;  Scott  1994). \nAs  argued  by  Draper  (1996), model uncertainty can contribute significantly to pre-\n\n\u2022 Also  with the Jet Propulsion  Laboratory 525-3660,  California  Institute of Technology, \n\nPasadena,  CA 91109 \n\n\fStacked Density Estimation \n\n669 \n\ndictive  error  in  estimation.  While  usually  considered  in  the  context  of supervised \nlearning, model uncertainty is  also important in  unsupervised  learning applications \nsuch as density estimation.  Even when the model class under consideration contains \nthe true density, if we  are only given  a finite  data set, then there is  always a chance \nof selecting the wrong model.  Moreover, even  if the correct  model is  selected, there \nwill typically be estimation error in the parameters of that model.  These difficulties \nare summarized by  wri ting \n\nP(f I D)  = L J dOMP(OM  I D,M) x P(M I D)  x fM,9M' \n\nM \n\n(1) \n\nwhere  f  is  a  density,  D is  the data set,  M  is  a  model, and OM  is  a  set  of values for \nthe  parameters for  model  M.  The  posterior  probability  P( M  I D)  reflects  model \nuncertainty,  and  the  posterior  P(OM  I D , M)  reflects  uncertainty  in  setting  the \nparameters even  once one  knows the model.  Note that if one  is privy to P(M, OM), \nthen Bayes'  theorem allows us  to write out  both of our posteriors explicitly, so  that \nwe  explicitly  have  P(f  I D)  (and  therefore  the  Bayes-optimal  density)  given  by \na  weighted  average  of the  fM ,9M\"  (See  also  Escobar  and  West  (1995)).  However \neven  when  we  know  P(M, OM),  calculating the  combining weights  can  be  difficult . \nThus,  various  approximations  and  sampling  techniques  are  often  used,  a  process \nthat  necessarily  introduces  extra  error  (Chickering  and  Heckerman  1997) .  More \ngenerally, consider  the case  of mis-specified  models where  the model class  does  not \ninclude the true model, so our presumption for  P(M, OM)  is erroneous.  In  this case \noften one should again average. \n\nThus, a natural approach to improving density estimators is to consider empirically(cid:173)\ndriven  combinations of multiple density  models.  There  are several  ways  to do  this, \nespecially if one exploits previous combining work in supervised learning.  For exam(cid:173)\nple,  Ormontreit and Tresp  (1996)  have shown that  \"bagging\"  (uniformly weighting \ndifferent  parametrizations  of the  same  model  trained  on  different  bootstrap  sam(cid:173)\nples) ,  originally introduced  for  supervised  learning  (Breiman  1996a) , can  improve \naccuracy  for  mixtures of Gaussians  with  a  fixed  number  of components.  Another \nsupervised  learning technique for  combining different  types of models is  \"stacking\" \n(Wolpert  1992),  which  has  been  found  to  be  very  effective  for  both  regression  and \nclassification  (e .g.,  Breiman  (1996b)) .  This paper  applies  stacking  to  density  esti(cid:173)\nmation , in  particular to combinations involving kernel  density  estimators together \nwith finite  mixture model estimators. \n\n2  Stacked Density Estimation \n\n2.1  Background on Density Estimation with Mixtures and Kernels \n\nConsider  a  set  of d  real-valued  random  variables  X  =  {Xl, . . . , xd}  Upper  case \nsymbols  denote  variable  name.s  (such  as  Xi)  and  lower-case  symbols  a  particular \nvalue  of a  variable  (such  as  xJ).  ~ is  a  realization  of the  vector  variable  X.  J(~) \nis  shorthand for  f(X  =  ~) and  represents  the joint probability distribution  of X. \nD  = {~1 '  .. . ' ~N} is  a  training data set  where  each  sample  ~i' 1  :::;  i  :::;  N  is  an \nindependently drawn sample from  the  underlying density function  J(~) . \n\nA  commonly used  model for  density  estimation is  the  finite  mixture  model with  k \ncomponents, defined  as: \n\nk \n\nfk(~J = L aigi(~), \n\ni=l \n\n(2) \n\n\fP.  Smyth and D. Wolpert \n670 \nwhere  I:~=1 Ctj  = 1.  The  component  gj's  are  usually  relatively  simple unimodal \ndensities such  as  Gaussians.  Density estimation with mixtures involves finding  the \nlocations,  shapes,  and  weights  of  the  component  densities  from  the  data  (using \nfor  example the  Expectation-Maximization (EM)  procedure).  Kernel  density  esti(cid:173)\nmation  can  be  viewed  as  a  special  case  of mixture modeling  where  a  component \nis  centered  at  each  data point,  given  a  weight  of 1/ N,  and  a  common covariance \nstructure  (kernel  shape)  is estimated from the data. \nThe quality of a  particular probabilistic model can  be  evaluated by  an appropriate \nscoring  rule  on  independent  out-of-sample data, such  as  the  test  set  log-likelihood \n(also  referred  to  as  the  log-scoring  rule  in  the  Bayesian  literature).  Given  a  test \ndata set  Dte3t , the  test  log likelihood is defined  as \n\nlogf(Dte3tlfk(~)) =  l: logfk(~i) \n\nDteof \n\n(3) \n\nThis  quantity  can  play  the  role  played  by  classification  error  in  classification  or \nsquared  error  in  regression.  For  example,  cross-validated  estimates  of it  can  be \nused  to find  the best  number of clusters  to fit  to a  given data set  (Smyth,  1996) . \n\n2.2  Background on Stacking \n\nStacking  can  be  used  either  to  combine  models  or  to  improve  a  single  model.  In \nthe  former  guise  it  proceeds  as  follows .  First,  subsamples  of the  training  set  are \nformed.  Next  the  models  are  all  trained  on  one  subsample  and  resultant  joint \npredictive  behavior  on  another  subs ample  is  observed,  together  with  information \nconcerning  the optimal predictions  on  the  elements in  that other subsample.  This \nis  repeated  for  other  pairs  of subsamples  of the  training  set.  Then  an  additional \n(\"stacked\" )  model is  trained  to learn,  from  the subsample-based observations,  the \nrelationship  between  the  observed  joint predictive  behavior of the  models  and  the \noptimal predictions.  Finally,  this  learned  relationship  is  used  in  conjunction  with \nthe predictions of the individual models being combined (now  trained on  the entire \ndata set)  to determine the full system's predictions. \n\n2.3  Applying Stacking to Density Estimation \n\nConsider a set of M  different density models, fm(~), 1 ~ m  ~ M.  In this paper each \nof these  models  will  be  either  a  finite  mixture  with  a  fixed  number of component \ndensities  or  a  kernel  density estimate  with  a fixed  kernel  and  a  single fixed  global \nbandwidth in  each dimension.  (In general  though  no such  restrictions  are  needed.) \nThe procedure  for  stacking the  M  density models is  as  follows: \n\n1.  Partition the  training data set  D  v times, exactly  as in  v-fold  cross  valida(cid:173)\n\ntion  (we  use  v = 10  throughout  this paper),  and for  each  fold: \n(a)  Fit each of the M  models to the training portion ofthe partition of D . \n(b)  Evaluate  the  likelihood of each  data point  in  the  test  partition of D, \n\nfor  each of the  M  fitted  models. \n\n2.  After  doing  this  one  has  M  density  estimates for  each  of N  data points, \nand  therefore  a  matrix  of size  N  x  M,  where  each  entry  is  fm(~) ,  the \nout-of-sample likelihood of the  mth model on  the  ith data point. \n\n3.  Use  that matrix to estimate the combination coefficients  {Pl, ... , PM}  that \nmaximize the log-likelihood at  the points ~i of a  stacked density  model  of \n\n\fStacked Density Estimation \n\nthe form: \n\nfstacked (.~) = I': f3m f m (~J. \n\nM \n\nm=l \n\n671 \n\nSince this is itself a mixture model, but where  the  fm(~i) are fixed,  the  EM \nalgorithm can  be  used  to  (easily)  estimate the f3m. \n\n4.  Finally,  re-estimate  the  parameters  of each  of the  m  component  density \nmodels using  all of the training data D.  The stacked  density model is then \nthe linear combination of those density models,  with combining coefficients \ngiven  by  the f3m. \n\n3  Experimental Results \n\nIn  our  stacking  experiments  M  =  6:  three  triangular  kernels  with  bandwidths  of \n0.1,0.4, and  1.5  of the standard deviation  (of the full  data set)  in  each  dimension, \nand  three  Gaussian  mixture models  with  k  = 2,4, and 8 components.  This set  of \nmodels was chosen  to provide a  reasonably diverse representational  basis for  stack(cid:173)\ning.  We  follow  roughly  the same experimental procedure  as  described  in  Breiman \n(1996b)  for  stacked  regression: \n\n\u2022  Each  data set  is  randomly split  into training  and  test  partitions 50  times, \nwhere  the test  partition is chosen  to be  large enough  to provide reasonable \nestimates of out-of-sample log-likelihood. \n\n\u2022  The following techniques  are  run  on  each  training partition: \n\n1.  Stacking:  The stacked combination of the six constituent  models. \n2.  Cross-Validation:  The  single  best  model  as  indicated  by  the  max(cid:173)\nimum  likelihood  score  of  the  M  =  6  single  models  in  the  N  x  M \ncross-validated  table of likelihood scores. \n\n3.  Uniform Weighting: A uniform average  of the six models. \n4.  \"Cheating:\" The best single model, i.e., the model having the largest \n\nlikelihood on  the  test data partition, \n\n5.  Truth:  The  true  model structure,  if the  true  model  is  one  of the six \n\ngenerating the data (only  valid for  simulated data). \n\n\u2022  The log-likelihoods of the  models  resulting  from  these  techniques  are  cal(cid:173)\n\nculated  on  the test  data partition.  The log-likelihood of a  single  Gaussian \nmodel  (parameters  determined  on  the  training  data)  is  subtracted  from \neach  model's log-likelihood to  provide some normalization of scale. \n\n3.1  Results on Real Data Sets \n\nFour  real  data  sets  were  chosen  for  experimental  evaluation.  The  diabetes  data \nset  consists  of 145  data points used  in  Gaussian clustering studies  by  Banfield  and \nRaftery  (1991)  and others.  Fisher's iris data set is a classic data set in  4 dimensions \nwith  150  data points.  Both  of these  data sets  are  thought  to  consist  roughly  of \n3  clusters  which  can  be  reasonably  approximated  by  3  Gaussians.  The  Barney \nand Peterson  vowel data (2  dimensions, 639 data points)  contains 10 distinct  vowel \nsounds and so  is highly multi-modal.  The star-galaxy data (7  dimensions, 499  data \npoints)  contains non-Gaussian looking structure  in  various  2d  projections. \nTable  1 summarizes the  results.  In  all  cases  stacking  had the highest  average  log(cid:173)\nlikelihood, even  out-performing  \"cheating\"  (the  single  best  model chosen  from  the \ntest  data).  (Breiman  (1996b)  also found  for  regression  that stacking outperformed \n\n\f672 \n\nP.  Smyth and D.  Wolperl \n\nTable 1:  Relative performance of stacking multiple mixture models, for various data \nsets,  measured  (relative  to  the  performance  of a  single  Gaussian  model)  by  mean \nlog-likelihood on test data partitions.  The maximum for each data set is underlined. \nII  Data Set \nI Gaussian  I Cross-Validation  I \"Cheating\"  I Uniform  I Stacking  II \nDiabetes \n\nFisher's Iris \n\nVowel \n\nStar-Galaxy \n\n-352.9 \n-52.6 \n128.9 \n-257.0 \n\n27.8 \n18.3 \n53.5 \n678.9 \n\n30.4 \n21.2 \n54.6 \n721.6 \n\n29.2 \n18.3 \n40.2 \n789.1 \n\n31.8 \n22.5 \n55.8 \n888.9 \n\nTable  2:  Average  across  20  runs  of the  stacked  weights found  for  each  constituent \nmodel.  The  columns  with  h  =  .. . are  for  the  triangular  kernels  and  the  columns \nwith  k =  . .. are for  the  Gaussian  mixtures. \n\nI h=O.1  I h=O.4  I h=1.5  I k = 2  I k = 4  I k = 8  1/ \n\nII  Data Set \nDIabetes \n\nFisher's Iris \n\nVowel \n\nStar-Galaxy \n\n0.01 \n0.02 \n0.00 \n0.00 \n\n0.09 \n0.16 \n0.25 \n0.04 \n\n0.03 \n0.00 \n0.00 \n0.03 \n\n0.13 \n0.26 \n0.02 \n0.03 \n\n0.41 \n0.40 \n0.20 \n0.27 \n\n0.32 \n0.16 \n0.53 \n0.62 \n\nthe  \"cheating\"  method.)  We considered two null hypotheses:  stacking has the same \npredictive  accuracy  as  cross-validation,  and  it  has  the  same  accuracy  as  uniform \nweighting.  Each hypothesis can be rejected with a chance ofless than 0.01% of being \nincorrect,  according  to the  Wilcoxon signed-rank  test  i.e.,  the  observed  differences \nin  performance are  extremely strong even  given  the fact  that this particular test  is \nnot strictly applicable in  this situation. \n\nOn  the  vowel  data  set  uniform  weighting  performs  much  worse  than  the  other \nmethods:  it is  closer  in  performance to stacking on  the other 3 data sets.  On  three \nof the data sets,  using cross-validation to select a single model is  the  worst method. \n\"Cheating\"  is  second-best  to  stacking  except  on  the  star-galaxy  data,  where  it \nis  worse  than  uniform  weighting  also:  this  may  be  because  the  star-galaxy  data \nprobably  induces  the  greatest  degree  of mis-specification  relative  to  this  6-model \nclass  (based  on  visual inspection). \nTable  2 shows  the  averages  of the  stacked  weight  vectors  for  each  data set.  The \nmixture components  generally  got  higher  weight  than  the  triangular  kernels.  The \nvowel and star-galaxy data sets have more structure  than can be represented  by any \nof the component models and this is  reflected  in  the fact  that for each most weight \nis  placed on  the most complex mixture model with  k =  8. \n\n3.2  Results on Simulated Data with no Model Mis-Specification \n\nWe  simulated data from  a  2-dimensional 4-Gaussian mixture model with a  reason(cid:173)\nable  degree  of overlap  (this  is  the  data set  used  in  Ripley  (1994)  with  the  class \nlabels  removed)  and compared  the  same models  and  combining/selection schemes \nas  before,  except  that  \"truth\"  is  also  included,  i.e.,  the  scheme  which  always  se(cid:173)\nlects  the  true  model  structure  with  k  = 4  Gaussians.  For  each  training  sample \nsize,  20  different  training data sets  were  simulated, and  the  mean likelihood  on  an \nindependent  test  data set  of size  1000  was  reported. \n\n\fStacked Density Estimation \n\n673 \n\n250 \n\nl-\nw \nUl \nI-\nfl]200 \nI-\n~ \n\n..J \n\n~lSO \n~ \ni \n8 \n..J  100 \n~ \n\n~ \n\nCh.-ung \n\nso \n\n+ \n\nI \nI \nI \n\n0 \n20 \n\n.0 \n\n. -. \n\n. ~ .\n\n..... \n.'  0 \n-----\nlJnIfonn \n\n, \n\nSlacking \n\n.+ .. . ' \n0 \n\n~ \n\n~ \n\n~ \n\n.~ \n\n, \n\n<' \n\n~ \n\n~~-: . \u2022 \n,--\n\u2022 \n\nI \nI \nI \n\nI  \". \n\n)Y \n\n/ \n\n/ \n\nTrueK \n\n~ \n\n60 \n\n80 \n\n100 \n\n120 \n\nTRAINING SAMPLE SIZE \n\n1.0 \n\n160 \n\n180 \n\n200 \n\nFigure  1:  Plot  of mean  log-likelihood  (relative  to  a  single  Gaussian  model)  for \nvarious density estimation schemes on data simulated from a 4-component Gaussian \nmixture. \n\nNote  that  here  we  are  assured  of having  the  true  model  in  the  set  of models  be(cid:173)\ning  considered,  something  that  is  presumably  never  exactly  the  case  in  the  real \nworld  (and  presumably  was  not  the  case  for  the  experiments  recounted  in  Table \n1.)  Nonetheless,  as  indicated in  (Figure  1),  stacking performed  about  the  same  as \nthe  \"cheating\"  method  and  significantly outperformed  the  other  methods,  includ(cid:173)\ning  \"truth.\"  (Results where some of the methods had log-likelihoods lower than the \nsingle  Gaussian are  not shown for  clarity). \n\nThe fact that \"truth\" performed poorly on the smaller sample sizes is due to the fact \nthat with smaller sample sizes it was often better to fit  a simpler model with reliable \nparameter  estimates  (which  is  what  \"cheating\"  typically  would  do)  than  a  more \ncomplex model which may overfit  (even when it is the true model structure).  As the \nsample size increases,  both  \"truth\"  and cross-validation approach the  performance \nof  \"cheating\"  and  stacking:  uniform  weighting  is  universally  poorer  as  one  would \nexpect  when  the  true  model is  within  the  model class.  The stacked  weights  at the \ndifferent sample sizes  (not shown) start out with significant weight on the triangular \nkernel  model  and  gradually shift  to the  k  =  2 Gaussian  mixture model  and finally \nto  the  (true)  k  = 4  Gaussian  model  as  sample size  grows.  Thus,  stacking  is  seen \nto  incur  no  penalty  when  the  true  model  is  within  the  model  class  being  fit.  In \nfact  the opposite is  true;  for  small sample sizes  stacking outperforms other  density \nestimation techniques which place full  weight on a single  (but poorly parametrized) \nmodel. \n\n4  Discussion and  Conclusions \n\nSelecting a  global bandwidth for  kernel  density  estimation is still a  topic of debate \namong  statisticians.  Stacking  allows  the  possibility  of side-stepping  the  issue  of \na  single  bandwidth  by  combining  kernels  with  different  bandwidths  and  different \nkernel shapes.  A stacked combination of such kernel estimators is equivalent to using \n\n\f674 \n\nP.  Smyth and D.  Wolpert \n\na  single  composite  kernel  that is  a  convex  combination of the  underlying  kernels. \nFor  example,  kernel  estimators  based  on  finite  support  kernels  can  be  regularized \nin  a data-driven manner by combining them with infinite support  kernels.  The key \npoint is  that the shape and width of the resulting  \"effective\"  kernel  i8  driven by the \ndata. \nIt is  also  worth  noting  that  by  combining Gaussian  mixture models  with  different \nk  values  one  gets  a  hierarchical  \"mixture  of mixtures\"  model.  This  hierarchical \nmodel can provide a natural multi-scale representation  of the data,  which is clearly \nsimilar in  spirit  to  wavelet  density  estimators,  although  the  functional  forms  and \nestimation  methodologies for  each  technique  can  be  quite  different.  There  is  also \na  representational  similarity  to  Jordan  and  Jacob's  (1994)  \"mixture  of experts\" \nmodel  where  the  weights  are  allowed  to  depend  directly on  the inputs.  Exploiting \nthat  similarity,  one  direction  for  further  work  is  to  investigate  adaptive  weight \nparametrizations in  the stacked density estimation context. \n\nAcknowledgements \n\nThe work  of P.S.  was  supported  in  part by  NSF  Grant IRI-9703120 and in  part by \nthe Jet Propulsion Laboratory, California Institute of Technology,  under  a contract \nwith  the  National Aeronautics and Space  Administration . \n\nReferences \nBanfield,  J.  D.,  and  Raftery,  A.  E.,  'Model-based  Gaussian  and  non-Gaussian \n\nclustering, '  Biometrics, 49,  803-821,  1993. \n\nBreiman, L. , 'Bagging predictors,'  Machine  Learning,  26(2),  123-140, 1996a. \nBreiman, L.,  'Stacked  regressions, '  Machine  Learning,  24,  49-64,  1996b. \nChickering,  D.  M.,  and Heckerman,  D.,  'Efficient  approximations for  the  marginal \nlikelihood of Bayesian  networks  with hidden  variables,'  Machine  Learning, \nIn press. \n\nDraper,  D,  'Assessment  and propagation of m0del  uncertainty  (with  discussion),' \n\nJournal of the  Royal Statistical Society B,  57,  45-97,  1995. \n\nEscobar,  M.  D.,  and  West,  M.,  'Bayesian  density  estimation  and  inference  with \n\nmixtures,'  J.  Am.  Stat.  Assoc., 90,  577-588,  1995. \n\nJordan,  M.  1.  and  Jacobs,  R.  A.,  'Hierarchical  mixtures  of experts  and  the  EM \n\nalgorithm,' Neural  Computation,  6,  181-214, 1994. \n\nMadigan,  D.,  and  Raftery,  A.  E.,  'Model  selection  and  accounting  for  model  un(cid:173)\n\ncertainty in graphical models using Occam's window,'  J. Am.  Stat.  Assoc., \n89,  1535-1546, 1994. \n\nOrmeneit,  D.,  and  Tresp,  V.,  'Improved  Gaussian  mixture density  estimates  us(cid:173)\n\ning Bayesian penalty terms and network  averaging,' in  Advances  in  Neural \nInformation  Processing  8,  542-548, MIT Press,  1996. \n\nRipley,  B. D.  1994.  'Neural networks  and  related  methods for  classification  (with \n\ndiscussion),'  J.  Roy.  Stat.  Soc.  B,  56,409-456. \n\nSmyth,  P.,'Clustering  using  Monte-Carlo  cross-validation,'  in  Proceedings  of the \n\nSecond  International  Conference  on  Knowledge  Discovery  and  Data  Min(cid:173)\ning,  Menlo  Park, CA:  AAAI  Press,  pp.126-133,  1996. \n\nTitterington,  D. M., A.  F.  M. Smith,  U. E.  Makov,  Statistical Analysis of Finite \n\nMixture  Distributions,  Chichester,  UK:  John Wiley and Sons,  1985 \nWolpert,  D.  1992.  'Stacked generalization,' Neural  Networks, 5,  241-259, \n\n\f", "award": [], "sourceid": 1353, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "David", "family_name": "Wolpert", "institution": null}]}