{"title": "Supervised learning from incomplete data via an EM approach", "book": "Advances in Neural Information Processing Systems", "page_first": 120, "page_last": 127, "abstract": null, "full_text": "Supervised learning from  incomplete \n\ndata via an EM approach \n\nZoubin Ghahramani and  Michael I.  Jordan \n\nDepartment of Brain &  Cognitive Sciences \n\nMassachusett.s  Institute of Technology \n\nCambridge,  MA  02139 \n\nAbstract \n\nReal-world  learning  tasks  may involve  high-dimensional  data sets \nwith  arbitrary  patterns of missing  data.  In  this  paper  we  present \na  framework  based  on  maximum likelihood  density  estimation for \nlearning from such  data set.s.  VVe  use  mixture models for  the  den(cid:173)\nsity  estimates and  make two  distinct  appeals  to  the  Expectation(cid:173)\nMaximization  (EM)  principle  (Dempster  et  al.,  1977)  in  deriving \na  learning  algorithm-EM is  used  both  for  the  estimation of mix(cid:173)\nture  components  and  for  coping  wit.h  missing  dat.a.  The  result(cid:173)\ning  algorithm  is  applicable  t.o  a  wide  range  of supervised  as  well \nas  unsupervised  learning  problems.  Result.s  from  a  classification \nbenchmark-t.he iris  data set-are presented. \n\n1 \n\nIntroduction \n\nAdaptive  systems  generally  operate  in  environments  t.hat  are fraught  with  imper(cid:173)\nfections;  nonet.heless  they  must  cope  with  these  imperfections and learn  to extract \nas  much  relevant  information  as  needed  for  their  part.icular  goals.  One  form  of \nimperfection is  incomplet.eness in sensing information.  Incompleteness can arise ex(cid:173)\ntrinsically  from  the  data generation  process  and  intrinsically  from  failures  of the \nsystem's sensors.  For example,  an object  recognition  system  must  be  able to learn \nto classify images with occlusions, and  a robotic controller must be able to integrate \nmultiple sensors  even  when  only  a  fraction  may operate at any given  time. \n\nIn this paper we present a. fra.mework-derived from parametric statistics-for learn-\n\n120 \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n121 \n\ning from data sets with arbitrary patterns of incompleteness.  Learning in this frame(cid:173)\nwork  is  a  classical  estimation problem requiring an explicit  probabilistic model and \nan  algorithm for  estimating the  parameters of the  model.  A  possible  disadvantage \nof parametric methods is  their lack of flexibility  when  compared with  nonparamet(cid:173)\nric  methods.  This  problem,  however,  can  be  largely  circumvented  by  the  use  of \nmixture models  (McLachlan  and  Basford,  1988) .  Mixture  models combine much of \nthe flexibility of nonparametric methods with  certain of the analytic advantages of \nparametric methods. \n\nMixture models have  been  utilized  recently  for  supervised  learning problems in  the \nform  of  the  \"mixtures  of experts\"  architecture  (Jacobs  et  al.,  1991;  Jordan  and \nJacobs,  1994).  This  architecture  is  a  parametric  regression  model  with  a  modular \nstructure  similar  to  the  nonparametric  decision  tree  and  adaptive  spline  models \n(Breiman  et  al.,  1984;  Friedman,  1991).  The approach  presented  here  differs  from \nthese  regression-based  approaches  in  that  the  goal  of learning  is  to  estimate  the \ndensity of the data.  No distinction is  made between input and output variables; the \njoint  density  is  estimated  and  this  estimate  is  then  used  to form  an  input/output \nmap.  Similar  approaches  have  been  discussed  by  Specht  (1991)  and  Tresp  et  al. \n(1993).  To estimate  the  vector  function  y  =  I(x)  the joint density  P(x, y)  is  esti(cid:173)\nmated  and,  given  a  particular  input  x,  the  conditional  density  P(ylx)  is  formed. \nTo obtain  a  single  estimate  of y  rather  than  the  full  conditional  density  one  can \nevaluate y =  E(ylx), the expectation of y  given  x. \nThe  density-based  approach  to  learning  can  be  exploited  in  several  ways .  First, \nhaving  an  estimate  of the  joint  density  allows  for  the  representation  of any  rela(cid:173)\ntion  between  the  variables.  From  P(x, y),  we  can  estimate y =  I(x),  the  inverse \nx = 1-1 (y),  or  any  other  relation  between  two  subsets  of the elements of the  con(cid:173)\ncatenated  vector  (x, y). \nSecond,  this  density-based  approach  is  applicable  both  to supervised  learning and \nunsupervised  learning  in  exactly  the  same  way.  The  only  distinction  between  su(cid:173)\npervised  and  unsupervised  learning  in  this  framework  is  whether  some portion  of \nthe data vector  is  denoted  as  \"input\"  and  another  portion  as  \"target\". \n\nThird,  as  we  discuss  in  this  paper, the density-based  approach  deals  naturally with \nincomplete  data,  i.e.  missing  values  in  the  data set.  This  is  because  the  problem \nof estimating mixture densities  can  itself be  viewed  as  a  missing data problem  (the \n\"labels\" for  the component densities are missing) and an Expectation-Maximization \n(EM)  algorithm (Dempster et  al.,  1977)  can  be  developed  to  handle  both  kinds  of \nmissing data. \n\n2  Density estimation using  EM \n\nThis  section  outlines  the  basic  learning  algorithm  for  finding  the  maximum  like(cid:173)\nlihood  parameters  of a  mixture  model  (Dempster  et  al.,  1977;  Duda  and  Hart, \n1973;  Nowlan,  1991).  \\IVe  assume  that.  t.he  data  . .:t'  =  {Xl, ... , XN}  are  generated \nindependently  from  a  mixture density \n\n1\\1 \n\nP(Xi) = LP(Xi IWj;(}j)P(Wj), \n\n;=1 \n\n(1) \n\n\f122 \n\nGhahramani and Jordan \n\nwhere each  component of the  mixture is  denoted Wj  and  parametrized by  (}j.  From \nequation  (1)  and  the independence  assumption we see  that the log likelihood of the \nparameters given  the  data set  is \n\nN \n\nM \n\nl((}IX) = LlogLP(xilwj;Oj)P(Wj). \n\n(2) \n\ni=1 \n\nj=1 \n\nBy  the  maximum  likelihood  principle  the  best  model of the  data  has  parameters \nthat maximize l(OIX).  This function,  however,  is  not easily maximized numerically \nbecause  it involves  the log  of a  sum. \n\nIntuitively, there is  a  \"credit-assignment\"  problem:  it is not clear  which component \nof the  mixture generated  a  given  data point  and  thus  which  parameters  to  adjust \nto fit  that data point.  The EM  algorithm for  mixture models is an  iterative method \nfor  solving  this  credit-assignment  problem.  The intuition  is  that  if one had  access \nto  a  \"hidden\"  random  variable  z  that  indicated  which  data  point  was  genera.ted \nby  which  component,  then  the  maximization  problem  would  decouple  into  a  set \nof simple  maximizations.  Using  the  indicator  variable  z,  a  \"complete-data\"  log \nlikelihood function  can  be  written \n\nN  M \n\nlc((}IX, Z) =  L  L  Zij log P(XdZi; O)P(Zi; (}), \n\n(3) \n\n;=1  j=1 \n\nwhich  does  not  involve a  log of a  summation. \nSince  Z  is  unknown  lc  cannot  be  utilized  directly,  so  we  instead  work  with  its  ex(cid:173)\npectation, denoted by  Q(OI(}k)'  As  shown by  (Dempster et aI.,  1977), l(OIX)  can  be \nmaximized  by  iterating the following  two steps: \n\nEstep:  Q(OI(}k) \nM  step: \n\n(}k+l \n\nE[lc(OIX,Z)IX,(}k] \nargmax  Q((}IOk)' \n\no \n\n(4) \n\nThe E  (Expectation) step  computes the expected  complete data log likelihood  and \nthe  M  (Maximization)  step  finds  the  parameters  that  maximize  this  likelihood. \nThese  two  steps  form  the  basis  of the  EM  algorithm;  in  the  next  two  sections  we \nwill outline how  they  can  be  used  for  real  and  discrete  density  estimation. \n\n2.1  Real-valued data:  Inixture of Gaussians \n\nReal-valued  data can  be  modeled  as  a  mixture of Gaussians.  For  this  model  the \n\nE-step simplifies  to  computing hij = E[Zijlxi,Ok],  the  probability that  Gaussian  j, \n\nas  defined  by  the  parameters estimated  at time step  k,  generated  data point  i. \n\nItj 1- 1/ 2 exp{ -~ (Xi  -\n\nitj)Tt;l,k(Xi - itj)} \n\nh ..  = \n\nI} \n\nL~1 IEfl-l/2exp{-~(Xi - it7)TE,I,k(Xi - it7)}' \n\n(5) \n\nThe  M-step  re-estimates  the \ndata set  weighted  by  the  hii= \n\nmeans  and  covariances  of the  Gaussians1  using  the \n\n)  ~ k+l  _  L~l hijXi \na \nLi=1 hij \n\nI-Lj \n\nN \n\n-\n\n' \n\n1 Though this derivation  assumes equal priors for  the Gaussians,  if the priors arc viewed \n\nas  mixing  parameters they can  also  be learned  in  the  maximization  step. \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n123 \n\n2.2  Discrete-valued data:  Inixture of Bernoullis \n\nD-dimensional binary  data x  =  (Xl, . .. ,Xd, . .. XD),  Xd  E  {O, 1}, can  be  modeled  as \na  mixture of !II  Bernoulli  densities.  That is, \n\nM \n\nP(xIO) = L P(Wj) IT /-ljd(1  - /-ljd)(l-Xd). \n\nD \n\nFor  this  model  the  E-step  involves  computing \n\nnD \n\nh ..  -\nI)  -\n\npX,ld (1  _  p.  )(1-Xld) \n\nd=l}d \n\n}d \n\n'Ef'!l nf=l P7J d  (1  - Pld)(1-xld) , \n\n(7) \n\n(8) \n\n(9) \n\nand  the  M-step  again re-estimates  the parameters by \n\n~ k+l  _ 'E~l hijXi \n\nttj \n\n-\n\nN \n\n. \n\n'Ei=l hij \n\nMore  generally, discrete or categorical  data can  be modeled  as generated  by  a  mix(cid:173)\nture of multinomial densities  and similar derivations for  the  learning algorithm can \nbe  applied.  Finally, the  extension  to  data with  mixed real,  binary.  and  categorical \ndimensions can  be  readily  derived  by  assuming  a joint  density  with  mixed  compo(cid:173)\nnents  of the  three  types . \n\n3  Learning from  inco111plete  data \n\nIn  the  previous  section  we  presented  one  aspect  of  the  EM  algorithm:  learning \nmixture  models.  Another  important  application  of EM  is  to  learning  from  data \nsets  with  missing  values  (Little  and  Rubin,  1987;  Dempster  et  aI.,  1977).  This \napplication  has  been  pursued  in  the  statistics  literature  for  non-mixture  density \nestimation problems;  in  this  paper  we  combine this application of EM  with that of \nlearning  mixture parameters. \nWe  assume  that.  the  data  set  ,l:'  =  {Xl \u2022.. . , XN}  is  divided  into  an  observed  com(cid:173)\nponent ,yo  and  a  missing component ;t'm.  Similarly, each  data vector  Xi  is  divided \ninto (xi, xi)  where  each  data vector  can  have  different  missing components-this \nwould be denoted  by superscript  Dli  and  OJ.  but we  have simplified the notation for \nthe sake  of clarity. \n\nTo handle missing data we  rewrite  the  EM  algorithm as  follows \n\nEstep: \nM step: \n\nE[ic( fJl,t'\u00b0, ;t'm , Z) I;t'\u00b0. Ok] \nargmax  Q(fJlfJk). \n\no \n\n(10) \n\nComparing to equation (4)  we see  that aside from  t.he  indicator variables Z  we  have \nadded  a  second  form  of incomplete  data,  ;t'm ,  corresponding  to  the  missing  values \nin  the  data set.  The E-step  of the  algorithm estimates both  these  forms  of missing \ninformation; in  essence  it uses  the current  estimate of the data density  to complete \nthe missing values. \n\n\f124 \n\nGhahramani and Jordan \n\n3.1  Real-valued data:  mixture of Gaussians \n\nWe start  by  writing the log  likelihood of the complete data, \n\nic(OIXO, xm, Z) = L L Zij  log P(xdzj, 0) + L L Zij  log P(zdO). \n\nN  M \n\nN  M \n\n(11) \n\nj \n\nj \n\nWe  can  ignore  the second  term since  we  will  only  be  estimating the  parameters of \nthe  P(XdZi, 0).  Using  equation  (11)  for  the  mixture of Gaussians  we  not.e  that  if \nonly the indicator variables Zi  are missing,  the E step  can be reduced  to estimating \nE[ Zij lXi, 0].  For the case  we  are interested  in,  with  two types of missing data Zi  and \nxi, we  expand  equation  (11)  using m  and  0  superscripts  to denote subvectors  and \nsubmatrices  of the  parameters  matching  the  missing  and  observed  components  of \nthe data, \n\nIc(OIXO,  xm, Z) = L L Zij[n log27r + ! log IEj 1- !(xi -l1-jf E;l,OO(xi -l1-j) \n\nN  M \n\n. .22  \n\nI \n\nJ \n\n2 \nm)  1(  m \n- 2 Xi \n\n-\n\nI1-j \n\nI1-j \n\n0 \n\n(\n\no)T~-l,Om(  m \nXi  -\n\nL...j \n\nI1-j \n\nm)T~-l,mm(  m \nXi \n\nL...j \n\n-\n\nI1-j \n\n- Xi  -\n\nm)] \n\u2022 \nNote  that  after  taking  the  expectation,  the  sufficient  statistics  for  the  parameters \ninvolve  three  unknown  terms,  Zij,  ZijXi,  and  zijxixiT.  Thus  we  must  compute: \nE[Zijlx?,Ok]'  E[Zijxilx?,Ok],  and  E[ZijxixinTlx?,Ok]. \nOne intuitive approach  to  dealing with  missing data is  to  use  the  current  estimate \nof the  data  density  to  compute  the  expectat.ion  of the  missing data  in  an  E-step, \ncomplete the  data with  these expectations,  and  then  use  this completed data to re(cid:173)\nestimate  parameters in  an  M-step.  However,  this  intuition fails  even  when  dealing \nwith a single two-dimensional Gaussian;  the expectation of the missing data always \nlies  along  a  line,  which  biases  the  estimate of the  covariance.  On  the  other  hand, \nthe approach arising from application of the EM  algorithm specifies  that one should \nuse  the current  density estimate to compute the expectation of whatever incomplete \nterms  appear  in  the  likelihood  maximization.  For  the  mixture of Gaussians  these \nincomplete  terms  involve  interactions  between  the  indicator  variable  :;ij  and  the \nfirst  and  second  moments of xi.  Thus,  simply  computing  the  expectation  of the \nmissing  data Zi  and  xi from  our  model  and  substituting  those  values  into  the  M \nstep  is  not sufficient  to guarantee an  increase  in  the likelihood of the  parameters. \nThe above  terms  can  be  computed  as  follows:  E[ Zij lxi, Ok]  is  again  hij,  the  proba(cid:173)\nbility as  defined  in  (5)  measured only on  the observed  dimensions of Xi,  and \nE[Zijxilxi, Ok]  = hijE[xilzij = 1, xi, Od  = hij(l1-j + EjOEjO-l (xi -Il.'}\u00bb,  (12) \nDefining xi] = E[xi IZij  = 1, xi, Ok],  the  regression  of xi on  xi using Gaussian j, \n(13) \n\nE[  ..  m  mTI  \u00b0 0  ] _  h .. (~mm  ~mo~oo-l ~moT  ~ m ~ mT) \n. \n\nL...j  ~j \n\n+ XijXij \n\nZ'Jxi  Xi \n\nxi'  k  -\n\n'J  L...j \n\nL...j \n\n-\n\nThe  M-step  uses  these  expectations  substituted  into  equations  (6)a  and  (6)b  to \nre-estimate  the  means  and  covariances.  To  re-estimate  the  mean  vector,  I1-j'  we \nsubstitute  the  values  E[xilzij  =  1, xi, Ok]  for  the  missing  components  of  Xi  in \nequation  (6)a.  To  re-estimate  the  covariance  matrix  we  substitute  t.he  values \nE[xixiTlzij =  1, xi, Ok]  for  the outer product matrices involving the missing com(cid:173)\nponents of Xi  in equation  (6)b. \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n125 \n\n3.2  Discrete-valued data:  Inixture of Bernoullis \n\nFor the  Bernoulli mixture the sufficient statistics for  the  M-step  involve  t he  incom(cid:173)\nplete terms E[Zij Ix?, Ok]  and E[ Zij xi Ix~, Ok].  The first  is equal to hij calculated over \nthe  observed  subvector of Xi.  The second,  since we  assume  that  within  a  class  the \nindividual  dimensions  of the  Bernoulli  variable  are  independent.,  is  simply  hijl-Lj. \nThe M-step  uses  these expectations substituted  into equation  (9). \n\n4  Supervised learning \n\nIf each  vector  Xi  in  the  data set  is  composed  of an  \"input\"  subvector,  x},  and  a \n\"target\"  or output  subvector,  x?,  then  learning  the joint density  of the  input  and \ntarget  is  a  form of supervised  learning.  In  supervised  learning we  generally  wish  to \npredict  the output variables from the input variables.  In this section we  will outline \nhow  this is  achieved  using  the estimated density. \n\n4.1  Function approximation \n\nFor  real-valued  function  approximation  we  have  assumed  that  the  densit.y  is  esti(cid:173)\nmated  using  a  mixture of Gaussians.  Given  an  input  vector  x~  we  ext ract  all  the \nrelevant  information  from  the  density  p(xi, XO)  by  conditionalizing  t.o  p(xOlxD. \nFor  a  single  Gaussian  this  conditional  densit.y  is  normal,  and,  since  P(x 1 ,  XO)  is  a \nmixture  of Gaussians  so  is  P(xolxi ).  In  principle,  this  conditional  density  is  the \nfinal  output  of the  density  estimator.  That  is,  given  a  particular  input  the  net(cid:173)\nwork  returns  the complete conditional density of t.he  output.  However,  since  many \napplications  require  a  single  estimate  of  the  output,  we  note  three  ways  to  ob(cid:173)\ntain  estimates  x  of  XO  =  f(x~):  the  least  squares  estimate  (LSE),  which  takes \nXO(xi)  =  E(xOlxi);  stochastic  sampling  (STOCH),  which  samples  according  to \nthe  distribution  xO(xD  \"\"  P(xOlxi);  single  component  LSE  (SLSE),  which  takes \nxO(xD  = E(xOlxLwj)  where  j  = argmaxk P(zklx~).  For  a  given  input,  SLSE  picks \nthe  Gaussian  with  highest  posterior  and  approximates  the  out.put  with  the  LSE \nestimator given  by  that  Gaussian  alone. \n\nThe conditional expectation or  LSE  estimator for  a  Gaussian  mixt.ure  is \n\n(14) \n\nwhich  is  a  convex  sum  of linear  approximations,  where  the  weights  h ij  vary  non(cid:173)\nlinearly  according  to equation  (14)  over  the  input space.  The  LSE  estimator on  a \nGaussian  mixture has  interesting  relations  to  algorithms such  as  CART  (Breiman \net al.,  1984), MARS (Friedman,  1991), and  mixtures of experts  (Jacobs <.'t  al.,  1991; \nJordan  and  Jacobs,  1994),  in  that  the  mixture  of Gaussians  competit.ively  parti(cid:173)\ntions  the input space,  and learns a  linear regression  surface on  each  part-it.ion.  This \nsimilarity has  also  been  noted  by  Tresp  et  al.  (1993)  . \n\nThe stochastic estimator (STOCH) and  the single component estimator (SLSE)  are \nbetter suited  than  any least squares  method  for  learning non-convex  ill verse  maps, \nwhere  the  mean of several  solutions  to  an  inverse  might  not  be  a  solut ion.  These \n\n\f126 \n\nGhahramani and Jordan \n\nFigure  1:  Classification  of the  iris  data \nset.  100  data points were  used for  train(cid:173)\ning  and  50  for  testing.  Each  data point \nconsisted  of 4 real-valued attributes and \none  of  three  class  labels.  The  figure \nshows  classification  performance  \u00b1  1 \nstandard  error  (11  = 5)  as  a  function \nof  proportion  missing  features  for  the \nEM  algorithm  and  for  mean  imputa(cid:173)\ntion (MI), a common heuristic where the \nmissing  values  are  replaced  with  their \nunconditional  means. \n\nClassification  with  missing  inputs \n\n100 \n\n~\" 0-1---~,  !  -t EM \nU \n;.;:: \n'\" '\" !!  60 \nU ... U \nII) .. o 40 \n\n\\ , 'l,_ \n\n\\ \n, \n, , \n\n, \n\n-'-'tI  MI \n\nU \n~ \n\n20 \n\no \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n%  missing features \n\nestimators take advantage of the explicit representat.ion of the input/output density \nby  selecting one of the several  solutions to  the inverse. \n\n4.2  Classification \n\nClassification  problems involve  learning  a  mapping from  an  input space  into  a  set \nof discrete  class  labels.  The  density  estimat.ion  framework  presented  in  this  paper \nlends  itself to solving classification  problems by  estimating the joint density  of the \ninput  and  class  label  using  a  mixture model.  For example,  if the  inputs have real(cid:173)\nvalued  attributes and  there  are  D  class  labels,  a  mixture model with Gaussian  and \nmultinomial components will  be  used: \n\nP(x, e = dlO) = ~ P(Wj) (27r)n/2IEj 11/2  exp{ -\"2 (x - I-tj fEj1 (x -\n\n~jd \n\nI-'j n, \n\n(15) \n\nAI \n~ \n\n1 \n\ndenoting  the  joint  probability  that  the  data  point.  is  x  and  belongs  to  class  d, \nwhere  the  ~j d  are  the  parameters for  the  multinomial.  Once  this  density  has  been \nestimated,  the maximum likelihood  label  for  a  particular input  x  may be obtained \nby computing P(C = dlx, 0).  Similarly, the class conditional densities can be derived \nby  evaluating P( x Ie  =  d, 0).  Condi tionalizing over  classes  in  this  way  yields  class \nconditional  densities  which  are  in  turn  mixtures  of  Gaussians.  Figure  1  shows \nthe  performance  of the  EM  algorithm  on  an  example  classification  problem  with \nvarying  proportions of missing features.  We  have  also  applied  these  algorithms to \nthe  problems of clustering  35-dimensional greyscale  images and  approximating the \nkinematics of a  three-joint  planar arm from  incomplete data. \n\n5  Discussion \n\nDensit.y  estimation in high dimensions is  generally considered  to be more difficult(cid:173)\nrequiring more parameters-than function  approximation.  The density-estimation(cid:173)\nbased approach to learning, however,  has two advantages.  First, it  permits ready in(cid:173)\ncorporation of results from  the statistical literature on  missing data to yield flexible \nsupervised  and  unsupervised  learning architectures.  This is  achieved  by  combining \ntwo  branches of application of the EM  algorithm yielding a set of learning rules for \nmixtures under  incomplete sampling. \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n127 \n\nSecond,  estimating  the  density  explicitly  enables  us  to  represent  any  relation  be(cid:173)\ntween the variables.  Density estimation is fundamentally more general than function \napproximation  and  this  generality  is  needed  for  a  large  class  of learning  problems \narising from  inverting causal systems (Ghahramani, 1994).  These  problems cannot \nbe solved easily  by  traditional function  approximation techniques  since  the  data is \nnot generated  from  noisy samples of a  function,  but  rather of a  relation. \n\nAcknow ledgmuents \n\nThanks to  D.  M.  Titterington and  David Cohn for  helpful  comments.  This project \nwas supported  in  part by  grants from  the McDonnell-Pew  Foundation, ATR Audi(cid:173)\ntory  and  Visual  Perception  Research  Laboratories,  Siemens  Corporation,  the  N a(cid:173)\ntional Science  Foundation, and  the Office  of Naval  Research.  The iris  data set  was \nobtained  from  the  VCI  Repository of Machine  Learning  Databases. \n\nReferences \n\nBreiman, L., Friedman, J.  H.,  Olshen, R.  A.,  and  Stone, C.  J.  (1984).  Classification \n\nand  Regression  Trees.  Wadsworth  International Group,  Belmont, CA . \n\nDempster, A.  P.,  Laird,  N.  M.,  and Rubin, D.  B.  (1977).  Maximum likelihood fwm \nincomplete  data via  the  EM  algorithm.  J.  Royal Statistical Society  Series  B, \n39:1-38. \n\nDuda,  R.  O.  and  Hart,  P.  E.  (1973).  Pattern  Classification  and  Scene  Analysis. \n\nWiley,  New  York. \n\nFriedman,  J.  H.  (1991).  Multivariate  adaptive  regression  splines.  The  Annols  of \n\nStatistics,  19:1-141. \n\nGhahramani, Z.  (1994).  Solving inverse  problems using an  EM  approach  to density \nestimation.  In  Proceedings  of the  1993  Connectionist  Models  Summer School. \nErlbaum,  Hillsdale,  NJ. \n\nJacobs,  R.,  Jordan,  M.,  Nowlan,  S.,  and  Hinton,  G.  (1991).  Adaptive  mixture of \n\nlocal experts.  Neural  Computation,  3:79-87. \n\nJordan,  M.  and  Jacobs,  R.  (1994).  Hierarchical  mixtures  of experts  ano  the  EM \n\nalgorithm.  Neural  Computation,  6:181-214. \n\nLittle,  R.  J.  A.  and  Rubin,  D.  B.  (1987).  Statistical  Analysis  with  Mis.'ling  Data. \n\nWiley,  New  York. \n\nMcLachlan, G.  and  Basford,  K.  (1988).  Mixture  models:  Inference  and  applications \n\nto  clustering.  Marcel  Dekkel'. \n\nNowlan, S.  J.  (1991).  Soft  Competitive  Adaptation:  Neural Network  Learning Algo(cid:173)\nrithms  based  on  Fitting  Statistical Mixtures.  CMV-CS-91-126, School of Com(cid:173)\nputer  Science,  Carnegie  Mellon  University,  Pittsburgh,  PA. \n\nSpecht,  D.  F.  (1991).  A  general  I'egression  neural  network.  IEEE  Trans.  Neural \n\nNetworks,  2(6):568-576. \n\nTresp,  V.,  Hollatz,  J.,  and  Ahmad,  S.  (1993).  Network  structuring  and  training \nusing  rule-based  knowledge.  In  Hanson,  S.  J.,  Cowan, J.  D.,  and  Giles,  C.  L., \neditors,  Advances  in  Neural  Information  Processing  Systems  5.  Morgan  Kauf(cid:173)\nman Publishers,  San  Mateo,  CA. \n\n\f", "award": [], "sourceid": 767, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}