{"title": "Unsupervised Learning of Mixtures of Multiple Causes in Binary Data", "book": "Advances in Neural Information Processing Systems", "page_first": 27, "page_last": 34, "abstract": null, "full_text": "Unsupervised Learning of Mixtures of \n\nMultiple  Causes  in Binary Data \n\nEric Saund \n\nXerox  Palo Alto Research  Center \n\n3333  Coyote  Hill  Rd.,  Palo Alto,  CA, 94304 \n\nAbstract \n\nThis paper presents a formulation for unsupervised learning of clus(cid:173)\nters  reflecting  multiple causal structure  in  binary data.  Unlike the \nstandard  mixture  model,  a  multiple cause  model  accounts  for  ob(cid:173)\nserved data by combining assertions from many hidden causes, each \nof which can pertain to varying degree  to any subset of the observ(cid:173)\nable dimensions.  A crucial  issue is  the  mixing-function for  combin(cid:173)\ning  beliefs  from  different  cluster-centers  in  order  to generate  data \nreconstructions whose errors are minimized both during recognition \nand  learning.  We  demonstrate a  weakness  inherent  to  the popular \nweighted  sum followed  by sigmoid squashing,  and  offer  an alterna(cid:173)\ntive form  of the  nonlinearity.  Results are  presented  demonstrating \nthe  algorithm's  ability  successfully  to  discover  coherent  multiple \ncausal  representat.ions  of noisy  test  data and  in  images of printed \ncharacters. \n\n1 \n\nIntroduction \n\nThe objective of unsupervised  learning is  to  identify patterns or features  reflecting \nunderlying regularities  in  data.  Single-cause  techniques,  including  the  k-means  al(cid:173)\ngorithm and the standard mixture-model (Duda and Hart,  1973), represent  clusters \nof data points sharing similar patterns of Is and Os  under the assumption that each \ndata point belongs to, or was  generated  by, one and only one cluster-center;  output \nactivity is constrained to sum to 1.  In contrast, a  multiple-cause model permits more \nthan  one  cluster-center  to  become  fully  active  in  accounting for  an  observed  data \nvector.  The advantage of a  multiple cause  model is  that a  relatively  small number \n\n27 \n\n\f28 \n\nSaund \n\nof hidden variables can be applied combinatorially to generate a large data set.  Fig(cid:173)\nure  1 illustrates with a  test set of nine  121-dimensional data vectors.  This  data set \nreflects  two  independent  processes,  one  of which  controls  the  position of the  black \nsquare  on  the  left  hand  side,  the other  controlling the  right.  While  a  single  cause \nmodel  requires  nine  cluster-centers  to account for  this data, a  perspicuous  multiple \ncause  formulation requires  only six hidden  units as  shown  in  figure  4b.  Grey levels \nindicate  dimensions  for  which  a  cluster-center  adopts  a  \"don't-know /don't-care\" \nassertion . \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\nFigure 1:  Nine  121-dimensional test data samples exhibiting multiple cause \nstructure.  Independent processes  control the position of the black rectangle \non the left  and  right hand  sides. \n\nWhile principal components analysis and its neural-network variants (Bourlard and \nKamp, 1988;  Sanger,  1989)  as  well  as the Harmonium Boltzmann Machine (Freund \nand  Haussler,  1992)  are  inherently  multiple  cause  models,  the  hidden  represen(cid:173)\ntations  they  arrive  at  are  for  many  purposes  intuitively  unsatisfactory.  Figure  2 \nillustrates  the  principal  components  representation  for  the  test  data set  presented \nin  figure  1.  Principal  components  is  able  to  reconstruct  the  data  without  error \nusing  only  four  hidden  units  (plus  fixed  centroid),  but  these  vectors  obscure  the \ncompositional structure of the data in that they reveal nothing about the statistical \nindependence  of the left  and  right hand processes.  Similar results  obtain for  multi(cid:173)\nple cause unsupervised  learning using a  Harmonium network and for  a  feedforward \nnetwork  using  the sigmoid nonlinearity.  We seek  instead  a  multiple cause  formula(cid:173)\ntion which  will  deliver  coherent representations exploiting  \"don't-know/don't-care\" \nweights  to  make  explicit  the  statistical  dependencies  and  independencies  present \nwhen  clusters  occur  in  lower-dimensional subspaces  of the  full  J -dimensional data \nspace. \n\nData domains differ  in  ways  that underlying causal processes  interact.  The present \ndiscussion focuses on data obeying a WRITE-WHITE-AND-BLACK model, under which \nhidden causes  are responsible for  both turning  \"on\"  and turning  \"off\"  the observed \nvariables. \n\na \n\nb \n\nFigure 2:  Principal components representation for  the test data from figure \n1.  (a)  centroid  (white:  -1, black:  1).  (b)  four  component vectors sufficient \nto encode  the  nine  data points.  (lighter shadings:  Cj,k  < 0;  grey:  Cj,k  = 0; \ndarker shading:  Cj,/.:  > 0). \n\n\fUnsupervised Learning of Mixtures of Multiple Causes in Binary Data \n\n29 \n\n2  Mixing  Functions \n\nA large class of unsupervised learning models share the architecture shown in figure \n3.  A  binary  vector  Di = (di ,l,di ,2,  ... di,j, ... di,J)  is  presented  at the  data layer,  and \na  measurement, or response  vector  mi = (mi ,l, mi,2, ... mi ,k,  ... mi ,K)  is  computed at \nthe encoding layer using \"weights\"  Cj,k  associating activity at data dimension j  with \nbe turned around to compute a  prediction vector  ri = (ri,l\"  ri,2, ... ri,j, ... ri,J) at the \nactivity at hidden  cluster-center  k.  Any  activity  pattern  at the encoding layer  can \n\ndata layer.  Different  models employ different functions for  performing the measure(cid:173)\nment  and  prediction  mappings,  and  give  different  interpretations  to  the  weights. \nCommon  to  most  models  is  a  learning  procedure  which  attempts  to  optimize  an \nobjective function  on errors  between  data vectors  in  a  training set,  and  predictions \nof these  data vectors  under  their  respective  responses  at the encoding  layer. \n\nencoding layer \n( cluster-centers) \n\ndata layer \n\npMietion \n\nd j  (observed data) \n\n(predicted) \n\nr. \nJ \n\nFigure 3:  Architecture  underlying  a  large  class of unsupervised  learning models. \n\nThe key  issue  is  the  mixing function  which specifies  how sometimes conflicting pre(cid:173)\ndictions from  individual hidden units combine to predict  values on the data dimen(cid:173)\nsions.  Most  neural-network  formulations,  including  principal components  variants \nand  the  Boltzmann Machine,  employ linearly  weighted  sum of hidden  unit  activity \nfollowed by  a squashing,  bump, or other nonlinearity.  This form of mixing function \npermits an error  in  prediction  by  one  cluster  center  to be cancelled  out  by  correct \npredictions from others without consequence  in terms of error in the net prediction. \nAs  a  result,  there  is  little  global  pressure  for  cluster-centers  to  adopt  don't-know \nvalues  when  they  are  not  quite  confident  in  their predictions. \n\nInstead,  a  mult.iple cause formulation  delivering coherent  cluster-centers  requires  a \nform  of nonlinearit.y  in  which  active  disagreement must result  in  a  net  \"uncertain\" \nor neutral  prediction  that results  in nonzero  error. \n\n\f30 \n\nSaund \n\n3  Multiple  Cause  Mixture Model \n\nOur formulation  employs  a  zero-based  representation  at  the  data layer  to simplify \nthe mathematical expression for a suitable mixing function.  Data values are either 1 \nor -1; the sign of a weight Cj ,k  indicates whether activity in cluster-center k  predicts \na  1 or  -1 at data dimension j, and its  magnitude (ICj,kl  ~ 1)  indicates strength of \nbelief;  Cj ,k  = 0  corresponds  to  \"don't-know /don't-care\"  (grey  in figure  4b). \nThe mixing function  takes  the  form, \n\nL  mi ,k(-c),k) \n\nr.,) =  k  <\".<0 \n\nII  (1 + m\"kCj,k)  - 1  +  L  mi,kc) ,k \nk <\". <0 \n\nk  <\".>0 \n\nI- II  (1  - m\"kCj,k) \n\nk  <\".>0 \n\nThis formula is  a  computationally tractable  approximation to  an  idealized  mixing \nfunction created by linearly interpolating boundary values on the extremes of mi,k  E \n{O,  I}  and  Cj,k  E {-I, 0,  I}  rationally designed  to meet the  criteria outlined  above. \nBoth learning and measurement operate  in the  context of an objective function on \npredictions equivalent to log-likelihood.  The weights Cj,k  are found through gradient \nascent  in  this  objective  function,  and  at each  training step  the  encoding  mi  of an \nobserved  data vector  is  found  by  gradient  ascent  as  well. \n\n4  Experimental Results \n\nFigure 4  shows  that the  model  converges  to the  coherent  multiple cause  represen(cid:173)\ntation for  the test  data of figure  1 starting with  random initial weights.  The model \nis  robust  with  respect  to noisy training data as  indicated  in  figure  5. \n\nIn  figure  6  the  model  was  trained  on  data  consisting  of 21  x  21  pixel  images  of \nregistered  lower  case  characters.  Results for  J(  =  14  are  shown  indicating that the \nmodel  has  discovered  statistical  regularities  associated  with  ascenders,  descenders, \ncircles,  etc. \n\na \n\nb  ...----.--\n\nFigure  4:  Multiple  Cause  Mixture  Model  representation  for  the  test  data \nfrom  figure  1.  (a)  Initial random  cluster-centers.  (b)  Cluster-centers  after \nseven  training iterations (white:  Cj,k  = -1; grey:  Cj,k = 0;  black:  Cj,k  = 1). \n\n\fUnsupervised Learning of Mixtures of Multiple Causes in Binary Data \n\n31 \n\n5  Conclusion \n\nAbility to  compress  data,  and  statistical  independence  of response  activities  (Bar(cid:173)\nlow,  1989),  are  not  the  only  criteria  by  which  to judge the  success  of an  encoder \nnetwork  paradigm for  unsupervised  learning.  For many purposes,  it  is  equally im(cid:173)\nportant  that  hidden  units  make explicit  statistically salient  structure  arising  from \ncausally  distinct  processes. \n\nThe  difficulty  lies  in  getting  the  internal  knowledge-bearing  entities  sensibly  to \ndivvy  up  responsibility  for  training  data  not  just  pointwise,  but  dimensionwise. \nMixing functions  based  on  linear  weighted  sum  of activities  (possibly  followed  by \na  nonlinearity)  fail  to  achieve  this  because  they  fail  to  pressure  the  hidden  units \ninto  giving  up  responsibility  (adopting  \"don't  know\"  values)  for  data dimensions \non  which  they  are  prone  to  be  incorrect.  We  have  outlined  criteria,  and  offered \na  specific  functional  form,  for  nonlinearly combining beliefs  in  a  predictive  mixing \nfunction  such  that  statistically  coherent  hidden  representations  of multiple causal \nstructure  can indeed  be discovered  in  binary data. \n\nReferences \n\nBarlow, H.;  [1989],  \"Unsupervised  Learning,\"  Neural  Computation,  1:  295-31l. \nBourlard, H., and Kamp, Y.;  [1988], Auto-Association by Multilayer Perceptrons  and \n\nSingular Value  Decomposition,\"  Biological  Cybernetics,  59:4-5, 291-294. \nDuda,  R.,  and  Hart,  P.;  [1973],  Pattern  Classification  and  Scene  Analysis,  Wiley, \nNew  York. \n\nFoldiak, P.;  [1990],  \"Forming sparse  representations by local anti-Hebbian learning,\" \n\nBiological  Cybernetics,  64:2,  165-170. \nFreund,  Y.,  and  Haussler,  D.;  [1992]'  \"Unsupervised  learning  of  distributions  on \nbinary vectors  using two-layer networks,\"  in  Moody, J., Hanson,  S.,  and  Lippman, \nR., eds,  Advances  in  Neural Information  Processing  Systems 4,  Morgan  Kauffman, \nSan  Mateo, 912-919. \nNowlan, S.;  [1990],  \"Maximum Likelihood  Competitive Learning,\"  in Touretzky,  D., \ned.,  Advances  in  Neural Information  Processing  Systems  2,  Morgan  Kauffman, San \nMateo,  574-582. \n\nSanger, T.;  [1989],  \"An Optimality Principle for  Unsupervised  Learning,\"  in Touret(cid:173)\nzky,  D.,  ed.,  Advances  in  Neural  Information  Processing  Systems,  Morgan  Kauff(cid:173)\nman, San  Mateo,  11-19. \n\n\f32 \n\nSaund \n\na \n\nb \n\nobservpd  data d, \n\nnlf'a~ l1I cme nt s 1n \" k \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nc \n\nx \n\n.;.'  .~ \n::.:,;,; \n\n, ' ;, \n\n.  ' . '  \n.::\n' \n.;;., \n\npredictions  r, \u2022 '.,  . .. , \n\u2022.. , . .. .  , .. , \n\u2022 \n\u2022 \n\u2022 \n\nFigure 5:  Multiple Cause Mixture Model results for noisy training data.  (a) \nFive  test  data sample suites  with  10%  bit-flip  noise.  Twenty  suites  were \nused  to train from  random initial cluster-centers,  resulting in the  represen(cid:173)\ntation shown  in  (b) .  (c)  Left:  Five  test  data samples  di ;  Middle:  Numer(cid:173)\nical  activities  mi,k  for  the  most  active  cluster-centers  (the  corresponding \ncluster-center  is  displayed  above  each  mi,k  value);  Right:  reconstructions \n(predictions)  ri  based  on  the  activities.  N ot.e  how  these  \"clean  up\"  the \nnoisy  samples from  which  they  were  computed. \n\n\fUnsupervised Learning of Mixtures of Multiple Causes in Binary Data \n\n33 \n\na \n\nb \n\nFigure 6:  (a) Training set of twenty-six 441-dimensional binary vectors.  (b) \nMultiple  Cause  Mixt.ure  Model  representation  at  J{  = 14.  (c)  Left:  Five \ntest  data samples di ;  Middle:  Numerical activities  mi,k  for  the most active \ncluster-centers  (the  corresponding  cluster-center  is  displayed  above  each \nmi,k  value);  Right:  reconstructions  (predictions)  ri  based on the  activities. \n\n\f34 \n\nSaund \n\nobserved data d; \n\nmeasurements m;,k \n\npredictions  ri \n\nc \n\n\f", "award": [], "sourceid": 735, "authors": [{"given_name": "Eric", "family_name": "Saund", "institution": null}]}