{"title": "Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 619, "abstract": null, "full_text": "Maximum Likelihood  Blind  Source \n\nSeparation:  A  Context-Sensitive \n\nGeneralization of ICA \n\nBarak A.  Pearlmutter \n\nLucas  C.  Parra \n\nComputer Science Dept,  FEC 313 \n\nSiemens  Corporate Research \n\nUniversity of New  Mexico \nAlbuquerque,  NM  87131 \n\nbap@cs.unm.edu \n\n755  College Road  East \n\nPrinceton, N J  08540-6632 \n\nlucas@scr.siemens.com \n\nAbstract \n\nIn the square linear blind source separation problem, one must find \na  linear  unmixing operator  which  can detangle  the result  Xi(t)  of \nmixing n  unknown independent sources 8i(t)  through an unknown \nn  x  n  mixing matrix A( t)  of causal linear filters:  Xi  =  E j  aij * 8 j . \nWe cast the problem as one of maximum likelihood density estima(cid:173)\ntion,  and  in  that framework  introduce an  algorithm  that searches \nfor  independent components using both temporal and spatial cues. \nWe  call the resulting algorithm  \"Contextual ICA,\"  after the  (Bell \nand  Sejnowski  1995)  Infomax  algorithm,  which  we  show  to  be  a \nspecial case of cICA.  Because cICA  can make use  of the temporal \nstructure of its input, it is  able separate in a  number of situations \nwhere  standard methods  cannot,  including  sources  with  low  kur(cid:173)\ntosis,  colored  Gaussian sources,  and  sources  which  have  Gaussian \nhistograms. \n\n1  The Blind Source Separation Problem \n\nConsider  a  set  of n  indepent  sources  81 (t), . .. ,8n (t).  We  are  given  n  linearly dis(cid:173)\ntorted sensor  reading  which  combine  these  sources,  Xi  =  E j  aij8j,  where  aij  is  a \nfilter  between source j  and  sensor i,  as  shown in figure  1a.  This can be expressed \nas \n\nXi(t) = 2: 2: aji(r)8j(t - r) = 2: aji * 8j \n\n00 \n\nj \n\nr=O \n\nj \n\n\f614 \n\nB. A.  Pearlmutter and L.  C.  Parra \n\nIftY/(t )IY/(t-l) \u2022 ... ;,,(1\u00bb  f-h-..... ~,,---------..,------~ \nf, \n~Y~~-+~-+-~ \n\n11\"-\n\nx. \n\nFigure 1:  The left  diagram shows  a  generative model  of data production for  blind \nsource separation problem.  The cICA algorithm fits  the reparametrized generative \nmodel  on  the right to the data.  Since  (unless  the mixing process  is  singular)  both \ndiagrams  give  linear  maps  between  the  sources  and  the sensors,  they  are  mathe(cid:173)\nmaticallyequivalent.  However,  (a)  makes the transformation from  s  to  x  explicit, \nwhile  (b)  makes the transformation from  x  to y, the estimated sources, explicit. \n\nor, in  matrix notation,  x(t)  =  L~=o A(T)S(t - T)  =  A * s.  The square linear  blind \nsource separation problem  is  to recover  S  from  x.  There is  an  inherent  ambiguity \nin  this,  for  if we  define  a  new  set  of sources  s'  by  s~ =  bi  * Si  where  bi ( T)  is  some \ninvertable filter,  then the various  s~ are independent, and constitute just as  good a \nsolution  to the problem as the true  Si,  since  Xi  =  Lj(aij * bjl) * sj.  Similarly the \nsources could  be arbitrarily permuted. \n\nSurprisingly,  up to permutation of the sources and linear filtering of the individual \nsources, the problem is  well  posed-assuming that the sources  Sj  are  not Gaussian. \nThe reason for  this is  that only with a correct separation are the recovered sources \ntruly statistically independent, and this fact serves as a sufficient constraint.  Under \nthe assumptions we  have made, I  and further  assuming that the linear transforma(cid:173)\ntion  A  is  invertible,  we  will  speak  of  recovering  Yi(t)  =  Lj Wji  * Xj  where  these \nYi  are  a  filtered  and  permuted  version  of the original  unknown  Si.  For  clarity  of \nexposition,  will  often refer to  \"the\"  solution and refer  to the Yi  as  \"the\"  recovered \nsources,  rather than  refering to an  point  in  the  manifold  of solutions  and  a  set of \nconsistent recovered sources. \n\n2  Maximum likelihood density estimation \n\nFollowing  Pham,  Garrat, and  Jutten  (1992)  and  Belouchrani and  Cardoso  (1995), \nwe  cast the BSS  problem as one of maximum likelihood density estimation.  In the \nMLE framework, one begins with a probabilistic model of the data production pro(cid:173)\ncess.  This probabilistic model is  parametrized by a vector of modifiable parameters \nw,  and it therefore assigns a  w-dependent probability density p( Xo, Xl, ... ; w)  to a \neach possible dataset xo, Xl, ....  The task is then to find  a  w  which maximizes this \nprobability. \n\nThere are a number of approaches to performing this maximization.  Here we apply \n\nlWithout these assumptions, for  instance in the presence of noise, even a linear mixing \n\nprocess leads to an optimal un mixing process that is  highly  nonlinear. \n\n\fMaximum Likelihood Blind Source Separation:  ContextuallCA \n\n615 \n\nthe stochastic gradient method, in which a single stochastic sample x  is chosen from \nthe dataset and  -dlogp(x; w)/dw is  used  as  a  stochastic  estimate of the gradient \nof the negative likelihood 2:t  -dlogp(x(t); w)/dw. \n\n2.1  The likelihood of the data \n\nThe model of data production we  consider is shown in figure  1a.  In that model, the \nsensor readings x  are an explicit  linear function of the underlying sources s. \n\nIn this  model  of the data production, there  are two  stages.  In the first  stage,  the \nsources independently produce signals.  These signals are time-dependent,  and  the \nprobability  density  of source  i  producing  value  Sj(t)  at  time  t  is  f;(Sj(t)lsj(t  -\n1), Sj(t - 2), ... ).  Although this source model could be of almost  any differentiable \nform,  we  used  a  generalized  autoregressive  model  described  in  appendix  A.  For \nexpository purposes, we can consider using a simple AR model, so we model Sj(t) = \nbj (l)sj(t -1) + bj(2)sj(t - 2) + ... + bj(T)sj(t - T) + Tj,  where Tj  is  an iid  random \nvariable,  perhaps with a  complicated density. \n\nIt  is  important to distinguish  two  different,  although  related,  linear filters .  When \nthe source models are simple AR models, there are two types of linear convolutions \nbeing performed.  The first is in the way each source produces its signal:  as a linear \nfunction  of its  recent  history  plus  a  white  driving term,  which  could  be expressed \nas  a  moving average model,  a  convolution  with  a  white  driving term,  Sj  = bj * Tj. \nThe second  is  in  the  way  the sources  are mixed:  linear functions  of the output  of \neach source are added,  Xi  =  2: j  aji * Sj  =  2: j (aji * bj) *Tj.  Thus, with AR sources, \nthe source convolution  could  be folded  into  the convolutions  of the  linear  mixing \nprocess. \n\nIf we  were to estimate values for  the free  parameters of this model,  i.e.  to estimate \nthe filters, then the task of recovering the estimated sources from the sensor output \nwould require inverting the linear A =  (aij), as well as some technique to guarantee \nits  non-singularity.  Such  a  model  is  shown  in  figure  1a.  Instead,  we  parameterize \nthe  model  by  W  = A-I,  an  estimated  unmixing  matrix,  as  shown  in  figure  lb. \nIn  this  indirect  representation,  s  is  an  explicit  linear  function  of x,  and  therefore \nx  is  only  an  implicit  linear  function  of  s.  This  parameterization of the  model  is \nequally convenient for  assigning probabilities to samples x, and is therefore suitable \nfor  MLE.  Its advantage is  that because the transformation from  sensors to sources \nis  estimated explicitly,  the sources can be recovered directly from  the data and the \nestimated model, without invertion.  Note that in this inverse parameterization, the \nestimated mixture  process  is  stored in inverse form.  The source-specific  models  Ii \nare kept  in forward  form.  Each source-specific model i  has a  vector of parameters, \nwhich  we  denote w(i). \n\nWe are now in a position to calculate the likelihood of the data.  For simplicity we use \na matrix W  of real numbers rather than FIR filters.  Generalizing this derivation to \na matrix of filters is straightforward, following the same techniques used by Lambert \n(1996),  Torkkola (1996),  A.  Bell  (1997),  but space precludes a  derivation here. \n\nThe individual generative source models give \n\np(y(t)ly(t - 1), y(t - 2), ... ) = II Ii(Yi(t)IYi(t - 1), Yi(t - 2), ... ) \n\n(1) \n\n\f616 \n\nB. A.  Pear/mutter and L.  C.  Parra \n\nwhere  the  probability  densities  h  are  each  parameterized  by  vectors  w(i).  Using \nthese  equations,  we  would  like  to  express  the  likelihood  of  x(t)  in  closed  form, \ngiven  the  history  x(t - 1), x(t - 2), ....  Since  the  history  is  known,  we  therefore \nalso  know  the history of the  recovered  sources,  y(t - 1),y(t - 2), ....  This  means \nthat  we  can  calculate the density p(y(t)ly(t - 1), . .. ).  Using  this,  we  can  express \nthe density  of x(t)  and expand G =  logp(x; w)  =  log IWI + 2:j  log fj(Yj(t)IYj(t -\n1), Yj(t  - 2), ... ; wU\u00bb)  There are two  sorts  of parameters  which  we  must  take  the \nderivative  with  respect  to:  the  matrix  W  and  the  source  parameters  wU).  The \nsource  parameters  do  not  influence  our  recovered  sources,  and  therefore  have  a \nsimple form \n\ndG \ndWj \n\ndfJ(Yj;wj)/dwj \n\nfj(Yj; Wj) \n\nHowever, a change to the matrix W  changes y, which introduces a few  extra terms. \nNote  that dlog IWI/dW = W- T ,  the  transpose inverse.  Next,  since  y  = Wx,  we \nsee  that  dYj/dW  =  (OlxIO)T,  a  matrix  of zeros  except  for  the  vector  x  in  row  j . \nNow  we  note that  dfJO/dW  term  has  two  logical  components:  the first  from  the \neffect of changing W  upon Yj(t), and the second from the effect of changing W  upon \nYj(t -1), Yj(t - 2), ....  (This second is called the  \"recurrent term\", and such terms \nare frequently  dropped for  convenience.  As  shown  in  figure  3,  dropping  this  term \nhere is  not a  reasonable approximation.) \n\ndfJ(Yj(t)IYj(t-1), ... ;wj)  =  afj  dYj(t)  + 2: \n\nafJ \n\ndYj(t-T) \n\ndW \n\naYj(t)  dW \n\naYj(t - T) \n\ndW \n\nT \n\nNote that the expression  dYij:;T)  is  the only  matrix,  and  it is  zero except  for  the \njth row,  which is  x(t - T).  The expression afJ/aYj(t) we shall denote fjO, and the \nexpression afjaYj(t - T)  we  shall denote f(T}(.).  We  then have \n\n!  =  _W- T  - (f~(:))  x(tf - f  (ft}:\u00b7))  x(t - Tf \n\nfJ() \n\nj \n\nT=l \n\nfJ() \n\nj \n\n(2) \n\nwhere (expr(j))j  denotes the column vector whose elements are expr(1) , . .. , expr(n). \n\n2.2  The natural gradient \n\nFollowing Amari, Cichocki, and Yang (1996), we follow a pseudogradient.  Instead of \nusing equation 2,  we  post-multiply this quantity by WTW.  Since this is  a  positive(cid:173)\ndefinite  matrix, it does  not affect  the stochastic gradient convergence criteria,  and \nthe resulting quantity simplifies in a fashion that neatly eliminates the costly matrix \ninversion otherwise required.  Convergence is  also  accelerated. \n\n3  Experiments \n\nWe  conducted a  number of experiments to test the efficacy of the cICA  algorithm. \nThe  first,  shown  in  figure  2,  was  a  toy  problem  involving  a  set  of processed  de(cid:173)\nliberately constructed to be difficult  for  conventional source separation algorithms. \nIn  the second  experiment,  shown in  figure  3,  ten real  sources  were  digitally  mixed \nwith an instantaneous matrix and separation performance was measured as a funci(cid:173)\nton of varying model  complexity  parameters.  These sources  have are available  for \nbenchmarking purposes in  http://www.cs.unm.edu;-bap/demos.html. \n\n\fMaximum Likelihood Blind Source Separation:  ContextuallCA \n\n617 \n\nFigure 2:  cICA using a history of one time step and a mixture of five logistic densities \nfor  each  source  was  applied to 5,000 samples of a  mixture of two  one-dimensional \nuniform  distributions  each  filtered  by  convolution  with  a  decaying  exponential  of \ntime  constant of 99.5.  Shown  is  a  scatterplot of the data input to  the  algorithm, \nalong  with  the  true  source  axes  (left),  the  estimated  residual  probability  density \n(center), and a scatterplot of the residuals of the data transformed into the estimated \nsource space coordinates  (right).  The  product of the  true mixing  matrix and  the \nestimated  unmixing  matrix  deviates  from  a  scaling  and  permutation  matrix  by \nabout 3%. \n\nNoise Model \n\nTruncated Gradient \n\nFull Gradient \n\n100 \n\n\u00b78 \nII \n~ \n10 \n\no \n\n5 \nt5 \nnumber 01 AR filter taps \n\n10 \n\n20 \n\no \n\n5 \n15 \nnumber 01 AR filter taps \n\n10 \n\n20 \n\nnumber oIlogistica \n\n2 \n\nFigure 3:  The performance of cICA as a function of model complexity and gradient \naccuracy.  In  all simulations, ten five-second clips taken digitally from ten audio CD \nwere digitally mixed through a random ten-by-ten instantanious mixing matrix.  The \nsignal to noise ratio of each original source as  expressed in  the recovered sources is \nplotted.  In  (a)  and (b),  AR source models with a logistic noise term were used, and \nthe number of taps of the AR model  was  varied.  (This  reduces  to Bell-Sejnowski \ninfomax when the number of taps is zero.)  Is  (a), the recurrent term of the gradient \nwas  left  out,  while  in  (b)  the  recurrent  term  was  included.  Clearly  the  recurrent \nterm is important.  In  (c),  a degenerate AR model with zero taps was  used,  but the \nnoise term was  a  mixture of logistics,  and the number of logistics  was  varied. \n\n4  Discussion \n\nThe Infomax algorithm (Baram and Roth 1994) used for source separation (Bell and \nSejnowski  1995)  is  a  special case of the above algorithm in which  (a)  the mixing is \nnot convolutional,  so W(l) = W(2)  = ... = 0,  and  (b)  the sources are assumed to \nbe iid,  and  therefore the distributions  fi(y(t))  are not  history  sensitive.  Further, \nthe form  of the Ii is  restricted  to  a  very  special  distribution:  the logistic  density, \n\n\f618 \n\nB. A.  Pearlmuner and L.  C.  Parra \n\nthe derivative of the sigmoidal function  1/{1 + exp -{).  Although ICA has enjoyed \na variety of applications (Makeig  et  al.  1996; Bell and Sejnowski  1996b; Baram and \nRoth 1995; Bell and Sejnowski 1996a), there are a number of sources which it cannot \nseparate.  These include all sources with Gaussian histograms (e.g.  colored gaussian \nsources,  or  even  speech  to  run  through  the  right  sort  of slight  nonlinearity),  and \nsources  with  low  kurtosis.  As  shown  in  the experiments  above,  these  are of more \nthan theoretical interest. \n\nIf we  simplify  our model to use ordinary AR models for  the sources, with gaussian \nnoise  terms  of fixed  variance,  it  is  possible  to  derive  a  closed-form  expression  for \nW  (Hagai  Attias,  personal  communication).  It  may  be  that for  many  sources  of \npractical interest, trading away this model accuracy for  speed  will  be fruitful. \n\n4.1  Weakened assumptions \n\nIt seems clear that,  in  general,  separating when  there  are fewer  microphones than \nsources  requires  a  strong bayesian  prior,  and  even  given  perfect  knowledge  of the \nmixture  process  and  perfect  source  models,  inverting  the  mixing  process  will  be \ncomputationally  burdensome.  However,  when  there  are  more  microphones  than \nsources,  there is  an  opportunity  to improve  the performance of the  system in  the \npresence  of  noise.  This  seems  straightforward  to  integrate  into  our  framework. \nSimilarly, fast-timescale microphone nonlinearities are easily incorporated into this \nmaximum likelihood approach. \n\nThe structure of this problem would seem to lend itself to EM. Certainly the individ(cid:173)\nual source models can be easily optimized using EM, assuming that they themselves \nare of suitable form. \n\nReferences \n\nA.  Bell,  T.-W.  L.  (1997).  Blind  separation  of  delayed  and  convolved  sources.  In \nAdvances  in  Neural  Information  Processing  Systems  9.  MIT  Press.  In  this \nvolume. \n\nAmari, S., Cichocki, A., and Yang, H.  H.  (1996).  A new learning algorithm for blind \nsignal  separation.  In  Advances  in  Neural  Information  Processing  Systems  8. \nMIT Press. \n\nBaram,  Y.  and  Roth,  Z.  (1994).  Density  Shaping  by  Neural  Networks  with  Ap(cid:173)\nplication  to  Classification,  Estimation  and  Forecasting.  Tech.  rep.  CIS-94-\n20,  Center for  Intelligent  Systems,  Technion,  Israel  Institute for  Technology, \nHaifa. \n\nBaram,  Y.  and  Roth,  Z.  (1995).  Forecasting  by  Density  Shaping  Using  Neural \nNetworks.  In  Computational Intelligence for  Financial Engineering New York \nCity.  IEEE Press. \n\nBell,  A.  J.  and  Sejnowski,  T.  J.  (1995).  An  Information-Maximization  Approach \nto  Blind  Separation  and  Blind  Deconvolution.  Neural  Computation,  7(6), \n1129-1159. \n\nBell,  A.  J.  and Sejnowski,  T.  J.  (1996a).  The Independent  Components of Natural \n\nScenes.  Vision  Research.  Submitted. \n\n\fMaximum Likelihood Blind Source Separation:  ContextuallCA \n\n619 \n\nBell,  A.  J.  and Sejnowski,  T.  J.  (1996b).  Learning the higher-order structure of a \n\nnatural sound.  Network:  Computation  in Neural Systems.  In press. \n\nBelouchrani, A.  and Cardoso,  J.-F.  (1995).  Maximum likelihood source separation \n\nby the expectation-maximization technique:  Deterministic and stochastic im(cid:173)\nplementation. In Proceedings of 1995 International Symposium on Non-Linear \nTheory  and Applications,  pp. 49- 53  Las Vegas,  NV.  In  press. \n\nLambert, R.  H.  (1996).  Multichannel Blind Deconvolution:  FIR Matrix Algebra and \n\nSeparation  of Multipath  Mixtures.  Ph.D. thesis,  USC. \n\nMakeig, S., Anllo-Vento, L.,  Jung, T.-P., Bell,  A.  J., Sejnowski, T. J., and Hillyard, \nIndependent  component  analysis  of  event-related  potentials \n\nS.  A.  (1996). \nduring selective attention.  Society for  Neuroscience  Abstracts,  22. \n\nPearlmutter,  B.  A.  and  Parra,  L.  C.  (1996).  A  Context-Sensitive  Generaliza(cid:173)\n\ntion  of ICA.  In  International  Conference  on  Neural  Information  Processing \nHong  Kong.  Springer-Verlag.  Url  ftp:/ /ftp.cnl.salk.edu/pub/bap/iconip-96-\ncica.ps.gz. \n\nPham, D.,  Garrat, P.,  and  Jutten,  C.  (1992).  Separation of a  mixture of indepen(cid:173)\n\ndent  sources  through  a  maximum  likelihood  approach.  In  European  Signal \nProcessing  Conference,  pp.  771-774. \n\nTorkkola,  K.  (1996).  Blind  separation of convolved  sources  based  on  information \nmaximization.  In  Neural  Networks  for  Signal  Processing  VI Kyoto,  Japan. \nIEEE Press.  In press. \n\nA  Fixed mixture AR models \n\nThe  fj{uj; Wj)  we used were a mixture AR processes driven by logistic noise terms, \nas in  Pearlmutter and Parra (1996).  Each source model  was \n\nfj{Uj{t)IUj{t -1), Uj{t - 2), ... ; Wj) = I: mjk  h{{u){t) - Ujk)/Ujk)/Ujk \n\n(3) \n\nk \n\nwhere  Ujk  is  a  scale  parameter for  logistic  density  k  of source j  and  is  an element \nof Wj,  and  the mixing coefficients  mjk  are elements of Wj  and are constrained by \n'Ek mjk  =  1.  The  component  means  Ujk  are  taken  to  be  linear  functions  of  the \nrecent values of that source, \n\nUjk  = L ajk(r) Uj{t - r) + bjk \n\nT=l \n\n(4) \n\nwhere  the  linear  prediction  coefficients  ajk{r)  and  bias  bjk  are  elements  of  Wj' \nThe derivatives  of these are straightforward; see  Pearlmutter and  Parra (1996)  for \ndetails.  One  complication  is  to  note  that,  after  each  weight  update,  the  mixing \ncoefficients must be  normalized,  mjk  t- mjk/ 'Ekl mjk' . \n\n\f", "award": [], "sourceid": 1179, "authors": [{"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}, {"given_name": "Lucas", "family_name": "Parra", "institution": null}]}