{"title": "Independent Factor Analysis with Temporally Structured Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 386, "page_last": 392, "abstract": null, "full_text": "Independent  Factor  Analysis with \n\nTemporally  Structured  Sources \n\nHagai Attias \n\nhagai@gatsby.ucl.ac.uk \n\nGatsby Unit,  University College London \n\n17  Queen Square \n\nLondon WCIN 3AR,  U.K. \n\nAbstract \n\nWe  present  a  new  technique  for  time  series  analysis  based on  dy(cid:173)\nnamic probabilistic networks.  In this approach,  the observed data \nare modeled in terms of unobserved, mutually independent factors, \nas in the recently introduced technique of Independent Factor Anal(cid:173)\nysis  (IFA).  However,  unlike  in  IFA,  the  factors  are not  Li.d.;  each \nfactor has its own temporal statistical characteristics.  We  derive a \nfamily of EM  algorithms that learn the structure of the underlying \nfactors  and  their  relation  to  the  data.  These  algorithms  perform \nsource separation and noise reduction in an integrated manner, and \ndemonstrate superior performance compared to IFA. \n\n1 \n\nIntroduction \n\nThe  technique  of independent  factor  analysis  (IFA)  introduced  in  [1]  provides  a \ntool  for  modeling  L'-dim  data  in  terms  of  L  unobserved  factors.  These  factors \nare  mutually  independent  and  combine  linearly  with  added  noise  to  produce  the \nobserved data.  Mathematically, the model is  defined  by \n\nYt  =  HXt + Ut, \n\n(1) \n\nwhere  Xt  is  the vector of factor activities at time t,  Yt  is  the data vector,  H  is  the \nL' x  L  mixing matrix, and Ut  is  the noise. \nThe origins of IFA  lie in applied statistics on the one hand and in signal processing \non the other hand.  Its statistics ancestor is  ordinary factor analysis (FA),  which as(cid:173)\nsumes Gaussian factors.  In contrast, IFA allows each factor to have its own arbitrary \ndistribution, modeled semi-parametrically by a  I-dim mixture of Gaussians (MOG). \nThe  MOG  parameters,  as well  as  the mixing matrix and noise  covariance  matrix, \nare learned from the observed data by an expectation-maximization (EM) algorithm \nderived in  [1].  The signal processing ancestor of IFA is  the independent component \nanalysis  (ICA)  method for  blind source separation [2]-[6].  In ICA,  the factors  are \ntermed sources,  and the task of blind source separation is to recover them from  the \nobserved data with  no knowledge of the  mixing process.  The sources  in  ICA  have \nnon-Gaussian  distributions,  but unlike  in  IFA  these  distributions are usually  fixed \nby  prior knowledge or have quite limited adaptability.  More significant restrictions \n\n\fDynamic Independent Factor Analysis \n\n387 \n\nare  that  their number is  set  to the  data dimensionality,  i.e.  L  =  L'  ('square mix(cid:173)\ning'),  the mixing matrix is  assumed invertible, and the data are assumed noise-free \n(Ut  =  0).  In  contrast,  IFA  allows  any  L, L'  (including more  sources  than sensors, \nL  > L'), as well  as non-zero noise with  unknown covariance.  In addition, its use of \nthe flexible  MOG  model often proves crucial for achieving successful separation [1]. \n\nTherefore,  IFA  generalizes  and  unifies  FA  and  ICA.  Once  the  model  has  been \nlearned, it can be used for  classification  (fitting an IFA  model for  each class), com(cid:173)\npleting  missing  data,  and  so  on.  In  the  context  of  blind  separation,  an  optimal \nreconstruction of the sources  Xt  from  data is  obtained [1]  using a  MAP  estimator. \n\nHowever,  IFA  and  its  ancestors  suffer  from  the  following  shortcoming:  They  are \noblivious to temporal information since they do not attempt to model the temporal \nstatistics of the data (but see  [4]  for  square, noise-free mixing).  In other words,  the \nmodel learned would not be affected by permuting the time indices of {yt}.  This is \nunfortunate since  modeling  the  data as  a  time series  would  facilitate  filtering  and \nforecasting,  as well  as more accurate classification.  Moreover, for  source separation \napplications,  learning temporal statistics  would  provide  additional  information on \nthe sources, leading to cleaner source reconstructions. \n\nTo see this, one may think of the problem of blind separation of noisy data in terms \nof two  components:  source  separation  and  noise  reduction.  A  possible  approach \nmight  be  the following  two-stage  procedure.  First,  perform  noise  reduction using, \ne.g.,  Wiener filtering.  Second,  perform  source  separation on  the  cleaned  data us(cid:173)\ning,  e.g.,  an ICA algorithm.  Notice  that this  procedure directly exploits  temporal \n(second-order)  statistics of the  data in  the first  stage to achieve  stronger noise  re(cid:173)\nduction.  An  alternative  approach  would  be  to  exploit  the  temporal  structure  of \nthe data indirectly,  by using a temporal source model.  In the resulting single-stage \nalgorithm,  the  opemtions of source sepamtion and noise reduction  are  coupled.  This \nis  the approach taken in  the present paper. \n\nIn  the  following,  we  present  a  new  approach  to  the  independent  factor  problem \nbased on dynamic  probabilistic networks.  In order to  capture temporal statistical \nproperties of the observed data, we  describe each source by a hidden Markov model \n(HMM).  The resulting dynamic model describes a multivariate time series in terms \nof several independent sources, each having its own temporal characteristics.  Section \n2 presents an EM learning algorithm for  the zero-noise case,  and section 3 presents \nan algorithm for the case of isotropic noise.  The case of non-isotropic noise turns out \nto be computationally intractable; section 4 provides an approximate EM algorithm \nbased on a  variational approach. \nNotation:  The multivariable Gaussian density is denoted by g(z, E) =1  27rE 1-1/ 2 \nexp( -zT E- l z/2).  We  work with T-point time blocks denoted Xl:T  =  {Xt}[=I'  The \nith coordinate  of Xt  is  x~.  For  a  function  f,  (f(Xl:T))  denotes  averaging over  an \nensemble of Xl:T  blocks. \n\n2  Zero  Noise \n\nThe MOG source model employed in IFA  [1]  has the advantages that (i)  it is capable \nof approximating arbitrary densities, and  (ii)  it can be  learned efficiently from data \nby  EM.  The Gaussians correspond  to the  hidden  states of the sources,  labeled  by \ns.  Assume that at time t,  source i  is in state s~ =  s.  Its signal x~ is  then generated \nby sampling from  a  Gaussian distribution with mean JL!  and variance v!.  In order \nto  capture  temporal  statistics  of the  data,  we  endow  the  sources  with  temporal \nstructure by  introducing a  transition  matrix a!,s  between  the  states.  Focusing on \n\n\f388 \n\nH.  Attias \n\na  time block t = 1, ... , T, the resulting probabilistic model is defined  by \n\ni \n\n( \n\ni \n\ni \n\n) \n\nP St  = S  St-l = S  = as's'  P So  = S  = 7rs  , \n(  i i i   ') \np(X~ I S~ = S)  = g(x~ - J.L!,v!), \n\nP(Yl:T) =1  detG IT  P(Xl:T), \n\n(2) \nwhere P(Xl:T)  is the joint density of all sources xL i  =  1, ... , L  at all time points, and \nthe last equation follows  from Xt  = GYt with G  =  H- 1  being the unmixing matrix. \nAs  usual  in  the  noise-free  scenario  (see  [2];  section  7 of [1]),  we  are assuming that \nthe mixing matrix is square and invertible. \nThe  graphical  model  for  the  observed  density  P(Yl:T  I  W)  defined  by  (2)  is \nparametrized  by  W  = {Gij , J.L!, v!, 7r!, a!, s}'  This  model  describes  each  source  as \na  first-order  HMM;  it  reduces  to a  time-independent  model if a!,s  =  7r!.  Whereas \ntemporal  structure can be  described  by  other means,  e.g.  a  moving-average [4]  or \nautoregressive [6]  model,  the HMM is advantageous since it models high-order tem(cid:173)\nporal statistics and facilitates EM learning.  Omitting the derivation, maximization \nwith respect to Gij  results in  the incremental update rule \n\nbG = \u20acG  - \u20acT  L </>(Xt)x[G  , \n\n1  T \n\nt=l \n\n(3) \n\nwhere </>(xn  =  Es 'Y:(s)(x~ - J.L!)/v!,  and the natural gradient  [3]  was  used;  \u20ac  is  an \nappropriately chosen learning rate.  For the source parameters we obtain the update \nrules \n\nEt 'Yt(s)x~ \nEt 'Y1(s) \n\n, \n\n_  E t ~t( s' , s) \n\ni \nas' s  - ~  i  ( ' ) '  \n\nut 'Yt-l  S \n\n(4) \n\nwith the initial probabilities updated  via 7r!  =  'YA(s).  We  used the standard HMM \nnotation 'Y:(s)  =  p(s~  =  S I xLT)'  ~t(s',s)  =  P(SLI  =  s',s~  =  s  I xLT)'  These \nposterior  densities  are  computed  in  the  E-step  for  each  source,  which  is  given  in \nterms of the data via x~ = E j  Gijyl, using the forward-backward procedure [7]. \nThe algorithm  (3-4)  may be  used  in several possible generalized EM schemes.  An \nefficient  one  is  given  by  the  following  two-phase  procedure:  (i)  freeze  the  source \nparameters and learn the separating matrix G  using (3);  (ii)  freeze  G  and learn the \nsource parameters using (4), then go back to (i)  and repeat.  Notice that the rule (3) \nis similar to a natural gradient version of Bell and Sejnowski's leA rule  [2];  in fact, \nthe two coincide  for  time-independent sources where  </>(Xi)  =  -alogp(xi)/axi.  We \nalso  recognize  (4)  as  the  Baum-Welch  method.  Hence,  in  phase  (i)  our algorithm \nseparates the sources  using  a  generalized  leA rule,  whereas in  phase  (ii)  it learns \nan HMM  for  each source. \nRemark.  Often one  would  like  to  model  a  given  L'-variable  time  series  in  terms \nof a  smaller  number  L  ~ L'  of factors.  In  the  framework  of our noise-free  model \nYt  =  HXt,  this  can  be  achieved  by  applying the  above  algorithm  to the L  largest \nprincipal components of the data; notice that if the data were indeed generated by L \nfactors,  the remaining L' - L  principal components would vanish.  Equivalently, one \nmay apply the algorithm to the data directly,  using a  non-square  L  x  L' unmixing \nmatrix G. \nResults.  Figure  1 demonstrates the performance of the  above  method on a  4 x  4 \nmixture of speech signals, which were passed through a non-linear function to mod(cid:173)\nify their distributions.  This mixture is inseparable to leA because the source model \nused by the latter does not fit  the actual source densities  (see  discussion in  [1]).  We \nalso applied our dynamic network to a mixture of speech signals whose distributions \n\n\fDynamic Independent Factor Analysis \n\n389 \n\n0 . 8 \n\n0 . 7 \n\n0 . 8 \n\n):i\"0.5 \n'zs:O.4 \n\n0.3 \n\n0 .2 \n\n0.1 \n\n0 \n-4 \n\n-2 \n\nHMM-ICA \n\nleA \n\n3 \n\n2 \n\n'>:! \n\n0 \n\n-1 \n\n-2 \n\n-3 \n\n3 \n\n-3 \n\n4 \n\n-2 \n\no \nx1 \n\n2 \n\n-2 \n\n0 \nx1 \n\n2 \n\nFigure  1:  Left:  Two  of the four  source  distributions.  Middle:  Outputs of the  EM  algo(cid:173)\nrithm (3-4)  are  nearly independent.  Right:  the outputs of leA (2)  are correlated. \n\nwere  made  Gaussian  by  an appropriate non-linear transformation.  Since  temporal \ninformation  is  crucial  for  separation  in  this  case  (see  [4],[6]),  this  mixture  is  in(cid:173)\nseparable to leA and IFA;  however,  the  algorithm  (3-4)  accomplished  separation \nsuccessfully. \n\n3 \n\nIsotropic Noise \n\nWe now turn to the case of non-zero noise Ut  ::j:.  O.  We assume that the noise is white \nand  has  a  zero-mean  Gaussian distribution  with  covariance matrix A.  In general, \nthis  case  is  computationally intractable (see  section  4).  The  reason  is  that the  E(cid:173)\nstep requires computing the posterior distribution P(SO:T, Xl:T  I Yl:T)  not only over \nthe  source  states  (as  in  the  zero-noise  case)  but  also  over  the source  signals,  and \nthis  posterior has  a  quite  complicated structure.  We  now  show  that  if we  assume \nisotropic  noise,  i.e.  Aij  =  )..6ij ,  as  well  as  square invertible  mixing as  above,  this \nposterior simplifies  considerably,  making learning and  inference  tractable.  This  is \ndone by adapting an idea suggested in  [8]  to our dynamic probabilistic network. \nWe  start by  pre-processing the data using a linear transformation that makes their \ncovariance  matrix  unity,  i.e.,  (YtyT)  =  I  ('sphering').  Here  (-)  denotes  averaging \nover T-point time blocks.  From  (1)  it follows  that HSHT = )..'1,  where S = (XtxT) \nis  the diagonal covariance matrix of the sources, and )..'  =  1 -)...  This, for  a square \ninvertible H, implies  that  HTH  is  diagonal.  In fact,  since the unobserved sources \ncan be determined only to within  a  scaling factor,  we  can set  the  variance of each \nsource to unity and obtain the orthogonality property HTH =  )..'1.  It can be shown \nthat the source posterior now factorizes  into a product over the individual sources, \nP(SO:T, Xl :T I Yl:T)  =  TIiP(sb:T, XLT  I Yl:T),  where \n\nt=l \n\nP(Sb:T,xLT I Yl:T)  ()(  [rrg(X; -T):'aD \u00b7 v;p(s: I SLl)]  vbp(sb)\u00b7 \n\n(5) \nThe means  and variances  at time  t  in  (5),  as  well  as  the  quantities vL  depend  on \nboth the data Yt  and the states s~; in particular, T);  =  (~j Hjiyl +  )..j1!)/(>..'vs +)..) \nand a-;  =  )..v!/(>..'vs  + )..),  using  s  =  s1;  the expression for  the v;  are omitted.  The \ntransition  probabilities  are  the  same  as  in  (2).  Hence,  the  posterior  distribution \n(5)  effectively defines a new HMM for  each source, with yrdependent emission and \ntransition probabilities. \nTo derive the learning rule for  H, we  should first  compute the conditional  mean Xt \nof the source  signals  at time t  given  the data.  This can  be  done  recursively  using \n(5)  as in  the forward-backward procedure.  We  then obtain \n\nc= T~YtXr. \n\n1  T \n\nt=l \n\n(6) \n\n\f390 \n\nH.  Attias \n\nThis fractional form results from imposing the orthogonality constraint HTH =  >..'1 \nusing Lagrange  multipliers  and can be computed via a  diagonalization procedure. \nThe source parameters are computed using a  learning rule  (omitted)  similar to the \nnoise-free rule (4).  It is easy to derive a  learning rule for  the noise level ,\\ as well;  in \nfact,  the ordinary FA rule would suffice.  We point out that, while this algorithm has \nbeen  derived  for  the  case  L  =  L',  it is  perfectly  well  defined  (though  sub-optimal: \nsee  below)  for  L  :::;  L'. \n\n4  Non-Isotropic  Noise \n\nThe general  case  of non-isotropic noise  and non-square mixing  is  computationally \nintractable.  This  is  because  the  exact  E-step  requires  summing  over  all  possible \nsource configurations (st, ... , SfL)  at all times tl, ... , tL  =  1, ... , T.  The intractability \nproblem  stems  from  the  fact  that,  while  the  sources  are  independent,  the sources \nconditioned  on  a  data  vector  Yl:T  are  correlated,  resulting  in  a  large  number  of \nhidden  configurations.  This problem does  not arise  in  the noise-free case,  and can \nbe avoided in  the case of isotropic noise and square mixing using the orthogonality \nproperty; in both cases,  the exact  posterior over the sources factorizes. \n\nThe EM algorithm derived below is  based on a variational approach.  This approach \nwas  introduced  in  [9J  in  the  context  of sigmoid  belief networks,  but  constitutes  a \ngeneral  framework  for  ML  learning  in  intractable  probabilistic  networks;  it  was \nused  in  a  HMM  context  in  [IOJ.  The idea is  to  use  an  approximate  but  tractable \nposterior to place a  lower bound on the likelihood, and optimize the parameters by \nmaximizing this  bound. \nA starting point for  deriving a  bound on the likelihood  L  is  Neal and Hinton's  [l1J \nformulation of the EM algorithm: \n\nL  =  lOgp(Yl:T)  ~ L Eq logp(Yt  I Xt)  + L Eq logp(sb:T' xi:T)  - Eq logq, \n\n(7) \n\nT \n\nt=l \n\nL \n\ni=l \n\nwhere Eq  denotes averaging with respect to an arbitrary posterior density over the \nhidden  variables  given  the  observed  data,  q  =  q(SO:T,Xl:T  I  Yl:T).  Exact  EM, \nas  shown  in  [11],  is  obtained  by  maximizing  the  bound  (7)  with  respect  to  both \nthe  posterior  q  (corresponding  to  the  E-step)  and  the  model  parameters  W  (M(cid:173)\nstep).  However,  the resulting q is  the true but intractable posterior.  In contrast, in \nvariational EM we  choose  a  q  that differs  from  the true posterior,  but facilitates  a \ntractable E-step. \n\nE-Step.  We  use  q(sO:T,Xl:T \nparametrized as \n\nI  Yl :T)  =  TIiq(sb:T \n\nI  Yl:T)TItq(Xt \n\nI  Yl:T), \n\nq(s~ =  s I SLI  =  S',Yl:T) \n\nex: \n\n'\\!,ta!,s, \n\nq(sb  =  s  I Yl :T)  ex:  ,\\! t7r! , \n\n, \n\nq(Xt  IYl :T)  =  Q(Xt  - Pt, ~t) . \n\n(8) \nThus, the variational transition probabilities in  (8)  are described by multiplying the \noriginal ones  a!, s  by the parameters  '\\~,t'  subject  to the normalization constraints. \nThe source signals  Xt  at  time  t  are jointly  Gaussian  with  mean Pt  and covariance \n~t.  The  means,  covariances  and  transition  probabilities  are  all  time- and  data(cid:173)\ndependent,  i.e.,  Pt  =  f(Yl:T, t)  etc.  This parametrization scheme  is  motivated  by \nthe form  of the posterior in  (5);  notice that the quantities 1]:, a-t, v~ t  there become \nthe  variational parameters pL ~;j,,\\~ t  of (8).  A related scheme was  used  in  [IOJ  in \na  different  context.  Since  these  parameters  will  be  adapted  independently  of the \nmodel  parameters,  the non-isotropic algorithm is  expected to give  superior results \ncompared to the isotropic one. \n\n, \n\n\fDynamic Independent Factor Analysis \n\n391 \n\nO ~--------~------~ \n\nMixing \n\n-10 \n\n~-20 \n\n.L3Cfl> ___ ~O \n\n-40 \n\n-5~S;----:0:------::5'-------:-::' 0:------:-'\u00b7 \n15 \n\nSNA  (dB) \n\n5 \n\n0 \n\n~ -5 \n~ \nL.U  -10 \n\n- 15 \n\n-20 \n- 5 \n\nReco nstruc tion \n\n0 \n\n0 \n\n0 \n\n0 \n\n5 \n\nSNR (dB) \n\n10 \n\n15 \n\nFigure 2:  Left:  quality of the  model  parameter  estimates.  Right:  quality  of  the  source \nreconstructions.  (See text). \n\nOf course, in the true posterior the Xt  are correlated, both temporally among them(cid:173)\nselves  and  with  St,  and  the  latter  do  not  factorize.  To  best  approximate  it,  the \nvariational parameters V  =  {p~, ~~j , >..!  t} are optimized to maximize the bound on \n.c,  or equivalently  to minimize  the  KL' distance  between  q  and  the true  posterior. \nThis requirement leads to the fixed  point equations \n(HT A -lH + Bt)-l(HT A -lYt + bt), \n.  (pi  _  J-Li)2  + ~ii] \n1  [1 \n\n~t =  (HT A-1H + Bt)-l , \n\nPt \n\nt \n\n, \n\n(9) \n\n--:- exp  - - log V Z  _ \nzZ \nt \n\n2 \n\ns \n\nt \n\ns  . \n2vZ \ns \n\nwhere  Bij  =  Ls[rl(S)/v!]6ij ,  b~  =  Ls ,l(s)J-L!/v!,  and  the factors  zf  ensure  nor(cid:173)\nmalization.  The  HMM  quantities  ,f(s)  are  computed  by  the  forward-backward \nprocedure using the variational transition probabilities (8).  The variational param(cid:173)\neters are  determined by  solving eqs.  (9)  iteratively for  each  block Yl :T;  in practice, \nwe  found  that less then 20  iterations are usually required for  convergence. \nM-Step.  The update rules for  W  are given for  the mixing parameters by \n\n1 ~  T \n\nA =  T  L,)YtYt  - YtPt  H  ), \n\nT  T \n\nand for  the source parameters by \nLt ,f(s)p~ \nLt ,I(s)  , \nLt ~f(s', s) \nLt,Ll(S')  , \n\nt \n\nVi  =  Lt ,f(s)((p~ - J-L~)2 + ~~i) \ns \n\nLt ,f(s) \n\n(10) \n\n(11) \n\nwhere  the  ~Hs' ,  s)  are computed  using  the variational  transition  probabilities  (8). \nNotice that the learning rules for the source parameters have the Baum-Welch form, \nin  spite  of the  correlations  between  the  conditioned  sources.  In  our  variational \napproach,  these  correlations  are  hidden  in  V,  as  manifested  by  the  fact  that  the \nfixed  point  equations  (9)  couple  the  parameters V  across time  points  (since  ,:(s) \ndepends on >\"!,t=l:T)  and sources. \nSource Reconstruction.  From q(Xt  I Yl :T)  (8),  we observe that the MAP source \nestimate is given by  Xt  = Pt(Yl:T),  and depends on both Wand V. \nResults.  The above algorithm is  demonstrated on a source separation task in  Fig(cid:173)\nure  2.  We  used  6 speech  signals,  transformed by  non-linearities  to  have  arbitrary \none-point  densities,  and  mixed  by  a  random  8  x  6  matrix  Ho.  Different  signal(cid:173)\nto-noise  (SNR)  levels  were  used.  The  error in  the estimated  H  (left,  solid line)  is \nquantified by the size ofthe non-diagonal elements of (HTH)-l HTHo relative to the \n\n\f392 \n\nH  Attias \n\ndiagonal;  the results obtained by  IFA  [1],  which does not use temporal information, \nare plotted for  reference (dotted line).  The mean squared error of the reconstructed \nsources  (right, solid line)  and the corresponding IFA  result  (right,  dashed line)  are \nalso  shown.  The  estimate  and  reconstruction  errors  of this  algorithm  are  consis(cid:173)\ntently smaller than those of IFA, reflecting the advantage of exploiting the temporal \nstructure of the data.  Additional experiments with different numbers of sources and \nsensors  gave  similar  results.  Notice  that  this  algorithm,  unlike  the  previous  two, \nallows both L  ::;  L' and L  > L'.  We  also considered situations where the number of \nsensors was  smaller  than  the number of sources;  the separation quality  was  good, \nalthough,  as  expected, less so than in the opposite  case. \n\n5  Conclusion \n\nAn important issue that has not been addressed here is  model selection.  When ap(cid:173)\nplying our algorithms to an arbitrary dataset,  the number of factors  and of HMM \nstates for  each factor should be determined.  Whereas this could be done,  in princi(cid:173)\nple,  using cross-validation, the required computational effort  would  be fairly  large. \nHowever,  in  a  recent  paper  [12]  we  develop  a  new  framework  for  Bayesian model \nselection,  as  well  as  model  averaging,  in  probabilistic  networks.  This framework, \ntermed  Variational  Bayes,  proposes an EM-like algorithm which  approximates full \nposterior distributions over not only hidden variables but also parameters and model \nstructure,  as  well  as  predictive quantities,  in  an analytical  manner.  It is  currently \nbeing applied to the algorithms presented here with good preliminary results. \nOne field  in which our approach may find important applications is speech technol(cid:173)\nogy,  where it suggests building more economical signal models based on combining \nindependent  low-dimensional  HMMs,  rather  than fitting  a  single  complex  HMM. \nIt may  also  contribute toward  improving recognition  performance in  noisy,  multi(cid:173)\nspeaker,  reverberant conditions which characterize real-world auditory scenes. \n\nReferences \n[1]  Attias,  H.  (1999).  Independent factor  analysis.  Neur.  Camp.  11, 803-85l. \n[2]  Bell,  A.J.  &  Sejnowski,  T .J.  (1995).  An information-maximization  approach to  blind \nseparation and blind deconvolution.  Neur.  Camp.  7,  1129-1159. \n[3]  Amari, S.,  Cichocki,  A.  &  Yang,  H.H.  (1996).  A new learning algorithm for  blind signal \nseparation.  Adv.  Neur.  Info.  Pmc.  Sys.  8,757-763  (Ed.  by  Touretzky, D.S.  et al).  MIT \nPress,  Cambridge,  MA. \n[4]  Pearlmutter, B.A.  &  Parra,  L.C.  (1997).  Maximum likelihood blind source separation: \nA  context-sensitive  generalization  of ICA.  Adv.  Neur.  Info.  Pmc.  Sys.  9,  613-619  (Ed. \nby  Mozer,  M.C.  et al).  MIT Press,  Cambridge,  MA. \n[5]  Hyviirinen,  A.  &  Oja,  E.  (1997).  A fast  fixed-point  algorithm for  independent compo(cid:173)\nnent analysis.  Neur.  Camp.  9,  1483-1492. \n[6]  Attias,  H.  &  Schreiner,  C.E.  (1998).  Blind source  separation  and  deconvolution:  the \ndynamic component analysis algorithm.  Neur.  Camp.  10,  1373-1424. \n[7]  Rabiner, L.  &  Juang, B.-H.  (1993).  Fundamentals  of Speech  Recognition.  Prentice Hall, \nEnglewood Cliffs,  NJ. \n[8]  Lee,  D.D.  &  Sompolinsky,  H.  (1999) , unpublished; D.D.  Lee,  personal communication. \n[9]  Saul,  L.K.,  Jaakkola,  T.,  and Jordan,  M.L  (1996).  Mean field  theory of sigmoid belief \nnetworks.  J.  Art.  Int.  Res.  4, 61-76. \n[10]  Ghahramani,  Z.  &  Jordan,  M.L  (1997).  Factorial  hidden  Markov  models.  Mach. \nLearn.  29,  245-273. \n[11]  Neal,  R.M.  &  Hinton,  G.E.  (1998).  A  view of the EM  algorithm  that justifies incre(cid:173)\nmental, sparse, and other variants.  Learning in Graphical Models,  355-368  (Ed.  by Jordan, \nM.L).  Kluwer  Academic  Press. \n[12]  Attias, H.  (2000).  A variational Bayesian framework for  graphical models.  Adv.  Neur. \nInfo.  Pmc.  Sys.  12  (Ed.  by  Leen,  T.  et al).  MIT  Press,  Cambridge,  MA. \n\n\f", "award": [], "sourceid": 1682, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}]}