{"title": "Blind Separation of Delayed and Convolved Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 758, "page_last": 764, "abstract": null, "full_text": "Blind separation of delayed and convolved \n\nsources. \n\nTe-Won Lee \n\nMax-Planck-Society, GERMANY, \nAND  Interactive Systems Group \n\nCarnegie Mellon  University \nPittsburgh, PA  15213, USA \n\ntewonOes. emu. edu \n\nAnthony J. Bell \n\nComputational Neurobiology, \n\nThe Salk Institute \n\n10010 N.  Torrey Pines Road \n\nLa Jolla,  California 92037,  USA \n\ntonyOsalk.edu \n\nRussell H.  Lambert \n\nDept of Electrical Engineering \n\nUniversity of South California,  USA \n\nrlambertOsipi.use.edu \n\nAbstract \n\nWe  address  the  difficult  problem  of separating multiple  speakers \nwith  multiple microphones  in  a  real room.  We  combine the work \nof Torkkola and  Amari,  Cichocki  and Yang,  to give  Natural  Gra(cid:173)\ndient information maximisation rules for recurrent (IIR)  networks, \nblindly  adjusting  delays,  separating  and  deconvolving  mixed  sig(cid:173)\nnals.  While  they  work  well  on  simulated  data,  these  rules  fail \nin  real  rooms  which  usually  involve  non-minimum phase  transfer \nfunctions,  not-invertible using stable IIR filters.  An approach that \nsidesteps this problem is to perform infomax on a feedforward archi(cid:173)\ntecture in the frequency domain  (Lambert 1996).  We  demonstrate \nreal-room separation of two natural signals using this approach. \n\n1  The problem. \n\nIn  the  linear  blind  signal  processing  problem  ([3,  2]  and  references  therein),  N \nsignals,  s(t)  =  [S1(t) ... SN(t)V,  are  transmitted  through  a  medium  so  that  an \narray of N  sensors picks  up  a  set of signals x(t)  =  [Xl (t) ... XN(t)V, each of which \n\n\fBlind Separation of Delayed and Convolved Sources \n\nhas been mixed,  delayed and filtered  as follows: \n\nN  M-l \n\nXi(t) =  L L  aijkSj(t - Dii - k) \n\nj=l  k=O \n\n759 \n\n(1) \n\n(Here  Dij  are  entries  in  a  matrix  of  delays  and  there  is  an  M-point  filter,  aij, \nbetween  the  the  jth source  and  the  ith  sensor.)  The  problem  is  to  invert  this \nmixing without knowledge of it, thus recovering the original signals,  s(t). \n\n2  Architectures. \n\nThe obvious architecture for  inverting eq.1  is  the feed forward  one: \n\nN  M-l \n\nUi(t) =  L L WijkXj(t - dij - k). \n\nj=l  k=O \n\n(2) \n\nwhich  has filters,  Wij,  and delays,  dij,  which  supposedly reproduce,  at the Ui , the \noriginal uncorrupted source Signals, Si.  This was the architecture implicitly assumed \nin [2].  However, it cannot solve the delay-compensation problem, since in eq.1 each \ndelay,  D ij , delays  a  single source,  while  in eq.2 each delay,  dii  is  associated  with a \nmixture,  Xj. \n\nTorkkola  [8],  has  addressed  the  problem  of solving  the  delay-compensation  prob(cid:173)\nlem with a  feedback  architecture.  Such  an architecture can,  in  principle,  solve  this \nproblem,  as  shown  earlier by Platt & Faggin  [7].  Torkkola [9]  also generalised the \nfeedback  architecture to remove  dependencies across time,  to achieve the deconvo(cid:173)\nlution of mixtures which have been filtered,  as in eq.1. \n\nHere  we  propose a  slightly  different  architecture than Torkkola's  ([9],  eq.15).  His \narchitecture could fail  since it is  missing feedback  cross-weights for  t =  0,  ie:  WijO. \nA full  feedback system looks like: \n\nUi (t) = Xi  - L L WijkUj(t - dij - k). \n\nN  M-l \n\nj=l  k=O \n\n(3) \n\nand is  illustrated in Fig.I.  Because terms in Ui(t)  appear on both sides,  we  rewrite \nthis  in  vector terms:  u(t) =  x(t) - Wou(t) - E~~l Wku(t - k),  in  order to solve \nit as follows: \n\nu(t) = (I + WO)-l(X(t)  - L WkU(t - k\u00bb \n\nM-l \n\nk=l \n\n(4) \n\nIn  these equations, there is  a  feedback  unmixing matrix,  W k,  for  each time  point \nof the filter,  but  the  'leading matrix',  Wo  has  a  special  status in  solving for  u(t). \nThe delay  terms are useful  since one metre of distance in  air at an 8kHz  sampling \nrate, corresponds to a whole  25  zero-taps of a  filter.  Reintroducing them gives  us: \n\nu(t) =  (I + WO)-l(X(t) - net(t\u00bb, \n\nN  M-l \n\nneti(t) = L L  WijkU(t - dij - k\u00bb \n\nj=l  k=l \n\n(5) \n\n\f760 \n\nT.  Lee, A. 1. Bell and R. H.  Lambert \n\nMa:tdmize Joint Entropy \n\nH(Y) / \n\nS I  ---.----; \n\nI---.--XI \n\nS2--<----; \n\nA(z) \n\nW(z) \n\nUI \n\nU2 \n\nFigure  1:  The feedback  neural architecture of eq.9,  which  is  used  to separate and \ndeconvolve  signals.  Each  box represents  a  causal  filter  and  each  circle  denotes  a \ntime delay. \n\n3  Algorithms. \n\nLearning in this architecture is performed by maximising the joint entropy, H(y(t)), \nof the  random  vector  y(t)  = g(u(t)),  where  9  is  a  bounded  monotonic  nonlinear \nfunction  (a sigmoid  function).  The success  of this  for  separating sources  depends \non  four  assumptions:  (1)  that  the  sources  are  statistically  independent,  (2)  that \neach  source  is  white,  ie:  there are no  dependencies  between  time  points,  (3)  that \nthe non-linearity, g,  has a  derivative which has higher kurtosis than the probability \ndensity functions  (pdf's) of the sources, and (4)  that a stable IIR (feedback)  inverse \nof the mixing exists;  ie:  that a  is  minimum phase  (see section 5). \nAssumption (1)  is reasonable and Assumption (3)  allows some tailoring of our algo(cid:173)\nrithm to fit  data of different  types.  Assumption  (2), on the other hand, is  not true \nfor  natural signals.  Our algorithm will  whiten:  it will  remove  dependencies  across \ntime which already existed in the original source signals,  Si.  However, it is  possible \nto restore the characteristic autocorrelations (amplitude spectra)  of the sources by \npost-processing.  For the reasoning behind Assumption  (3)  see  [2].  We  will  discuss \nAssumption 4 in section 5. \nIn the static feedback  case of eq.5,  when  M  = 1,  the learning rule for  the feedback \nweights W 0  is just a co-ordinate transform of the rule for feedforward weights, W 0, \nin  the  equivalent  architecture  of u(t)  =  Wox(t).  Since  Wo  = (I + WO)-l,  we \nhave  Wo  =  WOI  - I,  which,  due  to  the  quotient  rule  for  matrix  differentiation, \ndifferentiates as: \n\n(6) \n\nThe best  way  to  maximise entropy in  the  feedforward  system  is  not  to follow  the \nentropy gradient, as in [2J, but to follow  its 'natural' gradient, as reported by Amari \net al  [1]: \n\n(7) \n\nThis is  an optimal rescaling of the entropy gradient [1,  3].  It simplifies the learning \n\n\fBlind Separation of Delayed and Convolved Sources \n\nrule and speeds convergence considerably.  Evaluated, it gives  [2]: \n\nA \n\nAWO  ex:  (I + yu  )WO, \n\nA \n\nT \n\n761 \n\n(8) \n\nSubstituting into eq.7 gives the natural gradient rule for  static feedback weights: \n\n(9) \n\nThis reasoning may be extended to networks involving filters.  For the feedforward \nfilter  architecture u(t) =  ~:!:~1 W kX(t - k), we  derive a  natural gradient rule  (for \nk> 0)  of: \n\n(10) \nwhere, for convenience, time has become subscripted.  Performing the same coordi(cid:173)\nnate transforms as for  Wo  above, gives the rule: \n\nA W k  ex:  YUt-k W k \n\nA \n\nA \n\nT \n\n(We  note that learning rules  similar to these  have been  independently  derived  by \nCichocki et al  [4]).  Finally, for  the delays in eq.5, we  derive  [2,  8]: \n\n(11) \n\nAd\u00b7\u00b7  ex: \n\n13 \n\naH(  ) \nY  =  _yA.  ~ -W \" kU(t - d ..  - k) \nad. . \n13 \n\nM-l  a \n1  L  at  13 \n\nk=l \n\n13 \n\n(12) \n\nThis rule is different from that in [8]  because it uses the collected temporal gradient \ninformation from all the taps.  The algorithms of eq.9, eq.11  and eq.12 are the ones \nwe  use in our experiments on the architecture of eq.5. \n\n4  Simulation results  for  the feedback  architecture \n\nTo  test the learning rules in  eq.9,  eq.ll and eq.12 we  used  an IIR filter  system to \nrecover two sources which had  been  mixed and delayed  as  follows  (in  Z-transform \nnotation): \n\nAn (z)  =  0.9 + 0.5z- 1 + 0.3z-2 \n\nA21 (z) =  -0.7z-5  - 0.3z-6  - 0.2z- 7 \nA 12 (Z)  =  0.5z- 5 + 0.3z-6 + 0.2z- 7 \n\nA22 (Z)  =  0.8 - 0.lz-1 \n\n(13) \n\nThe mixing system,  A(z), is  a  minimum-phase system with all  its zeros  inside the \nunit circle.  Hence,  A(z) can be inverted using a  stable causal IIR system since all \npoles of the inverting systems are also inside the unit circle.  For this experiment, we \nchose an artificially-generated source:  a white process with a Laplacian distribution \n[/z(x)  =  exp( -Ix!)].  In the  frequency  domain  the  deconvolving  system  looks  as \nfollows: \n\n(14) \n\nwhere D(z) =  Wll (Z)W22 (Z) - W12 (Z)W21 (Z)).  This leads to the following solution \nfor  the weight  filters: \n\nW ll (z)  = A 22 (Z)  W22 (Z)  = All (z) \nW21 (z)  =  -A21 (z)  W 12 (Z)  =  -A12 (Z) \n\n(15) \n\n\f762 \n\nT.  Lee, A.  1.  Bell and R.  H.  Lambert \n\nThe learning rule we used was that of eq.9 and eq.ll with the logistic non-linearity, \nYi  =  1/ exp( -Ui).  Fig.2A  shows  the four  filters  learnt  by  our IIR algorithm.  The \nbottom row shows the inverting system convolved with the mixing system, proving \nthat W  * A  is  approximately the identity mapping.  Delay  learning is  not  demon(cid:173)\nstrated here,  though for  periodic signals like  speech we  observed  that it is  subject \nto local minima problems [8,  9]. \n\n(A)  Feedback (IIR) learning \n\n(8) Feedforward (FIR) learning \n\nFigure 2:  Top two rows:  learned unmixing filters for  (A)  IIR learning on minimum(cid:173)\nphase mixing,  and  (B)  FIR freq.-domain  learning on  non-minimum  phase mixing. \nBottom row:  the convolved mixing and unmixing systems.  The delta-like response \nindicates successful  blind unmixing.  In  (B)  this occurs acausally with a  time-shift. \n\n5  Back to the feedforward  architecture. \n\nThe  feedback  architecture  is  elegant  but  limited.  It can  only  invert  minimum(cid:173)\nphase  mixing  (all  zeros  are  inside  the  unit  circle  meaning  that  all  poles  of  the \ninverting system  are  as  well).  Unfortunately,  real  room  acoustics  usually  involves \nnon-minimum phase mixing. \n\nThere does  exist,  however,  a  stable  non-causal feedforward  (FIR)  inverse for  non(cid:173)\nminimum phase mixing systems.  The learning rules for such a system can be formu(cid:173)\nlated using the FIR polynomial  matrix algebra as  described  by  Lambert  [5].  This \nmay  be  performed  in  the time or frequency  domain,  the only  requirements  being \nthat  the  inverting  filters  are  long  enough  and  their  main  energy  occurs  more-or(cid:173)\nless  in their centre.  This  allows  for  the non-causal expansion of the non-minimum \nphase roots, causing the roughly symmetrical  \"flanged\"  appearance of the filters  in \nFig-.2B. \n\nFor convenience, we formulate the infomax and natural gradient info max rules [2,  1] \nin the frequency  domain: \n\n~ W  ex W- H + fft(y)XH \n~ W  ex  (I + fft(y)UH)W \n\n(16) \n\n(17) \nwhere the H  superscript denotes the Hermitian transpose (complex conjugate).  In \nthese rules,  as  in eq.14,  W  is  a  matrix of filters  and U  and X  are blocks of multi-\n\n\fBlind Separation of Delayed and Convolved Sources \n\n763 \n\nsensor signal in the frequency  domain.  Note that the nonlinearity 1li  =  8~i ~ still \noperates in the time domain and the fft  is  applied at the output. \n\n6  Simulation results for  the feedforward  architecture \n\nTo show the learning rule in eq.17 working, we altered the transfer function in eq.13 \nas follows: \n\n(18) \nThis system is  now non-minimum phase,  having zeros  outside the unit  circle.  The \ninverse system can  be  approximated by  stable non-causal  FIR filters.  These  were \nlearnt using the learning rule of eq.17  (again,  with the logistic  non-linearity).  The \nresulting learnt filters  are shown in  Fig.2B  where  the leading weights  were chosen \nto  be at  half the filter  size  (M/2).  Non-causality of the filters  can  be  clearly ob(cid:173)\nserved for  W12  and W21,  where there are non-zero coefficients  before the maximum \namplitude weights.  The bottom row  of Fig.2B  shows  the successful  separation by \nplotting the complete unmixing/mixing transfer function:  W  * A. \n\n7  Experiments with real recordings \n\nTo demonstrate separation in a real room, we set up two microphones and recorded \nfirstly  two people speaking and then one  person speaking with  music  in  the back(cid:173)\nground.  The microphones and  the sources  were  both  60cm  apart  and  60cm  from \neach  other  (arranged  in  a  square),  and  the  sampling  was  16kHz.  Fig.3A  shows \nthe two  recordings of a  person saying  the  digits  \"one\"  to  \"ten\"  while  loud  music \nplays  in  the  background.  The  IIR  system  of eq.5,  eq.9  and  eq.ll was  unable  to \nseparate  these  signals,  presumably  due  to  the  non-mini mum-phase  nature of the \nroom transfer functions.  However, the algorithm of eq.17, converged after 30 passes \nthrough  the  10  second  recordings.  The filter  lengths  were  256  (corresponding  to \n16ms).  The  separated signals  are  shown  in  Fig.3B.  Listening  to them  conveys  a \nsense of almost-clean separation, though interference is  audible.  The results on the \ntwo people speaking were similar. \n\nAn important application is in spontaneous speech recognition tasks where the best \nrecognizer  may fail  completely in  the presence of background music  or competing \nspeakers (as in the teleconferencing problem).  To test this application, we fed  into a \nspeech recognizer, ten sentences recorded with loud music in the background and ten \nsentences recorded with a simultaneous speaker interference.  After separation, the \nrecognition rate increased considerably for  both cases.  These results  are reported \nin  detail in  [6]. \n\n8  Conclusions \n\nStarting  with  'Natural  gradient  infomax'  IIR  learning  rules  for  blind  time  delay \nadjustment,  separation and  deconvolution,  we  showed  how  these  worked  well  on \nminimum-phase mixing,  but not on non-minimum-phase mixing,  as usually occurs \nin  rooms.  This  led  us  to an  FIR frequency  domain  infomax  approach  suggested \nby  Lambert  [5].  The latter approach shows  much  better separation of speech  and \nmusic mixed  in a  real-room.  Based on these techniques,  it should now  be possible \nto develop real-world applications. \n\n\f764 \n\nT.  Lee, A. 1. Bell and R.  H.  Lambert \n\n(A) Mixtures \n\n(8) Separations \n\n1'1.11 .:M~i.t; ... :.rr'.~:\" I ~ II~  1.-+--\u2022 ....--: s-:-ec-~-r-; .--.--'-; -.. -.--r;-t-\"'-~ -'1 \nI'.'.~!!s!';:~~.':.: .\u2022 :: : .1111 ~J l \n\nFigure  3:  Real-room  separation/deconvolution.  (A)  recorded  mixtures  (B)  sepa(cid:173)\nrated speech  (spoken digits  1-10) and music. \n\nAcknowledgments \n\nT.W.L.  is  supported by  the Daimler-Benz-Fellowship,  and  A.J.B.  by  a  grant from \nthe Office of Naval Research.  We are grateful to Kari Torkkola for sharing his results \nwith  us,  and to  Jiirgen  Fritsch,  Terry  Sejnowski  and  Alex  Waibel  for  discussions \nand comments. \n\nReferences \n\n[1]  Amari S-I.  Cichocki  A.  &  Yang H.H.  1996.  A new  learning algorithm for  blind \nsignal separation, Advances in Neural Information  Processing  Systems  8,  MIT \npress. \n\n[2]  Bell  A.J.  & Sejnowski  T.J.  1995.  An  information  maximisation  approach  to \nblind separation and blind deconvolution,  Neural  Computation,  7,  1129-1159 \n\n[3]  Cardoso J-F. &  Laheld B.  1996. Equivariant adaptive source separation, IEEE \n\nTrans.  on  Signal  Proc.,  Dec.  1996 \n\n[4]  Cichocki  A.,  Amari  S-I  & Coo  J.  1996.  Blind  separation of delayed  and con(cid:173)\nvolved signals with self-adaptive learning rate, in Proc. Intern.  Symp.  on Non(cid:173)\nlinear  Theory  and  Applications  (NOLTA *96),  Kochi,  Japan. \n\n[5]  Lambert  R.  1996.Multichannel  blind  deconvolution:  FIR matrix algebra and \n\nseparation of multipath mixtures,  PhD  Thesis,  University of Southern Califor(cid:173)\nnia,  Department of Electrical Engineering,  May 1996. \n\n[6]  Lee T-W. &  Orglmeister R.  Blind source separation of real-world signals.  sub(cid:173)\n\nmitted to Proc.  ICNN,  Houston,  USA,  1997. \n\n[7]  Platt J.C.  &  Faggin  F.  1992.  Networks for  the separation of sources  that are \n\nsuperimposed and delayed,  in  Moody  J.E et  al  (eds)  Advances  in  Neural  In(cid:173)\nformation  Processing  Systems 4,  Morgan-Kaufmann \n\n[8]  Torkkola K.  1996.  Blind  separation  of delayed  sources  based  on  information \n\nmaximisation,  Proc  IEEE ICASSP,  Atlanta, May 1996. \n\n[9]  Torkkola K.  1996.  Blind separation of convolved sources based on information \n\nmaximisation,  Proc.  IEEE  Workshop  on  Neural  Networks  and Signal Process(cid:173)\ning,  Kyota, Japan, Sept.  1996 \n\n\f", "award": [], "sourceid": 1235, "authors": [{"given_name": "Te-Won", "family_name": "Lee", "institution": null}, {"given_name": "Anthony", "family_name": "Bell", "institution": null}, {"given_name": "Russell", "family_name": "Lambert", "institution": null}]}