{"title": "One-unit Learning Rules for Independent Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 480, "page_last": 486, "abstract": null, "full_text": "One-unit  Learning Rules for \n\nIndependent  Component Analysis \n\nAapo Hyvarinen and Erkki Oja \nHelsinki  University of Technology \n\nLaboratory of Computer and Information Science \nRakentajanaukio 2 C,  FIN-02150 Espoo,  Finland \nemail:  {Aapo.Hyvarinen.Erkki.Oja}(Qhut.fi \n\nAbstract \n\nNeural one-unit learning rules for the problem of Independent Com(cid:173)\nponent Analysis (ICA)  and blind source separation are introduced. \nIn  these  new  algorithms,  every  ICA  neuron  develops  into  a  sepa(cid:173)\nrator that finds  one of the independent  components.  The learning \nrules  use  very  simple  constrained  Hebbianjanti-Hebbian  learning \nin  which  decorrelating feedback  may  be  added.  To  speed  up  the \nconvergence of these stochastic gradient descent rules, a novel com(cid:173)\nputationally efficient  fixed-point  algorithm is  introduced. \n\n1 \n\nIntroduction \n\nIndependent  Component Analysis  (ICA)  (Comon,  1994;  Jutten and Herault,  1991) \nis  a  signal  processing  technique  whose  goal  is  to  express  a  set  of  random  vari(cid:173)\nables as  linear combinations of statistically independent component  variables.  The \nmain  applications  of ICA  are  in  blind  source  separation,  feature  extraction,  and \nblind  deconvolution.  In  the  simplest  form  of ICA  (Comon,  1994),  we  observe  m \nscalar random variables  Xl, ... , Xm  which  are assumed to be linear combinations of \nn  unknown components 81, ... 8 n  that are zero-mean and mutually statistically  inde-\npendent.  In  addition, we  must assume n  ~ m.  If we  arrange the observed variables \nXi  into a  vector x  =  (Xl,X2, ... ,xm)T and the component  variables  8j  into a  vector \ns,  the linear relationship can  be  expressed as \n\nHere,  A  is  an  unknown m  x  n  matrix of full  rank,  called the mixing matrix.  Noise \nmay  also  be  added  to  the  model,  but  it  is  omitted  here for  simplicity.  The  basic \n\nx=As \n\n(1) \n\n\fOne-unit Learning Rules for Independent Component Analysis \n\n481 \n\nproblem of ICA  is  then to estimate (separate)  the realizations of the original inde(cid:173)\npendent  components  Sj,  or  a  subset  of them,  using only  the  mixtures  Xi.  This is \nroughly equivalent to estimating the rows, or a  subset of the rows,  of the pseudoin(cid:173)\nverse  of the  mixing  matrix  A .  The  fundamental  restriction  of the  model  is  that \nwe  can  only  estimate  non-Gaussian independent components,  or ICs  (except if just \none  of the ICs is  Gaussian).  Moreover,  the ICs  and the columns of A  can  only  be \nestimated up to a  multiplicative constant, because any constant  multiplying an IC \nin  eq.  (1)  could  be cancelled  by  dividing  the corresponding  column  of the  mixing \nmatrix A  by the same constant.  For mathematical convenience, we define here that \nthe  ICs  Sj  have  unit  variance.  This  makes  the  (non-Gaussian)  ICs  unique,  up  to \ntheir signs.  Note the assumption of zero mean of the ICs is in fact  no restriction, as \nthis can always  be accomplished  by  subtracting the mean from  the random vector \nx.  Note also that  no  order is  defined  between the lCs. \n\nIn  blind  source  separation  (Jutten  and  Herault,  1991),  the  observed  values  of  x \ncorrespond to a realization of an m-dimensional discrete-time signal x(t), t  =  1,2, .... \nThen the components Sj(t)  are called source signals.  The source signals are usually \noriginal, uncorrupted signals or noise sources.  Another application of ICA is feature \nextraction (Bell  and Sejnowski,  1996;  Hurri et al.,  1996), where the columns of the \nmixing matrix A  define features,  and the Sj  signal the presence and the amplitude \nof a  feature.  A closely related problem is  blind  deconvolution,  in which a  convolved \nversion x(t) of a  scalar LLd.  signal s(t) is observed.  The goal is  then to recover the \noriginal  signal  s(t)  without  knowing  the convolution kernel  (Donoho,  1981).  This \nproblem  can  be  represented in  a  way similar to eq.  (1),  replacing the matrix A  by \na  filter. \n\nThe current neural algorithms for  Independent  Component Analysis, e.g.  (Bell and \nSejnowski,  1995;  Cardoso  and  Laheld,  1996;  Jutten and  Herault,  1991;  Karhunen \net  al.,  1997;  Oja,  1995) try to estimate simultaneously all the components.  This is \noften not necessary,  nor feasible,  and it is  often desired to estimate only a subset of \nthe ICs.  This is  the starting point of our paper.  We  introduce  learning  rules for  a \nsingle  neuron,  by which the neuron learns to estimate one of the ICs.  A network of \nseveral such neurons can then estimate several (1  to n) ICs.  Both learning rules for \nthe 'raw' data (Section 3)  and for  whitened  data (Section  4)  are introduced.  If the \ndata is whitened, the convergence is speeded up, and some interesting simplifications \nand approximations are made  possible.  Feedback mechanisms  (Section  5)  are  also \nmentioned.  Finally, we introduce a novel approach for performing the computations \nneeded in the ICA learning rules, which uses a very simple, yet highly efficient, fixed(cid:173)\npoint iteration scheme (Section 6).  An important generalization of our learning rules \nis  discussed in Section 7,  and an  illustrative experiment is  shown in  Section 8. \n\n2  Using Kurtosis for  leA Estimation \n\nWe  begin  by  introducing  the  basic  mathematical  framework  of  ICA.  Most  sug(cid:173)\ngested  solutions  for  ICA  use  the  fourth-order  cumulant  or  kurtosis  of the signals, \ndefined for  a zero-mean random variable vas  kurt(v)  =  E{v4 }  - 3(E{V2})2.  For a \nGaussian random variable, kurtosis is zero.  Therefore, random variables of positive \nkurtosis  are  sometimes  called  super-Gaussian,  and  variables  of  negative  kurtosis \nsub-Gaussian.  Note that for  two independent random variables VI  and V2  and for  a \nscalar 0:,  it holds  kurt(vi + V2)  = kurt(vJ) + kurt(v2)  and  kurt(o:vd = 0:4 kurt(vd\u00b7 \n\n\f482 \n\nA. Hyviirinen and E.  Oja \n\nLet  us  search for  a  linear combination of the observations  Xi,  say,  w T x, such  that \nit  has  maximal  or  minimal  kurtosis.  Obviously,  this  is  meaningful  only  if  w  is \nsomehow  bounded;  let  us  assume  that  the  variance  of the  linear  combination  is \nconstant:  E{(wTx)2}  =  1.  Using  the  mixing  matrix  A  in  eq.  (1),  let  us  define \nz  = ATw.  Then  also  IIzl12  = w T A  ATw = w T E{xxT}w  = E{(WTx)2}  = 1. \nUsing eq.  (1)  and the properties of the kurtosis, we  have \n\nkurt(wT x) = kurt(wT As) = kurt(zT s) = L zJ  kurt(sj) \n\nn \n\n(2) \n\nj=1 \n\nUnder  the constraint E{(wT x)2}  =  IIzll2  =  1,  the function  in  (2)  has a  number of \nlocal  minima  and  maxima.  To  make  the  argument  clearer,  let  us  assume  for  the \nmoment  that the mixture contains at least one  Ie whose  kurtosis is  negative,  and \nat least one whose kurtosis is positive.  Then, as may be obvious, and was rigorously \nproven  by Delfosse  and  Loubaton  (1995),  the  extremal  points  of (2)  are obtained \nwhen  all  the components  Zj of z  are zero except one component  which  equals  \u00b11. \nIn  particular, the function  in  (2)  is  maximized  (resp.  minimized)  exactly when the \nlinear combination w T x  =  zT S  equals,  up to the sign, one of the les Sj  of positive \n(resp.  negative)  kurtosis.  Thus,  finding  the  extrema  of kurtosis  of w T x  enables \nestimation  of the  independent  components.  Equation (2)  also shows that  Gaussian \ncomponents,  or other components  whose  kurtosis  is  zero,  cannot  be estimated  by \nthis method. \nTo actually minimize or maximize  kurt(wT x), a  neural algorithm based on gradient \ndescent or ascent can be used.  Then w  is interpreted as the weight vector of a neuron \nwith input vector x  and linear output w T x.  The objective function can be simplified \nbecause of the constraint E{ (wT X)2}  =  1:  it  holds  kurt(wT x)  =  E{ (wT x)4}  - 3. \nThe constraint E{(wT x)2} =  1 itself can be taken into account by a  penalty term. \nThe final  objective function  is  then of the form \n\n(3) \n\nwhere a, (3  > 0 are arbitrary scaling constants, and F  is a suitable penalty function. \nOur  basic  leA  learning  rules  are  stochastic  gradient  descents  or  ascents  for  an \nobjective function  of this form.  In the next two  sections, we  present learning rules \nresulting  from  adequate  choices  of  the  penalty  function  F .  Preprocessing  of the \ndata (whitening) is  also used  to simplify  J  in  Section 4.  An alternative method for \nfinding  the extrema of kurtosis is  the fixed-point  algorithm; see Section 6. \n\n3  Basic  One-Unit leA Learning Rules \n\nIn  this  section,  we  introduce  learning  rules  for  a  single  neural  unit.  These  basic \nlearning  rules  require  no  preprocessing  of the  data,  except  that  the data must  be \nmade zero-mean.  Our learning rules are divided into two categories.  As explained in \nSection 2,  the learning rules either minimize the kurtosis of the output to separate \nles of negative kurtosis, or maximize it for  components of positive kurtosis. \n\nLet  us  assume  that  we  observe  a  sample  sequence  x(t)  of  a  vector  x  that  is  a \nlinear  combination  of independent  components  81, ... , 8 n  according to eq.  (1).  For \nseparating one of the les of negative  kurtosis,  we  use the following  learning rule for \n\n\fOne-unit Learning Rules for Independent Component Analysis \n\nthe weight  vector w  of a  neuron: \n\nAw(t) <X  x(t)g-(w(t? x(t)) \n\n483 \n\n(4) \n\nHere, the non-linear learning function g- is  a simple polynomial:  g-(u) =  au - bu3 \nwith  arbitrary scaling  constants a, b > O.  This learning rule is  clearly  a  stochastic \ngradient descent for a function of the form  (3),  with F(u) =  -u. To separate an IC \nof positive  kurtosis,  we  use the following  learning rule: \n\nAw(t) <X  x(t)g!(t) (w(t? x(t)) \n\n(5) \n\nthe \n\nlearning \n\nfunction  g!(t) \n\nis  defined  as \n\ng~(u) \nwhere \n-au(w(t)TCw(t))2 + bu3  where  C  is  the  covariance  matrix  of  x(t),  i.e.  C \nE{x(t)X(t)T}, and a, b > a are arbitrary constants.  This learning rule is  a stochas(cid:173)\ntic  gradient  ascent  for  a  function  of the  form  (3),  with  F(u)  =  -u2.  Note  that \nw(t)TCw(t) in g+  might  also be replaced by  (E{(w(t)Tx(t))2})2  or by  IIw(t)114  to \nenable a  simpler implementation. \n\nfollows: \n\nIt can be proven  (Hyvarinen  and Oja,  1996b)  that using the learning rules  (4)  and \n(5),  the  linear  output  converges to CSj(t)  where  Sj(t)  is  one  of the  ICs,  and  C  is  a \nscalar constant.  This multiplication of the source signal by the constant c is  in  fact \nnot  a  restriction,  as  the variance and the sign  of the sources  cannot  be  estimated. \nThe only condition for  convergence is that one of the ICs must be of negative (resp. \npositive)  kurtosis,  when  learning  rule  (4)  (resp.  learning  rule  (5))  is  used.  Thus \nwe  can  say  that  the  neuron  learns  to  separate  (estimate)  one  of the  independent \ncomponents.  It is also possible to combine these two learning rules into a single rule \nthat separates an IC  of any kurtosis;  see  (Hyvarinen and  Oja,  1996b). \n\n4  One-Unit  ICA Learning Rules for  Whitened Data \n\nWhitening,  also called  sphering,  is  a  very useful  preprocessing technique.  It speeds \nup  the  convergence  considerably,  makes  the  learning more  stable  numerically,  and \nallows  some interesting modifications of the learning rules.  Whitening means that \nthe  observed  vector  x  is  linearly  transformed  to  a  vector  v  =  Ux  such  that  its \nelements  Vi  are  mutually  uncorrelated  and  all  have  unit  variance  (Comon,  1994). \nThus the correlation matrix of v  equals unity:  E{ vvT} =  I.  This transformation is \nalways possible and can be accomplished by classical Principal Component Analysis. \nAt  the  same  time,  the  dimensionality  of the  data should  be  reduced  so  that  the \ndimension  of the transformed  data vector  v  equals  n,  the  number of independent \ncomponents.  This also has the effect  of reducing noise. \n\nLet  us  thus suppose  that the observed signal vet)  is  whitened  (sphered).  Then,  in \norder  to  separate one  of the  components  of  negative  kurtosis,  we  can  modify  the \nlearning rule  (4)  so as to get the following  learning rule for  the weight  vector w: \n\nAw(t) <X  v(t)g- (w(t? vet)) - wet) \n\n(6) \n\nHere,  the  function  g- is  the  same  polynomial  as  above:  g-(u)  =  au - bu3  with \na  > 1  and  b > O.  This  modification  is  valid  because we  now  have Ev(wT v)  =  w \nand  thus  we  can  add  +w(t)  in  the  linear  part  of g- and  subtract wet)  explicitly \nafterwards.  The modification is useful because it allows us to approximate g- with \n\n\f484 \n\nA.  Hyviirinen and E.  Oja \n\nthe 'tanh' function,  as w(t)T vet)  then stays in the range where this approximation \nis valid.  Thus we get what is perhaps the simplest possible stable Hebbian learning \nrule for  a  nonlinear Perceptron. \n\nTo separate one of the components of positive kurtosis,  rule  (5)  simplifies  to: \n\ndw(t)  <X  bv(t) (w(t)T v(t))3  - allw(t)114w (t) . \n\n(7) \n\n5  Multi-Unit ICA Learning  Rules \n\nIf estimation of several independent components is desired, it is possible to construct \na  neural  network  by  combining  N  (1  ~ N  ~ n)  neurons  that  learn  according  to \nthe learning rules given above, and adding  a feedback term to each of those learning \nrules.  A  discussion of such  networks can be found  in  (Hyv~rinen and Oja,  1996b) . \n\n6  Fixed-Point  Algorithm for  ICA \n\nThe advantage of neural on-line  learning  rules  like  those  introduced  above  is  that \nthe inputs vet) can be used in the algorithm at once, thus enabling faster adaptation \nin a  non-stationary environment.  A resulting trade-off, however, is that the conver(cid:173)\ngence is  slow,  and depends on  a  good  choice of the learning rate sequence,  i.e.  the \nstep size at each iteration.  A bad choice of the learning rate can, in practice, destroy \nconvergence.  Therefore, some ways to  make  the  learning  radically  faster  and more \nreliable  may  be  needed.  The fixed-point  iteration  algorithms  are  such  an  alterna(cid:173)\ntive.  Based on the learning rules introduced above,  we  introduce here a fixed-point \nalgorithm,  whose  convergence  is  proven  and  analyzed  in  detail  in  (Hyv~rinen and \nOja,  1997).  For simplicity,  we  only consider the case of whitened data here. \n\nConsider  the  general  neural  learning  rule  trying  to find  the  extrema  of kurtosis. \nIn  a  fixed  point  of such  a  learning  rule,  the  sum  of the  gradient  of  kurtosis  and \nthe penalty term must  equal zero:  E{v(wT v)3} - 311wll2w  + f(lIwI1 2)w =  0.  The \nsolutions of this equation must satisfy \n\n(8) \n\nActually,  because the  norm of w  is  irrelevant, it  is  the direction  of the right  hand \nside that is important.  Therefore the scalar in eq. (8)  is  not significant and its effect \ncan be replaced by explicit  normalization. \n\nAssume  now  that  we  have  collected  a  sample  of the  random  vector  v ,  which  is  a \nwhitened  (or  sphered)  version  of the  vector  x  in  eq.  (1).  Using  (8),  we  obtain the \nfollowing  fixed-point  algorithm for leA: \n\n1.  Take a  random initial vector w(o)  of norm 1.  Let  k  =  1. \n2.  Let  w(k)  =  E{v(w(k -\nestimated using a  large sample of v  vectors  (say,  1,000 points). \n\nI)T v)3}  - 3w(k - 1).  The expectation  can  be \n\n3.  Divide w(k)  by its norm. \n4.  If IW(k)Tw(k - 1)1  is  not  close enough to 1,  let  k  =  k + 1  and go  back \nto step 2.  Otherwise, output the vector w(k). \n\n\fOne-unit Learning Rules for Independent Component Analysis \n\n485 \n\nThe final  vector w*  =  limk w(k)  given by the algorithm separates one of the non(cid:173)\nGaussian  les  in  the  sense  that  w*T v  equals  one  of  the  les  Sj.  No  distinction \nbetween  components of positive or negative kurtosis is  needed  here.  A  remarkable \nproperty of our  algorithm  is  that  a  very  small  number of iterations,  usually  5-10, \nseems  to  be  enough  to obtain  the  maximal  accuracy  allowed  by  the  sample  data. \nThis  is  due to the fact  that the convergence of the fixed  point  algorithm is  in  fact \ncubic,  as  shown  in  (Hyv:trinen  and  Oja,  1997). \n\nTo estimate N  les, we run this algorithm N  times.  To ensure that we estimate each \ntime a  different  Ie, we  only need to add a simple projection inside the loop, which \nforces  the solution vector w(k) to be orthogonal to the  previously found  solutions. \nThis  is  possible  because  the  desired  weight  vectors  are orthonormal for  whitened \ndata  (Hyv:trinen  and  Oja,  1996bj  Karhunen  et  al.,  1997).  Symmetric  methods  of \northogonalization may also be used  (Hyv:trinen,  1997). \n\nThis fixed-point algorithm has several advantages when compared to other suggested \nleA methods.  First, the convergence of our algorithm is cubic.  This means very fast \nconvergence  and is  rather unique  among the leA algorithms.  Second,  contrary to \ngradient-based algorithms, there is  no learning rate or other adjustable parameters \nin the algorithm, which makes it easy to use and more reliable.  Third, components \nof both positive and negative kurtosis can be directly estimated by the same fixed(cid:173)\npoint  algorithm. \n\n7  Generalizations of Kurtosis \n\nIn the learning rules introduced above, we used kurtosis as an optimization criterion \nfor  leA  estimation.  This  approach  can  be  generalized  to  a  large  class  of  such \noptimizaton  criteria,  called  contrast  functions.  For  the  case  of  on-line  learning \nrules,  this  approach  is  developed  in  (Hyv:trinen  and  Oja,  1996a),  in  which  it  is \nshown that the function  9 in the learning rules in section 4 can be, in fact,  replaced \nby  practically  any  non-linear  function  (provided  that  w  is  normalized  properly). \nWhether one must use  Hebbian or anti-Hebbian learning is  then determined by  the \nsign  of certain  'non-polynomial cumulants'.  The utility of such  a  generalization is \nthat one can then choose the non-linearity according to some statistical optimality \ncriteria, such  as  robustness  against  outliers. \n\nThe fixed-point algorithm may also be generalized for an arbitrary non-linearity, say \ng.  Step 2 in the fixed-point algorithm then becomes (for whitened data) (Hyv:trinen, \n1997):  w(k) = E{vg(w(k -l)Tv)} - E{g'(w(k -l)Tv)}w(k -1). \n\n8  Experiments \n\nA  visually  appealing  way  of  demonstrating  how  leA  algorithms  work  is  to  use \nthem  to  separate  images  from  their  linear  mixtures.  On  the  left  in  Fig.  1,  four \nsuperimposed  mixtures  of 4  unknown  images  are  depicted.  Defining  the  j-th  Ie \nSj  to  be  the  gray-level  value  of the  j-th image  in  a  given  position,  and  scanning \nthe 4 images simultaneously pixel  by  pixel,  we  can use  the leA model  and  recover \nthe  original  images.  For  example,  we  ran  the  fixed-point  algorithm  four  times, \nestimating the four  images shown  on the right in Fig.  1.  The algorithm  needed on \nthe average 7 iterations for  each Ie. \n\n\f486 \n\nA.  Hyviirinen and E.  Oja \n\nFigure  1:  Three  photographs  of  natural  scenes  and  a  noise  image  were  linearly \nmixed to illustrate our algorithms.  The mixtures  are depicted  on  the left.  On the \nright, the images recovered by the fixed-point  algorithm are shown. \n\nReferences \n\nBell,  A.  and  Sejnowski,  T.  (1995).  An  information-maximization  approach  to  blind \n\nseparation and blind deconvolution.  Neural  Computation,  7:1129-1159. \n\nBell,  A.  and Sejnowski, T.  J .  (1996).  Edges are the independent components of natural \n\nscenes.  In NIPS *96,  Denver,  Colorado. \n\nCardoso,  J .-F. and Laheld, B.  H.  (1996) . Equivariant adaptive source separation.  IEEE \n\nTrans.  on Signal  Processing. 44(12). \n\nComon, P.  (1994). Independent component analysis - a new concept?  Signal Processing, \n\n36:287-314. \n\nDelfosse, N. and Loubaton, P.  (1995). Adaptive blind separation of independent sources: \n\na  deflation approach.  Signal  Processing,  45:59- 83 . \n\nDonoho,  D.  (1981) . On minimum entropy deconvolution. In  Applied  Time  Series  Anal(cid:173)\n\nysis  II. Academic Press. \n\nHurri,  J.,  Hyv:irinen,  A.,  Karhunen,  J.,  and  Oja,  E.  (1996).  Image feature  extraction \n\nusing independent component analysis.  In Proc.  NORSIG'96,  Espoo,  Finland. \n\nHyv:irinen,  A.  (1997).  A  family  of fixed-point  algorithms  for  independent  component \n\nanalysis.  In Pmc.  ICASSP'9'1,  Munich,  Germany. \n\nHyv:irinen,  A.  and  Oja,  E.  (1996a) .  Independent component analysis  by  general  non(cid:173)\nlinear hebbian-like learning rules. Technical Report A41, Helsinki University of Tech(cid:173)\nnology,  Laboratory of Computer and Information Science. \n\nHyv:irinen,  A.  and Oja,  E.  (1996b).  Simple neuron models for  independent component \nanalysis.  Technical  Report  A37,  Helsinki  University  of  Technology,  Laboratory  of \nComputer and Information Science. \n\nHyv:irinen, A.  and Oja, E.  (1997).  A fast fixed-point  algorithm for  independent compo(cid:173)\n\nnent analysis.  Neural  Computation.  To appear. \n\nJutten,  C.  and  Herault,  J.  (1991).  Blind  separation  of  sources,  part  I:  An  adaptive \n\nalgorithm based on neuromimetic architecture.  Signal Processing,  24:1-10. \n\nKarhunen, J., Oja, E., Wang, L., Vigario, R., and Joutsensalo,  J . (1997).  A class of neu(cid:173)\nral networks for independent component analysis.  IEEE Trans.  on Neural Networks. \nTo  appear. \n\nOja, E.  (1995).  The nonlinear PCA learning rule and signal separation - mathematical \nanalysis.  Technical  Report  A  26,  Helsinki  University  of Technology,  Laboratory  of \nComputer and Information Science.  Submitted to a journal. \n\n\f", "award": [], "sourceid": 1315, "authors": [{"given_name": "Aapo", "family_name": "Hyv\u00e4rinen", "institution": null}, {"given_name": "Erkki", "family_name": "Oja", "institution": null}]}