{"title": "Non-Linear Statistical Analysis and Self-Organizing Hebbian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 407, "page_last": 414, "abstract": null, "full_text": "Non-linear  Statistical Analysis and \nSelf-Organizing Hebbian Networks \n\nJonathan L.  Shapiro and  Adam  Priigel-Bennett \n\nDepartment of Computer Science \n\nThe  University,  Manchester \n\nManchester,  UK \n\nM139PL \n\nAbstract \n\nNeurons learning under an unsupervised  Hebbian learning rule can \nperform a nonlinear generalization of principal component analysis. \nThis relationship between  nonlinear PCA  and nonlinear neurons is \nreviewed.  The stable fixed  points of the neuron learning dynamics \ncorrespond  to  the  maxima of the  statist,ic  optimized  under  non(cid:173)\nlinear  PCA.  However,  in  order  to predict.  what  the  neuron  learns, \nknowledge  of the  basins  of attractions  of the  neuron  dynamics  is \nrequired.  Here  the  correspondence  between  nonlinear  PCA  and \nneural  networks  breaks  down.  This  is  shown  for  a  simple  model. \nMethods  of statistical  mechanics  can  be  used  to  find  the  optima \nof the objective function  of non-linear PCA. This determines  what \nthe neurons  can learn.  In order to find  how  the solutions are parti(cid:173)\ntioned amoung the neurons, however, one must solve the dynamics. \n\n1 \n\nINTRODUCTION \n\nLinear neurons learning under an unsupervised Hebbian rule can learn to perform a \nlinear statistical analysis ofthe input data.  This was first  shown by Oja (1982), who \nproposed  a  learning  rule  which  finds  the  first  principal  component of the  variance \nmatrix of  the  input  data.  Based  on  this  model,  Oja  (1989),  Sanger  (1989),  and \nmany others have  devised numerous neural networks  which find  many components \nof this  matrix.  These  networks  perform  principal  component  analysis  (PCA),  a \nwell-known method of statistical  analysis. \n\n407 \n\n\f408 \n\nShapiro and Priigel-Bennett \n\nSince  PCA is a  form of linear analysis, and  the neurons  used  in the  PCA networks \nare  linear  -\nthe  output  of these  neurons  is  equal  to  the  weighted  sum  of inputs; \nthere  is no squashing function  of sigmoid - it is  obvious to  ask  whether  non-linear \nHebbian  neurons  compute  some  form  of non-linear  PCA?  Is  this  a  useful  way  to \nunderstand  the  performance of the  networks?  Do  these  networks  learn  to  extract \nfeatures of the input data which are  different from those learned by linear neurons? \nCurrently  in  the  literature,  the  phrase  \"non-linear PCA\"  is  used  to  describe  what \nis  learned  by  any  non-linear generalization of Oja neurons  or  other  PCA  networks \n(see  for  example, Oja,  1993 and Taylor,  1993). \n\nIn  this  paper,  we  discuss  the  relationship  between  a  particular form  of non-linear \nHebbian  neurons  (Priigel-Bennett  and  Shapiro,  1992)  and  a  particular generaliza(cid:173)\ntion of non-linear PCA (Softky and Kammen 1991).  It is clear that non-linear neu(cid:173)\nrons  can  perform very  differently  from  linear  ones.  This has  been  shown  through \nanalysis  (Priigel-Bennett  and  Shapiro,  1993)  and  in  application  (Karhuenen  and \nJoutsensalo,  1992).  It can  also be very  useful  way  of understanding  what  the neu(cid:173)\nrons learn.  This is because non-linear PCA is equivalent to maximizing some objec(cid:173)\ntive function.  The features  that  this extracts from a  data set can  be studied  using \ntechniques of statistical mechanics.  However,  non-linear PCA is ambiguous because \nthere are multiple solutions.  What the neuron can learn is given by non-linear PCA. \nThe likelihood of learning the different  solutions is  governed by  the dyanamics cho(cid:173)\nsen  to implement  non-linear  PCA,  and  may  differ  in  different  implementations of \nthe dynamics. \n\n2  NON-LINEAR HEBBIAN NEURONS \n\nNeurons  with  non-linear  activation  functions  can  learn  to  perform  very  different \ntasks from  those learned  by linear neurons.  Nonlinear  Hebbian  neurons  have  been \nanalyzed  for  general  non-linearities  by  Oja  (1991),  and  was  applied  to  sinusoidal \nsignal detection  by  Karhuenen  and Joutsensalo  (1992). \n\nPreviously,  we  analysed  a  simple  non-linear  generalization  of Oja's  rule  (Priigel(cid:173)\nBennett  and  Shapiro,  1993).  We  showed  how  the shape  of the  neuron  activation \nfunction  can  control  what  a  neuron  learns.  Whereas  linear  neurons  learn  to  a \nstatistic mixture of all of the input patterns, non-linear neurons can learn to become \ntuned  to individual patterns,  or  to small clusters  of closely  correlated  patterns. \n\nIn  this  model,  each  neuron  has  weights,  Wi  is  the  weight  from  the  ith  input,  and \nresponds  to  the  usual  sum of input  times  weights  through  an  activation function \nA(y).  This is assumed a simple power-law above a  threshold and zero  below it.  I.e. \n\nHere  \u00a2 is the threshold,  b controls the power of the power-law, xf  is the ith compo(cid:173)\nnent  of the  pth  pattern,  and  VP  =  Li xf Wi.  Curves  of these  functions  are shown \nin figure  laj  if b =  1 the  neurons  are  threshold-linear.  For  b >  1 the curves  can be \nthought of as  low  activation approximations to a  sigmoid which  is shown in figure \n1 b.  The generalization of Oja's learning rule is  that the  change in  the weights  8Wi \n\n(1) \n\n\fNon-Linear Statistical Analysis and Self-Organizing Hebbian Networks \n\n409 \n\nNeuron Activation  Function \n\nb>1 \n\nA  Sigmoid Activation  Function \n\nb<1 \n\n\u2022 \n\npsp \n\nFigure 1:  a) The form of the neuron activation function.  Control by  two parameters \nband <p.  When  b >  1,  this  activation function  approximates  a  sigmoid,  which  is \nshown in b) . \n\nis given by \n\n6Wi  = LA(VP) [xf - VP Wi ] . \n\nP \n\n(2) \n\nIf b < 1,  the neuron  learns  to  average  a  set  of patterns.  If b =  1,  the  neuron  finds \nthe  principal  component  of the  pattern  set.  When  b  >  1,  the  neuron  learns  to \ndistinguish one of the patterns in the presence of the others, if those others are not \ntoo correlated  with the  pattern.  There is a  critical correlation which is determined \nby  b;  the  neuron  learns  to  individual  patterns  which  are  less  correlated  than  the \ncritical value,  but learns  to something like  the center  of the  cluster  if the  patterns \nare more correlated.  The threshold controls the size of the subset  of patterns which \nthe neuron  can respond  to. \n\nFor  these  neurons,  the  relationship  between  non-PCA  and  the  activation function \nwas  not previously  discussed.  That is  done in the next section. \n\n3  NON-LINEAR peA \n\nA  non-linear  generalization of PCA  was  proposed  by  Softky  and  Kammen (1991). \nIn this section,  the relationship between non-linear PCA and unsupervised Hebbian \nlearning is  reviewed. \n\n\f410 \n\nShapiro and Priigel-Bennett \n\n3.1  WHAT  IS  NON-LINEAR PCA \n\nThe  principal  component  of a  set  of data  is  the  direction  which  maximises  the \nvariance.  I.e.  to find  the  principal  component of the  data set,  find  the vector  tV  of \nunit  length which  maximises \n\n(3) \n\nHere,  Xi  denotes  the  ith  component  of  an  input  pattern  and  <  .. . >  denotes \nthe  average  over  the  patterns.  Sofky  and  Kammen suggested  that  an appropriate \ngeneralization is to find the vector tV which maximizes the d-dimensional correlation, \n\n(4) \n\nThey  argued  this would give interesting results  if higher  order  correlations are  im(cid:173)\nportant, or ifthe shape ofthe data cloud is not second order.  This can be generalized \nfurther,  of course,  maximizing the  average  of any  non-linear function  of the input \nU(y), \n\n(5) \n\nThe equations for  the principal components are easily found  using  Lagrange multi(cid:173)\npliers.  The extremal points are given  by \n\n< U' (1: WkXk )Xi  >= AWi. \n\nk \n\nThese  points will  be  (local)  maxima if the  Hessian  1lij, \n\n1lij  =< U\"(I: WkXk)XiXj  >  -ADij, \n\nk \n\nHere,  A is  a  Lagrange multiplier chosen  to make Iwl 2  =  1. \n\n3.2  NEURONS  WHICH  LEARN  PCA \n\n(6) \n\n(7) \n\nA  neuron  learning  via  unsupervised  Hebbian  learning  rule  can  perform  this  opti(cid:173)\nmization.  This  is  done  by  associating  Wi  with  the  weight  from  the  ith  input  to \nthe  neuron,  and  the  data  average  <  . >  as  the  sum over  input  patterns xf.  The \nnonlinear function which is optimized is determined by the integral of the activation \nfunction  of the neuron \n\nA(y) = U'(y). \n\nIn their paper, Softky and Kammen propose a learning rule which does not perform \nthis optimization in general.  The correct  learning rule  is  a  generalization of Oja's \nrule  (equation  (2)  above), in this notation, \n\n(8) \n\n\fNon-Linear Statistical Analysis and Self-Organizing Hebbian Networks \n\n411 \n\nThis fixed  points of this dynamical equation will be solutions to the extremal equa(cid:173)\ntion of nonlinear peA, equation (6),  when  the a.'3sociations \n\nand \n\nare  made. \n\nA = (A(V)V) , \n\nA(y) = U'(y) \n\nHere  (.)  is interpreted as sum over patterns;  this is batch learning.  The rule can also \nbe  used  incrementally, but  then  the  dynamics are stochastic  and  the  optimization \nmight be performed only on average, and then maybe only for small enough learning \nrates.  These fixed  points will be stable when  the Hessian llij  is negative definite at \nthe fixed  point.  This is  now, \n\nwhich  is  the  same as  the  previous,  equation (7),in  directions  perpendicular  to  the \nfixed  point,  but  contains  additional  terms  in  direction  of  the  fixed  point  which \nnormalize it. \n\nThe neurons  described in section  2 would perform precisely  what Softky and Kam(cid:173)\nmen  proposed  if the  activation function  was  pure  power-law  and  not  thresholded; \nas  it is  they  maximize a  more complicated objective function. \n\nSince  there  is  a  one  to  one  correspondence  between  the  stable  fixed  points  of the \ndynamics and the  local maxima of the non-linear correlation measure, one says that \nthese  non-linear neurons  compute non-linear peA. \n\n3.3  THEORETICAL  STUDIES  OF  NONLINEAR PCA \n\nIn  order  to  understand  what  these  neurons  learn,  we  have  studied  the  networks \nlearning on model data drawn from statistical distributions.  For very  dense clusters \np  ~ 00, N  fixed,  the  stable  fixed  point  equations  are  algebraic.  In  a  few  simple \ncases they can be solved.  For example, if the data is Gaussian or if the data cloud is \na  quadratic cloud (a function of a  quadratic form),  the neuron  learns the principal \ncomponent,  like  the  linear  neuron.  Likewise,  if the  patterns  are  not  random,  the \nfixed  point equations can  be solved in some cases. \n\nFor  large  number  of patterns  in  high  dimensions fluctuations  in  the  data  are  im(cid:173)\nportant  (N  and  P  goes  to  00  together  in  some  way).  In  this  case,  methods  of \nstatistical mechanics  can  be  used  to average over  the  data.  The objective function \nof the non-linear peA acts as  (minus)  the energy in statistical mechanics.  The free \nenergy  is formally, \n\nF =< IOg(D. J Of, 6 (t wl- I) exp (3U(V)  > . \n\n(10) \n\nIn  the  limit that  f3  is  large,  this  calculation finds  the  local  maxima of U.  In  this \nform  of analysis,  the  fact  that  the  neuron  optimizes  an  objective  function  is  very \nimportant.  This technique  was  used  to  produce  the results  outlined in section  2. \n\n\f412 \n\nShapiro and Priigel-Bennett \n\n3.4  WHAT  NON-LINEAR  peA FAILS  TO  REVEAL \n\nIn  the  linear  peA,  there  is  one  unique  solution,  or  if  there  are  many  solutions \nit  is  because  the  solutions  are  degenerate.  However,  for  the  non-linear  situation, \nthere  are  many stable fixed  points of the  dynamics and many local  maxima of the \nnon-linear correlation measure. \n\nThis has  two effects.  First,  it means that you  cannot  predict  what  the  neuron will \nlearn  simply  by  studying  fixed  point  equations.  This  tells  you  what  the  neuron \nmight learn, but the probability that this solution will be  can only be ascertained if \nthe dynamics are understood.  This also breaks the relationship between non-linear \npeA and  the  neurons,  because,  in  principle,  there  could  be other  dynamics which \nhave  the same fixed  point structure,  but  do not have  the same basins of attraction. \nSimple fixed  point  analysis  would  be  incapable  of predicting  what  these  neurons \nwould learn. \n\n4  PARTITIONING \n\nAn  important question  which  the  fixed-point  analysis,  or  corresponding statistical \nmechanics  cannot  address  is:  what  is  the  likelihood of learning  the  different  solu(cid:173)\ntions?  This is the essential ambiguity of non-linear peA - there are many solutions \nand the size  of the basin of attractions of each is  determined by the  dynamics, not \nby  local  maxima of the nonlinear  correlation measure. \n\nAs  an example, we  consider  the  partitioning of the  neurons  described  in section  2. \nThese neurons act much like neurons in competitive networks, they become tuned to \nindividual patterns or highly correlated  clusters.  Given that the density of patterns \nin  the  input  set  is  p(i),  what  is  the  probability  p(i)  that  a  neuron  will  become \ntuned to this pattern.  It is often said that the desired  result should be p(i) =  p(i), \nalthough for  Kohonen  I-d feature  maps ha.~ been shown  to be p(i) =  p(i)2/3  (see \nfor  example,  Hertz,  Krogh,  and  Palmer 1991). \n\nWe  have  found  that  he  partitioning  cannot  be  calculated  by  finding  the  optima \nof the  objective function .  For  example,  in  the  case  of weakly  correlated  patterns, \nthe global maxima is  the most likely  pattern,  whereas  all  of the  patterns  are  local \nmaxima.  To  determine  the  partitioning,  the  basin  of attraction  of each  pattern \nmust  be  computed.  This  could  be  different  for  different  dynamics  with  the  same \nfixed  point structure. \n\nIn  order  to  determine  the  partitioning,  the  dynamics  must  be  understood.  The \ndetails  will  be  described  elsewhere  (Priigel-Bennett  and  Shapiro,  1994).  For  the \ncase  of weakly  correlated  patterns,  a  neuron will learn a  pattern for  which \n\np(xp)(Vcr/- 1  > p(xq)(Voq)b-l \n\nVq  f- p. \n\nHere  Vcr  is  the initial overlap  (before learning) of the neuron's weights with the pth \npattern.  This defines  the basin of attraction for  each pattern. \n\nIn the large P  limit and for  random patterns \n\n(11) \nwhere a  ~ 210g(P)/(b -1), P is the number of patterns, and where b is a parameter \nthat controls the non-linearity of the neuron's response.  If b is chosen so that a  =  1, \n\np(i) ~ p(iYx \n\n\fNon-Linear Statistical Analysis and Self-Organizing Hebbian Networks \n\n413 \n\nthen  the  probability  of a  neuron  learning  a  pattern  will  be  proportional  to  the \nfrequency  with  which  the  pattern is  presented. \n\n5  CONCLUSIONS \n\nThe relationship  between  a  non-linear generalization of Oja's rule  and a  non-linear \ngeneralization of PCA was reviewed.  Non-linear PCA is equivalent to maximizing a \nobjective function which is a statistical measure of the data set.  The objective func(cid:173)\ntion optimized is  determined by  the form  of the  activation function of the  neuron. \nViewing the neuron in this way is useful,  because rather than solving the dynamics, \none  can use  methods of statistical mechanics or other  methods to find  the  maxima \nof the  objective  function.  Since  this  function  has  many  local  maxima,  however, \nthese  techniques  cannot  determine  how  the  solutions  are  partitioned  amoung  the \nneurons.  To  determine  this,  the  dynamics must be solved. \n\nAcknowledgements \n\nThis work  was  supported  by  SERC grant GRG20912. \n\nReferences \n\nJ. Hertz,  A.  Krogh, and R.G. Palmer.  (1991).  Introduction to the Theory of Neural \nComputation.  Addison-Wesley. \n\nJ. Karhunen and J. J outsensalo.  (1992) Nonlinear Heb bian algorithms for sinusoidal \nfrequency  estimation, in Artificial Neural  Networks,  2,  I. Akeksander and J . Taylor, \neditors,  North-Holland. \n\nErkki  Oja.  (1982)  A  simplified  neuron  model  as  a  principal  component  analyzer. \nem J.  Math.  Bio.,  15:267-273. \n\nErkki  Oja.  (1989)  Neural  networks,  principal  components,  and subspaces.  Int.  J. \nof Neural Systems,  1(1):61-68. \n\nE.  Oja,  H.  Ogawa,  and  J.  Wangviwattan.  (1992)  Principal  Component  Analysis \nby  homogeneous  neural  networks:  Part  II:  analysis  and  extension  of the  learning \nalgorithms IEICE  Trans.  on  Information  and  Systems,  E75-D,  3,  pp  376-382. \n\nE.  Oja. \nWorld  Congress on  Neural  Networks,  Portland, Or.  1993. \n\n(1993)  Nonlinear  PCA:  algorithms  and  applications,  in  Proceedings  of \n\nA.  Prugel-Bennett  and Jonathan 1. Shapiro.  (1993)  Statistical Mechanics of Unsu(cid:173)\npervised  Hebbian Learning.  J.  Phys.  A: 26,  2343. \n\nA.  Prugel-Bennett  and Jonathan  L.  Shapiro.  (1994)  The Partitioning Problem for \nUnsupervised  Learning for  Non-linear  Neurons.  J.  Phys.  A  to appear. \n\nT.  D.  Sanger. \nFeedforward  Neural  Network.  Neural  Networks 2,459-473. \n\n(1989)  Optimal  Unsupervised  Learning  in  a  Single-Layer  Linear \n\nJonathan L.  Shapiro and A. Prugel-Bennett (1992), Unsupervised Hebbian Learning \nand the Shape of the Neuron Activation Function, in Artificial Neural  Networks,  2, \nI. Akeksander  and J.  Taylor, editors,  North-Holland. \n\n\f414 \n\nShapiro and Prugel-Bennett \n\nW . Softky and D.  Kammen (1991).  Correlations in High Dimensional or Asymmet(cid:173)\nric  Data Sets:  Hebbian  Neuronal Processing.  Neural Networks 4,  pp  337-347. \nJ . Taylor,  (1993)  Forms of Memory,  in  Proceedings  of World  Congress  on  Neural \nNetworks,  Portland,  Or.  1993. \n\n\f", "award": [], "sourceid": 862, "authors": [{"given_name": "Jonathan", "family_name": "Shapiro", "institution": null}, {"given_name": "Adam", "family_name": "Pr\u00fcgel-Bennett", "institution": null}]}