{"title": "Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 127, "page_last": 133, "abstract": null, "full_text": "Neural Learning  in  Structured \n\nParameter  Spaces \n\nNatural Riemannian  Gradient \n\nShun-ichi Amari \n\nRIKEN  Frontier Research Program,  RIKEN, \n\nHirosawa 2-1,  Wako-shi 351-01, Japan \n\namari@zoo.riken.go.jp \n\nAbstract \n\nThe  parameter  space  of neural  networks  has  a  Riemannian  met(cid:173)\nric  structure.  The  natural  Riemannian  gradient  should  be  used \ninstead of the  conventional gradient, since  the former  denotes  the \ntrue steepest descent direction of a loss function in the Riemannian \nspace.  The behavior of the  stochastic gradient learning  algorithm \nis  much more effective if the natural gradient is  used.  The present \npaper studies the information-geometrical structure of perceptrons \nand  other  networks,  and  prove  that  the  on-line  learning  method \nbased  on  the  natural  gradient  is  asymptotically  as  efficient  as  the \noptimal  batch  algorithm.  Adaptive  modification  of the  learning \nconstant is proposed and analyzed in terms of the Riemannian mea(cid:173)\nsure  and  is  shown  to  be  efficient.  The  natural  gradient  is  finally \napplied to blind separation of mixtured independent signal sources. \n\n1 \n\nIntrod uction \n\nNeural  learning  takes  place  in  the parameter space of modifiable  synaptic  weights \nof a  neural network.  The  role  of each  parameter is  different  in  the  neural  network \nso  that the parameter space is  structured in  this sense.  The  Riemannian structure \nwhich  represents  a  local  distance  measure is  introduced in the parameter space  by \ninformation geometry  (Amari,  1985). \n\nOn-line learning is  mostly  based  on  the stochastic gradient descent  method,  where \nthe  current  weight  vector  is  modified  in  the  gradient  direction  of a  loss  function. \nHowever,  the  ordinary  gradient  does  not  represent  the steepest  direction  of a  loss \nfunction in the Riemannian space.  A geometrical modification is necessary, and it is \ncalled  the natural Riemannian gradient.  The  present paper studies the  remarkable \neffects of using  the natural Riemannian gradient in  neural learning. \n\n\f128 \n\nS.  Amari \n\nWe first studies the asymptotic behavior of on-line learning (Opper , NIPS'95 Work(cid:173)\nshop).  Batch learning uses all the examples at any time to obtain the optimal weight \nvector, whereas  on-line  learning  uses  an  example once  when  it is  observed.  Hence, \nin  general, the target weight vector is estimated more accurately in the case of batch \nlearning.  However, we  prove  that ,  when  the  Riemannian  gradient  is  used , on-line \nlearning is  asymptotically  as  efficient  as optimal batch learning. \n\nOn-line  learning  is  useful  when  the  target  vector fluctuates  slowly  (Amari,  1967). \nIn  this  case,  we  need  to  modify  a  learning  constant  TJt  depending  on  how  far  the \ncurrent weight  vector  is  located  from  the  target  function.  We  show  an  algorithm \nadaptive  changes  in  the  learning  constant based  on  the  Riemannian  criterion  and \nprove that it gives asymptotically optimal behavior.  This is  a generalization of the \nidea of Sompolinsky  et al.  [1995]. \nWe then answer the question what is  the Riemannian structure to be introduced in \nthe parameter space  of synaptic weights.  We  answer this problem from the point of \nview of information geometry (Amari [1985, 1995],  Amari et al  [1992]) .  The explicit \nform of the Riemannian metric and its inverse matrix are given in the case of simple \nperceptrons. \n\nWe finally show how  the Riemannian gradient is  applied to blind separation of mix(cid:173)\ntured independent signal sources.  Here,  the  mixing matrix is  unknown so  that the \nparameter space  is  the space  of matrices .  The  Riemannian structure is  introduced \nin  it.  The natural Riemannian gradient is  computationary much simpler  and  more \neffective than the conventional gradient . \n\n2  Stochastic Gradient Descent and  On-Line Learning \n\nLet  us  consider  a  neural  network  which  is  specified  by  a  vector  parameter  w \n(Wl' ... wn ) E Rn.  The parameter w  is  composed of modifiable  connection weights \nand  thresholds.  Let  us  denote  by  I(~, w)  a  loss  when  input signal  ~ is  processed \nby  a  network having parameter w .  In  the case  of multilayer perceptrons, a  desired \noutput or  teacher signal  y  is  associated with  ~, and  a  typical loss  is  given \n\nI(~ , y,w) = 211  y  - I(~ , w) II  , \n\n2 \n\n1 \n\n(1) \n\nwhere  z  = I(~, w) is  the output from the network. \nWhen input ~, or  input-output training pair (~ ,  y) , is  generated from  a fixed  prob(cid:173)\nability distribution,  the expected loss  L( w) of the  network specified  by w  is \n\nL(w) = E[/(~,y;w)], \n\n(2) \n\nwhere  E  denotes  the  expectation.  A  neural  network  is  trained  by  using  training \nexamples  (~l ' Yl)'(~2 ' Y2) ''' '  to  obtain  the  optimal  network  parameter  w\u00b7  that \nminimizes  L(w).  If L(w) is  known,  the gradient method'is described  by \n\nt  = 1, 2, ,, , \n\nwhere  TJt  is  a learning constant depending on t  and \"ilL =  oL/ow.  Usually  L(w) is \nunknown.  The stochastic gradient learning method \n\nwas  proposed  by  an  old  paper  (Amari  [1967]) .  This  method  has  become  popular \nsince  Rumelhart et al.  [1986]  rediscovered it .  It is expected that , when  TJt  converges \nto 0 in  a certain manner, the above Wt  converges to w\".  The dynamical behavior of \n\n(3) \n\n\fNeural Learning and Natural Gradient Descent \n\n129 \n\n(3)  was studied by Amari [1967],  Heskes  and Kappen [1992]  and many others when \n\"It  is  a  constant. \n\nIt was also shown in  Amari  [1967]  that \n\nworks  well  for  any  positive-definite  matrix,  in  particular  for  the  metric  G.  Ge(cid:173)\nometrically  speaking  Ol/aw  is  a  covariant  vector  while  Llwt  = Wt+l  - Wt  is  a \ncontravariant vector.  Therefore,  it is  natural to use  a  (contravariant) metric  tensor \nC- 1  to convert the  covariant gradient into the  contravariant form \n\n(4) \n\n) \nVI = G- - =  L.; g'J -(w)  , \n\n1  01  ( \" . .  a \naWj \now \n\n-\n\n. \nJ \n\n(5) \n\nwhere C- 1 = (gij) is  the inverse matrix of C = (gij).  The present paper studies how \nthe  matrix tensor  matrix C  should  be  defined  in  neural learning  and  how  effective \nis  the new  gradient learning rule \n\n(6) \n\n3  Gradient  in  Riemannian  spaces \n\nLet  S  =  {w}  be  the  parameter  space  and  let  I(w)  be  a  function  defined  on  S. \nWhen  S  is  a  Euclidean  space  and  w  is  an  orthonormal  coordinate  system,  the \nsquared length of a small incremental vector dw  connecting wand w + dw is given \nby \n\nn \n\nIdwl 2 = L(dwd2 . \n\n(7) \n\nHowever, when the coordinate system is non-orthonormal or the space S is  Rieman(cid:173)\nnian,  the squared length is  given  by  a  quadratic form \n\ni=l \n\nIdwl 2  =  Lgjj(w)dwidwj = w'Gw. \n\ni ,j \n\n(8) \n\nHere,  the matrix G = (gij)  depends in  general on wand is  called the metric tensor. \nIt  reduces  to  the  identity  matrix  I  in  the  Euclidean  orthonormal  case.  It will  be \nshown soon that the parameter space S of neural networks has Riemannian structure \n(see  Amari et  al.  [1992],  Amari  [1995],  etc.). \nThe  steepest  descent  direction  of a  function  I( w)  at w  is  defined  by  a  vector  dw \nthat minimize I(w+dw) under the constraint Idwl 2  = [2 (see eq.8) for  a sufficiently \nsmall constant [. \nLemma  1.  The steepest  descent  direction of I (w)  in  a  Riemannian  space  is  given \nby \n\n-V/(w) = -C-1(w)VI(w). \n\nWe  call \n\nV/(w) = C-1(w)V/(w) \n\nthe natural gradient of I( w) in the Riemannian space.  It shows the steepest descent \ndirection of I,  and is  nothing but the contravariant form of VI in the tensor notation. \nWhen  the  space  is  Eu~lidean and  the  coordinate  system  is  orthonormal,  C  is  the \nunit  matrix I  so  that VI =  V/' \n\n\f130 \n\nS. Amari \n\n4  Natural gradient gives  efficient  on-line learning \n\nLet  us  begin  with  the simplest case  of noisy  multilayer analog perceptrons.  Given \ninput  ~,  the  network  emits  output  z  = f(~, 10) + n,  where  f  is  a  differentiable \ndeterministic  function  of the  multilayer  perceptron  with  parameter 10  and  n  is  a \nnoise  subject  to  the  normal  distribution  N(O,1) .  The  probability  density  of an \ninput-output pair  (~, z)  is  given  by \n\np(~,z;1O) = q(~)p(zl~;1O), \n\nwhere  q( ~) is  the  probability distribution of input  ~, and \n\np(zl~; 10)  = vk exp {-~ II  z  - f(~, 10)  112}  . \n\nThe squared error loss function  (1)  can  be  written  as \n\nl(~, z, 10)  =  -logp(~, z; 10) + logq(~) -log.;z;:. \n\nHence,  minimizing  the  loss  is  equivalent  to  maximizing  the  likelihood  function \np(~, Zj 10) . \nLet  DT  =  {(~l' zI),\u00b7 \u00b7 \u00b7, (~T, ZT)}  be T  independent input-output examples gener(cid:173)\nated by the network having the parameter 10\" . Then, maximum likelihood estimator \nWT  minimizes the  log  loss  1(~,z;1O) =  -logp(~,z;1O) over  the  training  data DT , \nthat is,  it minimizes the training error \n\nThe maximum likelihood estimator is  efficient (or  Fisher-efficient) , implying that it \nis  the  best consistent estimator satisfying the  Cramer-Rao bound asymptotically, \n\nlim  TE[(wT - 1O\")(WT  - 10\")'] = C- 1 , \n\nT ..... oo \n\n(10) \n\nwhere G-l is  the inverse of the Fisher information  matrix G = (9ii) defined  by \n\n(9) \n\ngij  = E \n\n[ 8 logp(~, z; 10) 8 logp(~, z; 10)] \n\n8 \nWj \n\n8 \n\nWi \n\n(11) \n\nin the component form.  Information geometry (Amari, 1985)  proves that the Fisher \ninformation  G  is  the only  invariant metric  to be  introduced  in  the space  S  =  {1o} \nof the parameters of probability distributions. \nExamples (~l' zd, (~2' Z2)  . . . are given one at a time in the case of on-line learning. \nLet Wt  be  the estimated value at time t.  At the next time t + 1,  the estimator Wt  is \nmodified  to give  a  new estimator Wt+l  based on the observation  (~t+ll zt+d.  The \nold  observations  (~l' zd, .. . , (~t, zt)  cannot be  reused to obtain Wt+l,  so that the \nlearning  rule  is  written  as  Wt+l  =  m( ~t+l, Zt+l , wt}.  The  process  {wd  is  hence \nMarkovian.  Whatever a  learning  rule  m  we  choose,  the  behavior of the estimator \nWt  is  never  better  than  that  of the  optimal  batch  estimator  Wt  because  of this \nrestriction.  The conventional on-line learning rule is  given by the following gradient \nform Wt+l  = Wt  -1JtV'I(~t+l,zt+l;Wt). When  1Jt  satisfies  a  certain  condition,  say \n1Jt  = cit, the stochastic approximation guarantees that Wt  is  a consistent estimator \nconverging  to 10\" .  However,  it is  not efficient in  general. \n\nThere arises a  question if there exists an on-line learning rule that gives an efficient \nestimator.  If it exists,  the  asymptotic  behavior of on-line  learning  is  equivalent  to \n\n\fNeural Learning and Natural Gradient Descent \n\n131 \n\nthat of the  batch estimation  method.  The  present  paper  answers  the  question  by \ngiving an efficient on-line learning rule \n\n_ \n_ \nWt+l  = Wt  -\n\n_  ) \n1\"/( \n- v  ~t+l, Zt+l; Wt  . \nt \n\n(12) \n\nTheorem  1.  The  natural gradient on-line  learning rule  gives an Fisher-efficient \nestimator,  that is, \n\n(13) \n\n5  Adaptive modification  of learning constant \nWe have proved that TJt  = lit with the coefficient matrix C- 1  is  the asymptotically \nbest choice for  on-line learning.  However, when the target parameter w\u00b7 is  not fixed \nbut  fluctuating  or  changes  suddenly,  this  choice  is  not  good,  because  the  learning \nsystem cannot follow  the change if TJt  is  too small.  It was proposed in Amari [1967] \nto  choose  TJt  adaptively  such  that  TJt  becomes  larger  when  the  current  target  w\u00b7 \nis  far  from  Wt  and  becomes  small  when  it  is  close  to  Wt  adaptively.  However,  no \ndefinite scheme was analyzed there .  Sompolinsky et at.  [1995]  proposed an excellent \nscheme of an  adaptive  choice  of TJt  for  a  deterministic  dichotomy  neural  networks. \nWe  extend  their  idea  to  be  applicable  to  stochastic  cases,  where  the  Riemannian \nstructure plays a  role. \nWe  assume  that  I( ~, z; w)  is  differentiable  with  respect  to  w. \n(The  non(cid:173)\ndifferentiable  case  is  usually  more  difficult  to  analyze.  Sompolinsky  et  al  [1995] \ntreated  this  case.)  We  moreover treat  the realizable  teacher so  that  L(w\u00b7) = o. \nWe  propose the following  learning scheme: \n\n(14) \n(15) \nwhere a  and (3  are constants.  We  try to analyze the dynamical behavior of learning \nby using  the continuous version of the algorithm, \n\nWt+l  = Wt  - TJt't7/(~t+1,Zt+I;Wt) \nTJt+l  =  TJt  + aTJt [B/( ~t+l, Zt+l; wd - TJt], \n\n! Wt  =  -TJt 't7/(~t, Zt; Wt), \n!TJt = aTJt[B/(~t,zt;wd - TJt]. \n\n(16) \n\n(17) \n\nIn  order  to show  the  dynamical  behavior  of (Wt, TJt),  we  use  the  averaged  version \nof the  above  equation  with  respect  to  the  current  input-output  pair  (~t, Zt).  We \nintroduce the squared error variable \n\net = ~(Wt - w\u00b7)'C\u00b7(Wt - w\u00b7). \n\nBy  using the average and  continuous time version \n\nwhere\u00b7 denotes  dldt  and  ( )  the average over the current  (~, z),  we  have \n\ntVt  = -TJtC- 1(Wt) (a~ I(~t, Zt; Wt\u00bb) , \n* =  aTJd{3(I(~t, Zt;Wt\u00bb) - TJd, \net  = - 2TJtet , \n* = a{3TJt et - aTJi\u00b7 \n\n(18) \n\n(19) \n(20) \n\n\f132 \n\nS.Amari \n\nThe behavior of the above equation is  interesting:  The origin  (0,0) is  its attractor. \nHowever, the basin of attraction has a fractal boundary.  Anyway, starting from an \nadequate initial  value,  it has the solution  of the form \n\net ~ ~ (~ - \u00b1) ~, \n\n1 \n1Jt  ~ 2t' \n\n(21) \n\n(22) \n\nThis proves the  lit convergence rate of the  generalization error,  that is  optimal in \norder for  any estimator tVt  converging to w\u00b7 . \n\n6  Riemannian  structures of simple  perceptrons \n\nWe  first  study  the  parameter  space  S  of simple  perceptrons  to  obtain  an  explicit \nform of the metric G and its inverse G-1.  This suggests how to calculate the metric \nin  the parameter space of multilayer perceptrons. \nLet  us  consider  a  simple  perceptron  with  input  ~ and  output  z.  Let  w  be  its \nconnection  weight  vector.  For  the  analog  stochastic  perceptron,  its  input-output \nbehavior is  described  by  z = f( Wi ~) + n,  where  n  denotes  a  random noise  subject \nto the normal distribution N(O, (72)  and f  is  the hyperbolic  tangent, \n\n1- e- u \nf(u)=l+e- u \n\nIn  order  to  calculate  the  metric  G  explicitly,  let  ew  be  the  unit  column  vector  in \nthe direction  of w  in  the Euclidean space  R n , \nw \n\new=~, \n\nwhere  II  w  II  is  the Euclidean norm.  We  then have  the following  theorem. \nTheorem  2.  The  Fisher  metric  G  and its in verse  G-l are  given by \n\nG(w) = Cl(W)! + {C2(W)  - C1(w)}ewe~, \n1 )  \nG \n\n1 \n\n-1 \n\nI \n\nCl  W \n\n(1  \nC2  W \n\n(w) = -( -)! +  -( -) - -( -)  ewew' \nwhere W = Iwl  (Euclidean  norm)  and C1(W)  and C2(W)  are  given  by \n4~(72 J {f2(wc) _1}2 exp { _~c2} dc, \nC2(W)  =  4~(72 J {f2(wc) - 1}2c2 exp {_~c2} de. \n\nC1(W) \n\nC1  W \n\nTheorem 3.  The Jeffrey  prior is  given by \n\nvIG(w)1 = Vn VC2(W){C1(W)}n-1. \n\n1 \n\n(23) \n\n(24) \n\n(25) \n\n(26) \n\n(27) \n\n7  The natural gradient for  blind  separation  of mixtured \n\nsignals \n\nLet s = (51, ... , 5 n )  be n  source signals which are n  independent random variables. \nWe  assume that their  n  mixtures ~ =  (Xl,' \" ,  X n ), \n\n~ = As \n\n(28) \n\n\fNeural Learning and Natural Gradient Descent \n\n133 \n\nare  observed.  Here,  A  is  a  matrix.  When  8 \nis  time  serieses,  we  observe \n~(1), . .. , ~(t) .  The  problem  of  blind  separation  is  to  estimate  W  = A-I  adap(cid:173)\ntively from  z(t),  t  = 1,2,3,\u00b7 \u00b7\u00b7  without  knowing  8(t)  nor  A.  We  can  then  recover \noriginal  8  by \n\n(29) \nwhen  W =  A-I.  Let  W  E  Gl(n),  that  is  a  nonsingular  n  x  n-matrix,  and  \u00a2>(W) \nbe  a  scalar  function.  This  is  given  by a  measure of independence such as  \u00a2>(W)  = \nf{ L[P(y);p(y\u00bb)' which is represented by the expectation of a loss function .  We define \nthe  natural gradient of \u00a2>(W). \n\ny=W~ \n\nNow  we  return to our  manifold  Gl(n)  of matrices.  It has  the  Lie  group structure : \nAny A  E  Gl(n) maps Gl(n) to Gl(n) by W  -+ W A.  We impose that the Riemannian \nstructure should be  invariant by  this operation A. \n\nWe  can  then prove that the natural gradient in  this case  is \n\nfl\u00a2>  =  \\7\u00a2>W'W. \n\n(30) \n\nThe  natural  gradient  works  surprisingly  well  for  adaptive  blind  signal  separation \nAmari et  al.  (1995),  Cardoso  and  Laheld  [1996]. \n\nReferences \n\n[1]  S.  Amari.  Theory  of adaptive  pattern  classifiers,  IEEE  Trans.,  EC-16,  No.3, \n\n299-307, 1967. \n\n[2]  S.  Amari.  Differential-Geometrical  Methods  in  Statistics,  Lecture  Notes  in \n\nStatistics,  vol.28 , Springer,  1985. \n\n[3]  S.  Amari.  Information  geometry  of the  EM  and em algorithms for  neural net(cid:173)\n\nworks ,  Neural  Networks,  8, No.9,  1379-1408,  1995. \n\n[4]  S.  Amari, A.  Cichocki and H.H . Yang. A new learning algorithm for  blind signal \n\nseparation, in  NIPS'95,  vol.8,  1996,  MIT  Press,  Cambridge,  Mass. \n\n[5]  S.  Amari,  K.  Kurata,  H.  Nagaoka.  Information  geometry  of Boltzmann  ma(cid:173)\n\nchines , IEEE  Trans.  on  Neural Networks,  3, 260-271,  1992. \n\n[6]  J .  F.  Cardoso  and  Beate  Laheld.  Equivariant  adaptive  source  separation,  to \n\nappear  IEEE  Trans.  on  Signal Processing,  1996. \n\n[7]  T. M. Heskes  and B.  Kappen.  Learning processes  in  neural networks,  Physical \n\nReview A , 440, 2718-2726,  1991. \n\n[8]  D.  Rumelhart, G.E.  Hinton and  R.  J . Williams.  Learning  internal representa(cid:173)\n\ntion,  in  Parallel  Distributed  Processing:  Explorations  in  the  Microstructure  of \nCognition,  1,  Foundations , MIT  Press,  Cambridge,  MA,  1986. \n\n[9]  H.  Sompolinsky,N.  Barkai  and  H.  S.  Seung.  On-line  learning  of dichotomies: \nalgorithms  and  learning  curves,  Neural  Networks:  The  statistical  Mechanics \nPerspective,  Proceedings  of the  CTP-PBSRI  Joint  Workshop  on  Theoretical \nPhysics, J .-H.  Oh et al eds, 105- 130,  1995. \n\n\f", "award": [], "sourceid": 1248, "authors": [{"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}]}