{"title": "Solvable Models of Artificial Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 430, "abstract": null, "full_text": "Solvable  Models  of Artificial  Neural \n\nNetworks \n\nSumio Watanabe \n\nInformation and Communication  R&D Center \n\nRicoh Co.,  Ltd. \n\n3-2-3,  Shin-Yokohama,  Kohoku-ku,  Yokohama,  222  Japan \n\nsumio@ipe.rdc.ricoh.co.jp \n\nAbstract \n\nSolvable  models of nonlinear learning machines  are  proposed,  and \nlearning in artificial neural networks is  studied based on the theory \nof  ordinary  differential  equations.  A  learning  algorithm  is  con(cid:173)\nstructed,  by  which  the  optimal  parameter  can  be  found  without \nany recursive procedure.  The solvable models enable us  to analyze \nthe reason  why  experimental results  by  the error backpropagation \noften contradict the statistical learning theory. \n\n1 \n\nINTRODUCTION \n\nRecent studies have shown  that learning in  artificial neural networks can  be under(cid:173)\nstood  as  statistical  parametric  estimation  using  t.he  maximum  likelihood  method \n[1],  and  that  their  generalization  abilities  can  be  estimated  using  the  statistical \nasymptotic  theory  [2].  However,  as  is  often  reported,  even  when  the  number  of \nparameters is  too large, the error for the test.ing sample is not so large as the theory \npredicts.  The reason  for  such inconsistency  has  not yet  been clarified,  because it is \ndifficult  for  the artificial neural network  t.o  find  the global  optimal parameter. \n\nOn the other hand,  in  order to analyze the  nonlinear phenomena,  exactly solvable \nmodels  have  been  playing a  central role in  mathematical  physics,  for  example,  the \nK-dV equation, the Toda lattice, and some statistical models that satisfy the Yang-\n\n423 \n\n\f424 \n\nWatanabe \n\nBaxter equation[3]. \n\nThis paper proposes the first solvable models in the nonlinear learning problem.  We \nconsider simple three-layered  neural  networks,  and  show that the parameters from \nthe  inputs  to  the  hidden  units  determine  the function  space  that  is  characterized \nby  a  differential  equation.  This  fact  means  that  optimization  of  the  parameters \nis  equivalent  to  optimization  of the  differential  equation.  Based  on  this  property, \nwe  construct  a  learning algorithm  by  which  the optimal  parameters  can  be found \nwithout any recursive procedure.  Experimental result using the proposed algorithm \nshows  that the maximum  likelihood  estimator is  not always  obtained  by  the error \nbackpropagation, and that the conventional statistical learning theory leaves  much \nto be improved. \n\n2  The  Basic  Structure of Solvable Models \n\nLet  us  consider  a  function  fc,w( x)  given  by  a  simple  neural  network  with  1  input \nunit,  H  hidden  units,  and  1 output unit, \n\nH \n\nfc,w(x)  = L CiIPw;{X), \n\ni=1 \n\n(I) \n\nwhere  both  C  =  {Ci}  and  w  =  {Wi}  are  parameters  to  be optimized,  IPw;{x)  is  the \noutput of the i-th hidden  unit. \nWe  assume  that  {IPi(X)  =  IPw, (x)}  is  a  set  of independent  functions  in  C H -class. \nThe following  theorem is  the start point of this  paper. \n\nTheorem  1  The H -th  order differential  equation whose fundamental system of so(cid:173)\nlution  is  {IPi( x)}  and whose  H -th  order  coefficient  is  1  is  uniquely given  by \n\n(Dwg)(x) = (_l)H H!H+l(g,1P1,1P2, .. \u00b7,IPH)  =  0, \n\nlVH(IP1,  IP2,  .. \u00b7,IPH) \n\n(2) \n\nwhere  ltV H  is  the  H -th  order  Wronskian, \n\nIPH \n( 1) \nIPH \n(2) \n'PH \n\n(H-l) \n\n'PI \n\n(H-l) \n\n'P2 \n\n(H -1) \n\nIPH \n\nFor proof,  see  [4].  From this  theorem, we  have the following  corollary. \n\nCorollary  1  Let  g(x)  be  a  C H -class  function.  Then  the  following  conditions  for \ng(x)  and w  = {wd  are  equivalent. \n(1)  There  exists  a set  C = {cd  such  that g{x) = E~l CjIPw;(x). \n(2)  (Dwg)(x) =  O. \n\n\fSolvable Models of Artificial Neural Networks \n\n425 \n\nExample 1  Let  us  consider a  case, !Pw;(x)  = exp(WiX). \n\nH \n\ng(x)  = L Ci exp(WiX) \n\nis  equivalent  to {DH + P1D H- 1 + P2DH-2 + ... + PH }g(x) = 0,  where  D = (d/dx) \nand  a  set  {Pi}  is  determined from  {Wi}  by  the relation, \n\ni=l \n\nzH  + Plz H- 1 + P2zH-2 + ... + PIl  =  II(z - Wi) \n\nH \n\n('Vz  E C). \n\ni=l \n\nExample  2 \n\n(RBF)  A  function  g(x)  is  given  by  radial basis functions, \n\n11 \n\ng(x) = L Ci  exp{ -(x - Wi)2}, \n\ni=l \n\nif and only if e- z2 {DIl + P1DIl-l + P2DIl-2 + ... + PIl }(eZ2 g(x))  =  0,  where a  set \n{Pi}  is  determined from  {Wi}  by  the relation, \n\nzll + Plz ll - 1 + P2zll-2 + ... + PII  = II(z - 2Wi) \n\n11 \n\n('Vz  E C). \n\ni=l \n\nFigure 1 shows a learning algorithm for the solvable models.  When a target function \ng( x)  is  given,  let  us  consider the following  function  approximation  problem. \n\n11 \n\ng(x) = L Ci!Pw;(X) + E(X). \n\ni=l \n\n(3) \n\nLearning in  the neural  network  is  optimizing  both {cd  and  {wd  such  that E( x)  is \nminimized for  some error function.  From the definition of D w ,  eq.  (3)  is  equivalent \nto (Dwg)(x)  =  (Dw\u20ac)(x),  where the term (Dwg)(x)  is independent of Cj.  Therefore, \nif  we  adopt  IIDwEIl  as  the  error  function  to  be  minimized,  {wd  is  optimized  by \nminimizing  IIDwgll,  independently of {Cj},  where  111112  =  J II(x)1 2dx.  After  IIDwgll \nis  minimized,  we  have  (Dw.g)(x)  ~ 0,  where  w*  is  the optimized parameter.  From \nthe  corollary  1,  there  exists  a  set  {cn  such  that  g(x)  ~ L:ci!Pw~(x), where  {en \ncan be found  using the ordinary least square method. \n\n3  Solvable Models \n\nFor a  general  function  !Pw,  the differential  operator  Dw  does  not  always  have  such \na  simple form  as the above examples.  In this section,  we  consider a  linear operator \nL  such that the differential  equation of L!pw  has a  simple form. \nDefinition  A neural network L: Cj!PWi (x) is called solvable ifthere exist functions \na,  b,  and a  linear operator L  such  that \n\n(L!pwJ(x)  =  exp{a{wj)x + b(wi)). \n\nThe following theorem shows that the optimal parameter of the solvable models can \nbe found  using  the same algorithm as  Figure  1. \n\n\f426 \n\nWatanabe \n\nH \n\ni \n\ni=l \n\ng(X) = L Ci ~ (x) +E(X) \nto  optimize  wi \nt \n\nIt  is  difficult \nindependently  ?f ci \n\nThere exits  C i  s.t. \n\ng(x) = L Ci  <P  .(x) \n\nH \n\nwi \n\ni=l \nI \n\nLeast  Square  Method ~ ci  : optimized \n\nequiv. \n\nD  g(x) = D  E(X) \n\nw \n\nw \n\nII D wg II  : minimited - - W: optimized \n\n..... -.-----1  q,* g(x) \n\neqmv. \n\n0  I \n\ng(x) = L < <P  .(x) \n\nH \n\ni=l \n\nwi \n\nFigure  1:  St.ructure of Solvable Models \n\nTheorem 2  For  a solvable  model of a neuml network,  the following  conditions  are \nequivalent  when  Wi  \"#  Wj  (i  \"#  j). \n(1)  There  exist  both  {cd  and {wd  such  that g(x) = E:!:l Ci<t'w;(X). \n(2)  There  exists {Pi}  such  that {DH + P1D H- 1 + P2DH-2 + ... + PH }(Lg)(x) =  O. \n(3)  For  arbitmry Q > 0,  we  define  a sequence  {Yn}  by Yn  =  (Lg)(nQ).  Then,  there \nexists  {qd  such  that Yn  + qlYn-l + q2Yn-2 + ... + qHYn-H  =  o. \n\nNote  that  IIDwLgl12  is  a  quadratic form  for  {pd,  which is  easily  minimized  by  the \nleast square method.  En IYn + qlYn-l + ... + QHYn_HI 2 is  also a  quadratic form  for \n{Qd\u00b7 \n\nTheorem  3  The sequences { wd, {pd,  and {qd  in the theorem 2 have the following \nrelations. \n\nH+  H-l+  H-2+ \nz \n\nP2 Z \n\nPIZ \n\n...  PH \n\n+ \n\nzH + qlzH-l + q2zH-2 + ... + qH  = \n\n('Vz  E  C), \n\nH \nIT(z - a(wi)) \ni=l \nH \nIT(z - exp(a(Wi)Q)) \ni=l \n\n('Vz  E  C). \n\nFor  proofs  of  the  above  theorems,  see  [5].  These  theorems  show  that,  if  {Pi}  or \n\n\fSolvable Models of Artificial Neural Networks \n\n427 \n\n{qd  is  optimized  for  a  given  function  g( x),  then  {a( wd}  can  be found  as  a  set of \nsolutions of the algebraic equation. \nSuppose  that  a  target  function  g( x)  is  given.  Then,  from  the  above  theorems, \nthe  globally  optimal  parameter  w*  = {wi}  can  be found  by  minimizing  IIDwLgll \nindependently of {cd.  Moreover, if the function a(w) is a one-to-one mapping, then \nthere exists  w*  uniquely without permutation of {wi},  if and only if the quadratic \nform  II{DH + P1 DH-1 + ... + PH }g1l2  is  not  degenerate[4].  (Remark  that,  if it  is \ndegenerate,  we  can  use  another neural  network  with  the smaller  number  of hidden \nunits.) \n\nExample 3  A  neural  network without scaling \n\nH \n\nfb,c(X)  = L CiU(X + bi), \n\n(4) \n\nis  solvable when  (F u)( x)  I- 0  (a.e.), where F  denotes the Fourier transform.  Define \na  linear operator L  by (Lg)(x) = (Fg)(x)/(Fu)(x),  then, it follows  that \n\ni=1 \n\n(Lfb,c)(X) = L Ci exp( -vCi bi  x). \n\nH \n\n(5) \n\nBy the Theorem 2,  the optimal {bd can be obtained by  using the differential  01'  the \nsequential equation. \n\ni=l \n\nExample  4  (MLP)  A  three-layered  perceptron \n\nfb,c(X)  =  L  Ci  tan \n\nH \n~  -1  X  + bi \n(  a. \nz \n\ni=1 \n\n), \n\n(6) \n\nis  solvable.  Define  a  linear  operator  L  by  (Lg)( x)  =  x  . (F g)( x),  then,  it  follows \nthat \n\n(Lfb,c)(X)  = L Ci  exp( -(a.i + yCi bdx + Q(ai, bd)  (x  ~ 0). \n\nH \n\n(7) \n\ni=1 \n\nwhere Q( ai, bi )  is  some  function  of ai  and  bj.  Since the function  tan -1 (x)  is  mono(cid:173)\ntone  increasing  and  bounded,  we  can  expect  that  a  neural  network  given  by  eq. \n(6)  has  the  same  ability  in  the  function  approximation  problem  as  the  ordinary \nthree-layered  perceptron  using the sigmoid  function,  tanh{x). \n\nExample 5  (Finite Wavelet Decomposition)  A finite wavelet decomposition \n\nH \n\nfb,c(X)  = L Cju( \n\nx + bj \n\n(8) \nis  solvable when u(x) = (d/dx)n(1/(l + x 2 \u00bb  (n  ~ 1).  Define a lineal' operator L  by \n(Lg)(x)  =  x- n .  (Fg)(x)  then, it follows  that \n\ni=l \n\na.j \n\n), \n\n(Lfb,c)(X) = L Ci  exp( -(a.j + vCi bi)x + P(a.j, bi\u00bb \n\nH \n\n(x  ~ 0). \n\n(9) \n\ni=1 \n\n\f428 \n\nWatanabe \n\nwhere f3(ai, bi) is some function of ai and bi.  Note that O\"(x)  is an analyzing wavelet, \nand  that  this  example  shows  a  method  how  to  optimize  parameters  for  the  finite \nwavelet  decomposition. \n\n4  Learning Algorithm \n\nWe  construct a  learning algorithm for  solvable models,  as  shown  in  Figure 1-\n\n< <Learning Algorithm> > \n(0)  A target function  g(x)  is  given. \n(1)  {Ym}  is  calculated by  Ym  =  (Lg)(mQ). \n(2)  {qi} is  optimized by minimizing L:m  IYm + Q1Ym-l + Q2Ym-2 + ... + QHYm_HI 2. \n(3)  {Zi}  is  calculated by  solving  zH + q1zH-1 + Q2zH-2 + ... + QH  =  0. \n(4)  {wd is  determined  by a( wd =  (l/Q) log Zi. \n(5)  {cd  is  optimized  by  minimizing L:j(g(Xj) - L:i Cj<;?w;(Xj\u00bb2. \n\nStrictly speaking,  g(x)  should  be  given  for  arbitrary  x.  However,  in  the  practical \napplicat.ion,  if  the  number  of training  samples  is  sufficiently  large  so  that  (Lg)( x) \ncan  be  almost  precisely  approximated,  this  algorithm  is  available. \nIn  the  third \nprocedure, to solve the algebraic equation, t.he DKA method is applied, for example. \n\n5  Experimental Results and  Discussion \n\n5.1  The backpropagation and the proposed  method \n\nFor experiments,  we  used  a  probabilit.y  density  fUllction  and  a  regression  function \ngiven  by \n\nQ(Ylx) \n\nh(x) \n\n1 \n\nJ27r0\"2 \n\nexp  -\n\n((y - h(X\u00bb2) \n\n20\"2 \n\n1 \n\n-3\" tan \n\n-1  X  - 1/3 \n\n(  0.04 \n\n1 \n\n) + 6\"  tan \n\n-1  X  - 2/3 \n) \n\n(  0.02 \n\nwhere  0\"  = 0.2.  One hundred  input samples  were  set at  the  same interval  in  [0,1), \nand output samples were  taken from  the above condit.ional  distribution. \n\nTable  1  shows  the  relation  between  the  number  of hidden  units,  training  errors, \nand regression  errors.  In the table,  the t.raining errol' in  the back propagation shows \nthe  square  error  obtained  after  100,000  training  cycles.  The  traiuing  error in  the \nproposed  method  shows  the square errol'  by  the above  algorithm.  And  the regres(cid:173)\nsion  error  shows  the  square  error  between  the  true  regression  curve  h( x)  and  the \nestimated curve. \n\nFigure  2  shows  the  true  and  estimated  regression  lines:  (0)  the  true  regression \nline  and  sanlple  points,  (1)  the  estimated  regression  line  with  2  hidden  units,  by \nthe BP  (the error backpropagation)  after 100,000 training cycles,  (2)  the estimated \nregression line with 12  hidden units, by the BP after 100,000 training cycles,  (3)  the \n\n\fSolvable Models of Artificial Neural Networks \n\n429 \n\nTable  1:  Training errors and  regression  errors \n\nBackpropagation \n\nProposed  Method \n\nHidden \nUnits \n\n2 \n4 \n6 \n8 \n10 \n12 \n\nTraining  Regression  Training  Regression \n4.1652 \n3.3464 \n3.3343 \n3.3267 \n3.3284 \n3.3170 \n\n4.0889 \n3.8755 \n3.5368 \n3.2237 \n3.2547 \n3.1988 \n\n0.3301 \n0.2653 \n0.3730 \n0.4297 \n0.4413 \n0.5810 \n\n0.7698 \n0.4152 \n0.4227 \n0.4189 \n0.4260 \n0.4312 \n\nestimated line with  2 hidden  units  by  the  proposed  method,  and  (4)  the estimated \nline with  12  hidden  units  by the proposed method. \n\n5.2  Discussion \n\nWhen  the  number  of hidden  units  was  small,  the  training  errors  by  the  BP  were \nsmaller,  but  the  regression  errors  were  larger.  Vlhen  the  number  of  hidden  units \nwas taken to be larger, the training error by  the BP didn't decrease so much as  the \nproposed method, and the regression  error didn't increase so  mnch as the proposed \nmethod. \n\nBy  the  error  back propagation ,  parameters  dichl 't  reach  the  maximum  likelihood \nestimator,  or  they  fell  into  local  minima.  However,  when  t.he  number  of  hidden \nunits  was  large,  the  neural  network  wit.hout.  t.he  maximum  likelihood  estimator \nattained  the  bett.er  generalization.  It  seems  that  paramet.ers  in  the  local  minima \nwere closer to  the true parameter than the maximum likelihood  estimator. \n\nTheoretically,  in  the  case  of the  layered  neural  networks,  the  maximum  likelihood \nestimator  may  not  be  subject  to  asymptotically  normal  distribution  because  the \nFisher  informat.ion  matrix  may  be  degenerate.  This  can  be  one  reason  why  the \nexperimental results contradict the ordinary st.atistical theory.  Adding such a  prob(cid:173)\nlem,  the above experimental results show that the local  minimum causes  a  strange \nproblem.  In  order  to  construct the more  precise  learning  t.heory  for  the  backprop(cid:173)\nagation  neural  network,  and  to  choose  the better parameter for  generalization,  we \nmaybe  need  a  method  to analyze lea1'1ling  and inference with a  local  minimum. \n\n6  Conclusion \n\nWe  have  proposed  solvable  models  of artificial  neural  networks,  and  studied  their \nlearning structure.  It has been shown by the experimental results that the proposed \nmethod is  useful  in analysis of the neural  network generalizat.ion  problem. \n\n\f430 \n\nWatanabe \n\n.. \n.  ..' .. \n~--------.  '. \n..... '  : \n\n.'\"  .... \n\n\" . '   ..  \"0 \n\n... \n\n' .. \n\nH  :  the number of hidden units \nEtrain  :  t.he  t.raining  error \nE\"eg  :  the regression  error \n\n(0)  True Curve  and Samples. \nSample error sum =  3.6874 \n\n\"0 \n\ne\" \n\n: \n\n.. : .... \" ... \". \n\n. ' \n\n. . . \n.  ' .. \n(1)  BP  after 100,000 cycles \nH  = 2,  Etrain  = 4.1652,  E\"eg  = 0.7698 \n\n... \n\n. . . \n\n..... \"  : \n\n. \n\n..  .  .....  ,\". \n.. ' \n\n'.' \n\n,'. \n\n.. \n\n. \n......  , . .  \n~ \n\n. \n....... ~ .. : ........... : ...... :::: .. . \n\n\"0,  e\"  e\" '  \n\n.. \n\n.. \n\n' .   . \n\n' \n\n\u2022\n\n\u2022 \n\n' \n\n0\" \n\n\u2022 \n\n(2)  TIP  aft.er  100,000  cycles \nH  =  12,  E Ir\u2022a;\"  =  3.3170,  E\"eg =  0.4312 \n\n.. ...... \n\n. .:'{: \n\n' .. \n\n(3)  Proposed Method \nH  =  2,  Etrain  =  4.0889,  Ereg = 0.3301 \n\n(4)  Proposed Met.hod \nH  = 12,  E'm;\"  = 3.1988,  Ereg  = 0.5810 \n\nFigure 2:  Experimental Results \n\nReferences \n\n[I]  H.  White.  (1989) Learning in artificial neural networks:  a statistical perspective. \nNeural  Computation,  1, 425-464. \n\n[2]  N.Murata, S.Yoshizawa, and S.-I.Amari.(1992) Learning Curves, Model Selection \nand  Complexity  of  Neural  Networks.  Adlla:nces  in  Neural  Information  Processing \nSystems  5,  San  Mateo,  Morgan  Kaufman,  pp.607-614. \n\n[3]  R.  J.  Baxter.  (1982)  Exactly  Solved  Models  in  Statistical  Mechanics,  Academic \nPress. \n\n[4]  E. A.  Coddington.  (1955)  Th.eory  of ordinary differential equations, the McGraw(cid:173)\nHill  Book  Company,  New  York. \n\n[5]  S.  Watanabe.  (1993)  Function approximation  by  neural  networks  and  solution \nspaces  of differential equations.  Submitted to Neural Networks. \n\n\f", "award": [], "sourceid": 786, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}