{"title": "Recovering a Feed-Forward Net From Its Output", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 342, "abstract": null, "full_text": "Recovering a  Feed-Forward  Net \n\nFrom Its  Output \n\nCharles  Fefferman * and  Scott  Markel \n\nDavid Sarnoff Research  Center \n\nCN5300 \n\nPrinceton,  N J  08543-5300 \n\ne-mail:  cf9imath.princeton .edu \n\nsmarkel@sarnoff.com \n\nABSTRACT \n\nWe study feed-forward  nets with arbitrarily many layers,  using the stan(cid:173)\ndard  sigmoid,  tanh  x.  Aside  from  technicalities,  our  theorems  are: \n1.  Complete knowledge of the output of a neural net for  arbitrary inputs \nuniquely specifies  the architecture,  weights and thresholds;  and 2.  There \nare  only  finitely  many  critical  points  on  the  error  surface  for  a  generic \ntraining problem. \n\nNeural  nets  were  originally  introduced  as  highly  simplified  models  of the  nervous \nsystem.  Today  they  are  widely  used  in  technology  and  studied  theoretically  by \nscientists  from  several  disciplines.  However,  they  remain little understood. \nMathematically, a  (feed-forward)  neural  net  consists  of: \n\n(1)  A  finite  sequence  of positive integers  (Do, D 1 ,  ... ,  D\u00a3); \n(2)  A family of real numbers (wJ d  defined for 1 :5 e 5:  L,  1 5:  j  5:  D l ,  1 5:  k :5  Dl-l ; \n\nand \n\n(3)  A family of real  numbers  (OJ)  defined  for  15: f 5:  L,  15: j  5:  Dl. \n\nThe sequence  (Do,  D 1 ,  .. \"  DL )  is  called the  architecture of the neural net, while the \nW]k  are called  weights and  the  OJ  thresholds. \nNeural  nets  are used  to compute non-linear  maps from  }R.N  to  }R.M  by  the following \nconstruction.  vVe  begin by fixing a nonlinear function 0-( x)  of one  variable.  Analogy \nwith  the  nervous  system  suggests  that  we  take  o-(x)  asymptotic  to  constants  as  x \ntends  to \u00b1oo;  a  standard  choice,  which  we  adopt  throughout  this  paper,  is  o-(.r)  = \n\n* Alternate address:  Dept.  of Mathematics. Princeton University, Princeton, NJ  08544-1000. \n\n335 \n\n\f336 \n\nFefferman and Markel \n\ntanh ax).  Given  an  \"input\"  (tl , ... ,tDo)  E  JR Do ,  we  define  real  numbers x;  for \n\nOs l  S  L,  1 S j  S  De  by  the  following induction  on l . \n\nIf l  = 0  then x;  = t j  . \nIf the x~-l are  known  with  l  fixed  (1  SlS L),  then  we  set \n\n( 4) \n\n(5) \n\nfor \n\nISjSDe. \n\nHere  xf , ... , Xhl  are  interpreted  as  the outputs of Di  \"neurons\"  in  the  lth  \"layer\" \nof the  net.  The  output  map of the  net  is  defined  as  the  map \n\nIn  practical  applications, one tries  to  pick  the  neural  net  [(Do,  Dl\"'\"  DL),  (W]k)' \n(OJ)]  so  that  the  output  map  <I>  approximates  a  given  map  about  which  we  have \nonly  imperfect  information.  The  main  result  of this  paper  is  that  under  generic \nconditions,  perfect  knowledge of the output  map <I>  uniquely  specifies  the  architec(cid:173)\nture,  the  weights  and  the  thresholds  of a  neural  net,  up  to  obvious  symmetries. \n~Iore precisely,  the  obvious  symmetries are  as  follows .  Let  C1o, 11, . .. , ~(L)  be  per(cid:173)\nmutations, with 11.=  {I, ... , De}  -T   {I, . . . , De};  and let  {e]: Os f. S L,  IS j  50  De}  be \na  collection  of \u00b1 1 'so  Assume  that  Ii  =  (identity)  and  e]  =  + 1 whenever  l  =  0  or \n\u00a3 = L.  Then  one  checks  easily  that the  neural  nets \n(7) \n\n[(Do,  D 1 ,  .. .  ,  DL),  (wh),  (eJ)] \n\nand \n\n(8) \n\n[(Do , D 1,.\u00b7. , DL),  (W]k) '  (O'J)] \nhave  the same output  map  if we  set \n\n(9) \n\nand \n\nThis  reflects  the facts  that the neurons in  layer l  are interchangeable (1  50  f. 50  L - 1) , \nand  that  the  function  0'( x)  is  odd.  The  nets  (7)  and  (8)  will  be  called  isomorphtc \nif they  are  related  by  (9).  Note  in  particular  that  isomorphic neural  nets  have  the \nsame  architecture.  Our  main  theorem  asserts  that,  under  generic  conditions,  any \ntwo  neural  nets  with  the  same output  map are  isomorphic. \n\\Ve  discuss  the  generic  conditions  which  we  impose  on  neural  nets. \navoid  obvious counterexamples such  as: \n\n\\Ve  have  to \n\n(10)  Suppose  all  the  weights W]k  are  zero.  Then  the  output  map  <I>  is  constant . \nThe  architecture  and  thresholds  of  the  neural  net  are  clearly  not  uniquely \ndetermined by  <I>. \n\n(11)  Fix  lo,  JI,  h  with  IS fo  S  L  - 1  and  Isil  <  h  50  Dio '  Suppose  we  have \nelo  = O~o  and w~o  = w~o  for  all  k.  Then  (5)  gi ves  x~o =  x~o  Therefore  the \n\nJl \n\nJ2' \n\n11 \n\nJ2 \n\n11k \n\n12k \n\n, \n\n\fRecovering a Feed-Forward Net from Its Output \n\n337 \n\noutput  depends on ;,J~j~l  and wJj;l  only  through  the sum i. .. ;Jj~l + wJr;-l.  So \n\nthe  output  map does  not  uniquely  determine  the  weights. \n\nOur  hypotheses  are  more  than  adequate  to exclude  these  counterexamples.  Specif(cid:173)\nically,  we  assume that \n(12)  OJ  1=  0  and :0;1  1=  I\u00a31J/I  for  j  1=  j'. \n(13)  wh  1=  0;  and for  j  1=  j', the ratio WJdW]lk  is  not  equal  to  any fraction  of the \n\nform  pi q  with p,  q integers and  1 ~ q ~ 100 Dl-\n\nEvidently,  these  conditions  hold  for  generic  neural  nets.  The  precise  statement  of \nour  main  theorem  is  as  follows.  If two  neural  nets  satisfy  (12),  (13)  and  ha've  the \nsame  output,  then  the  nets  are  isomorphic.  It would  be  interesting  to  replace  (12), \n(13)  by  minimal hypotheses.  and  to  study  functions  O'(x)  other  than tanh (~x). \n\\Ve  now  sketch  the  proof  of  our  main  result .  sacrificing  accuracy  for  simplicity. \nAfter  a  trivial  reduction.  we  may  assume  Do  =  DL  = 1.  Thus,  the  outputs of the \nnodes  xJ(t)  are  functions  of one  variable,  and  the  output  map of the  neural  net  is \nt  ~ xf (t).  The key  idea is  to continue the xJ (t) analytically to complex values of t, \nand  to  read  off the structure of the  net from  the set  of singularities of the xJ,  ~ote \nthat  0'( x)  =  tanh Ox)  is  meromorphic,  with  poles  at  the  points of an  arithmetic \nprogression  {(2m + l);ri: mE \u00a3:}.  This  leads  to  two  crucial  observations. \n(14)  When P.  =  1,  the  poles of X] (t)  form  an arithmetic progression  II;.  and \n\n(15) \n\n'Vhen  e.  >  1,  every  pole of any  xi-1(t)  is  an  accumulation  point  of poles  of \nany  X] (t). \n\nIn  fact,  (14)  is  immediate from  the  formula x;(t) =  O'(WJlt  + O}),  which  is  merely \nthe special  case  Do  =  1 of (5).  \\Ve  obtain \n\n1  _  {(2m + l);ri - OJ  . \n\n} \n\n. mE 2 \n\n(16) \n\nII j \n\n-\n\n1 \nwjl \n\nTo see  (15),  fix  e.,  j, 'It,  and  assume for  simplicity that  X~-l(t) has  a  simple pole  at \nto,  while  xi- 1(t)  (k 1=  t:)  is  analytic in  a  neighborhood  of to.  Then \n\n(17) \n\nt.  1 \nxr.- (t) = t  _  to  + /(t),  with  /  analytic  in  a  neighborhood of to. \n\nA \n\nFrom  (17)  and  (5),  we  obtain \n\nxJ(t) =  O'(W;t-;A(t  - to)-1 + g(t\u00bb,  with \n\n(18) \n(19)  g(t)  = wJtcf(t) + LWJkX~-I(t) + \u00a31J \n\nanalytic in  a  neighborhood  of to. \n\nThus, in  a  neighborhood of to,  the poles of X] (1)  are the solutions tm  of the equation \n\nk;c~ \n\n(20) \n\nmE:: . \n\n\f338 \n\nFefferman and  Markel \n\nThere  are  infinitely  many  solutions  of (20),  accumulating  at  to.  Hence.  to  is  an \naccumulation point of poles  of xJ(t),  which  completes  the  proof of (15). \nIn  view  of (14),  (15),  it  is  natural  to  make  the  following  definitions.  The  natural \ndomain  of a  neural net is  the largest open subset  of the  complex plane to  which  the \noutput  map  t  ........  xf(t)  can  be  analytically  continued.  For  l? 0  we  define  the  lth \nsingular set Singe C)  by  setting \n\n= complement of the  natural domain in  C, \n\nSing(O) \nSinge e + 1) =  the set of all  accumulation points of Singe f). \n\nand \n\nThese  definitions  are  made entirely  in  terms  of the  output  map,  without  reference \nto the structure of the given neural net.  On the other hand, the sets Sing( \u00a3)  contain \nnearly  complete information on  the architecture,  weights and  thresholds of the net. \nThis will  allow  us  to read  off the structure  of a  neural net  from the analytic contin(cid:173)\nuation  of its  output  map.  To  see  how  the  sets  Sing(f)  reflect  the  structure  of the \nnet,  we  reason  as  follows. \nFrom  (14)  and  (15)  we  expect  that \n(21)  For  1 $f $  L, Sing(L -l) is  the  union over  j  = 1, ... , Dl of the set  of poles  of \n\nxJ(t),  together  with  their  accumulation points  (which  we  ignore  here),  and \n\n(22)  For  f? L,  Sing(l)  is  empty. \n\nImmediately, then,  we  can  read off the  \"depth\"'  L  of the neural net;  it  is  simply the \nsmallest e for  which  Sing(l)  is  empty. \nvVe  need  to solve  for  Dt , wh,  OJ.  We  proceed  by  induction  on l. \nWhen  f  =  1,  (14)  and  (21)  show  that  Sing(L - 1)  is  the  union  of arithmetic  pro(cid:173)\ngressions  IT},  j  ==  1, ... , D 1 .  Therefore,  from  Sing(L - 1)  we  can  read  off  Dl  and \nthe  IT].  (vVe  will  return  to  this  point  later  in  the  introduction.)  In  view  of (16), \nIT]  determines  the  weights  and thresholds  at  layer  1.  modulo signs.  Thus.  we  have \nfound  D I , W}k'  g}. \nWhen  l  > 1,  we  may assume  that \n(23)  The  D l \"  wJ~, Of  are  already  known,  for  1 ~ l' < f. \nOur  task  is  to find  De, W]k'  gJ.  In  view  of (23),  we  can  find  a  pole to  of xk-1(t)  for \nour  favorite  k.  Assume  for  simplicity that to  is  a  simple  pole  of x~-I(tL and  that \nthe  X~-l(t) (k  ::j:.  ~) are  analytic  in  a  neighborhood of to.  Then  X~-I(t) is  given  by \n(17)  in  a  neighborhood  of to,  with  A already  known  by  virtue  of (23).  Let  U  be  a \nsmall neighborhood of to. \nWe  will look  at  the  image Y  of U n Singe L - l)  under  the  map t  ........  t:to'  Since  A, \nto  and Sing(L - e)  are  already  known,  so  is  Y.  On the other hand,  we  can  relate Y \nto  De. WJk'  OJ  as  follows.  From  (21)  we  see  that Y  is  the  union  over  j  =  1,. \", Dl \nof \n(24)  Yj  = image of U n { Poles  of xJ (t)}  under t  f--->  tt:to)' \n\n\fRecovering a Feed-Forward Net from Its Output \n\n339 \n\nFor fixed  j, the  poles of xJ(t)  in  a  neighborhood of to  are  the lm  given  by  (20).  \\Ve \nwrite \n\n(25) \n\nEquation  (20)  shows  that  the  first  expression  in  brackets  in  (25)  is  equal  to  (2m + \n1 )'7ri.  Also,  since tm  -+  to  as  Iml  -\n00  and  9  is  analytic  in  a  neighborhood  of to, \nthe second  expression  in  brackets  in  (25)  tends  to zero.  Hence, \n\nW~ leA \n_) \ntm  - to \n\n=  (2m+1)7ri-g(to)+o(1) \n\nforlargem. \n\nComparing this  with  the  definition  (24),  \\':e  see  that  Yj  is  asymptotic  to  the  arith(cid:173)\nmetic  progression \n\nIT l  _  {(2m + 1)7ri  - g(to). \n\n] -\n\nl \n\n.mEtL.. \n\n~} \n. \n\n(26) \n\nWjt. \n\nThus, the known set Y  is  the union over j  =  1 ... \"  Dl of sets Yj,  with Yj  asymptotic \nto the arithmetic progression  IT~ .  From Y,  we  can  therefore  read off Dl and the II~ . \n(\\Ve  will  return  to  this  point  in  a  moment.)  \\Ve  see  at once  from  (26)  that wJ ~  is \ndetermined  up  to  sign  by  II].  Thus,  we  have  found  Dl  and who  \\Vith  more work, \nwe  can  also find  the OJ,  completing the  induction  on t. \nThe  above  induction  shows  that  the  structure  of  a  neural  net  may  be  read  off \nfrom  the  analytic  continuation  of  its  output  map. \n\\Ve  believe  that  the  analytic \ncontinuation  of the  output  map  will  lead  to  further  consequences  in  the  study  of \nneural  nets. \nLet us touch briefly on a few  points which we glossed over above.  First of all, suppose \nwe  are given  a  set Y  C  C,  and  we  know  that Y  is  the union of sets Yl ,  ... , Y D,  with \nYj  asymptotic to an arithmetic progression  IT j .  vVe  assumed above that III, ... , ITD \nare  uniquely  determined  by  Y.  In  fact,  without  some  further  hypothesis  on  the \nIT j,  this  need  not  be  true.  For  instance,  we  cannot  distinguish  IT 1  U IT 2  from  II3 \nif II 1  = {odd  integers},  II:!  = {even  integers}.  II3  = {all integers} .  On  the  other \nhand,  we  can  clearly  recognize  ITl  = {all  integers}  and  IT2  = {mj2 : m  an  integer} \nfrom  their  union  ITI  U II 2 .  Thus,  irrational numbers  enter  the  picture.  The role of \nour  generic  hypothesis  (13)  is  to  control  the  arithmetic  progressions  that  arise  in \nour  proof. \nSecondly, suppose xk(t) has a pole at to.  We assumed for simplicity that xt(t) is  an(cid:173)\nalytic in a  neighborhood of to  for  k  -::j:.  k.  However,  one of the  xk(t)  (k  -::j:.  ft)  may also \nhave a pole at to.  In  that case,  the X~+l (t)  may all be analytic in a  neighborhood of \nto,  because  the  contributions of the singularities of the xf  in  (J\"  (~WJtlxt + OJ+l) \nmay cancel.  Thus, the singularity at to  may disappear from the output  map.  \\Vhile \nthis circumstance is  hardly generic,  it  is  not ruled out by  our hypotheses  (12),  (13). \n\n\f340 \n\nFeffennan and Markel \n\nBecause  singularities  can  disappear,  we  have  to  make  technical  changes  in  our  de(cid:173)\nscription  of Sing(f).  For  example,  in  the  discussion  following  (23),  Y  need  not  be \nthe  union of the sets  rj.  Rather,  Y  is  their  \"approximate union\".  (See  [FD, \nNext,  we  should  point  out  that  the  signs  of  the  weights  and  thresholds  require \nsome  attention,  even  though  we  have  some  freedom  to  change  signs  by  applying \nisomorphisms.  (See  (9).) \nFinally,  in  the  definition  of the  natural  domain,  we  have  assumed  that  there  is  a \nunique  maximal  open  set  to  which  the  output  map  continues  analytically.  This \nneed  not  be  true of a  general  real-analytic function  on  the  line - for  instance.  take \nf(t)  = (1 + t 2)1/2.  Fortunately,  the natural  domain is  well-defined  for  any  function \nthat  continues  analytically  to  the  complement  of  a  countable  set.  The  defining \nformula  (5)  lets  us  check  easily  that  the  output  map continues  to  the  complement \nof a  countable set, so  the natural domain makes sense.  This concludes our overview \nof the  proof of our  main theorem.  The full  proof of our  results  will  appear  in  [F]. \nBoth  the  uniqueness  problem  and  the  use  of  analytic  continuation  have  already \nappeared  in  the  neural  net  literature.  In  particular,  it  was  R.  Hecht-Nielson  who \npointed  out  the  role  of isomorphisms and  posed  the  uniqueness  problem.  His  pa(cid:173)\nper  with  Chen  and  Lu  [CLH]  on  \"equioutput  transformations\"  on  the space  of all \nneural  nets  influenced  our work.  E . Sontag  [So]  and  H.  Sussman  [Su]  proved  sharp \nuniqueness  theorems for one hidden layer.  The proof in  [So]  uses  complex variables. \n\nAcknow ledgements \nFefferman  is  grateful  to  R.  Crane,  S.  j\\Iarkel,  J.  Pearson,  E.  Sontag,  R.  Sverdlove, \nand  N.  vVinarsky  for  introducing him to  the study  of neural  nets. \nThis research  was supported  by  the Advanced  Research  Projects  Agency of the  De(cid:173)\npartment of Defense and was monitored by the Air Force Office of Scientific Research \nunder  Contract F49620-92-C-0072.  The United States Government is  authorized  to \nreproduce  and  distribute  reprints  for  governmental  purposes  notwithstanding  any \ncopyright  notation  hereon.  This work  was  also supported  by  the  National  Science \nFoundation. \n\nThe following posters,  presented  at  XIPS  93,  may clarify  our  uniqueness  theorem. \n\nReferences \n\n[CLH]  R.  Hecht-Nielson, et al.,  On the geometry of feedforward neural network error \n\nsurfaces.  (to appear). \n\n[F] \n\n[So] \n\n[Su] \n\nC.  Fefferman,  Reconstructing  a  neural  network  from  its  output,  Re\\'ista \nMathematica Iberoamericana.  (to appear). \n\nF.  Albertini  and  E.  Sontag,  Uniqueness  of weights  for  neural  networks.  (to \nappear). \nH.  Sussman,  Uniqueness  of the  weights JOT  minimal feedforward  nets  u'ith  a \ngiven  input-output  map,  Neural  Networks 5  (1992),  pp. 589-593. \n\n\fRecovering a Feed-Forward Net from Its Output \n\n341 \n\n\"......,.---.... \n\nRecovering a Feed-Forward \n\nNet from Its 0 utput \n\nCharles Fefferman \n\nDavid Sarnoff Research Center and Princeton Univarsity \n\nPrinceton. N_ Jersey \n\nPI_edt., \n\nScot! A . Markel \n\nDavid Sarnoff Research Center \n\nPrinceton, N_ Jersey \n\nSuppose an unknown neural netwOf1t i. placed in a \nblack box. \n\nYou aren't allowed to look in the box, blA you are \nIIIlowed to observe the outputs produced by the \nnetwork for arbitrary inputs. \n\nThen. in principle, you have enough information to \ndetermine the network architecture (number d  layers \nand number of nodes in each layer) and the unique \nvalues for a. the weghts. \n\n\", .......... -\n\nThe Output Map of a Neural Network \n\nThe Key Question \n\nFix a feed-forward neural network with the standard \nsigmoid  CI (x) = tanh x. \n\n~ \n~ Y,  y. \n\ny. \n\nThe map that carries input vectors (XI' \u2022\u2022\u2022\u2022 x.J \nto outputvectors  (YI'  \u2022\u2022.\u2022 Y,,) \nis called the OUTPUT MAP of the neural network. \n\nWhen can two neural networks \n\nhave the same output map? \n\nObvious Examples of Two Neural \n\nNetworks with the Same Output Map \n\nUnlquene .. Theorem \n\nStart with a neural network N. \n\nThene~her \n\n1. permlte the nodes in a hidden layer. or \n\n2. fix a hidden node. and change the sign d \nevefY weight (Including the bias weght) that \ninvolves that node \n\nThis yields a new neural n~ork with the same \noutput map as N. \n\nLet N and N' be neural networks that satisfy generic \n\nconditions described below. \n\n\"  N and N' have the same output map. then they differ \n\nonly by sign changes and permutations of hidden node \u2022\u2022 \n\n\f342 \n\nFefferman and Markel \n\n............ -\n\n.--.,.--\n\nGeneric CondKlon. \n\nOutline of the Proof \n\nWe essume thet \n\n\u2022 aI _ighl. ere non-zero \n\n\u2022 bias weight. within each layer have distinct \nebsollte values \n\n\u2022 the ralio of weighl. from node i in layer I to nodes j \nand k in layer (1+ 1) is  not equal to any fraction of the \nform ~q with p. q Integers and 1~q~100'(number of \nnodes in layer I) \n\nSome such assumptions are needed to avoid obvious \ncounterexamples. \n\n\u2022 it's enough to con.ider networks with one input node \nand one output node (see below) \n\n\u2022 al node output. are nt:IW functions of a .ingle. real \nvariable t  (the network input) \n\n\u2022 analytically continue the network output to a function f \nof a .ingle. cofl1llex varillble t \n\n\u2022 the qualitative geometry of the pole. of the function f \ndetermines the network architecture (_ belCM') \n\n\u2022 the asymptotica of the function f near its singularities \ndetermine the weights \n\n.......... \n\n.. ......, .... -\n\n.c .... \n\nReduction to \u2022  Network wtth \nSingle Input and Output Node \u2022 \n\n\u2022 focus attention on a single output node, ignoring the \nothers \n\n\u2022  study only input data w~h a single non-zero entry \n\n'~'Y \u2022\u2022 ) \n\n0 \u2022\u2022  , )  \n\n\u2022 \u2022  \n\n. ... . \n\nGeometric Description of the Pole. \n\n...  : ... \n. . \n. \n, \u2022 ;)/' <.~' \u2022\u2022\u2022 .J.' \u2022\u2022 \n.  -. \n. . . \n. ...... \n\u2022 \n\u2022 \n\n. . . . :  \n\n. . \n\n\u2022 poles (smell dols) accumulate al essenllal singularities \n(smell squares) \n\u2022 essential singularities (small squares) accumulate at \nmore complicated essentlal slngularitles (large dots) \n\n..  c_ \n\nDetennlnlng the Network Architecture from the Picture \n\n(conl'd) \n\n\u2022 from the network reduction we know thai there is \none input node and one output node \n\n\u2022 therefore. the network architech.e is es pictured \n\nDetermining the Network Architecture from the Picture \n\n\u2022 three kinds 01 singularities (small dots, smaH squares. \nlarge dots) \n=>  thr_ layers of sigmoids, i.e. two hidden \n\nlayers and an output layer \n\n\u2022  thr_ 'spiral arms' of small squares accumulate at \neach large dot \n\n=>  three nodes in the second hidden layer \n\n\u2022 two 'spiral arms' of small dots accumulate at each \nsmaa square \n=>  two nodes in the first hidden layer \n\n\f", "award": [], "sourceid": 748, "authors": [{"given_name": "Charles", "family_name": "Fefferman", "institution": null}, {"given_name": "Scott", "family_name": "Markel", "institution": null}]}