{"title": "Recovering a Feed-Forward Net From Its Output", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 342, "abstract": null, "full_text": "Recovering a Feed-Forward Net \n\nFrom Its Output \n\nCharles Fefferman * and Scott Markel \n\nDavid Sarnoff Research Center \n\nCN5300 \n\nPrinceton, N J 08543-5300 \n\ne-mail: cf9imath.princeton .edu \n\nsmarkel@sarnoff.com \n\nABSTRACT \n\nWe study feed-forward nets with arbitrarily many layers, using the stan(cid:173)\ndard sigmoid, tanh x. Aside from technicalities, our theorems are: \n1. Complete knowledge of the output of a neural net for arbitrary inputs \nuniquely specifies the architecture, weights and thresholds; and 2. There \nare only finitely many critical points on the error surface for a generic \ntraining problem. \n\nNeural nets were originally introduced as highly simplified models of the nervous \nsystem. Today they are widely used in technology and studied theoretically by \nscientists from several disciplines. However, they remain little understood. \nMathematically, a (feed-forward) neural net consists of: \n\n(1) A finite sequence of positive integers (Do, D 1 , ... , D\u00a3); \n(2) A family of real numbers (wJ d defined for 1 :5 e 5: L, 1 5: j 5: D l , 1 5: k :5 Dl-l ; \n\nand \n\n(3) A family of real numbers (OJ) defined for 15: f 5: L, 15: j 5: Dl. \n\nThe sequence (Do, D 1 , .. \" DL ) is called the architecture of the neural net, while the \nW]k are called weights and the OJ thresholds. \nNeural nets are used to compute non-linear maps from }R.N to }R.M by the following \nconstruction. vVe begin by fixing a nonlinear function 0-( x) of one variable. Analogy \nwith the nervous system suggests that we take o-(x) asymptotic to constants as x \ntends to \u00b1oo; a standard choice, which we adopt throughout this paper, is o-(.r) = \n\n* Alternate address: Dept. of Mathematics. Princeton University, Princeton, NJ 08544-1000. \n\n335 \n\n\f336 \n\nFefferman and Markel \n\ntanh ax). Given an \"input\" (tl , ... ,tDo) E JR Do , we define real numbers x; for \n\nOs l S L, 1 S j S De by the following induction on l . \n\nIf l = 0 then x; = t j . \nIf the x~-l are known with l fixed (1 SlS L), then we set \n\n( 4) \n\n(5) \n\nfor \n\nISjSDe. \n\nHere xf , ... , Xhl are interpreted as the outputs of Di \"neurons\" in the lth \"layer\" \nof the net. The output map of the net is defined as the map \n\nIn practical applications, one tries to pick the neural net [(Do, Dl\"'\" DL), (W]k)' \n(OJ)] so that the output map * approximates a given map about which we have \nonly imperfect information. The main result of this paper is that under generic \nconditions, perfect knowledge of the output map ** uniquely specifies the architec(cid:173)\nture, the weights and the thresholds of a neural net, up to obvious symmetries. \n~Iore precisely, the obvious symmetries are as follows . Let C1o, 11, . .. , ~(L) be per(cid:173)\nmutations, with 11.= {I, ... , De} -T {I, . . . , De}; and let {e]: Os f. S L, IS j 50 De} be \na collection of \u00b1 1 'so Assume that Ii = (identity) and e] = + 1 whenever l = 0 or \n\u00a3 = L. Then one checks easily that the neural nets \n(7) \n\n[(Do, D 1 , .. . , DL), (wh), (eJ)] \n\nand \n\n(8) \n\n[(Do , D 1,.\u00b7. , DL), (W]k) ' (O'J)] \nhave the same output map if we set \n\n(9) \n\nand \n\nThis reflects the facts that the neurons in layer l are interchangeable (1 50 f. 50 L - 1) , \nand that the function 0'( x) is odd. The nets (7) and (8) will be called isomorphtc \nif they are related by (9). Note in particular that isomorphic neural nets have the \nsame architecture. Our main theorem asserts that, under generic conditions, any \ntwo neural nets with the same output map are isomorphic. \n\\Ve discuss the generic conditions which we impose on neural nets. \navoid obvious counterexamples such as: \n\n\\Ve have to \n\n(10) Suppose all the weights W]k are zero. Then the output map ** is constant . \nThe architecture and thresholds of the neural net are clearly not uniquely \ndetermined by **. \n\n(11) Fix lo, JI, h with IS fo S L - 1 and Isil < h 50 Dio ' Suppose we have \nelo = O~o and w~o = w~o for all k. Then (5) gi ves x~o = x~o Therefore the \n\nJl \n\nJ2' \n\n11 \n\nJ2 \n\n11k \n\n12k \n\n, \n\n\fRecovering a Feed-Forward Net from Its Output \n\n337 \n\noutput depends on ;,J~j~l and wJj;l only through the sum i. .. ;Jj~l + wJr;-l. So \n\nthe output map does not uniquely determine the weights. \n\nOur hypotheses are more than adequate to exclude these counterexamples. Specif(cid:173)\nically, we assume that \n(12) OJ 1= 0 and :0;1 1= I\u00a31J/I for j 1= j'. \n(13) wh 1= 0; and for j 1= j', the ratio WJdW]lk is not equal to any fraction of the \n\nform pi q with p, q integers and 1 ~ q ~ 100 Dl-\n\nEvidently, these conditions hold for generic neural nets. The precise statement of \nour main theorem is as follows. If two neural nets satisfy (12), (13) and ha've the \nsame output, then the nets are isomorphic. It would be interesting to replace (12), \n(13) by minimal hypotheses. and to study functions O'(x) other than tanh (~x). \n\\Ve now sketch the proof of our main result . sacrificing accuracy for simplicity. \nAfter a trivial reduction. we may assume Do = DL = 1. Thus, the outputs of the \nnodes xJ(t) are functions of one variable, and the output map of the neural net is \nt ~ xf (t). The key idea is to continue the xJ (t) analytically to complex values of t, \nand to read off the structure of the net from the set of singularities of the xJ, ~ote \nthat 0'( x) = tanh Ox) is meromorphic, with poles at the points of an arithmetic \nprogression {(2m + l);ri: mE \u00a3:}. This leads to two crucial observations. \n(14) When P. = 1, the poles of X] (t) form an arithmetic progression II;. and \n\n(15) \n\n'Vhen e. > 1, every pole of any xi-1(t) is an accumulation point of poles of \nany X] (t). \n\nIn fact, (14) is immediate from the formula x;(t) = O'(WJlt + O}), which is merely \nthe special case Do = 1 of (5). \\Ve obtain \n\n1 _ {(2m + l);ri - OJ . \n\n} \n\n. mE 2 \n\n(16) \n\nII j \n\n-\n\n1 \nwjl \n\nTo see (15), fix e., j, 'It, and assume for simplicity that X~-l(t) has a simple pole at \nto, while xi- 1(t) (k 1= t:) is analytic in a neighborhood of to. Then \n\n(17) \n\nt. 1 \nxr.- (t) = t _ to + /(t), with / analytic in a neighborhood of to. \n\nA \n\nFrom (17) and (5), we obtain \n\nxJ(t) = O'(W;t-;A(t - to)-1 + g(t\u00bb, with \n\n(18) \n(19) g(t) = wJtcf(t) + LWJkX~-I(t) + \u00a31J \n\nanalytic in a neighborhood of to. \n\nThus, in a neighborhood of to, the poles of X] (1) are the solutions tm of the equation \n\nk;c~ \n\n(20) \n\nmE:: . \n\n\f338 \n\nFefferman and Markel \n\nThere are infinitely many solutions of (20), accumulating at to. Hence. to is an \naccumulation point of poles of xJ(t), which completes the proof of (15). \nIn view of (14), (15), it is natural to make the following definitions. The natural \ndomain of a neural net is the largest open subset of the complex plane to which the \noutput map t ........ xf(t) can be analytically continued. For l? 0 we define the lth \nsingular set Singe C) by setting \n\n= complement of the natural domain in C, \n\nSing(O) \nSinge e + 1) = the set of all accumulation points of Singe f). \n\nand \n\nThese definitions are made entirely in terms of the output map, without reference \nto the structure of the given neural net. On the other hand, the sets Sing( \u00a3) contain \nnearly complete information on the architecture, weights and thresholds of the net. \nThis will allow us to read off the structure of a neural net from the analytic contin(cid:173)\nuation of its output map. To see how the sets Sing(f) reflect the structure of the \nnet, we reason as follows. \nFrom (14) and (15) we expect that \n(21) For 1 $f $ L, Sing(L -l) is the union over j = 1, ... , Dl of the set of poles of \n\nxJ(t), together with their accumulation points (which we ignore here), and \n\n(22) For f? L, Sing(l) is empty. \n\nImmediately, then, we can read off the \"depth\"' L of the neural net; it is simply the \nsmallest e for which Sing(l) is empty. \nvVe need to solve for Dt , wh, OJ. We proceed by induction on l. \nWhen f = 1, (14) and (21) show that Sing(L - 1) is the union of arithmetic pro(cid:173)\ngressions IT}, j == 1, ... , D 1 . Therefore, from Sing(L - 1) we can read off Dl and \nthe IT]. (vVe will return to this point later in the introduction.) In view of (16), \nIT] determines the weights and thresholds at layer 1. modulo signs. Thus. we have \nfound D I , W}k' g}. \nWhen l > 1, we may assume that \n(23) The D l \" wJ~, Of are already known, for 1 ~ l' < f. \nOur task is to find De, W]k' gJ. In view of (23), we can find a pole to of xk-1(t) for \nour favorite k. Assume for simplicity that to is a simple pole of x~-I(tL and that \nthe X~-l(t) (k ::j:. ~) are analytic in a neighborhood of to. Then X~-I(t) is given by \n(17) in a neighborhood of to, with A already known by virtue of (23). Let U be a \nsmall neighborhood of to. \nWe will look at the image Y of U n Singe L - l) under the map t ........ t:to' Since A, \nto and Sing(L - e) are already known, so is Y. On the other hand, we can relate Y \nto De. WJk' OJ as follows. From (21) we see that Y is the union over j = 1,. \", Dl \nof \n(24) Yj = image of U n { Poles of xJ (t)} under t f---> tt:to)' \n\n\fRecovering a Feed-Forward Net from Its Output \n\n339 \n\nFor fixed j, the poles of xJ(t) in a neighborhood of to are the lm given by (20). \\Ve \nwrite \n\n(25) \n\nEquation (20) shows that the first expression in brackets in (25) is equal to (2m + \n1 )'7ri. Also, since tm -+ to as Iml -\n00 and 9 is analytic in a neighborhood of to, \nthe second expression in brackets in (25) tends to zero. Hence, \n\nW~ leA \n_) \ntm - to \n\n= (2m+1)7ri-g(to)+o(1) \n\nforlargem. \n\nComparing this with the definition (24), \\':e see that Yj is asymptotic to the arith(cid:173)\nmetic progression \n\nIT l _ {(2m + 1)7ri - g(to). \n\n] -\n\nl \n\n.mEtL.. \n\n~} \n. \n\n(26) \n\nWjt. \n\nThus, the known set Y is the union over j = 1 ... \" Dl of sets Yj, with Yj asymptotic \nto the arithmetic progression IT~ . From Y, we can therefore read off Dl and the II~ . \n(\\Ve will return to this point in a moment.) \\Ve see at once from (26) that wJ ~ is \ndetermined up to sign by II]. Thus, we have found Dl and who \\Vith more work, \nwe can also find the OJ, completing the induction on t. \nThe above induction shows that the structure of a neural net may be read off \nfrom the analytic continuation of its output map. \n\\Ve believe that the analytic \ncontinuation of the output map will lead to further consequences in the study of \nneural nets. \nLet us touch briefly on a few points which we glossed over above. First of all, suppose \nwe are given a set Y C C, and we know that Y is the union of sets Yl , ... , Y D, with \nYj asymptotic to an arithmetic progression IT j . vVe assumed above that III, ... , ITD \nare uniquely determined by Y. In fact, without some further hypothesis on the \nIT j, this need not be true. For instance, we cannot distinguish IT 1 U IT 2 from II3 \nif II 1 = {odd integers}, II:! = {even integers}. II3 = {all integers} . On the other \nhand, we can clearly recognize ITl = {all integers} and IT2 = {mj2 : m an integer} \nfrom their union ITI U II 2 . Thus, irrational numbers enter the picture. The role of \nour generic hypothesis (13) is to control the arithmetic progressions that arise in \nour proof. \nSecondly, suppose xk(t) has a pole at to. We assumed for simplicity that xt(t) is an(cid:173)\nalytic in a neighborhood of to for k -::j:. k. However, one of the xk(t) (k -::j:. ft) may also \nhave a pole at to. In that case, the X~+l (t) may all be analytic in a neighborhood of \nto, because the contributions of the singularities of the xf in (J\" (~WJtlxt + OJ+l) \nmay cancel. Thus, the singularity at to may disappear from the output map. \\Vhile \nthis circumstance is hardly generic, it is not ruled out by our hypotheses (12), (13). \n\n\f340 \n\nFeffennan and Markel \n\nBecause singularities can disappear, we have to make technical changes in our de(cid:173)\nscription of Sing(f). For example, in the discussion following (23), Y need not be \nthe union of the sets rj. Rather, Y is their \"approximate union\". (See [FD, \nNext, we should point out that the signs of the weights and thresholds require \nsome attention, even though we have some freedom to change signs by applying \nisomorphisms. (See (9).) \nFinally, in the definition of the natural domain, we have assumed that there is a \nunique maximal open set to which the output map continues analytically. This \nneed not be true of a general real-analytic function on the line - for instance. take \nf(t) = (1 + t 2)1/2. Fortunately, the natural domain is well-defined for any function \nthat continues analytically to the complement of a countable set. The defining \nformula (5) lets us check easily that the output map continues to the complement \nof a countable set, so the natural domain makes sense. This concludes our overview \nof the proof of our main theorem. The full proof of our results will appear in [F]. \nBoth the uniqueness problem and the use of analytic continuation have already \nappeared in the neural net literature. In particular, it was R. Hecht-Nielson who \npointed out the role of isomorphisms and posed the uniqueness problem. His pa(cid:173)\nper with Chen and Lu [CLH] on \"equioutput transformations\" on the space of all \nneural nets influenced our work. E . Sontag [So] and H. Sussman [Su] proved sharp \nuniqueness theorems for one hidden layer. The proof in [So] uses complex variables. \n\nAcknow ledgements \nFefferman is grateful to R. Crane, S. j\\Iarkel, J. Pearson, E. Sontag, R. Sverdlove, \nand N. vVinarsky for introducing him to the study of neural nets. \nThis research was supported by the Advanced Research Projects Agency of the De(cid:173)\npartment of Defense and was monitored by the Air Force Office of Scientific Research \nunder Contract F49620-92-C-0072. The United States Government is authorized to \nreproduce and distribute reprints for governmental purposes notwithstanding any \ncopyright notation hereon. This work was also supported by the National Science \nFoundation. \n\nThe following posters, presented at XIPS 93, may clarify our uniqueness theorem. \n\nReferences \n\n[CLH] R. Hecht-Nielson, et al., On the geometry of feedforward neural network error \n\nsurfaces. (to appear). \n\n[F] \n\n[So] \n\n[Su] \n\nC. Fefferman, Reconstructing a neural network from its output, Re\\'ista \nMathematica Iberoamericana. (to appear). \n\nF. Albertini and E. Sontag, Uniqueness of weights for neural networks. (to \nappear). \nH. Sussman, Uniqueness of the weights JOT minimal feedforward nets u'ith a \ngiven input-output map, Neural Networks 5 (1992), pp. 589-593. \n\n\fRecovering a Feed-Forward Net from Its Output \n\n341 \n\n\"......,.---.... \n\nRecovering a Feed-Forward \n\nNet from Its 0 utput \n\nCharles Fefferman \n\nDavid Sarnoff Research Center and Princeton Univarsity \n\nPrinceton. N_ Jersey \n\nPI_edt., \n\nScot! A . Markel \n\nDavid Sarnoff Research Center \n\nPrinceton, N_ Jersey \n\nSuppose an unknown neural netwOf1t i. placed in a \nblack box. \n\nYou aren't allowed to look in the box, blA you are \nIIIlowed to observe the outputs produced by the \nnetwork for arbitrary inputs. \n\nThen. in principle, you have enough information to \ndetermine the network architecture (number d layers \nand number of nodes in each layer) and the unique \nvalues for a. the weghts. \n\n\", .......... -\n\nThe Output Map of a Neural Network \n\nThe Key Question \n\nFix a feed-forward neural network with the standard \nsigmoid CI (x) = tanh x. \n\n~ \n~ Y, y. \n\ny. \n\nThe map that carries input vectors (XI' \u2022\u2022\u2022\u2022 x.J \nto outputvectors (YI' \u2022\u2022.\u2022 Y,,) \nis called the OUTPUT MAP of the neural network. \n\nWhen can two neural networks \n\nhave the same output map? \n\nObvious Examples of Two Neural \n\nNetworks with the Same Output Map \n\nUnlquene .. Theorem \n\nStart with a neural network N. \n\nThene~her \n\n1. permlte the nodes in a hidden layer. or \n\n2. fix a hidden node. and change the sign d \nevefY weight (Including the bias weght) that \ninvolves that node \n\nThis yields a new neural n~ork with the same \noutput map as N. \n\nLet N and N' be neural networks that satisfy generic \n\nconditions described below. \n\n\" N and N' have the same output map. then they differ \n\nonly by sign changes and permutations of hidden node \u2022\u2022 \n\n\f342 \n\nFefferman and Markel \n\n............ -\n\n.--.,.--\n\nGeneric CondKlon. \n\nOutline of the Proof \n\nWe essume thet \n\n\u2022 aI _ighl. ere non-zero \n\n\u2022 bias weight. within each layer have distinct \nebsollte values \n\n\u2022 the ralio of weighl. from node i in layer I to nodes j \nand k in layer (1+ 1) is not equal to any fraction of the \nform ~q with p. q Integers and 1~q~100'(number of \nnodes in layer I) \n\nSome such assumptions are needed to avoid obvious \ncounterexamples. \n\n\u2022 it's enough to con.ider networks with one input node \nand one output node (see below) \n\n\u2022 al node output. are nt:IW functions of a .ingle. real \nvariable t (the network input) \n\n\u2022 analytically continue the network output to a function f \nof a .ingle. cofl1llex varillble t \n\n\u2022 the qualitative geometry of the pole. of the function f \ndetermines the network architecture (_ belCM') \n\n\u2022 the asymptotica of the function f near its singularities \ndetermine the weights \n\n.......... \n\n.. ......, .... -\n\n.c .... \n\nReduction to \u2022 Network wtth \nSingle Input and Output Node \u2022 \n\n\u2022 focus attention on a single output node, ignoring the \nothers \n\n\u2022 study only input data w~h a single non-zero entry \n\n'~'Y \u2022\u2022 ) \n\n0 \u2022\u2022 , ) \n\n\u2022 \u2022 \n\n. ... . \n\nGeometric Description of the Pole. \n\n... : ... \n. . \n. \n, \u2022 ;)/' <.~' \u2022\u2022\u2022 .J.' \u2022\u2022 \n. -. \n. . . \n. ...... \n\u2022 \n\u2022 \n\n. . . . : \n\n. . \n\n\u2022 poles (smell dols) accumulate al essenllal singularities \n(smell squares) \n\u2022 essential singularities (small squares) accumulate at \nmore complicated essentlal slngularitles (large dots) \n\n.. c_ \n\nDetennlnlng the Network Architecture from the Picture \n\n(conl'd) \n\n\u2022 from the network reduction we know thai there is \none input node and one output node \n\n\u2022 therefore. the network architech.e is es pictured \n\nDetermining the Network Architecture from the Picture \n\n\u2022 three kinds 01 singularities (small dots, smaH squares. \nlarge dots) \n=> thr_ layers of sigmoids, i.e. two hidden \n\nlayers and an output layer \n\n\u2022 thr_ 'spiral arms' of small squares accumulate at \neach large dot \n\n=> three nodes in the second hidden layer \n\n\u2022 two 'spiral arms' of small dots accumulate at each \nsmaa square \n=> two nodes in the first hidden layer \n\n\f", "award": [], "sourceid": 748, "authors": [{"given_name": "Charles", "family_name": "Fefferman", "institution": null}, {"given_name": "Scott", "family_name": "Markel", "institution": null}]}*