{"title": "How Neural Nets Work", "book": "Neural Information Processing Systems", "page_first": 442, "page_last": 456, "abstract": null, "full_text": "442 \n\nAlan Lapedes \nRobert Farber \n\nTheoretical Division \n\nHow Neural Nets Work \n\nLos Alamos National Laboratory \n\nLos Alamos, NM 87545 \n\nAbstract: \n\nThere is presently great interest in the abilities of neural networks to mimic \n\"qualitative reasoning\" by manipulating neural incodings of symbols. Less work \nhas been performed on using neural networks to process floating point numbers \nand it is sometimes stated that neural networks are somehow inherently inaccu(cid:173)\nrate and therefore best suited for \"fuzzy\" qualitative reasoning. Nevertheless, \nthe potential speed of massively parallel operations make neural net \"number \ncrunching\" an interesting topic to explore. In this paper we discuss some of our \nwork in which we demonstrate that for certain applications neural networks can \nachieve significantly higher numerical accuracy than more conventional tech(cid:173)\nniques. In particular, prediction of future values of a chaotic time series can \nbe performed with exceptionally high accuracy. We analyze how a neural net \nis able to do this , and in the process show that a large class of functions from \nRn. ~ Rffl may be accurately approximated by a backpropagation neural net \nwith just two \"hidden\" layers. The network uses this functional approximation \nto perform either interpolation (signal processing applications) or extrapolation \n(symbol processing applicationsJ. Neural nets therefore use quite familiar meth(cid:173)\nods to perform. their tasks. The geometrical viewpoint advocated here seems to \nbe a useful approach to analyzing neural network operation and relates neural \nnetworks to well studied topics in functional approximation. \n1. Introduction \n\nAlthough a great deal of interest has been displayed in neural network's \ncapabilities to perform a kind of qualitative reasoning, relatively little work has \nbeen done on the ability of neural networks to process floating point numbers \nIn this \nin a massively parallel fashion. Clearly, this is an important ability. \npaper we discuss some of our work in this area and show the relation between \nnumerical, and symbolic processing. We will concentrate on the the subject of \naccurate prediction in a time series. Accurate prediction has applications in \nmany areas of signal processing. It is also a useful, and fascinating ability, when \ndealing with natural, physical systems. Given some .data from the past history \nof a system, can one accurately predict what it will do in the future? \n\nMany conventional signal processing tests, such as correlation function anal(cid:173)\n\nysis, cannot distinguish deterministic chaotic behavior from from stochastic \nnoise. Particularly difficult systems to predict are those that are nonlinear and \nchaotic. Chaos has a technical definition based on nonlinear, dynamical systems \ntheory, but intuitivly means that the system is deterministic but \"random,\" in \na rather similar manner to deterministic, pseudo random number generators \nused on conventional computers. Examples of chaotic systems in nature include \nturbulence in fluids (D. Ruelle, 1971; H. Swinney, 1978), chemical reactions (K. \nTomita, 1979), lasers (H. Haken, 1975), plasma physics (D. Russel, 1980) to \nname but a few. Typically, chaotic systems also display the full range of non(cid:173)\nlinear behavior (fixed points, limit cycles etc.) when parameters are varied, and \ntherefore provide a good testbed in which to investigate techniques of nonlinear \nsignal processing. Clearly, if one can uncover the underlying, deterministic al(cid:173)\ngorithm from a chaotic time series, then one may be able to predict the future \ntime series quite accurately, \n\n\u00a9 American Institute of Physics 1988 \n\n\f443 \n\nIn this paper we review and extend our work (Lapedes and Farber ,1987) \non predicting the behavior of a particular dynamical system, the Glass-Mackey \nequation. We feel that the method will be fairly general, and use the Glass(cid:173)\nMackey equation solely for illustrative purposes. The Glass-Mackey equation \nhas a strange attractor with fractal dimension controlled by a constant param(cid:173)\neter appearing in the differential equation. We present results on a neural net(cid:173)\nwork's ability to predict this system at two values of this parameter, one value \ncorresponding to the onset of chaos, and the other value deeply in the chaotic \nregime. We also present the results of more conventional predictive methods and \nshow that a neural net is able to achieve significantly better numerical accuracy. \nThis particular system was chosen because of D. Farmer's and J. Sidorowich's \n(D. Farmer, J . Sidorowich, 1987) use of it in developing a new, non-neural net \nmethod for predicting chaos. The accuracy of this non-neural net method, and \nthe neural net method, are roughly equivalent, with various advantages or dis(cid:173)\nadvantages accruing to one method or the other depending on one's point of \nview. We are happy to acknowledge many valuable discussions with Farmer and \nSidorowich that has led to further improvements in each method. \n\nWe also show that a neural net never needs more than two hidden layers to \nsolve most problems. This statement arises from a more general argument that \na neural net can approximate functions from Rn. -+ Rm with only two hidden \nlayers, and that the accuracy of the approximation is controlled by the number \nof neurons in each layer. The argument assumes that the global minimum to the \nbackpropagation minimization problem may be found, or that a local minima \nvery close in value to the global minimum may be found. This seems to be \nthe case in the examples we considered, and in many examples considered by \nother researchers, but is never guaranteed. The conclusion of an upper bound \nof two hidden layers is related to a similar conclusion of R. Lipman (R. Lipman, \n1987) who has previously analyzed the number of hidden layers needed to form \narbitrary decision regions for symbolic processing problems. Related issues are \ndiscussed by J. Denker (J. Denker et.al. 1987) It is easy to extend the argument \nto draw similar conclusions about an upper bound of two hidden layers for \nsymbol processing and to place signal processing, and symbol processing in a \ncommon theoretical framework. \n2. Backpropagation \n\nBackpropagation is a learning algorithm for neural networks that seeks to \nfind weights, T ij, such that given an input pattern from a training set of pairs \nof Input/Output patterns, the network will produce the Output of the training \nset given the Input. Having learned this mapping between I and 0 for the \ntraining set, one then applies a new, previously unseen Input, and takes the \nOutput as the \"conclusion\" drawn by the neural net based on having learned \nfundamental relationships between Input and Output from the training set. A \npopular configuration for backpropagation is a totally feedforward net (Figure \n1) where Input feeds up through \"hidden layers\" to an Output layer. \n\n\f444 \n\nOUTPUT \n\nFigure 1. \n\nA feedforward neural \nnet. Arrows schemat(cid:173)\nically indicate full \nfeedforward connect(cid:173)\nivity \n\nEach neuron forms a weighted sum of the inputs from previous layers to \nwhich it is connected, adds a threshold value, and produces a nonlinear function \nof this sum as its output value. This output value serves as input to the future \nlayers to which the neuron is connected, and the process is repeated. Ultimately \na value is produced for the outputs of the neurons in the Output layer. Thus, \neach neuron performs: \n\n(1) \n\nwhere Tii are continuous valued, positive or negative weights, 9. is a constant, \nand g(x) is a nonlinear function that is often chosen to be of a sigmoidal form. \nFor example, one may choose \n\n1 \n\ng(z) = 2\" (1 + tanhz) \n\n(2) \n\nwhere tanh is the hyperbolic tangent, although the exact formula of the sigmoid \nis irrelevant to the results. \nIf t!\") are the target output values for the pth Input pattern then ones trains \nthe network by minimizing \n\nE = L L (t~P) - o!P)) 2 \n\np \n\ni \n\n(3) \n\nwhere t~p) is the target output values (taken from the training set) and O~pl \nis the output of the network when the pth Input pattern of the training set is \npresented on the Input layer. i indexes the number of neurons in the Output \nlayer. \n\nAn iterative procedure is used to minimize S. For example, the commonly \nused steepest descents procedure is implemented by changing Tii and S, by AT'i \nand AS, where \n\n\f~T. .. = - - 'E \n\n'1 \n\naE \naT. .. \n'1 \n\n445 \n\n(4a) \n\n( 4b) \n\nThis implies that ~E < 0 and hence E will decrease to a local minimum. \nUse o~ the chain .rule and definition of some intermediate quantities allows the \nfollowmg expressIons for ~Tij to be obtained (Rumelhart, 1987): \n\n~Tij = L E6lp)o~.p) \n\np \n\nwhere \n\nif i is labeling a neuron in the Output layer; and \n\n6Jp) = O!p) (1 - o~p\u00bb) LTi j 6;p) \n\nj \n\n(Sa) \n\n(Sb) \n\n(6) \n\n(7) \n\nif i labels a neuron in the hidden layers. Therefore one computes 6Jp) for the \nOutput layer first, then uses Eqn. (7) to computer 6i p ) for the hidden layers, \nand finally uses Eqn. (S) to make an adjustment to the weights. We remark that \nthe steepest descents procedure in common use is extremely slow in simulation, \nand that a better minimization procedure, such as the classic conjugate gradient \nprocedure (W. Press, 1986), can offer quite significant speedups. Many appli(cid:173)\ncations use bit representations (0,1) for symbols, and attempt to have a neural \nnet learn fundamental relationships between the symbols. This procedure has \nbeen successfully used in converting text to speech (T. Sejnowski, 1986) and in \ndetermining whether a given fragment of DNA codes for a protein or not (A. \nLapedes, R. Farber, 1987). \n\nThere is no fundamental reason, however, to use integer's as values for Input \nand Output. If the Inputs and Outputs are instead a collection of floating point \nnumbers, then the network, after training, yields a specific continuous function \nin n variables (for n inputs) involving g(x) (Le. hyperbolic tanh's) that provides \na type of nonlinear, least mean square interpolant formula for the discrete set \nof data points in the training set. Use of this formula a = 1(11, 1\", ... 1'1) \nwhen given a new input not in the training set, is then either interpolation or \nextrapolation. \n\nSince the Output values, when assumed to be floating point numbers may \nhave a dynamic range great than 10,1\\, one may modify the g(x) on the Output \nlayer to be a linear function, instead of sigmoidal, so as to encompass the larger \ndynamic range. Dynamic range of the Input values is not so critical, however we \nhave found that numerical problems may be avoided by scaling the Inputs (and \n\n\f446 \n\nalso the Outputs) to [0,1], training the network, and then rescaling the Ti;, (J, \nto encompass the original dynamic range. The point is that scale changes in \nI and 0 may, for feedforward networks, always be absorbed in the T ijJ (J, and \nvice versa. We use this procedure (backpropagation, conjugate gradient, linear \noutputs and scaling) in the following section to predict points in a chaotic time \nseries. \n3. Prediction \n\nLet us consider situations in Nature where a system is described by nonlin(cid:173)\n\near differential equations. This is faily generic. We choose a particular nonlinear \nequation that has an infinite dimensional phase space, so that it is similar to \nother infinite dimensional systems such as partial differential equations. A differ(cid:173)\nential equation with an infinite dimensional phase space (i.e. an infinite number \nof values are necessary to describe the initial condition) is a delay, differential \nequation. We choose to consider the time series generated by the Glass-Mackey \nequation: \n\nX= \n\naz(t - 1') \n\n1 + Z 10 (t _ 1') -\n\nb t \nZ ( ) \n\n(8) \n\nThis is a nonlinear differential, delay equation with an initial condition specified \nby an initial function defined over a strip of width l' (hence the infinite di(cid:173)\nmensional phase space i.e. initial functions, not initial constants are required). \nChoosing this function to be a constant function, and a = .2, b = .1, and l' = 17 \nyields a time series, x(t), (obtained by integrating Eqn. (8)), that is chaotic with \na fractal attractor of dimension 2.1. Increasing l' to 30 yields more complicated \nevolution and a fractal dimension of 3.5. The time series for 500 time steps for \n1'=30 (time in units of 1') is plotted in Figure 2. The nonlinear evolution of the \nsystem collapses the infinite dimensional phase space down to a low (approxi(cid:173)\nmately 2 or 3 dimensional) fractal, attracting set. Similar chaotic systems are \nnot uncommon in Nature. \n\nFigure 2. Example time series at tau ~ 30. \n\n\f447 \n\nThe goal is to take a set of values of xO at discrete times in some time \nwindow containing times less than t, and use the values to accurately predict \nx(t + P), where P is some prediction time step into the future. One may fix \nP, collect statistics on accuracy for many prediction times t (by sliding the \nwindow along the time series), and then increase P and again collect statistics \non accuracy. This one may observe how an average index of accuracy changes as \nP is increased. In terms of Figure 2 we will select various prediction time steps, \nP, that correspond to attempting to predict within a \"bump,\" to predicting \na couple of \"bumps\" ahead. The fundamental nature of chaos dictates that \nprediction accuracy will decrease as P is increased. This is due to inescapable \ninaccuracies of finite precision in specifying the x( t) at discrete times in the past \nthat are used for predicting the future. Thus, all predictive methods will degrade \nas P is increased - the question is \"How rapidly does the error increase with \nP?\" We will demonstrate that the neural net method can be orders of magnitude \nmore accurate than conventional methods at large prediction time steps, P. \n\nOur goal is to use backpropagation, and a neural net, to construct a function \n\nO(t + P) = f (11(t), 12(t - A) ... lm(t - mA)) \n\n(9) \nwhere O(t + P) is the output of a single neuron in the Output layer, and 11 ~ 1m \nare input neurons that take on values z(t), z(t - A) ... z(t -\nrnA), where A is \na time delay. O(t + P) takes on the value x(t + P). We chose the network \nconfiguation of Figure 1. \n\nWe construct a training set by selecting a set of input values: \n\n(10) \n\n1m = x(t p -\n\nrnA) \n\nwith associated output values 0 = x(tp + P), for a collection of discrete times \nthat are labelled by tp. Typically we used 500 I/O pairs in the training set \nso that p ranged from 1~ 500. Thus we have a collection of 500 sets of \n{lip), l~p), ... , 1::); O(p)} to use in training the neural net. This procedure of \nusing delayed sampled values of x{t) can be implemented by using tapped de(cid:173)\nlay lines, just as is normally done in linear signal processing applications, (B. \nWidrow, 1985). Our prediction procedure is a straightforward nonlinear exten(cid:173)\nsion of the linear Widrow Hoff algorithm. After training is completed, prediction \nis performed on a new set of times, t p, not in the training set i.e. for p = 500. \nWe have not yet specified what m or A should be, nor given any indication \nwhy a formula like Eqn. (9) should work at all. An important theorem of Takens \n(Takens, 1981) states that for flows evolving to compact attracting manifolds of \ndimension d.A\" \nthat a functional relation like Eqn. (9) does exist, and that m \nlies in the range d.A, < m + 1 < 2d.A, + 1. We therefore choose m = 4, for T = 30. \nTakens provides no information on A and we chose A = 6 for both cases. We \nfound that a few different choices of m and A can affect accuracy by a factor of 2 -\na somewhat significant but not overwhelming sensitivity, in view of the fact that \nneural nets tend to be orders of magnitude more accurate than other methods. \nTakens theorem gives no information on the form of fO in Eqn. (9). It therefore \n\n\f448 \n\nis necessary to show that neural nets provide a robust approximating procedure \nfor continuous fO, which we do in the following section. It is interesting to note \nthat attempts to predict future values of a time series using past values of x(t) \nfrom a tapped delay line is a common procedUre in signal processing, and yet \nthere is little, if any, reference to results of nonlinear dynamical systems theory \nshowing why any such attempt is reasonable. \n\nAfter trainin, the neural net as described above, we used it to predict 500 \nnew values of x(tJ in the future and computed the average accuracy for these \npoints. The accuracy is defined to be the average root mean square error, divided \nby a constant scale factor, which we took to be the standard deviation of the \ndata. It is necessary to remove the scale dependence of the data and dividing by \nthe standard deviation of the data provides a scale to use. Thus the resulting \n\"index of accuracy\" is insensitive to the dynamic range of x( t). \n\nAs just described, if one wanted to use a neural net to continuously predict \nx(t) values at, say, 6 time steps past the last observed value (i.e. wanted to \nconstruct a net predicting x( t + 6)) then one would train one network, at P \n= 6, to do this. If one wanted to always predict 12 time steps past the last \nobserved x( t) then a separate, P = 12, net would have to be trained. We, in \nfact, trained separate networks for P ranging between 6 and 100 in steps of 6. \nThe index of accuracy for these networks (as obtained by computing the index \nof accuracy in the prediction phase) is plotted as curve D in Figure 3. There \nis however an alternate way to predict. If one wished to predict, say, x(t + 12) \nusing a P = 6 net, then one can iterate the P = 6 net. That is, one uses the \nP = 6 net to predict the x(t +6) values, and then feeds x(t +6) back into the \ninput line to predict x(t + 12) using the predicted x(t + 6) value instead of \nthe observed x(t + 6) value. in fact, one can't use the observed x(t +6) value, \nbecause it hasn't been observed yet - the rule of the game is to use only data \noccurring at time t and before, to predict x( t + 12). This procedure corresponds \nto iterating the map given by Eqn. (9) to perform prediction at multiples of P. \nOf course, the delays, ~, must be chosen commensurate with P. \nThis iterative method of prediction has potential dangers. Because (in our \nexample of iterating the P = 6 map) the predicted x(t + 6) is always made \nwith some error, then this error is compounded in iteration, because predicted, \nand not observed values, are used on the input lines. However, one may pre(cid:173)\ndict more accurately for smaller P, so it may be the case that choosing a very \naccurate small P prediction, and iterating, can ultimately achieve higher accu(cid:173)\nracy at the larger P's of interest. This tUrns out to be true, and the iterated \nnet method is plotted as curve E in Figure 3. It is the best procedure to use. \nCurves A,B,C are alternative methods (iterated polynomial, Widrow-Hoff, and \nnon-iterated polynomial respectively. More information on these conventional \nmethods is in (Lapedes and Farber, 1987) ). \n\n\fB \n\nD \n\n449 \n\nE \n\n~ , : \n\nA C \n1 \nI, \n/' \n!I: \n\" :J \nI \n\\f I \n/ \n. , \n: .' \n/ \nI \n/ \nI .. ,,: \n\nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \n\n, \n\nI \n\n, \nI , \n\nI \n\nP1-~~ictlon ~~. P (T.U3~ 30) \n\n400 \n\nFigure 3. \n\n1 \n\n.8 \n\n.6 \n\n~ \n~ \n~ \n\n-\n= \n\n.4 \n\n.2 \n\no \n\no \n\n4. Why It Works \n\nConsider writing out explicitly Eqn. (9) for a two hidden layer network \nwhere the output is assumed to be a linear neuron. We consider Input connects \nto Hidden Layer 1, Hidden Layer 1 to Hidden Layer 2, and Hidden Layer 2 to \nOutput, Therefore: \n\nRecall that the output neurons a linear computing element so that only two gOs \noccur in formula (11), due to the two nonlinear hidden layers. For ease in later \nanalysis, let us rewrite this formula as \n\nOt = L TtJcg (SU Mle + Ole) + Ot \n\nIe tH 2 \n\nwhere \n\n(12a) \n\n(12b) \n\n\f450 \n\nThe T's and (Ps are specific numbers specified by the training algorithm, \nso that after training is finished one has a relatively complicated formula (12a, \n12b) that expresses the Output value as a specific, known, function of the Input \nvalues: \n\nOt == 1(117 12,\" .lm). \n\nA functional relation of this form, when there is only one output, may be \nviewed as surface in m + 1 dimensional space, in exactly the same manner \none interprets the formula z == f(x,y) as a two dimensional surface in three \n' dimensional space. The general structure of fO as determined by Eqn. (12a, \n12b) is in fact quite simple. From Eqn. (12b) we see that one first forms a sum \nof gO functions (where gO is s sigmoidal function) and then from Eqn. (12a) \none (orms yet another sum involving gO functions. It may at first be thought \nthat this special, simple form of fO restricts the type of surface that may be \nrepresented by Ot = f(Ii)' This initial tl.ought is wrong - the special form of \nEqn. (12) is actually a general representation for quite arbitrary surfaces. \n\nTo prove that Eqn. \n\n(12) is a reasonable representation for surfaces we \nfirst point out that surfaces may be approximated by adding up a series of \n\"bumps\" that are appropriately placed. An example of this occurs in familiar \nFourier analysis, where wave trains of suitable frequency and amplitude are \nadded together to approximate curves (or surfaces). Each half period of each \nwave of fixed wavelength is a \"bump,\" and one adds all the bumps together to \nform the approximant. Let us noW see how Eqn. (12) may be interpreted as \nadding together bumps of specified heights and positions. First consider SUMk \nwhich is a sum of g( ) functions. In Figure (4) we plot an example of such a gO \nfunction for the case of two inputs. \n\nFigure 4. A sigmoidal surface. \n\n\f451 \n\nThe orientation of this sigmoidal surface is determined by T sit the position by \n8;'1 and height by T\"'i. Now consider another gO function that occurs in SUM\",. \nThe 8;, of the second gO function is chosen to displace it from the first, the Tii \nis chosen so that it has the same orientation as the first, and T \"'i is chosen to \nhave opposite sign to the first. These two g( ) functions occur in SUM\"\" and \nso to determine their contribution to SUM\", we sum them together and plot the \nresult in Fi ure 5. The result is a ridged surface. \n\nFigure 5. A ridge. \n\nSince our goal is to obtain localized bumps we select another pair of gO functions \nin SUMk, add them together to get a ridged surface perpendicular to the first \nridged surface, and then add the two perpendicular ridged surfaces together to \nsee the contribution to SUMk. The result is plotted in Figure (6). \n\nFigure 6. A pseudo-bump . \n\n\f452 \n\nWe see that this almost worked, in so much as one obtains a local maxima by \nthis procedure. However there are also saddle-like configurations at the corners \nwhich corrupt the bump we were trying to obtain. Note that one way to fix \nthis is to take g(SUMk + Ok) which will, if Ole is chosen appropriately, depress \nthe local minima and saddles to zero while simultaneously sending the central \nmaximum towards 1. The result is plotted in Figure (7) and is the sought after \nb~~ ____________________________________________ ___ \n\nFigure 7. A bump. \n\nFurthermore, note that the necessary gO function is supplied by Eqn. (12). \nTherefore Eqn. (12) is a procedure to obtain localized bumps of arbitrary height \nand position. For two inputs, the kth bump is obtained by using four gO func(cid:173)\ntions from SUMk (two gO functions for each ridged surface and two ridged \nsurfaces per bump) and then taking gO of the result in Eqn. (12a). The height \nof the kth bump is determined by T tJe in Eqn. (12a) and the k bumps are added \ntogether by that equation as well. The general network architecture which cor(cid:173)\nresponds to the above procedure of adding two gO functions together to form a \nridge, two perpendicular ridges together to form a pseudo-bump, and the final \ngO to form the final bump is represented in Figure (8). To obtain any number \not bumps one adds more neurons to the hidden layers by repeatedly using the \nconnectivity of Figure (8) as a template (Le. four neurons per bump in Hidden \nLayer 1, and one neuron per bump in HiClden Layer 2). \n\n\f453 \n\nFigure 8. Connectivity needed \nto obtain one bump. Add four \nmore neurons to Hidden layer \n1, and one more neuron to \nHidden Layer 2, for each \nadditional bump. \n\nOne never needs more than two layers, or any other type of connectivity \nthan that already schematically specified by Figure (8). The accuracy of the \napproximation depends on the number of bumps, whIch in turn is specified, \nby the number of neurons per layer. This result is easily generalized to higher \ndimensions (more than two Inputs) where one needs 2m hiddens in the first \nhidden layer, and one hidden neuron in the second layer for each bump. \n\nThe argument given above also extends to the situation where one is pro-(cid:173)\n\ncessing symbolic information with a neural net. In this situation, the Input \ninformation is coded into bits (say Os and Is) and similarly for the Output. Or, \nthe Inputs may still be real valued numbers, in which case the binary output \nis attempting to group the real valued Inputs into separate classes. To make \nthe Output values tend toward 0 and lone takes a third and final gO on the \noutput layer, i.e. each output neuron is represented by g(Ot) where Ot is given \nin Eqn. (11) . Recall that up until now we have used hnear neurons on the \noutput layer. In typical backpropagation examples, one never actually achieves \na hard 0 or 1 on the output layers but achieves instead some value between 0.0 \nand 1.0. Then typically any value over 0.5 is called 1, and values under 0.5 are \ncalled O. This \"postprocessing\" step is not really outside the framework of the \nnetwork formalism, because it may be performed by merely increasing the slope \nof the sigmoidal function on the Output layer. Therefore the only effect of the \nthird and final gO function used on the Output layer in symbolic information \nprocessing is to pass a hyperplane through the surface we have just been dis(cid:173)\ncussing. This plane cuts the surface, forming \"decision regions,\" in which high \nvalues are called 1 and low values are called O. Thus we see that the heart of the \nproblem is to be able to form surfaces in a general manner, which is then cut \nby a hyperplane into general decision regions. We are therefore able to conclude \nthat the network architecture consisting of just two hidden layers is sufficient for \nlearning any symbol processing training set. For Boolean symbol mappings one \nneed not use the second hidden layer to remove the saddles on the bump (c.f. \nFig. 6). The saddles are lower than the central maximum so one may choose \na threshold on the output layer to cut the bump at a point over the saddles to \nyield the correct decision region. Whether this representation is a reasonable \none for subsequently achieving good prediction on a prediction set, as opposed \nto \"memorizing\" a training set, is an issue that we address below. \n\n\f454 \n\nWe also note that use of Sigma IIi units (Rummelhart, 1986) or high order \ncorrelation nets (Y.-C. Lee, 1987) is an attempt to construct a surface by a \ngeneral polynomial expansion, which is then cut by a hyperplane into decision \nregions, as in the above. Therefore the essential element of all these neural net \nlearning algorithms are identical (Le. surface construction), only the particular \nmethod of parameterizing the surface varies from one algorithm to another. This \ngeometrical viewpoint, which provides a unifying framework for many neural net \nalgorithms, may provide a useful framework in which to attempt construction \nof new algorithms. \n\nAdding together bumps to approximate surfaces is a reasonable procedure \nto use when dealing with real valued inputs. It ties in to general approximation \ntheory (c.f. Fourier series, or better yet, B splines), and can be quite successful \nas we have seen. Clearly some economy is gained by giving the neural net bumps \nto start with, instead of having the neural net form its own bumps from sigmoids. \nOne way to do this would be to use multidimensional Gaussian functions with \nadjustable parameters. \n\nThe situation is somewhat different when processing symbolic (binary val(cid:173)\n\nued) data. When input symbols are encoded into N bit bit-strings then one has \nwell defined input values in an N dimensional input space. As shown above, one \ncan learn the training set of input patterns by appropriately forming and placing \nbump surfaces over this space. This is an effective method for memorizing the \ntraining set, but a very poor method for obtaining correct predictions on new \ninput data. The point is that, in contrast to real valued inputs that come from, \nsay, a chaotic time series, the input points in symbolic processing problems are \nwidely separated and the bumps do not add together to form smooth surfaces. \nFurthermore, each input bit string is a corner of an 2N vertex hypercube, and \nthere is no sense in which one corner of a hypercube is surrounded by the other \ncorners. Thus the commonly used input representation for symbolic processing \nproblems requires that the neural net extrapolate the surface to make a new \nprediction for a new input pattern (i.e. new corner of the hypercube) and not \ninterpolate, as is commonly the case for real valued inputs. Extrapolation is \na farmore dangerous procedure than interpolation, and in view of the separated \nbumps of the training set one might expect on the basis of this argument that \nneural nets would fail dismally at symbol processing. This is not the case. \n\nThe solution to this apparent conundrum, of course, is that although it is \nsufficient for a neural net to learn a symbol processing training set by forming \nbumps it is not necessary for it to operate in this manner. The simplest exam(cid:173)\nple of this occurs in the XOR problem. One can implement the input/output \nmapping for this problem by duplicating the hidden layer architecture of Figure \n(8) appropiately for two bumps ( i.e. 8 hid dens in layer 1, 2 hid dens in layer 2). \nAs discussed above, for Boolean mappings, one can even eliminate the second \nhidden layer. However the architecture of Figure (9) will also suffice. \n\nFigure 9. Connectivity for XOR \n\nOUTPUT \n\nHIDDEN \n\nINPUT \n\n\f455 \n\nPlotting the output of this network, Figure(9), as a function of the two inputs \nyields a ridge orientated to run between (0,1) and (1,0) Figure(lO). Thus a \nneural net may learn a symbolic training set without using bumps, and a high \ndimensional version of this process takes place in more complex symbol pro(cid:173)\ncessing tasks.Ridge/ravine representations of the training data are considerably \nmore efficient than bumps (less hidden neurons and weights) and the extended \nnature of the surface allows reasonable predictions i.e. extrapolations. \n\nFigure 10 \n\nXOR surface \n\n(1, 1) \n\n5. Conclusion. \n\nNeural nets, in contrast to popular misconception, are capable of quite \naccurate number crunching, with an accuracy for the prediction problem we \nconsidered that exceeds conventional methods by orders of magnitude. Neural \nnets work by constructing surfaces in a high dimensional space, and their oper(cid:173)\nation when performing signal processing tasks on real valued inputs, is closely \nrelated to standard methods of functional ,,-pproximation. One does not need \nmore than two hidden layers for processing real valued input data, and the ac(cid:173)\ncuracy of the approximation is controlled by the number of neurons per layer, \nand not the number of layers. We emphasize that although two layers of hidden \nneurons are sufficient they may not be efficient. Multilayer architectures may \nprovide very efficient networks (in the sense of number of neurons and number \nof weights) that can perform accurately and with minimal cost. \n\nEffective prediction for symbolic input data is achieved by a slightly differ(cid:173)\n\nent method than that used for real value inputs. Instead of forming localized \nbumps (which would accurately represent the training data but would not pre(cid:173)\ndict well on new inputs) the network can use ridge/ravine like surfaces (and \ngeneralizations thereof) to efficiently represent the scattered input data. While \nneural nets generally perform prediction by interpolation for real valued data, \nthey must perform extrapolation for symbolic data if the usual bit representa(cid:173)\ntions are used. An outstanding problem is why do tanh representations seem to \nextrapolate well in symbol processing problema? How do other functional bases \ndo? How does the representation for symbolic inputs affect the ability to extra~ \nolate? This geometrical viewpoint provides a unifyimt framework for examimr: \n\n\f456 \n\nmany neural net algorithms, for suggesting questions about neural net operation, \nand for relating current neural net approaches to conventional methods. \nAcknowledgment. \n\nWe thank Y. C. Lee, J. D. Farmer, and J. Sidorovich for a number of \n\nvaluable discussions. \n\nReferences \n\nC. Barnes, C. Burks, R. Farber, A. Lapedes, K. Sirotkin, \"Pattern Recognition \nby Neural Nets in Genetic Databases\", manuscript in preparation \n\nJ. Denker et. al.,\" Automatic Learning, Rule Extraction,and Generalization\", \nATT, Bell Laboratories preprint, 1987 \n\nD. Farmer, J.Sidorowich, Phys.Rev. Lett., 59(8), p. 845,1987 \n\nH. Haken, Phys. Lett. A53, p77 (1975) \n\nA. Lapedes, R. Farber \"Nonlinear Signal Processing Using Neural Networks: \nPrediction and System Modelling\", LA-UR87-2662,1987 \n\nY.C. Lee, Physica 22D,(1986) \n\nR. Lippman, IEEE ASAP magazine,p.4, 1987 \n\nD. Ruelle, F. Takens, Comm. Math. Phys. 20, p167 (1971) \n\nD. Rummelhart, J. McClelland in \"Parallel Distributed Processing\" Vol. 1, \nM.I.T. Press Cambridge, MA (1986) \n\nD. Russel et al., Phys. Rev. Lett. 45, pU75 (1980) \n\nT. Sejnowski et al., \"Net Talk: A Parallel Network that Learns to Read Aloud,\" \nJohns Hopkins Univ. preprint (1986) \n\nH. Swinney et al., Physics Today 31 (8), p41 (1978) \n\nF. Takens, \"Detecting Strange Attractor in Turbulence,\" Lecture Notes in Math(cid:173)\nematics, D. Rand, L. Young (editors), Springer Berlin, p366 (1981) \n\nK. Tomita et aI., J. Stat. Phys. 21, p65 (1979) \n\n\f", "award": [], "sourceid": 59, "authors": [{"given_name": "Alan", "family_name": "Lapedes", "institution": null}, {"given_name": "Robert", "family_name": "Farber", "institution": null}]}