{"title": "Optimization with Artificial Neural Network Systems: A Mapping Principle and a Comparison to Gradient Based Methods", "book": "Neural Information Processing Systems", "page_first": 474, "page_last": 484, "abstract": null, "full_text": "474 \n\nOPTIMIZA nON WITH ARTIFICIAL NEURAL NETWORK SYSTEMS: \n\nA MAPPING PRINCIPLE \n\nAND \n\nA COMPARISON TO GRADIENT BASED METHODS t \n\nHarrison MonFook Leong \n\nResearch Institute for Advanced Computer Science \n\nNASA Ames Research Center 230-5 \n\nMoffett Field, CA, 94035 \n\nABSTRACT \n\nGeneral formulae for mapping optimization problems into systems of ordinary differential \n\nequations associated with artificial neural networks are presented. A comparison is made to optim(cid:173)\nization using gradient-search methods. The perfonnance measure is the settling time from an initial \nstate to a target state. A simple analytical example illustrates a situation where dynamical systems \nrepresenting artificial neural network methods would settle faster than those representing gradient(cid:173)\nsearch. Settling time was investigated for a more complicated optimization problem using com(cid:173)\nputer simulations. The problem was a simplified version of a problem in medical imaging: deter(cid:173)\nmining loci of cerebral activity from electromagnetic measurements at the scalp. The simulations \nshowed that gradient based systems typically settled 50 to 100 times faster than systems based on \ncurrent neural network optimization methods. \n\nINTRODUCTION \n\nSolving optimization problems with systems of equations based on neurobiological principles \nhas recently received a great deal of attention. Much of this interest began when an artificial \nneural network was devised to find near-optimal solutions to an np-complete problem 13. Since \nthen, a number of problems have been mapped into the same artificial neural network and varia(cid:173)\ntions of it 10.13,14,17.18,19.21,23.24. In this paper, a unifying principle underlying these mappings is \nderived for systems of first to nth -order ordinary differential equations. This mapping principle \nbears similarity to the mathematical tools used to generate optimization methods based on the gra(cid:173)\nIn view of this, it seemed important to compare the optimization efficiency of dynamical \ndient. \nsystems constructed by the neural network mapping principle with dynamical systems constructed \nfrom the gradient. \n\n. \n\nTHE PRINCIPLE \n\nThis paper concerns itself with networks of computational units having a state variable V, a \nfunction! that describes how a unit is driven by inputs, a linear ordinary differential operator with \nconstant coefficients D (v) that describes the dynamical response of each unit, and a function g that \ndescribes how the output of a computational unit is detennined from its state v. In particular, the \npaper explores how outputs of the computational units evolve with time in tenns of a scalar func(cid:173)\ntion E, a single state variable for the whole network. Fig. I summarizes the relationships between \nvariables, functions, and operators associated with each computational unit. Eq. (1) summarizes the \nequations of motion for a network composed of such units: \n\n\"-+(M) \nD \n\n(v) = (g 1 (v I)' ...\u2022 gN (VN ) ) \n\n1 \n\n(I) \n\nwhere the i th element of jJ(M) is D(M)(Vj), superscript (M) denotes that operator D is Mth order, \nthe i th element of 1 is !i(gl(VI) \u2022 ...\u2022 gN(VN\u00bb, and \nthe network is comprised of N computa(cid:173)\ntional units. The network of Hopfield 12 has M=I, functions 1 are weighted linear sums, and func(cid:173)\ntions 1 (where the ith element of 1 is gj(Vj) ) are all the same sigmoid function. We will exam(cid:173)\nine two ways of defining functions 1 given a function F. Along with these definitions will be \n\nt Work supported by NASA Cooperative Agreement No. NCC 2-408 \n\n\u00a9 American Institute of Physics 1988 \n\n\fdefined corresponding functions E that will be used to describe the dynamics of Eq. (1). \n\nThe first method corresponds to optimization methods introduced by artificial neural network \n\n475 \n\nresearch. It will be referred to as method V y (\"dell gil): \n! == VyF \n\nwith associated E function \n\nE\"j = F(\"g)-JL D(M)(v \u00b7(S\u00bb- - ' -\n\ndv'(S)jdg .(S) \n\n' \n\ndt \n\n' \ndt \n\ntN[ \n\ni \n\n(2a) \n\n(2b) \n\nds. \n\nHere, V xR denotes the gradient of H, where partials are taken with respect to variables of X, and \nE7 denotes the E function associated with gradient operator V 7' With appropriate operator D and \nfunctions 1 and g, Er is simply the \"energy function\" of Hopfield 12. Note that Eq. (2a) makes \nexplicit that we will only be concerned with 1 that can be derived from scalar potential functions. \nFor example, this restriction excludes artificial neural networks that have connections between exci(cid:173)\ntatory and inhibitory units such as that of Freeman 8. The second method corresponds to optimiza(cid:173)\ntion methods based on the gradient. It will be referred to as method V if (\"dell v\"): \n\nwith associated E function \n\n1 == VyoF \n\nEv> = FCg) -JL D(M)(v .(s\u00bb--' -\ndt \n\nt N [ \n\ni \n\nI \n\ndv \u00b7 (s) 1 dv \u00b7 (s ) \n\n' \ndt \n\n(3a) \n\n(3b) \n\nds \n\nwhere notation is analogous to that for Eqs. (2). \n\n\\\\ \n\nThe \n\ncritical \n\n~_ \u2022\u2022 \n\ncomputational unit i : \n\ndifferential operator specifying the \ndynamical characteristics of unit i \n\ntransform that detennines unit i's \noutput from state variable Vi \n\nresult \nthat allows us \nto map \noptimization problems into \nnetworks described by Eq. \n(1) is that conditions on the \nconstituents of the equation \ncan be chosen so that along \nany solution trajectory, the \nfunction corresponding \nE \nto \nthe system will be a \nmonotonic function of time. \nFor method V\"j' here are \nthe conditions: \nall func-\ntions g are 1) differentiable \nand 2) monotonic \nin the \nsame sense. Only the first \ncondition \nto \nmake a similar assertion for \nmethod V v- When these conditions are met and when solutions of Eq. (1) exist, the dynamical sys(cid:173)\ntems can be used for optimization. The appendix contains proofs for the monotonicity of function \nE along solution trajectories and references necessary existence theorems. In conclusion, mapping \noptimization problems onto dynamical systems summarized by Eq. (l) can be reduced to a matter \nof differentiation if a scalar function representation of the problem can be found and the integrals \nof Eqs. (2b) and (3b) are ignorable. This last assumption is certainly upheld for the case where \noperator D has no derivatives less than M'h order. In simulations below, it will be observed to \nhold for the case M =1 with a nonzero O'h order derivative in D . (Also see Lapedes and Falber 19.) \n\nFigure 1: Schematic of a computational unit i from which net-\nworks considered in this paper are constructed. Triangles suggest \nconnections between computational units. \n\nfunction governing how inputs to \nunit i are combined to drive it \n\n/gl(V 1) 'Tg2(v:z) \n\nis needed \n\nI' \n\n/ \n\nPERSPECTIVES OF RECENT WORK \n\n\f476 \n\nThe fonnulations above can be used to classify the neural network optimization techniques \nused in several recent studies. In these studies, the functions 1 were all identical. For the most \npart, following Hopfield's fonnulation, researchers 10.13.14.17.23.24 have used method Vy to derive \nfonns of Eq. (1) that exhibit the ability to find extrema of E-t with Ey quadratic in functions 1 and \nall functions 1 describable by sigmoid functions such as tanh (x ). However, several researchers \nhave written about artificial neural networks associated with non-quadratic E functions. Method \nVy has been used to derive systems capable of finding extrema of non-quadrntic Ey \n19. Method \nV v has been used to derive systems capable of optimizing Ev where Ev were not necessarily qua(cid:173)\ndratic in variables V 21. A sort of hybrid of the two methods was used by Jeffery and Rosner 18 to \nfind extrema of functions that were not quadratic. The important distinction is that their functions j \nwere derived from a given function Fusing Eq. (3a) where, in addition, a sign definite diagonal \nmatrix was introduced; the left side of Eq. (3a) was left multiplied by this matrix. A perspective \non the relationship between all three methods to construct dynamical systems for optimization is \nsummarized by Eq. (4) which describes the relationship between methods Vyand Vyo: \n\nV? = 0 is the gain and Th is the threshold. Transforms similar to this are widely used in \nartificial neural network research. Suppose we wish to use such computational units to search a \nmulti-dimensional binary solution space. We note that \n\n!li.. = G sech 2G(v -Th) \ndv \n\n(6) \n\nis near 0 at valid solution states (comers of a hypercube for the case of binary solution spaces). We \nsee from Eq. (4) that near a valid solution state. a network based on method Vy will allow compu(cid:173)\ntational units to recede from incorrect states and approach correct states comparatively faster. Does \n\n\fthis imply faster settling time for method V\"t? \n\nTo obtain an analytical comparison of settling times, consider the case where M =1 and \n\noperator D has no Om order derivatives and \n\n477 \n\nF = - ~('.\u00b7(tanhGv\u00b7)(tanhGv \u00b7 ) \nJ \n\n1 \n2~'J \n\n\u2022 \n\n'oJ \n\nwhere matrix S is symmetric. Method V y gives network equations \n\nand method V v gives network equations \n\ndV =StanhGv \ndt \n\n~ = diag [G sech 2Gvj 1 S tanhGV \n\n(7) \n\n(8) \n\n(9) \n\nwhere tanhGY denotes a vector with i'\" component tanhGv;. For method V r there is one stable \npoint, i.e. where ':: = 0, at V = O . For method V v the stable points are V = 0 and V \u20ac V where \nV is the set of vectors with component values that are either +- or - . Further trivialization \nallows for comparing estimates of settling times: Suppose S is diagonal. For this case, if Vj = 0 is \non the trajectory of any computational unit i for one method, Vj = 0 is on the trajectory of that unit \nfor the other method; hence, a comparison of settling times can be obtained by comparing time \nestimates for a computational unit to evolve from near 0 to near an extremum or, equivalently, the \nconverse. Specifically, let the interval be [Bo, I-a] where 0< Bo- With the condition that functions 1 \nare differentiable, we can show that the derivative of 4 is semi-definite: \n\ndE.\". \n_ v = I , - - ' - I, D(M)(Vj)_-' - ' . \ndt \n\nN dFv dv\u00b7 \nj dVj dt \n\ndV'] dv\u00b7 \ndt \ndt \n\nN [ \n\nj \n\nUsing Eqs. (3a) and (1), \n\nN [dVj ]2~ \ndEv \n--~- 0 \ndt \n, \nS \n\n- ~ dt \n\n(A2a) \n\n(A2b) \n\nas needed. \nIn order to use the results derived above to conclude that Eq. (1) can be used for optimiza(cid:173)\ntion of functions 4 and Et in the vicinity of some point vo. we need to show that there exists a \nneighborhood of Vo in which there exist solution trajectories to Eq. (1). The necessary existence \ntheorems and transformations of Eq. (1) needed in order to apply the theorems can be found in \nmany texts on ordinary differential equations; e.g. Guckenheimer and Holmes 11. Here, it is mainly \nimportant to state that the theorems require that functions ,\u00a3c(1), functions g are differentiable, \nand initial conditions are specified for all derivatives of lower order than M. \n\n\f484 \n\nACKNOWLEDGEMENTS \n\nI would like to thank Dr. Michael Raugh and Dr. Pentti Kanerva for constructive criticism \nand support. I would like to thank Bill Baird and Dr. James Keeler for reviewing this work. I \nwould like to thank Dr. Derek Fender, Dr. John Hopfield, and Dr. Stanley Klein for giving me \nopportunities that fostered this conglomeration of ideas. \n\n[1] Ackley D.H., \"Stochastic iterated genetic bill climbing\", PhD. dissertation, Carnegie Mellon \n\nREFERENCES \n\nU.,1987. \n\n[2] Bawn E., Neural Networks for Computing, ed. Denker 1.S. (AlP Confrnc. Proc. 151, ed. \n\nLerner R.G.), p53-58, 1986. \n\n[3] Brody D.A., IEEE Trans. vBME-32, n2, pl06-110, 1968. \n[4] Brody D.A., Terry F.H., !deker RE., IEEE Trans. vBME-20, p141-143, 1973. \n[5] Cohen M.A., Grossberg S., IEEE Trans. vSMC-13, p815-826, 1983. \n[6] Cuffin B.N., IEEE Trans. vBME-33, n9, p854-861. 1986. \n[7] Darcey T.M., AIr J.P., Fender D.H., Prog. Brain Res., v54, pI28-134, 1980. \n[8] \n[9] Gevins A.S., Morgan N.H., IEEE Trans., vBME-33, n12, pl054-1068, 1986. \n[10] Goles E., Vichniac G.Y., Neural Networks for Computing, ed. Denker J.S. (AlP Confrnc. \n\nFreeman W J., \"Mass Action in the Nervous System\", Academic Press, Inc., 1975. \n\nProc. 151, ed. Lerner R.G.), p165-181, 1986. \n\n[11] Guckenheimer J., Holmes P., \"Nonlinear Oscillations, Dynamical Systems, and Bifurcations \n\nof Vector Fields\", Springer Verlag, 1983. \n\n[12] Hopfield J.I., Proc. Nat!. Acad. Sci., v81, p3088-3092, 1984. \n[13] Hopfield 1.1., Tank D.W., Bio. Cybrn., v52, p141-152, 1985. \n[14] Hopfield 1.J., Tank D.W., Science, v233, n4764, p625-633, 1986. \n[15] Horowitz P., Hill W., \"The art of electronics\", Cambridge U. Press, 1983. \n[16] Hosek RS., Sances A., Jodat RW., Larson S.I., IEEE Trans., vBME-25, nS, p405-413, 1978. \n[171 Hutchinson J.M., Koch C., Neural Networks for Computing, ed. Denker J.S. (AlP Confrnc. \n\nProc. 151, ed. Lerner RG.), p235-240, 1986. \n\n[18] Jeffery W., Rosner R, Astrophys. I., v310, p473-481, 1986. \n[19] Lapedes A., Farber R., Neural Networks for Computing, ed. Denker 1.S. (AlP Confrnc. Proc. \n\nlSI, ed. Lerner RG.), p283-298, 1986. \n\n[20] Leong H.M.F., ''Frequency dependence of electromagnetic fields: models appropriate for the \n\nbrain\", PhD. dissertation, California Institute of Technology, 1986. \n\n[21] Platt I.C., Hopfield J.J., Neural Networks for Computing, ed. Denker I.S. (AlP Confrnc. Proc. \n\n151, ed. Lerner RG.), p364-369, 1986. \n\n[22] Press W.H., Flannery B.P., Teukolsky S.A., Vetterling W.T., \"Numerical Recipes\", Cam(cid:173)\n\nbridge U. Press, 1986. \n\n[23] Takeda M., Goodman J.W., Applied Optics, v25. n18, p3033-3046, 1986. \n[24] Tank D.W., Hopfield I.J., \"Neural computation by concentrating infornation in time\", pre(cid:173)\n\nprint, 1987. \n\n\f", "award": [], "sourceid": 19, "authors": [{"given_name": "Harrison", "family_name": "Leong", "institution": null}]}