{"title": "Probabilistic Characterization of Neural Model Computations", "book": "Neural Information Processing Systems", "page_first": 310, "page_last": 316, "abstract": null, "full_text": "310 \n\nPROBABILISTIC CHARACTERIZATION OF \n\nNEURAL MODEL COMPUTATIONS \n\nRichard M. Golden t \n\nUniversity of Pittsburgh, Pittsburgh, Pa. 15260 \n\nABSTRACT \n\nInformation retrieval in a neural network is viewed as a procedure in \n\nwhich the network computes a \"most probable\" or MAP estimate of the unk(cid:173)\nnown information. This viewpoint allows the class of probability distributions, \nP, the neural network can acquire to be explicitly specified. Learning algorithms \nfor the neural network which search for the \"most probable\" member of P can \nthen be designed. Statistical tests which decide if the \"true\" or environmental \nprobability distribution is in P can also be developed. Example applications of \nthe theory to the highly nonlinear back-propagation learning algorithm, and the \nnetworks of Hopfield and Anderson are discussed. \n\nINTRODUCTION \n\nA connectionist system is a network of simple neuron-like computing \nelements which can store and retrieve information, and most importantly make \ngeneralizations. Using terminology suggested by Rumelhart & McClelland 1, \nthe computing elements of a connectionist system are called units, and each unit \nis associated with a real number indicating its activity level. The activity level \nof a given unit in the system can also influence the activity level of another unit. \nThe degree of influence between two such units is often characterized by a \nparameter of the system known as a connection strength. During the informa(cid:173)\ntion retrieval process some subset of the units in the system are activated, and \nthese units in turn activate neighboring units via the inter-unit connection \nstrengths. The activation levels of the neighboring units are then interpreted as \n\nt Correspondence should be addressed to the author at the Department \n\nof Psychology, Stanford University, Stanford, California, 94305, USA. \n\n\u00a9 American Institute of Physics 1988 \n\n\f311 \n\nthe retrieved information. During the learning process, the values of the inter(cid:173)\nunit connection strengths in the system are slightly modified each time the units \nin the system become activated by incoming information. \n\nDERIV ATION OF TIIE SUBJECITVE PF \n\nSmolensky 2 demonstrated how the class of possible probability distri(cid:173)\n\nbutions that could be represented by a Hannony theory neural network model \ncan be derived from basic principles. Using a simple variation of the arguments \nmade by Smolen sky , a procedure for deriving the class of probability distribu(cid:173)\ntions associated with any connectionist system whose information retrieval \ndynamics can be summarized by an additive energy function is briefly sketched. \nA rigorous presentation of this proof may be found in Golden 3. \n\nLet a sample space, Sp, be a subset of the activation pattern state space, \nSd, for a particular neural network model. For notational convenience, define the \nterm probability function (pf) to indicate a function that assigns numbers \nbetween zero and one to the elements of Sp. For discrete random variables, the \npf is a probability mass function. For continuous random variables, the pf is a \nprobability density function. Let a particular stationary stochastic environment \nbe represented by the scalar-valued pf, Pe(X)' where X is a particular activation \npattern. The pf, Pe(X), indicates the relative frequency of occurrence of activa(cid:173)\ntion pattern X in the network model's environment. A second pf defined with \nrespect to sample space Sp also must be introduced. This probability function, \nps(X), is called the network's subjective pf. The pf Ps(X) is interpreted as the \nnetwork's belief that X will occur in the network's environment. \n\nThe subjective pf may be derived by making the assumption that the \ninformation retrieval dynamical system, D s' is optimal. That is, it is assumed \nthat D s is an algorithm designed to transform a less probable state X into a more \nprobable state X* where the probability of a state is defined by the subjective pf \nps(X;A), and where the elements of A are the connection strengths among the \nunits. Or in traditional engineering terminology, it is assumed that D s is a MAP \n(maximum a posteriori) estimation algorithm. The second assumption is that an \nenergy function, V(X), that is minimized by the system during the information \nretrieval process can be found with an additivity property. The additivity pro(cid:173)\nperty says that if the neural network were partitioned into two physically \n\n\f312 \n\nunconnected subnetworks, then Vex) can be rewritten as VI (Xl) + V 2(X2) \nwhere VIis the energy function minimized by the first subnetwork and V 2 is \nthe energy function minimized by the second subnetwork. The third assumption \nis that Vex) provides a sufficient amount of information to specify the probabil(cid:173)\nity of activation pattern X. That is, p (X) = G(V(X\u00bb where G is some continu(cid:173)\nous function. And the final assumpti;n (following Smolen sky 2) is that statisti(cid:173)\ncal and physical independence are equivalent. \n\nTo derive ps(X), it is necessary to characterize G more specifically. Note \nthat if probabilities are assigned to activation patterns such that physically \nindependent substates of the system are also statistically independent, then the \n\nadditivity property of V(X) forces G to be an exponential function since the onz \n\ncontinuous function that maps addition into multiplication is the exponential \n. \nAfter normalization and the assignment of unity to an irrelevant free parameter \n2, the unique subjective pf for a network model that minimizes V(X) during the \ninformation retrieval process is: \n\np s(X;A) = Z -1 exp [ - V (X;A)] \n\nZ = Jexp[ - V (X;A)]dX \n\n(1) \n\n(2) \n\nprovided that Z < C < 00. Note that the integral in (2) is taken over sp. Also note \nthat the pf, Ps' and samfle space, Sp, specify a Markov Random Field since (1) \nis a Gibbs distribution \n\n. \n\nExample 1: Subjective pfs for associative back-propagation networks \n\nThe information retrieval equation for an associative back-propagation 6 \nnetwork can be written in the form ~[I;A] where the elements of the vector 0 \nare the activity levels for the output units and the elements of the vector I are the \nactivity levels for the input units. The parameter vector A specifies the values \n\n\f313 \n\nof the \"connection strengths\" among the units in the system. The function cl> \nspecifies the architecture of the network. \n\nA natural additive energy function for the information retrieval dynam(cid:173)\n\nics of the least squares associative back-propagation algorithm is: \n\nV(O) = I ()-.4>(I;A) 12, \n\n(3) \n\nIf Sp is defined to be a real vector space such that 0 esp, then direct substitu(cid:173)\ntion of V(O) for V iX;A) into (1) and (2) yields a multivariate Gaussian density \nfunction with mean cl>(I;A) and covariance matrix equal to the identity matrix \nmultiplied by 1!2. This multivariate Gaussian density function is ps(OII;A). \nThat is, with respect to ps(OII;A), information retrieval in an associative back(cid:173)\npropagation network involves retrieving the \"most probable\" output vector, 0, \nfor a given input vector, I. \n\nExample 2: Subjective pis/or Hopfield and BSB networks. \n\nThe Hopfield 7 and BSB model 8,9 neural network models minimize the \n\nfollowing energy function during information retrieval: \n\nVex) =-X MX \n\nT \n\n(4) \n\nwhere the elements of X are the activation levels of the units in the system. and \nthe elements of M are the connection strengths among the units. Thus, the sub(cid:173)\njective pf for these networks is: \n\n\f314 \n\nX Z-l \n\nP s< ) = \n\nexp [X M X] where Z = l:exp [XT M X] \n\nT \n\n(5) \n\nwhere the summation is taken over Sp. \n\nAPPLICATIONS OF TIlE TIIEORY \n\nIf the subjective pf for a given connectionist system is known, then tradi(cid:173)\ntional analyses from the theory of statistical inference are immediately applica(cid:173)\nble. In this section some examples of how these analyses can aid in the design \nand analysis of neural networks are provided. \n\nEvaluating Learning Algorithms \n\nLearning in a neural network model involves searching for a set of con(cid:173)\n\nnection strengths or parameters that obtain a global minimum of a learning \nenergy function. The theory proposed here explicitly shows how an optimal \nlearning energy function can be constructed using the model's subjective pf and \nthe environmental pf. In particular, optimal learning is defined as searching for \nthe most probable connection strengths, given some set of observations (sam(cid:173)\nples) drawn from the environmental pf. Given some mild restrictions upon the \nfonn of the a priori pf associated with the connection strengths, and for a \nsufficiently large set of observations, estimating the most probable connection \nstrengths (MAP estimation) is equivalent to maximum likelihood estimation 10 \n\nA well-known result 11 is that if the parameters of the subjective pf are \nrepresented by the parameter vector A, then the maximum likelihood estimate of \nA is obtained by finding the A * that minimizes the function : \n\n\fE(A) =- <.LOG [p s(X;A)]> \n\n315 \n\n(6) \n\nwhere < > is the expectation operator taken with respect to the environmental pf. \nAlso note that (6) is the Kullback-Leibler 12 distance measure plus an irrelevant \nconstant. Asymptotically, E(A) is the logarithm of the probability of A given \nsome set of observations drawn from the environmental pf. \n\nEquation (6) is an important equation since it can aid in the evaluation \nand design of optimal learning algorithms. Substitution of the multivariate \nGaussian associated with (3) into (6) shows that the back-propagation algorithm \nis doing gradient descent upon the function in (6). On the other hand, substitu(cid:173)\ntion of (5) into (6) shows that the Hebbian and Widrow-Hoff learning rules pro(cid:173)\nposed for the Hopfield and BSB model networks are not doing gradient descent \nupon (6). \n\nEvaluating Network Architectures \n\nThe global minimum of ~6) occurs if and only if the subjective and \nenvironmental pfs are equivalent 2. Thus, one crucial issue is whether any set \nof connection strengths exists such that the neural network's subjective pf can \nbe made equivalent to a given environmental pf. If no such set of connection \nstrengths exists, the subjective pf, p s' is defined to be misspecified. White 11 \nand Lancaster 13 have introduced a statistical test designed to re~ct the null \nhypothesis that the subjective pf, Ps' is not misspecified. Golden \nsuggests a \nversion of this test that is suitable for subjective pfs with many parameters. \n\nREFERENCES \n\n1. D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, Parallel \ndistributed processing: Explorations in the microstructure of cognition, 1, \n(MIT Press, Cambridge, 1986). \n\n2. P. Smolensky, In D. E. Rumelhart, J. L. McClelland and the PDP Research \n\nGroup (Eds.), Parallel distributed processing: Explorations in the micros(cid:173)\ntructure of cognition, 1, (MIT Press, Cambridge, 1986), pp. 194-281. \n\n\f316 \n\n3. R. M. Golden, A unified framework for connectionist systems. Unpublished \n\nmanuscript. \n\n4. C. Goffman, Introduction to real analysis. (Harper and Row, N. Y., 1966), p. \n\n65. \n\n5. J. L. Marroquin, Probabilistic solution of inverse problems. A.I. Memo 860, \n\nMIT Press (1985). \n\n6. D. E. Rumelhart, G. E. Hinton, & R. J. Williams, In D. E. Rumelhart, 1. L. \n\nMcClelland, and the PDP Research Group (Eds.), Parallel distributed pro(cid:173)\ncessing: Explorations in the microstructure of cognition, 1, (MIT Press, \nCambridge, 1986), pp. 318-362. \n\n7. J. 1. Hopfield, Proceedings of the National Academy of Sciences, USA, 79, \n\n2554-2558 (1982). \n\n8. J. A. Anderson, R. M. Golden, & G. L. Murphy, In H. Szu (Ed.), Optical and \n\nHybrid Computing, SPIE, 634,260-276 (1986). \n\n9. R. M. Golden, Journal of Mathematical Psychology, 30,73-80 (1986). \n10. H. L. Van Trees, Detection, estimation, and modulation theory. (Wiley, N. \n\nY.,1968). \n\n11. H. White, Econometrica, 50, 1-25 (1982). \n12. S. Kullback & R. A. Leibler, Annals of Mathematical Statistics, 22, 79-86 \n\n(1951). \n\n13. T. Lancaster, Econometrica, 52, 1051-1053 (1984). \n\nACKNOWLEDGEMENTS \n\nThis research was supported in part by the Mellon foundation while the \nauthor was an Andrew Mellon Fellow in the Psychology Department at the \nUniversity of Pittsburgh, and partly by the Office of Naval Research under Con(cid:173)\ntract No. N-OOI4-86-K-OI07 to Walter Schneider. This manuscript was revised \nwhile the author was an NIH postdoctoral scholar at Stanford University. This \nresearch was also supported in part by grants from the Office of Naval Research \n(Contract No. NOOOI4-87-K-0671), and the System Development Foundation to \nDavid Rumelhart. I am very grateful to Dean C. Mumme for comments, criti(cid:173)\ncisms, and helpful discussions concerning an earlier version of this manuscript. \nI would also like to thank David B. Cooper of Brown University for his sugges(cid:173)\ntion that many neural network models might be viewed within a unified statisti(cid:173)\ncal framework. \n\n\f", "award": [], "sourceid": 75, "authors": [{"given_name": "Richard", "family_name": "Golden", "institution": null}]}