{"title": "Sigma-Pi Learning: On Radial Basis Functions and Cortical Associative Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 474, "page_last": 481, "abstract": null, "full_text": "474 Mel and Koch \n\nSigma-Pi Learning: \n\nOn Radial Basis Functions and Cortical \n\nAssociative Learning \n\nChristof Koch \nBartlett W. Mel \nComputation and Neural Systems Program \n\nCaltech, 216-76 \n\nPasadena, CA 91125 \n\nABSTRACT \n\nThe goal in this work has been to identify the neuronal elements \nof the cortical column that are most likely to support the learning \nof nonlinear associative maps. We show that a particular style of \nnetwork learning algorithm based on locally-tuned receptive fields \nmaps naturally onto cortical hardware, and gives coherence to a \nvariety of features of cortical anatomy, physiology, and biophysics \nwhose relations to learning remain poorly understood. \n\nINTRODUCTION \n\n1 \nSynaptic modification is widely believed to be the brain's primary mechanism for \nlong-term information storage. The enormous practical and theoretical importance \nof biological synaptic plasticity has stimulated interest among both experimental \nneuroscientists and neural network modelers, and has provided strong incentive for \nthe development of computational models that can both explain and predict. \n\nWe present here a model for the synaptic basis of associative learning in cerebral \ncortex. The main hypothesis of this work is that the principal output neurons \nof a cortical association area learn functions of their inputs as locally-generalizing \nlookup tables. As abstractions, locally-generalizing learning methods have a long \nhistory in statistics and approximation theory (see Atkeson, 1989; Barron & Barron, \n\n\fSigma-Pi Learning \n\n475 \n\nFigure 1: A Neural Lookup Table. A nonlinear function of several variables may \nbe decomposed as a weighted sum over a set of localized \"receptive fields\" units. \n\n1988). Radial Basis Function (RBF) methods are essentially similar (see Broomhead \n& Lowe, 1988) and have recently been discussed by Poggio and Girosi (1989) in \nrelation to regularization theory. As is standard for network learning problems, \nlocally-generalizing methods involve the learning of a map f(~) : ~ ~ y from \nexample (~, y) pairs. Rather than operate directly on the input space, however, \ninput vectors are first \"decoded\" by a population of \"receptive field\" units with \ncenters ei that each represents a local, often radially-symmetric, region in the input \nspace. Thus, an output unit computes its activation level y = L:i wig( x - ei), where \n9 defines a \"radial basis function\" , commonly a Gaussian, and Wi is its weight (Fig. \n1). The learning problem can then be characterized as one of finding weights w \nthat minimize the mean squared error over the N element training set. Learning \nschemes of this type lend themselves directly to very simple Hebb-type rules for \nsynaptic modification since the initially nonlinear learning problem is transformed \ninto a linear one in the unknown parameters w (see Broomhead & Lowe, 1988). \nLocally-generalizing learning algorithms as neurobiological models date at least to \nAlbus (1971) and Marr (1969, 1970); they have also been explored more recently by \na number of workers with a more pure computational bent (Broomhead & Lowe, \n1988; Lapedes & Farber, 1988; Mel, 1988, 1989; Miller, 1988; Moody, 1989; Poggio \n& Girosi, 1989). \n\n\f476 \n\nMel and Koch \n\n2 SIGMA-PI LEARNING \nUnlike the classic thresholded linear unit that is the mainstay of many current \nconnectionist models, the output of a sigma-pi unit is computed as a sum of contri(cid:173)\nbutions from a set of independent multiplicative clusters of input weights (adapted \nfrom Rumelhart & McClelland, 1986): y = O'(Ej WjCj), where Cj = rt ViXi is the \nproduct of weighted inputs to cluster j, Wj is the weight on cluster j as a whole, \nand 0' is an optional thresholding nonlinearity applied to the sum of total clus(cid:173)\nter activity. During learning, the output may also by clamped by an unconditioned \nteacher input, i.e. such that y = ti(~)' Units of this general type were first proposed \nby Feldman & Ballard (1982), and have been used occasionally by other connec(cid:173)\ntionist modelers, most commonly to allow certain inputs to gate others or to allow \nthe activation of one unit to control the strength of interconnection between two \nother units (Rumelhart & McClelland, 1986). The use of sigma-pi units as function \nlookup tables was suggested by Feldman & Ballard (1982), who cited a possible \nrelevance to local dendritic interactions among synaptic inputs (see also Durbin & \nRumelhart, 1989). \n\nIn the present work, the specific nonlinear interaction among inputs to a sigma-pi \ncluster is not of primary theoretical importance. The crucial property of a cluster \nis that its output should be AND-like, i.e. selective for the simultaneous activity \nof all of its k input lines!. \n\n2.1 NETWORK ARCHITECTURE \n\nWe assume an underlying d-dimensional input space X E Rd over which functions \nare to be learned. Vectors in X are represented by a population X of N units \nwhose state is denoted by ~ E RN. Within X, each of the d dimensions of X is \nindividually value-coded, i.e. consists of a set of units with gaussian receptive fields \ndistributed in overlapping fashion along the range of allowable parameter values, \nfor example, the angle of a joint, or the orientation of a visual stimulus at a specific \nretinal location. (A more biologically realistic case would allow for individual units \nin X to have multi-dimensional gaussian receptive fields, for example a 4-d visual \nreceptive field encoding retinal x and y, edge orientation, and binocular disparity.) \nWe assume a map t(~) : ~ ~ y. is to be learned, where the components ofy' E RM are \nrepresented by an output population Y of M units. According to the familiar single(cid:173)\nlayer feedforward network learning paradigm, X projects to Y via an \"associational\" \npathway with modifiable synapses. We consider the task of a single output unit Yi \n(hereafter denoted by y), whose job is to estimate the underlying teacher function \nti(~) : ~ ~ y from examples. Output unit y is assumed to have access to the entire \ninput vector ~, and a single unconditioned teacher input ti. We further assume that \n\n1 A local threshold function can act as an AND in place of a multiplication, and for purposes of \nbiological modeling, is a more likely dendritic mechanism than pure multiplication. In continuing \nwork, we are exploring the more detailed interactions between Hebb-type learning rules and various \npost-synaptic nonlinearities, specifically the NMDA channel, that could underlie a multiplication \nrelation among nearby inputs. \n\n\fSigma-Pi Learning \n\n477 \n\nall possible clusters Cj of size 1 through k = k maz pre-exist in y's dendritic field, \nwith cluster weights Wj initially set to 0, and input weights Vi within each cluster set \nequal to 1. Following from our assumption that each of the input lines Xi represents \na I-dimensional gaussian receptive field in X, a multiplicative cluster of k such \ninputs can yield a k-dimensional receptive field in X that may then be weighted . \nIn this way, a sigma-pi unit can directly implement an RBF decomposition over X. \nAdditionally, since a sigma-pi unit is essentially a massively parallel lookup table \nwith clusters as stored table entries, it is significant that the sigma-pi function is \ninherently modular, such that groups of sigma-pi units that receive the same teacher \nsignal can, by simply adding their outputs, act as a single much larger virtual sigma(cid:173)\npi unit with correspondingly increased table capacity2. A neural architecture that \nallows system storage capacity to be multiplied by a factor of k by growing k neurons \nin the place of one, is one that should be strongly preferred by biological evolution. \n\n2.2 THE LEARNING RULE \n\nThe cluster weights Wj are modified during training according to the following self(cid:173)\nnormalizing Hebb rule: \n\nwi = a cip tp -\n\nf3W j, \n\nwhere a and f3 are small positive constants, and cip and tp are, respectively, the jth \ncluster response and teacher signal in state p. The steady state of this learning rule \noccurs when Wj = ~ < cit >, which tries to maximize the correlation3 of cluster out(cid:173)\nput and teacher signal over the training set, while minimizing total synaptic weight \nfor all clusters. The inputs weights Vi are unmodified during learning, representing \nthe degree of cluster membership for each input line. \n\nWe briefly note that because this Hebb-type learning rule is truly local, i.e. depends \nonly upon activity levels available directly at a synapse to be modified, it may be \napplied transparently to a group of neurons driven by the same global teacher input \n(see above discussion of sigma-pi modularity). Error-correcting rules that modify \nsynapses based on a difference between desired vs. actual neural output do not \nshare this property. \n\n3 TOWARD A BIOLOGICAL MODEL \nIn the remainder of this paper we examine the hypothesis that sigma-pi units un(cid:173)\nderlie associative learning in cerebral cortex. To do so, we identify the six essential \nelements of the sigma-pi learning scheme and discuss the evidence for each: i) a pop(cid:173)\nulation of output neurons, ii) a focal teacher input, iii), a diffuse association input, \niv) Hebb-type synaptic plasticity, v) local dendritic multiplication (or thresholding), \nand vi) a cluster reservoir. \n\nFollowing Eccles (1985), we concern ourselves here with the cytoarchitecture of \n\"generic\" association cortex, rather than with the more specialized (and more often \nstudied) primary sensory and motor areas. We propose the cortical circuit of fig. \n\n2This assumes the global thresholding nonlinearity q is weak, i.e. has an extended linear range. \n3Strictly speaking, the average product. \n\n\f478 Mel and Koch \n\nASSOciationl~;.,~~lil~ll~fi~~~ \n\nribers\"\" \n\nj \n\nIV \n\nV,VI \n\nAssociation \n\nInputs \n\nSpecific \nAfferent \n\nFigure 2: Elements of the cortical column in a generic association cortex. \n\n2 to contain all of the basic elements necessary for associative learning, closely \nparalleling the accounts of Marr (1970) and Eccles (1985) at this level of description. \nWe limit our focus to the cortically-projecting \"output\" pyramids oflayers II and III, \nwhich are posited to be sigma-pi units. These cells are a likely locus of associative \nlearning as they are well situated to receive both teacher and associational input \npathways. With reference to the modularity property of sigma-pilearning (sec. 2.1), \nwe interpret the aggregates of layer II/III pyramidal cells whose apical dendrites \nrise toward the cortical surface in tight clumps (on the order of 100 cells, Peters, \n1989), as a single virtual sigma-pi unit. \n\n3.1 THE TEACHER INPUT \n\nWe tentatively define the \"teacher\" input to an association area to be those inputs \nthat terminate primarily in layer IV onto spiny stellate cells or small pyramidal \ncells. Lund et al. (1985) points out that spiny stellate cells are most numerous \nin primary sensory areas, but that the morphologically similar class of small pyra(cid:173)\nmidal cells in layer IV seem to mimick the spiny stellates in their local, vertically \noriented excitatory axonal distributions. The layer IV spiny stellates are known \nto project primarily up (but also down) a narrow vertical cylinder in which they \nsit, probably making powerful \"cartridge\" synapses onto overlying pyramidal cells. \nThese excitatory interneurons are presumably capable of strongly deplorarizing en(cid:173)\ntire output cells (Szentagothai, 1977), thus providing the needed unit-wide teacher \nsignals to the output neurons. We therefore assumethis teacher pathway plays a \nrole analagous to the presumed role of cerebellar climbing fibers (Albus, 1971; Marr, \n\n\fSigma-Pi Learning \n\n479 \n\n1969} The inputs to layer IV can be of both thalamic and/or cortical origin. \n\n3.2 THE ASSOCIATIONAL INPUT \n\nA second major form of extrinsic excitatory input with access to layer II/III pyra(cid:173)\nmidal cells is the massive system of horizontal fibers in layer I. The primary source \nof these fibers is currently believed to be long range excitatory association fibers \nfrom both other cortical and subcortical areas (Jones, 1981). In accordance with \nMarr (1970) and Eccles (1985), we interpret this system of horizontal fibers, which \nvirtually permeates the dendritic fields of the layer II/III pyramidal cells, as the pri(cid:173)\nmary conditioned input pathway at which cortical associative learning takes place. \nThere is evidence that an individual layer I fibers can make excitatory synapses \non apical dendrites of pyramidal cells across an area of cortex 5-6mm in diameter \n(Szentagothai, 1977). \n\n3.3 HEBB RULES, MULTIPLICATION, AND CLUSTERING \n\nThe process of cluster formation in sigma-pi learning is driven by a local Hebb-type \nrule. Long term Hebb-type synaptic modification has been demonstrated in several \ncortical areas, dependent only upon local post-synaptic depolarization (Kelso et al., \n1986), and thought to be mediated by the the voltage-dependent NMDA channel \n(see Brown et al., 1988). In addition to the standard tendency for LTP with pre- and \npost-synaptic correlation, sigma-pi learning implicitly specifies cooperation among \npre-synaptic units, in the sense that the largest increase in cluster weight Wj occurs \nwhen all inputs Xi to a cluster are simultaneously and strongly active. This type of \ncooperation among pre-synaptic inputs should follow directly from the assumption \nthat local post-synaptic depolarization is the key ingredient in the induction of LTP. \nIn other words, like-activated synaptic inputs must inevitably contribute to each \nother's enhancement during learning to the extent they are clustered on a post(cid:173)\nsynaptic dendrite. This type of cooperativity in learning gives key importance to \ndendritic space in neural learning, and has not until very recently been modelled at \na biophysical level (T. Brown, pers. comm; J. Moody, pers. comm.). \nIn addition to its possible role in enhancing like-activated synaptic clusters however, \nthe NMDA channel may be hypothesized to simultaneously underlie the \"multiplica(cid:173)\ntive\" interaction among neighboring inputs needed for ensuring cluster-selectivity \nin sigma-pi learning. Thus, if sufficiently endowed with NMDA channels, cortical \npyramidal cells could respond highly selectively to associative input \"vectors\" whose \nactive afferents are spatially clumped, rather than scattered uniformly, across the \ndendritic arbor. The possibility that dendritic computations could include local \nmultiplicative nonlinearities is widely accepted (e.g. Shepherd et al., 1985; Koch et \nal., 1983). \n\n3.4 A VIRTUAL CLUSTER RESERVOIR \n\nThe abstract definition of sigma-pi learning specifies that all possible clusters Cj of \nsize 1 < k < kmax pre-exist on the \"dendrites\" of each virtual sigma-pi unit (which \nwe have previously proposed to consist of a vertically aggregated clump of 100 \n\n\f480 Mel and Koch \n\npyramidal cells that receive the same teacher input from layer 4). During learning, \nthe weight on each cluster is governed by a simple Hebb rule. Since the number of \npossible clusters of size k overwhelms total available dendritic space for even small \nk 4 , it must be possible to create a cluster when it is needed. We propose that \nthe complex 3-d mesh of axonal and dendritic arborizations in layer 1 are ideal for \nmaximizing the probability that arbitrary (small) subsets of association axons cross \nnear to each other in space at some point in their collective arborizations. Thus, \nwe propose that the tangle ofaxons within a dendrite's receptive field gives rise to \nan enormous set of almost-clusters, poised to \"latch\" onto a post-synaptic dendrite \nwhen called for by a Hebb-type learning rule. This geometry of pre- and post(cid:173)\nsynaptic interface is to be strongly contrasted with the architecture of cerebellum, \nwhere the afferent \"parallel\" fibers have no possibility of clustering on post-synaptic \ndendrites. \n\nKnown biophysical mechamisms for the sprouting and guidance of growth cones \nduring development, in some cases driven by neural activity seem well suited to the \ntask of cluster formation over small distances in the adult brain. \n\n4 CONCLUSIONS \n\nThe locally-generalizing, table-based sigma-pi learning scheme is a parsimonious \nmechanisms that can account for the learning of nonlinear associative maps in cere(cid:173)\nbral cortex. Only a single layer of excitatory synapses is modified, under the control \nof a Hebb-type learning rule. Numerous open questions remain however, for exam(cid:173)\nple the degree to which clusters of active synapses scattered across a pyramidal \ndendritic tree can function independently, providing the necessary AND-like selec(cid:173)\ntivity. \n\nAcknowledgements \n\nThanks are due to Ojvind Bernander, Rodney Douglas, Richard Durbin, Kamil \nGrajski, David Mackay, and John Moody for numerous helpful discussions. We \nacknowledge support from the Office of Naval Research, the James S. McDonnell \nFoundation, and the Del Webb Foundation. \n\nReferences \n\nAlbus, J.S. A theory of cerebellar function. Math. Bio6Ci., 1971, 10,25-61. \n\nAtkeson, C.G. Using associative content-addressable memories to control robots, MIT A.I. Memo \n1124, September 1989. \n\nBarron, A.R. & Barron, R.L. Statistical learning networks: a unifying view. Presented at the 1988 \nSympo6ium on the Interface: Stati6tic6 and Computing Science, Reston, Virginia. \n\nBliss, T.V.P. & Lf/Jmo, T. Long-lasting potentiation of synaptic transmission in the dentate area \nof the anaesthetized rabbit following stimulation of the perforant path. J. PhY6ioi., 1973, 232, \n331-356. \n\n4 For example, assume a 3-d learning problem and clusters of size k = 3; with 100 afferents per \ninput dimension, there are 1003 = 106 possible clusters. If we assume 5,000 available association \nsynapses per pyramidal cell, there is dendritic space for at most 166,000 clusters of size 3. \n\n\fSigma-Pi Learning \n\n481 \n\nBroomhead, D.S. & Lowe, D. Multivariable functional interpolation and adaptive networks. Com(cid:173)\nplex Sy.tem., 1988, 2, 321-355. \n\nBrown, T.H., Chapman, P.F., Kairiss, E.W., & Keenan, C.L. Long-term synaptic potentiation. \nScience, 1988, 242, 724-728. \n\nDurbin, R. & Rumelhart, D.E. Product units: a computationally powerful and biologically plau(cid:173)\nsible extension to backpropagation networks. Complex Sy.tem., 1989, 1, 133. \n\nEccles, J.C. The cerebral neocortex: a theory of its operation. In Cerebral Cortex, vol. 2, A. \nPeters & E.G. Jones, (Eds.), Plenum: New York, 1985. \n\nFeldman, J.A. & Ballard, D.H. Connectionist models and their properties. Cognitive Science, \n1982, 6, 205-254. \n\nGiles, C.L. & Maxwell, T. Learning, invariance, and generalization in high-order neural networks. \nApplied Optic., 1987, 26(23), 4972-4978. \n\nHebb, D.O. The organization oj behavior. New York: Wiley, 1949. \n\nJones, E.G. Anatomy of cerebral cortex: columnar input-ouput relations. In The organization \noj cerebral cortex, F.O. Schmitt, F.G. Worden, G. Adelman, & S.G. Dennis, (Eds.), MIT Press: \nCambridge, MA, 1981. \n\nKelso, S.R., Ganong, A.H., & Brown, T.H. Hebbian synapses in hippocampus. PNAS USA, 1986, \n83, 5326-5330. \n\nKoch, C., Poggio, T., & Torre, V. Nonlinear interactions in a dendritic tree: localization, timing, \nand role in information processing. PNAS, 1983, 80, 2799-2802. \n\nLapedes, A. & Farber, R. How neural nets work. In NeurallnJormation Procfuing Sy.tem., D.Z. \nAnderson, (Ed.), American Institute of Physics: New York, 1988. \n\nLund, J.S. Spiny stellate neurons. In Cerebral Cortex, vol. 1, A. Peters & E.G. Jones, (Eds.), \nPlenum: New York, 1985. \n\nMarr, D. A theory for cerebral neocortex. Proc. Roy. Soc. Lond. B, 1970, 176, 161-234. \n\nMarr, D. A theory of cerebellar cortex. J. Phy.iol., 1969, 202, 437-470. \n\nMel, B.W. MURPHY: A robot that learns by doing. In NeurallnJormation Proceuing SY6tem., \nD.Z. Anderson, (Ed.), American Institute of Physics: New York, 1988. \n\nMel, B.W. MURPHY: A neurally inspired connectionist approach to learning and perfonnance in \nvision-based robot motion planning. Ph.D. thesis, University of Illinois, 1989. \n\nMiller W.T., Hewes, R.P., Glanz, F.H., & Kraft, L.G. Real time dynamic control of an indus(cid:173)\ntrial manipulator using a neural network based learning controller. Technical Report, Dept. of \nElectrical and Computer Engineering, University of New Hampshire, 1988. \n\nMoody, J. & Darken, C. Learning with localized receptive fields. In Proc. 1988 Connectioni6t \nModel. Summer School, Morgan-Kaufmann, 1988. \n\nPeters, A. Plenary address, 1989 Soc. Neurosc. Meeting, Phoenix, AZ. \n\nPoggio, T. & Girosi, F. Learning, networks and approximation theory. Science, In press. \n\nRumelhart, D.E., Hinton, G.E., & McClelland, J.L. A general framework for parallel distributed \nprocessing. In Parallel di.tributed proceuing: exploration. in the micro.tructure oj cognition, vol. \n1, D.E. Rumelhart, J.L. McClelland, (Eds.), Cambridge, MA: Bradford, 1986. \n\nShepherd, G.M., Brayton, R.K., Miller, J.P., Segev, I., Rinzel, J., & Rall, W. Signal enhancement \nin distal cortical dendrites by means of interactions between active dendritic spines. PNAS, 1985, \n82, 2192-2195. \n\nSzentagothai, J. The neuron network of the cerebral cortex: a functional interpretation. (1977) \nProc. R. Soc. Lond. B., 201:219-248. \n\n\f", "award": [], "sourceid": 282, "authors": [{"given_name": "Bartlett", "family_name": "Mel", "institution": null}, {"given_name": "Christof", "family_name": "Koch", "institution": null}]}