{"title": "Spherical Units as Dynamic Consequential Regions: Implications for Attention, Competition and Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 656, "page_last": 664, "abstract": null, "full_text": "Spherical Units as Dynamic Consequential Regions: \nImplications for Attention, Competition and Categorization \n\nStephen Jose Hanson* \nLearning and Knowledge \nAcquisition Group \nSiemens Corporate Research \nPrinceton, NJ 08540 \n\nMark A. Gluck \nCenter for Molecular & \nBehavioral Neuroscience \nRutgers University \nNewark, NJ 07102 \n\nAbstract \n\nto construct dynamic \n\nSpherical Units can be used \nreconfigurable \nconsequential regions, the geometric bases for Shepard's (1987) theory of \nstimulus generalization in animals and humans. We derive from Shepard's \n(1987) generalization theory a particular multi-layer network with dynamic \n(centers and radii) spherical regions which possesses a specific mass function \n(Cauchy). This learning model generalizes the configural-cue network model \n(Gluck & Bower 1988): (1) configural cues can be learned and do not require \npre-wiring the power-set of cues, (2) Consequential regions are continuous \nrather than discrete and (3) Competition amoungst receptive fields is shown \nto be increased by the global extent of a particular mass function (Cauchy). \nWe compare other common mass functions (Gaussian; used in models of \nMoody & Darken; 1989, Krushke, 1990) or just standard backpropogation \nnetworks with hyperplane/logistic hidden units showing that neither fare as \nwell as models of human generalization and learning. \n\n1 The Generalization Problem \n\nGiven a favorable or unfavorable consequence, what should an organism assume about \nthe contingent stimuli? If a moving shadow overhead appears prior to a hawk attack \nwhat should an organism assume about other moving shadows, their shapes and \npositions? If a dense food patch is occasioned by a particular density of certain kinds of \nshrubbery what should the organism assume about other shurbbery, vegetation or its \nspatial density? In an pattern recognition context, given a character of a certain shape, \norientation, noise level etc.. has been recognized correctly what should the system \nassume about other shapes, orientations, noise levels it has yet to encounter? \n\n\u2022 Also a member of Cognitive Science Laboratory, Princeton University, Princeton, NJ 08544 \n\n656 \n\n\fSpherical Units as Dynamic Consequential Regions \n\n657 \n\nMany \"generalization\" theories assume stimulus similarity represents a \"failure to \ndiscriminate\", rather than a cognitive decision about what to assume is consequential \nabout the stimulus event. In this paper we implement a generalization theory with \nmultilayer architecture and localized kernel functions (cf. Cooper, 1962; Albus 1975; \nKanerva, 1984; Hanson & Burr, 1987,1990; Niranjan & Fallside, 1988; Moody & \nDarken, 1989; Nowlan, 1990; Krushke, 1990) in which the learning system constructs \nhypotheses about novel stimulus events. \n\n2 Shepard's (1987) Generalization Theory \n\nConsiderable empirical evidence indicates that when stimuli are represented within an \nmulti-dimensional psychological \nstimulus \ngeneralization, drops off in an approximate exponential decay fashion with psychological \ndistance (Shepard, 1957, 1987). In comparison to a linear function, a similarity-distance \nrelationship with upwards concave curvature, such as an exponential-decay curve, \nexaggerates the similarity of items which are nearby in psychological space and \nminimizes the impact of items which are further away. \n\nsimilarity, as measured by \n\nspace, \n\nRecently, Roger Shepard (1987) has proposed a \"Universal Law of Generalization\" for \nstimulus generalization which derives this exponential decay similarity-distance function \nas a \"rational\" strategy given minimal information about the stimulus domain (see also \nShepard & Kannappan, this volume). To derive the exponential-decay similarity(cid:173)\ndistance rule, Shepard (1987) begins by assuming that stimuli can be placed within a \npsychological space such that the response learned to anyone stimulus will generalize to \nanother according to an invariant monotonic function of the distance between them. If a \nstimulus, 0, is known to have an important consequence, what is the probability that a \nnovel test stimulus, X, will lead to the same consequence? Shepard shows, through \narguments based on probabilistic reasoning that regardless of the a priori expectations \nfor regions of different sizes, this expectation will almost always yield an approximate \nexponentially decaying gradient away from a central memory point. In particular, very \nsimple geometric constraints can lead to the exponential generalization gradient. \nShepard (1987) assumes (1) that the consequential region overlaps the consequential \nstimulus event. and (2) bounded center symmetric consequential regions of unknown \nshape and size In the I-dimensional case it can be shown that g(x) is robust over a wide \nvariety of assumptions for the distribution of pes); although for pes) exactly the Erlangian \nor discrete Gamma, g(x) is exactly Exponential. \n\nWe now investigate possible ways to implement a model which can learn consequential \nregions and appropriate generalization behavior (cf. Shepard, 1990). \n\n3 Gluck & Bower's Configural-cue Network Model \n\nThe first point of contact is to discrete model due to Gluck and Bower: The configural(cid:173)\ncue network model (Gluck & Bower, 1988) The network model adapts its weights \nto Rescorla and Wagner's (1972) model of classical \n(associations) according \nconditioning which is a special case of Widrow & Hoffs (1961) Least-Mean-Squares \n(LMS) algorithm for training one-layer networks. Presentation of a stimulus pattern is \n\n\f658 \n\nHanson and Gluck \n\nrepresented by activating nodes on the input layer which correspond to the pattern's \nelementary features and pair-wise conjunctions of features. \n\nThe configural-cue network model implicitly embodies an exponential generalization \n(similarity) gradient (Gluck, 1991) as an emergent property of it's stimulus \nrepresentation scheme. This equivalence can be seen by computing how the number of \noverlapping active input nodes (similarity) changes as a function of the number of \noverlapping component cues (distance). If a stimulus pattern is associated with some \noutcome, the configural-cue model will generalize this association to other stimulus \npatterns in proportion to the number of common input nodes they both activate. \nAlthough the configura! cue model has been successful with various categorization data, \nthere are several limitations of the configural cue model: (1) it is discrete and can not deal \nadequately with continuous stimuli \ninternal \nrepresentation (3) it can involve the pre-wiring the power set of possible cues \nNonetheless, there are several properties that make the Configural Cue model successful \nthat are important to retain for generalizations of this model: (a) the competitive stimulus \nproperties deriving from the delta rule (b) the exponential stimulus generalization \nproperty deriving from the successive combinations of higher-order features encoded by \nhidden units. \n\nit possesses a non-adaptable \n\n(2) \n\n4 A Continuous Version of Shepard's Theory \nWe derive in this section a new model which generalizes the configural cue model and \nIn Figure 1, is shown a one \nderives directly from Shepard's generalization theory. \ndimensional depiction of the present theory. Similar to Shepard we assume there is a \nconsequential \n\n\u2022 lell lit ely \n\n/ \n\n0' o 0' \n\n0' X \n\n0' \n\nFigure 1: Hypothesis Distributions based on Consequential Region \n\nregion associated with a significant stimulus event, O. Also similar to Shepard we \nassume the learning system knows that the significant stimulus event is contained in the \nconsequential region, but does not know the size or location of the consquential region. \nIn absence of this information the learning system constructs hypothesis distributions \n\n\fSpherical Units as Dynamic Consequential Regions \n\n659 \n\n(0') which mayor maynot be contained in the consequential region but at least overlap \nIn some hypothesis \nthe significant stimulus event with some finite probablity measure. \ndistributions the significant stimulus event is \"typical\" in the consequential region, in \nother hypothesis distributions the significant stimulus event is \"rare\". Consequently, the \npresent model differs from Shepard's approach in that the learning system uses the \nconsequential region to project into a continuous hypothesis space in order to construct \nthe conditional probability of the novel stimulus, X, given the significant stimulus event \no. \nGiven no further information on the location and size of the consequential region the \nlearning system averages over all possible locations (equally weighted) and all possible \n(equally weighted) variances over the known stimulus dimension: \n\ng(x) = Xp(S)lp (C)H(x,s,C)dCdS \n\n(1) \n\nIn order to derive particular gradients we must assume particular forms for the hypothesis \ndistribution, H(x,s,c). Although we have investigated many different hypothesis \ndistributions and wieghting functions (p(c), pes)), we only have space here to report on \ntwo bounding cases, one with very \"light tails\", the Gaussian, and one with very \"heavy \ntails\", the Cauchy (see Figure 2). These two distributions are extremes and provide a \ntest of the robustness of the generalization gradient. At the same time they represent \ndifferent commitments to the amount of overlap of hidden unit receptive fields and the \nconsequent amount of stimulus competition during learning. \n\na=JfIl \n\n'\" \n\n~ \n0 \n\nI~ 0 \n'\" 0 \n\n0 \n0 \n\n... \n\nFigure 2: Gaussian compared to the Cauchy: Note heavier Cauchy tail \n\nEquation 2 was numerically integrated (using mathematica), over a large range of \nvariances and a large range of locations using uniform densities representing the \nweighting functions and both Gaussian and Cauchy distributions representing the \nhypothesis distributions. Shown in Figure 3 are the results of the integrations for both the \nCauchy and Gaussian distributions. The resultant gradients are shown by open circles \n(Cauchy) or stars (Gaussian) while the solid lines show the best fitting exponential \ngradient. We note that they approximate the derived gradients rather closely in spite of \nthe fact the underlying forms are quite complex, for example the curve shown for the \nCauchy integration is actually: \n\n\f660 \n\nHanson and Gluck \n\n-5Arctan ( x -{ 2 ) + 0.01 [Arctan (1 OO(x -c 1))J + 5Arctan ( x +{ 1 ) -\n\n(2) \n\n0.01 [Arctan (100(X+C2))] - 'l2\u00abc2+x)log(1-s lx+x 2)+(c l-x)log(s2-s lx+x 2))-\n\n'l2(c l-x)log(1+s lx+x 2) + (c2+x)log(s2+s lx+x 2)) \n\nConsequently we confinn Shepard's original observation for a continuous version} of his \ntheory that the exponential gradient is a robust consequence of minimum infonnation set \nof assumptions about generalization to novel stimuli. \n\no r----------------------------------------~ \n.,; \n\nI: \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n\"W1ceIromX \n\nFigure 3: Generalization Gradients Compared to Exponential (Solid Lines) \n\n4.1 Cauchy vs Gaussian \nAs pointed out before the Cauchy has heavier tails than the Gaussian and thus provides \nmore global support in the feature space. This leads to two main differences in the \nhypothesis distributions: \n\n(1) Global vs Local support: Unlike back-propagation with hyperplanes, Cauchy can be \nlocal in the feature space and unlike the Gaussian can have more global effect. \n\n(2) Competition not Dimensional scaling: Dimensional \"Attention\" in CC and Cauchy \nmultilayer network model is based on competition and effective allocation of resources \nduring learning rather than dimensional contraction or expansion. \n\n1 N-Dimensional Versions: we generalize the above continuous J-d model to an N-dimensional model by \nassuming that a network of Cauchy units can be used to construct a set of consequential regions each \npossibly composed of several Cauchy receptive fields. Consequently, dimensions can be differentially \nweighted by subsets of cauchy units acting in concert could produce metrics like L-J nonns in separable \n(e.g., shape, size of arbitrary fonns) dimension cases while equally weighting dimensions similar to metries \nlike L-2 nonns in integral (e.g., lightness, hue in color) dimension cases. \n\n\fSpherical Units as Dynamic Consequential Regions \n\n661 \n\nSince the stimulus generalization properites of both hypothesis distributions are \nindistinguishable (both close to exponential) it is important to compare categorization \nresults based on a multilayer gradient descent model using both the Cauchy and Gaussian \nas hidden node functions. \n\n5 Comparisons with Human Categorization Performance \nWe consider in the final section two experiments from human learning literature which \nconstrain categorization results. The model was a multilayer network using standard \ngradient descent in the radius, location and second layer weights of either Cauchy or \nGaussian functions in hidden units. \n\nShepard, Hovland and Jenkins (1961) \n\n5.1 \nIn order to investigate adults ability to learn simple classification SH&J used eight 3-\ndimensional stimuli (comers of the cube) representing seperable stimuli like shape, color \nor size. Of the 70 possible 4-exempler dichotomies there are only six unique 4 exemplar \ndichotomies which ignor the specific stimulus dimension . \n\n....,.,(cid:173)\n_til'O_ \n\n[0-II -- I. \n\n- - roI \n-\ny \n-\nVI \n\n! \nI \n\n! \n\n\" \n\n\u2022 \n\u2022 \n~\u00ad_til .. _ \n\n.. \n\n.. \n\n_0-\n\n[ill \n\nIII \n- - roI \n-\ny \nVI \n-\n\n\u2022 \n\n.. \n\n.. \n\n.. \n\n\u2022 \n\n-\n\nFigure 4: Classification Learning Rate for Gaussian and Cauchy on SHJ stimuli \n\nThese dichotomies \nlinearly separable and nonlinearly separable \nclassifications as well as selective dependence on a specific dimension or dimensions. \n\ninvolve both \n\n\f662 \n\nHanson and Gluck \n\nFor both measures of trials to learn and the number of errors made during learning the \norder of difficulty was (easiest) I