{"title": "Statistical Prediction with Kanerva's Sparse Distributed Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 586, "page_last": 593, "abstract": null, "full_text": "586 \n\nSTATISTICAL PREDICTION WITH KANERVA'S \n\nSPARSE DISTRmUTED MEMORY \n\nDavid Rogers \n\nResearch Institute for Advanced Computer Science \n\nMS 230-5, NASA Ames Research Center \n\nMoffett Field, CA 94035 \n\nABSTRACT \n\nA new viewpoint of the processing performed by Kanerva's sparse \ndistributed memory (SDM) is presented. \nIn conditions of near- or \nover- capacity, where the associative-memory behavior of the mod(cid:173)\nel breaks down, the processing performed by the model can be inter(cid:173)\npreted as that of a statistical predictor. Mathematical results are \npresented which serve as the framework for a new statistical view(cid:173)\npoint of sparse distributed memory and for which the standard for(cid:173)\nmulation of SDM is a special case. This viewpoint suggests possi(cid:173)\nble enhancements to the SDM model, including a procedure for \nimproving the predictiveness of the system based on Holland's \nwork with 'Genetic Algorithms', and a method for improving the \ncapacity of SDM even when used as an associative memory. \n\nOVERVIEW \n\nThis work is the result of studies involving two seemingly separate topics that \nproved to share a common framework. The fIrst topic, statistical prediction, is the \ntask of associating extremely large perceptual state vectors with future events. The \nsecond topic, over-capacity in Kanerva's sparse distributed memory (SDM), is a \nstudy of the computation done in an SDM when presented with many more associa(cid:173)\ntions than its stated capacity. \n\nI propose that in conditions of over-capacity, where the associative-memory behav(cid:173)\nior of an SDM breaks down, the processing performed by the SDM can be used for \nstatistical prediction. A mathematical study of the prediction problem suggests a \nvariant of the standard SDM architecture. This variant not only behaves as a statisti(cid:173)\ncal predictor when the SDM is fIlled beyond capacity but is shown to double the \ncapacity of an SDM when used as an associative memory. \n\nTHE PREDICTION PROBLEM \n\nThe earliest living creatures had an ability, albeit limited, to perceive the world \nthrough crude senses. This ability allowed them to react to changing conditions in \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n587 \n\nfor example, to move towards (or away from) light sources. As \nthe environment; \nnervous systems developed, learning was possible; \nif food appeared sim ultaneously \nwith some other perception, perhaps some odor, a creature could learn to associate \nthat smell with food. \n\nAs the creatures evolved further, a more rewarding type of learning was possible. \nSome perceptions, such as the perception of pain or the discovery of food, are very \nimportant to an animal. However, by the time the perception occurs, damage may \nalready be done, or an opportunity for gain missed. If a creature could learn to asso(cid:173)\nciate current perceptions with future ones, it would have a much better chance to \ndo something about it before damage occurs. This is the prediction problem. \n\nThe difficulty of the prediction problem is in the extremely large number of possi(cid:173)\nble sensory inputs. For example, a simple animal might have the equivalent of 1000 \nbits of sensory data at a given time; \nin this case, the number of possible inputs is \ngreater than the number of atoms in the known universe! \nIn essence, it is an enor(cid:173)\nmous search problem: a living creature must fmd the subregions of the perceptual \nspace which correlate with the features of interest Most of the gigantic perceptual \nspace will be uncorrelated, and hence uninteresting. \n\nTHE OVERCAPACITY PROBLEM \n\nAn associative memory is a memory that can recall data when addressed 'close-to' an \naddress where data were previously stored. A number of designs for associative \nmemories have been proposed, such as Hopfield networks (Hopfield, 1986) or the \nnearest-neighbor associative memory of Baum, Moody, and Wilczek (1987). Memo(cid:173)\nry-related standards such as capacity are usually selected to judge the relative perfor(cid:173)\nmance of different models. Performance is severely degraded when these memories \nare filled beyond capacity. \n\nKanerva's sparse distributed memory is an associative memory model developed from \nthe mathematics of high-dimensional spaces (Kanerva, 1988) and is related to the \nwork of David Marr (1969) and James Albus (1971) on the cerebellum of the brain. \n(For a detailed comparison of SDM to random-access memory, to the cerebellum, \nand to neural-networks, see (Rogers, 1988b\u00bb. Like other associative memory mod(cid:173)\nels, it exhibits non-memory behavior when near- or over- capacity. \n\nStudies of capacity are often over-simplified by the common assumption of uncorre(cid:173)\nlated random addresses and data. The capacity of some of these memories, including \nSDM, is degraded if the memory is presented with correlated addresses and data. \nSuch correlations are likely if the addresses and data are from a real-world source. \nThus, understanding the over-capacity behavior of an SDM may lead to better proce(cid:173)\ndures for storing correlated data in an associative memory. \n\n\f588 \n\nRogers \n\nSPARSE DISTRmUTED MEMORY \n\nSparse distributed memory can be best illustrated as a variant of random-access mem(cid:173)\nory (RAM). The structure of a twelve-location SDM with ten-bit addresses and \nten-bit data is shown in figure 1. (Kanerva, 1988) \n\nReference Address \n\n01010101101 \n\n~ \n\n~ \n\n1101100111 \n1010101010 \n\n0000011110 \n0011011001 \n\n1011101100 \n0010101111 \n1101101101 \n0100000110 \n\n0110101001 \n1011010110 \n1100010111 \n\n1111110011 \n\nLocation \nAddresses \n\nRadius o \n\nDist \n\nSelect \n\nInput Data \n\nlor 01 0111 11 11 0 I tI 0 11 I \n++++ttttt+ \n\n1 \n\n-1 1 \n\n-1 1 \n\n+ + + + , , , , , +r \nSums 1-31-51-3151 513 1 -31 31-3131 \nThreshold at 0 + , + + + + + + + + \nOutput Data I 0 I 0 I 0111 11 11 0 I 11 0 11 I \n\nFigure 1. Structure of a Sparse Distributed Memory \n\nA memory location is a row in this figure. The location addresses are set to random \naddresses. The data counters are initialized to zero. All operations begin with \naddressing the memory; this entails finding the Hamming distance between the refer(cid:173)\nence address and each of the location addresses. If this distance is less than or equal \nto the Hamming radius, the select-vector entry is set to I, and that location is \ntenned selected. The ensemble of such selected locations is called the selected set. \nSelection is noted in the figure as non-gray rows. A radius is chosen so that only a \nsmall percentage of the memory locations are selected for a given reference address. \n\n(Later, we will refer to the fact that a memory location defines an activation set of \naddresses in the address space; the activation set corresponding to a location is the \nset of reference addresses which activate that memory location. Note the reciprocity \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n589 \n\nbetween the selected set corresponding to a given reference address, and the activa(cid:173)\ntion set corresponding to a given location.) \n\nWhen writing to the memory, all selected counters beneath elements of the input \ndata equal to 1 are incremented, and all selected counters beneath elements of the \ninput data equal to 0 are decremented. This completes a write operation. When \nreading from the memory, the selected data counters are summed columnwise into \nthe register sums. If the value of a sum is greater than or equal to zero, we set the \ncorresponding bit in the output data to 1; otherwise, we set the bit in the output \ndata to O. (When reading, the contents of the input data are ignored.) \n\nThis example makes clear that a datum is distributed over the data counters of the \nselected locations when writing, and that the datum is reconstructed during reading \nby averaging the sums of these counters. However, depending on what additional \ndata were written into some of the selected locations, and depending on how these \ndata correlate with the Original data, the reconstruction may contain noise. \n\nTHE BEHAVIOR OF AN SDM WHEN AT OVER-CAPACITY \n\nIn this memory, we \nConsider an SDM with a I,OOO-bit address and a I-bit datum. \nare storing associations that are samples of some binary function ( on the space S of \nall possible addresses. After storing only a few associations, each data counter will \nhave no explicit meaning, since the data values stored in the memory are distributed \nover many locations. However, once a sufficiently large number of associations are \nstored in the memory, the data counter gains meaning: when appropriately normal(cid:173)\nized to the interval [0, 1], it contains a value which is the conditional probability \nthat the data bit is 1, given that its location was selected. This is shown in figure 2. \n\n\u2022 S is the space of all possible addresses \n\n\u2022 L is the set of addresses in S which activate \n\na given memory location \n\n\u2022 ( is a binary function on S that we want \nto estimate using the memory \n\n[0 or 1] \n\n\u2022 The data counter for L contains the average value \nof (over L, which equals P( (X) = 1 I X E L ) \n\nFigure 2. The Normalized Content of a Data Counter is the Conditional \n\nProbability of the Value of ( Being Equal to 1 Given the Reference \n\nAddresses are Restricted to the Sphere L. \n\nIn the prediction problem, we want to find activation sets of the address space that \ncorrelate with some desired feature bit. When filled far beyond capacity, the indi-\n\n\f590 \n\nRogers \n\nvidual memory locations of an SDM are collecting statistics about individual subre(cid:173)\ngions of the address space. To estimate the value of r at a given address, it should \nbe possible to combine the conditional probabilities in the data counters of the \nselected memory locations to make a \"best guess\" . \n\nIn the prediction problem. S is the space of possible sensory inputs. Since most \nregions of S have no relationship with the datum we wish to predict, most of the \nmemory locations will be in non-informative regions of the address space. Associa(cid:173)\ntive memories are not useful for the prediction problem because the key part of the \nproblem is the search for subregions of the address space that are informative. Due \nto capacity limitations and the extreme size of the address space. memories fill to \ncapacity and fail before enough samples can be written to identify the useful subre(cid:173)\ngions. \n\nPREDICTING THE VALUE OF f \n\nEach data counter in an SDM can be viewed as an independent estimate of the condi(cid:173)\ntional probability of f being equal to lover the activation set defmed by the \ncounter's memory location. \nIf a point of S is contained in multiple activation sets, \neach with its own probability estimate, how do we combine these estimates? More \ndirectly, when does knowledge of membership in some activation set help us esti(cid:173)\nmate f better? \n\nAssume that we know P( f(X) = 1), which is the average value of f over the entire \nspace S. If a data counter in memory location L has the same conditional probability \nas P( f(X) = 1). then knowing an address is contained in the activation set defining \n(This is what makes the prediction problem \nL gives no additional information. \nhard: most activation sets in S will be uncorrelated with the desired datum.) \n\nWhen is a data counter useful? If a data counter contains a conditional probability \nfar away from the probability for the entire space, then it is highly informative. \nThe more committed a data counter is one way or the other, the more weight it \nshould be given. Ambivalent data counters should be given less weight \n\nFigure 3 illustrates this point. Two activation sets of S are shown; \nthe numbers 0 \nand 1 are the values of r at points in these sets. (Assume that all the points in the \nactivation sets are in these diagrams.) Membership in the left activation set is non(cid:173)\ninformative, while membership in the right activation set is highly informative. \nMost activation sets are neither as bad as the left example nor as good as the right \nexample; instead. they are intermediate to these two cases. We can calculate the rel(cid:173)\native weights of different activation sets if we can estimate the relative signaVnoise \nratio of the sets. \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n591 \n\nP(f(X)=1 I Xe L) = \n\nP(f(X)=I) \n\n\u2022 In the left example, the mean of the acti(cid:173)\nvation set is the same as the mean of the \nentire space. Membership in this activation P(r(X)=1 I Xe L) = 1 \nset gives no information; \nsuch a set should be given zero weight \n\nthe opinion of \n\nIn the right example, the mean of the \n\n\u2022 \nactivation set is I; membership in this acti(cid:173)\nvation set completely determines the value \nof a point; the opinion of such a set should \nbe given 'infmite' weight. \n\nFigure 3. The Predictive Value of an Activation Set Depends on How Much \n\nNew Infonnation it Gives About the Function f. \n\n(Note that this partition will not be unique.) \n\nTo obtain a measure of the amount of signal in an activation set L, imagine segregat(cid:173)\ning the points of L into two sectors, which I call the informative sector and the non(cid:173)\ninformative sector. \nnon-infonnative sector the largest number of points possible such that the percent(cid:173)\nage of I's and O's equals the corresponding percentages in the overall population of \nthe entire space. The remaining points, which constitute the infonnative sector, \nwill contain all O's or I's. The relative size r of the informative sector compared to \nL constitutes a measure of the signal. The relative size of the non-infonnative sec(cid:173)\ntor to L is (1 - r), and is a measure of the noise. Such a conceptual partition is \nshown in figure 4. \n\nInclude in the \n\nOnce the signal and the noise of an activation set is estimated, there are known meth(cid:173)\nods for calculating the weight that should be given to this set when combining with \nother sets (Rogers, 1988a). That weight is (r / (1 - r)2). Thus, given the condition(cid:173)\nal probability and the global probability, we can calculate the weight which should \nbe given to that data counter when combined with other counters. \n\nP(r(X)=IIXeLinf) = VALUE [0 or 1] \n\nInfonnative sector \n\nr \n\n(1 _ r) \n\nP(f(X)=11 XEL) - P(f(X)=I) \nr= ----------------------\n\nVALUE - P(f(X)=I) \n\nP(f(X)=1 I Xe Lnon) = P(f(X)=I) \n\nFigure 4. An Activation Set Dermed by a Memory Location can be \n\nPartitioned into Infonnative and Non-infonnative Sectors. \n\n\f592 \n\nRogers \n\nEXPERIMENT AL \n\nThe given weighting scheme was used in the standard SDM to test its effect on capac(cid:173)\nIn the case of random addresses and data, the weights doubled the capacity of \nity. \nthe SDM. Even greater savings are likely with correlated data. These results are \nshown in figure 5. \n\nzo \n\nIe \n\n! 10 \n\nii \n\nl5 \n\n0 \n\n0 \n\n_\"DIot \n\nzoo \n100 \nWWIIber 01 yri ... \n\n3DO \n\n20 \n\nIe \n\n! 10 \n\nii \n\ne \n\n0 \n\n0 \n\n200 \n100 \n.UIIiMr of 9,.\"_ \n\n300 \n\nFigure s. Number of Bitwise Errors vs. Number of Writes in a 256-bit \n\nAddress, 256-bit Data, l000-Location Sparse Distributed Memory. The Left \nis the Standard SDM; the Right is the Statistically-Weighted SDM. Graphs \n\nShown are Averages of 16 Runs \n\nIn deriving the weights, it was assumed that the individual data counters would \nbecome meaningful only when a sufficiently large number of associations were \nstored in the memory. This experiment suggests that even a small number of associ(cid:173)\nations is sufficient to benefit from statistically-based weighting. These results are \nimportant, for they suggest that this scheme can be used in an SDM in the full con(cid:173)\ntinuum, from low-capacity memory-based uses to over-capacity statistical-predic(cid:173)\ntion uses. \n\nCONCLUSIONS \n\nStudies of SDM under conditions of over-capacity, in combination with the new \nproblem of statistical prediction, suggests a new range of uses for SDM. By weight(cid:173)\ning the locations differently depending on their contents, we also have discovered a \ntechnique for improving the capacity of the SDM even when used as a memory. \n\nThis weighting scheme opens new possibilities for learning; \nfor example, these \nweights can be used to estimate the fitness of the locations for learning algorithms \nsuch as Holland's genetic algorithms. Since the statistical prediction problem is pri(cid:173)\nmarily a problem of search over extremely large address spaces, such techniques \nwould allow redistribution of the memory locations to regions of the address space \nwhich are maximally useful, while abandoning the regions which are non-informa(cid:173)\ntive. The combination of learning with memory is a potentially rich area for future \nstudy. \n\nFinally, many studies of associative memories have explicitly assumed random data \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n593 \n\nin their studies; most real-world applications have non-random data. This theory \nexplicitly assumes, and makes use of, correlations between the associations given to \nthe memory. Assumptions such as randomness, which are useful in mathematical \nstudies, must be abandoned if we are to apply these tools to real-world problems. \n\nAcknowledgments \n\nThis work was supported in part by Cooperative Agreements NCC 2-408 and NCC \n2-387 from the National Aeronautics and Space Administration (NASA) to the Uni(cid:173)\nversities Space Research Association (USRA). Funding related to the Connection \nMachine was jointly provided by NASA and the Defense Advanced Research Projects \nAgency (DARPA). All agencies involved were very helpful in promoting this \nwork, for which I am grateful. \n\nThe entire RIACS staff and the SDM group has been supportive of my work. Louis \nIaeckel gave important assistance which guided the early development of these ideas. \nBruno Olshausen was a vital sounding-board for this work. Finally, I'll get mushy \nand thank those who supported my spirits during this project, especially Pentti Kan(cid:173)\nerva, Rick Claeys, Iohn Bogan, and last but of course not least, my parents, Philip \nand Cecilia. Love you all. \n\nReferences \n\nAlbus, I. S., \"A theory of cerebellar functions,\" Math. Bio.,10, pp. 25-61 (1971). \nBaum, E., Moody, I., and Wilczek, F., \"Internal representations for associative \n\nmemory,\" Biological Cybernetics, (1987). \n\nHolland, I. H., Adaptation in natural and artificial systems, Ann Arbor: Universi(cid:173)\n\nty of Michigan Press (1975). \n\nHolland, I. H., \"Escaping brittleness: the possibilities of general-purpose learning \nalgorithms applied to parallel rule-based systems,\" in Machine learning, an \nartificial intelligence approach, Volume II, R. I. Michalski, I. G. Carbonell, \nand T. M. Mitchell, eds. Los Altos, California: Morgan Kaufmann (1986). \n\nHopfield, IJ., \"Neural networks and physical systems with emergent collective \n\ncomputational abilities,\" Proc. Nat' I Acad. Sci. USA, 79, pp. 2554-8 (1982). \n\nKanerva, Pentti., \"Self-propagating Search: A Unified Theory of Memory,\" Center \n\nfor the Study of Language and Information Report No. CSLI-84-7 (1984). \nKanerva, Pentti., Sparse distributed memory, Cambridge, Mass: MIT Press, 1988. \nMarr, D., \"The cortex of the cerebellum,\" 1. Physio .\u2022 202, pp. 437-470 (1969). \nRogers, David, \"Using data-tagging to improve the performance of Kanerva's sparse \ndistributed memory,\" Research Institute for Advanced Computer Science \nTechnical Report 88.1, NASA Ames Research Center (1988a). \n\nRogers, David, \"Kanerva's sparse distributed memory: an associative memory algo(cid:173)\n\nrithm well-suited \nInstitute for \nAdvanced Computer Science Technical Report 88.32, NASA Ames Research \nCenter (l988b). \n\nthe Connection Machine,\" Research \n\nto \n\n\f", "award": [], "sourceid": 130, "authors": [{"given_name": "David", "family_name": "Rogers", "institution": null}]}