{"title": "Statistical Prediction with Kanerva's Sparse Distributed Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 586, "page_last": 593, "abstract": null, "full_text": "586 \n\nSTATISTICAL PREDICTION WITH KANERVA'S \n\nSPARSE DISTRmUTED MEMORY \n\nDavid Rogers \n\nResearch Institute for Advanced Computer Science \n\nMS 230-5, NASA Ames Research Center \n\nMoffett Field, CA  94035 \n\nABSTRACT \n\nA  new  viewpoint  of  the  processing  performed  by  Kanerva's  sparse \ndistributed  memory  (SDM)  is  presented. \nIn  conditions  of  near- or \nover- capacity,  where  the  associative-memory  behavior  of the  mod(cid:173)\nel  breaks  down,  the  processing  performed by  the  model  can  be  inter(cid:173)\npreted  as  that  of  a  statistical  predictor.  Mathematical  results  are \npresented  which  serve  as  the  framework  for  a  new  statistical  view(cid:173)\npoint  of  sparse  distributed  memory  and  for  which  the  standard  for(cid:173)\nmulation  of SDM  is  a  special  case.  This  viewpoint  suggests  possi(cid:173)\nble  enhancements  to  the  SDM  model,  including  a  procedure  for \nimproving  the  predictiveness  of  the  system  based  on  Holland's \nwork  with  'Genetic  Algorithms',  and  a  method  for  improving  the \ncapacity of SDM even when used as an associative memory. \n\nOVERVIEW \n\nThis  work  is  the  result  of  studies  involving  two  seemingly  separate  topics  that \nproved  to  share  a  common  framework.  The  fIrst  topic,  statistical  prediction,  is  the \ntask  of associating  extremely  large  perceptual  state  vectors  with  future  events.  The \nsecond  topic,  over-capacity  in  Kanerva's  sparse  distributed  memory  (SDM),  is  a \nstudy  of the  computation  done  in  an  SDM  when  presented  with  many  more  associa(cid:173)\ntions than its stated capacity. \n\nI  propose  that  in  conditions  of over-capacity,  where  the  associative-memory  behav(cid:173)\nior  of an  SDM  breaks  down,  the  processing  performed  by  the  SDM  can  be  used  for \nstatistical  prediction.  A  mathematical  study  of  the  prediction  problem  suggests  a \nvariant  of the  standard  SDM  architecture.  This  variant  not  only  behaves  as  a  statisti(cid:173)\ncal  predictor  when  the  SDM  is  fIlled  beyond  capacity  but  is  shown  to  double  the \ncapacity of an SDM when used as an associative memory. \n\nTHE PREDICTION PROBLEM \n\nThe  earliest  living  creatures  had  an  ability,  albeit  limited,  to  perceive  the  world \nthrough  crude  senses.  This  ability  allowed  them  to  react  to  changing  conditions  in \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n587 \n\nfor  example,  to  move  towards  (or  away  from)  light  sources.  As \nthe  environment; \nnervous  systems  developed,  learning  was  possible; \nif food  appeared  sim ultaneously \nwith  some  other  perception,  perhaps  some  odor,  a  creature  could  learn  to  associate \nthat smell with food. \n\nAs  the  creatures  evolved  further,  a  more  rewarding  type  of  learning  was  possible. \nSome  perceptions,  such  as  the  perception  of pain  or  the  discovery  of food,  are  very \nimportant  to  an  animal.  However,  by  the  time  the  perception  occurs,  damage  may \nalready be done,  or an  opportunity  for  gain  missed.  If a creature could learn  to  asso(cid:173)\nciate  current  perceptions  with  future  ones,  it  would  have  a  much  better  chance  to \ndo something about it before damage occurs. This is the prediction problem. \n\nThe  difficulty  of  the  prediction  problem  is  in  the  extremely  large  number  of  possi(cid:173)\nble  sensory  inputs.  For example,  a  simple  animal  might  have  the equivalent  of 1000 \nbits  of sensory  data  at  a  given  time; \nin  this  case,  the  number  of possible  inputs  is \ngreater  than  the  number  of atoms  in  the  known  universe! \nIn  essence,  it  is  an  enor(cid:173)\nmous  search  problem:  a  living  creature  must  fmd  the  subregions  of  the  perceptual \nspace  which  correlate  with  the  features  of interest  Most  of the  gigantic  perceptual \nspace will be uncorrelated, and hence uninteresting. \n\nTHE OVERCAPACITY PROBLEM \n\nAn  associative memory  is  a  memory  that can  recall  data  when  addressed  'close-to'  an \naddress  where  data  were  previously  stored.  A  number  of  designs  for  associative \nmemories  have  been  proposed,  such  as  Hopfield  networks  (Hopfield,  1986)  or  the \nnearest-neighbor  associative  memory  of Baum,  Moody,  and  Wilczek  (1987).  Memo(cid:173)\nry-related  standards  such  as  capacity  are  usually  selected  to judge  the  relative  perfor(cid:173)\nmance  of different  models.  Performance  is  severely  degraded  when  these  memories \nare filled beyond capacity. \n\nKanerva's  sparse  distributed  memory  is an  associative  memory  model  developed from \nthe  mathematics  of  high-dimensional  spaces  (Kanerva,  1988)  and  is  related  to  the \nwork  of David  Marr  (1969)  and  James  Albus  (1971)  on  the  cerebellum  of the  brain. \n(For  a  detailed  comparison  of  SDM  to  random-access  memory,  to  the  cerebellum, \nand  to  neural-networks,  see  (Rogers,  1988b\u00bb.  Like  other  associative  memory  mod(cid:173)\nels, it exhibits non-memory behavior when near- or over- capacity. \n\nStudies  of capacity  are  often  over-simplified  by  the  common  assumption  of uncorre(cid:173)\nlated  random  addresses  and data.  The capacity  of some of these  memories,  including \nSDM,  is  degraded  if  the  memory  is  presented  with  correlated  addresses  and  data. \nSuch  correlations  are  likely  if  the  addresses  and  data  are  from  a  real-world  source. \nThus,  understanding  the  over-capacity behavior  of an  SDM  may  lead  to  better proce(cid:173)\ndures for storing correlated data in an associative memory. \n\n\f588 \n\nRogers \n\nSPARSE DISTRmUTED MEMORY \n\nSparse distributed memory  can  be best illustrated as  a  variant of random-access  mem(cid:173)\nory  (RAM).  The  structure  of  a  twelve-location  SDM  with  ten-bit  addresses  and \nten-bit data is shown in figure 1.  (Kanerva, 1988) \n\nReference  Address \n\n01010101101 \n\n~ \n\n~ \n\n1101100111 \n1010101010 \n\n0000011110 \n0011011001 \n\n1011101100 \n0010101111 \n1101101101 \n0100000110 \n\n0110101001 \n1011010110 \n1100010111 \n\n1111110011 \n\nLocation \nAddresses \n\nRadius o \n\nDist \n\nSelect \n\nInput Data \n\nlor 01  0111  11  11  0 I tI 0  11 I \n++++ttttt+ \n\n1 \n\n-1  1 \n\n-1  1 \n\n+ + + + ,  , ,  , ,  +r \nSums 1-31-51-3151  513  1 -31  31-3131 \nThreshold at 0  + ,  + + + + + + + + \nOutput Data I 0 I 0 I 0111  11  11  0 I 11  0  11 I \n\nFigure 1.  Structure of a Sparse Distributed Memory \n\nA  memory  location  is  a  row  in  this  figure.  The  location  addresses  are  set  to  random \naddresses.  The  data  counters  are  initialized  to  zero.  All  operations  begin  with \naddressing  the  memory;  this  entails  finding  the  Hamming  distance  between  the  refer(cid:173)\nence  address  and  each  of the location  addresses.  If this  distance  is  less  than  or equal \nto  the  Hamming  radius,  the  select-vector  entry  is  set  to  I,  and  that  location  is \ntenned  selected.  The  ensemble  of  such  selected  locations  is  called  the  selected  set. \nSelection  is  noted  in  the  figure  as  non-gray  rows.  A  radius  is  chosen  so  that  only  a \nsmall percentage of the memory locations are selected for a given reference address. \n\n(Later,  we  will  refer  to  the  fact  that  a  memory  location  defines  an  activation  set  of \naddresses  in  the  address  space;  the  activation  set  corresponding  to  a  location  is  the \nset  of reference  addresses  which  activate  that  memory  location.  Note  the  reciprocity \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n589 \n\nbetween  the  selected  set  corresponding  to  a  given  reference  address,  and  the  activa(cid:173)\ntion set corresponding to a given location.) \n\nWhen  writing  to  the  memory,  all  selected  counters  beneath  elements  of  the  input \ndata  equal  to  1  are  incremented,  and  all  selected  counters  beneath  elements  of  the \ninput  data  equal  to  0  are  decremented.  This  completes  a  write  operation.  When \nreading  from  the  memory,  the  selected  data  counters  are  summed  columnwise  into \nthe  register  sums.  If the  value  of a  sum  is  greater  than  or equal  to  zero,  we  set  the \ncorresponding  bit  in  the  output  data  to  1;  otherwise,  we  set  the  bit  in  the  output \ndata to O.  (When reading, the contents of the input data are ignored.) \n\nThis  example  makes  clear  that  a  datum  is  distributed  over  the  data  counters  of  the \nselected  locations  when  writing,  and  that  the  datum  is  reconstructed  during  reading \nby  averaging  the  sums  of  these  counters.  However,  depending  on  what  additional \ndata  were  written  into  some  of the  selected  locations,  and  depending  on  how  these \ndata correlate with the Original data, the reconstruction may contain noise. \n\nTHE BEHAVIOR OF AN SDM WHEN AT OVER-CAPACITY \n\nIn  this  memory,  we \nConsider  an  SDM  with  a  I,OOO-bit  address  and  a  I-bit  datum. \nare  storing  associations  that  are  samples  of some binary  function  (  on  the  space  S  of \nall  possible  addresses.  After  storing  only  a  few  associations,  each  data  counter  will \nhave  no  explicit  meaning,  since  the  data  values  stored  in  the  memory  are  distributed \nover  many  locations.  However,  once  a  sufficiently  large  number  of associations  are \nstored  in  the  memory,  the  data  counter  gains  meaning:  when  appropriately  normal(cid:173)\nized  to  the  interval  [0,  1],  it contains  a  value  which  is  the  conditional  probability \nthat the data bit is 1, given that its location was selected.  This is shown in figure 2. \n\n\u2022 S is  the  space of all  possible addresses \n\n\u2022 L is the set of addresses in  S which  activate \n\na given  memory location \n\n\u2022 (  is  a binary  function  on  S that we  want \nto  estimate  using  the  memory \n\n[0 or 1] \n\n\u2022 The data counter for  L contains the average value \nof  (over L,  which equals P(  (X) = 1 I X E  L ) \n\nFigure 2.  The Normalized Content of a Data Counter is the Conditional \n\nProbability of the Value of ( Being Equal to 1 Given the Reference \n\nAddresses are Restricted to the Sphere L. \n\nIn  the  prediction  problem,  we  want  to  find  activation  sets  of the  address  space  that \ncorrelate  with  some  desired  feature  bit.  When  filled far  beyond  capacity,  the  indi-\n\n\f590 \n\nRogers \n\nvidual  memory  locations  of an  SDM  are  collecting  statistics  about  individual  subre(cid:173)\ngions  of the  address  space.  To  estimate  the  value  of r at  a  given  address,  it  should \nbe  possible  to  combine  the  conditional  probabilities  in  the  data  counters  of  the \nselected memory locations to make a \"best guess\" . \n\nIn  the  prediction  problem.  S  is  the  space  of  possible  sensory  inputs.  Since  most \nregions  of S  have  no  relationship  with  the  datum  we  wish  to  predict,  most  of  the \nmemory  locations  will  be  in  non-informative  regions  of the  address  space.  Associa(cid:173)\ntive  memories  are  not  useful  for  the  prediction  problem  because  the  key  part of the \nproblem  is  the  search  for  subregions  of the  address  space  that  are  informative.  Due \nto  capacity  limitations  and  the  extreme  size  of  the  address  space.  memories  fill  to \ncapacity  and  fail  before  enough  samples  can  be  written  to  identify  the  useful  subre(cid:173)\ngions. \n\nPREDICTING THE VALUE OF  f \n\nEach  data  counter  in  an  SDM  can  be  viewed  as  an  independent estimate  of the condi(cid:173)\ntional  probability  of  f  being  equal  to  lover  the  activation  set  defmed  by  the \ncounter's  memory  location. \nIf a  point  of S  is  contained  in  multiple  activation  sets, \neach  with  its  own  probability  estimate,  how  do  we  combine  these  estimates?  More \ndirectly,  when  does  knowledge  of  membership  in  some  activation  set  help  us  esti(cid:173)\nmate f better? \n\nAssume  that  we  know  P( f(X) = 1),  which  is  the  average  value  of f  over  the  entire \nspace  S.  If a data counter  in  memory  location  L  has  the  same  conditional probability \nas  P( f(X) = 1).  then  knowing  an  address  is  contained  in  the  activation  set  defining \n(This  is  what  makes  the  prediction  problem \nL  gives  no  additional  information. \nhard:  most activation sets in S will be uncorrelated with the desired datum.) \n\nWhen  is  a  data  counter  useful?  If a  data  counter  contains  a  conditional  probability \nfar  away  from  the  probability  for  the  entire  space,  then  it  is  highly  informative. \nThe  more  committed  a  data  counter  is  one  way  or  the  other,  the  more  weight  it \nshould be given.  Ambivalent data counters should be given less weight \n\nFigure  3  illustrates  this  point.  Two  activation  sets  of S  are  shown; \nthe  numbers  0 \nand  1  are  the  values  of r at  points  in  these  sets.  (Assume  that  all  the  points  in  the \nactivation  sets  are  in  these  diagrams.)  Membership  in  the  left  activation  set  is  non(cid:173)\ninformative,  while  membership  in  the  right  activation  set  is  highly  informative. \nMost  activation  sets  are  neither  as  bad  as  the  left  example  nor  as  good  as  the  right \nexample;  instead.  they  are  intermediate  to  these  two  cases.  We can  calculate the  rel(cid:173)\native  weights  of different  activation  sets  if  we  can  estimate  the  relative  signaVnoise \nratio of the sets. \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n591 \n\nP(f(X)=1 I Xe L)  = \n\nP(f(X)=I) \n\n\u2022  In  the  left example,  the  mean  of the  acti(cid:173)\nvation  set  is  the  same  as  the  mean  of  the \nentire  space.  Membership  in  this  activation  P(r(X)=1  I Xe L)  =  1 \nset  gives  no  information; \nsuch a set should be given zero weight \n\nthe  opinion  of \n\nIn  the  right  example,  the  mean  of  the \n\n\u2022 \nactivation  set  is  I;  membership  in  this  acti(cid:173)\nvation  set  completely  determines  the  value \nof a  point;  the  opinion  of such  a  set  should \nbe given 'infmite' weight. \n\nFigure 3.  The Predictive Value of an Activation Set Depends on How Much \n\nNew Infonnation it Gives About the Function f. \n\n(Note  that  this  partition  will  not  be  unique.) \n\nTo  obtain  a  measure  of the  amount of signal  in  an  activation  set L, imagine  segregat(cid:173)\ning  the  points  of L  into  two  sectors,  which  I  call  the  informative  sector  and  the  non(cid:173)\ninformative  sector. \nnon-infonnative  sector  the  largest  number  of  points  possible  such  that  the  percent(cid:173)\nage  of  I's  and  O's  equals  the  corresponding  percentages  in  the  overall  population  of \nthe  entire  space.  The  remaining  points,  which  constitute  the  infonnative  sector, \nwill  contain  all  O's  or  I's.  The  relative  size  r  of the  informative  sector  compared  to \nL  constitutes  a  measure  of the  signal.  The  relative  size  of the  non-infonnative  sec(cid:173)\ntor  to  L  is  (1  - r),  and  is  a  measure  of  the  noise.  Such  a  conceptual  partition  is \nshown in figure 4. \n\nInclude  in  the \n\nOnce  the  signal  and  the noise  of an  activation  set is  estimated,  there  are known  meth(cid:173)\nods  for  calculating  the  weight  that  should  be  given  to  this  set  when  combining  with \nother  sets  (Rogers,  1988a).  That  weight  is  (r /  (1  - r)2).  Thus,  given  the  condition(cid:173)\nal  probability  and  the  global  probability,  we  can  calculate  the  weight  which  should \nbe given to that data counter when combined with other counters. \n\nP(r(X)=IIXeLinf) = VALUE  [0 or  1] \n\nInfonnative sector \n\nr \n\n(1  _ r) \n\nP(f(X)=11 XEL)  - P(f(X)=I) \nr= ----------------------\n\nVALUE  - P(f(X)=I) \n\nP(f(X)=1  I Xe Lnon)  = P(f(X)=I) \n\nFigure 4.  An Activation Set Dermed by a Memory Location can be \n\nPartitioned into Infonnative and Non-infonnative Sectors. \n\n\f592 \n\nRogers \n\nEXPERIMENT AL \n\nThe given  weighting  scheme was  used in  the standard SDM  to  test its effect on  capac(cid:173)\nIn  the  case  of random  addresses  and  data,  the  weights  doubled  the  capacity  of \nity. \nthe  SDM.  Even  greater  savings  are  likely  with  correlated  data.  These  results  are \nshown in figure 5. \n\nzo \n\nIe \n\n! 10 \n\nii \n\nl5 \n\n0 \n\n0 \n\n_\"DIot \n\nzoo \n100 \nWWIIber  01  yri ... \n\n3DO \n\n20 \n\nIe \n\n! 10 \n\nii \n\ne \n\n0 \n\n0 \n\n200 \n100 \n.UIIiMr of 9,.\"_ \n\n300 \n\nFigure s.  Number of Bitwise Errors vs. Number of Writes in a 256-bit \n\nAddress, 256-bit Data, l000-Location Sparse Distributed Memory.  The Left \nis the Standard SDM; the Right is the Statistically-Weighted SDM.  Graphs \n\nShown are Averages of 16 Runs \n\nIn  deriving  the  weights,  it  was  assumed  that  the  individual  data  counters  would \nbecome  meaningful  only  when  a  sufficiently  large  number  of  associations  were \nstored  in  the  memory.  This  experiment  suggests  that  even  a  small  number  of associ(cid:173)\nations  is  sufficient  to  benefit  from  statistically-based  weighting.  These  results  are \nimportant,  for  they  suggest  that  this  scheme  can  be  used  in  an  SDM  in  the  full  con(cid:173)\ntinuum,  from  low-capacity  memory-based  uses  to  over-capacity  statistical-predic(cid:173)\ntion uses. \n\nCONCLUSIONS \n\nStudies  of  SDM  under  conditions  of  over-capacity,  in  combination  with  the  new \nproblem  of statistical  prediction,  suggests  a  new  range  of uses  for  SDM.  By  weight(cid:173)\ning  the  locations  differently  depending  on  their  contents,  we  also  have  discovered  a \ntechnique for improving the capacity of the SDM even when used as a memory. \n\nThis  weighting  scheme  opens  new  possibilities  for  learning; \nfor  example,  these \nweights  can  be  used  to  estimate  the  fitness  of  the  locations  for  learning  algorithms \nsuch  as  Holland's  genetic  algorithms.  Since  the  statistical  prediction  problem  is  pri(cid:173)\nmarily  a  problem  of  search  over  extremely  large  address  spaces,  such  techniques \nwould  allow  redistribution  of  the  memory  locations  to  regions  of  the  address  space \nwhich  are  maximally  useful,  while  abandoning  the  regions  which  are  non-informa(cid:173)\ntive.  The  combination  of learning  with  memory  is  a  potentially  rich  area  for  future \nstudy. \n\nFinally,  many  studies  of associative  memories  have  explicitly  assumed  random  data \n\n\fStatistical Prediction with Kanerva's Sparse Distributed Memory \n\n593 \n\nin  their  studies;  most  real-world  applications  have  non-random  data.  This  theory \nexplicitly  assumes,  and  makes  use  of,  correlations  between  the  associations  given  to \nthe  memory.  Assumptions  such  as  randomness,  which  are  useful  in  mathematical \nstudies, must be abandoned if we are to apply these tools to real-world problems. \n\nAcknowledgments \n\nThis  work  was  supported  in  part  by  Cooperative  Agreements  NCC  2-408  and  NCC \n2-387  from  the  National  Aeronautics  and  Space  Administration  (NASA)  to  the  Uni(cid:173)\nversities  Space  Research  Association  (USRA).  Funding  related  to  the  Connection \nMachine  was jointly provided by NASA  and  the Defense Advanced Research  Projects \nAgency  (DARPA).  All  agencies  involved  were  very  helpful  in  promoting  this \nwork, for which I am grateful. \n\nThe  entire  RIACS  staff and  the  SDM  group  has  been  supportive  of my  work.  Louis \nIaeckel  gave  important  assistance  which  guided  the early  development  of these  ideas. \nBruno  Olshausen  was  a  vital  sounding-board  for  this  work.  Finally,  I'll  get  mushy \nand  thank  those  who  supported  my  spirits  during  this  project,  especially  Pentti  Kan(cid:173)\nerva,  Rick  Claeys,  Iohn  Bogan,  and  last  but  of course  not  least,  my  parents,  Philip \nand Cecilia.  Love you all. \n\nReferences \n\nAlbus, I. S., \"A theory of cerebellar functions,\" Math. Bio.,10, pp. 25-61 (1971). \nBaum,  E.,  Moody,  I.,  and  Wilczek,  F.,  \"Internal  representations  for  associative \n\nmemory,\"  Biological Cybernetics, (1987). \n\nHolland,  I.  H.,  Adaptation  in  natural  and  artificial  systems,  Ann  Arbor:  Universi(cid:173)\n\nty of Michigan Press (1975). \n\nHolland,  I.  H.,  \"Escaping  brittleness:  the  possibilities  of  general-purpose  learning \nalgorithms  applied  to  parallel  rule-based  systems,\"  in  Machine  learning,  an \nartificial  intelligence  approach,  Volume  II,  R.  I.  Michalski,  I.  G.  Carbonell, \nand T. M. Mitchell, eds.  Los Altos, California:  Morgan Kaufmann (1986). \n\nHopfield,  IJ.,  \"Neural  networks  and  physical  systems  with  emergent  collective \n\ncomputational abilities,\" Proc. Nat' I Acad. Sci.  USA, 79, pp. 2554-8 (1982). \n\nKanerva,  Pentti.,  \"Self-propagating  Search:  A  Unified  Theory  of  Memory,\"  Center \n\nfor the Study of Language and Information Report No. CSLI-84-7  (1984). \nKanerva, Pentti., Sparse distributed memory, Cambridge, Mass: MIT Press, 1988. \nMarr, D., \"The cortex of the cerebellum,\" 1.  Physio .\u2022  202, pp. 437-470 (1969). \nRogers,  David,  \"Using  data-tagging  to  improve  the  performance  of Kanerva's  sparse \ndistributed  memory,\"  Research  Institute  for  Advanced  Computer  Science \nTechnical Report 88.1, NASA Ames Research Center (1988a). \n\nRogers,  David,  \"Kanerva's  sparse  distributed  memory:  an  associative  memory  algo(cid:173)\n\nrithm  well-suited \nInstitute  for \nAdvanced  Computer  Science  Technical  Report  88.32,  NASA  Ames  Research \nCenter (l988b). \n\nthe  Connection  Machine,\"  Research \n\nto \n\n\f", "award": [], "sourceid": 130, "authors": [{"given_name": "David", "family_name": "Rogers", "institution": null}]}