{"title": "Neural Models for Part-Whole Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 26, "abstract": null, "full_text": "Neural Models  for  Part-Whole Hierarchies \n\nMaximilian Riesenhuber \n\nPeter Dayan \n\nDepartment of Brain & Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA  02139 \n\n{max,dayan}~ai.mit.edu \n\nAbstract \n\nWe present a connectionist method for  representing images that ex(cid:173)\nplicitly addresses their hierarchical nature.  It blends data from neu(cid:173)\nroscience about whole-object viewpoint sensitive cells in inferotem(cid:173)\nporal  cortex8  and  attentional  basis-field  modulation  in  V43  with \nideas  about  hierarchical  descriptions  based  on  microfeatures.5,11 \nThe resulting model makes critical use of bottom-up and top-down \npathways for  analysis and synthesis.6  We  illustrate the model  with \na simple example of representing information about faces. \n\n1  Hierarchical Models \n\nImages  of objects  constitute an  important  paradigm case of a  representational  hi(cid:173)\nerarchy, in  which  'wholes', such as faces,  consist of 'parts', such as eyes,  noses  and \nmouths.  The  representation and manipulation of part-whole hierarchical informa(cid:173)\ntion  in  fixed  hardware  is  a  heavy  millstone  around  connectionist  necks,  and  has \nconsequently been the inspiration for  many interesting proposals, such as  Pollack's \nRAAM.l1 \n\nWe  turned to the  primate visual  system for  clues.  Anterior  inferotemporal  cortex \n(IT) appears to construct representations of visually presented objects.  Mouths and \nfaces  are both objects, and so require fully  elaborated representations,  presumably \nat  the level  of anterior IT,  probably using  different  (or  possibly  partially overlap(cid:173)\nping) sets of cells.  The natural way to represent the part-whole relationship between \nmouths and faces is to have a neuronal hierarchy, with connections bottom-up from \nthe mouth units to the face units so that information about the mouth can be used \nto help  recognize  or  analyze  the  image  of a  face,  and  connections  top-down  from \nthe face  units to the mouth units expressing the generative or synthetic knowledge \nthat if there is a face in a scene, then there is  (usually) a  mouth too.  There is little \n\nWe thank Larry Abbott, Geoff Hinton, Bruno Olshausen, Tomaso Poggio, Alex Pouget, \n\nEmilio Salinas and Pawan Sinha for  discussions  and comments. \n\n\f18 \n\nM.  Riesenhuberand P.  Dayan \n\nempirical support for  or against such a  neuronal hierarchy,  but it seems extremely \nunlikely  on the grounds that arranging for  one with  the correct set of levels for  all \nclasses of objects seems  to be impossible. \n\nThere  is  recent  evidence  that  activities  of cells  in  intermediate  areas  in  the  visual \nprocessing hierarchy  (such  as  V4)  are  influenced  by  the locus  of visual  attention. 3 \nThis  suggests  an  alternative  strategy  for  representing  part-whole  information,  in \nwhich  there  is  an  interaction,  subject  to  attentional  control,  between  top-down \ngenerative  and  bottom-up  recognition  processing.  In  one  version  of our example, \nactivating units in IT that represent a  particular face  leads,  through the top-down \ngenerative  model,  to a  pattern of activity  in  lower  areas  that  is  closely  related  to \nthe  pattern  of activity  that  would  be  seen  when  the  entire  face  is  viewed.  This \nactivation  in  the  lower  areas  in  turn  provides  bottom-up input  to the  recognition \nsystem.  In the bottom-up direction, the attentional signal controls which aspects of \nthat activation are actually processed, for example, specifying that only the activity \nreflecting  the lower  part of the face  should  be  recognized.  In this  case,  the mouth \nunits in IT can then recognize this restricted pattern of activity as being a particular \nsort of mouth.  Therefore,  we  have provided a  way  by  which  the visual system can \nrepresent the part-whole relationship between faces  and mouths. \n\nThis describes just one of many possibilities.  For instance, attentional control could \nbe mainly active during the top-down phase instead.  Then it would create in VI  (or \nindeed in  intermediate areas)  just the  activity  corresponding to  the  lower  portion \nof the face  in the first  place.  Also the focus  of attention need  not be so ineluctably \nspatial. \n\nThe overall scheme is  based  on  an hierarchical top-down  synthesis  and  bottom-up \nanalysis model for  visual  processing, as in  the Helmholtz machine6  (note that  \"hi(cid:173)\nerarchy\"  here refers  to a  processing hierarchy rather than the part-whole hierarchy \ndiscussed above)  with a  synthetic model forming  the effective map: \n\n'object' 18)  'attentional eye-position'  -t 'image' \n\n(1) \n\n(shown  in  cartoon form  in  figure  1)  where  'image'  stands in  for  the  (probabilities \nover the)  activities of units at various levels in the system that would be caused by \nseeing the aspect of the 'object' selected by placing the focus  and scale of attention \nappropriately.  We  use this  generative model during synthesis in  the way  described \nabove to traverse the hierarchical  description  of any particular image.  We  use  the \nstatistical inverse of the synthetic model as the way of analyzing images to determine \nwhat  objects  they  depict.  This  inversion  process  is  clearly  also  sensitive  to  the \nattentional  eye-position - it actually determines  not only  the nature of the object \nin  the scene,  but  also the  way  that it is  depicted  (ie  its instantiation parameters) \nas  reflected  in the attentional eye  position. \n\nIn  particular,  the  bottom-up  analysis  model  exists  in  the  connections  leading  to \nthe  2D  viewpoint-selective  image  cells  in  IT reported  by  Logothetis  et  al8  which \nform  population  codes  for  all  the  represented  images  (mouths,  noses,  etc).  The \ntop-down synthesis model exists in the connections leading in the reverse direction. \nIn generalizations of our scheme, it may,  of course, not be necessary to generate an \nimage all  the way down in VI. \n\nThe map  (1)  specifies a  top-down  computational task very  like  the  bottom-up one \naddressed  using  a  multiplicatively  controlled synaptic matrix in  the shifter  model \n\n\fNeural Modelsfor Part-Whole Hierarchies \n\n19 \n\nattentional \neye \nposition e  =(c;, tyl t%) \n\nlayer \n\n3 \n\no \n\n2 \n\nm \n\n1 \n\np \n\n\"- ... \n\n,'iIL.~ \n\\ ';:111 .. , \n'-\" \n\n,~ \n\nFigure  1:  Cartoon  of the  model.  In  the  top-down,  generative,  direction,  the  model  generates \nimages  of faces,  eyes,  mouths or noses  based  on  an  attentional  eye  position  and  a  selection  of a \nsingle top-layer unit; the bottom-up, recognition, direction is the inverse of this map.  The response \nof the  neurons  in  the  middle layer  is  modulated  sigmoidally (as  illustrated  by the graphs shown \ninside the circles representing the neurons in the middle layer)  by the attentional eye position.  See \nsection  2 for  more details. \n\nof  Olshausen  et  al.9  Our  solution  emerges  from  the  control  the  attentional  eye \nposition exerts at various levels  of processing, most  relevantly  modulating activity \nin V4.3  Equivalent modulation in the parietal cortex based on actual  (rather than \nattentional)  eye  position!  has  been  characterized  by  Pouget  &  Sejnowski13  and \nSalinas  &  Abbott15  in  terms  of basis  fields.  They showed  that  these  basis  fields \ncan  be  used  to solve  the same tasks as  the shifter model  but with  neuronal  rather \nthan  synaptic  multiplicative  modulation.  In fact,  eye-position  modulation  almost \ncertainly occurs at many leve~s in the system, possibly including VIP Our sch~me \nclearly requires that the modulating attentional eye-position must be able to become \ndetached from the spatial eye-position - Connor et al. 3  collected evidence for  part of \nthis hypothesis; although the coordinate system(s) of the modulation is  not entirely \nclear from their data. \n\nBottom-up  and  top-down  mappings  are  learned  taking  the  eye-position  modula(cid:173)\ntion  into  account.  In  the experiments  below,  we  used  a  version  of the  wake-sleep \nalgorithm,6 for  its conceptual and computational simplicity.  This requires learning \nthe bottom-up model from  generated imagery  (during sleep)  and learning the top(cid:173)\ndown  model  from  assigned  explanations  (during  observation  of real  input  during \nwake).  In the current  version, for simplicity, the eye position is set correctly during \nrecognition, but we  are also  interested in exploring automatic ways of doing this. \n\n2  Results \n\nWe  have  developed  a  simple  model  that  illustrates  the  feasibility  of the  scheme \npresented  above in  the  context of recognizing and  generating  cartoon  drawings  of \na  face  and  its  parts.  Recognition  involves  taking  an  image  of  a  face  or  a  part \nthereof (the mouth, nose or one of the eyes)  at an arbitrary position on the retina, \n\n\f20 \n\na) \n\n~.} ::0 GJ ::Li] \n...,  .. \n,  -\n, \n.:c.. _no ..  _ \n[iJ \u00b7\u00b7'0 [!J \u00b7'0 \n\n'~.' \n. \n....  J \n\n~ .. \n\n\u2022.\u2022 \nl.  ~ \n~J:: \n\nf~.  _ , . . . . . .   __ \n\n\u2022.\u2022 \n\n\u2022. 0 \n\n,__  ..--..th  noe- __ \n\n~. \n\n__   ............ _ \n\n__ \n\n-\n\n...... \n\n~ \n\n\u2022 \n\nM.  Riesenhuber and P.  Dayan \n\nb) \n\nfaoo \n\nnose \n\nmouth \n\neye \n\nFigure  2:  a)  Recognition:  the  left  column  of each  pair  shows  the  stimuli;  the  right  shows  the \nresulting activations in the top layer  (ordered as face,  mouth, nose  and eye).  The stimuli are faces \nat random positions in the retina.  Recognition is  performed by setting the attentional eye position \nin  the  image  and  setting  the  attentional  scale,  which  creates  a  window  of attention  around  the \nattended  to position,  shown  by  a  circle  of corresponding size  and  position.  b)  Generation:  each \npanel shows the output of the generative pathway for  a  randomly chosen  attentional  eye  position \non  activating  each  of the  top  layer  units  in  turn.  The  focus  of attention  is  marked  by  a  circle \nwhose size reflects the attentional scale.  The name of the object whose neuronal representation in \nthe top layer was activated  is shown  above each  panel. \n\nand setting  the appropriate top  level  unit  to 1  (and  the  remaining units  to  zero). \nGeneration involves  imaging either a  whole face  or of one of its parts  (selected by \nthe active unit in the top layer)  at an arbitrary position on the retina. \n\nThe model  (figure  1)  consists of three layers.  The lowest layer is  a  32  x 32  'retina'. \nIn the recognition direction, the retina feeds  into a layer of 500 hidden units.  These \nproject  to the  top  layer,  which  has four  neurons.  In the generative  direction,  the \nconnectivity  is  reversed.  The  network  is  fully  connected  in  both  directions.  The \nactivity of each neuron  based on input from the preceding  (for  recognition)  or  the \nfollowing  layer  (for  generation)  is  a  linear function  (weight  matrices wr, Vr in  the \nrecognition  and vg, W g in the generative direction).  The attentional eye  position \ninfluences  activity through multiplicative modulation of the neuronal responses  in \nthe hidden layer.  The linear response ri = (Wrp)i or ri = (VgO)i  of each neuron i \nin the middle layer based on the bottom-up or  top-down connections is  multiplied \nby ~i =  \u00a2i(ex)\u00a2f(ey)\u00a2f(es),  where \u00a2!x,y,s}  are the tuning curves in each dimension \nof the attentional eye position e = (eX, eY , eS),  coding the x- and y- coordinates and \nthe scale of the focus  of attention, respectively.  Thus, for  the activity mi of hidden \nneuron i  we have mi = (Wrp)i \u00b7~i in the recognition pathway and mi = (VgO)i \u00b7~i in \nthe generative pathway.  The tuning curves of the ~i are chosen to be sigmoid with \nrandom centers Ci  and random directions di  E {-I, I},  eg  \u00a2!  =  u( 4 * d! * (e S  - cn). \nIn  other  implementations,  we  have  also  used  Gaussian  tuning  functions.  In fact, \nthe only requirement regarding the shape of the tuning functions  is  that through a \nsuperposition of them one can construct functions  that show  a  peaked dependence \non  the  attentional  eye  position.  In  the  recognition  direction,  the  attentional  eye \nposition also has an influence on the activity in the input layer by defining a 'window \nof attention',7  which  we  implemented  using  a  Gaussian  window  centered  at  the \nattentional eye position with its size given by the attentional scale.  This is  to allow \nthe system to learn models of parts based on experience with images of whole faces. \n\nTo train the model,  we employ a variant of the unsupervised wake-sleep algorithm. 6 \nIn this algorithm, the generative pathway is  trained during a  wake-phase, in which \n\n\fNeural Models/or Part-Whole Hierarchies \n\n21 \n\nstimuli  in  the input layer  (the retina,  in our  case)  cause  activation of the neurons \nin  the network through the recognition pathway,  providing an error signal to train \nthe generative pathway using the delta rule.  Conversely, in the sleep-phase, random \nactivation  of a  top layer unit  (in  conjunction  with  a  randomly  chosen  attentional \neye-position)  leads,  via the  generative connections,  to the generation of activation \nin the middle layer and consequently an image in the input layer that is then used to \nadapt the recognition weights, again using the delta rule.  Although the delta rule in \nwake-sleep is  fine  for  the recognition direction, it leads to a  poor generative model \n- in  our  simple  case,  generation  is  much  more  difficult  than  recognition.  As  an \ninterim solution, we  therefore train the generative weights using back-propagation, \nwhich  uses  the activity in the top layer created by the recognition pathway as  the \ninput  and  the  retinal  activation  pattern  as  the  target  signal.  Hence,  learning  is \nstill unsupervised  (except that appropriate attentional eye-positions are always set \nduring recognition).  We have also experimented with a system in which the weights \nwr  and w g  are preset  and  only  the weights  between  layers  2 and  3  are  trained. \nFor this model,  training could  be done  with the standard wake-sleep algorithm,  ie \nusing the local delta-rule for  both sets of weights. \n\nFigure 2a shows several examples of the performance of the recognition pathway for \nthe different  stimuli after 300,000 iterations.  The network is  able to recognize  the \nstimuli accurately at different  positions in the visual field.  Figure 2b  shows several \nexamples of the output of the generative model, illustrating its capacity to produce \nimages  of faces  or their parts at arbitrary locations.  By  imaging a  whole  face  and \nthen  focusing  the  attention  on  eg  an  area around  its  center,  which  activates  the \n'nose' unit through the recognition pathway, the relationship that,  eg  a  nose is  part \nof a  face  can be established in a  straightforward way. \n\n3  Discussion \n\nRepresenting  hierarchical  structure  is  a  key  problem  for  connectionism.  Visual \nimages  offer  a  canonical example for  which  it  seems  possible  to elucidate  some  of \nthe underlying neural mechanisms.  The theory is based on 2D  view object selective \ncells  in anterior IT, and attentional eye-position modulation of the firing of cells in \nV 4.  These work in the context of analysis by synthesis or recognition and generative \nmodels such that the part-whole hierarchy of an object such as a face  (which contains \neyes,  which  contain  pupils,  etc)  can  be  traversed  in  the  generative  direction  by \nchoosing  to  view  the object  through  a  different  effective  eye-position,  and  in  the \nrecognition  direction  by  allowing  the  real  and  the  attentional  eye-positions  to  be \ndecoupled to activate the requisite 2D  view  selective cells. \n\nThe  scheme  is  related  to  Pollack's  Recursive  Auto-Associative  Memory  (RAAM) \nsystem. l1  RAAM  provides a  way  of representing tree-structured information - for \ninstance to learn an object whose structure is  {{A,B},{C,D}}, a  standard three(cid:173)\nlayer auto-associative net would be taught AB, leading to a pattern of hidden unit \nactivations  0:;  then  it  would  learn  CD  leading  to (3;  and  finally  0:(3  leading to  I, \nwhich  would  itself  be  the  representation  of the  whole  object.  The  compression \noperation (AB -t 0:)  and its expansion inverse are required as explicit methods for \nmanipulating tree  structure. \n\nOur  scheme  for  representing  hierarchical  information  is  similar  to  RAAM,  using \nthe notion of an attentional eye-position to perform its compression and expansion \n\n\f22 \n\nM.  Riesenhuberand P.  Dayan \n\noperations.  However,  whereas RAAM  normally  constructs its own  codes for  inter(cid:173)\nmediate  levels  of the  trees  that  it  is  fed,  here,  images  of faces  are  as  real  and  as \navailable as  those,  for  instance, of their  associated mouths.  This not only  changes \nthe  learning  task,  but  also  renders  sensible  a  notion  of direct  recognition  without \nrepeated RAAMification of the parts. \n\nVarious  aspects of our scheme  require comment:  the way  that eye  position  affects \nrecognition;  the coding of different  instances of objects;  the use  of top-down  infor(cid:173)\nmation  during  bottom-up  recognition;  variants  of the scheme for  objects  that  are \ntoo  big  or too geometrically  challenging to  'fit'  in  one  go  into a  single  image;  and \nhierarchical objects other than images.  We are also working on a more probabilisti(cid:173)\ncally correct version, taking advantage of the statistical soundness of the Helmholtz \nmachine. \n\nEye  position  information  is  ubiquitous  in  visual  processing  areas,12  including  the \nLG N and VI, 17  as well  as the parietal cortex 1 and V 4. 3  Further, it can be revealed \nas  having a  dramatic effect  on  perception,  as  in Ramachandran  et  al'sl4  study on \nintermittent exotropes.  This is a form of squint in which the two eyes are normally \naligned,  but in which the exotropic eye can deviate (voluntarily or involuntarily) by \nas much as 60\u00b0.  The study showed that even if an image is  'burnt' on the retina in \nthis eye as an afterimage, and so is  fixed  in retinal coordinates, at least one compo(cid:173)\nnent of the percept moves as the eye moves.  This argues that information about eye \nposition  dramatically effects  visual  processing in  a  manner  that  is  consistent  with \nthe  model  presented  here  of shifts  based  on  modulation.  This  is  also  required  by \nBridgeman  et  al's2  theory of perceptual  stability  across  fixations,  that essentially \nbuilds up an impression of a  scene in  exactly the form  of mapping  (1). \n\nIn general, there will  be many instances for an object,  e.g.,  many different faces.  In \nthis  general case, the top level  would implement a  distributed code for  the identity \nand instantiation parameters of the objects.  We are currently investigating methods \nof implementing this form  of representation into the model. \n\nA  key  feature  of the  model  is  the  interaction  of the  synthesis  and  analysis  path(cid:173)\nways when traversing the part-whole hierarchies.  This interaction between the two \npathways  can also  aid  the system  when  performing image  analysis  by  integrating \ninformation  across  the  hierarchy.  Just  as  in  RAAM,  the  extra  feature  required \nwhen traversing a  hierarchy is  short term memory.  For RAAM,  the memory stores \ninformation  about  the  various  separate sub-trees that  have  already  been  decoded \n(or  encoded).  For  our system,  the  memory  is  required  during generative traversal \nto force  'whole'  activity on lower  layers to persist even  after  the activity on upper \nlayers  has  ceased,  to free  these upper  units to  recognize  a  'part'.  Memory  during \nrecognition traversal is necessary in marginal cases to accumulate information across \nseparate 'parts' as  well  as the  'whole'.  This solution to hierarchical representation \ninevitably gives  up the computational simplicity of the naive neuronal hierarchical \nscheme described in the introduction which does not require any such accumulation. \n\nKnowledge of images that are too large to fit  naturally in a single view4  at a canoni(cid:173)\ncal location and scale, or that theoretically cannot fit in a view (like 360\u00b0 information \nabout a room) can be handled in a straightforward extension of the scheme.  All  this \nrequires  is  generalizing  further  the  notion  of eye-position.  One  can  explore  one's \ngenerative model of a  room in the same  way that one can explore one's generative \nmodel  of a  face. \n\n\fNeural Models/or Part-Whole Hierarchies \n\n23 \n\nWe  have  described  our scheme from  the perspective of images.  This is  convenient \nbecause of the substantial information available about visual  processing.  However, \nimages are not the only examples of hierarchical structure - this is also very relevant \nto words,  music and also inferential mechanisms.  We  believe  that our mechanisms \nare  also  more  general  - proving this  will  require  the  equivalent  of the  attentional \neye-position that lies  at the heart of the method. \n\nReferences \n[1]  Andersen,  R,  Essick,  GK  &  Siegel,  RM  (1985).  Encoding  of spatial location  by  pos(cid:173)\n\nterior  parietal neurons.  Science,  230, 456-458. \n\n[2]  Bridgeman, B,  van der Hejiden,  AHC &  Velichkovsky,  BM  (1994).  A theory of visual \nstability across saccadic eye movements.  Behavioral and Brain Sciences,  17, 247-292. \n[3]  Connor,  CE,  Gallant,  JL,  Preddie,  DC &  Van  Essen,  DC  (1996).  Responses  in  area \nV 4  depend  on  the  spatial  relationship  between  stimulus  and  attention.  Journal  of \nNeurophysiology,  75, 1306-1308. \n\n[4]  Feldman,  JA  (1985).  Four  frames  suffice:  A  provisional  model  of vision  and  space. \n\nThe  Behavioral  and Brain Sciences,  8, 265-289. \n\n[5]  Hinton,  GE  (1981).  Implementing  semantic  networks  in  parallel  hardware.  In  GE \nHinton  & JA  Anderson,  editors,  Parallel  Models  of Associative  Memory.  Hillsdale, \nNJ:  Erlbaum, 161-188. \n\n[6]  Hinton,  GE,  Dayan,  P,  Frey,  BJ  &  Neal,  RM  (1995).  The wake-sleep  algorithm  for \n\nunsupervised neural networks.  Science,  268, 1158-1160. \n\n[7]  Koch,  C &  Ullmann, S  (1985).  Shifts in selective visual attention:  towards the under(cid:173)\n\nlying  neural circuitry.  Human  Neurobiology,  4, 219-227. \n\n[8]  Logothetis,  NK,  Pauls,  J,  &  Poggio,  T  (1995).  Shape  representation  in  the  inferior \n\ntemporal cortex of monkeys.  Current  Biology,  5, 552-563. \n\n[9]  Olshausen,  BA,  Anderson,  CH  &  Van  Essen,  DC  (1993).  A  neurobiological  model \nof visual  attention  and  invariant  pattern  recognition  based  on  dynamic  routing  of \ninformation.  Journal  of Neuroscience,  13, 4700-4719. \n\n[10]  Pearl,  J  (1988).  Probabilistic  Reasoning in Intelligent Systems:  Networks  of Plausible \n\nInference.  San Mateo,  CA:  Morgan Kaufmann. \n\n[11]  Pollack,  JB  (1990).  Recursive  distributed representations.  Artificial Intelligence,  46, \n\n77-105. \n\n[12]  Pouget,  A,  Fisher,  SA  &  Sejnowski,  TJ  (1993).  Egocentric  spatial representation  in \n\nearly vision.  Journal  of Cognitive  Neuroscience,  5, 150-161. \n\n[13]  Pouget,  A &  Sejnowski, TJ (1995) . Spatial representations in the parietal cortex may \nuse  basis  functions.  In  G  Tesauro,  DS  Touretzky  &  TK  Leen,  editors,  Advances  in \nNeural  Information  Processing  Systems  7,  157-164. \n\n[14]  Ramachandran, VS,  Cobb,  S  &  Levi,  L  (1994).  The neural locus  of binocular rivalry \n\nand monocular  diplopia in  intermittent exotropes.  Neuroreport,  5, 1141-1144. \n\n[15]  Salinas, E  & Abbott LF (1996).  Transfer of coded information from sensory to motor \n\nnetworks.  Journal  of Neuroscience,  15, 6461-6474. \n\n[16]  Sung,  K  &  Poggio,  T  (1995).  Example  based  learning  for  view-based  human  face  de(cid:173)\n\ntection.  AI Memo 1521,  CBCL paper 112,  Cambridge,  MA:  MIT. \n\n[17]  Trotter, Y,  Celebrini, S, Stricanne, B,  Thorpe, S &  Imbert, M  (1992).  Modulation of \nneural stereoscopic  processing  in  primate area VI  by the  viewing  distance.  Science, \n257,  1279-1281. \n\n\f\fPART II \n\nNEUROSCIENCE \n\n\f\f", "award": [], "sourceid": 1236, "authors": [{"given_name": "Maximilian", "family_name": "Riesenhuber", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}