{"title": "Neural Analog Diffusion-Enhancement Layer and Spatio-Temporal Grouping in Early Vision", "book": "Advances in Neural Information Processing Systems", "page_first": 289, "page_last": 296, "abstract": null, "full_text": "289 \n\nNEURAL ANALOG DIFFUSION-ENHANCEMENT \n\nLAYER AND SPATIO-TEMPORAL GROUPING \n\nIN EARLY VISION \n\nAllen M. Waxman\u00b7,t, Michael Seibert\u00b7,t,RobertCunninghamt  and I ian Wu\u00b7 \n\n\u2022  Laboratory for Sensory Robotics \n\nt  Machine Intelligence Group \n\nBoston University \nBoston, MA  02215 \n\nMIT Lincoln Laboratory \nLexington, MA 02173 \n\nABSTRACT \n\nA  new  class of neural  network  aimed  at  early  visual  processing  is \ndescribed; we call it a Neural  Analog Diffusion-Enhancement Layer or \n\"NADEL.\" The  network  consists  of two  levels  which  are  coupled \nthrough feedfoward and shunted feedback connections. The lower level \nis  a  two-dimensional  diffusion map which  accepts  visual  features  as \ninput, and spreads activity over larger scales as a function of time. The \nupper layer is periodically fed the  activity from  the diffusion layer and \nlocates local maxima in it (an extreme form  of contrast enhancement) \nusing a network of local comparators. These local maxima are fed back \nto  the  diffusion  layer  using  an  on-center/off-surround  shunting \nanatomy. The maxima are also available  as output of the network.  The \nnetwork dynamics  serves  to  cluster features  on  multiple  scales  as  a \nfunction of time, and can be used in a variety of early visual processing \ntasks such  as:  extraction of comers  and high  curvature  points  along \nedge contours, line end detection, gap filling in contours, generation of \nfixation points, perceptual grouping on multiple scales, correspondence \nand path impletion  in long-range  apparent  motion,  and building  2-D \nshape representations that are invariant to  location, orientation, scale, \nand small deformation on the visual field. \n\nINTRODUCTION \n\nComputer vision  is often divided into two  main stages, \"early vision\" and \"late vision\", \nwhich  correspond to image processing and knowledge-based recognition/interpretation, \nrespectively.  Image  processing  for  early  vision  involves  algorithms  for  feature \nenhancement and extraction (e.g. edges and  comers), feature  grouping (i.e., perceptual \n\nWe  acknowledge  support  from  the  Machine  Intelligence  Group  of MIT  Lincoln \nLaboratory. The  views expressed are those of the authors and do not reflect the official \npolicy or position of the U.S. Government \n\n\f290 \n\nWaxman, Seibert, Cunningham and Wu \n\norganization), and the extraction of physical properties for object surfaces that comprise a \nscene  (e.g.  reflectance,  depth,  surface  slopes  and  curvatures,  discontinuities).  The \ncomputer vision literature is characterized by a plethora of algorithms to achieve many of \nthese computations, though they are hardly robust in performance. \n\nBiological neural network processing does, of course, achieve all of these early  vision \ntasks, as evidenced by psychological studies of the preattentive phase  of human  visual \nprocessing.  Often, such  studies  provide  motivation  for  new  algorithms  in  computer \nvision. In contrast to this algorithmic approach, computational neural network processing \ntries to glean organizational and functional  insights from  the  biological  realizations,  in \norder  to  emulate  their  information  processing capabilities.  This  is  desirable.  mainly \nbecause of the adaptive and real-time nature of the neural  network architecture. Here, we \nshall  demonstrate  that  a  single  neural  architecture  based  on  dynamical  diffusion(cid:173)\nenhancement networks can realize  a large variety of early vision tasks  that deal mainly \nwith  perceptual grouping.  The ability  to group image features  on multiple  scales as  a \nfunction  of  time,  follows  from  \"attractive  forces\"  that  emerge  from  the  network \ndynamics. We have  already implemented the NADEL (in  16-bit arithmetic) on a  video(cid:173)\nrate parallel computer, the PIPE [Kent et al .\u2022  1985], as well as a SUN-3 workstation. \n\nTHE NADEL \n\nThe Neural Analog Diffusion-Enhancement Layer  was recently introduced by Seibert & \nWaxman [1989]. and  is illustrated in Figure  1;  it consists primarily of two levels which \nare  coupled  via feedforward  and  shunted  feedback  connections.  Low-level  features \nextracted from  the imagery provide input to the lower level (a 2-D map) which spreads \ninput activity over larger scales  as  time progresses via diffusion, allowing  for passive \ndecay of activity. The diffused activity is periodically sampled and passed upward to  a \ncontrast-enhancing level (another 2-D map) which  locates local  maxima in the terrain of \ndiffuse activity. However, this forward pathway is masked by receptive  fields  which pass \nonly regions of activity with positive Gaussian curvature and negative mean curvature; \nthat is these receptive fields play the role of inhibitory dendro-dendritic modulatory gates. \nThis  masking  fascilitates  the  local  maxima detection  in  the  upper  level.  These  local \nmaxima detected by  the  upper level are  fed  back  to  the  lower diffusion level  using  a \nshunting dynamics with on-center/off-surround anatomy  (cf.  [Grossberg,  1973]  on  the \nimportance of  shunting  for  automatic  gain  control,  and  the  role  of center/surround \nanatomies in competitive networks). The local maxima are also available as outputs of the \nnetwork, and take on different interpretations as a  function  of the  input.  A  number of \nexamples of spatio-temporal grouping will be illustrated in the next section. \n\nThe primary  result of diffusion-enhancement network dynamics  is  to  create  a  long(cid:173)\nrange  attractive force  between isolated featural  inputs.  This  force  manifests  itself by \nshifting the local maxima of activity  toward one another. leading to a  featural grouping \nover mUltiple scales as a function of time.  This is shown in Figure 1, where two featural \ninputs spread their initial excitations over time. The individual activities superpose. with \nthe  tail of one gaussian  crossing  the  maximum of the other gaussian at an angle.  This \n\n\fNeural Analog Diffusion-Enhancement Layer \n\n291 \n\nbiases the superposition of activities, adding more activity to one side of a maximum than \nanother,  causing a shift in  the  local maxima toward one another.  Eventually,  the  local \nmaxima merge into a single maximum at the centroid of the individual inputs. If we keep \ntrack  of the  local  maxima  as  diffusion  progresses  (by  connecting  the  output  of the \nenhancement layer to another layer which stores activity in short term memory), then the \ntwo  initial inputs  will  become connected  by  a  line.  In  Figure  1 we  also  illustrate  the \ngrouping of five  features in two clusters, a configuration possessing two  spatial scales. \nAfter little diffusion the local maxima are located where the initial inputs were. Further \ndiffusion  causes  each  cluster  to  form  a  single local maximum  at the cluster centroid. \nEventually, both clusters merge into a  single hump of activity  with one maximum  at the \ncentroid of the  five  initial  inputs.  Thus,  multi scale grouping  over time  emerges.  The \nexamples of Figure  1  use  only diffusion  without any  feedback,  yet  they  illustrate  the \nimportance of localizing  the  local  maxima through  a kind of contrast-enhancement on \nanother layer. The local maxima of activity serve  as  \"place tokens\" representing grouped \nfeatures  at  a  particular scale.  The  feedback  pathway  re-activates  the  diffusion  layer, \nthereby  allowing  the  grouping  process  to  proceed  to  still  larger  scales,  even  across \nfeatureless areas of imagery. \n\nThe dynamical evolution of activity  in the NADEL can  be  modeled using a  modified \ndiffusion  equation  [Seibert  &  Waxman,  1989].  However,  in  our  simulations  of the \nNADEL we don't actually solve this differential  equation directly. Instead, each iteration \nof the NADEL consists of a spreading of activity using gaussian convolution,  allowing \nfor  passive decay,  then  sampling  the  diffusion  layer,  masking out areas  which  are  not \npositive Gaussian curvature and negative mean curvature activity surfaces, detecting one \nlocal maximum in each of these convex areas, and feeding this back to the diffusion layer \nwith  a shunted on-center/off-surround excitation at the local maxima. In  the biological \nsystem,  diffusion  can  be  accomplished  via  a  recurrent  network  of  cells  with  off(cid:173)\ncenter/on-surround  lateral  connectivity,  or  more  directly  using  electrotonic  coupling \nacross gap  junctions  as  in  the  horizontal  cell  layer  of the  retina  [Dowling,  1987]. \nCurvature  masking  of the  activity  surface  can  be  accomplished  using  oriented  off(cid:173)\ncenter/on-surround  receptive  fields  that  modulate  the  connections  between  the  two \nprimary layers of the NADEL. \n\nSPATIO-TEMPORAL GROUPING \n\nWe give several examples of grouping phenomena  in early vision,  utilizing the NADEL. \nIn all cases its parameters correspond  to gaussian spreading with 0'=3  and passive decay \nof 1 % per iteration, and on-center/off-surround feedback with  0'+=11'12 and 0'_=1. \n\nGrouping  of Two  Points:  The simple  case  of two  instantaneous  point  stimuli  input \nsimultaneously to the NADEL  is summarized in Figure 2.  We plot the  time (N network \niterations) it takes  to  merge  the  two  inputs,  as  a  function  of their initial  separation  (S \npixels). For  S:S;6  the  points  merge  in  one  iteration;  for  S>24  activity  equilibrates and \nshifting of local maxima never begins. \n\n\f292 \n\nWaxman, Seibert, Cunningham and Wu \n\nGrouping on Multiple Scales:  Figure 3 illusttates the hierarchy of groupings generated \nby a square outline (31  pixels on a side)  with gaps (9  pixels wide). Comer and line-end \nfeatures  are  first  enhanced  using  complementary  center-surround  receptive  fields \n(modeled as a rectified response to a difference-of-gaussians), and located at the local \nmaxima of activity.  These features  are  shown  superimposed on  the  shape in  3a;  they \nserve as input to the NADEL. Figure 3b shows the loci of local maxima determined up to \nthe second stable grouping, superimposed over the shape. Boundary completion fills the \ngaps in  the  square. In Figure 3c we show the loci of local maxima on the image plane, \nafter the final grouping has occured (N=I00 iterations). The trajectory of local  maxima \nthrough space-time (x,y,t)  is shown in Figure 3d after the fourth grouping.  It reveals  a \nhierarchical organization similar to the \"scale-space diagrams\" of Witkin [1983]. \n\nIt can be seen from Figure 3d that successive groupings form  stable  entities in that the \nplace tokens remain stationary for several  iterations of the NADEL. It isn't until activity \nhas diffused farther  out to  the  next representative  scale that  these  local  maxima  start \nmoving once again, and eventually  merge.  This relates stable perceptual groupings  to \nplace tokens  (i.e., local maxima of activity) that are not in motion on the diffusion layer. \nThe motion of place tokens can be measured in  the same fashion  as  feature  point motion \nacross the visual field. Real-time receptive fields for measuring the motion of image edge \nand point features have recently been developed by Waxman et ale  [1988]. \n\nGrouping of Time-Varying Inputs: The simplest example in this case corresponds to the \ngrouping of two lights that are flashed  at different locations at different times. When the \ntime  interval  between  flashes  (Stimulus  Onset Asynchrony  SOA)  is set appropriately, \none perceives a smooth  motion _or  \"path impletion\"  between the stimuli. This percept of \n\"long-range apparent motion\" is the cornerstone of the Gestalt  Psychology  moverment, \nand  has  remained  unexplained  for  one-hundred  years  now  [Kolers,  1972].  We  have \napplied the NADEL to a variety of classical problems in apparent motion including the \n\"split motion\" percept and the mUlti-point Ternus configuration [Waxman et al., 1989]. \n\nHere we consider only the case of motion between two stimuli, where we interpret  the \nlocus of local maxima as the impleted path in apparent motion.  However, the  direction \nof perceived motion is not  determined by the grouping process itself;  only the path. We \nmake  the additional assumption  that grouping  generates  a  motion  percept only if the \nsecond stimulus begins to shift immediately  upon input to  the NADEL. We suggest that \nthe motion percept occurs only after path impletion is complete.  That is, while grouping \nis  active,  its  outputs  are  suppressed  from  our  perception  (a  form  of \"ttansient-on(cid:173)\nsustained  inhibition\"  analogous  to  saccadic  suppression).  By  varying  the  separation \nbetween the two  stimuli, and the time (SOA) between their inputs, we can plot regimes \nfor  which  the  NADEL  predicts  apparent  motion.  This  is  shown  in  Figure  4,  which \ncompares favorably with the psychophysical results summarized in Figure 3.2 of [Kolers, \n1972]. We find regimes in which complete paths are formed (\"smooth motion\"), partial \npaths are formed (\"jumpy motion\"), and no immediate shifting occurs (\"no motion\"). The \nmaximum allowable SOA between stimuli  (upper curves) is determined by the passive \ndecay rate. Increasing this decay from  1 % to 3%  will decrease the maximum SOA by a \n\n\fNeural Analog Diffusion-Enhancement Layer \n\n293 \n\nfactor of five.  The  minimum  allowable  SOA  (lower curves)  increases  with  increasing \nseparation, since it takes longer for activity from  the first stimulus to influence a more \ndistant  second  stimulus.  The  linearity of the  lower boundary  has been  interpreted by \n[Waxman  et aI .\u2022  1989] as suggestive of Korte's \"third law\" [Kolers, 1972], when taken in \ncombination with a logarithmic transformation of the visual field [Schwartz, 1980]. \n\nAttentional Cues and Invariant Representations:  Place tokens  which  emerge as  stable \ngroupings over time  can  also provide attentional cues to  a  vision system.  They would \ntypically drive saccadic eye motions during scene inspection, with the relative activities \nof these  maxima and  their order of emergence determining  the  sequence of rapid eye \nmotions.  Such eye motions  are  known  to  play  a  key  role in human  visual perception \n[Yarbus, 1967]; they are influenced by both bottom-up perceptual cues as  well  as  top(cid:173)\ndown expectations. The neuromorphic vision system developed by  Seibert &  Waxman \n[1989],  shown  in  Figure  5,  utilizes  the  NADEL  to  drive  \"eye  motions\", and  thereby \nachieve translational invariance in  2-D object learning and recognition. This is followed \nby a log-polar transform (which  emulates the geniculo-cortical connections  [Schwartz, \n1980]) and another NADEL  to  achieve  rotation  and scale in variance  as  well.  Further \ncoding  of the  transformed  feature  points  by  overlapping  receptive  fields  provides \ninvariance to small deformation.  Pattern learning and recognition is then achieved using \nan Adaptive Resonance Theory (ART-2) network [Carpenter & Grossberg, 1987]. \n\nREFERENCES \n\nG.  Carpenter &  S.  Grossberg  (1987).  ART-2:  Self-organization  of stable  category \nrecognition codes for analog input patterns.  Applied Optics 26, pp. 4919-4930. \n\nI.E. Dowling (1987).  The RETINA: An approachable part or the brain.  Cambridge, \nMA: Harvard University Press. \n\nS.  Grossberg  (1973).  Contour enhancement,  short term  memory,  and  constancies in \nreverberating neural networks.  Studies in Applied Mathematics  52.  pp.217-257. \n\nE.W. Kent, M.O. Shneier & R. Lumia (1985).  PIPE: Pipelined Image Processing  Engine. \nJournal of Parallel and Distributed Computing  2, pp. 50-78. \n\nP.A. Kolers  (1972).  Aspects or Motion Perception.  New York: Pergamon Press. \n\nE.L.  Schwartz  (1980).  Computational  anatomy  and  functional  architecture  of striate \ncortex:  A spatial mapping approach to perceptual coding.  Vision Research  20, pp. 645-\n669. \n\nM.  Seibert &  A.M. Waxman  (1989).  Spreading activation  layers,  visual  saccades and \ninvariant representations for  neural pattern recognition systems.  N~ural Networks  2, pp. \n9-27. \n\n\f294 \n\nWaxman, Seibert, Cunningham and Wu \n\nA.M.  Waxman, J.  Wu  &  F.  Bergholm  (1988).  Convected  activation  profiles and  the \nmeasurement of visual motion.  Proceeds.  1988 IEEE Conference on Computer Vision \nand Pattern Recognition.  Ann Arbor, MI, pp. 717-723. \n\nA.M. Waxman. J. Wu &  M.  Seibert (1989).  Computing visual motion in the short and \nthe long: From receptive fields to neural networks. Proceeds. IEEE 1989 Workshop  on \nVisual Motion. Irvine, CA. \n\nA.P.  Witkin  (1983).  Scale  space  filtering.  Proceeds.  of the  International  Joint \nConference on Artificial Intelligence.  Karlsruhe, pp. 1019-1021. \n\nAL. Yarbus  (1967).  Eye Movements and Vision.  New York: Plenum Press. \n\nON\u00b7CENTER! \n\nOFF\u00b7CENTER! \n\nOFF\u00b7SURROUND o \na.l-SURAOUNQ o \nFEE08AO< o \n\nSHUNTNG \n\nON-CENTER! \n\nOFF\u00b7SURAOUND \n\nFigure 1 \u2022 (left) The NADEL takes featural  input and diffuses it over  a  2-D  map  as  a \nfunction of time. Local maxima of activity are detected by the upper layer and fed back to \nthe  diffusion  layer  using  an  on-center/off-surround  shunting  anatomy.  (right.  top) \nFeatures spread their activities which superpose to generate an attractive force,  causing \ndistant features  to  group.  (right,  bottom)  Grouping  progresses  in  time  over  multiple \nscales. with small clusters emerging before extended clusters. Clusters are represented by \ntheir local ~axima of activity, which serve as place tokens. \n\n\f1000 \n800 \n600 \n\n.00 \n\n200 \n\nlOa \nso \n60 \n\n~o \n\n20 \n\n10 \n\n6 \n\n<! \n~ \n.9 \n\n~ .x \n~ \n~ \n\nI \nI; \n1 \n/, \n\nI \n\nI \n\n/ \n\nI \n\n/ \n\n/ \n\n/ \n\nI \n\n/ \n\nI \n\nI \n\nI \n/ \n/ \n\nSeparalion  5  (pIXelS) \n\nlmilllillUml:l \u2022 \n\n\u2022 \n\n~~ .. ~ \n\n\u2022 \n\n1_\u00b7 \n\n.' \n;h~ ' \n~~!:Ji. \n.J\\llt! \n!\" \n:\u00b7\u00b7'Id \nIi' \n~,I \nk;'~ \"~~~\"'r..;.'(!S.1-:'~-' \n~,~o.~_, \n\n!,~ \n\n1ii~Ill'J!I!IIIl!lll!l1llllllm= \n~l\" \n\n1~~'l \nF~ \n~h \nt~5~ \n':'.~ \nI~~ \n~ \n~ \ntt~i'><'! \n~7'~~':;' \n... ~~~~ . \n\n(b) \n\n(a) \n\ny \n\n.---\n\n.... ,-_ ..... \n\n(e) \n\nCd) \n\nFigure 2 _ The time  (network iterations  N) to  merge \ntwo points input simultaneously to the NADEL, as a \nfunction of their initial separation (S pixels). \n\nFigure 3 - Perceptual grouping of a square outline with gaps. \n\nz \n~ e. \n~ e. \nt2 \n~ \n~ \nr!3. g \n~ g. \n\nI g \n\n(\"t-\n~ \n~ \n~ \n\n~ \n'-0 \nc:.n \n\n\f200 \n\n100 \n80 \n\n60 \n\n~  40 \n<II c: \n.2 \n~  20 \na> \n-l: \n0 \n~ \niii \n10 \n.s \n8 \nC  6 \n\nC/) \n\n4 \n\n2 \n\n1 \n\nNo Motion \n\n, /  \n\nSmooth Motion \n\n./ \n\n./' \n\n./ \n\n./ \n\n./' \n\n, /  \n\nJumpy Motion \n\n,/ \n\n./ \n\n./ \n\n./ \n\n./ \n\n./ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n6 \n\n~ co \n\n(J') \n\nJ \n\n~ .... g-\n,;+ \ng \n::s e. ::s \nI\u00a7. a \n\n~ \n\n~ \n~ \n\nE_pectat Ions \n\nSpatial relations \n\nVIA NADEL \n\nFEATURE  MAP \n\nSeparation  S  (pixels) \n\nFigure  4  - Apparent  motion  between  two  flashed \nlights:  Stimulus  Onset  Asynchrony  SOA  (network \niterations N) vs.  Separation S  (pixels).  Solid curves \nindicate boundaries between  which  introduction  of \nthe  second light yields  immediate shifting  of local \nmaxima; dashed curves  (above solid curves) indicate \nwhen final merge occurs yielding the impleted path. \n\nEYE  MOTIONS \n\nFigure  5  - The  neuromorphic  vision  system  for \ninvariant  learning  and recognition  of  2-D  object~ \nutilizes three NADEL networks. \n\n\f", "award": [], "sourceid": 97, "authors": [{"given_name": "Allen", "family_name": "Waxman", "institution": null}, {"given_name": "Michael", "family_name": "Seibert", "institution": null}, {"given_name": "Robert", "family_name": "Cunningham", "institution": null}, {"given_name": "Jian", "family_name": "Wu", "institution": null}]}