{"title": "ARTEX: A Self-organizing Architecture for Classifying Image Regions", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 879, "abstract": null, "full_text": "ARTEX:  A  Self-Organizing  Architecture \n\nfor  Classifying Image  Regions \n\nStephen Grossberg and James R.  Williamson \n\n{steve, jrw}@cns.bu.edu \n\nCenter for  Adaptive Systems and \n\nDepartment of Cognitive and  Neural Systems \n\nBoston  University \n677  Beacon Street, \nBoston,  MA  02215 \n\nAbstract \n\nA self-organizing architecture is  developed for  image region  classi(cid:173)\nfication.  The system  consists  of a  preprocessor  that utilizes multi(cid:173)\nscale filtering, competition, cooperation, and diffusion to compute a \nvector  of image boundary  and surface  properties,  notably  texture \nand  brightness  properties.  This  vector  inputs  to  a  system  that \nincrementally  learns  noisy  multidimensional  mappings  and  their \nprobabilities.  The  architecture  is  applied  to  difficult  real-world \nimage  classification  problems,  including  classification  of synthet(cid:173)\nic  aperture  radar  and  natural  texture  images,  and  outperforms a \nrecent state-of-the-art system at classifying natural textures. \n\n1 \n\nINTRODUCTION \n\nAutomatic processing of visual scenes often begins by detecting regions of an image \nwith common values of simple local features,  such as  texture,  and mapping the pat(cid:173)\ntern offeature activation into a predicted region label.  We develop a self-organizing \nneural  architecture,  called  the  ARTEX  algorithm,  for  automatically  extracting  a \nnovel and effective array of such features and mapping them to output region label(cid:173)\ns.  ARTEX is  made up of biologically  motivated networks,  the  Boundary  Contour \nSystem and Feature Contour System (BCS/FCS) networks for visual feature extrac(cid:173)\ntion  (Cohen  &  Grossberg,  1984;  Grossberg  &  Mingolla,  1985a,  1985b;  Grossberg \n&  Todorovic,  1988;  Grossberg,  Mingolla,  &  Williamson,  1995),  and  the  Gaussian \nARTMAP (GAM)  network for  classification  (Williamson, 1996). \nARTEX is first evaluated on a difficult real-world task, classifying regions of synthet(cid:173)\nic  aperture  radar  (SAR)  images,  where  it  reliably  achieves  high  resolution  (single \n\n\f874 \n\nS.  Grossberg and 1. R.  Williamson \n\npixel)  classification results,  and  creates accurate probability maps for  its class  pre(cid:173)\ndictions.  ARTEX is  then  evaluated  on  classification  of natural  textures,  where  it \noutperforms the  texture  classification system  in  Greenspan,  Goodman, Chellappa, \n&  Anderson  (1994)  using  comparable preprocessing  and  training conditions. \n\n2  FEATURE EXTRACTION NETWORKS \n\nFilled-in surface brightness.  Regions of interest in  an  image can  often  be seg(cid:173)\nmented based on first-order  differences  in pixel intensity. An improvement over raw \npixel  intensities  can  be obtained  by  compensating for  variable  illumination  of the \nimage to yield  a  local  brightness feature.  A further  improvement over local bright(cid:173)\nness features can be obtained with a surface brightness feature, which is obtained by \nsmoothing local brightness values when they belong to the same region, while main(cid:173)\ntaining  differences  when  they  belong  to different  regions.  Such  a  procedure  tends \nto maximize the separability of different  regions in  brightness space  by minimizing \nwithin-region variance while  maximizing between-region variance. \nIn Grossberg  et al.  (1995)  a  multiple-scale BCS/FCS network was used  to process \nnoisy SAR images for  use  by human operators by normalizing and segmenting the \nSAR intensity distributions and using  these  transformed data to fill-in surface rep(cid:173)\nresentations that smooth over noise while maintaining informative structures.  The \nsingle-scale  BCS/FCS  used  here  employs the  middle-scale  BCS/FCS used  in  that \nstudy.  The  BCS/FCS equations  and  parameters are fully  described  in  Grossberg \net  al.  (1995).  The  BCS/FCS is  herein  applied  to  SAR images  that  are  spatially \nconsolidated  to half the size  (in  each  dimension) of the images used  in  that study, \nand so is comparable to  the large-scale BCS/FCS used  there. \nMultiple-scale oriented  contrast. \nimage property that is  useful  for  region  segmentation is  texture.  One  popular ap(cid:173)\nproach for  analyzing texture, for  which there is a great deal of supporting biological \nand  computational evidence,  decomposes  an  image,  at each  image location, into a \nset  of energy  measures at different  oriented spatial frequencies.  This may be  done \nby applying a  bank of orientation-selective bandpass filters  followed  by simple non(cid:173)\nlinearities and spatial pooling,  to extract  multiple-scale oriented  contrast features. \nThe  early  stages  of the  BCS,  which  define  a  Static  Oriented  Constrast  (or  SOC) \nfiltering  network, carry  out these  operations,  and variants of them have been  used \nin  many texture segregation algorithms (Bergen,  1991;  Greenspan  et  al.,  1994). \nHere,  the SOC network produces K = 4 oriented contrast features at each of four  s(cid:173)\npatial scales.  The first stage of the SOC network is a shunting on-center off-surround \nnetwork that compensates for variable illumination, normalizes, and computes ratio \ncontrasts in the image.  Given an input image, I,  the output at pixel (i,i) and scale \n9  in  the first  stage of the SOC network is \n\nIn  addition  to surface  brightness,  another \n\n9  _  Iij  - (Gg * I)ij  - DE \naij  - D + Iij + (Gg * I)ij  , \nwhere  E = 0.5, and Gg  is  a  Gaussian kernel defined  by \n\nGfj(p, q)  =  -2  1 2  exp[-\u00abi - p)2 + (j - q)2)/20\";], \n\n'frO\" 9 \n\n( 1) \n\n(2) \n\nwith  O\"g = 2g ,  for  the spatial scales  9 = 0,1,2,3.  The value of D  is  determined  by \nthe  range of pixel intensities in  the input image.  We use  D=2000 for  SAR images \nand  D = 255  for  natural texture images.  The next stage obtains a  local  measure of \norientational contrast by convolving the output of (1) with Gabor filters, H!, which \n\n\fARTEX: A Self-organizing Architecture/or Classifying Inwge Regions \n\nare defined  at four  orientations, and then  full-wave  rectifying the result: \n\nbfjk = I(HZ * ag)ij I\u00b7 \nThe horizontal Gabor filter  (k = 0)  is  defined  by: \n\n875 \n\n(3) \n\nHfJo(p, q)  = Grj(p, q)  . sin[0.757r(j - q)/ug] . \n\n(4) \nOrientational  contrast  responses  may  exhibit  high  spatial  variability.  A  smooth, \nreliable  measure  of orientational  contrast  is  obtained  by  spatially  pooling  the  re(cid:173)\nsponses  within  the same orientation: \n\ne!jk  = (Gg * bt)ij. \n\n(5) \nEquation  (5)  yields  an  orientationaliy  variant,  or  OV,  representation  of oriented \ncontrast.  A further  optional stage yields an  orientationaliy invariant, or 01,  repre(cid:173)\nsentation by shifting the oriented  responses  at each scale into a  canonical ordering, \nto yield a  common representation for  rotated versions of the same texture: \n\nd!jk  =  erjkl  where  k' = [k + arg ~~ (cfjk ll )]  mod K. \n\n(6) \n\n3  CLASSIFICATION NETWORK \n\nGAM  is  a  constructive,  incremental-learning network  which self-organizes  internal \ncategory  nodes  that  learn  a  Gaussian  mixture  model  of the  M-dimensional  input \nspace,  as  well  as  mappings  to  output  class  labels.  Here,  mappings  are  learned \nfrom  17 -dimensional input  vectors  (composed  of a  filled-in  brightness feature  and \n16 oriented contrast features)  to a class label representing a shadow, road, grass,  or \ntree region.  The ph category's receptive field  is parametrized by two M-dimensional \nvectors:  its mean, Pj,  and standard deviation,  ifj .  A scalar,  nj, also represents  the \nnode's  cumulative  credit.  Category  j  is  activated  only  if its  match,  Gj\n,  satisfies \nthe  match  criterion,  which  is  determined  by  a  vigilance parameter,  p.  Match  is  a \nmeasure,  obtained  from  the  category's  unit-height  Gaussian  distribution,  of how \nclose  an input,  X,  is to the category's mean, relative to its standard deviation: \n\nGj = exp  -- L-\n\n(  1~(Xi-l'\"i)2) \n\nJ \n\n\u2022 \n\n2  i=l \n\nUji \n\n(7) \n\nThe match criterion is  a threshold:  the category is activated only if Gj  > Pi  other(cid:173)\nwise,  the category  is  reset.  The input strength, gj, is  determined  by \ngj = 0  otherwise. \n\nGj  if  Gj  > Pi \n\nn' \ngj  =  M  J \n\n(8) \n\nThe category's activation, Yj,  which  represents  P(jlx), is obtained by \n\nTIi=l Uji \n\nY. _ \nJ  -\n\ng' \nJ \n\nN '  \n\nD + Ll=l g, \n\n(9) \n\nwhere N  is the number of categories and D  is a shunting decay term that maintains \nsensitivity to the input magnitude in the activation level  (D = 0.01  here). \nWhen category j  is first  chosen,  it learns a permanent mapping to the output class, \nk,  associated with the current training sample.  All categories that map to the same \nclass  prediction  belong  to  the  same  ensemble:  j  E  E(k).  Each  time  an  input  is \npresented,  the  categories  in  each  ensemble sum their  activations to generate a  net \nprobability estimate,  Zk,  of the class prediction  k  that they share: \n\nZk  =  L  Yj\u00b7 \n\njEE(k) \n\n(10) \n\n\f876 \n\nS. Grossberg and 1. R.  Williamson \n\nThe system prediction, }(, is  determined  by  the maximum probability estimate, \n\nK  = argmax(zk), \n\nk \n\n(11) \n\n(13) \n\nwhich  determines  the  chosen  ensemble.  Once  the  class  prediction  K  is  chosen,  we \nobtain the category's \"chosen-ensemble\" activation, Y;,  which represents P(jlx, K): \n(12) \n\nif  j  E E(K);  Y;  = 0  otherwise. \n\nY;  = L:  Yj \n\nlEE(K) Yl \n\nIf K  is  the  correct  prediction,  then  the  network  resonates  and  learns;  otherwise, \nmatch tracking is invoked:  p is raised  to the average match of the chosen ensemble. \n\np = exp (-~  L  Y; t  (Xi ; .:ji) 2) . \n\njEE(K) \n\ni=l \n\nl' \n\nIn addition, all  categories in  the chosen ensemble are reset.  Equations (8)-(11)  are \nthen  re-evaluated.  Based  on  the  remaining non-reset  categories,  a  new  prediction \nK  in (11),  and its corresponding ensemble, are chosen.  This automatic search cycle \ncontinues  until  the  correct  prediction  is  made,  or  until  all  committed  categories \nare  reset  and  an  uncommitted  category  is  chosen.  Upon  presentation of the next \ntraining sample, p  is  reassigned  its baseline value:  p = p.  Here,  p ~ O. \nWhen category j  learns,  nj is updated to represent the amount of training data the \nnode has been  assigned  credit for: \n\nThe vectors  flj  and iij  are  then  updated  to learn  the input statistics: \n\nnj := nj + Y; \u2022 \n\nI'ji \n\n(1 \n\n\u2022  -1)  +.-1 \n\nI'ji  Yj nj  Xi, \n\n- Yj nj \n\n(14) \n\n(15) \n\n(16) \nGAM  is initialized with  N = O.  When  a  category  is first  chosen,  N  is incremented, \nand the new  category, indexed by  J =N, is initialized with  nJ =  1,  fl =  X,  G'ji  =;, \nand  with  a  permanent  mapping  to  the  correct  output  class.  Initializing  G'ji  = ; \nis  necessary  to  make  (7)  and  (8)  well-defined.  Varying;  has  a  marked  effect  on \nlearning:  as  ;  is  raised,  learning  becomes slower,  but fewer  categories  are  created. \nThe  input  vectors  are  normalized  to  have  the  same  standard  deviation  in  each \ndimension so  that; has  the same meaning in  each dimension. \n\n4  SIMULATION RESULTS \n\nClassifying SAR image regions.  Figure  1 illustrates the classification  results \nobtained  on  one  SAR image after  training on  the other  eight  images in  the  data \nset.  The final classification result  (bottom, right) closely resembles the hand-labeled \nregions  (middle, left) .  The caption summarizes the average results  obtained on all \nnine  images.  ARTEX  learns  this  problem  very  quickly,  using  a  small  number  of \nself-organized  categories,  as  shown in  Figure 2 (left).  The best  classification result \nof 84.2%  correct  is  obtained  by  filling-in  the  probability estimates from  equation \n(10)  within  the  BCS  boundaries,  using  an  FCS  diffusion  equation  as  described  in \nGrossberg  et  al.  (1995).  These  filled-in  probability  estimates  predict  the  actual \nclassification  rates with  remarkable accuracy  (Figure 2,  right). \nClassifying natural textures.  ARTEX  performance is  now  compared  to  that \nof a  texture  analysis  system  described  in  Greenspan  et  al.  (1994),  which  we  re(cid:173)\nfer  to  as  the  \"hybrid  system\"  because  it  is  a  hybrid  architecture  made  up  of a \n\n\fART EX: A Self-organizing Architecture/or Classifying Image Regions \n\n877 \n\nFigure  1:  Results  are shown on  a  180x180  pixel  SAR image,  which  is  one of nine \nimages in data set.  Top row:  Center/surround, first stage output (left); BCS bound(cid:173)\naries  to  FCS filling-in  (middle); final  BCS/FCS filled-in  output (right).  Note  that \nBCS  accurately localizes region boundaries,  and that FCS improves appearance by \nsmoothing  intensities  within  regions  while  maintaining  sharp  differences  between \nregions.  Middle  row:  Hand-labeled  regions  corresponding  to  shadow,  road,  grass, \ntrees  (left);  Gaussian  classifier  results  based  on  center/surround  feature  (middle, \n59.6%  correct),  and  based  on  filled-in  feature  (right,  70.7%).  Note  that  filling(cid:173)\nin greatly improves classification  by reducing  brightness variability within  regions. \nHowever,  the lack of textural information results  in  errors,  such  as  the misclassifi(cid:173)\ncation of the vertical road as  a  shadow  region.  Bottom row:  GAM  results  (1'  = 4) \nbased on  16  SOC features  in  addition to the  filled-in  brightness feature:  using the \nOV representation  (left,  81.9%), using  the 01  representation  (middle, 83.2%), and \nusing filled-in  01 prediction  probabilities (right, 84.2%).  With the OV representa(cid:173)\ntion (bottom, left),  the  thin vertical road is  misclassified as  shadows because there \nare no thin vertical roads in the training set.  With the 01  representation,  however \n(bottom, middle), the  road  is  classified  correctly  because  the  training set  includes \nthin roads  at other orientations.  Finally, the  classification  results  are  improved by \nfilling-in the prediction probabilities from equation (10) within the BCS boundaries, \nthereby taking advantage of spatial and structural context (bottom, right). \n\n\f878 \n\n.. \n\ns.  Grossberg and J. R.  Williamson \n\n'00 \n\n.. \n\n50 \n\n50 \n\n'00 \n\n'50 \n\n... \n\nNumber of Categories \n\n'\"'\" \n\n03 \n\n0 .4 \n\n05 \n\n0.5 \n0.8 \nFilled-in Probability \n\n0.7 \n\n0.1 \n\nFigure 2:  Left:  classification rate is plotted as a function of the number of categories \nafter  training  on  different  sized  subsets  of the  SAR  training  data:  (left-to-right) \n0.01 %,  0.1%,  1%,  10%,  and  100%  of the  training set.  Right:  classification  rate  is \nplotted as a  function of filled-in  probability estimates. \n\nlog-Gabor pyramid representation,  followed  by  unsupervised  k-means clustering in \nthe feature  space,  followed  by  batch  learning of mappings from  clusters  to output \nclasses  using  a  rule-based  classifier.  The hybrid system  uses  three  pyramid  levels \nand four orientations at each level.  Each level of the pyramid is produced  via three \nblurring/decimation steps,  resulting in  an 8x8 pixel  resolution.  For a  fair  compari(cid:173)\nson,  sufficient  blurring/decimation was added  as  a  postprocessing  step  to ARTEX \nfeatures  to  yield  the  same net  amount of blurring.  Both  ARTEX  and  the  hybrid \nsystem  use  an  OV  representation  for  these  problems because  the  textures  are  not \nrotated.  The first  task  is  classification of a  library  of ten  separate structured  and \nunstructured  textures  after  training on  different  example images.  ARTEX obtains \nbetter  performance,  achieving 96.3%  correct  after  40  training  epochs  (with  i  = 1, \n34  categories)  versus  94.3%  for  the  hybrid  system.  Even  after  only  one  training \nepoch,  ARTEX  achieves  better  results  (94.9%,  23  categories).  The  second  task \n(Figure  3)  is  classification  of a  five-texture  mosaic,  which  requires  discriminating \ntexture  boundaries,  after  training on  examples  of the  five  textures,  plus  an  addi(cid:173)\ntional texture  (sand).  ARTEX achieves 93.6% correct  after  40  training epochs  (33 \ncategories),  and produces results which appear to be better than those produced  by \nthe hybrid system on  a similar problem (see  Greenspan  et al.,  1994,  Figure 5). \nIn  summary,  the  ARTEX system  demonstrates  the  utility of combining  BCS  tex(cid:173)\nture  and  FCS  brightness  measures  for  image  preprocessing.  These  features  may \nbe  effectively  classified  by  the  GAM  network,  whose  self-calibrating matching and \nsearch  operations  enable  it  to  carry  out  fast,  incremental, distributed  learning  of \nrecognition categories and their probabilities.  BCS  boundaries may be further  used \nto  constrain  the  diffusion  of these  probabilities according  to FCS  rules  to improve \nprediction probability. \n\nAcknowledgements \n\nStephen  Grossberg  was  supported  by  the  Office of Naval  Research  (ONR  NOOOl4-\n95-1-0409  and  ONR  NOOOl4-95-1-0657).  James Williamson  was  supported  by  the \nAdvanced Research Projects Agency (ONR NOOOl4-92-J-4015), the Air Force Office \nof Scientific  Research  (AFOSR F49620-92-J-0225  and  AFOSR F49620-92-J-0334), \nthe  National  Science  Foundation  (NSF  IRI-90-00530 and  NSF  IRI-90-24877),  and \n\n\fARTEX: A Self-organizing Architecturejor Classifying Image Regions \n\n879 \n\nFigure 3:  Left:  mosaic of five natural textures.  Right:  ARTEX classification (93.6% \ncorrect) after training on examples of five textures and an additional texture (sand). \n\nthe Office of Naval Research (ONR NOOOI4-91-J-4100 and ONR NOOOI4-95-1-0409). \n\nReferences \n\nBergen,  J.R.  ''Theories  of visual  texture  perception,\"  in  Spatial  Vision,  D.  M. \n\nRegan  Ed.  New  York:  Macmillan, 1991, pp.  114-134. \n\nCohen,  M.  &  Grossberg,  S.,  (1984).  Neural  dynamics  of brightness  perception: \nFeatures, boundaries, diffusion, and resonance.  Perception \u00a33 Psychophysic(cid:173)\ns,  36, 428-456. \n\nGreenspan,  H.,  Goodman,  R.,  Chellappa,  R.,  &  Anderson,  C.H.  (1994).  Learn(cid:173)\ning  texture  discrimination  rules  in  a  multiresolution system.  IEEE  Trans. \nPAMI, 16, 894-90l. \n\nGrossberg, S. & Mingolla, E. (1985a).  Neural dynamicsofform perception:  Bound(cid:173)\n\nary  completion,  illusory  figures,  and  neon  color  spreading.  Psychological \nReview, 92, 173-211. \n\nGrossberg,  S.  &  Mingolla,  E.  (1985b).  Neural  dynamics  of perceptual  group(cid:173)\ning:  Textures,  boundaries,  and  emergent  segmentations.  Perception  \u00a33 \nPsychophysics,  38, 141-17l. \n\nGrossberg,  S.,  Mingolla,  E.,  &  Williamson,  J.  (1995).  Synthetic  aperture  radar \n\nprocessing by a  multiple scale neural system for  boundary and surface rep(cid:173)\nresentation.  Neural Networks, 8, 1005-1028. \n\nGrossberg, S.  & Todorovic, D. (1988).  Neural dynamics of I-D and 2-D brightness \nperception:  A unified model of classical and recent phenomena.  Perception \n\u00a33  Psychophysics,  43,241-277. \n\nGrossberg, S.,  &  Williamson, J.R. (1996).  A self-organizing system for  classifying \ncomplex images:  Natural textures  and synthetic aperture  radar.  Technical \nReport CAS/CNS TR-96-002, Boston, MA:  Boston  University. \n\nWilliamson, J .R. (1996).  Gaussian ARTMAP: A neural network for fast incremen(cid:173)\n\ntal learning of noisy  multidimensional maps.  Neural Networks,  9, 881-897. \n\n\f", "award": [], "sourceid": 1329, "authors": [{"given_name": "Stephen", "family_name": "Grossberg", "institution": null}, {"given_name": "James", "family_name": "Williamson", "institution": null}]}