{"title": "Learning to Make Coherent Predictions in Domains with Discontinuities", "book": "Advances in Neural Information Processing Systems", "page_first": 372, "page_last": 379, "abstract": null, "full_text": "Learning to  Make  Coherent  Predictions in \n\nDomains with Discontinuities \n\nSuzanna Becker and  Geoffrey  E.  Hinton \n\nDepartment of Computer Science,  University of Toronto \n\nToronto, Ontario, Canada M5S  1A4 \n\nAbstract \n\nWe  have  previously  described  an  unsupervised  learning  procedure  that \ndiscovers  spatially coherent  propertit>_<;  of the  world  by  maximizing the  in(cid:173)\nformation  that  parameters  extracted  from  different  parts  of  the  sensory \ninput convey  about some  common  underlying cause.  When given  random \ndot  stereograms  of curved  surfaces,  this  procedure  learns  to  extract  sur(cid:173)\nface  depth  because  that  is  the  property  that  is  coherent  across  space.  It \nalso  learns  how  to  interpolate  the  depth  at  one  location  from  the  depths \nat  nearby  locations  (Becker  and  Hint.oll.  1992).  1n  this  paper,  we  pro(cid:173)\npose  two  new  models  which  handle surfaces  with discontinuities.  The first \nmodel  attempts  to  detect  cases  of discontinuities  and  reject  them.  The \nsecond  model  develops  a  mixture of expert  interpolators.  It  learns  to  de(cid:173)\ntect  the  locations of discontinuities  and  to  invoke specialized,  asymmetric \ninterpolators  that  do  not cross  the  discontinuities. \n\nIntrod uction \n\n1 \nStandard  backpropagation is  implausible as  a  model  of perceptual learning because \nit  requires  an  external  teacher  to  specify  the  desired  output  of the  network.  We \nhave  shown  (Becker  and  Hinton,  1992)  how  the  external  teacher  can  be  replaced \nby  internally  derived  teaching  signals.  These  signals  are  generated  by  using  the \nassumption  that  different  parts  of  the  perceptual  input  have  common  causes  in \nthe  external  world.  Small  modules  that  look  at separate  but  related  parts  of the \nperceptual input discover  these  common causes  by  striving to produce outputs that \nagree with each other (see  Figure 1 a).  The modules may look at different modalities \n(e.g.  vision and touch), or the same modality at different  times (e.g.  the consecutive \n2-D  views  of a  rotating  3-D object),  or  even  spatially  adjacent  parts  of the  same \nimage.  In  previous  work,  we  showed  that  when  our  learning  procedure  is  applied \n\n372 \n\n\fLearning to Make Coherent Predictions in  Domains with Discontinuities \n\n373 \n\nto adjacent patches of 2-dimensional images, it allows a  neural network  that has no \nprior knowledge of the third dimension to discover depth in random dot stereograms \nof curved  surfaces.  A  more  general  version  of the  method  allows  the  network  to \ndiscover  the  best  way  of interpolating  the  depth  at  one  location  from  the  depths \nat  nearby  locations.  We  first  summarize  this  earlier  work,  and  then  introduce \ntwo  new  models  which  allow  coherent  predictions  to  be  made  in  the  presence  of \ndiscontinuities. \n\na) \n\nleft \nrightm~m~ \n\npatch A \n\npatch B \n\nFigure  1:  a)  Two  modules  that  receive  input  from  corresponding  parts  of  stereo \nimages.  The  first  module  receives  input  from  stereo  patch  A,  consisting  of a  hori(cid:173)\nzontal  strip from  the  left  image  (striped)  and  a  corresponding  strip from  the  right \nimage  (hatched).  The  second  module  receives  input  from  an  adjacent  stereo  patch \nB .  The  modules  try  to  make  their  outputs,  da  and  db,  convey  as  much  informa(cid:173)\ntion  as  possible  about  some  underlying  signal (i. e.,  the  depth)  which  is  common  to \nboth  patches.  b)  The  architecture of the  interpolating network,  consisting  of multiple \ncopies  of modules  like  those  in  a)  plus  a  layer  of interpolating  units.  The  network \ntries  to  maximize  the  information  that  the  locally  extracted  parameter de  and  the \ncontextually  predicted  parameter de  convey  about  some  common  underlying  signal. \nWe  actually used 10 modules  and the  central 6 modules  tried to  maximize  agreement \nbetween  their  outputs  and  contextually  predicted  values.  We  used  weight  averaging \nto  constrain  the  interpolating function  to  be  identical for  all modules. \n\n2  Learning spatially coherent  features  in  images \n\nThe simplest  way  to get  the outputs of two  modules  to agree  is  to use  the squared \ndifference  between  the outputs as a  cost  function,  and  to adjust the weights in each \nmodule so  as  to  minimize this cost.  Unfortunately,  this usually causes  each  module \nto  produce  the same constant output  that is  unaffected  by  the input to  the  module \nand  therefore  conveys  no  information  about  it.  What  we  want  is  for  the  outputs \nof two  modules  to  agree  closely  (i.e.  to  have  a  small expected  squared  difference) \nrelative to how  much they both vary as the input is  varied.  When this happens,  the \ntwo modules must  be responding to something that is  common to their  two inputs. \nIn  the special case  when  the outputs,  da ,  db,  of the two modules are scalars, a good \n\n\f374 \n\nBecker and Hinton \n\nmeasure of agreement  is: \n\n(1) \n\nwhere  V  is  the  variance  over  the  training  cases. \nIf  da  and  db  are  both  versions \nof the  same  underlying  Gaussian  signal  that  have  been  corrupted  by  independent \nGaussian  noise,  it  can  be  shown  that  I  is  the  mutual  information  between  the \nunderlying signal  and  the  average of da  and db.  By  maximizing  I  we  force  the  two \nmodules  to  extract  as  pure  a  version  as  possible of the  underlying common signal. \n\n2.1  The basic stereo net \nWe have shown how this principle can be applied to a multi-layer network that learns \nto  extract  depth  from  random  dot  stereograms  (Becker  and  Hinton,  1992).  Each \nnetwork  module  received  input  from  a  patch  of a  left  image  and  a  corresponding \npatch  of a  right  image,  as shown  in  Figure  1  a).  Adjacent  modules  received  input \nfrom  adjacent  stereo  image  patches,  and  learned  to  extract  depth  by  trying  to \nmaximize  agreement  between  their outputs.  The real-valued  depth  (relative  to  the \nplane  of fixation)  of  each  patch  of  the  surface  gives  rise  to  a  disparity  between \nfeatures  in  the  left  and  right  images; since  that disparity  is  the only  property  that \nis coherent  across each stereo image,  the output units of modules were  able to learn \nto  accurately  detect  relative  depth. \n\n2.2  The interpolating net \nThe  basic stereo  net  uses  a  very  simple model  of coherence  in  which  an  underlying \nparameter at one location is  assumed to be approximately equal to the parameter at \na  neighbouring location.  This model  is  fine  for  the  depth of fronto-parallel surfaces \nbut it is  far  from  the  best  model  of slanted or  curved surfaces.  Fortunately,  we  can \nuse  a  far  more  general  model  of coherence  in  which  the  parameter  at one  location \nis  assumed  to be an unknown linear function of the parameters  at nearby  locations. \nThe particular linear function  that  is  appropriate  can  be learned  by  the  network. \n\nWe  used  a  network  of the  type shown  in  Figure  1 b).  The  depth  computed  locally \nby a  module, dc,  was compared with the depth  predicted  by  a linear combination de \nof the outputs of nearby modules,  and the network tried  to maximize the agreement \nbetween  de  and de. \nThe  contextual  prediction,  dc,  was  produced  by  computing  a  weighted  sum  of \nthe  outputs  of  two  adjacent  modules  on  either  side.  The  interpolating  weights \nused  in  this  sum,  and  all  other  weights  in  the  network,  were  adjusted  so  as  to \nmaximize agreement  between  locally computed  and contextually  predicted  depths. \nTo  speed  the  learning,  we  first  trained  the  lower  layers  of  the  network  as  be(cid:173)\nfore,  so  that  agreement  was  maximized  between  neighbouring  locally  computed \noutputs.  This  made  it  easier  to  learn  good  interpolating  weights.  When  the \nnetwork  was  trained  on  stereograms  of  cubic  surfaces,  it  learned  interpolating \nweights  of  -0.147,0.675,0.656, -0.131  (Becker  and  Hinton,  1992).  Given  noise \nfree  estimates  of local  depth,  the  optimal  linear  interpolator  for  a  cubic  surfa.ce \nis  -0.167,0.667,0.667, -0.167. \n\n\fLearning to Make Coherent Predictions in  Domains with Discontinuities \n\n375 \n\n3  Throwing out  discontinuities \n\nIf the surface is continuous,  the depth at one patch can be accurately predicted from \nthe depths of two patches on either side.  If, however, the training data contains cases \nin  which  there  are  depth  discontinuities  (see  figure  2)  the  interpolator  will  also  try \nto model these cases and this will contribute considerable noise to the interpolating \nweights  and  to  the  depth estimates.  One  way  of reducing  this  noise  is  to  treat  the \ndiscontinuity  cases  as  outliers and  to  throw  them out.  Rather than  making a  hard \ndecision  about  whether  a  case  is  an  outlier,  we  make  a  soft  decision  by  using  a \nmixture model.  For each  training case,  the  network  compares  the locally extracted \ndepth,  dc,  with  the  depth  predicted  from  the  nearby  context,  de.  It assumes  that \nde  - de  is  drawn  from  a  zero-mean  Gaussian  if it  is  a  continuity  case  and  from  a \nuniform distribution if it is a discontinuity case.  It can then estimate the probability \nof a  continuity case: \n\nSpline \ncurve \n\nLeft \nImage \n\nRight \nImage \n\n--------\n\nI  1 \n\nl  I \n\nI I  III \n\n1111 \n\ntil \n\nI, I \"'I \n\nII  I 111 \n\nII \n\n\"I  I \n\nII  I II  \\ \n\nII  I  Iii \n\nII ill  I \n\nIII ,i  I I  \\ \nI \n\nI I  I I ,,\\ \n\nIII \n\nII \n\nI \n\n1'1 \n\nFigure  2:  Top:  A  curved  surface  strip  with  a  discontinuity  created  by  fitting  2 \ncubic  splines  through  randomly  chosen  control points,  25  pixels  apart,  separated  by \na  depth  discontinuity.  Feature  points  are  randomly  scattered  on  each  spline  with  an \naverage  of 0.22  features  per  pixel.  Bottom:  A  stereo  pair  of  \"intensity\"  images \nof the  surface  strip formed  by  taking  two  different  projections  of the  feature  points, \nfiltering  them  through  a  gaussian,  and  sampling  the  filtered  projections  at  evenly \nspaced sample  points.  The  sample  values  in  corresponding patches  of the  two  images \nare  used  as  the  inputs  to  a  module.  The  depth  of the  surface for  a  particular zmage \nregion  is  directly  related  to  the  disparity  between  corresponding  features  in  the  left \nand  right  patch.  Disparity  ranges  continuously  from  -1  to  + 1  image  pixels.  Each \nstereo  image  was  120 pixels  wide  and  divided  into  10  receptive  fields  10  pixels  wide \nand  separated  by  2  pixel  gaps,  as  input  for  the  networks  shown  in  figure  1.  The \nreceptive  field  of an  interpolating  unit  spanned  58  image  pixels,  and  discontinuities \nwere  randomly  located  a  minimum  of 40  pixels  apart,  so  only  rarely  would  more \nthan  one  discontinuity  lie  within  an  interpolator's  receptive  field. \n\n\f376 \n\nBecker and Hinton \n\nwhere  N  is  a  gaussian,  and  kdi3eont  is  a  constant  representing  a  uniform density.  1 \n\nWe can  now  optimize the  average  information  de  and  de  transmit about  their com(cid:173)\nmon  cause.  We  assume  that  no  information  is  transmitted  in  discontinuity  cases, \nso  the  average  information  depends  on  the  probability  of continuity  and  on  the \nvariance of de  + de  and de  - de  measured  only  in  the  continuity  cases. \n\n(2) \n\n(3) \n\nWe  tried  several  variations of this  mixture  approach.  The network  is  quite good  at \nrejecting  the  discontinuity  cases,  but  this  leads  to  only  a  modest  improvement  in \nthe  performance  of the  interpolator.  In  cases  where  there  is  a  depth  discontinuity \nbetween  da  and  db  or  between  dd  and  de  the  interpolator  works  moderately  well \nbecause  the  weights  on  da  or  de  are  small.  Because  of the  term  Peont  in  equation \n3  there  is  pressure  to  include  these  cases  as  continuity  cases,  so  they  probably \ncontribute  noise  to  the  interpolating  weights.  In  the  next  section  we  show  how  to \navoid  making  a  forced  choice  between  rejecting  these  cases  or  treating  them  just \nlike  all  the other continuity cases. \n\n4  Learning a  mixture of expert interpolators \nThe  presence  of a  depth  discontinuity  somewhere  within  a  strip  of five  adjacent \npatches  does  not entirely  eliminate the coherence  of depth  across  these  patches.  It \njust  restricts  the  range over  which  this  coherence  operates.  So  instead  of throwing \nout  cases  that  contain  a  discontinuity,  the  network  could  try  to  develop  a  number \nof different,  specialized  interpolators each  of which  captures  the  particular  type  of \ncoherence  that  remains  in  the  presence  of a  discontinuity  at  a  particular  location. \nIf,  for  example,  there  is  a  depth  discontinuity  between  de  and  de,  an  extrapolator \n\nwith  weights of -1.0, +2.0,0, \u00b0 would  be  an  appropriate predictor  of de . \n\nFigure  3 shows  the system  of five  expert  interpolators  that  we  used  for  predicting \nde  from  the  neighboring  depths.  To  allow  the  system  to  invoke  the  appropriate \ninterpolator,  each  expert  has  its  own  \"controller\"  which  must  learn  to  detect  the \npresence of a  discontinuity  at a  particular location  (or  the  absence  of a  discontinu(cid:173)\nity  in  the  case  of the  interpolator  for  pure  continuity  cases).  The  outputs  of the \ncontrollers are normalized, as shown in figure  3, so that they form a  probability dis(cid:173)\ntribution.  We  can  think of these  normalized outputs as  the  probability  with  which \nthe system selects  a  particular expert.  The controllers get to see  all  five  local depth \nestimates and most of them learn to detect particular depth discontinuities by  using \nlarge  weights  of opposite sign on  the local  depth  estimates of neighboring patches. \n\nlWe  empirically select  a  good  (fixed)  value  of kdiseont,  and  we  choose  a  starting  value \nof Veont{de  - de)  (some  proportion of the  initial variance of de  - de),  and  gradually  shrink \nit during  learning. \n\n\fLearning to  Make Coherent Predictions in Domains with Discontinuities \n\n377 \n\nexpert  1 \n\nexpert  2 \n\nexpert  3 \n\nexpert  4 \n\nexpert  5 \n\nde  I , \n\nde  2 , \n\nde ,3 \n\nde  4 , \n\nde  5 , \n\nPI \n\nP2 \n\nP3 \n\nP4 \n\nP5 \n\nXl  controller  1 \n\nNormal-\nization \n\nX2  controller  2 \n\nPi  =  I: x \u00b7 2 \n\ne  J \n\ne x ,2 \n\nJ \n\nX3  controller  3 \n\nX4  controller 4 \n\nX5  controller  5 \n\nFigure  3:  The  architecture  of the  mixture  of interpolators  and  discontinuzty  detec. \ntors .  We  actually  used  a  larger  modular  network  and  equality  constraints  between \nmodules,  as  described  in  figure  1  b),  with  6  copies  of the  architecture  shown  here . \nEach  copy  received  input from  different  but  overlapping  parts  of the  input. \n\nFigure  4  shows  the  weights  learned  by  the  experts  and  by  their  controllers.  As \nexpected,  there  is  one  interpolator  (the  top  one)  that  is  appropriate for  continuity \ncases  and  four  other  interpolators  that  are  appropriate for  the  four  different  loca(cid:173)\ntions of a  discontinuity.  In interpreting the weights of the controllers it is important \nto  remember  that  a  controller  which  produces  a  small  X  value for  a  particular case \nmay  nevertheless  assign  high  probability  to  its  expert  if all  the  other  controllers \nproduce even  smaller x  values. \n\n4.1  The learning  procedure \nIn  the  example  presented  here,  we  first  trained  the  network  shown  in  figure  1 b) \non  images  with  discontinuities.  We  then  used  the  outputs of the  depth  extracting \nlayer,  da, ... ,de  as the inputs to  the expert  interpolators and  their controllers.  The \nsystem learned  a set of expert interpolators without backpropagating derivatives all \nthe  way  down  to  the  weights  of the  local  depth  extracting  modules.  So  the  local \ndepth  estimates d a ,  ... ,de  did  not  change  as  the  interpolators were  learned. \n\nTo  train  the  system  we  used  an  unsupervised  version  of  the  competing  experts \nalgorithm described  by  Jacobs,  Jordan,  Nowlan  and  Hinton  (1991) .  The output  of \nthe ith expert, de,i,  is treated as the mean of a Gaussian distribution with variance 0- 2 \nand the normalized output of each controller, Pi , is  treated as the mixing proportion \nof that  Gaussian.  So,  for  each  training  case,  the  outputs  of the  experts  and  their \ncontrollers define a probability distribution that is a mixture of Gaussians .  The aim \n\n\f378 \n\nBecker and Hinton \n\n,a) \nInterpolator \nweights \n\nDiscontinuity \ndetector weights \n\nb) \n\nMean output vs.  distance \nto nearest discontinuity \n\n1.00 \n\n0.95 \n\n0.90 \n\n0.15 \n\n0.10 \n\n0.75 \n\n0.70 \n\n0.6.5 \n\n0.60 \n\n0.55 \n\n0.50 \n\n0.45 \n\n0.40 \n\nO.lS \n\n0.10 \n\n02S \n\n020 \n\n0.15 \n\n0.10 \nOAS \n0.00 \n.oAS \n\niiiiiI \niiiii2 \n~T-\nYU.-4- -\nuair -\n\n\\ \n\n\\ \n\\ \n\n\\ \n\\ \n\n\" \n\n\\' \nI' \nj' \n\n, \nI \nI  i \n\n~ \niI  \" \" , . \n'I  , . \n. \\  , . ~, \n, , \n,  , , \n\u2022  , \\ \n,  , \n,~  :  I  ,  ,  , \nI  ,  . , \ni  ,  ,  ,  , \nI,  : ,  , \n/~  ; \n\u2022  ~  1 \n, , \n.\u2022 I \nl  ~, \n.  ., \n, \n:' \nti \n\\  :' \n: \n:; \n, \n.  I' \nI \n1 \n\u2022 \n~ \n'i \nI ~ \n\" \n:i \n'I \n\" \nI  .  : \\  , , \ni  ~ \n..  ,  I \n\" \nI:  ' ! \n. \n, \nI \n,  I \n, \ni \nI  , \n/ \ni \n! \nI\n' \nI '  \ni  I \n\\' \n\" I, \n\n\" \n~ \n\n\\ \n\n\\ \n\n, \n\nI \nI \n\nI \n\n-~:' .. -f \n, \n..... '  , \n, \n, \n\nI \nI \nI \nI \nI \nI \nI \n\n\u00b760.00 \n\n.4()00 \n\n-20.00 \n\n0.00 \n\n20.00 \n\n40.00 \n\n60.00 \n\npudol \n\nFigure  4:  a)  Typical  weights  lear~ed by  the  five  competing  interpolators  and  cor(cid:173)\nresponding  five  discontinuity  detectors.  Positive  weights  are  shown  in  white,  and \nnegative  weights  in  black.  b) The  mean  probabilities  computed  by  each  discontinuity \ndetector  are  plotted  against  the  the  distance  from  the  center  of the  units' receptive \nfield  to  the  nearest  discontinuity.  The  probabilistic  outputs  are  averaged  over  an \nensemble  of 1000  test  cases.  If the  nearest  discontinuity  is  beyond  \u00b1  thirty  pixels, \nit  is  outside  the  units' receptive field  and  the  case  is  therefore  a  continuity  example. \n\nof the learning is  to maximize the  log probability density of the desired  output, de, \nunder this mixture of Gaussians distribution.  For a  particular training case  this log \nprobability is  given  by: \n\n' \"  \nlog P( de)  = log L.,; Pi \n\n. \n\nI \n\n1 \nto=  exp \nv2~u \n\n((d e -dei )2) \n-\n\n2  2 ' \nu \n\n(4) \n\nBy  taking  derivatives  of this  objective  function  we  can  simultaneously  learn  the \nweights  in  the  experts  and  in  the  controllers.  For  the  results  shown  here,  the \nnework  was  trained  for  30  conjugate  gradient  iterations  on  a  set  of  1000  random \ndot stereograms  with discontinuities. \n\nThe rationale for  the use of a  variance ratio in equation 1 is  to prevent  the variances \nof da  and db  collapsing to zero.  Because the local estimates d 1 ,  ... , ds did not change \nas the system learned  the expert interpolators, it  was  possible  to use  (de  - dc,d 2  in \nthe  objective  function  without  worrying  about  the  possibility  that  the  variance  of \nde  across  cases  would  collapse  to zero  during the  learning .  Ideally  we  would  like  to \n\n\fLearning  (0 Make Coherent Predictions in  Domains with Discontinuities \n\n379 \n\nrefine  the  weights of the  local  depth  estimators  to  maximize  their  agreement  with \nthe  contextually predicted  depths  produced  by the mixture of expert  interpolators. \nOne way  to do  this would  be to generalize equation 3 to handle a  mixture of expert \ninterpolators: \n\n(5) \n\nAlternatively we  could  modify equation  4  by  normalizing the difference  (de  - de.i )2 \nby  the  actual  variance  of dc,  though  this  makes  the  derivatives  considerably  more \ncomplicated. \n\n5  Discussion \nThe competing controllers in figure 3 explicitly represent which regularity applies in \na particular region.  The outputs of the controllers for nearby regions may themselves \nexhibit coherence  at  a  larger spatial scale, so the same learning technique  could  be \napplied  recursively.  In  2-D  images  this  should  allow  the  continuity  of depth  edges \nto  be discovered. \n\nThe approach  presented  here  should  be  applicable  to other  domains  which  contain \na  mixture of alternative local regularities aCl\u00b7OSS  space or time.  For example,  a  l\u00b7igiJ \nshape causes  a  linear constraint between  the locations of its parts in  an  image, so if \nthere are many possible shapes,  there are  many alternative local regularities (Zemel \nand  Hinton,  1991). \n\nOur  learning procedure  differs  from  methods that  try  to  capture  as  much  informa(cid:173)\ntion  as  possible  about  the  input  (Linsker,  1988;  Atick  and  Redlich,  1990)  because \nwe  ignore  information in the  input that  is  not  coherent  across space. \n\nAcknowledgements \n\nThis  research  was  funded  by  grants  from  NSERC  and  the  Ontario  Information  Technol(cid:173)\nogy  Research  Centre.  Hinton  is  Noranda  fellow  of the  Canadian  Institute  for  Advanced \nResearch.  Thanks  to  John  Bridle  and  Steve  Nowlan  for  helpful  discussions. \n\nReferences \n\nAtick,  J.  J.  and  Redlich,  A.  N.  (1990).  Towards  a  theory  of  early  visual  processing. \n\nTechnical  Report  IASSNS-HEP-90j10,  Institute for  Advanced  Study,  Princeton. \n\nBecker,  S.  and  Hinton,  G.  E.  (1992).  A  self-organizing  neural  network  that  discovers \n\nsurfaces in random-dot  stereograms.  January  1992  Nature. \n\nJacobs,  R.  A.,  Jordan,  M.  1.,  Nowlan,  S. J.,  and  Hinton,  G.  E.  (1991).  Adaptive mixtures \n\nof local experts.  Neural  Computation,  3(1). \n\nLinsker,  R.  (1988).  Self-organization  in  a  perceptual  network.  IEEE  Computer,  March, \n\n21:105-117. \n\nZemel,  R.  S.  and  Hinton,  G.  E.  (1991).  Discovering  viewpoint-invariant  relationships  that \ncharacterize objects.  In  Advances In  Neural Information  Processing Systems 3,  pages \n299-305.  Morgan  Kaufmann  Publishers. \n\n\f", "award": [], "sourceid": 534, "authors": [{"given_name": "Suzanna", "family_name": "Becker", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}