{"title": "Probabilistic Image Sensor Fusion", "book": "Advances in Neural Information Processing Systems", "page_first": 824, "page_last": 830, "abstract": null, "full_text": "Probabilistic Image  Sensor  Fusion \n\nRavi  K.  Sharma1 ,  Todd K.  Leen 2  and Misha Pavel1 \n\n1 Department of Electrical  and  Computer Engineering \n2Department of Computer Science  and  Engineering \nOregon  Graduate Institute of Science  and  Technology \n\nP.O.  Box 91000 , Portland , OR 97291-1000 \n\nEmail:  {ravi,pavel} @ece.ogi.edu, tleen@cse.ogi.edu \n\nAbstract \n\nWe  present  a  probabilistic  method  for  fusion  of images produced \nby multiple sensors.  The approach is  based on  an image formation \nmodel in which  the sensor images are  noisy, locally linear functions \nof an  underlying,  true  scene.  A  Bayesian framework  then  provides \nfor  maximum likelihood or  maximum a  posteriori  estimates of the \ntrue  scene  from  the sensor  images.  Maximum likelihood estimates \nof  the  parameters  of  the  image  formation  model  involve  (local) \nsecond order image statistics, and thus are related to local principal \ncomponent  analysis.  We  demonstrate  the  efficacy  of the  method \non images from  visible-band  and infrared sensors. \n\n1 \n\nIntroduction \n\nAdvances in sensing devices have fueled the deployment of multiple sensors in several \ncomputational vision systems  [1,  for  example].  Using  multiple sensors  can  increase \nreliability  with  respect  to  single  sensor  systems.  This  work  was  motivated  by  a \nneed  for  an  aircraft  autonomous  landing  guidance  (ALG)  system  [2,  3]  that  uses \nvisible-band,  infrared  (IR)  and  radar-based  imaging  sensors  to  provide  guidance \nto  pilots  for  landing  aircraft  in  low  visibility.  IR  is  suitable  for  night  operation, \nwhereas  radar can  penetrate  fog.  The  application requires  fusion  algorithms [4]  to \ncombine the different  sensor  images . \n\nImages from  different  sensors  have  different  characteristics  arising from  the varied \nphysical imaging processes.  Local contrast may be polarity reversed between visible(cid:173)\nband  and  IR images  [5 ,  6] .  A  particular sensor  image may  contain  local  features \nnot found  in  another sensor image , i.e., sensors may report complementary features . \nFinally, individual sensors  are  subject  to  noise.  Fig .  l(a)  and  l(b)  are  visible-band \nand  IR images respectively,  of a  runway scene showing polarity reversed  (rectangle) \n\n\fProbabilistic Image Sensor Fusion \n\n825 \n\nand complementary (circle)  features.  These  effects  pose  difficulties for  fusion. \n\nAn  obvious  approach  to  fusion  is  to  average  the  pixel  intensities  from  different \nsensors.  Averaging,  Fig.  l(c),  increases  the  signal  to  noise  ratio,  but  reduces  the \ncontrast  where  there  are  polarity reversed  or  complementary features  [7]. \n\nTransform-based fusion methods [8, 5,  9]  selectfrom one sensor or another for fusion. \nThey  consist  of  three  steps:  (i)  decompose  the  sensor  images  using  a  specified \ntransform  e.g.  a  multiresolution  Laplacian  pyramid,  (ii)  fuse  at  each  level  of the \npyramid by  selecting  the  highest  energy  transform coefficient,  and  (iii)  invert  the \ntransform  to  synthesize  the  fused  image.  Since  features  are  selected  rather  than \naveraged,  they are rendered  at full contrast, but the methods are sensitive to sensor \nnoise,  see  Fig.  l(d). \n\nTo overcome the limitations of averaging or selection methods, and put sensor fusion \non  firm  theoretical  grounds,  we  explicitly  model  the  production  of sensor  images \nfrom  the  true  scene,  including  the  effects  of sensor  noise.  From  the  model,  and \nsensor  images,  one  can  ask  What  is  the  most  probable  true  scene?  This  forms \nthe  basis  for  fusing  the  sensor  images.  Our  technique  uses  the  Laplacian  pyramid \nrepresentation  [5],  with  the  step  (ii)  above  replaced  by  our  probabilistic fusion.  A \nsimilar probabilistic framework for  sensor  fusion  is discussed  in  ([10]). \n\n2  The  lInage Forlnation  Model \nThe  true  scene,  denoted  s,  gives  rise  to  a  sensor  image through  a  noisy,  non-linear \ntransformation.  For ALG, s would be an image of the landing scene under conditions \nof uniform lighting, unlimited visibility, and perfect sensors.  We model the map from \nthe  true  scene  to  a  sensor  image  by  a  noisy,  locally  affine  transformation  whose \nparameters  are  allowed  to  vary  across  the  image  (actually  across  the  Laplacian \npyramid) \n\nai(~ t)  = (3i(~ t)  s(~ t) + O'i(~ t) + Ei(~ t) \n\n(1) \nwhere,  s  is  the  true  scene,  ai  is  ith  sensor  image, r ==  (x, y, k)  is  the  hyperpixel \nlocation,  with  x, y  the  pixel  coordinates  and  k  the  level  of the  pyramid,  t  is  the \ntime,  0'  is  the  sensor  offset,  {3  is  the sensor  gain  (which  includes  the effects  of local \npolarity reversals  and complementarity), and  E  is  the  (zero-mean)  sensor  noise.  To \nsimplify notation,  we  adopt the  matrix form \n\na  =  (3s + 0' + l \n\n(2) \nwhere  a  =  [al,a2, . . . ,aqr, f3  = [(31,(32,  ... , (3qr,  Q'  = [0'1,0'2, ... ,O'qr,  s  is  a \nscalar  and  l  =  [El,E2, ... ,Eqr, and  we  have  dropped  the  reference  to  location  and \ntime. \nSince the image formation parameters f3,  Q',  and the sensor  noise covariance E~ can \nvary  from  hyperpixel  to hyperpixel,  the  model can express  local  polarity reversals, \ncomplementary features,  spatial variation of sensor  gain,  and  noise. \n\nWe  do  assume,  however,  that  the  image  formation  parameters  and  sensor  noise \ndistribu tion  vary  slowly  with  location 1 .  Hence,  a  particular  set  of parameters  is \nconsidered  to  hold  true over  a spatial region  of several square  hyperpixels.  We  will \nuse  this  assumption implicitly when  we  estimate these  parameters from data. \n\nThe  model  (2)  fits  the  framework  of the  factor  analysis  model  in  statistics  [11, \n12] .  Here  the  hyperpixel  values  of the  true  scene  s  are  the  latent  variables  or \n\n1 Specifically  the  parameters  vary  slowly  on  the  spatia-temporal  scales  over  which  the \n\ntrue  scene  s  may  exhibit  large  variations. \n\n\f826 \n\nR.  K.  Sharma,  T.  K.  Leen and M.  Pavel \n\ncommon factors,  f3  contains the factor  loadings,  and  the  sensor  noise  \u00a3  values  are \nthe  independent  factors.  Estimation  of the  true  scene  is  equivalent  to  estimating \nthe common factors from  the observations a. \n\n3  Bayesian  Fusion \nGiven  the  sensor  intensities  a,  we  will  estimate  the  true  scene  s  by  appeal  to  a \nBayesian framework.  We  assume that the probability density function  of the latent \nvariables  s  is  a  Gaussian  with  local  mean  so(~ t)  and  local  variance  u;(~ t).  An \nattractive  benefit  of this  setup  is  that  the  prior  mean  So  might  be  obtained  from \nknowledge  in  the  form  of maps,  or  clear-weather  images of the  scene.  Thus,  such \ndatabase information can be folded  into the sensor  fusion  in  a  natural way. \n\nThe  density  on the sensor  images conditioned  on  the true scene,  P(als),  is  normal \nwith mean f3 s+a and covariance E\u00a3  ::::  diag[u;l' U;2\"  .. ,u;J The marginal density \nP(a)  is  normal with mean I'm  ::::  f3  So  + a  and  covariance \n\nC  ::::  E\u00a3 + u;f3f3 T \n\n(3) \nFinally,  the  posterior  density  on  s,  given  the  sensor  data a,  P(sla)  is  also  normal \n::::  (f/ E;l f3+ l/u;fl. \nwith  mean M- 1 (f3T E;l (a -a)+ so/u;), and covariance M- 1 \nGiven  these  densities,  there  are  two  obvious  candidates  for  probabilistic  fusion : \nmaximum  likelihood  (ML)  5  ::::  max.  P(als),  and  maximum  a  posteriori  (MAP) \n5::::  max.  P(sla) . \n\nThe MAP  fusion  estimate is simply the posterior  mean \n\n5::::  [f3TE;If3+1/u;r1  (f3TE;l(a_a)  +  so/un \n\n(4) \n\n(5) \nTo  obtain  the  ML  fusion  estimate we  take  the limit u;  -+  00  in  either  (4)  or  (5). \nFor both ML and MAP, the fused image 5 is a locally linear combination of the sensor \nimages that can,  through  the spatio-temporal variations in f3,  a,  and  E\u00a3,  properly \nrespond  to  changes  in  the  sensor  characteristics  that  tax  averaging  or  selection \nschemes.  For  example,  if the second  sensor  has  a  polarity  reversal  relative  to  the \nfirst,  then  f32  is  negative  and  the  two  sensor  contributions  are  properly  subtracted. \nIf the first  sensor  has  high  noise  (large  u;J, its contribution  to the  fused  image is \nattenuated.  Finally,  a  feature  missing from  sensor  1 corresponds  to  f31 \n::::  O.  The \nmodel compensates  by accentuating  the contribution from sensor  2. \n\n4  Model  Parameter Estimates \nWe  need  to estimate the local  image formation model parameters a(~ t), f3(~ t)  and \nthe  local  sensor  noise  covariance\u00b7E\u00a3(~ t).  We  estimate  the  latter  from  successive, \nmotion compensated  video frames from each sensor.  First we  estimate the  average \nvalue  at  each  hyperpixel  (ai(t)),  and  the  average  square  (a;(t))  by  exponential \nmoving  averages.  We  next  estimate  the  noise  variance  by  the  difference  U;i (t)  :::: \na; (t)  - ai 2 (t). \nTo  estimate f3  and  a,  we  assume  that  f3,  a,  E\u00a3,  So  and  u;  are  nearly  constant \nover  small spatial  regions  (5  x  5  blocks)  surrounding  the  hyperpixel  for  which  the \n\n\fProbabilistic Image Sensor Fusion \n\n827 \n\nparameters  are  desired.  Essentially  we  are  invoking  a  spatial analog  of ergodicity, \nwhere  ensemble  averages  are  replaced  by  spatial  averages,  carried  out  locally  over \nregions in which  the statistics are  approximately constant. \n\nTo  form  a  maximum  likelihood  (ML)  estimate  of a,  we  extremize  the  data  log(cid:173)\nlikelihood C =  Z=;;=llog[P(an)]  with  respect  to a  to obtain \n\na ML  = I'a  - f3 so  , \n\n(6) \nwhere I'a  is  the data mean, computed over  a  5  x  5  hyperpixellocal region  (N = 25 \npoints). \nTo obtain  a  ML  estimate  of f3,  we  set  the  derivatives  of C with  respect  to f3  equal \nto zero  and  recover \n\n(C - Ea)C \n\n-1 \n\nf3  =  0 \n\n(7) \n\nwhere Ea is  the data covariance matrix, also computed over  a  5 x 5 hyperpixel local \nregion .  The only  non-trivial solution to  (7)  is \n\nf3ML  = E, U \n\n!-(X-l)t \nr \n\nu~ \n\n(8) \n\n_ \n\nwhere  U ,  A are  the  principal eigenvector  and  eigenvalue  of the  weighted  data co-\nvariance matrix, Ea ==  E,  2  EaE \u20ac  2,  and  r  =  \u00b1l. \nAn  alternative  to  maximum  likelihood  estimation  is  the  least  squares  (LS)  ap(cid:173)\nproach  [11] .  We obtain the  LS  estimate aLS  by  minimizing \n\n_1. \n\n_1. \n\nwith  respect  to a .  This gives \n\naLS = I'a  - f3 so  . \n\nThe least squares estimate f3 LS  is  obtained by  minimizing \n\nwith  respect  to f3 . The solution to this minimization is \n\nE{3  = II  Ea - C  W \n\nf3LS  =  -Ur \n\nAt \n\nu~ \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nwhere  U,  A are  the  principal eigenvector  and  eigenvalue of the  noise-corrected  co(cid:173)\nvariance matrix (Ea - E f ),  and  r  =  \u00b1 l. 2 \nThe estimation procedures cannot  provide values of the  priors  u~ and  So.  Were we \ndealing with a single global model, this would pose no problem.  But we  must impose \na  constraint in  order  to smoothly piece  together  our local models.  We impose  that \n11.811  = 1  everywhere,  or  by  (12)  u;  =  A.  Recall  that  A is  the leading eigenvalue  of \n~a - ~, and thus captures the scale  of variations in a  that arise from variations in \ns .  Thus  we  would  expect  A ex  u~.  Our  constraint  insures  that  the  proportionality \nconstant be the same for each local model.  Next, note that changing So  causes a shift \n\n2The least  squares  and  maximum  likelihood  solutions  are  identical when  the  model  is \nexact  Ea  ==  C,  i.e. \nthe  observed  data  covariance  is  exactly of  the  form  dictated  by  the \nmodel.  Under  this  condition,  U =  (UTE;lU)-1/2Ee -1/2U  and  (~- 1)  =  ~(UTE;lU). \nThe  LS  and  ML  solutions  are  also  identical  when  the  noise  covariance  is  homoscedastic \nEe  = (1; I,  even if the  model  is  not exact. \n\n\fR.  K.  Sharma,  T.  K.  Leen and M.  Pavel \n828 \nin  s.  To  maintain consistency  between  local  regions,  we  take  So  = 0  everywhere. \nThese choices  for  11';  and  So  constrain the parameter estimates to \n\nf3 LS \n\nr  V  and \nPa \n\n. \n\naLS \n\n(13) \nIn  (5)  11';  and  So  are  defined  at  each  hyperpixel.  However,  to  estimate f3  and  a, \nwe  used  spatial  averages  to  compute  the  sample  mean  and  covariance.  This  is \nsomewhat inconsistent,  since  the spatial variation of So  (e.g.  when  there  are  edges \nin  the  scene)  is  not  explicitly  captured  in  the  model  mean  and  covariance.  These \nvariations  are,  instead,  attributed  to  11';,  resulting  in  overestimation  of the  latter. \nA  more complete model would explicitly model the spatial variations of So,  though \nwe  expect  this  will  produce only minor changes  in  the  results . \n\nFinally, the sign  parameter  r  is  not  specified.  In  order  to  properly  piece  together \nour local models , we  must choose  r  at each hyperpixel in such  a way that f3  changes \ndirection  slowly as  we  move from  hyperpixel  to  hyperpixel  and encounter  changes \nin  the local image statistics.  That is, large  direction changes  due  to arbitrary sign \nreversals  are not  allowed.  We  use  a  simple heuristic  to accomplish this. \n\n5  Relation to  peA \nThe MAP  and  ML fusion  rules  are closely related  to PCA. To see  this, assume that \nthe  noise  is  homoscedastic  EE  = 11';1  and  use  the  parameter estimates  (13)  in  the \nMAP fusion  rule  (5),  reducing  the latter to \n\ns= 1+I1'UI1';  Va(a-Pa)  +  1+11';;11'~  So \n\nT \n\n1 \n\n1 \n\n(14) \n\nwhere  Va  is  the  principal eigenvector  of the data covariance matrix Ea.  The  MAP \nestimate s is simply a  scaled  and shifted local  PCA projection of the sensor  data. \nBoth  the  scaling  and  shift  arise because  the  prior  distribution on  s  tends  to bias s \ntowards  So.  When  the  prior  is  flat  11';  -+  00,  (or  equivalently  when  using  the  ML \nfusion estimate), or when the  noise variance vanishes,  the fused  image is given  by  a \nsimple local  PCA  projection \n\n(15 ) \n\n6  Experilllents and  Results \nWe  applied  our fusion  method  to  visible-band  and  IR runway  images,  Fig.  1,  con(cid:173)\ntaining  additive  Gaussian  noise.  Fig.  l(e)  shows  the  result  of ML  fusion  with  f3 \nand  a  estimated  using  (13) .  ML  fusion  performs  better  than  either  averaging  or \nselection in  regions that contain local polarity reversals  or complementary features. \nML  fusion  gives  higher  weight  to  IR in  regions  where  the  features  in  the  two  im(cid:173)\nages  are  common , thus reducing  the effects  of noise  in  the  visible-band image.  ML \nfusion  gives higher weight to the appropriate sensor  in regions with complementary \nfeatures. \nFig.  l(f) shows the result of MAP fusion (5)  with the priors 11';  and So  those dictated \nby  the  consistency  requirements  discussed  in section  4.  Clearly,  the  MAP  image is \nless  noisy  than  the  ML  image.  In  regions  of low  sensor  image contrast,  11';  is  low \n(since>. is low), thus the contribution from the sensor images is attenuated compared \nto the ML  fusion  rule.  Hence the noise  is attenuated.  In regions containing features \nsuch  as  edges,  11';  is  high  (since>.  is  high);  hence  the contribution  from  the  sensor \nimages is  similar to that in  ML fusion. \n\n\fProbabilistic Image Sensor Fusion \n\n829 \n\n(a)  Visible-band  image \n\n(b)  IR image \n\n(c)  Averaging \n\n(d)  Selection \n\n(e)  ML \n\n(f) MAP \n\nFigure  1:  Fusion of visible-band  and IR  images containing additive Gaussian  noise \n\nIn  Fig.  2 we  demonstrate the use of a database image for  fusion.  Fig.  2(a)  and 2(b) \nare  simulated noisy  sensor  images from  visible-band  and  JR,  that depict  a  runway \nwith an aircraft on it.  Fig.  2(c)  is an image of the same scene  as might be obtained \nfrom  a  terrain database.  Although this image is  clean,  it does not show  the actual \nsituation on  the  runway.  One  can  use  the  database  image pixel  intensities  as  the \nprior  mean  So  in  the  MAP  fusion  rule  (5).  The  prior  variance  u;  in  (5)  can  be \nregarded  as  a  m-easure of confidence in  the database image - it's value controls the \nrelative contribution of the sensors vs.  the database image in the fused  image.  (The \nparameters f3  and  a,  and  the  sensor  noise  covariance  EIE  were  estimated  exactly \nas  before.)  Fig.  2(d),  2(e)  and  2(f)  show  the  MAP-fused  image  as  a  function  of \nincreasing 0\";.  Higher values of 0\";  accentuate the contribution of the sensor images, \nwhereas  lower  values of 0\";  accentuate  the contribution of the database. \n\n7  Discussion \n\nWe presented a model-based probabilistic framework for fusion of images from multi(cid:173)\nple sensors and exercised the approach on visible-band and IR images.  The approach \nprovides both a  rigorous framework for  PCA-like fusion  rules,  and a  principled  way \nto combine information from  a  terrain database with sensor  images. \n\nWe  envision  several  refinements  of the  approach  given  here.  Writing  new  image \nformation models at each  hyperpixel  produces  an  overabundance  of models.  Early \nexperiments show that this can be relaxed by using the same model parameters over \nregions of several  square hyperpixels,  rather than recalculating for  each hyperpixel. \nA  further  refinement  could  be  provided  by  adopting a  mixture of linear  models to \nbuild  up  the  non-linear  image  formation  model.  Finally,  we  have  used  multiple \nframes  from  a  video  sequence  to  obtain  ML  and  MAP  fused  sequences,  and  one \nshould be able to produce superior parameter estimates by suitable use of the video \nsequence. \n\n\f830 \n\nR. K.  Sharma,  T.  K.  Leen and M  Pavel \n\n(a)  Visible-band  image \n\n(b)  IR image \n\n(c)  Database image \n\nFigure  2:  Fusion of simulated visible-band  and IR images using database image \nAcknowledgments - This work  was supported  by NASA  Ames Research  Center \ngrant NCC2-S11.  TKL was  partially supported  by  NSF grant ECS-9704094. \n\nReferences \n[1]  L.  A.  Klein.  Sensor and Data  Fusion  Concepts and Applications. SPIE,  1993. \n[2]  J. R.  Kerr,  D.  P.  Pond,  and S.  Inman.  Infrared-optical  muItisensor  for  autonomous \n\nlanding  guidance.  Proceedings  of SPIE,  2463:38-45,  1995. \n\n[3]  B.  Roberts  and  P.  Symosek.  Image  processing  for  flight  crew  situation  awareness. \n\nProceedings of SPIE,  2220:246-255,  1994. \n\n[4]  M.  Pavel and R.  K.  Sharma.  Model-based  sensor fusion  for aviation.  In J.  G.  Verly, \neditor,  Enhanced and Synthetic  Vision 1997, volume 3088, pages 169-176. SPIE, 1997. \n[5]  P.  J.  Burt  and R.  J.  Kolczynski.  Enhanced  image  capture  through  fusion.  In  Fourth \n\nInt.  Conf.  on  Computer  Vision, pages  173-182.  IEEE Compo  Soc.,  1993. \n\n[6]  H.  Li  and  Y.  Zhou.  Automatic  visual/IR  image  registration.  Optical  Engineering, \n\n35(2):391-400,  1996. \n\n' \n\n[7]  M .  Pavel,  J.  Larimer,  and  A.  Ahumada.  Sensor fusion  for synthetic  vision.  In  Pro(cid:173)\n\nceedings of the Society for  Information Display, pages 475-478.  SPIE,  1992. \n\n[8]  P.  Burt.  A  gradient  pyramid  basis for pattern-selective image fusion.  In  Proceedings \n\nof the  Society for  Information  Display, pages 467-470.  SPIE,  1992. \n\n[9]  A.  Toet.  Hierarchical  image fusion.  Machine  Vision  and Applications, 3:1-11,  1990. \n[10]  J. J. Clark and A. L.  Yuille.  Data Fusion for Sensory Information Processing Systems. \n\nKluwer,  Boston,  1990. \n\n[11]  A.  Basilevsky.  Statistical Factor Analysis and Related Methods.  Wiley,  1994. \n[12]  M.  E.  Tipping and  C.  M.  Bishop.  Probabilistic  principal  component  analysis.  Tech(cid:173)\nnical  report,  NCRG/97/01O,  Neural  Computing  Research  Group,  Aston  University, \nUK,1997. \n\n\f", "award": [], "sourceid": 1516, "authors": [{"given_name": "Ravi", "family_name": "Sharma", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "Misha", "family_name": "Pavel", "institution": null}]}