{"title": "A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 767, "abstract": null, "full_text": "A  Phase  Space  Approach to Minimax \nEntropy Learning and the  Minutemax \n\nApproximations \n\nJames M. Coughlan \nSmith-Kettlewell Inst. \n\nSan Francisco,  CA 94115 \n\nA.L.Yuille \n\nSmith-Kettlewell Inst. \n\nSan Francisco,  CA 94115 \n\nAbstract \n\nThere  has  been  much  recent  work  on  measuring  image  statistics \nand  on  learning  probability  distributions  on  images.  We  observe \nthat  the  mapping  from  images  to  statistics  is  many-to-one  and \nshow  it  can  be  quantified  by  a  phase  space  factor.  This  phase \nspace approach throws light on the Minimax Entropy technique for \nlearning Gibbs distributions on images with potentials derived from \nimage statistics and elucidates the ambiguities that are inherent to \ndetermining the potentials.  In  addition, it shows that if the phase \nfactor  can  be  approximated  by  an  analytic  distribution  then  this \napproximation  yields  a  swift  \"Minutemax\"  algorithm  that  vastly \nreduces  the  computation  time  for  Minimax entropy  learning.  An \nillustration  of this  concept,  using  a  Gaussian  to  approximate  the \nphase  factor,  gives  a  good  approximation  to  the  results  of  Zhu \nand  Mumford  (1997)  in  just  seconds  of  CPU  time.  The  phase \nspace  approach  also  gives  insight  into  the  multi-scale  potentials \nfound  by  Zhu and Mumford  (1997)  and suggests that the forms  of \nthe potentials are influenced greatly by phase space considerations. \nFinally,  we  prove  that probability distributions  learned  in  feature \nspace  alone  are  equivalent  to  Minimax  Entropy  learning  with  a \nmultinomial approximation of the phase factor. \n\n1 \n\nIntroduction \n\nBayesian probability theory gives a powerful framework for  visual perception (Knill \nand Richards 1996).  This approach, however,  requires specifying prior probabilities \nand likelihood functions.  Learning these probabilities is  difficult  because it requires \nestimating distributions on random variables of very high dimensions  (for example, \nimages with 200  x  200  pixels,  or shape curves of length 400 pixels).  An  important \n\n\f762 \n\nJ  M.  Coughlan and A. L.  Yuille \n\nrecent advance is the Minimax Entropy Learning theory.  This theory was developed \nby  Zhu,  Wu  and  Mumford  (1997  and  1998)  and enables  them to learn  probability \ndistributions for  the intensity  properties and shapes of natural stimuli and clutter. \nIn addition, when applied to real world images it has an interesting link to the work \non natural image statistics (Field  1987),  (Ruderman and Bialek 1994),  (Olshaussen \nand Field  1996).  We  wish to simplify Minimax and make the learning easier, faster \nand more transparent. \n\nIn this paper we present a phase space approach to Minimax Entropy learning.  This \napproach  is  based  on  the  observation  that  the  mapping  from  images  to statistics \nis  many-to-one  and  can  be  quantified  by  a  phase  space  factnr.  If this  phase space \nfactor  can  be  approximated  by  an  analytic  function  then  we  obtain  approximate \n\"Minutemax\" algorithms which greatly speed up the learning process.  In one version \nof this  approximation,  the  unknown  parameters  of the  distribution  to  be  learned \nare  related  linearly  to  the  empirical  statistics  of the  image  data set,  and  may  be \nsolved  for  in  seconds  or  less.  Independent  of this  approximation,  the  Minutemax \nframework also  illuminates an important combinatoric aspect  of Minimax,  namely \nthe fact  that many different  images can give rise to the same image statistics.  This \n\"phase space\"  factor  explains  the  ambiguities  inherent  in  learning  the parameters \nof  the  unknown  distribution,  and  motivates  the  approximation  that  reduces  the \nproblem to linear algebra.  Finally, we prove that probability distributions learned in \nfeature space alone are equivalent to Minimax Entropy learning with a multinomial \napproximation of the phase factor. \n\n2  A  Phase Space Perspective on Minimax \n\nWe  wish  to  learn  a  distribution  P(I)  on  images,  where  I  denotes  the set  of pixel \nvalues [(x, y)  on a finite  image lattice, and each value [(x , y)  is quantized to a finite \nset of intensity values.  (In fact,  this approach is general and applies to any patterns, \nnot just images.)  We  define  a  set  of image  statistics \u00a21 (I), \u00a22(1), ... , \u00a2s(I),  which \nwe  concatenate  as  a  single  vector  function  \u00a2(I) .  If these  statistics  have  empirical \nmean d =< \u00a2(I)  > on  a  dataset of images  (we  assume  a  large  enough  dataset for \nthe  law  of  large  numbers  to  apply;  see  Zhu  and  Mumford  (1997)  for  an  analysis \nof the errors inherent  in  this  assumption)  then  the maximum entropy  distribution \nPM(I)  with these  empirical statistics is  an exponential  (Gibbs)  distribution  of the \nform \n\nPM(I)  =  -\n\n.... -, \n\n(1) \n\neX\u00b7i(I) \n\nZ('\\) \n\nwhere the potential X is  set so  that < \u00a2(I) > M= 1. \nIn  summary,  the  goal  of  Minimax  Learning  is  to  to  find  an  appropriate  set  of \nimage filters  for  the  domain of interest  (i.e.  maximally informative filters)  and  to \nestimate X given 1.  Extensive  computation  is  required  to determine X;  the  phase \nspace  approach to Minimax Le~ning motivates  approximations that make X easy \nto estimate. \n\n2.1 \n\nImage Histogram Statistics \n\nThe statistics  we  consider  (following  Zhu,  Wu  and  Mumford  (1997,  1998))  are  de(cid:173)\nfined  as  histograms of the responses of one  or more filters  applied  acrOss  an entire \nimage.  Consider a single filter f  (linear or non-linear)  with response fx(l)  centered \nat position x  in the image.  Without loss of generality,  we  will  assume the filter  has \nquantized integer responses from  1 through f max, \n\n\fA Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations \n\n763 \n\nFor notational convenience we  transform the filter response fx(l)  to a  binary repre(cid:173)\nsentation bx(I) , defined as a column vector with fmax  components:  bx,z(l)  =  6z'/x(I) , \nwhere  index  z  ranges  from  1  through  f max .  This  vector  is  composed  of all  zeros \nexcept for  the  entry corresponding to the  filter  response,  which  is  set to one.  The \nimage  statistics  vector  is  then  a  histogram  vector  defined  as  the  average  of the \niv  L:x bx (I).  The  entries  in  \u00a2(I)  then  sum  to  1. \nbx (I) 's  over  all  N  pixels:  \u00a2(I)  = \n(We  can generalize to the case of multiple filters  f(1),  f(2), . . . , f(m),  as  detailed  in \nCoughlan and Yuille  (1999).) \n\n2.2  The Phase Factor \n\nThe original Minimax distribution PM (I) induces a distribution PM(\u00a2) on the statis(cid:173)\ntics  themselves,  without reference to a  particular image: \n\n(2) \n\nwhere g(\u00a2)  is  a  combinatoric phase  space  factor,  with  a  corresponding  normalized \ncombinatoric distribution g(\u00a2), defined  by: \n\ng(\u00a2o) = L 6io ,i(I), and  g(\u00a2) = g(\u00a2)/Q N , \n\nI \n\n(3) \n\nwhere the  phase space factor g( \u00a2)  counts the number of images 1 having statistics \n\u00a2.  N  is  the  number  of pixels  and  Q  is  the  number  of pixel  intensity  levels,  Le. \nQN  is  the  total  number  of possible  images  I.  It  should  be  emphasized  that  the \nphase factor  depends  only  on the set of filters  chosen  and is  independent of the true \ndistribution P(I).  Thus  the  phase factor  can  be  computed  offline,  independent  of \nthe image data set. \n\nIn  this  paper  we  will  discuss  two  useful  approximations  to g(\u00a2):  a  Gaussian  ap(cid:173)\nproximation,  which  yields  the swift  approximation for  learning, and a  multinomial \napproximation, which establishes a  connection between Minimax and standard fea(cid:173)\nture learning. \n\n2.3  The  Non-Uniqueness of the Potential X \nGiven a set of filters and their empirical mean statistics d,  is the potential X uniquely \nspecified?  Clearly,  any  solution  for  X may  be  shifted  by  an  additive  constant \n(Ai  -+  A~  =  Ai  + k  for  all  i),  yielding  a  different  normalization  constant  Z(~) \nbut  preserving PM(I).  In this  section  we  show  that  other,  non-trivial  ambiguities \nin X which  preserve PM(I)  can  exist,  stemming from  the fact  that  some values  of \n\u00a2 are inconsistent with every possible image 1 and hence  never arise  (in  any possi(cid:173)\nble image dataset).  These  \"intrinsic\"  ambiguities are inherent to Minimax and are \nindependent  of the  true  distribution  P(I).  We  will  also  discuss  a  second  type  of \npossible ambiguity  which  depends  on the characteristics of the image dataset used \nfor  learning. \nWe  can  uncover  the  intrinsic  ambiguities  in  X by  examining the  covariance  C  of \ng(\u00a2).  (See  Coughlan  and  Yuille  (1999)  for  details  on  calculating the  mean c and \ncovariance C for any set of linear filters or non-linear filters that are scalar functions \n\n\f764 \n\nJ. M.  Coughlan and A. L.  Yuille \n\nof linear filters.)  Defining the set of all possible statistics values <P  =  {\u00a2 : g( \u00a2)  :f.  O}, \nthe null space of G reflects  degeneracy  (Le.  flatness)  in  <P.  The following  theorem, \nproved  in  Coughlan  and  Yuille  (1999),  shows  that  X is  determined  only  up  to  a \nhyperplane whose  dimension is  the nullity of G. \nTheorem 1  (Intrinsic Ambiguity in X).  Gil =  0 if and only if e().+t1)\u00b7i(I) /Z(X+ \njI)  and e).\u00b7i(l) /Z(X)  are identical distributions on I. \nIn addition to this intrinsic ambiguity in X,  it is  also possible that different  values of \nX may yield distinct distributions which nevertheless have the same mean statistics \n< \u00a2 > on  the image  dataset.  (As  shown  in  Coughlan and Yuille  (1999),  there is  a \nconvex  set  of distributions,  of which  the true distribution P(I)  is  a  member,  which \nshare the same mean statistics < \u00a2 >.)  This second  kind  of ambiguity stems from \nthe  fact  that  the  mean  statistics  convey  only  a  fraction  of the  information  that \nis  contained  in  the  true  distribution  P(I).  To  resolve  this  second  ambiguity  it  is \nnecessary to extract more information from  the image data set.  The simplest way \nto  achieve  this  is  to  use  a  larger  (or  more  informative)  set  of filters  to  lower  the \nentropy of PM(I)  (this  topic  is  discussed  in more  detail  in  Zhu,  Wu  and  Mumford \n(1997,  1998), Coughlan and Yuille  (1999)).  Alternatively, one can extend Minimax \nto include second-order statistics, i.e.  the covariance of \u00a2 in  addition to its mean d. \nThis is  an important topic for  future research. \n\n3  The Minutemax Approximations \n\nWe now illustrate the phase space approach by showing that suitable approximations \nof the  phase space factor  g( \u00a2)  make  it easy  to  estimate  the  potential  X given  the \nempirical  mean  d.  The  resulting  fast  approximations  to  Minimax  Learning  are \ncalled \"Minutemax\"  algorithms. \n\n3.1  The  Gaussian Approximation of g(\u00a2) \n\nIf the  phase  space  factor  g( \u00a2)  may  be  approximated  as  a  multi-variate  Gaussian \n(see  Coughlan and Yuille  (1999)  for  a justification of this approximation)  then the \nprobability  distribution  PM(\u00a2)  =  g(\u00a2)e).\u00b7i/Z(X)  reduces  to  another  multi-variate \nGaussian.  (Note  that  we  are making  the  Gaussian  approximation  in \u00a2 space- the \nspace  of all  possible  image  statistics  histograms-and  not  filter  response  (feature) \nspace.)  As  we  will  see,  this  result  greatly simplifies  the problem  of estimating the \npotential X. \nRecall  that the mean and  covariance of g( \u00a2)  are denoted  by c and  G,  respectively. \nThe  null  space  of G  has  dimension  n  and  is  spanned  by  vectors  il(1), il(2)  ... il(n). \nAs  discussed  in Theorem  1,  for  all  feasible  values  of \u00a2 (Le.  all \u00a2 E  <p)  and  all  il in \nthe null  space, il\u00b7 \u00a2 is  a  constant k.  Thus we  have that \n\n(4) \n\nwhere  the  subscript  r  denotes  projection  onto  the  rank  of G.  Thus  PgatJss(\u00a2)  ex \nggatJss(\u00a2)e).\u00b7i  ex  U]7=l di.Ui ,k}e-!(ir-cr)TC;l(ir-cr)+)..i.  Completing  the  square \nin the exponent  yields  PgatJss(\u00a2)  ex  U17=1  di'Ui ,k}e-!(ir-Ifr)TC;l(ir-lfr)  where fr \n\n\fA Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations \n\n765 \n\nrn[Q[]] \n\nFigure 1:  From left to right:  J,  cand -X (as  computed by the Gaussian Minutemax \napproximation) for first filter  alone. \n\nis  the projection of any .,p that satisfies .,p =  c + eX.  Since  Pgauss ($)  is  a  Gaussian \nwe  have  < \u00a2 >gauss= .,p = J,  and so we  can write a  linear equation relating X and \nd:  d= c+cX. \nIt can be shown  (Zhu - private communication) that solving this equation is equiv(cid:173)\nalent  to one step of Newton-Raphson for minimization of an appropriate cost func(cid:173)\ntion.  This  will  fail  to be  a  good  approximation if the  cost  function  is  highly  non(cid:173)\nquadratic.  As explained in Coughlan and Yuille (1999), the Gaussian approximation \nis also equivalent to a second-order perturbation expansion of the partition function \nZ(X);  higher-order corrections can be made by computing higher-order moments of \ng($). \n\n3.2  Experimental Results \n\nWe  tested  the  Gaussian Minutemax procedure on two sets of filters:  a  single  (fine \nscale)  image  gradient  filter  aI/ax,  and  a  set  of multi-scale  image  gradient  filters \ndefined  at three scales, similar to those used by Zhu  and Mumford  (1997).  In both \nsets,  the  fine  scale  gradient  filter  is  linear  with  kernel  (1, -1),  representing  a  dis(cid:173)\ncretization of a/ax.  In the second set, the medium scale filter kernel is  (U2 , -U2 )/4 \nand the coarse scale kernel is  (U4 , -U4 )/16, where Un  denotes the n x n matrix of all \nones.  The responses of the medium and coarse filters  were rounded (i.e.  quantized) \nto  the  nearest  integer,  thus  adding  a  non-linearity to these filters.  Finally,  d was \nmeasured on  a  data set  of over  100 natural images;  the fine  scale components of d \nare shown  in  the first  panel of Figure  (1)  and  were  empirically  very  similar to the \nmedium and coarse scale components. \nA  X that solves  d = c + cX is  shown  in  the  third  panel  of Figure  (1)  for  the  first \nfilter  (along  with  c  in  the second  panel)  and  in  the  three  panels  of Figure  (2)  for \nthe multi-scale filter  set.  The form  of X is  qualitatively similar to that obtained by \nZhu and Mumford (1997)  (bearing in mind that Zhu disregarded any filter responses \nwith magnitude above Q/2, i.e.  his filter response range is half of ours).  In addition, \nthe eigenvectors of C with small eigenvalues are large away from  the origin, so  one \nshould not  trust the values  of the potentials there  (obtained  by  any algorithm). \n\nZhu  and  Mumford  (1997)  report  interactions  between  filters ' applied  at  different \nscales.  This is  because the resulting potentials appear different  than the potential \nat the fine  scale even though the histograms appear similar at all scales.  We  argue, \nhowever,  that  some  of this  \"interaction\"  is  due  to  the  different  phase  factors  at \ndifferent scales.  In other words the potentials would look different at different scales \neven  if the empirical histograms  were  identical because of differing phase factors. \n\n3.3  The Multinomial Approximation of g(\u00a2) \n\nMany learning theories simply make probability distributions on feature space.  How \ndo  they  differ  from  Minimax  Entropy  Learning  which  works  on  image  space?  By \n\n\f766 \n\n1.  M  Coughlan and A. L.  Yuille \n\n., ~ \nI \n\n~ \n\n; \n! \n\n. , \n\n,'. \n\n.. \n\nl \n\nFigure  2:  From  left  to  right:  the  fine,  medium  and  coarse  components  of  - X as \ncomputed by  the Gaussian Minutemax approximation. \n\n\". \n\nFigure 3:  Left to right:  d, c,  and -X as given by multinomial approximation for  the \na / ax filter  at fine  scale. \n\nexamining the phase factor we  will  show that the  two  approaches are not identical \nin  general.  The  feature  space  learning  ignores  the  coupling  between  the  filters \nwhich  arise due  to how  the statistics are obtained.  More precisely,  the  probability \ndistribution obtained on feature space, PF, is equivalent to the Minimax distribution \nPM  if,  and only  if,  the  phase factor  is  multinomial. \n\nWe  begin  the analysis  by  considering a  single filter.  As  before we  define  the com(cid:173)\nbinatoric  mean c =  L:r$ g( i)i.  The  multinomial  approximation  of g( i)  is  equiv(cid:173)\nalent  to  assuming  that  the  combinatoric  frequencies  of filter  responses  are  inde(cid:173)\npendent from  pixel  to  pixel.  Since  the  combinatoric  frequency  of filter  response \nj  E  {I, 2, .. . , fmax}  is  Cj  and there are N<pj  pixels  with response j, we  have: \n\nand  Pmult(<p)  ex  }1 (cje>'j/N)NI/Jj TIJ:l~ (N<pj)!' \n\nfm4~ \n\nN! \n\n~ \n\n(5) \n\nusing  Pmult(i)  ex  9mult(i)e5.\u00b7\u00a2.  Therefore Pmult(i)  is  also a  multinomial.  Shifting \nthe Aj'S  by  an appropriate additive constant, we  can make the constant of propor(cid:173)\ntionality in the above equation equal to 1.  In this case we have < <Pj  >mult= cje>'j/N \nand  Aj  =  N log( dj / Cj)  by setting < <Pj  >mult  to the empirical mean dj . \nNote  that  if  any  component  dj  of the  empirical  mean  is  close  to  0  then  by  the \nprevious  equation  any  small  perturbations  in  dj  (e.g.  from  sampling  error)  will \nyield  large changes in  Aj , making the estimate of that component unstable. \n\nWe  can  generalize  the  multinomial  approximation  of 9(i)  to  the  multiple  filter \ncase  merely  by  factoring  gmult(i)  into  separate  multinomials,  one  for  each  filter . \nOf course,  this  approximation  neglects  all  interactions  among  filters  (and  among \npixels). \n\n\fA Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations \n\n767 \n\n3.4  The Multinomial Approximation and Feature Learning \n\nThe  connection  between  the  multinomial  approximation  and  feature  learning  is \nstraightforward once  we  consider  a  distribution on the feature  vector f  This  dis(cid:173)\ntribution  (denoted  PF  for  \"feature\")  is  constructed  assuming  independent  filter \nresponses  from  pixel  to  pixel  and  with  statistics  matching  the  empirical  mean  d: \nPF(f) =  TI~l dU;),  where  fi  denotes the filter  response at pixel  i.  Then it follows \nthat PF(\u00a2) is a multinomial:  PF(\u00a2) = TI;:l'\" d~f/J; TIJ\"':;Nf/J  )!.  Since dj  = cje>.;/N, \nwe  have our main  result  that PF(\u00a2)  =  Pmult (\u00a2)' \n\n}=l \n\n} \n\n4  Conclusion \n\nThe main point of this paper is  to introduce the phase space factor  to quantify the \nmapping  between  images  and  their  feature  statistics.  This  phase  space  approach \ncan:  (i)  provide fast  approximate \"Minutemax\"  algorithms,  (ii)  clarify the relation(cid:173)\nship between probability distributions learned in feature and image space, and  (iii) \nto determine intrinsic ambiguities in the X potentials. \n\nAcknowledgements \n\nWe  acknowledge  stimulating  discussions  with  Song  Chun  Zhu.  Funding  was  pro(cid:173)\nvided  by  the  Smith-Kettlewell  Institute  Core  Grant  and  the  Center  for  Imaging \nSciences ARO  grant DAAN04-95-1-0494. \n\nReferences \n\nCoughlan, J.M. and Yuille,  A.L. \"The Phase Space of Minimax Entropy Learning\". \nIn preparation.  1999. \nField,  D.  J.  \"Relations  between  the statistics  of natural  images  and  the  response \nproperties of cortical cells\".  Journal of the  Optical Society 4,(12),  2379-2394.  1987. \nD.C.  Knill  and W.  Richards.  (Eds).  Perception as  Bayesian  Inference.  Cam(cid:173)\nbridge University  Press.  1996. \n\nOlshausen, B. A. and Field, D. J. \"Emergence of simple-cell receptive field properties \nby  learning a sparse code for  natural images\".  Nature.  381,  607-609.  1996. \n\nB.D.  Ripley.  Pattern Recognition and Neural Networks.  Cambridge Univer(cid:173)\nsity  Press.  1996. \nRuderman, D. and Bialek, W. \"Statistics of Natural Images:  Scaling in the Woods\". \nPhysical Review Letters.  73,  Number 6,(8 August  1994), 814-817.  1994. \n\nS.C.  Zhu,  Y.  Wu,  and D.  Mumford.  \"Minimax Entropy Principle and Its Applica(cid:173)\ntion to Texture Modeling\".  Neural  Computation.  Vol.  9.  no.  8.  Nov.  1997. \n\nS.C.  Zhu  and D.  Mumford.  \"Prior Learning and  Gibbs  Reaction-Diffusion\".  IEEE \nTrans.  on  PAMI vol.  19,  no.  11.  Nov.  1997. \n\nS-C Zhu, Y-N Wu and D.  Mumford.  FRAME: Filters, Random field  And Maximum \nEntropy:  - Towards  a  Unified  Theory  for  Texture  Modeling. \nInt'l  Journal  of \nComputer Vision  27(2)  1-20,  Marchi April.  1998. \n\n\f", "award": [], "sourceid": 1626, "authors": [{"given_name": "James", "family_name": "Coughlan", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}