{"title": "Learning Sparse Image Codes using a Wavelet Pyramid Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 887, "page_last": 893, "abstract": null, "full_text": "Learning Sparse Image Codes using a \n\nWavelet  Pyramid Architecture \n\nBruno A.  Olshausen \n\nDepartment of Psychology and \n\nCenter for  Neuroscience,  UC  Davis \n\n1544  Newton Ct. \nDavis,  CA  95616 \n\nbaolshausen@uedavis.edu \n\nPhil Sallee \n\nDepartment of Computer Science \n\nUC  Davis \n\nDavis,  CA  95616 \n\nsallee@es.uedavis.edu \n\nMichael S.  Lewicki \n\nDepartment of Computer Science and \n\nCenter for  the Neural Basis of Cognition \n\nCarnegie Mellon  University \n\nPittsburgh, PA  15213 \nlewieki@enbe.emu.edu \n\nAbstract \n\nWe  show  how  a  wavelet  basis  may  be  adapted  to  best  represent \nnatural images  in  terms  of sparse coefficients.  The  wavelet  basis, \nwhich  may  be  either  complete  or  overcomplete,  is  specified  by  a \nsmall number of spatial functions  which  are  repeated across space \nand combined in a  recursive fashion  so  as  to be self-similar across \nscale.  These functions are adapted to minimize the estimated code \nlength under a model that assumes images are composed of a linear \nsuperposition of sparse,  independent  components.  When  adapted \nto natural images, the wavelet  bases take on  different  orientations \nand they evenly tile the orientation domain, in stark contrast to the \nstandard,  non-oriented  wavelet  bases  used  in  image  compression. \nWhen  the  basis  set  is  allowed  to  be  overcomplete,  it  also  yields \nhigher coding efficiency than standard wavelet  bases. \n\n1 \n\nIntroduction \n\nThe general problem we  address here is that of learning efficient codes for represent(cid:173)\ning natural images.  Our previous  work  in  this  area has focussed  on  learning basis \nfunctions that represent images in  terms of sparse,  independent  components  [1,  2]. \nThis is  done  within  the  context  of a  linear generative  model  for  images,  in  which \nan  image  I(x,y)  is  described  in  terms  of a  linear  superposition  of basis functions \nbi(x,y)  with amplitudes ai>  plus noise  v(x,y): \n\n(1) \n\n\fA sparse, factorial prior is imposed upon the coefficients ai,  and the basis functions \nare adapted so as to maximize the average log-probability of images under the model \n(which  is  equivalent  to  minimizing the  model's estimate of the  code  length of the \nimages).  When the  model  is  trained  on  an  ensemble  of whitened  natural images, \nthe basis functions  converge to a  set of spatially localized,  oriented,  and bandpass \nfunctions  that  tile  the  joint  space  of  position  and  spatial-frequency  in  a  manner \nsimilar to a  wavelet  basis.  Similar results have been achieved  using other forms  of \nindependent  components analysis  [3,  4]. \n\nOne  of the  disadvantages  of this  approach,  from  an  image  coding  perspective,  is \nthat  it  may  only  be  applied  to  small  sub-images  (e.g.,  12  x  12  pixels)  extracted \nfrom  a  larger  image.  Thus,  if an  image  were  to  be  coded  using  this  method,  it \nwould need  to be blocked and would thus likely introduce blocking artifacts as the \nresult  of quantization or  sparsification  of the  coefficients.  In  addition,  the  model \nis  unable  to  capture  spatial structure in  the images  that is  larger than the  image \nblock, and scaling up the algorithm to significantly larger blocks is computationally \nintractable. \n\nThe solution to these problems that we  propose here is to assume translation- and \nscale-invariance  among  the  basis  functions,  as  in  a  wavelet  pyramid  architecture. \nThat is,  if a  basis function  is  learned at one  position and  scale, then it is  assumed \nto  be  repeated  at  all  positions  (spaced  apart  by  two  positions  horizontally  and \nvertically)  and scales  (in octave increments).  Thus, the entire set of basis functions \nfor tiling a large image may be learned by adapting only a handful of parameters(cid:173)\ni.e.,  the wavelet filters  and the scaling function that is  used to expand them across \nscale. \n\nWe  show  here  that  when  a  wavelet  image  model  is  adapted  to  natural images  to \nyield  coefficients  that  are  sparse  and  as  statistically  independent  as  possible,  the \nwavelet  functions  converge to a  set  of oriented functions,  and the scaling function \nconverges  to  a  circularly  symmetric  lowpass  filter  appropriate for  generating  self(cid:173)\nsimilarity  across  scale.  Moreover,  the  resulting  coefficients  achieve  higher  coding \nefficiency  (higher SNR for a fixed  bit-rate) than traditional wavelet bases which are \ntypically designed  \"by hand\"  according to certain mathematical desiderata [5]. \n\n2  Wavelet  image model \n\nThe  wavelet  image model  is  specified  by  a  relatively  small  number of parameters, \nconsisting  of  a  set  of  wavelet  functions  'ljJi(X,y),  i  =  l..M,  and  scaling  function \n\u00a2(x,y).  An  image  is  generated  by  upsampling  and  convolving  the  coefficients  at \na  given  band i  with  'ljJi  (or  with  \u00a2  at  the  lowest-resolution  level  of the  pyramid), \nfollowed  by successive upsampling and convolution with \u00a2, depending on their level \nwithin  the pyramid.  The wavelet  image  model for  an L  level  pyramid is  specified \nmathematically as \n\nI(x,y) \n\ng(x, y, i) \n\ng(x, y, 0) + v(x, y) \n{  aL-l(x,y)  l=L-1 \nl  < L - 1 \n\nII (x, y) \n\nII(x,y)  =  [g(x,y,l + 1)t 2]  * \u00a2(x,y) + L: [a~(x,y)t 2J  * 'ljJi(X,y) \n\nM \n\ni=l \n\n(2) \n\n(3) \n\n(4) \n\nwhere  the  coefficients  a  are  indexed  by  their  position,  x, y,  band,  i,  and  level  of \nresolution  l  within the pyramid (l  =  0 is  the highest  resolution level) .  The symbol \n\n\fFigure 1:  Wavelet  image model.  Shown  are the coefficients of the first  three levels \nof  a  pyramid  (l  = 0,1,2),  with  each  level  split  into  a  number  of  different  bands \n(i  =  1...M).  The highest  level  (l  =  3)  is  not  shown  and contains only one lowpass \nband. \n\nt  2 denotes upsampling by two  and is  defined  as \n\nf(x,y)t2 \n\n== \n\n{  f ( ~, ~)  x even  &  y even \n\notherwise \n\no \n\n(5) \n\nThe  wavelet  pyramid  model  is  schematically  illustrated  in  figure  1.  Thaditional \nwavelet bases typically utilize three bands (M = 3), in which case the representation \nis  critically  sampled  (same  number  of coefficients  as  image  pixels).  Here,  we  shall \nalso examine the cases of M  =  4 and 6,  in which the representation is  overcomplete \n(more  coefficients than image pixels). \nBecause the image model is linear, it may be expressed compactly in vector/matrix \nnotation as \n\nI=Ga+v \n\n(6) \nwhere the vector a  is the entire list  of coefficient  values at all positions, bands, and \nlevels of the pyramid, and the columns of G  are the basis functions corresponding to \neach coefficient, which are parameterized by 'l/J  and 41.  The probability of generating \nan image  I  given  a  specific state of the  coefficients  a  and  assuming Gaussian i.i.d. \nnoise  v  is  then \n\nP(Ila,O) = - e - 2  1-\n~I  G \n\n1 \nZAN \n\n12 \n\na \n\n(7) \n\nwhere  0  denotes  the  parameters  of the  model  and  includes  the  wavelet  pyramid \nfunctions 'l/Ji  and 41,  as well  as the noise  variance,  a~ = 1/ AN. \nThe  prior  probability  distribution  over  the  coefficients  is  assumed  to  be  factorial \nand sparse: \n\nP(a) \n\n1 \n- e  \nZs \n\n-S(a.) \n' \n\n(8) \n\n(9) \n\nwhere  S  is  a  non-convex function that shapes  P(ai) to have the requisite  \"sparse\" \nform-\ni.e.,  peaked  at  zero  with  heavy  tails,  or  positive  kurtosis.  We  choose  here \nS(x) = t3log(1 + (x/a)2),  which  corresponds to a  Cauchy-like prior over the coeffi(cid:173)\ncients  (an exact Cauchy distribution would be obtained for  t3  =  1).1 \n\n1 A  more optimal choice for  the prior would be to use  a  mixture-of-Gaussians distribu(cid:173)\n\ntion, which better captures the sharp peak at zero characteristic of a sparse representation. \nBut properly maximizing the posterior with such a prior presents formidable challenges  [6) . \n\n\f3 \n\nInferring the coefficients \n\nThe coefficients for  a  particular image  are determined  by  finding  the maximum of \nthe posterior distribution  (MAP  estimate) \n\na \n\nargmax P(all, B) \n\n=  argmax P(lla, B)P(aIB) \n\na \n\na \n\nargmln [A;II_GaI2+ ~S(ai)l \n\n(10) \n\n(11) \n\nA local minimum may be found  via gradient descent,  yielding the differential equa(cid:173)\ntion \n\nex:  ANGT  e - S(a) \n\n(12) \n(13) \nThe computations involving G T  e and G a in equations 12 and 13 may be performed \nquickly and efficiently using fast  algorithms for  building pyramids and reconstruct(cid:173)\ning from  pyramids [7]. \n\nIi \ne  =  I-Ga. \n\n4  Learning \n\nOur goal in  adapting the wavelet  model to  natural images is  to find  the functions \n'l/Ji and tjJ  that minimize the description length \u00a3  of images under the model \n\n\u00a3  = \n\n-(logP(IIB)) \n\nP(IIB)  =  f P(lla, B) P(aIB) da \n\n(14) \n\n(15) \n\nA learning rule for  the basis functions  may be derived  by gradient descent on \u00a3: \n\n8\u00a3 \n8Bi \n\n=  AN \\  (eT  ~~ a) P(aII,O)  ) \n\n(16) \n\nInstead of sampling from the full posterior distribution, however, we utilize a simpler \napproximation in which  a single sample is taken at the posterior maximum, and so \nwe  have \n\nAll \n\n(AT 8G A) \nUUi  ex:  e  8Bi  a \n\n(17) \nwhere  e = I  - Ga.  The  price  we  pay for  this  approximation,  though,  is  that the \nbasis  functions  will  grow  without  bound,  since  the  greater  their  norm,  I G k I,  the \nsmaller each  ak  will  become,  thus  decreasing the sparseness penalty in  (11).  This \ntrivial  solution  is  avoided  by  adaptively  rescaling  the  basis  functions  after  each \nlearning step so  that a target variance on the coefficients is met,  as  described in an \nearlier paper [1]. \nThe update rules for 'l/Ji and tjJ  are then derived from  (17),  and may be expressed in \nterms of the following  recursive formulas: \n\n~'l/Ji(m,n)  =  F't/J(e(x,y),m,n,O) \n\nF't/J (1,  m, n, l) \n\n- L f(2x +  m, 2y + n) ai(x, y)  +  F't/J([f * tjJ]{,  2,  m, n, l +  1) \n\n(18) \n\nx,y \n\n\f~\u00a2(m, n)  =  F</l(e(x, y),  m, n, 0) \n\n(19) \nF</l(f,  m,n, l)  =  L f(2x + m,2y + n) g(x,y,l + 1) + F</l([f *\u00a2l+ 2,  m,n, l + 1) \n\nx ,y \n\nwhere  * denotes  cross-correlation  and  .j..  2  denotes  downsampling  by  two.  These \ncomputations may  also  be performed  efficiently  using fast  algorithms for  building \nand reconstructing from  pyramids [7]. \n\n5  Results \n\nThe image model was trained on a set of 10, pre-whitened 512 x 512 natural images \nthat  were  used  in  previous  studies  [1].  The  basis  function  parameters  '\u00a2i  and  \u00a2 \nwere represented as 5 x 5 pixel masks, and were initialized to random numbers.  For \neach update, an 80 x 80 subimage was randomly extracted from  one of the images, \nand the coefficients  were  computed iteratively via (12,13)  until the decrease in the \nenergy function  was  less  than  0.1 %.  The  resulting  residual,  e,  was  then  used  for \nupdating the functions  '\u00a2i  and \u00a2  according to  (18)  and  (19).  The noise  parameter \nAN  was  set  to  400,  corresponding  to  a  noise  variance  that  is  2.5%  of the  image \nvariance  (a;  =  0.1).  At  this level  of noise,  the image  reconstructions  are  visually \nindistinguishable from the original.  The parameters of the prior used were f3  =  2.5, \na  =  0.3.  A  stable  solution  began  to  emerge  after  about  one  hour  of training for \nM=3, and after several hours for  M  =  6  (Pentium II,  450  MHz). \nShown  in  figure  2  are  the  basis  functions  learned  for  the  cases  M  =  3,  4  and \n6,  along  with  a  standard bi-orthogonal 9/7 wavelet  (FBI fingerprint  standard  [8]) \nfor  comparison.  The  difference  between  the  learned  wavelets  and  the  standard \nwavelet  is  striking,  in  that  the  learned  wavelets  tile  the orientation  domain  more \nevenly.  They  also  exhibit  self-similarity  in  orientation-\ni.e.,  they  appear  to  be \nrotated versions of one  another.  Increasing the number of bands M  from  three to \nfour produces narrower orientation tuning, but increasing overcompleteness beyond \nthat point does not, as shown in the tiling diagram of figure  3.  All the learned basis \nfunction  spectra lie  well  within the Nyquist bounding box in the 2D  Fourier plane, \nmatching the power spectrum of the images in  the training set. \nCoding efficiency  was  evaluated  by  compressing the  sparsified  coefficients  ii using \nthe embedded wavelet  zerotree encoder  [9]  and measuring the signal-to-noise ratio \nfor  a fixed  bit rate (SNR =  10 IOglO a; /mse).  The results, shown in table 1, demon(cid:173)\nstrate that the overcomplete bases  (M =  4)  achieve higher SNR than either of two \nstandard wavelet  bases for  the same bit rate.  Note  however that  at these levels  of \nSNR the reconstructions are visually identical to the original.  At  higher  compres(cid:173)\nsion ratios the learned bases loose their advantage, most likely  due to the fact  that \nthey are non-orthogonal and hence produce more errors in the reconstruction when \nthe coefficients are quantized. \n\nTable  1:  Coding efficiency. \nI basis  set \nM  =  3  (learned) \nM  =  4  (learned) \nDaubechies 6 \nFBI 9/7 \n\n11.2 \n11.9 \n11.2 \n11.4 \n\nI  SNR  I \n\n\fM=3 (learned) \n\nM=3 (standard) \n\n\u2022  M=6 (learned) \n\nM=4 (learned) \n\nbasis \nfunctions \n\nspectra \n\nbasis \nfunctions \n\nspectra \n\nFigure  2:  Basis  functions  and  corresponding  power  spectra for  M  = 3,  4  and  6, \nalong  with  a  standard  9/7 biorthogonal  wavelet.  Each  column  shows  a  different \nband,  while  each  row  shows  a  different  level.  The  lone  basis  function  in  the last \nrow  is  the  scaling  function  (twice  convolved  with  itself).  The  power  spectra are \nplotted in the 2D-Fourier plane  (vertical vs.  horizontal spatial-frequency) with the \nmaximum spatial-frequency at the Nyquist  rate. \n\n\fM=3 (standard) \n\nM=3 (learned) \n\nM=4 (learned) \n\nM=6 (learned) \n\nFigure 3:  Frequency domain tiling properties.  Shown are iso-power contours at 50% \nof the maximum for  each band and level. \n\n6  Conclusion \n\nWe  have  shown  in  this  work  how  a  wavelet  basis  may  be  adapted  so  as  to  repre(cid:173)\nsent  the structures in  natural images in  terms of sparse, independent  components. \nImportantly, the algorithm has the capacity to learn overcomplete basis sets, which \nare  capable of tiling the joint space  of position,  orientation,  and  spatial-frequency \nin  a  more  continuous  fashion  than  traditional,  critically  sampled  basis  sets  [10]. \nThe overcomplete bases exhibit superior coding efficiency, in the sense of achieving \nhigher SNR for a fixed bit rate.  Although the improvements in coding efficiency are \nmodest, we believe the method described here has the potential to yield even greater \nimprovements when  adapted to more specific image ensembles  such as textures. \n\nAcknowledgments \n\nThis work benefited from extensive use of Eero Simoncelli's Matlab pyramid toolbox. \nSupported by NIMH  R29-MH057921. \n\nReferences \n\n[1]  Olshausen  BA,  Field  DJ  (1997)  Sparse  coding  with  an  overcomplete  basis  set:  A \n\nstrategy employed by V1?  Vision  Research,  37:  3311-3325. \n\n[2]  Lewicki  MS,  Olshausen  BA  (1999)  A  probabilistic framework  for  the adaptation and \n\ncomparison of image codes,  J.  Opt.  Soc.  of Am.,  A,  16{7}:  1587-160l. \n\n[3]  Bell  AJ,  Sejnowski T J  (1997)  The independent components of natural images are edge \n\nfilters,  Vision  Research,  37:  3327-3338. \n\n[4]  van  Hateren  JH,  van  der  Schaaff A  (1997)  Independent  component filters  of natural \nimages  compared with  simple  cells  in  primary visual  cortex,  Proc.  Royal  Soc.  Lond. \nB,  265:  359-366. \n\n[5]  Mallat  S  (1999)  A  wavelet  tour  of signal processing.  Academic  Press. \n[6]  Olshausen BA,  Millman KJ (2000).  Learning sparse codes with a mixture-of-Gaussians \nprior.  In:  Advances  in  Neuml  Information  Processing  Systems,  12,  S.A.  Solla,  T.K. \nLeen,  K.R.  Muller,  eds.  MIT  Press,  pp. 841-847. \n\n[7]  Simoncelli  EP,  Matlab pyramid toolbox. \n\nftp://ftp.cis.upenn.edu/pub/eero/matlabPyrTools.tar.gz \n\n[8]  The Bath Wavelet Warehouse \n\nhttp://dmsun4.bath.ac.uk/wavelets/warehouse.html \n\n[9]  Shapiro  JM  (1993).  Embedded  image  coding  using  zerotrees  of  wavelet  coefficients. \n\nIEEE  Transactions  on  Signal Processing,  41 {12}:  3445-3462. \n\n[10]  Simoncelli  EP,  Freeman  WT,  Adelson  EH,  Heeger  DJ  (1992)  Shiftable  multi scale \n\ntransforms,  IEEE  Transactions  on  Information  Theory,  38{2}:  587-607. \n\n\f", "award": [], "sourceid": 1878, "authors": [{"given_name": "Bruno", "family_name": "Olshausen", "institution": null}, {"given_name": "Phil", "family_name": "Sallee", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}]}