{"title": "Sparse Code Shrinkage: Denoising by Nonlinear Maximum Likelihood Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 473, "page_last": 479, "abstract": null, "full_text": "Sparse  Code  Shrinkage:  Denoising by \n\nNonlinear  Maximum Likelihood  Estimation \n\nAapo  Hyvarinen,  Patrik  Hoyer and  Erkki  Oja \n\nHelsinki  University of Technology \n\nLaboratory of Computer and Information Science \n\nP.O.  Box  5400,  FIN-02015 HUT,  Finland \n\naapo.hyvarinen@hut.fi,patrik.hoyer@hut.fi,erkki.oja@hut.fi \n\nhttp://www.cis.hut.fi/projects/ica/ \n\nAbstract \n\nSparse  coding  is  a  method  for  finding  a  representation  of data in \nwhich  each  of the  components of the representation is  only  rarely \nsignificantly  active.  Such  a  representation is  closely  related  to re(cid:173)\ndundancy reduction and independent component analysis,  and has \nsome  neurophysiological  plausibility.  In  this  paper,  we  show  how \nsparse coding can be used for denoising.  Using maximum likelihood \nestimation of nongaussian variables corrupted by gaussian noise, we \nshow  how  to  apply a  shrinkage nonlinearity on  the components  of \nsparse coding so  as  to reduce noise.  Furthermore,  we  show  how  to \nchoose the optimal sparse coding basis for  denoising.  Our method \nis  closely  related  to  the  method  of wavelet  shrinkage,  but  has  the \nimportant benefit over wavelet methods that both the features and \nthe shrinkage parameters are estimated directly from  the data. \n\n1 \n\nIntroduction \n\nA fundamental  problem in neural network research is  to find  a suitable representa(cid:173)\ntion for the data.  One of the simplest methods is to use linear transformations of the \nobserved data.  Denote by  x  =  (Xl, X2,  ... , Xn)T  the observed n-dimensional random \nvector  that  is  the  input  data  (e.g., an  image  window),  and  by  s  =  (81,82 ,  . .. , 8 n )T \nthe vector  of the  linearly  transformed  component  variables.  Denoting  further  the \nn  x  n  transformation matrix by  W, the linear representation is  given by \n\ns=Wx. \n\n(1) \n\n\f474 \n\nA. Hyviirinen,  P Hoyer and E.  Dja \n\nWe  assume here that the number of transformed components equals the number of \nobserved variables, but this  need not be the case in general. \n\nAn  important  representation  method  is  given  by  (linear)  sparse  coding  [1 ,  10],  in \nwhich the representation of the form  (1)  has the property that only a small number \nof the  components  Si  of the  representation  are  significantly  non-zero  at  the  same \ntime.  Equivalently,  this  means  that a  given  component has  a  'sparse' distribution. \nA random variable Si  is  called sparse when  Si  has a distribution with a peak at zero, \nand heavy tails, as is the case, for example, with the double exponential (or Laplace) \ndistribution [6];  for  all practical purposes, sparsity is equivalent to supergaussianity \nor leptokurtosis [8].  Sparse coding is  an adaptive method, meaning that the matrix \nW  is  estimated for  a given class of data so that the components  Si  are as sparse as \npossible;  such an estimation procedure is  closely related to independent component \nanalysis [2J. \n\nSparse coding of sensory data has  been shown to have advantages from  both phys(cid:173)\niological and information  processing viewpoints  [1] .  However,  thorough analyses of \nthe utility of such a  coding scheme have been few.  In this paper, we  introduce and \nanalyze  a  statistical method  based  on sparse coding.  Given  a  signal  corrupted  by \nadditive  gaussian  noise,  we  attempt  to  reduce  gaussian  noise by  soft  thresholding \n('shrinkage')  of the sparse components.  Intuitively, because only a  few  of the com(cid:173)\nponents  are significantly  active  in  the sparse  code  of a  given  data point,  one  may \nassume that the activities of components with small absolute values are purely noise \nand  set  them  to  zero,  retaining just  a  few  components  with  large  activities.  This \nmethod  is  closely  connected  to  the  wavelet  shrinkage  method  [3].  In  fact,  sparse \ncoding may be viewed  as  a  principled way for  determining a  wavelet-like  basis  and \nthe corresponding shrinkage nonlinearities,  based on data alone. \n\n2  Maximum likelihood estimation of sparse  components \n\nThe starting point of a rigorous derivation of our denoising method is the fact  that \nthe  distributions  of the  sparse  components  are  nongaussian.  Therefore,  we  shall \nbegin by developing a general theory that shows how to remove gaussian noise from \nnongaussian variables,  making minimal assumptions on the data. \n\nDenote  by  S  the  original  nongaussian  random  variable  (corresponding  here  to  a \nnoise-free  version of one of the sparse components  Si),  and  by  v  gaussian  noise  of \nzero mean and variance a 2 \u2022  Assume that we  only observe the random variable y : \n\ny=S+v \n\n(2) \n\nand we  want to estimate the original s.  Denoting by p  the probability density of s, \nand by f  =  -logp its negative log-density, the maximum likelihood  (ML) method \ngives the following  estimator for  s: \n\n\u00a7  =  argmin ~(y - u)2 + f(u). \n\nu  2a \n\n(3) \n\nAssuming f  to be strictly  convex  and differentiable,  this  can be solved  [6]  to yield \n\u00a7  =  g(y),  where the function  g can be obtained from  the relation \n\n(4) \n\nThis nonlinear estimator forms  the basis of our method. \n\n\fSparse Code Shrinkage: Denoising by Nonlinear Maximum Likelihood Estimation \n\n475 \n\n\"'-~~-----r\\ --~-~~---, \n\n'. '. \n' . . . . . \n\n,  \" \n\nFigure 1:  Shrinkage nonlinearities and associated probability densities.  Left:  Plots \nof the different shrinkage functions.  Solid line:  shrinkage corresponding to Laplace \ndensity.  Dashed  line:  typical  shrinkage  function  obtained  from  (6).  Dash-dotted \nline:  typical shrinkage function obtained from  (8).  For comparison, the line x  =  y is \ngiven  by  dotted line.  All  the densities  were normalized  to unit  variance, and noise \nvariance was fixed to .3.  Right:  Plots of corresponding model densities of the sparse \ncomponents.  Solid line:  Laplace  density.  Dashed line:  a  typical moderately super(cid:173)\ngaussian  density  given  by  (5).  Dash-dotted line:  a  typical strongly supergaussian \ndensity given  by  (7).  For comparison, gaussian density is  given by dotted line. \n\n3  Parameterizations of sparse densities \n\nTo  use  the  estimator  defined  by  (3)  in  practice,  the  densities  of  the  Si  need  to \nbe  modelled  with  a  parameterization that  is  rich enough.  We  have developed  two \nparameterizations that seem to describe very well most of the densities encountered \nin image denoising.  Moreover, the parameters are easy to estimate, and the inversion \nin  (4)  can be performed analytically.  Both models use two parameters and are thus \nable  to  model  different  degrees  of supergaussianity,  in  addition  to different  scales, \ni.e.  variances.  The densities are here assumed to be symmetric and of zero mean. \n\nThe first  model is  suitable for  supergaussian densities that are not sparser than the \nLaplace distribution r6],  and is  given  by the family  of densities \n\np(s)  =  C exp( -as2 12  - bls!), \n\n(5) \nwhere  a, b  >  0  are  parameters  to  be  estimated,  and  C  is  an  irrelevant  scaling \nconstant .  The  classical  Laplace  density  is  obtained  when  a  =  0,  and  gaussian \ndensities  correspond to b =  o.  A simple  method  for  estimating a  and  b was  given \nin  [6].  For this density,  the nonlinearity g  takes  the form: \n\ng(u)  = \n\n1  2  sign(u) max(O, lui  - ba2 ) \n\n1 +a a \n\n(6) \n\nwhere  a 2  is  the  noise  variance.  The  effect  of the  shrinkage  function  in  (6)  is  to \nreduce  the absolute value  of its  argument  by  a  certain  amount,  which  depends  on \nthe parameters, and then  rescale.  Small arguments are  thus set  to zero.  Examples \nof the obtained shrinkage functions  are given in  Fig.  l. \n\nThe  second model describes densities that are sparser than the Laplace density: \n\n(a: + 2) [a: (a: + 1)/2](a/Hl) \np(s)  =  2d  [Va: (a: + 1)/2 + I sid 1](a+3)\u00b7 \n\n1 \n\n(7) \n\n\f476 \n\nA.  Hyvarinen,  P  Hoyer and E.  Dja \n\nWhen  a  -+  00,  the  Laplace  density  is  obtained  as  the  limit.  A  simple  consistent \nmethod  for  estimating  the  parameters  d, a  >  0  in  (7)  can  be  obtained  from  the \nrelations  d =  JE{S2}  and  a  = (2  - k + Jk(k + 4))/(2k - 1)  with  k  = d2Ps(O)2, \nsee  [6].  The resulting shrinkage function  can be obtained as  [6] \n\ng(u)  =  sign(u)max(O, \n\nlui - ad \n\n2 \n\n1 \n\n,..,---,-----,--::-------:,....,----\n+ \"2 J (l u l + ad)2  - 4a2(a + 3)) \n\n(8) \n\nwhere  a  =  Ja(a + 1)/2,  and  g(u)  is  set  to zero  in  case  the square  root  in  (8)  is \nimaginary.  This is a  shrinkage function  that has a  certain hard-thresholding flavor, \nas  depicted in  Fig.  1. \n\nExamples  of the  shapes  of the  densities  given  by  (5)  and  (7)  are  given  in  Fig.  1, \ntogether with  a  Laplace density and  a  gaussian  density.  For illustration  purposes, \nthe densities in the plot are normalized to unit variance, but these parameterizations \nallow  the variance to be choosen freely. \n\nChoosing  whether  model  (5)  or  (7)  should  be  used  can  be  based  on  moments  of \nthe distributions; see  [6].  Methods for  estimating the noise variance a 2  are given in \n[3,6]. \n\n4  Sparse  code shrinkage \n\nThe above results  imply  the following  sparse  code  shrinkage method for  denoising. \nAssume that we observe a noisy version x =  x + v  of the data x, where v  is gaussian \nwhite noise vector.  To denoise x,  we  transform the data to a sparse code,  apply the \nabove  ML  estimation  procedure  component-wise,  and  then  transform  back  to  the \noriginal variables.  Here,  we  constrain the transformation to  be  orthogonal;  this  is \nmotivated in  Section 5.  To  summarize: \n\n1.  First,  using  a  noise-free  training set  of x,  use  some  sparse coding  method \nfor  determining  the  orthogonal  matrix  W  so  that  the  components  Si  in \ns  =  Wx have as  sparse distributions as possible.  Estimate a density model \nPi(Si)  for  each sparse component,  using  the models  in  (5)  and  (7). \n\n2.  Compute for each noisy observation x(t) of x the corresponding noisy sparse \n\ncomponents  y(t)  =  Wx(t).  Apply  the shrinkage non-linearity gi(')  as de(cid:173)\nfined in (6),  or in  (8), on each component Yi(t), for every observation index \nt.  Denote the obtained components by Si(t)  =  gi(Yi(t)). \n\n3.  Invert  the  relation  (1)  to  obtain  estimates  of  the  noise-free  x,  given  by \n\nx(t) =  WT\u00a7(t) . \n\nTo estimate the sparsifying transform W, we assume that we have access to a noise(cid:173)\nfree realization of the underlying random vector.  This assumption is not unrealistic \non  many  applications:  for  example,  in  image  denoising  it  simply  means  that  we \ncan observe noise-free  images  that  are somewhat  similar  to  the noisy  image  to  be \ntreated, i.e.,  they belong to the same environment or context.  This assumption can \nbe,  however,  relaxed  in  many  cases,  see  [7].  The  problem  of finding  an  optimal \nsparse code in step  1 is  treated in  the next  section. \n\n\fSparse Code Shrinkage: Denoising by Nonlinear Maximum Likelihood Estimation \n\n477 \n\nIn  fact ,  it  turns  out  that  the  shrinkage  operation  given  above  is  quite  similar  to \nthe  one  used  in  the  wavelet  shrinkage  method  derived  earlier  by  Donoho et  al  [3] \nfrom  a very different  approach.  Their estimator consisted of applying the shrinkage \noperator in  (6) , with  different  values  for  the  parameters, on  the  coefficients  of the \nwavelet  transform.  There are two  main differences  between  the two  methods.  The \nfirst  is  the  choice  of the  transformation.  We  choose  the  transformation using  the \nstatistical properties of the data at hand, whereas Donoho et al use a predetermined \nwavelet transform.  The second important difference  is  that we estimate the shrink(cid:173)\nage nonlinearities by the ML  principle,  again adapting to the data at hand, whereas \nDonoho et al  use fixed  thresholding operators derived  by the minimax  principle. \n\n5  Choosing the optimal sparse  code \n\nDifferent  measures of sparseness (or nongaussianity)  have  been proposed in the lit(cid:173)\nerature  [1,  4,  8,  10].  In  this  section,  we  show  which  measures  are optimal for  our \nmethod.  We  shall here restrict ourselves to the class of linear,  orthogonal transfor(cid:173)\nmations.  This  restriction  is  justified  by  the  fact  that  orthogonal  transformations \nleave  the  gaussian  noise  structure  intact,  which  makes  the  problem  more  simply \ntractable.  This restriction can be  relaxed,  however,  see  [7]. \n\nA  simple,  yet  very  attractive  principle  for  choosing  the  basis  for  sparse  coding  is \nto  consider  the  data to  be  generated  by  a  noisy  independent  component  analysis \n(ICA)  model  [10,  6,  9] : \n\nx  =  As+v, \n\n(9) \n\nwhere  the  Si  are now  the independent  components,  and  v  is  multivariate gaussian \nnoise.  We  could  then  estimate  A  using  ordinary  maximum  likelihood  estimation \nof the  ICA  model.  Under  the  restriction  that  A  is  constrained  to  be  orthogonal, \nestimation  of the  noise-free  components  Si  then  amounts  to  the  above  method  of \nshrinking the values  of AT x,  see  [6].  In this  ML  sense,  the optimal transformation \nmatrix  is  thus  given  by  W  =  AT.  In  particular,  using  this  principle  means  that \nordinary  ICA  algorithms  can  be  used  to  estimate  the  sparse  coding  basis.  This \nis  very  fortunate  since  the  computationally  efficient  methods  for  ICA  estimation \nenable  the basis estimation even  in spaces of rather high  dimensions  [8,  5]. \n\nAn  alternative principle for  determining  the  optimal sparsifying  transformation is \nto minimize the mean-square error (MSE). In [6],  a theorem is given that shows that \nthe optimal basis in minimum MSE sense is obtained by maximizing 2:~=1 IF(wTx) \nwhere  IF(s)  =  E{[P'(s)jp(s)J2}  is  the  Fisher  information of the  density  of s,  and \nthe wT  are the rows  of W .  Fisher information of a  density [4]  can be considered as \na  measure of its  nongaussianity.  It is  well-known  [4]  that in  the set  of probability \ndensities of unit variance, Fisher information is  minimized by the gaussian density, \nand  the  minimum  equals  1.  Thus  the  theorem  shows  that  the  more  nongaussian \n(sparse)  S  is,  the better we can reduce noise.  Note, however, that Fisher information \nis  not scale-invariant. \n\nThe  former  (ML)  method  of  determining  the  basis  matrix  gives  usually  sparser \ncomponents than the latter method based on minimizing MSE.  In the case of image \ndenoising,  however,  these two methods give essentially equivalent bases if a percep(cid:173)\ntually  weighted  MSE  is  used  [6].  Thus  we  luckily  avoid  the  classical  dilemma of \nchoosing between these two optimality criteria. \n\n\f478 \n\n6  Experiments \n\nA.  Hyvtirinen,  P.  Hoyer and E. Oja \n\nImage data seems to fulfill  the assumptions inherent in sparse code shrinkage:  It is \npossible to find  linear representations whose components have sparse distributions, \nusing wavelet-like filters [10].  Thus we performed a set of experiments to explore the \nutility  of sparse code  shrinkage in  image denoising.  The experiments  are  reported \nin more detail  in [7]. \n\nData.  The data consisted  of real-life  images,  mainly  natural scenes.  The  images \nwere  randomly  divided  into  two  sets.  The  first  set  was  used  in  estimating  the \nmatrix W  that gives  the sparse coding transformation, as well  as in estimating the \nshrinkage  nonlinearities.  The second set  was  used  as  a  test  set.  It was  artificially \ncorrupted  by  Gaussian  noise,  and  sparse  code  shrinkage  was  used  to  reduce  the \nnoise.  The  images  were  used  in  the  method  in  the  form  of subwindows  of 8  x  8 \npixels. \n\nMethods.  The sparse  coding  matrix  W  was  determined  by  first  estimating  the \nICA model for the image windows (with DC component removed) using the FastICA \nalgorithm  [8,  5],  and  projecting  the  obtained estimate on  the  space of orthogonal \nmatrices.  The  training images  were  also  used  to  estimate  the  parametric  density \nmodels  of the sparse components.  In the first  series of experiments,  the local  vari(cid:173)\nance  was  equalized  as  a  preprocessing  step  [7].  This  implied  that  the  density  in \n(5)  was  a  more suitable model for  the densities of the sparse components; thus the \nshrinkage function  in  (6)  was  used.  In the second series,  no such equalization was \nmade, and the density model  (7)  and the shrinkage function  (8)  were used  [7]. \n\nResults.  Fig.  2  shows,  on  the  left,  a  test  image  which  was  artificially  corrupted \nwith  Gaussian  noise  with  standard  deviation  0.5  (the  standard  deviations  of the \noriginal images were normalized to 1).  The result of applying our denoising method \n(without  local  variance equalization)  on  that  image is  shown  on  the right.  Visual \ncomparison  of the  images  in  Fig.  2  shows  that  our sparse  code shrinkage  method \ncancels  noise  quite effectively.  One sees  that  contours and other sharp  details  are \nconserved  quite  well,  while  the overall  reduction  of noise  is  quite strong,  which  in \nis  contrast to methods  based on low-pass filtering.  This result is  in line with those \nobtained by  wavelet shrinkage [3].  More experimental results  are given  in  [7]. \n\n7  Conclusion \nSparse coding  and ICA  can  be  applied  for  image  feature  extraction, resulting in  a \nwavelet-like basis for image windows  [10].  As  a practical application of such a basis, \nwe  introduced the method of sparse code  shrinkage.  It is  based on the fact  that in \nsparse coding  the energy  of the signal  is  concentrated on  only  a  few  components, \nwhich are different for  each observed vector.  By shrinking the absolute values of the \nsparse components towards zero,  noise  can be reduced.  The method is  also  closely \nconnected to modeling image data with noisy independent component analysis  [9]. \nWe  showed  how  to find  the optimal  sparse  coding  basis  for  denoising,  and  we  de(cid:173)\nveloped  families  of probability  densities  that  allow  the shrinkage  nonlinearities  to \nadapt accurately to the data at hand.  Experiments on image data showed that the \nperformance  of the  method  is  very  appealing.  The method  reduces  noise  without \nblurring edges or other sharp features as much as linear low-pass or median filtering. \nThis  is  made possible  by  the strongly  non-linear nature of the shrinkage operator \nthat takes advantage of the inherent statistical structure of natural images. \n\n\f", "award": [], "sourceid": 1612, "authors": [{"given_name": "Aapo", "family_name": "Hyv\u00e4rinen", "institution": null}, {"given_name": "Patrik", "family_name": "Hoyer", "institution": null}, {"given_name": "Erkki", "family_name": "Oja", "institution": null}]}