{"title": "Boxlets: A Fast Convolution Algorithm for Signal Processing and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 577, "abstract": null, "full_text": "Boxlets:  a  Fast  Convolution Algorithm for \n\nSignal Processing and Neural Networks \n\nPatrice Y.  Simard\u00b7,  Leon Botton, Patrick Haffner and Yann  LeCnn \n\npatrice@microsoft.com \n\n{leon b ,haffner ,yann }@research.att.com \n\nAT&T Labs-Research \n\n100  Schultz  Drive,  Red  Bank,  NJ  07701-7033 \n\nAbstract \n\nSignal  processing  and  pattern  recognition  algorithms make  exten(cid:173)\nsive  use  of convolution.  In  many cases, computational accuracy  is \nnot  as  important  as  computational  speed.  In  feature  extraction, \nfor  instance,  the  features  of interest  in  a  signal  are  usually  quite \ndistorted.  This form of noise justifies some level of quantization in \norder  to  achieve  faster  feature  extraction .  Our  approach  consists \nof approximating regions  of the  signal  with  low  degree  polynomi(cid:173)\nals,  and then differentiating the resulting signals in order to obtain \nimpulse functions  (or derivatives of impulse functions).  With this \nrepresentation,  convolution  becomes  extremely  simple and  can  be \nimplemented quite effectively.  The true  convolution can  be  recov(cid:173)\nered  by  integrating  the  result  of  the  convolution.  This  method \nyields  substantial speed  up  in feature  extraction  and is  applicable \nto convolutional neural  networks. \n\nIntroduction \n\n1 \nIn  pattern  recognition,  convolution  is  an  important tool  because  of its  translation \ninvariance properties.  Feature extraction is a typical example:  The distance between \na small pattern  (i.e.  feature)  is computed at all positions  (i.e.  translations) inside a \nlarger one.  The  resulting  \"distance  image\"  is  typically obtained by  convolving the \nfeature  template with the larger pattern.  In the remainder of this paper we  will use \nthe  terms image and pattern  interchangeably  (because  of the  topology  implied by \ntranslation invariance). \nThere  are  many ways  to  convolve  images efficiently.  For  instance,  a  multiplication \nof images of the  same  size  in  the  Fourier  domain corresponds  to  a  convolution  of \nthe  two  images in  the original space.  Of course  this  requires  J{ N  log N  operations \n(where N  is the number of pixels of the image and J{ is  a constant) just to go in and \nout of the  Fourier domain.  These  methods  are usually  not  appropriate for feature \nextraction  because  the  feature  to  be  extracted  is  small with  respect  to  the  image. \nFor instance, if the image and the feature  have respectively 32 x 32  and 5 x 5 pixels, \n\n\u2022  Now  with  Microsoft,  One Microsoft  Way,  Redmond,  WA  98052 \n\n\f572 \n\nP  Y.  Simard,  L. BOllou,  P Haffner and Y.  Le Cun \n\nthe  full  convolution can  be  done  in  25  x  1024  multiply-adds.  In  contrast,  it  would \nrequire  2 x  J{  x  1024 x  10  to go  in  and out of the  Fourier domain. \nFortunately,  in  most  pattern  recognition  applications,  the  interesting  features  are \nalready  quite  distorted  when  they  appear in  real  images.  Because  of this  inherent \nnoise,  the feature  extraction  process  can  usually be approximated (to a  certain  de(cid:173)\ngree)  without affecting the performance.  For example, the result of the convolution \nis  often  quantized  or  thresholded  to  yield  the  presence  and  location of distinctive \nfeatures  ll].  Because  precision  is  typically  not  critical  at  this  stage  (features  are \nrarely  optimal,  thresholding  is  a  crude  operation),  it  is  often  possible  to  quantize \nthe signals before  the  convolution with  negligible degradation of performance. \nThe  subtlety  lies  in  choosing  a  quantization  scheme  which  can  speed  up  the  con(cid:173)\nvolution  while  maintaining the  same level  of performance.  We  now  introduce  the \nconvolution  algorithm,  from  which  we  will  deduce  the  constraints  it  imposes  on \nquantization. \nThe main algorithm introduced in this paper is  based on a fundamental property of \nconvolutions.  Assuming  that 1 and 9  have finite  support  and  that r  denotes  the \nn-th integral of 1 (or the n-th derivative if n  is  negative),  we can write the following \nconvolution  identity: \n(1) \nwhere  * denotes  the  convolution  operator.  Note  that  1 or  9  are  not  necessarily \ndifferentiable.  For  instance,  the impulse function  (also called Dirac delta function), \ndenoted  J,  verifies  the identity: \n\n(J * g)n  = r  * 9  = 1 * gn \n\nI*g =  r  *g-n \n\n(2) \nwhere  J~ denotes  the  n-th  integral of the  delta function,  translated  by  a  (Ja(x)  = \nJ(x - a)).  Equations  1 and 2 are  not new  to signal processing.  Heckbert  has devel(cid:173)\noped  an  effective  filtering  algorithm  [2]  where  the  filter  9  is  a  simple combination \nof polynomial of degree  n  - 1.  Convolution between  a signal 1 and  the filter  9  can \nbe  written  as \n(3) \nwhere  r  is  the  n-th  integral  of  the  signal,  and  the  n-th  derivative  of the  filter \n9  can  be  written  exclusively  with  delta  functions  (resulting  from  differentiating \nn  - 1  degree  polynomials  n  times).  Since  convolving  with  an  impulse function  is \na  trivial operation,  the  computation  of Equation  3  can  be  carried  out  effectively. \nUnfortunately,  Heckbert's  algorithm is  limited to  simple  polynomial filters  and  is \nonly interesting when the filter is wide and when the Fourier transform is unavailable \n(such  as  in  variable length filters). \nIn  contrast,  in  feature  extraction,  we  are  interested  in  small  and  arbitrary  filters \n(the  features).  Under  these  conditions,  the  key  to  fast  convolution  is  to  quantize \nthe  images  to  combinations  of low  degree  polynomials,  which  are  differentiated, \nconvolved  and then  integrated.  The  algorithm is  summarized by  equation: \n\n1 * 9  ~ F  * C  =  (F- n * C-m)m+n \n\n(4) \nwhere  F  and  C  are  polynomial  approximation  of 1 and  g,  such  that  F- n  and \nC- m  can  be  written  as  sums of impulse functions  and  their  derivatives.  Since  the \nconvolution F- n * C- m  only involves applying Equation 2,  it can be computed quite \neffectively.  The  computation  of the  convolution  is  illustrated  in  Figure  1.  Let  1 \nand 9  be  two  arbitrary  I-dimensional signals  (top of the figure).  Let's  assume that \n1 and  9  can  both  be  approximated  by  partitions  of polynomials,  F  and  C.  On \nthe figure ,  the  polynomials are of degree  0  (they  are constant),  and are depicted  in \nthe  second  line.  The  details  on  how  to compute F  and C  will  be explained  in  the \nnext section.  In the next step,  F  and C  are differentiated once,  yielding successions \nof impulse functions  (third  line  in  the  figure).  The  impulse representation  has  the \nadvantage  of having  a  finite  support,  and  of being  easy  to  convolve.  Indeed  two \nimpulse functions  can  be convolved using  Equation 2  (4  x  3 =  12  multiply-adds on \nthe figure).  Finally the  result  of the convolution  must be  integrated  twice  to  yield \n\nF * C =  (F- 1 * C- 1 )2 \n\n(5) \n\n\fBoxlets:  A Fast Convolution Algorithm \n\n573 \n\nOriginal \n\nQuantization \n\nF \n\n= \n\nDifferentiation  -1...' - - - - -L . . - - - r_ - -1 -\n\nG  V \n'r-I ------,11-__ ----..I \nt \nG' t \n\nFIG' \n\nFIG \n\nConvolution \n\nDouble \n\nIntegration \n\nFigure  1:  Example of convolution  between  I-dimensional function  f  and g ,  where \nthe  approximations of f  and 9  are  piecewise  constant . \n\n2  Quantization:  from  Images to  Boxlets \nThe goal of this section  is  to suggest  efficient  ways  to  approximate an image f  by \ncover  of  polynomials  of degree  d  suited  for  convolution.  Let  S  be  the  space  on \nwhich  f  is  defined ,  and  let  C  = {cd  be  a  partition  of S  (Ci n Cj  = 0 for  i  =f.  j , \nand  Ui  Ci  =  S).  For  each  Ci,  let  Pi  be  a  polynomial of degree  d  which  minimizes \nequatIOn: \n\n(6) \n\nThe  uniqueness  of Pi  is  guaranteed  if Ci  is  convex.  The  problem is  to find  a  cover \nC  which  minimizes both the  number of Ci  and I.:i ei.  Many  different  compromises \nare  possible,  but  since  the  computational cost  of the  convolution  is  proportional \nto the  number of regions,  it  seemed  reasonable  to chose  the  largest  regions  with  a \nmaximum  error bounded by  a  threshold  K .  Since each  region  will be differentiated \nand  integrated  along  the  directions  of the  axes,  the  boundaries  of the  CiS  are  re(cid:173)\nstricted to be parallel to the axes , hence the appellation boxlet.  There are still many \nways  to compute valid partitions of boxlets  and polynomials.  We have investigated \ntwo  very  different  approaches  which  both  yield  a  polynomial cover  of the  image in \nreasonable  time.  The first  algorithm is  greedy.  It uses  a  procedure  which,  starting \nfrom  a  top  left  corner ,  finds  the  biggest  boxlet  Ci  which  satisfies  ei  <  K  without \noverlapping  another  boxlet .  The  algorithm starts  with  the  top  left  corner  of the \nimage,  and  keeps  a  list  of all  possible  starting  points  (uncovered  top  left  corners) \nsorted by  X and Y  positions.  When the list is  exhausted,  the algorithm terminates. \nSurprisingly, this  algorithm can run in  O(d(N + PlogN)), where  N  is  the  number \nof pixels,  P  is  the  number  of boxlets  and  d  is  the  order  of the  polynomials PiS. \nAnother  much  simpler  algorithm consists  of recursively  splitting  boxlets,  starting \nfrom  a  boxlet  which  encompass  the  whole  image,  until  ei  <  K  for  all  the  leaves \nof the  tree.  This  algorithm  runs  in  O(dN) ,  is  much  easier  to  implement,  and  is \nfaster  (better  time constant).  Furthermore , even  though  the  first  algorithm yields \na  polynomial coverage  with  less  boxlets,  the  second  algorithm  yields  less  impulse \nfunctions  after differentiation because more impulse functions can be combined (see \nnext  section).  Both  algorithms rely  on  the  fact  that  Equation 6  can  be computed \n\n\f574 \n\nP.  Y.  Simard,  L.  Bottou, P.  Haffner and Y.  Le Cun \n\nFigure  2:  Effects  of boxletization:  original  (top  left),  greedy  (bottom  left)  with  a \nthreshold of tO,OOO,  and recursive  (top  and bottom right)  with a  threshold of 10,000. \n\nin constant  time.  This computation requires  the following quantities \n\nL  f(x, y), L  f(x, y)2 , L f(x, y)x, L  f(x, y)y, L  f(x, y)xy,... \n\n(7) \n\n~~------~v~------~\"~--------------v-------------~ \n\ndegree  1 \n\ndegree a \n\nto  be  pre-computed over  the  whole  image, for  the  greedy  algorithm, or over  recur(cid:173)\nsively  embedded  regions,  for  the  recursive  algorithm.  In  the  case  of the  recursive \nalgorithm these quantities are computed bottom up and very efficiently.  To prevent \nthe  sums to  become  too large  a  limit can  be  imposed on  the  maximum size  of Ci. \nThe  coefficients  of the  polynomials are quickly evaluated  by  solving  a  small linear \nsystem using  the first  two  sums for  polynomials of degree a (constants),  the first  5 \nsums for  polynomials of degree  1,  and so  on. \nFigure 2  illustrates the  results  of the quantization algorithms.  The top left  corner \nis  a  fraction  of the  original  image.  The  bottom left  image illustrates  the  boxleti(cid:173)\nzation  of the  greedy  algorithm,  with  polynomials of degree  1,  and  ei  <=  10, 000 \n(  13000 boxlets,  62000 impulse (and its derivative)  functions .  The top right image \nillustrates the  boxletization of the  recursive  algorithm, with  polynomials of degree \no and  ei  <=  10, 000  (  47000  boxlets,  58000  impulse functions).  The  bottom right \nis  the same as  top right  without displaying the  boxlet  boundaries.  In  this case  the \npixel to impulse function  ratio 5.8. \n\n3  Differentiation:  from  Boxlets to  Impulse Functions \nIf Pi  is  a  polynomial of degree d,  its  (d + 1 )-th derivative can be written as a sum of \nimpulse function's derivatives,  which  are  zero  everywhere  but  at  the corners  of Ci. \nThese impulse functions  summarize the  boundary conditions and completely char(cid:173)\nacterize Pi.  They can be represented  by four  (d + 1 )-dimensional vectors  associated \nwith  the  4  corners  of Ci.  Figure  3  (top)  illustrates  the  impulse functions  at  the  4 \n\n\fBoxlets: A Fast Convolution Algorithm \n\n575 \n\nl-~C~) \nYd \u00b71. \n\nenvatlve \n\n(of X derivative) \n\nPolynomial \n(constant) \n\nX derivative \n\nPolynomial covering \n\n(constants) \n\nDerivatives \n\nCombined \n\n~~ \nD  D  D \n~ \nD  D \n~ \nD  D \nSorted list \nrepresentation \n\nFigure  3:  Differentiation  of a  constant  polynomial  in  2D  (top).  Combining  the \nderivative  of adjacent  polynomials  (bottom) \n\ncorners  when  the  polynomial is  a  constant  (degree  zero).  Note  that  the polynomial \nmust  be  differentiated  d + 1  times  (in  this  example  the  polynomial is  a  constant, \nso  d =  0),  with  respect  to each  dimension of the input space.  This is  illustrated at \nthe  top of Figure  3.  The  cover  C  being  a  partition,  boundary  conditions  between \nadjacent  squares  do  simplify,  that  is,  the  same  derivatives  of a  impulse functions \nat  the  same location  can  be  combined  by  adding  their  coefficients.  It is  very  ad(cid:173)\nvantageous  to  do  so  because  it  will  reduce  the  computation of the  convolution  in \nthe next  step.  This is  illustrated in  Figure  3  (bottom).  This combining of impulse \nfunctions  is  one  of the  reason  why  the  recurslve  algorithm for  the  quantization  is \npreferred  to  the  greedy  algorithm.  In  the  recursive  algorithm,  the  boundaries  of \nboxlets  are  often  aligned, so  that  the impulse functions of adjacent  boxlets can  be \ncombined .  Typically,  after  simplification,  there  are  only  20%  more  impulse func(cid:173)\ntions than there  are boxlets.  In contrast,  the greedy algorithm generates  up to 60% \nmore  impulse functions  than  boxlets,  due  to  the fact  that  there  are  no  alignment \nconstraints.  For  the same threshold  the  recursive  algorithm generates  20%  to  30% \nless  impulse functions  than  the greedy  algorithm. \nFinding  which  impulse functions  can  be  combined  is  a  difficult  task  because  the \nrecursive  representation  returned  by  the  recursive  algorithm does  not  provide  any \nmeans  for  matching  the  bottom  of squares  on  one  line,  with  the  top  of squares \nfrom below that line.  Sorting takes O(P log P) computational steps (where  P  is the \nnumber of impulse functions)  and is  therefore  too expensive.  A better algorithm is \nto visit the recursive tree and accumulate all the top corners into sorted (horizontal) \nlists.  A  similar procedure  sorts  all  the  bottom corners  (also  into  horizontal  lists). \nThe horizontal lists corresponding to the same vertical positions can then be merged \nin  O(P) operations.  The complete algorithm which  quantizes  an image of N  pixels \nand  returns  sorted  lists  of impulse functions  runs  in  O(dN)  (where  d is  the degree \nof the polynomials). \n\n4  Results \nThe  convolution  speed  of the  algorithm was  tested  with feature  extraction  on  the \nimage shown  on  the  top left  of Figure  2.  The  image is  quantized,  but  the feature \nis  not.  The  feature  is  tabulated  in  kernels  of sizes  5  x  5,  10  x  10,  15  x  15  and \n20  x 20 .  If the kernel is decomposable, the algorithm can  be  modified to do two  1D \nconvolutions instead of the  present  2D  convolution. \nThe quantization of the image is  done with constant  polynomials, and with  thresh(cid:173)\nolds varying from  1,000  to 40,000.  This corresponds  to varying the pixel to impulse \nfunction  ratio  from  2.3  to  13.7.  Since  the  feature  is  not  quantized,  these  ratios \ncorrespond exactly to the ratios of number of multiply-adds for the standard convo(cid:173)\nlution versus  the boxlet convolution  (excluding quantization and integration).  The \n\n\f576 \n\nP  Y.  Simard,  L.  Bottou, P  Haffner and Y.  Le Cun \n\n8.4 \n\n12.5 \n\n13.4 \n\n13.8 \n\nTable  1:  Convolution  speed-up  factors \n\nHorizontal convolution A * \n\nA \n\n* \n\n(a) \n\n(b) \n\nConvolution of runs \n\n---\nI  A \n~'\\-\n(C) ~ ?-)-\nld..b-\n-------------- - ----r ------T-------\n(d)~;%)+r: ~ ~ \n\n- - - -\n\n'w  w' \n\n\\ .  \n\nFigure 4:  Run  length  X  convolution \n\nactual speed  up factors  are summarized in  Table  1.  The four  last columns indicate \nthe  measured  time ratios  between  the  standard convolution  and the  boxlet convo(cid:173)\nlution.  For  each  threshold  value,  the  top line indicates  the  time  ratio of standard \nconvolution  versus  quantization,  convolution  and  integration  time  for  the  boxlet \nconvolution.  The  bottom  line  does  not  take  into  account  the  quantization  time. \nThe feature size  was  varied from 5 x 5 to 20  x  20.  Thus with  a  threshold of 10,000 \nand  a  5  x  5  kernel,  the  quantization  ratio  is  5.8,  and  the  speed  up  factor  is  2.8. \nThe  loss  in  image  quality  can  be  seen  by  comparing the  top  left  and  the  bottom \nright  images.  If several  features  are  extracted,  the  quantization  time of the  image \nis  shared  amongst the features  and the speed  up factor  is  closer  to 4.7. \nIt should  be  noted  that  these  speed  up  factors  depend  on  the  quantization  level \nwhich depends on the data and affects  the accuracy of the result.  The good news  is \nthat for  each  application the optimal threshold  (the maximum level of quantization \nwhich has negligible effect  on the result) can be evaluated quickly.  Once the optimal \nthreshold  has been determined, one can enjoy  the speed up factor.  It is  remarkable \nthat  with  a  quantization  factor  as  low  as  2.3,  the  speed  up  ratio  can  range  from \n1.5  to  2.3,  depending  on  the  number  of features.  We  believe  that  this  method  is \ndirectly  applicable  to  forward  propagation  in  convolutional  neural  nets  (although \nno  results  are  available at this  time) . \nThe next  application shows  a case  where  quantization has no  adverse  effect  on the \naccuracy  of the convolution,  and yet  large speed  ups  are obtained. \n\n\fBoxlets: A Fast Convolution Algorithm \n\n577 \n\n5  Binary images and  run-length encoding \nThe  quantization  steps  described  in  Sections  2  and  3  become  particularly  simple \nwhen  the  image is  binary.  If the  threshold  is  set  to  zero,  and if only  the  X deriva(cid:173)\ntive  is  considered,  the impulse representation  is  equivalent  to  run-length encoding. \nIndeed  the position of each  positive impulse function  codes  the beginning of a  run, \nwhile  the position of each  negative impulses code  the end  of a  run.  The  horizontal \nconvolution  can  be  computed  effectively  using  the  boxlet  convolution  algorithm. \nThis is  illustrated in Figure 4.  In  (a), the distance between two  binary images must \nbe evaluated for every horizontal position (horizontal translation invariant distance). \nThe result is obtained by convolving each horizontal line and by computing the sum \nof each  of the  convolution functions.  The  convolution  of two  runs,  is  depicted  in \n(b),  while  the  summation of all  the  convolutions of two  runs  is  depicted  in  (c).  If \nan impulse representation  is  used for  the  runs  (a first  derivative) , each  summation \nof a  convolution  between  two  runs  requires  only  4  additions of impulse functions, \nas  depicted  in  (d).  The  result  must  be  integrated  twice,  according  to  Equation  5. \nThe speed up factors  can be considerable depending on  the width of the images (an \norder  of magnitude if the width  is  40  pixels),  and there  is  no  accuracy  penalty. \n\nFigure 5:  Binary image (left)  and compact impulse function  encoding  (right). \n\nThis speed up also generalizes to 2-dimensional encoding of binary images.  The gain \ncomes from the frequent  cancellations of impulse functions of adjacent boxlets.  The \nnumber  of impulse  functions  is  proportional  to  the  contour  length  of the  binary \nshapes.  In  this  case,  the  boxlet  computation  is  mostly  an  efficient  algorithm for \n2-dimensional  run-length  encoding.  This  is  illustrated  in  Figure  5.  As  with  run(cid:173)\nlength encoding, a considerable speed up is obtained for convolution, at no accuracy \npenalty  cost. \n\n6  Conclusion \nWhen  convolutions  are  used  for  feature  extraction,  preCISIon  can  often  be  sacri(cid:173)\nficed  for  speed  with negligible degradation of performance.  The boxlet convolution \nmethod  combines  quantization  and  convolution  to  offer  a  continuous  adjustable \ntrade-off between  accuracy  and  speed.  In  some  cases  (such  as  in  relatively  simple \nbinary images)  large speed  ups can  come  with  no  adverse  effects.  The algorithm is \ndirectly applicable to the forward propagation in convolutional neural networks and \nin pattern matching when  translation invariance results from the use of convolution. \n\nReferences \n[1]  Yann  LeCun  and Yoshua Bengio,  \"Convolutional networks for  images, speech, \nand time-series,\"  in  The  Han'dbook  of Brain  Theory and Neural Networks, M.  A. \nArbib,  Ed.  1995,  MIT  Press. \n\n[2]  Paul  S.  Heckbert,  \"Filtering  by  repeated  integration,\" \n\nin  ACM SIGGRAPH \nconference  on  Computer  graphics,  Dallas,  TX ,  August  1986,  vol.  20,  pp.  315-\n321. \n\n\f", "award": [], "sourceid": 1602, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Patrick", "family_name": "Haffner", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}]}