{"title": "Implementation Issues in the Fourier Transform Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 266, "abstract": null, "full_text": "Implementation Issues  in the Fourier \n\nTransform Algorithm \n\nYishay Mansour\"  Sigal Sahar t \n\nComputer Science  Dept. \n\nTel-Aviv University \nTel-Aviv, ISRAEL \n\nAbstract \n\nThe  Fourier  transform of boolean  functions  has  come  to  play  an \nimportant role in proving many important learnability results.  We \naim to demonstrate that the  Fourier transform techniques  are also \na  useful  and  practical  algorithm in  addition  to  being  a  powerful \ntheoretical  tool.  We  describe  the more prominent changes  we  have \nintroduced  to  the  algorithm,  ones  that  were  crucial  and  without \nwhich  the  performance  of  the  algorithm  would  severely  deterio(cid:173)\nrate.  One of the benefits we  present  is  the confidence  level for each \nprediction  which  measures  the  likelihood the prediction is  correct. \n\n1 \n\nINTRODUCTION \n\nOver the last few  years the Fourier Transform (FT) representation of boolean func(cid:173)\ntions has  been  an instrumental tool in  the computational learning  theory  commu(cid:173)\nnity.  It has  been  used  mainly to demonstrate the learnability of various  classes  of \nfunctions with respect to the uniform distribution.  The first  connection between the \nFourier  representation  and  learnability of boolean  functions  was  established  in  [6] \nwhere  the  class  ACo  was  learned  (using  its  FT representation)  in  O(nPoly-log(n)) \ntime.  The  work  of [5]  developed  a  very  powerful  algorithmic  procedure:  given  a \nfunction  and a  threshold  parameter it finds  in  polynomial time all  the  Fourier  co(cid:173)\nefficients  of the  function  larger  than  the  threshold.  Originally the  procedure  was \nused  to learn  decision  trees  [5],  and in  [8,  2,  4]  it was used  to learn polynomial size \nDNF. The FT technique applies naturally to the uniform distribution, though some \nof the learnability results  were  extended to product  distribution  [1,  3] . \n\n.. e-mail:  manSQur@cs.tau.ac.il \nt e-mail:  gales@cs.tau.ac.il \n\n\fImplementation  Issues in the  Fourier Transform  Algorithm \n\n261 \n\nA  great  advantage of the  FT algorithm is  that it  does  not  make any assumptions \non  the function  it is  learning.  We  can apply it to  any  function  and hope  to obtain \n\"large\"  Fourier  coefficients.  The  prediction  function  simply computes  the  sum  of \nthe  coefficients  with  the  corresponding  basis  functions  and  compares  the  sum  to \nsome  threshold.  The  procedure  is  also  immune to  some  noise  and  will  be  able  to \noperate even if a fraction of the examples are maliciously misclassified.  Its drawback \nis  that it requires  to query  the  target function  on  randomly selected  inputs. \n\nWe  aim to  demonstrate  that  the  FT  technique  is  not  only  a  powerful  theoretical \ntool,  but also a practical one.  In the process of implementing the Fourier algorithm \nwe enhanced it in order to improve the accuracy of the hypothesis we generate while \nmaintaining a  desirable  run  time.  We  have  added  such  feartures  as  the  detection \nof inaccurate  approximations  \"on  the fly\"  and  immediate correction  of the  errors \nincurred at a  minimal cost.  The methods we  devised  to choose  the  \"right\"  parame(cid:173)\nters proved to be essential in order to achieve our goals.  Furthermore, when making \npredictions,  it  is  extremely  beneficial  to  have  the  prediction  algorithm supply  an \nindicator that provides the confidence level we  have in the prediction we  made.  Our \nalgorithm provides us  naturally with such  an indicator as  detailed in  Section 4.1. \n\nThe  paper  is  organized  as  follows:  section  2  briefly  defines  the  FT and  describes \nthe  algorithm.  In Section  3 we  describe  the experiments and their  outcome and in \nSection 4 the enhancements  made.  We  end  with our conclusions in  Section  5. \n\n2  FOURIER TRANSFORM  (FT)  THEORY \n\nIn this section  we  briefly introduce  the FT theory and algorithm.  its connection to \nlearning and the algorithm that finds  the large coefficients.  A comprehensive survey \nof the theoretical results  and proofs  can  be found  in  [7]. \n\nWe  consider  boolean functions  of n  variables:  f  : {O, l}n - t  {-I, I}.  We  define  the \ninner  product:  <  g,  f  >=  2- n  L::XE{O,l}R  f(x)g(x)  = E[g  . f],  where  E  is  the  ex(cid:173)\npected value with respect to the uniform distribution.  The basis is defined as follows: \nfor  each  z  E  {O,l}n,  we  define  the  basis  function  :\\:z(Xl,\u00b7\u00b7\u00b7,Xn) =  (_1)L::~=lx;z \u2022. \nAny function of n  boolean inputs can be uniquely expressed as a linear combination \nof the basis functions .  For a  function  f,  the  zth  Fourier  coefficient of f  is  denoted \nby  j(z) , i.e. ,  f(x)  =  L::zE{O,l}R  j(z)XAx) .  The  Fourier  coefficients  are  computed \nby  j(z) =< f, Xz  > and we  call  z the  coefficient-name of j(z).  We  define  at-sparse \nfunction  to be  a function  that has at most t  non-zero  Fourier  coefficients. \n\n2.1  PREDICTION \n\nOur aim is  to approximate the target function  f  by a t-sparse  function  h.  In  many \ncases  h will simply include the \"large\"  coefficients of f.  That is, if A =  {Zl' ... , zm} \nis  the  set  of z's for  which  j(Zi)  is  \"large\",  we  set  hex)  =  L::z;EA aiXz;(x),  where \nat  is  our  approximation of j(Zi).  The  hypothesis  we  generate  using  this  process, \nhex),  does  not  have  a  boolean  output.  In  order  to  obtain  a  boolean  prediction \nwe  use  Sign(h(x)),  i.e.,  output  +1  if hex)  2 0  and  -1 if hex)  < o.  We  want  to \nbound the error we get from approximating f  by h using the expected error squared, \nE[(J - h )2].  It can be  shown  that bounding it bounds the boolean prediction error \nprobability,  i.e.,  Pr[f(x)  f.  sign(h(x))]  ~ E[(J - h)2] .  For  a  given  t,  the  t-sparse \n\n\f262 \n\nY.  MANSOUR, S.  SAHAR \n\nhypothesis h that minimizes E[(J - h)2]  simply includes the t largest  coefficients  of \nf.  Note that the  more coefficients  we  include in our approximation and the  better \nwe  approximate their  values, the smaller E[(J - h )2]  is  going to be.  This provides \nus  with  the motivation to find  the  \"large\"  coefficients. \n\n2.2  FINDING THE LARGE COEFFICIENTS \n\nThe algorithm that finds  the  \"large\"  coefficients  receives  as  inputs a  function 1 (a \nblack-box it can query)  and an interest threshold parameter (J  > 0.  It outputs a list \nof coefficient-names  that  (1)  includes  all  the  coefficients-names  whose  correspond(cid:173)\ning  coefficients  are  \"large\",  i.e.,  at  least  (J ,  and  (2)  does  not  include  \"too  many\" \ncoefficient-names.  The algorithm runs  in polynomial time in  both  1/() and n . \n\nSUBROUTINE search( a) \n\nIF TEST[J, a, II]  THEN IF lal = n  THEN OUTPUT a \n\nELSE search(aO); search(al); \n\nFigure  1:  Subroutine  search \n\nThe basic idea of the algorithm is to perform a search in the space of the coefficient(cid:173)\nnames of I.  Throughout the search algorithm (see  Figure (1))  we  maintain a prefix \nof  a  coefficient-name  and  try  to  estimate  whether  any  of  its  extensions  can  be \na  coefficient-name  whose  value  is  \"large\".  The  algorithm  commences  by  calling \nsearch(A)  where  A is  the empty string.  On each  invocation it computes the pred(cid:173)\nicate  TEST[/, a, (J].  If the  predicate  is  true,  it  recursively  calls  search(aO)  and \nsearch(al).  Note  that  if TEST  is  very  permissive  we  may  reach  all  the  coeffi(cid:173)\ncients,  in  which  case  our running  time will  not  be  polynomial; its implementation \nis  therefore of utmost interest.  Formally, T EST[J, a, (J]  computes whether \n\nwhere  k =  Iiali . \n\nExe {O,l}n-\"E;e{O,lP.[J(YX)Xa(Y)]  2:  (J2, \n\n(1) \nDefine  la(x) =  L:,ae{O,l}n-\"  j(aj3)x.,a(x).  It can  be  shown  that the expected  value \nin  (1)  is  exactly  the  sum of the  squares  of the  coefficients  whose  prefix  is  a , i.e., \nExe {o,l}n-\"E;e{o,l}d/(yx)x.a(Y)]  =  Ex[/~(x)]  =  L:,ae{o,l}n-\"  p(aj3),  implying \nthat  if there  exists  a  coefficient  Ii( a,8)1  2:  (),  then  E[/;]  2:  (J2 .  This  condition \nguarantees  the  correctness  of our  algorithm, namely that  we  reach  all  the  \"large\" \ncoefficients.  We would like also to bound the number of recursive  calls that search \nperforms.  We  can show that for at most 1/(J2  of the prefixes of size k, TEST[!, a , (J] \nis  true.  This bounds  the  number of recursive  calls in our procedure  by O(n/(J2). \n\nIn  TEST  we  would  like  to  compute  the  expected  value,  but  in  order  to  do  so \nefficiently  we  settle for  an approximation of its value.  This can  be done  as  follows: \n(1)  choose  ml  random  Xi  E  {a, l}n-k,  (2)  choose  m2  random  Yi,j  E  {a, l}k ,  (3) \nquery  1 on  Yi,jXi  (which  is  why  we  need  the  query  model-to query  f  on  many \npoints with the same prefix  Xi)  and receive  I(Yi,j xd, and (4)  compute the estimate \nas,  Ba  =  ';1  L:~\\ (~~ L:~l I(Yi,iXdXa(Yi,j)f  .  Again, for  more details  see  [7]. \n\n3  EXPERIMENTS \n\nWe  implemented the  FT algorithm (Section  2.2)  and went  forth  to  run  a  series  of \nexperiments.  The parameters of each experiment include the target function , (J ,  ml \n\n\fImplementation  Issues  in the Fourier Transform Algorithm \n\n263 \n\nand m2.  We briefly introduce the parameters here and defer the detailed discussion. \nThe parameter ()  determines the threshold between  \"small\" and  \"large\"  coefficients, \nthus controlling the number of coefficients we  will output.  The parameters wI and \nw2 determine how  accurately we  approximate the TEST predicate.  Failure to ap(cid:173)\nproximate it accurately may yield faulty, even random, results  (e.g., for a  ludicrous \nchoice  of m1  = 1 and  m2  = 1)  that may cause  the algorithm to fail  (as  detailed  in \nSection  4.3).  An  intelligent  choice  of m1  and  m2  is  therefore  indispensable.  This \nissue  is  discussed  in greater  detail in Sections  4.3  and 4.4. \n\nFigure 2:  Typical frequency plots and typical errors .  Errors occur in two cases:  (1) the algorithm \npredicts  a  +1  response  when  the  actual  response  is  -1 (the  lightly  shaded  area),  and  (2)  the  algorithm \npredicts  a  -1 response ,  while  the  true  response is  +1  (the darker shaded  area) . \n\nFigures  (3)-(5)  present  representative  results  of our  experiments  in  the  form  of \ngraphs  that  evaluate  the  output  hypothesis  of the  algorithm on  randomly  chosen \ntest  points.  The target  function,  I,  returns  a  boolean  response,  \u00b11,  while the  FT \nhypothesis  returns  a  real  response.  We  therefore  present,  for  each  experiment,  a \ngraph  constituting  of two  curves:  the  frequency  of the  values  of the  hypothesis, \nh( x),  when  I( x)  =  +1,  and  the  second  curve  for  I( x)  =  -1.  If the  two  curves \nintersect,  their intersection  represents  the inherent error  the algorithm makes. \n\nFigure  3:  Decision  trees  of depth  5  and  3  with  41  variables .  The  5-deep  (3-deep)  decision  tree \nreturns  -1 about  50%  (62.5%) of the time .  The results  shown  above  are for  values  (J  = 0.03,  ml = 100 \nand  m2  = 5600  \u00ab(J  = 0.06,  ml  = 100 and  m2  = 1300).  Both graphs are  disjoint,  signifying  0%  error. \n\n4  RESULTS  AND ALGORITHM  ENHANCEMENTS \n\n4.1  CONFIDENCE  LEVELS \n\nOne  of our most  consistent  and interesting  empirical findings  was  the distribution \nof the error versus  the value of the algorithm's hypothesis:  its shape is  always that \nof a bell shaped curve.  Knowing the error distribution permits us  to determine with \na  high  (often  100%)  confidence  level  the  result  for  most  of the  instances,  yielding \nthe much sought after confidence  level indicator.  Though this simple logic thus  far \nhas  not been  supported by any theoretical  result,  our experimental results  provide \noverwhelming evidence  that this is  indeed  the  case. \n\nLet us demonstrate the strength of this technique:  consider the results of the 16-term \nDNF portrayed in Figure (4) .  If the algorithm's hypothesis outputs 0.3  (translated \n\n\f264 \n\nY.  MANSOUR, S.  SAHAR \n\nFigure  4:  16  terlD  DNF. This (randomly  generated)  DNF of 40  variables  returns  -1 about  61 %  of \nthe  time.  The  results  shown  above  are  for  the  values  of  9  = 0 .02 ,  m2  = 12500  and  ml  = 100.  The \nhypothesis  uses  186 non-zero  coefficients .  A  total of 9.628% error was detected. \n\ninto  1  in  boolean  terms  by  the  Sign function),  we  know  with  an  83%  confidence \nlevel  that the prediction is  correct.  If the algorithm outputs -0.9 as its prediction, \nwe  can  virtually guarantee  that  the  response  is  correct.  Thus,  although  the  total \nerror  level  is  over  9%  we  can supply a  confidence  level  for  each  prediction.  This is \nan indispensable tool for  practical usage  of the hypothesis. \n\n4.2  DETERMINING  THE THRESHOLD \n\nOnce  the list of large  coefficients  is  built  and  we  compute the  hypothesis  h( x),  we \nstill need  to determine the threshold,  a,  to which we  compare hex)  (i.e., predict +1 \niff hex)  > a).  In  the  theoretical  work  it is  assumed  that  a =  0,  since  a  priori one \ncannot  make  a  better  guess.  We  observed  that  fixing  a's  value  according  to  our \nhypothesis, improves the hypothesis.  a is  chosen  to minimize the error with respect \nto a  number of random examples. \n\nFigure 5:  8  terlD DNF . This (randomly generated) DNF of 40 variables returns -1 about 43% of the \ntime.  The results shown above are for the values of 9 = 0.03,  m2  = 5600 and ml  = 100.  The hypothesis \nconsists  of 112 non-zero coefficients. \n\nFor example,  when  trying to learn an 8-term DNF with  the  zero  threshold  we  will \nreceive  a  total  of 1.22%  overall  error  as  depicted  in  Figure  (5).  However,  if we \nchoose  the threshold  to be 0.32,  we  will get a  diminished error of 0.068%. \n\n4.3  ERROR DETECTION  ON  THE FLY - RETRY \n\nDuring  our  experimentations  we  have  noticed  that  at  times  the  estimate  Ba  for \nE[J~] may be inaccurate.  A faulty approximation may result in the abortion of the \ntraversal of \"interesting\"  subtreees,  thus  decreasing  the hypothesis'  accuracy,  or in \ntraversal  of \"uninteresting\"  subtrees,  thereby  needlessly  increasing  the algorithm's \nruntime.  Since  the  properties  of the  FT guarantee  that  E[J~] =  E[f~o] + E[J~d, \nwe  expect  Ba  ::::::  Bao + Bal .  Whenever  this  is  not  true,  we  conclude  that at  least \none  of our  approximations is  somewhat  lacking.  We  can  remedy  the situation by \n\n\fImplementation  Issues in  the Fourier Transform  Algorithm \n\n265 \n\nrunning the  search procedure  again  on  the  children,  i.e.,  retry node  a.  This solu(cid:173)\ntion  increases  the  probability of finding  all  the  \"large\"  coefficients.  A  brute force \nimplementation may cost  us  an inordinate amount of time since  we  may retraverse \nsubtrees  that we  have previously visited.  However,  since any discrepancies  between \nthe parent and its children are discovered-and corrected-as soon as  they appear, \nwe  can  circumvent  any retraversal.  Thus, we  correct  the errors  without any super(cid:173)\nfluous  additions to the run time. \n\n--\nJ: \n,--\n\n\" i\\ \no \" ....... \n\nFigure 6:  Majority function of 41  variables.  The result portrayed are for values m1  =  100 , m2  =  800 \nand  (J  = 0.08 .  Note the  majority-function  characteristic distribution  of the  results1 . \n\nWe  demonstrate  the  usefulness  of this  approach  with  an  example of learning  the \nmajority function  of 41  boolean  variables.  Without  the  retry  mechanism,  8  (of a \ntotal of 42) large coefficients were missed, giving rise to 13.724% error represented by \nthe shaded area in Figure (6).  With the retries  all the correct coefficients were found, \nyielding perfect  (flawless)  results represented  in the dotted  curve in  Figure (6). \n\n4.4  DETERMINING THE PARAMETERS \n\nOne of our aims was to determine the values of the different parameters, m1, m2 and \n(}.  Recall  that  in  our  algorithm we  calculate  Ba ,  the  approximation of Ex[f~(x)] \nwhere  m1  is  the number of times we  sample x  in order to make this approximation. \nWe sample Y randomly m2 times to approximate fa(Xi)  = Ey[f(YXih:a(Y)), for each \nXi \u00b7  This  approximation of fa(Xi)  has a  standard deviation of approximately A . \nAssume  that the true value is  13i,  i.e.  f3i  =  fa(Xi),  then we  expect  the contribution \nof the  ith  element to Ba  to be (13i  \u00b1 )n;? = 131  \u00b1 J&; + rr!~.  The algorithm tests \nBa =  rr!1  L 131  ?  (}2,  therefore,  to ensure a  low error,  based on the above argument, \nwe  choose  m2 = (J52 \u2022 \nChoosing the  right  value for  m2  is  of great importance.  We  have  noticed  on  more \nthan one occasion that increasing the value of m2 actually  decreases  the overall run \ntime.  This is not obvious at first :  seemingly, any increase in the number of times we \nloop in the algorithm only increases  the run time.  However,  a  more accurate  value \nfor  m2 means a more accurate approximation of the TEST predicate, and therefore \nless  chance  of redundant  recursive  calls  (the  run  time  is  linear  in  the  number  of \nrecursive  calls) .  We  can  see  this  exemplified  in  Figure  (7)  where  the  number  of \nrecursive  calls  increase  drastically as  m2  decreases.  In order  to present  Figure  (7) , \n\n1The  \"peaked\"  distribution  of  the  results  is  not  coincidental.  The  FT of  the  majority  function  has  42  large \nequal  coefficients,  labeled  cmaj'  one  for  each  singleton  (a  vector  of  the  form  0 .. 010 .. 0)  and  one  for  parity  (the \nall-ones  vector).  The  zeros  of  an  input  vector with  z  zeros  we  will  contribute  \u00b11(2z - 41). cmajl  to  the  result \nand  the parity  will  contribute  \u00b1cma)  (depending on  whether  z  is  odd  or  even),  so  that  the  total  contribution  is \nan  even factor  of  cma)'  Since  cma)  =  (~g);tcr - 0 .12,  we  have  peaks  around  factors  of 0.24 .  The distribution \naround  the  peaks  is  due  to  the  f~ct we  only  approximate  each  coefficient  and  get  a  value  close  to  cma)' \n\n\f266 \nwe  learned  the  same 3  term  DNF  always  using  e = 0.05  and  mr  * m2 \nThe trials differ  in  the specific  values  chosen  in each  trial for  m2. \n\nY.  MANSOUR, S. SAHAR \n\n100000. \n\nFigure  7:  Deter01ining  012'  Note  that  the  number  of recursive  calls  grows  dramatically  as  m2 's \nvalue decreases.  For example,  for m2  = 400, the number of recursive calls is  14,433 compared  with only \n1,329 recursive  calls for  m2  = 500 . \n\nSPECIAL CASES: When k  =  110'11  is either very small or very large, the values we \nchoose for ml  and m2  can  be self-defeating:  when  k  ,.....  n we  still loop  ml  (~ 2n - k ) \ntimes, though often without gaining additional information. The same holds for very \nsmall values  of k,  and the  corresponding  m2  (~ 2k)  values.  We  therefore  add the \nfollowing feature:  for  small and large values of k  we  calculate exactly the expected \nvalue thereby decreasing  the run time and increasing accuracy. \n\n5  CONCLUSIONS \n\nIn this work we implemented the FT algorithm and showed it to be a useful practical \ntool  as  well  as a  powerful theoretical  technique.  We  reviewed  major enhancements \nthe  algorithm underwent  during  the  process.  The  algorithm successfully  recovers \nfunctions  in  a  reasonable  amount of time.  Furthermore,  we  have  shown  that  the \nalgorithm naturally derives a confidence parameter.  This parameter enables the user \nin  many cases  to  conclude  that  the  prediction  received  is  accurate  with extremely \nhigh probability, even  if the  overall error  probability is not negligible. \n\nAcknowledgements \n\nThis  research  was  supported  in  part  by  The  Israel  Science  Foundation  administered  by  The  Israel \nAcademy of Science  and  Humanities  and  by  a  grant of the  Israeli  Ministry  of Science  and  Technology. \n\nReferences \n\n[1)  Mihir  Bellare.  A  technique  for  upper  bounding  the  spectral  norm  with  applications  to  learning. \n\nAnnual  Work&hop  on  Computational  Learning Theory,  pages  62-70,  July  1992. \n\nIn  5 th \n\n(2)  Avrim  Blum,  Merrick  Furst,  Jeffrey  Jackson,  Michael  Kearns,  Yishay  Mansour,  and  Steven  Rudich.  Weakly \nlearning  DNF and  characterizing statistical  query learning using  fourier  analysis.  In The 26 th  Annual  AC M \nSympo&ium  on  Theory  of  Computing,  pages  253  - 262,  1994 . \n\n(3)  Merrick  L .  Furst ,  Jeffrey  C.  Jackson,  and  Sean  W.  Smith.  Improved  learning  of  ACO  functions . \n\nAnnual  Work&hop  on  Computational  Learning Theory,  pages  317-325,  August  1991. \n\nIn  4th \n\n(4)  J.  Jackson .  An  efficient  membership-query algorithm  for  learning  DNF  with  respect to  the uniform distribu(cid:173)\n\ntion.  In  Annual  Sympo&ium  on  Switching  and  Automata Theory,  pages  42  - 53,  1994. \n\n(5)  E.  Kushilevitz  and  Y .  Mansour.  Learning  decision  trees  using  the  fourier  spectrum.  SIAM  Journal  on \n\nComputing 22(6):  1331-1348,  1993. \n\n(6)  N.  Linial,  Y.  Mansour,  and  N .  Nisan.  Constant  depth  circuits,  fourier  transform  and  learnability.  JACM \n\n40(3):607-620,  1993. \n\n(7)  Y.  Mansour.  Learning  Boolean  Functions  via  the  Fourier  Transform.  Advance&  in  Neural  Computation, \nedited  by  V.P.  Roychodhury  and  K-Y.  Siu  and  A.  Orlitsky,  Kluwer  Academic  Pub.  1994.  Can  be  accessed \nvia  Up:/ /ftp .math.tau.ac.iJ/pub/mansour/PAPERS/LEARNING/fourier-survey.ps.Z. \n\n(8)  Yishay  Mansour.  An o(nlog log n) learning algorihm for DNF under the uniform distribution .  J.  of Computer \n\nand  Sy&tem  Science,  50(3):543-550,  1995. \n\n\f", "award": [], "sourceid": 1054, "authors": [{"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "Sigal", "family_name": "Sahar", "institution": null}]}