{"title": "Implementation Issues in the Fourier Transform Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 266, "abstract": null, "full_text": "Implementation Issues in the Fourier \n\nTransform Algorithm \n\nYishay Mansour\" Sigal Sahar t \n\nComputer Science Dept. \n\nTel-Aviv University \nTel-Aviv, ISRAEL \n\nAbstract \n\nThe Fourier transform of boolean functions has come to play an \nimportant role in proving many important learnability results. We \naim to demonstrate that the Fourier transform techniques are also \na useful and practical algorithm in addition to being a powerful \ntheoretical tool. We describe the more prominent changes we have \nintroduced to the algorithm, ones that were crucial and without \nwhich the performance of the algorithm would severely deterio(cid:173)\nrate. One of the benefits we present is the confidence level for each \nprediction which measures the likelihood the prediction is correct. \n\n1 \n\nINTRODUCTION \n\nOver the last few years the Fourier Transform (FT) representation of boolean func(cid:173)\ntions has been an instrumental tool in the computational learning theory commu(cid:173)\nnity. It has been used mainly to demonstrate the learnability of various classes of \nfunctions with respect to the uniform distribution. The first connection between the \nFourier representation and learnability of boolean functions was established in [6] \nwhere the class ACo was learned (using its FT representation) in O(nPoly-log(n)) \ntime. The work of [5] developed a very powerful algorithmic procedure: given a \nfunction and a threshold parameter it finds in polynomial time all the Fourier co(cid:173)\nefficients of the function larger than the threshold. Originally the procedure was \nused to learn decision trees [5], and in [8, 2, 4] it was used to learn polynomial size \nDNF. The FT technique applies naturally to the uniform distribution, though some \nof the learnability results were extended to product distribution [1, 3] . \n\n.. e-mail: manSQur@cs.tau.ac.il \nt e-mail: gales@cs.tau.ac.il \n\n\fImplementation Issues in the Fourier Transform Algorithm \n\n261 \n\nA great advantage of the FT algorithm is that it does not make any assumptions \non the function it is learning. We can apply it to any function and hope to obtain \n\"large\" Fourier coefficients. The prediction function simply computes the sum of \nthe coefficients with the corresponding basis functions and compares the sum to \nsome threshold. The procedure is also immune to some noise and will be able to \noperate even if a fraction of the examples are maliciously misclassified. Its drawback \nis that it requires to query the target function on randomly selected inputs. \n\nWe aim to demonstrate that the FT technique is not only a powerful theoretical \ntool, but also a practical one. In the process of implementing the Fourier algorithm \nwe enhanced it in order to improve the accuracy of the hypothesis we generate while \nmaintaining a desirable run time. We have added such feartures as the detection \nof inaccurate approximations \"on the fly\" and immediate correction of the errors \nincurred at a minimal cost. The methods we devised to choose the \"right\" parame(cid:173)\nters proved to be essential in order to achieve our goals. Furthermore, when making \npredictions, it is extremely beneficial to have the prediction algorithm supply an \nindicator that provides the confidence level we have in the prediction we made. Our \nalgorithm provides us naturally with such an indicator as detailed in Section 4.1. \n\nThe paper is organized as follows: section 2 briefly defines the FT and describes \nthe algorithm. In Section 3 we describe the experiments and their outcome and in \nSection 4 the enhancements made. We end with our conclusions in Section 5. \n\n2 FOURIER TRANSFORM (FT) THEORY \n\nIn this section we briefly introduce the FT theory and algorithm. its connection to \nlearning and the algorithm that finds the large coefficients. A comprehensive survey \nof the theoretical results and proofs can be found in [7]. \n\nWe consider boolean functions of n variables: f : {O, l}n - t {-I, I}. We define the \ninner product: < g, f >= 2- n L::XE{O,l}R f(x)g(x) = E[g . f], where E is the ex(cid:173)\npected value with respect to the uniform distribution. The basis is defined as follows: \nfor each z E {O,l}n, we define the basis function :\\:z(Xl,\u00b7\u00b7\u00b7,Xn) = (_1)L::~=lx;z \u2022. \nAny function of n boolean inputs can be uniquely expressed as a linear combination \nof the basis functions . For a function f, the zth Fourier coefficient of f is denoted \nby j(z) , i.e. , f(x) = L::zE{O,l}R j(z)XAx) . The Fourier coefficients are computed \nby j(z) =< f, Xz > and we call z the coefficient-name of j(z). We define at-sparse \nfunction to be a function that has at most t non-zero Fourier coefficients. \n\n2.1 PREDICTION \n\nOur aim is to approximate the target function f by a t-sparse function h. In many \ncases h will simply include the \"large\" coefficients of f. That is, if A = {Zl' ... , zm} \nis the set of z's for which j(Zi) is \"large\", we set hex) = L::z;EA aiXz;(x), where \nat is our approximation of j(Zi). The hypothesis we generate using this process, \nhex), does not have a boolean output. In order to obtain a boolean prediction \nwe use Sign(h(x)), i.e., output +1 if hex) 2 0 and -1 if hex) < o. We want to \nbound the error we get from approximating f by h using the expected error squared, \nE[(J - h )2]. It can be shown that bounding it bounds the boolean prediction error \nprobability, i.e., Pr[f(x) f. sign(h(x))] ~ E[(J - h)2] . For a given t, the t-sparse \n\n\f262 \n\nY. MANSOUR, S. SAHAR \n\nhypothesis h that minimizes E[(J - h)2] simply includes the t largest coefficients of \nf. Note that the more coefficients we include in our approximation and the better \nwe approximate their values, the smaller E[(J - h )2] is going to be. This provides \nus with the motivation to find the \"large\" coefficients. \n\n2.2 FINDING THE LARGE COEFFICIENTS \n\nThe algorithm that finds the \"large\" coefficients receives as inputs a function 1 (a \nblack-box it can query) and an interest threshold parameter (J > 0. It outputs a list \nof coefficient-names that (1) includes all the coefficients-names whose correspond(cid:173)\ning coefficients are \"large\", i.e., at least (J , and (2) does not include \"too many\" \ncoefficient-names. The algorithm runs in polynomial time in both 1/() and n . \n\nSUBROUTINE search( a) \n\nIF TEST[J, a, II] THEN IF lal = n THEN OUTPUT a \n\nELSE search(aO); search(al); \n\nFigure 1: Subroutine search \n\nThe basic idea of the algorithm is to perform a search in the space of the coefficient(cid:173)\nnames of I. Throughout the search algorithm (see Figure (1)) we maintain a prefix \nof a coefficient-name and try to estimate whether any of its extensions can be \na coefficient-name whose value is \"large\". The algorithm commences by calling \nsearch(A) where A is the empty string. On each invocation it computes the pred(cid:173)\nicate TEST[/, a, (J]. If the predicate is true, it recursively calls search(aO) and \nsearch(al). Note that if TEST is very permissive we may reach all the coeffi(cid:173)\ncients, in which case our running time will not be polynomial; its implementation \nis therefore of utmost interest. Formally, T EST[J, a, (J] computes whether \n\nwhere k = Iiali . \n\nExe {O,l}n-\"E;e{O,lP.[J(YX)Xa(Y)] 2: (J2, \n\n(1) \nDefine la(x) = L:,ae{O,l}n-\" j(aj3)x.,a(x). It can be shown that the expected value \nin (1) is exactly the sum of the squares of the coefficients whose prefix is a , i.e., \nExe {o,l}n-\"E;e{o,l}d/(yx)x.a(Y)] = Ex[/~(x)] = L:,ae{o,l}n-\" p(aj3), implying \nthat if there exists a coefficient Ii( a,8)1 2: (), then E[/;] 2: (J2 . This condition \nguarantees the correctness of our algorithm, namely that we reach all the \"large\" \ncoefficients. We would like also to bound the number of recursive calls that search \nperforms. We can show that for at most 1/(J2 of the prefixes of size k, TEST[!, a , (J] \nis true. This bounds the number of recursive calls in our procedure by O(n/(J2). \n\nIn TEST we would like to compute the expected value, but in order to do so \nefficiently we settle for an approximation of its value. This can be done as follows: \n(1) choose ml random Xi E {a, l}n-k, (2) choose m2 random Yi,j E {a, l}k , (3) \nquery 1 on Yi,jXi (which is why we need the query model-to query f on many \npoints with the same prefix Xi) and receive I(Yi,j xd, and (4) compute the estimate \nas, Ba = ';1 L:~\\ (~~ L:~l I(Yi,iXdXa(Yi,j)f . Again, for more details see [7]. \n\n3 EXPERIMENTS \n\nWe implemented the FT algorithm (Section 2.2) and went forth to run a series of \nexperiments. The parameters of each experiment include the target function , (J , ml \n\n\fImplementation Issues in the Fourier Transform Algorithm \n\n263 \n\nand m2. We briefly introduce the parameters here and defer the detailed discussion. \nThe parameter () determines the threshold between \"small\" and \"large\" coefficients, \nthus controlling the number of coefficients we will output. The parameters wI and \nw2 determine how accurately we approximate the TEST predicate. Failure to ap(cid:173)\nproximate it accurately may yield faulty, even random, results (e.g., for a ludicrous \nchoice of m1 = 1 and m2 = 1) that may cause the algorithm to fail (as detailed in \nSection 4.3). An intelligent choice of m1 and m2 is therefore indispensable. This \nissue is discussed in greater detail in Sections 4.3 and 4.4. \n\nFigure 2: Typical frequency plots and typical errors . Errors occur in two cases: (1) the algorithm \npredicts a +1 response when the actual response is -1 (the lightly shaded area), and (2) the algorithm \npredicts a -1 response , while the true response is +1 (the darker shaded area) . \n\nFigures (3)-(5) present representative results of our experiments in the form of \ngraphs that evaluate the output hypothesis of the algorithm on randomly chosen \ntest points. The target function, I, returns a boolean response, \u00b11, while the FT \nhypothesis returns a real response. We therefore present, for each experiment, a \ngraph constituting of two curves: the frequency of the values of the hypothesis, \nh( x), when I( x) = +1, and the second curve for I( x) = -1. If the two curves \nintersect, their intersection represents the inherent error the algorithm makes. \n\nFigure 3: Decision trees of depth 5 and 3 with 41 variables . The 5-deep (3-deep) decision tree \nreturns -1 about 50% (62.5%) of the time . The results shown above are for values (J = 0.03, ml = 100 \nand m2 = 5600 \u00ab(J = 0.06, ml = 100 and m2 = 1300). Both graphs are disjoint, signifying 0% error. \n\n4 RESULTS AND ALGORITHM ENHANCEMENTS \n\n4.1 CONFIDENCE LEVELS \n\nOne of our most consistent and interesting empirical findings was the distribution \nof the error versus the value of the algorithm's hypothesis: its shape is always that \nof a bell shaped curve. Knowing the error distribution permits us to determine with \na high (often 100%) confidence level the result for most of the instances, yielding \nthe much sought after confidence level indicator. Though this simple logic thus far \nhas not been supported by any theoretical result, our experimental results provide \noverwhelming evidence that this is indeed the case. \n\nLet us demonstrate the strength of this technique: consider the results of the 16-term \nDNF portrayed in Figure (4) . If the algorithm's hypothesis outputs 0.3 (translated \n\n\f264 \n\nY. MANSOUR, S. SAHAR \n\nFigure 4: 16 terlD DNF. This (randomly generated) DNF of 40 variables returns -1 about 61 % of \nthe time. The results shown above are for the values of 9 = 0 .02 , m2 = 12500 and ml = 100. The \nhypothesis uses 186 non-zero coefficients . A total of 9.628% error was detected. \n\ninto 1 in boolean terms by the Sign function), we know with an 83% confidence \nlevel that the prediction is correct. If the algorithm outputs -0.9 as its prediction, \nwe can virtually guarantee that the response is correct. Thus, although the total \nerror level is over 9% we can supply a confidence level for each prediction. This is \nan indispensable tool for practical usage of the hypothesis. \n\n4.2 DETERMINING THE THRESHOLD \n\nOnce the list of large coefficients is built and we compute the hypothesis h( x), we \nstill need to determine the threshold, a, to which we compare hex) (i.e., predict +1 \niff hex) > a). In the theoretical work it is assumed that a = 0, since a priori one \ncannot make a better guess. We observed that fixing a's value according to our \nhypothesis, improves the hypothesis. a is chosen to minimize the error with respect \nto a number of random examples. \n\nFigure 5: 8 terlD DNF . This (randomly generated) DNF of 40 variables returns -1 about 43% of the \ntime. The results shown above are for the values of 9 = 0.03, m2 = 5600 and ml = 100. The hypothesis \nconsists of 112 non-zero coefficients. \n\nFor example, when trying to learn an 8-term DNF with the zero threshold we will \nreceive a total of 1.22% overall error as depicted in Figure (5). However, if we \nchoose the threshold to be 0.32, we will get a diminished error of 0.068%. \n\n4.3 ERROR DETECTION ON THE FLY - RETRY \n\nDuring our experimentations we have noticed that at times the estimate Ba for \nE[J~] may be inaccurate. A faulty approximation may result in the abortion of the \ntraversal of \"interesting\" subtreees, thus decreasing the hypothesis' accuracy, or in \ntraversal of \"uninteresting\" subtrees, thereby needlessly increasing the algorithm's \nruntime. Since the properties of the FT guarantee that E[J~] = E[f~o] + E[J~d, \nwe expect Ba :::::: Bao + Bal . Whenever this is not true, we conclude that at least \none of our approximations is somewhat lacking. We can remedy the situation by \n\n\fImplementation Issues in the Fourier Transform Algorithm \n\n265 \n\nrunning the search procedure again on the children, i.e., retry node a. This solu(cid:173)\ntion increases the probability of finding all the \"large\" coefficients. A brute force \nimplementation may cost us an inordinate amount of time since we may retraverse \nsubtrees that we have previously visited. However, since any discrepancies between \nthe parent and its children are discovered-and corrected-as soon as they appear, \nwe can circumvent any retraversal. Thus, we correct the errors without any super(cid:173)\nfluous additions to the run time. \n\n--\nJ: \n,--\n\n\" i\\ \no \" ....... \n\nFigure 6: Majority function of 41 variables. The result portrayed are for values m1 = 100 , m2 = 800 \nand (J = 0.08 . Note the majority-function characteristic distribution of the results1 . \n\nWe demonstrate the usefulness of this approach with an example of learning the \nmajority function of 41 boolean variables. Without the retry mechanism, 8 (of a \ntotal of 42) large coefficients were missed, giving rise to 13.724% error represented by \nthe shaded area in Figure (6). With the retries all the correct coefficients were found, \nyielding perfect (flawless) results represented in the dotted curve in Figure (6). \n\n4.4 DETERMINING THE PARAMETERS \n\nOne of our aims was to determine the values of the different parameters, m1, m2 and \n(}. Recall that in our algorithm we calculate Ba , the approximation of Ex[f~(x)] \nwhere m1 is the number of times we sample x in order to make this approximation. \nWe sample Y randomly m2 times to approximate fa(Xi) = Ey[f(YXih:a(Y)), for each \nXi \u00b7 This approximation of fa(Xi) has a standard deviation of approximately A . \nAssume that the true value is 13i, i.e. f3i = fa(Xi), then we expect the contribution \nof the ith element to Ba to be (13i \u00b1 )n;? = 131 \u00b1 J&; + rr!~. The algorithm tests \nBa = rr!1 L 131 ? (}2, therefore, to ensure a low error, based on the above argument, \nwe choose m2 = (J52 \u2022 \nChoosing the right value for m2 is of great importance. We have noticed on more \nthan one occasion that increasing the value of m2 actually decreases the overall run \ntime. This is not obvious at first : seemingly, any increase in the number of times we \nloop in the algorithm only increases the run time. However, a more accurate value \nfor m2 means a more accurate approximation of the TEST predicate, and therefore \nless chance of redundant recursive calls (the run time is linear in the number of \nrecursive calls) . We can see this exemplified in Figure (7) where the number of \nrecursive calls increase drastically as m2 decreases. In order to present Figure (7) , \n\n1The \"peaked\" distribution of the results is not coincidental. The FT of the majority function has 42 large \nequal coefficients, labeled cmaj' one for each singleton (a vector of the form 0 .. 010 .. 0) and one for parity (the \nall-ones vector). The zeros of an input vector with z zeros we will contribute \u00b11(2z - 41). cmajl to the result \nand the parity will contribute \u00b1cma) (depending on whether z is odd or even), so that the total contribution is \nan even factor of cma)' Since cma) = (~g);tcr - 0 .12, we have peaks around factors of 0.24 . The distribution \naround the peaks is due to the f~ct we only approximate each coefficient and get a value close to cma)' \n\n\f266 \nwe learned the same 3 term DNF always using e = 0.05 and mr * m2 \nThe trials differ in the specific values chosen in each trial for m2. \n\nY. MANSOUR, S. SAHAR \n\n100000. \n\nFigure 7: Deter01ining 012' Note that the number of recursive calls grows dramatically as m2 's \nvalue decreases. For example, for m2 = 400, the number of recursive calls is 14,433 compared with only \n1,329 recursive calls for m2 = 500 . \n\nSPECIAL CASES: When k = 110'11 is either very small or very large, the values we \nchoose for ml and m2 can be self-defeating: when k ,..... n we still loop ml (~ 2n - k ) \ntimes, though often without gaining additional information. The same holds for very \nsmall values of k, and the corresponding m2 (~ 2k) values. We therefore add the \nfollowing feature: for small and large values of k we calculate exactly the expected \nvalue thereby decreasing the run time and increasing accuracy. \n\n5 CONCLUSIONS \n\nIn this work we implemented the FT algorithm and showed it to be a useful practical \ntool as well as a powerful theoretical technique. We reviewed major enhancements \nthe algorithm underwent during the process. The algorithm successfully recovers \nfunctions in a reasonable amount of time. Furthermore, we have shown that the \nalgorithm naturally derives a confidence parameter. This parameter enables the user \nin many cases to conclude that the prediction received is accurate with extremely \nhigh probability, even if the overall error probability is not negligible. \n\nAcknowledgements \n\nThis research was supported in part by The Israel Science Foundation administered by The Israel \nAcademy of Science and Humanities and by a grant of the Israeli Ministry of Science and Technology. \n\nReferences \n\n[1) Mihir Bellare. A technique for upper bounding the spectral norm with applications to learning. \n\nAnnual Work&hop on Computational Learning Theory, pages 62-70, July 1992. \n\nIn 5 th \n\n(2) Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly \nlearning DNF and characterizing statistical query learning using fourier analysis. In The 26 th Annual AC M \nSympo&ium on Theory of Computing, pages 253 - 262, 1994 . \n\n(3) Merrick L . Furst , Jeffrey C. Jackson, and Sean W. Smith. Improved learning of ACO functions . \n\nAnnual Work&hop on Computational Learning Theory, pages 317-325, August 1991. \n\nIn 4th \n\n(4) J. Jackson . An efficient membership-query algorithm for learning DNF with respect to the uniform distribu(cid:173)\n\ntion. In Annual Sympo&ium on Switching and Automata Theory, pages 42 - 53, 1994. \n\n(5) E. Kushilevitz and Y . Mansour. Learning decision trees using the fourier spectrum. SIAM Journal on \n\nComputing 22(6): 1331-1348, 1993. \n\n(6) N. Linial, Y. Mansour, and N . Nisan. Constant depth circuits, fourier transform and learnability. JACM \n\n40(3):607-620, 1993. \n\n(7) Y. Mansour. Learning Boolean Functions via the Fourier Transform. Advance& in Neural Computation, \nedited by V.P. Roychodhury and K-Y. Siu and A. Orlitsky, Kluwer Academic Pub. 1994. Can be accessed \nvia Up:/ /ftp .math.tau.ac.iJ/pub/mansour/PAPERS/LEARNING/fourier-survey.ps.Z. \n\n(8) Yishay Mansour. An o(nlog log n) learning algorihm for DNF under the uniform distribution . J. of Computer \n\nand Sy&tem Science, 50(3):543-550, 1995. \n\n\f", "award": [], "sourceid": 1054, "authors": [{"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "Sigal", "family_name": "Sahar", "institution": null}]}