{"title": "Near-minimax recursive density estimation on the binary hypercube", "book": "Advances in Neural Information Processing Systems", "page_first": 1305, "page_last": 1312, "abstract": "This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For $d$ covariates, there are $2^d$ basis coefficients to estimate, which renders conventional approaches computationally prohibitive when $d$ is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.", "full_text": "Near-Minimax Recursive Density Estimation\n\non the Binary Hypercube\n\nMaxim Raginsky\nDuke University\n\nDurham, NC 27708\nm.raginsky@duke.edu\n\nSvetlana Lazebnik\nUNC Chapel Hill\n\nChapel Hill, NC 27599\nlazebnik@cs.unc.edu\n\nRebecca Willett\nDuke University\n\nDurham, NC 27708\nwillett@duke.edu\n\nJorge Silva\n\nDuke University\n\nDurham, NC 27708\njg.silva@duke.edu\n\nAbstract\n\nThis paper describes a recursive estimation procedure for multivariate binary den-\nsities using orthogonal expansions. For d covariates, there are 2d basis coef\ufb01cients\nto estimate, which renders conventional approaches computationally prohibitive\nwhen d is large. However, for a wide class of densities that satisfy a certain spar-\nsity condition, our estimator runs in probabilistic polynomial time and adapts to\nthe unknown sparsity of the underlying density in two key ways: (1) it attains\nnear-minimax mean-squared error, and (2) the computational complexity is lower\nfor sparser densities. Our method also allows for \ufb02exible control of the trade-off\nbetween mean-squared error and computational complexity.\n\n1 Introduction\n\n\u25b3\n\nMSE(f,bf ) = E(cid:8)Px\u2208Bd(f (x) \u2212 bf (x))2(cid:9).\n\nMultivariate binary data arise in a variety of \ufb01elds, such as biostatistics [1], econometrics [2] or\narti\ufb01cial intelligence [3].\nIn these and other settings, it is often necessary to estimate a proba-\nbility density from a number of independent observations. Formally, we have n i.i.d. samples\nfrom a probability density f (with respect to the counting measure) on the d-dimensional bi-\nnary hypercube Bd, B\n\n= {0, 1}, and seek an estimate bf of f with a small mean-squared error\n\nIn many cases of practical interest, the number of covariates d is much larger than log n, so direct\nestimation of f as a multinomial density with 2d parameters is both unreliable and impractical. Thus,\none has to resort to \u201cnonparametric\u201d methods and search for good estimators in a suitably de\ufb01ned\nclass whose complexity grows with n. Some nonparametric methods proposed in the literature, such\nas kernels [4] and orthogonal expansions [5, 6], either have very slow rates of MSE convergence or\nare computationally prohibitive for large d. For example, the kernel method [4] requires O(n2d)\noperations to compute the estimate at any x \u2208 Bd, yet its MSE decays as O(n\u22124/(4+d)) [7], which is\nextremely slow when d is large. In contrast, orthogonal function methods generally have much better\nMSE decay rates, but rely on estimating 2d coef\ufb01cients in a \ufb01xed basis, which requires enormous\ncomputational resources for large d. For instance, using the Fast Hadamard Transform to estimate\nthe coef\ufb01cients in the so-called Walsh basis using n samples requires O(nd2d) operations [5].\nIn this paper we take up the problem of accurate, computationally tractable estimation of a density\non the binary hypercube. We take the minimax point of view, where we assume that f comes from\na particular function class F and seek an estimator that approximately attains the minimax MSE\n\nwhere the in\ufb01mum is over all estimators based on n i.i.d. samples. We will de\ufb01ne our function class\nto re\ufb02ect another feature often encountered in situations involving multivariate binary data: namely,\n\nR\u2217\nn(F )\n\n\u25b3\n\n= inf\nbf\n\nsup\nf \u2208F\n\nMSE(f,bf ),\n\n\fthat the shape of the underlying density is strongly in\ufb02uenced by small constellations of the d co-\nvariates. For example, when working with panel data [2], it may be the case that the answers to some\nspeci\ufb01c subset of questions are highly correlated among a particular group of the panel participants,\nand the responses of these participants to other questions are nearly random; moreover, there may\nbe several such distinct groups in the panel. To model such \u201cconstellation effects\u201d mathematically,\nwe will consider classes of densities that satisfy a particular sparsity condition.\n\nOur contribution consists in developing a thresholding density estimator that adapts to the unknown\nsparsity of the underlying density in two key ways: (1) it is near-minimax optimal, with the error\ndecay rate depending upon the sparsity, and (2) it can be implemented using a recursive algorithm\nthat runs in probabilistic polynomial time and whose computational complexity is lower for sparser\ndensities. The algorithm entails recursively examining empirical estimates of whole blocks of the\n2d basis coef\ufb01cients. At each stage of the algorithm, the weights of the coef\ufb01cients estimated at\nprevious stages are used to decide which remaining coef\ufb01cients are most likely to be signi\ufb01cant,\nand computing resources are allocated accordingly. We show that this decision is accurate with high\nprobability. An additional attractive feature of our approach is that it gives us a principled way of\ntrading off MSE against computational complexity by controlling the decay of the threshold as a\nfunction of the recursion depth.\n\n2 Preliminaries\n\nWe \ufb01rst list some de\ufb01nitions and results needed in the sequel. Throughout the paper, C and c denote\ngeneric constants whose values may change from line to line. For two real numbers a and b, a \u2227 b\nand a \u2228 b denote, respectively, the smaller and the larger of the two.\nBiased Walsh bases. Let \u00b5d denote the counting measure on the d-dimensional binary hypercube\nBd. Then the space of all real-valued functions on Bd is the real Hilbert space L2(\u00b5d) with the\n= Px\u2208Bd f (x)g(x). Given any \u03b7 \u2208 (0, 1), we can construct an\nstandard inner product hf, gi\northonormal system \u03a6d,\u03b7 in L2(\u00b5d) as follows. De\ufb01ne two functions \u03d50,\u03b7, \u03d51,\u03b7 : B \u2192 R by\nx \u2208 {0, 1}.\n\n= (\u22121)x\u03b7x/2(1 \u2212 \u03b7)(1\u2212x)/2,\nNow, for any s = (s(1), . . . , s(d)) \u2208 Bd de\ufb01ne the function \u03d5s,\u03b7 : Bd \u2192 R by\n\nand \u03d51,\u03b7(x)\n\n(1)\n\n\u03d50,\u03b7(x)\n\n\u25b3\n\n= (1 \u2212 \u03b7)x/2\u03b7(1\u2212x)/2\n\n\u25b3\n\n\u25b3\n\n\u03d5s,\u03b7(x) \u25b3=\n\n\u03d5s(i),\u03b7(x(i)),\n\n\u2200x = (x(1), . . . , x(d)) \u2208 Bd\n\n(2)\n\n(this is written more succinctly as \u03d5s,\u03b7 = \u03d5s(1),\u03b7 \u2297 . . . \u2297 \u03d5s(d),\u03b7, where \u2297 is the tensor product).\nThe set \u03a6d,\u03b7 = {\u03d5s,\u03b7 : s \u2208 Bd} is an orthonormal system in L2(\u00b5d), which is referred to as the\nWalsh system with bias \u03b7 [8, 9]. Any function f \u2208 L2(\u00b5d) can be uniquely represented as\n\ndYi=1\n\nf = Xs\u2208Bd\n\n\u03b8s,\u03b7\u03d5s,\u03b7,\n\nwhere \u03b8s,\u03b7 = hf, \u03d5s,\u03b7i. When \u03b7 = 1/2, we get the standard Walsh system used in [5, 6]; in that\ncase, we shall omit the index \u03b7 = 1/2 for simplicity. The product structure of the biased Walsh\nbases makes them especially convenient for statistical applications as it allows for a computation-\nally ef\ufb01cient recursive method for computing accurate estimates of squared coef\ufb01cients in certain\nhierarchically structured sets.\n\nSparsity and weak-\u2113p balls. We are interested in densities whose representations in some biased\nWalsh basis satisfy a certain sparsity constraint. Given \u03b7 \u2208 (0, 1) and a function f \u2208 L2(\u00b5d), let\n\u03b8(f ) denote the list of its coef\ufb01cients in \u03a6d,\u03b7. We are interested in cases when the components\nof \u03b8(f ) decay according to a power law. Formally, let \u03b8(1), . . . , \u03b8(M), where M = 2d, be the\ncomponents of \u03b8(f ) arranged in decreasing order of magnitude: |\u03b8(1)| \u2265 |\u03b8(2)| \u2265 . . . \u2265 |\u03b8(M)|.\nGiven some 0 < p < \u221e, we say that \u03b8(f ) belongs to the weak-\u2113p ball of radius R [10], and write\n\u03b8(f ) \u2208 w\u2113p(R), if\n(3)\n\n|\u03b8(m)| \u2264 R \u00b7 m\u22121/p,\n\n1 \u2264 m \u2264 M.\n\n\fIt is not hard to show that the coef\ufb01cients of any probability density on Bd in \u03a6d,\u03b7 are bounded by\nR(\u03b7) = [\u03b7 \u2228 (1 \u2212 \u03b7)]d/2. With this in mind, let us de\ufb01ne the class Fd(p, \u03b7) of all functions f on Bd\nsatisfying \u03b8(f ) \u2208 w\u2113p(R(\u03b7)) in RM . We are particularly interested in the case 0 < p < 2. When\n\u03b7 = 1/2, with R(\u03b7) = 2\u2212d/2, we shall write simply Fd(p).\nWe will need approximation properties of weak-\u2113p balls as listed, e.g., in [11]. The basic fact is that\nthe power-law condition (3) is equivalent to the concentration estimate\n\n\u2200\u03bb > 0.\n\nM \u2264 CRk\u2212r, where r\n\n(cid:12)(cid:12)(cid:8)s \u2208 Bd : |\u03b8s| \u2265 \u03bb(cid:9)(cid:12)(cid:12) \u2264 (R/\u03bb)p,\n\n(4)\nFor any 1 \u2264 k \u2264 M , let \u03b8k(f ) denote the vector \u03b8(f ) with \u03b8(k+1), . . . , \u03b8(M) set to zero. Then it\nfollows from (3) that k\u03b8(f ) \u2212 \u03b8k(f )k\u21132\n= 1/p \u2212 1/2, and C is some constant\nthat depends only on p. Given any f \u2208 Fd(p, \u03b7) and denoting by fk the function obtained from it\nby retaining only the k largest coef\ufb01cients, we get from Parseval\u2019s identity that\n(5)\nTo get a feeling for what the classes Fd(p, \u03b7) could model in practice, we note that, for a \ufb01xed\n= \u221a\u03b7/(\u221a\u03b7 + \u221a1 \u2212 \u03b7) is the unique\n\u03b7 \u2208 (0, 1), the product of d Bernoulli(\u03b7\u2217) densities with \u03b7\u2217 \u25b3\nsparsest density in the entire scale of Fd(p, \u03b7) spaces with 0 < p < 2: all of its coef\ufb01cients in\nFd,\u03b7 are zero, except for \u03b8s,\u03b7 with s = (0, . . . , 0), which is equal to (\u03b7\u2217/\u221a\u03b7)d. Other densities in\n{\u03a6d(p, \u03b7) : 0 < p < 2} include, for example, mixtures of components that, up to a permutation\nof {1, . . . , d}, can be written as a tensor product of a large number of Bernoulli(\u03b7\u2217) densities and\nsome other density. The parameter \u03b7 can be interpreted either as the default noise level in measuring\nan individual covariate or as a smoothness parameter that interpolates between the point masses\n\u03b4(0,...,0) and \u03b4(1,...,1). We assume that \u03b7 is known (e.g., from some preliminary exploration of the\ndata or from domain-speci\ufb01c prior information) and \ufb01xed.\n\nkf \u2212 fkkL2(\u00b5d) \u2264 CRk\u2212r.\n\n\u25b3\n\nIn the following, we limit ourselves to the \u201cnoisiest\u201d case \u03b7 = 1/2 with R(1/2) = 2\u2212d/2. Our\ntheory can be easily modi\ufb01ed to cover any other \u03b7 \u2208 (0, 1): one would need to replace R = 2\u2212d/2\nwith the corresponding R(\u03b7) and use the bound k\u03d5s,\u03b7k\u221e \u2264 R(\u03b7) instead of k\u03d5sk\u221e \u2264 2\u2212d/2 when\nestimating variances and higher moments.\n\n3 Density estimation via recursive Walsh thresholding\n\nWe now turn to our problem of estimating a density f on Bd from a sample {Xi}n\nfor some unknown 0 < p < 2. The minimax theory for weak-\u2113p balls [10] says that\n\ni=1 when f \u2208 Fd(p)\n\nR\u2217\nn(Fd(p)) \u2265 CM \u2212p/2n\u22122r/(2r+1),\n\nr = 1/p \u2212 1/2\n\nwhere M = 2d. We shall construct an estimator that adapts to unknown sparsity of f in the sense\nthat it achieves this minimax rate up to a logarithmic factor without prior knowledge of p and that\nits computational complexity improves as p \u2192 0.\nOur method is based on the thresholding of empirical Walsh coef\ufb01cients. A thresholding estimator\nis any estimator of the form\n\nI{T (b\u03b8s)\u2265\u03bbn}b\u03b8s\u03d5s,\n\ni=1 \u03d5s(Xi) are empirical estimates of the Walsh coef\ufb01cients of f , T (\u00b7) is\nsome statistic, and I{\u00b7} is an indicator function. The threshold \u03bbn depends on the sample size. For\ns was used with the threshold \u03bbn = 1/M (n + 1). This\nchoice was motivated by the considerations of bias-variance trade-off for each individual coef\ufb01cient.\n\nbf = Xs\u2208Bd\nwhere b\u03b8s = (1/n)Pn\nexample, in [5, 6] the statistic T (b\u03b8s) = b\u03b82\nThe main disadvantage of such direct methods is the need to estimate all M = 2d Walsh coef\ufb01cients.\nWhile this is not an issue when d \u224d log n, it is clearly impractical when d \u226b log n. To deal with this\nissue, we will consider a recursive thresholding approach that will allow us to reject whole groups\nof coef\ufb01cients based on ef\ufb01ciently computable statistics. This approach is motivated as follows. For\nany 1 \u2264 k \u2264 d, we can write any f \u2208 L2(\u00b5d) with the Walsh coef\ufb01cients \u03b8(f ) as\n\nf = Xu\u2208Bk Xv\u2208Bd\u2212k\n\n\u03b8uv\u03d5uv = Xu\u2208Bk\n\nfu \u2297 \u03d5u,\n\n\f\u25b3\n\n\u25b3= kfuk2\n\nwhere uv denotes the concatenation of u \u2208 Bk and v \u2208 Bd\u2212k and, for each u \u2208 Bk, fu\n=\nPv\u2208Bd\u2212k \u03b8uv\u03d5v lies in L2(\u00b5d\u2212k). By Parseval\u2019s identity, Wu\nL2(\u00b5d\u2212k) = Pv\u2208Bd\u2212k \u03b82\nuv.\nuv < \u03bb for every v \u2208 Bd\u2212k. Thus, we could\nThis means that if Wu < \u03bb for some u \u2208 Bk, then \u03b82\nstart at u = 0 and u = 1 and check whether Wu \u2265 \u03bb. If not, then we would discard all \u03b8uv with\nv \u2208 Bd\u22121; otherwise, we would proceed on to u0 and u1. At the end of this process, we will be left\ns \u2265 \u03bb. Let f\u03bb denote the resulting function. If f \u2208 Fd(p) for\nonly with those s \u2208 Bd for which \u03b82\nsome p, then we will have kf \u2212 f\u03bbk2\nL2(\u00b5d) \u2264 CM \u22121(M \u03bb)\u22122r/(2r+1).\nWe will follow this reasoning in constructing our estimator. We begin by developing an estimator\nfor Wu. We will use the following fact, easily proved using the de\ufb01nitions (1) and (2) of the Walsh\nfunctions: for any density f on Bd, any k and u \u2208 Bk, we have\nfu(y) = Ef(cid:8)\u03d5u(\u03c0k(X))I{\u03c3k (X)=y}(cid:9) ,\u2200y \u2208 Bd\u2212k\n\nand Wu = Ef {\u03d5u(\u03c0k(X))fu(\u03c3k(X))} ,\n= (x(k + 1), . . . , x(d)) for any x \u2208 Bd. This suggests\n\n= (x(1), . . . , x(k)) and \u03c3k(x)\n\nwhere \u03c0k(x)\nthat we can estimate Wu by\n\n\u25b3\n\n\u25b3\n\n1\nn2\n\nnXi1=1\n\nnXi2=1\n\ncWu =\n\n\u03d5u(\u03c0k(Xi1 ))\u03d5u(\u03c0k(Xi2 ))I{\u03c3k(Xi1 )=\u03c3k(Xi2 )}.\n\n(6)\n\nUsing induction and Eqs. (1) and (2), we can prove that cWu = Pv\u2208Bd\u2212kb\u03b82\nuv. An advantage of\ncomputingcWu indirectly via (6) rather than as a sum of b\u03b82\nuv, v \u2208 Bd\u2212k, is that, while the latter\nhas O(2d\u2212kn) complexity, the former has only O(n2d) complexity. This can lead to signi\ufb01cant\ncomputational savings for small k. When k \u2265 d \u2212 log(nd), it becomes more ef\ufb01cient to use the\ndirect estimator.\n\nNow we can de\ufb01ne our density estimation procedure. Instead of using a single threshold for all\n\nthreshold that depends not only on n, but also on k. Speci\ufb01cally, we will let\n\n1 \u2264 k \u2264 d, we consider a more \ufb02exible strategy: for every k, we shall compare each cWu to a\n\n\u03bbk,n =\n\n\u03b1k log n\n\n,\n\nn\n\n1 \u2264 k \u2264 d\n\n(7)\n\nk=1, de\ufb01ne the set A(\u03bb)\n\nk=1 satis\ufb01es \u03b11 \u2265 \u03b1k \u2265 \u03b1d > 0. (This k-dependent scaling will allow us to trade\n= {s \u2208 Bd :\n\nwhere \u03b1 = {\u03b1k}d\noff MSE and computational complexity.) Given \u03bb = {\u03bbk,n}d\ncW\u03c0k(s) \u2265 \u03bbk,n,\u22001 \u2264 k \u2264 d} and the corresponding estimator\nI{s\u2208A(\u03bb)}b\u03b8s\u03d5s,\nwhere RWT stands for \u201crecursive Walsh thresholding.\u201d To implement bfRWT on a computer, we adapt\n\nthe algorithm of Goldreich and Levin [12], originally developed for cryptography and later applied\nto the problem of learning Boolean functions from membership queries [13]: we call the routine\nRECURSIVEWALSH, shown in Algorithm 1, with u = \u2205 (the empty string) and with \u03bb from (7).\n\n= Xs\u2208Bd\n\nbfRWT\n\n(8)\n\n\u25b3\n\n\u25b3\n\nAnalysis of the estimator. We now turn to the asymptotic analysis of the MSE and the computa-\n\ntional complexity of bfRWT. We \ufb01rst prove that bfRWT adapts to unknown sparsity of f :\nTheorem 3.1 Suppose the threshold sequence \u03bb = {\u03bbk}d\nThen for all 0 < p < 2 the estimator (8) satis\ufb01es\n\nk=1 is such that \u03b1d \u2265 (20d + 25)2/2d.\n\nsup\n\nf \u2208Fd(p)\n\nf \u2208Fd(p)\n\nL2(\u00b5d) \u2264\n\nwhere the constant C depends only on p.\n\nMSE(f,bfRWT) = sup\n\nEf kf \u2212 bfRWTk2\nProof: Let us decompose the squared L2 error of bfRWT as\nI{s\u2208A(\u03bb)}(\u03b8s \u2212b\u03b8s)2 +Xs\n\nkf \u2212 bfRWTk2\n\nL2(\u00b5d) =Xs\n\nC\n\n2d(cid:18) 2d\u03b11 log n\n\nn\n\n(cid:19)2r/(2r+1)\n\n,\n\n(9)\n\nI{s\u2208A(\u03bb)c}\u03b82\n\ns \u2261 T1 + T2.\n\n\fn\n\nn2\n\nn2\n\nend if\n\n\u03d5u1(\u03c0k+1(Xi1 ))\u03d5u1(\u03c0k+1(Xi2 ))I{\u03c3k+1(Xi1 )=\u03c3k+1(Xi2 )}\n\n\u03d5u0(\u03c0k+1(Xi1 ))\u03d5u0(\u03c0k+1(Xi2 ))I{\u03c3k+1(Xi1 )=\u03c3k+1(Xi2 )}\n\nnPi=1\n\u03d5u(Xi); ifb\u03b82\nnPi2=1\nnPi1=1\nnPi2=1\nnPi1=1\n\nu \u2265 \u03bbd,n then output u,b\u03b8u; return\n\ncomputeb\u03b8u \u2190 1\ncomputecWu0 \u2190 1\ncomputecWu1 \u2190 1\nifcWu0 \u2264 \u03bbk+1,n then return else RECURSIVEWALSH(u0, \u03bb); end if\nifcWu1 \u2264 \u03bbk+1,n then return else RECURSIVEWALSH(u1, \u03bb); end if\nWe start by observing that s \u2208 A(\u03bb) only ifb\u03b82\n1 \u2264 k \u2264 d such that b\u03b82\ns < \u03bbk,n \u2264 \u03bb1,n. De\ufb01ning the sets A1 = {s \u2208 Bd : b\u03b82\ns < \u03bb1,n}, we get T1 \u2264Ps I{s\u2208A1}(\u03b8s \u2212b\u03b8s)2 and T2 \u2264Ps I{s\u2208A2}\u03b82\nA2 = {s \u2208 Bd :b\u03b82\ns \u2265 3\u03bb1,n/2}, we can write\ns < \u03bbd,n/2} and S = {s \u2208 Bd : \u03b82\nde\ufb01ning B = {s \u2208 Bd : \u03b82\nI{s\u2208A1\u2229B}(\u03b8s \u2212b\u03b8s)2 +Xs\nT1 =Xs\nI{s\u2208A1\u2229Bc}(\u03b8s \u2212b\u03b8s)2 \u2261 T11 + T12,\nT2 =Xs\n\nFirst we deal with the easy terms T12, T22. Applying (4), (5) and a bit of algebra, we get\n\ns \u2261 T21 + T22.\n\nI{s\u2208A2\u2229S c}\u03b82\n\nI{s\u2208A2\u2229S}\u03b82\n\ns \u2265 \u03bbd,n, while for any s \u2208 A(\u03bb)c there exists some\ns \u2265 \u03bbd,n} and\ns. Further,\n\n1\n\nn\n\n.\n\n\u2264\n\n.\n\nI{\u03b82\n\n1\n\nC\n\n(10)\n\n(11)\n\n1\nM\n\ns \u2264\n\nE T12 \u2264\n\nn\u22122r/(2r+1),\n\ns <(3\u03b11/2) log n/n}\u03b82\n\n(cid:19)2r/(2r+1)\n\nE T22 \u2264 Xs\u2208Bd\n\nM n(cid:12)(cid:12)(cid:8)s : \u03b82\n\nNext we deal with the large-deviation terms T11 and T21. Using Cauchy\u2013Schwarz, we get\n\nM n(cid:18) 2\nM \u03bbd,n(cid:19)p/2\nM (cid:18) M \u03b11 log n\nE T11 \u2264Xs hE(\u03b8s \u2212b\u03b8s)4 \u00b7 P(s \u2208 A1 \u2229 B)i1/2\nTo estimate the fourth moment in (12), we use Rosenthal\u2019s inequality [14] to get E(\u03b8s \u2212b\u03b8s)4 \u2264\nc/M 2n2. To bound the probability that s \u2208 A1 \u2229 B, we observe that s \u2208 A1 \u2229 B implies that\n|b\u03b8s \u2212 \u03b8s| \u2265 (1/5)p\u03bbd,n, and then use Bernstein\u2019s inequality [14] to get\n2(1 + 2\u03b2/3)(cid:19) = 2n\u2212\u03b22/[2(1+2\u03b2/3)] \u2264 2n\u2212(\u03b2\u22121)/2\nP(cid:0)|b\u03b8s \u2212 \u03b8s| \u2265 (1/5)p\u03bbd,n(cid:1) \u2264 2 exp(cid:18)\u2212\nwith \u03b2 = (1/5)\u221aM \u03b1d \u2265 4d + 5. Since n\u2212(\u03b2\u22121)/2 \u2264 n\u22122(d+1), we have\nFinally, E T21 \u2264Ps\n2n\u2212(\u03b3\u22121)/2, where \u03b3 = (1/5)\u221aM \u03b11. Since \u03b82\n\n(13)\ns . Using the same argument as above, we get P(s \u2208 A2\u2229 S) \u2264\ns \u2264 1/M for all s \u2208 Bd and since \u03b3 \u2265 \u03b2, this gives\n(14)\n\nE T11 \u2264 Cn\u2212(d+1) \u2264 C/(M n).\n\nP(s \u2208 A2\u2229 S)\u03b82\n\n\u03b22 log n\n\n(12)\n\ns +Xs\ns \u2265 \u03bbd,n/2(cid:9)(cid:12)(cid:12) \u2264\n\nAlgorithm 1 RECURSIVEWALSH(u, \u03bb)\n\nk \u2190 length(u)\nif k = d then\n\nPutting together Eqs. (10), (11), (13), and (14), we get (9), and the theorem is proved.\n\nE T21 \u2264 2n\u22122(d+1) \u2264 2/(M n).\n\nOur second result concerns the running time of Algorithm 1. Let K(\u03b1, p) \u25b3=Pd\nTheorem 3.2 Given any \u03b4 \u2208 (0, 1), provided each \u03b1k is chosen so that\np2k\u03b1kn log n \u2265 5(cid:2)C2\u221an + (log(d/\u03b4) + k)/ log e(cid:3) ,\n\nAlgorithm 1 runs in O(n2d(n/M log n)p/2K(\u03b1, p)) time with probability at least 1 \u2212 \u03b4.\n\nk=1 \u03b1\u2212p/2\n\nk\n\n.\n\n(cid:4)\n\n(15)\n\n\fProof: The complexity is determined by the number of calls to RECURSIVEWALSH. For each k,\n\n\u25b3\n\nL2(\u00b5d\u2212k) \u2265\n\ndXk=1 Xu\u2208Bk\n\na call to RECURSIVEWALSH is made at every u \u2208 Bk withcWu \u2265 \u03bbk,n. Let us say that a call to\nRECURSIVEWALSH(u, \u03bb) is correct if Wu \u2265 \u03bbk,n/2. We will show that, with probability at least\n1 \u2212 \u03b4, only the correct calls are made. The probability of making at least one incorrect call is\nP d[k=1 [u\u2208Bk{cWu \u2265 \u03bbk,n, Wu < \u03bbk,n/2}! \u2264\nP(cid:16)cWu \u2265 \u03bbk,n, Wu < \u03bbk,n/2(cid:17) .\nFor a given u \u2208 Bk, cWu \u2265 \u03bbk,n and Wu < \u03bbk,n/2 together imply that kfu \u2212 bfuk2\n(1/5)p\u03bbk,n, where bfu\n=Pv\u2208Bd\u2212kb\u03b8uv\u03d5v. Now, it can be shown that, for every u \u2208 Bk, the norm\nkfu \u2212 bfukL2(\u00b5d\u2212k) can be expressed as a supremum of an empirical process [15] over a certain\nwhere ak,n = (1/5)p\u03b1k log n/n\u2212 C2/\u221a2kn, and C1, C2 are the absolute constants in Talagrand\u2019s\nbound. If we choose \u03b1k as in (15), then P(cWu \u2265 \u03bbk,n, Wu < \u03bbk,n/2) \u2264 \u03b4/(d2d\u2212k) for all u \u2208 Bk.\nSumming over k, u \u2208 Bk, we see that, with probability \u2265 1 \u2212 \u03b4, only the correct calls will be made.\nIt remains to bound the number of the correct calls. For each k, Wu \u2265 \u03bbk,n/2 implies that there\nexists at least one v \u2208 Bd\u2212k such that \u03b82\nuv \u2265 \u03bbk,n/2. Since for every 1 \u2264 k \u2264 d each \u03b8s contributes\nto exactly one Wu, we have by the pigeonhole principle that\n\nP(cWu \u2265 \u03bbk,n, Wu < \u03bbk,n/2) \u2264 exp(cid:8) \u2212 nC1(2ka2\n\nfunction class that depends on k (details are omitted for lack of space). We can then use Talagrand\u2019s\nconcentration-of-measure inequality for empirical processes [16] to get\n\nk,n \u2227 2k/2ak,n)(cid:9),\n\n(cid:12)(cid:12)(cid:8)u \u2208 Bk : Wu \u2265 \u03bbk,n/2(cid:9)(cid:12)(cid:12) \u2264(cid:12)(cid:12)(cid:8)s \u2208 Bd : \u03b82\n\nwhere in the second inequality we used (4) with R = 1/\u221aM. Hence, the number of correct\nrecursive calls is bounded by N =Pd\nk=1(2/M \u03bbk,n)p/2 = (2n/M log n)p/2K(\u03b1, p). At each call,\nwe compute an estimate of the corresponding Wu0 and Wu1, which requires O(n2d) operations.\nTherefore, with probability at least 1 \u2212 \u03b4, the time complexity will be as stated in the theorem. (cid:4)\n\ns \u2265 \u03bbk,n/2(cid:9)(cid:12)(cid:12) \u2264 (2/M \u03bbk,n)p/2,\n\nMSE vs. complexity. By controlling the rate at which the sequence \u03b1k decays with k, we can\ntrade off MSE against complexity. Consider the following two extreme cases: (1) \u03b11 = . . . =\n\u03b1d \u223c 1/M and (2) \u03b1k \u223c 2d\u2212k/M . The \ufb01rst case, which reduces to term-by-term threshold-\ning, achieves the best bias-variance trade-off with the MSE O((log n/n)2r/(2r+1)(1/M )). How-\never, it has K(\u03b1, p) = O(M p/2d), resulting in O(d2n2(n/ log n)p/2) complexity. The second\ncase, which leads to a very severe estimator that will tend to reject a lot of coef\ufb01cients, has MSE\nof O((log n/n)2r/(2r+1)M \u22121/(2r+1)), but K(\u03b1, p) = O(M p/2), leading to a considerably better\nO(dn2(n/ log n)p/2) complexity. From the computational viewpoint, it is preferable to use rapidly\ndecaying thresholds. However, this reduction in complexity will be offset by a corresponding in-\ncrease in MSE. In fact, using exponentially decaying \u03b1k\u2019s in practice is not advisable as its low\ncomplexity is mainly due to the fact that it will tend to reject even the big coef\ufb01cients very early on,\nespecially when d is large. To achieve a good balance between complexity and MSE, a moderately\ndecaying threshold sequence might be best, e.g., \u03b1k \u223c (d\u2212 k + 1)m/M for some m \u2265 1. As p \u2192 0,\nthe effect of \u03bb on complexity becomes negligible, and the complexity tends to O(n2d).\nPositivity and normalization issues. As is the case with orthogonal series estimators, bfRWT may\nnot necessarily be a bona \ufb01de density. In particular, there may be some x \u2208 Bd such that bfRWT(x) <\n0, and it may happen thatR bfRWTd\u00b5d 6= 1. In principle, this can be handled by clipping the negative\n\nvalues at zero and renormalizing, which can only improve the MSE. In practice renormalization may\nbe computationally expensive when d is very large. If the estimate is suitably sparse, however, the\nrenormalization can be carried out approximately using Monte-Carlo methods.\n\n4 Simulations\n\nThe focus of our work is theoretical, consisting in the derivation of a recursive thresholding proce-\ndure for estimating multivariate binary densities (Algorithm 1), with a proof of its near-minimaxity\n\n\fand an asymptotic analysis of its complexity. Although an extensive empirical evaluation is outside\nthe scope of this paper, we have implemented the proposed estimator, and now present some simula-\ntion results to demonstrate its small-sample performance. We generated synthetic observations from\na mixture density f on a 15-dimensional binary hypercube. The mixture has 10 components, where\neach component is a product density with 12 randomly chosen covariates having Bernoulli(1/2)\ndistributions, and the other three having Bernoulli(0.9) distributions. For d = 15, it is still feasible\nto quickly compute the ground truth, consisting of 32768 values of f and its Walsh coef\ufb01cients.\nThese values are shown in Fig. 1 (left). As can be seen from the coef\ufb01cient pro\ufb01le in the bottom of\nthe \ufb01gure, this density is clearly sparse. Fig. 1 also shows the estimated probabilities and the Walsh\ncoef\ufb01cients for sample sizes n = 5000 (middle) and n = 10000 (right).\n\nGround truth (f)\n\nbfRWT, n = 5000\n\nbfRWT, n = 10000\n\nFigure 1: Ground truth (left) and estimated density for n = 5000 (middle) and n = 10000 (right) with\nconstant thresholding. Top: true and estimated probabilities (clipped at zero and renormalized) arranged in\nlexicographic order. Bottom: absolute values of true and estimated Walsh coef\ufb01cients arranged in lexicographic\norder. For the estimated densities, the coef\ufb01cient plots also show the threshold level (dotted line) and absolute\nvalues of the rejected coef\ufb01cients (lighter color).\n\n)\n\nd\n\n2\n \n\u00d7\n(\n \n\nE\nS\nM\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\nconstant\nlog\nlinear\n\n)\ns\n(\n \ne\nm\nT\n\ni\n\n1400\n1200\n1000\n800\n600\n400\n200\n\n \n\n2000\n\n4000\nSample size (n)\n\n6000\n\n8000\n\n10000\n\n2000\n\ns\n\nl\nl\n\ni\n\na\nc\n \ne\nv\ns\nr\nu\nc\ne\nR\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n2000\n\n4000\nSample size (n)\n\n6000\n\n8000 10000\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\nd\ne\nt\na\nm\n\ni\nt\ns\ne\n \n.\ns\nf\nf\ne\no\nC\n\n2000\n\n4000\nSample size (n)\n\n6000\n\n8000\n\n10000\n\n4000\nSample size (n)\n\n6000\n\n8000 10000\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Small-sample performance of bfRWT in estimating f wth three different thresholding schemes:\n(a) MSE; (b) running time (in seconds); (c) number of recursive calls; (d) number of coef\ufb01cients retained by\nthe algorithm. All results are averaged over \ufb01ve independent runs for each sample size (the error bars show the\nstandard deviations).\n\nTo study the trade-off between MSE and complexity, we implemented three different thresholding\nschemes: (1) constant, \u03bbk,n = 2 log n/(2dn), (2) logarithmic, \u03bbk,n = 2 log(d\u2212 k + 2) log n/(2dn),\nand (3) linear, \u03bbk,n = 2(d \u2212 k + 1) log n/(2dn). Up to the log n factor (dictated by the theory),\nthe thresholds at k = d are set to twice the variance of the empirical estimate of any coef\ufb01cient\nwhose value is zero; this forces the estimator to reject empirical coef\ufb01cients whose values cannot\nbe reliably distinguished from zero. Occasionally, spurious coef\ufb01cients get retained, as can be seen\n\nin Fig. 1 (middle) for the estimate for n = 5000. Fig 2 shows the performance of bfRWT. Fig. 2(a)\nis a plot of MSE vs. sample size. In agreement with the theory, MSE is the smallest for the con-\nstant thresholding scheme [which is simply an ef\ufb01cient recursive implementation of a term-by-term\nthresholding estimator with \u03bbn \u223c log n/(M n)], and then it increases for the logarithmic and for\nthe linear schemes. Fig. 2(b,c) shows the running time (in seconds) and the number of recursive\n\n\fcalls made to RECURSIVEWALSH vs. sample size. The number of recursive calls is a platform-\nindependent way of gauging the computational complexity of the algorithm, although it should be\nkept in mind that each recursive call has O(n2d) overhead. The running time increases polynomi-\nally with n, and is the largest for the constant scheme, followed by the logarithmic and the linear\nschemes. We see that, while the MSE of the logarithmic scheme is fairly close to that of the constant\nscheme, its complexity is considerably lower, in terms of both the number of recursive calls and the\nrunning time. In all three cases, the number of recursive calls decreases with n due to the fact that\nweight estimates become increasingly accurate with n, which causes the expected number of false\ndiscoveries (i.e., making a recursive call at an internal node of the tree only to reject its descendants\nlater) to decrease. Finally, Fig. 2(d) shows the number of coef\ufb01cients retained in the estimate. This\nnumber grows with n as a consequence of the fact that the threshold decreases with n, while the\nnumber of accurately estimated coef\ufb01cients increases. The true density f has 40 parameters: 9 to\nspecify the weights of the components, 3 per component to locate the indices of the nonuniform\ncovariates, and the single Bernoulli parameter of the nonuniform covariates. It is interesting to note\nthat the maximal number of coef\ufb01cients returned by our algorithm approaches 40.\n\nOverall, these preliminary simulation results show that our implemented estimator behaves in accor-\ndance with the theory even in the small-sample regime. The performance of the logarithmic thresh-\nolding scheme is especially encouraging, suggesting that it may be possible to trade off MSE against\ncomplexity in a way that will scale to large values of d. In the future, we plan to test our method\non high-dimensional real data sets. Our particular interest is in social network data, e.g., records of\nmeetings among large groups of individuals. These are represented by binary strings most of whose\nentries are zero (i.e., only a very small number of people are present at any given meeting). To model\ntheir densities, we plan to experiment with Walsh bases with \u03b7 biased toward unity.\n\nAcknowledgments\n\nThis work was supported by NSF CAREER Award No. CCF-06-43947 and DARPA Grant No. HR0011-07-1-\n003.\n\nReferences\n\n[1] I. Shmulevich and W. Zhang. Binary analysis and optimization-based normalization of gene expression\n\ndata. Bioinformatics 18(4):555\u2013565, 2002.\n\n[2] J.M. Carro. Estimating dynamic panel data discrete choice models with \ufb01xed effects. J. Econometrics\n\n140:503\u2013528, 2007.\n\n[3] Z. Ghahramani and K. Heller. Bayesian sets. NIPS 18:435\u2013442, 2006.\n[4] J. Aitchison and C.G.G. Aitken. Multivariate binary discrimination by the kernel method. Biometrika\n\n63(3):413\u2013420, 1976.\n\n[5] J. Ott and R.A. Kronmal. Some classi\ufb01cation procedures for multivariate binary data using orthogonal\n\nfunctions. J. Amer. Stat. Assoc. 71(354):391\u2013399, 1976.\n\n[6] W.-Q. Liang and P.R. Krishnaiah. Nonparametric iterative estimation of multivariate binary density. J.\n\nMultivariate Anal. 16:162\u2013172, 1985.\n\n[7] J.S. Simonoff. Smoothing categorical data. J. Statist. Planning and Inference 47:41\u201360, 1995.\n[8] M. Talagrand. On Russo\u2019s approximate zero-one law. Ann. Probab. 22:1576\u20131587, 1994.\n[9] I. Dinur, E. Friedgut, G. Kindler and R. O\u2019Donnell. On the Fourier tails of bounded functions over the\n\ndiscrete cube. Israel J. Math. 160:389\u2013421, 2007.\n\n[10] I.M. Johnstone. Minimax Bayes, asymptotic minimax and sparse wavelet priors. In S.S. Gupta and\n\nJ.O. Berger, eds., Statistical Decision Theory and Related Topics V, pp. 303\u2013326, Springer, 1994.\n\n[11] E.J. Cand`es and T. Tao. Near-optimal signal recovery from random projections: universal encoding strate-\n\ngies? IEEE Trans. Inf. Theory 52(12):5406\u20135425, 2006.\n\n[12] O. Goldreich and L. Levin. A hard-core predicate for all one-way functions. STOC, pp. 25\u201332, 1989.\n[13] E. Kushilevitz and Y. Mansour. Learning decision trees using the Fourier spectrum. SIAM J. Comput.\n\n22(6):1331-1348, 1993.\n\n[14] W. H\u00a8ardle, G. Kerkyacharian, D. Picard and A.B. Tsybakov. Wavelets, Approximation, and Statistical\n\nApplications, Springer, 1998.\n\n[15] S.A. van de Geer. Empirical Processes in M-Estimation, Cambridge Univ. Press, 2000.\n[16] M. Talagrand. Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22:28\u201376, 1994.\n\n\f", "award": [], "sourceid": 363, "authors": [{"given_name": "Maxim", "family_name": "Raginsky", "institution": null}, {"given_name": "Svetlana", "family_name": "Lazebnik", "institution": null}, {"given_name": "Rebecca", "family_name": "Willett", "institution": null}, {"given_name": "Jorge", "family_name": "Silva", "institution": null}]}