{"title": "Efficient Nonparametric Smoothness Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1010, "page_last": 1018, "abstract": "Sobolev quantities (norms, inner products, and distances) of probability density functions are important in the theory of nonparametric statistics, but have rarely been used in practice, partly due to a lack of practical estimators. They also include, as special cases, L^2 quantities which are used in many applications. We propose and analyze a family of estimators for Sobolev quantities of unknown probability density functions. We bound the finite-sample bias and variance of our estimators, finding that they are generally minimax rate-optimal. Our estimators are significantly more computationally tractable than previous estimators, and exhibit a statistical/computational trade-off allowing them to adapt to computational constraints. We also draw theoretical connections to recent work on fast two-sample testing and empirically validate our estimators on synthetic data.", "full_text": "Ef\ufb01cient Nonparametric Smoothness Estimation\n\nShashank Singh\n\nCarnegie Mellon University\nsss1@andrew.cmu.edu\n\nSimon S. Du\n\nCarnegie Mellon University\n\nssdu@cs.cmu.edu\n\nBarnab\u00e1s P\u00f3czos\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nAbstract\n\nSobolev quantities (norms, inner products, and distances) of probability density\nfunctions are important in the theory of nonparametric statistics, but have rarely\nbeen used in practice, due to a lack of practical estimators. They also include,\nas special cases, L2 quantities which are used in many applications. We propose\nand analyze a family of estimators for Sobolev quantities of unknown probability\ndensity functions. We bound the \ufb01nite-sample bias and variance of our estimators,\n\ufb01nding that they are generally minimax rate-optimal. Our estimators are signif-\nicantly more computationally tractable than previous estimators, and exhibit a\nstatistical/computational trade-off allowing them to adapt to computational con-\nstraints. We also draw theoretical connections to recent work on fast two-sample\ntesting and empirically validate our estimators on synthetic data.\n\nIntroduction\n\n1\nL2 quantities (i.e., inner products, norms, and distances) of continuous probability density functions\nare important information theoretic quantities with many applications in machine learning and signal\nprocessing. For example, L2 norm estimates can be used for goodness-of-\ufb01t testing [7], image\nregistration and texture classi\ufb01cation [12], and parameter estimation in semi-parametric models [36].\nL2 inner products estimates can generalize linear or polynomial kernel methods to inputs which are\ndistributions rather than numerical vectors. [28] L2 distance estimators are used for two-sample\ntesting [1, 25], transduction learning [30], and machine learning on distributional inputs [27]. [29]\ngives applications of L2 quantities to adaptive information \ufb01ltering, classi\ufb01cation, and clustering.\nL2 quantities are a special case of less-well-known Sobolev quantities. Sobolev norms measure\nglobal smoothness of a function in terms of integrals of squared derivatives. For example, for a\nnon-negative integer s and a function f : R \u2192 R with an sth derivative f (s), the s-order Sobolev\ndx (when this quantity is \ufb01nite). See Section 2 for\n\nnorm (cid:107) \u00b7 (cid:107)H s is given by (cid:107)f(cid:107)H s =(cid:82)\n\n(cid:0)f (s)(x)(cid:1)2\n\nR\n\nmore general de\ufb01nitions, and see [21] for an introduction to Sobolev spaces.\nEstimation of general Sobolev norms has a long history in nonparametric statistics (e.g., [32, 13, 10,\n2]) This line of work was motivated by the role of Sobolev norms in many semi- and non-parametric\nproblems, including density estimation, density functional estimation, and regression, (see [35],\nSection 1.7.1) where they dictate the convergence rates of estimators. Despite this, to our knowledge,\nthese quantities have never been studied in real data, leaving an important gap between the theory\nand practice of nonparametric statistics. We suggest this is in part due a lack of practical estimators\nfor these quantities. For example, the only one of the above estimators that is statistically minimax-\noptimal [2] is extremely dif\ufb01cult to compute in practice, requiring numerical integration over each of\nO(n2) different kernel density estimates, where n denotes the sample size. We know of no estimators\npreviously proposed for Sobolev inner products and distances.\nThe main goal of this paper is to propose and analyze a family of computationally and statistically\nef\ufb01cient estimators for Sobolev inner products, norms, and distances. Our speci\ufb01c contributions are:\n\n1. We propose families of nonparametric estimators for all Sobolev quantities (Section 4).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2. We analyze the estimators\u2019 bias and variance. Assuming the underlying density functions lie in\na Sobolev class of smoothness parametrized by s(cid:48), we show the estimator for Sobolev quantities\nof order s < s(cid:48) converges to the true value at the \u201cparametric\u201d rate of O(n\u22121) in mean squared\nerror when s(cid:48) \u2265 2s + D/4, and at a slower rate of O\n\notherwise. (Section 5).\n\n8(s\u2212s(cid:48) )\n4s(cid:48)+D\n\n(cid:18)\n\n(cid:19)\n\nn\n\n3. We validate our theoretical results on simulated data. (Section 8).\n4. We derive asymptotic distributions for our estimators, and we use these to derive tests for the\ngeneral statistical problem of two-sample testing. We also draw theoretical connections between\nour test and the recent work on nonparametric two-sample testing. (Section 9).\n\nIn terms of mean squared error, minimax lower bounds matching our convergence rates over Sobolev\nor H\u00f6lder smoothness classes have been shown by [16] for s = 0 (i.e., L2 quantities), and [3] for\nSobolev norms with integer s. Since these lower bounds intuitively \u201cspan\u201d the space of relevant\nquantities, it is a small step to conjecture that our estimators are minimax rate-optimal for all Sobolev\nquantities and s \u2208 [0,\u221e).\nAs described in Section 7, our estimators are computable in O(n1+\u03b5) time using only basic matrix\noperations, where n is the sample size and \u03b5 \u2208 (0, 1) is a tunable parameter trading statistical and\ncomputational ef\ufb01ciency; the smallest value of \u03b5 at which the estimator continues to be minimax\nrate-optimal approaches 0 as we assume more smoothness of the true density.\n\n2 Problem setup and notation\nLet X = [\u2212\u03c0, \u03c0]D and let \u00b5 denote the Lebesgue measure on X . For D-tuples z \u2208 ZD of integers,\nlet \u03c8z \u2208 L2 = L2(X ) 1 de\ufb01ned by \u03c8z(x) = e\u2212i(cid:104)z,x(cid:105) for all x \u2208 X denote the zth element of the L2-\nX \u03c8z(x)f (x) d\u00b5(x) denote\nthe zth Fourier coef\ufb01cient of f. 2 For any s \u2208 [0,\u221e), de\ufb01ne the Sobolev space H s = H s(X ) \u2286 L2\nof order s on X by 3\n\northonormal Fourier basis, and, for f \u2208 L2, let (cid:101)f (z) := (cid:104)\u03c8z, f(cid:105)L2 =(cid:82)\n(cid:41)\n\n(cid:40)\n\nH s =\n\n(1)\nFix a known s \u2208 [0,\u221e) and a unknown probability density functions p, q \u2208 H s, and suppose we\nhave n IID samples X1, ..., Xn \u223c p and Y1, . . . , Yn \u223c q from each of p and q. We are interested in\nestimating the inner product\n\nz\u2208ZD\n\n.\n\nf \u2208 L2 :\n\n(cid:88)\n\nz2s(cid:12)(cid:12)(cid:12)(cid:101)f (z)\n\n(cid:12)(cid:12)(cid:12)2\n\n< \u221e\n\n(cid:104)p, q(cid:105)H s :=\n\nde\ufb01ned for all\n\np, q \u2208 H s.\n\n(2)\n\n(cid:88)\n\nz\u2208ZD\n\nz2s(cid:101)p(z)(cid:101)q(z)\n\n(cid:88)\n\nz\u2208ZD\n\nEstimating the inner product gives an estimate for the (squared) induced norm and distance, since 4\n(cid:107)p(cid:107)2\n\nz2s |(cid:101)p(z)|2 = (cid:104)p, p(cid:105)H s\n\nH s \u2212 2(cid:104)p, q(cid:105)H s + (cid:107)q(cid:107)2\n\nH s = (cid:107)p(cid:107)2\n\n(cid:107)p \u2212 q(cid:107)2\n\nH s :=\n\nand\n\nH s .\n\n(3)\n\nSince our theoretical results assume the samples from p and q are independent, when estimating\n(cid:107)p(cid:107)2\nmay not be optimal in practice.\nFor a more classical intuition, we note that, in the case D = 1 and s \u2208 {0, 1, 2, . . .}, (via Parseval\u2019s\n\nH s, we split the sample from p in half to compute two independent estimates of(cid:101)p, although this\nidentity and the identity (cid:103)f (s)(z) = (iz)s(cid:101)f (z)), that one can show the following: H s includes the\n3When D > 1, z2s = (cid:81)D\n\n1We suppress dependence on X ; all function spaces are over X except as discussed in Section 2.1.\n2Here, (cid:104)\u00b7,\u00b7(cid:105) denotes the dot product on RD. For a complex number c = a + bi, c = a \u2212 bi denotes the\n\u221a\ncc =\nj . For z < 0, z2s should be read as (z2)s, so that z2s \u2208 R even when\n2s /\u2208 Z. In the L2 case, we use the convention that 00 = 1.\n4(cid:107)p(cid:107)Hs is pseudonorm on H s because it fails to distinguish functions identical almost everywhere up to\nadditive constants; a combination of (cid:107)p(cid:107)L2 and (cid:107)p(cid:107)Hs is used when a proper norm is needed. However, since\nprobability densities integrate to 1, (cid:107)\u00b7\u2212\u00b7(cid:107)Hs is a proper metric on the subset of (almost-everywhere equivalence\nclasses of) probability density functions in H s, which is important for two-sample testing (see Section 9). For\nsimplicity, we use the terms \u201cnorm\u201d, \u201cinner product\u201d, and \u201cdistance\u201d for the remainder of the paper.\n\ncomplex conjugate of c, and |c| =\n\na2 + b2 denotes the modulus of c.\n\nj=1 z2s\n\n\u221a\n\n2\n\n\fFunctional Name\n\nL2 norms\n\n(Integer) Sobolev norms\n\nDensity functionals\nDerivative functionals\n\nL2 inner products\n\nMultivariate functionals\n\nFunctional Form\n\nL2 =(cid:82) (p(x))2 dx\nH k =(cid:82)(cid:0)p(k)(x)(cid:1)2\n(cid:82) \u03d5(x, p(x)) dx\n\n(cid:107)p(cid:107)2\n(cid:107)p(cid:107)2\n\ndx\n\n(cid:82) \u03d5(x, p(x), p(cid:48)(x), . . . , p(k)(x)) dx\n(cid:104)p1, p2(cid:105)L2 =(cid:82) p1(x)p2(x) dx\n(cid:82) \u03d5(x, p1(x), . . . , pk(x)) dx\n\nReferences\n\n[32, 6]\n\n[2]\n\n[18, 19]\n\n[3]\n\n[16, 17]\n[34, 14]\n\nTable 1: Some related functional forms for which estimators for which nonparametric estimators have\nbeen developed and analyzed. p, p1, ..., pk are unknown probability densities, from each of which we\ndraw n IID samples, \u03d5 is a known real-valued measurable function, and k is a non-negative integer.\n\n(cid:16)\n\n(cid:17)2\n\n(cid:90)\n\nX\n\n(cid:13)(cid:13)(cid:13)f (s)(cid:13)(cid:13)(cid:13)2\n\nsubspace of L2 functions with at least s derivatives in L2 and, if f (s) denotes the sth derivative of f\n\n(cid:107)f(cid:107)2\n\nH s = 2\u03c0\n\n(4)\nIn particular, when s = 0, H s = L2, (cid:107) \u00b7 (cid:107)H s = (cid:107) \u00b7 (cid:107)L2, and (cid:104)\u00b7,\u00b7(cid:105)H s = (cid:104)\u00b7,\u00b7(cid:105)L2. As we describe in\nthe supplement, equation (4) and our results generalizes trivially to weak derivatives, as well as to\nnon-integer s \u2208 [0,\u221e) via a notion of fractional derivative.\n\nf (s)(x)\n\ndx = 2\u03c0\n\n,\n\nL2\n\n\u2200f \u2208 H s.\n\np2\u03c0(x) :=(cid:80)\n\n2.1 Unbounded domains\nA notable restriction above is that p and q are supported in X := [\u2212\u03c0, \u03c0]D. In fact, our estimators\nand tests are well-de\ufb01ned and valid for densities supported on arbitrary subsets of RD. In this\ncase, they act on the 2\u03c0-periodic summation p2\u03c0 : [\u2212\u03c0, \u03c0]D \u2192 [0,\u221e] de\ufb01ned for x \u2208 X by\nz\u2208ZD p(x + 2\u03c0z), which is itself a probability density function on X . For example, the\nestimator for (cid:107)p(cid:107)H s will instead estimate (cid:107)p2\u03c0(cid:107)H s, and the two-sample test for distributions p and q\nwill attempt to distinguish p2\u03c0 from q2\u03c0. Typically, this is not problematic; for example, for most\nrealistic probability densities, p and p2\u03c0 have similar orders of smoothness, and p2\u03c0 = q2\u03c0 if and\nonly if p = q. However, there are (meagre) sets of exceptions; for example, if q is a translation of p\nby exactly 2\u03c0, then p2\u03c0 = q2\u03c0, and one can craft a highly discontinuous function p such that p2\u03c0 is\nuniform on X . [39] These exceptions make it dif\ufb01cult to extend theoretical results to densities with\narbitrary support, but they are \ufb01xed, in practice, by randomly rescaling the data (as in [4]). If the\ndensities have (known) bounded support, they can simply be shifted and scaled to be supported on X .\n\n3 Related work\nThere is a large body of work on estimating nonlinear functionals of probability densities, with\nvarious generalizations in terms of the class of functionals considered. Table 1 gives a subset of such\nwork, for functionals related to Sobolev quantities. As shown in Section 2, the functional form we\nconsider is a strict generalization of L2 norms, Sobolev norms, and L2 inner products. It overlaps\nwith, but is neither a special case nor a generalization of the remaining functional forms in the table.\nNearly all of the above approaches compute an optimally smoothed kernel density estimate and\nthen perform bias corrections based on Taylor series expansions of the functional of interest. They\ntypically consider distributions with densities that are \u03b2-H\u00f6lder continuous and satisfy periodicity\nassumptions of order \u03b2 on the boundary of their support, for some constant \u03b2 > 0 (see, for example,\nSection 4 of [16] for details of these assumptions). The Sobolev class we consider is a strict superset\nof this H\u00f6lder class, permitting, for example, certain \u201csmall\u201d discontinuities. In this regard, our\nresults are slightly more general than most of these prior works.\nFinally, there is much recent work on estimating entropies, divergences, and mutual informations,\nusing methods based on kernel density estimates [14, 17, 16, 24, 33, 34] or k-nearest neighbor\nstatistics [20, 23, 22, 26]. In contrast, our estimators are more similar to orthogonal series density\nestimators, which are computationally attractive because they require no pairwise operations between\nsamples. However, they require quite different theoretical analysis; unlike prior work, our estimator\n\n3\n\n\fis constructed and analyzed entirely in the frequency domain, and then related to the data domain via\nParseval\u2019s identity. We hope our analysis can be adapted to analyze new, computationally ef\ufb01cient\ninformation theoretic estimators.\n\n4 Motivation and construction of our estimator\nFor a non-negative integer parameter Zn (to be speci\ufb01ed later), let\n\n(cid:88)\n\n(cid:107)z(cid:107)\u221e\u2264Zn\n\n(cid:101)q(z)\u03c8z where\n\nand\n\nqn :=\n\n(cid:88)\n\n(cid:101)p(z)\u03c8z\n\npn :=\n\n(cid:107)z(cid:107)\u221e\u2264Zn\n\n(cid:107)z(cid:107)\u221e := max\n\nj\u2208{1,...,D} zj\n\n(5)\n\northonormal family Fn := {\u03c8z : z \u2208 ZD,|z| \u2264 Zn}. Note that, since(cid:102)\u03c8z(y) = 0 whenever y (cid:54)= z,\n\ndenote the L2 projections of p and q, respectively, onto the linear subspace spanned by the L2-\nthe Fourier basis has the special property that it is orthogonal in (cid:104)\u00b7,\u00b7(cid:105)H s as well. Hence, since\npn and qn lie in the span of Fn while p \u2212 pn and q \u2212 qn lie in the span of {\u03c8z : z \u2208 Z}\\Fn,\n(cid:104)p \u2212 pn, qn(cid:105)H s = (cid:104)pn, q \u2212 qn(cid:105)H s = 0. Therefore,\n\n(cid:104)p, q(cid:105)H s = (cid:104)pn, qn(cid:105)H s + (cid:104)p \u2212 pn, qn(cid:105)H s + (cid:104)pn, q \u2212 qn(cid:105)H s + (cid:104)p \u2212 pn, q \u2212 qn(cid:105)H s\n\n= (cid:104)pn, qn(cid:105)H s + (cid:104)p \u2212 pn, q \u2212 qn(cid:105)H s .\n\nWe propose an unbiased estimate of Sn := (cid:104)pn, qn(cid:105)H s =(cid:80)(cid:107)z(cid:107)\u221e\u2264Zn\nz2s(cid:101)pn(z)(cid:101)qn(z). Notice that\nFourier coef\ufb01cients of p are the expectations(cid:101)p(z) = EX\u223cp [\u03c8z(X)]. Thus, \u02c6p(z) := 1\n(cid:80)n\nj=1 \u03c8z(Yj) are independent unbiased estimates of(cid:101)p and(cid:101)q, respectively. Since Sn\nis bilinear in(cid:101)p and(cid:101)q, the plug-in estimator for Sn is unbiased. That is, our estimator for (cid:104)p, q(cid:105)H s is\n\nand \u02c6q(z) := 1\nn\n\n(cid:80)n\n\nj=1 \u03c8z(Xj)\n\n(6)\n\nn\n\nz2s \u02c6p(z)\u02c6q(z).\n\n(7)\n\n(cid:88)\n\n\u02c6Sn :=\n\n(cid:107)z(cid:107)\u221e\u2264Zn\n\nz2s(cid:101)pn(z)(cid:101)qn(z) = Sn.\n\n5 Finite sample bounds\nHere, we present our main theoretical results, bounding the bias, variance, and mean squared error of\nour estimator for \ufb01nite n.\nBy construction, our estimator satis\ufb01es\n\n=\n\n(cid:105)\nE(cid:104) \u02c6Sn\n(cid:88)\n(cid:12)(cid:12)(cid:12)E(cid:104) \u02c6Sn\n(cid:105) \u2212 (cid:104)p, q(cid:105)H s\n\n(cid:107)z(cid:107)\u221e\u2264Zn\n\nz2s E [\u02c6p(z)] E [\u02c6q(z)] =\n\n(cid:88)\n(cid:113)\n(cid:12)(cid:12)(cid:12) = |(cid:104)p \u2212 pn, q \u2212 qn(cid:105)H s| \u2264\n\n(cid:107)z(cid:107)\u221e\u2264Zn\n\nThus, via (6) and Cauchy-Schwarz, the bias of the estimator \u02c6Sn satis\ufb01es\n(cid:107)p \u2212 pn(cid:107)2\n\n(8)\n(cid:107)p \u2212 pn(cid:107)H s is the error of approximating p by an order-Zn trigonometric polynomial, a classic\nproblem in approximation theory, for which Theorem 2.2 of [15] shows:\n\nH s (cid:107)q \u2212 qn(cid:107)2\n\nH s.\n\nif p \u2208 H s(cid:48)\n\nfor some s(cid:48) > s,\n\nthen\n\n(cid:107)p \u2212 pn(cid:107)H s \u2264 (cid:107)p(cid:107)H s(cid:48) Z s\u2212s(cid:48)\n\nn\n\n.\n\nIn combination with (8), this implies the following bound on the bias of our estimator:\nTheorem 1. (Bias bound) If p, q \u2208 H s(cid:48)\n\nfor some s(cid:48) > s, then, for CB := (cid:107)p(cid:107)H s(cid:48)(cid:107)q(cid:107)H s(cid:48) ,\n\n(cid:12)(cid:12)(cid:12)E(cid:104) \u02c6Sn\n\n(cid:105) \u2212 (cid:104)p, q(cid:105)H s\n\n(cid:12)(cid:12)(cid:12) \u2264 CBZ 2(s\u2212s(cid:48))\n\nn\n\n(9)\n\n(10)\n\nHence, the bias of \u02c6Sn decays polynomially in Zn, with a power depending on the \u201cextra\u201d s(cid:48) \u2212 s\norders of smoothness available. On the other hand, as we increase Zn, the number of frequencies at\nwhich we estimate \u02c6p increases, suggesting that the variance of the estimator will increase with Zn.\nIndeed, this is expressed in the following bound on the variance of the estimator.\n\n4\n\n\fTheorem 2. (Variance bound) If p, q \u2208 H s(cid:48)\n\nfor some s(cid:48) \u2265 s, then\n\nV(cid:104) \u02c6Sn\n\n(cid:105) \u2264 2C1\n\nZ 4s+D\n\nn\n\nn2 +\n\nC2\nn\n\n,\n\nwhere C1 :=\n\n(cid:107)p(cid:107)L2(cid:107)q(cid:107)L2\n\n2D\u0393(4s + 1)\n\u0393(4s + D + 1)\nH s are the constants (in n)\n\n(11)\n\nH s(cid:107)q(cid:107)4\n\nand C2 := ((cid:107)p(cid:107)H s + (cid:107)q(cid:107)H s)(cid:107)p(cid:107)W 2s,4(cid:107)q(cid:107)W 2s,4 + (cid:107)p(cid:107)4\nThe proof of Theorem 2 is perhaps the most signi\ufb01cant theoretical contribution of this work. Due to\nspace constraints, the proof is given in the supplement. Combining Theorems 1 and 2 gives a bound\non the mean squared error (MSE) of \u02c6Sn via the usual decomposition into squared bias and variance:\nCorollary 3. (Mean squared error bound) If p, q \u2208 H s(cid:48)\nBZ 4(s\u2212s(cid:48))\n\n(cid:20)(cid:16) \u02c6Sn \u2212 (cid:104)p, q(cid:105)H s\n\nfor some s(cid:48) > s, then\nC2\nn\n\n\u2264 C 2\n\nZ 4s+D\n\n+ 2C1\n\n(12)\n\nE\n\nn\n\nn\n\n.\n\nIf, furthermore, we choose Zn (cid:16) n\n\n4s(cid:48)+D (optimizing the rate in inequality 12), then\n\n(cid:17)2(cid:21)\n\n(cid:110) 8(s\u2212s(cid:48) )\n\n4s(cid:48)+D\n\n(cid:16) nmax\n\nn2 +\n(cid:111)\n\n,\u22121\n\n.\n\n(cid:17)2(cid:21)\n(cid:20)(cid:16) \u02c6Sn \u2212 (cid:104)p, q(cid:105)H 2\n\n2\n\nE\n\nCorollary 3 recovers the phenomenon discovered by [2]: when s(cid:48) \u2265 2s + D\n\nMSE decays at the \u201csemi-parametric\u201d n\u22121 rate, whereas, when s(cid:48) \u2208(cid:0)s, 2s + D\n\na slower rate. Also, the estimator is L2-consistent if Zn \u2192 \u221e and Znn\u2212 2\nis useful in practice, since s is known but s(cid:48) is not.\nFinally, it is worth reiterating that, by (3), these \ufb01nite sample rates extend, with additional constant\nfactors, to estimating Sobolev norms and distances.\n\n(13)\n\n(cid:1), the MSE decays at\n\n4 , the minimax optimal\n4s+D \u2192 0 as n \u2192 \u221e. This\n\n4\n\n6 Asymptotic distributions\nIn this section, we derive the asymptotic distributions of our estimator in two cases: (1) the inner\nproduct estimator and (2) the distance estimator in the case p = q. These results provide con\ufb01dence\nintervals and two-sample tests without computationally intensive resampling. While (1) is more\ngeneral in that it can be used with (3) to bound the asymptotic distributions of the norm and distance\nestimators, (2) provides a more precise result leading to a more computationally and statistically\nef\ufb01cient two-sample test. Proofs are given in the supplementary material.\nTheorem 4 shows that our estimator has a normal asymptotic distribution, assuming Zn \u2192 \u221e slowly\nenough as n \u2192 \u221e, and also gives a consistent estimator for its asymptotic variance. From this, one\ncan easily estimate asymptotic con\ufb01dence intervals for inner products, and hence also for norms.\nTheorem 4. (Asymptotic normality) Suppose that, for some s(cid:48) > 2s + D\n, and\n4s+D \u2192 0 as n \u2192 \u221e. Then, \u02c6Sn is asymptotically normal\nsuppose Znn\nwith mean (cid:104)p, q(cid:105)H s. In particular, for j \u2208 {1, . . . , n} and z \u2208 ZD with (cid:107)z(cid:107)\u221e \u2264 Zn, de\ufb01ne\nWj,z := zseizXj and Vj,z := zseizYj , so that Wj and Vj are column vectors in R(2Zn)D. Let\nW := 1\nn\n\n(cid:80)n\nj=1 Vj \u2208 R(2Zn)D\n\n4(s\u2212s(cid:48) ) \u2192 \u221e and Znn\u2212 1\n\n4 , p, q \u2208 H s(cid:48)\n\nj=1 Wj, V := 1\nn\n\n,\n\n1\n\n\u03a3W :=\n\n1\nn\n\n(Wj\u2212W )(Wj\u2212W )T ,\n\nand \u03a3V :=\n\n1\nn\n\n(Vj\u2212V )(Vj\u2212V )T \u2208 R(2Zn)D\u00d7(2Zn)D\n\n(cid:80)n\nn(cid:88)\n\nj=1\n\ndenote the empirical means and covariances of W and V , respectively. Then, for\n\nn(cid:88)\n(cid:32) \u02c6Sn \u2212 (cid:104)p, q(cid:105)H s\n\nj=1\n\n\u221a\n\nn\n\n\u02c6\u03c3n\n\n(cid:33)\n\n(cid:20) V\n\n(cid:21)T(cid:20)\u03a3W\n\nW\n\n0\n\n(cid:21)(cid:20) V\n\n(cid:21)\n\nW\n\n0\n\u03a3V\n\n\u02c6\u03c32\nn :=\n\n,\n\nwe have\n\nD\u2192 N (0, 1),\n\nwhere D\u2192 denotes convergence in distribution.\nSince distances can be written as a sum of three inner products (Eq. (3)), Theorem 4 might suggest\nan asymptotic normal distribution for Sobolev distances. However, extending asymptotic normality\n\n5\n\n\ffrom inner products to their sum requires that the three estimates be independent, and hence that we\nsplit data between the three estimates. This is inef\ufb01cient in practice and somewhat unnatural, as we\nknow, for example, that distances should be non-negative. For the particular case p = q (as in the\nnull hypothesis of two-sample testing), the following theorem 5 provides a more precise asymptotic\n(\u03c72) distribution of our Sobolev distance estimator, after an extra decorrelation step. This gives, for\nexample, a more powerful two-sample test statistic (see Section 9 for details).\nTheorem 5. (Asymptotic null distribution) Suppose that, for some s(cid:48) > 2s + D\nsuppose Znn\n\n(cid:107)z(cid:107)\u221e \u2264 Zn, de\ufb01ne Wj,z := zs(cid:0)e\u2212izXj \u2212 e\u2212izYj(cid:1), so that Wj is a column vector in R(2Zn)D. Let\n(cid:0)Wj \u2212 W(cid:1)(cid:0)Wj \u2212 W(cid:1)T \u2208 R(2Zn)D\u00d7(2Zn)D\n\n, and\n4s+D \u2192 0 as n \u2192 \u221e. For j \u2208 {1, . . . , n} and z \u2208 ZD with\n\n4(s\u2212s(cid:48) ) \u2192 \u221e and Znn\u2212 1\n\nWj \u2208 R(2Zn)D\n\n4 , p, q \u2208 H s(cid:48)\n\nand \u03a3 :=\n\nn(cid:88)\n\nn(cid:88)\n\nW :=\n\n1\n\n1\nn\n\nj=1\n\n1\nn\n\nj=1\n\ndenote the empirical mean and covariance of W , and de\ufb01ne T := nW\n\nT\n\n\u03a3\u22121W . Then, if p = q, then\n\nQ\u03c72((2Zn)D)(T ) D\u2192 Uniform([0, 1])\n\nas\n\nn \u2192 \u221e,\n\nwhere Q\u03c72(d) : [0,\u221e) \u2192 [0, 1] denotes the quantile function (inverse CDF) of the \u03c72 distribution\n\u03c72(d) with d degrees of freedom.\nLet \u02c6M denote our estimator for (cid:107)p\u2212 q(cid:107)H s (i.e., plugging \u02c6Sn into (3)). While Theorem 5 immediately\nprovides a valid two-sample test of desired level, it is not immediately clear how this relates to\n\u02c6M, nor is there any suggestion of why the test statistic ought to be a good (i.e., consistent) one.\nSome intuition is as follows. Notice that \u02c6M = W\nW . Since, by the central limit theorem, W\nhas a normal asymptotic distribution, if the components of W were uncorrelated (and Zn were\n\ufb01xed), we would expect n \u02c6M to have an asymptotic \u03c72 distribution with (2Zn)D degrees of freedom.\nHowever, because we use the same data to compute each component of \u02c6M, they are not typically\nuncorrelated, and so the asymptotic distribution of \u02c6M is dif\ufb01cult to derive. This motivates the statistic\n\nT\n\n\u03a3\u22121\nW W\n\n\u03a3\u22121\nW W , since the components of\n\n\u03a3\u22121\nW W are (asymptotically) uncorrelated.\n\n(cid:113)\n\n(cid:18)(cid:113)\n\nT =\n\n(cid:19)T(cid:113)\n\n2\n\n7 Parameter selection and statistical/computational trade-off\nHere, we give statistical and computational considerations for choosing the smoothing parameter Zn.\nStatistical perspective: In practice, of course, we do not typically know s(cid:48), so we cannot simply\nset Zn (cid:16) n\n4s(cid:48)+D , as suggested by the mean squared error bound (13). Fortunately (at least for ease\nof parameter selection), when s(cid:48) \u2265 2s + D\n4 , the dominant term of (13) is C2/n for Zn (cid:16) n\u2212 1\n4s+D .\nHence if we are willing to assume that the density has at least 2s + D\n4 orders of smoothness (which\nmay be a mild assumption in practice), then we achieve statistical optimality (in rate) by setting\nZn (cid:16) n\u2212 1\n4s+D , which depends only on known parameters. On the other hand, the estimator can\ncontinue to bene\ufb01t from additional smoothness computationally.\nComputational perspective An attractive property of the estimator discussed is its computational\nsimplicity and ef\ufb01ciency, in low dimensions. Most competing nonparametric estimators, such\nas kernel-based or nearest-neighbor methods, either take O(n2) time or rely on complex data\nstructures such as k-d trees or cover trees [31] for O(2Dn log n) time performance. Since computing\nthe estimator takes O(nZ D\nn ) memory (that is, the cost of estimating each of\n(2Zn)D Fourier coef\ufb01cients by an average), a statistically optimal choice of Zn gives a runtime of\n. Since the estimate requires only a vector outer product, exponentiation, and averaging,\n\nO\nconstants involved are small and computations parallelize trivially over frequencies and data.\nUnder severe computational constraints, for very large data sets, or if D is large relative to s(cid:48), we can\nreduce Zn to trade off statistical for computational ef\ufb01ciency. For example, if we want an estimator\n5This result is closely related to Proposition 4 of [4]. However, in their situation, s = 0 and the set of test\n\nn ) time and O(Z D\n\n4s(cid:48)+2D\n4s(cid:48)+D\n\n(cid:18)\n\n(cid:19)\n\nn\n\nfrequencies is \ufb01xed as n \u2192 \u221e, whereas our set is increasing.\n\n6\n\n\f(a) 1D Gaussians with\ndifferent means.\n\n(b) 1D Gaussians with\ndifferent variance.\n\n(c) Uniform distributions\nwith different range.\n\n(d) One uniform and one\ntriangular distribution.\n\n(a) 3D Gaussians with\ndifferent means.\n\nwith runtime O(n1+\u03b8) and space requirement O(n\u03b8) for some \u03b8 \u2208(cid:16)\n\n(b) 3D Gaussians with\ndifferent variance.\n\n(c) Estimation of H 0\nnorm of N (0, 1).\n\nstill gives a consistent estimator, with mean squared error of the order O\n\n(d) Estimation of H 1\nnorm of N (0, 1).\n\n0,\n\n2D\n\n4s(cid:48)+D\n\n, setting Zn (cid:16) n\u03b8/D\n\n(cid:17)\nnmax{ 4\u03b8(s\u2212s(cid:48) )\n\nD\n\n(cid:16)\n\n,\u22121}(cid:17)\n\n.\n\nKernel- or nearest-neighbor-based methods, including nearly all of the methods described in Section\n3, tend to require storing previously observed data, resulting in O(n) space requirements.\nIn\ncontrast, orthogonal basis estimation requires storing only O(Z D\nn ) estimated Fourier coef\ufb01cients.\nThe estimated coef\ufb01cients can be incrementally updated with each new data point, which may make\nthe estimator or close approximations feasible in streaming settings.\n8 Experimental results\nIn this section, we use synthetic data to demonstrate effectiveness of our methods. 6 All experiments\nuse 10, 102, . . . , 105 samples for estimation.\nWe \ufb01rst test our estimators on 1D L2 distances. Figure 1a shows estimated distance between N (0, 1)\nand N (1, 1); Figure 1b shows estimated distance between N (0, 1) and N (0, 4); Figure 1c shows\nestimated distance between Unif [0, 1] and Unif[0.5, 1.5]; Figure 1d shows estimated distance between\n[0, 1] and a triangular distribution whose density is highest at x = 0.5. Error bars indicate asymptotic\n95% con\ufb01dence intervals based on Theorem 4. These experiments suggest 105 samples is suf\ufb01cient\nto estimate L2 distances with high con\ufb01dence. Note that we need fewer samples to estimate Sobolev\nquantities of Gaussians than, say, of uniform distributions, consistent with our theory, since (in\ufb01nitely\ndifferentiable) Gaussians are smoothier than (discontinuous) uniform distributions.\nNext, we test our estimators on L2 distances of multivariate distributions. Figure 2a shows estimated\ndistance between N ([0, 0, 0] , I) and N ([1, 1, 1] , I); Figure 2b shows estimated distance between\nN ([0, 0, 0] , I) and N ([0, 0, 0] , 4I). These experiments show that our estimators can also handle\nmultivariate distributions. Lastly, we test our estimators for H s norms. Figure 2c shows estimated\nH 0 norm of N (0, 1) and Figure 2d shows H 1 norm of N (0, 1). Additional experiments with other\ndistributions and larger values of s are given in the supplement.\n9 Connections to two-sample testing\nWe now discuss the use of our estimator in two-sample testing. From the large literature on nonpara-\nmetric two-sample testing, we discuss only some recent approaches closely related to ours.\nLet \u02c6M denote our estimate of the Sobolev distance, consisting of plugging \u02c6S into equation (3).\nSince (cid:107) \u00b7 \u2212 \u00b7 (cid:107)H s is a metric on the space of probability density functions in H s, computing \u02c6M\nleads naturally to a two-sample test on this space. Theorem 5 suggests an asymptotic test, which\nis computationally preferable to a permutation test. In particular, for a desired Type I error rate\n\u03b1 \u2208 (0, 1) our test rejects the null hypothesis p = q if and only if Q\u03c72(2ZD\n\nn )(T ) < \u03b1.\n\n6MATLAB code for these experiments is available at https://github.com/sss1/SobolevEstimation.\n\n7\n\n1011021031041050.050.10.150.20.250.3number of samplesL22 Estimated DistanceTrue Distance1011021031041050.050.10.150.20.25number of samplesL22 Estimated DistanceTrue Distance1011021031041050.20.40.60.81number of samplesL22 Estimated DistanceTrue Distance10110210310410500.10.20.30.4number of samplesL22 Estimated DistanceTrue Distance1050.020.0250.030.0350.040.045number of samplesL22 Estimated DistanceTrue Distance1050.010.0150.020.0250.030.0350.04number of samplesL22 Estimated DistanceTrue Distance1011021031041050.10.20.30.40.5number of samples||\u22c5||2H0 Estimated DistanceTrue Distance101102103104105\u22120.4\u22120.200.20.40.6number of samples||\u22c5||H12 Estimated DistanceTrue Distance\fWhen s = 0, this approach is closely related to several two-sample tests in the literature based on\ncomparing empirical characteristic functions (CFs). Originally, these tests [11, 5] computed the same\nstatistic T with a \ufb01xed number of random RD-valued frequencies instead of deterministic ZD-valued\nfrequencies. This test runs in linear time, but is not generally consistent, since the two CFs need\nnot differ almost everywhere. Recently, [4] suggested using smoothed CFs, i.e., the convolution of\nthe CF with a universal smoothing kernel k. This is computationally easy (due to the convolution\n\ntheorem) and, when p (cid:54)= q, ((cid:101)p \u2217 k)(z) (cid:54)= ((cid:101)q \u2217 k)(z) for almost all z \u2208 RD, reducing the need for\n\ncarefully choosing test frequencies. Furthermore, this test is almost-surely consistent under very\ngeneral alternatives. However, it is not clear what sort of assumptions would allow \ufb01nite sample\nanalysis of the power of their test. Indeed, the convergence as n \u2192 \u221e can be arbitrarily slow,\ndepending on the random test frequencies used. Our analysis instead uses the assumption p, q \u2208 H s(cid:48)\n\nto ensure that small, ZD-valued frequencies contain most of the power of(cid:101)p. 7\nAt the other extreme (\u03b8 = 1) are MMD-based tests of [8, 9], which use the entire spectrum(cid:101)p. These\n\nThese \ufb01xed-frequency approaches can be thought of as the extreme point \u03b8 = 0 of the compu-\ntational/statistical trade-off described in section 7: they are computable in linear time and (with\nsmoothing) are strongly consistent, but do not satisfy \ufb01nite-sample bounds under general conditions.\n\ntests are statistically powerful and have strong guarantees for densities in an RKHS, but have O(n2)\ncomputational complexity. 8 The computational/statistical trade-off discussed in Section 7 can be\nthought of as an interpolation (controlled by \u03b8) of these approaches, with runtime in the case \u03b8 = 1\napproaching quadratic for large D and small s(cid:48).\n\n10 Conclusions and future work\nIn this paper, we proposed nonparametric estimators for Sobolev inner products, norms and distances\nof probability densities, for which we derived \ufb01nite-sample bounds and asymptotic distributions.\nA natural follow-up question to our work is whether estimating smoothness of a density can guide the\nchoice of smoothing parameters in nonparametric estimation. When analyzing many nonparametric\nestimators, Sobolev norms appear as the key unknown term in error bounds. While theoretically\noptimal smoothing parameter values are often suggested based on optimizing these error bounds,\nour work may suggest a practical way of mimicking this procedure by plugging estimated Sobolev\nnorms into these bounds. For some problems, such as estimating functionals of a density, this may\nbe especially useful, since no error metric is typically available for cross-validation. Even when\ncross-validation is an option, as in density estimation or regression, estimating smoothness may be\nfaster, or may suggest an appropriate range of parameter values.\nAcknowledgments\nThis material is based upon work supported by a National Science Foundation Graduate Research\nFellowship to the \ufb01rst author under Grant No. DGE-1252522.\nReferences\n[1] N. H. Anderson, P. Hall, and D. M. Titterington. Two-sample test statistics for measuring discrepancies\nbetween two multivariate probability density functions using kernel-based density estimates. Journal of\nMultivariate Analysis, 50(1):41\u201354, 1994.\n\n[2] P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best order of convergence\n\nestimates. Sankhy\u00afa: The Indian Journal of Statistics, Series A, pages 381\u2013393, 1988.\n\n[3] L. Birg\u00e9 and P. Massart. Estimation of integral functionals of a density. The Annals of Statistics, pages\n\n11\u201329, 1995.\n\n[4] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic\n\nrepresentations of probability measures. In NIPS, pages 1972\u20131980, 2015.\n\n[5] T. Epps and K. J. Singleton. An omnibus test for the two-sample problem using the empirical characteristic\n\nfunction. Journal of Statistical Computation and Simulation, 26(3-4):177\u2013203, 1986.\n\n47\u201361, 2008.\n\n[6] E. Gin\u00e9 and R. Nickl. A simple adaptive estimator of the integrated square of a density. Bernoulli, pages\n\n7Note that smooth CFs can be used in our test by replacing \u02c6p(z) with 1\nn\n\n(cid:80)n\nj=1 e\u2212izXj k(x), where k is the\ninverse Fourier transform of a characteristic kernel. However, smoothing seems less desirable under Sobolev\nassumptions, as it spreads the power of the CF away from small ZD-valued frequencies where our test focuses.\n\u221a\nn\n\n8Fast MMD approximations have been proposed, including the Block MMD, [37] FastMMD, [38] and\n\nsub-sampled MMD, but these lack the statistical guarantees of MMD.\n\n8\n\n\f[7] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. N. Inverardi. A new class of random vector entropy\nestimators and its applications in testing statistical hypotheses. J. Nonparametric Stat., 17:277\u2013297, 2005.\n[8] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel method for the\n\ntwo-sample-problem. In Advances in neural information processing systems, pages 513\u2013520, 2006.\n\n[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test. The\n\nJournal of Machine Learning Research, 13(1):723\u2013773, 2012.\n\n[10] P. Hall and J. S. Marron. Estimation of integrated squared density derivatives. Statistics & Probability\n\n[11] C. Heathcote. A test of goodness of \ufb01t for symmetric random variables. Australian Journal of Statistics,\n\n[12] A. O. Hero, B. Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal\n\n[13] I. Ibragimov and R. Khasminskii. On the nonparametric estimation of functionals. In Symposium in\n\nLetters, 6(2):109\u2013115, 1987.\n\n14(2):172\u2013181, 1972.\n\nProcessing Magazine, 19(5):85\u201395, 2002.\n\nAsymptotic Statistics, pages 41\u201352, 1978.\n\n[14] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al. Nonparametric von Mises estimators\n\nfor entropies, divergences and mutual informations. In NIPS, pages 397\u2013405, 2015.\n\n[15] H.-O. Kreiss and J. Oliger. Stability of the Fourier method. SIAM Journal on Numerical Analysis, 16(3):\n\n421\u2013433, 1979.\n\nAISTATS, 2015.\n\nde math\u00e9matiques, 1992.\n\n659\u2013681, 1996.\n\n[16] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. A. Wasserman. Nonparametric estimation of renyi\n\ndivergence and friends. In ICML, pages 919\u2013927, 2014.\n\n[17] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. A. Wasserman. On estimating L2\n\n2 divergence. In\n\n[18] B. Laurent. Ef\ufb01cient estimation of integral functionals of a density. Universit\u00e9 de Paris-sud, D\u00e9partement\n\n[19] B. Laurent et al. Ef\ufb01cient estimation of integral functionals of a density. The Annals of Statistics, 24(2):\n\n[20] N. Leonenko, L. Pronzato, V. Savani, et al. A class of r\u00e9nyi information estimators for multidimensional\n\ndensities. The Annals of Statistics, 36(5):2153\u20132182, 2008.\n\n[21] G. Leoni. A \ufb01rst course in Sobolev spaces, volume 105. American Math. Society, Providence, RI, 2009.\n[22] K. Moon and A. Hero. Multivariate f-divergence estimation with con\ufb01dence. In Advances in Neural\n\nInformation Processing Systems, pages 2420\u20132428, 2014.\n\n[23] K. R. Moon and A. O. Hero. Ensemble estimation of multivariate f-divergence. In Information Theory\n\n(ISIT), 2014 IEEE International Symposium on, pages 356\u2013360. IEEE, 2014.\n\n[24] K. R. Moon, K. Sricharan, K. Greenewald, and A. O. Hero III. Improving convergence of divergence\n\nfunctional ensemble estimators. arXiv preprint arXiv:1601.06884, 2016.\n\n[25] L. Pardo. Statistical inference based on divergence measures. CRC Press, 2005.\n[26] B. P\u00f3czos and J. G. Schneider. On the estimation of alpha-divergences. In International Conference on\n\nArti\ufb01cial Intelligence and Statistics, pages 609\u2013617, 2011.\n\n[27] B. P\u00f3czos, L. Xiong, and J. Schneider. Nonparametric divergence estimation with applications to machine\n\nlearning on distributions. arXiv preprint arXiv:1202.3758, 2012.\n\n[28] B. P\u00f3czos, L. Xiong, D. J. Sutherland, and J. Schneider. Nonparametric kernel estimators for image\nclassi\ufb01cation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages\n2989\u20132996. IEEE, 2012.\n\n[29] J. C. Principe. Information theoretic learning: Renyi\u2019s entropy and kernel perspectives. Springer Science\n\n& Business Media, 2010.\n\n[30] N. Quadrianto, J. Petterson, and A. J. Smola. Distribution matching for transduction. In Advances in\n\nNeural Information Processing Systems, pages 1500\u20131508, 2009.\n\n[31] P. Ram, D. Lee, W. March, and A. G. Gray. Linear-time algorithms for pairwise statistical problems. In\n\nAdvances in Neural Information Processing Systems, pages 1527\u20131535, 2009.\n\n[32] T. Schweder. Window estimation of the asymptotic variance of rank estimators of location. Scandinavian\n\nJournal of Statistics, pages 113\u2013126, 1975.\n\n[33] S. Singh and B. P\u00f3czos. Generalized exponential concentration inequality for renyi divergence estimation.\n\nIn Proceedings of The 31st International Conference on Machine Learning, pages 333\u2013341, 2014.\n\n[34] S. Singh and B. P\u00f3czos. Exponential concentration of a density functional estimator. In Advances in\n\nNeural Information Processing Systems, pages 3032\u20133040, 2014.\n\n[35] A. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st\n\nedition, 2008. ISBN 0387790519, 9780387790510.\n\n[36] E. Wolsztynski, E. Thierry, and L. Pronzato. Minimum-entropy estimation in semi-parametric models.\n\nSignal Processing, 85(5):937\u2013949, 2005.\n\n[37] W. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel two-sample test.\n\nIn Advances in neural information processing systems, pages 755\u2013763, 2013.\n\n[38] J. Zhao and D. Meng. Fastmmd: Ensemble of circular discrepancy for ef\ufb01cient two-sample test. Neural\n\ncomputation, 27(6):1345\u20131372, 2015.\n\n[39] A. Zygmund. Trigonometric series, volume 1. Cambridge university press, 2002.\n\n9\n\n\f", "award": [], "sourceid": 593, "authors": [{"given_name": "Shashank", "family_name": "Singh", "institution": "Carnegie Mellon University"}, {"given_name": "Simon", "family_name": "Du", "institution": "Carnegie Mellon University"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}