{"title": "Exponential Concentration of a Density Functional Estimator", "book": "Advances in Neural Information Processing Systems", "page_first": 3032, "page_last": 3040, "abstract": "We analyse a plug-in estimator for a large class of integral functionals of one or more continuous probability densities. This class includes important families of entropy, divergence, mutual information, and their conditional versions. For densities on the d-dimensional unit cube [0,1]^d that lie in a beta-Holder smoothness class, we prove our estimator converges at the rate O(n^(1/(beta+d))). Furthermore, we prove that the estimator obeys an exponential concentration inequality about its mean, whereas most previous related results have bounded only expected error of estimators. Finally, we demonstrate our bounds to the case of conditional Renyi mutual information.", "full_text": "Exponential Concentration of a Density Functional\n\nEstimator\n\nShashank Singh\n\nStatistics & Machine Learning Departments\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nsss1@andrew.cmu.edu\n\nBarnab\u00b4as P\u00b4oczos\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nbapoczos@cs.cmu.edu\n\nAbstract\n\nWe analyze a plug-in estimator for a large class of integral functionals of one\nor more continuous probability densities. This class includes important families\nof entropy, divergence, mutual information, and their conditional versions. For\ndensities on the d-dimensional unit cube [0, 1]d that lie in a \u03b2-H\u00a8older smoothness\nclass, we prove our estimator converges at the rate O\n. Furthermore, we\nprove the estimator is exponentially concentrated about its mean, whereas most\nprevious related results have proven only expected error bounds on estimators.\n\n(cid:16)\n\n(cid:17)\n\n\u2212 \u03b2\n\n\u03b2+d\n\nn\n\n1\n\nIntroduction\n\nMany important quantities in machine learning and statistics can be viewed as integral functionals\nof one of more continuous probability densities; that is, quanitities of the form\n\n(cid:90)\n\nF (p1,\u00b7\u00b7\u00b7 , pk) =\n\nX1\u00d7\u00b7\u00b7\u00b7\u00d7Xk\n\nf (p1(x1), . . . , pk(xk)) d(x1, . . . , xk),\n\nwhere p1,\u00b7\u00b7\u00b7 , pk are probability densities of random variables taking values in X1,\u00b7\u00b7\u00b7 , Xk, re-\nspectively, and f : Rk \u2192 R is some measurable function. For simplicity, we refer to such integral\nfunctionals of densities as \u2018density functionals\u2019. In this paper, we study the problem of estimating\ndensity functionals. In our framework, we assume that the underlying distributions are not given\nexplicitly. Only samples of n independent and identically distributed (i.i.d.) points from each of the\nunknown, continuous, nonparametric distributions p1,\u00b7\u00b7\u00b7 , pk are given.\n\n1.1 Motivations and Goals\n\nOne density functional of interest is Conditional Mutual Information (CMI), a measure of con-\nditional dependence of random variables, which comes in several varieties including R\u00b4enyi-\u03b1 and\nTsallis-\u03b1 CMI (of which Shannon CMI is the \u03b1 \u2192 1 limit case). Estimating conditional dependence\nin a consistent manner is a crucial problem in machine learning and statistics; for many applications,\nit is important to determine how the relationship between two variables changes when we observe\nadditional variables. For example, upon observing a third variable, two correlated variables may be-\ncome independent, and, similarly, two independent variables may become dependent. Hence, CMI\nestimators can be used in many scienti\ufb01c areas to detect confounding variables and avoid infering\ncausation from apparent correlation [20, 16]. Conditional dependencies are also central to Bayesian\nnetwork learning [7, 35], where CMI estimation can be used to verify compatibility of a particular\nBayes net with observed data under a local Markov assumption.\nOther important density functionals are divergences between probability distributions, including\nR\u00b4enyi-\u03b1 [25] and Tsallis-\u03b1 [32] divergences (of which Kullback-Leibler (KL) divergence [9] is the\n\n1\n\n\f\u03b1 \u2192 1 limit case), and Lp divergence. Divergence estimators can be used to extend machine\nlearning algorithms for regression, classi\ufb01cation, and clustering from the standard setting where in-\nputs are \ufb01nite-dimensional feature vectors to settings where inputs are sets or distributions [23, 19].\nEntropy and mutual information (MI) can be estimated as special cases of divergences. Entropy\nestimators are used in goodness-of-\ufb01t testing [5], parameter estimation in semi-parametric models\n[34], and texture classi\ufb01cation [6], and MI estimators are used in feature selection [21], clustering\n[1], optimal experimental design [13], and boosting and facial expression recognition [26]. Both en-\ntropy and mutual information estimators are used in independent component and subspace analysis\n[10, 30] and image registration [6]. Further applications of divergence estimation are in [11].\nDespite the practical utility of density functional estimators, little is known about their statistical\nperformance, especially for functionals of more than one density. In particular, few density func-\ntional estimators have known convergence rates, and, to the best of our knowledge, no \ufb01nite sample\nexponential concentration bounds have been derived for general density functional estimators. One\nconsequence of this exponential bound is that, using a union bound, we can guarantee accuracy of\nmultiple estimates simultaneously. For example, [14] shows how this can be applied to optimally\nanalyze forest density estimation algorithms. Because the CMI of variables X and Y given a third\nvariable Z is zero if and only X and Y are conditionally independent given Z, by estimating CMI\nwith a con\ufb01dence interval, we can test for conditional independence with bounded type I error prob-\nabilty.\nOur main contribution is to derive convergence rates and an exponential concentration inequality\nfor a particular, consistent, nonparametric estimator for large class of density functionals, including\nconditional density functionals. We also apply our concentration inequality to the important case of\nR\u00b4enyi-\u03b1 CMI.\n\n1.2 Related Work\n\ndensity cases of L2, R\u00b4enyi-\u03b1, and Tsallis-\u03b1 divergences and gave plug-in estimators which achieve\n\noptimally estimating the density and then applying a correction to the plug-in estimate. In contrast,\n\nAlthough lower bounds are not known for estimation of general density functionals (of arbitrarily\nmany densities), [2] lower bounded the convergence rate for estimators of functionals of a single\n\ndensity (e.g., entropy functionals) by O(cid:0)n\u22124\u03b2/(4\u03b2+d)(cid:1). [8] extended this lower bound to the two-\nthis rate. These estimators enjoy the parametric rate of O(cid:0)n\u22121/2(cid:1) when \u03b2 > d/4, and work by\nour estimator undersmooths the density, and converges at a slower rate of O(cid:0)n\u2212\u03b2/(\u03b2+d)(cid:1) when\n\u03b2 < d (and the parametric rate O(cid:0)n\u22121/2(cid:1) when \u03b2 \u2265 d), but obeys an exponential concentration\n\ninequality, which is not known for the estimators of [8].\nAnother exception for f-divergences is provided by [18], using empirical risk minimization. This\napproach involves solving an \u221e-dimensional convex minimization problem which be reduced to an\nn-dimensional problem for certain function classes de\ufb01ned by reproducing kernel Hilbert spaces (n\nis the sample size). When n is large, these optimization problems can still be very demanding. They\nstudied the estimator\u2019s convergence rate, but did not derive concentration bounds.\nA number of papers have studied k-nearest-neighbors estimators, primarily for R\u00b4enyi\u03b1 density func-\ntionals including entropy [12], divergence [33] and conditional divergence and MI [22]. These esti-\nmators work directly, without the intermediate density estimation step, and generally have proofs of\nconsistency, but their convergence rates and dependence on k, \u03b1, and the dimension are unknown.\nOne recent exception is a k-nearest-neighbors based estimator that converges at the parametric rate\nwhen \u03b2 > d, using an optimally weighted ensemble of weak estimators [28, 17]. These estimators\nappear to perform well in higher dimensions, but rates for these estimators require that k \u2192 \u221e as\nn \u2192 \u221e, causing computational dif\ufb01culties for large samples.\nAlthough the literature on dependence measures is huge, few estimators have been generalized to the\nconditional case [4, 24]. There is some work on testing conditional dependence [29, 3], but, unlike\nCMI estimation, these tests are intended to simply accept or reject the hypothesis that variables\nare conditionally independent, rather than to measure conditional dependence. Our exponential\nconcentration inequality also suggests a new test for conditional independence.\nThis paper continues a line of work begun in [14] and continued in [27]. [14] proved an exponential\nconcentration inequality for an estimator of Shannon entropy and MI in the 2-dimensional case.\n\n2\n\n\f[27] used similar techniques to derive an exponential concentration inequality for an estimator of\nR\u00b4enyi-\u03b1 divergence in d dimensions, for a larger family of densities. Both used plug-in estimators\nbased on a mirrored kernel density estimator (KDE) on [0, 1]d. Our work generalizes these results to\na much larger class of density functionals, as well as to conditional density functionals (see Section\n6).\nIn particular, we use a plug-in estimator for general density functionals based on the same\nmirrored KDE, and also use some lemmas regarding this KDE proven in [27]. By considering the\nmore general density functional case, we are also able to signi\ufb01cantly simplify the proofs of the\nconvergence rate and exponential concentration inequality.\n\nOrganization\n\nIn Section 2, we establish the theoretical context of our work, including notation, the precise prob-\nlem statement, and our estimator. In Section 3, we outline our main theoretical results and state\nsome consequences. Sections 4 and 5 give precise statements and proofs of the results in Section 3.\nFinally, in Section 6, we extend our results to conditional density functionals, and state the conse-\nquences in the particular case of R\u00b4enyi-\u03b1 CMI.\n\n2 Density Functional Estimator\n\n2.1 Notation\nFor an integer k, [k] = {1,\u00b7\u00b7\u00b7 , k} denotes the set of positive integers at most k. Using the notation\nof multi-indices common in multivariable calculus, Nd denotes the set of d-tuples of non-negative\nintegers, which we denote with a vector symbol(cid:126)\u00b7, and, for(cid:126)i \u2208 Nd,\n|(cid:126)i| =\n\nd(cid:88)\n\nD(cid:126)i :=\n\n\u2202|(cid:126)i|\n\nand\n\nik.\n\nFor \ufb01xed \u03b2, L > 0, r \u2265 1, and a positive integer d, we will work with densities in the following\nbounded subset of a \u03b2-H\u00a8older space:\n\nk=1\n\nC \u03b2\n\nL,r([0, 1]d) :=\n\n|D(cid:126)ip(x) \u2212 D(cid:126)ip(y)|\n\n(cid:107)x \u2212 y(cid:107)(\u03b2\u2212(cid:96))\n\n(1)\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe ,\n\nwhere (cid:96) = (cid:98)\u03b2(cid:99) is the greatest integer strictly less than \u03b2, and (cid:107) \u00b7 (cid:107)r : Rd \u2192 R is the usual r-norm.\nTo correct for boundary bias, we will require the densities to be nearly constant near the boundary\nof [0, 1]d, in that their derivatives vanish at the boundary. Hence, we work with densities in\n|D(cid:126)ip(x)| \u2192 0 as dist(x, \u2202[0, 1]d) \u2192 0\n\n\u03a3(\u03b2, L, r, d) :=\n\np \u2208 C \u03b2\n\n(cid:40)\n\n(cid:41)\n\n(2)\n\n,\n\nL,r([0, 1]d)\n\nwhere \u2202[0, 1]d = {x \u2208 [0, 1]d : xj \u2208 {0, 1} for some j \u2208 [d]}.\n\n2.2 Problem Statement\nFor each i \u2208 [k] let Xi be a di-dimensional random vector taking values in Xi := [0, 1]di, distributed\naccording to a density pi : X \u2192 R. For an appropriately smooth function f : Rk \u2192 R, we are\ninterested in a using random sample of n i.i.d. points from the distribution of each Xi to estimate\n\nf (p1(x1), . . . , pk(xk)) d(x1, . . . , xk).\n\n(3)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup\n\nx(cid:54)=y\u2208D\n|(cid:126)i|=(cid:96)\n\n\u2202i1x1 \u00b7\u00b7\u00b7 \u2202id xd\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3p : [0, 1]d \u2192 R\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max\n\n1\u2264|(cid:126)i|\u2264(cid:96)\n\n(cid:90)\n\n(cid:90)\n\n3\n\nF (p1, . . . , pk) :=\n\n2.3 Estimator\n\nX1\u00d7\u00b7\u00b7\u00b7\u00d7Xk\n\nFor a \ufb01xed bandwidth h, we \ufb01rst use the mirrored kernel density estimator (KDE) \u02c6pi described in\n[27] to estimate each density pi. We then use a plug-in estimate of F (p1, . . . , pk).\n\nF (\u02c6p1, . . . , \u02c6pk) :=\n\nf (\u02c6p1(x1), . . . , \u02c6pk(xk)) d(x1, . . . , xk).\n\nX1\u00d7\u00b7\u00b7\u00b7\u00d7Xk\n\nOur main results generalize those of [27] to a broader class of density functionals.\n\n\f3 Main Results\n\nIn this section, we outline our main theoretical results, proven in Sections 4 and 5, and also discuss\nsome important corollaries.\nWe decompose the estimatator\u2019s error into bias and a variance-like terms via the triangle inequality:\n\n|F (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| \u2264 |F (\u02c6p1, . . . , \u02c6pk) \u2212 EF (\u02c6p1, . . . , \u02c6pk)|\n\n(cid:124)\n(cid:124)\n\nvariance-like term\n\n+ |EF (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)|\n\n(cid:125)\n(cid:125)\n\n.\n\n(4)\n\n(5)\n\n(cid:19)\n(cid:19)\n\n,\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\nbias term\n\n(cid:18)\n\n(cid:18)\n\nh\u03b2 + h2\u03b2 +\n\n1\nnhd\n\n(cid:16)\n\n(cid:17)\n\nWe will prove the \u201cvariance\u201d bound\n\nP (|F (\u02c6p1, . . . , \u02c6pk) \u2212 EF (\u02c6p1, . . . , \u02c6pk)| > \u03b5) \u2264 2 exp\n\n\u2212 2\u03b52n\nC 2\nV\n\nfor all \u03b5 > 0 and the bias bound\n\n|EF (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| \u2264 CB\n\nwhere d := maxi di, and CV and CB are constant in the sample size n and bandwidth h for exact\nvalues. To the best of our knowledge, this is the \ufb01rst time an exponential inequality like (4) has been\nestablished for general density functional estimation. This variance bound does not depend on h and\nthe bias bound is minimized by h (cid:16) n\n\n\u03b2+d , we have the convergence rate\n\n\u2212 1\n\n|EF (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| \u2208 O\n\n\u2212 \u03b2\n\n\u03b2+d\n\nn\n\n.\n\nIt is interesting to note that, in optimizing the bandwidth for our density functional estimate, we use\na smaller bandwidth than is optimal for minimizing the bias of the KDE. Intuitively, this re\ufb02ects the\nfact that the plug-in estimator, as an integral functional, performs some additional smoothing.\nWe can use our exponential concentration bound to obtain a bound on the true variance of\nF (\u02c6p1, . . . , \u02c6pk). If G : [0,\u221e) \u2192 R denotes the cumulative distribution function of the squared\ndeviation of F (\u02c6p1, . . . , \u02c6pk) from its mean, then\n\n(cid:18)\n\n\u2212 2\u03b5n\nC 2\nV\n\n(cid:19)\n\n.\n\n(F (\u02c6p1, . . . , \u02c6pk) \u2212 EF (\u02c6p1, . . . , \u02c6pk))2 > \u03b5\n\n(cid:17) \u2264 2 exp\n1 \u2212 G(\u03b5) = P(cid:16)\n(F (\u02c6p1, . . . , \u02c6pk) \u2212 EF (\u02c6p1, . . . , \u02c6pk))2(cid:105)\nV[F (\u02c6p1, . . . , \u02c6pk)] = E(cid:104)\n(cid:19)\n(cid:90) \u221e\n\n(cid:18)\n\n(cid:90) \u221e\n(F (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk))2(cid:105) \u2208 O\n\n1 \u2212 G(\u03b5) d\u03b5 \u2264 2\n\n\u2212 2\u03b5n\nC 2\nV\n\nn\u22121 + n\n\nE(cid:104)\n\n(cid:16)\n\nexp\n\n=\n\n0\n\n0\n\n\u2212 2\u03b2\n\n\u03b2+d\n\n= C 2\n\nV n\u22121.\n\n(cid:17)\n\n,\n\nWe then have a mean squared error of\n\nThus,\n\n\u2212 2\u03b2\n\nwhich is in O(n\u22121) if \u03b2 \u2265 d and O\nIt should be noted that the constants in both the bias bound and the variance bound depend expo-\nnentially on the dimension d. Lower bounds in terms of d are unknown for estimating most density\nfunctionals of interest, and an important open problem is whether this dependence can be made\nasymptotically better than exponential.\n\notherwise.\n\n\u03b2+d\n\nn\n\n4 Bias Bound\n\nIn this section, we precisely state and prove the bound on the bias of our density functional estimator,\nas introduced in Section 3.\n\n4\n\n(cid:16)\n\n(cid:17)\n\n\fAssume each pi \u2208 \u03a3(\u03b2, L, r, d) (for i \u2208 [k]), assume f : Rk \u2192 R is twice continuously differen-\ntiable, with \ufb01rst and second derivatives all bounded in magnitude by some Cf \u2208 R, 1 and assume\nthe kernel K : R \u2192 R has bounded support [\u22121, 1] and satis\ufb01es\n\n(cid:90) 1\n\n(cid:90) 1\n\nK(u) du = 1\n\nand\n\nujK(u) du = 0\n\nfor all j \u2208 {1, . . . , (cid:96)}.\n\n\u22121\n\n\u22121\nThen, there exists a constant CB \u2208 R such that\n\n|EF (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| \u2264 CB\n\n(cid:18)\n\nh\u03b2 + h2\u03b2 +\n\n(cid:19)\n\n.\n\n1\nnhd\n\n4.1 Proof of Bias Bound\nBy Taylor\u2019s Theorem, \u2200x = (x1, . . . , xk) \u2208 X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xk, for some \u03be \u2208 Rk on the line segment\nbetween \u02c6p(x) := (\u02c6p1(x1), . . . , \u02c6pk(xk)) and p(x) := (p1(x1), . . . , pk(xk)), letting Hf denote the\nHessian of f\n\n|Ef (\u02c6p(x)) \u2212 f (p(x))| =\n\n\u2264 Cf\n\n(\u02c6p(x) \u2212 p(x))T Hf (\u03be)(\u02c6p(x) \u2212 p(x))\n\n|Bpi(xi)Bpj (xj)| +\n\nE[\u02c6pi(xi) \u2212 pi(xi)]2\n\n(cid:12)(cid:12)(cid:12)(cid:12)E(\u2207f )(p(x)) \u00b7 (\u02c6p(x) \u2212 p(x)) +\n\uf8eb\uf8ed k(cid:88)\n\n(cid:88)\n\n1\n2\n\ni=1\n\ni<j\u2264k\n\n|Bpi (xi)| +\n(cid:90)\n\nX1\u00d7\u00b7\u00b7\u00b7\u00d7Xk\n\n(cid:90)\n\n(cid:88)\n\ni<j\u2264k\n\nk(cid:88)\n\ni=1\n\n(cid:90)\n\nwhere we used that \u02c6pi and \u02c6pj are independent for i (cid:54)= j. Applying H\u00a8older\u2019s Inequality,\n|EF (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| \u2264\n\n|Ef (\u02c6p(x)) \u2212 f (p(x))| dx\n\n\uf8eb\uf8ed k(cid:88)\n(cid:90)\n(cid:115)(cid:90)\n(cid:32) k(cid:88)\n\ni=1\n\n\u2264 Cf\n\n\u2264 Cf\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\uf8f6\uf8f8\n\n\uf8f6\uf8f8\n(cid:33)\n\n(cid:90)\n\nXi\n\n(cid:90)\n\n[0,1]d\n\n|Bpi(xi)| + E[\u02c6pi(xi) \u2212 pi(xi)]2 dxi +\nXi\n\n|Bpi(xi)| dxi\nXi\n\n|Bpj (xj)| dxj\nXj\n\nB2\npi\n\n(xi) dxi +\n\ni=1\n\nXi\n\nE[\u02c6pi(xi) \u2212 pi(xi)]2 dxi\n\n(cid:115)(cid:90)\n\n(cid:88)\n\n+\n\ni<j\u2264k\n\nXi\n\n(cid:90)\n\nXj\n\nB2\npi\n\n(xi) dxi\n\nB2\npj\n\n(xj) dxj\n\n.\n\nWe now make use of the so-called Bias Lemma proven by [27], which bounds the integrated squared\nbias of the mirrored KDE \u02c6p on [0, 1]d for an arbitrary p \u2208 \u03a3(\u03b2, L, r, d). Writing the bias of \u02c6p at\nx \u2208 [0, 1]d as Bp(x) = E\u02c6p(x) \u2212 p(x), [27] showed that there exists C > 0 constant in n and h such\nthat\n\np(x) dx \u2264 Ch2\u03b2.\nB2\n\n(6)\n\nApplying the Bias Lemma and certain standard results in kernel density estimation (see, for example,\nPropositions 1.1 and 1.2 of [31]) gives\n\n|EF (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| \u2264 C(cid:0)k2h\u03b2 + kh2\u03b2(cid:1) +\n\n(cid:107)K(cid:107)d\nnhd \u2264 CB\n\n1\n\nh\u03b2 + h2\u03b2 +\n\n1\nnhd\n\n(cid:19)\n\n,\n\n(cid:18)\n\nwhere (cid:107)K(cid:107)1 denotes the 1-norm of the kernel. (cid:4)\n\n1If p1(X1) \u00d7 \u00b7\u00b7\u00b7 \u00d7 pk(Xk) is known to lie within some cube [\u03ba1, \u03ba2]k, then it suf\ufb01ces for f to be twice\ncontinuously differentiable on [\u03ba1, \u03ba2]k (and the boundedness condition follows immediately). This will be\nimportant for our application to R\u00b4enyi-\u03b1 Conditional Mutual Information.\n\n5\n\n\f5 Variance Bound\n\nIn this section, we precisely state and prove the exponential concentration inequality for our density\nfunctional estimator, as introduced in Section 3. Assume that f is Lipschitz continuous with constant\nCf in the 1-norm on p1(X1) \u00d7 \u00b7\u00b7\u00b7 \u00d7 pk(Xk) (i.e.,\n|xi \u2212 yi|,\n\n\u2200x, y \u2208 p1(X1) \u00d7 \u00b7\u00b7\u00b7 \u00d7 pk(Xk)).\n\n|f (x) \u2212 f (y)| \u2264 Cf\n\n\u221e(cid:88)\n\n(7)\n\nk=1\n\nand assume the kernel K \u2208 L1(R) (i.e., it has \ufb01nite 1-norm). Then, there exists a constant CV \u2208 R\nsuch that \u2200\u03b5 > 0,\n\nP (|F (\u02c6p1, . . . , \u02c6pk) \u2212 EF (\u02c6p1, . . . , \u02c6pk)|) \u2264 2 exp\n\n(cid:18)\n\n\u2212 2\u03b52n\nC 2\nV\n\n(cid:19)\n\n.\n\nNote that, while we require no assumptions on the densities here, in certain speci\ufb01c applications,\nsuch us for some R\u00b4enyi-\u03b1 quantities, where f = log, assumptions such as lower bounds on the\ndensity may be needed to ensure f is Lipschitz on its domain.\n\n5.1 Proof of Variance Bound\n\nConsider i.i.d. samples (x1\np = p1\u00d7\u00b7\u00b7\u00b7 pk. In anticipation of using McDiarmid\u2019s Inequality [15], let \u02c6p(cid:48)\nKDE when the sample xi\n(7) on f,\n\nk ) \u2208 X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xk drawn according to the product distribution\nj denote the jth mirrored\nj)(cid:48). Then, applying the Lipschitz condition\n\nj is replaced by new sample (xi\n\n1, . . . , xn\n\n|F (\u02c6p1, . . . , \u02c6pk) \u2212 F (\u02c6p1, . . . , \u02c6p(cid:48)\n\n|pj(x) \u2212 p(cid:48)\n\nsince most terms of the sum in (7) are zero. Expanding the de\ufb01nition of the kernel density estimates\n\u02c6pj and \u02c6p(cid:48)\n|F (\u02c6p1, . . . , \u02c6pk) \u2212 F (\u02c6p1, . . . , \u02c6p(cid:48)\n\nj and noting that most terms of the mirrored KDEs \u02c6pj and \u02c6p(cid:48)\nx \u2212 xi\n\nj are identical gives\nx \u2212 (xi\nj)(cid:48)\nh\n\nj, . . . , \u02c6pk)| =\n\n\u2212 Kdj\n\nCf\nnhdj\n\nj\n\n(cid:33)\n\nwhere Kdj denotes the dj-dimensional mirrored product kernel based on K. Performing a change\nof variables to remove h and applying the triangle inequality followed by the bound on the integral\nof the mirrored kernel proven in [27],\n\n(cid:90)\nj, . . . , \u02c6pk)| \u2264 Cf\n(cid:32)\n\n(cid:90)\n\nj(x)| dx,\n(cid:32)\n\nXj\n\nh\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Kdj\n(cid:12)(cid:12)Kdj (x \u2212 xi\n\n(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx\nj)(cid:48))(cid:12)(cid:12) dx\n\nCV\nn\n\n,\n\n(8)\n\n|F (\u02c6p1, . . . , \u02c6pk) \u2212 F (\u02c6p1, . . . , \u02c6p(cid:48)\n\nj, . . . , \u02c6pk)| \u2264 Cf\nn\n\u2264 2Cf\nn\n\nj) \u2212 Kdj (x \u2212 (xi\n(cid:107)K(cid:107)dj\n\n1 =\n\n|Kdj (x)| dx \u2264 2Cf\nn\n\n[\u22121,1]dj\n\nfor CV = 2Cf maxj (cid:107)K(cid:107)dj\narmid\u2019s Inequality then gives, for any \u03b5 > 0,\n\nP (|F (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| > \u03b5) \u2264 2 exp\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\n\u2212 2\u03b52\nknC 2\n\nV /n2\n\n= 2 exp\n\n\u2212 2\u03b52n\nkC 2\nV\n\n. (cid:4)\n\n1 . Since F (\u02c6p1, . . . , \u02c6pk) depends on kn independent variables, McDi-\n\n6 Extension to Conditional Density Functionals\n\nOur convergence result and concentration bound can be fairly easily adapted to to KDE-based plug-\nin estimators for many functionals of interest, including R\u00b4enyi-\u03b1 and Tsallis-\u03b1 entropy, divergence,\nand MI, and Lp norms and distances, which have either the same or analytically similar forms as as\nthe functional (3). As long as the density of the variable being conditioned on is lower bounded on\nits domain, our results also extend to conditional density functionals of the form 2\n\n(cid:19)\n\n(cid:19)\n\n,\n\nP (x2, z)\n\nP (z)\n\n, . . . ,\n\nP (xk, z)\n\nP (z)\n\nd(x1, . . . , xk)\n\ndz (9)\n\n(cid:18)(cid:90)\n\n(cid:90)\n\nZ\n\nF (P ) =\n\nP (z)f\n\ng\n\nX1\u00d7\u00b7\u00b7\u00b7\u00d7Xk\n\n(cid:18) P (x1, z)\n\nP (z)\n\n2We abuse notation slightly and also use P to denote all of its marginal densities.\n\n6\n\nXj\n\n(cid:90)\n(cid:90)\n\nhXj\n\n(cid:18)\n\n\fincluding, for example, R\u00b4enyi-\u03b1 conditional entropy, divergence, and mutual information, where f\nis the function x (cid:55)\u2192 1\n1\u2212\u03b1 log(x). The proof of this extension for general k is essentially the same as\nfor the case k = 1, and so, for notational simplicity, we demonstrate the latter.\n\n6.1 Problem Statement, Assumptions, and Estimator\nFor given dimensions dx, dz \u2265 1, consider random vectors X and Z distributed on unit cubes\nX := [0, 1]dx and Z := [0, 1]dz according to a joint density P : X \u00d7 Z \u2192 R. We use a random\nsample of 2n i.i.d. points from P to estimate a conditional density functional F (P ), where F has\nthe form (9).\nSuppose that P is in the H\u00a8older class \u03a3(\u03b2, L, r, dx + dz), noting that this implies an analogous\ncondition on each marginal of P , and suppose that P bounded below and above, i.e., 0 < \u03ba1 :=\ninf x\u2208X ,z\u2208Z P (z) and \u221e > \u03ba2 := inf x\u2208X ,z\u2208Z P (x, z). Suppose also that f and g are continuously\ndifferentiable, with\n\nwhere\n\nCf := sup\n\nx\u2208[cg,Cg]\n\n(cid:18)(cid:20)\n\ncg := inf g\n\n0,\n\n|f (x)|\n\n(cid:21)(cid:19)\n\n\u03ba2\n\u03ba1\n\nand Cf(cid:48) := sup\n\n|f(cid:48)(x)|,\n\nx\u2208[cg,Cg]\n\n(cid:18)(cid:20)\n\n0,\n\n\u03ba2\n\u03ba1\n\n(cid:21)(cid:19)\n\n(10)\n\n.\n\nand Cg := sup g\n\nAfter estimating the densities P (z) and P (x, z) by their mirrored KDEs, using n independent data\nsamples for each, we clip the estimates of P (x, z) and P (z) below by \u03ba1 and above by \u03ba2 and\ndenote the resulting density estimates by \u02c6P . Our estimate F ( \u02c6P ) for F (P ) is simply the result of\nplugging \u02c6P into equation (9).\n\n6.2 Proof of Bounds for Conditional Density Functionals\n\nWe bound the error of F ( \u02c6P ) in terms of the error of estimating the corresponding unconditional\ndensity functional using our previous estimator, and then apply our previous results.\nSuppose P1 is either the true density P or a plug-in estimate of P computed as described above,\nand P2 is a plug-in estimate of P computed in the same manner but using a different data sample.\nApplying the triangle inequality twice,\n|F (P1) \u2212 F (P2)| \u2264\n\n\u2212 P2(z)f\n\n(cid:18)(cid:90)\n\ndx\n\ndx\n\ng\n\ng\n\nX\n\nX\n\n+\n\n\u2264\n\n+ P2(z)\n\ndx\n\ndx\n\ndx\n\ng\n\nX\n\nZ\n\nZ\n\ng\n\nX\n\ng\n\nX\n\nP1(z)\n\nP1(z)\n\nP1(z)\n\nP2(z)\n\n(cid:19)\n\n\u2212 P2(z)f\n\n|P1(z) \u2212 P2(z)|\n\n(cid:18) P1(x, z)\n(cid:18) P1(x, z)\n(cid:19)\n(cid:12)(cid:12)(cid:12)(cid:12)f\n(cid:18)(cid:90)\n(cid:18) P1(x, z)\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)P1(z)f\n(cid:18) P1(x, z)\n(cid:19)\n(cid:90)\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) dz\n(cid:12)(cid:12)(cid:12)(cid:12)P2(z)f\n(cid:18)(cid:90)\n(cid:19)\n(cid:18) P2(x, z)\n(cid:90)\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) dz\n(cid:12)(cid:12)(cid:12)(cid:12)f\n(cid:19)\n(cid:18)(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:12) dz\n(cid:19)\n(cid:18) P2(x, z)\nCf|P1(z) \u2212 P2(z)| + \u03ba2Cf(cid:48)(cid:12)(cid:12)GP1(z)(P1(\u00b7, z)) \u2212 GP2(z)(P2(\u00b7, z))(cid:12)(cid:12) dz,\n\n(cid:19)\n(cid:19)\n(cid:18)(cid:90)\n(cid:19)\n(cid:18)(cid:90)\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:18) P1(x, z)\n(cid:19)\n(cid:18)(cid:90)\n(cid:19)\n(cid:18) P2(x, z)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n(cid:19)\n(cid:18) P1(x, z)\n(cid:18) Q(x)\n\nP1(z)\n\u2212 f\n\n(cid:19)\n\nP1(z)\n\nP2(z)\n\nP2(z)\n\nP1(z)\n\n\u2212 g\n\ng\n\nX\n\ng\n\nX\n\n(cid:90)\n(cid:90)\n\n(cid:90)\n\ndx\n\ndx\n\ndx\n\n=\n\nZ\n\nApplying the Mean Value Theorem and the bounds in (10) gives\n|F (P1) \u2212 F (P2)| \u2264\n\nCf|P1(z) \u2212 P2(z)| + \u03ba2Cf(cid:48)\nZ\n\nX\n\ng\n\nwhere Gz is the density functional\n\nGP (z)(Q) =\n\ng\n\nX\n\nP (z)\n\ndx.\n\nNote that, since the data are split to estimate P (z) and P (x, z), G \u02c6P (z)( \u02c6P (\u00b7, z)) depends on each\ndata point through only one of these KDEs. In the case that P1 is the true density P , taking the\n\n7\n\n\fexpectation and using Fubini\u2019s Theorem gives\n\nE|F (P ) \u2212 F ( \u02c6P )| \u2264\n\n(cid:90)\n\nZ\n\u2264Cf\n\nCfE|P (z) \u2212 \u02c6P (z)| + \u03ba2Cf(cid:48)E(cid:12)(cid:12)(cid:12)GP (z)(P (\u00b7, z)) \u2212 G \u02c6P (z)( \u02c6P (\u00b7, z))\n(cid:12)(cid:12)(cid:12) dz,\n(cid:115)(cid:90)\n(cid:19)\n\n(cid:18)\n\nE(P (z) \u2212 \u02c6P (z))2dz + 2\u03ba2Cf(cid:48)CB\nZ\n\nh\u03b2 + h2\u03b2 +\n\n1\nnhd\n\n(cid:18)\n\n(cid:19)\n\n\u2264 (2\u03ba2Cf(cid:48)CB + Cf C)\n\nh\u03b2 + h2\u03b2 +\n\n1\nnhd\n\napplying H\u00a8older\u2019s Inequality and our bias bound (5), followed by the bias lemma (6). This extends\nour bias bound to conditional density functionals. For the variance bound, consider the case where\nP1 and P2 are each mirrored KDE estimates of P , but with one data point resampled (as in the proof\nof the variance bound, setting up to use McDiarmid\u2019s Inequality). By the same sequence of steps\nused to show (8),\n\n(cid:90)\n|P1(z) \u2212 P2(z)| dz \u2264 2(cid:107)K(cid:107)dz\n(cid:12)(cid:12)(cid:12)GP (z)(P (\u00b7, z)) \u2212 G \u02c6P (z)( \u02c6P (\u00b7, z))\n\nZ\n\nn\n\n1\n\nand\n\n(cid:90)\n\nZ\n\n,\n\n(cid:12)(cid:12)(cid:12) dz \u2264 CV\n(cid:18)\n\nn\n\n.\n\n(cid:19)\n\n(by casing on whether the resampled data point was used to estimate P (x, z) or P (z)), for an\nappropriate CV depending on supx\u2208[\u03ba1/\u03ba2,\u03ba2/\u03ba1] |g(cid:48)(x)|. Then, by McDiarmid\u2019s Inequality,\n\nP (|F (\u02c6p1, . . . , \u02c6pk) \u2212 F (p1, . . . , pk)| > \u03b5) = 2 exp\n\n\u2212 \u03b52n\n4C 2\nV\n\n. (cid:4)\n\n6.3 Application to R\u00b4enyi-\u03b1 Conditional Mutual Information\n\nAs an example, we demonstrate our concentration inequality to the R\u00b4enyi-\u03b1 Conditional Mutual\nInformation (CMI). Consider random vectors X, Y , and Z on X = [0, 1]dx, Y = [0, 1]dy, Z =\n[0, 1]dz, respectively. \u03b1 \u2208 (0, 1) \u222a (1,\u221e), the R\u00b4enyi-\u03b1 CMI of X and Y given Z is\nI(X; Y |Z) =\n\n(cid:19)\u03b1(cid:18) P (x, z)P (y, z)\n\n(cid:18) P (x, y, z)\n\n(cid:19)1\u2212\u03b1\n\nd(x, y) dz. (11)\n\nP (z) log\n\n(cid:90)\n\n(cid:90)\n\n1\n\nX\u00d7Y\n\nP (z)\n\nP (z)2\n\n1 \u2212 \u03b1\n\nZ\n\nIn this case, the estimator which plugs mirrored KDEs for P (x, y, z), P (x, z), P (y, z), and P (z)\ninto (11) obeys the concentration inequality (4) with CV = \u03ba\u2217(cid:107)K(cid:107)dx+dy+dz\n, where \u03ba\u2217 depends\nonly on \u03b1, \u03ba1, and \u03ba2.\n\n1\n\nReferences\n[1] M. Aghagolzadeh, H. Soltanian-Zadeh, B. Araabi, and A. Aghagolzadeh. A hierarchical clus-\ntering based on mutual information maximization. In in Proc. of IEEE International Confer-\nence on Image Processing, pages 277\u2013280, 2007.\n\n[2] L. Birge and P. Massart. Estimation of integral functions of a density. A. Statistics, 23:11\u201329,\n\n1995.\n\n[3] T. Bouezmarni, J. Rombouts, and A. Taamouti. A nonparametric copula based test for condi-\ntional independence with applications to granger causality, 2009. Technical report, Universidad\nCarlos III, Departamento de Economia.\n\n[4] K. Fukumizu, A. Gretton, X. Sun, and B. Schoelkopf. Kernel measures of conditional depen-\n\ndence. In Neural Information Processing Systems (NIPS), 2008.\n\n[5] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. Novi Inverardi. A new class of random\nvector entropy estimators and its applications in testing statistical hypotheses. J. Nonparamet-\nric Statistics, 17:277\u2013297, 2005.\n\n[6] A. O. Hero, B. Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs.\n\nIEEE Signal Processing Magazine, 19(5):85\u201395, 2002.\n\n[7] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, Cambridge, MA, 2009.\n\n8\n\n\f[8] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. Wasserman. Nonparametric estimation\nof renyi divergence and friends. In International Conference on Machine Learning (ICML),\n2014.\n\n[9] S. Kullback and R.A. Leibler. On information and suf\ufb01ciency. Annals of Mathematical Statis-\n\ntics, 22:79\u201386, 1951.\n\n[10] E. G. Learned-Miller and J. W. Fisher. ICA using spacings estimates of entropy. J. Machine\n\nLearning Research, 4:1271\u20131295, 2003.\n\n[11] N. Leonenko, L. Pronzato, and V. Savani. A class of R\u00b4enyi information estimators for multidi-\n\nmensional densities. Annals of Statistics, 36(5):2153\u20132182, 2008.\n\n[12] N. Leonenko, L. Pronzato, and V. Savani. Estimation of entropies and divergences via nearest\n\nneighbours. Tatra Mt. Mathematical Publications, 39, 2008.\n\n[13] J. Lewi, R. Butera, and L. Paninski. Real-time adaptive information-theoretic optimization of\nneurophysiology experiments. In Advances in Neural Information Processing Systems, vol-\nume 19, 2007.\n\n[14] H. Liu, J. Lafferty, and L. Wasserman. Exponential concentration inequality for mutual infor-\n\nmation estimation. In Neural Information Processing Systems (NIPS), 2012.\n\n[15] C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 141:148\u2013\n\n188, 1989.\n\n[16] D. Montgomery. Design and Analysis of Experiments. John Wiley and Sons, 2005.\n[17] K.R. Moon and A.O. Hero. Ensemble estimation of multivariate f-divergence. In Information\n\nTheory (ISIT), 2014 IEEE International Symposium on, pages 356\u2013360, June 2014.\n\n[18] X. Nguyen, M.J. Wainwright, and M.I. Jordan. Estimating divergence functionals and the\n\nlikelihood ratio by convex risk minimization. IEEE Trans. on Information Theory., 2010.\n\n[19] J. Oliva, B. Poczos, and J. Schneider. Distribution to distribution regression. In International\n\nConference on Machine Learning (ICML), 2013.\n\n[20] J. Pearl. Why there is no statistical test for confounding, why many think there is, and why\n\nthey are almost right, 1998. UCLA Computer Science Department Technical Report R-256.\n\n[21] H. Peng and C. Dind. Feature selection based on mutual information: Criteria of max-\ndependency, max-relevance, and min-redundancy. IEEE Trans On Pattern Analysis and Ma-\nchine Intelligence, 27, 2005.\n\n[22] B. Poczos and J. Schneider. Nonparametric estimation of conditional information and diver-\nIn International Conference on AI and Statistics (AISTATS), volume 20 of JMLR\n\ngences.\nWorkshop and Conference Proceedings, 2012.\n\n[23] B. Poczos, L. Xiong, D. Sutherland, and J. Schneider. Nonparametric kernel estimators for\nimage classi\ufb01cation. In 25th IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2012.\n\n[24] S. J. Reddi and B. Poczos. Scale invariant conditional dependence measures. In International\n\nConference on Machine Learning (ICML), 2013.\n\n[25] A. R\u00b4enyi. Probability Theory. North-Holland Publishing Company, Amsterdam, 1970.\n[26] C. Shan, S. Gong, and P. W. Mcowan. Conditional mutual information based boosting for\n\nfacial expression recognition. In British Machine Vision Conference (BMVC), 2005.\n\n[27] S. Singh and B. Poczos. Generalized exponential concentration inequality for r\u00b4enyi divergence\n\nestimation. In International Conference on Machine Learning (ICML), 2014.\n\n[28] K. Sricharan, D. Wei, and A. Hero. Ensemble estimators for multivariate entropy estimation,\n\n2013.\n\n[29] L. Su and H. White. A nonparametric Hellinger metric test for conditional independence.\n\n[30] Z. Szab\u00b4o, B. P\u00b4oczos, and A. L\u02ddorincz. Undercomplete blind subspace deconvolution. J. Ma-\n\nEconometric Theory, 24:829\u2013864, 2008.\n\nchine Learning Research, 8:1063\u20131095, 2007.\n\n[31] A.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company,\n\nIncorporated, 1st edition, 2008.\n\n[32] T. Villmann and S. Haase. Mathematical aspects of divergence based vector quantization using\n\nFrechet-derivatives, 2010. University of Applied Sciences Mittweida.\n\n[33] Q. Wang, S.R. Kulkarni, and S. Verd\u00b4u. Divergence estimation for multidimensional densities\n\nvia k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5), 2009.\n\n[34] E. Wolsztynski, E. Thierry, and L. Pronzato. Minimum-entropy estimation in semi-parametric\n\nmodels. Signal Process., 85(5):937\u2013949, 2005.\n\n[35] K. Zhang, J. Peters, D. Janzing, and B. Scholkopf. Kernel-based conditional independence test\n\nand application in causal discovery. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2011.\n\n9\n\n\f", "award": [], "sourceid": 1576, "authors": [{"given_name": "Shashank", "family_name": "Singh", "institution": "Carnegie Mellon University"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "CMU"}]}