{"title": "Estimation of Information Theoretic Measures for Continuous Random Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1257, "page_last": 1264, "abstract": "We analyze the estimation of information theoretic measures of continuous random variables such as: differential entropy, mutual information or Kullback-Leibler divergence. The objective of this paper is two-fold. First, we prove that the information theoretic measure estimates using the k-nearest-neighbor density estimation with fixed k converge almost surely, even though the k-nearest-neighbor density estimation with fixed k does not converge to its true measure. Second, we show that the information theoretic measure estimates do not converge for k growing linearly with the number of samples. Nevertheless, these nonconvergent estimates can be used for solving the two-sample problem and assessing if two random variables are independent. We show that the two-sample and independence tests based on these nonconvergent estimates compare favorably with the maximum mean discrepancy test and the Hilbert Schmidt independence criterion, respectively.", "full_text": "Estimation of Information Theoretic Measures for\n\nContinuous Random Variables\n\nFernando P\u00b4erez-Cruz\n\nPrinceton University, Electrical Engineering Department\nB-311 Engineering Quadrangle, 08544 Princeton (NJ)\n\nfp@princeton.edu\n\nAbstract\n\nWe analyze the estimation of information theoretic measures of continuous ran-\ndom variables such as: differential entropy, mutual information or Kullback-\nLeibler divergence. The objective of this paper is two-fold. First, we prove that the\ninformation theoretic measure estimates using the k-nearest-neighbor density es-\ntimation with \ufb01xed k converge almost surely, even though the k-nearest-neighbor\ndensity estimation with \ufb01xed k does not converge to its true measure. Second,\nwe show that the information theoretic measure estimates do not converge for k\ngrowing linearly with the number of samples. Nevertheless, these nonconvergent\nestimates can be used for solving the two-sample problem and assessing if two\nrandom variables are independent. We show that the two-sample and indepen-\ndence tests based on these nonconvergent estimates compare favorably with the\nmaximum mean discrepancy test and the Hilbert Schmidt independence criterion.\n\n1 Introduction\n\nKullback-Leibler divergence, mutual information and differential entropy are central to information\ntheory [5]. The divergence [17] measures the \u2018distance\u2019 between two density distributions while\nmutual information measures the information one random variable contains about a related random\nvariable [23]. In machine learning, statistics and neuroscience the information theoretic measures\nalso play a leading role. For instance, the divergence is the error exponent in large deviation theory\n[5] and the divergence can be directly applied to solving the two-sample problem [1]. Mutual infor-\nmation is extensively used to assess whether two random variables are independent [2] and has been\nproposed to solve the all-relevant feature selection problem [8, 24]. Information-theoretic analysis of\nneural data is unavoidable given the questions neurophysiologists are interested in1. There are other\nrelevant applications in different research areas in which divergence estimation is used to measure\nthe difference between two density functions, such as multimedia [19] and text [13] classi\ufb01cation,\namong others.\nThe estimation of information theoretic quantities can be traced back to the late \ufb01fties [7], when Do-\nbrushin estimated the differential entropy for one-dimensional random variables. The review paper\nby Beirlant et al. [4] analyzes the different contributions of nonparametric differential entropy esti-\nmation for continuous random variables. The estimation of the divergence and mutual information\nfor continuous random variables has been addressed by many different authors [25, 6, 26, 18, 20, 16],\nsee also the references therein. Most of these approaches are based on estimating the densities \ufb01rst.\nFor example, in [25], the authors propose to estimate the densities based on data-dependent his-\ntograms with a \ufb01xed number of samples from q(x) in each bin. The authors of [6] compute relative\nfrequencies on data-driven partitions achieving local independence for estimating mutual informa-\ntion. Also, in [20, 21], the authors compute the divergence using a variational approach, in which\n\n1See [22] for a detailed discussion on mutual information estimation in neuroscience.\n\n1\n\n\fconvergence is proven ensuring that the estimate for p(x)/q(x) or log p(x)/q(x) converges to the\ntrue measure ratio or its log ratio.\nThere are only a handful of approaches that use k-nearest-neighbors (k-nn) density estimation [26,\n18, 16] for estimating the divergence and mutual information for \ufb01nite k. Although \ufb01nite k-nn\ndensity estimation does not converge to the true measure, the authors are able to prove mean-square\nconsistency of their divergence estimators imposing some regularity constraint over the densities.\nThese proofs are based on the results reported in [15] for estimating the differential entropy with\nk-nn density estimation.\nThe results in this paper are two-fold. First, we prove almost sure convergence of our divergence es-\ntimate based on k-nn density estimation with \ufb01nite k. Our result is based on describing the statistics\n\nof p(x)/(cid:98)p(x) as a waiting time distribution independent of p(x). We can readily apply this result to\n\nthe estimation of the differential entropy and mutual information.\nSecond, we show that for k linearly growing with the number of samples, our estimates do not con-\nverge nor present known statistics. But they can be reliably used for solving the two-sample problem\nor assessing if two random variables are independent. We show that for this choice of k, the esti-\nmates of the divergence or mutual information perform, respectively, as well as the maximum mean\ndiscrepancy (MMD) test in [9] and the Hilbert Schmidt independence criterion (HSIC) proposed in\n[10].\nThe rest of the paper is organized as follows. We prove in Section 2 the almost sure convergence\nof the divergence estimate based on k-nn density estimation with \ufb01xed k. We extend this result\nfor differential entropy and mutual information in Section 3. In Section 4 we present some exam-\nples to illustrate the convergence of our estimates and to show how can they be used to assess the\nindependence of related random variables. Section 5 concludes the paper with some \ufb01nal remarks.\n\n2 Estimation of the Kullback-Leibler Divergence\n\nIf the densities P and Q exist with respect to a Lebesgue measure, the Kullback-Leibler divergence\nis given by:\n\n(cid:90)\n\nD(P||Q) =\n\np(x) log p(x)\n\nq(x) dx \u2265 0.\n\nRd\n\nThis divergence is \ufb01nite whenever P is absolutely continuous with respect to Q and it is zero only\nif P = Q.\nThe idea of using k-nn density estimation to estimate the divergence was put forward in [26, 18],\nwhere they prove mean-square consistency of their estimator for \ufb01nite k. In this paper, we prove\nthe almost sure convergence of this divergence estimator, using waiting-times distributions without\nneeding to impose additional conditions over the density models. Given a set with n i.i.d. samples\nfrom p(x), X = {xi}n\nj}m\nj=1, we estimate D(P||Q)\nfrom a k-nn density estimate of p(x) and q(x) as follows:\n\ni=1, and m i.i.d. samples from q(x), X (cid:48) = {x(cid:48)\n\n(cid:98)Dk(P||Q) =\n\nn(cid:88)\n\ni=1\n\nlog(cid:98)pk(xi)\n(cid:98)qk(xi)\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n= d\nn\n\nlog sk(xi)\nkk(xi)\n\n+ log m\nn \u2212 1\n\nwhere\n\n(1)\n\n(2)\n\n(3)\n\n(cid:98)pk(xi) =\n(cid:98)qk(xi) = k\n\nk\n\n(n \u2212 1)\n\n\u0393(d/2 + 1)\n\n1\n\n\u03c0d/2\n\nrk(xi)d\n\n\u0393(d/2 + 1)\n\n1\n\nm\n\n\u03c0d/2\n\n(4)\nrk(xi) and sk(xi) are, respectively, the Euclidean distances to the k-nn of xi in X\\xi and X (cid:48), and\n\u03c0d/2/\u0393(d/2 + 1) is the volume of the unit-ball in Rd. Before proving (2) converges almost surely\nto D(P||Q), let us show an intermediate necessary result.\nLemma 1. Given n i.i.d. samples, X = {xi}n\nx in the support of p(x).\n\nbution P , the limiting distribution of p(x)/(cid:98)p1(x) is exponentially distributed with unit mean for any\n\ni=1, from an absolutely continuous probability distri-\n\nsk(xi)d\n\n2\n\n\fProof. Let\u2019s initially assume p(x) is a d-dimensional uniform distribution with a given support. The\nset Sx,R = {xi| (cid:107)xi \u2212 x(cid:107)2 \u2264 R, xi \u2208 X} contains all the samples from X inside the ball centered\nin x of radius R. The radius R has to be small enough for the ball centered in x to be contained\nwithin the support of p(x).\n2| xi \u2208 Sx,R} are consequently uniformly distributed between 0 and Rd.\nThe samples in {(cid:107)xi \u2212 x(cid:107)d\nThereby, the limiting distribution of r1(x)d = minxj\u2208Sx,R((cid:107)xj \u2212 x(cid:107)d\n2) is exponentially distributed,\nas it measures the waiting time between the origin and the \ufb01rst event of a uniformly-spaced sample\nball centered in x, p(x)/(cid:98)p1(x) is distributed as a unit-mean exponential distribution as n tends to\n(see Theorem 2.4 in [3]). Since p(x)n\u03c0d/2/\u0393(d/2 + 1) is the mean number of samples per unit\nin\ufb01nity.\nFor non-uniform absolutely-continuous P , P(r1(x) > \u03b5) \u2192 0 as n \u2192 \u221e for any x in the support\nand the limiting distribution of p(x)/(cid:98)p1(x) is a unit-mean exponential distribution.\nof p(x) and any \u03b5 > 0. Therefore, as n tends to in\ufb01nity p(arg minxj\u2208Sx,R((cid:107)xj \u2212 x(cid:107)d\n2)) \u2192 p(x)\nbution P , the limiting distribution of p(x)/(cid:98)pk(x) is a unit-mean 1/k-variance gamma distribution\n\nCorolary 1. Given n i.i.d. samples, X = {xi}n\nfor any x in the support of p(x).\n\ni=1, from an absolutely continuous probability distri-\n\ni=1, from an absolutely continuous probability\n\nProof. In the previous proof, instead of measuring the waiting time to the \ufb01rst event, we compute the\nwaiting time to the kth event of a uniformly-spaced sample. This waiting-time limiting distribution\nis a unit-mean and 1/k-variance Erlang (gamma) distribution [14].\nCorolary 2. Given n i.i.d. samples, X = {xi}n\nn \u2192 \u221e.\nProof. The k-nn in X tends to x as k/n \u2192 0 and n \u2192 \u221e. Thereby the limiting distribution of\n\ndistribution P , then(cid:98)pk(x) P\u2192 p(x) for any x in the support of p(x), if k \u2192 \u221e and k/n \u2192 0, as\np(x)/(cid:98)pk(x) is a unit-mean 1/k-variance gamma distribution. As k \u2192 \u221e the variance of the gamma\ndistribution goes to zero and consequently(cid:98)pk(x) converges to p(x).\nk grows linearly with n, the k-nn sample in X does not converge to x, which precludes p(x)/(cid:98)pk(x)\n\nThe second corollary is the widely known result that k-nn density estimation converges to the true\nmeasure if k \u2192 \u221e and k/n \u2192 0. We have just include it in the paper for clarity and completeness. If\n\nto present known statistics. For this growth on k, the divergence estimate does not converge to\nD(P||Q).\nNow we can prove the almost surely convergence to (1) of the estimate in (2) based on the k-nn\ndensity estimation.\nTheorem 1. Let P and Q be absolutely continuous probability measures and let P be absolutely\ni}m\ncontinuous with respect to Q. Let X = {xi}n\ni=1 be i.i.d. samples, respectively,\nfrom P and Q, then\nD(P||Q)\n\ni=1 and X (cid:48) = {x(cid:48)\n\na.s.\u2212\u2192\n\n(5)\n\nProof. We can rearrange (cid:98)Dk(P||Q) in (2) as follows:\n(cid:98)Dk(P||Q) =\nlog p(xi)\nq(xi)\n\nlog(cid:98)pk(xi)\n(cid:98)qk(xi)\n\nn(cid:88)\n\n1\nn\n\n1\nn\n\ni=1\n\ni=1\n\n(cid:98)Dk(P||Q)\nn(cid:88)\n\n=\n\nn(cid:88)\n\ni=1\n\n\u2212 1\nn\n\n(cid:98)pk(xi)\nlog p(xi)\n\n+\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n(cid:98)qk(xi)\nlog q(xi)\n\n(6)\n\nThe \ufb01rst term is the empirical estimate of (1) and, by the law of large numbers [11], it converges\nalmost surely to its mean, D(P||Q).\n\nThe limiting distributions of p(xi)/(cid:98)pk(xi) and q(xi)/(cid:98)qk(xi) are unit-mean 1/k-variance gamma\n\ndistributions, independent of i, p(x) and q(x) (see Corollary 1). In the large sample limit:\n\nlog(z)zk\u22121e\u2212kzdz\n\n(7)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:98)pk(xi)\nlog p(xi)\n\nby the law of large numbers [11].\n\n(cid:90) \u221e\n\n0\n\na.s.\u2212\u2192\n\nkk\n\n(k \u2212 1)!\n\n3\n\n\fFinally, the sum of almost surely convergent terms also converges almost surely [11], which com-\npletes our proof.\n\nThe k-nn based divergence estimator is biased, because the convergence rate of p(xi)/(cid:98)pk(xi) and\nq(xi)/(cid:98)qk(xi) to the unit-mean 1/k-variance gamma distribution depends on the density models and\n\nwe should not expect them to be identical. If p(x) = q(x), the divergence is zero and our estimate\nis unbiased for any k (even if k/n does not tend to zero), since the statistics of the second and\nthird term in (6) are identical and they cancel each other out for any n (their expected mean is the\nsame). We use the Monte Carlo based test described in [9] with our divergence estimator to solve\nthe two-sample problem and decide if the samples from X and X (cid:48) actually came from the same\ndistribution.\n\n3 Differential Entropy and Mutual Information Estimation\n\nThe results obtained for the divergence can be readily applied to estimate the differential entropy of\na random variable or the mutual information between two correlated random variables.\nThe differential entropy for an absolutely continuous random variable P is given by:\n\nh(x) = \u2212\n\np(x) log p(x)dx\n\n(8)\n\n(cid:90)\n\nWe can estimate the differential entropy given a set with n i.i.d. samples from P , X = {xi}n\nusing k-nn density estimation as follows:\n\ni=1,\n\n(cid:98)hk(x) = \u2212 1\n\nn\n\n(cid:88)\n\ni=1\n\nlog(cid:98)pk(xi)\n\n(9)\n\nwhere(cid:98)pk(xi) is given by (3).\n\nTheorem 2. Let P be an absolutely continuous probability measure and let X = {xi}n\nsamples from P , then\n\ni=1 be i.i.d.\n\n(cid:98)hk(x)\n\u03b3k = \u2212 kk\n\na.s.\u2212\u2192\n\n(cid:90) \u221e\n\nh(x) + \u03b3k\n\nwhere\n\nand \u03b31\n\n(k \u2212 1)!\n\nlog(z)zk\u22121e\u2212kzdz\n\u223c=0.5772 and it is known as the Euler-Mascheroni constant [12].\nn(cid:88)\n\nProof. We can rearrange(cid:98)hk(x) in (9) as follows:\nlog(cid:98)pk(xi) = \u2212 1\n\n(cid:98)hk(x) = \u2212 1\n\nlog p(xi) +\n\nn(cid:88)\n\nn(cid:88)\n\n0\n\nn\n\ni=1\n\nn\n\ni=1\n\n1\nn\n\ni=1\n\n(10)\n\n(11)\n\n(12)\n\n(cid:98)pk(xi)\nlog p(xi)\n\nThe \ufb01rst term is the empirical estimate of (9) and, by the law of large numbers [11], it converges\nalmost surely to its mean, h(x).\n\nThe limiting distributions of p(xi)/(cid:98)pk(xi) is a unit-mean 1/k-variance gamma distribution, inde-\n\npendent of i and p(x) (see Corollary 1). In the large sample limit:\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:98)pk(xi)\nlog p(xi)\n\n(cid:90) \u221e\n\n0\n\na.s.\u2212\u2192\n\nkk\n\n(k \u2212 1)!\n\nlog(z)zk\u22121e\u2212kzdz = \u2212\u03b3k\n\n(13)\n\nby the law of large numbers [11].\nFinally, the sum of almost surely convergent terms also converges almost surely [11], which com-\npletes our proof.\n\nNow, we can use the expansion of the conditional differential entropy, mutual information and con-\nditional mutual information to prove the convergence of their estimates based on k-nn density esti-\nmation to their values.\n\n4\n\n\f\u2022 Conditional differential entropy:\n\n\u2022 Mutual Information:\n\n\u2022 Conditional Mutual Information:\n\nI(x; y) = \u2212\n\n(cid:98)I(x;|y) =\n(cid:90)\n(cid:98)I(x; y|z) =\n\nI(x; y|z) =\n\nh(y|x) = \u2212\n\np(x, y) log p(y, x)\nn(cid:88)\n\n(cid:90)\n(cid:98)h(y|x) = \u2212 1\n(cid:90)\np(x, y) log p(y, x)\nn(cid:88)\n\nlog p(yi, xi)\np(xi)\n\ni=1\n\nn\n\nlog p(yi, xi)\np(xi)p(yi)\n\n1\nn\n\ni=1\n\np(x) dxdy\na.s.\u2212\u2192\n\np(x)p(y) dxdy\na.s.\u2212\u2192\n\nh(y|x)\n\nI(x; y) + \u03b3k\n\np(x, y, z) log p(y, x, z)p(z)\nn(cid:88)\n\np(x, z)p(y, z) dxdydz\na.s.\u2212\u2192\n\nlog p(yi, xi, zi)p(zi)\np(xi, zi)p(yi, zi)\n\n1\nn\n\ni=1\n\nI(x; y|z)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n4 Experiments\n\nWe have carried out two sets of experiments.\nIn the \ufb01rst one, we show the convergence of the\ndivergence to their limiting value as the number of samples tends to in\ufb01nity and we compare the\ndivergence estimation to the MMD test in [9] for MNIST dataset. In the second experiment, we\ncompute if two random variables are independent and compare the obtained results to the HSIC\nproposed in [10].\nWe \ufb01rst compare the divergence between a uniform distribution between 0 and 1 in d-dimension and\na zero-mean Gaussian distribution with identity covariance matrix. We plot the divergence estimates\nfor d = 1 and d = 5 in Figure 1 as a function of n, for k = 1, k =\n\nn and k = n/2 with m = n.\n\n\u221a\n\n(a)\n\n(b)\n\n\u221a\n\nFigure 1: We plot the divergence for d = 1 in (a) and d = 5 in (b). The solid line with \u2019(cid:63)\u2019 represents\nthe divergence estimate for k = 1, the solid line with \u2019\u2217\u2019 represents the divergence estimate for\nn, the solid line with \u2019\u25e6\u2019 represents the divergence estimate for k = n/2 and the dashed-\nk =\ndotted line represents the divergence. The dashed-lines represent \u00b13 standard deviation for each\ndivergence estimate. We have not added symbols to them to avoid cluttering the images further and\nfrom the plots it should be clear which con\ufb01dence interval is assigned to what estimate.\n\nAs expected, the divergence estimate for k = n/2 does not converge to the true divergence as the\n\nlimiting distributions of p(x)/(cid:98)pk(x) and q(x)/(cid:98)qk(x) are unknown and they depend on p(x) and\n\n5\n\n1021031040.90.9511.051.11.15nKLDd=1 k=0.5nk=n0.5k=1KLD1021031044.74.84.955.15.25.35.45.5nKLDd=5 k=0.5nk=n0.5k=1KLD\f\u221a\n\n\u221a\n\n\u221a\n\n\u221a\nq(x), respectively. Nevertheless, this estimate converges faster to its limiting value and its variance\nis much smaller than that provided by the estimates of the divergence with k =\nn or k = 1. This\nmay indicate that using k = n/2 might be a better option for solving the two-sample problem than\nactually trying to estimate the true divergence, as theorized in [9].\nBoth divergence estimates for k = 1 and k =\nn converge to the true divergence as the number\nof samples tends to in\ufb01nity. The convergence of the divergence estimate for k = 1 is signi\ufb01cantly\nfaster than that with k =\n\nn, because p(x)/(cid:98)p1(x) converges much faster to its limiting distribution\nn(x). p(x)/(cid:98)p1(x) converges faster because the nearest neighbor to x is much closer\nn-nearest-neighbor and we need that the k-nn to be close enough to x for p(x)/(cid:98)pk(x) to\n\nthan p(x)/(cid:98)p\u221a\n\nthan the\nbe close to its limiting distribution. As d grows the divergence estimates need many more samples\nto converge and even for small dimensions the number of samples can be enormously large.\nNevertheless, we can still use this divergence estimate to assess whether two sets of samples come\nfrom the same distribution, because the divergence estimate for p(x) = q(x) is unbiased for any\nk. In Figure 2(a) we plot the divergence estimate between the three\u2019s and two\u2019s handwritten digits\nin the MNIST dataset (http://yann.lecun.com/exdb/mnist/) in a 784 dimensional space. In Figure\n\n2(a) we plot the divergence estimator for (cid:98)D1(3, 2) (solid line) and (cid:98)D1(3, 3) (dashed line) mean\n\nvalues for 100 experiments together with their 90% con\ufb01dence interval. For comparison purposes\nwe plot the MMD test from [9], in which a kernel method was proposed for solving the two-sample\nproblem. We use the code available in http://www.kyb.mpg.de/bs/people/arthur/mmd.htm and use\nits bootstrap estimate for our comparisons. For n = 5 the error rate for the test using k = 1 is 1%, for\nn is 7% and for k = n/2 is 43% and for the MMD test is 34%. For n \u2265 10 all tests reported\nk =\nzero error rate. It seems than the k = 1 test is more powerful than the MMD test in this case, at\nleast for small n. But we can see that the con\ufb01dence interval for the MMD test decreases faster than\nthe test based on the divergence estimate with k = 1 and we should expect better performance for\nlarger n, similar to the divergence estimate with k = n/2.\n\n\u221a\n\nFigure 2: In (a) we plot (cid:98)D1(3||2) (solid), (cid:98)D1(3||3) (dashed) and their 90% con\ufb01dence interval\n\n(b)\n\n(a)\n\n(dotted). In (b) we repeat the same plots using the MMD test from [9].\n\n(cid:21)\n\n(cid:20)y1\n\n(cid:20) cos(\u03b8)\n\n(cid:21) (cid:20)x1\n\n(cid:21)\n\nsin(\u03b8)\ncos(\u03b8)\n\nIn the second example we compute the mutual information between y1 and y2, which are given by:\n\n=\n\ny2\n\nx2\n\n\u2212 sin(\u03b8)\n\n(20)\nwhere x1 and x2 are independent and uniformly distributed between 0 and 1, and \u03b8 \u2208 [0, \u03c0/4]. If\n\u03b8 is zero, y1 and y2 are independent. Otherwise they are not independent, but still uncorrelated for\nany \u03b8.\nWe carry out a test for describing if y1 and y2 are independent. The test is identical to the one\ndescribed in [10] and we use the Mote Carlo resampling technique proposed in that paper with a\n95% con\ufb01dence interval and 1000 repetitions.\nIn Figure 3 we report the acceptance of the null\nhypothesis (y1 and y2 are independent) as a function of \u03b8 for n = 100 in (a) and as a function of n\nfor \u03b8 = \u03c0/8 in (b). We compute the mutual information with k = 1, k =\nn and k = n/2 for our\ntest, and compare it to the HSIC in [10].\n\n\u221a\n\n6\n\n101102\u2212200\u22121000100200300400500nD(3||3) D(3||2)Divergence101102\u22120.2\u22120.15\u22120.1\u22120.0500.050.10.150.20.250.3nMMD(3||3) MMD(3||2)Maximum Mean Discrepancy\f(a)\n\n(b)\n\nFigure 3: We plot the acceptance of the null hypothesis (y1 and y2 are independent) for a 95%\ncon\ufb01dence interval in (a) as a function of \u03b8 and in (b) as a function on (n). The solid line uses the\nmutual information estimate with k = n/2 and the dash-dotted line uses the HSIC. The dashed and\ndotted lines, respectively, use the mutual information estimate with k =\n\nn and k = 1.\n\n\u221a\n\nThe HSIC test and the mutual information estimate based test with k = n/2 perform equally well\n\u221a\nat predicting whether y1 and y2 are independent, while the test based on the mutual information\nestimates with k = 1 and k =\nn clearly underperforms. This example shows that if our goal is\nto predict whether two random variables are independent we are better off using HSIC or a noncon-\nvergent estimate of the mutual information rather than trying to compute the mutual information as\naccurately as possible. Furthermore, in our test, the computational complexity of computing HSIC\nfor n = 5000 is over 10 times more computationally costly (running time) than computing the\nmutual information for k = n/22.\nAs we saw in the case of the divergence estimate in Figure 1, mutual information is more accurately\nestimated when k = 1, but at the cost of a higher variance. If our objective is to estimate the mutual\ninformation (or the divergence), we should use a small value of k, ideally k = 1. However, if we are\ninterested in assessing whether two random variables are independent, it is better to use k = n/2,\nbecause the variance of the estimate is much lower, even though it does not converge to the mutual\ninformation (or the divergence).\n\n5 Conclusions\n\nWe have proved that the estimates of the differential entropy, mutual information and divergence\nbased on k-nn density estimation for \ufb01nite k converge almost surely, even though the density esti-\nmate does not converge. The previous literature could only prove mean-squared consistency and it\nrequired imposing some constraints over the density models. The proof in this paper relies on de-\n\nscribing the limiting distribution of p(x)/(cid:98)pk(x). This limiting distribution can be easily described\n\nn we can prove that(cid:98)pk(x) converges to p(x) while for \ufb01nite k this convergences does not\n\nusing waiting-times distributions, such as the exponential or the Erlang distributions.\nWe have shown, experimentally, that \ufb01xing k = 1 achieves the fastest convergence rate, at the\n\u221a\nexpense of a higher variance for our estimator. The divergence, mutual information and differential\n\u221a\nentropy estimates using k = 1 are much better than the estimates using k =\nn, even though for\nk =\noccur.\nFinally, if we are interested in solving the two-sample problem or assessing if two random variables\nare independent, it is best to \ufb01x k to a fraction of n (we have used k = n/2 in our experiments),\nalthough in this case the estimates do not converge to the true value. Nevertheless, their variances\nare signi\ufb01cantly lower, which allows our tests to perform better. The tests with k = n/2 perform as\nwell as the MMD test for solving the two sample problem and the HSIC for assessing independence.\n\n2For computing HSIC test we use A. Gretton code in http://www.kyb.mpg.de/bs/people/arthur/indep.htm\n\nand for \ufb01nding the k-nn we use the sort function in Matlab.\n\n7\n\n00.10.20.30.40.50.60.700.10.20.30.40.50.60.70.80.91qAcept H0n=100 k=0.5nHSICk=n0.5k=110210300.10.20.30.40.50.60.70.80.91nAcept H0q=p/8 k=0.5nHSICk=n0.5k=1\fAcknowledgment\n\nFernando Prez-Cruz is supported by Marie Curie Fellowship 040883-AI-COM. This work was par-\ntially funded by Spanish government (Ministerio de Educaci\u00b4on y Ciencia TEC2006-13514-C02-\n01/TCM.\n\nReferences\n[1] N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies be-\ntween two multivariate probability density functions using kernel-based density estimates. Journal of\nMultivariate Analysis, 50(1):41\u201354, 7 1994.\n\n[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. JMLR, 3:1\u201348, 2004.\n[3] K. Balakrishnan and A. P. Basu. The Exponential Distribution: Theory, Methods and Applications. Gor-\n\ndon and Breach Publishers, Amsterdam, Netherlands, 1996.\n\n[4] J. Beirlant, E. Dudewicz, L. Gyor\ufb01, and E. van der Meulen. Nonparametric entropy estimation: An\n\noverview. nternational Journal of the Mathematical Statistics Sciences, pages 17\u201339, 1997.\n\n[5] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, USA, 1991.\n[6] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptive partitioning of the observation\n\nspace. IEEE Trans. Information Theory, 45(4):1315\u20131321, 5 1999.\n\n[7] R. L. Dobrushin. A simpli\ufb01ed method for experimental estimate of the entropy of a stationary sequence.\n\nTheory of Probability and its Applications, (4):428\u2013430, 1958.\n\n[8] F. Fleuret. Fast binary feature selection with conditional mutual information. JMLR, 5:1531\u20131555, 2004.\n[9] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-\nIn B. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information\n\nsample-problem.\nProcessing Systems 19, Cambridge, MA, 2007. MIT Press.\n\n[10] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\nindependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information\nProcessing Systems 20, Cambridge, MA, 2008. MIT Press.\n\n[11] G.R. Grimmett and D.R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford,\n\nUK, 3 edition, 2001.\n\n[12] Julian Havil. Gamma: Exploring Euler\u2019s Constant. Princeton University Press, New York, USA, 2003.\n[13] S. Mallela I. S. Dhillon and R. Kumar. A divisive information-theoretic feature clustering algorithm for\n\ntext classi\ufb01cation. JMLR, 3:1265\u20131287, 3 2003.\n\n[14] Leonard Kleinrock. Queueing Systems. Volume 1: Theory. Wiley, New York, USA, 1975.\n[15] L. F. Kozachenko and N. N. Leonenko. Sample estimate of the entropy of a random vector. Problems\n\nInform. Transmission, 23(2):95\u2013101, 4 1987.\n\n[16] A. Kraskov, H. St\u00a8ogbauer, and P. Grassberger. Estimating mutual information. Physical Review E,\n\n69(6):1\u201316, 6 2004.\n\n[17] S. Kullback and R. A. Leibler. On information and suf\ufb01ciency. Ann. Math. Stats., 22(1):79\u201386, 3 1951.\n[18] N. N. Leonenko, L. Pronzato, and V. Savani. A class of renyi information estimators for multidimensional\n\ndensities. Annals of Statistics, 2007. Submitted.\n\n[19] P. J. Moreno, P. P. Ho, and N. Vasconcelos. A kullback-leibler divergence based kernel for svm classi\ufb01-\n\ncation in multimedia applications. Technical Report HPL-2004-4, HP Laboratories, 2004.\n\n[20] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Nonparametric estimation of the likelihood ratio and\n\ndivergence functionals. In IEEE Int. Symp. Information Theory, Nice, France, 6 2007.\n\n[21] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood\nratio by penalized convex risk minimization. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors,\nAdvances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT Press.\n\n[22] L. Paninski. Estimation of entropy and mutual information. Neural Compt, 15(6):1191\u20131253, 6 2003.\n[23] C. E. Shannon. A mathematical theory of communication. Bell System Tech. J., pages 379\u2013423, 1948.\n[24] K. Torkkola. Feature extraction by non parametric mutual information maximization. JMLR, 3:1415\u2013\n\n1438, 2003.\n\n[25] Q. Wang, S. Kulkarni, and S. Verd\u00b4u. Divergence estimation of continuous distributions based on data-\n\ndependent partitions. IEEE Trans. Information Theory, 51(9):3064\u20133074, 9 2005.\n\n[26] Q. Wang, S. Kulkarni, and S. Verd\u00b4u. A nearest-neighbor approach to estimating divergence between\n\ncontinuous random vectors. In IEEE Int. Symp. Information Theory, Seattle, USA, 7 2006.\n\n8\n\n\f", "award": [], "sourceid": 755, "authors": [{"given_name": "Fernando", "family_name": "P\u00e9rez-Cruz", "institution": null}]}