{"title": "Distribution of Mutual Information", "book": "Advances in Neural Information Processing Systems", "page_first": 399, "page_last": 406, "abstract": null, "full_text": "Distribution of Mutual Information \n\nMarcus Hutter \n\nIDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland \n\nmarcus@idsia.ch \n\nhttp://www.idsia.ch/- marcus \n\nAbstract \n\nThe mutual information of two random variables z and J with joint \nprobabilities {7rij} is commonly used in learning Bayesian nets as \nwell as in many other fields. The chances 7rij are usually estimated \nby the empirical sampling frequency nij In leading to a point es(cid:173)\ntimate J(nij In) for the mutual information. To answer questions \nlike \"is J (nij In) consistent with zero?\" or \"what is the probability \nthat the true mutual information is much larger than the point es(cid:173)\ntimate?\" one has to go beyond the point estimate. In the Bayesian \nframework one can answer these questions by utilizing a (second \norder) prior distribution p( 7r) comprising prior information about \n7r. From the prior p(7r) one can compute the posterior p(7rln), from \nwhich the distribution p(Iln) of the mutual information can be cal(cid:173)\nculated. We derive reliable and quickly computable approximations \nfor p(Iln). We concentrate on the mean, variance, skewness, and \nkurtosis, and non-informative priors. For the mean we also give an \nexact expression. Numerical issues and the range of validity are \ndiscussed. \n\n1 \n\nIntroduction \n\nThe mutual information J (also called cross entropy) is a widely used information \ntheoretic measure for the stochastic dependency of random variables [CT91, SooOO] . \nIt is used, for instance, in learning Bayesian nets [Bun96, Hec98] , where stochasti(cid:173)\ncally dependent nodes shall be connected. The mutual information defined in (1) \ncan be computed if the joint probabilities {7rij} of the two random variables z and J \nare known. The standard procedure in the common case of unknown chances 7rij is \nto use the sample frequency estimates n~; instead, as if they were precisely known \nprobabilities; but this is not always appropriate. Furthermore, the point estimate \nJ (n~; ) gives no clue about the reliability of the value if the sample size n is finite. \nFor instance, for independent z and J, J(7r) =0 but J(n~;) = O(n- 1 / 2 ) due to noise \nin the data. The criterion for judging dependency is how many standard deviations \nJ(\":,;) is away from zero. In [KJ96, Kle99] the probability that the true J(7r) is \ngreater than a given threshold has been used to construct Bayesian nets. In the \nBayesian framework one can answer these questions by utilizing a (second order) \n\n\fprior distribution p(7r),which takes account of any impreciseness about 7r. From the \nprior p(7r) one can compute the posterior p(7rln), from which the distribution p(Iln) \nof the mutual information can be obtained. \n\nThe objective of this work is to derive reliable and quickly computable analytical \nexpressions for p(1ln). Section 2 introduces the mutual information distribution, \nSection 3 discusses some results in advance before delving into the derivation. Since \nthe central limit theorem ensures that p(1ln) converges to a Gaussian distribution \na good starting point is to compute the mean and variance of p(1ln). In section \n4 we relate the mean and variance to the covariance structure of p(7rln). Most \nnon-informative priors lead to a Dirichlet posterior. An exact expression for the \nmean (Section 6) and approximate expressions for t he variance (Sections 5) are \ngiven for the Dirichlet distribution. More accurate estimates of the variance and \nhigher central moments are derived in Section 7, which lead to good approximations \nof p(1ln) even for small sample sizes. We show that the expressions obtained in \n[KJ96, Kle99] by heuristic numerical methods are incorrect. Numerical issues and \nthe range of validity are briefly discussed in section 8. \n\n2 Mutual Information Distribution \n\nWe consider discrete random variables Z E {l, ... ,r} and J E {l, ... ,s} and an i.i.d. \nrandom process with samples (i,j) E {l , ... ,r} x {l, ... ,s} drawn with joint probability \n7rij. An important measure of the stochastic dependence of z and J is the mutual \ninformation \nT \n1( 7r) = L L 7rij log ~ = L 7rij log7rij - L 7ri+ log7ri+ - L 7r +j log7r +j' (1) \n\n7rij \n\nS \n\ni=1 j = 1 \n\nH +J \n\nij \n\ni \n\nj \n\nlog denotes the natural logarithm and 7ri+ = Lj7rij and 7r +j = L i7rij are marginal \nprobabilities. Often one does not know the probabilities 7rij exactly, but one has a \nsample set with nij outcomes of pair (i,j). The frequency irij := n~j may be used as \na first estimate of the unknown probabilities. n:= L ijnij is the total sample size. \nThis leads to a point (frequency) estimate 1(ir) = Lij n~j logn:~:j for the mutual \ninformation (per sample). \n\nUnfortunately the point estimation 1(ir) gives no information about its accuracy. \nIn the Bayesian approach to this problem one assumes a prior (second order) prob(cid:173)\nability density p( 7r) for the unknown probabilities 7rij on the probability simplex. \nFrom this one can compute the posterior distribution p( 7rln) cxp( 7r) rrij7r~;j (the nij \nare multinomially distributed). This allows to compute the posterior probability \ndensity of the mutual information.1 \n\np(Iln) = f 8(1(7r) - I)p(7rln)dTS 7r \n\n(2) \n2The 80 distribution restricts the integral to 7r for which 1(7r) =1. For large sam-\n1 I(7r) denotes the mutual information for the specific chances 7r, whereas I in the context \nabove is just some non-negative real number. I will also denote the mutual information \nrandom variable in the expectation E [I] and variance Var[I]. Expectaions are always w.r.t. \nto the posterior distribution p(7rln). \nbe restricted to J:mam, which shows that the domain of p(Iln) is [O,Imax] . \n\n2Since O~I(7r) ~Imax with sharp upper bound Imax :=min{logr,logs}, the integral may \n\n\fpIe size n ---+ 00, p(7rln) is strongly peaked around 7r = it and p(Iln) gets strongly \npeaked around the frequency estimate I = I(it). The mean E[I] = fooo Ip(Iln) dI = \nf I(7r)p(7rln)dTs7r and the variance Var[I] =E[(I - E[I])2] = E[I2]- E[Ij2 are of cen(cid:173)\ntral interest. \n\n3 Results for I under the Dirichlet P (oste )rior \n\nWI \n\nIT \n\nex: \n\nij 7rij \n\nnij -1 \n\n\u00b7th\u00b7 t \n\nIII erpre a IOn nij - nij + nij , were nij are \n\nMost3 non-informative priors for p(7r) lead to a Dirichlet posterior distribution \n( I) \np 7r n \ne num er \nof samples (i,j), and n~j comprises prior information (1 for the uniform prior, ~ for \nJeffreys' prior, 0 for Haldane's prior, -?:s for Perks' prior [GCSR95]). In principle \nthis allows to compute the posterior density p(Iln) of the mutual information. In \nsections 4 and 5 we expand the mean and variance in terms of n- 1 : \n\n- , , , h \n\nt t\u00b7 \n\n' \n\nth \n\nb \n\n~ nij I \nL...J -\n.. n \n'J \n\nog - - - + \n\nnijn \nni+n+j \n\n(r - 1)(8 - 1) \n\n2n \n\n+ \n\nO( -2) \n, \n\nn \n\n(3) \n\nE[I] \n\nVar[I] \n\nThe first term for the mean is just the point estimate I(it). The second term is \na small correction if n \u00bb r\u00b7 8. Kleiter [KJ96, Kle99] determined the correction by \nMonte Carlo studies as min {T2~1 , 8;;;,1 }. This is wrong unless 8 or rare 2. The \nexpression 2E[I]/n they determined for the variance has a completely different \nstructure than ours. Note that the mean is lower bounded by co~st. +O(n- 2 ), which \nis strictly positive for large, but finite sample sizes, even if z and J are statistically \nindependent and independence is perfectly represented in the data (I (it) = 0). On \nthe other hand, in this case, the standard deviation u= y'Var(I) '\" ~ \",E[I] correctly \nindicates that the mean is still consistent with zero. \n\nOur approximations (3) for the mean and variance are good if T~8 is small. The \ncentral limit theorem ensures that p(Iln) converges to a Gaussian distribution with \nmean E[I] and variance Var[I]. Since I is non-negative it is more appropriate to \napproximate p(II7r) as a Gamma (= scaled X2 ) or log-normal distribution with mean \nE[I] and variance Var[I], which is of course also asymptotically correct. \n\nA systematic expansion in n -1 of the mean, variance, and higher moments is possible \nbut gets arbitrarily cumbersome. The O(n- 2) terms for the variance and leading \norder terms for the skewness and kurtosis are given in Section 7. For the mean it \nis possible to give an exact expression \n\nE[I] = - L nij[1jJ(nij + 1) -1jJ(ni+ + 1) -1jJ(n+j + 1) + 1jJ(n + 1)] \n\n1 \nn \n\n.. \n'J \n\n(4) \n\nwith 1jJ(n+1)=-,),+L~= lt=logn+O(~) for integer n. See Section 6 for details \nand more general expressions for 1jJ for non-integer arguments. \n\nThere may be other prior information available which cannot be comprised in a \nDirichlet distribution. In this general case, the mean and variance of I can still be \n\n3But not all priors which one can argue to be non-informative lead to Dirichlet poste(cid:173)\nriors. Brand [Bragg] (and others), for instance, advocate the entropic prior p( 7r) ex e-H(rr). \n\n\frelated to the covariance structure of p(7fln), which will be done in the following \nSection. \n\n4 Approximation of Expectation and Variance of I \n\nIn the following let frij := E[7fij]. Since p( 7fln) is strongly peaked around 7f = fr for \nlarge n we may expand J(7f) around fr in the integrals for the mean and the variance. \nWith I:::..ij :=7fij -frij and using L:ij7fij = 1 = L:ijfrij we get for the expansion of (1) \nJ(7f) = J(fr) + 2)og ~ I:::..ij + L ----}J-- L ~- L ~+O(1:::..3). (5) \n\nfr .. \n\n( \n\n) \n\n.. \n2J \n\n7fi+7f+j \n\n1:::..2 . \n.. 27fij \n2J \n\n1:::..2 \n. 27fi+ \n2 \n\n1:::.. 2 . \n. 27f+j \nJ \n\nTaking the expectation, the linear term E[ I:::..ij ] = a drops out. The quadratic terms \nE[ I:::..ij I:::..kd = Cov( 7fij ,7fkl) are the covariance of 7f under distribution p( 7fln) and are \nproportional to n- 1 . It can be shown that E[1:::..3] ,,-,n- 2 (see Section 7). \n\n( A) 1\", (bikbjl \nEJ = J7f +-~ -A- -\n\n[ ] \n\n2 ijkl \n\n7fij \n\nbik \n-A- -\n7fi+ \n\nbjl) \n-A- COV7fij,7fkl +On \n7f +j \n\n( \n\n) \n\n(-2) \n. \n\n(6) \n\nThe Kronecker delta bij is 1 for i = j and a otherwise. The variance of J in leading \norder in n - 1 is \n\n(7) \n\nwhere :t means = up to terms of order n -2. So the leading order variance and \nthe leading and next to leading order mean of the mutual information J(7f) can be \nexpressed in terms of the covariance of 7f under the posterior distribution p(7fln). \n\n5 The Second Order Dirichlet Distribution \n\nNoninformative priors for p(7f) are commonly used if no additional prior information \nis available. Many non-informative choices (uniform, Jeffreys' , Haldane's, Perks', \n\nprior) lead to a Dirichlet posterior distribution: \n\n1 II n;j - 1 \nN(n) .. 7fij \n\n( \nb 7f++ - 1 with normalization \n\n) \n\n2J \n\nN(n) \n\n(8) \n\nwhere r is the Gamma function, and nij = n~j + n~j, where n~j are the number of \nsamples (i,j), and n~j comprises prior information (1 for the uniform prior, ~ for \nJeffreys' prior, a for Haldane's prior, -!s for Perks' prior) . Mean and covariance of \np(7fln) are \n\nA E[] nij \n7fij:= \n7fij = - , \nn \n\n(9) \n\n\fInserting this into (6) and (7) we get after some algebra for the mean and variance \nof the mutual information I(7r) up to terms of order n- 2 : \n\nE[I] \n\nVar[I] \n\nJ \n\n(r - 1)(8 - 1) \n\n+ \n\n2(n + 1) \n\nn \n~1 (K - J2) + 0(n-2), \nn+ \n\nO( -2) \n, \n\n+ \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\nJ and K (and L, M, P, Q defined later) depend on 7rij = \":,j only, i.e. are 0(1) \nin n. Strictly speaking we should expand n~l = ~+0(n-2), i.e. drop the +1, but \nthe exact expression (9) for the covariance suggests to keep the +1. We compared \nboth versions with the exact values (from Monte-Carlo simulations) for various \nparameters 7r. In most cases the expansion in n~l was more accurate, so we suggest \nto use this variant. \n\n6 Exact Value for E[I] \n\nIt is possible to get an exact expression for the mean mutual information E[I] under \nthe Dirichlet distribution. By noting that xlogx= d~x,6I,6= l' (x = {7rij,7ri+ ,7r+j}), \none can replace the logarithms in the last expression of (1) by powers. From (8) we \nsee that E[ (7rij ),6] = ~i~:~ t~~~;l. Taking the derivative and setting ,8 = 1 we get \n\nE[7rij log 7rij] = d,8E[(7rij) ,6],6=l = ;;: 2:::: nij[1j!(nij + 1) -1j!(n + 1)]. \n\nd \n\n1 \n\nThe 1j! function has the following properties (see [AS74] for details) \n\n\"J \n\n1j!(z) = \n\ndlogf(z) \n\ndz \n\nf'(z) \n\n= f(z)' 1j!(z + 1) = log z + 2z - 12z2 + O( Z4)' \n\n1 \n\n1 \n\n1 \n\n1j!(n) = -\"( + L k' 1j!(n +~) = -\"( + 2log2 + 2 L 2k _ l' \n\nn \n\nn - l 1 \n\n1 \n\n(14) \n\nk=l \n\nk=l \n\nThe value of the Euler constant \"( is irrelevant here, since it cancels out. Since the \nmarginal distributions of 7ri+ and 7r+j are also Dirichlet (with parameters ni+ and \nn+j) we get similarly \n\n1 - L n+j[1j!(n+j + 1) -1j!(n + 1)]. \n\nn \n\n. \nJ \n\nInserting this into (1) and rearranging terms we get the exact expression4 \nE[I] = - L nij[1j!(nij + 1) -1j!(ni+ + 1) -1j!(n+j + 1) + 1j!(n + 1)] \n\n1 \nn \n\n.. \n\n(15) \n\n4This expression has independently been derived in [WW93]. \n\n\fFor large sample sizes, 'Ij;(z + 1) ~ logz and (15) approaches the frequency estimate \nI(7r) as it should be. Inserting the expansion 'Ij;(z + 1) = logz + 2\\ + ... into (15) we \nalso get the correction term (r - 11~s - 1) of (3). \n\nThe presented method (with some refinements) may also be used to determine an \nexact expression for the variance of I(7f). All but one term can be expressed in terms \nof Gamma functions. The final result after differentiating w.r.t. (31 and (32 can be \nrepresented in terms of 'Ij; and its derivative 'Ij;' . The mixed term E[( 7fi+ )131 (7f +j )132] \nis more complicated and involves confluent hypergeometric functions, which limits \nits practical use [WW93] . \n\n7 Generalizations \n\nA systematic expansion of all moments of p(Iln) to arbitrary order in n -1 is possible, \nbut gets soon quite cumbersome. For the mean we already gave an exact expression \n(15), so we concentrate here on the variance, skewness and the kurtosis of p(Iln). \nThe 3rd and 4th central moments of 7f under the Dirichlet distribution are \n\n( \nn+l n+2 \n\n)2( \n\n) [27ra7rb7rc - 7ra7rbc5bc - 7rb7rcc5ca - 7rc7rac5ab + 7rac5abc5bc] \n(16) \n\n~2 [37ra7rb7rc7rd - jrc!!d!!a c5ab - A7rbjrdA7rac5ac - A7rbA7rcA7rac5ad \n-7fa7fd7fbc5bc - 7fa7fc7fbc5bd - 7fa7fb7fcc5cd \n+7ra7rcc5abc5cd + 7ra7rbc5acc5bd + 7ra7rbc5adc5bc] + O(n-3) \n\n(17) \n\nwith a=ij, b= kl, ... E {1, ... ,r} x {1, ... ,8} being double indices, c5ab =c5ik c5jl , ... 7rij = n~j \u2022 \nExpanding D..k = (7f_7r)k in E[D..aD..b ... ] leads to expressions containing E[7fa7fb ... ], \nwhich can be computed by a case analysis of all combinations of equal/unequal \nindices a,b,c, ... using (8). Many terms cancel leading to the above expressions. They \nallow to compute the order n- 2 term of the variance of I(7f). Again, inspection of \n(16) suggests to expand in [(n+l)(n+2)]-1, rather than in n- 2 . The variance in \nleading and next to leading order is \n\nVar[I] \n\nM \n\nQ \n\nK - J2 + M + (r - 1)(8 - 1)(~ - J) - Q + O(n- 3) \nn + 1 \nL (~- _1 _ _ _ 1_ +~) nij log nijn , \nni+n+j \n\n(n + l)(n + 2) \n\nn+j \n\nni+ \n\nn \n\nij \n\nnij \n\n2 \n\nl-L~\u00b7 \n\nij ni+n+j \n\n(18) \n\n(19) \n\n(20) \n\nJ and K are defined in (12) and (13). Note that the first term ~+f also contains \nsecond order terms when expanded in n -1. The leading order terms for the 3rd and \n4th central moments of p(Iln) are \n\nL \n\n.-\n\n\f'\"\"\" nij I \n~- og---\nni+n+j \nj \n\nnij n \n\nn \n\n32 [K - J 2F + O(n- 3 ), \nn \n\nfrom which the skewness and kurtosis can be obtained by dividing by Var[Ij3/2 \nand Var[IF respectively. One can see that the skewness is of order n- 1 / 2 and the \nkurtosis is 3 + 0 (n - 1). Significant deviation of the skewness from a or the kurtosis \nfrom 3 would indicate a non-Gaussian I. They can be used to get an improved \napproximation for p(Iln) by making, for instance, an ansatz \n\nand fitting the parameters b, c, jJ\" and (j-2 to the mean, variance, skewness, and \nkurtosis expressions above. Po is the Normal or Gamma distribution (or any other \ndistribution with Gaussian limit). From this, quantiles p(I>I*ln):= fI:'p(Iln) dI, \nneeded in [KJ96, Kle99], can be computed. A systematic expansion of arbitrarily \nhigh moments to arbitrarily high order in n- 1 leads, in principle, to arbitrarily \naccurate estimates. \n\n8 Numerics \n\nVar[Il e\u2022 act \n\nThere are short and fast implementations of'if;. The code of the Gamma function in \n[PFTV92], for instance, can be modified to compute the 'if; function. For integer and \nhalf-integer values one may create a lookup table from (14) . The needed quantities \nJ, K, L, M, and Q (depending on n) involve a double sum, P only a single sum, \nand the r+s quantities Ji+ and J+ j also only a single sum. Hence, the computation \ntime for the (central) moments is of the same order O(r\u00b7s) as for the point estimate \n\"Exact\" values have been obtained for representative choices of 7rij, r, s, \n(1). \nand n by Monte Carlo simulation. The 7rij := Xij / x++ are Dirichlet distributed, \nif each Xij follows a Gamma distribution. See [PFTV92] how to sample from a \nGamma distribution. The variance has been expanded in T~S, so the relative error \nVar [I]app\" o.-Var[I] .. act of the approximation (11) and (18) are of the order of T'S and \n(T~S)2 respectively, if z and J are dependent. If they are independent the leading \nterm (11) drops itself down to order n -2 resulting in a reduced relative accuracy \nO( T~S) of (18). Comparison with the Monte Carlo values confirmed an accurracy in \nthe range (T~S)1...2. The mean (4) is exact. Together with the skewness and kurtosis \nwe have a good description for the distribution of the mutual information p(Iln) for \nnot too small sample bin sizes nij' We want to conclude with some notes on useful \naccuracy. The hypothetical prior sample sizes n~j = {a, -!S' ~,1} can all be argued to \nbe non-informative [GCSR95]. Since the central moments are expansions in n- 1 , \nthe next to leading order term can be freely adjusted by adjusting n~j E [0 ... 1]. So \none may argue that anything beyond leading order is free to will, and the leading \norder terms may be regarded as accurate as we can specify our prior knowledge. \nOn the other hand, exact expressions have the advantage of being safe against \ncancellations. For instance, leading order of E [I] and E[I2] does not suffice to \ncompute the leading order of Var[I]. \n\nn \n\n\fAcknowledgements \n\nI want to thank Ivo Kwee for valuable discussions and Marco Zaffalon for encour(cid:173)\naging me to investigate this topic. This work was supported by SNF grant 2000-\n61847.00 to Jiirgen Schmidhuber. \n\nReferences \n\n[AS74] \n\n[Bra99] \n\n[Bun96] \n\n[CT91] \n\nM. Abramowitz and 1. A. Stegun, editors. Handbook of mathematical functions. \nDover publications, inc., 1974. \nM. Brand. Structure learning in conditional probability models via an entropic \nprior and parameter extinction. N eural Computation, 11(5):1155- 1182, 1999. \nW. Buntine. A guide to the literature on learning probabilistic networks from \ndata. \nIEEE Transactions on Knowledge and Data Engineering, 8:195- 210, \n1996. \nT. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series \nin Telecommunications. John Wiley & Sons, New York, NY, USA, 1991. \n\n[GCSR95] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. \n\nChapman, 1995. \n\n[Hec98] \n\n[KJ96] \n\n[Kle99] \n\nD. Heckerman. A tutorial on learning with Bayesian networks. Learnig in \nGraphical Models, pages 301-354, 1998. \nG. D. Kleiter and R. Jirousek. Learning Bayesian networks under the control \nof mutual information. Proceedings of the 6th International Conference on \nInformation Processing and Management of Uncertainty in Knowledge-Based \nSystems (IPMU-1996), pages 985- 990, 1996. \nG. D. Kleiter. The posterior probability of Bayes nets with strong dependences. \nSoft Computing, 3:162- 173, 1999. \n\n[PFTV92] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical \nR ecipes in C: Th e Art of Scientific Computing. Cambridge University Press, \nCambridge, second edition, 1992. \nE. S. Soofi. Principal information theoretic approaches. Journal of the Ameri(cid:173)\ncan Statistical Association, 95:1349- 1353, 2000. \n\n[SooOO] \n\n[WW93] D. R. Wolf and D. H. Wolpert. Estimating functions of distributions from A \n\nfinite set of samples, part 2: Bayes estimators for mutual information, chi(cid:173)\nsquared, covariance and other statistics. Technical Report LANL-LA-UR-93-\n833, Los Alamos National Laboratory, 1993. Also Santa Fe Insitute report \nSFI-TR-93-07 -047. \n\n\f", "award": [], "sourceid": 2071, "authors": [{"given_name": "Marcus", "family_name": "Hutter", "institution": null}]}