{"title": "Bayesian estimation of discrete entropy with mixtures of stick-breaking priors", "book": "Advances in Neural Information Processing Systems", "page_first": 2015, "page_last": 2023, "abstract": "We consider the problem of estimating Shannon's entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Pitman-Yor processes (a generalization of Dirichlet processes) provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they also provide natural priors for Bayesian entropy estimation, due to the remarkable fact that the moments of the induced posterior distribution over H can be computed analytically. We derive formulas for the posterior mean (Bayes' least squares estimate) and variance under such priors. Moreover, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the entropy estimate in the under-sampled regime. We derive a family of continuous mixing measures such that the resulting mixture of Pitman-Yor processes produces an approximately flat (improper) prior over H. We explore the theoretical properties of the resulting estimator, and show that it performs well on data sampled from both exponential and power-law tailed distributions.", "full_text": "Bayesian estimation of discrete entropy with mixtures\n\nof stick-breaking priors\n\nEvan Archer\u21e4124, Il Memming Park\u21e4234, & Jonathan W. Pillow234\n\nThe University of Texas at Austin\n\n1. Institute for Computational and Engineering Sciences\n2. Center for Perceptual Systems, 3. Dept. of Psychology,\n\n4. Division of Statistics & Scienti\ufb01c Computation\n\nAbstract\n\nWe consider the problem of estimating Shannon\u2019s entropy H in the under-sampled\nregime, where the number of possible symbols may be unknown or countably\nin\ufb01nite. Dirichlet and Pitman-Yor processes provide tractable prior distributions\nover the space of countably in\ufb01nite discrete distributions, and have found major\napplications in Bayesian non-parametric statistics and machine learning. Here\nwe show that they provide natural priors for Bayesian entropy estimation, due\nto the analytic tractability of the moments of the induced posterior distribution\nover entropy H. We derive formulas for the posterior mean and variance of H\ngiven data. However, we show that a \ufb01xed Dirichlet or Pitman-Yor process prior\nimplies a narrow prior on H, meaning the prior strongly determines the estimate\nin the under-sampled regime. We therefore de\ufb01ne a family of continuous mixing\nmeasures such that the resulting mixture of Dirichlet or Pitman-Yor processes\nproduces an approximately \ufb02at prior over H. We explore the theoretical properties\nof the resulting estimators and show that they perform well on data sampled from\nboth exponential and power-law tailed distributions.\n\n1\n\nIntroduction\n\nAn important statistical problem in the study of natural systems is to estimate the entropy of an\nunknown discrete distribution on the basis of an observed sample. This is often much easier than\nthe problem of estimating the distribution itself; in many cases, entropy can be accurately estimated\nwith fewer samples than the number of distinct symbols. Entropy estimation remains a dif\ufb01cult\nproblem, however, as there is no unbiased estimator for entropy, and the maximum likelihood es-\ntimator exhibits severe bias for small datasets. Previous work has tended to focus on methods for\ncomputing and reducing this bias [1\u20135]. Here, we instead take a Bayesian approach, building on a\nframework introduced by Nemenman et al [6]. The basic idea is to place a prior over the space of\nprobability distributions that might have generated the data, and then perform inference using the\ninduced posterior distribution over entropy. (See Fig. 1).\nWe focus on the setting where our data are a \ufb01nite sample from an unknown, or possibly even count-\nably in\ufb01nite, number of symbols. A Bayesian approach requires us to consider distributions over\nthe in\ufb01nite-dimensional simplex, 1. To do so, we employ the Pitman-Yor (PYP) and Dirichlet\n(DP) processes [7\u20139]. These processes provide an attractive family of priors for this problem, since:\n(1) the posterior distribution over entropy has analytically tractable moments; and (2) distributions\ndrawn from a PYP can exhibit power-law tails, a feature commonly observed in data from social, bi-\nological, and physical systems [10\u201312]. However, we show that a \ufb01xed PYP prior imposes a narrow\n\n\u21e4 These authors contributed equally.\n\n1\n\n\fparameter\n\ndistribution\n\nentropy\n\n...\n\ndata\n\nFigure 1: Graphical model illustrating the ingre-\ndients for Bayesian entropy estimation. Arrows indi-\ncate conditional dependencies between variables, and\nthe gray \u201cplate\u201d denotes multiple copies of a random\nvariable (with the number of copies N indicated at\nbottom). For entropy estimation, the joint probabil-\nity distribution over entropy H, data x = {xj}, dis-\ncrete distribution \u21e1 = {\u21e1i}, and parameter \u2713 factor-\nizes as: p(H, x, \u21e1, \u2713) = p(H|\u21e1)p(x|\u21e1)p(\u21e1|\u2713)p(\u2713).\nEntropy is a deterministic function of \u21e1, so p(H|\u21e1) =\n(H Pi \u21e1i log \u21e1i).\n\nprior over entropy, leading to severe bias and overly narrow credible intervals for small datasets. We\naddress this shortcoming by introducing a set of mixing measures such that the resulting Pitman-Yor\nMixture (PYM) prior provides an approximately non-informative (i.e., \ufb02at) prior over entropy.\nThe remainder of the paper is organized as follows. In Section 2, we introduce the entropy estimation\nproblem and review prior work. In Section 3, we introduce the Dirichlet and Pitman-Yor processes\nand discuss key mathematical properties relating to entropy. In Section 4, we introduce a novel\nentropy estimator based on PYM priors and derive several of its theoretical properties. In Section 5,\nwe show applications to data.\n\n2 Entropy Estimation\n\nConsider samples x := {xj}N\na \ufb01nite or (countably) in\ufb01nite alphabet X. We wish to estimate the entropy of \u21e1,\n\nj=1 drawn iid from an unknown discrete distribution \u21e1 := {\u21e1i}A\n\ni=1 on\n\nH(\u21e1) = \n\nAXi=1\n\n\u21e1i log \u21e1i,\n\n(1)\n\nwhere we identify X = {1, 2, . . . , A} as the set of alphabets without loss of generality (where the\nalphabet size A may be in\ufb01nite), and \u21e1i > 0 denotes the probability of observing symbol i. We\nfocus on the setting where N \u2327 A.\nA reasonable \ufb01rst step toward estimating H is to estimate the distribution \u21e1. The sum of ob-\nserved counts nk = PN\ni=1 1{xi=k} for each k 2 X yields the empirical distribution \u02c6\u21e1, where\n\u02c6\u21e1k = nk/N. Plugging this estimate for \u21e1 into eq. 1, we obtain the so-called \u201cplugin\u201d estimator:\n\u02c6Hplugin = P \u02c6\u21e1i log \u02c6\u21e1i, which is also the maximum-likelihood estimator. It exhibits substantial\n\nnegative bias in the undersampled regime.\n\n2.1 Bayesian entropy estimation\n\nThe Bayesian approach to entropy estimation involves formulating a prior over distributions \u21e1, and\nthen turning the crank of Bayesian inference to infer H using the posterior distribution. Bayes\u2019 least\nsquares (BLS) estimators take the form:\n\n\u02c6H(x) = E[H|x] =Z H(\u21e1)p(\u21e1|x) d\u21e1\n\n(2)\n\nwhere p(\u21e1|x) is the posterior over \u21e1 under some prior p(\u21e1) and categorical likelihood p(x|\u21e1) =\nQj p(xj|\u21e1), where p(xj = i) = \u21e1i. The conditional p(H|\u21e1) = (H Pi \u21e1i log \u21e1i), since H is\ndeterministically related to \u21e1. To the extent that p(\u21e1) expresses our true prior uncertainty over the\nunknown distribution that generated the data, this estimate is optimal in a least-squares sense, and\nthe corresponding credible intervals capture our uncertainty about H given the data.\nFor distributions with known \ufb01nite alphabet size A, the Dirichlet distribution provides an obvious\nchoice of prior due to its conjugacy to the discrete (or multinomial) likelihood. It takes the form\n\n, for \u21e1 on the A-dimensional simplex (\u21e1i 1,P \u21e1i = 1), with concentration\n\np(\u21e1) /QA\n\ni=1 \u21e1\u21b51\n\ni\n\n2\n\n\f]\n\n \n\nn\n>\n\n \nt\n\nn\nu\no\nc\nd\nr\no\nw\nP\n\n[\n\n100\n\n10(cid:239)1\n\n10(cid:239)2\n\n10(cid:239)3\n\n10(cid:239)4\n\n10(cid:239)5\n\n \n100\n\nNeural Alphabet Frequency\n\n(27 spiking neurons)\n\n]\n\n \n\nn\n>\n\n \nt\n\nn\nu\no\nc\nd\nr\no\nw\nP\n\n[\n\ncell data\n95% confidence\n\n101\n\n102\n\n103\n\nwordcount n\n\n104\n\n105\n\n100\n\n10(cid:239)1\n\n10(cid:239)2\n\n10(cid:239)3\n\n10(cid:239)4\n\n10(cid:239)5\n\n \n100\n\nWord Frequency in Moby Dick\n\n \n\n \n\nDP\nPY\nword data\n95% confidence\n\n101\n\n102\nwordcount n\n\n103\n\nFigure 2: Power-law frequency distributions from neural signals and natural language. We compare\nsamples from the DP (red) and PYP (blue) priors for two datasets with heavy tails (black). In both\ncases, we compare the empirical CDF with distributions sampled given d and \u21b5 \ufb01xed to their ML\nestimates. For both datasets, the PYP better captures the heavy-tailed behavior of the data. Left:\nFrequencies among N = 1.2e6 neural spike words from 27 simultaneously-recorded retinal ganglion\ncells, binarized and binned at 10 ms [18]. Right: Frequency of N = 217826 words in the novel Moby\nDick by Herman Melville.\n\nparameter \u21b5 [13]. Many previously proposed estimators can be viewed as Bayesian estimators with\na particular \ufb01xed choice of \u21b5. (See [14] for an overview).\n\n2.2 Nemenman-Shafee-Bialek (NSB) estimator\n\nIn a seminal paper, Nemenman et al [6] showed that Dirichlet priors impose a narrow prior over\nentropy. In the under-sampled regime, Bayesian estimates using a \ufb01xed Dirichlet prior are severely\nbiased, and have small credible intervals (i.e., they give highly con\ufb01dent wrong answers!). To ad-\ndress this problem, [6] suggested a mixture-of-Dirichlets prior:\n\n(3)\nwhere pDir(\u21e1|\u21b5) denotes a Dir(\u21b5) prior on \u21e1. To construct an approximately \ufb02at prior on entropy,\n[6] proposed the mixing weights on \u21b5 given by,\n\np(\u21e1) =Z pDir(\u21e1|\u21b5)p(\u21b5)d\u21b5,\n\n(4)\nwhere E[H|\u21b5] denotes the expected value of H under a Dir(\u21b5) prior, and 1(\u00b7) denotes the tri-\ngamma function. To the extent that p(H|\u21b5) resembles a delta function, eq. 3 implies a uniform prior\nfor H on [0, log A].The BLS estimator under the NSB prior can then be written as,\n\nd\nd\u21b5E[H|\u21b5] = A 1(A\u21b5 + 1) 1(\u21b5 + 1),\n\np(\u21b5) /\n\n\u02c6Hnsb = E[H|x] =ZZ H(\u21e1)p(\u21e1|x, \u21b5) p(\u21b5|x) d\u21e1 d\u21b5 =Z E[H|x, \u21b5]\n\n(5)\nwhere E[H|x, \u21b5] is the posterior mean under a Dir(\u21b5) prior, and p(x|\u21b5) denotes the evidence,\nwhich has a Polya distribution. Given analytic expressions for E[H|x, \u21b5] and p(x|\u21b5), this estimate\nis extremely fast to compute via 1D numerical integration in \u21b5. (See Appendix for details).\nNext, we shall consider the problem of extending this approach to in\ufb01nite-dimensional discrete\ndistributions. Nemenman et al proposed one such extension using an approximation to \u02c6Hnsb in the\nlimit A ! 1,which we refer to as \u02c6Hnsb1 [15, 16]. Unfortunately, \u02c6Hnsb1 increases unboundedly\nwith N (as noted by [17]), and it performs poorly for the examples we consider.\n\np(x|\u21b5)p(\u21b5)\n\np(x)\n\nd\u21b5,\n\n3 Stick-Breaking Priors\n\nTo construct a prior over countably in\ufb01nite discrete distributions we employ a class of distributions\nfrom nonparametric Bayesian statistics known as stick-breaking processes [19]. In particular, we\n\n3\n\n\ffocus on two well-known subclasses of stick-breaking processes: the Dirichlet Process (DP) and\nPitman-Yor process (PYP). Both are stochastic processes whose samples are discrete probability\ndistributions [7, 20]. A sample from a DP or PYP may be written asP1i=1 \u21e1ii, where \u21e1 = {\u21e1i}\ndenotes a countably in\ufb01nite set of \u2018weights\u2019 on a set of atoms {i} drawn from some base probability\nmeasure, where i denotes a delta function on the atom i.1 The prior distribution over \u21e1 under\nthe DP and PYP is technically called the GEM distribution or the two-parameter Poisson-Dirichlet\ndistribution, but we will abuse terminology and refer to it more simply as script notation DP or PY.\nThe DP weight distribution DP(\u21b5) may be described as a limit of the \ufb01nite Dirichlet distributions\nwhere the alphabet size grows and concentration parameter shrinks, A ! 1 and \u21b50 ! 0, such that\n\u21b50\nA ! \u21b5 [20]. The PYP generalizes the DP to allow power-law tails, and includes DP as a special\ncase [7].\nLet PY(d, \u21b5) denote the PYP weight distribution with discount parameter d and concentration pa-\nrameter \u21b5 (also called the \u201cDirichlet parameter\u201d), for d 2 [0, 1), \u21b5 > d. When d = 0, this reduces\nto the DP weight distribution, denoted DP(\u21b5). The name \u201cstick-breaking\u201d refers to the fact that\nthe weights of the DP and PYP can be sampled by transforming an in\ufb01nite sequence of indepen-\ndent Beta random variables in a procedure known as \u201cstick-breaking\u201d [21]. Stick-breaking provides\nsamples \u21e1 \u21e0 PY(d, \u21b5) according to:\n\ni \u21e0 Beta(1 d, \u21b5 + id)\n\n\u02dc\u21e1i =\n\ni1Yk=1\n\n(1 k)i,\n\n(6)\n\nwhere \u02dc\u21e1i is known as the i\u2019th size-biased sample from \u21e1. (The \u02dc\u21e1i sampled in this manner are not\nstrictly decreasing, but decrease on average such thatP1i=1 \u02dc\u21e1i = 1 with probability 1). Asymptoti-\ncally, the tails of a (sorted) sample from DP(\u21b5) decay exponentially, while for PY(d, \u21b5) with d 6= 0,\nthe tails approximately follow a power-law: \u21e1i / (i) 1\nd ( [7], pp. 867)2. Many natural phenomena\nsuch as city size, language, spike responses, etc., also exhibit power-law tails [10, 12]. (See Fig. 2).\n\n3.1 Expectations over DP and PY weight distributions\n\nA key virtue of PYP priors is a mathematical property called invariance under size-biased sampling,\nwhich allows us to convert expectations over \u21e1 on the in\ufb01nite-dimensional simplex to one or two-\ndimensional integrals with respect to the distribution of the \ufb01rst two size-biased samples [23, 24].\nThese expectations are required for computing the mean and variance of H under the prior (or\nposterior) over \u21e1.\nProposition 1 (Expectations with \ufb01rst two size-biased samples). For \u21e1 \u21e0 PY(d, \u21b5) and arbitrary\nintegrable functionals f and g of \u21e1,\n\nE(\u21e1|d,\u21b5)\" 1Xi=1\nE(\u21e1|d,\u21b5)24Xi,j6=i\n\nf (\u21e1i)# = E(\u02dc\u21e11|d,\u21b5)\uf8ff f (\u02dc\u21e11)\n\u02dc\u21e11 ,\ng(\u21e1i, \u21e1j)35 = E(\u02dc\u21e11,\u02dc\u21e12|d,\u21b5) [g(\u02dc\u21e11, \u02dc\u21e12)(1 \u02dc\u21e11)] ,\n\nwhere \u02dc\u21e11 and \u02dc\u21e12 are the \ufb01rst two size-biased samples from \u21e1.\n\n(7)\n\n(8)\n\nThe \ufb01rst result (eq. 7) appears in [7], and we construct an analogous proof for eq. 8 (see Appendix).\nThe direct consequence of this lemma is that \ufb01rst two moments of H(\u21e1) under the DP and PY\npriors have closed forms , which can be obtained using (from eq. 6): \u02dc\u21e11 \u21e0 Beta(1 d, \u21b5 + d), and\n\u02dc\u21e12/(1\u02dc\u21e11)|\u02dc\u21e11 \u21e0 Beta(1d, \u21b5+2d), with f (\u21e1i) = \u21e1i log(\u21e1i) for E[H], and f (\u21e1i) = \u21e12\ni (log \u21e1i)2\nand g(\u21e1i, \u21e1j) = \u21e1i\u21e1j(log \u21e1i)(log \u21e1j) for E[H 2].\n\n1Here, we will assume the base measure is non-atomic, so that the atoms i are distinct with probability\n1. This allows us to ignore the base measure, making entropy of the distribution equal to the entropy of the\nweights \u21e1.\n\n2Note that the power-law exponent is given incorrectly in [9, 22].\n\n4\n\n\fPrior Mean\n\nPrior Uncertainty\n\n2\n\n1\n\n)\ns\nt\na\nn\n(\n \nn\no\n\ni\nt\n\ni\n\na\nv\ne\nd\n\n \n\nd\nr\na\nd\nn\na\n\nt\ns\n\nd=0.9\nd=0.8\nd=0.7\nd=0.6\nd=0.5\nd=0.4\nd=0.3\nd=0.2\nd=0.1\nd=0.0\n\n30\n\n20\n\n10\n\n)\ns\nt\n\na\nn\n(\n \ny\np\no\nr\nt\nn\ne\n\n \n\nd\ne\n\nt\nc\ne\np\nx\ne\n\n0\n100\n\n105\n\n1010\n\n0\n100\n\n105\n\n1010\n\nFigure 3: Prior mean and standard deviation over entropy H under a \ufb01xed PY prior, as a function of\n\u21b5 and d. Note that expected entropy is approximately linear in log \u21b5. Small prior standard deviations\n(right) indicate that p(H(\u21e1)|d, \u21b5) is highly concentrated around the prior mean (left).\n\n3.2 Posterior distribution over weights\nA second desirable property of the PY distribution is that the posterior p(\u21e1post|x, d, \u21b5) takes the\nform of a (\ufb01nite) Dirichlet mixture of point masses and a PY distribution [8]. This makes it possible\nto apply the above results to the posterior mean and variance of H.\nLet ni denote the count of symbol i in an observed dataset. Then let \u21b5i = ni d, N = P ni,\nand A =P \u21b5i =Pi ni Kd = N Kd, where K =PA\n\ni=1 1{ni>0} is the number of unique\nsymbols observed. Given data, the posterior over (countably in\ufb01nite) discrete distributions, written\nas \u21e1post = (p1, p2, p3, . . . , pK, p\u21e4\u21e1), has the distribution (given in [19]):\n\n(9)\n\n(p1, p2, p3, . . . , pK, p\u21e4) \u21e0 Dir(n1 d, n2 d, . . . , nK d, \u21b5 + Kd)\n\u21e1 := (\u21e11, \u21e12, \u21e13, . . . ) \u21e0 PY(d, \u21b5 + Kd).\n4 Bayesian entropy inference with PY priors\n\n4.1 Fixed PY priors\n\nUsing the results of the previous section (eqs. 7 and 8), we can derive the prior mean and variance\nof H under a PY(d, \u21b5) prior on \u21e1:\n\nE[H(\u21e1)|d, \u21b5] = 0(1 + \u21b5) 0(1 d),\n1 d\nvar[H(\u21e1)|d, \u21b5] =\n1 + \u21b5\n\n\u21b5 + d\n\n 1(2 d) 1(2 + \u21b5),\n\n(10)\n\n(11)\n\n(1 + \u21b5)2(1 d)\n\n+\n\nwhere n is the polygamma of n-th order (i.e., 0 is the digamma function). Fig. 3 shows these\nfunctions for a range of d and \u21b5 values. These reveal the same phenomenon that [6] observed for\n\ufb01nite Dirichlet distributions: a PY prior with \ufb01xed (d, \u21b5) induces a narrow prior over H. In the\nundersampled regime, Bayesian estimates under PY priors will therefore be strongly determined by\nthe choice of (d, \u21b5), and posterior credible intervals will be unrealistically narrow.3\n\n4.2 Pitman-Yor process mixture (PYM) prior\n\nThe narrow prior on H induced by \ufb01xed PY priors suggests a strategy for constructing a non-\ninformative prior: mix together a family of PY distributions with some hyper-prior p(d, \u21b5) selected\nto yield an approximately \ufb02at prior on H. Following the approach of [6], we setting p(d, \u21b5) propor-\ntional to the derivative of the expected entropy. This leaves one extra degree of freedom, since large\n3The only exception is near the corner d ! 1 and \u21b5 ! d. There, one can obtain arbitrarily large prior\nvariance over H for given mean. However, these such priors have very heavy tails and seem poorly suited to\ndata with \ufb01nite or exponential tails; we do not explore them further here.\n\n5\n\n\f)\ns\nt\n\na\nn\n(\n \ny\np\no\nr\nt\n\nn\ne\n\n20\n\n15\n\n10\n\n5\n\n1\n\n0\n\n(standard params)\n\n(new params)\n\n1\n\n0\n\n10\n\n0\n\n0\n\n20\n\n10\n\n20\n\n0.1\n\n)\n\nH\n(\np\n\n0.05\n\n)\n\nH\n(\np\n\n)\n\nH\n(\np\n\n0\n0.06\n0.04\n0.02\n0\n0.06\n0.04\n0.02\n0\n\n0\n\n1\n\n2\n3\nEntropy (H)\n\n4\n\n5\n\nFigure 4: Expected entropy under Pitman-Yor and Pitman-Yor Mixture priors. (A) Left: expected\nentropy as a function of the natural parameters (d, \u21b5). Right: expected entropy as a function of\n(B) Sampled prior distributions (N = 5e3) over entropy implied\ntransformed parameters (h, ).\nby three different PY mixtures: (1) p(, h) / ( 1) (red), a mixture of PY(d, 0) distributions; (2)\n1 ) (grey), which\np(, h) / () (blue), a mixture of DP(\u21b5) distributions; and (3) p(, h) / exp( 10\nprovides a tradeoff between (1) & (2). Note that the implied prior over H is approximately \ufb02at.\n\nprior entropies can arise either from large values of \u21b5 (as in the DP) or from values of d near 1. (See\nFig. 4A). We can explicitly control this trade-off by reparametrizing the PY distribution, letting\n\nh = 0(1 + \u21b5) 0(1 d),\n\n =\n\n 0(1) 0(1 d)\n\n 0(1 + \u21b5) 0(1 d)\n\n,\n\n(12)\n\nwhere h > 0 is equal to the expected entropy of the prior (eq. 10) and > 0 captures prior beliefs\nabout tail behavior of \u21e1. For = 0, we have the DP (d = 0); for = 1 we have a PY(d, 0)\nprocess (i.e., \u21b5 = 0). Where required, the inverse transformation to standard PY parameters is given\nby: \u21b5 = 01 (h(1 ) + 0(1)) 1, d = 1 01 ( 0(1) h) , where 01(\u00b7) denotes the\ninverse digamma function.\nWe can construct an (approximately) \ufb02at improper prior over H on [0,1] by setting p(h, ) = q(),\nwhere q is any density on [0,1]. The induced prior on entropy is thus:\np(H) =ZZ p(H|\u21e1)pPY(\u21e1|, h)p(, h)d dh,\n\n(13)\nwhere pPY(\u21e1|, h) denotes a PY distribution on \u21e1 with parameters , h. Fig. 4B shows samples\nfrom this prior under three different choices of q(), for h uniform on [0, 3]. We refer to the resulting\nprior distribution over \u21e1 as the Pitman-Yor mixture (PYM) prior. All results in the \ufb01gures are\ngenerated using the prior q() / max(1 , 0).\n4.3 Posterior inference\n\nPosterior inference under the PYM prior amounts to computing the two-dimensional integral over\nthe hyperparameters (d, \u21b5),\n\n\u02c6HPYM = E[H|x] =Z E[H|x, d, \u21b5]\n\np(x|d, \u21b5)p(\u21b5, d)\n\np(x)\n\nd(d, \u21b5)\n\n(14)\n\nAlthough in practice we parametrize our prior using the variables and h, for clarity and consistency\nwith other literature we present results in terms of d and \u21b5. Just as the case with the prior mean, the\nposterior mean E[H|x, d, \u21b5] is given by a convenient analytic form (derived in the Appendix),\n(ni d) 0(ni d + 1)# .\nE[H|\u21b5, d, x] = 0(\u21b5 + N + 1) \n\n 0(1 d) \n\n\u21b5 + Kd\n\u21b5 + N\n\nThe evidence, p(x|d, \u21b5), is given by\n\np(x|d, \u21b5) = \u21e3QK1\n\nl=1 (\u21b5 + ld)\u2318\u21e3QK\n\n(1 d)K(\u21b5 + N )\n\n6\n\n1\n\n\u21b5 + N \" KXi=1\ni=1 (ni d)\u2318 (1 + \u21b5)\n\n(15)\n\n.\n\n(16)\n\n\fWe can obtain con\ufb01dence regions for \u02c6HPYM by computing the posterior variance E[(H \u02c6HPYM)2|x].\nThe estimate takes the same form as eq. 14, except that we substitute var[H|x, d, \u21b5] for E[H|x, d, \u21b5].\nAlthough var[H|x, d, \u21b5] has an analytic closed form that is fast to compute, it is a lengthy expression\nthat we do not have space to reproduce here; we provide it in the Appendix.\n\n4.4 Computation\n\nIn practice, the two-dimensional integral over \u21b5 and d is fast to compute numerically. Computation\nof the integrand can be carried out more ef\ufb01ciently using a representation in terms of multiplicities\n(also known as the empirical histogram distribution function [4]), the number of symbols that have\noccurred with a given frequency in the sample. Letting mk = |{i : ni = k}| denote the total\nnumber of symbols with exactly k observations in the sample gives the compressed statistic m =\n[m0, m1, . . . , mM ]>, where nmax is the largest number of samples for any symbol. Note that the\ninner product [0, 1, . . . , nmax] \u00b7 m = N, the total number of samples.\nThe multiplicities representation signi\ufb01cantly reduces the time and space complexity of our compu-\ntations for most datasets, as we need only compute sums and products involving the number symbols\nwith distinct frequencies (at most nmax), rather than the total number of symbols K. In practice, we\ncompute all expressions not explicitly involving \u21e1 using the multiplicities representation. For in-\nstance, in terms of the multiplicities, the evidence takes the compressed form\n\np(x|d, \u21b5) = p(m1, . . . , mM|d, \u21b5) =\n\n(1 + \u21b5)QK1\n\n(\u21b5 + n)\n\nl=1 (\u21b5 + ld)\n\nMYi=1\u2713 (i d)\n\ni!(1 d)\u25c6mi M !\n\nmi!\n\n.\n\n(17)\n\n4.5 Existence of posterior mean\nGiven that the PYM prior with p(h) / 1 on [0,1] is improper, the prior expectation E[H] does\nnot exist. It is therefore reasonable to ask what conditions on the data are suf\ufb01cient to obtain \ufb01nite\nposterior expectation E[H|x]. We give an answer to this question in the following short proposition,\nthe proof of which we provide in Appendix B.\nTheorem 1. Given a \ufb01xed dataset x of N samples and any bounded (potentially improper) prior\np(, h), \u02c6HPYM < 1 when N K 2.\nThis result says that the BLS entropy estimate is \ufb01nite whenever there are at least two \u201ccoinci-\ndences\u201d, i.e., two fewer unique symbols than samples, even though the prior expectation is in\ufb01nite.\n\n5 Results\n\nWe compare PYM to other proposed entropy estimators using four example datasets in Fig. 5. The\nMiller-Maddow estimator is a well-known method for bias correction based on a \ufb01rst-order Taylor\nexpansion of the entropy functional. The CAE (\u201cCoverage Adjusted Estimator\u201d) addresses bias\nby combining the Horvitz-Thompson estimator with a nonparametric estimate of the proportion\nof total probability mass (the \u201ccoverage\u201d) accounted for by the observed data x [17, 25]. When\nd = 0, PYM becomes a DP mixture (DPM). It may also be thought of as NSB with a very large\nA, and indeed the empirical performance of NSB with large A is nearly identical to that of DPM.\nAll estimators appear to converge except \u02c6Hnsb1, the asymptotic extension of NSB discussed in\nSection 2.2, which increases unboundedly with data size. In addition PYM performs competitively\nwith other estimators. Note that unlike frequentist estimators, PYM error bars in Fig. 5 arise from\ndirect compution of the posterior variance of the entropy.\n\n6 Discussion\n\nIn this paper we introduced PYM, a novel entropy estimator for distributions with unknown support.\nWe derived analytic forms for the conditional mean and variance of entropy under a DP and PY\nprior for \ufb01xed parameters. Inspired by the work of [6], we de\ufb01ned a novel PY mixture prior, PYM,\nwhich implies an approximately \ufb02at prior on entropy. PYM addresses two major issues with NSB:\nits dependence on knowledge of A and its inability (inherited from the Dirichlet distribution) to\n\n7\n\n\fFigure 5: Convergence of entropy estimators with sample size, on two simulated and two real datasets.\nWe write \u201cMiMa\u201d for \u201cMiller-Maddow\u201d and \u201cNSB1\u201d for \u02c6Hnsb1. Note that DPM (\u201cDP mixture\u201d) is\nsimply a PYM with = 0. Credible intervals are indicated by two standard deviation of the posterior\nfor DPM and PYM. (A) Exponential distribution \u21e1i / ei. (B) Power law distribution with exponent 2\n(\u21e1i / i2). (C) Word frequency from the novel Moby Dick. (D) Neural words from 8 simultaneously-\nrecorded retinal ganglion cells. Note that for clarity \u02c6Hnsb1 has been cropped from B and D. All plots\nare average of 16 Monte Carlo runs.\n\naccount for the heavy-tailed distributions which abound in biological and other natural data. We\nhave shown that PYM performs well in comparison to other entropy estimators, and indicated its\npracticality in example applications to data.\nWe note, however, that despite its strong performance in simulation and in many practical examples,\nwe cannot assure that PYM will always be well-behaved. There may be speci\ufb01c distributions for\nwhich the PYM estimate is so heavily biased that the credible intervals fail to bracket the true en-\ntropy. This re\ufb02ects a general state of affairs for entropy estimation on countable distributions: any\nconvergence rate result must depend on restricting to a subclass of distributions [26]. Rather than\nworking within some analytically-de\ufb01ned subclass of discrete distributions (such as, for instance,\nthose with \ufb01nite \u201centropy variance\u201d [17]), we work within the space of distributions parametrized\nby PY which spans both the exponential and power-law tail distributions. Although PY parameter-\nizes a large class of distributions, its structure allows us to use the PY parameters to understand the\nqualitative features of the distributions made likely under a choice of prior. We feel this is a key\nfeature for small-sample inference, where the choice of prior is most relevant. Moreover, in a forth-\ncoming paper, we demonstrate the consistency of PYM, and show that its small-sample \ufb02exibility\ndoes not sacri\ufb01ce desirable asymptotic properties.\nIn conclusion, we have de\ufb01ned the PYM prior through a reparametrization that assures an approx-\nimately \ufb02at prior on entropy. Moreover, although parametrized over the space of countably-in\ufb01nite\ndiscrete distributions, the computation of PYM depends primarily on the \ufb01rst two conditional mo-\nments of entropy under PY. We derive closed-form expressions for these moments that are fast to\ncompute, and allow the ef\ufb01cient computation of both the PYM estimate and its posterior credible\ninterval. As we demonstrate in application to data, PYM is competitive with previously proposed\nestimators, and is especially well-suited to neural applications, where heavy-tailed distributions are\ncommonplace.\n\n8\n\n2090400100002.42.62.833.23.43.63.84Retinal Ganglion Cell Spike Trains# of samplesEntropy (nats)1060300100000.60.811.21.41.61.822.2(cid:51)(cid:82)(cid:90)(cid:72)(cid:85)(cid:239)(cid:79)(cid:68)(cid:90)1001600180002100004.555.566.577.5Moby Dick wordsEntropy (nats)ABDC102040902000.60.811.21.41.61.8Exponential distribution# of samplespluginMiMaDPMPYMCAENSB(cid:146)\fAcknowledgments\n\nWe thank E. J. Chichilnisky, A. M. Litke, A. Sher and J. Shlens for retinal data, and Y. .W. Teh for helpful\ncomments on the manuscript. This work was supported by a Sloan Research Fellowship, McKnight Scholar\u2019s\nAward, and NSF CAREER Award IIS-1150186 (JP).\n\nReferences\n[1] G. Miller. Note on the bias of information estimates. Information theory in psychology: Problems and\n\nmethods, 2:95\u2013100, 1955.\n\n[2] S. Panzeri and A. Treves. Analytical estimates of limited sampling biases in different information mea-\n\nsures. Network: Computation in Neural Systems, 7:87\u2013107, 1996.\n\n[3] R. Strong, S. Koberle, de Ruyter van Steveninck R., and W. Bialek. Entropy and information in neural\n\nspike trains. Physical Review Letters, 80:197\u2013202, 1998.\n\n[4] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191\u20131253, 2003.\narXiv preprint, January 2008,\n[5] P. Grassberger.\n\nEntropy estimates from insuf\ufb01cient samplings.\n\n[6] I. Nemenman, F. Shafee, and W. Bialek. Entropy and inference, revisited. Adv. Neur. Inf. Proc. Sys., 14,\n\narXiv:0307138 [physics].\n\n2002.\n\n[7] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordina-\n\ntor. The Annals of Probability, 25(2):855\u2013900, 1997.\n\n[8] H. Ishwaran and L. James. Generalized weighted chinese restaurant processes for species sampling mix-\n\nture models. Statistica Sinica, 13(4):1211\u20131236, 2003.\n\n[9] S. Goldwater, T. Grif\ufb01ths, and M. Johnson. Interpolating between types and tokens by estimating power-\n\nlaw generators. Adv. Neur. Inf. Proc. Sys., 18:459, 2006.\n\n[10] G. Zipf. Human behavior and the principle of least effort. Addison-Wesley Press, 1949.\n[11] T. Dudok de Wit. When do \ufb01nite sample effects signi\ufb01cantly affect entropy estimates? Eur. Phys. J. B -\n\nCond. Matter and Complex Sys., 11(3):513\u2013516, October 1999.\n\n[12] M. Newman. Power laws, Pareto distributions and Zipf\u2019s law. Contemporary physics, 46(5):323\u2013351,\n\n[13] M. Hutter. Distribution of mutual information. Adv. Neur. Inf. Proc. Sys., 14:399, 2002.\n[14] J. Hausser and K. Strimmer. Entropy inference and the James-Stein estimator, with application to nonlin-\n\near gene association networks. The Journal of Machine Learning Research, 10:1469\u20131484, 2009.\n\n[15] I. Nemenman, W. Bialek, and R. Van Steveninck. Entropy and information in neural spike trains: Progress\n\non the sampling problem. Physical Review E, 69(5):056111, 2004.\n\n[16] I. Nemenman. Coincidences and estimation of entropies of random variables with large cardinalities.\n\n2005.\n\n[17] V. Q. Vu, B. Yu, and R. E. Kass. Coverage-adjusted entropy estimation.\n\nStatistics in medicine,\n\n[18] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, and E. P. Chichilnisky, E. J. Simoncelli. Nature,\n\nEntropy, 13(12):2013\u20132023, 2011.\n\n26(21):4039\u20134060, 2007.\n\n454:995\u2013999, 2008.\n\n[19] H. Ishwaran and M. Zarepour. Exact and approximate sum representations for the Dirichlet process.\n\nCanadian Journal of Statistics, 30(2):269\u2013283, 2002.\n\n[20] J. Kingman. Random discrete distributions. Journal of the Royal Statistical Society. Series B (Method-\n\nological), 37(1):1\u201322, 1975.\n\n[21] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American\n\nStatistical Association, 96(453):161\u2013173, March 2001.\n\n[22] Y. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. Proceedings of the 21st\nInternational Conference on Computational Linguistics and the 44th annual meeting of the Association\nfor Computational Linguistics, pages 985\u2013992, 2006.\n\n[23] M. Perman, J. Pitman, and M. Yor. Size-biased sampling of poisson point processes and excursions.\n\nProbability Theory and Related Fields, 92(1):21\u201339, March 1992.\n\n[24] J. Pitman. Random discrete distributions invariant under size-biased permutation. Advances in Applied\n\nProbability, pages 525\u2013539, 1996.\n\n[25] A. Chao and T. Shen. Nonparametric estimation of Shannon\u2019s index of diversity when there are unseen\n\nspecies in sample. Environmental and Ecological Statistics, 10(4):429\u2013443, 2003.\n\n[26] A. Antos and I. Kontoyiannis. Convergence properties of functional estimates for discrete distributions.\n\nRandom Structures & Algorithms, 19(3-4):163\u2013193, 2001.\n\n[27] D. Wolpert and D. Wolf. Estimating functions of probability distributions from a \ufb01nite set of samples.\n\nPhysical Review E, 52(6):6841\u20136854, 1995.\n\n9\n\n\f", "award": [], "sourceid": 996, "authors": [{"given_name": "Evan", "family_name": "Archer", "institution": null}, {"given_name": "Il Memming", "family_name": "Park", "institution": null}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": null}]}