{"title": "How biased are maximum entropy models?", "book": "Advances in Neural Information Processing Systems", "page_first": 2034, "page_last": 2042, "abstract": "Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mu- tual information in biological systems. However, maximum entropy models fit to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspecification can lead to much larger bias. We provide a perturba- tive approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data.", "full_text": "How biased are maximum entropy models?\n\nJakob H. Macke\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London, UK\njakob@gatsby.ucl.ac.uk\n\nIain Murray\n\nSchool of Informatics\n\nUniversity of Edinburgh, UK\n\ni.murray@ed.ac.uk\n\nPeter E. Latham\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London, UK\npel@gatsby.ucl.ac.uk\n\nAbstract\n\nMaximum entropy models have become popular statistical models in neuroscience\nand other areas in biology, and can be useful tools for obtaining estimates of mu-\ntual information in biological systems. However, maximum entropy models \ufb01t to\nsmall data sets can be subject to sampling bias; i.e. the true entropy of the data can\nbe severely underestimated. Here we study the sampling properties of estimates\nof the entropy obtained from maximum entropy models. We show that if the data\nis generated by a distribution that lies in the model class, the bias is equal to the\nnumber of parameters divided by twice the number of observations. However, in\npractice, the true distribution is usually outside the model class, and we show here\nthat this misspeci\ufb01cation can lead to much larger bias. We provide a perturba-\ntive approximation of the maximally expected bias when the true model is out of\nmodel class, and we illustrate our results using numerical simulations of an Ising\nmodel; i.e. the second-order maximum entropy distribution on binary data.\n\n1\n\nIntroduction\n\nOver the last several decades, information theory [1, 2] has played a major role in our effort to\nunderstand the neural code in the brain [3, 4]. Its usefulness, however, is limited by the fact that\nthe quantity of interest, mutual information (typically between stimuli and neuronal responses) is\nhard to compute from data [5]. Consequently, although this approach has led to a relatively deep\nunderstanding of neural coding in single neurons [4], it has told us far less about populations [6, 7].\nIn essence, the brute-force approaches to measuring mutual information that have worked so well on\nsingle spike trains simply do not work on populations. This is because the key-ingredient of mutual\ninformation is the entropy, and in general, estimation of the entropy from \ufb01nite data sets suffers\nfrom a severe downward bias [8, 9]: on average, the entropy estimated on the data set will be lower\nthan the actual entropy of the underlying model. While a number of improved estimators have been\ndeveloped (see [5, 10] for an overview), the amount of data one needs is, ultimately, exponential in\nthe number of neurons, so even modest populations (tens of neurons) are out of reach.\nTo apply information-theoretic techniques to populations, then, our only hope is to develop models in\nwhich the number of unconstrained parameters grows (relatively) slowly with the number of neurons\n[11]. For such models, estimating information requires much less data than brute force methods.\nStill, the amount of data is nontrivial, and naive estimators of information can be badly biased. Here\nwe consider one class of models \u2013 maximum entropy models subject to linear constraints \u2013 and\ncompute the bias in the entropy. We show that if the true distribution lies in the parametric model\nclass, then the bias is equal to the number of parameters divided by twice the number of observations.\nWhen the true distribution is outside the model class, however, the bias can be much larger.\n\n1\n\n\fWe illustrate our results using a very popular model in neuroscience, the Ising model [12], which\nis the second-order maximum entropy distribution on binary data. Recently, this model has become\na popular means of characterizing the distribution of \ufb01ring patterns in multi-electrode recordings,\nand has been used extensively in a wide range of applications, including recordings in the retina\n[13, 14] and visual cortex [15]. In addition, several recent studies [16, 17, 18] have used numerical\nsimulations of large Ising models to understand the scaling of the entropy of the model with popula-\ntion size. And, \ufb01nally, Ising models have been used in other \ufb01elds in biology, for example to model\ngene-regulation networks [19].\n\n2 Theory\n\n2.1 Maximum entropy models\n\nOur starting point is an underlying true distribution, denoted p(x) where x is a (typically real valued)\nvector; the goal is to model it with a maximum entropy distribution. For simplicity, when developing\nthe formalism we take x to be discrete; however, all our results apply to continuous variables.\nThe maximum entropy distribution is the distribution with the highest entropy subject to a set of\nconstraints, where the entropy is given by\n\np(x) log p(x) .\n\n(1)\n\nSpeci\ufb01cally, suppose that under the true distributions a set of m functions, denoted gi(x), i =\n1, ..., m, average to \u00b5i,\n\np(x)gi(x) .\n\n(2)\n\nS = \u2212\uffffx\n\u00b5i =\uffffx\n\uffffx\n\n(4)\n\n(5)\n\nIf we use q(x|\u00b5) to denote the maximum entropy distribution (with \u00b5 \u2261 (\u00b51, \u00b52, ..., \u00b5m)), the\nconstraints (here taken to be linear in the probability) are of the form\n(3)\n\nq(x|\u00b5)gi(x) = \u00b5i .\n\nFinding an explicit expression for q(x|\u00b5) is a straightforward optimization problem (see, e.g., [2]).\nIt can be shown that the maximum entropy distribution is in the exponential family,\n\nwhere the parameters, \u03bbi (the Lagrange multipliers of the optimization problem), are chosen such\nthat the constraints in Eq. (2) are satis\ufb01ed. The partition function, Z(\u00b5), ensures that the probabili-\nties normalize to one,\n\nq(x|\u00b5) =\n\nexp [\uffffm\n\ni=1 \u03bbi(\u00b5)gi(x)]\nZ(\u00b5)\n\nZ(\u00b5) =\uffffx\n\nexp\uffff m\uffffi=1\n\n\u03bbi(\u00b5)gi(x)\uffff .\n\nOnce we have identi\ufb01ed the parameters of this model, we can insert Eq. (4) into Eq. (1), which\nallows us to write the entropy in the form\n\nSq(\u00b5) = log Z(\u00b5) \u2212\n\n\u03bbi(\u00b5)\u00b5i .\n\nm\uffffi=1\n\n(6)\n\n2.2 Estimation bias in maximum entropy models\n\nSo far we have assumed that the true \u00b5i are known. In general, though, we have to estimate the\n\u00b5i from data. Speci\ufb01cally, if we have K observations of x, denoted x(k), k = 1, ..., K, then the\nestimate of \u00b5i, denoted \u02c6\u00b5i, is given by\n\n\u02c6\u00b5i =\n\n1\nK\n\nK\uffffk=1\n\ngi\uffffx(k)\uffff .\n\n2\n\n(7)\n\n\fWe can still use the maximum entropy formulation described above; the only difference is that we\nreplace \u00b5 by \u02c6\u00b5. Thus, the maximum entropy distribution is given by q(x| \u02c6\u00b5) (Eq. (4)) and the\nentropy by Sq( \u02c6\u00b5) (Eq. (6)).\nBecause of sampling error, the \u02c6\u00b5i are not equal to their true values, \u00b5i; consequently, neither is\nSq( \u02c6\u00b5). This leads to variability, in the sense that different sets of x(k) lead to different entropies\nand, because the entropy is concave, to bias. Thus, the entropy estimated from a \ufb01nite data set will\nbe lower, on average, than the entropy obtained from the true underlying model. In the large K limit,\nso that \u02c6\u00b5i is close to \u00b5i, the bias can be computed by Taylor expanding around Sq(\u00b5) and averaging\nover the true distribution, p(x). Anticipating somewhat our result, we use \u2212b/2K to denote the\nbias, and we have\n\nb\n2K \u2261 \uffffSq( \u02c6\u00b5) \u2212 Sq(\u00b5)\uffffp(x) =\n\n\u2212\n\n\u2202Sq(\u00b5)\n\n\u2202\u00b5i\n\n\uffff\u03b4\u00b5i\uffffp(x) +\n\nm\uffffi=1\n\n1\n2\n\nm\uffffi,j=1\n\n\u22022Sq(\u00b5)\n\u2202\u00b5i\u2202\u00b5j \uffff\u03b4\u00b5i\u03b4\u00b5j\uffffp(x) + ...\n(8)\n\nwhere\n\n\u03b4\u00b5i \u2261 \u02c6\u00b5i \u2212 \u00b5i =\n\n1\nK\n\nK\uffffk=1\n\ngi\uffffx(k)\uffff \u2212 \u00b5i .\n\nThe angle brackets with subscript p(x) indicate an average with respect to the true distribution,\np(x). The quantity we focus on is b, the normalized bias (as it is independent of K in the large K\nlimit). Computing the averages and derivatives in Eq. (8) is straightforward (see Appendix A in the\nsupplementary material for details), and we \ufb01nd that, through second order in \u03b4\u00b5,\n\n(9)\n\n(10)\n\n(11a)\n(11b)\n\n(12)\n\nb =\uffffij\n\nCq\u22121\nij Cp\nji,\n\nwhere\n\nHere Cq\u22121\n\nij\n\nCq\nij \u2261 \uffff\u03b4gi(x)\u03b4gj(x)\uffffq(x|\u00b5)\nCp\nij \u2261 \uffff\u03b4gi(x)\u03b4gj(x)\uffffp(x).\n\ndenotes the ijth entry of Cq\u22121 and\n\n\u03b4gi(x) \u2261 gi(x) \u2212 \u00b5i .\n\n2.3 Bias when the true model is in the model class\n\nEquation (10) tells us the normalized bias (to \ufb01rst order in 1/K). Evaluating it is, typically, hard, but\nthere is one case in which we can write down an explicit expression for it: when the true distribution\nlies in the model class, so that p(x) = q(x|\u00b5). In that case, Cq = Cp, the normalized bias is\nthe trace of the identity matrix, and we have b = m (recall that m is the number of constraints);\nalternatively, Bias[S] = \u2212m/2K.\nAn important within-model-class case arises when x is discrete and the \u201cparametrized\u201d model is a\ndirect histogram of the data. If x can take on D values, then there are D \u2212 1 parameters (the \u201c\u22121\u201d\ncomes from the fact that p(x) must sum to 1) and the normalized bias is (D \u2212 1)/2K. We thus re-\ncover a general version of the Miller\u2013Madow [8] or Panzeri & Treves bias correction [9], which was\nderived for a multinomial distribution. (Note that our expression differs from theirs by a factor of\nlog 2; that\u2019s because they use base 2 logarithms whereas we use natural logarithms.) Alternatively,\none can exploit the relationship between entropy-maximization and maximum-likelihood estima-\ntion in the exponential family to deduce this result from the asymptotic distribution of maximum\nlikelihood estimators [20]. For details see Appendix B in the supplementary material.\n\n2.4 Bias when the true model is not in the model class\n\nIn practice, it is rare for the true distribution to lie in the model class, so it is important to know\nhow the normalized bias behaves in general. In this section, we investigate how quickly it changes\nwhen we leave the model class. We concentrate on the worst case scenario and determine the largest\nnormalized bias that is consistent with a given \u201cdistance\u201d from the true model class. For cases in\nwhich we are close to the true model class, we provide a perturbative expression for this quantity.\n\n3\n\n\fTo assess the normalized bias out of model class, we assume that p(x), the distribution from which\nthe data was generated, can be written as\n\n(13)\n\n(14)\n\np(x) = q(x|\u00b5) + \u03b4p(x)\n\nin turn implies that\n\nwith \u03b4p(x) chosen so that it is orthogonal to all the constraints; that is\uffffx \u03b4p(x)gi(x) = 0, which\n\np(x)gi(x) =\uffffx\n(and both, of course, are equal to \u00b5i). We then ask how the normalized bias behaves as \u03b4p(x) varies.\nBecause q(x|\u00b5) is independent of \u03b4p(x), so is Cq\nij, and the normalized bias, b, that appears in\nEq. (10) can be written (using Eq. (11b))\n\nq(x|\u00b5)gi(x)\n\n\uffffx\n\nwhere\n\nb = \uffffB(x)\uffffp(x)\n\nB(x) \u2261\uffffij\n\n\u03b4gi(x)Cq\u22121\n\nij \u03b4gj(x) .\n\n(15)\n\n(16)\n\nIt\u2019s not possible to say anything de\ufb01nitive about the normalized bias in general, but what we can\ndo is compute its maximum as a function of the distance between p(x) and q(x|\u00b5), with \u201cdistance\u201d\nmeasured by the Kullback\u2013Leibler divergence. The latter quantity, denoted \u2206S, is given by\n\n\u2206S =\uffffx\n\np(x) log\n\np(x)\nq(x|\u00b5)\n\n= Sq(\u00b5) \u2212 Sp\n\n(17)\n\nwhere Sp is the entropy of p(x). The second equality follows from the de\ufb01nition of q(x|\u00b5), Eq. (4),\nand the fact that \uffffgi(x)\uffffp(x) = \uffffgi(x)\uffffq(x|\u00b5), which comes from Eq. (14).\nWe are interested in \ufb01nding the maximal normalized bias that is consistent with a given \u2206S. Rather\nthan maximizing the normalized bias at \ufb01xed \u2206S, we take the complementary approach: For each\npossible bias, we \ufb01nd the minimal possible \u2206S. This gives us a relationship between bias and\nminimal \u2206S, which we can invert to obtain the maximal bias for a given \u2206S. Since Sq(\u00b5) is\nindependent of p(x), minimizing \u2206S is equivalent to maximizing Sp (see Eq. (17)). Thus, again we\nhave a maximum entropy problem. Now, though, we have an additional constraint on the normalized\nbias, which gives us an additional Lagrange multiplier in addition to the \u03bbi we had for the original\noptimization problem. This leads to (in analogy to Eq. (4))\n\np(x|\u00b5,\u03b2 ) =\n\nexp [\u03b2B(x) +\uffffi \u03bbi(\u00b5,\u03b2 )gi(x)]\n\nZ(\u00b5,\u03b2 )\n\n(18)\n\nwhere Z(\u00b5,\u03b2 ) is the partition function and the \u03bbi(\u00b5,\u03b2 ) are chosen to satisfy Eq. (2), but with p(x)\nreplaced by p(x|\u00b5,\u03b2 ). Amongst all models that satisfy the moments constraints and have the same\nnormalized bias, this is the one that is closest (in KL\u2013divergence) to the maximum entropy model.\nNote that we have slightly abused notation: whereas in the previous sections the \u03bbi and Z depended\nonly on \u00b5, they now depend on both \u00b5 and \u03b2. However, the previous variables are closely related to\nthe new ones: when \u03b2 = 0 the constraint associated with b disappears, and we recover q(x|\u00b5); that\nis, p(x|\u00b5, 0) = q(x|\u00b5). Consequently, \u03bbi(\u00b5, 0) = \u03bbi(\u00b5), and Z(\u00b5, 0) = Z(\u00b5).\nRelating \u2206S to b is now a purely numerical task: choose a set of \u00b5i and a normalized bias, b,\ndetermine the Lagrange multipliers, \u03bbi(\u00b5,\u03b2 ) and \u03b2, that appear in Eq. (18), then compute Sp the\nentropy of p(x|\u00b5,\u03b2 ), and subtract that from Sq(\u00b5) to \ufb01nd \u2206S (see Eq. (17)). In section 3.2 we do\nexactly that. First, however, to gain some intuition into how the normalized bias depends on \u2206S, we\ncompute the relationship between the two perturbatively. This can be done by considering the small\n\u03b2 limit. In this limit we can expand both \u2206S and b as a Taylor series in \u03b2. De\ufb01ning\n\n(19)\nwhere Sp(\u03b2) is the entropy of p(x|\u00b5,\u03b2 ), and using primes to denote derivatives with respect to \u03b2,\nwe have, through second order in \u03b2,\n\n\u2206S(\u03b2) \u2261 Sq(\u00b5) \u2212 Sp(\u03b2)\n\n\u2206S(\u03b2) = Sq(\u00b5) \u2212 Sp(0) \u2212 \u03b2S\uffffp(0) \u2212\n\nb(\u03b2) = b(0) + \u03b2b\uffff(0) .\n\n\u03b22\n2\n\nS\uffff\uffffp (0)\n\n(20a)\n(20b)\n\n4\n\n\fWe expand \u2206S(\u03b2) to second order in \u03b2 because S\uffffp(0) = 0, which follows from the fact that when\n\u03b2 \uffff= 0 there is an additional constraint on the normalized bias, and so any \u03b2 \uffff= 0 can only lower\nthe entropy; therefore, \u03b2 = 0 must be a local maximum. Alternatively, a straightforward calculation\nin which we write down the entropy of p(x|\u00b5,\u03b2 ) using Eq. (18) (which results in an expression\nanalogous to Eq. (6)) and differentiate with respect to \u03b2, yields\n(21)\nFrom this it follows that S\uffffp(0) = 0; in addition, we see that S\uffff\uffffp (0) = \u2212b\uffff(0). Thus, using the fact\nthat when \u03b2 = 0, p(x|\u00b5, 0) is within the model class, so Sp(0) = Sq(\u00b5), Eq. (20) tells us that when\n\u03b2 is suf\ufb01ciently small,\n\nS\uffffp(\u03b2) = \u2212\u03b2b\uffff(\u03b2) .\n\n\u2206S =\n\n(b \u2212 m)2\n2b\uffff(0)\n\n.\n\n(22)\n\nThe term in the denominator, b\uffff(0), is relatively easy to compute, and we show in Appendix C (in\nthe supplementary material) that it is given by\n\nb\uffff(0) = Var[B]q(x|\u00b5) \u2212\n\nm\uffffi,j=1\n\n\uffffB(x)\u03b4gi(x)\uffffq(x|\u00b5)Cq\u22121\n\nij\n\n\uffff\u03b4gj(x)B(x)\uffffq(x|\u00b5) .\n\n(23)\n\nThe key result of the perturbative analysis is that when the true distribution is out of the model class,\nthe normalized bias can be increased by a term proportional to b\uffff(0)1/2. Thus, the size of b\uffff(0) is\ncrucial for telling us how big the bias really is. In the next section we investigate this numerically\nfor a particular model, the Ising model.\n\n3 Numerical Results: Estimation bias in Ising models\n\nFor our numerical simulations, we consider the second order maximum entropy model on n binary\nvariables, also known as the Ising model [12] (see [13, 14] for an application of Ising models to\nneuroscience). In this section, we use numerical studies to verify that the asymptotic bias gives\nan accurate characterization of the expected bias for relevant sample-sizes K, investigate the size\nof the normalized bias when the true model is not in the model class, and study the scaling of the\nnormalized bias with the number of parameters. We show numerically that, for the Ising model, the\nmodel-misspeci\ufb01cation can result in the normalized bias increasing rapidly with population size.\n\n3.1 Estimation in a binary maximum entropy model\nWe consider n interacting spins si, i = 1, ..., n with si \u2208{ 0, 1}. We put constraints on the \ufb01rst and\nsecond moments only, so m, the number of constraints, is n(n + 1)/2: gi(s) = si and gij(s) =\nsisj, i < j. The maximum entropy model (with the \u03bbi\u2019s replaced by hi and Jij and the gi written\nexplicitly) has the form\n\n(24)\n\nq(s|h, J) =\n\n1\n\nZ(h, J)\n\nexp\uf8ee\uf8f0\uffffi\n\nhisi +\uffffi 103), such deviations could be a\nconsequence of numerical errors in the \ufb01tting-procedure.\nWe note that our choice of Jij = 0 is merely for concreteness, and that the validity of our formulation\nis not dependent on the values of Jij. We also performed simulations with models in which Jij is\nnon-zero and drawn from a Gaussian distribution, which yielded qualitatively similar results.\n\nA)\n\n100\n\n10\u22122\n\ns\na\nb\n\ni\n\n \n)\ne\nv\ni\nt\n\na\ng\ne\nN\n\n(\n\n10\u22124\n\n \n101\n\nC)\n\n100\n\n10\u22122\n\ni\n\ns\na\nb\n \n)\ne\nv\ni\nt\na\ng\ne\nN\n\n(\n\n10\u22124\n\n101\n\nn=2\n3\n5\n10\n15\nAsymp\n\nB)\n\n \n\ni\n\n \n\ns\na\nB\nd\ne\na\nc\ns\ne\nR\n\nl\n\n102\n103\nSample size K\n\n104\n\nD)\n\ni\n\n \n\ns\na\nB\nd\ne\na\nc\ns\ne\nR\n\nl\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n10\n\n1.5\n\n1\n\n0.5\n\n25\n\n50 100\nSample size K\n\n250\n\n1000\n\n102\n103\nSample size K\n\n104\n\n0\n10\n\n25\n\n50\n\n100\nSample size K\n\n250\n\nFigure 1: Asymptotic bias in Ising models. A) Comparison of asymptotic bias with expected bias\ncalculated via simulations of an independent model with a mean of 0.5 (see text). The thin-black\nlines correspond to the bias as predicted by our asymptotic calculation. We have here inverted the\nsign of the bias, the actual biases are negative numbers. B) Same data as in A, but on a semi-\nlog plot to illustrate how many samples are necessary for the asymptotic bias to be an accurate\nrepresentation of the actual bias: For the parameters used here, the bias seems to be accurate even\nfor small (< 100) values of K. We rescaled the estimated biases of each population size n such that\nthe predicted asymptotic biases (thin black lines) are on top of each other, and such that the biases\nare positive. C and D) Same as in A and B, but for an independent model with mean 0.1. Error bars\nshow standard errors on the mean estimates from 104 simulated data sets.\n\n3.2 Estimation bias when the data has higher-order correlations\n\nWhat happens when the true model is not in the model class? To investigate this question, we\n\ufb01rst consider homogeneous pairwise maximum entropy models (hi = h and Jij = J) of sizes\nn \u2208{ 5, 10, 15}, common means \uffffsi\uffff = 0.5 or 0.1, and pairwise correlation-coef\ufb01cient \u03c1i,j = 0.1\nfor each pair i, j. For a range of normalized biases, we calculated \u2206S, the maximum entropy\ndifference between g(x|\u00b5) and an out of model class distribution as a function of normalized bias,\nb. For very small or large normalized biases, the optimization did not converge to values moment\nconstraints, indicating that such an extreme normalized bias would be inconsistent with the speci\ufb01ed\nsecond order moments. The results are shown in Fig. 2, along with the perturbative predictions. For\nthese choices of parameters, the maximum and minimum normalized bias did not deviate much from\nthe within-model-class case. In the next example, we illustrate that the deviation can be very large.\nTo get a better understanding of the additional bias (or, potentially, reduction in bias) due to model\nmisspeci\ufb01cation, we studied the bias of the Dichotomized Gaussian distribution, which can be in-\nterpreted as a very simple model of neural population activity in which correlations among neurons\nare induced by common, Gaussian inputs into threshold neurons [21, 22]. In this case we simply\nset p(x) to a Dichotomized Gaussian, and numerically computed the bias and the KL\u2013divergence\nbetween p(x) and the maximum entropy model with the same \ufb01rst and second moments. We did\n\n6\n\n\f \n\n \n\n30\n\n20\n\n10\n\n0\n\n30\n\n20\n\n10\n\n12\n16\nNormalized bias\n\n14\n\nN=5\n\nPredicted\nExact\n\n40\n\n30\n60\nNormalized bias\n\n50\n\nN=10\n\nn\ni\n(\n \n\n2\n\nS\n\n \n/\n\nS\n\u0394\n\n \n\n)\nt\nn\ne\nc\nr\ne\np\n \nn\ni\n(\n \n\nS\n\n \n/\n\nS\n\u0394\n\n \n\n2\n\n5\n\n0\n\n \n\n15\n\n10\n\n5\n\n0\n\n \n\nN=5\n\nPredicted\nExact\n\n15\n\n10\n\n)\nt\n\nn\ne\nc\nr\ne\np\n\n \n\nN=10\n\nN=15\n\n25\n20\n15\n10\n5\n0\n\n20\n\n15\n\n10\n\n5\n\n0\n\n60 80 100 120 140\nNormalized bias\n\nN=15\n\n118 120 122 124 126\n\nNormalized bias\n\n14.8\n\n15\n\n15.2\n\nNormalized bias\n\n0\n\n54\n\n55\n\n56\n\nNormalized bias\n\n57\n\nFigure 2: Bias in the case of model misspeci\ufb01cation. Top row: \u2206S/S2, where S2 is the entropy of\nthe second order model, as a function of the normalized bias for a model with means \uffffsi\uffff = 0.5 and\ncorrelation-coef\ufb01cient 0.1. The red (dashed) lines show the exact \u2206S calculated by using equation\n(18), and the green (solid) lines using the perturbative expansion in equation (22). The curves end\nbecause for normalized biases too large or too small the optimization does not converge to values\nwhich satisfy the moment constraints. Bottom row: Same as top row, but using means of \uffffsi\uffff = 0.1.\nthis for means set to \uffffsi\uffff = 0.02, a realistic value for applications of maximum entropy models\nin neuroscience, and different values of the pairwise correlation coef\ufb01cient \u03c1 \u2208{ 0.02, 0.1, 0.5}.\nWe also included, for comparison, the normalized bias for a within model class distribution (i.e. a\nmaximum entropy model with matched \ufb01rst and second moments), which is just n(n + 1)/2.\nFor the Dichotomized Gaussian, the normalized bias was substantially larger than the within model\nclass bias. For example, for population size n = 15, its bias is 2.3 times larger for \u03c1 = 0.1, and 6.8\ntimes larger for \u03c1 = 0.5. Figure 3B shows \u2206S versus population size for the models in Fig. 3A,\nand the corresponding \u201cmaximally biased\u201d model; i.e. the model which has the same normalized\nbias as the Dichotomized Gaussian, but minimal \u2206S. Interestingly, \u2206S for the maximally biased\nmodels (equation (18)) is very similar to \u2206S for the Dichotomized Gaussian. This suggests that our\nextremal calculation of the bias is relevant for a reasonably mechanistic model of neural population\nactivity.\n\n4 Conclusions\n\nIn recent years, there has been a resurgence of interest in maximum entropy models in neuroscience\nand related \ufb01elds [13, 14, 15]. In particular, maximum entropy models can be useful for model-based\nestimation of the information content of neural populations [11], as direct information-estimates do\nnot scale well for large population sizes. In this paper, we studied estimation biases in the entropy of\nmaximum entropy models. We focused on \u201cnaive\u201d estimators, i.e. estimators of the entropy which\nsimply calculate it from the empirical estimates of the probabilities of the model, and do not attempt\nto do any bias reduction.\nWe found that if the true model is in the model class, the (downward) bias in a maximum entropy\nestimate from \ufb01nite observations is proportional to the ratio of the number of parameters to the\nnumber of observations, a relationship which is identical to that of the (naive) histogram estimators\n[8, 9]. However, we also show that if the model is misspeci\ufb01ed (i.e. if the true data do not come\n\n7\n\n\fA)\n\n1000\n\ni\n\ns\na\nb\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n800\n\n600\n\n400\n\n200\n\n0\n\n \n\nMaxEnt model\nDG \u03c1= .02\nDG \u03c1= .1\nDG \u03c1= .5\n\n5\n\n10\n\nPopulation size\n\n15\n\nB)\n\n \n\n0.2\n\n0.15\n\nS\n\u0394\n\n \n\n0.1\n\n0.05\n\n0\n\n \n\n \n\nDG \u03c1= 0.02\nMin \u03c1= 0.02\nDG \u03c1= 0.1\nMin \u03c1= 0.1\nDG \u03c1= 0.5\nMin \u03c1= 0.5\n\n5\n\n10\n\nPopulation size\n\n15\n\nFigure 3: Bias in the case of model misspeci\ufb01cation, using the Dichotomized Gaussian. A) Scal-\ning of the normalized bias with population size. The normalized bias of the Dichotomized Gaussian\n(DG) is much larger than that of the maximum entropy model. B) Distance from model class, \u2206S,\nversus population size for the Dichotomized Gaussian and maximum entropy models. They are\nabout the same, indicating that the Dichotomized Gaussian model has close to maximum bias.\n\nfrom the speci\ufb01ed exponential family model), then the bias can be much larger. We numerically\ninvestigated the bias in second-order binary maximum entropy models (also known as Ising models),\nand showed that in this case, model misspeci\ufb01cation can lead to substantially bigger biases.\nNon-parametric estimation of entropy is a well researched subject, and various estimators with op-\ntimized properties have been proposed (see e.g. [5, 23]). A number of studies have looked at the\nentropy estimation for the multivariate normal distribution [24, 25, 26, 27] and other continuous\ndistributions, and improved estimators for the Gaussian distribution have been described [28]. As\nthe (differential) entropy of a Gaussian distribution is essentially its log-determinant, the bias of\nthis model can be related to results about the eigenvalues of random matrices [29]. An overview of\nestimators of the entropy of continuous-valued distributions is given in [30].\nHowever, to our knowledge, the entropy bias of maximum entropy models in the presence of model-\nmisspeci\ufb01cation has not be characterized or studied numerically. We provided here an asymptotic\nderivation of this bias, and studied it numerically for the pairwise binary maximum entropy model,\nthe Ising model. Our characterization of the bias relates the (worst case) bias in the case of model-\nmisspeci\ufb01cation to the distance (as measured by KL\u2013divergence) between the model and the actual\ndata. This characterization does not yield a precise estimate of the bias on a given data-set which\ncould simply be \u2018subtracted-off\u2019\u2013 thus, our derivation does not directly yield an improved estimator\nof the bias for such data-sets. However, importantly, our results show that model-misspeci\ufb01cation\ncan indeed lead to additional bias which can be much larger than generally appreciated. Using\nnumerical simulations, we showed that this also happens for a realistic model which shares many\nproperties with neural recordings. In addition, our results could be useful for deriving general guide-\nline for how many samples a neurophysiological data-set needs to contain to achieve a bias which is\nless than some desired accuracy.\n\nAcknowledgements\n\nWe acknowledge support from the Gatsby Charitable Foundation. JHM is supported by an EC\nMarie Curie Fellowship, and IM in part by the IST Programme of the European Community, under\nthe PASCAL2 Network of Excellence, IST-2007-216886. This publication only re\ufb02ects the authors\u2019\nviews.\n\nReferences\n[1] C.E. Shannon and W. Weaver. The mathematical theory of communication. University of Illinois Press,\n\n1949.\n\n[2] T.M. Cover, J.A. Thomas, J. Wiley, et al. Elements of information theory, volume 6. Wiley Online\n\nLibrary, 1991.\n\n[3] F. Rieke, D. Warland, R. de R uytervansteveninck, and W. Bialek. Spikes: exploring the neural code\n\n(computational neuroscience). The MIT Press, 1999.\n\n8\n\n\f[4] A. Borst and F. E. Theunissen. Information theory and neural coding. Nat Neurosci, 2(11):947\u2013957,\n\n1999 Nov.\n\n[5] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191\u20131253,\n\n2003.\n\n[6] B. B. Averbeck, P. E. Latham, and A. Pouget. Neural correlations, population coding and computation.\n\nNature Reviews Neuroscience, 7(5):358\u201366, 2006.\n\n[7] R. Quian Quiroga and S. Panzeri. Extracting information from neuronal populations: information theory\n\nand decoding approaches. Nat Rev Neurosci, 10(3):173\u2013185, 2009.\n\n[8] G. Miller. Note on the bias of information estimates. In Information Theory in Psychology II-B, chapter\n\n95-100. Free Press, Glencole, IL, 1955.\n\n[9] A. Treves and S. Panzeri. The upward bias in measures of information derived from limited data samples.\n\nNeural Computation, 7(2):399\u2013407, 1995.\n\n[10] S. Panzeri, R. Senatore, M. A. Montemurro, and R. S. Petersen. Correcting for the sampling bias problem\n\nin spike train information measures. J Neurophysiol, 98(3):1064\u20131072, 2007.\n\n[11] Robin A A Ince, Alberto Mazzoni, Rasmus S Petersen, and Stefano Panzeri. Open source tools for the\n\ninformation theoretic analysis of neural data. Front Neurosci, 4, 2010.\n\n[12] E. Ising. Beitrag zur Theorie des Ferromagnetismus. Z. Phys, 31:253, 1925.\n[13] E. Schneidman, M. J. 2nd Berry, R. Segev, and W. Bialek. Weak pairwise correlations imply strongly\n\ncorrelated network states in a neural population. Nature, 440(7087):1007\u201312, 2006.\n\n[14] J. Shlens, G. D. Field, J. L. Gauthier, M. I. Grivich, D. Petrusca, A. Sher, A. M. Litke, and E. J.\nChichilnisky. The structure of multi-neuron \ufb01ring patterns in primate retina. J Neurosci, 26(32):8254\u201366,\n2006.\n\n[15] I. E. Ohiorhenuan, F. Mechler, K. P. Purpura, A. M. Schmid, Q. Hu, and J. D. Victor. Sparse coding and\n\nhigh-order correlations in \ufb01ne-scale cortical networks. Nature, 466(7306):617\u2013621, 2010.\n\n[16] G. Tkacik, E. Schneidman, M. J. Berry, II, and W. Bialek. Spin glass models for a network of real\n\nneurons. arXiv:q-bio/0611072v2, 2009.\n\n[17] Y. Roudi, J. Tyrcha, and J. Hertz. Ising model for neural data: model quality and approximate methods\nfor extracting functional connectivity. Phys Rev E Stat Nonlin Soft Matter Phys, 79(5 Pt 1):051915, May\n2009.\n\n[18] Y. Roudi, E. Aurell, and J. A. Hertz. Statistical physics of pairwise probability models. Front Comput\n\nNeurosci, 3:22, 2009.\n\n[19] T. Mora, A. M. Walczak, W. Bialek, and C. G. Jr Callan. Maximum entropy models for antibody diversity.\n\nProc Natl Acad Sci U S A, 107(12):5405\u20135410, 2010.\n\n[20] A.W. Van der Vaart. Asymptotic statistics. Cambridge Univ Pr, 2000.\n[21] J.H. Macke, P. Berens, A.S. Ecker, A.S. Tolias, and M. Bethge. Generating spike trains with speci\ufb01ed\n\ncorrelation coef\ufb01cients. Neural Computation, 21(2):397\u2013423, 2009.\n\n[22] J.H. Macke, M. Opper, and M. Bethge. Common input explains higher-order correlations and entropy in\n\na simple model of neural population activity. Physical Review Letters, 106(20):208102, 2011.\n\n[23] I. Nemenman, W. Bialek, and R.D.R. Van Steveninck. Entropy and information in neural spike trains:\n\nProgress on the sampling problem. Physical Review E, 69(5):056111, 2004.\n\n[24] N.A. Ahmed and D. V. Gokhale. Entropy expressions and their estimators for multivariate distributions.\n\nInformation Theory, IEEE Transactions on, 35(3):688\u2013692, 1989.\n\n[25] O. Oyman, R. U. Nabar, H. Bolcskei, and A. J. Paulraj. Characterizing the statistical properties of mutual\ninformation in MIMO channels: insights into diversity-multiplexing tradeoff. In Signals, Systems and\nComputers, 2002. Conference Record of the Thirty-Sixth Asilomar Conference on, volume 1, pages 521\u2013\n525. IEEE, 2002.\n\n[26] N. Misra, H. Singh, and E. Demchuk. Estimation of the entropy of a multivariate normal distribution.\n\nJournal of multivariate analysis, 92(2):324\u2013342, 2005.\n\n[27] G. Marrelec and H. Benali. Large-sample asymptotic approximations for the sampling and posterior\ndistributions of differential entropy for multivariate normal distributions. Entropy, 13(4):805\u2013819, 2011.\nIn\nInformation Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 1103\u20131107. IEEE, 2008.\n[29] N.R. Goodman. The distribution of the determinant of a complex Wishart distributed matrix. The Annals\n\n[28] S. Srivastava and M.R. Gupta. Bayesian estimation of the entropy of the multivariate Gaussian.\n\nof mathematical statistics, 34(1):178\u2013180, 1963.\n\n[30] M. Gupta and S. Srivastava. Parametric Bayesian estimation of differential entropy and relative entropy.\n\nEntropy, 12(4):818\u2013843, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1145, "authors": [{"given_name": "Jakob", "family_name": "Macke", "institution": null}, {"given_name": "Iain", "family_name": "Murray", "institution": null}, {"given_name": "Peter", "family_name": "Latham", "institution": null}]}