{"title": "Estimating Entropy of Distributions in Constant Space", "book": "Advances in Neural Information Processing Systems", "page_first": 5162, "page_last": 5173, "abstract": "We consider the task of estimating the entropy of $k$-ary distributions from samples in the streaming model, where space is limited. Our main contribution is an algorithm that requires $O\\left(\\frac{k \\log (1/\\varepsilon)^2}{\\varepsilon^3}\\right)$ samples and a constant $O(1)$ memory words of space and outputs a $\\pm\\varepsilon$ estimate of $H(p)$. Without space limitations, the sample complexity has been established as $S(k,\\varepsilon)=\\Theta\\left(\\frac k{\\varepsilon\\log k}+\\frac{\\log^2 k}{\\varepsilon^2}\\right)$, which is sub-linear in the domain size $k$, and the current algorithms that achieve optimal sample complexity also require nearly-linear space in $k$. \n\nOur algorithm partitions $[0,1]$ into intervals and estimates the entropy contribution of probability values in each interval. The intervals are designed to trade bias and variance. \n\nDistribution property estimation and testing with limited memory is a largely unexplored research area. We hope our work will motivate research in this field.", "full_text": "Estimating Entropy of Distributions in Constant Space\n\nJayadev Acharya\nCornell University\n\nacharya@cornell.edu\n\nSourbh Bhadane\nCornell University\n\nsnb62@cornell.edu\n\nPiotr Indyk\n\nMassachusetts Institute of Technology\n\nindyk@mit.edu\n\nZiteng Sun\n\nCornell University\n\nzs335@cornell.edu\n\nAbstract\n\n\"3\n\nWe consider the task of estimating the entropy of k-ary distributions from samples\nin the streaming model, where space is limited. Our main contribution is an\n\n\u2318 samples and a constant O(1) memory\nalgorithm that requires O\u21e3 k log(1/\")2\nwords of space and outputs a \u00b1\" estimate of H(p). Without space limitations, the\nsample complexity has been established as S(k, \") =\u21e5 \u21e3 k\n\"2 \u2318, which\n\nis sub-linear in the domain size k, and the current algorithms that achieve optimal\nsample complexity also require nearly-linear space in k.\nOur algorithm partitions [0, 1] into intervals and estimates the entropy contribution\nof probability values in each interval. The intervals are designed to trade off the\nbias and variance of these estimates.\n\n\" log k + log2 k\n\n1\n\nIntroduction\n\nStreaming Algorithms. Algorithms that require a limited memory/space/storage1 have garnered\ngreat interest over the last two decades, and are popularly known as streaming algorithms. Initially\nstudied by [1, 2], this setting became mainstream with the seminal work of [3]. Streaming algorithms\nare particularly useful in handling massive datasets that are impossible to be stored in the memory of\nthe system. It is also applicable in networks where data is naturally generated sequentially and the\ndata rates are much higher than the capabilities of storing them, e.g., on a router.\nThe literature on streaming algorithms is large, and many problems have been studied in this model.\nWith roots in computer science, a large fraction of this literature considers the worst case model,\nwhere it is assumed that the input X n := X1, . . . , Xn is an arbitrary sequence over some domain of\nsize k (e.g., [k] := {1, . . . , k}). The set-up is as follows:\nGiven a system with limited memory that can make a few (usually just one) passes over X n, the\nobjective is to estimate some f (X n) of the underlying dataset. The primary objective is solving the\ntask with as little memory as possible, which is called the space complexity.\nSome of the research closest to our task is the estimation of frequency moments of the data stream [3,\n4, 5], the Shannon and R\u00e9nyi entropy of the empirical distribution of the data stream [6, 7, 8, 9, 10],\nthe heavy hitters [11, 12, 13, 14], and distinct elements [15, 16]. There has also been work on random\norder streams, where one still considers a worst case data stream X n, but feeds a random permutation\nX(1), . . . , X(n) of X n as input to the algorithm [10, 17, 18].\n\n1We use space, storage, and memory interchangeably.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fStatistical Estimation. At the same time, there has been great progress in the classical \ufb01elds of\nstatistical learning and distribution property estimation. The typical set-up is as follows:\nGiven independent samples X n from an unknown distribution p, the objective is to estimate a property\nf (p) using the fewest samples, called the sample complexity.\nDistribution property estimation literature most related to our work include entropy estimation [19,\n20, 21, 22, 23, 24, 25], support size estimation [21, 23, 26], R\u00e9nyi entropy estimation [27, 28, 29],\nsupport coverage estimation [30, 31], and divergence estimation [32, 33]. In these tasks, the optimal\nsample complexity is sub-linear in k, the domain size of the distribution.\nStreaming Algorithms for Statistical Estimation. While space complexity of streaming algorithms,\nand sample complexity of statistical estimation have both received great attention, the problem of\nstatistical estimation under memory constraints has received relatively little attention. Interestingly,\nalmost half a century ago, Cover and Hellman [34, 35] studied this setting for hypothesis testing with\n\ufb01nite memory, and [36] had studied estimating the bias of a coin using a \ufb01nite state machine. However,\nuntil recently, there are few works on learning with memory constraints. There has been a recent\ninterest in space-sample trade-offs in statistical estimation [37, 38, 39, 40, 41, 42, 43, 44]. Within\nthese, [40] is the closest to our paper. They consider estimating the integer moments of distributions,\nwhich is equivalent to estimating R\u00e9nyi entropy of integer orders under memory constraints. They\npresent natural algorithms for the problem, and perhaps more interestingly, prove non-trivial lower\nbounds on the space complexity of this task. More recently, [45] obtained memory sample trade-offs\nfor testing discrete distributions.\nWe initiate the study of distribution entropy estimation with space limitations, with the goal of\nunderstanding the space-sample trade-offs.\n\n1.1 Problem Formulation\nLet k be the class of all k-ary discrete distributions over the set X = [k] := {0, 1, . . . , k 1}.\nThe Shannon entropy of p 2 k is H (p) := Px2[k] p (x) log (p (x)) . Entropy is a fundamental\nmeasure of randomness and a central quantity in information theory and communications. Entropy\nestimation is a key primitive in various machine learning applications for feature selection.\nGiven independent samples X n := X1, . . . , Xn from an unknown p 2 k, an entropy estimator is a\npossibly randomized mapping bH : [k]n ! R. Given \"> 0, > 0, bH is an (\", ) estimator if\n\n8p 2 k, PrX n\u21e0p\u2326n\u21e3|bH(X n) H(p)| >\"\u2318 <,\n\nwhere p\u2326n denotes the joint distribution of n independent samples from p.\nSample Complexity. The sample complexity S(H, k, \", ) is the smallest n for which an estimator\nsatisfying (1) exists. Throughout this paper, we assume a constant error probability, say = 1/3,2\nand exclusively study entropy estimation. We therefore denote S(H, k, \", 1/3) by S(k, \").\nMemory Model and Space Complexity. The basic unit of our storage model is a word, which\nconsists of log k + log(1/\") bits. This choice of storage model is motivated by the fact that at least\nlog(1/\") bits are needed for a precision of \u00b1\", and log k bits are needed to store a symbol in [k]. The\nspace complexity of an algorithm is the smallest space (in words) required for its implementation.\n\n(1)\n\n1.2 Prior Work\nDistribution Entropy Estimation. Entropy estimation from samples has a long history [19, 46, 47].\nThe most popular method is empirical plug-in estimation that outputs the entropy of the empirical\ndistribution of the samples. Its sample complexity [47, 20] is\n\n(2)\nPaninski [48] showed that there exists an estimator with sub-linear sample complexity in k. A recent\nline of work [21, 23, 22] has characterized the optimal sample complexity as\n\nSe(k, \") =\u21e5 k/\" + (log2 k)/\"2 .\nS(k, \") =\u21e5 k/(\" log k) + log2 k/\"2 .\n\n2For smaller \u2019s, we can apply median trick with an extra factor of log(1/) samples.\n\n(3)\n\n2\n\n\fNote that the optimal sample complexity is sub-linear in k, and that of empirical estimator is linear.\nEstimating Entropy of Streams. There is signi\ufb01cant work on estimating entropy of the stream\nwith limited memory. Here, no distributional assumptions on the input stream X n, and the goal\nis to estimate H(X n), the entropy of the empirical distribution of X n. [6, 49, 10, 9, 8] consider\nmultiplicative entropy estimation. These algorithms can be modi\ufb01ed to additive entropy estimation\nby noting that (1 \u00b1 \"/ log n) multiplicative estimation is equivalent to a \u00b1\" additive estimation. With\nthis, [8, 10] give an algorithm requiring O( log3 n\n) words of space for \u00b1\" estimate of H(X n). [9]\n\"2\nproposes an algorithm using O( log2 n\u00b7log log n\n) words of space. A space lower bound of \u2326(1/\"2) was\n\"2\nproved in [8] for the worst-case setting.\nAnother widely used notion of entropy is R\u00e9nyi entropy [50]. The R\u00e9nyi entropy of p of order \u21b5> 0\n\nis H\u21b5(p) := log(Px p(x)\u21b5)/(1 \u21b5). [51, 52, 27] show that the sample complexity of estimating\nH\u21b5(p) is \u21e5(k11/\u21b5/\"2) for \u21b5 2 N. [40] studies the problem of estimating the collision probability,\nwhich can be seen as estimating H\u21b5(p) for \u21b5 = 2, under memory constraints. They propose an\nalgorithm with sample complexity n and the memory M satisfying n \u00b7 M \u2326(k), when n is at least\nO(k11/\u21b5). They also provide some (non-tight) lower bounds on the memory requirements.\n1.3 Our Results and Techniques\nWe consider the problem of estimating H(p) from samples X n \u21e0 p, with as little space as possible.\nOur motivating question is: What is the space-sample trade-off of entropy estimation over k?\nThe optimal sample complexity is given in (3). However, straight-forward implementations of sample-\noptimal schemes in [21, 23, 22] require nearly linear space complexity in S(k, \"), which is nearly\nlinear (in k) words space. At the same time, when the number of samples is more than Se(k, \"),\ngiven in (2), the empirical entropy of X n is within \u00b1\" of H(p). We can therefore use results from\nstreaming literature to estimate the empirical entropy of a data-stream with n = Se(k, \") samples\nto within \u00b1\", and in doing so, obtain a \u00b12\" estimate of H(p). In particular, the algorithm of [9]\nrequires Se(k, \") samples, and with O( log2(Se(k,\")) log log(Se(k,\"))\n) words of space, estimates H(p)\nto \u00b1\". Note that Se(k, \") is linear with respect to k.\nOur work requires constant words of space while maintaining linear sample complexity in k.\nTheorem 1. There is an algorithm that requires O\u21e3 k(log(1/\"))2\nand estimates H(p) to \u00b1\".\nThe results and the state of the art are given in Table 1. A few remarks are in order.\nRemark. (1). Our algorithm can bypass the lower bound of \u2326(1/\"2) for entropy estimation of data-\nstreams since X n is generated by a distribution and not the worst case data stream. (2). Consider the\ncase when \" is a constant, say \" = 1. Then, the optimal sample complexity is \u21e5( k\nlog k ) (from (3)) and\nall known implementations of these algorithms requires \u02dc\u21e5(k) space. Streaming literature provides\nan algorithm with O(k) samples and \u02dcO((log k)2) memory words. We provide an algorithm with\nO(k) samples, and 20 memory words. Compared to the sample-optimal algorithms, we have a log k\nblow-up in the sample complexity, but an exponential reduction in space.\n\n\u2318 samples and 20 words of space\n\n\"2\n\n\"3\n\nTable 1: Sample and space complexity for estimating H(p).\n\nAlgorithm\n\nSample-Optimal [21],[23, 22] \u21e5\u21e3 k\n\"2 \u2318\n\"2 \u2318\nO\u21e3 k\nO\u21e3 k(log(1/\"))2\n\u2318\n\nSamples\n\" log k + log2 k\n\" + log2 k\n\nStreaming [8, 9]\n\nAlgorithm 6\n\n\"3\n\nSpace (in words)\n\" log k + log2 k\n\" ) log log( k\n\nO\u21e3 k\nOlog2( k\n\n\"2 \u2318\n\" )/\"2\n\n20\n\nWe now describe the high level approach and techniques. We can write H(p) as\n\nH(p) = Xx\n\np(x) log p(x) = EX\u21e0p [ log p(X)] .\n\n(4)\n\n3\n\n\fA Simple Method. We build layers of sophistication to a simple approach: In each iteration,\n\n1. Obtain a sample X \u21e0 p.\n2. Using constant memory, over the next N samples, estimate log(1/p(X)).\n\nFrom (4), for large enough N, we can obtain a good estimate \u02c6p(X) of p(X), and log(1/\u02c6p(X)) will be\nan almost unbiased estimate of the entropy. We can then maintain a running average of log(1/\u02c6p(X))\nover R iterations, where R is large enough for the empirical mean of log(1/\u02c6p(X)) to concentrate.\nThe total sample requirement is N R. This approach is described in Algorithm 1 (in Section 2).\nTheorem 4 states that it requires O(1) memory words and the sample complexity is super-linear.\nIntervals for Better Sample Complexity. To improve the sample complexity, we partition [0, 1]\ninto T disjoint intervals (Algorithm 1 corresponds to T = 1). In Lemma 7 we express H(p) as a sum\nof entropy-like expressions de\ufb01ned over probability values in these T intervals. We will then estimate\neach of the terms separately with the approach stated above. We will show that the sample complexity\nas a function of k drops down roughly as k(log(T ) k)2, where log(T ) is the T th iterated logarithm,\nwhile the space complexity is still constant. In the simple algorithm described above, we need\n\n1. N, the number of samples for each iteration, to be large enough for good estimates of p(X).\n2. R, the number of iterations, to be large enough for concentration.\n\nNote that when p(X) is large, fewer samples are needed to estimate p(X) (small N), and for small\np(X) more samples are needed. However, if the intervals are chosen such that small probabilities are\nalso contained in small intervals, the number of iterations R needed for these intervals can be made\nsmall (the range of random variables in Hoeffding\u2019s inequality is smaller). Succinctly,\nFewer samples are needed to estimate the large probabilities, and fewer iterations are needed for\nconvergence of estimates for small probabilities by choosing the intervals carefully.\nSome Useful Tools. We now state two concentration inequalities that we use throughout this paper.\nLemma 2. (Hoeffding\u2019s Inequality) [53] Let X1, . . . , Xm 2 [ai, bi] be independent random vari-\nables. Let X = (X1 + . . . + Xm)/m, then Pr (|X E [X]| t) \uf8ff 2 exp\u21e3 2(mt)2\nPi(biai)2\u2318 .\n\nIn some algorithms we consider, m itself is a random variable. In those cases, we will use the\nfollowing variant of Hoeffding\u2019s inequality, which is proved in Section B.\nLemma 3. (Random Hoeffding\u2019s Inequality) Let M \u21e0 Bin (m, p). Let X1, . . . , Xm be independent\nrandom variables such that Xi 2 [a, b]. Let X = (PM\n\ni=1 Xi)/M. Then, for any 0 < p \uf8ff 1\n\nPr (|X E [X]| t/p) \uf8ff 3 exp\u21e3mt2/(8p (b a)2)\u2318 .\n\n(5)\n\nOutline. In Section 2 we describe the simple approach and its performance in Theorem 4. In\nSection 3.1, Algorithm 5 we show how the sample complexity can be reduced from k log2 k in\nTheorem 4 to k(log log k)2 in Theorem 8 by choosing two intervals (T = 2). The intervals are\nchosen such that the number of iterations R for the small interval is poly(log log k) in Algorithm 5\ncompared to poly(log k) in Algorithm 1. The algorithm for general T is described in Section 3.2,\nand the performance of our main algorithm is given in Theorem 1.\n\n2 A Building Block: Simple Algorithm with Constant Space\n\nWe propose a simple method (Algorithm 1) with the following guarantee.\n\nTheorem 4. Let \"> 0. Algorithm 1 takes O\u21e3 k log2(k/\")\nwords of memory, and outputs \u00afH, such that with probability at least 2/3, \u00afH H(p) <\" .\n\nBased on (4), each iteration of Algorithm 1 obtains a sample X from p and estimates log(1/p(X)).\nTo avoid assigning zero probability value to p(X), we do add-1 smoothing to our empirical estimate\nof p(X). The bias in our estimator can be controlled by the choice of N.\nPerformance Guarantee. Algorithm 1 only maintains a running sum at the end of each iteration.\nWe reserve two words for N, R and S. We reserve one word to store x and two words to keep track\n\n\u2318 samples from p 2 k, uses at most 20\n\n\"3\n\n4\n\n\fAlgorithm 1 Entropy estimation with constant space: Simple Algorithm\nRequire: Accuracy parameter \"> 0, a data stream X1, X2, . . . \u21e0 p\n1: Set\nN 2k/\",\n\nR 4 log2(1 + 2k/\")/\"2,\nLet x the next element in the data stream\nNx # appearances of x in the next N symbols\n\u02c6Ht = log (N/(Nx + 1))\nS = S + \u02c6Ht\n\n2: for t = 1, . . . , R do\n3:\n4:\n5:\n6:\n7: \u00afH = S/R\n\nS 0\n\nof Nx in each iteration. We reserve three words for the counters. Thus the algorithm uses less than\n20 words of space.\nTo bound the accuracy, note that \u00afH is the mean of R i.i.d. random variables \u02c6H1, . . . , \u02c6HR. We bound\nthe bias and prove concentration of \u00afH using Lemma 2.\nBias Bound. Larger values of N provides a better estimate of p(X), and therefore a smaller bias in\nestimation. This is captured in the next lemma, which is proved in Section C.\n\nConcentration. Since 8t, \u02c6Ht 2 [log(N/(N + 1)), log N ], we show in the next lemma that with large\nenough R, \u00afH concentrates. This is proved in Section C.\n\nLemma 5. (Bias Bound)E\u21e5 \u00afH\u21e4 H (p) \uf8ff k\nLemma 6. (Concentration) For any \u00b5 > 0, Pr| \u00afH E\u21e5 \u00afH\u21e4| \u00b5 \uf8ff 2 exp\u21e3 2R\u00b52\nThe choice of N implies that E\u21e5 \u00afH\u21e4 H (p) \uf8ff \"/2, and by choosing \u00b5 = \"/2, and R =\n4 log2(1 + 2k/\")/\"2 implies that \u00afH is within H(p) \u00b1 \" with probability at least 2/3. The total\nsample complexity of Algorithm 1 is (N + 1)R = Ok log2 (k/\")/\"3.\n\nlog2(N +1)\u2318 .\n\nN .\n\n3\n\nInterval-based Algorithms\n\nIn the previous section, the simple algorithm treats each symbol equally and uses the same N and\nR. To reduce the sample complexity, we express H(p) as an expectation of various conditional\nexpectations depending on the symbol probability values. For larger probability values we use smaller\nN and for small probabilities we use smaller R. We then estimate the terms separately to obtain the\n\ufb01nal estimate.\nEntropy as a Weighted Sum of Conditional Expectations. Let T 2 N (decided later), and 0 =\na0 < a1 < . . . < aT = 1. Let I := {I1, I2, ..., IT}, where Ij = [aTj, aTj+1) be a partition of\n[0, 1] into T intervals.\nConsider a randomized algorithm A : [k] !{ I1, . . . , IT} that takes as input x 2 [k], and outputs an\ninterval in I. Let pA (Ij|x) = Pr (A(x) = Ij). For a distribution p 2 k, let\n\npA(Ij) := Xx2[k]\n\np(x) \u00b7 pA (Ij|x) ,\n\npA (x|Ij) :=\n\np(x) \u00b7 pA (Ij|x)\n\npA (Ij)\n\n.\n\n(6)\n\nThen pA(Ij) is the probability that A(X) = Ij, when X \u21e0 p. pA (x|Ij) is the conditional distribution\nover [k] given A(X) = Ij. Then we have the following lemma:\nLemma 7. Let Hj := EX\u21e0pA(x|Ij ) [ log p (X)] then, H (p) =PT\n\nj=1 pA (Ij) Hj.\n\nProof.\n\nH (p) =Xx\n\np(x)0@Xj\n\npA (Ij|x)1A log\n\n1\n\np(x)\n\n=Xx Xj \u2713pA (Ij) pA (x|Ij) log\n\n1\n\np(x)\u25c6\n\n(7)\n\n5\n\n\fwhere (7) follows from (6).\n\n=Xj\n\npA(Ij)EX\u21e0pA(x|Ij ) [ log p (X)].\n\nWe will choose the intervals and algorithm A appropriately. By estimating each term in the summation\nabove, we will design an algorithm with T intervals that uses O\u21e3 k(log(T ) k+log(1/\"))2\n\u2318 samples and a\nconstant words of space, and estimates H(p) to \u00b1\".\nIn Section 3.1, we provide the details with T = 2. This section will \ufb02esh out the key arguments, and\n\ufb01nally in Section 3.2, we extend this to T = log\u21e4 k where log\u21e4 k = mini{log(i) k \uf8ff 1} intervals to\nfurther reduce the sample complexity to O(k(log(1/\"))2/\"3).\n\n\"3\n\n3.1 Two Intervals Algorithm\n\nWe propose Algorithm 5 with T = 2 and the following guarantee.\n\nTheorem 8. Algorithm 5 uses O(N R + N1R1 + N2R2) = O\u21e3 k(log(log(k)/\"))2\nand outputs an \u00b1\" estimate of H(p) with probability at least 2/3.\n3.1.1 Description of the Algorithm\nLet T = 2, and > 16 be a constant. Consider the following partition of [0, 1]:\n\n\"3\n\n\u2318 samples, 20 words\n\nI2 = [0,` ) , I1 = [`, 1]\n\n(8)\nWe now specify the algorithm A : [k] !{ I1, I2} to be used in Lemma 7. A is denoted by ESTINT\n(Algorithm 2). For x 2 [k], it takes N samples from p, and outputs the interval where the empirical\nfraction of occurrences of x lies. ESTINT tries to predict the interval in which p(x) lies.\n\n` = (log k)/k.\n\nwhere\n\nAlgorithm 2 A : ESTINT (N, x)\n1: Obtain N samples from p\n2:\n3: else output I2\n\nif x appears N` times, output I1\n\nAlgorithm 3 ESTPROBINT (N, R)\n1: \u02c6pA (I1) = 0\n2: for t = 1 to R do\nSample x \u21e0 p.\n3:\nif ESTINT (N, x) = I1 then\n4:\n5:\n\u02c6pA (I1) = \u02c6pA (I1) + 1/R\n\nBy Lemma 7, H (p) = pA (I1) H1 + pA (I2) H2. We estimate the terms in this expression as follows.\nEstimating pA(Ij)\u2019s. We run ESTINT multiple times on samples generated from p, and output\nthe fraction of times the output is Ij as an estimate of pA(Ij). We only estimate pA(I1), since\npA(I1) + pA(I2) = 1. The complete procedure is speci\ufb01ed in Algorithm 3.\nEstimating Hj\u2019s. Recall that Hj\u2019s are the expectations of log (p(x)) under different distributions\ngiven in (6). Since the expectations are with respect to the conditional distributions, we \ufb01rst sample\na symbol from p and then conditioned on the event that ESTINT outputs Ij, we use an algorithm\nsimilar to Algorithm 1 to estimate log(1/p(x)). The full algorithm is in Algorithm 4. Notice that\nwhen computing \u02c6H2 in Step 8, we clip the \u02c6H2\u2019s to log 1\n4` if Nx,2 > 4`N2 1. This is done to restrict\neach \u02c6H2 to be in the range of [log 1\n\n4` , log N2], which helps when proving concentration.\n\n3.1.2 Performance Guarantees\nMemory Requirements. We reserve 5 words to store parameters R1, R2, N1, N2 and `. ESTINT\nuses one word to keep track of the number of occurrences of x. For ESTPROBINT, we use one word\nto store x and one word to keep track of the \ufb01nal sum \u02c6pA (I1). We execute CONDEXP for each\ninterval separately and use one word each to store x and keep track of Si and \u02c6Hi. We use two words\nto store the outputs \u00afH1 and \u00afH2 and store the \ufb01nal output \u02c6HII in one of those. Hence, at most 20\nwords of memory are suf\ufb01cient.\n\n6\n\n\fAlgorithm 4 Estimating H1 and H2 : CONDEXP (N1, N2, R1, R2)\n1: for i = 1, 2, set \u02c6Hi = 0, Si = 0, do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nfor t = 1 to Ri do\nSample x \u21e0 p\nif ESTINT (N, x) = Ii then\nSi = Si + 1\nLet Nx,i # occurrences of x in the next Ni samples\n\u02c6Hi = \u02c6Hi + log (Ni/(Nx,i + 1)) if i = 1\n\u02c6Hi = \u02c6Hi + max{log (Ni/(Nx,i + 1)) , log (1/4`)} if i = 2\n\n\u00afHi = \u02c6Hi/Si\n\nAlgorithm 5 Entropy Estimation with constant space: Two Intervals Algorithm\nRequire: Accuracy parameter \"> 0, = /2, a data stream X1, X2, . . . \u21e0 p\n1: Set\n, R2 = C2 \u00b7\n\n\" (log k) , R = R1 = C2\n\n, N2 = C1 \u00b7\n\nN = N1 =\n\nlog(k/\")2\n\nC1k\n\nk\n\"\n\n\"2\n\n(log((log k)/\"))2\n\n\"2\n\n2: \u02c6pA (I1) = ESTPROBINT (N, R)\n3: \u00afH1, \u00afH2 = CONDEXP (N1, N2, R1, R2)\n4: \u02c6HII = \u02c6pA (I1) \u00afH1 + (1 \u02c6pA (I1)) \u00afH2\n\n+Eh \u02c6HIIi Eh \u02c6H\u21e4IIi\n|\n}\n\nClipping Error\n\n{z\n\n+ \u02c6HII Eh \u02c6HIIi .\n|\n}\n\nConcentration\n\nSample Guarantees. Let \u02c6H\u21e4II be the unclipped version of the estimator where we don\u2019t use clipping\nin Step 8 in Algorithm 4 (all other steps remain the same). Then we can bound the estimation error\nby the following three terms and we will bound each of them separately,\n\nH (p) \u02c6HII \uf8ffH (p) Eh \u02c6H\u21e4IIi\n}\n\nUnclipped Bias\n\n|\n\n{z\n\n{z\nClipping Error. By the design of CONDEXP, \u02c6H2 is clipped only when the event Ex =\n{ESTINT(N, x) = I2, Nx,2 > 4N2` 1} occurs for some x 2X . We bound the clipping er-\nror in the following lemma (proof in Section D.3) by showing that Pr (Ex) is small.\nLemma 9. (Clipping Error Bound) Let \u02c6HII be the entropy estimate of Algorithm 5 and let \u02c6H\u21e4II be\n\nthe entropy estimate of the unclipped version of Algorithm 5. ThenEh \u02c6HIIi Eh \u02c6H\u21e4IIi \uf8ff \"/3.\n\nConcentration Bound. To prove the concentration bound, we use Lemma 10 to decompose it into\nthree terms each of which can be viewed as the difference between some empirical mean and its true\nexpectation which can be bounded using concentration inequalities. (proof in Section D.4)\nLemma 10. (Concentration Bound) Let \u02c6HII be the entropy estimate of Algorithm 5 and let \u00afHi be\nas de\ufb01ned in Algorithm 5. Let pA (Ii) be the distribution de\ufb01ned in (6) where A is ESTINT.\n\nEh \u02c6HIIi \u02c6HII \uf8ff\n\n2Xi=1\n\npA (Ii) \u00afHi E\u21e5 \u00afHi\u21e4 + |pA (I1) \u02c6pA (I1)|| \u00afH1 \u00afH2|\uf8ff \"/3.\n\nWe provide a brief outline of the proof below. By union bound, in order to show that with probability\n9, each of\nat least 2/3 the sum is less than \"/3, it is suf\ufb01cient to show that with probability at most 1\nthe terms is greater than \"/9.\nTo bound |pA (I1) \u02c6pA (I1)|| \u00afH1 \u00afH2|, we \ufb01rst bound the range of | \u00afH1 \u00afH2| and then use\nthat we cannot obtain concentration using Hoeffding\u2019s inequality because Ri (the number of samples\nthat we average over) is a random variable. Therefore we apply Random Hoeffding Inequality\n(Lemma 3) to \u00afHi. Since Ri depends on the range of the random variables being averaged over,\nwe obtain a reduction in the sample complexity for i = 2 because of clipping the estimate below\n\nHoeffding\u2019s inequality (Lemma 2) to obtain concentration of \u02c6pA (I1). To bound \u00afHi E\u21e5 \u00afHi\u21e4, note\n\n7\n\n\fto log 1\n\n4`. Therefore the range for the second interval is log(N2) log 1\n\nimplying R2 = O(log ((log k)/\"))2/\"2 suf\ufb01ces for the desired probability. For i = 1, since the\n\nrange is the same as the one interval case, we use the same R1 as in the one interval case. Note\nR2 < R1.\nBias Bound. We bound the bias of the unclipped version, \u02c6H\u21e4II using the following lemma:\nLemma 11. (Unclipped Bias Bound) Let \u02c6H\u21e4II be the unclipped estimate of Algorithm 5 and let\npA (Ii|x) be the conditional distribution de\ufb01ned in (6) where A is ESTPROBINT. Then,\n\n4` = O (log ((log k) /\"))\n\nH (p) Eh \u02c6H\u21e4IIi \uf8ff\n\n2Xi=1 Xx2X\n\npA (Ii|x)/Ni! \uf8ff \"/3.\n\n(9)\n\n. For interval I1, we improve upon k\nN1\n\n(Proof in Section D.2) Lemma 11 allows us to choose N1 and N2 separately to bound the bias.\nInterval I2\u2019s contribution is at most k\nby partitioning X\nN2\ninto sets X1 = {x 2X| p(x) <`/ 2} and X2 = {x 2X| p(x) `/2}. For X1, pA (I1|x) is small\nby Chernoff bound. For X2, since p(x) `/2, |X2|\uf8ff 2/` which is smaller than k. Hence we can\nchoose N2 < N1.\nIn the sample complexity of the two interval algorithm, observe that the term N2R2 dominates.\nReducing N2 is hard because it is independent of the interval length. Therefore we hope to reduce\nR2 by partitioning into intervals with smaller lengths. In the smallest interval, if we reduce the range\nof each estimate to be within a constant, then O( 1\n\"2 ) samples would suf\ufb01ce for concentration. In the\nnext section, we make this concrete by considering an algorithm that uses multiple intervals.\n\n3.2 General Intervals Algorithm\nThe general algorithm follows the same principles as the previous section with a larger number of\nintervals, decreasing the sample requirements at each step, as discussed in Section 1.3. However, the\nproofs are much more involved, particularly in order to obtain an O(k) upper bound on the sample\ncomplexity. We will sketch some of the key points and move a bulk of the algorithm and details to\nthe appendix due to lack of space.\nIntervals. Let T = log\u21e4 k, where log\u21e4 k := mini{log(i) k \uf8ff 1}. Consider the following partition\nof [0, 1]: {Ii}T\n(> 16)\nand `i1 = hi. De\ufb01ne lT = 0 and h1 = 1, then we have for i = 2, ..., T 1 :\n! .\nI1 =\" (log(1)(k))\n! , Ii =\" (log(i)(k))\n,\n\ni=1 where I1 = [l1, h1] and for i = 2, ..., T , Ii = [li, hi), hi = (log(i1)(k))\n\n, 1# , IT =\"0,\n\n(log(T1)(k))\n\n(log(i1)(k))\n\nk\n\nk\n\nk\n\nk\n\nk\n\nWe divide the bottleneck of the two intervals algorithm I2, into further intervals until the width of\nthe smallest interval is a constant over k (e/k) which implies concentration with lesser samples\nthan before. Using Lemma 7, similar to the two intervals case, we will estimate each of the pA (Ii)\nand Hi\u2019s independently in Algorithm 8 (GENESTPROBINT) and Algorithm 9 (GENCONDEXP),\npresented in Appendix E.1. Complete algorithm for T = log\u21e4 k is presented in Algorithm 6.\nMemory Requirements. The analysis of memory requirement is similar to that of the two interval\ncase. To store parameters `i, Ni, Ri\u2019s, we only store k, \", , C N and CR and compute the parameters\non the \ufb02y. Notice that for each interval, the execution of GENESTINT, GENESTPROBINT and\nGENCONDEXP require same memory as that of their two interval counterparts. The trick here is\nthat we don\u2019t need to store \u02c6pA (Ii)\u2019s and \u00afHi\u2019s since we can perform each of GENESTPROBINT and\nGENCONDEXP for one interval and maintain a running sum of \u02c6pA (Ii) \u00afHi\u2019s. Therefore, Algorithm 6\nuses at most 20 words of space.\nSample complexity. Algorithm 6 proves the main claim of our paper in Theorem 1. The key idea to\nremove the extra loglog factor in Theorem 8 is to progressively make the error requirements stricter\nfor the larger probability intervals. We denote the \ufb01nal estimate without the clipping step (Step 8) by\n\u02c6H\u21e4\n(all other steps remaining the same). Then the error can be bounded by the following three terms:\nI\n\n|H (p) \u02c6HI|\uf8ff| H (p) Eh \u02c6H\u21e4\nIi|\n}\n\nUnclipped Bias\n\n{z\n\n|\n\n+|Eh \u02c6HIi Eh \u02c6H\u21e4\nIi|\n}\n|\n\nClipping Error\n\n{z\n\n+| \u02c6HI Eh \u02c6HIi|\n}\n|\n\nConcentration\n\n{z\n\n.\n\n(10)\n\n8\n\n\fAlgorithm 6 Entropy Estimation with constant space: General Intervals Algorithm\nRequire: Accuracy parameter \"> 0, = /2, a data stream X1, X2, . . . \u21e0 p.\n1: Set\n\n(log(log(i1)(k)/\"))2\n\nk\n\nNi = CN \u00b7\n\n\"(log(i)(k))\n\n,\n\nRi = CR \u00b7\n\n1 \uf8ff i \uf8ff T 1\n\nNT = CN \u00b7\n\nk\n\"\n\n,\n\nRT = CR \u00b7\n\n(log(log(T1)(k)/\"))2\n\n\"2\n\n\"2\n\n2: {\u02c6pA (Ii)}T1\n3: \u00afHi T\n4: \u02c6HI =PT1\n\ni=1 = GENESTPROBINT\u21e3{Ni}T1\ni=1\u2318\ni=1 ,{Ri}T1\ni=1 = GENCONDEXP\u21e3{Ni}T\ni=1\u2318\ni=1 ,{Ri}T\ni=1 \u02c6pA (Ii) \u00afHi + (1 PT1\ni=1 \u02c6pA (Ii)) \u00afHT\n\nWith the parameters de\ufb01ned in Algorithm 6, we can bound the unclipped bias and clipping error\nin (10) by \"\n3 with probability at least\n2/3. The details are given in Lemma 13, 14, and 15 in Appendix E.\n\n3 each and show that the concentration part is also bounded by \"\n\n4 Open Problems\n\nThere are several interesting questions that arise from our work. While our algorithms require only\na constant memory words of space, they require a log k multiplicative factor more samples (as a\nfunction of k) than the optimal sample complexity (in (3)).\n\n\u2022 Does there exist an algorithm for entropy estimation that has the optimal sample complexity\n\nand space requirement that is at most poly(log k)?\n\nWe are unaware of any implementation that requires sub-linear space in k. Designing a strictly\nsublinear-space (space requirement k\u21b5 for some \u21b5< 1) sample-optimal algorithm could be a \ufb01rst\nstep toward solving the question above. At the same time, there might not exist an algorithm with a\nsmall sample complexity. This leads to the following complementary question.\n\n\u2022 Prove a lower bound on the space requirement of a sample-optimal algorithm for entropy\n\nestimation.\n\nBeyond these, obtaining sample-space trade-offs for distribution testing, and property estimation\ntasks is an exciting future direction.\nAcknowledgements. This work is supported by NSF-CCF-1657471. This research started with the\nsupport of MIT-Shell Energy Research Fellowship to JA and PI, while JA was at MIT.\n\nReferences\n[1] J Ian Munro and Mike S Paterson. Selection and sorting with limited storage. Theoretical\n\ncomputer science, 12(3):315\u2013323, 1980.\n\n[2] Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applica-\n\ntions. Journal of Computer and System Sciences, 31(2):182\u2013209, 1985.\n\n[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the\nfrequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of\nComputing, 1996.\n\n[4] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream\ncomputation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer\nScience, page 189. IEEE, 2000.\n\n[5] Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data\nstreams. In Proceedings of the 37th Annual ACM Symposium on the Theory of Computing,\nSTOC \u201905, 2005.\n\n9\n\n\f[6] Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms\n\nfor estimating entropy of network traf\ufb01c. 34(1):145\u2013156, June 2006.\n\n[7] Amit Chakrabarti, Khanh Do Ba, and S Muthukrishnan. Estimating entropy and entropy norm\n\non data streams. Internet Mathematics, 3(1):63\u201378, 2006.\n\n[8] Amit Chakrabarti, Graham Cormode, and Andrew Mcgregor. A near-optimal algorithm for\nestimating the entropy of a stream. volume 6, pages 51:1\u201351:21, New York, NY, USA, July\n2010. ACM.\n\n[9] Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy\nvia approximation theory. In Proceedings of the 49th Annual IEEE Symposium on Foundations\nof Computer Science, FOCS \u201906, pages 489\u2013498, Philadephia, PA, USA, 2006. IEEE Computer\nSociety.\n\n[10] Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian. Sublinear estimation of\n\nentropy and information distances. ACM Transactions on Algorithms, 5(4), 2009.\n\n[11] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.\nIn International Colloquium on Automata, Languages, and Programming, pages 693\u2013703.\nSpringer, 2002.\n\n[12] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min\n\nsketch and its applications. Journal of Algorithms, 55(1):58\u201375, 2005.\n\n[13] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Ef\ufb01cient computation of frequent\nand top-k elements in data streams. In International Conference on Database Theory, pages\n398\u2013412. Springer, 2005.\n\n[14] Arnab Bhattacharyya, Palash Dey, and David P Woodruff. An optimal algorithm for l1-heavy\nhitters in insertion streams and related problems. In Proceedings of the 35th ACM SIGMOD-\nSIGACT-SIGAI Symposium on Principles of Database Systems, pages 385\u2013400. ACM, 2016.\n[15] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct\nelements in a data stream. In International Workshop on Randomization and Approximation\nTechniques in Computer Science, pages 1\u201310. Springer, 2002.\n\n[16] Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct\nIn Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART\n\nelements problem.\nSymposium on Principles of Database Systems, pages 41\u201352. ACM, 2010.\n\n[17] Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile estimation in\n\nrandom-order streams. SIAM Journal on Computing, 38(5):2044\u20132059, 2009.\n\n[18] Amit Chakrabarti, TS Jayram, and Mihai Patra\u00b8scu. Tight lower bounds for selection in randomly\nIn Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete\n\nordered streams.\nAlgorithms, pages 720\u2013729. Society for Industrial and Applied Mathematics, 2008.\n\n[19] George A. Miller. Note on the bias of information estimates. Information Theory in Psychology:\n\nProblems and Methods, 2:95\u2013100, 1955.\n\n[20] Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191\u2013\n\n1253, 2003.\n\n[21] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/ log n-sample estimator for\nentropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual\nACM Symposium on the Theory of Computing, STOC \u201911, pages 685\u2013694, New York, NY, USA,\n2011. ACM.\n\n[22] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of\nfunctionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835\u2013\n2885, May 2015.\n\n[23] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best\n\npolynomial approximation. IEEE Trans. Information Theory, 62(6):3702\u20133720, 2016.\n\n10\n\n\f[24] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uni\ufb01ed\nmaximum likelihood approach for estimating symmetric properties of discrete distributions.\nIn Proceedings of the 34th International Conference on Machine Learning, ICML \u201917, pages\n11\u201321. JMLR, Inc., 2017.\n\n[25] Yi Hao, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data ampli\ufb01cation: A uni\ufb01ed\nand competitive approach to property estimation. In S. Bengio, H. Wallach, H. Larochelle,\nK. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pro-\ncessing Systems 31, pages 8834\u20138843. Curran Associates, Inc., 2018.\n\n[26] Yi Hao and Alon Orlitsky. The broad optimality of pro\ufb01le maximum likelihood. arXiv preprint\n\narXiv:1906.03794, 2019.\n\n[27] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating\nR\u00e9nyi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1):38\u201356,\nJan 2017.\n\n[28] Maciej Obremski and Maciej Skorski. R\u00e9nyi entropy estimation revisited. In Proceedings of\nthe 20th International Workshop on Approximation Algorithms for Combinatorial Optimization\nProblems, APPROX \u201917, pages 20:1\u201320:15, Dagstuhl, Germany, 2017. Schloss Dagstuhl\u2013\nLeibniz-Zentrum fuer Informatik.\n\n[29] Kazuto Fukuchi and Jun Sakuma. Minimax optimal estimators for additive scalar functionals of\ndiscrete distributions. In Proceedings of the 2017 IEEE International Symposium on Information\nTheory, ISIT \u201917, pages 2103\u20132107, Washington, DC, USA, 2017. IEEE Computer Society.\n\n[30] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words\n\ndid shakespeare know? Biometrika, 63(3):435\u2013447, 1976.\n\n[31] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number\nof unseen species. Proceedings of the National Academy of Sciences, 113(47):13283\u201313288,\n2016.\n\n[32] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of kl\ndivergence between discrete distributions. In 2016 International Symposium on Information\nTheory and Its Applications, ISITA \u201916, pages 256\u2013260. IEEE, 2016.\n\n[33] Jayadev Acharya. Pro\ufb01le maximum likelihood is optimal for estimating kl divergence. In 2018\nIEEE International Symposium on Information Theory (ISIT), pages 1400\u20131404. IEEE, 2018.\n[34] Martin E Hellman and Thomas M Cover. Learning with \ufb01nite memory. The Annals of\n\nMathematical Statistics, pages 765\u2013782, 1970.\n\n[35] Thomas M Cover. Hypothesis testing with \ufb01nite statistics. The Annals of Mathematical Statistics,\n\npages 828\u2013835, 1969.\n\n[36] Thomas Leighton and Ronald Rivest. Estimating a probability using \ufb01nite memory. IEEE\n\nTransactions on Information Theory, 32(6):733\u2013742, 1986.\n\n[37] Sudipto Guha and Andrew McGregor. Space-ef\ufb01cient sampling. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 171\u2013178, 2007.\n\n[38] Steve Chien, Katrina Ligett, and Andrew McGregor. Space-ef\ufb01cient estimation of robust\n\nstatistics and distribution testing. Tsinghua University Press, January 2010.\n\n[39] Yuval Dagan and Ohad Shamir. Detecting correlations with little memory and communication.\n\narXiv preprint arXiv:1803.01420, 2018.\n\n[40] Michael Crouch, Andrew McGregor, Gregory Valiant, and David P Woodruff. Stochastic\nstreams: Sample complexity vs. space complexity. In LIPIcs-Leibniz International Proceedings\nin Informatics, volume 57. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.\n\n[41] Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical\nqueries. In Proceedings of the 29th Annual Conference on Learning Theory, pages 1490\u20131516,\n2016.\n\n11\n\n\f[42] Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. In\nProceedings of the IEEE 57th Annual Symposium on Foundations of Computer Science, 2016.\n[43] Dana Moshkovitz and Michal Moshkovitz. Mixing implies lower bounds for space bounded\nlearning. In Proceedings of the 30th Annual Conference on Learning Theory, pages 1516\u20131566,\n2017.\n\n[44] Ayush Jain and Himanshu Tyagi. Effective memory shrinkage in estimation. In Proceedings of\n\nthe 2018 IEEE International Symposium on Information Theory, 2018.\n\n[45] Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, and Sankeerth Rao. Communication\nIn Proceedings of the 29th Annual\n\nand memory ef\ufb01cient testing of discrete distributions.\nConference on Learning Theory, COLT \u201919, 2019.\n\n[46] Georgij P Basharin. On a statistical estimate for the entropy of a sequence of independent\n\nrandom variables. Theory of Probability and Its Applications, 4(3):333\u2013336, 1959.\n\n[47] Andr\u00e1s Antos and Ioannis Kontoyiannis. Convergence properties of functional estimates for\n\ndiscrete distributions. Random Struct. Algorithms, 19(3-4):163\u2013193, October 2001.\n\n[48] Liam Paninski. Estimating entropy on m bins given fewer than m samples. IEEE Transactions\n\non Information Theory, 50(9):2200\u20132203, 2004.\n\n[49] Lakshminath Bhuvanagiri and Sumit Ganguly. Estimating entropy over data streams.\n\nIn\nProceedings of the 14th Annual European Symposium on Algorithms, volume 4168, page 148.\nSpringer, 2006.\n\n[50] Alfr\u00e9d R\u00e9nyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley\nSymposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory\nof Statistics, pages 547\u2013561, 1961.\n\n[51] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Electronic\n\nColloquium on Computational Complexity (ECCC), 7(20), 2000.\n\n[52] Ziv Bar-Yossef, Ravi Kumar, and D Sivakumar. Sampling algorithms: Lower bounds and\napplications. In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing,\npages 266\u2013275. ACM, 2001.\n\n[53] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American Statistical Association, 58(301):13\u201330, 1963.\n\n12\n\n\f", "award": [], "sourceid": 2814, "authors": [{"given_name": "Jayadev", "family_name": "Acharya", "institution": "Cornell University"}, {"given_name": "Sourbh", "family_name": "Bhadane", "institution": "Cornell University"}, {"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Ziteng", "family_name": "Sun", "institution": "Cornell University"}]}