NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 2814 Estimating Entropy of Distributions in Constant Space

### Reviewer 1

While a huge amount of work exists on sample-complexity of entropy estimation, the space complexity was overlooked. However, it is clear that from the practical point of view, it is important to study the trade-off between sample and space complexities of statistical estimation tasks. So in that sense I consider the main idea of the paper original. Most part of the paper are clear, but I do think that the authors should read their paper carefully and fix some typos/grammar mistakes. The proofs are easy to follow. For me, there is one main disadvantage for this paper: Overall, the trade-off is not clear. The authors just proposed a simple algorithm which has sub-optimal sample complexity and "constant/optimal" memory complexity. However, to really study the complexity a lower bound is needed. It would be much more interesting to come up with a statement that roughly says that: "if one insist on a memory complexity of at most X then sample complexity must be at least Y". Update: reading the detailed author response I now realize that obtaining a lower bound is quite challenging. Thus, I decided to increase my score accordingly.

### Reviewer 2

The algorithm is also simple: the basic idea is to choose an element randomly and estimate its frequency by further sampling enough elements. Repeat this process to estimate E[-log X], which is the entropy. A more sophisticated algorithm is built by dividing [0,1] into several subintervals, and estimate the conditional entropy within each subinterval and combine them in the end. The savings come from the intuition that a large frequency can be estimated with fewer samples. Overall, the algorithm is neat and simple, easy to implement. Minor comments: Line 60: “denoted by S(H,k,eps,delta),” -- insert “by” and add a comma at the end of the phrase Line 71: Do not use citations as subjects. Say instead “Paninski showed … ”. Such bad style occurs a lot in this section. Line 76: “denoted by H(X^n)” -- insert “by”. Line 107: delete “bit” in “eps = 1 bit” Line 115: At -> In Algorithm 1, line 5: Do you mean S += …? Please edit the typesetting Algorithm 3, line 4: Does the symbol of + above = mean +=? Please use the same notation throughout the paper Algorithm 4, line 4: Is it ESTINT(N_i, x)?

### Reviewer 3

Originality: The space-sample tradeoff in memory-constrained statistical inference, especially the lower bounds, is typically very challenging and underexplored until quite recently, e.g., the seminal paper [Raz'16]. The Shannon entropy estimation problem is also a challenging example of high-dimensional functional estimation, which was worked out recently in [Jiao et al.'15, Wu--Yang'16]. This manuscript combines the entropy estimation problem with memory constraints, which I believe to be quite novel. The techniques in the simple algorithm are similar in spirit to [Alon et al.'96], while the authors found a suitable version for entropy estimation. Besides, the authors wrote an excellent literature review in the introduction. Quality: The authors exclusively focused on entropy estimation with constant memory, and worked out an upper bound for the sample complexity. Specifically, a simple algorithm would achieve the sub-optimal bound with an additional log factor, and a careful interval algorithm would fix the log factor. I checked the high-level ideas and most of the proofs, which are all correct. However, I still have some specific comments as follows: 1. It is unfortunate that the authors did not work out a lower bound. Without a matching lower bound, one may doubt whether the considerable amount of effort for the logarithmic improvement in this manuscript is significant or not. I understand that the lower bound may take huge effort and deserves another independent paper, while I feel that the authors should highlight its difficulty. 2. It is also to some extent unsatisfactory that the authors only considered constant memory. From Table I one may be tempted to think that as long as sample complexity * space complexity >= k/eps^3, the entropy estimation problem becomes feasible. Can the proposed idea be generalized to incorporate the case where the space can be larger? 3. The authors first proposed a simple algorithm (Algorithm 1) which has a logarithmic gap from the desired sample complexity. I have the following questions: 3.1. Algorithm 1 relies heavily on the specific form of the Shannon entropy in Eqn. (4). Can the same idea be generalized to other functionals as well, say the power sum function? Besides, in the later interval algorithm the log function also becomes important in the analysis. Will the same idea of local intervals work for other functionals too? 3.2. The logarithmic gap is only an upper bound, and it may happen that this bound is loose. To motivate the necessity of the later interval algorithm, the authors should show that the logarithmic gap is unavoidable in the simple algorithm. I checked simple examples and believe that this lower bound is true. 3.3. One may wonder whether fixing a logarithmic gap would be important, given no good lower bound is known. The authors may at least comment on this issue. 4. I am wondering whether some components in the interval-based algorithm are necessary in theory, or are just artifacts in the proof. Specifically, I am curious about whether the clipping in Algorithms 4 and 9 is indeed necessary. Note that without clipping, the algorithm description could be much clearer: we still use the same estimator in different intervals, and the only difference is that we observe fewer samples for large probabilities, and fewer iterations for small probabilities. I strongly feel that clipping can only be used in the proof, and the same bound holds without clipping in the algorithm. In fact, a simple Chebyshev inequality can be used to bound the concentration term in Eqn. (9), where by triangle inequality the variance can be upper bounded by the sum of the variance of the clipped estimate, and the second moment of the difference. The only remaining problem is about the difference, whose variance seems to be upper bounded by some desirable quantity using the same decomposition arguments as in Appendix F.3. Note that in this way, clipping is only used in the proof but not in the algorithm. 5. Continuing with the previous point, I also feel that some other components of the algorithm may not be really necessary. Note that currently there are lots of wasted/unused samples in Algorithm 4, which seems unnatural. A possibly more natural streaming algorithm would be as follows: after observing some symbol x, first make fewer observations N_1 of the subsequent observations. If the resulting estimate of p(x) is large (i.e., in the largest interval), we simply stop and move on to the next iteration. If it is small, then we make some more observations (N_2 - N_1) to have N_2 observations in total. If the resulting estimate of p(x) is in the second largest interval, we stop; otherwise we continue this process. It seems that the new algorithm shares the same idea of the current one and makes use of samples more efficiently. Will this algorithm work? Clarity: The introduction and the sketch of main ideas are very clear. However, the interval-based algorithm looks a bit messy (due to some theoretical artifacts I think?), which could be hopefully fixed by my above points. Some minor comments: 1. The a += b symbol in the algorithms may be changed into a \leftarrow a + b to avoid ambiguity. 2. In Algorithm 7, line 4-5 seems redundant if we change t-1 to t on line 1. Significance: From the theoretical side, it would be more significant if some lower bound could be proved. However, the algorithms/ideas proposed in this manuscript are quite interesting, and could potentially be generalized to other statistical inference problems with memory constraints. Update: I've read the response. I'm fine with the lower bound issue - as already included in my comments above. However, it's a bit unfortunate to me that the authors did not address the minor technicalities presented above, including the clipping (point 4) and the dependency in the suggested algorithm (point 5). A good paper should be presented in the most natural way and get rid of all unnecessary parts. As a result I will keep my score.