{"title": "Data Amplification: A Unified and Competitive Approach to Property Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 8834, "page_last": 8843, "abstract": "Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\\sqrt{\\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. \n\nWe illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.", "full_text": "Data Ampli\ufb01cation: A Uni\ufb01ed and Competitive\n\nApproach to Property Estimation\n\nYi HAO\n\nAlon Orlitsky\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\nyih179@eng.ucsd.edu\n\nLa Jolla, CA 92093\nalon@eng.ucsd.edu\n\nAnanda T. Suresh\n\nGoogle Research, New York\n\nNew York, NY 10011\n\ntheertha@google.com\n\nYihong Wu\n\nDept. of Statistics and Data Science\n\nYale University\n\nNew Haven, CT 06511\nyihong.wu@yale.edu\n\nAbstract\n\n\u221a\n\nEstimating properties of discrete distributions is a fundamental problem in sta-\ntistical learning. We design the \ufb01rst uni\ufb01ed, linear-time, competitive, property\nestimator that for a wide class of properties and for all underlying distributions uses\njust 2n samples to achieve the performance attained by the empirical estimator\nwith n\nlog n samples. This provides off-the-shelf, distribution-independent, \u201cam-\npli\ufb01cation\u201d of the amount of data available relative to common-practice estimators.\nWe illustrate the estimator\u2019s practical advantages by comparing it to existing\nestimators for a wide variety of properties and distributions. In most cases, its\nperformance with n samples is even as good as that of the empirical estimator with\nn log n samples, and for essentially all properties, its performance is comparable\nto that of the best existing estimator designed speci\ufb01cally for that property.\n\n1 Distribution Properties\nLet DX denote the collection of distributions over a countable set X of \ufb01nite or in\ufb01nite cardinality k.\nA distribution property is a mapping f : DX \u2192 R. Many applications call for estimating properties\nof an unknown distribution p \u2208 DX from its samples. Often these properties are additive, namely can\nbe written as a sum of functions of the probabilities. Symmetric additive properties can be written as\n\n(cid:88)\n\nx\u2208X\n\nf (p) def=\n\nf (px),\n\nand arise in many biological, genomic, and language-processing applications:\n\nShannon entropy (cid:80)\nNormalized support size (cid:80)\nNormalized support coverage (cid:80)\n\nsize estimation [3].\n\nx\u2208X px log 1\npx\n\nx\u2208X 1\nk\n\n, where throughout the paper log is the natural logarithm, is the\n\nfundamental information measure arising in a variety of applications [1].\n\n1px>0 plays an important role in population [2] and vocabulary\nx\u2208X 1\u2212e\u2212mpx\n\nis the normalized expected number of distinct ele-\nments observed upon drawing Poi(m) independent samples, it arises in ecological [4], genomic [5],\nand database studies [6].\n\nm\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fPower sum (cid:80)\nDistance to uniformity (cid:80)\n\nx\u2208X pa\n\nx, arises in R\u00e9nyi entropy [7], Gini impurity [8], and related diversity measures.\n\n(cid:12)(cid:12)px \u2212 1\n\nk\n\n(cid:12)(cid:12), appears in property testing [9].\n\nx\u2208X\n\nMore generally, non-symmetric additive properties can be expressed as\n\n(cid:88)\n\nx\u2208X\n\nf (p) def=\n\nfx(px),\n\nL1 distance (cid:80)\nKL divergence (cid:80)\n\nfor example distances to a given distribution, such as:\n\nx\u2208X |px \u2212 qx|, the L1 distance of the unknown distribution p from a given distribu-\n\ntion q, appears in hypothesis-testing errors [10].\n\n, the KL divergence of the unknown distribution p from a given\ndistribution q, re\ufb02ects the compression [1] and prediction [11] degradation when estimating p by q.\n\nx\u2208X px log px\nqx\n\nGiven one of these, or other, properties, we would like to estimate its value based on samples from an\nunderlying distribution.\n\n2 Recent Results\n\nIn the common property-estimation setting, the unknown distribution p generates n i.i.d. samples\nX n \u223c pn, which in turn are used to estimate f (p). Speci\ufb01cally, given property f, we would like to\nconstruct an estimator \u02c6f : X \u2217 \u2192 R such that \u02c6f (X n) is as close to f (p) as possible. The standard\nestimation loss is the expected squared loss\n\n(cid:16) \u02c6f (X n) \u2212 f (p)\n(cid:17)2\n\n.\n\nEX n\u223cpn\n\nGenerating exactly n samples creates dependence between the number of times different symbols\nappear. To avoid these dependencies and simplify derivations, we use the well-known Poisson\nsampling [12] paradigm. We \ufb01rst select N \u223c Poi(n), and then generate N independent samples\naccording to p. This modi\ufb01cation does not change the statistical nature of the estimation problem\nsince a Poisson random variables is exponentially concentrated around its mean. Correspondingly the\nestimation loss is\n\n(cid:16) \u02c6f (X N ) \u2212 f (p)\n\n(cid:17)2(cid:21)\n\n.\n\nL \u02c6f (p, n) def= EN\u223cPoi(n)\n\nEX N\u223cpN\n\n(cid:20)\n\n(cid:26)(cid:80)\n\nFor simplicity, let Nx be the number of occurrences of symbol x in X n. An intuitive estimator is\nthe plug-in empirical estimator f E that \ufb01rst uses the N samples to estimate px = Nx/N and then\nestimates f (p) as\n\n(cid:0) Nx\n\nN\n\n(cid:1) N > 0,\n\nf E(X N ) def=\n\nx\u2208X fx\n\n0\n\nN = 0.\n\nGiven an error tolerance parameter \u03b4 > 0, the (\u03b4, p)-sample complexity of an estimator \u02c6f in estimating\nf (p) is the smallest number of samples n allowing for estimation loss smaller than \u03b4,\n\nn \u02c6f (\u03b4, p) def= min\n\nn\u2208N{L \u02c6f (p, n) < \u03b4}.\n\nSince p is unknown, the common min-max approach considers the worst case (\u03b4, p)-sample complexity\nof an estimator \u02c6f over all possible p,\n\nn \u02c6f (\u03b4) def= max\np\u2208DX\n\nn \u02c6f (\u03b4, p).\n\nFinally, the estimator minimizing n \u02c6f (\u03b4) is called the min-max estimator of property f, denoted f M.\nIt follows that nf M (\u03b4) is the smallest Poisson parameter n, or roughly the number of samples, needed\nfor any estimator \u02c6f to estimate f (p) to estimation loss \u03b4 for all p.\n\n2\n\n\fThere has been a signi\ufb01cant amount of recent work on property estimation. In particular, it was\nshown that for all seven properties mentioned earlier, f M improves the sample complexity by a\nlogarithmic factor compared to f E. For example, for Shannon entropy [13], normalized support\nsize [14], normalized support coverage [15], and distance to uniformity [16], nf E (\u03b4) = \u0398\u03b4(k)\nwhile nf M (\u03b4) = \u0398\u03b4(k/ log k). Note that for normalized support size, DX is typically replaced by\nDk := {p \u2208 DX : px \u2265 1/k,\u2200x \u2208 X}, and for normalized support coverage, k is replaced by m.\n\n3 New Results\nWhile the results already obtained are impressive, they also have some shortcomings. Recent state-of-\nthe-art estimators are designed [13, 14, 16] or analyzed [15, 19] to estimate each individual property.\nConsequently these estimators cover only few properties. Second, estimators proposed for more\ngeneral properties [15, 20] are limited to symmetric properties and are not known to be computable\nin time linear in the sample size. Last but not least, by design, min-max estimators are optimized for\nthe \u201cworst\u201d distribution in a class. In practice, this distribution is often very different, and frequently\nmuch more complex, than the actual underlying distribution. This \u201cpessimistic\u201d worst-case design\nresults in sub-optimal estimation, as born by both the theoretical and experimental results.\nIn Section 6, we design an estimator f\u2217 that addresses all these issues. It is uni\ufb01ed and applies to\na wide range of properties, including all previously-mentioned properties (a > 1 for power sums)\nand all Lipschitz properties f where each fx is Lipschitz. It can be computed in linear-time in\nthe sample size. It is competitive in that it is guaranteed to perform well not just for the worst\ndistribution in the class, but for each and every distribution. It \u201campli\ufb01es\u201d the data in that it uses\njust Poi(2n) samples to approximate the performance of the empirical estimator with Poi(n\nlog n)\nsamples regardless of the underlining distribution p, thereby providing an off-the-shelf, distribution-\nindependent, \u201campli\ufb01cation\u201d of the amount of data available relative to the estimators used by many\npractitioners. As we show in Section 8, it also works well in practice, outperforming existing\nestimator and often working as well as the empirical estimator with even n log n samples.\nFor a more precise description, let o(1) represent a quantity that vanishes as n \u2192 \u221e and write a (cid:46) b\nfor a \u2264 b(1 + o(1)). Suppressing small \u0001 for simplicity \ufb01rst, we show that\n\n\u221a\n\nLf\u2217 (p, 2n) (cid:46) Lf E (p, n(cid:112)log n) + o(1),\n\nwhere the \ufb01rst right-hand-side term relates the performance of f\u2217 with 2n samples to that of f E with\n\u221a\nlog n samples. The second term adds a small loss that diminishes at a rate independent of the\nn\nsupport size k, and for \ufb01xed k decreases roughly as 1/n. Speci\ufb01cally, we prove,\nTheorem 1. For every property f satisfying the smoothness conditions in Section 5, there is a\nconstant Cf such that for all p \u2208 DX and all \u0001 \u2208 (0, 1\n2 ),\n\n(cid:18)\n\n(cid:19)\n\nLf\u2217 (p, 2n) \u2264\n\n1 +\n\n3\n\nlog\u0001 n\n\nLf E (p, n log\n\n1\n\n2\u2212\u0001 n) + Cf min\n\nlog\u0001 n + \u02dcO\n\n(cid:26) k\n\nn\n\n(cid:18) 1\n\n(cid:19)\n\nn\n\n,\n\n1\n\nlog\u0001 n\n\n(cid:27)\n\n.\n\nThe \u02dcO re\ufb02ects a multiplicative polylog(n) factor unrelated to k and p. Again, for normalized support\nsize, DX is replaced by Dk, and we also modify f\u2217 as follows: if k > n, we apply f\u2217, and if k \u2264 n,\nwe apply the corresponding min-max estimator [14]. However, for experiments shown in Section 8,\nthe original f\u2217 is used without such modi\ufb01cation. In Section 7, we note that for several properties,\nthe second term can be strengthened so that it does not depend on \u0001.\n\n4\n\nImplications\n\nTheorem 1 has three important implications.\n\nData ampli\ufb01cation Many modern applications, such as those arising in genomics and natural-\nlanguage processing, concern properties of distributions whose support size k is comparable to or\neven larger than the number of samples n. For these properties, the estimation loss of the empirical\nestimator f E is often much larger than 1/ log\u0001 n, hence the proposed estimator, f\u2217, yields a much\nbetter estimate whose performance parallels that of f E with n\nlog n samples. This allows us to\namplify the available data by a factor of\n\nlog n regardless of the underlying distribution.\n\n\u221a\n\n\u221a\n\n3\n\n\fNote however that for some properties f, when the underlying distributions are limited to a \ufb01xed\nsmall support size, Lf E (p, n) = \u0398(1/n) (cid:28) 1/log\u0001 n. For such small support sizes, f\u2217 may not\nimprove the estimation loss.\n\nUni\ufb01ed estimator Recent works either prove ef\ufb01cacy results individually for each property [13,\n14, 16], or are not known to be computable in linear time [15, 20].\nBy contrast, f\u2217 is a linear-time estimator well for all properties satisfying simple Lipschitz-type and\nsecond-order smoothness conditions. All properties described earlier: Shannon entropy, normalized\nsupport size, normalized suppport coverage, power sum, L1 distance and KL divergence satisfy these\nconditions, and f\u2217 therefore applies to all of them.\nMore generally, recall that a property f is Lipschitz if all fx are Lipschitz. It can be shown, e.g. [21],\nthat with O(k) samples, f E approximates a k-element distribution to a constant L1 distance, and\nhence also estimates any Lipschitz property to a constant loss. It follows that f\u2217 estimates any\n\u221a\nLipschitz property over a distribution of support size k to constant estimation loss with O(k/\nlog k)\nsamples. This provides the \ufb01rst general sublinear-sample estimator for all Lipschitz properties.\n\nCompetitive optimality Previous results were geared towards the estimator\u2019s worst estimation loss\nover all possible distributions. For example, they derived estimators that approximate the distance to\nuniformity of any k-element distribution with O(k/ log k) samples, and showed that this number is\noptimal as for some distribution classes estimating this distance requires \u2126(k/ log k) samples.\nHowever, this approach may be too pessimistic. Distributions are rarely maximally complex, or are\nhardest to estimate. For example, most natural scenes have distinct simple patterns, such as straight\nlines, or \ufb02at faces, hence can be learned relatively easily.\nMore concretely, consider learning distance to uniformity for the collection of distributions with\nentropy bounded by log log k. It can be shown that for suf\ufb01ciently large k, f E can learn distance to\nuniformity to constant estimation loss using O((log k)\u0398(1)) samples. Theorem 1 therefore shows that\n\u221a\nthe distance to uniformity can be learned to constant estimation loss with O((log k)\u0398(1)/\nlog log k)\nsamples. (In fact, without even knowing that the entropy is bounded.) By contrast, the original\nmin-max estimator results would still require the much larger \u2126(k/ log k) samples.\nThe rest of the paper is organized as follows. Section 5 describes mild smoothness conditions satis\ufb01ed\nby many natural properties, including all those mentioned above. Section 6 describes the estimator\u2019s\nexplicit form and some intuition behind its construction and performance. Section 7 describes\ntwo improvements of the estimator addressed in the supplementary material. Lastly, Section 8\ndescribes various experiments that illustrate the estimator\u2019s power and competitiveness. For space\nconsiderations, we relegate all the proofs to the supplemental material.\n\n5 Smooth properties\n\nMany natural properties, including all those mentioned in the introduction satisfy some basic smooth-\nness conditions. For h \u2208 (0, 1], consider the Lipschitz-type parameter\n\n(cid:96)f (h) def= max\n\nx\n\nmax\n\nu,v\u2208[0,1]:max{u,v}\u2265h\n\n|fx(u) \u2212 fx(v)|\n\n|u \u2212 v|\n\n,\n\n(cid:18) u + v\n\n2\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:27)\n\n(cid:26)(cid:12)(cid:12)(cid:12)(cid:12) fx(u) + fx(v)\n\n2\n\nand the second-order smoothness parameter, resembling the modulus of continuity in approximation\ntheory [17, 18],\nf (h) def= max\n\u03c92\n\n\u2212 fx\n\nmax\n\n.\n\nx\n\nu,v\u2208[0,1]:|u\u2212v|\u22642h\n\nf (h) \u2264 Sf \u00b7 h for some absolute constant Sf .\n\nWe consider properties f satisfying the following conditions: (1) \u2200x \u2208 X , fx(0) = 0; (2) (cid:96)f (h) \u2264\npolylog(1/h) for h \u2208 (0, 1]; (2) \u03c92\nNote that the \ufb01rst condition, fx(0) = 0, entails no loss of generality. The second condition implies\nthat fx is continuous over [0, 1], and in particular right continuous at 0 and left-continuous at 1.\nIt is easy to see that continuity is also essential for consistent estimation. Observe also that these\nconditions are more general than assuming that fx is Lipschitz, as can be seen for entropy where\nfx = x log x, and that all seven properties described earlier satisfy these three conditions. Finally, to\nensure that L1 distance satis\ufb01es these conditions, we let fx(px) = |px \u2212 qx| \u2212 qx.\n\n4\n\n\f6 The Estimator f\u2217\nGiven the sample size n, de\ufb01ne an ampli\ufb01cation parameter t > 1, and let N(cid:48)(cid:48) \u223c Poi(nt) be the\nampli\ufb01ed sample size. Generate a sample sequence X N(cid:48)(cid:48)\nx denote\nthe number of times symbol x appeared in X N(cid:48)(cid:48)\n. The empirical estimate of f (p) with Poi(nt)\nsamples is then\n\nindependently from p, and let N(cid:48)(cid:48)\n\n(cid:88)\n\nx\u2208X\n\n(cid:19)\n\n(cid:18) N(cid:48)(cid:48)\n\nx\n\nN(cid:48)(cid:48)\n\n.\n\nf E(X N(cid:48)(cid:48)\n\n) =\n\nfx\n\nOur objective is to construct an estimator f\u2217 that approximates f E(X N(cid:48)(cid:48)\nPoi(2n) samples.\nSince N(cid:48)(cid:48) sharply concentrates around nt, we can show that f E(X N(cid:48)(cid:48)\nmodi\ufb01ed empirical estimator,\n\n) for large t using just\n\n) can be approximated by the\n\nf ME(X N(cid:48)(cid:48)\n\n) def=\n\nfx\n\n(cid:88)\n\nx\u2208X\n\n(cid:18) N(cid:48)(cid:48)\n\n(cid:19)\n\nx\nnt\n\n,\n\nwhere fx(p) def= fx(1) for all p > 1 and x \u2208 X .\nSince large probabilities are easier to estimate, it is natural to set a threshold parameter s and rewrite\nthe modi\ufb01ed estimator as a separate sum over small and large probabilities,\n\n(cid:88)\n\nx\u2208X\n\n(cid:18) N(cid:48)(cid:48)\n\n(cid:19)\n\nx\nnt\n\n(cid:88)\n\nx\u2208X\n\n(cid:18) N(cid:48)(cid:48)\n\n(cid:19)\n\nx\nnt\n\nf ME(X N(cid:48)(cid:48)\n\n) =\n\nfx\n\n1px\u2264s +\n\nfx\n\n1px>s.\n\nNote however that we do not know the exact probabilities. Instead, we draw two independent sample\nsequences X N and X N(cid:48)\nx be the\nnumber of occurrences of x in the \ufb01rst and second sample sequence respectively. We then set a\nsmall/large-probability threshold s0 and classify a probability px as large or small according to N(cid:48)\nx:\n\nfrom p, each of an independent Poi(n) size, and let Nx and N(cid:48)\n\nis the modi\ufb01ed small-probability empirical estimator, and\n\nS (X N(cid:48)(cid:48)\nf ME\n\n, X N(cid:48)\n\n) def=\n\nL (X N(cid:48)(cid:48)\nf ME\n\n, X N(cid:48)\n\n) def=\n\n(cid:19)\n(cid:19)\n\n(cid:18) N(cid:48)(cid:48)\n(cid:18) N(cid:48)(cid:48)\n\nx\nnt\n\nx\nnt\n\nfx\n\nfx\n\n(cid:88)\n(cid:88)\n\nx\u2208X\n\nx\u2208X\n\n1N(cid:48)\n\nx\u2264s0\n\n1N(cid:48)\n\nx>s0\n\nis the modi\ufb01ed large-probability empirical estimator. We rewrite the modi\ufb01ed empirical estimator as\n\nf ME(X N(cid:48)(cid:48)\n\n) = f ME\n\nS (X N(cid:48)(cid:48)\n\n, X N(cid:48)\n\n) + f ME\n\nL (X N(cid:48)(cid:48)\n\n, X N(cid:48)\n\n).\n\nCorrespondingly, we express our estimator f\u2217 as a combination of small- and large-probability\nestimators,\n\nf\u2217(X N , X N(cid:48)\n\n) def= f\u2217\n\nThe large-probability estimator approximates f ME\nL (X N , X N(cid:48)\n\nf\u2217\nL(X N , X N(cid:48)\n\n) def= f ME\n\nS(X N , X N(cid:48)\nL (X N(cid:48)(cid:48)\n\n).\n\n) + f\u2217\n, X N(cid:48)\n\n(cid:88)\n\nL(X N , X N(cid:48)\n(cid:19)\n) as\n\n(cid:18) Nx\n\n) =\n\nfx\n\nx\u2208X\n\nnt\n\n1N(cid:48)\n\nx>s0 .\n\nNote that we replaced the length-Poi(nt) sample sequence X N(cid:48)(cid:48)\nsample sequence X N . We can do so as large probabilities are well estimated from fewer samples.\nThe small-probability estimator f\u2217\n) and is more involved.\nWe outline its construction below and details can be found in Section 8 of the supplemental material.\nThe expected value of f ME for the small probabilities is\n\nby the independent length-Poi(n)\n\n) approximates f ME\n\nS(X N , X N(cid:48)\n\nS (X N(cid:48)(cid:48)\n\n, X N(cid:48)\n\nE[f ME\n\nS (X N(cid:48)(cid:48)\n\n, X N(cid:48)\n\n)] =\n\nE[1Nx\u2264s0 ]E\n\nfx\n\n(cid:20)\n\n(cid:18) N(cid:48)(cid:48)\n\n(cid:19)(cid:21)\n\nx\nnt\n\n.\n\n(cid:88)\n\nx\u2208X\n\n5\n\n\fLet \u03bbx\n\ndef= npx be the expected number of times symbol x will be observed in X N , and de\ufb01ne\n\ngx(v) def= fx\n\nt \u2212 1\n\n.\n\n(cid:17)(cid:18) t\n\n(cid:16) v\n(cid:17)\n(cid:16) v\n\nnt\n\nnt\n\n(cid:19)v\n\u221e(cid:88)\n\nv=1\n\ne\u2212\u03bbxt (\u03bbxt)v\nv!\n\nfx\n\n= e\u2212\u03bbx\n\ne\u2212\u03bbx(t\u22121) (\u03bbx(t \u2212 1))v\n\ngx (v) .\n\nv!\n\nThen\n\n(cid:20)\n\nfx\n\n(cid:19)(cid:21)\n\n(cid:18) N(cid:48)(cid:48)\n\nx\nnt\n\nE\n\n\u221e(cid:88)\n\nv=0\n\n=\n\nAs explained in Section 8.1 of the supplemental material, the sum beyond a truncation threshold\n\nis small, hence it suf\ufb01ces to consider the truncated sum\n\numax\n\ndef= 2s0t + 2s0 \u2212 1\ne\u2212\u03bbx(t\u22121) (\u03bbx(t \u2212 1))v\n\nv!\n\numax(cid:88)\n\nv=1\n\ne\u2212\u03bbx\n\ngx (v) .\n\nApplying the polynomial smoothing technique in [22], Section 8.2 of the supplemental material\napproximates the above summation by\n\n\u221e(cid:88)\n\ne\u2212\u03bbx\n\nwhere\n\nhx,v = (t \u2212 1)v\n\n(umax\u2227v)(cid:88)\n\nu=1\n\nhx,v\u03bbv\nx,\n\nv=1\n\ngx(u)(\u22121)v\u2212u\n(v \u2212 u)!u!\n\n\uf8eb\uf8ed1 \u2212 e\u2212r\n\n\uf8f6\uf8f8 ,\n\nv+u(cid:88)\n\nj=0\n\nrj\nj!\n\nrj\nj! is the tail probability of a Poi(r) distribution that diminishes rapidly\nbeyond r. Hence r determines which summation terms will be attenuated, and serves as a smoothing\nparameter.\n\nj=0\n\nr def= 10s0t + 10s0.\n\nand\n\nObserve that 1 \u2212 e\u2212r(cid:80)v+u\nAn unbiased estimator of e\u2212\u03bbx(cid:80)\u221e\n\u221e(cid:88)\n\nx is\n\nv=1 hx,v\u03bbv\nhx,vv! \u00b7 1Nx=v = hx,Nx \u00b7 Nx!.\n\nFinally, the small-probability estimator is\nf\u2217\nS(X N , X N(cid:48)\n\n) def=\n\nv=1\n\nhx,Nx \u00b7 Nx! \u00b7 1N(cid:48)\n\nx\u2264s0.\n\n(cid:88)\n\nx\u2208X\n\n(cid:26)\n\n(cid:26)\n\n7 Extensions\nIn Theorem 1, for \ufb01xed n, as \u0001 \u2192 0, the \ufb01nal slack term 1/ log\u0001 n approaches a constant. For\ncertain properties it can be improved. For normalized support size, normalized support coverage, and\ndistance to uniformity, a more involved estimator improves this term to\n\nCf,\u03b3 min\nfor any \ufb01xed constant \u03b3 \u2208 (0, 1/2).\nFor Shannon entropy, correcting the bias of f\u2217\nreduces the slack term even more, to\n\nn log1\u2212\u0001 n\n\nk\n\n+\n\n1\nn1\u2212\u03b3 ,\n\n1\n\nlog1+\u0001 n\n\nL [23] and further dividing the probability regions,\n\nCf,\u03b3 min\n\nk2\n\nn2 log2\u2212\u0001 n\n\n+\n\n1\nn1\u2212\u03b3 ,\n\n1\n\nlog2+2\u0001 n\n\n.\n\nFinally, the theorem compares the performance of f\u2217 with 2n samples to that of f E with n\nlog n\nsamples. As shown in the next section, the performance is often comparable to that of n log n samples.\nIt would be interesting to prove a competitive result that enlarges the ampli\ufb01cation to n log1\u2212\u0001 n or\neven n log n. This would be essentially the best possible as it can be shown that for the symmetric\nproperties mentioned in the introduction, ampli\ufb01cation cannot exceed O(n log n).\n\n\u221a\n\n(cid:27)\n\n,\n\n(cid:27)\n\n6\n\n\f\u221a\n\n8 Experiments\nWe evaluated the new estimator f\u2217 by comparing its performance to several recent estimators [13\u2013\n15, 22, 27]. To ensure robustness of the results, we performed the comparisons for all the symmetric\nproperties described in the introduction: entropy, support size, support coverage, power sums, and\ndistance to uniformity. For each property, we considered six underlying distributions: uniform,\nDirichlet-drawn-, Zipf, binomial, Poisson, and geometric. The results for the \ufb01rst three properties\nare shown in Figures 1\u20133, the plots for the \ufb01nal two properties can be found in Section 9 of the\nsupplemental material. For nearly all tested properties and distributions, f\u2217 achieved state-of-the-art\nperformance.\nAs Theorem 1 implies, for all \ufb01ve properties, with just n (not even 2n) samples, f\u2217 performed as well\nlog n samples. Interestingly, in most cases f\u2217 performed\nthe empirical estimator f E with roughly n\neven better, similar to f E with n log n samples.\nRelative to previous estimators, depending on the property and distribution, different previous\nestimators were best. But in essentially all experiments, f\u2217 was either comparable or outperformed\nthe best previous estimator. The only exception was PML that attempts to smooth the estimate, hence\nperformed better on uniform, and near-uniform Dirichlet-drawn distributions for several properties.\nTwo additional advantages of f\u2217 may be worth noting. First, underscoring its competitive performance\nfor each distribution, the more skewed the distribution the better is its relative ef\ufb01cacy. This is because\nmost other estimators are optimized for the worst distribution, and work less well for skewed ones.\nSecond, by its simple nature, the empirical estimator f E is very stable. Designed to emulate f E for\nmore samples, f\u2217 is therefore stable as well. Note also that f E is not always the best estimator choice.\nFor example, it always underestimates the distribution\u2019s support size. Yet even for normalized support\nsize, Figure 2 shows that f\u2217 outperforms other estimators including those designed speci\ufb01cally for\nthis property (except as above for PML on near-uniform distributions).\nThe next subsection describes the experimental settings. Additional details and further interpretation\nof the observed results can be found in Section 9 of the supplemental material.\n\nExperimental settings\n\n\u221a\n\nWe tested the \ufb01ve properties on the following distributions: uniform distribution; a distribution\nrandomly generated from Dirichlet prior with parameter 2; Zipf distribution with power 1.5; Binomial\ndistribution with success probability 0.3; Poisson distribution with mean 3,000; geometric distribution\nwith success probability 0.99.\nWith the exception of normalized support coverage, all other properties were tested on distributions\nof support size k = 10,000. The Geometric, Poisson, and Zipf distributions were truncated at k and\nre-normalized. The number of samples, n, ranged from 1,000 to 100,000, shown logarithmically on\nthe horizontal axis. Each experiment was repeated 100 times and the reported results, shown on the\nvertical axis, re\ufb02ect their mean squared error (MSE).\nWe compared the estimator\u2019s performance with n samples to that of four other recent estimators as\nwell as the empirical estimator with n, n\nlog n, and n log n samples. We chose the ampli\ufb01cation\nparameter t as log1\u2212\u03b1 n + 1, where \u03b1 \u2208 {0.0, 0.1, 0.2, ..., 0.6} was selected based on independent\ndata, and similarly for s0. Since f\u2217 performed even better than Theorem 1 guarantees, \u03b1 ended up\n\u221a\nbetween 0 and 0.3 for all properties, indicating ampli\ufb01cation even beyond n\nlog n. The graphs\ndenote f\u2217 by NEW, f E with n samples by Empirical, f E with n\nlog n samples by Empirical+, f E\nwith n log n samples by Empirical++, the pattern maximum likelihood estimator in [15] by PML, the\nShannon-entropy estimator in [27] by JVHW, the normalized-support-size estimator in [14] and the\nentropy estimator in [13] by WY, and the smoothed Good-Toulmin Estimator for normalized support\ncoverage estimation [22], slightly modi\ufb01ed to account for previously-observed elements that may\nappear in the subsequent sample, by SGT.\nWhile the empirical and the new estimators have the same form for all properties, as noted in the\nintroduction, the recent estimators are property-speci\ufb01c, and each was derived for a subset of the\nproperties. In the experiments we applied these estimators to all the properties for which they were\nderived. Also, additional estimators [28\u201334] for various properties were compared in [13, 14, 22, 27]\nand found to perform similarly to or worse than recent estimators, hence we do not test them here.\n\n\u221a\n\n7\n\n\fFigure 1: Shannon Entropy\n\nFigure 2: Normalized Support Size\n\nFigure 3: Normalized Support Coverage\n\n8\n\n\f9 Conclusion\n\n\u221a\n\nIn this paper, we considered the fundamental learning problem of estimating properties of discrete\ndistributions. The best-known distribution-property estimation technique is the \u201cempirical estimator\u201d\nthat takes the data\u2019s empirical frequency and plugs it in the property functional. We designed a\ngeneral estimator that for a wide class of properties, uses only n samples to achieve the same accuracy\nlog n samples. This provides an off-the-shelf method for amplifying\nas the plug-in estimator with n\nthe data available relative to traditional approaches. For all the properties and distributions we have\ntested, the proposed estimator performed as well as the best estimator(s). A meaningful future\nresearch direction would be to verify the optimality of our results: the ampli\ufb01cation factor\nlog n\nand the slack terms. There are also several important properties that are not included in our paper,\nfor example, R\u00e9nyi entropy [35] and the generalized distance to uniformity [36, 37]. It would be\ninteresting to determine whether data ampli\ufb01cation could be obtained for these properties as well.\n\n\u221a\n\nReferences\n[1] COVER, T. M., & THOMAS, J. A. (2012). Elements of information theory. John Wiley & Sons.\n\n[2] GOOD, I. J. (1953). The population frequencies of species and the estimation of population\n\nparameters. Biometrika, 40(3-4), 237-264.\n\n[3] MCNEIL, D. R. (1973). Estimating an author\u2019s vocabulary. Journal of the American Statistical\n\nAssociation, 68(341), 92-96.\n\n[4] COLWELL, R. K., CHAO, A., GOTELLI, N. J., LIN, S. Y., MAO, C. X., CHAZDON, R. L.,\n& LONGINO, J. T. (2012). Models and estimators linking individual-based and sample-based\nrarefaction, extrapolation and comparison of assemblages. Journal of plant ecology, 5(1), 3-21.\n\n[5] IONITA-LAZA, I., LANGE, C., & LAIRD, N. M. (2009). Estimating the number of unseen\nvariants in the human genome. Proceedings of the National Academy of Sciences, 106(13),\n5008-5013.\n\n[6] HAAS, P. J., NAUGHTON, J. F., SESHADRI, S., & STOKES, L. (1995). Sampling-based\n\nestimation of the number of distinct values of an attribute. VLDB, Vol. 95, pp. 311-322.\n\n[7] R\u00c9NYI, A. (1961). On measures of entropy and information. HUNGARIAN ACADEMY OF\n\nSCIENCES Budapest Hungary.\n\n[8] LOH, W. Y. (2011). Classi\ufb01cation and regression trees. Wiley Interdisciplinary Reviews: Data\n\nMining and Knowledge Discovery, 1(1), 14-23.\n\n[9] CANONNE, C. L. (2017). A Survey on Distribution Testing. Your Data is Big. But is it Blue.\n\n[10] LEHMANN, E. L., & ROMANO, J. P. (2006). Testing statistical hypotheses. Springer Science &\n\nBusiness Media.\n\n[11] KULLBACK, S., & LEIBLER, R. A. (1951). On information and suf\ufb01ciency. The annals of\n\nmathematical statistics, 22(1), 79-86.\n\n[12] S\u00c4RNDAL, C. E., SWENSSON, B., & WRETMAN, J. (2003). Model assisted survey sampling.\n\nSpringer Science & Business Media.\n\n[13] WU, Y., & YANG, P. (2016). Minimax rates of entropy estimation on large alphabets via best\n\npolynomial approximation, IEEE Transactions on Information Theory, 62(6), 3702-3720.\n\n[14] WU, Y., & YANG, P. (2015). Chebyshev polynomials, moment matching, and optimal estimation\n\nof the unseen. ArXiv preprint arXiv:1504.01227.\n\n[15] ACHARYA, J., DAS, H., ORLITSKY, A., & SURESH, A. T. (2017). A uni\ufb01ed maximum\nlikelihood approach for estimating symmetric properties of discrete distributions. In International\nConference on Machine Learning (pp. 11-21).\n\n[16] JIAO, J., HAN, Y., & WEISSMAN, T. (2016). Minimax estimation of the L1 distance. In\n\nInformation Theory (ISIT), 2016 IEEE International Symposium on (pp. 750-754). IEEE.\n\n9\n\n\f[17] TIMAN, A. F. (2014). Theory of approximation of functions of a real variable. Elsevier.\n\n[18] KORN \u02d8EICHUK, N. P. (1991). Exact constants in approximation theory. (Vol. 38). Cambridge\n\nUniversity Press.\n\n[19] VALIANT, G., & VALIANT, P. (2011). The power of linear estimators. In Foundations of\n\nComputer Science (FOCS), 2011 IEEE 52nd Annual Symposium on (pp. 403-412). IEEE.\n\n[20] HAN, Y., JIAO, J., & WEISSMAN, T. (2018). Local moment matching: A uni\ufb01ed methodology\nfor symmetric functional estimation and distribution estimation under Wasserstein distance. arXiv\npreprint arXiv:1802.08405.\n\n[21] KAMATH, S., ORLITSKY, A., PICHAPATI, D., & SURESH, A. T. (2015, June). On learning\n\ndistributions from their samples. In Conference on Learning Theory (pp. 1066-1100).\n\n[22] ORLITSKY, A., SURESH, A. T., & WU, Y. (2016). Optimal prediction of the number of unseen\n\nspecies. Proceedings of the National Academy of Sciences, 201607774.\n\n[23] CARLTON, A. G. (1969). On the bias of information estimates. Psychological Bulletin, 71(2),\n\n108.\n\n[24] CHUNG, F. R., & LU, L. (2017). Complex graphs and networks. (No. 107). American\n\nMathematical Soc.\n\n[25] BUSTAMANTE, J. (2017). Bernstein operators and their properties. Chicago.\n\n[26] WATSON, G. N. (1995). A treatise on the theory of Bessel functions. Cambridge University\n\nPress.\n\n[27] JIAO, J., VENKAT, K., HAN, Y., & WEISSMAN, T. (2015). Minimax estimation of functionals\n\nof discrete distributions. IEEE Transactions on Information Theory, 61(5), 2835-2885.\n\n[28] VALIANT, P., & VALIANT, G. (2013). Estimating the unseen: improved estimators for entropy\nand other properties. In Advances in Neural Information Processing Systems (pp. 2157-2165).\n\n[29] PANINSKI, L. (2003). Estimation of entropy and mutual information. Neural computation,\n\n15(6), 1191-1253.\n\n[30] CARLTON, A. G. (1969). On the bias of information estimates. Psychological Bulletin, 71(2),\n\n108.\n\n[31] GOOD, I. J. (1953). The population frequencies of species and the estimation of population\n\nparameters. Biometrika, 40(3-4), 237-264.\n\n[32] CHAO, A. (1984). Nonparametric estimation of the number of classes in a population. Scandi-\n\nnavian Journal of Statistics, 265-270.\n\n[33] CHAO, A. (2005). Species estimation and applications. Encyclopedia of statistical sciences.\n\n[34] SMITH, E. P., & VAN BELLE, G. (1984). Nonparametric estimation of species richness.\n\nBiometrics, 119-129.\n\n[35] ACHARYA, J., ORLITSKY, A., SURESH, A. T., & TYAGI, H. (2017). Estimating R\u00e9nyi entropy\n\nof discrete distributions. IEEE Transactions on Information Theory, 63(1), 38-56.\n\n[36] HAO, Y., & ORLITSKY, A. (2018, June). Adaptive estimation of generalized distance to\nuniformity. In 2018 IEEE International Symposium on Information Theory (ISIT) (pp. 1076-\n1080). IEEE.\n\n[37] BATU, T., & CANONNE, C. L. (2017, October). Generalized Uniformity Testing. In 2017 IEEE\n\n58th Annual Symposium on Foundations of Computer Science (FOCS) (pp. 880-889). IEEE.\n\n10\n\n\f", "award": [], "sourceid": 5310, "authors": [{"given_name": "Yi", "family_name": "Hao", "institution": "University of California, San Diego"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "Google"}, {"given_name": "Yihong", "family_name": "Wu", "institution": "Yale University"}]}