{"title": "Sharp Bounds for Generalized Uniformity Testing", "book": "Advances in Neural Information Processing Systems", "page_first": 6201, "page_last": 6210, "abstract": "We study the problem of generalized uniformity testing of a discrete probability distribution: Given samples from a probability distribution p over an unknown size discrete domain \u2126, we want to distinguish, with probability at least 2/3, between the case that p is uniform on some subset of \u2126 versus \u03b5-far, in total variation distance, from any such uniform distribution. We establish tight bounds on the sample complexity of generalized uniformity testing. In more detail, we present a computationally efficient tester whose sample complexity is optimal, within constant factors, and a matching worst-case information-theoretic lower bound. Specifically, we show that the sample complexity of generalized uniformity testing is \u0398(1/(\u03b5^(4/3) ||p||_3) + 1/(\u03b5^2 ||p||_2 )).", "full_text": "Sharp Bounds for Generalized Uniformity Testing\n\nIlias Diakonikolas\n\nUniversity of Southern California\n\ndiakonik@usc.edu\n\nDaniel M. Kane\n\nUniversity of California, San Diego\n\ndakane@ucsd.edu\n\nAlistair Stewart\n\nUniversity of Southern California\n\nstewart.al@gmail.com\n\nAbstract\n\nWe study the problem of generalized uniformity testing of a discrete probability\ndistribution: Given samples from a probability distribution p over an unknown\nsize discrete domain \u2126, we want to distinguish, with probability at least 2/3,\nbetween the case that p is uniform on some subset of \u2126 versus \u0001-far, in total vari-\nation distance, from any such uniform distribution. We establish tight bounds\non the sample complexity of generalized uniformity testing. In more detail, we\npresent a computationally ef\ufb01cient tester whose sample complexity is optimal,\nwithin constant factors, and a matching worst-case information-theoretic lower\nbound. Speci\ufb01cally, we show that the sample complexity of generalized unifor-\n\nmity testing is \u0398(cid:0)1/(\u00014/3(cid:107)p(cid:107)3) + 1/(\u00012(cid:107)p(cid:107)2)(cid:1).\n\n1\n\nIntroduction\n\nConsider the following statistical task: Given independent samples from a distribution over an un-\nknown size discrete domain \u2126, determine whether it is uniform on some subset of the domain versus\nsigni\ufb01cantly different from any such uniform distribution. Formally, let CU\ndef= {uS : S \u2286 \u2126}\ndenote the set of uniform distributions uS over subsets S of \u2126. Given sample access to an unknown\ndistribution p on \u2126 and a proximity parameter \u0001 > 0, we want to correctly distinguish between the\ncase that p \u2208 CU versus dTV (p,CU ) def= minS\u2286\u2126 dTV (p, uS) \u2265 \u0001, with probability at least 2/3.\nHere, dTV (p, q) = (1/2)(cid:107)p\u2212 q(cid:107)1 denotes the total variation distance between distributions p and q.\nThis natural problem, termed generalized uniformity testing, was recently introduced by Batu and\nCanonne [BC17], who gave the \ufb01rst upper and lower bounds on its sample complexity.\nGeneralized uniformity testing bears a strong resemblance to the familiar task of uniformity testing,\nwhere one is given samples from a distribution p on a domain of known size n and the goal is to\ndetermine, with probability at least 2/3, whether p is the uniform distribution un on this domain\nversus dTV (p, un) \u2265 \u0001. Uniformity testing is arguably the most extensively studied problem in\ndistribution property testing [GR00, Pan08, VV14, DKN15b, Gol16, DGPP16, DGPP17] and its\nsample complexity is well understood. Speci\ufb01cally, it is known [Pan08, CDVV14, VV14, DKN15b]\nthat \u0398(n1/2/\u00012) samples are necessary and suf\ufb01cient for this task.\nThe \ufb01eld of distribution property testing [BFR+00] has seen substantial progress in the past decade,\nsee [Rub12, Can15] for two recent surveys. A large body of the literature has focused on char-\nacterizing the sample size needed to test properties of arbitrary distributions of a given support\nsize. This regime is fairly well understood: for many properties of interest there exist sample-\nef\ufb01cient testers [Pan08, CDVV14, VV14, DKN15b, ADK15, CDGR16, DK16, DGPP16, CDS17,\nDGPP18, CDKS18]. Moreover, an emerging body of work has focused on leveraging a pri-\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fori structure of the underlying distributions to obtain signi\ufb01cantly improved samples complexi-\nties [BKR04, DDS+13, DKN15b, DKN15a, CDKS17, DP17, DDK16, DKN17].\nPerhaps surprisingly, the natural setting where the distribution is arbitrary on a discrete but un-\nknown domain (of unknown size) does not seem to have been explicitly studied before the recent\nwork of Batu and Canonne [BC17]. Returning to the speci\ufb01c problem studied here, at \ufb01rst glance\nit might seem that generalized uniformity testing and uniformity testing are essentially the same\ntask. Naively, one might attempt to apply the existing uniformity testers directly without explicit\nknowledge of the domain. This nearly works, as standard testers do not need to make use of any\nparticular information about the names of domain elements. However, these algorithms do make use\nof the domain size in a critical way. This dif\ufb01culty is not so easy to overcome. In fact, as was shown\nin [BC17], the sample complexity with an unknown domain size is signi\ufb01cantly different. Speci\ufb01-\ncally, [BC17] gave a generalized uniformity tester with expected sample complexity O(1/(\u00016(cid:107)p(cid:107)3))\nand showed a lower bound of \u2126(1/(cid:107)p(cid:107)3). This should be compared to the O(n1/2/\u00012)-sample tester\nfor distributions on domains of size n. Of particular interest here is that distributions p with support\nsize n can have 1/(cid:107)p(cid:107)3 as large as n2/3, making the problem with unknown domain substantially\nharder in the worst case.\n\n1.1 Our Results and Techniques\n\na parameter 0 < \u0001 < 1, the algorithm uses O(cid:0)1/(\u00014/3(cid:107)p(cid:107)3) + 1/(\u00012(cid:107)p(cid:107)2)(cid:1) independent samples\n\nAn immediate open question arising from the work of [BC17] is to precisely characterize the sample\ncomplexity of generalized uniformity testing. The main result of this paper provides an answer to\nthis question. In particular, we show the following:\nTheorem 1.1 (Main Result). There is an algorithm with the following performance guarantee:\nGiven sample access to an arbitrary distribution p over an unknown size discrete domain \u2126 and\nfrom p in expectation, and distinguishes between the case p \u2208 CU versus dTV (p,CU ) \u2265 \u0001 with\nprobability at least 2/3. Moreover, for every 0 < \u0001 < 1/10 and n > 1, any algorithm that\ndistinguishes between p \u2208 CU and dTV (p,CU ) \u2265 \u0001 requires at least \u2126(n2/3/\u00014/3 + n1/2/\u00012)\nsamples, where p is guaranteed to have (cid:107)p(cid:107)3 = \u0398(n\u22122/3) and (cid:107)p(cid:107)2 = \u0398(n\u22121/2).\n\nIn the following paragraphs, we provide an intuitive explanation of our algorithm and our matching\nsample size lower bound, in tandem with a comparison to the prior work [BC17].\n\n2\n\nSample-Optimal Generalized Uniformity Tester Our algorithm requires considering two cases\nbased on the relative size of \u0001 and (cid:107)p(cid:107)2\n2. This case analysis seems somewhat intrinsic to the problem\nas the correct sample complexity branches into these cases.\nFor large \u0001, we use the same overall technique as [BC17], noting that p is uniform if and only\nif (cid:107)p(cid:107)3 = (cid:107)p(cid:107)4/3\n, and that for p far from uniform, (cid:107)p(cid:107)3 must be substantially larger. The basic\nidea from here is to \ufb01rst obtain rough approximations to (cid:107)p(cid:107)2 and (cid:107)p(cid:107)3 in order to ascertain the\n2 and (cid:107)p(cid:107)3\ncorrect number of samples to use, and then use standard unbiased estimators of (cid:107)p(cid:107)2\n3\nto approximate them to appropriate precision, so that their relative sizes can be compared with\nappropriate accuracy.\nWe improve upon the work of [BC17] in this parameter regime in a couple of ways. First, we\nobtain more precise lower bounds on the difference (cid:107)p(cid:107)3\n2 in the case where p is far from\nuniform (Lemma 2.8). This allows us to reduce the accuracy needed in estimating (cid:107)p(cid:107)2 and (cid:107)p(cid:107)3.\nSecond, we re\ufb01ne the method used for performing the approximations to these moments ((cid:96)r-norms).\nIn particular, we observe that using the generic estimators for these quantities yields sub-optimal\nbounds for the following reason: The error of the unbiased estimators is related to their variance,\nwhich in turn can be expressed in terms of the higher moments of p (Fact 2.1). This implies for\nexample that the worst case sample complexity for estimating (cid:107)p(cid:107)3 comes when the fourth and \ufb01fth\nmoments of p are large. However, since we are trying to test for the case of uniformity (where\nthese higher moments are minimal), we do not need to worry about this worst case. In particular,\nafter applying sample ef\ufb01cient tests to ensure that the higher moments of p are not much larger than\nexpected, the standard estimators for the second and third moments of p can be shown to converge\nmore rapidly than they would in the worst case (Fact 2.1).\n\n3 \u2212 (cid:107)p(cid:107)4\n\n2\n\n\fThe above algorithm is not suf\ufb01cient for small values of \u0001. For \u0001 suf\ufb01ciently small, we employ a\ndifferent, perhaps more natural, algorithm. Here we take m samples (for m appropriately chosen\nbased on an approximation to (cid:107)p(cid:107)2) and consider the subset S of the domain that appears in the\nsample. We then test whether the conditional distribution p on S is uniform, and output the answer\nof this tester. The number of samples m drawn in the \ufb01rst step is suf\ufb01ciently large so that p(S),\nthe probability mass of S under p, is relatively high. Hence, it is easy to sample from the condi-\ntional distribution using rejection sampling. Furthermore, we can use a standard uniformity testing\n\nalgorithm requiring O((cid:112)|S|/\u00012) samples.\nx = \u0398(1/n), with high constant probability, the random variable Z(x) =(cid:80)\nimplies that(cid:80)\n\nTo establish correctness of this algorithm, we need to show that if p is far from uniform, then the\nconditional distribution p on S is far from uniform as well. We show (Lemma 2.10) that for any\ni\u2208S |pi \u2212 x| is large. It\nis not hard to show that this holds with high probability for each \ufb01xed x, as p being far from uniform\ni\u2208\u2126 min(pi,|pi \u2212 x|) is large. This latter condition can be shown to provide a clean\nlower bound for the expectation of Z(x). To conclude the argument, we show that Z(x) is tightly\nconcentrated around its expectation. Applying an appropriate union bound, allows us to show that\nZ(x) is large for all x, and thus that the conditional distribution is far form uniform.\n\nSample Complexity Lower Bound The lower bound of \u2126(n1/2/\u00012) follows directly from the\nstandard lower bound of [Pan08] for uniformity testing on a given domain of size n. The other\nbranch of the lower bound, namely \u2126(n2/3/\u00014/3), is more involved. To prove this lower bound,\nwe use the shared information method of [DK16] for the following family of hard instances: In the\n\u201cYES\u201d case, we consider the distribution over (pseudo-)distributions on N bins, where each pi is\n(1+\u00012)/n with probability n/(N (1+\u00012)), and 0 otherwise. (Here we assume that the parameter N is\nsuf\ufb01ciently large compared to the other parameters.) In the \u201cNO\u201d case, we consider the distribution\nover (pseudo-)distributions on N bins, where each pi is (1+\u0001)/n with probability n/(2N ), (1\u2212\u0001)/n\nwith probability n/(2N ), and 0 otherwise.\n\nNotation. Let \u2126 denote the unknown discrete domain. Each probability distribution over \u2126 can be\ni\u2208\u2126 pi = 1. We will use pi,\ninstead of p(i), to denote the probability of element i \u2208 \u2126 in p. For a distribution p and a set S \u2286 \u2126,\ni\u2208S pi and by (p|S) the conditional distribution of p on S. For r \u2265 1, the\n\nassociated with a probability mass function p : \u2126 \u2192 R+ such that(cid:80)\nwe denote by p(S) def= (cid:80)\nr = (cid:80)\n\n(cid:96)r-norm of a function p : \u2126 \u2192 R is (cid:107)p(cid:107)r\ni\u2208\u2126 |pi|r. For \u2205 (cid:54)= S \u2286 \u2126, let uS be the uniform distribution over S. Let\nFr(p) def= (cid:107)p(cid:107)r\nCU\ndef= {uS : \u2205 (cid:54)= S \u2286 \u2126} be the set of uniform distributions over subsets of \u2126. The total variation\ndistance between distributions p, q on \u2126 is de\ufb01ned as dTV (p, q) def= maxS\u2286\u2126 |p(S) \u2212 q(S)| =\n(1/2) \u00b7 (cid:107)p \u2212 q(cid:107)1. Finally, we denote by Poi(\u03bb) the Poisson distribution with parameter \u03bb.\n\ni\u2208\u2126 |pi|r(cid:1)1/r. For convenience, we will denote\n\ndef= (cid:0)(cid:80)\n\n2 Generalized Uniformity Tester\n\nsums Fr(p) =(cid:80)\n\nBefore we describe our algorithm, we summarize a few preliminary results on estimating the power\ni\u2208\u2126 |pi|r of an unknown distribution p. We present these results in Section 2.1. In\nSection 2.2, we present and analyze the algorithm for large values of \u0001. In Section 2.3, we do the\nsame for the small \u0001 algorithm. Finally, in Section 2.4, we present the full algorithm.\n\n2.1 Estimating the Power Sums of a Discrete Distribution\n\nWe will require various notions of approximation for the power sums of a discrete distribution.\nFact 2.1 ([AOST17]). Let p be a probability distribution on an unknown discrete domain. For any\n\nr \u2265 1, there exists an estimator(cid:98)Fr(p) for Fr(p) that draws Poi(m) samples from p and satis\ufb01es the\nfollowing: E[(cid:98)Fr(p)] = Fr(p) and Var[(cid:98)Fr(p)] = m\u22122r(cid:80)r\u22121\nThe estimator (cid:98)Fr(p) draws Poi(m) samples from p and mr \u00b7(cid:98)Fr(p) equals the number of r-wise\n\n(cid:1)rr\u2212tFr+t(p).\n\nt=0 mr+t(cid:0)r\n\ncollisions, i.e., ordered r-tuples of samples that land in the same bin. We use this fact to get a few\nuseful algorithms for approximating these moments:\n\nt\n\n3\n\n\fLemma 2.2. There exists an algorithm that given an integer r \u2265 1 and sample access to a distribu-\ntion p returns a positive real number x so that:\n\n1. With at least 99% probability x is within a constant (depending on r) multiple of (cid:107)p(cid:107)r.\n2. The expectation of 1/x is Or(1/(cid:107)p(cid:107)r).\n3. The expected number of samples taken by the algorithm is Or(1/(cid:107)p(cid:107)r).\n\nProof. The algorithm is as follows:\n\nAlgorithm 1 Algorithm for Rough Moment Estimation\n1: procedure ROUGH-MOMENT-ESTIMATOR(p, r)\ninput: Sample access to distribution p on unknown discrete domain \u2126 and an integer r > 0.\noutput: A value x approximating (cid:107)p(cid:107)r.\n2:\n3:\n\nDraw samples from p until there is some r-wise collision among these samples.\nReturn 1/n, where n is the number of samples taken in Step 2.\n\nFirstly, we note that with large constant probability n (cid:29)r 1/(cid:107)p(cid:107)r. This is because after taking\nm samples, the expected number of r-wise collisions is at most Fr(p)mr = ((cid:107)p(cid:107)rm)r. Thus, by\nMarkov\u2019s inequality, if m (cid:28) 1/(cid:107)p(cid:107)r, then with large constant probability, our algorithm will not\nhave terminated yet. To \ufb01nish the proof, it suf\ufb01ces to show that E[n] = Or(1/(cid:107)p(cid:107)r). This implies by\nMarkov\u2019s inequality that with large constant probability n (cid:28)r 1/(cid:107)p(cid:107)r, and bounds the expectations\nof the number of samples and of 1/x. Let m = 1/(cid:107)p(cid:107)r. We note, by Fact 2.1 , if we take Poi(m)\nsamples from p, the expected number of r-wise collisions is 1, and the variance is Or(1). By the\nPaley-Zygmund inequality, every time the algorithm takes Poi(m) samples, there is at least a cr > 0\nprobability of seeing an r-wise collision. Therefore, if we consider our algorithm to take samples\nin blocks of size Poi(m), the probability that we have not found an r-wise collision after t blocks is\nat most (1 \u2212 cr)t. Thus, the expected number of blocks until we have an r-wise collision is Or(1).\nTherefore, the expected number of samples is Or(m) = Or(1/(cid:107)p(cid:107)r) completing the proof.\nFrom the above, we derive an algorithm that approximates (cid:107)p(cid:107)r to a small relative error:\nLemma 2.3. There exists an algorithm that given sample access to a distribution p, a positive\n\ninteger r and a 1 > \u03b4 > 0, computes a value(cid:98)\u03b3r so that with probability at least 19/20 we have that\n|(cid:98)\u03b3r \u2212 Fr(p)| \u2264 \u03b4 \u00b7 Fr(p). Furthermore, this algorithm uses an expected Or(\n\n) samples.\n\n\u03b42(cid:107)p(cid:107)r\n\n1\n\nProof. The algorithm is as follows:\n\nAlgorithm 2 Algorithm for Moment Estimation\n1: procedure MOMENT-ESTIMATOR(p, r, \u03b4)\ninput: Sample access to arbitrary distribution p on unknown discrete domain \u2126 and an integer\n\noutput: A value(cid:98)\u03b3r approximating Fr(p).\n\nr > 0, and a 1 > \u03b4 > 0.\n\n2:\n3:\n4:\n\nRun Rough-Moment-Estimator(p, r) returning a value x.\nLet m be Cr/(\u03b42x) for Cr a suf\ufb01ciently large constant in terms of r.\nRun the algorithm from Fact 2.1 using Poi(m) samples and return the result.\n\nTo show correctness, \ufb01rst note that with 99% probability we have that x = \u0398r((cid:107)p(cid:107)r), and thus, m is\nat least a suf\ufb01ciently large multiple of 1/\u03b42(cid:107)p(cid:107)r. If this holds, then the output of our algorithm will\nbe a random variable with mean Fr(p). We need to bound the variance, which we do as follows:\n\nClaim 2.4. If m(cid:107)p(cid:107)r (cid:29) 1, then Var((cid:99)Fr(p)) = Or(Fr(p)2((cid:107)p(cid:107)r/m)).\n\nt=0 mt\u2212r(cid:107)p(cid:107)t+r\n\nr\n\n= Or(m\u22121(cid:107)p(cid:107)2r\u22121\n\nr\n\n) = Or(Fr(p)2((cid:107)p(cid:107)r/m)),\n\n(cid:16)(cid:80)r\u22121\n\nProof. The variance is Or\nwhich completes the proof.\n\n(cid:17)\n\n4\n\n\fIf Cr is large enough, this implies that Var((cid:99)Fr(p)) \u2264 (Fr(p)2\u03b42)/100. Given this, our bound on\n|(cid:98)\u03b3r \u2212 Fr(p)| follows from Chebyshev\u2019s inequality. In terms of sample complexity, we note that the\n\nexpected number of samples in Step 1 is Or(1/(cid:107)p(cid:107)r), and the expected number of samples in Step\n2 is O(m) = Or(1/(\u03b42x)), which in expectation is Or(1/(\u03b42(cid:107)p(cid:107)r)). This completes the proof.\n\nOur algorithm will begin by running Rough-Moment-Estimator to compute rough estimates of\nthe second and third moments of p. Unless there is some n for which (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and\n(cid:107)p(cid:107)3 = \u0398(n\u22122/3), then we know that p cannot possibly be uniform. Otherwise, we know that if p\nis uniform, then its support must have size \u0398(n). Our algorithm will thus critically depend on the\nfollowing proposition:\nProposition 2.5. There exists an algorithm that given sample access to a distribution p, and n, \u0001 > 0\ntakes an expected O(n2/3/\u00014/3 +n1/2/\u00012) samples from p and distinguishes with probability at least\n2/3 between the cases: (i) p is the uniform distribution on a domain of size \u0398(n), and (ii) p is \u0001-far\nfrom any uniform distribution.\nOur algorithm will begin by verifying that (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and (cid:107)p(cid:107)3 = \u0398(n\u22122/3) using Lemma\n2.2. Thus, in the second case, we can assume that (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and (cid:107)p(cid:107)3 = \u0398(n\u22122/3). We\nwill further split our algorithm into cases depending on whether \u0001 is bigger than n\u22121/4, which in\nparticular determines which term dominates the sample complexity.\nWe will need the following simple claim giving a useful condition for the soundness case:\n\nClaim 2.6. If dTV (p,CU ) \u2265 \u0001, then for all x \u2208 [0, 1] we have that(cid:80)\nProof. Let Sh be the set of i \u2208 \u2126 on which pi > x/2. Let \u03b4 =(cid:80)\n\ni\u2208\u2126 min{pi,|x \u2212 pi|} \u2265 \u0001/2.\ni\u2208\u2126 min{pi,|x \u2212 pi|}. Note that\n\u03b4 = (cid:107)p\u2212 cx,Sh(cid:107)1, where cx,Sh is the pseudo-distribution that is x on Sh on 0 elsewhere. If (cid:107)cx,Sh(cid:107)1\nwere 1, cx,Sh would be the uniform distribution uSh and we would have \u03b4 \u2265 \u0001. However, this need\nnot be the case. That said, it is easy to see that (cid:107)uSh \u2212cx,Sh(cid:107)1 = |1\u2212(cid:107)cx,Sh(cid:107)1| \u2264 (cid:107)p\u2212cx,Sh(cid:107)1 = \u03b4.\nTherefore, by the triangle inequality 2\u03b4 \u2265 (cid:107)p\u2212 cx,Sh(cid:107)1 +(cid:107)uSh \u2212 cx,Sh(cid:107)1 \u2265 (cid:107)p\u2212 uSh(cid:107)1 \u2265 \u0001 .\n\n2.2 Algorithm for Large \u0001\n\nLemma 2.7. There exists an algorithm that given sample access to a distribution p, and n, \u0001 > 0\nwith \u0001 \u2265 n\u22121/4 takes an expected O(n2/3/\u00014/3) samples from p and distinguishes with probability\nat least 9/10 between the cases: (i) p is the uniform distribution on a domain of size \u0398(n). (ii) p\nsatis\ufb01es (cid:107)p(cid:107)2 = \u0398(n\u22121/2), (cid:107)p(cid:107)3 = \u0398(n\u22122/3), and p is \u0001-far from any uniform distribution.\nThe basic idea of this algorithm is that if p is uniform over any discrete domain then\n\nF3(p) = F2(p)2 .\n\n(1)\n\nWe claim that this condition is robust. Namely for p far from uniform, Equation (1) will fail by a lot.\nTherefore, we can distinguish between the relevant cases by \ufb01nding suitably close approximations\nto F2(p) and F3(p). To start with, we need to prove the robust version of Equation (1):\n2(p). (ii) If dTV (p,CU ) \u2265 \u0001,\nLemma 2.8. We have the following: (i) If p \u2208 CU , then F3(p) = F2\nthen F3(p) \u2212 F2\nProof. The proof of (i) is straightforward. Suppose that p = uS for some \u2205 (cid:54)= S \u2286 \u2126. It then\nfollows that F2(p) = 1/|S| and F3(p) = 1/|S|2, yielding part (i) of the lemma. We now proceed\nto prove part (ii). Suppose that dTV (p,CU ) \u2265 \u0001. First, it will be useful to rewrite the quantity\nF3(p) \u2212 F2\n\n2(p) > \u00012F2\n\n2(p)/64.\n\n2(p) as follows:\n\n(cid:88)\n\ni\u2208\u2126\n\n5\n\nF3(p) \u2212 F2\n\n2(p) =\n\npi(pi \u2212 F2(p))2 .\n\n(2)\n\n(cid:80)\nNote that (2) follows from the identity pi(pi \u2212 F2(p))2 = p3\ni F2(p) by summing\nits complement Sh = \u2126\\Sl. Note that(cid:80)\nover i \u2208 \u2126. Since dTV (p,CU ) \u2265 \u0001, an application of Claim 2.6 for x = F2(p) \u2208 [0, 1], gives that\ni\u2208\u2126 min{pi,|F2(p)\u2212pi|} \u2265 \u0001/2 . We partition \u2126 into the sets Sl = {i \u2208 \u2126 | pi < F2(p)/2} and\n|F2(p)\u2212\n\ni\u2208\u2126 min{pi,|F2(p)\u2212pi|} =(cid:80)\n\ni + piF2(p)2 \u2212 2p2\n\npi +(cid:80)\n\ni\u2208Sh\n\ni\u2208Sl\n\n\fpi \u2265 \u0001/4 or(cid:80)\npi(pi \u2212 F2(p))2 > (F2(p)/2)2 \u00b7(cid:88)\n\npi \u2265 \u0001/4. Using (2) we can now write\n\ni\u2208Sh\n\ni\u2208Sl\n\n|F2(p) \u2212 pi| \u2265 \u0001/4. We analyze each case\n\npi = \u0001F2\n\n2(p)/16 .\n\ni\u2208Sl\n\ni\u2208Sl\n\npi| . It follows that either(cid:80)\nseparately. First, suppose that(cid:80)\n2(p) \u2265(cid:88)\nNow suppose that(cid:80)\n2(p) \u2265 (cid:88)\n\n(2) we obtain\nF3(p) \u2212 F2\n\nF3(p) \u2212 F2\n\ni\u2208Sh\n\ni\u2208Sl\n\n|F2(p) \u2212 pi| \u2265 \u0001/4. Note that 1 \u2264 |Sh| \u2264 2/|F2(p)|. In this case, using\n\npi(pi \u2212 F2(p))2 \u2265 (F2(p)/2) \u00b7 (cid:88)\n\n(pi \u2212 F2(p))2\n\ni\u2208Sh\n\n|F2(p) \u2212 pi|)2\n|Sh|\n\ni\u2208Sh\n\u2265 (F2(p)/2)2 \u00b7 (\u0001/4)2 = \u00012F2\n\n2(p)/64 ,\n\ni\u2208Sh\n\n\u2265 (F2(p)/2) \u00b7 ((cid:80)\n\nwhere the second inequality uses the de\ufb01nition of Sh, and the third is Cauchy-Schwarz.\n\nWe are now ready to prove Lemma 2.7. At a high level, the algorithm is simple. Compute ap-\nproximations to F2(p) and F3(p) using Fact 2.1 and apply Lemma 2.8. However, there is one\ntechnical problem with this scheme. Namely that the variance in our estimator for F3(p) depends\non the values of F4(p) and F5(p). If either of these are too large, then it will affect the accuracy\nof our \ufb01nal estimator. However, if p is uniform on a domain of size \u0398(n), it must be the case that\nF4(p) = O(n\u22123) and F5(p) = O(n\u22124). Se we will \ufb01rst perform a pre-processing step where we\nverify that neither F4(p) nor F5(p) are too large, before estimating F2(p) and F3(p).\n\nProof of Lemma 2.7. The pseudocode is described in Algorithm 3.\n\nAlgorithm 3 Algorithm for Large \u0001\n1: procedure LARGE-EPS-TESTER(p, n, \u0001)\ninput: Sample access to arbitrary distribution p on unknown discrete domain \u2126 and n, \u0001 > 0 and\n\n\u0001 \u2265 n\u22121/4.\noutput: \u201cYES\u201d with probability 9/10 if p is uniform on a set of size \u0398(n), \u201cNO\u201d with probability\n9/10 if (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and (cid:107)p(cid:107)3 = \u0398(n\u22122/3) and p is \u0001-far from any uniform distribution.\nLet C, C(cid:48) be a suf\ufb01ciently large constants, with C large enough relative to C(cid:48). Let m =\n\n2:\n\nCn2/3/\u00014/3.\n\nDraw Poi(O(m)) samples from p and let(cid:98)\u03b34 denote the value of(cid:98)F4(p) on this sample.\nif(cid:98)\u03b34 > C(cid:48)n\u22123 then return \u201cNO\u201d.\nDraw Poi(O(m)) samples from p and let(cid:98)\u03b35 denote the value of(cid:98)F5(p) on this sample.\nif(cid:98)\u03b35 > C(cid:48)n\u22124 then return \u201cNO\u201d.\nCompute the estimates(cid:98)F2(p),(cid:98)F3(p) on two separate sets of Poi(m) samples.\n(cid:16)(cid:98)F3(p) \u2212(cid:98)F2(p)2 \u2264 \u00012/(300n2)\n(cid:17)\n\nthen return \u201cYES\u201d.\n\nif\nelse return \u201cNO\u201d.\n\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nwith at least 99% probability.\nwith at least 99% probability unless Fr(p) = O(C(cid:48)(n1\u2212r + m\u2212r)).\n\nIn the completeness case, we claim that these steps will reject\nIn particular, if Fr(p) \u2265\n\nNote that the expected number of samples taken by this algorithm is O(m) = O(n2/3/\u00014/3).\nWe next prove correctness. We start by considering Steps 3 through 6. Firstly, in the complete-\n\nness case, we note that Fr(p) = \u0398(n1\u2212r), and therefore, by the Markov bound, (cid:98)\u03b3r \u2264 C(cid:48)n1\u2212r\nKC(cid:48)(n1\u2212r + m\u2212r), then m(cid:107)p(cid:107)r \u2265 1, and therefore, by Claim 2.4 we have that E[(cid:98)\u03b3r] = Fr(p)\nand Var((cid:98)\u03b3r) = O(Fr(p)2/K 2). So, if K is suf\ufb01ciently large, by Chebyshev\u2019s inequality, with 99%\nprobability we have that(cid:98)(\u03b3)r > Fr(p)/2 \u2265 C(cid:48)n1\u2212r. Thus, in the remainder, we can assume that\nVar((cid:99)F2(p)) = O(m\u22122F2(p) + m\u22121F3(p)) = O(m\u22122n\u22121 + m\u22121n\u22122) = O(\u00014/n2)/C , where\nVar((cid:99)F3(p)) = O(m\u22123F3(p) + m\u22122F4(p) + m\u22121F5(p))\n\nF4(p) = O(C(cid:48)(n\u22123 + m\u22124)) and F5(p) = O(C(cid:48)(n\u22124 + m\u22125)). To analyze Step 7, we note that\nwe use that \u0001 \u2265 n\u22121/4 and m = Cn2/3/\u00014/3. Similarly, we have\n\n= O(m\u22123n\u22122 + C(cid:48)m\u22122n\u22123 + C(cid:48)m\u22126 + C(cid:48)m\u22121n\u22124) = O(\u00014/n4)(C(cid:48)/C).\n\n6\n\n\f\u221a\n\nTherefore, by Chebyshev\u2019s inequality, with 99% probability we have that |(cid:99)F2(p) \u2212 F2(p)| =\nC , and |(cid:99)F3(p) \u2212 F3(p)| = O(\u00012/n2)(cid:112)C(cid:48)/C . Assuming these hold, we have that\n(cid:12)(cid:12)(cid:12)(cid:0)F3(p) \u2212 F2(p)2(cid:1) \u2212(cid:16)(cid:99)F3(p) \u2212(cid:99)F2(p)2(cid:17)(cid:12)(cid:12)(cid:12) = O(\u00012/n2)(cid:112)C(cid:48)/C.\n\nO(\u00012/n)/\n\nThus, if C/C(cid:48) is suf\ufb01ciently large, if p is uniform, we accept, and if p is \u0001-far from uniform, then by\nLemma 2.8, we reject. This completes the proof.\n\n2.3 Algorithm for Small \u0001\nIn this section, we give a tester that works for \u0001 \u2264 n\u22121/4.\nLemma 2.9. There exists an algorithm that given sample access to a distribution p, and n, \u0001 > 0\nwith \u0001 \u2264 n\u22121/4 takes an expected O(n1/2/\u00012) samples from p and distinguishes with probability at\nleast 9/10 between the cases: (i) p is the uniform distribution on a domain of size \u0398(n), and (ii) p\nis \u0001-far from any uniform distribution.\n\nProof. The basic idea is that we will take \u0398(n) samples from p and let S be the set of distinct\nelements seen. We then test uniformity of (p|S) using the standard uniformity tester.\n\nAlgorithm 4 Algorithm for Small Epsilon\n1: procedure SMALL-EPS-TESTER(p, n, \u0001)\ninput: Sample access to arbitrary distribution p on unknown discrete domain \u2126 and n, \u0001 > 0 and\n\noutput: \u201cYES\u201d with probability 9/10 if p is uniform on a set of size \u0398(n), \u201cNO\u201d with probability\n\nn\u22121/4 \u2265 \u0001.\n\n9/10 p is \u0001-far from any uniform distribution.\n\nLet C, C(cid:48) be a suf\ufb01ciently large constants with C large even relative to C(cid:48). Let m = Cn.\nDraw Poi(m) samples from p. Let S be the subset of \u2126 that appears in the sample.\nVerify the following conditions: (i) Each i \u2208 S appears O(C log n) times; (ii) |S| = \u0398(n).\nif (either of conditions (i) or (ii)) is violated) then return \u201cNO\u201d.\nDraw m(cid:48) = C\nif fewer than half of these samples were in S then return \u201cNO\u201d.\nUse the \ufb01rst m(cid:48)/2 of these samples that landed in S to run the standard uniformity tester for\n\nn/\u00012 samples from p.\n\n\u221a\n\n(p|S) with distance \u0001/C(cid:48) and 1% probability of error.\n\nreturn the answer of the tester in Step 8.\n\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\n9:\n\nWe note that the expected number of samples is O(m + m(cid:48)) = O(n2/3/\u00014/3). It remains to prove\ncorrectness. We begin with the completeness case. If p is uniform over a set of size \u0398(n), with high\nprobability no bin will see more than O(C log(n)) samples, thus (i) is satis\ufb01ed. Furthermore, we\nnote that with high probability that Poi(Cn) samples from p will cover more than two thirds of the\nbins with high probability and thus (ii) will be satis\ufb01ed. Additionally, this means that p(S) \u2265 2/3,\nso again with high probability, at least half of our m(cid:48) samples will lie in S. These \ufb01rst m(cid:48)/2\nsamples from S will be independent samples from (p|S), which is uniform, and therefore with 99%\nprobability will pass the uniformity tester. Therefore, in this case, our algorithm will return \u201cYES\u201d\nwith probability at least 9/10.\nFor the soundness case, we note that if any bin has probability more than a suf\ufb01ciently large multiple\nof log(n)/n, we will fail to satisfy (i) with high probability and reject. We would like to claim next\nthat (p|S) is likely to be far from uniform, and thus that we will fail the \ufb01nal test. Of course, this\nmay depend on the randomness over our \ufb01rst set of samples, but we claim it with high probability.\nIn particular, we show (see supplementary material for the proof):\nLemma 2.10. If dTV (p,CU ) \u2265 \u0001 and p assigns no more than O(log(n)/n) mass to any single bin,\nthen with high probability over the Poi(m) samples, we have at least one of the following: (i) |S| is\nnot \u0398(n), (ii) p(S) \u2264 1/3, (iii) dTV ((p|S),CU ) \u2265 \u0001/C(cid:48).\n\n7\n\n\fAlgorithm 5 The Full Tester\n1: procedure GENERALIZED-UNIFORMITY-TESTER(p, \u0001)\ninput: Sample access to arbitrary distribution p on unknown discrete domain \u2126 and n, \u0001 > 0.\noutput: \u201cYES\u201d with probability 2/3 if p is uniform on its support, \u201cNO\u201d with probability 2/3 p is\n\n\u0001-far from any uniform distribution.\n\nLet (cid:98)\u03b32 = Rough-Moment-Estimator(p, 2).\nLet (cid:98)\u03b33 = Rough-Moment-Estimator(p, 3).\nif (cid:98)\u03b33 is not \u0398((cid:98)\u03b32\nLet n = (cid:98)\u03b33\n\n4/3) then return \u201cNO\u201d.\n\n2:\n3:\n4:\n5:\n6:\n7:\n\n\u22123/2.\n\nif \u0001 \u2265 n\u22121/4 then return Large-Eps-Tester(p, n, \u0001)\nif n\u22121/4 \u2265 \u0001 then return Small-Eps-Tester(p, n, \u0001)\n\n2.4 Full Tester\n\n4/3).\nAssuming this holds, F2(p) = \u0398(n\u22121/2) and F3(p) = \u0398(n\u22122/3), so the assumptions necessary for\nour Small/Large-\u0001 testers are satis\ufb01ed, and they will work with appropriate probability.\nFor sample complexity, we note that the \ufb01rst two lines take O(1/(cid:107)p(cid:107)3) samples in expectation. The\n\nFirst, we verify correctness. With appropriately high probability,(cid:98)\u03b32 and(cid:98)\u03b33 approximate (cid:107)p(cid:107)2 and\n(cid:107)p(cid:107)3 respectively to within constant factors. In this case, p cannot be uniform unless (cid:98)\u03b33 = \u0398((cid:98)\u03b32\nremaining lines use an expected O(n2/3/\u00014/3+n1/2/\u00012) samples. This is O(1/(\u00014/3(cid:98)\u03b33)+1/(\u00012(cid:98)\u03b32)).\nOur \ufb01nal expected sample bound follows from noting by Lemma 2.2 that the expected values of 1/(cid:98)\u03b33\nand 1/(cid:98)\u03b32 are O(1/(cid:107)p(cid:107)3) and O(1/(cid:107)p(cid:107)2), respectively. This completes our proof.\n\n3 Sample Complexity Lower Bound\n\n\u221a\nIn this section, we sketch a sample size lower bound matching our algorithm in Proposition 2.5. One\nn/\u00012) samples are\npart of the lower bound is fairly easy. In particular, it is known [Pan08] that \u2126(\nrequired to test uniformity of a distribution with a known support of size n. It is easy to see that the\nhard cases for this lower bound still work when (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and (cid:107)p(cid:107)3 = \u0398(n\u22122/3).\nThe other half of the lower bound is somewhat more dif\ufb01cult and we rely on the lower bound\ntechniques of [DK16]. In particular, for n > 0, and 1/10 > \u0001 > n\u22121/4 and for N suf\ufb01ciently large,\nwe produce a pair of distributions D and D(cid:48) over positive measures on [N ], so that: 1. A random\nsample from D or D(cid:48) has total mass \u0398(1) with high probability. 2. A random sample from D or\nD(cid:48) has (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and (cid:107)p(cid:107)3 = \u0398(n\u22122/3) with high probability. 3. A sample from \u00b5 \u2208 D\nhas \u00b5/(cid:107)\u00b5(cid:107)1 the uniform distribution over some subset of [N ] with probability 1. 4. A sample from\n\u00b5 \u2208 D(cid:48) has \u00b5/||\u00b5(cid:107)1 at least \u2126(\u0001)-far from any uniform distribution with high probability. 5. Given a\nmeasure \u00b5 taking randomly from either D or D(cid:48), no algorithm given the output of a Poisson process\nwith intensity k\u00b5 for k = o(min(n2/3/\u00014/3, n)) can reliably distinguish between a \u00b5 taken from D\nand \u00b5 taken from D(cid:48).\nBefore we exhibit these families, we \ufb01rst discuss why the above is suf\ufb01cient. This Poissonization\ntechnique has been used previously in various settings [VV14, DK16, WY16, DGPP17], so we only\nprovide a sketch here. In particular, suppose that we have such families D and D(cid:48), but that there is\nalso an algorithm A that distinguishes between a distribution p being uniform and being \u0001-far from\nuniform when (cid:107)p(cid:107)2 = \u0398(n\u22121/2) and (cid:107)p(cid:107)3 = \u0398(n\u22122/3) in m = o(n2/3/\u00014/3) samples. We show\nthat we can use algorithm A to violate property 5 above. In particular, letting p = \u00b5/(cid:107)\u00b5(cid:107)1 for \u00b5 a\nrandom measure taken from either D or D(cid:48), we note that with high probability (cid:107)p(cid:107)2 = \u0398(n\u22121/2)\nand (cid:107)p(cid:107)3 = \u0398(n\u22122/3). Therefore, m(cid:48) = o(n2/3/\u00014/3) samples are suf\ufb01cient to distinguish between\np being uniform and being \u2126(\u0001) far from uniform. However, by properties 3 and 4, this is equivalent\nto distinguish between \u00b5 being taken from D and being taken from D(cid:48). On the other hand, given the\noutput of a Poisson process with intensity Cm(cid:48)\u00b5, for C a suf\ufb01ciently large constant, a random m(cid:48)\nof these samples (note that there are at least m(cid:48) total samples with high probability) are distributed\nidentically to m(cid:48) samples from p. Thus, applying A to these samples distinguishes between \u00b5 taken\nfrom D and \u00b5 taken from D(cid:48), thus contradicting property 5. Due to space constraints, the technical\ndetails are deferred to the supplementary material.\n\n8\n\n\f4 Conclusions\n\nIn this paper, we gave tight upper and lower bounds on the sample complexity of generalized\nuniformity testing \u2013 a natural non-trivial generalization of uniformity testing, recently introduced\nin [BC17]. The obvious research question is to understand the sample complexity of testing more\ngeneral symmetric properties (e.g., closeness, independence, etc.) for the regime where the domain\nof the underlying distributions is discrete but unknown (of unknown size). Is it possible to obtain\nsub-learning sample complexities for these problems? And what is the optimal sample complexity\nfor each of these tasks?\n\nReferences\n[ADK15] J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distribu-\n\ntions. In NIPS, pages 3591\u20133599, 2015.\n\n[AOST17] J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi. Estimating renyi entropy of discrete\n\ndistributions. IEEE Trans. Information Theory, 63(1):38\u201356, 2017.\n\n[BC17] T. Batu and C. Canonne. Generalized uniformity testing. CoRR, abs/1708.04696, 2017.\n\nTo appear in FOCS\u201917.\n\n[BFR+00] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions\nare close. In IEEE Symposium on Foundations of Computer Science, pages 259\u2013269,\n2000.\n\n[BKR04] T. Batu, R. Kumar, and R. Rubinfeld. Sublinear algorithms for testing monotone and\nunimodal distributions. In ACM Symposium on Theory of Computing, pages 381\u2013390,\n2004.\n\n[BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic\n\nTheory of Independence. OUP Oxford, 2013.\n\n[Can15] C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Elec-\n\ntronic Colloquium on Computational Complexity (ECCC), 22:63, 2015.\n\n[CDGR16] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restric-\ntions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer\nScience, STACS 2016, pages 25:1\u201325:14, 2016.\n\n[CDKS17] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing bayesian networks.\nIn Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 370\u2013\n448, 2017.\n\n[CDKS18] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing conditional in-\ndependence of discrete distributions. In Proceedings of the 50th Annual ACM SIGACT\nSymposium on Theory of Computing, STOC 2018, pages 735\u2013748, 2018.\n\n[CDS17] C. L. Canonne, I. Diakonikolas, and A. Stewart. Fourier-based testing for families of\n\ndistributions. CoRR, abs/1706.05738, 2017.\n\n[CDVV14] S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing\n\ncloseness of discrete distributions. In SODA, pages 1193\u20131203, 2014.\n\n[DDK16] C. Daskalakis, N. Dikkala, and G. Kamath.\n\nabs/1612.03147, 2016.\n\nTesting ising models.\n\nCoRR,\n\n[DDS+13] C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-modal\n\ndistributions: Optimal algorithms via reductions. In SODA, pages 1833\u20131852, 2013.\n\n[DGPP16] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are opti-\nmal for uniformity and closeness. Electronic Colloquium on Computational Complexity\n(ECCC), 23:178, 2016.\n\n9\n\n\f[DGPP17] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Sample-optimal identity testing\n\nwith high probability. CoRR, abs/1708.02728, 2017.\n\n[DGPP18] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Sample-optimal identity testing\nwith high probability. In 45th International Colloquium on Automata, Languages, and\nProgramming, ICALP 2018, pages 41:1\u201341:14, 2018.\n\n[DK16] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete\ndistributions. In FOCS, pages 685\u2013694, 2016. Full version available at abs/1601.05557.\n\n[DKN15a] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds\nfor testing closeness of structured distributions. In IEEE 56th Annual Symposium on\nFoundations of Computer Science, FOCS 2015, pages 1183\u20131202, 2015.\n\n[DKN15b] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distribu-\ntions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA 2015, pages 1841\u20131854, 2015.\n\n[DKN17] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Near-optimal closeness testing of dis-\nIn 44th International Colloquium on Automata, Lan-\n\ncrete histogram distributions.\nguages, and Programming, ICALP 2017, pages 8:1\u20138:15, 2017.\n\n[DP17] C. Daskalakis and Q. Pan. Square hellinger subadditivity for bayesian networks and\nits applications to identity testing. In Proceedings of the 30th Conference on Learning\nTheory, COLT 2017, pages 697\u2013703, 2017.\n\n[Gol16] O. Goldreich. The uniform distribution is complete with respect to testing identity to a\n\n\ufb01xed distribution. ECCC, 23, 2016.\n\n[GR00] O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Technical\n\nReport TR00-020, Electronic Colloquium on Computational Complexity, 2000.\n\n[Pan08] L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled dis-\n\ncrete data. IEEE Transactions on Information Theory, 54:4750\u20134755, 2008.\n\n[Rub12] R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24\u201328, 2012.\n\n[VV14] G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity\n\ntesting. In FOCS, 2014.\n\n[WY16] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best\nIEEE Transactions on Information Theory, 62(6):3702\u2013\n\npolynomial approximation.\n3720, June 2016.\n\n10\n\n\f", "award": [], "sourceid": 3043, "authors": [{"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "University of Southern California"}, {"given_name": "Daniel M.", "family_name": "Kane", "institution": "UCSD"}, {"given_name": "Alistair", "family_name": "Stewart", "institution": "University of Southern California"}]}