{"title": "Testing Closeness With Unequal Sized Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 2611, "page_last": 2619, "abstract": "We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly.  Specifically, given a target error parameter $\\eps > 0$,  $m_1$ independent draws from an unknown distribution $p$ with discrete support, and $m_2$ draws from an unknown distribution $q$ of discrete support, we describe a test for distinguishing the case that $p=q$ from the case that $||p-q||_1 \\geq \\eps$. If $p$ and $q$ are supported on at most $n$ elements, then our test is successful with high probability provided $m_1\\geq n^{2/3}/\\varepsilon^{4/3}$ and $m_2 = \\Omega\\left(\\max\\{\\frac{n}{\\sqrt m_1\\varepsilon^2}, \\frac{\\sqrt n}{\\varepsilon^2}\\}\\right).$ We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, $n,m_1,$ and $\\eps$, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on $n$ states up to a $\\log n$ factor that uses $\\tilde{O}(n^{3/2} \\tau_{mix})$ queries to a ``next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data.  We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.", "full_text": "Testing Closeness With Unequal Sized Samples\n\nBhaswar B. Bhattacharya\nDepartment of Statistics\n\nStanford University\nCalifornia, CA 94305\n\nbhaswar@stanford.edu\n\nGregory Valiant\u2217\n\nDepartment of Computer Science\n\nStanford University\nCalifornia, CA 94305\n\nvaliant@stanford.edu\n\nAbstract\n\n(cid:16)\n\n\u03b52 }(cid:17)\n\nWe consider the problem of testing whether two unequal-sized samples were\ndrawn from identical distributions, versus distributions that differ signi\ufb01cantly.\nSpeci\ufb01cally, given a target error parameter \u03b5 > 0, m1 independent draws from\nan unknown distribution p with discrete support, and m2 draws from an unknown\ndistribution q of discrete support, we describe a test for distinguishing the case that\np = q from the case that ||p \u2212 q||1 \u2265 \u03b5. If p and q are supported on at most n ele-\nments, then our test is successful with high probability provided m1 \u2265 n2/3/\u03b54/3\nand m2 = \u2126\n. We show that this tradeoff is information the-\noretically optimal throughout this range in the dependencies on all parameters,\nn, m1, and \u03b5, to constant factors for worst-case distributions. As a consequence,\nwe obtain an algorithm for estimating the mixing time of a Markov chain on n\nstates up to a log n factor that uses \u02dcO(n3/2\u03c4mix) queries to a \u201cnext node\u201d ora-\ncle. The core of our testing algorithm is a relatively simple statistic that seems to\nperform well in practice, both on synthetic and on natural language data. We be-\nlieve that this statistic might prove to be a useful primitive within larger machine\nlearning and natural language processing systems.\n\nn\u221a\nm1\u03b52 ,\n\nmax{\n\n\u221a\n\nn\n\nIntroduction\n\n1\nOne of the most basic problems in statistical hypothesis testing is the question of distinguishing\nwhether two unknown distributions are very similar, or signi\ufb01cantly different. Classical tests, like\nthe Chi-squared test or the Kolmogorov-Smirnov statistic, are optimal in the asymptotic regime,\nfor \ufb01xed distributions as the sample sizes tend towards in\ufb01nity. Nevertheless, in many modern\nsettings\u2014such as the analysis of customer, web logs, natural language processing, and genomics,\ndespite the quantity of available data\u2014the support sizes and complexity of the underlying distribu-\ntions are far larger than the datasets, as evidenced by the fact that many phenomena are observed\nonly a single time in the datasets, and the empirical distributions of the samples are poor represen-\ntations of the true underlying distributions.1 In such settings, we must understand these statistical\ntasks not only in the asymptotic regime (in which the amount of available data goes to in\ufb01nity), but\nin the \u201cundersampled\u201d regime in which the dataset is signi\ufb01cantly smaller than the size or complex-\nity of the distribution in question. Surprisingly, despite an intense history of study by the statistics,\ninformation theory, and computer science communities, aspects of basic hypothesis testing and esti-\nmation questions\u2013especially in the undersampled regime\u2014remain unresolved, and require both new\nalgorithms, and new analysis techniques.\n\n\u2217Supported in part by NSF CAREER Award CCF-1351108\n1To give some speci\ufb01c examples, two recent independent studies [19, 26] each considered the genetic se-\nquences of over 14,000 individuals, and found that rare variants are extremely abundant, with over 80% of\nmutations observed just once in the sample. A separate recent paper [16] found that the discrepancy in rare mu-\ntation abundance cited in different demographic modeling studies can largely be explained by discrepancies in\nthe sample sizes of the respective studies, as opposed to differences in the actual distributions of rare mutations\nacross demographics, highlighting the importance of improved statistical tests in this \u201cundersampled\u201d regime.\n\n1\n\n\fIn this work, we examine the basic hypothesis testing question of deciding whether two unknown\ndistributions over discrete supports are identical (or extremely similar), versus have total variation\ndistance at least \u03b5, for some speci\ufb01ed parameter \u03b5 > 0. We consider (and largely resolve) this\nquestion in the extremely practically relevant setting of unequal sample sizes. Informally, taking\n\u03b5 to be a small constant, we show that provided p and q are supported on at most n elements, for\nany \u03b3 \u2208 [0, 1/3], the hypothesis test can be successfully performed (with high probability over the\nrandom samples) given samples of size m1 = \u0398(n2/3+\u03b3) from p, and m2 = \u0398(n2/3\u2212\u03b3/2) from\nq, where n is the size of the supports of the distributions p and q. Furthermore, for every \u03b3 in\nthis range, this tradeoff between m1 and m2 is necessary, up to constant factors. Thus, our results\n\u221a\nsmoothly interpolate between the known bounds of \u0398(n2/3) on the sample size necessary in the\nsetting where one is given two equal-sized samples [6, 9], and the bound of \u0398(\nn) on the sample\nsize in the setting in which the sample is drawn from one distribution and the other distribution is\nknown to the algorithm [22, 29]. Throughout most of the regime of parameters, when m1 (cid:28) m2\n2,\nour algorithm is a natural extension of the algorithm proposed in [9], and is similar to the algorithm\ninformation theoretic optimality. In the extreme regime when m1 \u2248 n and m2 \u2248 \u221a\nproposed in [3] except with the addition of a normalization term that seems crucial to obtaining our\nn, our algorithm\nintroduces an additional statistic which (we believe) is new. Our algorithm is relatively simple, and\npractically viable. In Section 4 we illustrate the ef\ufb01cacy of our approach on both synthetic data, and\non the real-world problem of deducing whether two words are synonyms, based on a small sample\nof the bi-grams in which they occur.\nWe also note that, as pointed out in several related work [3, 12, 6], this hypothesis testing question\nhas applications to other problems, such as estimating or testing the mixing time of Markov chains,\nand our results yield improved algorithms in these settings.\n\n1.1 Related Work\n\nThe general question of how to estimate or test properties of distributions using fewer samples\nthan would be necessary to actually learn the distribution, has been studied extensively since the\nlate \u201990s. Most of the work has focussed on \u201csymmetric\u201d properties (properties whose value is\ninvariant to relabeling domain elements) such as entropy, support size, and distance metrics between\ndistributions (such as (cid:96)1 distance). This has included both algorithmic work (e.g. [4, 5, 7, 8, 10, 13,\n20, 21, 27, 28, 29]), and results on developing techniques and tools for establishing lower bounds\n(e.g.\n[23, 30, 27]). See the recent survey by Rubinfeld for a more thorough summary of the\ndevelopments in this area [24]).\nThe speci\ufb01c problem of \u201ccloseness testing\u201d or \u201cidentity testing\u201d, that is, deciding whether two dis-\ntributions, p and q, are similar, versus have signi\ufb01cant distance, has two main variants: the one-\nunknown-distribution setting in which q is known and a sample is drawn from p, and the two-\nunknown-distributions settings in which both p and q are unknown and samples are drawn from\nboth. We brie\ufb02y summarize the previous results for these two settings.\nIn the one-unknown-distribution setting (which can be thought of as the limiting setting in the case\nthat we have an arbitrarily large sample drawn from distribution q, and a relatively modest sized\nsample from p), initial work of Goldreich and Ron [12] considered the problem of testing whether\np is the uniform distribution over [n], versus has distance at least \u03b5. The tight bounds of \u0398(\nn/ \u03b52)\nwere later shown by Paninski [22], essentially leveraging the birthday paradox and the intuition\nthat, among distributions supported on n elements, the uniform distribution maximizes the number\nof domain elements that will be observed once. Batu et al. [8] showed that, up to polylogarithmic\nfactors of n, and polynomial factors of \u03b5, this dependence was optimal for worst-case distributions\nover [n]. Recently, an \u201cinstance\u2013optimal\u201d algorithm and matching lower bound was shown: for any\n\u03b5 , \u03b5\u22122||q\u2212 max\u2212\u0398(\u03b5)||2/3} samples from p are both necessary\ndistribution q, up to constant factors, max{ 1\nand suf\ufb01cient to test p = q versus ||p \u2212 q|| \u2265 \u03b5, where ||q\u2212 max\u2212\u0398(\u03b5)||2/3 \u2264 ||q||2/3 is the 2/3-rd norm\nof the vector of probabilities of distribution q after the maximum element has been removed, and\nthe smallest elements up to \u0398(\u03b5) total mass have been removed. (This immediately implies the tight\nn/ \u03b52) samples are suf\ufb01cient to test its\nbounds that if q is any distribution supported on [n], O(\nidentity.)\nThe two-unknown-distribution setting was introduced to this community by Batu et al. [6]. The\noptimal sample complexity of this problem was recently determined by Chan et al. [9]: they showed\n\n\u221a\n\n\u221a\n\n2\n\n\fn\n\n\u221a\n\n\u03b53\n\nm1\n\n,\n\n\u03b52\n\n\u221a\n\nn log n\n\nthat m = \u0398(n2/3/\u03b54/3) samples are necessary and suf\ufb01cient. In a slightly different vein, Acharya et\nal. [1, 2] recently considered the question of closeness testing with two unknown distributions from\nthe standpoint of competitive analysis. They proposed an algorithm that performs the desired task\nusing O(s3/2 polylog s) samples, and established a lower bound of \u2126(s7/6), where s represents the\nnumber of samples required to determine whether a set of samples were drawn from p versus q, in\nthe setting where p and q are explicitly known.\nA natural generalization of this hypothesis testing problem, which interpolates between the two-\nunknown-distribution setting and the one-unknown-distribution setting, is to consider unequal sized\nsamples from the two distributions. More formally, given m1 samples from the distribution p, the\nasymmetric closeness testing problem is to determine how many samples, m2, are required from the\ndistribution q such that the hypothesis p = q versus ||p \u2212 q||1 > \u03b5 can be distinguished with large\nconstant probability (say 2/3). Note that the results of Chan et al. [9] imply that it is suf\ufb01cient to\nconsider m1 \u2265 \u0398(n2/3/\u03b54/3). This problem was studied recently by Acharya et al. [3]: they gave\nan algorithm that given m1 samples from the distribution p uses m2 = O(max{ n log n\n})\nsamples from q, to distinguish the two distributions with high probability. They also proved a lower\nbound of m2 = \u2126(max{\u221a\n}). There is a polynomial gap in these upper and lower bounds\n\u03b52 , n2\n\u221a\n\u03b54m2\n1\nm1 and \u03b5.\nin the dependence on n,\nAs a corollary to our main hypothesis testing result, we obtain an improved algorithm for testing\nthe mixing time of a Markov chain. The idea of testing mixing properties of a Markov chain goes\nback to the work of Goldreich and Ron [12], which conjectured an algorithm for testing expansion\nof bounded-degree graphs. Their test is based on picking a random node and testing whether ran-\n\u221a\ndom walks from this node reach a distribution that is close to the uniform distribution on the nodes\nn) query complexity. Later, Czumaj\nof the graph. They conjectured that their algorithm had O(\nand Sohler [11], Kale and Seshadhri [15], and Nachmias and Shapira [18] have independently con-\ncluded that the algorithm of Goldreich and Ron is provably a test for expansion property of graphs.\nRapid mixing of a chain can also be tested using eigenvalue computations. Mixing is related to the\nseparation between the two largest eigenvalues [25, 17], and eigenvalues of a dense n \u00d7 n matrix\ncan be approximated in O(n3) time and O(n2) space. However, for a sparse n \u00d7 n symmetric\nmatrix with m nonzero entries, the same task can be achieved in O(n(m + log n)) operations and\nO(n + m) space. Batu et al. [6] used their (cid:96)1 distance test on the t-step distributions, to test mixing\nproperties of Markov chains. Given a \ufb01nite Markov chain with state space [n] and transition matrix\nPPP = ((P (x, y))), they essentially show that one can estimate the mixing time \u03c4mix up to a factor\nof log n using \u02dcO(n5/3\u03c4mix) queries to a next node oracle, which takes a state x \u2208 [n] and outputs a\nstate y \u2208 [n] drawn from the distribution P (x,\u00b7). Such an oracle can often be simulated signi\ufb01cantly\nmore easily than actually computing the transition matrix P (x, y).\nWe conclude this related work section with a comment on \u201crobust\u201d hypothesis testing and distance\nestimation. A natural hope would be to simply estimate ||p \u2212 q|| to within some additive \u03b5, which is\na strictly more dif\ufb01cult task than distinguishing p = q from ||p \u2212 q|| \u2265 \u03b5. The results of Valiant and\nValiant [27, 28, 29] show that this problem is signi\ufb01cantly more dif\ufb01cult than hypothesis testing:\nthe distance can be estimated to additive error \u03b5 for distributions supported on \u2264 n elements using\nsamples of size O(n/ log n) (in both the setting where either one, or both distributions are unknown).\nMoreover, \u2126(n/ log n) samples are information theoretically necessary, even if q is the uniform\ndistribution over [n], and one wants to distinguish the case that ||p \u2212 q||1 \u2264 1\n10 from the case that\n10 . Recall that the non-robust test of distinguishing p = q versus ||p \u2212 q|| > 9/10\n||p \u2212 q||1 \u2265 9\nn). The exact worst-case sample complexity of distinguishing\nrequires a sample of size only O(\nwhether ||p \u2212 q||1 \u2264 1\nnc versus ||p \u2212 q||1 \u2265 \u03b5 is not well understood, though in the case of constant\n\u03b5, up to logarithmic factors, the required sample size seems to scale linearly in the exponent between\nn2/3 and n as c goes from 1/3 to 0.\n\n\u221a\n\n1.2 Our results\n\nOur main result resolves the minimax sample complexity of the closeness testing problem in the\nunequal sample setting, to constant factors, in terms of n, the support sizes of the distributions in\nquestion:\n\n3\n\n\fTheorem 1. Given m1 \u2265 n2/3/\u03b54/3 and \u03b5 > n\u22121/12, and sample access to distributions p and q\nover [n], there is an O(m1) time algorithm which takes m1 independent draws from p and m2 =\n\u221a\nO(max{\n\u03b52 }) independent draws from q, and with probability at least 2/3 distinguishes\nwhether\n\nn\u221a\nm1\u03b52 ,\n\nn\n\n||p \u2212 q||1 \u2264 O\n\n||p \u2212 q||1 \u2265 \u03b5.\n(1)\n\u221a\n\u03b52 }) samples from q are information-\nMoreover, given m1 samples from p, \u2126(max{\ntheoretically necessary to distinguish p = q from ||p \u2212 q||1 \u2265 \u03b5 with any constant probability\nbounded below by 1/2.\n\nn\u221a\nm1\u03b52 ,\n\nversus\n\nn\n\n(cid:18) 1\n\n(cid:19)\n\nm2\n\n\u221a\nThe lower bound in the above theorem is proved using the machinery developed in Valiant [30],\nand \u201cinterpolates\u201d between the \u0398(\nn/ \u03b52) lower bound in the one-unknown-distribution setting of\ntesting uniformity [22] and the \u0398(n2/3/ \u03b54/3) lower bound in the setting of equal sample sizes from\ntwo unknown distributions [9]. The algorithm establishing the upper bound involves a re-weighted\nversion of a statistic proposed in [9], and is similar to the algorithm proposed in [3] modulo the\naddition of a normalizing term, which seems crucial to obtaining our tight results. In the extreme\nn/ \u03b52, we incorporate an additional statistic that has not appeared\n\nregime when m1 \u2248 n and m2 \u2248 \u221a\nbefore in the literature.\nAs an application of Theorem 1 in the extreme regime when m1 \u2248 n, we obtain an improved\nalgorithm for estimating the mixing time of a Markov chain:\nCorollary 1. Consider a \ufb01nite Markov chain with state space [n] and a next node oracle; there is\nan algorithm that estimates the mixing time, \u03c4mix, up to a multiplicative factor of log n, that uses\n\u02dcO(n3/2\u03c4mix) time and queries to the next node oracle.\n\nConcurrently to our work, Hsu et al. [14] considered the question of estimating the mixing time\nbased on a single sample path (as opposed to our model of a sampling oracle). In contrast to our\napproach via hypothesis testing, they considered the natural spectral approach, and showed that the\nmixing time can be approximated, up to logarithmic factors, given a path of length \u02dcO(\u03c4 3\nmix/\u03c0min),\nwhere \u03c0min is the minimum probability of a state under the stationary distribution. Hence, if the\nstationary distribution is uniform over n states, this becomes \u02dcO(n\u03c4 3\nmix). It remains an intriguing\nopen question whether one can simultaneously achieve both the linear dependence on \u03c4mix of our\nresults and the linear dependence on 1/\u03c0min or the size of the state space, n, as in their results.\n\n1.3 Outline\n\nWe begin by stating our testing algorithm, and describe the intuition behind the algorithm. The\nformal proof of the performance guarantees of the algorithm require rather involved bounds on the\nmoments of various parameters, and are provided in the supplementary material. We also defer\nthe entirety of the matching information theoretic lower bounds to the supplementary material, as\nthe techniques may not appeal to as wide an audience as the algorithmic portion of our work. The\napplication of our testing results to the problem of testing or estimating the mixing time of a Markov\nchain is discussed in Section 3. Finally, Section 4 contains some empirical results, suggesting that\nthe statistic at the core of our testing algorithm performs very well in practice. This section contains\nboth results on synthetic data, as well as an illustration of how to apply these ideas to the problem\nof estimating the semantic similarity of two words based on samples of the n-grams that contain the\nwords in a corpus of text.\n\n2 Algorithms for (cid:96)1 Testing\n\nIn this section we describe our algorithm for (cid:96)1 testing with unequal samples. This gives the upper\nbound in Theorem 1 on the sample sizes necessary to distinguish p = q from ||p \u2212 q||1 \u2265 \u03b5. For\nclarity and ease of exposition, in this section we consider \u03b5 to be some absolute constant, and supress\nthe dependency on \u03b5 . The slightly more involved algorithm that also obtains the optimal dependency\non the parameter \u03b5 is given in the supplementary material.\nWe begin by presenting the algorithm, and then discuss the intuition for the various steps.\n\n4\n\n\fAlgorithm 1 The Closeness Testing Algorithm\nSuppose \u03b5 = \u2126(1) and m1 = O(n1\u2212\u03b3) for some \u03b3 \u2265 0. Let S1, S2 denote two independent sets of\nm1 samples drawn from p and let T1, T2 denote two independent sets of m2 samples drawn from q.\nWe wish to test p = q versus ||p \u2212 q||1 > \u03b5.\n\n\u2022 Let b = C0\n\nlog n\nm2\n\n, for an absolute constant C0, and de\ufb01ne the set\n\nB = {i \u2208 [n] : X S1\noccurrences of i in S1, and Y T1\n\ni\nm1\n\n> b} \u222a {i \u2208 [n] : Y T1\n\ni\nm2\n\n\u2022 Let Xi denote the number of occurrences of element i in S2, and Yi denote the number of\n\ni\n\ndenotes the number of occurrences of i in T1.\n\n> b}, where X S1\n\ni denotes the number of\n\noccurrences of element i in T2:\n\n(cid:12)(cid:12)(cid:12)(cid:12) Xi\n\nm1\n\n(cid:88)\n\ni\u2208B\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b5/6.\n\n\u2212 Yi\nm2\n\n(m2Xi \u2212 m1Yi)2 \u2212 (m2\nXi + Yi\n\n2Xi + m2\n\n1Yi)\n\n\u2264 C\u03b3m3/2\n\n1 m2,\n\n1. Check if\n\n2. Check if\n\n(cid:88)\n\ni\u2208[n]\\B\n\nZ :=\n\n3. If \u03b3 \u2265 1/9:\n\n(2)\n\n(3)\n\n(4)\n\nfor an appropriately chosen constant C\u03b3 (depending on \u03b3).\n\n\u2022 If (2) and (3) hold, then ACCEPT. Otherwise, REJECT.\n\n4. Otherwise, if \u03b3 < 1/9 :\n\n\u2022 Check if\n\n(cid:88)\n\ni\u2208[n]\\B\n\nR :=\n\n111{Yi = 2}\nXi + 1\n\n\u2264 C1\n\nm2\n2\nm1\n\n,\n\nwhere C1 is an appropriately chosen absolute constant.\n\n\u2022 REJECT if there exists i \u2208 [n] such that Yi \u2265 3 and Xi \u2264 C2\n\u2022 If (2), (3), and (4) hold, then ACCEPT. Otherwise, REJECT.\n\nappropriately chosen absolute constant.\n\nm1\n\nm2n1/3 , where C2 is an\n\nP oisson(m1pi) and Yi \u2190 P oisson(m2qi), then E(cid:2)(m2Xi \u2212 m1Yi)2 \u2212 (m2\n\nThe intuition behind the above algorithm is as follows: with high probability, all elements in the\nset B satisfy either pi > b/2, or qi > b/2 (or both). Given that these elements are \u201cheavy\u201d, their\ncontribution to the (cid:96)1 distance will be accurately captured by the (cid:96)1 distance of their empirical\nfrequencies (where these empirical frequencies are based on the second set of samples, S2, T2).\nFor the elements that are not in set B\u2014the \u201clight\u201d elements\u2014their empirical frequencies will,\nin general, not accurately re\ufb02ect their true probabilities, and hence the distance between the em-\npirical distributions of the \u201clight\u201d elements will be misleading. The Z statistic of Equation 3 is\ndesigned speci\ufb01cally for this regime. If the denominator of this statistic were omitted, then this\nwould give an estimator for the squared (cid:96)2 distance between the distributions (scaled by a factor of\n2). To see this, note that if pi and qi are small, then Binomial(m1, pi) \u2248 P oisson(m1pi)\n1m2\nm2\nand Binomial(m2, qi) \u2248 P oisson(m2qi); furthermore, a simple calculation yields that if Xi \u2190\n2(p \u2212 q)2. The normalization by Xi + Yi \u201clinearizes\u201d the Z statistic, essentially turning the\nm2\nsquared (cid:96)2 distance into an estimate of the (cid:96)1 distance between light elements of the two distri-\nbutions. Similar results can possibly be obtained using other linear functions of Xi and Yi in the\ndenominator, though we note that the \u201cobvious\u201d normalizing factor of Xi + m1\nYi does not seem to\nm2\nwork theoretically, and seems to have extremely poor performance in practice.\nFor the extreme case (corresponding to \u03b3 < 1/9) where m1 \u2248 n and m2 \u2248 \u221a\nsample of size m2 \u2248 \u221a\n\nn/ \u03b52, the statistic\nZ might have a prohibitively large variance; this is essentially due to the \u201cbirthday paradox\u201d which\nmight cause a constant number of rare elements (having probability O(1/n) to occur twice in a\n1) \u2248 n2 to the Z statistic,\n\nn/ \u03b52). Each such element will contribute \u2126(m2\n\n1Yi)(cid:3) =\n\n2Xi + m2\n\n1m2\n\n5\n\n\fand hence the variance can be \u2248 n4. The statistic R of Equation (4) is tailored to deal with these\ncases, and captures the intuition that we are more tolerant of indices i for which Yi = 2 if the\ncorresponding Xi is larger. It is worth noting that one can also de\ufb01ne a natural analog of the R\nstatistic corresponding to the indices i for which Yi = 3, etc., using which the robustness parameter\nof the test can be improved. The \ufb01nal check\u2014ensuring that in this regime with m1 (cid:29) m2 there are\nno elements for which Yi \u2265 3 but Xi is small\u2014rules out the remaining sets of distributions, p, q, for\nwhich the variance of the Z statistic is intolerably large.\nFinally, we should emphasize that the crude step of using two independent batches of samples\u2014\nthe \ufb01rst to obtain the partition of the domain into \u201cheavy\u201d and \u201clight\u201d elements, and the second to\nactually compute the statistics, is for ease of analysis. As our empirical results of Section 4 suggest,\nfor practical applications one may want to use only the Z-statistic of (3), and one certainly should\nnot \u201cwaste\u201d half the samples to perform the \u201cheavy\u201d/\u201clight\u201d partition.\n\n3 Estimating Mixing Times in Markov Chains\n\nThe basic hypothesis testing question of distinguishing identical distributions from those with sig-\nni\ufb01cant (cid:96)1 distance can be employed for several other practically relevant tasks. One example is the\nproblem of estimating the mixing time of Markov chains.\nConsider a \ufb01nite Markov chain with state space [n], transition matrix PPP = ((P (x, y))), with sta-\nx(\u00b7) is the probability\ntionary distribution \u03c0. The t-step distribution starting at the point x \u2208 [n], P t\ndistribution on [n] obtained by running the chain for t steps starting from x.\nDe\ufb01nition 1. The \u03b5-mixing time of a Markov chain with transition matrix PPP = ((P (x, y))) is de\ufb01ned\nas tmix(\u03b5) := inf\n\n(cid:110)\nt \u2208 [n] : supx\u2208[n]\n\n(cid:80)\ny\u2208[n] |P t\n\nx(y) \u2212 \u03c0(y)| \u2264 \u03b5\n\n(cid:111)\n\n.\n\n1\n2\n\nt\n\nDe\ufb01nition 2. The average t-step distribution of a Markov chain PPP with n states is the distribution\nx, that is, the distribution obtained by choosing x uniformly from [n] and walking\nP\nt steps from the state x.\n\nx\u2208[n] P t\n\n= 1\nn\n\n(cid:80)\n\nThe connection between closeness testing and testing whether a Markov chain is close to mixing\nwas \ufb01rst observed by Batu et al. [6], who proposed testing the (cid:96)1 difference between distributions\nt0, for every x \u2208 [n]. The algorithm leveraged their equal sample-size hypothesis testing\nx and P\nP t0\nt0. This yields an\nresults, drawing \u02dcO(n2/3 log n) samples from both the distributions P t0\noverall running time of \u02dcO(n5/3t0).\nHere, we note that our unequal sample-size hypothesis testing algorithm can yield an improved\nt0 is independent of the starting state x, it suf\ufb01ces to take \u02dcO(n)\nruntime. Since the distribution P\n\u221a\nx, for every x \u2208 [n]. This results in a query and\nsamples from P\nn) samples from P t\nruntime complexity of \u02dcO(n3/2t0). We sketch this algorithm below.\n\nt0 once and \u02dcO(\n\nx and P\n\nAlgorithm 2 Testing for Mixing Times in Markov Chains\nGiven t0 \u2208 R and a \ufb01nite Markov chain with state space [n] and transition matrix PPP = ((P (x, y))),\nwe wish to test\n\nH0 : tmix\n\nO\n\n\u2264 t0,\n\nversus H1 : tmix (1/4) > t0.\n\n(5)\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n(cid:18) 1\u221a\n\nn\n\ndistribution.\n\n1. Draw O(log n) samples S1, . . . , SO(log n), each of size Pois(C1n) from the average t0-step\n2. For each state x \u2208 [n] we will distinguish whether ||P t0\n\nt0||1 \u2264 O( 1\u221a\n||P t0\nx \u2212 P\n\u221a\nruns of Algorithm 1, with the i-th run using Si and a fresh set of Pois(O(\nfrom P t\nx.\n\nn ), versus\nt0||1 > 1/4, with probability of error (cid:28) 1/n. We do this by running O(log n)\nn)) samples\n\nx \u2212 P\n\n3. If all n of the (cid:96)1 closeness testing problems are accepted, then we ACCEPT H0.\n\n6\n\n\fThe above testing algorithm can be leveraged to estimate the mixing time of a Markov chain, via the\nn) \u2264\nbasic observation that if tmix(1/4) \u2264 t0, then for any \u03b5, tmix(\u03b5) \u2264 log \u03b5\n\u221a\n2 log n \u00b7 tmix(1/4). Because tmix(1/4) and tmix(O(1/\nn)) differ by at most a factor of log n,\nby applying Algorithm 2 for a geometrically increasing sequence of t0\u2019s, and repeating each test\nO(log t0 + log n) times, one obtains Corollary 1, restated below:\nCorollary 1 For a \ufb01nite Markov chain with state space [n] and a next node oracle, there is an\nalgorithm that estimates the mixing time, \u03c4mix, up to a multiplicative factor of log n, that uses\n\u02dcO(n3/2\u03c4mix) time and queries to the next node oracle.\n\n\u221a\nlog 1/2 t0, and thus tmix(1/\n\n4 Empirical Results\nBoth our formal algorithms and the corresponding theorems involve some unwieldy constant factors\n(that can likely be reduced signi\ufb01cantly). Nevertheless, in this section we provide some evidence\nthat the statistic at the core of our algorithm can be fruitfully used in practice, even for surprisingly\nsmall sample sizes.\n\n(cid:80)\n\ni\n\nm3/2\n\n1 m2(Xi+Yi)\n\n2Xi+m2\n\n1Yi)\n\n(m2Xi\u2212m1Yi)2\u2212(m2\n\n4.1 Testing similarity of words\nAn extremely important primitive in natural\nmate the semantic similarity of two words. Here, we show that\n\nlanguage processing is the ability to esti-\nthe Z statistic, Z =\n, which is the core of our testing algorithm, can accurately dis-\ntinguish whether two words are very similar based on surprisingly small samples of the contexts in\nwhich they occur. Speci\ufb01cally, for each pair of words, a, b that we consider, we select m1 random\noccurrences of a and m2 random occurrences of word b from the Google books corpus, using the\nGoogle Books Ngram Dataset.2 We then compare the sample of words that follow a with the sample\nof words that follow b. Henceforth, we refer to these as samples of the set of bi-grams involving\neach word.\nFigure 1(a) illustrates the Z statistic for various pairs of words that range from rather similar words\nlike \u201csmart\u201d and \u201cintelligent\u201d, to essentially identical word pairs such as \u201cgrey\u201d and \u201cgray\u201d (whose\nusage differs mainly as a result of historical variation in the preference for one spelling over the\nother); the sample size of bi-grams containing the \ufb01rst word is \ufb01xed at m1 = 1, 000, and the sample\nsize corresponding to the second word varies from m2 = 50 through m2 = 1, 000. To provide a\nframe of reference, we also compute the value of the statistic for independent samples corresponding\nto the same word (i.e. two different samples of words that follow \u201cwolf\u201d); these are depicted in red.\nFor comparison, we also plot the total variation distance between the empirical distributions of\nthe pair of samples, which does not clearly differentiate between pairs of identical words, versus\ndifferent words, particularly for the smaller sample sizes.\nOne subtle point is that the issue with using the empirical distance between the distributions goes\nbeyond simply not having a consistent reference point. For example, let X denote a large sample\nof size m1 from distribution p, X(cid:48) denote a small sample of size m2 from p, and Y denote a\nsmall sample of size m2 from a different distribution q. It is tempting to hope that the empirical\ndistance between X and X(cid:48) will be smaller than the empirical distance between X and Y . As\nFigure 1(b) illustrates, this is not always the case, even for natural distributions: for the speci\ufb01c\nexample illustrated in the \ufb01gure, over much of the range of m2, the empirical distance between X\nand X(cid:48) is indistinguishable from that of X and Y , though the Z statistic easily discerns that these\ndistributions are very different.\nThis point is further emphasized in Figure 2, which depicts this phenomena in the synthetic setting\nwhere p = Unif[n] is the uniform distribution over n elements, and q is the distribution whose\nelements have probabilities (1 \u00b1 \u03b5)/n, for \u03b5 = 1/2. The second and fourth plots represent the\nprobability that the distance between two empirical distributions of samples from p is smaller than\nthe distance between the empirical distributions of the samples from p and q; the \ufb01rst and third\nplots represent the analogous probability involving the Z statistic. The \ufb01rst two plots correspond to\nn = 1, 000 and the last two correspond to n = 50, 000. In all plots, we consider a pair of samples\nof respective sizes m1 and m2, as m1 and m2 range between\n\n\u221a\n\nn and n.\n\n2The Google Books Ngram Dataset is freely available here: http://storage.googleapis.com/\n\nbooks/ngrams/books/datasetsv2.html\n\n7\n\n\fFigure 1: (a) Two measures of the similarity between words, based on samples of the bi-grams\ncontaining each word. Each line represents a pair of words, and is obtained by taking a sample of\nm1 = 1, 000 bi-grams containing the \ufb01rst word, and m2 = 50, . . . , 1, 000 bi-grams containing the\nsecond word, where m2 is depicted along the x-axis in logarithmic scale. In both plots, the red lines\nrepresent pairs of identical words (e.g. \u201cwolf/wolf\u201d,\u201calmost/almost\u201d,. . . ). The blue lines represent\npairs of similar words (e.g. \u201cwolf/fox\u201d, \u201calmost/nearly\u201d,. . . ), and the black line represents the pair\n\u201cgrey/gray\u201d whose distribution of bi-grams differ because of historical variations in preference for\neach spelling. Solid lines indicate the average over 200 trials for each word pair and choice of m2,\nwith error bars of one standard deviation depicted. The left plot depicts our statistic, which clearly\ndistinguishes identical words, and demonstrates some intuitive sense of semantic distance. The\nright plot depicts the total variation distance between the empirical distributions\u2014which does not\nsuccessfully distinguish the identical words, given the range of sample sizes considered. The plot\nwould not be signi\ufb01cantly different if other distance metrics between the empirical distributions,\nsuch as f-divergence, were used in place of total variation distance. Finally, note the extremely\nuniform magnitudes of the error bars in the left plot, as m2 increases, which is an added bene\ufb01t\nof the Xi + Yi normalization term in the Z statistic. (b) Illustration of how the empirical distance\ncan be misleading: here, the empirical distance between the distributions of samples of bi-grams for\n\u201cwolf/wolf\u201d is indistinguishable from that for the pair \u201cwolf/fox*\u201d over much of the range of m2;\nnevertheless, our statistic clearly discerns that these are signi\ufb01cantly different distributions. Here,\n\u201cfox*\u201d denotes the distribution of bi-grams whose \ufb01rst word is \u201cfox\u201d, restricted to only the most\ncommon 100 bi-grams.\n\nFigure 2: The \ufb01rst and third plot depicts the probability that the Z statistic applied to samples of\nsizes m1, m2 drawn from p = U nif [n] is smaller than the Z statistic applied to a sample of size m1\ndrawn from p and m2 drawn from q, where q is a perturbed version of p in which all elements have\nprobability (1 \u00b1 1/2)/n. The second and fourth plots depict the probability that empirical distance\nbetween a pair of samples (of respective sizes m1, m2) drawn from p is less than the empirical\ndistribution between a sample of size m1 drawn from p and m2 drawn from q. The \ufb01rst two plots\n\u221a\ncorrespond to n = 1, 000 and the last two correspond to n = 50, 000. In all plots, m1 and m2 range\nbetween\nn and n on a logarithmic scale. In all plots the colors depict the average probability based\non 100 trials.\n\n8\n\n!\"#!\"#!\"#$%$&'()*$+,'-&.)/0%)!)+,'1+1&)##$%&'#'%()#2$\"$('%$,3)4.,5..-)6'$%+)78)97%:+)$%&'#$%&'#$%&'#$%&'#$%&'#'%()#!\"#!\"#!\"#$%$&'()*$+,'-&.)/0%)!)+,'1+1&)$%&'#$%('#)*+,#,*-#./0#.1*2#(+!*30#4&(%+'#####5%&6#3!(%0#2$\"$('%$,3)4.,5..-)6'$%+)78)97%:+)740&++7$&40#3!(%0#(a) \t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0102\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0103\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0102\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0103\t\r \u00a0m2 m2 !\"#!\"#!\"#$%$&'()*$+,'-&.)/0%)!)+,'1+1&)##$%&'#'%()#2$\"$('%$,3)4.,5..-)6'$%+)78)97%:+)$%&'#$%&'#$%&'#$%&'#$%&'#'%()#!\"#!\"#!\"#$%$&'()*$+,'-&.)/0%)!)+,'1+1&)##$%&'#'%()#2$\"$('%$,3)4.,5..-)6'$%+)78)97%:+)$%&'#$%&'#$%&'#$%&'#$%&'#'%()#(b) \t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0102\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0103\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0102\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0103\t\r \u00a0m2 m2 Pr [ Z(pm1,qm2) > Z(pm1,pm2) ] n = 1,000m1m2n 0.5 n 0.75 n n 0.75 nm2Pr [ Z(pm1,qm2) > Z(pm1,pm2) ] n = 50,000m1n 0.5 n 0.75 n n 0.75 nPr [ || pm1 \u2013 qm2 || > || pm1 \u2013 pm2 || ] n = 1,000m1m2n 0.5 n 0.75 n n 0.75 nm1n 0.5 n 0.75 n n 0.75 nm2Pr [ || pm1 \u2013 qm2 || > || pm1 \u2013 pm2 || ] n = 50,0001 0.9 0.8 0.7 0.6 0.5\fReferences\n[1] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan, Competitive closeness testing, COLT, 2011.\n[2] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan, Competitive classi\ufb01cation and closeness testing.\n\nCOLT, 2012.\n\n[3] J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh, Sublinear algorithms for outlier detection and\n\ngeneralized closeness testing, ISIT, 3200\u20133204, 2014.\n\n[4] J. Acharya, C. Daskalakis, and G. Kamath, Optimal testing for properties of distributions, NIPS, 2015.\n[5] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications, STOC,\n\n2001.\n\n[6] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White, Testing that distributions are close, FOCS,\n\n2000.\n\n[7] T. Batu, S. Dasgupta, R. Kumar, and R. Rubinfeld, The complexity of approximating the entropy, SIAM\n\nJournal on Computing, 2005.\n\n[8] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, Testing random variables for\n\nindependence and identity, FOCS, 2001.\n\n[9] S.-on Chan, I. Diakonikolas, P. Valiant, G. Valiant, Optimal Algorithms for Testing Closeness of Discrete\n\nDistributions, Symposium on Discrete Algorithms (SODA), 1193\u20131203, 2014,\n\n[10] M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya, Towards estimation error guarantees for\n\ndistinct values, Symposium on Principles of Database Systems (PODS), 2000.\n\n[11] A. Czumaj and C. Sohler, Testing expansion in bounded-degree graphs, FOCS, 2007.\n[12] O. Goldreich and D. Ron, On testing expansion in bounded-degree graphs, ECCC, TR00-020, 2000.\n[13] S. Guha, A. McGregor, and S. Venkatasubramanian, Streaming and sublinear approximation of entropy\n\nand information distances, Symposium on Discrete Algorithms (SODA), 2006.\n\n[14] D. Hsu, A. Kontorovich, and C. Szepesv\u00b4ari, Mixing time estimation in reversible Markov chains from a\n\nsingle sample path, NIPS, 2015.\n\n[15] S. Kale and C. Seshadhri, An expansion tester for bounded degree graphs, ICALP, LNCS, Vol. 5125,\n\n527\u2013538, 2008.\n\n[16] A. Keinan and A. G. Clark. Recent explosive human population growth has resulted in an excess of rare\n\ngenetic variants. Science, 336(6082):740743, 2012.\n\n[17] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times, Amer. Math. Soc., 2009.\n[18] A. Nachmias and A. Shapira, Testing the expansion of a graph, Electronic Colloquium on Computational\n\nComplexity (ECCC), Vol. 14 (118), 2007.\n\n[19] M. R. Nelson and D. Wegmann et al., An abundance of rare functional variants in 202 drug target genes\n\nsequenced in 14,002 people. Science, 337(6090):100104, 2012.\n\n[20] L. Paninski, Estimation of entropy and mutual information, Neural Comp., Vol. 15 (6), 1191\u20131253, 2003.\n[21] L. Paninski, Estimating entropy on m bins given fewer than m samples, IEEE Transactions on Informa-\n\ntion Theory, Vol. 50 (9), 2200\u20132203, 2004.\n\n[22] L. Paninski, A coincidence-based test for uniformity given very sparsely-sampled discrete data, IEEE\n\nTransactions on Information Theory, Vol. 54, 4750\u20134755, 2008.\n\n[23] S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith, Strong lower bounds for approximating distribution\nsupport size and the distinct elements problem, SIAM Journal on Computing, Vol. 39(3), 813\u2013842, 2009.\n\n[24] R. Rubinfeld, Taming big probability distributions, XRDS, Vol. 19(1), 24\u201328, 2012.\n[25] A. Sinclair and M. Jerrum, Approximate counting, uniform generation and rapidly mixing Markov chains,\n\nInformation and Computation, Vol. 82(1), 93\u2013133, 1989.\n\n[26] J. A. Tennessen, A.W. Bigham, and T.D. O\u2019Connor et al. Evolution and functional impact of rare coding\n\nvariation from deep sequencing of human exomes. Science, 337(6090):6469, 2012\n\n[27] G. Valiant and P. Valiant, Estimating the unseen: an n/ log n-sample estimator for entropy and support\n\nsize, shown optimal via new CLTs, STOC, 2011.\n\n[28] G. Valiant and P. Valiant, Estimating the unseen: improved estimators for entropy and other properties,\n\nNIPS, 2013.\n\n[29] G. Valiant and P. Valiant, An Automatic Inequality Prover and Instance Optimal Identity Testing, FOCS,\n\n51\u201360, 2014.\n\n[30] P. Valiant, Testing symmetric properties of distributions, STOC, 2008.\n[31] P. Valiant, Testing Symmetric Properties of Distributions, PhD thesis, M.I.T., 2008.\n\n9\n\n\f", "award": [], "sourceid": 1527, "authors": [{"given_name": "Bhaswar", "family_name": "Bhattacharya", "institution": "Stanford University"}, {"given_name": "Gregory", "family_name": "Valiant", "institution": "Stanford University"}]}