Part of Advances in Neural Information Processing Systems 28 (NIPS 2015)
Bhaswar Bhattacharya, Gregory Valiant
We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter \eps>0, m1 independent draws from an unknown distribution p with discrete support, and m2 draws from an unknown distribution q of discrete support, we describe a test for distinguishing the case that p=q from the case that ||p−q||1≥\eps. If p and q are supported on at most n elements, then our test is successful with high probability provided m1≥n2/3/ε4/3 and m2=Ω(max We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, n,m_1, and \eps, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on n states up to a \log n factor that uses \tilde{O}(n^{3/2} \tau_{mix}) queries to a next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.