{"title": "A Fast, Consistent Kernel Two-Sample Test", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 681, "abstract": "A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P=Q) an infinite weighted sum of $\\chi^2$ random variables. The main result of the present work is a novel, consistent estimate of this null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q. This estimate may be computed faster than a previous consistent estimate based on the bootstrap. Another prior approach was to compute the null distribution based on fitting a parametric family with the low order moments of the test statistic: unlike the present work, this heuristic has no guarantee of being accurate or consistent. We verify the performance of our null distribution estimate on both an artificial example and on high dimensional multivariate data.", "full_text": "A Fast, Consistent Kernel Two-Sample Test\n\nArthur Gretton\n\nCarnegie Mellon University\n\nMPI for Biological Cybernetics\n\narthur.gretton@gmail.com\n\nKenji Fukumizu\n\nInst. of Statistical Mathematics\n\nTokyo Japan\n\nfukumizu@ism.ac.jp\n\nZaid Harchaoui\n\nCarnegie Mellon University\n\nPittsburgh, PA, USA\n\nzaid.harchaoui@gmail.com\n\nBharath K. Sriperumbudur\n\nDept. of ECE, UCSD\nLa Jolla, CA 92037\nbharathsv@ucsd.edu\n\nAbstract\n\nA kernel embedding of probability distributions into reproducing kernel Hilbert\nspaces (RKHS) has recently been proposed, which allows the comparison of two\nprobability measures P and Q based on the distance between their respective em-\nbeddings: for a suf\ufb01ciently rich RKHS, this distance is zero if and only if P and\nQ coincide. In using this distance as a statistic for a test of whether two samples\nare from different distributions, a major dif\ufb01culty arises in computing the signif-\nicance threshold, since the empirical statistic has as its null distribution (where\nP = Q) an in\ufb01nite weighted sum of \u03c72 random variables. Prior \ufb01nite sample\napproximations to the null distribution include using bootstrap resampling, which\nyields a consistent estimate but is computationally costly; and \ufb01tting a parametric\nmodel with the low order moments of the test statistic, which can work well in\npractice but has no consistency or accuracy guarantees. The main result of the\npresent work is a novel estimate of the null distribution, computed from the eigen-\nspectrum of the Gram matrix on the aggregate sample from P and Q, and having\nlower computational cost than the bootstrap. A proof of consistency of this esti-\nmate is provided. The performance of the null distribution estimate is compared\nwith the bootstrap and parametric approaches on an arti\ufb01cial example, high di-\nmensional multivariate data, and text.\n\n1 Introduction\nLearning algorithms based on kernel methods have enjoyed considerable success in a wide range of\nsupervised learning tasks, such as regression and classi\ufb01cation [25]. One reason for the popularity of\nthese approaches is that they solve dif\ufb01cult non-parametric problems by representing the data points\nin high dimensional spaces of features, speci\ufb01cally reproducing kernel Hilbert spaces (RKHSs), in\nwhich linear algorithms can be brought to bear. While classical kernel methods have addressed the\nmapping of individual points to feature space, more recent developments [14, 29, 28] have focused\non the embedding of probability distributions in RKHSs. When the embedding is injective, the\nRKHS is said to be characteristic [11, 29, 12], and the distance between feature mappings constitutes\na metric on distributions. This distance is known as the maximum mean discrepancy (MMD).\n\nOne well-de\ufb01ned application of the MMD is in testing whether two samples are drawn from two\ndifferent distributions (i.e., a two-sample or homogeneity test). For instance, we might wish to \ufb01nd\nwhether DNA microarrays obtained on the same tissue type by different labs are distributed iden-\ntically, or whether differences in lab procedure are such that the data have dissimilar distributions\n(and cannot be aggregated) [8]. Other applications include schema matching in databases, where\ntests of distribution similarity can be used to determine which \ufb01elds correspond [14], and speaker\n\n1\n\n\fveri\ufb01cation, where MMD can be used to identify whether a speech sample corresponds to a person\nfor whom previously recorded speech is available [18].\n\nA major challenge when using the MMD in two-sample testing is in obtaining a signi\ufb01cance thresh-\nold, which the MMD should exceed with small probability when the null hypothesis (that the sam-\nples share the same generating distribution) is satis\ufb01ed. Following [14, Section 4], we de\ufb01ne this\nthreshold as an upper quantile of the asymptotic distribution of the MMD under the null hypothesis.\nUnfortunately this null distribution takes the form of an in\ufb01nite weighted sum of \u03c72 random vari-\nables. Thus, obtaining a consistent \ufb01nite sample estimate of this threshold \u2014 that is, an estimate\nthat converges to the true threshold in the in\ufb01nite sample limit \u2014 is a signi\ufb01cant challenge. Three\napproaches have previously been applied: distribution-free large deviation bounds [14, Section 3],\nwhich are generally too loose for practical settings; \ufb01tting to the Pearson family of densities [14],\na simple heuristic that performs well in practice, but has no guarantees of accuracy or consistency;\nand a bootstrap approach, which is guaranteed to be consistent, but has a high computational cost.\n\nThe main contribution of the present study is a consistent \ufb01nite sample estimate of the null distribu-\ntion (not based on bootstrap), and a proof that this estimate converges to the true null distribution in\nthe in\ufb01nite sample limit. Brie\ufb02y, the in\ufb01nite sequence of weights that de\ufb01nes the null distribution is\nidentical to the sequence of normalized eigenvalues obtained in kernel PCA [26, 27, 7]. Thus, we\nshow that the null distribution de\ufb01ned using \ufb01nite sample estimates of these eigenvalues converges\nto the population distribution, using only convergence results on certain statistics of the eigenvalues.\nIn experiments, our new estimate of the test threshold has a smaller computational cost than that\nof resampling-based approaches such as the bootstrap, while providing performance as good as the\nalternatives for larger sample sizes.\n\nWe begin our presentation in Section 2 by describing how probability distributions may be embedded\nin an RKHS. We also review the maximum mean discrepancy as our chosen distance measure on\nthese embeddings, and recall the asymptotic behaviour of its \ufb01nite sample estimate. In Section 3,\nwe present both moment-based approximations to the null distribution of the MMD (which have\nno consistency guarantees); and our novel, consistent estimate of the null distribution, based on the\nspectrum of the kernel matrix over the aggregate sample. Our experiments in Section 4 compare the\ndifferent approaches on an arti\ufb01cial dataset, and on high-dimensional microarray and neuroscience\ndata. We also demonstrate the generality of a kernel-based approach by testing whether two samples\nof text are on the same topic, or on different topics.\n\n2 Background\nIn testing whether two samples are generated from the same distribution, we require both a measure\nof distance between probabilities, and a notion of whether this distance is statistically signi\ufb01cant. For\nthe former, we de\ufb01ne an embedding of probability distributions in a reproducing kernel Hilbert space\n(RKHS), such that the distance between these embeddings is our test statistic. For the latter, we give\nan expression for the asymptotic distribution of this distance measure, from which a signi\ufb01cance\nthreshold may be obtained.\n\nLet F be an RKHS on the separable metric space X, with a continuous feature mapping \u03c6(x) \u2208 F\nfor each x \u2208 X. The inner product between feature mappings is given by the positive de\ufb01nite kernel\nfunction k(x, x\u2032) := h\u03c6(x), \u03c6(x\u2032)iF. We assume in the following that the kernel k is bounded. Let\nP be the set of Borel probability measures on X. Following [4, 10, 14], we de\ufb01ne the mapping to F\nof P \u2208 P as the expectation of \u03c6(x) with respect to P , or\n\n\u00b5P : P \u2192 F\n\nP 7\u2192 ZX\n\n\u03c6(x)dP.\n\nThe maximum mean discrepancy (MMD) [14, Lemma 7] is de\ufb01ned as the distance between two\nsuch mappings,\n\nMMD(P, Q)\n\n:= k\u00b5P \u2212 \u00b5QkF\n= (Ex,x\u2032(k(x, x\u2032)) + Ey,y \u2032k(y, y\u2032) \u2212 2Ex,yk(x, y))1/2 ,\n\nwhere x and x\u2032 are independent random variables drawn according to P , y and y\u2032 are independent\nand drawn according to Q, and x is independent of y. This quantity is a pseudo-metric on distribu-\ntions: that is, it satis\ufb01es all the qualities of a metric besides MMD(P, Q) = 0 iff P = Q. For MMD\n\n2\n\n\fto be a metric, we require that the kernel be characteristic [11, 29, 12].1 This criterion is satis\ufb01ed for\nmany common kernels, such as the Gaussian kernel (both on compact domains and on Rd) and the\nB2l+1 spline kernel on Rd.\nsamples\nWe now consider two possible empirical estimates of the MMD, based on i.i.d.\n(x1, . . . , xm) from P and (y1, . . . , ym) from Q (we assume an equal number of samples for sim-\nplicity). An unbiased estimate of MMD is the one-sample U-statistic\n\nMMD2\n\nu :=\n\n1\n\nm(m \u2212 1)\n\nmXi6=j\n\nh(zi, zj),\n\n(1)\n\nwhere zi := (xi, yi) and h(zi, zj) := k(xi, xj)+k(yi, yj)\u2212k(xi, yj)\u2212k(xj, yi). We also de\ufb01ne the\nbiased estimate MMD2\nb by replacing the U-statistic in (1) with a V-statistic (the sum then includes\nterms i = j).\nOur goal is to determine whether P and Q differ, based on m samples from each. To this end, we\nrequire a measure of whether MMD2\nu differs signi\ufb01cantly from zero; or, if the biased statistic MMD2\nb\nis used, whether this value is signi\ufb01cantly greater than its expectation when P = Q. In other words\nwe conduct a hypothesis test with null hypothesis H0 de\ufb01ned as P = Q, and alternative hypothesis\nH1 as P 6= Q. We must therefore specify a threshold that the empirical MMD will exceed with\nsmall probability, when P = Q. For an asymptotic false alarm probability (Type I error) of \u03b1, an\nappropriate threshold is the 1 \u2212 \u03b1 quantile of the asymptotic distribution of the empirical MMD\nassuming P = Q. According to [14, Theorem 8], this distribution takes the form\n\nmMMD2\n\nu \u2192\nD\n\n\u03bbl(z2\n\nl \u2212 2),\n\n(2)\n\n\u221eXl=1\n\nZX\n\ndenotes convergence in distribution, zl \u223c N(0, 2) i.i.d., \u03bbi are the solutions to the eigen-\n\nwhere \u2192\nD\nvalue equation\n\n\u02dck(xi, xj)\u03c8l(xi)dP := \u03bbl\u03c8l(xj ),\n\n(3)\n\nand \u02dck(xi, xj) := k(xi, xj ) \u2212 Exk(xi, x) \u2212 Exk(x, xi) + Ex,x\u2032k(x, x\u2032). Consistency in power of\nthe resulting hypothesis test (that is, the convergence of its Type II error to zero for increasing m) is\nshown in [14].\n\nThe eigenvalue problem (3) has been studied extensively in the context of kernel PCA [26, 27, 7]:\nthis connection will be used in obtaining a \ufb01nite sample estimate of the null distribution in (2),\nand we summarize certain important results. Following [3, 10], we de\ufb01ne the covariance operator\nC : F \u2192 F as\n\nhf, Cf iF := var(f (x))\n\n(4)\nThe eigenvalues \u03bbl of C are the solutions to the eigenvalue problem in (3) [19, Proposition 2].\nFollowing e.g. [27, p.2511], empirical estimates of these eigenvalues are\n\n= Exf 2(x) \u2212 [Exf (x)]2 .\n\n\u02c6\u03bbl =\n\n1\nm\n\n\u03bdl\n\n(5)\n\nwhere \u03bdl are the eigenvalues of the centered Gram matrix\n\nKi,j := k(xi, xj), and H = I \u2212 1\nmMMD2\n\nb, we observe that these differ by a quantity with expectation tr(C) =P\u221e\n\nm 11\u22a4 is a centering matrix. Finally, by subtracting mMMD2\n\nu from\n\nl=1 \u03bbl, and thus\n\neK := HKH,\n\u221eXl=1\n\nb \u2192\nD\n\nmMMD2\n\n\u03bblz2\nl .\n\n1Other interpretations of the MMD are also possible, for particular kernel choices. The most closely related\nis the L2 distance between probability density estimates [1], although this requires the kernel bandwidth to\ndecrease with increasing sample size. See [1, 14] for more detail. Yet another interpretation is given in [32].\n\n3\n\n\f3 Theory\n\nIn the present section, we describe three approaches for approximating the null distribution of MMD.\nWe \ufb01rst present the Pearson curve and Gamma-based approximations, which consist of parametrized\nfamilies of distributions that we \ufb01t by matching the low order moments of the empirical MMD. Such\napproximations can be accurate in practice, although they remain heuristics with no consistency\nguarantees. Second, we describe a null distribution estimate based on substituting the empirical\nestimates (5) of the eigenvalues into (2). We prove that this estimate converges to its population\ncounterpart in the large sample limit.\n\n3.1 Moment-based null distribution estimates\n\nThe Pearson curves and the Gamma approximation are both based on the low order moments of the\nempirical MMD. The second and third moments for MMD are obtained in [14]:\n\nE(cid:16)(cid:2)MMD2\nE(cid:16)(cid:2)MMD2\n\nu(cid:3)2(cid:17) =\nu(cid:3)3(cid:17) =\n\nEz,z\u2032(cid:2)h2(z, z\u2032)(cid:3) and\n\n2\n\nm(m \u2212 1)\n8(m \u2212 2)\n\nm2(m \u2212 1)2\n\nEz,z\u2032 [h(z, z\u2032)Ez\u2032\u2032 (h(z, z\u2032\u2032)h(z\u2032, z\u2032\u2032))] + O(m\u22124).\n\n(6)\n\n(7)\n\nPearson curves take as arguments the variance, skewness and kurtosis As in [14], we replace the\n+ 1. An alternative,\nmore computationally ef\ufb01cient approach is to use a two-parameter Gamma approximation [20, p.\n343, p. 359],\n\nkurtosis with a lower bound due to [31], kurt(cid:0)MMD2\n\nu(cid:1)(cid:1)2\nu(cid:1) \u2265(cid:0)skew(cid:0)MMD2\n\nmMMDb(Z) \u223c\n\nx\u03b1\u22121e\u2212x/\u03b2\n\n\u03b2\u03b1\u0393(\u03b1)\n\nwhere \u03b1 =\n\n(E(MMDb(Z)))2\nvar(MMDb(Z))\n\n,\n\n\u03b2 =\n\nmvar(MMDb(Z))\n\nE(MMDb(Z))\n\n,\n\n(8)\n\nand we use the biased statistic MMD2\nb. Although the Gamma approximation is necessarily less\naccurate than the Pearson approach, it has a substantially lower computational cost (O(m2) for\nthe Gamma approximation, as opposed to O(m3) for Pearson). Moreover, we will observe in our\nexperiments that it performs remarkably well, at a substantial cost saving over the Pearson curves.\n\n3.2 Null distribution estimates using Gram matrix spectrum\n\nIn [14, Theorem 8], it was established that for large sample sizes, the null distribution of MMD\napproaches an in\ufb01nite weighted sum of independent \u03c72\n1 random variables, the weights being the\npopulation eigenvalues of the covariance operator C. Hence, an ef\ufb01cient and theoretically grounded\nway to calibrate the test is to compute the quantiles by replacing the population eigenvalues of C\nwith their empirical counterparts, as computed from the Gram matrix (see also [18], where a similar\nstrategy is proposed for the KFDA test with \ufb01xed regularization).\n\nThe following result shows that this empirical estimate of the null distribution converges in distribu-\ntion to its population counterpart. In other words, a test using the MMD statistic, with the threshold\ncomputed from quantiles of the null distribution estimate, is asymptotically consistent in level.\n\nTheorem 1 Let z1, . . . , zl, . . . be an in\ufb01nite sequence of i.i.d. random variables, with z1 \u223c N(0, 2).\n\nAssumeP\u221e\n\nl=1 \u03bb1/2\n\nl < \u221e. Then, as m \u2192 \u221e\n\nFurthermore, as m \u2192 \u221e\n\n\u221eXl=1\n\n\u02c6\u03bbl(z2\n\nl \u2212 2) \u2192\nD\n\n\u221eXl=1\nu > t(cid:1) \u2212 P \u221eXl=1\n\n4\n\nsup\n\nt\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nP(cid:0)mMMD2\n\n\u03bbl(z2\n\nl \u2212 2) .\n\n\u02c6\u03bbl(z2\n\nl \u2212 2) > t!(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2192 0 .\n\n\fProof (sketch) We begin with a proof of conditions under which the sumP\u221e\n\nl \u2212 2) is \ufb01nite\nw.p. 1. According to [16, Exercise 30, p. 358], we may use Kolmogorov\u2019s inequality to determine\nthat this sum converges a.s. if\n\nl=1 \u03bbl(z2\n\nEz[\u03bb2\n\nl (z2\n\nl \u2212 2)2] < \u221e,\n\n\u221eXl=1\n\nl=1 \u03bb1/2\n\nfrom which it follows that the covariance operator must be Hilbert-Schmidt: this is guaranteed by\nl < \u221e (see also [7]). We now proceed to the convergence result. Let C\n\nthe assumptionP\u221e\nand bC be the covariance operator and its empirical estimator. Let \u03bbl andb\u03bbl (l = 1, 2, . . .) be the\neigenvalues of C and bC, respectively, in descending order. We want to prove\n\n(9)\n\nl \u2192 0\n\n\u221eXp=1\n\n(b\u03bbl \u2212 \u03bbl)Z 2\n\nin probability as n \u2192 \u221e, where Zp \u223c N (0, 2) are i.i.d. random variables. The constant \u22122 in\nZ 2\n\np \u2212 2 can be neglected as Tr[bC] \u2192 Tr[C], where the proof is given in the online supplement. Thus\n(cid:12)(cid:12)(cid:12)Xl\n\n(b\u03bbl \u2212 \u03bbl)Z 2\n\nl \u2212 \u03bb1/2\n\nl(cid:12)(cid:12)(cid:12)\n\nl Z 2\n\n)\u03bb1/2\n\n)Z 2\n\nl\n\nl \u2212 \u03bb1/2\n\nl\n\nl(cid:12)(cid:12)(cid:12) +(cid:12)(cid:12)(cid:12)Xl (cid:0)b\u03bb1/2\n(cid:12)(cid:12)2o1/2\n(cid:12)(cid:12)2o1/2\n\nl \u2212 \u03bb1/2\n\nl\n\nl\n\nl\n\nl \u2212 \u03bb1/2\n\nl(cid:12)(cid:12)(cid:12) \u2264(cid:12)(cid:12)(cid:12)Xl b\u03bb1/2\n(cid:0)b\u03bb1/2\n\u2264nXl b\u03bblZ 4\nlo1/2nXl (cid:12)(cid:12)b\u03bb1/2\nlo1/2nXl (cid:12)(cid:12)b\u03bb1/2\n+nXl\nl and Plb\u03bblZ 4\ni =Xi\nEXi\n\n\u02c6\u03bbiZ 4\n\n\u03bblZ 4\n\nE[\u02c6\u03bbi]E[Z 4\n\ni ] = \u03baE[tr( \u02c6C)],\n\nWe now establish Pl \u03bblZ 4\n\ninequality. To prove the latter, we use that since \u02c6\u03bbi and Zi are independent,\n\nl are of Op(1). The former follows from Chebyshev\u2019s\n\n(Cauchy-Schwarz).\n\n(10)\n\nwhere \u03ba = E[Z 4]. Since E[tr( \u02c6C)] is bounded when the kernel has bounded expectation, we again\nhave the desired result by Chebyshev\u2019s inequality. The proof is complete if we show\n\n(11)\n\n(12)\n\n(13)\n\nFrom\n\nwe have\n\nl \u2212 \u03bb1/2\n\nl\n\n(cid:12)(cid:12)(cid:12)b\u03bb1/2\n\nIt is known as an extension of the Hoffmann-Wielandt inequality that\n\nl\n\nl\n\nl\n\n2\n\n)2 = op(1).\n\nl \u2212 \u03bb1/2\n\nl \u2212 \u03bb1/2\n\nl + \u03bb1/2\n\nXl (cid:0)b\u03bb1/2\n(cid:12)(cid:12)(cid:12)\n\u2264(cid:12)(cid:12)(cid:12)b\u03bb1/2\n(cid:12)(cid:12)(cid:12) (b\u03bb1/2\n(cid:12)(cid:12)(cid:12)b\u03bb1/2\nXl\n\u2264Xl\n|b\u03bbl \u2212 \u03bbl|.\n(cid:12)(cid:12)(cid:12)b\u03bbl \u2212 \u03bbl(cid:12)(cid:12)(cid:12) \u2264 kbC \u2212 Ck1,\nXl\n\nl \u2212 \u03bb1/2\n\n(cid:12)(cid:12)(cid:12)\n\n2\n\nl\n\n) =(cid:12)(cid:12)(cid:12)b\u03bbl \u2212 \u03bbl(cid:12)(cid:12)(cid:12) ,\n\nwhere k \u00b7 k1 is the trace norm (see [23], also shown in [5, p. 490]). Using [18, Prop. 12], which\n\nsecond statement follows immediately from the Polya theorem [21], as in [18].\n\ngives kbC \u2212 Ck1 \u2192 0 in probability, the proof of the \ufb01rst statement is completed. The proof of the\n\n3.3 Discussion\n\nWe now have several ways to calibrate the MMD test statistic, ranked in order of increasing com-\nputational cost: 1) the Gamma approximation, 2) the \u201cempirical null distribution\u201d: that is, the null\ndistribution estimate using the empirical Gram matrix spectrum, and 3) the Pearson curves, and\n\n5\n\n\fthe resampling procedures (subsampling or bootstrap with replacement). We include the \ufb01nal two\napproaches in the same cost category since even though the Pearson approach scales worse with\nm than the bootstrap (O(m3) vs O(m2)), the bootstrap has a higher cost for sample sizes less than\nabout 103 due the requirement to repeatedly re-compute the test statistic. We also note that our result\nof large-sample consistency in level holds under a restrictive condition on the decay of the spectrum\nof the covariance operator, whereas the Gamma approximation calculations are straightforward and\nremain possible for any spectrum decay behaviour. The Gamma approximation remains a heuristic,\nhowever, and we give an example of a distribution and kernel for which it performs less accurately\nthan the spectrum-based estimate in the upper tail, which is of most interest for testing purposes.\n\n4 Experiments\n\nIn this section, we compare the four approaches to obtaining the null distribution, both in terms of\nthe approximation error computed with respect to simulations from the true null, and when used\nin homogeneity testing. Our approaches are denoted Gamma (the two-parameter Gamma approx-\nimation), Pears (the Pearson curves based on the \ufb01rst three moments, using a lower bound for the\nkurtosis), Spec (our new approximation to the null distribution, using the Gram matrix eigenspec-\ntrum), and Boot (the bootstrap approach).\nArti\ufb01cial data: We \ufb01rst provide an example of a distribution P for which the heuristics Gamma\nand Pears have dif\ufb01culty in approximating the null distribution, whereas Spec converges. We chose\nP to be a mixture of normals P = 0.5 \u2217 N(\u22121, 0.44) + 0.5 \u2217 N(+1, 0.44), and k as a Gaussian\nkernel with bandwidth ranging over \u03c3 = 2\u22124, 2\u22123, 2\u22122, 2\u22121, 20, 21, 22. The sample sizes were set\nto m = 5000, the total sample size hence being 10, 000, and the results were averaged over 50, 000\nreplications. The eigenvalues of the Gram matrix were estimated in this experiment using [13],\nwhich is slower but more accurate than standard Matlab routines. The true quantiles of the MMD\nnull distribution, referred to as the oracle quantiles, were estimated by Monte Carlo simulations with\n50, 000 runs. We report the empirical performance of Spec compared to the oracle in terms of \u2206q =\nmaxtr :q\n\nu > tr)|, where tq is such that P(mM M D2\n\nu > tr) \u2212bPm(mM M D2\n\ntq) = q for q = 0.6, 0.7, 0.8, 0.9, and bPm is the Spec null distribution estimate obtained with m\n\nsamples from each of P and Q. We also use this performance measure for the Gamma and Pears\napproximations. This focuses the performance comparison on the quantiles corresponding to the\nupper tail of the null distribution, while still addressing uniform accuracy over a range of thresholds\nso as to ensure reliable p-values. The results are shown in Figure 1, and demonstrate that for this\ncombination of distribution and kernel, Spec performs almost uniformly better than both Gamma and\nPears. We emphasize that the performance advantage of Spec is greatest when we restrict ourselves\nto higher quantiles, which are of most interest in testing.\n\n6\n.\n0\n\n\u2206\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n \n\n0.02\n\u22124\n\n\u2206\n0.6 vs \u03c3\n\nGam\nSpec\nPears\n\n\u22122\n\n0\n\n(\u03c3)\nlog\n2\n\n \n\n2\n\n\u2206\n0.7 vs \u03c3\n\n0.08\n\n0.06\n\nGam\nSpec\nPears\n\n7\n.\n0\n\n\u2206\n\n0.04\n\n0.02\n\n \n\n0\n\u22124\n\n\u22122\n\n0\n\n(\u03c3)\nlog\n2\n\n \n\n2\n\n8\n.\n0\n\n\u2206\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n \n\n0\n\u22124\n\n\u2206\n0.8 vs \u03c3\n\nGam\nSpec\nPears\n\n\u22122\n\n0\n\n(\u03c3)\nlog\n2\n\n \n\n2\n\n\u2206\n0.9 vs \u03c3\n\n0.05\n\n0.04\n\nGam\nSpec\nPears\n\n9\n.\n0\n\n\u2206\n\n0.03\n\n0.02\n\n \n\n0.01\n\u22124\n\n\u22122\n\n0\n\n(\u03c3)\nlog\n2\n\n \n\n2\n\nFigure 1: Evolution of \u2206\nq for resp. the Gamma (Gam), Spectrum (Spec), and Pearson (Pears) approximations\nto the null distribution, as the Gaussian kernel bandwidth parameter varies. From left to right, plots of \u2206\nversus \u03c3 = 2\u22124, 2\u22123, . . . , 22 for q = 0.6, 0.7, 0.8, 0.9.\n\nq\n\nBenchmark data: We next demonstrate the performance of the MMD tests on a number of mul-\ntivariate datasets, taken from [14, Table 1]. We compared microarray data from normal and tumor\ntissues (Health status), microarray data from different subtypes of cancer (Subtype), and local \ufb01eld\npotential (LFP) electrode recordings from the Macaque primary visual cortex (V1) with and with-\nout spike events (Neural Data I and II, described in [24]). In all cases, we were provided with two\nsamples having different statistical properties, where the detection of these differences was made\ndif\ufb01cult by the high data dimensionality (for the microarray data, density estimation is impossi-\n\n6\n\n\fble given the small sample size and high data dimensionality, and a successful test cannot rely on\naccurate density estimates as an intermediate step).\n\nIn computing the null distributions for both the Spec and Pears cases, we drew 500 samples from the\nassociated null distribution estimates, and computed the test thresholds using the resulting empirical\nquantiles. For the Spec case, we computed the eigenspectrum on the gram matrix of the aggregate\ndata from P and Q, retaining in all circumstances the maximum number 2m \u2212 1 of nonzero eigen-\nvalues of the empirical Gram matrix. This is a conservative approach, given that the Gram matrix\nspectrum may decay rapidly [2, Appendix C], in which case it might be possible to safely discard the\nsmallest eigenvalues. For the bootstrap approach Boot, we aggregated points from the two samples,\nthen assigned these randomly without replacement to P and Q. In our experiments, we performed\n500 such iterations, and used the resulting histogram of MMD values as our null distribution. We\nused a Gaussian kernel in all cases, with the bandwidth set to the median distance between points in\nthe aggregation of samples from P and Q.\nWe applied our tests to the benchmark data as follows: Given datasets A and B, we either drew one\nsample with replacement from A and the other from B (in which case a Type II error was made\nwhen the null hypothesis H0 was accepted); or we drew both samples with replacement from a\nsingle pool consisting of A and B combined (in which case a Type I error was made when H0\nwas rejected: this should happen a fraction 1 \u2212 \u03b1 of the time). This procedure was repeated 1000\ntimes to obtain average performance \ufb01gures. We summarize our results in Table 1. Note that an\nextensive benchmark of the MMD Boot and Pears tests against other nonparametric approaches to\ntwo-sample testing is provided in [14]: these include the the Friedman-Rafsky generalisation of the\nKolmogorov-Smirnov and Wald-Wolfowitz tests [9], the Biau-Gy\u00a8or\ufb01 test [6], and the Hall-Tajvidi\ntest [17]. See [14] for details.\n\nWe observe that the kernel tests perform extremely well on these data: the Type I error is in the\ngreat majority of cases close to its design value of 1 \u2212 \u03b1, and the Type II error is very low (and\noften zero). The Spec test is occasionally slightly conservative, and has a lower Type I error than\nrequired: this is most pronounced in the Health Status dataset, for which the sample size m is low.\nThe computational cost shows the expected trend, with Gamma being least costly, followed by Spec,\nPears, and \ufb01nally Boot (this trend is only visible for the larger m = 500 datasets). Note that for yet\nlarger sample sizes, however, we expect the cost of Pears to exceed that of the remaining methods,\ndue to its O(m3) cost requirement (vs O(m2) for the other approaches).\n\nDataset\nNeural Data I\n\nNeural Data II\n\nHealth status\n\nSubtype\n\nAttribute\nType I/Type II\nTime (sec)\nType I/Type II\nTime (sec)\nType I/Type II\nTime (sec)\nType I/Type II\nTime (sec)\n\nGamma\n0.95 / 0.00\n0.06\n0.96 / 0.00\n0.08\n0.96 / 0.00\n0.01\n0.95 / 0.02\n0.05\n\nPears\n0.96 / 0.00\n3.92\n0.96 / 0.00\n3.97\n0.96 / 0.00\n0.01\n0.95 / 0.01\n0.05\n\nSpec\n0.96 / 0.00\n2.79\n0.97 / 0.00\n2.91\n0.98 / 0.00\n0.01\n0.96 / 0.01\n0.05\n\nBoot\n0.96 / 0.00\n5.79\n0.96 / 0.00\n8.08\n0.95 / 0.00\n0.03\n0.94 / 0.01\n0.07\n\nTable 1: Benchmarks for the kernel two-sample tests on high dimensional multivariate data. Type I and Type\nII errors are provided, as are average run times. Sample size (dimension): Neural I 500 (63) ; Neural II 500\n(100); Health Status 25 (12,600); Subtype 25 (2,118).\n\nthe\nare\n\nof\n\nfrom the\n\ntest\n\non\n\nwe\n\ndata.\n\nOur\n\ndata\n\ndemonstrate\n\nperformance\ntaken\n\nthe\nCanadian Hansard\n\nstructured\nFinally,\n(text)\ncorpus\n(http : //www.isi.edu/natural \u2212 language/download/hansard/). As in the earlier work on\ndependence testing presented in [15], debate transcripts on the three topics of agriculture, \ufb01sheries,\nand immigration were used. Transcripts were in English and French, however we con\ufb01ne ourselves\nto reporting results on the English data (the results on the French data were similar). Our goal was to\ndistinguish samples on different topics, for instance P being drawn from transcripts on agriculture\nand Q from transcripts on immigration (in the null case, both samples were from the same topic).\nThe data were processed following the same procedures as in [15]. We investigated two different\nkernels on text: the k-substring kernel of [22, 30] with k = 10, and a bag-of-words kernel. In\nboth cases, we computed kernels between \ufb01ve-line extracts, ignoring lines shorter than \ufb01ve words\nlong. Results are presented in Figure 2, and represent an average over all three combinations of\n\n7\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n \nI\nI\n \ne\np\ny\nT\n\n0\n \n10\n\nTest performance,bow\n\nTest performance,spec\n\nGram matrix spectrum, bow\n\n \n\nGamma\nPears\nSpec\nBoot\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n \nI\nI\n \ne\np\ny\nT\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\ne\nd\nu\nt\ni\nl\n\np\nm\na\n \n.\ng\nE\n\ni\n\n20\n40\nSample size m\n\n30\n\n50\n\n0\n10\n\n20\n40\nSample size m\n\n30\n\n50\n\n0\n0\n\n5\n\n10\n\nEig. index\n\n15\n\n20\n\nFigure 2: Canadian Hansard data. Left: Average Type II error over all of agriculture-\ufb01sheries, agriculture-\nimmigration, and \ufb01sheries-immigration, for the bag-of-words kernel. Center: Average Type II error for the\nk-substring kernel. Right: Eigenspectrum of a centered Gram matrix obtained by drawing m = 10 points from\neach of P and Q, where P 6= Q, for the bag-of-words kernel.\n\ndifferent topic pairs: agriculture-\ufb01sheries, agriculture-immigration, and \ufb01sheries-immigration. For\neach topic pairing, results are averaged over 300 repetitions.\n\nWe observe that in general, the MMD is very effective at distinguishing distributions of text frag-\nments on different topics: for sample sizes above 30, all the test procedures are able to detect differ-\nences in distribution with zero Type II error, for both kernels. When the k-substring kernel is used,\nthe Boot, Gamma, and Pears approximations can distinguish the distributions for sample sizes as low\nas 10: this indicates that a more sophisticated encoding of the text than provided by bag-of-words\nresults in tests of greater sensitivity (consistent with the independence testing observations of [15]).\n\nWe now investigate the fact that for sample sizes below m = 30 on the Hansard data, the Spec\ntest has a much higher Type II error the alternatives. The k-substring and bag-of-words kernels are\ndiagonally dominant: thus for small sample sizes, the empirical estimate of the kernel spectrum\nis effectively truncated at a point where the eigenvalues remain large, introducing a bias (Figure\n2). This effect vanishes on the Hansard benchmark once the number of samples reaches 25-30.\nBy contrast, for the Neural data using a Gaussian kernel, this small sample bias is not observed,\nand the Spec test has equivalent Type II performance to the other three tests (see Figure 1 in the\nonline supplement). In this case, for sample sizes of interest (i.e., where there are suf\ufb01cient samples\nto obtain a Type II error of less than 50%), the bias in the Spec test due to spectral truncation is\nnegligible. We emphasize that the speed advantage of the Spec test becomes important only for\nlarger sample sizes (and the consistency guarantee is only meaningful in this regime).\n\n5 Conclusion\n\nWe have presented a novel method for estimating the null distribution of the RKHS distance be-\ntween probability distribution embeddings, for use in a nonparametric test of homogeneity. Unlike\nprevious parametric heuristics based on moment matching, our new distribution estimate is consis-\ntent; moreover, it is computationally less costly than the bootstrap, which is the only alternative\nconsistent approach. We have demonstrated in experiments that our method performs well on high\ndimensional multivariate data and text, as well as for distributions where the parametric heuristics\nshow inaccuracies. We anticipate that our approach may also be generalized to kernel independence\ntests [15], and to homogeneity tests based on the kernel Fisher discriminant [18].\nAcknowledgments: The ordering of the second through fourth authors is alphabetical. We thank Choon-Hui\nTeo for generating the Gram matrices for the text data, Malte Rasch for his assistance in the experimental\nevaluation, and Karsten Borgwardt for his assistance with the microarray data. A. G. was supported by grants\nDARPA IPTO FA8750-09-1-0141, ONR MURI N000140710747, and ARO MURI W911NF0810242. Z. H.\nwas supported by grants from the Technical Support Working Group through funding from the Investigative\nSupport and Forensics subgroup and NIMH 51435, and from Agence Nationale de la Recherche under contract\nANR-06-BLAN-0078 KERNSIG. B. K. S. was supported by the MPI for Biological Cybernetics, NSF (grant\nDMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program.\n\nReferences\n[1] N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies be-\ntween two multivariate probability density functions using kernel-based density estimates. Journal of\n\n8\n\n\fMultivariate Analysis, 50:41\u201354, 1994.\n\n[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1\u201348, 2002.\n[3] C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical\n\n[4] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nSociety, 186:273\u2013289, 1973.\n\nSpringer-Verlag, Berlin, 2003.\n\n[5] Rajendra Bhatia and Ludwig Elsner. The Hoffman-Wielandt inequality in in\ufb01nite dimensions. Proceed-\n\nings of Indian Academy of Science (Mathematical Sciences), 104(3):483\u2013494, 1994.\n\n[6] G. Biau and L. Gyor\ufb01. On the asymptotic properties of a nonparametric l1-test statistic of homogeneity.\n\nIEEE Transactions on Information Theory, 51(11):3965\u20133973, 2005.\n\n[7] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component analysis.\n\nMachine Learning, 66:259\u2013294, 2007.\n\n[8] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch\u00a8olkopf, and A. J. Smola. Integrating\nstructured biological data by kernel maximum mean discrepancy. Bioinformatics (ISMB), 22(14):e49\u2013\ne57, 2006.\n\n[9] J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample\n\ntests. The Annals of Statistics, 7(4):697\u2013717, 1979.\n\n[10] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. J. Mach. Learn. Res., 5:73\u201399, 2004.\n\n[11] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence.\n\nIn\n\nNIPS 20, pages 489\u2013496, 2008.\n\n[12] K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Sch\u00a8olkopf. Characteristic kernels on groups and\n\nsemigroups. In NIPS 21, pages 473\u2013480, 2009.\n\n[13] G. Golub and Q. Ye. An inverse free preconditioned krylov subspace method for symmetric generalized\n\neigenvalue problems. SIAM Journal on Scienti\ufb01c Computing, 24:312\u2013334, 2002.\n\n[14] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-sample-\n\n[15] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\n\nproblem. In NIPS 19, pages 513\u2013520, 2007.\n\nindependence. In NIPS 20, pages 585\u2013592, 2008.\n\n[16] G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford,\n\n[17] P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings.\n\nthird edition, 2001.\n\nBiometrika, 89(2):359\u2013374, 2002.\n\n[18] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel \ufb01sher discriminant analysis.\n\nIn NIPS 20, pages 609\u2013616. 2008. (long version: arXiv:0804.1026v1).\n\n[19] M. Hein and O. Bousquet. Kernels, associated structures, and generalizations. Technical Report 127,\n\nMax Planck Institute for Biological Cybernetics, 2004.\n\n[20] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1 (Second\n\nEdition). John Wiley and Sons, 1994.\n\n[21] E. Lehmann and J. Romano. Testing Statistical Hypothesis (3rd ed.). Wiley, New York, 2005.\n[22] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classi\ufb01cation.\n\nIn Proceedings of the Paci\ufb01c Symposium on Biocomputing, pages 564\u2013575, 2002.\n\n[23] A. S. Markus. The eigen- and singular values of the sum and product of linear operators. Russian\n\nMathematical Surveys, 19(4):93\u2013123, 1964.\n\n[24] M. Rasch, A. Gretton, Y. Murayama, W. Maass, and N. K. Logothetis. Predicting spiking activity from\n\nlocal \ufb01eld potentials. Journal of Neurophysiology, 99:1461\u20131476, 2008.\n\n[25] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[26] B. Sch\u00a8olkopf, A. J. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10:1299\u20131319, 1998.\n\n[27] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix\n\nand the generalisation error of kernel PCA. IEEE Trans. Inf. Theory, 51(7):2510\u20132522, 2005.\n\n[28] A. J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert space embedding for distributions. In ALT\n\n[29] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf.\n\nInjective hilbert space\n\nembeddings of probability measures. In COLT 21, pages 111\u2013122, 2008.\n\n[30] C. H. Teo and S. V. N. Vishwanathan. Fast and space ef\ufb01cient string kernels using suf\ufb01x arrays. In ICML,\n\n18, pages 13\u201331, 2007.\n\npages 929\u2013936, 2006.\n\n[31] J. E. Wilkins. A note on skewness and kurtosis. Ann. Math. Stat., 15(3):333\u2013335, 1944.\n[32] G. Zech and B. Aslan. A multivariate two-sample test based on the concept of minimum energy.\n\nIn\n\nPHYSTAT, pages 97\u2013100, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1006, "authors": [{"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": null}]}