{"title": "A Linear-Time Kernel Goodness-of-Fit Test", "book": "Advances in Neural Information Processing Systems", "page_first": 262, "page_last": 271, "abstract": "We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.", "full_text": "A Linear-Time Kernel Goodness-of-Fit Test\n\nWittawat Jitkrittum\n\nGatsby Unit, UCL\n\nwittawatj@gmail.com\n\nWenkai Xu\n\nZolt\u00e1n Szab\u00f3\u2217\n\nGatsby Unit, UCL\n\nwenkaix@gatsby.ucl.ac.uk\n\nCMAP, \u00c9cole Polytechnique\n\nzoltan.szabo@polytechnique.edu\n\nKenji Fukumizu\n\nThe Institute of Statistical Mathematics\n\nfukumizu@ism.ac.jp\n\nArthur Gretton\u2217\nGatsby Unit, UCL\n\narthur.gretton@gmail.com\n\nAbstract\n\nWe propose a novel adaptive test of goodness-of-\ufb01t, with computational cost\nlinear in the number of samples. We learn the test features that best indicate the\ndifferences between observed samples and a reference model, by minimizing the\nfalse negative rate. These features are constructed via Stein\u2019s method, meaning that\nit is not necessary to compute the normalising constant of the model. We analyse\nthe asymptotic Bahadur ef\ufb01ciency of the new test, and prove that under a mean-shift\nalternative, our test always has greater relative ef\ufb01ciency than a previous linear-time\nkernel test, regardless of the choice of parameters for that test. In experiments, the\nperformance of our method exceeds that of the earlier linear-time test, and matches\nor exceeds the power of a quadratic-time kernel test. In high dimensions and where\nmodel structure may be exploited, our goodness of \ufb01t test performs far better than\na quadratic-time two-sample test based on the Maximum Mean Discrepancy, with\nsamples drawn from the model.\n\nIntroduction\n\n1\nThe goal of goodness of \ufb01t testing is to determine how well a model density p(x) \ufb01ts an observed\ni=1 \u2282 X \u2286 Rd from an unknown distribution q(x). This goal may be achieved via\nsample D = {xi}n\na hypothesis test, where the null hypothesis H0 : p = q is tested against H1 : p (cid:54)= q. The problem\nof testing goodness of \ufb01t has a long history in statistics [11], with a number of tests proposed for\nparticular parametric models. Such tests can require space partitioning [18, 3], which works poorly in\nhigh dimensions; or closed-form integrals under the model, which may be dif\ufb01cult to obtain, besides\nin certain special cases [2, 5, 30, 26]. An alternative is to conduct a two-sample test using samples\ndrawn from both p and q. This approach was taken by [23], using a test based on the (quadratic-time)\nMaximum Mean Discrepancy [16], however this does not take advantage of the known structure of p\n(quite apart from the increased computational cost of dealing with samples from p).\nMore recently, measures of discrepancy with respect to a model have been proposed based on Stein\u2019s\nmethod [21]. A Stein operator for p may be applied to a class of test functions, yielding functions that\nhave zero expectation under p. Classes of test functions can include the W 2,\u221e Sobolev space [14],\nand reproducing kernel Hilbert spaces (RKHS) [25]. Statistical tests have been proposed by [9, 22]\nbased on classes of Stein transformed RKHS functions, where the test statistic is the norm of the\nsmoothness-constrained function with largest expectation under q . We will refer to this statistic as\nthe Kernel Stein Discrepancy (KSD). For consistent tests, it is suf\ufb01cient to use C0-universal kernels\n[6, De\ufb01nition 4.1], as shown by [9, Theorem 2.2], although inverse multiquadric kernels may be\npreferred if uniform tightness is required [15].2\n\n\u2217Zolt\u00e1n Szab\u00f3\u2019s ORCID ID: 0000-0001-6183-7603. Arthur Gretton\u2019s ORCID ID: 0000-0003-3169-7624.\n2Brie\ufb02y, [15] show that when an exponentiated quadratic kernel is used, a sequence of sets D may be\nconstructed that does not correspond to any q, but for which the KSD nonetheless approaches zero. In a statistical\ntesting setting, however, we assume identically distributed samples from q, and the issue does not arise.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe minimum variance unbiased estimate of the KSD is a U-statistic, with computational cost\nquadratic in the number n of samples from q. It is desirable to reduce the cost of testing, however,\nso that larger sample sizes may be addressed. A \ufb01rst approach is to replace the U-statistic with a\nrunning average with linear cost, as proposed by [22] for the KSD, but this results in an increase in\nvariance and corresponding decrease in test power. An alternative approach is to construct explicit\nfeatures of the distributions, whose empirical expectations may be computed in linear time. In the\ntwo-sample and independence settings, these features were initially chosen at random by [10, 8, 32].\nMore recently, features have been constructed explicitly to maximize test power in the two-sample\n[19] and independence testing [20] settings, resulting in tests that are not only more interpretable, but\nwhich can yield performance matching quadratic-time tests.\nWe propose to construct explicit linear-time features for testing goodness of \ufb01t, chosen so as to\nmaximize test power. These features further reveal where the model and data differ, in a readily inter-\npretable way. Our \ufb01rst theoretical contribution is a derivation of the null and alternative distributions\nfor tests based on such features, and a corresponding power optimization criterion. Note that the\ngoodness-of-\ufb01t test requires somewhat different strategies to those employed for two-sample and\nindependence testing [19, 20], which become computationally prohibitive in high dimensions for\nthe Stein discrepancy (speci\ufb01cally, the normalization used in prior work to simplify the asymptotics\nwould incur a cost cubic in the dimension d and the number of features in the optimization). Details\nmay be found in Section 3.\nOur second theoretical contribution, given in Section 4, is an analysis of the relative Bahadur\nef\ufb01ciency of our test vs the linear time test of [22]: this represents the relative rate at which the p-\nvalue decreases under H1 as we observe more samples. We prove that our test has greater asymptotic\nBahadur ef\ufb01ciency relative to the test of [22], for Gaussian distributions under the mean-shift\nalternative. This is shown to hold regardless of the bandwidth of the exponentiated quadratic kernel\nused for the earlier test. The proof techniques developed are of independent interest, and we anticipate\nthat they may provide a foundation for the analysis of relative ef\ufb01ciency of linear-time tests in the\ntwo-sample and independence testing domains. In experiments (Section 5), our new linear-time test\nis able to detect subtle local differences between the density p(x), and the unknown q(x) as observed\nthrough samples. We show that our linear-time test constructed based on optimized features has\ncomparable performance to the quadratic-time test of [9, 22], while uniquely providing an explicit\nvisual indication of where the model fails to \ufb01t the data.\n\n2 Kernel Stein Discrepancy (KSD) Test\n\nWe begin by introducing the Kernel Stein Discrepancy (KSD) and associated statistical test, as\nproposed independently by [9] and [22]. Assume that the data domain is a connected open set X \u2286 Rd.\nConsider a Stein operator Tp that takes in a multivariate function f (x) = (f1(x), . . . , fd(x))(cid:62)\n\u2208 Rd\nand constructs a function (Tpf ) (x) : Rd \u2192 R. The constructed function has the key property that for\nall f in an appropriate function class, Ex\u223cq [(Tpf )(x)] = 0 if and only if q = p. Thus, one can use\nthis expectation as a statistic for testing goodness of \ufb01t.\nThe function class F d for the function f is chosen to be a unit-norm ball in a reproducing kernel Hilbert\nspace (RKHS) in [9, 22]. More precisely, let F be an RKHS associated with a positive de\ufb01nite kernel\nk : X \u00d7 X \u2192 R. Let \u03c6(x) = k(x,\u00b7) denote a feature map of k so that k(x, x(cid:48)) = (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)F .\nthe standard inner product (cid:104)f , g(cid:105)F d :=(cid:80)d\n(cid:16) \u2202 log p(x)\n(cid:17) (a)\nAssume that fi \u2208 F for all i = 1, . . . , d so that f \u2208 F \u00d7 \u00b7\u00b7\u00b7 \u00d7 F := F d where F d is equipped with\n= (cid:10)f , \u03bep(x,\u00b7)(cid:11)\nin [9] is (Tpf ) (x) :=(cid:80)d\ni=1 (cid:104)fi, gi(cid:105)F . The kernelized Stein operator Tp studied\nF d , where at (a) we use the\nreproducing property of F, i.e., fi(x) = (cid:104)fi, k(x,\u00b7)(cid:105)F , and that \u2202k(x,\u00b7)\n\u2202xi \u2208 F [28, Lemma 4.34],\nhence \u03bep(x,\u00b7) := \u2202 log p(x)\nis in F d. We note that the Stein operator presented in [22]\nis de\ufb01ned such that (Tpf ) (x) \u2208 Rd. This distinction is not crucial and leads to the same goodness-of-\n\ufb01t test. Under appropriate conditions, e.g. that lim(cid:107)x(cid:107)\u2192\u221e p(x)fi(x) = 0 for all i = 1, . . . , d, it can\nbe shown using integration by parts that Ex\u223cp(Tpf )(x) = 0 for any f \u2208 F d [9, Lemma 5.1]. Based\non the Stein operator, [9, 22] de\ufb01ne the kernelized Stein discrepancy as\n\nk(x,\u00b7)+ \u2202k(x,\u00b7)\n\nfi(x) + \u2202fi(x)\n\u2202xi\n\ni=1\n\n\u2202xi\n\n\u2202x\n\n\u2202x\n\nF d\n\n(a)\n= sup\n\n(cid:107)f(cid:107)F d\u22641\n\n2\n\nSp(q) := sup\n\n(cid:107)f(cid:107)F d\u22641\n\nEx\u223cq\n\n(cid:10)f , \u03bep(x,\u00b7)(cid:11)\n\n(cid:10)f , Ex\u223cq\u03bep(x,\u00b7)(cid:11)\n\nF d = (cid:107)g(\u00b7)(cid:107)F d ,\n\n(1)\n\n\f(cid:80)\n\np (y)\u2207xk(x, y) + s(cid:62)\n\np (x)\u2207yk(x, y) + (cid:80)d\n\np(q) = Ex\u223cqEx(cid:48)\u223cqhp(x, x(cid:48)), where hp(x, y)\n, and sp(x)\n\nwhere at (a), \u03bep(x,\u00b7) is Bochner integrable [28, De\ufb01nition A.5.20] as long as Ex\u223cq(cid:107)\u03bep(x,\u00b7)(cid:107)F d <\n\u221e, and g(y) := Ex\u223cq\u03bep(x, y) is what we refer to as the Stein witness function. The Stein witness\nfunction will play a crucial role in our new test statistic in Section 3. When a C0-universal kernel is\nused [6, De\ufb01nition 4.1], and as long as Ex\u223cq(cid:107)\u2207x log p(x) \u2212 \u2207x log q(x)(cid:107)2 < \u221e, it can be shown\nthat Sp(q) = 0 if and only if p = q [9, Theorem 2.2].\nThe KSD Sp(q) can be written as S2\np(q), denoted by (cid:99)S2 =\ns(cid:62)\np (x)sp(y)k(x, y) + s(cid:62)\n\u2207x log p(x) is a column vector. An unbiased empirical estimator of S2\ntest, the rejection threshold can be computed by a bootstrap procedure. All these properties make(cid:99)S2\nn(n\u22121)\nLinear-Time Kernel Stein (LKS) Test Computation of (cid:99)S2 costs O(n2). To reduce this cost, a\ngiven by (cid:99)S2\nlinear-time (i.e., O(n)) estimator based on an incomplete U-statistic is proposed in [22, Eq. 17],\ni=1 hp(x2i\u22121, x2i), where we assume n is even for simplicity. Empirically\n[22] observed that the linear-time estimator performs much worse (in terms of test power) than the\nquadratic-time U-statistic estimator, agreeing with our \ufb01ndings presented in Section 5.\n\na very \ufb02exible criterion to detect the discrepancy of p and q: in particular, it can be computed even if\np is known only up to a normalization constant. Further studies on nonparametric Stein operators can\nbe found in [25, 14].\n\ni T\u03b1, where T\u03b1 is the rejection threshold (critical value), and (cid:92)FSSD2 is\nan empirical estimate of FSSD2\np(q). The threshold which guarantees that the type-I error (i.e., the\nprobability of rejecting H0 when it is true) is bounded above by \u03b1 is given by the (1 \u2212 \u03b1)-quantile of\nthe null distribution i.e., the distribution of n (cid:92)FSSD2 under H0. In the following, we start by giving\nthe expression for (cid:92)FSSD2, and summarize its asymptotic distributions in Proposition 2.\nLet \u039e(x) \u2208 Rd\u00d7J such that [\u039e(x)]i,j = \u03bep,i(x, vj)/\u221adJ. De\ufb01ne \u03c4 (x) := vec(\u039e(x)) \u2208 RdJ where\nvec(M) concatenates columns of the matrix M into a column vector. We note that \u03c4 (x) depends\nj=1. Let \u2206(x, y) := \u03c4 (x)(cid:62)\u03c4 (y) = tr(\u039e(x)(cid:62)\u039e(y)). Given an i.i.d.\non the test locations V = {vj}J\nsample {xi}n\n(cid:92)FSSD2 =\n\ni=1 \u223c q, a consistent, unbiased estimator of FSSD2\n\n\u03bep,l(xi, vm)\u03bep,l(xj, vm) =\n\nn(cid:88)\n\nd(cid:88)\n\nJ(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u2206(xi, xj),\n\np(q) is\n\n(2)\n\n1\n\n1\ndJ\n\nn(n \u2212 1)\n\ni=1\n\nj(cid:54)=i\n\nl=1\n\nm=1\n\n2\n\nn(n \u2212 1)\n\ni 0, then \u221an( (cid:92)FSSD2 \u2212 FSSD2) d\n\n2. Under H1 : p (cid:54)= q, if \u03c32\n\ni \u2212 1)\u03c9i.\n\ni=1(Z 2\n\nH1\n\n\u2192 N (0, \u03c32\n\nH1\n\n).\n\n(cid:80)dJ\n\nProof. Recognizing that (2) is a degenerate U-statistic, the results follow directly from [27, Section\n5.5.1, 5.5.2].\n\nClaims 1 and 2 of Proposition 2 imply that under H1, the test power (i.e., the probability of correctly\nrejecting H1) goes to 1 asymptotically, if the threshold T\u03b1 is de\ufb01ned as above. In practice, simulating\nfrom the asymptotic null distribution in Claim 1 can be challenging, since the plug-in estimator of\n\u03a3p requires a sample from p, which is not available. A straightforward solution is to draw sample\nfrom p, either by assuming that p can be sampled easily or by using a Markov chain Monte Carlo\n(MCMC) method, although this adds an additional computational burden to the test procedure. A\nmore subtle issue is that when dependent samples from p are used in obtaining the test threshold, the\ntest may become more conservative than required for i.i.d. data [7]. An alternative approach is to use\nthe plug-in estimate \u02c6\u03a3q instead of \u03a3p. The covariance matrix \u02c6\u03a3q can be directly computed from the\ndata. This is the approach we take. Theorem 3 guarantees that the replacement of the covariance in\nthe computation of the asymptotic null distribution still yields a consistent test. We write PH1 for the\n(cid:80)n\ndistribution of n (cid:92)FSSD2 under H1.\ni=1 \u03c4 (xi)\u03c4 (cid:62)(xi)\u2212 [ 1\nTheorem 3. Let \u02c6\u03a3q := 1\nn\n\ni=1 \u223c\ni \u22121) \u02c6\u03bdi\n\u223c N (0, 1), and \u02c6\u03bd1, . . . , \u02c6\u03bddJ are eigenvalues of \u02c6\u03a3q. Then, under H0, asymptotically\nj=1 drawn from a distribution with a density, the test\n\nq. Suppose that the test threshold T\u03b1 is set to the (1\u2212\u03b1)-quantile of the distribution of(cid:80)dJ\nwhere {Zi}dJ\nthe false positive rate is \u03b1. Under H1, for {vj}J\npower PH1 (n (cid:92)FSSD2 > T\u03b1) \u2192 1 as n \u2192 \u221e.\nRemark 1. The proof of Theorem 3 relies on two facts. First, under H0, \u02c6\u03a3q = \u02c6\u03a3p i.e., the plug-in\nestimate of \u03a3p. Thus, under H0, the null distribution approximated with \u02c6\u03a3q is asymptotically\n\n(cid:80)n\nj=1 \u03c4 (xj)](cid:62) with {xi}n\ni=1(Z 2\n\ni=1 \u03c4 (xi)][ 1\nn\n\n(cid:80)n\n\ni.i.d.\n\ni=1\n\nn\n\n4\n\n\fcorrect, following the convergence of \u02c6\u03a3p to \u03a3p. Second, the rejection threshold obtained from the\napproximated null distribution is asymptotically constant. Hence, under H1, claim 2 of Proposition 2\nimplies that n (cid:92)FSSD2 d\n\n\u2192 \u221e as n \u2192 \u221e, and consequently PH1 (n (cid:92)FSSD2 > T\u03b1) \u2192 1.\n\n3.2 Optimizing the Test Parameters\n\nTheorem 1 guarantees that the population quantity FSSD2 = 0 if and only if p = q for any choice of\ni=1 drawn from a distribution with a density. In practice, we are forced to rely on the empirical\n{vi}J\n(cid:92)FSSD2, and some test locations will give a higher detection rate (i.e., test power) than others for\n\ufb01nite n. Following the approaches of [17, 20, 19, 29], we choose the test locations V = {vj}J\nand kernel bandwidth \u03c32\nk so as to maximize the test power i.e., the probability of rejecting H0 when\nit is false. We \ufb01rst give an approximate expression for the test power when n is large.\nProposition 4 (Approximate test power of n (cid:92)FSSD2). Under H1, for large n and \ufb01xed r, the\ntest power PH1(n (cid:92)FSSD2 > r) \u2248 1 \u2212 \u03a6\n, where \u03a6 denotes the cumulative\ndistribution function of the standard normal distribution, and \u03c3H1 is de\ufb01ned in Proposition 2.\n(cid:17)\n\nProof. PH1 (n (cid:92)FSSD2 > r) = PH1 ( (cid:92)FSSD2 > r/n) = PH1\n.\nFor suf\ufb01ciently large n, the alternative distribution is approximately normal as given in Proposition 2.\nIt follows that PH1 (n (cid:92)FSSD2 > r) \u2248 1 \u2212 \u03a6\n\n(cid:17)\nn\u03c3H1 \u2212 \u221an FSSD2\n(cid:16)\u221an\nn\u03c3H1 \u2212 \u221an FSSD2\n\n> \u221an r/n\u2212FSSD2\n\n(cid:92)FSSD2\u2212FSSD2\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\nr\u221a\n\nr\u221a\n\n\u03c3H1\n\n\u03c3H1\n\n\u03c3H1\n\nj=1\n\n.\n\n\u03c3H1\n\nn\u03c3H1\n\nr\u221a\n\nn\u03c3H1 \u2212 \u221an FSSD2\n\n= O(n\u22121/2) going to 0 as n \u2192 \u221e, while the second term \u221an FSSD2\n\nLet \u03b6 := {V, \u03c32\nFollowing the same argument as in [29], in\nr\u221a\nthe \ufb01rst for large n. Thus, the best parameters that maximize the test power are given by \u03b6\n\nk} be the collection of all tuning parameters. Assume that n is suf\ufb01ciently large.\n, we observe that the \ufb01rst term\n= O(n1/2), dominating\n=\narg max\u03b6 PH1(n (cid:92)FSSD2 > T\u03b1) \u2248 arg max\u03b6\n. Since FSSD2 and \u03c3H1 are unknown, we divide\n(cid:92)FSSD2\n\u02c6\u03c3H1 +\u03b3 ,\nthe sample {xi}n\nwhere a small regularization parameter \u03b3 > 0 is added for numerical stability. The goodness-of-\ufb01t\ntest is performed on the test set to avoid over\ufb01tting. The idea of splitting the data into training and\ntest sets to learn good features for hypothesis testing was successfully used in [29, 20, 19, 17].\n\ni=1 into two disjoint training and test sets, and use the training set to compute\n\nFSSD2\n\n\u03c3H1\n\n\u03c3H1\n\n\u03c3H1\n\n\u2217\n\n(cid:92)FSSD2\n\u02c6\u03c3H1 +\u03b3 , we use gradient ascent for its simplicity. The initial points of\nTo \ufb01nd a local maximum of\ni=1 are set to random draws from a normal distribution \ufb01tted to the training data, a heuristic we\n{vi}J\nfound to perform well in practice. The objective is non-convex in general, re\ufb02ecting many possible\nways to capture the differences of p and q. The regularization parameter \u03b3 is not tuned, and is\n(cid:92)FSSD2\n\ufb01xed to a small constant. Assume that \u2207x log p(x) costs O(d2) to evaluate. Computing \u2207\u03b6\n\u02c6\u03c3H1 +\u03b3\ncosts O(d2J 2n). The computational complexity of n (cid:92)FSSD2 and \u02c6\u03c32\nis O(d2Jn). Thus, \ufb01nding\na local optimum via gradient ascent is still linear-time, for a \ufb01xed maximum number of iterations.\nComputing \u02c6\u03a3q costs O(d2J 2n), and obtaining all the eigenvalues of \u02c6\u03a3q costs O(d3J 3) (required\nonly once). If the eigenvalues decay to zero suf\ufb01ciently rapidly, one can approximate the asymptotic\n(cid:80)n\nnull distribution with only a few eigenvalues. The cost to obtain the largest few eigenvalues alone can\nbe much smaller.\nRemark 2. Let \u02c6\u00b5 := 1\ni=1 \u03c4 (xi). It is possible to normalize the FSSD statistic to get a new\nn\n( \u02c6\u03a3q + \u03b3I)\u22121 \u02c6\u00b5 where \u03b3 \u2265 0 is a regularization parameter that goes to 0\nstatistic \u02c6\u03bbn := n \u02c6\u00b5\nas n \u2192 \u221e. This was done in the case of the ME (mean embeddings) statistic of [8, 19]. The\nasymptotic null distribution of this statistic takes the convenient form of \u03c72(dJ) (independent of\np and q), eliminating the need to obtain the eigenvalues of \u02c6\u03a3q. It turns out that the test power\ncriterion for tuning the parameters in this case is the statistic \u02c6\u03bbn itself. However, the optimization\nis computationally expensive as ( \u02c6\u03a3q + \u03b3I)\u22121 (costing O(d3J 3)) needs to be reevaluated in each\ngradient ascent iteration. This is not needed in our proposed FSSD statistic.\n\n(cid:62)\n\nH1\n\n5\n\n\f4 Relative Ef\ufb01ciency and Bahadur Slope\n\nBoth the linear-time kernel Stein (LKS) and FSSD tests have the same computational cost of O(d2n),\nand are consistent, achieving maximum power of 1 as n \u2192 \u221e under H1. It is thus of theoretical\ninterest to understand which test is more sensitive in detecting the differences of p and q. This can be\nquanti\ufb01ed by the Bahadur slope of the test [1]. Two given tests can then be compared by computing\nthe Bahadur ef\ufb01ciency (Theorem 7) which is given by the ratio of the slopes of the two tests. We\nnote that the constructions and techniques in this section may be of independent interest, and can be\ngeneralised to other statistical testing settings.\nWe start by introducing the concept of Bahadur slope for a general test, following the presentation of\n[12, 13]. Consider a hypothesis testing problem on a parameter \u03b8. The test proposes a null hypothesis\nH0 : \u03b8 \u2208 \u03980 against the alternative hypothesis H1 : \u03b8 \u2208 \u0398\\\u03980, where \u0398, \u03980 are arbitrary sets.\nLet Tn be a test statistic computed from a sample of size n, such that large values of Tn provide\nan evidence to reject H0. We use plim to denote convergence in probability, and write Er for\nEx\u223crEx(cid:48)\u223cr.\nApproximate Bahadur Slope (ABS) For \u03b80 \u2208 \u03980, let the asymptotic null distribution of Tn be\nF (t) = limn\u2192\u221e P\u03b80(Tn < t), where we assume that the CDF (F ) is continuous and common to all\n\u03b80 \u2208 \u03980. The continuity of F will be important later when Theorem 9 and 10 are used to compute\nthe slopes of LKS and FSSD tests. Assume that there exists a continuous strictly increasing function\n\u03c1 : (0,\u221e) \u2192 (0,\u221e) such that limn\u2192\u221e \u03c1(n) = \u221e, and that \u22122 plimn\u2192\u221e log(1\u2212F (Tn))\n= c(\u03b8)\nwhere Tn \u223c P\u03b8, for some function c such that 0 < c(\u03b8A) < \u221e for \u03b8A \u2208 \u0398\\\u03980, and c(\u03b80) = 0 when\n\u03b80 \u2208 \u03980. The function c(\u03b8) is known as the approximate Bahadur slope (ABS) of the sequence Tn.\nThe quanti\ufb01er \u201capproximate\u201d comes from the use of the asymptotic null distribution instead of the\nexact one [1]. Intuitively the slope c(\u03b8A), for \u03b8A \u2208 \u0398\\\u03980, is the rate of convergence of p-values (i.e.,\n1 \u2212 F (Tn)) to 0, as n increases. The higher the slope, the faster the p-value vanishes, and thus the\nlower the sample size required to reject H0 under \u03b8A.\nApproximate Bahadur Ef\ufb01ciency Given two sequences of test statistics, T (1)\nn having the\nsame \u03c1(n) (see Theorem 10), the approximate Bahadur ef\ufb01ciency of T (1)\nn is de\ufb01ned\nas E(\u03b8A) := c(1)(\u03b8A)/c(2)(\u03b8A) for \u03b8A \u2208 \u0398\\\u03980. If E(\u03b8A) > 1, then T (1)\nis asymptotically more\nef\ufb01cient than T (2)\nn in the sense of Bahadur, for the particular problem speci\ufb01ed by \u03b8A \u2208 \u0398\\\u03980. We\nn (cid:92)FSSD2, and the LKS test statistic \u221an(cid:99)S2\nnow give approximate Bahadur slopes for two sequences of linear time test statistics: the proposed\nTheorem 5. The approximate Bahadur slope of n (cid:92)FSSD2 is c(FSSD) := FSSD2/\u03c91, where \u03c91 is the\nmaximum eigenvalue of \u03a3p := Ex\u223cp[\u03c4 (x)\u03c4 (cid:62)(x)] and \u03c1(n) = n.\nTheorem 6. The approximate Bahadur slope of the linear-time kernel Stein (LKS) test statistic \u221an(cid:99)S2\n(cid:1), possibly with\nWe assume that both tests use the Gaussian kernel k(x, y) = exp(cid:0)\n\nTo make these results concrete, we consider the setting where p = N (0, 1) and q = N (\u00b5q, 1).\ndifferent bandwidths. We write \u03c32\nk and \u03ba2 for the FSSD and LKS bandwidths, respectively. Under\nthese assumptions, the slopes given in Theorem 5 and Theorem 6 can be derived explicitly. The\nfull expressions of the slopes are given in Proposition 12 and Proposition 13 (in the appendix). By\n[12, 13] (recalled as Theorem 10 in the supplement), the approximate Bahadur ef\ufb01ciency can be\ncomputed by taking the ratio of the two slopes. The ef\ufb01ciency is given in Theorem 7.\nTheorem 7 (Ef\ufb01ciency in the Gaussian mean shift problem). Let E1(\u00b5q, v, \u03c32\n\nmate Bahadur ef\ufb01ciency of n (cid:92)FSSD2 relative to \u221an(cid:99)S2\nand J = 1 (i.e., one test location v for n (cid:92)FSSD2). Fix \u03c32\nfor some v \u2208 R, and for any \u03ba2 > 0, we have E1(\u00b5q, v, \u03c32\nWhen p = N (0, 1) and q = N (\u00b5q, 1) for \u00b5q (cid:54)= 0, Theorem 7 guarantees that our FSSD test is\nasymptotically at least twice as ef\ufb01cient as the LKS test in the Bahadur sense. We note that the\n\nk, \u03ba2) be the approxi-\nl for the case where p = N (0, 1), q = N (\u00b5q, 1),\nk = 1 for n (cid:92)FSSD2. Then, for any \u00b5q (cid:54)= 0,\nk, \u03ba2) > 2.\n\n, where hp is the U-statistic kernel of the KSD statistic, and \u03c1(n) = n.\n\nis c(LKS) = 1\n2\n\n[Eqhp(x,x(cid:48))]2\nEp[h2\np(x,x(cid:48))]\n\nl discussed in Section 2.\n\n\u03c1(n)\n\nn and T (2)\nn relative to T (2)\n\nn\n\nl\n\n\u2212(x \u2212 y)2/2\u03c32\n\nk\n\n6\n\n\fef\ufb01ciency is conservative in the sense that \u03c32\nwill likely improve the ef\ufb01ciency further.\n\nk = 1 regardless of \u00b5q. Choosing \u03c32\n\nk dependent on \u00b5q\n\n5 Experiments\n\nIn this section, we demonstrate the performance of the proposed test on a number of problems. The\nprimary goal is to understand the conditions under which the test can perform well.\n\nFigure 1: The power criterion\nFSSD2/\u03c3H1 as a function of\ntest location v.\n\nSensitivity to Local Differences We start by demonstrating that\nthe test power objective FSSD2/\u03c3H1 captures local differences\nof p and q, and that interpretable features v are found. Con-\nsider a one-dimensional problem in which p = N (0, 1) and\nq = Laplace(0, 1/\u221a2), a zero-mean Laplace distribution with scale\nparameter 1/\u221a2. These parameters are chosen so that p and q have\nthe same mean and variance. Figure 1 plots the (rescaled) objective\nas a function of v. The objective illustrates that the best features\n(indicated by v\u2217) are at the most discriminative locations.\nTest Power We next investigate the power of different tests on two problems:\n\ni=1 Laplace(xi|0, 1/\u221a2) where the\n2(cid:107)x(cid:107)2(cid:1) , where x \u2208 Rd, h \u2208 {\u00b11}dh is a random vector of\n\n1. Gaussian vs. Laplace: p(x) = N (x|0, Id) and q(x) = (cid:81)d\ndimension d will be varied. The two distributions have the same mean and variance. The main\nZ exp(cid:0)x(cid:62)Bh + b(cid:62)x + c(cid:62)x \u2212 1\ncharacteristic of this problem is local differences of p and q (see Figure 1). Set n = 1000.\n2. Restricted Boltzmann Machine (RBM): p(x) is the marginal distribution of p(x, h) =\n(cid:80)\n1\nhidden variables, and Z is the normalization constant. The exact marginal density p(x) =\nh\u2208{\u22121,1}dh p(x, h) is intractable when dh is large, since it involves summing over 2dh terms.\nRecall that the proposed test only requires the score function \u2207x log p(x) (not the normalization\nconstant), which can be computed in closed form in this case. In this problem, q is another RBM\nwhere entries of the matrix B are corrupted by Gaussian noise. This was the problem considered in\n[22]. We set d = 50 and dh = 40, and generate samples by n independent chains (i.e., n independent\nsamples) of blocked Gibbs sampling with 2000 burn-in iterations.\nWe evaluate the following six kernel-based nonparametric tests with \u03b1 = 0.05, all using the Gaussian\nkernel. 1. FSSD-rand: the proposed FSSD test where the test locations set to random draws from\na multivariate normal distribution \ufb01tted to the data. The kernel bandwidth is set by the commonly\nused median heuristic i.e., \u03c3k = median({(cid:107)xi \u2212 xj(cid:107), i < j}). 2. FSSD-opt: the proposed FSSD\ntest where both the test locations and the Gaussian bandwidth are optimized (Section 3.2). 3. KSD:\nthe quadratic-time Kernel Stein Discrepancy test with the median heuristic. 4. LKS: the linear-time\nversion of KSD with the median heuristic. 5. MMD-opt: the quadratic-time MMD two-sample\ntest of [16] where the kernel bandwidth is optimized by grid search to maximize a power criterion\nas described in [29]. 6. ME-opt: the linear-time mean embeddings (ME) two-sample test of [19]\nwhere parameters are optimized. We draw n samples from p to run the two-sample tests (MMD-opt,\nME-opt). For FSSD tests, we use J = 5 (see Section A for an investigation of test power as J varies).\nAll tests with optimization use 20% of the sample size n for parameter tuning. Code is available at\nhttps://github.com/wittawatj/kernel-gof.\nFigure 2 shows the rejection rates of the six tests for the two problems, where each problem is\nrepeated for 200 trials, resampling n points from q every time. In Figure 2a (Gaussian vs. Laplace),\nhigh performance of FSSD-opt indicates that the test performs well when there are local differences\nbetween p and q. Low performance of FSSD-rand emphasizes the importance of the optimization\nof FSSD-opt to pinpoint regions where p and q differ. The power of KSD quickly drops as the\ndimension increases, which can be understood since KSD is the RKHS norm of a function witnessing\ndifferences in p and q across the entire domain, including where these differences are small.\nWe next consider the case of RBMs. Following [22], b, c are independently drawn from the standard\nmultivariate normal distribution, and entries of B \u2208 R50\u00d740 are drawn with equal probability from\n{\u00b11}, in each trial. The density q represents another RBM having the same b, c as in p, and with all\nentries of B corrupted by independent zero-mean Gaussian noise with standard deviation \u03c3per. Figure\n\n7\n\n\u22124\u22122024v\u2217v\u2217pqFSSD2\u03c3H1\f(a) Gaussian vs. Laplace.\nn = 1000.\nFigure 2: Rejection rates of the six tests. The proposed linear-time FSSD-opt has a comparable or\nhigher test power in some cases than the quadratic-time KSD test.\n\n(c) RBM. \u03c3per = 0.1. Per-\nturb B1,1.\n\n(b) RBM. n = 1000. Per-\nturb all entries of B.\n\n(d) Runtime (RBM)\n\n2b shows the test powers as \u03c3per increases, for a \ufb01xed sample size n = 1000. We observe that all the\ntests have correct false positive rates (type-I errors) at roughly \u03b1 = 0.05 when there is no perturbation\nnoise. In particular, the optimization in FSSD-opt does not increase false positive rate when H0 holds.\nWe see that the performance of the proposed FSSD-opt matches that of the quadratic-time KSD at\nall noise levels. MMD-opt and ME-opt perform far worse than the goodness-of-\ufb01t tests when the\ndifference in p and q is small (\u03c3per is low), since these tests simply represent p using samples, and do\nnot take advantage of its structure.\nThe advantage of having O(n) runtime can be clearly seen when the problem is much harder,\nrequiring larger sample sizes to tackle. Consider a similar problem on RBMs in which the parameter\nB \u2208 R50\u00d740 in q is given by that of p, where only the \ufb01rst entry B1,1 is perturbed by random\nN (0, 0.12) noise. The results are shown in Figure 2c where the sample size n is varied. We observe\nthat the two two-sample tests fail to detect this subtle difference even with large sample size. The\ntest powers of KSD and FSSD-opt are comparable when n is relatively small. It appears that KSD\nhas higher test power than FSSD-opt in this case for large n. However, this moderate gain in the test\npower comes with an order of magnitude more computation. As shown in Figure 2d, the runtime\nof the KSD is much larger than that of FSSD-opt, especially at large n. In these problems, the\nperformance of the new test (even without optimization) far exceeds that of the LKS test. Further\nsimulation results can be found in Section B.\n\nInterpretable Features\nIn the\n\ufb01nal simulation, we demonstrate\nthat the learned test locations are\ninformative in visualising where\nthe model does not \ufb01t the data\nwell. We consider crime data\nfrom the Chicago Police Depart-\nment,\nrecording n = 11957\nlocations (latitude-longitude co-\nordinates) of robbery events in\nChicago in 2016.3 We address\nthe situation in which a model p\nfor the robbery location density is\ngiven, and we wish to visualise\nwhere it fails to match the data. We \ufb01t a Gaussian mixture model (GMM) with the expectation-\nmaximization algorithm to a subsample of 5500 points. We then test the model on a held-out test set\nof the same size to obtain proposed locations of relevant features v. Figure 3a shows the test robbery\nlocations in purple, the model with two Gaussian components in wireframe, and the optimization\nobjective for v as a grayscale contour plot (a red star indicates the maximum). We observe that the\n2-component model is a poor \ufb01t to the data, particularly in the right tail areas of the data, as indicated\nin dark gray (i.e., the objective is high). Figure 3b shows a similar plot with a 10-component GMM.\nThe additional components appear to have eliminated some mismatch in the right tail, however a\ndiscrepancy still exists in the left region. Here, the data have a sharp boundary on the right side\nfollowing the geography of Chicago, and do not exhibit exponentially decaying Gaussian-like tails.\nWe note that tests based on a learned feature located at the maximum both correctly reject H0.\n\n(a) p = 2-component GMM.\nFigure 3: Plots of the optimization objective as a function of\ntest location v \u2208 R2 in the Gaussian mixture model (GMM)\nevaluation task.\n\n(b) p = 10-component GMM\n\n3Data can be found at https://data.cityofchicago.org.\n\n8\n\n0.000.020.040.06PerturbationSD\u03c3per0.00.51.0RejectionrateFSSD-optFSSD-randKSDLKSMMD-optME-opt151015dimensiond0.00.51.0Rejectionrate0.000.020.040.06PerturbationSD\u03c3per0.00.51.0Rejectionrate20004000Samplesizen0.000.250.500.75Rejectionrate1000200030004000Samplesizen0100200300Time(s)\u22120.08\u22120.040.000.040.080.120.160.20\fAcknowledgement\n\nWJ, WX, and AG thank the Gatsby Charitable Foundation for the \ufb01nancial support. ZSz was\n\ufb01nancially supported by the Data Science Initiative. KF has been supported by KAKENHI Innovative\nAreas 25120012.\n\nReferences\n[1] R. R. Bahadur. Stochastic comparison of tests. The Annals of Mathematical Statistics, 31(2):\n\n276\u2013295, 1960.\n\n[2] L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical\n\ncharacteristic function. Metrika, 35:339\u2013348, 1988.\n\n[3] J. Beirlant, L. Gy\u00f6r\ufb01, and G. Lugosi. On the asymptotic normality of the l1- and l2-errors in\n\nhistogram density estimation. Canadian Journal of Statistics, 22:309\u2013318, 1994.\n\n[4] R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.\n[5] A. Bowman and P. Foster. Adaptive smoothing and density based tests of multivariate normality.\n\nJournal of the American Statistical Association, 88:529\u2013537, 1993.\n\n[6] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit\u00e0. Vector valued reproducing kernel Hilbert\n\nspaces and universality. Analysis and Applications, 08(01):19\u201361, Jan. 2010.\n\n[7] K. Chwialkowski, D. Sejdinovic, and A. Gretton. A wild bootstrap for degenerate kernel tests. In\n\nNIPS, pages 3608\u20133616, 2014.\n\n[8] K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with\n\nanalytic representations of probability measures. In NIPS, pages 1981\u20131989, 2015.\n\n[9] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of \ufb01t. In ICML,\n\npages 2606\u20132615, 2016.\n\n[10] T. Epps and K. Singleton. An omnibus test for the two-sample problem using the empirical\ncharacteristic function. Journal of Statistical Computation and Simulation, 26(3\u20134):177\u2013203,\n1986.\n\n[11] J. Frank J. Massey. The Kolmogorov-Smirnov test for goodness of \ufb01t. Journal of the American\n\nStatistical Association, 46(253):68\u201378, 1951.\n\n[12] L. J. Gleser. On a measure of test ef\ufb01ciency proposed by R. R. Bahadur. 35(4):1537\u20131544,\n\n1964.\n\n[13] L. J. Gleser. The comparison of multivariate tests of hypothesis by means of Bahadur ef\ufb01ciency.\n\n28(2):157\u2013174, 1966.\n\n[14] J. Gorham and L. Mackey. Measuring sample quality with Stein\u2019s method. In NIPS, pages\n\n226\u2013234, 2015.\n\n[15] J. Gorham and L. Mackey. Measuring sample quality with kernels. In ICML, pages 1292\u20131301.\n\nPMLR, 06\u201311 Aug 2017.\n\n[16] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample\n\ntest. JMLR, 13:723\u2013773, 2012.\n\n[17] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and\nB. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In NIPS, pages\n1205\u20131213. 2012.\n\n[18] L. Gy\u00f6r\ufb01 and E. C. van der Meulen. A consistent goodness of \ufb01t test based on the total variation\ndistance. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages\n631\u2013645, 1990.\n\n[19] W. Jitkrittum, Z. Szab\u00f3, K. P. Chwialkowski, and A. Gretton. Interpretable Distribution Features\n\nwith Maximum Testing Power. In NIPS, pages 181\u2013189. 2016.\n\n[20] W. Jitkrittum, Z. Szab\u00f3, and A. Gretton. An adaptive test of independence with analytic kernel\n\nembeddings. In ICML, pages 1742\u20131751. PMLR, 2017.\n\n[21] C. Ley, G. Reinert, and Y. Swan. Stein\u2019s method for comparison of univariate distributions.\n\nProbability Surveys, 14:1\u201352, 2017.\n\n9\n\n\f[22] Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests. In\n\nICML, pages 276\u2013284, 2016.\n\n[23] J. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In NIPS,\n\npages 829\u2013837, 2015.\n\n[24] B. Mityagin. The Zero Set of a Real Analytic Function. Dec. 2015. arXiv: 1512.07276.\n[25] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695\u2013718, 2017.\n[26] M. L. Rizzo. New goodness-of-\ufb01t tests for Pareto distributions. ASTIN Bulletin: Journal of the\n\nInternational Association of Actuaries, 39(2):691\u2013715, 2009.\n\n[27] R. J. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, 2009.\n[28] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.\n[29] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton.\nGenerative models and model criticism via optimized Maximum Mean Discrepancy. In ICLR,\n2016.\n\n[30] G. J. Sz\u00e9kely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate\n\nAnalysis, 93(1):58\u201380, 2005.\n\n[31] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.\n[32] Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic. Large-scale kernel methods for indepen-\n\ndence testing. Statistics and Computing, pages 1\u201318, 2017.\n\n10\n\n\f", "award": [], "sourceid": 213, "authors": [{"given_name": "Wittawat", "family_name": "Jitkrittum", "institution": "Gatsby unit, University College London"}, {"given_name": "Wenkai", "family_name": "Xu", "institution": "Gatsby Unit, UCL"}, {"given_name": "Zoltan", "family_name": "Szabo", "institution": "Ecole Polytechnique"}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": "Institute of Statistical Mathematics"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}]}