{"title": "A Kernel Statistical Test of Independence", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 592, "abstract": null, "full_text": "A Kernel Statistical Test of Independence\n\nArthur Gretton\n\nKenji Fukumizu\n\nMPI for Biological Cybernetics\n\nInst. of Statistical Mathematics\n\nT\u00a8ubingen, Germany\n\narthur@tuebingen.mpg.de\n\nTokyo Japan\n\nfukumizu@ism.ac.jp\n\nLe Song\n\nNICTA, ANU\n\nand University of Sydney\nlesong@it.usyd.edu.au\n\nBernhard Sch\u00a8olkopf\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\nbs@tuebingen.mpg.de\n\nChoon Hui Teo\nNICTA, ANU\n\nCanberra, Australia\n\nchoonhui.teo@gmail.com\n\nAlexander J. Smola\n\nNICTA, ANU\n\nCanberra, Australia\n\nalex.smola@gmail.com\n\nAbstract\n\nAlthough kernel measures of independence have been widely applied in machine\nlearning (notably in kernel ICA), there is as yet no method to determine whether\nthey have detected statistically signi\ufb01cant dependence. We provide a novel test of\nthe independence hypothesis for one particular kernel independence measure, the\nHilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2),\nwhere m is the sample size. We demonstrate that this test outperforms established\ncontingency table and functional correlation-based tests, and that this advantage\nis greater for multivariate data. Finally, we show the HSIC test also applies to\ntext (and to structured data more generally), for which no other independence test\npresently exists.\n\n1 Introduction\n\nKernel independence measures have been widely applied in recent machine learning literature, most\ncommonly in independent component analysis (ICA) [2, 11], but also in \ufb01tting graphical models [1]\nand in feature selection [22]. One reason for their success is that these criteria have a zero expected\nvalue if and only if the associated random variables are independent, when the kernels are universal\n(in the sense of [23]). There is presently no way to tell whether the empirical estimates of these\ndependence measures indicate a statistically signi\ufb01cant dependence, however. In other words, we\nare interested in the threshold an empirical kernel dependence estimate must exceed, before we can\ndismiss with high probability the hypothesis that the underlying variables are independent.\n\nStatistical tests of independence have been associated with a broad variety of dependence measures.\nClassical tests such as Spearman\u2019s \u03c1 and Kendall\u2019s \u03c4 are widely applied, however they are not\nguaranteed to detect all modes of dependence between the random variables. Contingency table-\nbased methods, and in particular the power-divergence family of test statistics [17], are the best\nknown general purpose tests of independence, but are limited to relatively low dimensions, since they\nrequire a partitioning of the space in which each random variable resides. Characteristic function-\nbased tests [6, 13] have also been proposed, which are more general than kernel density-based tests\n[19], although to our knowledge they have been used only to compare univariate random variables.\n\nIn this paper we present three main results: \ufb01rst, and most importantly, we show how to test whether\nstatistically signi\ufb01cant dependence is detected by a particular kernel independence measure, the\nHilbert Schmidt independence criterion (HSIC, from [9]). That is, we provide a fast (O(m2) for\nsample size m) and accurate means of obtaining a threshold which HSIC will only exceed with\nsmall probability, when the underlying variables are independent. Second, we show the distribution\n\n1\n\n\fof our empirical test statistic in the large sample limit can be straightforwardly parameterised in\nterms of kernels on the data. Third, we apply our test to structured data (in this case, by establishing\nthe statistical dependence between a text and its translation). To our knowledge, ours is the \ufb01rst\nindependence test for structured data.\n\nWe begin our presentation in Section 2, with a short overview of cross-covariance operators be-\ntween RKHSs and their Hilbert-Schmidt norms: the latter are used to de\ufb01ne the Hilbert Schmidt\nIndependence Criterion (HSIC). In Section 3, we describe how to determine whether the depen-\ndence returned via HSIC is statistically signi\ufb01cant, by proposing a hypothesis test with HSIC as its\nstatistic. In particular, we show that this test can be parameterised using a combination of covariance\noperator norms and norms of mean elements of the random variables in feature space. Finally, in\nSection 4, we give our experimental results, both for testing dependence between random vectors\n(which could be used for instance to verify convergence in independent subspace analysis [25]),\nand for testing dependence between text and its translation. Software to implement the test may be\ndownloaded from http : //www.kyb.mpg.de/bs/people/arthur/indep.htm\n\n2 De\ufb01nitions and description of HSIC\n\nOur problem setting is as follows:\n\nProblem 1 Let Pxy be a Borel probability measure de\ufb01ned on a domain X \u00d7 Y, and let Px and\nPy be the respective marginal distributions on X and Y. Given an i.i.d sample Z := (X, Y ) =\n{(x1, y1), . . . , (xm, ym)} of size m drawn independently and identically distributed according to\nPxy, does Pxy factorise as PxPy (equivalently, we may write x \u22a5\u22a5 y)?\n\nWe begin with a description of our kernel dependence criterion, leaving to the following section the\nquestion of whether this dependence is signi\ufb01cant. This presentation is largely a review of material\nfrom [9, 11, 22], the main difference being that we establish links to the characteristic function-based\nindependence criteria in [6, 13]. Let F be an RKHS, with the continuous feature mapping \u03c6(x) \u2208 F\nfrom each x \u2208 X, such that the inner product between the features is given by the kernel function\nk(x, x\u2032) := h\u03c6(x), \u03c6(x\u2032)i. Likewise, let G be a second RKHS on Y with kernel l(\u00b7, \u00b7) and feature\nmap \u03c8(y). Following [7], the cross-covariance operator Cxy : G \u2192 F is de\ufb01ned such that for all\nf \u2208 F and g \u2208 G,\n\nhf, CxygiF = Exy ([f (x) \u2212 Ex(f (x))] [g(y) \u2212 Ey(g(y))]) .\n\nThe cross-covariance operator itself can then be written\n\nCxy := Exy[(\u03c6(x) \u2212 \u00b5x) \u2297 (\u03c8(y) \u2212 \u00b5y)],\n\n(1)\n\nwhere \u00b5x := Ex\u03c6(x), \u00b5y := Ey\u03c6(y), and \u2297 is the tensor product [9, Eq. 6]: this is a generalisation\nof the cross-covariance matrix between random vectors. When F and G are universal reproducing\nkernel Hilbert spaces (that is, dense in the space of bounded continuous functions [23]) on the\ncompact domains X and Y, then the largest singular value of this operator, kCxyk, is zero if and only\nif x \u22a5\u22a5 y [11, Theorem 6]: the operator therefore induces an independence criterion, and can be used\nto solve Problem 1. The maximum singular value gives a criterion similar to that originally proposed\nin [18], but with more restrictive function classes (rather than functions of bounded variance). Rather\nthan the maximum singular value, we may use the squared Hilbert-Schmidt norm (the sum of the\nsquared singular values), which has a population expression\n\nHSIC(Pxy, F, G) = Exx\u2032yy \u2032[k(x, x\u2032)l(y, y\u2032)] + Exx\u2032[k(x, x\u2032)]Eyy \u2032 [l(y, y\u2032)]\n\n\u2212 2Exy [Ex\u2032[k(x, x\u2032)]Ey \u2032[l(y, y\u2032)]]\n\n(2)\n\n(assuming the expectations exist), where x\u2032 denotes an independent copy of x [9, Lemma 1]: we\ncall this the Hilbert-Schmidt independence criterion (HSIC).\nWe now address the problem of estimating HSIC(Pxy, F, G) on the basis of the sample Z. An\nunbiased estimator of (2) is a sum of three U-statistics [21, 22],\n\nHSIC(Z) =\n\n1\n\n(m)2 X(i,j)\u2208im\n\n2\n\nkij lij +\n\n1\n\n(m)4 X(i,j,q,r)\u2208im\n\n4\n\nkij lqr \u2212 2\n\n1\n\n(m)3 X(i,j,q)\u2208im\n\n3\n\nkij liq,\n\n(3)\n\n2\n\n\f(m\u2212n)! , the index set im\n\nwhere (m)n := m!\nr denotes the set all r-tuples drawn without replacement from\nthe set {1, . . . , m}, kij := k(xi, xj ), and lij := l(yi, yj). For the purpose of testing independence,\nhowever, we will \ufb01nd it easier to use an alternative, biased empirical estimate [9, De\ufb01nition 2],\nobtained by replacing the U-statistics with V-statistics1\n\nHSICb(Z) =\n\n1\nm2\n\nm\n\nXi,j\n\nkijlij +\n\n1\nm4\n\nm\n\nXi,j,q,r\n\nkijlqr \u2212 2\n\n1\nm3\n\nm\n\nXi,j,q\n\nkij liq =\n\n1\nm2 trace(KHLH),\n\n(4)\n\nwhere the summation indices now denote all r-tuples drawn with replacement from {1, . . . , m} (r\nbeing the number of indices below the sum), K is the m\u00d7m matrix with entries kij, H = I\u2212 1\n11\u22a4,\nm\nand 1 is an m \u00d7 1 vector of ones (the cost of computing this statistic is O(m2)). When a Gaussian\n\nkernel kij := exp(cid:16)\u2212\u03c3\u22122 kxi \u2212 xj k2(cid:17) is used (or a kernel deriving from [6, Eq. 4.10]), the latter\nstatistic is equivalent to the characteristic function-based statistic [6, Eq. 4.11] and the T 2n statistic\nof [13, p. 54]: details are reproduced in [10] for comparison. Our setting allows for more general\nkernels, however, such as kernels on strings (as in our experiments in Section 4) and graphs (see\n[20] for further details of kernels on structures): this is not possible under the characteristic function\nframework, which is restricted to Euclidean spaces (Rd in the case of [6, 13]). As pointed out in [6,\nSection 5], the statistic in (4) can also be linked to the original quadratic test of Rosenblatt [19] given\nan appropriate kernel choice; the main differences being that characteristic function-based tests (and\nRKHS-based tests) are not restricted to using kernel densities, nor should they reduce their kernel\nwidth with increasing sample size. Another related test described in [4] is based on the functional\ncanonical correlation between F and G, rather than the covariance: in this sense the test statistic\nresembles those in [2]. The approach in [4] differs with both the present work and [2], however,\nin that the function spaces F and G are represented by \ufb01nite sets of basis functions (speci\ufb01cally\nB-spline kernels) when computing the empirical test statistic.\n\n3 Test description\n\nWe now describe a statistical test of independence for two random variables, based on the test\nstatistic HSICb(Z). We begin with a more formal introduction to the framework and terminology\nof statistical hypothesis testing. Given the i.i.d. sample Z de\ufb01ned earlier, the statistical test, T(Z) :\n(X \u00d7 Y)m 7\u2192 {0, 1} is used to distinguish between the null hypothesis H0 : Pxy = PxPy and\nthe alternative hypothesis H1 : Pxy 6= PxPy. This is achieved by comparing the test statistic, in\nour case HSICb(Z), with a particular threshold: if the threshold is exceeded, then the test rejects\nthe null hypothesis (bearing in mind that a zero population HSIC indicates Pxy = PxPy). The\nacceptance region of the test is thus de\ufb01ned as any real number below the threshold. Since the test\nis based on a \ufb01nite sample, it is possible that an incorrect answer will be returned: the Type I error\nis de\ufb01ned as the probability of rejecting H0 based on the observed sample, despite x and y being\nindependent. Conversely, the Type II error is the probability of accepting Pxy = PxPy when the\nunderlying variables are dependent. The level \u03b1 of a test is an upper bound on the Type I error, and\nis a design parameter of the test, used to set the test threshold. A consistent test achieves a level \u03b1,\nand a Type II error of zero, in the large sample limit.\n\nHow, then, do we set the threshold of the test given \u03b1? The approach we adopt here is to derive\nthe asymptotic distribution of the empirical estimate HSICb(Z) of HSIC(Pxy, F, G) under H0. We\nthen use the 1 \u2212 \u03b1 quantile of this distribution as the test threshold.2 Our presentation in this section\nis therefore divided into two parts. First, we obtain the distribution of HSICb(Z) under both H0 and\nH1; the latter distribution is also needed to ensure consistency of the test. We shall see, however, that\nthe null distribution has a complex form, and cannot be evaluated directly. Thus, in the second part\nof this section, we describe ways to accurately approximate the 1 \u2212 \u03b1 quantile of this distribution.\n\nAsymptotic distribution of HSICb(Z) We now describe the distribution of the test statistic in (4)\nThe \ufb01rst theorem holds under H1.\n\n1The U- and V-statistics differ in that the latter allow indices of different sums to be equal.\n2An alternative would be to use a large deviation bound, as provided for instance by [9] based on Hoeffding\u2019s\ninequality. It has been reported in [8], however, that such bounds are generally too loose for hypothesis testing.\n\n3\n\n\fTheorem 1 Let\n\nhijqr =\n\n1\n4!\n\n(i,j,q,r)\n\nX(t,u,v,w)\n\nktultu + ktulvw \u2212 2ktultv,\n\n(5)\n\nwhere the sum represents all ordered quadruples (t, u, v, w) drawn without replacement from\n\n(i, j, q, r), and assume E(cid:0)h2(cid:1) < \u221e. Under H1, HSICb(Z) converges in distribution as m \u2192 \u221e\n\nto a Gaussian according to\n\nm\n\n1\n\n2 (HSICb(Z) \u2212 HSIC(Pxy, F, G)) D\u2192 N(cid:0)0, \u03c32\nu(cid:1) .\n\nThe variance is \u03c32\n\nu = 16(cid:18)Ei(cid:16)Ej,q,rhijqr(cid:17)2\n\n\u2212 HSIC(Pxy, F, G)(cid:19) , where Ej,q,r := Ezj ,zq,zr .\n\nProof We \ufb01rst rewrite (4) as a single V-statistic,\n\nHSICb(Z) =\n\n1\nm4\n\nm\n\nXi,j,q,r\n\nhijqr,\n\n(6)\n\n(7)\n\nwhere we note that hijqr de\ufb01ned in (5) does not change with permutation of its indices. The associ-\nated U-statistic HSICs(Z) converges in distribution as (6) with variance \u03c32\nu [21, Theorem 5.5.1(A)]:\nsee [22]. Since the difference between HSICb(Z) and HSICs(Z) drops as 1/m (see [9], or Theorem\n3 below), HSICb(Z) converges asymptotically to the same distribution.\nThe second theorem applies under H0\n\nTheorem 2 Under H0, the U-statistic HSICs(Z) corresponding to the V-statistic in (7) is degen-\nerate, meaning Eihijqr = 0. In this case, HSICb(Z) converges in distribution according to [21,\nSection 5.5.2]\n\n\u221e\n\nmHSICb(Z) D\u2192\n\n\u03bblz2\nl ,\n\nXl=1\n\n(8)\n\nwhere zl \u223c N(0, 1) i.i.d., and \u03bbl are the solutions to the eigenvalue problem\n\n\u03bbl\u03c8l(zj) = Z hijqr\u03c8l(zi)dFi,q,r,\nwhere the integral is over the distribution of variables zi, zq, and zr.\n\nProof This follows from the discussion of [21, Section 5.5.2], making appropriate allowance for\nthe fact that we are dealing with a V-statistic (which is why the terms in (8) are not centred: in the\ncase of a U-statistic, the sum would be over terms \u03bbl(z2\n\nl \u2212 1)).\n\nApproximating the 1 \u2212 \u03b1 quantile of the null distribution A hypothesis test using HSICb(Z)\ncould be derived from Theorem 2 above by computing the (1 \u2212 \u03b1)th quantile of the distribution (8),\nwhere consistency of the test (that is, the convergence to zero of the Type II error for m \u2192 \u221e) is\nguaranteed by the decay as m\u22121 of the variance of HSICb(Z) under H1. The distribution under H0\nis complex, however: the question then becomes how to accurately approximate its quantiles.\n\nOne approach, taken by [6], is to use a Monte Carlo resampling technique: the ordering of the Y\nsample is permuted repeatedly while that of X is kept \ufb01xed, and the 1 \u2212 \u03b1 quantile is obtained\nfrom the resulting distribution of HSICb values. This can be very expensive, however. A second\napproach, suggested in [13, p. 34], is to approximate the null distribution as a two-parameter Gamma\ndistribution [12, p. 343, p. 359]: this is one of the more straightforward approximations of an in\ufb01nite\nsum of \u03c72 variables (see [12, Chapter 18.8] for further ways to approximate such distributions; in\nparticular, we wish to avoid using moments of order greater than two, since these can become\nexpensive to compute). Speci\ufb01cally, we make the approximation\n\nmHSICb(Z) \u223c\n\nx\u03b1\u22121e\u2212x/\u03b2\n\n\u03b2\u03b1\u0393(\u03b1)\n\nwhere \u03b1 =\n\n(E(HSICb(Z)))2\nvar(HSICb(Z))\n\n,\n\n\u03b2 =\n\nmvar(HSICb(Z))\n\nE(HSICb(Z))\n\n.\n\n(9)\n\n4\n\n\fAn illustration of the cumulative distribution function\n(CDF) obtained via the Gamma approximation is given\nin Figure 1, along with an empirical CDF obtained by\nrepeated draws of HSICb. We note the Gamma approxi-\nmation is quite accurate, especially in areas of high prob-\nability (which we use to compute the test quantile). The\naccuracy of this approximation will be further evaluated\nexperimentally in Section 4.\n\nTo obtain the Gamma distribution from our observa-\ntions, we need empirical estimates for E(HSICb(Z)) and\nvar(HSICb(Z)) under the null hypothesis. Expressions\nfor these quantities are given in [13, pp. 26-27], however\nthese are in terms of the joint and marginal characteris-\ntic functions, and not in our more general kernel setting\n(see also [14, p. 313]). In the following two theorems,\nwe provide much simpler expressions for both quantities,\nin terms of norms of mean elements \u00b5x and \u00b5y, and the\ncovariance operators\n\nFigure 1: mHSICb cumulative distribution\nfunction (Emp) under H0 for m = 200,\nobtained empirically using 5000 indepen-\ndent draws of mHSICb. The two-parameter\nGamma distribution (Gamma) is \ufb01t using\n\u03b1 = 1.17 and \u03b2 = 8.3 \u00d7 10\u22124 in (9), with\nmean and variance computed via Theorems\n3 and 4.\n\n)\n\nb\n\nb\n\nI\n\nC\nS\nH\nm\n<\n \n)\nZ\n(\n\n \n\nI\n\nC\nS\nH\nm\nP\n\n(\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\n \n\nEmp\nGamma\n\n1.5\n\n2\n\n0.5\n\n1\n\nmHSIC\nb\n\nCxx := Ex[(\u03c6(x) \u2212 \u00b5x) \u2297 (\u03c6(x) \u2212 \u00b5x)]\n\nand Cyy, in feature space. The main advantage of our new expressions is that they are computed\nentirely in terms of kernels, which makes possible the application of the test to any domains on\nwhich kernels can be de\ufb01ned, and not only Rd.\n\nTheorem 3 Under H0,\n\nE(HSICb(Z)) =\n\n1\nm\n\nTrCxxTrCyy =\n\n1\n\nm (cid:16)1 + k\u00b5xk2 k\u00b5yk2 \u2212 k\u00b5xk2 \u2212 k\u00b5yk2(cid:17) ,\n\n(10)\n\nwhere the second equality assumes kii = lii = 1. An empirical estimate of this statistic is obtained\nby replacing the norms above with\nkij , bearing in mind that this results\nin a (generally negligible) bias of O(m\u22121) in the estimate of k\u00b5xk2 k\u00b5yk2.\n\n2 P(i,j)\u2208im\n\n\\\nk\u00b5xk2 = (m)\u22121\n\n2\n\nTheorem 4 Under H0,\n\nvar(HSICb(Z)) =\n\n2(m \u2212 4)(m \u2212 5)\n\n(m)4\n\nkCxxk2\n\nHS kCyyk2\n\nHS + O(m\u22123).\n\nDenoting by \u2299 the entrywise matrix product, A\u00b72 the entrywise matrix power, and B =\n((HKH) \u2299 (HLH))\u00b72, an empirical estimate with negligible bias may be found by replacing the\nproduct of covariance operator norms with 1\u22a4 (B \u2212 diag(B)) 1: this is slightly more ef\ufb01cient than\ntaking the product of the empirical operator norms (although the scaling with m is unchanged).\n\nProofs of both theorems may be found in [10], where we also compare with the original characteristic\nfunction-based expressions in [13]. We remark that these parameters, like the original test statistic\nin (4), may be computed in O(m2).\n4 Experiments\nGeneral tests of statistical independence are most useful for data having complex interactions that\nsimple correlation does not detect. We investigate two cases where this situation arises: \ufb01rst, we\ntest vectors in Rd which have a dependence relation but no correlation, as occurs in independent\nsubspace analysis; and second, we study the statistical dependence between a text and its translation.\n\nIndependence of subspaces One area where independence tests have been applied is in deter-\nmining the convergence of algorithms for independent component analysis (ICA), which involves\nseparating random variables that have been linearly mixed, using only their mutual independence.\nICA generally entails optimisation over a non-convex function (including when HSIC is itself the\noptimisation criterion [9]), and is susceptible to local minima, hence the need for these tests (in fact,\nfor classical approaches to ICA, the global minimum of the optimisation might not correspond to\nindependence for certain source distributions). Contingency table-based tests have been applied [15]\n\n5\n\n\fin this context, while the test of [13] has been used in [14] for verifying ICA outcomes when the\ndata are stationary random processes (through using a subset of samples with a suf\ufb01ciently large\ndelay between them). Contingency table-based tests may be less useful in the case of independent\nsubspace analysis (ISA, see e.g. [25] and its bibliography), where higher dimensional independent\nrandom vectors are to be separated. Thus, characteristic function-based tests [6, 13] and kernel\nindependence measures might work better for this problem.\n\nIn our experiments, we tested the independence of random vectors, as a way of verifying the so-\nlutions of independent subspace analysis. We assumed for ease of presentation that our subspaces\nhave respective dimension dx = dy = d, but this is not required. The data were constructed as\nfollows. First, we generated m samples of two univariate random variables, each drawn at random\nfrom the ICA benchmark densities in [11, Table 3]: these include super-Gaussian, sub-Gaussian,\nmultimodal, and unimodal distributions. Second, we mixed these random variables using a rota-\ntion matrix parameterised by an angle \u03b8, varying from 0 to \u03c0/4 (a zero angle means the data are\nindependent, while dependence becomes easier to detect as the angle increases to \u03c0/4: see the two\nplots in Figure 2, top left). Third, we appended d \u2212 1 dimensional Gaussian noise of zero mean\nand unit standard deviation to each of the mixtures. Finally, we multiplied each resulting vector\nby an independent random d-dimensional orthogonal matrix, to obtain vectors dependent across all\nobserved dimensions. We emphasise that classical approaches (such as Spearman\u2019s \u03c1 or Kendall\u2019s\n\u03c4) are completely unable to \ufb01nd this dependence, since the variables are uncorrelated; nor can we\nrecover the subspace in which the variables are dependent using PCA, since this subspace has the\nsame second order properties as the noise. We investigated sample sizes m = 128, 512, 1024, 2048,\nand d = 1, 2, 4.\nWe compared two different methods for computing the 1 \u2212 \u03b1 quantile of the HSIC null distribution:\nrepeated random permutation of the Y sample ordering as in [6] (HSICp), where we used 200 per-\nmutations; and Gamma approximation (HSICg) as in [13], based on (9). We used a Gaussian kernel,\nwith kernel size set to the median distance between points in input space. We also compared with\ntwo alternative tests, the \ufb01rst based on a discretisation of the variables, and the second on functional\ncanonical correlation. The discretisation based test was a power-divergence contingency table test\nfrom [17] (PD), which consisted in partitioning the space, counting the number of samples falling\nin each partition, and comparing this with the number of samples that would be expected under the\nnull hypothesis (the test we used, described in [15], is more re\ufb01ned than this short description would\nsuggest). Rather than a uniform space partitioning, we divided our space into roughly equiprobable\nbins as in [15], using a Gessaman partition for higher dimensions [5, Figure 21.4] (Ku and Fine did\nnot specify a space partitioning strategy for higher dimensions, since they dealt only with univariate\nrandom variables). All remaining parameters were set according to [15]. The functional correlation-\nbased test (fCorr) is described in [4]: the main differences with respect to our test are that it uses\nthe spectrum of the functional correlation operator, rather than the covariance operator; and that it\napproximates the RKHSs F and G by \ufb01nite sets of basis functions. Parameter settings were as in\n[4, Table 1], with the second order B-spline kernel and a twofold dyadic partitioning. Note that\nfCorr applies only in the univariate case. Results are plotted in Figure 2 (average over 500 repeti-\ntions). The y-intercept on these plots corresponds to the acceptance rate of H0 at independence, or\n1 \u2212 (Type I error), and should be close to the design parameter of 1 \u2212 \u03b1 = 0.95. Elsewhere, the\nplots indicate acceptance of H0 where the underlying variables are dependent, i.e. the Type II error.\nAs expected, we observe that dependence becomes easier to detect as \u03b8 increases from 0 to \u03c0/4,\nwhen m increases, and when d decreases. The PD and fCorr tests perform poorly at m = 128,\nbut approach the performance of HSIC-based tests for increasing m (although PD remains slightly\nworse than HSIC at m = 512 and d = 1, while fCorr becomes slightly worse again than PD). PD\nalso scales very badly with d, and never rejects the null hypothesis when d = 4, even for m = 2048.\nAlthough HSIC-based tests are unreliable for small \u03b8, they generally do well as \u03b8 approaches \u03c0/4\n(besides m = 128, d = 2). We also emphasise that HSICp and HSICg perform identically, although\nHSICp is far more costly (by a factor of around 100, given the number of permutations used).\n\nDependence and independence between text\nsection, we demonstrate inde-\npendence testing on text.\nOur data are taken from the Canadian Hansard corpus\n(http : //www.isi.edu/natural \u2212 language/download/hansard/). These consist of the of-\n\ufb01cial records of the 36th Canadian parliament, in English and French. We used debate transcripts\non the three topics of Agriculture, Fisheries, and Immigration, due to the relatively large volume of\ndata in these categories. Our goal was to test whether there exists a statistical dependence between\n\nIn this\n\n6\n\n\fY\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nH\n\n \nf\no\n \ne\nc\nn\na\nt\np\ne\nc\nc\na\n \n%\n\n0\n0\n\nRotation \u03b8 = \u03c0/8\n\nRotation \u03b8 = \u03c0/4\n\nY\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22122\n\n0\nX\n\n2\n\n\u22122\n\n0\nX\n\n2\n\n0\n\nH\n\n \nf\n\n \n\nt\n\no\ne\nc\nn\na\np\ne\nc\nc\na\n%\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\nSamp:128, Dim:1\n\nPD\nfCorr\nHSICp\nHSICg\n\n \n\n0\n\nH\n\n \nf\no\n \ne\nc\nn\na\nt\np\ne\nc\nc\na\n \n%\n\n0.5\n\nAngle (\u00d7\u03c0/4)\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\nSamp:128, Dim:2\n\n0.5\n\nAngle (\u00d7\u03c0/4)\n\n1\n\nSamp:512, Dim:1\n\nSamp:512, Dim:2\n\nSamp:1024, Dim:4\n\nSamp:2048, Dim:4\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nH\n\n \nf\no\n \ne\nc\nn\na\nt\np\ne\nc\nc\na\n \n%\n\n0\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nH\n\n \nf\no\n \ne\nc\nn\na\nt\np\ne\nc\nc\na\n \n%\n\n0\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nH\n\n \nf\no\n \ne\nc\nn\na\nt\np\ne\nc\nc\na\n \n%\n\n0\n0\n\n0.5\n\nAngle (\u00d7\u03c0/4)\n\n1\n\n0.5\n\nAngle (\u00d7\u03c0/4)\n\n1\n\n0.5\n\nAngle (\u00d7\u03c0/4)\n\n1\n\n0.5\n\nAngle (\u00d7\u03c0/4)\n\n1\n\nFigure 2: Top left plots: Example dataset for d = 1, m = 200, and rotation angles \u03b8 = \u03c0/8 (left) and \u03b8 = \u03c0/4\n(right). In this case, both sources are mixtures of two Gaussians (source (g) in [11, Table 3]). We remark that\nthe random variables appear \u201cmore dependent\u201d as the angle \u03b8 increases, although their correlation is always\nzero. Remaining plots: Rate of acceptance of H0 for the PD, fCorr, HSICp, and HSICg tests. \u201cSamp\u201d is the\nnumber m of samples, and \u201cdim\u201d is the dimension d of x and y.\n\nEnglish text and its French translation. Our dependent data consisted of a set of paragraph-long\n(5 line) English extracts and their French translations. For our independent data, the English para-\ngraphs were matched to random French paragraphs on the same topic: for instance, an English\nparagraph on \ufb01sheries would always be matched with a French paragraph on \ufb01sheries. This was\ndesigned to prevent a simple vocabulary check from being used to tell when text was mismatched.\nWe also ignored lines shorter than \ufb01ve words long, since these were not always part of the text (e.g.\nidenti\ufb01cation of the person speaking). We used the k-spectrum kernel of [16], computed according\nto the method of [24]. We set k = 10 for both languages, where this was chosen by cross validating\non an SVM classi\ufb01er for Fisheries vs National Defense, separately for each language (performance\nwas not especially sensitive to choice of k; k = 5 also worked well). We compared this kernel with\na simple kernel between bags of words [3, pp. 186\u2013189]. Results are in Table 1.\n\nOur results demonstrate the excellent performance of the HSICp test on this task: even for small\nsample sizes, HSICp with a spectral kernel always achieves zero Type II error, and a Type I error\nclose to the design value (0.95). We further observe for m = 10 that HSICp with the spectral kernel\nalways has better Type II error than the bag-of words kernel. This suggests that a kernel with a more\nsophisticated encoding of text structure induces a more sensitive test, although for larger sample\nsizes, the advantage vanishes. The HSICg test does less well on this data, always accepting H0 for\nm = 10, and returning a Type I error of zero, rather than the design value of 5%, when m = 50. It\nappears that this is due to a very low variance estimate returned by the Theorem 4 expression, which\ncould be caused by the high diagonal dominance of kernels on strings. Thus, while the test threshold\nfor HSICg at m = 50 still fell between the dependent and independent values of HSICb, this was\nnot the result of an accurate modelling of the null distribution. We would therefore recommend the\npermutation approach for this problem. Finally, we also tried testing with 2-line extracts and 10-line\nextracts, which yielded similar results.\n\n5 Conclusion\n\nWe have introduced a test of whether signi\ufb01cant statistical dependence is obtained by a kernel depen-\ndence measure, the Hilbert-Schmidt independence criterion (HSIC). Our test costs O(m2) for sam-\nple size m. In our experiments, HSIC-based tests always outperformed the contingency table [17]\nand functional correlation [4] approaches, for both univariate random variables and higher dimen-\nsional vectors which were dependent but uncorrelated. We would therefore recommend HSIC-based\ntests for checking the convergence of independent component analysis and independent subspace\nanalysis. Finally, our test also applies on structured domains, being able to detect the dependence\n\n7\n\n\fTable 1: Independence tests for cross-language dependence detection. Topics are in the \ufb01rst column, where the\ntotal number of 5-line extracts for each dataset is in parentheses. BOW(10) denotes a bag of words kernel and\nm = 10 sample size, Spec(50) is a k-spectrum kernel with m = 50. The \ufb01rst entry in each cell is the null\nacceptance rate of the test under H0 (i.e. 1 \u2212 (Type I error); should be near 0.95); the second entry is the null\nacceptance rate under H1 (the Type II error, small is better). Each entry is an average over 300 repetitions.\n\nTopic\n\nAgriculture\n(555)\nFisheries\n(408)\nImmigration\n(289)\n\nBOW(10)\n\nSpec(10)\n\nBOW(50)\n\nSpec(50)\n\nHSICg\n1.00,\n0.99\n1.00,\n1.00\n1.00,\n1.00\n\nHSICp\n0.94,\n0.18\n0.94,\n0.20\n0.96,\n0.09\n\nHSICg\n1.00,\n1.00\n1.00,\n1.00\n1.00,\n1.00\n\nHSICp\n0.95,\n0.00\n0.94,\n0.00\n0.91,\n0.00\n\nHSICg\n1.00,\n0.00\n1.00,\n0.00\n0.99,\n0.00\n\nHSICp\n0.93,\n0.00\n0.93,\n0.00\n0.94,\n0.00\n\nHSICg\n1.00,\n0.00\n1.00,\n0.00\n1.00,\n0.00\n\nHSICp\n0.95,\n0.00\n0.95,\n0.00\n0.95,\n0.00\n\nof passages of text and their translation.Another application along these lines might be in testing\ndependence between data of completely different types, such as images and captions.\nAcknowledgements: NICTA is funded through the Australian Government\u2019s Backing Australia\u2019s\nAbility initiative, in part through the ARC. This work was supported in part by the IST Programme\nof the European Community, under the PASCAL Network of Excellence, IST-2002-506778.\nReferences\n[1] F. Bach and M. Jordan. Tree-dependent component analysis. In UAI 18, 2002.\n[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1\u201348, 2002.\n[3] I. Calvino. If on a winter\u2019s night a traveler. Harvest Books, Florida, 1982.\n[4] J. Dauxois and G. M. Nkiet. Nonlinear canonical analysis and independence tests. Ann. Statist.,\n\n[5] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in\n\nApplications of mathematics. Springer, New York, 1996.\n\n[6] Andrey Feuerverger. A consistent test for bivariate dependence.\n\nInternational Statistical Review,\n\n26(4):1254\u20131278, 1998.\n\n61(3):419\u2013433, 1993.\n\n[7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73\u201399, 2004.\n\n[8] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-sample-\n\nproblem. In NIPS 19, pages 513\u2013520, Cambridge, MA, 2007. MIT Press.\n\n[9] A. Gretton, O. Bousquet, A.J. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with Hilbert-\n\nSchmidt norms. In ALT, pages 63\u201377, 2005.\n\n[10] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\n\nindependence. Technical Report 168, MPI for Biological Cybernetics, 2008.\n\n[11] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch\u00a8olkopf. Kernel methods for measuring\n\nindependence. J. Mach. Learn. Res., 6:2075\u20132129, 2005.\n\n[12] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1 (Second\n\n[13] A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function.\n\nEdition). John Wiley and Sons, 1994.\n\nPhD thesis, University of Jyv\u00a8askyl\u00a8a, 1995.\n\n[14] Juha Karvanen. A resampling test for the total independence of stationary time series: Application to the\n\nperformance evaluation of ica algorithms. Neural Processing Letters, 22(3):311 \u2013 324, 2005.\n\n[15] C.-J. Ku and T. Fine. Testing for stochastic independence: application to blind source separation. IEEE\n\nTransactions on Signal Processing, 53(5):1815\u20131826, 2005.\n\n[16] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classi\ufb01cation.\n\nIn Paci\ufb01c Symposium on Biocomputing, pages 564\u2013575, 2002.\n\n[17] T. Read and N. Cressie. Goodness-Of-Fit Statistics for Discrete Multivariate Analysis. Springer-Verlag,\n\nNew York, 1988.\n\n[18] A. R\u00b4enyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441\u2013451, 1959.\n[19] M. Rosenblatt. A quadratic measure of deviation of two-dimensional density estimates and a test of\n\nindependence. The Annals of Statistics, 3(1):1\u201314, 1975.\n\n[20] B. Sch\u00a8olkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, 2004.\n[21] R. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.\n[22] L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection via dependence\n\nestimation. In Proc. Intl. Conf. Machine Learning, pages 823\u2013830. Omnipress, 2007.\n\n[23] I. Steinwart. The in\ufb02uence of the kernel on the consistency of support vector machines. Journal of\n\n[24] C. H. Teo and S. V. N. Vishwanathan. Fast and space ef\ufb01cient string kernels using suf\ufb01x arrays. In ICML,\n\nMachine Learning Research, 2, 2002.\n\npages 929\u2013936, 2006.\n\n[25] F.J. Theis. Towards a general independent subspace analysis. In NIPS 19, 2007.\n\n8\n\n\f", "award": [], "sourceid": 730, "authors": [{"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Choon", "family_name": "Teo", "institution": null}, {"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}