{"title": "A Kernel Test for Three-Variable Interactions", "book": "Advances in Neural Information Processing Systems", "page_first": 1124, "page_last": 1132, "abstract": "We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful three-variable interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak influence on a third dependent variable, but their combined effect has a strong influence. This makes the Lancaster test especially suited to finding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.", "full_text": "A Kernel Test for Three-Variable Interactions\n\nDino Sejdinovic, Arthur Gretton\n\nGatsby Unit, CSML, UCL, UK\n\n{dino.sejdinovic, arthur.gretton}@gmail.com\n\nWicher Bergsma\n\nDepartment of Statistics, LSE, UK\n\nw.p.bergsma@lse.ac.uk\n\nAbstract\n\nWe introduce kernel nonparametric tests for Lancaster three-variable interaction\nand for total independence, using embeddings of signed measures into a repro-\nducing kernel Hilbert space. The resulting test statistics are straightforward to\ncompute, and are used in powerful interaction tests, which are consistent against\nall alternatives for a large family of reproducing kernels. We show the Lancaster\ntest to be sensitive to cases where two independent causes individually have weak\nin\ufb02uence on a third dependent variable, but their combined effect has a strong\nin\ufb02uence. This makes the Lancaster test especially suited to \ufb01nding structure in\ndirected graphical models, where it outperforms competing nonparametric tests in\ndetecting such V-structures.\n\n1 Introduction\n\nThe problem of nonparametric testing of interaction between variables has been widely treated in\nthe machine learning and statistics literature. Much of the work in this area focuses on measuring\nor testing pairwise interaction: for instance, the Hilbert-Schmidt Independence Criterion (HSIC) or\nDistance Covariance [1, 2, 3], kernel canonical correlation [4, 5, 6], and mutual information [7].\nIn cases where more than two variables interact, however, the questions we can ask about their\ninteraction become signi\ufb01cantly more involved. The simplest case we might consider is whether the\ni=1 PXi , as considered in Rd by [8]. This is already\na more general question than pairwise independence, since pairwise independence does not imply\ntotal (mutual) independence, while the implication holds in the other direction. For example, if\nX and Y are i.i.d. uniform on {\u22121, 1}, then (X, Y, XY ) is a pairwise independent but mutually\ndependent triplet [9]. Tests of total and pairwise independence are insuf\ufb01cient, however, since they\ndo not rule out all third order factorizations of the joint distribution.\n\nvariables are mutually independent, PX = Qd\n\nAn important class of high order interactions occurs when the simultaneous effect of two vari-\nables on a third may not be additive. In particular, it may be possible that X \u22a5\u22a5 Z and Y \u22a5\u22a5 Z,\nwhereas \u00ac ((X, Y ) \u22a5\u22a5 Z) (for example, neither adding sugar to coffee nor stirring the coffee in-\ndividually have an effect on its sweetness but the joint presence of the two does).\nIn addition,\nstudy of three-variable interactions can elucidate certain switching mechanisms between positive\nand negative correlation of two genes expressions, as controlled by a third gene [10]. The presence\nof such interactions is typically tested using some form of analysis of variance (ANOVA) model\nwhich includes additional interaction terms, such as products of individual variables. Since each\nsuch additional term requires a new hypothesis test, this increases the risk that some hypothesis test\nwill produce a false positive by chance. Therefore, a test that is able to directly detect the presence\nof any kind of higher-order interaction would be of a broad interest in statistical modeling. In the\npresent work, we provide to our knowledge the \ufb01rst nonparametric test for three-variable interaction.\nThis work generalizes the HSIC test of pairwise independence, and has as its test statistic the norm\n\n1\n\n\fof an embedding of an appropriate signed measure to a reproducing kernel Hilbert space (RKHS).\nWhen the statistic is non-zero, all third order factorizations can be ruled out. Moreover, this test is\napplicable to the cases where X, Y and Z are themselves multivariate objects, and may take values\nin non-Euclidean or structured domains.1\n\nOne important application of interaction measures is in learning structure for graphical models. If\nthe graphical model is assumed to be Gaussian, then second order interaction statistics may be used\nto construct an undirected graph [11, 12]. When the interactions are non-Gaussian, however, other\napproaches are brought to bear. An alternative approach to structure learning is to employ condi-\ntional independence tests. In the PC algorithm [13, 14, 15], a V-structure (a directed graphical model\nwith two independent parents pointing to a single child) is detected when an independence test be-\ntween the parent variables accepts the null hypothesis, while a test of dependence of the parents\nconditioned on the child rejects the null hypothesis. The PC algorithm gives a correct equivalence\nclass of structures subject to the causal Markov and faithfulness assumptions, in the absence of\nhidden common causes. The original implementations of the PC algorithm rely on partial correla-\ntions for testing, and assume Gaussianity. A number of algorithms have since extended the basic\nPC algorithm to arbitrary probability distributions over multivariate random variables [16, 17, 18],\nby using nonparametric kernel independence tests [19] and conditional dependence tests [20, 18].\nWe observe that our Lancaster interaction based test provides a strong alternative to the conditional\ndependence testing approach, and is seen to outperform earlier approaches in detecting cases where\nindependent parent variables weakly in\ufb02uence the child variable when considered individually, but\nhave a strong combined in\ufb02uence.\n\nWe begin our presentation in Section 2 with a de\ufb01nition of interaction measures, these being the\nsigned measures we will embed in an RKHS. We cover this embedding procedure in Section 3. We\nthen proceed in Section 4 to de\ufb01ne pairwise and three way interactions. We describe a statistic to\ntest mutual independence for more than three variables, and provide a brief overview of the more\ncomplex high-order interactions that may be observed when four or more variables are considered.\nFinally, we provide experimental benchmarks in Section 5.\n\n2 Interaction Measure\n\nAn interaction measure [21, 22] associated to a multidimensional probability distribution P of a ran-\ndom vector (X1, . . . , XD) taking values in the product space X1 \u00d7\u00b7 \u00b7 \u00b7\u00d7XD is a signed measure \u2206P\nthat vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly mul-\ntivariate) marginal distributions. For the cases D = 2, 3 the correct interaction measure coincides\nwith the the notion introduced by Lancaster [21] as a formal product\n\n\u2206LP =\n\nD\n\nYi=1(cid:0)P \u2217Xi \u2212 PXi(cid:1) ,\n\n(1)\n\nj=1 P \u2217Xij\n\nwhere each productQD\u2032\n(cid:0)Xi1 , . . . , XiD\u2032(cid:1). We will term the signed measure in (1) the Lancaster interaction measure. In the\n\ncase of a bivariate distribution, the Lancaster interaction measure is simply the difference between\nthe joint probability distribution and the product of the marginal distributions (the only possible\nnon-trivial factorization for D = 2), \u2206LP = PXY \u2212 PX PY , while in the case D = 3, we obtain\n\nsigni\ufb01es the joint probability distribution PXi1\u00b7\u00b7\u00b7Xi\n\nD\u2032 of a subvector\n\n\u2206LP = PXY Z \u2212 PXY PZ \u2212 PY Z PX \u2212 PXZ PY + 2PX PY PZ .\n\nIt is readily checked that\n\n(X, Y ) \u22a5\u22a5 Z \u2228 (X, Z) \u22a5\u22a5 Y \u2228 (Y, Z) \u22a5\u22a5 X \u21d2 \u2206LP = 0.\n\n(2)\n\n(3)\n\nFor D > 3, however, (1) does not capture all possible factorizations of the joint distribution, e.g.,\nfor D = 4, it need not vanish if (X1, X2) \u22a5\u22a5 (X3, X4), but X1 and X2 are dependent and X3 and\nX4 are dependent. Streitberg [22] corrected this de\ufb01nition using a more complicated construction\nwith the M\u00a8obius function on the lattice of partitions, which we describe in Section 4.3.\nIn this\n\n1As the reader might imagine, the situation becomes more complex again when four or more variables\n\ninteract simultaneously; we provide a brief technical overview in Section 4.3.\n\n2\n\n\fwork, however, we will focus on the case of three variables and formulate interaction tests based on\nembedding of (2) into an RKHS.\n\nThe implication (3) states that the presence of Lancaster interaction rules out the possibility of any\nfactorization of the joint distribution, but the converse is not generally true; see Appendix C for de-\ntails. In addition, it is important to note the distinction between the absence of Lancaster interaction\nand the total (mutual) independence of (X, Y, Z), i.e., PXY Z = PX PY PZ . While total indepen-\ndence implies the absence of Lancaster interaction, the signed measure \u2206totP = PXY Z \u2212PXPY PZ\nassociated to the total (mutual) independence of (X, Y, Z) does not vanish if, e.g., (X, Y ) \u22a5\u22a5 Z, but\nX and Y are dependent.\n\nIn this contribution, we construct the non-parametric test for the hypothesis \u2206LP = 0 (no Lancaster\ninteraction), as well as the non-parametric test for the hypothesis \u2206totP = 0 (total independence),\nbased on the embeddings of the corresponding signed measures \u2206LP and \u2206totP into an RKHS.\nBoth tests are particularly suited to the cases where X, Y and Z take values in a high-dimensional\nspace, and, moreover, they remain valid for a variety of non-Euclidean and structured domains, i.e.,\nfor all topological spaces where it is possible to construct a valid positive de\ufb01nite function; see [23]\nfor details. In the case of total independence testing, our approach can be viewed as a generalization\nof the tests proposed in [24] based on the empirical characteristic functions.\n\n3 Kernel Embeddings\n\nWe review the embedding of signed measures to a reproducing kernel Hilbert space. The RKHS\nnorms of such embeddings will then serve as our test statistics. Let Z be a topological space.\nAccording to the Moore-Aronszajn theorem [25, p. 19], for every symmetric, positive de\ufb01nite\nfunction (henceforth kernel) k : Z \u00d7 Z \u2192 R, there is an associated reproducing kernel Hilbert\nspace (RKHS) Hk of real-valued functions on Z with reproducing kernel k. The map \u03d5 : Z \u2192 Hk,\n\u03d5 : z 7\u2192 k(\u00b7, z) is called the canonical feature map or the Aronszajn map of k. Denote by M(Z)\nthe Banach space of all \ufb01nite signed Borel measures on Z. The notion of a feature map can then be\nextended to kernel embeddings of elements of M(Z) [25, Chapter 4].\nDe\ufb01nition 1. (Kernel embedding) Let k be a kernel on Z, and \u03bd \u2208 M(Z). The kernel embedding\n\nfor all f \u2208 Hk.\n\nof \u03bd into the RKHS Hk is \u00b5k(\u03bd) \u2208 Hk such that \u00b4 f (z)d\u03bd(z) = hf, \u00b5k(\u03bd)i\nAlternatively, the kernel embedding can be de\ufb01ned by the Bochner integral \u00b5k(\u03bd) = \u00b4 k(\u00b7, z) d\u03bd(z).\nIf a measurable kernel k is a bounded function, it is straightforward to show using the Riesz repre-\nsentation theorem that \u00b5k(\u03bd) exists for all \u03bd \u2208 M(Z).2 For many interesting bounded kernels k,\nincluding the Gaussian, Laplacian and inverse multiquadratics, the embedding \u00b5k : M(Z) \u2192 Hk is\ninjective. Such kernels are said to be integrally strictly positive de\ufb01nite (ISPD) [26, p. 4]. A related\nbut weaker notion is that of a characteristic kernel [20, 27], which requires the kernel embedding\nto be injective only on the set M1\n+(Z) of probability measures. In the case that k is ISPD, since\nHk is a Hilbert space, we can introduce a notion of an inner product between two signed measures\n\u03bd, \u03bd\u2032 \u2208 M(Z),\n\nHk\n\nhh\u03bd, \u03bd\u2032iik := h\u00b5k(\u03bd), \u00b5k(\u03bd\u2032)i\n\n= \u02c6 k(z, z\u2032)d\u03bd(z)d\u03bd\u2032(z\u2032).\n\nHk\n\nSince \u00b5k is injective, this is a valid inner product and induces a norm on M(Z), for which\nk\u03bdkk = hh\u03bd, \u03bdii1/2\nk = 0 if and only if \u03bd = 0. This fact has been used extensively in the literature to\nformulate: (a) a nonparametric two-sample test based on estimation of maximum mean discrepancy\ni.i.d.\u223c Q [28] and (b) a nonparametric indepen-\nkP \u2212 Qkk, for samples {Xi}n\ni.i.d.\u223c PXY\ndence test based on estimation of kPXY \u2212 PX PY kk\u2297l, for a joint sample {(Xi, Yi)}n\n[19] (the latter is also called a Hilbert-Schmidt independence criterion), with kernel k \u2297 l on the\nproduct space de\ufb01ned as k(x, x\u2032)l(y, y\u2032). When a bounded characteristic kernel is used, the above\ntests are consistent against all alternatives, and their alternative interpretation is as a generalization\n[29, 3] of energy distance [30, 31] and distance covariance [2, 32].\n\ni.i.d.\u223c P , {Yi}m\n\ni=1\n\ni=1\n\ni=1\n\n2Unbounded kernels can also be considered, however [3].\n\nk (Z) \u2282 M(Z), which satisfy a \ufb01nite moment condition, i.e., M1/2\n\nIn this case, one can still study embeddings\nk (Z) =\n\nof the signed measures M1/2\n\nn\u03bd \u2208 M(Z) : \u00b4 k1/2(z, z) d|\u03bd|(z) < \u221eo .\n\n3\n\n\fTable 1: V -statistic estimates of hh\u03bd, \u03bd \u2032iik\u2297l in the two-variable case\n\n\u03bd\\\u03bd \u2032\nPXY\nPX PY\n\nPXY\n\n1\nn2 (K \u25e6 L)++\n\nPX PY\n\n1\nn3 (KL)++\n1\nn4 K++L++\n\nIn this article, we extend this approach to the three-variable case, and formulate tests for both\nthe Lancaster interaction and for the total independence, using simple consistent estimators of\nk\u2206LP kk\u2297l\u2297m and k\u2206totP kk\u2297l\u2297m respectively, which we describe in the next Section. Using the\nsame arguments as in the tests of [28, 19], these tests are also consistent against all alternatives as\nlong as ISPD kernels are used.\n\n4 Interaction Tests\n\nNotational remarks: Throughout the paper, \u25e6 denotes an Hadamard (entrywise) product. Let A be\nan n \u00d7 n matrix, and K a symmetric n \u00d7 n matrix. We will \ufb01x the following notational conventions:\ni=1 Aij denotes the sum of all elements of the j-th\nj=1 Aij denotes the sum of all elements of the i-th row of A; A++ =\nj=1 Aij denotes the sum of all elements of A; K+ = 11\u22a4K, i.e., [K+]ij = K+j = Kj+,\n\n1 denotes an n \u00d7 1 column of ones; A+j = Pn\ncolumn of A; Ai+ = Pn\nPn\ni=1Pn\nand(cid:2)K\u22a4+(cid:3)ij = Ki+ = K+i.\n\n4.1 Two-Variable (Independence) Test\n\nWe provide a short overview of the kernel independence test of [19], which we write as the RKHS\nnorm of the embedding of a signed measure. While this material is not new (it appears in [28, Section\n7.4]), it will help de\ufb01ne how to proceed when a third variable is introduced, and the signed measures\nbecome more involved. We begin by expanding the squared RKHS norm kPXY \u2212 PX PY k2\nk\u2297l as\ninner products, and applying the reproducing property,\n\nkPXY \u2212 PX PY k2\n\nk\u2297l = EXY EX \u2032Y \u2032 k(X, X\u2032)l(Y, Y \u2032) + EX EX \u2032 k(X, X\u2032)EY EY \u2032 l(Y, Y \u2032)\n\n\u2212 2EX \u2032Y \u2032 [EX k(X, X\u2032)EY l(Y, Y \u2032)] ,\n\n(4)\n\nwhere (X, Y ) and (X\u2032, Y \u2032) are independent copies of random variables on X \u00d7 Y with distribution\nPXY .\nGiven a joint sample {(Xi, Yi)}n\nk\u2297l is\nobtained by substituting corresponding empirical means into (4), which can be represented using\nGram matrices K and L (Kij = k(Xi, Xj), Lij = l(Yi, Yj )),\n\ni.i.d.\u223c PXY , an empirical estimator of kPXY \u2212 PX PY k2\n\ni=1\n\n\u02c6EXY \u02c6EX \u2032Y \u2032 k(X, X\u2032)l(Y, Y \u2032) =\n\n\u02c6EX \u02c6EX \u2032 k(X, X\u2032)\u02c6EY \u02c6EY \u2032l(Y, Y \u2032) =\n\n\u02c6EX \u2032Y \u2032h\u02c6EX k(X, X\u2032)\u02c6EY l(Y, Y \u2032)i =\n\n1\nn2\n\n1\nn4\n\n1\nn3\n\nn\n\nn\n\nXa=1\nXa=1\nXa=1\n\nn\n\nn\n\nn\n\nXb=1\nXb=1\nXb=1\n\nn\n\nKabLab =\n\n1\nn2 (K \u25e6 L)++ ,\n\nn\n\nXc=1\nXc=1\n\nn\n\nn\n\nXd=1\n\nKabLcd =\n\n1\nn4\n\nK++L++,\n\nKacLbc =\n\n1\nn3 (KL)++ .\n\nSince these are V-statistics [33, Ch. 5], there is a bias of OP (n\u22121); U-statistics may be used if an\nunbiased estimate is needed. Each of the terms above corresponds to an estimate of an inner product\n\nhh\u03bd, \u03bd\u2032iik\u2297l for probability measures \u03bd and \u03bd\u2032 taking values in {PXY , PX PY }, as summarized in\n\nTable 1. Even though the second and third terms involve triple and quadruple sums, each of the\nempirical means can be computed using sums of all terms of certain matrices, where the dominant\ncomputational cost is in computing the matrix product KL. In fact, the overall estimator can be\n\n4\n\n\fTable 2: V -statistic estimates of hh\u03bd, \u03bd \u2032iik\u2297l\u2297m in the three-variable case\n\n\u03bd\\\u03bd \u2032\n\nnPXY Z\n\nn2PXY PZ\n\nn2PXZ PY\n\nn2PY Z PX\n\nn3PX PY PZ\n\nnPXY Z\n\n(K \u25e6 L \u25e6 M )++\n\nn2PXY PZ\nn2PXZ PY\nn2PY Z PX\nn3PX PY PZ\n\n((K \u25e6 L) M )++\n(K \u25e6 L)++ M++\n\n((K \u25e6 M ) L)++\n\n((M \u25e6 L) K)++\n\ntr(K+ \u25e6 L+ \u25e6 M+)\n\n(M KL)++\n\n(K \u25e6 M )++ L++\n\n(KLM )++\n(KM L)++\n\n(KL)++M++\n\n(KM )++L++\n\n(L \u25e6 M )++ K++\n\n(LM )++K++\n\nK++L++M++\n\n1\n\ncomputed in an even simpler form (see Proposition 9 in Appendix F), as (cid:13)(cid:13)(cid:13)\n\n=\nn2 (K \u25e6 HLH)++ , where H = I \u2212 1\n11\u22a4 is the centering matrix. Note that by the idempotence of\nH, we also have that (K \u25e6 HLH)++ = (HKH \u25e6 HLH)++. In the rest of the paper, for any Gram\nmatrix K, we will denote its corresponding centered matrix HKH by \u02dcK. When three variables are\npresent, a two-variable test already allows us to determine whether for instance (X, Y ) \u22a5\u22a5 Z, i.e.,\nwhether PXY Z = PXY PZ . It is suf\ufb01cient to treat (X, Y ) as a single variable on the product space\nX \u00d7 Y, with the product kernel k \u2297 l. Then, the Gram matrix associated to (X, Y ) is simply K \u25e6 L,\nand the corresponding V -statistic is 1\n.3 What is not obvious, however, is if a\n\n\u02c6PXY \u2212 \u02c6PX \u02c6PY(cid:13)(cid:13)(cid:13)\n\nk\u2297l\n\nn\n\n2\n\nn2 (cid:16)K \u25e6 L \u25e6 \u02dcM(cid:17)++\n\nV-statistic for the Lancaster interaction (which can be thought of as a surrogate for the composite\nhypothesis of various factorizations) can be obtained in a similar form. We will address this question\nin the next section.\n\n4.2 Three-Variable Tests\n\nAs in the two-variable case, it suf\ufb01ces to derive V-statistics for inner products hh\u03bd, \u03bd\u2032iik\u2297l\u2297m, where\n\u03bd and \u03bd\u2032 take values in all possible combinations of the joint and the products of the marginals, i.e.,\nPXY Z , PXY PZ , etc. Again, it is easy to see that these can be expressed as certain expectations of\nkernel functions, and thereby can be calculated by an appropriate manipulation of the three Gram\nmatrices. We summarize the resulting expressions in Table 2 - their derivation is a tedious but\nstraightforward linear algebra exercise. For compactness, the appropriate normalizing terms are\nmoved inside the measures considered.\n\nBased on the individual RKHS inner product estimators, we can now easily derive estimators for\nvarious signed measures arising as linear combinations of PXY Z , PXY PZ , and so on. The \ufb01rst such\nmeasure is an \u201cincomplete\u201d Lancaster interaction measure \u2206(Z)P = PXY Z +PX PY PZ \u2212PY Z PX \u2212\nPXZ PY , which vanishes if (Y, Z) \u22a5\u22a5 X or (X, Z) \u22a5\u22a5 Y , but not necessarily if (X, Y ) \u22a5\u22a5 Z. We\nobtain the following result for the empirical measure \u02c6P .\n\nProposition 2 (Incomplete Lancaster interaction). (cid:13)(cid:13)(cid:13)\u2206(Z)\n\n\u02c6P(cid:13)(cid:13)(cid:13)\n\n2\n\nk\u2297l\u2297m\n\n= 1\n\nn2 (cid:16) \u02dcK \u25e6 \u02dcL \u25e6 M(cid:17)++\n\n.\n\n\u02c6P . Unlike in the two-variable case where either\nAnalogous expressions hold for \u2206(X)\nmatrix or both can be centered, centering of each matrix in the three-variable case has a different\nmeaning. In particular, one requires centering of all three kernel matrices to perform a \u201ccomplete\u201d\nLancaster interaction test, as given by the following Proposition.\n\n\u02c6P and \u2206(Y )\n\nProposition 3 (Lancaster interaction). (cid:13)(cid:13)(cid:13)\u2206L \u02c6P(cid:13)(cid:13)(cid:13)\n\n2\n\nk\u2297l\u2297m\n\n= 1\n\nn2 (cid:16) \u02dcK \u25e6 \u02dcL \u25e6 \u02dcM(cid:17)++\n\n.\n\nThe proofs of these Propositions are given in Appendix A. We summarize various hypotheses and\nthe associated V-statistics in the Appendix B. As we will demonstrate in the experiments in Section\n5, while particularly useful for testing the factorization hypothesis, i.e., for (X, Y ) \u22a5\u22a5 Z \u2228 (X, Z) \u22a5\u22a5\n\nY \u2228 (Y, Z) \u22a5\u22a5 X, the statistic (cid:13)(cid:13)(cid:13)\u2206L \u02c6P(cid:13)(cid:13)(cid:13)\n\n2\n\nk\u2297l\u2297m\n\nindividual hypotheses (Y, Z) \u22a5\u22a5 X, (X, Z) \u22a5\u22a5 Y , or (X, Y ) \u22a5\u22a5 Z, or for total independence testing,\n\ncan also be used for powerful tests of either the\n\n3In general, however, this approach would require some care since, e.g., X and Y could be measured on\n\nvery different scales, and the choice of kernels k and l needs to take this into account.\n\n5\n\n\fi.e., PXY Z = PX PY PZ , as it vanishes in all of these cases. The null distribution under each of these\nhypotheses can be estimated using a standard permutation-based approach described in Appendix\nD.\n\nAnother way to obtain the Lancaster interaction statistic is as the RKHS norm of the joint \u201ccen-\ntral moment\u201d \u03a3XY Z = EXY Z[(kX \u2212 \u00b5X ) \u2297 (lY \u2212 \u00b5Y ) \u2297 (mZ \u2212 \u00b5Z )] of RKHS-valued random\nvariables kX , lY and mZ (understood as an element of the tensor RKHS Hk \u2297 Hl \u2297 Hm). This is\nrelated to a classical characterization of the Lancaster interaction [21, Ch. XII]: there is no Lancaster\ninteraction between X, Y and Z if and only if cov [f (X), g(Y ), h(Z)] = 0 for all L2 functions f , g\nand h. There is an analogous result in our case (proof is given in Appendix A), which states\nProposition 4. k\u2206LP kk\u2297l\u2297m = 0 if and only if cov [f (X), g(Y ), h(Z)] = 0 for all f \u2208 Hk,\ng \u2208 Hl, h \u2208 Hm.\n\nAnd \ufb01nally, we give an estimator of the RKHS norm of the total independence measure \u2206totP .\nProposition 5 (Total independence). Let \u2206tot \u02c6P = \u02c6PXY Z \u2212 \u02c6PX \u02c6PY \u02c6PZ . Then:\n1\nn6\n\n1\nn2 (K \u25e6 L \u25e6 M )++ \u2212\n\ntr(K+ \u25e6 L+ \u25e6 M+) +\n\nK++L++M++.\n\n2\nn4\n\n=\n\n2\n\n(cid:13)(cid:13)(cid:13)\u2206tot \u02c6P(cid:13)(cid:13)(cid:13)\n\nk\u2297l\u2297m\n\nThe proof follows simply from reading off the corresponding inner-product V-statistics from the\nTable 2. While the test statistic for total independence has a somewhat more complicated form than\nthat of Lancaster interaction, it can also be computed in quadratic time.\n\n4.3 Interaction for D > 3\n\nStreitberg\u2019s correction of the interaction measure for D > 3 has the form\n\n\u2206S P = X\u03c0\n\n(\u22121)|\u03c0|\u22121 (|\u03c0| \u2212 1)!J\u03c0P,\n\n(5)\n\nwhere the sum is taken over all partitions of the set {1, 2, . . . , n}, |\u03c0| denotes the size of the partition\n(number of blocks), and J\u03c0 : P 7\u2192 P\u03c0 is the partition operator on probability measures, which for\na \ufb01xed partition \u03c0 = \u03c01|\u03c02| . . . |\u03c0r maps the probability measure P to the product measure P\u03c0 =\nj=1 P\u03c0j , where P\u03c0j is the marginal distribution of the subvector (Xi : i \u2208 \u03c0j) . The coef\ufb01cients\ncorrespond to the M\u00a8obius inversion on the partition lattice [34]. While the Lancaster interaction\nhas an interpretation in terms of joint central moments, Streitberg\u2019s correction corresponds to joint\n\nQr\ncumulants [22, Section 4]. Therefore, a central moment expression like EX1...Xn [(cid:16)k(1)\n\u00b7 \u00b7 \u00b7 \u2297 (cid:16)k(n)\n\n\u2212 \u00b5X1(cid:17) \u2297\n\u2212 \u00b5Xn(cid:17)] does not capture the correct notion of the interaction measure. Thus, while\n\none can in principle construct RKHS embeddings of higher-order interaction measures, and compute\nRKHS norms using a calculus of V -statistics and Gram-matrices analogous to that of Table 2, it does\nnot seem possible to avoid summing over all partitions when computing the corresponding statistics,\nyielding a computationally prohibitive approach in general. This can be viewed by analogy with the\nscalar case, where it is well known that the second and third cumulants coincide with the second\nand third central moments, whereas the higher order cumulants are neither moments nor central\nmoments, but some other polynomials of the moments.\n\nXn\n\nX1\n\n4.4 Total independence for D > 3\n\nIn general, the test statistic for total independence in the D-variable case is\n\n2\n\n\u02c6PX1:D \u2212\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nD\n\nYi=1\n\n\u02c6PXi(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\ni=1 k(i)\n\nND\n\n=\n\n+\n\nK (i)\n\nab \u2212\n\n2\n\nnD+1\n\nn\n\nXa=1\n\nD\n\nYi=1\n\nn\n\nXb=1\n\nK (i)\nab\n\n1\nn2\n\n1\n\nn2D\n\nn\n\nn\n\nD\n\nYi=1\nXb=1\nXa=1\nXa=1\nYi=1\nXb=1\n\nD\n\nn\n\nn\n\nK (i)\nab .\n\nA similar statistic for total independence is discussed by [24] where testing of total independence\nbased on empirical characteristic functions is considered. Our test has a direct interpretation in terms\nof characteristic functions as well, which is straightforward to see in the case of translation invariant\nkernels on Euclidean spaces, using their Bochner representation, similarly as in [27, Corollary 4].\n\n6\n\n\fMarginal independence tests: Dataset A\n\nMarginal independence tests: Dataset B\n\n \n\n)\nr\no\nr\nr\ne\n\nI\nI\n\ne\np\ny\nT\n(\n\ne\nt\na\nr\n\ne\nc\nn\na\nt\np\ne\nc\nc\na\n\nl\nl\nu\nN\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n\n1\n\n3\n\n5\n\n9\n\n7\n11\nDimension\n\n13\n\n15\n\n17\n\n19\n\n1\n\n3\n\n5\n\n2var: X \u22a5\u22a5 Y\n2var: X \u22a5\u22a5 Z\n2var: (X, Y ) \u22a5\u22a5 Z\n\u2206L: (X, Y ) \u22a5\u22a5 Z\n\n13\n\n15\n\n17\n\n19\n\n7\n\n9\n\n11\nDimension\n\nFigure 1: Two-variable kernel independence tests and the test for (X, Y ) \u22a5\u22a5 Z using the Lancaster\nstatistic\n\n)\nr\no\nr\nr\ne\n\nI\nI\n\ne\np\ny\nT\n(\n\ne\nt\na\nr\n\ne\nc\nn\na\nt\np\ne\nc\nc\na\n\nl\nl\nu\nN\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nTotal independence test: Dataset A\n\nTotal independence test: Dataset B\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n\n\u2206L: total indep.\n\u2206tot : total indep.\n\n \n\n1\n\n3\n\n5\n\n9\n\n7\n11\nDimension\n\n13\n\n15\n\n17\n\n19\n\n1\n\n3\n\n5\n\n7\n\n9\n\n11\nDimension\n\n13\n\n15\n\n17\n\n19\n\nFigure 2: Total independence: \u2206tot \u02c6P vs. \u2206L \u02c6P .\n\n5 Experiments\n\nWe investigate the performance of various permutation based tests that use the Lancaster statistic\n\n2\n\n(cid:13)(cid:13)(cid:13)\u2206L \u02c6P(cid:13)(cid:13)(cid:13)\n\nk\u2297l\u2297m\n\nX, Y and Z are random vectors of increasing dimensionality:\n\nand the total independence statistic(cid:13)(cid:13)(cid:13)\u2206tot \u02c6P(cid:13)(cid:13)(cid:13)\n\n2\n\nk\u2297l\u2297m\n\non two synthetic datasets where\n\nDataset A: Pairwise independent, mutually dependent data. Our \ufb01rst dataset is a triplet of\ni.i.d.\u223c N (0, Ip), W \u223c Exp( 1\u221a2\nrandom vectors (X, Y, Z) on Rp \u00d7 Rp \u00d7 Rp, with X, Y\n),\nZ1 = sign(X1Y1)W , and Z2:p \u223c N (0, Ip\u22121), i.e., the product of X1Y1 determines the sign of\nZ1, while the remaining p \u2212 1 dimensions are independent (and serve as noise in this example).4\nIn this case, (X, Y, Z) is clearly a pairwise independent but mutually dependent triplet. The mutual\ndependence becomes increasingly dif\ufb01cult to detect as the dimensionality p increases.\n\nDataset B: Joint dependence can be easier to detect. In this example, we consider a triplet of\nrandom vectors (X, Y, Z) on Rp \u00d7 Rp \u00d7 Rp, with X, Y\n\ni.i.d.\u223c N (0, Ip), Z2:p \u223c N (0, Ip\u22121), and\n\nZ1 =\n\n\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n1 + \u01eb,\n1 + \u01eb,\n\nX 2\nw.p. 1/3,\nY 2\nw.p. 1/3,\nX1Y1 + \u01eb, w.p. 1/3,\n\nwhere \u01eb \u223c N (0, 0.12). Thus, dependence of Z on pair (X, Y ) is stronger than on X and Y individ-\nually.\n\n4Note that there is no reason for X, Y and Z to have the same dimensionality p - this is done for simplicity\n\nof exposition.\n\n7\n\n\f)\nr\no\nr\nr\ne\n\nI\nI\n\ne\np\ny\nT\n(\n\ne\nt\na\nr\n\ne\nc\nn\na\nt\np\ne\nc\nc\na\n\nl\nl\nu\nN\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nV-structure discovery: Dataset A\n\nV-structure discovery: Dataset B\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n\n2var: Factor\n\n\u2206L: Factor\n\nCI: X \u22a5\u22a5 Y |Z\n\n \n\n1\n\n3\n\n5\n\n9\n\n7\n11\nDimension\n\n13\n\n15\n\n17\n\n19\n\n1\n\n3\n\n5\n\n7\n\n9\n\n11\nDimension\n\n13\n\n15\n\n17\n\n19\n\nFigure 3: Factorization hypothesis: Lancaster statistic vs. a two-variable based test; Test for X \u22a5\u22a5\nY |Z from [18]\n\nIn all cases, we use permutation tests as described in Appendix D. The test level is set to \u03b1 = 0.05,\nsample size to n = 500, and we use gaussian kernels with bandwidth set to the interpoint median\ndistance.\nIn Figure 1, we plot the null hypothesis acceptance rates of the standard kernel two-\nvariable tests for X \u22a5\u22a5 Y (which is true for both datasets A and B, and accepted at the correct\nrate across all dimensions) and for X \u22a5\u22a5 Z (which is true only for dataset A), as well as of the\nstandard kernel two-variable test for (X, Y ) \u22a5\u22a5 Z, and the test for (X, Y ) \u22a5\u22a5 Z using the Lancaster\nstatistic. As expected, in dataset B, we see that dependence of Z on pair (X, Y ) is somewhat easier\nto detect than on X individually with two-variable tests. In both datasets, however, the Lancaster\ninteraction appears signi\ufb01cantly more sensitive in detecting this dependence as dimensionality p\n\n2\n\n2\n\nk\u2297l\u2297m\n\nincreases. Figure 2 plots the Type II error of total independence tests with statistics (cid:13)(cid:13)(cid:13)\u2206L \u02c6P(cid:13)(cid:13)(cid:13)\nand(cid:13)(cid:13)(cid:13)\u2206tot \u02c6P(cid:13)(cid:13)(cid:13)\n\nk\u2297l\u2297m\n. The Lancaster statistic outperforms the total independence statistic everywhere\n\napart from the Dataset B when the number of dimensions is small (between 1 and 5). Figure 3 plots\nthe Type II error of the factorization test, i.e., test for (X, Y ) \u22a5\u22a5 Z \u2228 (X, Z) \u22a5\u22a5 Y \u2228 (Y, Z) \u22a5\u22a5 X\nwith Lancaster statistic with Holm-Bonferroni correction as described in Appendix D, as well as\nthe two-variable based test (which performs three standard two-variable tests and applies the Holm-\nBonferroni correction). We also plot the Type II error for the conditional independence test for\nX \u22a5\u22a5 Y |Z from [18]. Under assumption that X \u22a5\u22a5 Y (correct on both datasets), negation of each\nof these three hypotheses is equivalent to the presence of V-structure X \u2192 Z \u2190 Y , so the rejection\nof the null can be viewed as a V-structure detection procedure. As dimensionality increases, the\nLancaster statistic appears signi\ufb01cantly more sensitive to the interactions present than the competing\napproaches, which is particularly pronounced in Dataset A.\n\n6 Conclusions\n\nWe have constructed permutation-based nonparametric tests for three-variable interactions, includ-\ning the Lancaster interaction and total independence. The tests can be used in datasets where only\nhigher-order interactions persist, i.e., variables are pairwise independent; as well as in cases where\njoint dependence may be easier to detect than pairwise dependence, for instance when the effect of\ntwo variables on a third is not additive. The \ufb02exibility of the framework of RKHS embeddings of\nsigned measures allows us to consider variables that are themselves multidimensional. While the to-\ntal independence case readily generalizes to more than three dimensions, the combinatorial nature of\njoint cumulants implies that detecting interactions of higher order requires signi\ufb01cantly more costly\ncomputation.\n\nReferences\n\n[1] A. Gretton, O. Bousquet, A. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with Hilbert-\n\nSchmidt norms. In ALT, pages 63\u201378, 2005.\n\n[2] G. Sz\u00b4ekely, M. Rizzo, and N.K. Bakirov. Measuring and testing dependence by correlation of distances.\n\nAnn. Stat., 35(6):2769\u20132794, 2007.\n\n8\n\n\f[3] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and\n\nRKHS-based statistics in hypothesis testing. Ann. Stat., 41(5):2263\u20132291, 2013.\n\n[4] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1\u201348, 2002.\n[5] K. Fukumizu, F. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. J.\n\nMach. Learn. Res., 8:361\u2013383, 2007.\n\n[6] J. Dauxois and G. M. Nkiet. Nonlinear canonical analysis and independence tests. Ann. Stat., 26(4):1254\u2013\n\n1278, 1998.\n\n[7] D. Pal, B. Poczos, and Cs. Szepesvari. Estimation of renyi entropy and mutual information based on\n\ngeneralized nearest-neighbor graphs. In NIPS 23, 2010.\n\n[8] A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function.\n\nPhD thesis, University of Jyv\u00a8askyl\u00a8a, 1995.\n\n[9] S. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.\n\n[10] M. Kayano, I. Takigawa, M. Shiga, K. Tsuda, and H. Mamitsuka. Ef\ufb01ciently \ufb01nding genome-wide three-\n\nway gene interactions from transcript- and genotype-data. Bioinformatics, 25(21):2735\u20132743, 2009.\n\n[11] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the lasso. Ann.\n\nStat., 34(3):1436\u20131462, 2006.\n\n[12] P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\n\nminimizing \u21131-penalized log-determinant divergence. Electron. J. Stat., 4:935\u2013980, 2011.\n[13] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2001.\n[14] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. 2nd edition, 2000.\n[15] M. Kalisch and P. Buhlmann. Estimating high-dimensional directed acyclic graphs with the PC algorithm.\n\nJ. Mach. Learn. Res., 8:613\u2013636, 2007.\n\n[16] X. Sun, D. Janzing, B. Sch\u00a8olkopf, and K. Fukumizu. A kernel-based causal learning algorithm. In ICML,\n\npages 855\u2013862, 2007.\n\n[17] R. Tillman, A. Gretton, and P. Spirtes. Nonlinear directed acyclic structure learning with weakly additive\n\nnoise models. In NIPS 22, 2009.\n\n[18] K. Zhang, J. Peters, D. Janzing, and B. Schoelkopf. Kernel-based conditional independence test and\n\napplication in causal discovery. In UAI, pages 804\u2013813, 2011.\n\n[19] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\n\nindependence. In NIPS 20, pages 585\u2013592, Cambridge, MA, 2008. MIT Press.\n\n[20] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence.\n\nIn\n\nNIPS 20, pages 489\u2013496, 2008.\n\n[21] H.O. Lancaster. The Chi-Squared Distribution. Wiley, London, 1969.\n[22] B. Streitberg. Lancaster interactions revisited. Ann. Stat., 18(4):1878\u20131885, 1990.\n[23] K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Schoelkopf. Characteristic kernels on groups and\n\nsemigroups. In NIPS 21, pages 473\u2013480, 2009.\n\n[24] A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function.\n\nPhD thesis, University of Jyv\u00a8askyl\u00a8a, 1995.\n\n[25] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nKluwer, 2004.\n\n[26] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and rkhs embed-\n\nding of measures. J. Mach. Learn. Res., 12:2389\u20132410, 2011.\n\n[27] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Hilbert space embeddings\n\nand metrics on probability measures. J. Mach. Learn. Res., 11:1517\u20131561, 2010.\n\n[28] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel two-sample test. J. Mach.\n\nLearn. Res., 13:723\u2013773, 2012.\n\n[29] D. Sejdinovic, A. Gretton, B. Sriperumbudur, and K. Fukumizu. Hypothesis testing using pairwise dis-\n\ntances and associated kernels. In ICML, 2012.\n\n[30] G. Sz\u00b4ekely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, (5), November\n\n2004.\n\n[31] L. Baringhaus and C. Franz. On a new multivariate two-sample test. J. Multivariate Anal., 88(1):190\u2013206,\n\n2004.\n\n[32] G. Sz\u00b4ekely and M. Rizzo. Brownian distance covariance. Ann. Appl. Stat., 4(3):1233\u20131303, 2009.\n[33] R. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.\n[34] T.P. Speed. Cumulants and partition lattices. Austral. J. Statist., 25:378\u2013388, 1983.\n[35] S. Holm. A simple sequentially rejective multiple test procedure. Scand. J. Statist., 6(2):65\u201370, 1979.\n[36] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test.\n\nIn NIPS 22, Red Hook, NY, 2009. Curran Associates Inc.\n\n9\n\n\f", "award": [], "sourceid": 594, "authors": [{"given_name": "Dino", "family_name": "Sejdinovic", "institution": "Gatsby Unit, UCL"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "UCL"}, {"given_name": "Wicher", "family_name": "Bergsma", "institution": "LSE"}]}