{"title": "Kernel Measures of Independence for non-iid Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1937, "page_last": 1944, "abstract": "Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering.", "full_text": "Kernel Measures of Independence for non-iid Data\n\nXinhua Zhang\n\nNICTA and Australian National University\n\nCanberra, Australia\n\nxinhua.zhang@anu.edu.au\n\nArthur Gretton\n\nMPI T\u00a8ubingen for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\narthur@tuebingen.mpg.de\n\nLe Song\u2217\n\nSchool of Computer Science\n\nCarnegie Mellon University, Pittsburgh, USA\n\nlesong@cs.cmu.edu\n\nAlex Smola\u2217\nYahoo! Research\n\nSanta Clara, CA, United States\n\nalex@smola.org\n\nAbstract\n\nMany machine learning algorithms can be formulated in the framework of statis-\ntical independence such as the Hilbert Schmidt Independence Criterion. In this\npaper, we extend this criterion to deal with structured and interdependent obser-\nvations. This is achieved by modeling the structures using undirected graphical\nmodels and comparing the Hilbert space embeddings of distributions. We apply\nthis new criterion to independent component analysis and sequence clustering.\n\n1 Introduction\nStatistical dependence measures have been proposed as a unifying framework to address many ma-\nchine learning problems. For instance, clustering can be viewed as a problem where one strives to\nmaximize the dependence between the observations and a discrete set of labels [14]. Conversely, if\nlabels are given, feature selection can be achieved by \ufb01nding a subset of features in the observations\nwhich maximize the dependence between labels and features [15]. Similarly in supervised dimen-\nsionality reduction [13], one looks for a low dimensional embedding which retains additional side\ninformation such as class labels. Likewise, blind source separation (BSS) tries to unmix independent\nsources, which requires a contrast function quantifying the dependence of the unmixed signals.\nThe use of mutual information is well established in this context, as it is theoretically well justi\ufb01ed.\nUnfortunately, it typically involves density estimation or at least a nontrivial optimization procedure\n[11]. This problem can be averted by using the Hilbert Schmidt Independence Criterion (HSIC). The\nlatter enjoys concentration of measure properties and it can be computed ef\ufb01ciently on any domain\nwhere a Reproducing Kernel Hilbert Space (RKHS) can be de\ufb01ned.\nHowever, the application of HSIC is limited to independent and identically distributed (iid) data, a\nproperty that many problems do not share (e.g., BSS on audio data). For instance many random\nvariables have a pronounced temporal or spatial structure. A simple motivating example is given in\nFigure 1a. Assume that the observations xt are drawn iid from a uniform distribution on {0, 1} and\nyt is determined by an XOR operation via yt = xt \u2297 xt\u22121. Algorithms which treat the observation\npairs {(xt, yt)}\u221e\nt=1 as iid will consider the random variables x, y as independent. However, it is\ntrivial to detect the XOR dependence by using the information that xi and yi are, in fact, sequences.\nIn view of its importance, temporal correlation has been exploited in the independence test for blind\nsource separation. For instance, [9] used this insight to reject nontrivial nonseparability of nonlinear\nmixtures, and [18] exploited multiple time-lagged second-order correlations to decorrelate over time.\nThese methods work well in practice. But they are rather ad hoc and appear very different from\nstandard criteria. In this paper, we propose a framework which extends HSIC to structured non-\niid data. Our new approach is built upon the connection between exponential family models and\n\u2217This work was partially done when the author was with the Statistical Machine Learning Group of NICTA.\n\n\f(a) XOR sequence\n\n(b) iid\n\n(c) First order sequential\n\n(d) 2-Dim mesh\n\nFigure 1: From left to right: (a) Graphical model representing the XOR sequence, (b) a graphical\nmodel representing iid observations, (c) a graphical model for \ufb01rst order sequential data, and (d) a\ngraphical model for dependency on a two dimensional mesh.\n\nthe marginal polytope in an RKHS. This is doubly attractive since distributions can be uniquely\nidenti\ufb01ed by the expectation operator in the RKHS and moreover, for distributions with conditional\nindependence properties the expectation operator decomposes according to the clique structure of\nthe underlying undirected graphical model [2].\n\n2 The Problem\nDenote by X and Y domains from which we will be drawing observations Z :=\n{(x1, y1), . . . , (xm, ym)} according to some distribution p(x, y) on Z := X \u00d7 Y. Note that the\ndomains X and Y are fully general and we will discuss a number of different structural assumptions\non them in Section 3 which allow us to recover existing and propose new measures of dependence.\nFor instance x and y may represent sequences or a mesh for which we wish to establish dependence.\nTo assess whether x and y are independent we brie\ufb02y review the notion of Hilbert Space embeddings\nof distributions [6]. Subsequently we discuss properties of the expectation operator in the case of\nconditionally independent random variables which will lead to a template for a dependence measure.\nHilbert Space Embedding of Distribution Let H be a RKHS on Z with kernel v : Z \u00d7 Z (cid:55)\u2192 R.\nMoreover, let P be the space of all distributions over Z, and let p \u2208 P. The expectation operator in\nH and its corresponding empirical average can be de\ufb01ned as in [6]\n\n\u00b5[p] := Ez\u223cp(z) [v(z,\u00b7)]\n\nsuch that\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nm(cid:88)\n\nEz\u223cp(z)[f(z)] = (cid:104)\u00b5[p], f(cid:105)\n1\nf(xi, yi) = (cid:104)\u00b5[Z], f(cid:105) .\nm\n\n(1)\n\nv((xi, yi),\u00b7)\n\ni=1\n\n\u00b5[Z] :=\n\nsuch that\n\n(2)\nThe map \u00b5 : P (cid:55)\u2192 H characterizes a distribution by an element in the RKHS. The following theorem\nshows that the map is injective [16] for a large class of kernels such as Gaussian and Laplacian RBF.\nTheorem 1 If Ez\u223cp [v(z, z)] < \u221e and H is dense in the space of bounded continuous functions\nC 0(Z) in the L\u221e norm then the map \u00b5 is injective.\n2.1 Exponential Families\nWe are interested in the properties of \u00b5[p] in the case where p satis\ufb01es the conditional indepen-\ndence relations speci\ufb01ed by an undirected graphical model. In [2], it is shown that for this case the\nsuf\ufb01cient statistics decompose along the maximal cliques of the conditional independence graph.\nMore formally, denote by C the set of maximal cliques of the graph G and let zc be the restriction\nof z \u2208 Z to the variables on clique c \u2208 C. Moreover, let vc be universal kernels in the sense of [17]\nacting on the restrictions of Z on clique c \u2208 C. In this case, [2] showed that\n\nv(z, z(cid:48)) =(cid:88)\n\nc\u2208C\n\nvc(zc, z(cid:48)\nc)\n\n(3)\n\ncan be used to describe all probability distributions with the above mentioned conditional indepen-\ndence relations using an exponential family model with v as its kernel. Since for exponential families\nexpectations of the suf\ufb01cient statistics yield injections, we have the following result:\n\nCorollary 2 On the class of probability distributions satisfying conditional independence properties\naccording to a graph G with maximal clique set C and with full support on their domain, the operator\n\nyt\u20131xt\u20131 xt+1ztytyt+1xtyt\u20131xt\u20131 xt+1yt+1ytxtztyt\u20131xt\u20131 ytxtyt+1xt+1ztxstyst\f\u00b5[p] =(cid:88)\n\nc\u2208C\n\n\u00b5c[pc] =(cid:88)\n\nc\u2208C\n\nEzc [vc(zc,\u00b7)]\n\n(4)\n\nis injective if the kernels vc are all universal. The same decomposition holds for the empirical\ncounterpart \u00b5[Z].\n\nThe condition of full support arises from the conditions of the Hammersley-Clifford Theorem [4, 8]:\nwithout it, not all conditionally independent random variables can be represented as the product of\npotential functions. Corollary 2 implies that we will be able to perform all subsequent operations on\nstructured domains simply by dealing with mean operators on the corresponding maximal cliques.\n\n2.2 Hilbert Schmidt Independence Criterion\nTheorem 1 implies that we can quantify the difference between two distributions p and q by simply\ncomputing the square distance between their RKHS embeddings, i.e., (cid:107)\u00b5[p] \u2212 \u00b5[q](cid:107)2H. Similarly,\nwe can quantify the strength of dependence between random variables x and y by simply measuring\nthe square distance between the RKHS embeddings of the joint distribution p(x, y) and the product\nof the marginals p(x) \u00b7 p(y) via\n\n(5)\nMoreover, Corollary 2 implies that for an exponential family consistent with the conditional inde-\npendence graph G we may decompose I(x, y) further into\n\nI(x, y) := (cid:107)\u00b5[p(x, y)] \u2212 \u00b5[p(x)p(y)](cid:107)2H .\n\n(cid:8)E(xcyc)(x(cid:48)\nc\u2208C (cid:107)\u00b5c[pc(xc, yc)] \u2212 \u00b5c[pc(xc)pc(yc)](cid:107)2Hc\n\u2212 2E(xcyc)x(cid:48)\ncy(cid:48)\nc\u2208C\n\nc) + Excycx(cid:48)\n\ncy(cid:48)\n\ncy(cid:48)\n\nc\n\nc\n\n(cid:9) [vc((xc, yc), (x(cid:48)\n\nc, y(cid:48)\n\nc))]\n\n(6)\n\nI(x, y) =(cid:88)\n=(cid:88)\n\nwhere bracketed random variables in the subscripts are drawn from their joint distributions and un-\nbracketed ones are from their respective marginals, e.g., E(xcyc)x(cid:48)\nc. Obviously\nthe challenge is to \ufb01nd good empirical estimates of (6). In its simplest form we may replace each of\nthe expectations by sums over samples, that is, by replacing\n\n:= E(xcyc)Ex(cid:48)\n\nEy(cid:48)\n\ncy(cid:48)\n\nc\n\nc\n\nE(x,y)[f(x, y)] \u2190 1\n\nm\n\nf(xi, yi)\n\nand E(x)(y)[f(x, y)] \u2190 1\nm2\n\nf(xi, yj).\n\n(7)\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\ni,j=1\n\n3 Estimates for Special Structures\nTo illustrate the versatility of our approach we apply our model to a number of graphical models\nranging from independent random variables to meshes proceeding according to the following recipe:\n\n1. De\ufb01ne a conditional independence graph.\n2. Identify the maximal cliques.\n3. Choose suitable joint kernels on the maximal cliques.\n4. Exploit stationarity (if existent) in I(x, y) in (6).\n5. Derive the corresponding empirical estimators for each clique, and hence for all of I(x, y).\n\nIndependent and Identically Distributed Data\n\n3.1\nAs the simplest case, we \ufb01rst consider the graphical model in Figure 1b, where {(xt, yt)}T\niid random variables. Correspondingly the maximal cliques are {(xt, yt)}T\nkernel on the cliques to be\nvt((xt, yt), (x(cid:48)\nThe representation for vt implies that we are taking an outer product between the Hilbert Spaces on\nxt and yt induced by kernels k and l respectively. If the pairs of random variables (xt, yt) are not\nidentically distributed, all that is left is to use (8) to obtain an empirical estimate via (7).\nWe may improve the estimate considerably if we are able to assume that all pairs (xt, yt)\nare drawn from the same distribution p(xt, yt). Consequently all coordinates of the mean\n\nt) hence v((x, y), (x(cid:48), y(cid:48))) =(cid:88)T\n\nt=1 are\nt=1. We choose the joint\n\nt)) := k(xt, x(cid:48)\n\nt)l(yt, y(cid:48)\nt).\n\nt)l(yt, y(cid:48)\n\nk(xt, x(cid:48)\n\nt, y(cid:48)\n\nt=1\n\n(8)\n\n\fmap are identical and we can use all\nthe data to estimate just one of the discrepancies\n(cid:107)\u00b5c[pc(xc, yc)] \u2212 \u00b5c[pc(xc)pc(yc)](cid:107)2. The latter expression is identical to the standard HSIC crite-\nrion and we obtain the biased estimate\n\nT tr HKHL where Kst := k(xs, xt), Lst := l(ys, yt) and Hst := \u03b4st \u2212 1\nT .\n\n\u02c6I(x, y) = 1\n\n(9)\n\n3.2 Sequence Data\nA more interesting application beyond iid data is sequences with a Markovian dependence as de-\npicted in Figure 1c. Here the maximal cliques are the sets {(xt, xt+1, yt, yt+1)}T\u22121\nt=1 . More gen-\nerally, for longer range dependency of order \u03c4 \u2208 N, the maximal cliques will involve the random\nvariables (xt, . . . , xt+\u03c4 , yt, . . . , yt+\u03c4 ) =: (xt,\u03c4 , yt,\u03c4 ).\nWe assume homogeneity and stationarity of the random variables: that is, all cliques share the same\nsuf\ufb01cient statistics (feature map) and their expected value is identical. In this case the kernel\n\nv0((xt,\u03c4 , yt,\u03c4 ), (x(cid:48)\n\nt,\u03c4 , y(cid:48)\n\nt,\u03c4 )) := k(xt,\u03c4 , x(cid:48)\n\nt,\u03c4 )l(yt,\u03c4 , y(cid:48)\n\nt,\u03c4 )\n\ncan be used to measure discrepancy between the random variables. Stationarity means that\n\u00b5c[pc(xc, yc)] and \u00b5c[pc(xc)pc(yc)] are the same for all cliques c, hence I(x, y) is a multiple of\nthe difference for a single clique.\nUsing the same argument as in the iid case, we can obtain a biased estimate of the measure of\ndependence by using Kij = k(xi,\u03c4 , xj,\u03c4 ) and Lij = l(yi,\u03c4 , yj,\u03c4 ) instead of the de\ufb01nitions of K and\nL in (9). This works well in experiments. In order to obtain an unbiased estimate we need some\nmore work. Recall the unbiased estimate of I(x, y) is a fourth order U-statistic [6].\nTheorem 3 An unbiased empirical estimator for (cid:107)\u00b5[p(x, y)] \u2212 \u00b5[p(x)p(y)](cid:107)2 is\n\n\u02c6I(x, y) := (m\u22124)!\n\nm!\n\nh(xi, yi, . . . , xr, yr),\n\n(10)\n\n(cid:88)\n\n(i,j,q,r)\n\nwhere the sum is over all terms such that i, j, q, r are mutually different, and\n\nh(x1, y1, . . . , x4, y4) :=\n\n1\n4!\n\nk(xt, xu)l(xt, xu) + k(xt, xu)l(xv, xw) \u2212 2k(xt, xu)l(xt, xv),\n\n(1,2,3,4)(cid:88)\n\n(t,u,v,w)\n\nand the latter sum denotes all ordered quadruples (t, u, v, w) drawn from (1, 2, 3, 4).\n\nThe theorem implies that in expectation h takes on the value of the dependence measure. To estab-\nlish that this also holds for dependent random variables we use a result from [1] which establishes\nconvergence for stationary mixing sequences under mild regularity conditions, namely whenever\nthe kernel of the U-statistic h is bounded and the process generating the observations is absolutely\nregular. See also [5, Section 4].\nTheorem 4 Whenever I(x, y) > 0, that is, whenever the random variables are dependent, the\nestimate \u02c6I(x, y) is asymptotically normal with\n\n\u221a\nm(\u02c6I(x, y) \u2212 I(x, y)) d\u2212\u2192 N (0, 4\u03c32)\n\nwhere the variance is given by\n\n\u03c32 =Var [h3(x1, y1)]2 + 2\n\nCov(h3(x1, y1), h3(xt, yt))\n\nand\n\nh3(x1, y1) :=E(x2,y2,x3,y3,x4,y4)[h(x1, y1, . . . , x4, y4)]\n\nt=1\n\n\u221e(cid:88)\n\n(11)\n\n(12)\n\n(13)\n\nThis follows from [5, Theorem 7], again under mild regularity conditions (note that [5] state their\nresults for U-statistics of second order, and claim the results hold for higher orders). The proof is\ntedious but does not require additional techniques and is therefore omitted.\n\n3.3 TD-SEP as a special case\nSo far we did not discuss the freedom of choosing different kernels. In general, an RBF kernel will\nlead to an effective criterion for measuring the dependence between random variables, especially in\n\n\ftime-series applications. However, we could also choose linear kernels for k and l, for instance, to\nobtain computational savings.\nFor a speci\ufb01c choice of cliques and kernels, we can recover the work of [18] as a special case of our\nframework. In [18], for two centered scalar time series x and y, the contrast function is chosen as\nthe sum of same-time and time-lagged cross-covariance E[xtyt]+E[xtyt+\u03c4 ]. Using our framework,\ntwo types of cliques, (xt, yt) and (xt, yt+\u03c4 ), are considered in the corresponding graphical model.\nFurthermore, we use a joint kernel of the form\n\n(cid:104)xs, xt(cid:105)(cid:104)ys, yt(cid:105) + (cid:104)xs, xt(cid:105)(cid:104)ys+\u03c4 , yt+\u03c4(cid:105) ,\n\n(14)\n\nwhich leads to the estimator of structured HSIC: I(x, y) = 1\nT (tr HKHL + tr HKHL\u03c4 ). Here L\u03c4\ndenotes the linear covariance matrix for the time lagged y signals. For scalar time series, basic alge-\nbra shows that tr HKHL and tr HKHL\u03c4 are the estimators of E[xtyt] and E[xtyt+\u03c4 ] respectively\n(up to a multiplicative constant).\nFurther generalization can incorporate several time lagged cross-covariances into the contrast func-\ntion. For instance, TD-SEP [18] uses a range of time lags from 1 to \u03c4. That said, by using a nonlinear\nkernel we are able to obtain better contrast functions, as we will show in our experiments.\n\n3.4 Grid Structured Data\nStructured HSIC can go beyond sequence data and be applied to more general dependence structures\nsuch as 2-D grids for images. Figure 1d shows the corresponding graphical model. Here each node\nof the graphical model is indexed by two subscripts, i for row and j for column. In the simplest\ncase, the maximal cliques are\n\nC = {(xij, xi+1,j, xi,j+1, xi+1,j+1, yij, yi+1,j, yi,j+1, yi+1,j+1)}ij.\n\nIn other words, we are using a cross-shaped stencil to connect vertices. Provided that the kernel v can\nalso be decomposed into the product of k and l, then a biased estimate of the independence measure\ncan be again formulated as tr HKHL up to a multiplicative constant. The statistical analysis of\nU-statistics for stationary Markov random \ufb01elds is highly nontrivial. We are not aware of results\nequivalent to those discussed in Section 3.2.\n\n4 Experiments\nHaving a dependence measure for structured spaces is useful for a range of applications. Analogous\nto iid HSIC, structured HSIC can be applied to non-iid data in applications such as independent\ncomponent analysis [12], independence test [6], feature selection [15], clustering [14], and dimen-\nsionality reduction [13]. The fact that structured HSIC can take into account the interdependency\nbetween observations provides us with a principled generalization of these algorithms to, e.g., time\nseries analysis. In this paper, we will focus on two examples: independent component analysis,\nwhere we wish to minimize the dependence, and time series segmentation, where we wish to max-\nimize the dependence instead. Two simple illustrative experiments on independence test for XOR\nbinary sequence and Gaussian process can be found in the longer version of this paper.\n\nIndependent Component Analysis\n\n4.1\nIn independent component analysis (ICA), we observe a time series of vectors u that corresponds to\na linear mixture u = As of n mutually independent sources s (each entry in the source vector here\nis a random process, and depends on its past values; examples include music and EEG time series).\nBased on the series of observations t, we wish to recover the sources using only the independence\nassumption on s. Note that sources can only be recovered up to scaling and permutation. The core\nof ICA is a contrast function that measures the independence of the estimated sources. An ICA\nalgorithm searches over the space of mixing matrix A such that this contrast function is minimized.\nThus, we propose to use structured HSIC as the contrast function for ICA. By incorporating time\nlagged variables in the cliques, we expect that structured HSIC can better deal with the non-iid nature\nof time series. In this respect, we generalize the TD-SEP algorithm [18], which implements this idea\nusing a linear kernel on the signal. Thus, we address the question of whether correlations between\nhigher order moments, as encoded using non-linear kernels, can improve the performance of TD-\nSEP on real data.\n\n\fTable 1: Median performance of ICA on music using HSIC, TDSEP, and structured HSIC. In the top\nrow, the number n of sources and m of samples are given. In the second row, the number of time lags\n\u03c4 used by TDSEP and structured HSIC are given: thus the observation vectors x, xt\u22121, . . . , xt\u2212\u03c4\nwere compared. The remaining rows contain the median Amari divergence (multiplied by 100) for\nthe three methods tested. The original HSIC method does not take into account time dependence\n(\u03c4 = 0), and returns a single performance number. Results are in all cases averaged over 136\nrepetitions: for two sources, this represents all possible pairings, whereas for larger n the sources\nare chosen at random without replacement.\n\nMethod\n\nHSIC\nTDSEP\nStructured HSIC\n\nn = 2, m = 5000\n\n1\n\n1.54\n1.48\n\n2\n1.51\n1.62\n1.62\n\n3\n\n1.74\n1.64\n\nn = 3, m = 10000\n3\n1\n\nn = 4, m = 10000\n3\n1\n\n2\n1.70\n1.72\n1.58\n\n1.84\n1.65\n\n1.54\n1.56\n\n2.90\n2.65\n\n2\n2.68\n2.08\n2.12\n\n1.91\n1.83\n\nData Following the settings of [7, Section 5.5], we unmixed various musical sources, combined\nusing a randomly generated orthogonal matrix A (since optimization over the orthogonal part of\na general mixing matrix is the more dif\ufb01cult step in ICA). We considered mixtures of two to four\nsources, drawn at random without replacement from 17 possibilities. We used the sum of pairwise\ndependencies as the overall contrast function when more than two sources were present.\nMethods We compared structured HSIC to TD-SEP and iid HSIC. While iid HSIC does not take\nthe temporal dependence in the signal into account, it has been shown to perform very well for\niid data [12]. Following [7], we employed a Laplace kernel, k(x, x(cid:48)) = exp(\u2212\u03bb(cid:107)x \u2212 x(cid:48)(cid:107)) with\n\u03bb = 3 for both structured and iid HSIC. For both structured and iid HSIC, we used gradient descent\nover the orthogonal group with a Golden search, and low rank Cholesky decompositions of the Gram\nmatrices to reduce computational cost, as in [3].\nResults We chose the Amari divergence as the index for comparing performance of the various\nICA methods. This is a divergence measure between the estimated and true unmixing matrices,\nwhich is invariant to the output ordering and scaling ambiguities. A smaller Amari divergence\nindicates better performance. Results are shown in Table 1. Overall, contrast functions that take\ntime delayed information into account perform best, although the best time lag is different when the\nnumber of sources varies.\n\n4.2 Time Series Clustering and Segmentation\nWe can also extend clustering to time series and sequences using structured HSIC. This is carried\nout in a similar way to the iid case. One can formulate clustering as generating the labels y from a\n\ufb01nite discrete set, such that their dependence on x is maximized [14]:\n\nmaximizey tr HKHL subject to constraints on y.\n\n(15)\nHere K and L are the kernel matrices for x and the generated y respectively. More speci\ufb01cally,\nassuming Lst := \u03b4(ys, yt) for discrete labels y, we recover clustering. Relaxing discrete labels to\nyt \u2208 R with bounded norm (cid:107)y(cid:107)2 and setting Lst := ysyt, we obtain Principal Component Analysis.\nThis reasoning for iid data carries over to sequences by introducing additional dependence structure\nthrough the kernels: Kst := k(xs,\u03c4 , xt,\u03c4 ) and Lst := l(ys,\u03c4 , yt,\u03c4 ). In general, the interacting label\nsequences make the optimization in (15) intractable. However, for a class of kernels l an ef\ufb01cient\ndecomposition can be found by applying a reverse convolution on k: assume that l is given by\n\n(16)\nwhere M \u2208 R(\u03c4 +1)\u00d7(\u03c4 +1) with M (cid:23) 0, and \u00afl is a base kernel between individual time points. A\ncommon choice is \u00afl(ys, yt) = \u03b4(ys, yt). In this case we can rewrite tr HKHL by applying the\nsummation over M to HKH, i.e.,\n\nu,v=0\n\n\u00afl(ys+u, yt+v)Muv,\n\nl(ys,\u03c4 , yt,\u03c4 ) =(cid:88)\u03c4\nT +\u03c4(cid:88)\n\n[HKH]ij\n\n\u00afl(ys+u, yt+v)Muv =\n\nMuv[HKH]s\u2212u,t\u2212v\n\n\u00afl(ys, yt)\n\n(17)\n\nT(cid:88)\n\n\u03c4(cid:88)\n\ns,t=1\n\nu,v=0\n\ns,t=1\n\nu,v=0\n\ns\u2212u,t\u2212v\u2208[1,T ]\n\n\u03c4(cid:88)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n:= \u00afKst\n\n(cid:125)\n\n\fTable 2: Segmentation errors by various methods on the four studied time series.\n\nMethod\n\nstructured HSIC\nspectral clustering\n\nHMM\n\nSwimming I Swimming II Swimming II BCI\n111.5\n162\n168\n\n108.6\n143.9\n150\n\n118.5\n212.3\n120\n\n99.0\n125\n153.2\n\nThis means that we may apply the matrix M to HKH and thereby we are able to decouple the\ndependency within y. Denote the convolution by \u00afK = [HKH] (cid:63) M. Consequently using \u00afK we can\ndirectly apply (15) to times series and sequence data. In practice, approximate algorithms such as\nincomplete Cholesky decomposition are needed to ef\ufb01ciently compute \u00afK.\nDatasets We study two datasets in this experiment. The \ufb01rst dataset is collected by the Australian\nInstitute of Sports (AIS) from a 3-channel orientation sensor attached to a swimmer. The three time\nseries we used in our experiment have the following con\ufb01gurations: T = 23000 time steps with 4\nlaps; T = 47000 time steps with 16 laps; and T = 67000 time steps with 20 laps. The task is to\nautomatically \ufb01nd the starting and \ufb01nishing time of each lap based on the sensor signals. We treated\nthis problem as a segmentation problem. Since the dataset contains 4 different style of swimming,\nwe used 6 as the number of clusters (there are 2 additional clusters for starting and \ufb01nishing a lap).\nThe second dataset is a brain-computer interface data (data IVb of Berlin BCI group1). It contains\nEEG signals collected when a subject was performing three types of cued imagination. Furthermore,\nthe relaxation period between two imagination is also recorded in the EEG. Including the relaxation\nperiod, the dataset consists of T = 10000 time points with 16 different segments. The task is to\nautomatically detect the start and end of an imagination. We used 4 clusters for this problem.\nMethods We compared three algorithms: structured HSIC for clustering, spectral clustering [10],\nand HMM. For structured HSIC, we used the maximal cliques of (xt, yt\u221250,100), where y is the\ndiscrete label sequence to be generated. The kernel l on y took the form of equation (16), with M \u2208\nR101\u00d7101 and Muv := exp(\u2212(u \u2212 v)2). The kernel k on x was Gaussian RBF: exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2).\nAs a baseline, we used a spectral clustering with the same kernel k on x, and a \ufb01rst order HMM with\n6 hidden states and diagonal Gaussian observation model2.\nFurther details regarding preprocessing of the above two datasets (which is common to all algorithms\nsubsequently compared), parameters of algorithms and protocols of experiments, are available in the\nlonger version of this paper.\nResults To evaluate the segmentation quality, the boundaries found by various methods were com-\npared to the ground truth. First, each detected boundary was matched to a true boundary, and then\nthe discrepancy between them was counted into the error. The overall error was this sum divided by\nthe number of boundaries. Figure 2d gives an example on how to compute this error.\nAccording to Table 2, in all of the four time series we studied, segmentation using structured HSIC\nleads to lower error compared with spectral clustering and HMM. For instance, structured HSIC\nreduces nearly 1/3 of the segmentation error in the BCI dataset. To provide a visual feel of the\nimprovement, we plot the true boundaries together with the segmentation results in Figure 2a, 2b,2c.\nClearly, segment boundaries produced by structured HSIC \ufb01t better with the ground truth.\n\n5 Conclusion\nIn this paper, we extended the Hilbert Schmidt Independence Criterion from iid data to structured\nand non-iid data. Our approach is based on RKHS embeddings of distributions, and utilizes the ef\ufb01-\ncient factorizations provided by the exponential family associated with undirected graphical models.\nEncouraging experimental results were demonstrated on independence test, ICA, and segmentation\nfor time series. Further work will be done in the direction of applying structured HSIC to PCA and\nfeature selection on structured data.\nAcknowledgements\nNICTA is funded by the Australian Governments Backing Australia\u2019s Ability and the Centre of Excellence\nprograms. This work is also supported by the IST Program of the European Community, under the FP7 Network\nof Excellence, ICT-216886-NOE.\n\n1http://ida.\ufb01rst.fraunhofer.de/projects/bci/competition-iii/desc-IVb.html\n2http://www.torch.ch\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nUAI.\n\nFigure 2: Segmentation results produced by (a) structured HSIC, (b) spectral clustering and (c)\nHMM. (d) An example for counting the segmentation error. Red line denotes the ground truth and\nblue line is the segmentation results. The error introduced for segment R1 to R(cid:48)\n1 is a + b, while that\nfor segment R2 to R(cid:48)\nReferences\n[1] Aaronson, J., Burton, R., Dehling, H., Gilat, D., Hill, T., & Weiss, B. (1996). Strong laws for L and\n\n2 is c + d. The overall error in this example is then (a + b + c + d)/4.\n\nU-statistics. Transactions of the American Mathematical Society, 348, 2845\u20132865.\n\n[2] Altun, Y., Smola, A. J., & Hofmann, T. (2004). Exponential families for conditional random \ufb01elds. In\n\n[3] Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. JMLR, 3, 1\u201348.\n[4] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J.\n\nRoy. Stat. Soc. B, 36(B), 192\u2013326.\n\n[5] Borovkova, S., Burton, R., & Dehling, H. (2001). Limit theorems for functionals of mixing processes\nwith applications to dimension estimation. Transactions of the American Mathematical Society, 353(11),\n4261\u20134318.\n\n[6] Gretton, A., Fukumizu, K., Teo, C.-H., Song, L., Sch\u00a8olkopf, B., & Smola, A. (2008). A kernel statistical\n\ntest of independence. Tech. Rep. 168, MPI for Biological Cybernetics.\n\n[7] Gretton, A., Herbrich, R., Smola, A., Bousquet, O., & Sch\u00a8olkopf, B. (2005). Kernel methods for measur-\n\ning independence. JMLR, 6, 2075\u20132129.\n\n[8] Hammersley, J. M., & Clifford, P. E. (1971). Markov \ufb01elds on \ufb01nite graphs and lattices. Unpublished\n\nmanuscript.\n\n[9] Hosseni, S., & Jutten, C. (2003). On the separability of nonlinear mixtures of temporally correlated\n\nsources. IEEE Signal Processing Letters, 10(2), 43\u201346.\n\n[10] Ng, A., Jordan, M., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In NIPS.\n[11] Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2008). Estimating divergence functionals and the\n\nlikelihood ratio by penalized convex risk minimization. In NIPS.\n\n[12] Shen, H., Jegelka, S., & Gretton, A. (submitted). Fast kernel-based independent component analysis.\n\nIEEE Transactions on Signal Processing.\n\n[13] Song, L., Smola, A., Borgwardt, K., & Gretton, A. (2007). Colored maximum variance unfolding. In\n\nNIPS.\n\ning. In Proc. Intl. Conf. Machine Learning.\n\ndependence estimation. In ICML.\n\n[14] Song, L., Smola, A., Gretton, A., & Borgwardt, K. (2007). A dependence maximization view of cluster-\n\n[15] Song, L., Smola, A., Gretton, A., Borgwardt, K., & Bedo, J. (2007). Supervised feature selection via\n\n[16] Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., & Sch\u00a8olkopf, B. (2008). Injective hilbert\n\nspace embeddings of probability measures. In COLT.\n\n[17] Steinwart, I. (2002). The in\ufb02uence of the kernel on the consistency of support vector machines. JMLR, 2.\n[18] Ziehe, A., & M\u00a8uller, K.-R. (1998). TDSEP \u2013 an ef\ufb01cient algorithm for blind separation using time\n\nstructure. In ICANN.\n\n020004000600080001000001234 Structured HSICGround Truth020004000600080001000001234 Spectral ClusteringGround Truth020004000600080001000000.511.52 HMMGround Truth\f", "award": [], "sourceid": 855, "authors": [{"given_name": "Xinhua", "family_name": "Zhang", "institution": null}, {"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}