{"title": "Characteristic Kernels on Groups and Semigroups", "book": "Advances in Neural Information Processing Systems", "page_first": 473, "page_last": 480, "abstract": "Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for $\\R^n$. In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and $\\R^n_+$.", "full_text": "Characteristic Kernels on Groups and Semigroups\n\nKenji Fukumizu\n\nInstitute of Statistical Mathematics\n\n4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569 Japan\n\nfukumizu@ism.ac.jp\n\nArthur Gretton\n\nMPI for Biological Cybernetics\n\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\narthur.gretton@tuebingen.mpg.de\n\nBharath Sriperumbudur\n\nDepartment of ECE, UC San Diego\n/ MPI for Biological Cybernetics\n\nbharathsv@ucsd.edu\n\nBernhard Sch\u00a8olkopf\n\nMPI for Biological Cybernetics\n\nbs@tuebingen.mpg.de\n\nAbstract\n\nEmbeddings of random variables in reproducing kernel Hilbert spaces (RKHSs)\nmay be used to conduct statistical inference based on higher order moments. For\nsuf\ufb01ciently rich (characteristic) RKHSs, each probability distribution has a unique\nembedding, allowing all statistical properties of the distribution to be taken into\nconsideration. Necessary and suf\ufb01cient conditions for an RKHS to be character-\nistic exist for Rn. In the present work, conditions are established for an RKHS\nto be characteristic on groups and semigroups. Illustrative examples are provided,\nincluding characteristic kernels on periodic domains, rotation matrices, and Rn\n+.\n\n1 Introduction\n\nRecent studies have shown that mapping random variables into a suitable reproducing kernel Hilbert\nspace (RKHS) gives a powerful and straightforward method of dealing with higher-order statistics\nof the variables. For suf\ufb01ciently rich RKHSs, it becomes possible to test whether two samples\nare from the same distribution, using the difference in their RKHS mappings [8]; as well as testing\nindependence and conditional independence [6, 9]. It is also useful to optimize over kernel mappings\non distributions, for instance to \ufb01nd the most predictive subspace in regression [5], or for ICA [1].\n\nKey to the above work is the notion of a characteristic kernel, as introduced in [5, 6]: it gives an\nRKHS for which probabilities have unique images (i.e., the mapping is injective). Such RKHSs\nare suf\ufb01ciently rich in the sense required above. Universal kernels on compact metric spaces [16]\nare characteristic [8], as are Gaussian and Laplace kernels on Rn [6]. Recently, it has been shown\n[14] that a continuous shift-invariant R-valued positive de\ufb01nite kernel on Rn is characteristic if and\nonly if the support of its Fourier transform is the entire Rn. This completely determines the set of\ncharacteristic ones in the convex cone of continuous shift-invariant positive de\ufb01nite kernels on Rn.\n\nOne of the chief advantages of kernel methods is that they allow us to deal straightforwardly with\ncomplex domains, through use of a kernel function to determine the similarity between objects in\nthese domains [13]. A question that naturally arises is whether characteristic kernels can be de\ufb01ned\non spaces besides Rn. Several such domains constitute topological groups/semigroups, and our\nfocus is on kernels de\ufb01ned by their algebraic structure. Broadly speaking, our approach is based on\nextensions of Fourier analysis to groups and semigroups, where we apply appropriate extensions of\nBochner\u2019s theorem to obtain the required conditions on the kernel.\n\nThe most immediate generalization of the results in [14] is to locally compact Abelian groups, of\nwhich (Rn, +) is one example. Thus, in Section 2 we provide review of characteristic kernels on\n(Rn, +) from this viewpoint. In Section 3 we derive necessary and suf\ufb01cient conditions for kernels\n\n1\n\n\fon locally compact Abelian groups to be characteristic. Besides (Rn, +), such groups include [0, 1]n\nwith periodic boundary conditions [13, Section 4.4.4]. We next address non-Abelian compact groups\nin Section 4, for which we obtain a suf\ufb01cient condition for a characteristic kernel. We illustrate with\nthe example of SO(3), which describes rotations in R3, and is used in \ufb01elds such as geophysics\n[10] and robotics [15]. Finally, in Section 5, we consider the Abelian semigroup (Rn\n+, +), where\nR+ = [0,\u221e). This semigroup has many practical applications, including expressions of nonnegative\nmeasures or frequency on n points [3]. Note that in all cases, we provide speci\ufb01c examples of\ncharacteristic kernels to illustrate the properties required.\n\n2 Preliminaries: Characteristic kernels and shift-invariant kernels\n\n{P : probability on (\u2126,B)} \u2192 H,\n\nLet X be a random variable taking values on a measurable space (\u2126,B), and H be a RKHS de\ufb01ned\nby a measurable kernel k on \u2126 such that E[pk(X, X)] < \u221e. The mean element mX of X is\nde\ufb01ned by the element in H such that hmX , fiH = E[f (X)] (\u2200f \u2208 H) (See [6, 7]). By plugging\nf = k(\u00b7, y) in the de\ufb01nition, the explicit functional form of mX is given by mX (y) = E[k(y, X)].\nA bounded measurable kernel k on \u2126 is called characteristic if\n(1)\nis injective ([5, 6]). Therefore, by de\ufb01nition, a characteristic kernel uniquely determines a probabil-\nity by its mean element. This property is important in making inference on properties of distribu-\ntions. It guarantees, for example, that M M D = kmX \u2212 mY kH is a (strict) distance on the space\nof probabilities on \u2126 [8]. The following result provides the necessary and suf\ufb01cient condition for a\nkernel to be characteristic and shows its associated RKHS to be a rich function class.\nLemma 1 ([7] Prop. 5). Let (\u2126,B) be a measurable space, k be a bounded measurable positive\nde\ufb01nite kernel on \u2126, and H be the associated RKHS. Then, k is characteristic if and only if H + R\n(direct sum of the two RKHS\u2019s) is dense in L2(P ) for every probability P on (\u2126,B).\nThe above lemma and Theorem 3 of [6] imply that characteristic kernels give a criterion of (condi-\ntional) independence through (conditional) covariance on RKHS, which enables statistical tests of\nindependence with kernels [6]. This explains also the practical importance of characteristic kernels.\n\nP 7\u2192 mP = EX\u223cP [k(\u00b7, X)]\n\nThe following result shows that the characteristic property is invariant under some conformal map-\npings introduced in [17] and provides a construction to generate new characteristic kernels.\nLemma 2. Let \u2126 be a topological space with Borel \u03c3-\ufb01eld, k be a measurable positive de\ufb01nite\n\nkernel on \u2126 such thatR\u2126 k(\u00b7, y)d\u00b5(y) = 0 means \u00b5 = 0 for a \ufb01nite Borel measure \u00b5, and f : \u2126 \u2192 C\nbe a bounded continuous function such that f (x) > 0 for all x \u2208 \u2126 and k(x, x)|f (x)|2 is bounded.\nThen, the kernel \u02dck(x, y) = f (x)k(x, y)f (y) is characteristic.\nProof. Let P and Q be Borel probabilities such that R \u02dck(\u00b7, x)dP (x) = R \u02dck(\u00b7, x)dQ(x). We have\nR k(\u00b7, x)f (x)d(P \u2212 Q)(x) = 0, which means f P = f Q. We have P = Q by the positivity and\n\ncontinuity of f.\n\nWe will focus on spaces with algebraic structure for better description of characteristic kernels.\nLet G be a group. A function \u03c6 : G \u2192 C is called positive de\ufb01nite if k(x, y) = \u03c6(y\u22121x) is\na positive de\ufb01nite kernel. We call this type of positive de\ufb01nite kernels shift-invariant, because\nk(zx, zy) = \u03c6((zy)\u22121zx) = \u03c6(y\u22121x) = k(x, y) for any z \u2208 G.\nThere are many examples of shift-invariant positive de\ufb01nite kernels on the additive group Rn: Gaus-\nsian RBF kernel k(x, y) = exp(\u2212kx\u2212yk2/\u03c32) and Laplacian kernel k(x, y) = exp(\u2212\u03b2Pn\ni=1 |xi\u2212\nyi|) are famous ones. In the case of Rn, the following Bochner\u2019s theorem is well-known;\nTheorem 3 (Bochner). Let \u03c6 : Rn \u2192 C be a continuous function. \u03c6 is positive de\ufb01nite if and only\nif there is a unique \ufb01nite non-negative Borel measure \u039b on Rn such that\n\n\u03c6(x) =ZRn\n\n\u221a\u22121xT \u03c9d\u039b(\u03c9).\ne\n\n(2)\n\nBochner\u2019s theorem completely characterizes the set of continuous shift-invariant positive de\ufb01nite\nkernels on Rn by the Fourier transform. It also implies that the continuous positive de\ufb01nite functions\n\nform a convex cone with the extreme points given by the Fourier kernels {e\u221a\u22121xT \u03c9 | \u03c9 \u2208 Rn}.\n\n2\n\n\fIt is interesting to determine the class of continuous shift-invariant \u201ccharacteristic\u201d kernels on Rn.\n[14] gives a complete solution: if supp(\u039b) = Rn,1 then \u03c6(x \u2212 y) is characteristic. In addition, if\na continuous positive de\ufb01nite function of the form in Eq. (2) is real-valued and characteristic, then\nsupp(\u039b) = Rn. The basic idea is the following: since the mean element EP [\u03c6(y \u2212 X)] is equal to\nthe convolution \u03c6 \u2217 P , the Fourier transform rewrites the de\ufb01nition of characteristic property as\nwhereb denotes the Fourier transform, and we use [\u03c6 \u2217 P = \u039bbP . Hence, it is natural to expect that\nif \u039b is everywhere positive, then (bP \u2212 bQ) must be zero, which means P = Q.\n\nWe will extend these results to more general algebraic objects, such as groups and semigroups, on\nwhich Fourier analysis and Bochner\u2019s theorem can be extended.\n\n(bP \u2212 bQ)\u039b = 0 =\u21d2 P = Q,\n\n3 Characteristic kernels on locally compact Abelian groups\n\nIt is known that most of the results on Fourier analysis for Rn are extended to any locally compact\nAbelian (LCA) group, which is an Abelian (i.e. commutative) topological group with the topology\nHausdorff and locally compact. The basic terminologies are provided in the supplementary material\nfor readers who are not familiar to them. The group operation is denoted by \u201c+\u201d in Abelian cases.\nHereafter, for a LCA group G, we consider only the probability measures included in the set of \ufb01nite\nregular measures M (G) (see Supplements) to discuss characteristic property. This slightly restricts\nthe class of measures, but removes only pathological ones.\n\n3.1 Fourier analysis on LCA Group\n\nWe brie\ufb02y summarize necessary results to show our main theorems. For the details, see [12, 11].\n\nFor a LCA group G, there exists a non-negative regular measure m on G such that m(E + x) =\nm(E) for every x \u2208 G and every Borel set E in G. This measure is called Haar measure. We use\ndx to denote the Haar measure of G. With the Haar measure, the integral is shift-invariant, that is,\n\nZG\n\nf (x + y)dx =ZG\n\nf (x)dx\n\n(\u2200y \u2208 G).\n\nThe space of Lp(G, dx) is simply denoted by Lp(G).\nA function \u03b3 : G \u2192 C is called a character of G if \u03b3(x + y) = \u03b3(x)\u03b3(y) and |\u03b3(x)| = 1. The set of\nall continuous characters of G forms an Abelian group with the operation (\u03b31\u03b32)(x) = \u03b31(x)\u03b32(x).\nBy convention, the group operation is denoted by addition \u201c+\u201d, instead of multiplication; i.e., (\u03b31 +\n\n\u03b32)(x) = \u03b31(x)\u03b32(x). This group is called the dual group of G, and denoted by bG.\nFor any x \u2208 G, the function \u02c6x on bG given by \u02c6x(\u03b3) = \u03b3(x) (\u03b3 \u2208 bG) de\ufb01nes a character of bG. It is\nknown that bG is a LCA group if the weakest topology is introduced so that \u02c6x is continuous for each\nx \u2208 G. We can therefore consider the dual of bG, denoted by G\u02c6\u02c6, and the group homomorphism\n\nThe Pontryagin duality guarantees that this homomorphism is an isomorphism, and homeomor-\nphism, thus G\u02c6\u02c6can be identi\ufb01ed with G. In view of the duality, it is customary to write (x, \u03b3) :=\n\u03b3(x). We have (\u2212x, \u03b3) = (x,\u2212\u03b3) = \u03b3(x)\u22121 = (x, \u03b3), where z is the complex conjugate of z.\nLet f \u2208 L1(G) and \u00b5 \u2208 M (G), the Fourier transform of f and \u00b5 are respectively de\ufb01ned by\n\nG \u2192 G\u02c6\u02c6,\n\nx 7\u2192 \u02c6x.\n\n\u02c6f (\u03b3) =ZG\n\n(\u2212x, \u03b3)f (x)dx,\n\n\u02c6\u00b5(\u03b3) =ZG\n\n(\u2212x, \u03b3)d\u00b5(x),\n\n(\u03b3 \u2208 bG).\n\n(3)\n\nLet f \u2208 L\u221e(G), g \u2208 L1(G), and \u00b5, \u03bd \u2208 M (G). The convolutions are de\ufb01ned respectively by\n(g \u2217f )(x) =ZG\n\nf (x\u2212y)g(y)dy, (\u00b5\u2217f )(x) =ZG\n\nf (x\u2212y)d\u00b5(y), (\u00b5\u2217\u03bd)(E) =ZG\n\n\u03c7E(x+y)d\u00b5(x)d\u03bd(y).\n\n1For a \ufb01nite regular measure, there is the largest open set U with \u00b5(U ) = 0. The complement of U is called\n\nthe support of \u00b5, and denoted by supp(\u00b5). See the supplementary material for the detail.\n\n3\n\n\fg \u2217 f is uniformly continuous on G. For any f, g \u2208 L1(G) and \u00b5, \u03bd \u2208 M (G), we have the formula\n(4)\n\n[f \u2217 g = \u02c6f \u02c6g, [\u00b5 \u2217 f = b\u00b5bf , [\u00b5 \u2217 \u03bd = b\u00b5b\u03bd.\n\nThe following facts are basic ( [12], Section 1.3).\nProposition 4. For \u00b5 \u2208 M (G), the Fourier transform \u02c6\u00b5 is bounded and uniformly continuous.\nTheorem 5 (Uniqueness theorem). If \u00b5 \u2208 M (G) satis\ufb01es b\u00b5 = 0, then \u00b5 = 0.\nIt is known that the dual group of the LCA group Rn is {e\u221a\u22121\u03c9T x | \u03c9 \u2208 Rn}, which can be\n\nidenti\ufb01ed with Rn. The above de\ufb01nition and properties of Fourier transform for LCA groups are\nextension of the ordinary Fourier transform for Rn. Bochner\u2019s theorem can be also extended.\nTheorem 6 (Bochner\u2019s theorem. e.g., [12] Section 1.4.3). A continuous function \u03c6 on G is positive\n\nde\ufb01nite if and only if there is a unique non-negative measure \u039b \u2208 M (bG) such that\n\n(x, \u03b3)d\u039b(\u03b3)\n\n(x \u2208 G).\n\n(5)\n\n\u03c6(x) =Z\nbG\n\n3.2 Shift-invariant characteristic kernels on LCA group\n\nBased on Bochner\u2019s theorem, a suf\ufb01cient condition of the characteristic property is obtained.\nTheorem 7. Let \u03c6 be a continuous positive de\ufb01nite function on a LCA group G given by Eq. (5)\n\n\u03c6)(x)d\u00b5(x) = 0. On the other hand, by using Fubini\u2019s theorem,\n\nTheorem 8. Let \u03c6 be a R-valued continuous positive de\ufb01nite function on a LCA group G given\n\nwith \u039b. If supp(\u039b) = bG, then the positive de\ufb01nite kernel k(x, y) = \u03c6(x \u2212 y) is characteristic.\nProof. It suf\ufb01ces to prove that if \u00b5 \u2208 M (G) satis\ufb01es \u00b5 \u2217 \u03c6 = 0 then \u00b5 = 0. We have RG(\u00b5 \u2217\nRG(\u00b5 \u2217 \u03c6)(x)d\u00b5(x) =RGRG\u03c6(x \u2212 y)d\u00b5(y)d\u00b5(x) =RGRGRbG(x \u2212 y, \u03b3)d\u039b(\u03b3)d\u00b5(y)d\u00b5(x)\n=RbGRG(x, \u03b3)d\u00b5(x)RG(\u2212y, \u03b3)d\u00b5(y)d\u039b(\u03b3) =RbG |b\u00b5(\u03b3)|2d\u039b(\u03b3).\nSince b\u00b5 is continuous and supp(\u039b) = bG, we have b\u00b5 = 0, which means \u00b5 = 0 by Theorem 5.\nIn real-valued cases, the condition supp(\u039b) = bG is almost necessary.\nby Eq. (5) with \u039b. The kernel \u03c6(x \u2212 y) is characteristic if and only if (i) 0 \u2208 bG is not open and\nsupp(\u039b) = bG, or (ii) 0 \u2208 bG is open and supp(\u039b) \u2283 bG \u2212 {0}. The case (ii) occurs if G is compact.\nProof. It suf\ufb01ces to prove the only if part. Assume k(x, y) = \u03c6(x \u2212 y) is characteristic.\nIt is\nobvious that k is characteristic if and only if so is k(x, y) + 1. Thus, we can assume 0 \u2208 supp(\u039b).\nSuppose supp(\u039b) 6= bG. Since \u03c6 is real-valued, \u039b(\u2212E) = \u039b(E) for every Borel set E. Thus\nU := bG\\supp(\u039b) is a non-empty open set, with \u2212U = U, and 0 /\u2208 U by assumption. Let \u03b30 \u2208 U\nand \u03c4 : bG \u00d7 bG \u2192 bG, (\u03b31, \u03b32) 7\u2192 \u03b31 \u2212 \u03b32. Take an open neighborhood W of 0 in bG with compact\nclosure such that W \u2282 \u03c4\u22121(U \u2212 \u03b30). Then, (W + (\u2212W ) + \u03b30) \u222a (W + (\u2212W ) \u2212 \u03b30) \u2282 U.\nLet g = \u03c7W \u2217 \u03c7\u2212W , where \u03c7E denotes the indicator function of a set E.\nuous, and supp(g) \u2282 cl(W + (\u2212W )). Also, g is positive de\ufb01nite, since Pi,jcicjg(xi \u2212\nxj) = Pi,j cicjRG\u03c7W (xi \u2212 xj \u2212 y)\u03c7\u2212W (y)dy = Pi,j cicjRG\u03c7W (xi \u2212 y)\u03c7\u2212W (y \u2212 xj)dy =\nRG(cid:0)Pici\u03c7W (xi \u2212 y)(cid:1)(cid:0)Pj cj\u03c7W (xj \u2212 y)(cid:1)dy \u2265 0. By Bochner\u2019s theorem and Pontryagin duality,\nthere is a non-negative measure \u00b5 \u2208 M (G) such that\ng(\u03b3) =RG(x, \u03b3)d\u00b5(x)\n\ng(\u03b3 \u2212 \u03b30) + g(\u03b3 + \u03b30) =RG{(x, \u03b3 \u2212 \u03b30) + (x, \u03b3 + \u03b30)}d\u00b5(x) =RG(x, \u03b3)d((\u03b30 +\nIt follows that\n\u03b30)\u00b5)(x).\nSince supp(g) \u2282 cl(W + (\u2212W )), the left hand side is non-zero only in (W + (\u2212W ) + \u03b30)\u222a (W +\n(\u2212W ) \u2212 \u03b30) \u2282 U, which does not contain 0. Thus, by setting \u03b3 = 0, we have\n\n(\u03b3 \u2208 bG).\n\ng is contin-\n\n((\u03b30 + \u03b30)\u00b5)(G) = 0.\n\n(6)\n\n4\n\n\fThe measure (\u03b30 + \u03b30)\u00b5 is real-valued, and non-zero since the function g(\u03b3 \u2212 \u03b30) + g(\u03b3 + \u03b30) is\nnot constant zero. Let m = |(\u03b30 + \u03b30)\u00b5|(G), and de\ufb01ne the non-negative measures\n\u00b52 = {|(\u03b30 + \u03b30)\u00b5| \u2212 (\u03b30 + \u03b30)\u00b5}/m.\n\n\u00b51 = |(\u03b30 + \u03b30)\u00b5|/m,\n\nBoth of \u00b51 and \u00b52 are probability measures on G from Eq. (6), and \u00b51 6= \u00b52. From Fubini\u2019s theorem,\nm \u00d7 ((\u00b51 \u2212 \u00b52) \u2217 \u03c6)(x) =RG\u03c6(x \u2212 y)(\u03b30(y) + \u03b30(y))d\u00b5(y)\n=RbG(x, \u03b3)RG {(y, \u03b3 \u2212 \u03b30) + (y, \u03b3 + \u03b30)}d\u00b5(y)d\u039b(\u03b3) =RbG(x, \u03b3){g(\u03b3 \u2212 \u03b30) + g(\u03b3 + \u03b30)}d\u039b(\u03b3)\nSince the integrand is zero in supp(\u039b), we have (\u00b51 \u2212 \u00b52) \u2217 \u03c6 = 0, which derives contradiction.\nThe last assertion is obvious, since bG is discrete if and only if G is compact [12, Sec. 1.7.3].\n\nProof. We show the proof only for (i). Let \u039b1, \u039b2 be the non-negative measures to give \u03c61 and \u03c62,\n\nTheorems 7 and 8 are generalization of the results in [14]. From Theorem 8, we can see that the\ncharacteristic property is stable under the product for shift-invariant kernels.\nCorollary 9. Let \u03c61(x \u2212 y) and \u03c62(x \u2212 y) be R-valued continuous shift-invariant characteristic\nkernels on a LCA group G. If (i) G is non-compact, or (ii) G is compact and 2\u03b3 6= 0 for any nonzero\n\u03b3 \u2208 bG. Then (\u03c61\u03c62)(x \u2212 y) is characteristic.\nrespectively, in Eq. (5). By Theorem 8, supp(\u039b1) = supp(\u039b2) = bG. This means supp(\u039b1 \u2217 \u039b2) =\nbG. The proof is completed because \u039b1 \u2217 \u039b2 gives a positive de\ufb01nite function \u03c61\u03c62.\nExample 1. (Rn, +): As already shown in [6, 14], the Gaussian RBF kernel exp(\u2212 1\n2\u03c32kx \u2212 yk2)\nand Laplacian kernel exp(\u2212\u03b2Pn\ni=1 |xi \u2212 yi|) are characteristic on Rn. An example of a positive\nde\ufb01nite kernel that is not characteristic on Rn is sinc(x \u2212 y) = sin(x\u2212y)\nx\u2212y\nExample 2. ([0, 2\u03c0), +): The addition is made modulo 2\u03c0. The dual group is {e\u221a\u22121nx | n \u2208 Z},\nwhich is isomorphic to Z. The Fourier transform is equal to the ordinary Fourier expansion. The\nfollowing are examples of characteristic kernels given by the expression\n\n.\n\n\u03c6(x) =P\u221en=\u2212\u221e\n\n\u221a\u22121nx,\n\nane\n\na0 \u2265 0, an > 0 (n 6= 0), P\u221en=0an < \u221e.\nk1(x, y) = (\u03c0 \u2212 (x \u2212 y)mod 2\u03c0)2.\n\u21d2\n\u21d2 k2(x, y) = cosh(\u03c0 \u2212 (x \u2212 y)mod 2\u03c0).\n\n(1) a0 = \u03c02/3, an = 2/n2 (n 6= 0)\n(2) a0 = 1/2, an = 1/(1 + n2) (n 6= 0)\n(3) a0 = 0, an = \u03b1n/n (n 6= 0), (|\u03b1| < 1) \u21d2 k3(x, y) = \u2212 log(1 \u2212 2\u03b1 cos(x \u2212 y) + \u03b12).\n(4) an = \u03b1|n|, (0 < \u03b1 < 1) \u21d2 k4(x, y) = 1/(1 \u2212 2\u03b1 cos(x \u2212 y) + \u03b12)\n(Poisson kernel).\nExamples of non-characteristic kernels on [0, 2\u03c0) include cos(x \u2212 y), F\u00b4ejer, and Dirichlet kernel.\n4 Characteristic kernels on compact groups\n\nWe discuss non-Abelian cases in this section. Non-Abelian groups include various matrix groups,\nsuch as SO(3) = {A \u2208 M (3 \u00d7 3; R) | AT A = I3, detA = 1}, which represents rotations in R3.\nSO(3) is used in practice as the data space of rotational data, which popularly appear in many \ufb01elds\nsuch as geophysics [10] and robotics [15]. Providing useful positive de\ufb01nite kernels on this class is\nimportant in those applications areas. First, we give a brief summary of known results on the Fourier\nanalysis on locally compact and compact groups. See [11, 4] for the details.\n\n4.1 Unitary representation and Fourier analysis\n\nLet G be a locally compact group, which may not be Abelian. A unitary representation (T, H) of\nG is a group homomorphism T into the group U (H) of unitary operators on some nonzero Hilbert\nspace H, that is, a map T : G \u2192 U (H) that satis\ufb01es T (xy) = T (x)T (y) and T (x\u22121) = T (x)\u22121 =\nT (x)\u2217, and for which x 7\u2192 T (x)u is continuous from G to H for any u \u2208 H.\nFor a unitary representation (T, H) on a locally compact group G, a subspace V in H is called G-\ninvariant if T (x)V \u2282 V for every x \u2208 G. A unitary representation (T, H) is irreducible if there are\n\n5\n\n\fno closed G-invariant subspace except {0} and H. Unitary representations (T1, H1) and (T2, H2)\nare said to be equivalent if there is a unitary isomorphism A : H1 \u2192 H2 such that T1 = A\u22121T2A.\nThe following facts are basic (e.g., [4], Section 3,1, 5.1).\nTheorem 10. (i) If G is a compact group, every irreducible unitary representation (T, H) of G is\n\ufb01nite dimensional, that is, H is \ufb01nite dimensional. (ii) If G is an Abelian group, every irreducible\nunitary representation of G is one dimensional. They are the continuous characters of G.\n\nIt is possible to extend the Fourier analysis on locally compact non-Abelian groups. Unlike Abelian\ncases, the Fourier transform by the characters are not possible, but we need to consider unitary\nrepresentations and operator-valued Fourier transform. Since extending the results of the LCA case\nto the general cases causes very complicated topology, we focus on compact groups. Also, for\nsimplicity, we assume that G is second countable, i.e., there are countable open basis on G.\n\nf (x)T (x\u22121)dx =ZG\n\nf (x)T (x)\u2217dx,\n\nT (x\u22121)d\u00b5(x) =ZG\n\nT (x)\u2217d\u00b5(x),\n\ngroup G. The equivalence class of a unitary representation (T, HT ) is denoted by [T ], and the\n\nWe de\ufb01ne bG to be the set of equivalent classes of irreducible unitary representations of a compact\ndimensionality of HT by dT . We \ufb01x a representative T for every [T ] \u2208 bG for all.\n\nrespectively. These are operators on HT . This is a natural extension of the Fourier transform on\n\nIt is known that on a compact group G there is a Haar measure m, which is a left and right invariant\nnon-negative \ufb01nite measure. We normalize it so that m(G) = 1 and denote it by dx.\nLet (T, HT ) be a unitary representation. For f \u2208 L1(G) and \u00b5 \u2208 M (G), the Fourier transform of f\nand \u00b5 are de\ufb01ned by the \u201coperator-valued\u201d functions on bG,\nb\u00b5(T ) =ZG\nbf (T ) =ZG\nLCA groups, where bG is the characters serving as the Fourier kernel in view of Theorem 10.\nWe can de\ufb01ne the \u201cinverse Fourier transform\u201d. Let AT ([T ] \u2208 bG) be an operator on HT . The series\nis said to be absolutely convergent ifP[T ]\u2208bG dT Tr[|AT|] < \u221e, where |A| = \u221aAT A. It is obvious\nif G is second countable, bG is at most countable, thus the sum is taken over the countable set.\nFourier transform b\u03c6(T ) is positive semide\ufb01nite, gives an absolutely convergent series Eq. (7), and\nj xi) =Pi,jcicjP[T ]\u2208bGdT Tr[b\u03c6(T )T (x\u22121\nThe proof of \u201cif\u201d part is easy; in fact,Pi,jcicj\u03c6(x\u22121\n=Pi,jcicjP[T ]dT Tr[T (xi)b\u03c6(T )T (xj)\u2217] =P[T ]dT Tr[(cid:0)PiciT (xi)(cid:1)b\u03c6(T )(cid:0)PjcjT (xj)(cid:1)\u2217] \u2265 0.\n\nBochner\u2019s theorem can be extended to compact groups as follows [11, Section 34.10].\nTheorem 11. A continuous function \u03c6 on a compact group G is positive de\ufb01nite if and only if the\n\n\u03c6(x) =P[T ]\u2208bGdT Tr[b\u03c6(T )T (x)].\n\nthat if the above series is absolutely convergent, the convergence is uniform on G. It is known that\n\nP[T ]\u2208bGdT Tr[AT T (x)]\n\n4.2 Shift-invariant characteristic kernels on compact groups\n\n(7)\n\n(8)\n\nj xi)]\n\nWe have the following suf\ufb01cient condition of characteristic property for compact groups.\nTheorem 12. Let \u03c6 be a positive de\ufb01nite function of the form Eq. (8) on a compact group G. If\n\nb\u03c6(T ) is strictly positive de\ufb01nite for every [T ] \u2208 bG\\{1}, the kernel \u03c6(y\u22121x) is characteristic.\nDe\ufb01ne \u00b5 = P \u2212 Q,\nProof. Let P, Q \u2208 M (G) be probabilities on G.\nsuppose RG \u03c6(y\u22121x)d\u00b5(y) = 0.\ntheorem shows 0 = RGRGP[T ]dT Tr[b\u03c6(T )T (y\u22121x)]d\u00b5(y)d\u00b5(x) =\nsure \u00b5, Fubini\u2019s\nP[T ]dTRGRGTr[T (x)b\u03c6(T )T (y)\u2217]d\u00b5(x)d\u00b5(y) = P[T ]dT Tr[b\u00b5(T )b\u03c6(T )b\u00b5(T )\u2217]. Since dT > 0 and\nb\u03c6(T ) is strictly positive, b\u00b5(T ) = 0 for every [T ] \u2208 bG, that is, RG T (x)\u2217d\u00b5(x) = O. If we \ufb01x an\n\nand\nIf we take the integral over x with the mea-\n\northonormal basis of HT and express T (x) by the matrix elements Tij(x), we have\n\nRGTij(x)d\u00b5(x) = 0\n\n(\u2200[T ] \u2208 bG, i, j = 1, . . . , dT ).\n\n6\n\n\fThe Peter-Weyl Theorem (e.g., [4, Section 5.2]) shows that {\u221adT Tij(x) |\n1, . . . , dT} is a complete orthonormal basis of L2(G), which means \u00b5 = 0.\nIt is interesting to ask whether Theorem 8 can be extended to compact groups. The same proof does\n\n[T ] \u2208 bG, i, j =\n\nnot possible by the lack of duality.\n\nnot apply, however, because application of Bochner\u2019s theorem to a positive de\ufb01nite function on bG is\nIt is known that \\SO(3) consists of (Tn, Hn) (n = 0, 1, 2, . . .), where dTn =\nExample of SO(3).\n2n + 1. We omit the explicit form of Tn, while it is known (e.g., [4], Section 5.4), but use the\ncharacter de\ufb01ned by \u03b3n(x) = Tr[Tn(x)]. It is also known that \u03b3n is given by\n\n\u03b3n(A) =\n\nsin((2n + 1)\u03b8)\n\nsin \u03b8\n\n(n = 0, 1, 2, . . .),\n\nwhere e\u00b1\u221a\u22121\u03b8 (0 \u2264 \u03b8 \u2264 \u03c0) are the eigenvalues of A, i.e., cos \u03b8 = 1\nb\u03c6(Tn) = anIdTn in Eq. (8) derives an\u03b3n for each term, we see that a sequence {an}\u221en=0 such that\na0 \u2265 0, an > 0 (n \u2265 1), and P\u221en=0 an(2n + 1)2 < \u221e de\ufb01nes a characteristic positive de\ufb01nite\n\n2 Tr[A]. Since plugging\n\nkernel on SO(3) by\n\nk(A, B) =P\u221en=0(2n + 1)an\n\n1\n2\nSome examples are listed below (\u03b1 is a parameter such that |\u03b1| < 1).\n\nsin((2n + 1)\u03b8)\n\n(cos \u03b8 =\n\nsin \u03b8\n\nTr[B\u22121A], 0 \u2264 \u03b8 \u2264 \u03c0).\n\n(1) an =\n\n(2) an =\n\n1\n\n(2n + 1)4 :\n\u03b12n+1\n(2n + 1)2 :\n\nk1(A, B) =\n\n1\n\nsin \u03b8\n\nsin((2n + 1)\u03b8)\n\n(2n + 1)3 =\n\n\u221eXn=0\n\nk2(A, B) =\n\n\u221eXn=0\n\n\u03b12n+1 sin((2n + 1)\u03b8)\n\n(2n + 1) sin \u03b8\n\n=\n\n.\n\n\u03c0\u03b8(\u03c0 \u2212 \u03b8)\n8 sin \u03b8\narctan(cid:16) 2\u03b1 sin \u03b8\n1 \u2212 \u03b12 (cid:17).\n\n1\n\n2 sin \u03b8\n\n5 Characteristic kernels on the semigroup Rn\n\n+\n\nIn this section, we consider kernels on an Abelian semigroup (S, +). In this case, a kernel based\non the semigroup structure is de\ufb01ned by k(x, y) = \u03c6(x + y). For an Abelian semigroup (S, +), a\nsemicharacter is de\ufb01ned by a map \u03c1 : S \u2192 C such that \u03c1(x + y) = \u03c1(x)\u03c1(y).\nWhile extensions of Bochner\u2019s theorem are known for semigroups [2], the topology on the set of\nsemicharacters are not as obvious as LCA groups, and the straightforward extension of the results\nin Section 3 is dif\ufb01cult. We focus only on the Abelian semigroup (Rn\n+, +), where R+ = [0,\u221e).\nThis semigroup has many practical applications of data analysis including expressions of nonneg-\n+, it is easy to see the bounded continuous\native measures or frequency on n points [3]. For Rn\n\nsemicharacters are given by {Qn\n\ni=1 e\u2212\u03bbix | \u03bbi \u2265 0 (i = 1, . . . , n)} [2, Section 4.4].\n+, Laplace transform replaces Fourier transform to give Bochner\u2019s theorem.\n\nFor Rn\nTheorem 13 ([2], Section 4.4). Let \u03c6 be a bounded continuous function on Rn\nif and only if there exists a unique non-negative measure \u039b \u2208 M (Rn\n\n+) such that\n\n+. \u03c6 is positive de\ufb01nite\n\n\u03c6(x) =ZRn\n\n+\n\ne\u2212P n\n\ni=1 tixid\u039b(t)\n\n(\u2200x \u2208 Rn\n+).\n\n(9)\n\nBased on the above theorem, we have the following suf\ufb01cient condition of characteristic property.\nTheorem 14. Let \u03c6 be a positive de\ufb01nite function given by Eq. (9). If supp\u039b = Rn\n+, then the\npositive de\ufb01nite kernel k(x, y) = \u03c6(x + y) is characteristic.\n\ne\u2212P n\n\nProof. Let P and Q be probabilities on Rn\n\nL\u00b5(t) = RRn\nR \u03c6(x + y)d\u00b5(y) = 0 for all x \u2208 Rn\nLP = LQ. By the uniqueness part of Theorem 13, we conclude P = Q.\n\n+, and \u00b5 = P \u2212 Q. De\ufb01ne the Laplace transform by\n+. Suppose\n+. In exactly the same way as the proof of Theorem 7, we have\n\ni=1 tixid\u00b5(x). It is easy to see L\u00b5 is bounded and continuous on Rn\n\n+\n\n7\n\n\fi\n\n2 shows\n\n+), Lemma\n\n+, +). Let a = (ai)n\n\ni=1 and b = (bi)n\n\ni=1\n\ne\u03bbti (\u03bb > 0) :\n(2) \u039b = t\u22123/2e\u2212\u03b22/(4t) (\u03b2 > 0) :\n\ni=1t\u03bd\u22121\n\nWe show some examples of characteristic kernels on (Rn\n(ai \u2265 0, bi \u2265 0) be non-negative measures on n points.\n(1) \u039b =Qn\nSince the proof of Theorem 14 showsR \u03c6(x + y)d\u00b5(y) = 0 means \u00b5 = 0 for \u00b5 \u2208 M (Rn\ni=1pbi)/2(cid:1)(cid:9)\nis also characteristic. The exponent has the form h(cid:0) a+b\nwith h(c) =Pn\n\ni=1\u221aci, which\ncompares the value of h of the merged measure (a + b)/2 and the average of h(a) and h(b). This\ntype of kernel on non-negative measures is discussed in [3] in connection with semigroup structure.\n\ni=1p(ai + bi)/2 \u2212 (Pn\n\n\u02dck2(a, b) = exp(cid:8)\u2212\u03b2(cid:0)Pn\n\nk1(a, b) =Qn\nk2(a, b) = e\u2212\u03b2P n\n\ni=1\u221aai +Pn\n\n2 (cid:1)\u2212 h(a)+h(b)\n\n2\n\ni=1(ai + bi + \u03bb)\u22121.\n\ni=1 \u221aai+bi.\n\n6 Conclusions\n\nWe have discussed conditions that kernels de\ufb01ned by the algebraic structure of groups and semi-\ngroups are characteristic. For locally compact Abelian groups, the continuous shift-invariant R-\nvalued characteristic kernels are completely determined by the Fourier inverse of positive measures\nwith support equal to the entire dual group. For compact (non-Abelian) groups, we show a suf\ufb01cient\ncondition of continuous shift-invariant characteristic kernels in terms of the operator-valued Fourier\ntransform. We show a condition for the semigroup Rn\n+. In the advanced theory of harmonic analysis,\nBochner\u2019s theorem and Fourier analysis can be extended to more general algebraic structure to some\nextent. It is interesting to consider generalization of the results in this paper to such general classes.\n\nIn practical applications of machine learning, we are given a \ufb01nite sample from a distribution, rather\nthan the distribution itself. In this setting, it becomes important to choose the best possible kernel\nfor inference on this sample. While the characteristic property gives a necessary requirement for\nRKHS embeddings of distributions to be distinguishable, it does not address optimal kernel choice\nat \ufb01nite sample sizes. Theoretical approaches to this problem are the basis for future work.\n\nReferences\n[1] F. R. Bach and M. I. Jordan. Kernel independent component analysis. JMLR, 3:1\u201348, 2002.\n[2] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, 1984.\n[3] M. Cuturi, K. Fukumizu, and J.-P. Vert. Semigroup kernels on measures. JMLR, 6:1169\u20131198, 2005.\n[4] B. B. Folland. A course in abstract harmonic analysis. CRC Press, 1995.\n[5] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. JMLR, 5:73\u201399, 2004.\n\n[6] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence. Ad-\n\nvances in NIPS 20, 489\u2013496. MIT Press, 2008.\n\n[7] K. Fukumizu, F. R.Bach, and M. I. Jordan. Kernel dimension reduction in regression. The Annals of\n\nStatistics, 2009, in press.\n\n[8] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-\n\nsample-problem. Advances in NIPS 19. MIT Press, 2007.\n\n[9] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\n\nindependence. Advances in NIPS 20, 585\u2013592. MIT Press, 2008.\n\n[10] M. S. Hanna and T. Chang. Fitting smooth histories to rotation data. Journal of Multivariate Analysis,\n\n75:47\u201361, 2000.\n\n[11] E. Hewitt and K. A. Ross. Abstract Harmonic Analysis II. 1970.\n[12] W. Rudin. Fourier Analysis on Groups. Interscience, 1962.\n[13] B. Sch\u00a8olkopf and A.J. Smola. Learning with Kernels. MIT Press. 2002.\n[14] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Injective Hilbert space\n\nembeddings of probability measures. In Proc. COLT 2008, to appear, 2008.\n\n[15] O. Stavdahl, A. K. Bondhus, K. Y. Pettersen, and K. E. Malvig. Optimal statistical operators for 3-\ndimensional rotational data: geometric interpretations and application to prosthesis kinematics. Robotica,\n23(3):283\u2013292, 2005.\n\n[16] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. JMLR, 2:67\u2013\n\n93, 2001.\n\n[17] S. Wu and S-I. Amari. Conformal Transformation of Kernel Functions: A Data-Dependent Way to Im-\n\nprove Support Vector Machine Classi\ufb01ers. Neural Process. Lett., 15(1):59\u201367, 2002.\n\n8\n\n\f", "award": [], "sourceid": 458, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": null}]}