{"title": "Binet-Cauchy Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1441, "page_last": 1448, "abstract": null, "full_text": "Binet-Cauchy Kernels\n\nS.V.N. Vishwanathan, Alexander J. Smola\n\nNational ICT Australia, Machine Learning Program, Canberra, ACT 0200, Australia\n\n{SVN.Vishwanathan, Alex.Smola}@nicta.com.au\n\nAbstract\n\nWe propose a family of kernels based on the Binet-Cauchy theorem and its ex-\ntension to Fredholm operators. This includes as special cases all currently known\nkernels derived from the behavioral framework, diffusion processes, marginalized\nkernels, kernels on graphs, and the kernels on sets arising from the subspace angle\napproach. Many of these kernels can be seen as the extrema of a new continuum\nof kernel functions, which leads to numerous new special cases. As an application,\nwe apply the new class of kernels to the problem of clustering of video sequences\nwith encouraging results.\n\nIntroduction\n\n1\nRecent years have see a combinatorial explosion of results on kernels for structured and\nsemi-structured data, including trees, strings, graphs, transducers and dynamical systems\n[6, 8, 15, 13]. The fact that these kernels are very speci\ufb01c to the type of discrete data under\nconsideration is a major cause of confusion to the practitioner. What is required is a) an\nuni\ufb01ed view of the \ufb01eld and b) a recipe to design new kernels easily.\nThe present paper takes a step in this direction by unifying these diverse kernels by means\nof the Binet-Cauchy theorem. Our point of departure is the work of Wolf and Shashua [17],\nor more speci\ufb01cally, their proof that det A\u22a4B is a kernel on matrices A, B \u2208 Rm\u00d7n. We\nextend the results of [17] in the following three ways:\n1. There exists an operator-valued equivalent of the Binet-Cauchy theorem.\n2. Wolf and Shashua only exploit the Binet-Cauchy theorem for one particular choice of\nparameters. It turns out that the continuum of these values corresponds to a large class\nof kernels some of which are well known and others which are novel.\n\n3. The Binet-Cauchy theorem can be extended to semirings. This points to a close con-\n\nnection with rational kernels [3].\n\nOutline of the paper: Section 2 contains the main result of the present paper: the def-\ninition of Binet-Cauchy kernels and their ef\ufb01cient computation. Subsequently, section 3\ndiscusses a number of special cases, which allows us to recover well known kernel func-\ntions. Section 4 applies our derivations to the analysis of video sequences, and we conclude\nwith a discussion of our results.\n\n2 Binet-Cauchy Kernels\nIn this section we deal with linear mappings from X = Rn to Y = Rm (typically denoted\nby matrices), their coordinate free extensions to Fredholm operators (here Rn and Rm are\nreplaced by measurable sets), and their extensions to semirings (here addition and multi-\nplication are replaced by an abstract class of symbols (\u2295, \u2297) with the same distributive\nproperties).\n\n\f2.1 The General Composition Formula\nWe begin by de\ufb01ning compound matrices. They arise by picking subsets of entries of a\nmatrix and computing their determinants.\nDe\ufb01nition 1 (Compound Matrix) Let A \u2208 Rm\u00d7n, let q \u2264 min(m, n) and let I n\n(i1, i2, . . . , iq) : 1 \u2264 i1 < . . . < iq \u2264 n, ii \u2208 N} and likewise I m\nmatrix of order q is de\ufb01ned as\n\nq = {i =\nq . Then the compound\n\nk,l=1 where i \u2208 I n\nHere i, j are assumed to be arranged in lexicographical order.\n\n[Cq(A)]i,j := det(A(ik, jl))q\n\nq and j \u2208 I m\nq .\n\n(1)\n\nTheorem 2 (Binet-Cauchy) Let A \u2208 Rl\u00d7m and, B \u2208 Rl\u00d7n. For q \u2264 min(m, n, l) we\nhave Cq(A\u22a4B) = Cq(A)\u22a4Cq(B).\nWhen q = m = n = l we have Cq(A) = det(A) and the Binet-Cauchy theorem becomes\nthe well known identity det(A\u22a4B) = det(A) det(B). On the other hand when q = 1 we\nhave C1(A) = A, so Theorem 2 reduces to a tautology.\nTheorem 3 (Binet-Cauchy for Semirings) When the common semiring (R, +, \u00b7, 0, 1) is\nreplaced by an abstract semiring (K, \u2295, \u2297, \u00af0, \u00af1) the equality Cq(A\u22a4B) = Cq(A)\u22a4Cq(B)\nstill holds. Here all operations occur on the monoid K, addition and multiplication are\nreplaced by \u2295, \u2297, and (\u00af0, \u00af1) take the role of (0, 1).\nA second extension of Theorem 2 is to replace matrices by Fredholm operators, as they\ncan be expressed as integral operators with corresponding kernels. In this case, Theorem 2\nbecomes a statement about convolutions of integral kernels.\nDe\ufb01nition 4 (Fredholm Operator) A Fredholm operator is a bounded linear operator be-\ntween two Hilbert spaces with closed range and whose kernel and co-kernel are \ufb01nite-\ndimensional.\nTheorem 5 (Kernel Representation of Fredholm Operators) Let A : L2(Y) \u2192 L2(X)\nand, B : L2(Y) \u2192 L2(Z) be Fredholm operators. Then there exists some kA : X \u00d7 Y \u2192 R\nsuch that for all f \u2208 L2(X) we have\n\n[Af ](x) = ZY\n\nkA(x, y)f (y)dy.\n\n(2)\n\nMoreover, for the composition A\u22a4B we have kA\u22a4B(x, z) = RY kA\u22a4(x, y)kB(y, z)dy.\nHere the convolution of kernels kA and kB plays the same role as the matrix multiplica-\ntion. To extend the Binet-Cauchy theorem we need to introduce the analog of compound\nmatrices:\nDe\ufb01nition 6 (Compound Kernel and Operator) Denote by X, Y ordered sets and let k :\nX \u00d7 Y \u2192 R. De\ufb01ne I X\nq . Then the compound\nkernel of order q is de\ufb01ned as\n\nq = {x \u2208 Xq : x1 \u2264 . . . \u2264 xq} and likewise I Y\n\nk[q](x, y) := det(k(xk, yl))q\n\nk,l=1 where x \u2208 I X\n\nq and y \u2208 I Y\nq .\n\n(3)\n\nIf k is the integral kernel of an operator A we de\ufb01ne Cq(A) to be the integral operator\ncorresponding to k[q].\nTheorem 7 (General Composition Formula [11]) Let X, Y, Z be ordered sets and let A :\nL2(Y) \u2192 L2(X), B : L2(Y) \u2192 L2(Z) be Fredholm operators. Then for q \u2208 N we have\n\nCq(A\u22a4B) = Cq(A)\u22a4Cq(B).\n\n(4)\n\nTo recover Theorem 2 from Theorem 7 set X = [1..m], Y = [1..n] and Z = [1..l].\n\n\f2.2 Kernels\nThe key idea in turning the Binet-Cauchy theorem and its various incarnations into a kernel\nis to exploit the fact that tr A\u22a4B and det A\u22a4B are kernels on operators A, B. We extend\nthis by replacing A\u22a4B with some functions \u03c8(A)\u22a4\u03c8(B) involving compound operators.\nTheorem 8 (Trace and Determinant Kernel) Let A, B : L2(X) \u2192 L2(Y) be Fredholm\noperators and let S : L2(Y) \u2192 L2(Y), T : L2(X) \u2192 L2(X) be positive trace-class opera-\ntors. Then the following two kernels are well de\ufb01ned and they satisfy Mercer\u2019s condition:\n\nS and T = V \u22a4\n\nk(A, B) = tr(cid:2)SA\u22a4T B(cid:3)\nk(A, B) = det(cid:2)SA\u22a4T B(cid:3) .\n\n(5)\n(6)\nNote that determinants are not de\ufb01ned in general for in\ufb01nite dimensional operators, hence\nour restriction to Fredholm operators A, B in (6).\nProof Observe that S and T are positive and compact. Hence they admit a decomposition\ninto S = VSV \u22a4\nT VT . By virtue of the commutativity of the trace we have\nthat k(A, B) = tr(cid:16)[VT AVS]\u22a4 [VT BVS](cid:17). Analogously, using the Binet-Cauchy theorem,\nwe can decompose the determinant. The remaining terms VT AVS and VT BVS are again\nFredholm operators for which determinants are well de\ufb01ned.\nNext we use special choices of A, B, S, T involving compound operators directly to state\nthe main theorem of our paper.\nTheorem 9 (Binet-Cauchy Kernel) Under the assumptions of Theorem 8 it follows that\nfor all q \u2208 N the kernels k(A, B) = tr Cq (cid:2)SA\u22a4T B(cid:3) and k(A, B) = det Cq (cid:2)SA\u22a4T B(cid:3)\nsatisfy Mercer\u2019s condition.\nProof We exploit the factorization S = VSV \u22a4\nT VT and apply Theorem 7. This\nyields Cq(SA\u22a4T B) = Cq(VT AVS)\u22a4Cq(VT BVS), which proves the theorem.\nFinally, we de\ufb01ne a kernel based on the Fredholm determinant itself. It is essentially a\nweighted combination of Binet-Cauchy kernels. Fredholm determinants are de\ufb01ned as\nfollows [11]:\n\nS , T = V \u22a4\n\n\u221e\n\nD(A, \u00b5) :=\n\nXq=1\n\n\u00b5q\nq!\n\ntr Cq(A).\n\n(7)\n\nThis series converges for all \u00b5 \u2208 C and it is an entire function of \u00b5. It suggests a kernel\ninvolving weighted combinations of the kernels of Theorem 9. We have the following:\nCorollary 10 (Fredholm Kernel) Let A, B, S, T as in Theorem 9 and let \u00b5 > 0. Then the\nfollowing kernel satis\ufb01es Mercer\u2019s condition:\n\nk(A, B) := D(A\u22a4B, \u00b5) where \u00b5 > 0.\n\n(8)\nD(A\u22a4B, \u00b5) is a weighted combination of the kernels discussed above. The exponential\ndown-weighting via 1\nq! ensures that the series converges even in the case of exponential\ngrowth of the values of the compound kernel.\n\n2.3 Ef\ufb01cient Computation\nAt \ufb01rst glance, computing the kernels of Theorem 9 and Corollary 10 presents a formidable\nIf A, B \u2208 Rm\u00d7n, the matrix\ncomputational task even in the \ufb01nite dimensional case.\nq(cid:1) rows and columns and each of the entries requires the computation of\n\na determinant of a q-dimensional matrix. A brute-force approach would involve O(q3nq)\noperations (assuming 2q \u2264 n). Clearly we need more ef\ufb01cient techniques.\nWhen computing determinants, we can take recourse to Franke\u2019s Theorem [7] which states\nthat\n\nCq(A\u22a4B) has (cid:0)n\n\ndet Cq(A) = (det A)(n\u22121\n\nq\u22121).\n\n(9)\n\n\fand consequently k(A, B) = det Cq[SA\u22a4T B] = (det[SA\u22a4T B])(n\u22121\nq\u22121).1 This indicates\nthat the determinant kernel may be of limited use, due to the typically quite high power in\nthe exponent. Kernels building on tr Cq are not plagued by this problem and we give an\nef\ufb01cient recursion below. It follows from the ANOVA kernel recursion of [1]:\nLemma 11 Denote by A \u2208 Cm\u00d7m a square matrix and let \u03bb1, . . . , \u03bbm be its eigenvalues.\nThen tr Cq(A) can be computed by the following recursion:\n\ntr Cq(A) =\n\n1\nq\n\nq\n\nXj=1\n\n(\u22121)j+1 \u00afCq\u2212j(A) \u00afCj(A) where \u00afCq(A) =\n\nn\n\nXj=1\n\n\u03bbq\nj .\n\n(10)\n\nProof We begin by writing A in its Jordan normal form as A = P DP \u22121 where D is a\nblock diagonal, upper triangular matrix. Furthermore, the diagonal elements of D consist\nof the eigenvalues of A. Repeated application of the Binet-Cauchy Theorem yields\n\ntr Cq(A) = tr Cq(P )Cq(D)Cq(P \u22121) = tr Cq(D)Cq(P \u22121)Cq(P ) = tr Cq(D)\n\n(11)\n\nFor a triangular matrix the determinant is the product of its diagonal entries. Since all\nthe square submatrices of D are also upper triangular, to construct tr(Cq(D)) we need to\nsum over all products of exactly q eigenvalues. This is analog to the requirement of the\nANOVA kernel of [1]. In its simpli\ufb01ed version it can be written as (10), which completes\nthe proof.\n\nWe can now compute the Jordan normal form of SA\u22a4T B in O(n3) time and apply\nLemma 11 directly to it to compute the kernel value.\nFinally, in the case of Fredholm determinants, we can use the recursion directly, because\nfor n-dimensional matrices the sum terminates after n terms. This is no more expensive\nthan computing tr Cq directly. Note that in the general nonsymmetric case (i.e. A 6= A\u22a4)\nno such ef\ufb01cient recursions are known.\n\n3 Special Cases\n\nWe now focus our attention on various special cases to show how they \ufb01t into the general\nframework which we developed in the previous section. For this to succeed, we will map\nvarious systems such as graphs, dynamical systems, or video sequences into Fredholm\noperators. A suitable choice of this mapping and of the operators S, T of Theorem 9 will\nallow us to recover many well-known kernels as special cases.\n\n3.1 Dynamical Systems\nWe begin by describing a partially observable discrete time LTI (Linear Time Invariant)\nmodel commonly used in control theory. Its time-evolution equations are given by\n\nyt = P xt + wt\nxt = Qxt\u22121 + vt\n\nwhere wt \u223c N(0, R)\nwhere vt \u223c N(0, S).\n\n(12a)\n(12b)\n\nHere yt \u2208 Rm is observed, xt \u2208 Rn is the hidden or latent variable, and P \u2208 Rm\u00d7n, Q \u2208\nRn\u00d7n, R \u2208 Rm\u00d7m and, S \u2208 Rn\u00d7n, moreover R, S (cid:23) 0. Typically m \u226b n. similar model\nexists for continuous LTI. Further details on it can be found in [14].\nFollowing the behavioral framework of [16] we associate dynamical systems, X :=\n(P, Q, R, S, x0), with their trajectories, that is, the set of yt with t \u2208 N for discrete time\nsystems (and t \u2208 [0, \u221e) for the continuous-time case). These trajectories can be interpreted\n\n1Eq. (9) can be seen as follows: the compound matrix of an orthogonal matrix is orthogonal and\nconsequently its determinant is unity. Subsequently use an SVD factorization of the argument of the\ncompound operator to compute the determinant of the compound matrix of a diagonal matrix.\n\n\fas linear operators mapping from Rm (the space of observations y) into the time domain\n(N or [0, \u221e)) and vice versa. The diagram below depicts this mapping:\n\nX\n\n/ Traj(X)\n\n/ Cq(Traj(X))\n\nFinally, Cq(Traj(X)) is weighted in a suitable fashion by operators S and T and the trace\nis evaluated. This yields an element from the family of Binet-Cauchy kernels.\nIn the following we discuss several kernels and we show that they differ essentially\nin how the mapping into a dynamical system occurs (discrete-time or continuous time,\nfully observed or partial observations), whether any other preprocessing is carried out on\nCq(Traj(X)) (such as QR decomposition in the case of the kernel proposed by [10] and\nrediscovered by [17]), or which weighting S, T is chosen.\n\n3.2 Dynamical Systems Kernels\nWe begin with kernels on dynamical systems (12) as proposed in [14]. Set S = 1, q = 1\nand T to be the diagonal operator with entries e\u2212\u03bbt. In this case the Binet-Cauchy kernel\nbetween systems X and X \u2032 becomes\n\ntr Cq(S Traj(X) T Traj(X \u2032)\u22a4) =\n\n\u221e\n\nXi=1\n\ne\u2212\u03bbty\u22a4\n\nt y\u2032\nt.\n\n(13)\n\nSince yt, y\u2032\nSome tedious yet straightforward algebra [14] allows us to compute (13) as follows:\n\nt are random variables, we also need to take expectations over wt, vt, w\u2032\n\nt.\nt, v\u2032\n\nk(X, X \u2032) = x\u22a4\n\n0 M1x\u2032\n\n0 +\n\n1\n\ne\u03bb \u2212 1\n\ntr [SM2 + R] ,\n\n(14)\n\nwhere M1, M2 satisfy the Sylvester equations:\n\nM1 = e\u2212\u03bbQ\u22a4P \u22a4P \u2032Q\u2032 + e\u2212\u03bbQ\u22a4M1Q\u2032 and M2 = P \u22a4P \u2032 + e\u2212\u03bbQ\u22a4M2Q\u2032.\n\n(15)\nSuch kernels can be computed in O(n3) time [5]. Analogous expressions for continuous-\ntime systems exist [14]. In Section 4 we will use this kernel to compute similarities between\nvideo sequences, after having encoded the latter as a dynamical system. This will allow us\nto compare sequences of different length, as they are all mapped to dynamical systems in\nthe \ufb01rst place.\n\n3.3 Martin Kernel\nA characteristic property of (14) is that it takes initial conditions of the dynamical system\ninto account. If this is not desired, one may choose to pick only the subspace spanned by\nthe trajectory yt. This is what was proposed in [10].2\nMore speci\ufb01cally, set S = T = 1, consider the trajectory upto only a \ufb01nite number of time\nsteps, say up to n, and let q = n. Furthermore let Traj(X) = QX RX denote the QR-\ndecomposition of Traj(X), where QX is an orthogonal matrix and RX is upper triangular.\nThen it can be easily veri\ufb01ed the kernel proposed by [10] can be written as\n\nk(X, X \u2032) = tr Cq(SQX T Q\u22a4\n\n(16)\nThis similarity measure has been used by Soatto, Doretto, and coworkers [4] for the anal-\nysis and computation of similarity in video sequences. Subsequently Wolf and Shashua\n[17] modi\ufb01ed (16) to allow for kernels: to deal with determinants on a possibly in\ufb01nite-\ndimensional feature space they simply project the trajectories on a reduced set of points in\nfeature space.3 This is what [17] refer to as a kernel on sets.\n\nX \u2032) = det(QX Q\u22a4\n\nX \u2032 ).\n\n2Martin [10] suggested the use of Cepstrum coef\ufb01cients of a dynamical system to de\ufb01ne a Eu-\nclidean metric. Later De Cock and Moor [2] showed that this distance is, indeed, given by the\ncomputation of subspace angles, which can be achieved by computing the QR-decomposition.\n\n3To be precise, [17] are unaware of the work of [10] or of [2] and they rediscover the notion of\n\nsubspace angles for the purpose of similarity measures.\n\n/\n/\n\f3.4 Graph Kernels\n\nYet another class of kernels can be seen to fall into this category: the graph kernels proposed\nin [6, 13, 9, 8]. Denote by G(V, E) a graph with vertices V and edges E. In some cases,\nsuch as in the analysis of molecules, the vertices will be equipped with labels L. For\nrecovering these kernels from our framework we set q = 1 and systematically map graph\nkernels to dynamical systems.\nWe denote by xt a probability distribution over the set of vertices at time t. The time-\nevolution xt \u2192 xt+1 occurs by performing a random walk on the graph G(V, E). This\nyields xt+1 = W D\u22121xt, where W is the connectivity matrix of the graph and D is a\ndiagonal matrix where Dii denotes the outdegree of vertex i. For continuous-time systems\none uses x(t) = exp(\u2212 \u02dcLt)x(0), where \u02dcL is the normalized graph Laplacian [9].\nIn the graph kernels of [9, 13] one assumes that the variables xt are directly observed and\nno special mapping is required in order to obtain yt. Various choices of S and T yield the\nfollowing kernels:\n\n\u2022 [9] consider a snapshot of the diffusion process at t = \u03c4 . This amounts to choosing\n\nT = 1 and a S which is zero except for a diagonal entry at \u03c4 .\n\n\u2022 The inverse Graph-Laplacian kernel proposed in [13] uses a weighted combination of\n\ndiffusion process and corresponds to S = W a diagonal weight matrix.\n\n\u2022 The model proposed in [6] can be seen as using a partially observable model: rather\nthan observing the states directly, we only observe the labels emitted at the states. If we\nassociate this mapping from states to labels with the matrix P of (12), set T = 1 and\nlet S be the projector on the \ufb01rst n time instances, we recover the kernels from [6].\n\nSo far, we deliberately made no distinction between kernels on graphs and kernels between\ngraphs. This is for good reason: the trajectories depend on both initial conditions and the\ndynamical system itself. Consequently, whenever we want to consider kernels between ini-\ntial conditions, we choose the same dynamical system in both cases. Conversely, whenever\nwe want to consider kernels between dynamical systems, we average over initial conditions.\nThis is what allows us to cover all the aforementioned kernels in one framework.\n\n3.5 Extensions\n\nObviously the aforementioned kernels are just speci\ufb01c instances of what is possible by\nusing kernels of Theorem 9. While it is pretty much impossible to enumerate all combina-\ntions, we give a list of suggestions for possible kernels below:\n\n\u2022 Use the continuous-time diffusion process and a partially observable model. This would\nextend the diffusion kernels of [9] to comparisons between vertices of a labeled graph\n(e.g. atoms in a molecule).\n\n\u2022 Use diffusion processes to de\ufb01ne similarity measures between graphs.\n\u2022 Compute the determinant of the trajectory associated with an n-step random walk on a\ngraph, that is use Cq with q = n instead of C1. This gives a kernel analogous to the\none proposed by Wolf and Shashua [17], however without the whitening incurred by\nthe QR factorization.\n\n\u2022 Take Fredholm determinants of the above mentioned trajectories.\n\u2022 Use a nonlinear version of the dynamical system as described in [14].\n\n4 Experiments\n\nTo test the utility of our kernels we applied it to the task of clustering short video clips.\nWe randomly sampled 480 short clips from the movie Kill Bill and model them as linear\nARMA models (see Section 3.1).\n\n\f\u22121\n\n\u22122\n\n\u22123\n\n2\n\n1\n\n0\n\nThe sub-optimal procedure outlined in [4] was used\nfor estimating the model parameters P , Q, R and,\nS and the kernels described in Section 3.2 were ap-\nplied to these models. Locally Linear Embedding\n(LLE) [12] was used to cluster and embed the clips\nin two dimensions. The two dimensional embed-\nding obtained by LLE is depicted in Figure 1. We\nrandomly selected a few data points from Figure 1\nand depict the \ufb01rst frame of the corresponding clips\nin Figure 2.\nObserve the linear cluster (with a projecting arm) in\nFigure 1: LLE embeddings of 480 ran-\nFigure 1. This corresponds to clips which are tem-\ndom clips from Kill Bill\nporally close to each other and hence have similar\ndynamics.\na\nperson rolling in the snow while those in the far left corner depict a sword \ufb01ght while clips\nin the center involve conversations between two characters. A naiv\u00b4e comparison of the in-\ntensity values or a dot product of the actual clips would not be able to extract such semantic\ninformation. Even though the camera angle varies with time our kernel is able to success-\nfully pick out the underlying dynamics of the scene. These experiments are encouraging\nand future work will concentrate on applying this to video sequence querying.\n\ninstance\n\ndepict\n\nclips\n\nright\n\nFor\n\nthe\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\nfar\n\nin\n\n0\n\n1\n\n2\n\n3\n\n4\n\n\u22124\n\u22125\n\nFigure 2: LLE embeddings of a subset of our dataset. A larger version is available from\nhttp://mlg.anu.edu.au/\u02dcvishy/papers/KillBill.png\n\n5 Discussion\nIn this paper, we introduced a unifying framework for de\ufb01ning kernels on discrete objects\nusing the Binet-Cauchy theorem on compounds of the Fredholm operators. We demon-\nstrated that many of the previously known kernels can be explained neatly by our frame-\nwork. In particular many graph kernels and dynamical system related kernels fall out as\nnatural special cases. The main advantage of our unifying framework is that it allows ker-\nnel engineers to use domain knowledge in a principled way to design kernels for solving\nreal life problems.\n\n\fAcknowledgement We thank Stephane Canu and Ren\u00b4e Vidal for useful discussions. Na-\ntional ICT Australia is supported by the Australian Government\u2019s Program Backing Aus-\ntralia\u2019s Ability. This work was partly supported by grants of the Australian Research Coun-\ncil. This work was supported by the IST Programme of the European Community, under\nthe Pascal Network of Excellence, IST-2002-506778.\n\nReferences\n\n[1] C. J. C. Burges and V. Vapnik. A new method for constructing arti\ufb01cial neural net-\nworks. Interim technical report, ONR contract N00014 - 94-c-0186, AT&T Bell Lab-\noratories, 1995.\n\n[2] K. De Cock and B. De Moor. Subspace angles between ARMA models. Systems and\n\nControl Letter, 46:265 \u2013 270, 2002.\n\n[3] C. Cortes, P. Haffner, and M. Mohri. Rational kernels.\n\nInformation Processing Systems 2002, 2002. in press.\n\nIn Proceedings of Neural\n\n[4] G. Doretto, A. Chiuso, Y.N. Wu, and S. Soatto. Dynamic textures.\n\nJournal of Computer Vision, 51(2):91 \u2013 109, 2003.\n\nInternational\n\n[5] J. D. Gardiner, A. L. Laub, J. J. Amato, and C. B. Moler. Solution of the Sylvester ma-\ntrix equation AXB\u22a4 + CXD\u22a4 = E. ACM Transactions on Mathematical Software,\n18(2):223 \u2013 231, 1992.\n\n[6] T. G\u00a8artner, P.A. Flach, and S. Wrobel. On graph kernels: Hardness results and ef\ufb01-\ncient alternatives. In B. Sch\u00a8olkopf and M. Warmuth, editors, Sixteenth Annual Con-\nference on Computational Learning Theory and Seventh Kernel Workshop, COLT.\nSpringer, 2003.\n\n[7] W. Gr\u00a8obner. Matrizenrechnung. BI Hochschultaschenb\u00a8ucher, 1965.\n[8] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled\ngraphs. In Proceedings of the 20th International Conference on Machine Learning\n(ICML), Washington, DC, United States, 2003.\n\n[9] I.R. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete struc-\n\ntures. In Proceedings of the ICML, 2002.\n\n[10] R.J. Martin. A metric for ARMA processes. IEEE Transactions on Signal Processing,\n\n48(4):1164 \u2013 1170, 2000.\n\n[11] A. Pinkus. Spectral properties of totally positive kernels and matrices. In M. Gasca\nand C. A. Miccheli, editors, Total Positivity and its Applications, volume 359 of Math-\nematics and its Applications, pages 1\u201335. Kluwer, March 1996.\n\n[12] S. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear em-\n\nbedding. Science, 290:2323 \u2013 2326, December 2000.\n\n[13] A.J. Smola and I.R. Kondor. Kernels and regularization on graphs. In B. Sch\u00a8olkopf\nand M. K. Warmuth, editors, Proceedings of the Annual Conference on Computa-\ntional Learning Theory, Lecture Notes in Computer Science. Springer, 2003.\n\n[14] A.J. Smola, R. Vidal, and S.V.N. Vishwanathan. Kernels and dynamical systems.\n\nAutomatica, 2004. submitted.\n\n[15] S.V.N. Vishwanathan and A.J. Smola. Fast kernels on strings and trees. In Proceed-\n\nings of Neural Information Processing Systems 2002, 2002.\n\n[16] J. C. Willems. From time series to linear system. I. Finite-dimensional linear time\n\ninvariant systems. Automatica J. IFAC, 22(5):561 \u2013 580, 1986.\n\n[17] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Jounal of\n\nMachine Learning Research, 4:913 \u2013 931, 2003.\n\n\f", "award": [], "sourceid": 2717, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}