{"title": "Rational Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 624, "abstract": null, "full_text": "Rational Kernels\n\nAT&T Labs \u2013 Research\n\n180 Park Avenue, Florham Park, NJ 07932, USA\n\nCorinna Cortes\n\nPatrick Haffner Mehryar Mohri\n\n corinna, haffner, mohri\n\n@research.att.com\n\nAbstract\n\nWe introduce a general family of kernels based on weighted transduc-\ners or rational relations, rational kernels, that can be used for analysis of\nvariable-length sequences or more generally weighted automata, in appli-\ncations such as computational biology or speech recognition. We show\nthat rational kernels can be computed ef\ufb01ciently using a general algo-\nrithm of composition of weighted transducers and a general single-source\nshortest-distance algorithm. We also describe several general families of\npositive de\ufb01nite symmetric rational kernels. These general kernels can\nbe combined with Support Vector Machines to form ef\ufb01cient and power-\nful techniques for spoken-dialog classi\ufb01cation: highly complex kernels\nbecome easy to design and implement and lead to substantial improve-\nments in the classi\ufb01cation accuracy. We also show that the string kernels\nconsidered in applications to computational biology are all speci\ufb01c in-\nstances of rational kernels.\n\n1 Introduction\n\nIn many applications such as speech recognition and computational biology, the objects\nto study and classify are not just \ufb01xed-length vectors, but variable-length sequences, or\neven large sets of alternative sequences and their probabilities. Consider for example the\nproblem that originally motivated the present work, that of classifying speech recognition\noutputs in a large spoken-dialog application. For a given speech utterance, the output of a\nlarge-vocabulary speech recognition system is a weighted automaton called a word lattice\ncompactly representing the possible sentences and their respective probabilities based on\nthe models used. Such lattices, while containing sometimes just a few thousand transitions,\nmay contain hundreds of millions of paths each labeled with a distinct sentence.\nThe application of discriminant classi\ufb01cation algorithms to word lattices, or more generally\nweighted automata, raises two issues: that of handling variable-length sequences, and that\nof applying a classi\ufb01er to a distribution of alternative sequences. We describe a general\ntechnique that solves both of these problems.\nKernel methods are widely used in statistical learning techniques such as Support Vector\nMachines (SVMs) [18] due to their computational ef\ufb01ciency in high-dimensional feature\nspaces. This motivates the introduction and study of kernels for weighted automata. We\npresent a general family of kernels based on weighted transducers or rational relations,\nrational kernels which apply to weighted automata. We show that rational kernels can be\ncomputed ef\ufb01ciently using a general algorithm of composition of weighted transducers and\na general single-source shortest-distance algorithm.\nWe also brie\ufb02y describe some speci\ufb01c rational kernels and their applications to spoken-\ndialog classi\ufb01cation. These kernels are symmetric and positive de\ufb01nite and can thus be\ncombined with SVMs to form ef\ufb01cient and powerful classi\ufb01ers. An important bene\ufb01t of\n\n\u0001\n\fSET\n\nSEMIRING\nBoolean\nProbability\nLog\nTropical\n\n\u0002\u0005\u0004\u0006\u0003\n\t\u000b\n\n\u0011\u0010\u0013\u0012\n\u0016\u0015\n\t\u000f\u000e\n\u0004\u0014\f\n\u0017\u0019\u0018\n\u0011\u0010\u0013\u0012\n\u0004\u001b\f\n\u001c\u001e\u001d \u001f\n\t\u001a\u000e\nTable 1: Semiring examples. !\u0015\nis de\ufb01ned by: \"#$\u0015\n\u0017\u0019\u0018\n\nour approach is its generality and its simplicity: the same ef\ufb01cient algorithm can be used\nto compute arbitrarily complex rational kernels. This makes highly complex kernels easy\nto use and helps us achieve substantial improvements in classi\ufb01cation accuracy.\n\n\u0010('*),+\u0005-/.,021\n\n.,02354 .\n\n\u0017\u0019\u0018\u000b%\u001e&\n\n2 Weighted automata and transducers\n\nIn this section, we present the algebraic de\ufb01nitions and notation necessary to introduce\nrational kernels.\n\n\u000fE\n\n; and\u0002\n\n.\n\nis a commutative\n\n\u0004\u0019:\u001e\u0001\n\n\u0002!&\n\n\u0004\u001b7\u0004\n\u0002=\u0001>:#&\n\n\u0002 .\n\nover a semiring 6\n\nwhich is derived from the log semiring using the Viterbi approximation.\n\nis the\nthe\na \ufb01nite set of transitions;\nthe \ufb01nal weight function mapping\n\nWeighted automata can be formally de\ufb01ned in a similar way by simply omitting the input\nor the output labels.\n\nThus, a semiring is a ring that may lack negation. Table 2 lists some familiar examples\nof semirings. In addition to the Boolean semiring and the probability semiring used to\ncombine probabilities, two semirings often used in applications are the log semiring which\n\nDe\ufb01nition 1 ([7]) A system -/6\nis a semiring if: -96\n\u0004\u00147\u0004\u001b\u00017\u0004\n\u00028\u0004\nmonoid with identity element\u0002 ; -96\n4 is a monoid with identity element\u0003 ;\u0001 distributes\n\u0004\u0014\u00017\u0004\nover \n: for all:<;\nis an annihilator for \u0001\nis isomorphic to the probability semiring via a ' )\u0011+ morphism, and the tropical semiring\nDe\ufb01nition 2 A weighted \ufb01nite-state transducer ?\nis an 8-tuple ?@&\n4 where: A\n-BA\nis the \ufb01nite input alphabet of the transducer; C\n\u0004\u0014CD\u0004\u0019E\u001e\u0004\u0019F\u0005\u0004\u0019GH\u0004JIK\u0004\u0014LM\u0004JN\n\ufb01nite output alphabet;E\nis a \ufb01nite set of states; FPOQE\nthe set of initial states; GROSE\nXW\n-VA\nXW\nset of \ufb01nal states; I@OTEU\r\nCY\u000e\nthe initial weight function; and N]Z\u0005GU[\nL\u001aZ\u0005F\\[\nto6\n.cb its origin or previous state and d\u000b`\n.cb its desti-\nGiven a transition .\n, we denote by_a`\n;^I\n.hgjici\u0006iJ.lk\n.cb\nnation state or next state, and e!`\nits weight. A path fQ&\nis an element of I\u001em\n.,n\n.5npb , q$&srt\u0004cu\u0006ucuv\u0004\u0019w\n. We extend d\nwith consecutive transitions: d\u000b`\nand_\n&o_a`\n.\u0011gJb . The weight function e\n.\u0011kXb and_a`\nby setting: d\u000b`\n&xd\u000b`\n&o_a`\nto paths by de\ufb01ning the weight of a path as the \u0001\nb . We denote by y\nici\u0006i\nthe set of paths from z\n-/z\nz|{}4\ntransitions: e!`\n\u0001Qe!`\n&se!`\n-9z\nto z\nthe set of paths fromz\n{ with input label\"\u001a;\n{ and byy\nm and output\n\u0004~\"a\u0004~%2\u0004\nlabel %\n(transducer case). These de\ufb01nitions can be extended to subsets \u007f7\u0004\u0019\u007f\nO\u0080E\n-/z\n4 .\n&\u0081\u000e\u0083\u0082\u0014\u0084\u0011\u0085j\u0086V\u0082~\u0087\u0088\u0084\u0011\u0085\u0089\u0087/y\n\u007f#\u0004J\"8\u0004~%2\u0004\u0019\u007f\n\u0004~\"a\u0004~%2\u0004\nis regulated if the output weight associated by?\nA transducer?\nstring -\n4 by:\n\"a\u0004~%\nb\u00884\n_a`\n\u0001\u0092e!`\n\"8\u0004J%\n\u0001^N2`\n\u0084\u0011\u008c\u008e\u008d \u008f\u0006\u0086\n\u0090M\u0091\n\u0002 when y\n&x\u0094 . In the following, we will\nF\u0005\u0004J\"8\u0004~%2\u0004\u0019G\n-sum and \u0001\nand Kleene-closure. In particular, the \ng and?M\u0095 are de\ufb01ned for each pair -\n4 by:\n\"a\u0004~%\nbV-\ng\u0019b\nbV-\n\"8\u0004~%\n\"8\u0004J%\n\u0092?\u0089\u0095\nS`\n?2\u0095\ngv4\nb\u0093-\ng\u0019b\n\u0001\u009a`\n\u0004J%\n\u0001\u0092?\u0089\u0095\n3v\u0096M3v\u0097\u00993\u0019\u0098\n1X\u0096M1l\u0097V1X\u0098\n\nassume that all the transducers considered are regulated. Weighted transducers are closed\n-multiplications of two\n\nis well-de\ufb01ned and in 6\n, \u0001\nunder \ntransducers?\n\nto paths\ncan also be extended\n-product of the weights of its constituent\n\nb\u0093-\n\n?\u0089\u0095\n\n\"\u009b\u0095\u0011\u0004J%\u0011\u0095\n\nto any pair of input-output\n\n(1)\n\n(2)\n\n(3)\n\nbV-\n\n.\n\nb\u0093-\nb\u0093-\n\n\"8\u0004J%\n\"8\u0004J%\n\nb b\n\nd\u000b`\n\nto z\n\n, by:\n\nb\u0093-\n\n\n\u0001\n\u0002\n\u0003\n\n\u0001\n\u0007\n\b\n\u0002\n\u0003\n\f\n\n\u0002\n\u0003\n\u0012\n\u0001\n\f\n\f\n\u0012\n\u0002\n\u0012\n\u0001\n\f\n\f\n\u0012\n\u0002\n\f\n\u0003\n4\n\u0002\n4\n\u0003\n6\n\u000e\n\u0001\n4\n\n-\n\u0001\n4\n\n6\n6\n6\nG\n0\ng\nb\nf\nb\nf\nb\nf\nb\n.\ng\nb\n\u0001\n.\nk\n\u0004\nz\n{\n4\nA\n{\ny\n-\n{\n4\nz\n{\n`\n`\n?\nb\n4\n&\n\u008a\n\u008b\n1\n\u0086\n3\n\u0086\nL\n-\nf\nf\nb\nf\n`\n`\n?\nb\n\"\n4\n&\n-\n4\n`\n`\n?\ng\nb\n4\n&\n`\n`\n?\n4\n`\nb\n4\n`\n`\n?\ng\nb\n4\n&\n\u008a\n\u0086\n`\n`\n?\n\"\ng\n`\nb\n4\n\f3 Rational kernels\n\nThis section introduces rational kernels, presents a general algorithm for computing them\nef\ufb01ciently and describes several examples of rational kernels.\n\n3.1 De\ufb01nition\n\nthere exist a weighted transducer ?\n\nand a function\n\nsuch that for\n\n(4)\n\nDe\ufb01nition 3 A kernel\n\nis rational\n\nif\n\n-BA\n\u0004\u0014CD\u0004\u0019E\u001e\u0004\u0019F\u0005\u0004\u0019GH\u0004JIK\u0004\u0014LM\u0004JN\nall\"a\u0004~%\\;\nwhen 6\nanother semiring (6\n\n4 over the semiring 6\nm :\n4~4\n&\u0002\u0001\nis an arbitrary function mapping 6\nand may be a projection when the semiring 6\n\nb\u0093-\n\"a\u0004~%\nto \t\n\n&o\t\n\n\"8\u0004J%\n\n{ ).\n\n&S\t\u0081\n\nIn general,\n. In some cases, it may be desirable\nto assume that it is a semiring morphism as in Section 3.6. It is often the identity function\nand\n\nis the cross-product of \t\n\n\u0004\u0005\u0004\n&\u0007\u0001\n\n&\u000b\u0001\n\nRational kernels can be naturally extended to kernels over weighted automata.\nIn the\nfollowing, to simplify the presentation, we will restrict ourselves to the case of acyclic\nweighted automata which is the case of interest for our applications, but our results apply\nbe two acyclic weighted automata\nsimilarly to arbitrary weighted automata. Let\n\nand\n\nover the semiring6\n\n, then\n\nis de\ufb01ned by:\n\n\u0003!\u0004\u0006\u0004\n\nbV-\n\nbV-\n\n\"8\u0004~%\n\n\u0001S`\n\n\u0001\u009a`\n\nbV-\n\n4J4\n\n(5)\n\n-sum and \u0001\n4\u0006-\n\n\f\n\n\n\"a\u0004~%\n\nMore generally, the results mentioned in the following for strings apply all similarly to\nis also\n-product [2, 3], it follows that rational kernels over a semiring\nthe\n\nacyclic weighted automata. Since the set of weighted transducers over a semiring6\nclosed under \n\nare closed under sum and product. We denote by\n\nthe sum and by\n\nproduct of two rational kernels\nthese kernels, we have for example:\n\ng and\n\n<\u0095 . Let?\n4Vb\nb\u0093-\n\u0092?\n\ng and?M\u0095 be the associated transducers of\n4~4\n\"a\u0004~%\n\n\f\b\n&\f\n\n\"8\u0004J%\n\n\"a\u0004~%\n\n\f\r\n\n\t\n\n(6)\n\n\"8\u0004J%\n\n-~\u0010\u0016\u0015\n\n\fT:\n\f\u000b\n\nIn learning techniques such as those based on SVMs, we are particularly interested in\npositive de\ufb01nite symmetric kernels, which guarantee the existence of a corresponding re-\nproducing kernel Hilbert space. Not all rational kernels are positive de\ufb01nite symmetric but\nin the following sections we will describe some general classes of rational kernels that have\nthis property.\nPositive de\ufb01nite symmetric kernels can be used to construct other families of kernels\nare formed from\n\u0010\u0012\u0011\u0014\u0013\n\nthat also meet these conditions [17]. Polynomial kernels of degree _\n4\u000f\u000e , and Gaussian kernels can be formed as\nthe expression -\n\"a\u0004~%\n&\u001b\n\"8\u0004J\"\n%2\u0004~%\n\nnite kernels is closed under sum [1], the sum of two positive de\ufb01nite rational kernels is also\na positive de\ufb01nite rational kernel.\nIn what follows, we will focus on the algorithm for computing rational kernels. The al-\ngorithm for computing\nbased on two general algorithms that we brie\ufb02y present: composition of weighted trans-\nducers to combine\n\nr\u001a\n4 , or\n4 , for any two acyclic weighted automata, is\n, and a general shortest-distance algorithm in a semiring6\n\n4 with\n4 . Since the class of symmetric positive de\ufb01-\n\n-sum of the weights of the successful paths of the combined machine.\n\nComposition is a fundamental operation on weighted transducers that can be used in many\nbe a com-\nsuch that\n\nto compute the \napplications to create complex weighted transducers from simpler ones. Let 6\ng and?M\u0095 be two weighted transducers de\ufb01ned over6\nmutative semiring and let?\n\u0095 coincides with the output alphabet of ?\nthe input alphabet of ?\ng and?M\u0095\nof?\n\ng . Then, the composition\n?\u0089\u0095 which, when it is regulated, is de\ufb01ned for all\n\nis a weighted transducer?\n\n3.2 Composition of weighted transducers\n\n\u0095\u0018\u0017\u001a\u0019\u0089\u0095\n\n\u0003!\u0004\u0006\u0004\n\n\"a\u0004~%\n\n, and\n\ng\u001d\u001c\n\n,?\n\n\n&\n\u0001\nZ\n6\n[\n\t\nA\n\n-\n4\n-\n`\n`\n?\nb\n\u0001\n6\n\u0003\n\u0004\n\n-\n\u0003\n4\n\n-\n4\n-\n\u008a\n1\n\u0086\n3\n`\n`\n\u0003\nb\n\"\n4\n`\n?\nb\n4\n`\n\u0004\nb\n%\n6\n\ng\n\u0095\n\ng\n\u0095\n\n-\n\ng\n\u0095\n4\n-\n`\n`\n-\n?\ng\n\u0095\ng\n-\n4\n\u0095\n-\n4\n\n\u0015\n\u0095\n-\n4\n-\n4\n-\n4\n\u0010\n-\n\n-\n\n-\n\u0003\n\u0004\n\fa:a/1.61\n\n1\n\na:b/0\n\n0\n\nb:b/0.22\n\nb:a/0.69\n\na:a/1.2\n\nb:b/0.92\n\nb:a/0.69\n\n2\n\n3/0\n\n0\n\na:b/2.3\nb:a/0.51\n\n1\n\na:a/0.51\n\n2/0\n\n(a)\n\n(b)\n\na:a/2.81\n\na:b/3.91\n\n0\n\n1\n\n4\n\na:a/0.51\n\na:b/0.92\n\nb:a/1.2\n\n2\n\n3/0\n\nb:a/0.73\n\n(c)\n\ng\u0019b\n\nbV-\n\n?\u0089\u0095\n\n\"8\u0004J%\n\n&\u009a\u008a\n\nfollowed by their corresponding weight.\n\nare represented by bold circles, \ufb01nal states by double circles. Inside each circle, the \ufb01rst\nnumber indicates the state number, the second, at \ufb01nal states only, the value of the \ufb01nal\n\ng over the log semiring. (b) Weighted transducer ?\nFigure 1: (a) Weighted transducer ?\nover the log semiring. (c) Construction of the result of composition ?\n\u0095 . Initial states\nweight functionN at that state. Arrows represent transitions and are labeled with symbols\n\"8\u0004J% by [2, 3, 15, 7]:1\nNote that a transducer can be viewed as a matrix over a countable set A\nCKm and com-\nthe input transducers [14, 12]. States in the composition ?\n?\u0089\u0095 of two weighted trans-\ng and a state of?8\u0095 . Leaving aside\ng and?8\u0095 are identi\ufb01ed with pairs of a state of ?\nducers ?\ntransitions withW\ng and?\n\u0095 :2\n\u0095 from appropriate transitions of?\nof?\n-/z\n-9z\n-~-9z\n\u0004\u0019:h\u0004\b\u0004l\u0004~e\n\u0004\u0005\u0004\n&\u0007\u0006\n\u0004J:\u009b\u0004\n\u0004~e\ng match all those of?a\u0095\nleaving a statez\nIn the worst case, all transitions of?\nzl{\ng , thus the space and time complexity of composition is quadratic:\n4~4 . Fig.(1) (a)-(c) illustrate the algorithm when applied to the transducers of Fig.(1) (a)-\n\nposition as the corresponding matrix-multiplication. There exists a general and ef\ufb01cient\ncomposition algorithm for weighted transducers which takes advantage of the sparsity of\n\n(b) de\ufb01ned over the log semiring. The intersection of two weighted automata is a special\ncase of composition. It corresponds to the case where the input and output label of each\ntransition are identical.\n\ninputs or outputs, the following rule speci\ufb01es how to compute a transition\n\n(8)\nleaving state\n\n4J4\n4\u0006-\u0005\n\n\u0001\u0092e\n-~-\b\n\n\u0001\u009a`\n\nm!\n\n\"8\u0004\n\n?\u0089\u0095\n\nand\n\n(7)\n\n\u0004~%\n\n\u0004~e\n\nb\u0093-\n\nbV-\n\n-/z\n\n3.3 Single-source shortest distance algorithm over a semiring\n\n:\n\nto the set\n\nGiven a weighted automaton or transducer\n\nis de\ufb01ned as the \nzXb\n\n, the shortest-distance from state z\n-sum of all the paths fromz\nof \ufb01nal states G\nb*b\ne!`\n\u0084\u0011\u008c\u008e\u008d}\u0082\u0014\u0086\n\u0090M\u0091\nwhen this sum is well-de\ufb01ned and in 6\n, which is always the case when the semiring is w -\nclosed or when\n4 ,\nalgorithm for computing the shortest-distance \u0015\n[11].\nwhere ?\nThe algorithm is a generalization of Lawler\u2019s algorithm [8] to the case of an arbitrary\nsemiring 6\nIt is based on a generalized relaxation of the outgoing transitions of each\n\ndenotes the maximum time to compute \n\n\u0001\u0092N2`\nzXb in linear time\n\n.\nvisited in reverse topological order [11].\n\n?\r\f(\f\u0092?\u000f\u000e\nthe time to compute \u0001\n\nis acyclic [11], the case of interest in what follows. There exists a general\n\nand ?\n\ntoG\n\nstate of\n\nd\u000b`\n\n4\u0010\n\n-\b\n\n(9)\n\n1We use a matrix notation for the de\ufb01nition of composition as opposed to a functional notation.\n\nThis is a deliberate choice motivated by an improved readability in many applications.\n\n2See [14, 12] for a detailed presentation of the algorithm including the use of a transducer \ufb01lter\n\nfor dealing with\n\n-multiplicity in the case of non-idempotent semirings.\n\n\u0095\ng\n\u001c\n?\n`\n`\n?\ng\n\u001c\nb\n4\n\u0001\n`\n`\n?\n\u0002\n4\n`\nb\n\u0002\n4\ng\n\u001c\ng\n\u001c\n?\ng\n\u0003\ng\n\u0004\nz\n\u0095\n4\n{\ng\n\u0004\n\u0003\n\u0095\n\u0004\nz\n{\n\u0095\n4\ng\n\u0004\nz\n{\ng\n4\ng\n\u0095\n\u0004\n\u0095\n\u0004\nz\n{\n\u0095\ng\n\t\nE\ng\n\n\f\n\nI\ng\n\nE\n\u0095\n\n\f\n\nI\n\u0095\n\n\u000b\n\u0015\n`\n&\n\u008a\n\u008b\nf\nb\nf\n\u000b\n`\n\t\nE\n\n\f\n-\nI\n\n\f\n\u000e\n\u000b\n\u0011\n\fe: b/3\ne: a/3\nb:e/2\na:e/2\nb:a/1\na:b/1\nb:b/0\na:a/0\n\n0/0\n\n(a)\n\ne: b\ne: a\n\n1\n\na:a\nb:b\n\na:a\nb:b\n\nb:e/l\na:e/l\n\n2\n\ne: a\ne: b\n\nb:e\na:e\n\n0\n\ne: b/l\ne: a/l\n\n3\n\na:a\nb:b\n\ne: a/l\ne: b/l\n\n(b)\n\na:a\nb:b\n\nb:e\na:e\n\n4\n\ne: b\ne: a\n\n5\n\ne: a\ne: b\n\nFigure 2: Weighted transducers associated to two rational kernels. (a) Edit-distance kernel.\n(b) Gappy\n\n-gram count kernel, with\n\n= 2.\n\n3.4 Algorithm\n\nand\n\n\u0003!\u0004\u0006\u0004\n\nLet\nbe two acyclic weighted automata.\nmay be any other complex weighted acceptors. By de\ufb01nition of rational kernels (Eq.(5))\nand the shortest-distance (Eq.(9)),\n\nbe a rational kernel and let ? be the associated weighted transducer. Let\nmay represent just two strings \"8\u0004J%^;\n4 can be computed by:\n\nusing the shortest-distance algorithm described in the previous section.\n\n1. Constructing the acyclic composed transducer\n2. Computing \u0015\n3. Computing\n\nb , the shortest-distance from the initial states of\nb\u00884 .\n4 , where \n\ndenote respectively the size of?\n4 , \"\u0092;\n\ntions, then the complexity of the computation of\nand\n\nThus, the total complexity of the algorithm is\nand\n\n,\n. If we assume that\n\nand\nthe worst case complexity of computing\ncan be computed in constant time as in many applica-\n\nis quadratic with respect to\n\n , \n\n , and \n\nand\n\nm or\n\nto its \ufb01nal states\n\n&\f\u0003\n\n\f\u0002\u0001\n\n-\u0005\n\n.\n\nis:\n\n-\u0005\n\n4 .\n\n3.5 Edit-distance kernels\n\n\u0004\u0006\u0004\n\nRecently, several kernels, string kernels, have been introduced in computational biology for\ninput vectors representing biological sequences [4, 19]. String kernels are speci\ufb01c instances\nof rational kernels. Fig.(2) (a) shows the weighted transducer over the tropical semiring\nassociated to a classical type of string kernel. The kernel corresponds to an edit-distance\nbased on a symbol substitution with cost \u0003 , deletion with cost r , and insertion of cost\n\u0003 . All classical edit-distances can be represented by weighted transducers over the tropical\nsemiring [13, 10]. The kernel computation algorithm just described can be used to compute\nef\ufb01ciently the edit-distance of two strings or two sets of strings represented by automata. 3\n\n3.6 Rational kernels of the type?\nkernel from a weighted transducer ? when\nimplies in particular that 6\ntransducer obtained from ?\n\nThere exists a general method for constructing a positive de\ufb01nite and symmetric rational\nis a semiring morphism \u2013 this\n, that is the\nby transposing the input and output labels of each transition.\nis symmetric and, when it is regulated, de\ufb01nes\n\nis commutative. Denote by ?\n\nthe inverse of ?\n\nThen the composed transducer\n\n3We have proved and will present elsewhere a series of results related to kernels based on the\nnotion of edit-distance. In particular, we have shown that the classical edit-distance\nwith equal\ncosts for insertion, deletion and substitution is not negative de\ufb01nite [1] and that the Gaussian kernel\n\u0006\b\u0007\n\t\f\u000b\u000e\n\nis not positive de\ufb01nite.\n\n&S?\n\n\u0005\n\u000f\n\n\n\n\n\u0003\n\u0004\n\u0003\n\u0004\nA\n\n-\n\n\u001c\n?\n\u001c\n\u0004\n`\n\n\n\u0001\n-\n\u0015\n`\n\n\t\n?\n\n\u0003\n\n\u0004\n\n?\n\u0003\n\u0004\n\n\u0003\n\u0004\n\u0001\n\u0001\n-\n\"\n6\n\u0001\n\n-\n\u0003\n4\n\u0003\n\u0004\n\t\n?\n\n\u0003\n\n\u0004\n\n\u001c\n?\n0\ng\n\u0001\nZ\n6\n[\n\t\n0\ng\n\u0004\n\u001c\n?\n0\ng\n\u0005\n\fa positive de\ufb01nite symmetric rational kernel\nby de\ufb01nition of composition:\n\n. Indeed, since\n\nis a semiring morphism,\n\n4~4\n\nb\u0093-\n\n\"a\u0004~%\n\n4J4\u008ei\n\nb\u0093-\n\n\"8\u0004\n\n4J4\n\nb\u0093-\n\n%2\u0004\n\n&\u0001\n\nwhich shows that\na symmetric kernel\n\n\"8\u0004J%\n&\u000b\u0001\nis symmetric. For any non-negative integerd and for all\"a\u0004~% we de\ufb01ne\n\u0003\u0002 by:\n\"8\u0004~%\n\u0004\u0002\nn\u000f\u000e\n\n4J4\nbV-\n%\u009b\u0004\nwhere the sum runs over all strings \u0002 of length less or equal to d\n\u0003 and any \"\n\u0003\u0011\u0010 with\n\"8\u0004J%\n\nbe an arbitrary ordering of these strings. For any\nde\ufb01ne the matrix\n\n\u0004~\"\n\u000eX4J4 . Thus, the eigenvalues of\n\nde\ufb01ned by\nare all non-negative, which implies that\n\n\"8\u0004\n\u000eX4 . Then,\n\n\u0004\u0006u\u0006ucu\u0006\u0004J\"\r\f\n\u0002\u0014\u0013\u0016\u0015\n\n\u0004\u0006ucu\u0006uv\u0004\nn\u0012\u000e\n\"8\u0004J%\n\nis also de\ufb01nite positive [1].\n\npositive de\ufb01nite [1]. Since\n\nis a point-wise limit of\n\n. Let \u0002\n\n\u000b\n\n&\u0005\n\n4J4\u008ei\n\nby:\n\n\u0002\t\b\n\nis\n\nm ,\n4 ,\n\nb\u0093-\n\nb\u0093-\n\n,\n\n\u001d \u001c\n\n4 Application to spoken-dialog classi\ufb01cation\n\nRational kernels can be used in a variety of applications ranging from computational biol-\nogy to optical character recognition. This section singles out one speci\ufb01c application, that\nof topic classi\ufb01cation applied to the output of a speech recognizer. We will show how the\nuse of weighted transducers rationalizes the design and optimization of kernels. Simple\nequations and graphs replace complex diagrams and intricate algorithms often used for the\nde\ufb01nition and analysis of string kernels.\nAs mentioned in the introduction, the output of a speech recognition system associated\nto a speech utterance is a weighted automaton called a word lattice representing a set of\nalternative sentences and their respective probabilities based on the models used. Rational\nkernels help address both the problem of handling variable-length sentences and that of\napplying a classi\ufb01cation algorithm to such distributions of alternatives.\nThe traditional solution to sentence classi\ufb01cation is the \u201cbag-of-words\u201d approach used in\ninformation retrieval. Because of the very large dimension of the input space, the use of\nlarge-margin classi\ufb01ers such as SVMs [6] and AdaBoost [16] was found to be appropriate\nin such applications.\nOne approach adopted in various recent studies to measure the topic-similarity of two sen-\ntences consists of counting their common non-contiguous\n-grams, i.e., their common\nsubstrings of\n-grams can be extracted explic-\nitly from each sentence [16] or matched implicitly through a string kernel [9]. We will\nshow that such kernels are rational and will describe how they can be easily constructed\nand computed using the general algorithms given in the previous section. More generally,\nwe will show how rational kernels can be used to compute the expected counts of common\nnon-contiguous\n-grams of two weighted automata and thus de\ufb01ne the topic-similarity of\ntwo lattices. This will demonstrate the simplicity, power, and \ufb02exibility of our framework\nfor the design of kernels.\n\nwords with possible insertions. These\n\ng kernels\n\n4.1 Application of?\nConsider a word lattice \u0017 over the probability semiring. \u0017\ndistribution y\u0019\u0018 over all strings \u001a<;\n-gram sequence \"\nin a string \u001a\n1 denotes the number of occurrences of\"\ntransducer ?\n&\u0081r . Similarly, the transducer?\n:\u009b\u0004\n\nm . The expected count or number of occurrences of\nfor the probability distribution y\n1 ,\nin \u001a . It is easy to construct a weighted\n\u001e of Fig.(3) (b) can be used to output\n\fH\u0086\n\nan\nwhere \n\n-grams of an input lattice with their correspond-\ning expected counts. Fig.(3) (a) shows that transducer, when the alphabet is reduced to\n\n-grams with their expected counts. 4 Long gaps are penalized\n4The transducers shown in the \ufb01gures of this section are all de\ufb01ned over the probability semiring,\n\ncan be viewed as a probability\n\nnon-contiguous or gappy\n\nthat outputs the set of\n\nis: \u001b\u001d\u001c\u009by\n\nand\n\nthus a transition corresponding to a gap in\u001f! #\"\n\n.\n\nis weighted by%\n\n\n\u0001\n\n-\n4\n-\n`\n`\n\u0004\nb\n\u0001\n\u0001\n-\n`\n`\n?\nb\n\u0002\n\u0001\n-\n`\n`\n?\nb\n\u0002\n\n-\n4\n\u0006\n\u0001\n\u0006\n\u0007\n\u0002\n\u0001\n-\n`\n`\n?\nb\n\u0002\n\u0001\n-\n`\n`\n?\nb\n\u0002\ng\n\u0004\n\u0002\n\u0095\ng\n;\nA\n\u000b\n\u000b\n&\n\n\u0002\n-\n\"\nn\n\u000b\n&\n\u0003\n\u0003\n\u0003\n&\n\u0001\n-\n`\n`\n?\nb\n\"\nn\n\u0004\n\u0002\n\u000b\n\n\u0002\n\n\n\u0002\n\n-\n4\n&\n'\n\n\u0002\n-\n\n\n\n\n\n\u001c\n?\n0\nA\n\n\u0018\n\u0018\n-\n\u001a\n4\n\n\u001a\n\n\u001a\n\n\f\n\nA\n&\n\n\u0003\n\u0001\n\n\n$\n\fb:e\na:e\n\n0\n\nb:e\na:e\n\n2\n\na:a\nb:b\n\na:a\nb:b\n\n1\n\n(a)\n\nb:e\na:e\n\n2\n\na:a\nb:b\n\nb:e\na:e\n\n0\n\na:a\nb:b\n\nb:e/l\na:e/l\n\n1\n\n(b)\n\nFigure 3:\n\n-gram transducers (\n\n= 2) de\ufb01ned over the probability semiring. (a) Bigram\n\ncounting variable-length\n\ncounter transducer?a\u0095 . (b) Gappy bigram counter?\u000b\u0095v\u0086\nwith a decay factor \u0002\u0001\u0081L\u0003\u0002Y\u0003 : a gap of length\ners: ?\n\n\u001e .\n\n\u0002t\u0086\n\n\u001e .\nreduces the count by L\n\nIn the remaining of this section, we will omit the subscript\nsince our results are\nindependent of the choice of these parameters. Thus the topic-similarity of two strings or\nlattices\n\nbased on the expected counts of theirs common substrings is given by:\n\nand\n\nand L\n\n. A transducer\n-grams is obtained by simply taking the sum of these transduc-\n\n(10)\nis of the type studied in section 3.6 and thus is symmetric and positive\n\nThe kernel\nde\ufb01nite.\n\n4.2 Computation\n\nThe speci\ufb01c form of the kernel\nseveral alternatives for computing\nGeneral algorithm. We can use the general algorithm described in Section 3.4 to compute\n\nand the associativity of composition provide us with\n.\n\nby precomputing the transducer?\n\nin the case of gappy bigrams. Using that algorithm, the complexity of the computation of\nthe kernel\nparticular example has been treated by ad hoc algorithms with a similar complexity, but\nthat only work with strings [9, 5] and not with weighted automata or lattices.\nOther factoring. Thanks to the associativity of composition, we can consider a different\nfactoring of the composition cascade de\ufb01ning\n\n4 as described in the previous section is quadratic\n\ng . Fig.(2)(b) shows the result of that composition\n4 . This\n\n-\u0005\n\n:\n\n4Vb\n\n(11)\n\nThis factoring suggests computing\n\ntransducers rather than constructing ?\n\nnot affect the overall time complexity of the algorithm, but in practice one method may be\npreferable over the other. We are showing elsewhere that in the speci\ufb01c case of the counting\ntransducers such as those described in previous sections, the kernel computation can in fact\nbe performed in linear time, that is in\nfailure functions.\n\n4 , in particular by using the notion of\n\ng . The choice between the two methods does\n\n\ufb01rst and then composing the resulting\n\nand?\n-\u0005\n\n4.3 Experimental results\n\nWe used the?\n\ng -type kernel with SVMs for call-classi\ufb01cation in the spoken language\n\nunderstanding (SLU) component of the AT&T How May I Help You natural dialog system.\nIn this system, users ask questions about their bill or calling plans and the objective is to\nassign a class to each question out of a \ufb01nite set of 38 classes made of call-types and named\nentities such as Billing Services, or Calling Plans.\nIn our experiments, we used 7,449 utterances as our training data and 2,228 utterances as\nour test data. The feature space corresponding to our lattice kernel is that of all possible\ntrigrams over a vocabulary of 5,405 words. Training required just a few minutes on a single\nprocessor of a 1GHz Intel Pentium processor Linux cluster with 2GB of memory and 256\nKB cache. The implementation took only about a few hours and was entirely based on\n\n\n\n\u0004\n\b\n\n\u0007\n\f\n\u0086\n\u001e\n&\n\u001b\n\u0002\n\u0007\n\f\n?\n\n\u0005\n\u0006\n\n-\n\u0005\n\u0004\n\u0006\n4\n&\n\u0015\n`\n\u0005\n\u001c\n-\n?\n\u001c\n?\n0\ng\n4\n\u001c\n\u0006\nb\n\n\n\n\n\u001c\n?\n0\n\n-\n\u0005\n\u0004\n\u0006\n\t\n\u0005\n\n\u0006\n\n\n\n-\n\u0005\n\u0004\n\u0006\n4\n&\n\u0015\n`\n-\n\u0005\n\u001c\n?\n4\n\u001c\n-\n?\n0\ng\n\u001c\n\u0006\n\u0005\n\u001c\n?\n0\ng\n\u001c\n\u0006\n\u001c\n?\n0\n\t\n\u0005\n\n\f\n\n\u0006\n\n\u001c\n?\n0\n\fthe FSM library. Compared to the standard approach of using trigram counts over the\nbest recognized sentence, our experiments with a trigram rational kernel showed a \u0003\u0001\u0003\u0002\nreduction in error rate at a \u0003\n5 Conclusion\n\nrejection level.\n\n\u0002\u0004\u0002\n\nIn our classi\ufb01cation experiments in spoken-dialog applications, we found rational kernels\nto be a very powerful exploration tool for constructing and generalizing highly ef\ufb01cient\nstring and weighted automata kernels. In the design of learning machines such as SVMs,\nrational kernels give us access to the existing set of ef\ufb01cient and general weighted automata\nalgorithms [13]. Prior knowledge about the task can be crafted into the kernel using graph\nediting tools or weighted regular expressions, in a way that is often more intuitive and easy\nto modify than complex matrices or formal algorithms.\n\nReferences\n[1] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semi-\n\ngroups. Springer-Verlag: Berlin-New York, 1984.\n\n[2] Jean Berstel. Transductions and Context-Free Languages. Teubner Studienbucher: Stuttgart,\n\n1979.\n\n[3] Samuel Eilenberg. Automata, Languages and Machines, volume A-B. Academic Press, 1974.\n[4] David Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-\n\n10, University of California at Santa Cruz, 1999.\n\n[5] Ralf Herbrich. Learning Kernel Classi\ufb01ers. MIT Press, Cambridge, 2002.\n[6] Thorsten Joachims. Text categorization with support vector machines:\n\nrelevant features. In Proc. of ECML-98. Springer Verlag, 1998.\n\nlearning with many\n\n[7] Werner Kuich and Arto Salomaa. Semirings, Automata, Languages. Number 5 in EATCS\n\nMonographs on Theoretical Computer Science. Springer-Verlag, Berlin, Germany, 1986.\n\n[8] Eugene L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart, and\n\nWinston, 1976.\n\n[9] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text\n\nclassi\ufb01cation using string kernels. In NIPS, pages 563\u2013569, 2000.\n\n[10] Mehryar Mohri. Edit-Distance of Weighted Automata. In Jean-Marc Champarnaud and Denis\nMaurel, editor, Seventh International Conference, CIAA 2002, volume to appear of Lecture\nNotes in Computer Science, Tours, France, July 2002. Springer-Verlag, Berlin-NY.\n\n[11] Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Jour-\n\nnal of Automata, Languages and Combinatorics, 7(3):321\u2013350, 2002.\n\n[12] Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted automata in text and\n\nspeech processing. In ECAI-96 Workshop, Budapest, Hungary. ECAI, 1996.\n\n[13] Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. The Design Principles of a\nWeighted Finite-State Transducer Library. Theoretical Computer Science, 231:17\u201332, January\n2000. http://www.research.att.com/sw/tools/fsm.\n\n[14] Fernando C. N. Pereira and Michael D. Riley. Speech recognition by composition of weighted\n\ufb01nite automata. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Pro-\ncessing, pages 431\u2013453. MIT Press, Cambridge, Massachusetts, 1997.\n\n[15] Arto Salomaa and Matti Soittola. Automata-Theoretic Aspects of Formal Power Series.\n\nSpringer-Verlag: New York, 1978.\n\n[16] Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based system for text catego-\n\nrization. Machine Learning, 39(2/3):135\u2013168, 2000.\n\n[17] Bernhard Scholkopf and Alex Smola. Learning with Kernels. MIT Press: Cambridge, MA,\n\n2002.\n\n[18] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New-York, 1998.\n[19] Chris Watkins. Dynamic alignment kernels. Technical Report CSD-TR-98-11, Royal Holloway,\n\nUniversity of London, 1999.\n\n\f", "award": [], "sourceid": 2159, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Patrick", "family_name": "Haffner", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}]}