{"title": "Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1177, "page_last": 1184, "abstract": null, "full_text": "Computation of Similarity Measures for\n\nSequential Data using Generalized Suf\ufb01x Trees\n\nKonrad Rieck\n\nFraunhofer FIRST.IDA\n\nKekul\u00b4estr. 7\n\nPavel Laskov\n\nFraunhofer FIRST.IDA\n\nKekul\u00b4estr. 7\n\nS\u00a8oren Sonnenburg\n\nFraunhofer FIRST.IDA\n\nKekul\u00b4estr. 7\n\n12489 Berlin, Germany\n\n12489 Berlin, Germany\n\n12489 Berlin, Germany\n\nrieck@first.fhg.de\n\nlaskov@first.fhg.de\n\nsonne@first.fhg.de\n\nAbstract\n\nWe propose a generic algorithm for computation of similarity measures for se-\nquential data. The algorithm uses generalized suf\ufb01x trees for ef\ufb01cient calculation\nof various kernel, distance and non-metric similarity functions.\nIts worst-case\nrun-time is linear in the length of sequences and independent of the underlying\nembedding language, which can cover words, k-grams or all contained subse-\nquences. Experiments with network intrusion detection, DNA analysis and text\nprocessing applications demonstrate the utility of distances and similarity coef\ufb01-\ncients for sequences as alternatives to classical kernel functions.\n\n1 Introduction\n\nThe ability to operate on sequential data is a vital prerequisite for application of machine learning\ntechniques in many challenging domains. Examples of such applications are natural language pro-\ncessing (text documents), bioinformatics (DNA and protein sequences) and computer security (byte\nstreams or system call traces). A key instrument for handling such data is the ef\ufb01cient computation\nof pairwise similarity between sequences. Similarity measures can be seen as an abstraction between\nparticular structure of data and learning theory.\n\nOne of the most successful similarity measures thoroughly studied in recent years is the kernel\nfunction [e.g. 1\u20133]. Various kernels have been developed for sequential data, starting from the\noriginal ideas of Watkins [4] and Haussler [5] and extending to application-speci\ufb01c kernels such\nas the ones for text and natural language processing [e.g. 6\u20138], bioinformatics [e.g. 9\u201314], spam\n\ufb01ltering [15] and computer security [e.g. 16; 17].\n\nAlthough kernel-based learning has gained a major focus in machine learning research, a kernel\nfunction is obviously only one of various possibilities for measuring similarity between objects.\nThe choice of a similarity measure is essentially determined by (a) understanding of a problem and\n(b) properties of the learning algorithm to be applied. Some algorithms operate in vector spaces,\nothers in inner product, metric or even non-metric feature spaces. Investigation of techniques for\nlearning in spaces other than RKHS is currently one of the active research \ufb01elds in machine learning\n[e.g. 18\u201321].\n\nThe focus of this contribution lies on general similarity measures for sequential data, especially on\nef\ufb01cient algorithms for their computation. A large number of such similarity measures can be ex-\npressed in a generic form so that a simple linear-time algorithm can be applied for computation of a\nwide class of similarity measures. This algorithm enables the investigation of alternative represen-\ntations of problem domain knowledge other than kernel functions. As an example, two applications\nare presented for which replacement of a kernel \u2013 or equivalently, the Euclidean distance \u2013 with a\ndifferent similarity measure yields a signi\ufb01cant improvement of accuracy in an unsupervised learn-\ning scenario.\n\n\fThe rest of the paper is organized as follows. Section 2 provides a brief review of common simi-\nlarity measures for sequential data and introduces a generic form in which a large variety of them\ncan be cast. The generalized suf\ufb01x tree and a corresponding algorithm for linear-time computation\nof similarity measures are presented in Section 3. Finally, the experiments in Section 4 demon-\nstrate ef\ufb01ciency and utility of the proposed algorithm on real-world applications: network intrusion\ndetection, DNA sequence analysis and text processing.\n\n2 Similarity measures for sequences\n\n2.1 Embedding of sequences\n\nA common way to de\ufb01ne similarity measures for sequential data is via explicit embedding into a\nhigh-dimensional feature space. A sequence x is de\ufb01ned as concatenation of symbols from a \ufb01nite\nalphabet \u03a3. To model the content of a sequence, we consider a language L \u2286 \u03a3\u2217 comprising subse-\nquences w \u2208 L. We refer to these subsequences as words, even though they may not correspond to\na natural language. Typical examples for L are a \u201cbag of words\u201d [e.g. 22], the set of all sequences of\n\ufb01xed length (k-grams or k-mers) [e.g. 10; 23] or the set of all contained subsequences [e.g. 8; 24].\nGiven a language L, a sequence x can be mapped into an |L|-dimensional feature space by calcu-\nlating an embedding function \u03c6w(x) for every w \u2208 L appearing in x. The funcion \u03c6w is de\ufb01ned as\nfollows\n\n\u03c6w : \u03a3\u2217 \u2192 R+ \u222a {0}, \u03c6w(x) := \u03c8(occ(w, x)) \u00b7 Ww\n\n(1)\n\nwhere occ(w, x) is the number of occurrences of w in x, \u03c8 a numerical transformation, e.g. a\nconversion to frequencies, and W a weighting assigned to individual words, e.g. length-dependent\nor position-dependent weights [cf. 3; 24]. By employing the feature space induced through L and\n\u03c6, one can adapt many vectorial similarity measures to operate on sequences.\nThe feature space de\ufb01ned via explicit embedding is sparse, since the number of non-zero dimensions\nfor each feature vector is bounded by the sequence length. Thus the essential parameter for measur-\ning complexity of computation is the sequence length, denoted hereinafter as n. Furthermore, the\nlength of a word |w| or in case of a set of words the maximum length is denoted by k.\n\n2.2 Vectorial similarity measures\n\nSeveral vectorial kernel and distance functions can be applied to the proposed embedding of sequen-\ntial data. A list of common functions in terms of L and \u03c6 is given in Table 1.\n\nKernel function\n\nk(x, y)\n\nLinear\n\nPolynomial\n\nRBF\n\nPw\u2208L \u03c6w(x)\u03c6w(y)\n\n(cid:0)Pw\u2208L \u03c6w(x)\u03c6w(y) + \u03b8(cid:1)d\n\nexp(cid:16) \u2212d(x,y)2\n\n\u03c3\n\n(cid:17)\n\nDistance function\n\nd(x, y)\n\nManhattan\n\nCanberra\n\nMinkowski\n\nHamming\n\nChebyshev\n\nPw\u2208L |\u03c6w(x) \u2212 \u03c6w(y)|\nPw\u2208L\n\n|\u03c6w(x)\u2212\u03c6w(y)|\n\u03c6w(x)+\u03c6w(y)\n\nkqPw\u2208L |\u03c6w(x) \u2212 \u03c6w(y)|k\nPw\u2208L sgn |\u03c6w(x) \u2212 \u03c6w(y)|\nmaxw\u2208L |\u03c6w(x) \u2212 \u03c6w(y)|\n\nTable 1: Kernels and distances for sequential data\n\n\fSimilarity coef\ufb01cient\nSimpson\nJaccard\nBraun-Blanquet\nCzekanowski, Sorensen-Dice\nSokal-Sneath, Anderberg\nKulczynski (1st)\nKulczynski (2nd)\nOtsuka, Ochiai\n\ns(x, y)\n\na/ min(a + b, a + c)\n\na/(a + b + c)\n\na/ max(a + b, a + c)\n\n2a/(2a + b + c)\n\na/(a + 2(b + c))\n\na/(b + c)\n\n1\n2 (a/(a + b) + a/(a + c))\n\na/p(a + b)(a + c)\n\nTable 2: Similarity coef\ufb01cients for sequential data\n\nBeside kernel and distance functions, a set of rather exotic similarity coef\ufb01cients is also suitable for\napplication to sequential data [25]. The coef\ufb01cients are constructed using three summation variables\na, b and c, which in the case of binary vectors correspond to the number of matching component\npairs (1-1), left mismatching pairs (0-1) and right mismatching pairs (1-0) [cf. 26; 27] Common\nsimilarity coef\ufb01cients are given in Table 2. For application to non-binary data these summation\nvariables can be extended as proposed in [25]:\n\na = X\nb = X\nc = X\n\nw\u2208L\n\nw\u2208L\n\nw\u2208L\n\nmin(\u03c6w(x), \u03c6w(y))\n\n[\u03c6w(x) \u2212 min(\u03c6w(x), \u03c6w(y))]\n\n[\u03c6w(y) \u2212 min(\u03c6w(x), \u03c6w(y))]\n\n2.3 A generic representation\n\nOne can easily see that the presented similarity measures can be cast in a generic form that consists\nof an outer function \u2295 and an inner function m:\n\ns(x, y) = M\n\nw\u2208L\n\nm(\u03c6w(x), \u03c6w(y))\n\n(2)\n\nGiven this de\ufb01nition, the kernel and distance functions presented in Table 1 can be re-formulated\nin terms of \u2295 and m. Adaptation of similarity coef\ufb01cients to the generic form (2) involves a re-\nformulation of the summation variables a, b and c. The particular de\ufb01nitions of outer and inner\nfunctions for the presented similarity measures are given in Table 3. The polynomial and RBF ker-\nnels are not shown since they can be expressed in terms of a linear kernel or a distance respectively.\n\nKernel function \u2295\n\nm(x, y)\n\nDistance function\n\nLinear\n\n+\n\nx \u00b7 y\n\nSimilarity coef. \u2295\n\nm(x, y)\n\nVariable a\n\nVariable b\n\nVariable c\n\n+\n\nmin(x, y)\n\n+ x \u2212 min(x, y)\n\n+ y \u2212 min(x, y)\n\nManhattan\n\nCanberra\n\nMinkowskik\n\nHamming\n\nChebyshev\n\n\u2295\n\n+\n\n+\n\n+\n\n+\n\nm(x, y)\n\n|x \u2212 y|\n\n|x \u2212 y|/(x + y)\n\n|x \u2212 y|k\n\nsgn |x \u2212 y|\n\nmax\n\n|x \u2212 y|\n\nTable 3: Generalized formulation of similarity measures\n\n\f3 Generalized suf\ufb01x trees for comparison of sequences\n\nThe key to ef\ufb01cient comparison of two sequences lies in considering only the minimum of words\nnecessary for computation of the generic form (2) of similarity measures. In the case of kernels\nonly the intersection of words in both sequences needs to be considered, while the union of words\nis needed for calculating distances and non-metric similarity coef\ufb01cients. A simple and well-known\napproach for such comparison is representing the words of each sequence in a sorted list. For words\nof maximum length k such a list can be constructed in O(kn log n) using general sorting or O(kn)\nusing radix-sort. If the length of words k is unbounded, sorted lists are no longer an option as the\nsorting time becomes quadratic.\n\nThus, special data structures are needed for ef\ufb01cient comparison of sequences. Two data structures\npreviously used for computation of kernels are tries [28; 29] and suf\ufb01x trees [30]. Both have been\napplied for computation of a variety of kernel functions in O(kn) [3; 10] and also in O(n) run-time\nusing matching statistics [24]. In this contribution we will argue that a generalized suf\ufb01x tree is\nsuitable for computation of all similarity measures of the form (2) in O(n) run-time.\nA generalized suf\ufb01x tree (GST) is a tree containing all suf\ufb01xes of a set of strings x1, . . . , xl [31]. The\nsimplest way to construct a generalized suf\ufb01x tree is to extend each string xi with a delimiter $i and\nto apply a suf\ufb01x tree construction algorithm [e.g. 32] to the concatenation of strings x1$1 . . . xl$l.\nIn the remaining part we will restrict ourselves to the case of two strings x and y delimited by\n# and $, computation of an entire similarity matrix using a single GST for a set of strings being\na straightforward extension. An example of a generalized suf\ufb01x tree for the strings \u201caab#\u201d and\n\u201cbabab$\u201d is shown in Fig. 1(a).\n\nb\n\n#...\n\n$\n\na\n\n$\n\nb\n\nab#...\n\n#...\n\nab\n\nab$\n\n$\n\n#...\n\n$\n\nab$\n\n(1,0)\n(a) Generalized suf\ufb01x tree (GST)\n\nb\n\n#...\n\n$\n\nm(0,2)\n\n$\n\n(0,1)\n\nm\n\n(1,3)\n\n#...\n\nab\n\n(1,0)\n\nx\n\nab$\n\n(0,1)\n(1,0)\n\n(0,1)\n\n(b) Traversal of a GST\n\nFigure 1: Generalized suf\ufb01x tree for \u201caab#\u201d and \u201cbabab$\u201d and a snapshot of its traveral\n\nOnce a generalized suf\ufb01x tree is constructed, it remains to determine the number of occurences\nocc(w, x) and occ(w, y) of each word w present in the sequences x and y. Unlike the case for\nkernels for which only nodes corresponding to both sequences need to be considered [24], the con-\ntributions must be correctly computed for all nodes in the generalized suf\ufb01x tree. The following\nsimple recursive algorithm computes a generic similarity measure between the sequence x and y in\none depth-\ufb01rst traversal of the generalized suf\ufb01x tree (cf. Algorithm 1).\n\nThe algorithm exploits the fact that a leaf in a GST representing a suf\ufb01x of x contributes exactly 1 to\nocc(w, x) if w is the pre\ufb01x of this suf\ufb01x \u2013 and similarly for y and occ(w, y). As the GST contains\nall suf\ufb01xes of x and y, every word w in x and y is represented by at least one leaf. Whether a leaf\ncontributes to x or y can be determined by considering the edge at the leaf. Due to the uniqueness\nof the delimiter #, no branching nodes can occur below an edge containing #, thus a leaf node\nat an edge starting before the index of # must contain a suf\ufb01x of x; otherwise it contains a suf\ufb01x\nof y. The contributions of all leaves are aggregated in two variables x and y during a post-order\ntraversal. At each node the inner function m of (2) is calculated using \u03c8(x) and \u03c8(y) according to\nthe embedding \u03c6 in (1). A snapshot of the traversal procedure is illustrated in Fig. 1(b).\nTo account implicit nodes along the edges of the GST and to support weighted embeddings \u03c6, the\nweighting function WEIGHT introduced in [24] is employed. At a node v the function takes the\nbeginning (begin[v]) and the end (end[v]) of the incoming edge and the depth of node (depth[v]) as\narguments to determine how much the node and edge contribute to the similarity measure, e.g. for\nk-gram models only nodes up to a path depth of k need to be considered.\n\n\fS \u2190 SUFFIXTREE(x # y $)\n(x, y, s) \u2190 MATCH(root[S])\nreturn s\n\nAlgorithm 1 Suf\ufb01x tree comparison\n1: function COMPARE(x, y)\n2:\n3:\n4:\n5:\n6: function MATCH(v)\nif v is leaf then\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n\nj \u2190 end[v]\n\nelse\n\ns \u2190 0\nif begin[v] \u2264 index# then\n\nelse\n\n(x, y) \u2190 (1, 0)\nj \u2190 index# \u2212 1\n\n(x, y) \u2190 (0, 1)\nj \u2190 index$ \u2212 1\n\n\u22b2 Leaf of a suf\ufb01x of x\n\n\u22b2 Leaf of a suf\ufb01x of y\n\n(x, y, s) \u2190 (0, 0, 0)\nfor all c in children[v] do\n(\u02c6x, \u02c6y, \u02c6s) \u2190 MATCH(c)\n(x, y, s) \u2190 (x + \u02c6x, y + \u02c6y, s \u2295 \u02c6s)\n\n\u22b2 Traverse GST\n\nW \u2190 WEIGHT(begin[v], j, depth[v])\ns \u2190 s \u2295 m(\u03c8(x)W, \u03c8(y)W)\nreturn (x, y, s)\n\n\u22b2 Cf. de\ufb01nitions in (1) and (2)\n\nSimilarly to the extension of string kernels proposed in [33], the GST traversal can be performed on\nan enhanced suf\ufb01x array [34] for further run-time and space reduction.\n\nTo prove correctness of our algorithm, a different approach must be taken than the one in [24]. We\ncannot claim that the computed similarity value is equivalent to the one returned by the matching\nstatistic algorithm, since the latter is restricted to kernel functions. Instead we show that at each\nrecursive call to the MATCH function correct numbers of occurences are maintained.\nTheorem 1. A word w occurs occ(w, x) and occ(w, y) times in x and y if and only if MATCH( \u00afw)\nreturns x = occ(w, x) and y = occ(w, y), where \u00afw is the node at the end of a path from the root\nreassembling w in the generalized suf\ufb01x tree of x and y .\nProof. If w occurs m times in x, there exist exactly m suf\ufb01xes of x with w as pre\ufb01x. Since w\ncorresponds to a path from the root of the GST to a node \u00afw all m suf\ufb01xes must pass \u00afw. Due to\nthe unique delimiters # each suf\ufb01x of x corresponds to one leaf node in the GST whose incoming\nedge contains #. Hence m equals occ(w, x) and is exactly the aggregated quantity x returned by\nMATCH( \u00afw). Likewise, occ(w, y) is the number of suf\ufb01xes beginning after # and having a pre\ufb01x w,\nwhich is computed by y.\n\n4 Experimental Results\n\n4.1 Run-time experiments\n\nIn order to illustrate the ef\ufb01ciency of the proposed algorithm, we conducted run-time experiments on\nthree benchmark data sets for sequential data: network connection payloads from the DARPA 1999\nIDS evaluation [35], news articles from the Reuters-21578 data set [36] and DNA sequences from\nthe human genome [14]. Table 4 gives an overview of the data sets and their speci\ufb01c properties.\nWe compared the run-time of the generalized suf\ufb01x tree algorithm with a recent trie-based method\nsupporting computation of distances. Tries yield better or equal run-time complexity for computa-\ntion of similarity measures over k-grams than algorithms using indexed arrays and hash tables. A\ndetailed description of the trie-based approach is given in [25]. Note that in all of the following\nexperiments tries were generated in a pre-processing step and the reported run-time corresponds to\nthe comparison procedure only.\n\nFor each of the three data sets, we implemented the following experimental protocol: the Manhattan\ndistances were calculated for 1000 pairs of randomly selected sequences using k-grams as an em-\n\n\fName\nDNA\nNIDS\nTEXT\n\nType\nHuman genome sequences\nTCP connection payloads\nReuters Newswire articles\n\nAlphabet Min. length Max. length\n\n4\n108\n93\n\n2400\n53\n43\n\n2400\n132753\n10002\n\nTable 4: Sequential data sets\n\nbedding language. The procedure was repeated 10 times for various values of k, and the run-time\nwas averaged over all runs. Fig. 2 compares the run-time of sequence comparison algorithms using\nthe generalized suf\ufb01x trees and tries. On all three data sets the trie-based comparison has a low\nrun-time for small values of n but grows linearly with k. The algorithm using a generalized suf\ufb01x\ntree is independent from complexity of the embedding language, although this comes at a price of\nhigher constants due to a more complex data structure. It is obvious that a generalized suf\ufb01x tree is\nthe algorithm of choice for higher values of k.\n\nManhattan distance runtime (NIDS dataset)\n\nManhattan distance runtime (TEXT dataset)\n\nManhattan distance runtime (DNA dataset)\n\n)\ns\n(\n \n.\n\np\nm\nc\n \n\n0\n0\n0\n1\n\n \nr\ne\np\n\n \n\ne\nm\n\ni\nt\n\nn\nu\nr\n \n\nn\na\ne\nm\n\n10\n\nTrie comparison\nGST comparison\n\n8\n\n6\n\n4\n\n2\n\n0\n\n5\n\n10\n\n15\n\nk\u2212gram length\n\n20\n\n(a) NIDS data set\n\n)\ns\n(\n \n.\n\np\nm\nc\n \n\n0\n0\n0\n1\n\n \nr\ne\np\n\n \n\ne\nm\n\ni\nt\n\nn\nu\nr\n \n\nn\na\ne\nm\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nTrie comparison\nGST comparison\n\n5\n\n10\n\n15\n\nk\u2212gram length\n\n20\n\n(b) TEXT data set\n\n)\ns\n(\n \n.\n\np\nm\nc\n \n\n0\n0\n0\n1\n\n \nr\ne\np\n\n \n\ne\nm\n\ni\nt\n\nn\nu\nr\n \n\nn\na\ne\nm\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nTrie comparison\nGST comparison\n\n5\n\n10\n\n15\n\nk\u2212gram length\n\n20\n\n(c) DNA data set\n\nFigure 2: Run-time performance for varying k-gram lengths\n\n4.2 Applications\n\nAs a second part of our evaluation, we show that the ability of our approach to compute diverse sim-\nilarity measures pays off when it comes to real applications, especially in an unsupervised learning\nscenario. The experiments were performed for (a) intrusion detection in real network traf\ufb01c and (b)\ntranscription start site (TSS) recognition in DNA sequences.\n\nFor the \ufb01rst application, network data was generated by members of our laboratory using virtual\nnetwork servers. Recent attacks were injected by a penetration-testing expert. The distance-based\nanomaly detection method Zeta [17] was applied to 5-grams extracted from byte sequences of TCP\nconnections using different similarity measures: the linear kernel, the Manhattan distance and the\nKulczynski coef\ufb01cient. The results on network data from the HTTP protocol are shown in Fig. 3(a).\nApplication of the Kulczynski coef\ufb01cient yields the highest detection accuracy. Over 78% of all\nattacks are identi\ufb01ed with no false-positives in an unsupervised setup. In comparison, the linear\nkernel yields roughly 30% lower detection rates.\n\nThe second application focused on TSS recognition in DNA sequences. The data set comprises \ufb01xed\nlength DNA sequences that either cover the TSS of protein coding genes or have been extracted ran-\ndomly from the interior of genes [14]. We evaluated three methods on this data: an unsupervised\nk-nearest neighbor (kNN) classi\ufb01er, a supervised and bagged kNN classi\ufb01er and a Support Vec-\ntor Machine (SVM). Each method was trained and tested using a linear kernel and the Manhattan\ndistance as a similarity measure over 4-grams. Fig. 3(b) shows the performance achieved by the\nunsupervised and supervised versions of the kNN classi\ufb01er1. Even though the linear kernel and\nthe Manhattan distance yield similar accuracy in a supervised setup, their performance differs sig-\nni\ufb01cantly in unsupervised application. In the absence of prior knowledge of labels the Manhattan\n\n1Results for the SVM are similar to the supervised kNN and have been omitted.\n\n\ft\n\ne\na\nr\n \ne\nv\ni\nt\ni\ns\no\np\ne\nu\nr\nt\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nROC for intrusion detection in HTTP\n\nKulczynski coefficient (unsup. knn)\nLinear kernel (unsup. knn)\nManhattan distance (unsup. knn)\n\n0.002\n\n0.004\n0.006\nfalse positive rate\n\n0.008\n\n0.01\n\ne\n\nt\n\na\nr\n \n\ne\nv\ni\nt\ni\ns\no\np\n\n \n\ne\nu\nr\nt\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nROC for transcription site recognition\n\nLinear kernel (unsup. knn)\nLinear kernel (sup. knn)\nManhattan distance (unsup. knn)\nManhattan distance (sup. knn)\n\n0.02\n\n0.04\nfalse positive rate\n\n0.06\n\n0.08\n\n0.1\n\n(a) Results for network application\n\n(b) Results for DNA application\n\nFigure 3: Comparison of similarity measures on the network and DNA data\n\ndistance expresses better discriminative properties for TSS recognition than the linear kernel. For the\nsupervised application the classication performance is bounded for both similarity measures, since\nonly some discriminative features for TSS recognition are encapsulated in n-gram models [14].\n\n5 Conclusions\n\nKernel functions for sequences have recently gained strong attention in many applications of ma-\nchine learning, especially in bioinformatics and natural language processing. In this contribution\nwe have shown that other similarity measures such as metric distances or non-metric similarity co-\nef\ufb01cients can be computed with the same run-time complexity as kernel functions. The proposed\nalgorithm is based on a post-order traversal of a generalized suf\ufb01x tree of two or more sequences.\nDuring the traversal, the counts of matching and mismatching words from an embedding language\nare computed in time linear in sequence length \u2013 regardless of the particular kind of chosen lan-\nguage: words, k-grams or even all consecutive subsequences. By using a generic representation of\nthe considered similarity measures based on an outer and inner function, the same algorithm can be\napplied for various kernel, distance and similarity functions on sequential data.\n\nOur experiments demonstrate that the use of general similarity measures can bring signi\ufb01cant im-\nprovement to learning accuracy \u2013 in our case observed for unsupervised learning \u2013 and emphasize\nimportance of further investigation of distance- and similarity-based learning algorithms.\n\nAcknowledgments\n\nThe authors gratefully acknowledge the funding from Bundesministerium f\u00a8ur Bildung und\nForschung under the project MIND (FKZ 01-SC40A) and would like to thank Klaus-Robert M\u00a8uller\nand Mikio Braun for fruitful discussions and support.\n\nReferences\n[1] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.\n[2] B. Sch\u00a8olkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[3] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press,\n\n2004.\n\n[4] C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans,\n\neditors, Advances in Large Margin Classi\ufb01ers, pages 39\u201350, Cambridge, MA, 2000. MIT Press.\n\n[5] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, UC Santa\n\nCruz, July 1999.\n\n[6] T. Joachims. Text categorization with support vector machines: Learning with many relevant features.\n\nTechnical Report 23, LS VIII, University of Dortmund, 1997.\n\n\f[7] E. Leopold and J. Kindermann. Text categorization with Support Vector Machines. how to represent texts\n\nin input space? Machine Learning, 46:423\u2013444, 2002.\n\n[8] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classi\ufb01cation using string\n\nkernels. Journal of Machine Learning Research, 2:419\u2013444, 2002.\n\n[9] A. Zien, G. R\u00a8atsch, S. Mika, B. Sch\u00a8olkopf, T. Lengauer, and K.-R. M\u00a8uller. Engineering Support Vector\nMachine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799\u2013807, September\n2000.\n\n[10] C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: A string kernel for SVM protein classi\ufb01cation.\n\nIn Proc. Paci\ufb01c Symp. Biocomputing, pages 564\u2013575, 2002.\n\n[11] C. Leslie, E. Eskin, A. Cohen, J. Weston, and W.S. Noble. Mismatch string kernel for discriminative\n\nprotein classi\ufb01cation. Bioinformatics, 1(1):1\u201310, 2003.\n\n[12] J. Rousu and J. Shawe-Taylor. Ef\ufb01cient computation of gapped substring kernels for large alphabets.\n\nJournal of Machine Leaning Research, 6:1323\u20131344, 2005.\n\n[13] G. R\u00a8atsch, S. Sonnenburg, and B. Sch\u00a8olkopf. RASE: recognition of alternatively spliced exons in c.\n\nelegans. Bioinformatics, 21:i369\u2013i377, June 2005.\n\n[14] S. Sonnenburg, A. Zien, and G. R\u00a8atsch. ARTS: Accurate Recognition of Transcription Starts in Human.\n\nBioinformatics, 22(14):e472\u2013e480, 2006.\n\n[15] H. Drucker, D. Wu, and V.N. Vapnik. Support vector machines for spam categorization. IEEE Transac-\n\ntions on Neural Networks, 10(5):1048\u20131054, 1999.\n\n[16] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. Applications of Data Mining in Computer\nSecurity, chapter A geometric framework for unsupervised anomaly detection: detecting intrusions in\nunlabeled data. Kluwer, 2002.\n\n[17] K. Rieck and P. Laskov. Detecting unknown network attacks using language models. In Proc. DIMVA,\n\npages 74\u201390, July 2006.\n\n[18] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classi\ufb01cation on pairwise proximity\ndata. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing\nSystems, volume 11, pages 438\u2013444. MIT Press, 1999.\n\n[19] V. Roth, J Laub, M. Kawanabe, and J.M. Buhmann. Optimal cluster preserving embedding of non-metric\n\nproximity data. IEEE Trans. PAMI, 25:1540\u20131551, December 2003.\n\n[20] J. Laub and K.-R. M\u00a8uller. Feature discovery in non-metric pairwise data. Journal of Machine Learning,\n\n5(Jul):801\u2013818, July 2004.\n\n[21] C. Ong, X. Mary, S. Canu, and A.J. Smola. Learning with non-positive kernels. In Proc. ICML, pages\n\n639\u2013646, 2004.\n\n[22] G. Salton. Mathematics and information retrieval. Journal of Documentation, 35(1):1\u201329, 1979.\n[23] M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science,\n\n267(5199):843\u2013848, 1995.\n\n[24] S.V.N. Vishwanathan and A.J. Smola. Kernels and Bioinformatics, chapter Fast Kernels for String and\n\nTree Matching, pages 113\u2013130. MIT Press, 2004.\n\n[25] K. Rieck, P. Laskov, and K.-R. M\u00a8uller. Ef\ufb01cient algorithms for similarity measures over sequential data:\n\nA look beyond kernels. In Proc. DAGM, pages 374\u2013383, September 2006.\n\n[26] R.R. Sokal and P.H. Sneath. Principles of numerical taxonomy. Freeman, San Francisco, CA, USA, 1963.\n[27] M.R. Anderberg. Cluster Analysis for Applications. Academic Press, Inc., New York, NY, USA, 1973.\n[28] E. Fredkin. Trie memory. Communications of ACM, 3(9):490\u2013499, 1960.\n[29] D. Knuth. The art of computer programming, volume 3. Addison-Wesley, 1973.\n[30] P. Weiner. Linear pattern matching algorithms.\n\nIn Proc. 14th Annual Symposium on Switching and\n\nAutomata Theory, pages 1\u201311, 1973.\n\n[31] D. Gus\ufb01eld. Algorithms on strings, trees, and sequences. Cambridge University Press, 1997.\n[32] E. Ukkonen. Online construction of suf\ufb01x trees. Algorithmica, 14(3):249\u2013260, 1995.\n[33] C.H. Teo and S.V.N. Vishwanathan. Fast and space ef\ufb01cient string kernels using suf\ufb01x arrays. In Pro-\n\nceedings, 23rd ICMP, pages 939\u2013936. ACM Press, 2006.\n\n[34] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suf\ufb01x trees with enhanced suf\ufb01x arrays. Journal\n\nof Discrete Algorithms, 2(1):53\u201386, 2002.\n\n[35] R. Lippmann, J.W. Haines, D.J. Fried, J. Korba, and K. Das. The 1999 DARPA off-line intrusion detection\n\nevaluation. Computer Networks, 34(4):579\u2013595, 2000.\n\n[36] D.D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs Research, 1997.\n\n\f", "award": [], "sourceid": 2999, "authors": [{"given_name": "Konrad", "family_name": "Rieck", "institution": null}, {"given_name": "Pavel", "family_name": "Laskov", "institution": null}, {"given_name": "S\u00f6ren", "family_name": "Sonnenburg", "institution": null}]}