{"title": "Semi-supervised Protein Classification Using Cluster Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 595, "page_last": 602, "abstract": "", "full_text": "Semi-supervised protein classi\ufb01cation using\n\ncluster kernels\n\nJason Weston(cid:3)\n\nChristina Leslie\n\nMax Planck Institute for Biological Cybernetics,\n\nDepartment of Computer Science,\n\n72076 T\u00a8ubingen, Germany\n\nColumbia University\n\nweston@tuebingen.mpg.de\n\ncleslie@cs.columbia.edu\n\nDengyong Zhou, Andre Elisseeff\n\nWilliam Stafford Noble\n\nMax Planck Institute for Biological Cybernetics,\n\nDepartment of Genome Sciences\n\n72076 T\u00a8ubingen, Germany\n\nUniversity of Washington\n\nzhou@tuebingen.mpg.de\n\nnoble@gs.washington.edu\n\nAbstract\n\nA key issue in supervised protein classi\ufb01cation is the representation of in-\nput sequences of amino acids. Recent work using string kernels for pro-\ntein data has achieved state-of-the-art classi\ufb01cation performance. How-\never, such representations are based only on labeled data \u2014 examples\nwith known 3D structures, organized into structural classes \u2014 while\nin practice, unlabeled data is far more plentiful. In this work, we de-\nvelop simple and scalable cluster kernel techniques for incorporating un-\nlabeled data into the representation of protein sequences. We show that\nour methods greatly improve the classi\ufb01cation performance of string ker-\nnels and outperform standard approaches for using unlabeled data, such\nas adding close homologs of the positive examples to the training data.\nWe achieve equal or superior performance to previously presented cluster\nkernel methods while achieving far greater computational ef\ufb01ciency.\n\n1\n\nIntroduction\n\nA central problem in computational biology is the classi\ufb01cation of proteins into functional\nand structural classes given their amino acid sequences. The 3D structure that a protein\nassumes after folding largely determines its function in the cell. However, it is far easier\nto determine experimentally the primary sequence of a protein than it is to solve the 3D\nstructure. Through evolution, structure is more conserved than sequence, so that detecting\neven very subtle sequence similarities, or remote homology, is important for predicting\nfunction.\n\nThe major methods for homology detection can be split into three basic groups: pairwise\nsequence comparison algorithms [1, 2], generative models for protein families [3, 4], and\ndiscriminative classi\ufb01ers [5, 6, 7]. Popular sequence comparison methods such as BLAST\n\n(cid:3)Supplemental information for the paper, including the data sets and Matlab source code can be\n\nfound on this author\u2019s web page at http://www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot\n\n\fand Smith-Waterman are based on unsupervised alignment scores. Generative models such\nas pro\ufb01le hidden Markov models (HMMs) model positive examples of a protein family,\nbut they can be trained iteratively using both positively labeled and unlabeled examples\nby pulling in close homologs and adding them to the positive set. A compromise between\nthese methods is PSI-BLAST [8], which uses BLAST to iteratively build a probabilistic\npro\ufb01le of a query sequence and obtain a more sensitive sequence comparison score. Finally,\nclassi\ufb01ers such as SVMs use both positive and negative examples and provide state-of-the-\nart performance when used with appropriate kernels [5, 6, 7]. However, these classi\ufb01ers still\nrequire an auxiliary method (such as PSI-BLAST) to handle unlabeled data: one generally\nadds predicted homologs of the positive training examples to the training set before training\nthe classi\ufb01er.\n\nIn practice, relatively little labeled data is available \u2014 approximately 30,000 proteins with\nknown 3D structure, some belonging to families and superfamilies with only a handful of\nlabeled members \u2014 whereas there are close to one million sequenced proteins, providing\nabundant unlabeled data. New semi-supervised learning techniques should be able to make\nbetter use of this unlabeled data.\n\nRecent work in semi-supervised learning has focused on changing the representation\ngiven to a classi\ufb01er by taking into account the structure described by the unlabeled data\n[9, 10, 11]. These works can be viewed as cases of cluster kernels, which produce sim-\nilarity metrics based on the cluster assumption: namely, two points in the same \u201ccluster\u201d\nor region of high density should have a small distance to each other.\nIn this work, we\ninvestigate the use of cluster kernels for protein classi\ufb01cation by developing two simple\nand scalable methods for modifying a base kernel. The neighborhood kernel uses aver-\naging over a neighborhood of sequences de\ufb01ned by a local sequence similarity measure,\nand the bagged kernel uses bagged clustering of the full sequence data set to modify the\nbase kernel. In both the semi-supervised and transductive settings, these techniques greatly\nimprove classi\ufb01cation performance when used with mismatch string kernels, and the tech-\nniques achieve equal or superior results to all previously presented cluster kernel methods\nthat we tried. Moreover, the neighborhood and bagged kernel approaches are far more\ncomputationally ef\ufb01cient than these competing methods.\n\n2 Representations and kernels for protein sequences\n\nProteins can be represented as variable length sequences, typically several hundred char-\nacters long, from the alphabet of 20 amino acids. In order to use learning algorithms that\nrequire vector inputs, we must \ufb01rst \ufb01nd a suitable feature vector representation, mapping\nsequence x into a vector space by x 7! (cid:8)(x). If we use kernel methods such as SVMs,\nwhich only need to compute inner products K(x; y) = h(cid:8)(x); (cid:8)(y)i for training and test-\ning, then we can accomplish the above mapping using a kernel for sequence data.\n\nBiologically motivated sequence comparison scores, like Smith-Waterman or BLAST, pro-\nvide an appealing representation of sequence data. The Smith-Waterman (SW) algorithm\n[2] uses dynamic programming to compute the optimal local gapped alignment score be-\ntween two sequences, while BLAST [1] approximates SW by computing a heuristic align-\nment score. Both methods return empirically estimated E-values indicating the con\ufb01dence\nof the score. These alignment-based scores do not de\ufb01ne a positive de\ufb01nite kernel; how-\never, one can use a feature representation based on the empirical kernel map\n\n(cid:8)(x) = hd(x1; x); : : : ; d(xm; x)i\n\nwhere d(x; y) is the pairwise score (or E-value) between x and y and xi, i = 1 : : : m,\nare the training sequences. Using SW E-values in this fashion gives strong classi\ufb01cation\nperformance [7]. Note, however, that the method is slow, both because computing each SW\nscore is O(jxj2) and because computing each empirically mapped kernel value is O(m).\n\n\fAnother appealing idea is to derive the feature representation from a generative model for\na protein family. In the Fisher kernel method [5], one \ufb01rst builds a pro\ufb01le HMM for the\npositive training sequences, de\ufb01ning a log likelihood function log P (xj(cid:18)) for any protein\nsequence x. Then the gradient vector r(cid:18) log P (xj(cid:18))j(cid:18)=(cid:18)0, where (cid:18)0 is the maximum like-\nlihood estimate for model parameters, de\ufb01nes an explicit vector of features, called Fisher\nscores, for x. This representation gives excellent classi\ufb01cation results, but the Fisher scores\nmust be computed by an O(jxj2) forward-backward algorithm, making the kernel tractable\nbut slow.\n\nIt is possible to construct useful kernels directly without explicitly depending on generative\nmodels by using string kernels. For example, the mismatch kernel [6] is de\ufb01ned by a\nhistogram-like feature map that uses mismatches to capture inexact string matching. The\nfeature space is indexed by all possible k-length subsequences (cid:11) = a1a2 : : : ak, where each\nai is a character in the alphabet A of amino acids. The feature map is de\ufb01ned on k-gram (cid:11)\nby (cid:8)((cid:11)) = ((cid:30)(cid:12)((cid:11)))Ak where (cid:30)(cid:12)((cid:11)) = 1 if (cid:11) is within m mismatches of (cid:12), 0 otherwise,\nand is extended additively to longer sequences: (cid:8)(x) = Pk-grams2x (cid:8)((cid:11)). The mismatch\nkernel can be computed ef\ufb01ciently using a trie data structure: the complexity of calculating\nK(x; y) is O(cK (jxj + jyj)), where cK = km+1jAjm. For typical kernel parameters k = 5\nand m = 1 [6], the mismatch kernel is fast, scalable and yields impressive performance.\nMany other interesting models and examples of string kernels have recently been presented.\nA survey of related string kernel work is given in the longer version of this paper.\n\nString kernel methods with SVMs are a powerful approach to protein classi\ufb01cation and\nhave consistently performed better than non-discriminative techniques [5, 7, 6]. However,\nin a real-world setting, protein classi\ufb01ers have access to unlabeled data. We now discuss\nhow to incorporate such data into the representation given to SVMs via the use of cluster\nkernels.\n\n3 Cluster kernels for protein sequences\n\nIn semi-supervised learning, one tries to improve a classi\ufb01er trained on labeled data by\nexploiting (a relatively large set of) unlabeled data. An extensive review of techniques\ncan be found in [12]. It has been shown experimentally that under certain conditions, the\ndecision function can be estimated more accurately in a semi-supervised setting, yielding\nlower generalization error. The most common assumption one makes in this setting is\ncalled the \u201ccluster assumption,\u201d namely that the class does not change in regions of high\ndensity.\n\nAlthough classi\ufb01ers implement the cluster assumption in various ways, we focus on clas-\nsi\ufb01ers that re-represent the given data to re\ufb02ect structure revealed by unlabeled data. The\nmain idea is to change the distance metric so that the relative distance between two points\nis smaller if the points are in the same cluster. If one is using kernels, rather than explicit\nfeature vectors, one can modify the kernel representation by constructing a cluster kernel.\nIn [10], a general framework is presented for producing cluster kernels by modifying the\neigenspectrum of the kernel matrix. Two of the main methods presented are the random\nwalk kernel and the spectral clustering kernel.\n\nThe random walk kernel is a normalized and symmetrized version of a transition matrix\ncorresponding to a t-step random walk. The random representation described in [11] in-\nterprets an RBF kernel as a transition matrix of a random walk on a graph with vertices\nxi, P (xi ! xj) = Kij\n. After t steps, the probability of going from a point xi to a\nP Kip\npoint xj should be high if the points are in the same cluster. This transition probability\ncan be calculated for the entire matrix as P t = (D(cid:0)1K)t, where D is a diagonal matrix\nsuch that Dii = Pp Kip. To obtain a kernel, one performs the following steps. Com-\n\n\fpute L = D(cid:0)1=2KD(cid:0)1=2 and its eigendecomposition L = U (cid:3)U >. let (cid:21)i (cid:21)t\ni, where\n(cid:21)i = (cid:3)ii, and let ~L = U ~(cid:3)U >. Then the new kernel is ~K = ~D1=2 ~L ~D1=2, where ~D is a\ndiagonal matrix with ~Dii = 1=Lii.\nThe spectral clustering kernel is a simple use of the representation derived from spectral\nclustering [13] using the \ufb01rst k eigenvectors. One computes the eigenvectors (v1; : : : ; vk)\nof D(cid:0) 1\n2 , with D de\ufb01ned as before, giving the representation (cid:30)(xi)p = vpi. This\nvector can also then be normalized to have length 1. This approach has been shown to\nproduce a well-clustered representation. While in spectral clustering, one then performs k-\nmeans in this representation, here one simply gives the representation as input to a classi\ufb01er.\n\n2 KD(cid:0) 1\n\nA serious problem with these methods is that one must diagonalize a matrix the size of the\nset of labeled and unlabeled data. Other methods of implementing the cluster assumption\nsuch as transductive SVMs [14] also suffer from computational ef\ufb01ciency issues. A second\ndrawback is that these kernels are better suited to a transductive setting (where one is given\nboth the unlabeled and test points in advance) rather than a semi-supervising setting. In\norder to estimate the kernel for a sequence not present during training, one is forced to\nsolve a dif\ufb01cult regression problem [10]. In the next two sections we will describe two\nsimple methods to implement the cluster assumption that do not suffer from these issues.\n\n4 The neighborhood mismatch kernel\n\nIn most current learning applications for prediction of protein properties, such as predic-\ntion of three-state secondary structure, neural nets are trained on probabilistic pro\ufb01les of\na sequence window \u2014 a matrix of position-speci\ufb01c emission and gap probabilities \u2014\nlearned from a PSI-BLAST alignment rather than an encoding of the sequence itself. In this\nway, each input sequence is represented probabilistically by its \u201cneighborhood\u201d in a large\nsequence database, where PSI-BLAST neighbors are sequences that are closely related\nthrough evolution. We wish to transfer the notion of pro\ufb01les to our mismatch representa-\ntion of protein sequences.\n\nWe use a standard sequence similarity measure like BLAST or PSI-BLAST to de\ufb01ne a\nneighborhood Nbd(x) for each input sequence x as the set of sequences x0 with similarity\nscore to x below a \ufb01xed E-value threshold, together with x itself. Now given a \ufb01xed original\nfeature representation, we represent x by the average of the feature vectors for members of\nits neighborhood: (cid:8)nbd(x) =\njNbd(x)j Px02Nbd(x) (cid:8)orig(x0). The neighborhood kernel\nis then de\ufb01ned by:\n\n1\n\nKnbd(x; y) =\n\n1\n\njNbd(x)jjNbd(y)j\n\nX\n\nKorig(x0; y0):\n\nx02Nbd(x);y02Nbd(y)\n\nWe will see in the experimental results that this simple neighborhood-averaging technique,\nused in a semi-supervised setting with the mismatch kernel, dramatically improves classi-\n\ufb01cation performance.\n\nTo see how the neighborhood approach \ufb01ts with the cluster assumption, consider a set of\npoints in feature space that form a \u201ccluster\u201d or dense region of the data set, and consider\nthe region R formed by the union of the convex hulls of the neighborhood point sets. If the\ndissimilarity measure is a true distance, the neighborhood averaged vector (cid:8)nbd(x) stays\ninside the convex hull of the vectors in its neighborhood, all the neighborhood vectors stay\nwithin region R. In general, the cluster contracts inside R under the averaging operation.\nThus, under the new representation, different clusters can become better separated from\neach other.\n\n\f5 The bagged mismatch kernel\n\nThere exist a number of clustering techniques that are much more ef\ufb01cient than the methods\nmentioned in Section 3. For example, the classical k-means algorithm is O(rkmd), where\nm is the number of data points, d is their dimensionality, and r is the number of iterations\nrequired. Empirically, this running time grows sublinearly with k, m and d. In practice, it\nis computationally ef\ufb01cient even to run k-means multiple times, which can be useful since\nk-means can converge to local minima. We therefore consider the following method:\n\n1. Run k-means n times, giving p = 1; : : : ; n cluster assignments cp(xi) for each i.\n2. Build a bagged-clustering representation based upon the fraction of times that xi\n\nand xj are in the same cluster:\n\nKbag(xi; xj) = Pp[cp(xi) = cp(xj)]\n3. Take the product between the original and bagged kernel:\n\nn\n\n:\n\nK(xi; xj) = Korig(xi; xj) (cid:1) Kbag(xi; xj)\n\n(1)\n\nBecause k-means gives different solutions on each run, step (1) will give different results;\nfor other clustering algorithms one could sub-sample the data instead. Step (2) is a valid\nkernel because it is the inner product in an nk-dimensional space (cid:8)(xi) = h[cp(xi) = q] :\np = 1; : : : ; n; q = 1; : : : ; ki, and products of kernels as in step (3) are also valid kernels.\nThe intuition behind the approach is that the original kernel is rescaled by the \u201cprobability\u201d\nthat two points are in the same cluster, hence encoding the cluster assumption. To estimate\nthe kernel on a test sequence x in a semi-supervised setting, one can assign x to the nearest\ncluster in each of the bagged runs to compute Kbag(x; xi). We apply the bagged kernel\nmethod with Korig as the mismatch kernel and Kbag built using PSI-BLAST.\n\n6 Experiments\n\nWe measure the recognition performance of cluster kernels methods by testing their ability\nto classify protein domains into superfamilies in the Structural Classi\ufb01cation of Proteins\n(SCOP) [15]. We use the same 54 target families and the same test and training set splits\nas in the remote homology experiments in [7]. The sequences are 7329 SCOP domains\nobtained from version 1.59 of the database after purging with astral.stanford.edu so that no\npair of sequences share more than 95% identity. Compared to [7], we reduce the number\nof available labeled training patterns by roughly a third. Data set sequences that were\nneither in the training nor test sets for experiments from [7] are included as unlabeled\ndata. All methods are evaluated using the receiver operating characteristic (ROC) score\nand the ROC-50, which is the ROC score computed only up to the \ufb01rst 50 false positives.\nMore details concerning the experimental setup can be found at http://www1.cs.\ncolumbia.edu/compbio/svm-pairwise.\n\nIn all experiments, we use an SVM classi\ufb01er with a small soft margin parameter, set as\nin [7] . The SVM computations are performed using the freely available Spider Mat-\nlab machine learning package available at http://www.kyb.tuebingen.mpg.de/\nbs/people/spider. More information concerning the experiments, including data\nand source code scripts, can be found at http://www.kyb.tuebingen.mpg.de/\nbs/people/weston/semiprot.\n\nSemi-supervised setting. Our \ufb01rst experiment shows that the neighborhood mismatch\nkernel makes better use of unlabeled data than the baseline method of \u201cpulling in ho-\nmologs\u201d prior to training the SVM classi\ufb01er, that is, simply \ufb01nding close homologs of\n\n\fUsing PSI\u2212BLAST for homologs & neighborhoods\n\n60\n\n50\n\ns\ne\n\ni\nl\ni\n\nm\na\n\nf\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n40\n\n30\n\n20\n\n10\n\n0\n0\n\nmismatch(5,1)\nmismatch(5,1)+homologs\nneighborhood mismatch(5,1)\n0.2\n\n0.8\n\n0.4\n0.6\nROC\u221250\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nl\n\n0\n5\n\u2212\nC\nO\nR\n \ns\ng\no\no\nm\no\nh\n+\n)\n1\n5\n(\nh\nc\nt\n\n,\n\na\nm\ns\nM\n\ni\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nNeighborhood Mismatch(5,1) ROC\u221250\n\nFigure 1: Comparison of protein representations and classi\ufb01ers using unlabeled data.\nThe mismatch kernel is used to represent proteins, with close homologs being pulled in\nfrom the unlabeled set with PSI-BLAST. Building a neighborhood with the neighborhood\nmismatch kernel improves over the baseline of pulling in homologs.\n\nmismatch kernel\nmismatch kernel + homologs\nneighborhood mismatch kernel\n\nBLAST\n\nPSI-BLAST\n\nROC-50\n0.416\n0.480\n0.639\n\nROC\n0.870\n0.900\n0.922\n\nROC-50\n0.416\n0.550\n0.699\n\nROC\n0.870\n0.910\n0.923\n\nTable 1: Mean ROC-50 and ROC scores over 54 target families for semi-supervised exper-\niments, using BLAST and PSI-BLAST.\n\nthe positive training examples in the unlabeled set and adding them to the positive training\nset for the SVM. Homologs come from the unlabeled set (not the test set), and \u201cneigh-\nbors\u201d for the neighborhood kernel come from the training plus unlabeled data. We com-\npare the methods using the mismatch kernel representation with k = 5 and m = 1, as\nused in [6]. Homologs are chosen via PSI-BLAST as having a pairwise score (E-value)\nwith any of the positive training samples less than 0.05, the default parameter setting [1].\nThe neighborhood mismatch kernel uses the same threshold to choose neighborhoods.\nFor the neighborhood kernel, we normalize before and after the averaging operation via\nKij Kij=pKiiKjj. The results are given in Figure 1 and Table 1. The former plots\nthe number of families achieving a given ROC-50 score, and a strongly performing method\nthus produces a curve close to the top right of the plot. A signed rank test shows that the\nneighborhood mismatch kernel yields signi\ufb01cant improvement over adding homologs (p-\nvalue 3.9e-05). Note that the PSI-BLAST scores in these experiments are built using the\nwhole database of 7329 sequences (that is, test sequences in a given experiment are also\navailable to the PSI-BLAST algorithm), so these results are slightly optimistic. However,\nthe comparison of methods in a truly inductive setting using BLAST shows the same im-\nprovement of the neighborhood mismatch kernel over adding homologs (p-value 8.4e-05).\nAdding homologs to the (much larger) negative training set in addition to pulling in the pos-\nitive homologs gives poorer performance than only adding the positive homologs (results\nnot shown).\n\nTransductive setting.\nIn the following experiments, we consider a transductive setting,\nin which the test points are given to the methods in advance as unlabeled data, giving\nslightly improved results over the last section. Although this setting is unrealistic for a\nreal protein classi\ufb01cation system, it more easily enables comparison with random walk\nand spectral clustering kernels, which do not easily work in another setting. In Figure 2\n(left), we again show the mismatch kernel compared with pulling in homologs and the\nneighborhood kernel. This time we also compare with the bagged mismatch kernel using\n\n\f60\n\n50\n\ns\ne\n\ni\nl\ni\n\nm\na\n\nf\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n40\n\n30\n\n20\n\n10\n\n0\n0\n\nMismatch kernel, PSI\u2212BLAST distance\n\nPSI\u2212BLAST kernel, varying methods\n\n60\n\n50\n\ns\ne\n\n40\n\ni\nl\ni\n\nm\na\n\nmismatch(5,1)\nmismatch(5,1)+homologs\nneighborhood mismatch(5,1)\nbagged mismatch(5,1) k=100\n0.2\n\n0.8\n\n0.4\n0.6\nROC\u221250\n\nf\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n30\n\n20\n\n10\n\n1\n\n0\n0\n\nPSI\u2212BLAST\n+ close homologs\nspectral cluster, k=100\nrandom walk, t=2\n0.2\n\n0.4\n0.6\nROC\u221250\n\n0.8\n\n1\n\nFigure 2: Comparison of protein representations and classi\ufb01ers using unlabeled data\nin a transductive setting. Neighborhood and bagged mismatch kernels outperform pulling\nin close homologs (left) and equal or outperform previous semi-supervised methods (right).\n\nmismatch kernel\nmismatch kernel + homologs\nneighborhood mismatch kernel\nbagged mismatch kernel (k = 100)\nbagged mismatch kernel (k = 400)\n\nROC-50\n0.416\n0.625\n0.704\n0.719\n0.671\n\nROC\n0.875\n0.924\n0.917\n0.943\n0.935\n\nPSI-BLAST kernel\nPSI-BLAST+homologs kernel\nspectral clustering kernel\nrandom walk kernel\ntransductive SVM\n\nROC-50\n0.533\n0.585\n0.581\n0.691\n0.637\n\nROC\n0.866\n0.873\n0.861\n0.915\n0.874\n\nTable 2: Mean ROC-50 and ROC scores over 54 target families for transductive experi-\nments.\n\nbagged k-means with k = 100 and n = 100 runs, which gave the best results. We found\nthe method quite insensitive to k. The result for k = 400 is also given in Table 2.\nWe then compare these methods to using random walk and spectral clustering kernels.\nBoth methods do not work well for the mismatch kernel (see online supplement), perhaps\nbecause the feature vectors are so orthogonal. However, for a PSI-BLAST representation\nvia empirical kernel map, the random walk outperforms pulling in homologs. We take the\nempirical map with (cid:8)(x) = hexp((cid:0)(cid:21)d(x1; x)); : : : ; exp((cid:0)(cid:21)(d(xm; x))i, where d(x; y)\nare PSI-BLAST E-values and (cid:21) = 1\n1000 , which improves over a linear map. We report\nresults for the best parameter choices, t = 2 for the random walk and k = 200 for spectral\nclustering. We found the latter quite brittle with respect to the parameter choice; results\nfor other parameters can be found on the supplemental web site. For pulling in close\nhomologs, we take the empirical kernel map only for points in the training set and the\nchosen close homologs. Finally, we also run transductive SVMs. The results are given\nin Table 2 and Figure 2 (right). A signed rank test (with adjusted p-value cut-off of 0.05)\n\ufb01nds no signi\ufb01cant difference between the neighborhood kernel, the bagged kernel (k =\n100), and the random walk kernel in this transductive setting. Thus the new techniques are\ncomparable with random walk, but are feasible to calculate on full scale problems.\n\n7 Discussion\n\nTwo of the most important issues in protein classication are representation of sequences\nand handling unlabeled data. Two developments in recent kernel methods research, string\nkernels and cluster kernels, address these issues separately. We have described two kernels\n\u2014 the neighborhood mismatch kernel and the bagged mismatch kernel \u2014 that combine\n\n\fboth approaches and yield state-of-the-art performance in protein classi\ufb01cation. Practical\nuse of semi-supervised protein classi\ufb01cation techniques requires computational ef\ufb01ciency.\nMany cluster kernels require diagonalization of the full labeled plus unlabeled data kernel\nmatrix. The neighborhood and bagged kernel approaches, used with an ef\ufb01cient string ker-\nnel, are fast and scalable cluster kernels for sequence data. Moreover, these techniques can\nbe applied to any problem with a meaningful local similarity measure or distance function.\n\nFuture work will deal with additional challenges of protein classi\ufb01cation: addressing the\nfull multi-class problem, which potentially involves thousands of classes; handling very\nsmall classes with few homologs; and dealing with missing classes, for which no labeled\nexamples exist.\n\nAcknowledgments\n\nWe would like to thank Eleazar Eskin for discussions that contributed to the neighborhood\nkernel and Olivier Chapelle and Navin Lal for their help with this work.\n\nReferences\n[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment\n\nsearch tool. Journal of Molecular Biology, 215:403\u2013410, 1990.\n\n[2] T. Smith and M. Waterman.\n\nIdenti\ufb01cation of common molecular subsequences. Journal of\n\nMolecular Biology, 147:195\u2013197, 1981.\n\n[3] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden markov models in compu-\ntational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501\u2013\n1531, 1994.\n\n[4] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence\ncomparisons using multiple sequences detect twice as many remote homologues as pairwise\nmethods. Journal of Molecular Biology, 284(4):1201\u20131210, 1998.\n\n[5] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote\n\nprotein homologies. Journal of Computational Biology, 2000.\n\n[6] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein\n\nclassi\ufb01cation. Neural Information Processing Systems 15, 2002.\n\n[7] C. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines\n\nfor remote protein homology detection. Proceedings of RECOMB, 2002.\n\n[8] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lip-\nman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.\nNucleic Acids Research, 25:3389\u20133402, 1997.\n\n[9] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.\n\nTechnical report, CMU, 2002.\n\n[10] O. Chapelle, J. Weston, and B. Schoelkopf. Cluster kernels for semi-supervised learning. Neural\n\nInformation Processing Systems 15, 2002.\n\n[11] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with Markov random walks. Neu-\n\nral Information Processing Systems 14, 2001.\n\n[12] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh,\n\n2001.\n\n[13] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. Neural\n\nProcessing Information Systems 14, 2001.\n\n[14] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. Pro-\n\nceedings of ICML, 1999.\n\n[15] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classi\ufb01cation\nof proteins database for the investigation of sequences and structures. Journal of Molecular\nBiology, 247:536\u2013540, 1995.\n\n\f", "award": [], "sourceid": 2496, "authors": [{"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Dengyong", "family_name": "Zhou", "institution": null}, {"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}, {"given_name": "William", "family_name": "Noble", "institution": null}, {"given_name": "Christina", "family_name": "Leslie", "institution": null}]}