{"title": "Cluster Kernels for Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": null, "full_text": "Cluster Kernels for \n\nSemi-Supervised Learning \n\nOlivier Chapelle, Jason Weston, Bernhard Scholkopf \n\nMax Planck Institute for Biological Cybernetics, 72076 Tiibingen, Germany \n\n{first. last} @tuebingen.mpg.de \n\nAbstract \n\nWe propose a framework to incorporate unlabeled data in kernel \nclassifier, based on the idea that two points in the same cluster are \nmore likely to have the same label. This is achieved by modifying \nthe eigenspectrum of the kernel matrix. Experimental results assess \nthe validity of this approach. \n\n1 \n\nIntroduction \n\nWe consider the problem of semi-supervised learning, where one has usually few \nlabeled examples and a lot of unlabeled examples. One of the first semi-supervised \nalgorithms [1] was applied to web page classification. This is a typical example \nwhere the number of unlabeled examples can be made as large as possible since \nthere are billions of web page, but labeling is expensive since it requires human \nintervention. Since then, there has been a lot of interest for this paradigm in the \nmachine learning community; an extensive review of existing techniques can be \nfound in [10]. \nIt has been shown experimentally that under certain conditions, the decision func(cid:173)\ntion can be estimated more accurately, yielding lower generalization error [1, 4, 6] . \nHowever, in a discriminative framework, it is not obvious to determine how unla(cid:173)\nbeled data or even the perfect knowledge of the input distribution P(x) can help in \nthe estimation of the decision function. Without any assumption, it turns out that \nthis information is actually useless [10]. \n\nThus, to make use of unlabeled data, one needs to formulate assumptions. One \nwhich is made, explicitly or implicitly, by most of the semi-supervised learning \nalgorithms is the so-called \"cluster assumption\" saying that two points are likely to \nhave the same class label if there is a path connecting them passing through regions \nof high density only. Another way of stating this assumption is to say that the \ndecision boundary should lie in regions of low density. In real world problems, this \nmakes sense: let us consider handwritten digit recognition and suppose one tries to \nclassify digits 0 from 1. The probability of having a digit which in between a 0 and \n1 is very low. \n\nIn this article, we will show how to design kernels which implement the cluster \nassumption, i.e. kernels such that the induced distance is small for points in the \nsame cluster and larger for points in different clusters. \n\n\f' :.. + .... . \n\n+ \n\nFigure 1: Decision function obtained by an SVM with the kernel (1). On this \ntoy problem, this kernel implements perfectly the cluster assumption: the decision \nfunction cuts a cluster only when necessary. \n\n2 Kernels implementing the cluster assumption \n\nIn this section, we explore different ideas on how to build kernels which take into \naccount the fact that the data is clustered. In section 3, we will propose a framework \nwhich unifies the methods proposed in [11] and [5]. \n\n2.1 Kernels from mixture models \n\nIt is possible to design directly a kernel taking into account the generative model \nlearned from the unlabeled data. Seeger [9] derived such a kernel in a Bayesian \nsetting. He proposes to use the unlabeled data to learn a mixture of models and \nhe introduces the Mutual Information kernel which is defined in such way that \ntwo points belonging to different components of the mixture model will have a \nlow dot product. Thus, in the case of a mixture of Gaussians, this kernel is an \nimplementation of the cluster assumption. Note that in the case of a single mixture \nmodel, the Fisher kernel [3] is an approximation of this Mutual Information kernel. \n\nIndependently, another extension of the Fisher kernel has been proposed in [12] \nwhich leads, in the case of a mixture of Gaussians (J.Lk, ~k) to the Marginalized \nkernel whose behavior is similar to the mutual information kernel, \n\nq \n\nK(x, y) = L P(klx)P(kly)x T~kly. \n\nk=l \n\n(1) \n\nTo understand the behavior of the Marginalized kernel, we designed a 2D-toy prob(cid:173)\nlem (figure 1): 200 unlabeled points have been sampled from a mixture of two \nGaussians, whose parameters have then been learned with EM applied to these \npoints. An SVM has been trained on 3 labeled points using the Marginalized kernel \n(1). The behavior of this decision function is intuitively very satisfying: on the \none hand, when not enough label data is available, it takes into account the cluster \nassumption and does not cut clusters (right cluster), but on the other hand, the \nkernel is flexible enough to cope with different labels in the same cluster (left side). \n\n2.2 Random walk kernel \n\nThe kernels presented in the previous section have the drawback of depending on \na generative model: first, they require an unsupervised learning step, but more \n\n\fimportantly, in a lot of real world problems, they cannot model the input distri(cid:173)\nbution with sufficient accuracy. When applying the mixture of Gaussians method \n(presented above) to real world problems, one cannot expect the \"ideal\" result of \nfigure 1. \n\nFor this reason, in clustering and semi-supervised learning, there has been a lot \nof interest to find algorithms which do not depend on a generative model. We \nwill present two of them, find out how they are related and present a kernel which \nextends them. The first one is the random walk representation proposed in [11] . \nThe main idea is to compute the RBF kernel matrix (with the labeled and unlabeled \npoints) Kij = exp( -llxi - Xj 112 /2( 2 ) and to interpret it as a transition matrix of \na random walk on a graph with vertices Xi , P(Xi -+ Xj) = \"K'k . . After t steps \n(where t is a parameter to be determined) , the probability of going from a point \nXi to a point Xj should be quite high if both points belong to the same cluster and \nshould stay low if they are in two different clusters. \n\nL.J p \n\ntp \n\nLet D be the diagonal matrix whose elements are Dii = Lj K ij . The one step \ntransition matrix is D - 1 K and after t steps it is pt = (D - 1 K)t. In [11], the \nauthors design a classifier which uses directly those transition probabilities. One \nwould be tempted to use pt as a kernel matrix for a SVM classifier. However, it \nis not possible to directly use pt as a kernel matrix since it is not even symmetric. \nWe will see in section 3 how a modified version of pt can be used as a kernel. \n\n2.3 Kernel induced by a clustered representation \n\nAnother idea to implement the cluster assumption is to change the representation \nof the input points such that points in the same cluster are grouped together in the \nnew representation. For this purpose, one can use tools of spectral clustering (see \n[13] for a review) Using the first eigenvectors of a similarity matrix, a representation \nwhere the points are naturally well clustered has been recently presented in [5]. We \nsuggest to train a discriminative learning algorithm in this representation. This \nalgorithm, which resembles kernel PCA, is the following: \n\n1. Compute the affinity matrix, which is an RBF kernel matrix but with \n\ndiagonal elements being 0 instead of 1. \n\n2. Let D be a diagonal matrix with diagonal elements equal to the sum of the \nrows (or the columns) of K and construct the matrix L = D - 1/ 2KD - 1/ 2 . \n3. Find the eigenvectors (Vi, ... , Vk) of L corresponding the first k eigenvalues. \n\n4. The new representation of the point Xi is (Vii' ... ' Vik) and is normalized \n\nto have length one: ip(Xi)p = Vip / 0:=;=1 Vfj)1/2. \n\nThe reason to consider the first eigenvectors of the affinity matrix is the following. \nSuppose there are k clusters in the dataset infinitely far apart from each other. One \ncan show that in this case, the first k eigenvalues of the affinity matrix will be 1 and \nthe eigenvalue k + 1 will be strictly less than 1 [5]. The value of this gap depends \non how well connected each cluster is: the better connected, the larger the gap is \n(the smaller the k + 1st eigenvalue). Also, in the new representation in Rk there \nwill be k vectors Zl, .. . ,Zk orthonormal to each other such that each training point \nis mapped to one of those k points depending on the cluster it belongs to. \n\nThis simple example show that in this new representation points are naturally \nclustered and we suggest to train a linear classifier on the mapped points. \n\n\f3 Extension of the cluster kernel \n\nBased on the ideas of the previous section, we propose the following algorithm: \n\n1. As before, compute the RBF matrix K from both labeled and unlabeled \npoints (this time with 1 on the diagonal and not 0) and D, the diagonal \nmatrix whose elements are the sum of the rows of K. \n\n2. Compute L = D- 1 / 2 K D- 1 / 2 and its eigendecomposition L = U AUT. \n3. Given a transfer function
is the feature map corresponding to K, i.e., K(x, x') = \n((x) . (x/)). The new dot product between the test point x and the other points \nis expressed as a linear combination of the dot products of k, \n\nK(X,Xi) = (Ka )i = (KK vk \n-\n\n- 1 \n\n-\n\n-\n\n0 \n\n-\n\nNote that for a linear transfer function, k = K, and the new dot product is the \nstandard one. \n\n4 Experiments \n\n4.1 \n\nInfluence of the transfer function \n\nWe applied the different kernel clusters of section 3 to the text classification task of \n[11], following the same experimental protocol. There are two categories mac and \nwindows with respectively 958 and 961 examples of dimension 7511. The width of \nthe RBF kernel was chosen as in [11] giving a = 0.55. Out of all examples, 987 \nwere taken away to form the test set. Out of the remaining points, 2 to 128 were \nrandomly selected to be labeled and the other points remained unlabeled. Results \nare presented in figure 2 and averaged over 100 random selections of the labeled \nexamples. The following transfer functions were compared: linear (i.e. standard \nSVM), polynomial
n + 10 \n\nFor large sizes of the (labeled) training set, all approaches give similar results. The \ninteresting case are small training sets. Here, the step and poly-step functions work \nvery well. The polynomial transfer function does not give good results for very small \ntraining sets (but nevertheless outperforms the standard SVM for medium sizes). \nThis might be due to the fact that in this example, the second largest eigenvalue is \n0.073 (the largest is by construction 1). Since the polynomial transfer function tends \n\n1 We consider here an RBF kernel and for this reason the matrix K is always invertible. \n\n\fto push to 0 the small eigenvalues, it turns out that the new kernel has \"rank almost \none\" and it is more difficult to learn with such a kernel. To avoid this problem, \nthe authors of [11] consider a sparse affinity matrix with non-zeros entries only for \nneighbor examples. In this way the data are by construction more clustered and \nthe eigenvalues are larger. We verified experimentally that the polynomial transfer \nfunction gave better results when applied to a sparse affinity matrix. \n\nConcerning the step transfer function, the value of the cut-off index corresponds \nto the number of dimensions in the feature space induced by the kernel, since the \nlatter is linear in the representation given by the eigendecomposition of the affinity \nmatrix. Intuitively, it makes sense to have the number of dimensions increase with \nthe number of training examples, that is the reason why we chose a cutoff index \nequal to n + 10. \nThe poly-step transfer function is somewhat similar to the step function, but is not as \nrough: the square root tends to put more importance on dimensions corresponding \nto large eigenvalues (recall that they are smaller than 1) and the square function \ntends to discard components with small eigenvalues. This method achieves the best \nresults. \n\n4.2 Automatic selection of the transfer function \n\nThe choice of the poly-step transfer function in the previous choice corresponds to \nthe intuition that more emphasis should be put on the dimensions corresponding to \nthe largest eigenvalues (they are useful for cluster discrimination) and less on the \ndimensions with small eigenvalues (corresponding to intra-cluster directions). The \ngeneral form of this transfer function is \n\ni ~ r \ni > r \n\n' \n\n(2) \n\nwhere p, q E lR and r E N are 3 hyperparameters. As before, it is possible to choose \nqualitatively some values for these parameters, but ideally, one would like a method \nwhich automatically chooses good values. It is possible to do so by gradient descent \non an estimate of the generalization error [2]. To assess the possibility of estimating \naccurately the test error associated with the poly-step kernel, we computed the span \nestimate [2] in the same setting as in the previous section. We fixed p = q = 2 and \nthe number of training points to 16 (8 per class). The span estimate and the test \nerror are plotted on the left side of figure 3. \n\nAnother possibility would be to explore methods that take into account the spec(cid:173)\ntrum of the kernel matrix in order to predict the test error [7]. \n\n4.3 Comparison with other algorithms \n\nWe summarized the test errors (averaged over 100 trials) of different algorithms \ntrained on 16 labeled examples in the following table. \n\nThe transductive SVM algorithm consists in maximizing the margin on both labeled \nand unlabeled. To some extent it implements also the cluster assumption since it \ntends to put the decision function in low density regions. This algorithm has been \nsuccessfully applied to text categorization [4] and is a state-of-the-art algorithm for \n\n\f0.25,-----~-~-~-r=_=:=;:'T'=\" ='''O==, ==;] \n\n0.22 ,----~~-~~-7_~T'=\"='''==O, ==]l \n\n--e- S an estimale \n\n-e- S an estimate \n\n0.2 \n\n0.2 1 \n\n0.2 \n\n0.19 \n\n0. 18 \n\n0 15 \"~ ~ \n\n~ 0.17 \n\n0.1 \n\n0. 16 \n\n0. 15 \n\n10 \n\n15 \n\n20 \n\n25 \n\n30 \n\n10 \n\n12 \n\n14 \n\n16 \n\n16 \n\n20 \n\nFigure 3: The span estimate predicts accurately the minimum of the test error \nfor different values of the cutoff index r in the poly-step kernel (2). Left: text \nclassification task, right: handwritten digit classification \n\nperforming semi-supervised learning. The result of the Random walk kernel is taken \ndirectly from [11]. Finally, the cluster kernel performance has been obtained with \np = q = 2 and r = 10 in the transfer function 2. The value of r was the one \nminimizing the span estimate (see left side of figure 3). \n\nFuture experiments include for instance the Marginalized kernel (1) with the stan(cid:173)\ndard generative model used in text classification by Naive Bayes classifier [6]. \n\n4.4 Digit recognition \n\nIn a second set of experiments, we considered the task of classifying the handwritten \ndigits 0 to 4 against 5 to 9 of the USPS database. The cluster assumption should \napply fairly well on this database since the different digits are likely to be clustered. \n\n2000 training examples have been selected and divided into 50 subsets on 40 ex(cid:173)\namples. For a given run, one of the subsets was used as the labeled training set, \nwhereas the other points remained unlabeled. The width of the RBF kernel was set \nto 5 (it was the value minimizing the test error in the supervised case). \n\nThe mean test error for the standard SVM is 17.8% (standard deviation 3.5%), \nwhereas the transductive SVM algorithm of [4] did not yield a significant improve(cid:173)\nment (17.6% \u00b1 3.2%). As for the cluster kernel (2), the cutoff index r was again \nselected by minimizing the span estimate (see right side of figure 3). It gave a test \nerror of 14.9% (standard deviation 3.3%). It is interesting to note in figure 3 the \nlocal minimum at r = 10, which can be interpreted easily since it corresponds to \nthe number of different digits in the database. \nIt is somehow surprising that the transductive SVM algorithm did not improve the \ntest error on this classification problem, whereas it did for text classification. We \nconjecture the following explanation: the transductive SVM is more sensitive to \noutliers in the unlabeled set than the cluster kernel methods since it directly tries \nto maximize the margin on the unlabeled points. For instance, in the top middle \npart of figure 1, there is an unlabeled point which would have probably perturbed \nthis algorithm. However, in high dimensional problems such as text classification, \nthe influence of outlier points is smaller. Another explanation is that this method \ncan get stuck in local minima, but that again, in higher dimensional space, it is \neasier to get out of local minima. \n\n\f5 Conclusion \n\nIn a discriminative setting, a reasonable way to incorporate unlabeled data is \nthrough the cluster assumption. Based on the ideas of spectral clustering and \nrandom walks, we proposed a framework for constructing kernels which implement \nthe cluster assumption: the induced distance depends on whether the points are in \nthe same cluster or not. This is done by changing the spectrum of the kernel ma(cid:173)\ntrix. Since there exist several bounds for SVMs which depend on the shape of this \nspectrum, the main direction for future research is to perform automatic model se(cid:173)\nlection based on these theoretical results. Finally, note that the cluster assumption \nmight also be useful in a purely supervised learning task. \n\nAcknowledgments \n\nThe authors would like to thank Martin Szummer for helpful discussion on this \ntopic and for having provided us with his database. \n\nReferences \n\n[1] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. \nIn COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan \nKaufmann Publishers, 1998. \n\n[2] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple param(cid:173)\n\neters for support vector machines. Machine Learning, 46(1-3):131-159, 2002. \n\n[3] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi(cid:173)\n\nfiers. In Advances in Neural Information Processing, volume 11, pages 487-493. The \nMIT Press, 1998. \n\n[4] T. Joachims. Transductive inference for text classification using support vector ma(cid:173)\n\nchines. In Proceedings of the 16th International Conference on Machine Learning, \npages 200- 209. Morgan Kaufmann, San Francisco, CA, 1999. \n\n[5] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an \nalgorithm. In Advances in Neural Information Processing Systems, volume 14, 200l. \n[6] K Nigam, A. K McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text \nfrom labeled and unlabeled documents. In Proceedings of AAAI-9S, 15th Conference \nof the American Association for Artificial Intelligence, pages 792- 799, Madison, US, \n1998. AAAI Press, Menlo Park, US. \n\n[7] B. Scholkopf, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Generalization \nbounds via eigenvalues of the Gram matrix. Technical Report 99-035, NeuroColt, \n1999. \n\n[8] B. Scholkopf, A. Smola, and K-R. Muller. Nonlinear component analysis as a kernel \n\neigenvalue problem. Neural Computation, 10:1299- 1310, 1998. \n\n[9] M. Seeger. Covariance kernels from Bayesian generative models. In Advances in \n\nNeural Information Processing Systems, volume 14, 200l. \n\n[10] M. Seeger. Learning with labeled and unlabeled data. Technical report , Edinburgh \n\nUniversity, 200l. \n\n[11] M. Szummer and T. Jaakkola. Partially labeled classification with markov random \n\nwalks. In Advances in Neural Information Processing Systems, volume 14, 200l. \n\n[12] K Tsuda, T. Kin, and K Asai. Marginalized kernels for biological sequences. Bioin(cid:173)\n\nformatics , 2002. To appear. Also presented at ICMB 2002. \n\n[13] Y. Weiss. Segmentation using eigenvectors: A unifying view. In International Con(cid:173)\n\nference on Computer Vision, pages 975- 982, 1999. \n\n\f", "award": [], "sourceid": 2257, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}