{"title": "Cluster Kernels for Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": null, "full_text": "Cluster Kernels  for \n\nSemi-Supervised  Learning \n\nOlivier  Chapelle, Jason Weston, Bernhard  Scholkopf \n\nMax Planck Institute for  Biological  Cybernetics,  72076 Tiibingen,  Germany \n\n{first. last} @tuebingen.mpg.de \n\nAbstract \n\nWe  propose  a  framework  to  incorporate  unlabeled  data in  kernel \nclassifier,  based on the idea that two points in the same cluster are \nmore likely  to have  the same label.  This is  achieved  by modifying \nthe eigenspectrum of the kernel matrix.  Experimental results assess \nthe validity of this  approach. \n\n1 \n\nIntroduction \n\nWe  consider  the  problem  of semi-supervised  learning,  where  one  has  usually  few \nlabeled examples and a  lot of unlabeled examples.  One of the first  semi-supervised \nalgorithms  [1]  was  applied  to  web  page  classification.  This  is  a  typical  example \nwhere  the  number  of  unlabeled  examples  can  be  made  as  large  as  possible  since \nthere  are  billions  of web  page,  but  labeling  is  expensive  since  it  requires  human \nintervention.  Since  then,  there  has  been  a  lot  of interest  for  this  paradigm in  the \nmachine  learning  community;  an  extensive  review  of  existing  techniques  can  be \nfound  in  [10]. \nIt has been shown experimentally that under certain conditions,  the decision func(cid:173)\ntion can be estimated more accurately,  yielding lower generalization error [1,  4,  6] . \nHowever,  in  a  discriminative framework,  it is  not  obvious  to determine  how  unla(cid:173)\nbeled data or even the perfect knowledge of the input distribution P(x)  can help in \nthe estimation of the decision function.  Without any assumption, it turns out that \nthis  information is  actually useless  [10]. \n\nThus,  to  make  use  of  unlabeled  data,  one  needs  to  formulate  assumptions.  One \nwhich  is  made,  explicitly  or  implicitly,  by  most  of  the  semi-supervised  learning \nalgorithms is  the so-called  \"cluster assumption\"  saying that two points are likely to \nhave the same class label if there is  a path connecting them passing through regions \nof  high  density  only.  Another  way  of stating  this  assumption  is  to  say  that  the \ndecision  boundary should lie  in regions of low  density.  In  real world problems, this \nmakes sense:  let us consider handwritten digit recognition and suppose one tries to \nclassify digits 0 from  1.  The probability of having a digit which in  between a  0 and \n1 is  very low. \n\nIn  this  article,  we  will  show  how  to  design  kernels  which  implement  the  cluster \nassumption,  i.e.  kernels  such  that  the  induced  distance  is  small for  points  in  the \nsame cluster and larger for  points in  different  clusters. \n\n\f' :..  + .... . \n\n+ \n\nFigure  1:  Decision  function  obtained  by  an  SVM  with  the  kernel  (1).  On  this \ntoy problem,  this kernel implements  perfectly the cluster assumption:  the decision \nfunction  cuts a  cluster only when necessary. \n\n2  Kernels  implementing the cluster assumption \n\nIn  this  section,  we  explore  different  ideas  on  how  to  build  kernels  which  take into \naccount the fact that the data is clustered.  In section 3, we will propose a framework \nwhich  unifies  the methods  proposed in  [11]  and  [5]. \n\n2.1  Kernels from  mixture models \n\nIt is  possible  to  design  directly  a  kernel  taking  into  account  the  generative  model \nlearned  from  the  unlabeled  data.  Seeger  [9]  derived  such  a  kernel  in  a  Bayesian \nsetting.  He  proposes  to  use  the  unlabeled  data to  learn  a  mixture  of models  and \nhe  introduces  the  Mutual  Information  kernel  which  is  defined  in  such  way  that \ntwo  points  belonging  to  different  components  of  the  mixture  model  will  have  a \nlow  dot  product.  Thus,  in  the  case  of  a  mixture  of  Gaussians,  this  kernel  is  an \nimplementation of the cluster assumption.  Note that in the case of a single mixture \nmodel, the  Fisher kernel [3]  is  an approximation of this Mutual Information kernel. \n\nIndependently,  another  extension  of  the  Fisher  kernel  has  been  proposed  in  [12] \nwhich  leads,  in  the  case  of  a  mixture  of  Gaussians  (J.Lk,  ~k) to  the  Marginalized \nkernel whose  behavior is  similar to the mutual information kernel, \n\nq \n\nK(x, y)  =  L P(klx)P(kly)x T~kly. \n\nk=l \n\n(1) \n\nTo understand the behavior of the Marginalized kernel, we  designed a  2D-toy prob(cid:173)\nlem  (figure  1):  200  unlabeled  points  have  been  sampled  from  a  mixture  of  two \nGaussians,  whose  parameters  have  then  been  learned  with  EM  applied  to  these \npoints.  An SVM has been trained on 3 labeled points using the Marginalized kernel \n(1).  The  behavior  of  this  decision  function  is  intuitively  very  satisfying:  on  the \none hand, when not enough label data is  available, it takes into account the cluster \nassumption  and  does  not  cut  clusters  (right  cluster),  but  on  the  other  hand,  the \nkernel is flexible  enough to cope with different labels in the same cluster  (left  side). \n\n2.2  Random walk  kernel \n\nThe  kernels  presented in  the  previous  section  have  the  drawback of depending on \na  generative  model:  first,  they  require  an  unsupervised  learning  step,  but  more \n\n\fimportantly,  in  a  lot  of real  world  problems,  they  cannot  model  the  input  distri(cid:173)\nbution  with  sufficient  accuracy.  When  applying the mixture  of Gaussians  method \n(presented  above)  to  real  world  problems, one  cannot  expect  the  \"ideal\"  result  of \nfigure  1. \n\nFor  this  reason,  in  clustering  and  semi-supervised  learning,  there  has  been  a  lot \nof  interest  to  find  algorithms  which  do  not  depend  on  a  generative  model.  We \nwill  present two of them, find  out how they are related and present a  kernel which \nextends  them.  The  first  one  is  the  random  walk  representation  proposed  in  [11] . \nThe main idea is to compute the RBF kernel matrix (with the labeled and unlabeled \npoints)  Kij  =  exp( -llxi - Xj 112  /2( 2 )  and to interpret it  as  a  transition matrix of \na  random  walk  on  a  graph  with  vertices  Xi ,  P(Xi  -+  Xj)  =  \"K'k . . After  t  steps \n(where  t  is  a  parameter to  be  determined) ,  the  probability  of going  from  a  point \nXi  to a  point  Xj  should be quite high if both points belong to the same cluster and \nshould stay low  if they are in two  different  clusters. \n\nL.J p \n\ntp \n\nLet  D  be  the  diagonal  matrix  whose  elements  are  Dii  =  Lj K ij .  The  one  step \ntransition  matrix  is  D - 1 K  and  after  t  steps  it  is  pt  =  (D - 1 K)t.  In  [11],  the \nauthors  design  a  classifier  which  uses  directly  those  transition  probabilities.  One \nwould  be  tempted  to  use  pt  as  a  kernel  matrix for  a  SVM  classifier.  However,  it \nis  not possible to directly use  pt as a  kernel matrix since it is  not even symmetric. \nWe  will  see  in  section 3 how  a  modified version of pt can be used as a  kernel. \n\n2.3  Kernel  induced by a  clustered representation \n\nAnother idea to implement  the  cluster  assumption is  to  change  the representation \nof the input points such that points in the same cluster are grouped together in the \nnew  representation.  For this  purpose,  one  can use  tools  of spectral  clustering  (see \n[13]  for  a review)  Using the first eigenvectors of a similarity matrix, a representation \nwhere the points are naturally well  clustered has been recently presented in  [5].  We \nsuggest  to  train  a  discriminative  learning  algorithm  in  this  representation.  This \nalgorithm,  which resembles kernel PCA,  is  the following: \n\n1.  Compute  the  affinity  matrix,  which  is  an  RBF  kernel  matrix  but  with \n\ndiagonal elements being 0 instead of 1. \n\n2.  Let D  be a diagonal matrix with diagonal elements equal to the sum of the \nrows  (or the columns)  of K  and construct the matrix L  =  D - 1/ 2KD - 1/ 2 . \n3.  Find the eigenvectors (Vi, ... , Vk)  of L corresponding the first k eigenvalues. \n\n4.  The  new  representation of the  point  Xi  is  (Vii' ... ' Vik)  and  is  normalized \n\nto have length one:  ip(Xi)p  =  Vip  /  0:=;=1 Vfj)1/2. \n\nThe reason to consider the first  eigenvectors of the affinity  matrix is  the following. \nSuppose there are k  clusters in the dataset infinitely far  apart from each other.  One \ncan show that in this case, the first  k eigenvalues of the affinity matrix will  be 1 and \nthe eigenvalue  k + 1 will  be strictly less  than  1  [5].  The value  of this  gap  depends \non  how  well  connected each  cluster is:  the  better connected,  the  larger  the  gap  is \n(the  smaller  the  k + 1st eigenvalue).  Also,  in  the  new  representation  in  Rk  there \nwill  be k  vectors Zl, .. .  ,Zk orthonormal to each other such that each training point \nis  mapped to one of those  k  points depending on the cluster it belongs to. \n\nThis  simple  example  show  that  in  this  new  representation  points  are  naturally \nclustered and we  suggest to train a  linear  classifier on the mapped points. \n\n\f3  Extension of the cluster kernel \n\nBased on the ideas of the previous section,  we  propose the following  algorithm: \n\n1.  As  before,  compute  the  RBF  matrix  K  from  both  labeled  and  unlabeled \npoints  (this  time  with  1  on  the  diagonal  and  not  0)  and  D,  the  diagonal \nmatrix whose  elements are the sum of the rows  of K. \n\n2.  Compute L  =  D- 1 / 2 K D- 1 / 2  and its eigendecomposition L  =  U AUT. \n3.  Given a transfer function <p,  let :Xi  =  <p(Ai),  where the Ai  are the eigenvalues \n\nof L, and construct L =  U AuT. \n\n4.  Let iJ be a diagonal matrix with iJii  =  1/ Lii and compute K  =  iJ1 /2 LiJ 1/2. \n\nThe new kernel  matrix is  K.  Different  transfer function  lead to different  kernels: \nLinear  <p(A)  =  A.  In this  case L =  L  and  iJ  =  D  (since  the  diagonal elements of \n\nK  are 1).  It turns out that K  = K  and no  transformation is  performed. \n\nStep  <p(A)  =  1 if A 2:  Acut  and 0 otherwise.  If Acut  is  chosen to be equal to the k-th \nlargest eigenvalue  of L,  then the  new  kernel  matrix K is  the  dot  product \nmatrix in the representation of [5]  described in the previous section. \n\nLinear-step  Same  as  the step function,  but  with  <p(A)  =  A for  A 2:  Acut .  This  is \nclosely  related  to  the  approach  consisting  in  building  a  linear  classifier  in \nthe space given  by the first  Kernel  PCA components  [8]:  if the normaliza(cid:173)\ntion matrix D  and iJ  were equal to the identity, both approaches would be \nidentical.  Indeed, if the eigendecomposition of K  is  K  =  U AUT , the coor(cid:173)\ndinates  of the  training  points  in  the  kernel  PCA  representation  are given \nby the matrix U A 1 /2 . \nAt. \n\nIn \niJ1 /2 D1 /2 (D- 1 K)t D- 1/2 iJ1/2 . \nthe  transition \nmatrix  in  the  random  walk  described  in  section  2.2  and  K  can  be  inter(cid:173)\npreted  as  a  normalized  and  symmetrized  version  of the  transition  matrix \ncorresponding to a  t  step random walk. \n\nLt  and  K \nis \n\nPolynomial  <p(A) \n\nthis  case,  L \n\nThe  matrix  D- 1 K \n\nThis  makes  the  connection  between  the idea of the random walk  kernel  of section \n2.2  and a linear classifier trained in a space induced by either the spectral clustering \nalgorithm of [5]  or the Kernel  PCA algorithm. \n\nHow  to  handle  test  points  If test  points  are  available  during  training  and  if \nthey are also drawn from the same distribution as the training points (an assumption \nwhich is  commonly made), then they should be considered as  unlabeled points and \nthe  matrix K  described  above  should  be  built  using  training,  unlabeled  and  test \npoints. \n\nHowever, it might happen that test points are not available during training.  This is \na problem, since our method produces a new kernel matrix, but not an analytic form \nof the effective  new  kernel  that could readily be evaluated on novel test points.  In \nthis case,  we  propose the following  solution:  approximate a  test point x  as  a  linear \ncombination  of the  training  and  unlabeled  points,  and  use  this  approximation to \nexpress  the  required  dot  product  between  the  test  point  and  other  points  in  the \nfeature space.  More  precisely,  let \n\naD  =  argm~n 11<p(X)  - n~u lli<P(Xi)II  =  K- 1v \n\n\fLinear (Normal SVM) \n\n-\n--e-- Polynomial \n- - Step \n- - Pol  - sle \n\n0.2 \n\n0.15 \n\n0.' \n\n0.OS'-::'------:------:------:C'S:------=3C:-2  --64:':--='28 \n\nNb of labeled  points \n\nFigure  2:  Test  error  on  a  text  classification  problem  for  training  set  size  varying \nfrom 2 to 128 examples.  The different kernels correspond to different kind of transfer \nfunctions. \n\nwith Vi  =  K(x, Xi)l .  Here,  <I>  is the feature map corresponding to K, i.e., K(x, x') = \n(<I>(x)  . <I>(x/)).  The new dot product between the test point x  and the other points \nis  expressed as a  linear combination of the dot products of k, \n\nK(X,Xi)  = (Ka  )i  = (KK  vk \n-\n\n- 1 \n\n-\n\n-\n\n0 \n\n-\n\nNote  that  for  a  linear  transfer  function,  k  =  K,  and  the  new  dot  product  is  the \nstandard one. \n\n4  Experiments \n\n4.1 \n\nInfluence  of the transfer function \n\nWe  applied the different kernel clusters of section 3 to the text classification task of \n[11],  following  the  same experimental  protocol.  There  are  two  categories  mac  and \nwindows  with  respectively 958  and 961  examples of dimension  7511.  The width of \nthe  RBF  kernel  was  chosen  as  in  [11]  giving  a  =  0.55.  Out  of all  examples,  987 \nwere  taken away  to form  the  test  set.  Out  of the remaining  points,  2 to  128  were \nrandomly selected  to be labeled and the other points remained unlabeled.  Results \nare  presented  in  figure  2  and  averaged  over  100  random  selections  of the  labeled \nexamples.  The  following  transfer  functions  were  compared:  linear  (i.e.  standard \nSVM),  polynomial <p(A)  =  A5 ,  step keeping only the n + 10 where n  is  the number of \nlabeled points, and poly-step defined in the following way (with 1 2  Ai  2  A2  2  . .. ), \n\ni  :S  n + 10 \ni  > n + 10 \n\nFor large sizes of the  (labeled)  training set, all  approaches give similar results.  The \ninteresting case are small training sets.  Here,  the step and poly-step functions  work \nvery well.  The polynomial transfer function does not give good results for very small \ntraining sets  (but  nevertheless  outperforms  the  standard SVM  for  medium  sizes). \nThis might be due to the fact  that in this example, the second largest eigenvalue is \n0.073 (the largest is by construction 1).  Since the polynomial transfer function tends \n\n1 We consider here an RBF kernel and for  this reason the matrix K  is  always invertible. \n\n\fto push to 0 the small eigenvalues, it turns out that the new kernel has \"rank almost \none\"  and  it  is  more  difficult  to  learn  with  such  a  kernel.  To  avoid  this  problem, \nthe authors of [11]  consider a  sparse affinity  matrix with  non-zeros entries only for \nneighbor  examples.  In  this  way  the  data are  by  construction  more  clustered  and \nthe eigenvalues  are larger.  We  verified experimentally that the polynomial transfer \nfunction  gave better results  when applied to a  sparse affinity  matrix. \n\nConcerning  the  step  transfer  function,  the  value  of the  cut-off index  corresponds \nto  the  number of dimensions  in  the feature  space induced  by the kernel,  since  the \nlatter is  linear in the representation given by the eigendecomposition of the affinity \nmatrix.  Intuitively,  it makes sense to have the number of dimensions increase with \nthe  number  of training  examples,  that  is  the  reason  why  we  chose  a  cutoff index \nequal to n + 10. \nThe poly-step transfer function is somewhat similar to the step function, but is not as \nrough:  the square root tends to put more importance on dimensions corresponding \nto  large  eigenvalues  (recall  that  they  are  smaller  than  1)  and  the  square  function \ntends to discard components with small eigenvalues.  This method achieves the best \nresults. \n\n4.2  Automatic selection of the transfer function \n\nThe choice  of the  poly-step  transfer function  in the previous choice  corresponds to \nthe intuition that more emphasis should be put on the dimensions corresponding to \nthe  largest  eigenvalues  (they  are  useful  for  cluster  discrimination)  and  less  on the \ndimensions  with  small  eigenvalues  (corresponding to intra-cluster directions).  The \ngeneral form  of this transfer function  is \n\ni  ~ r \ni  > r \n\n' \n\n(2) \n\nwhere p, q E  lR  and r  E  N are 3 hyperparameters.  As  before, it is  possible to choose \nqualitatively some values for these parameters, but ideally, one would like a method \nwhich automatically chooses good values.  It is  possible to do so by gradient descent \non an estimate of the generalization error [2].  To assess the possibility of estimating \naccurately the test error associated with the poly-step kernel, we  computed the span \nestimate  [2]  in  the same setting as in  the previous section.  We  fixed  p  =  q =  2 and \nthe number  of training  points to 16  (8  per class).  The span estimate and the test \nerror are plotted on the left  side of figure  3. \n\nAnother possibility would  be to explore  methods  that take  into  account  the spec(cid:173)\ntrum of the kernel  matrix in order to predict the test error  [7]. \n\n4.3  Comparison with other algorithms \n\nWe  summarized  the  test  errors  (averaged  over  100  trials)  of different  algorithms \ntrained on 16  labeled examples in  the following  table. \n\nThe transductive SVM algorithm consists in maximizing the margin on both labeled \nand unlabeled.  To  some  extent  it implements  also  the  cluster  assumption  since  it \ntends  to put the decision function  in low  density regions.  This algorithm has  been \nsuccessfully applied to text categorization [4]  and is  a state-of-the-art algorithm for \n\n\f0.25,-----~-~-~-r=_=:=;:'T'=\" ='''O==,  ==;] \n\n0.22 ,----~~-~~-7_~T'=\"='''==O, ==]l \n\n--e- S  an estimale \n\n-e- S  an estimate \n\n0.2 \n\n0.2 1 \n\n0.2 \n\n0.19 \n\n0. 18 \n\n0 15 \"~ ~ \n\n~  0.17 \n\n0.1 \n\n0. 16 \n\n0. 15 \n\n10 \n\n15 \n\n20 \n\n25 \n\n30 \n\n10 \n\n12 \n\n14 \n\n16 \n\n16 \n\n20 \n\nFigure  3:  The  span  estimate  predicts  accurately  the  minimum  of  the  test  error \nfor  different  values  of  the  cutoff  index  r  in  the  poly-step  kernel  (2).  Left:  text \nclassification task,  right:  handwritten digit  classification \n\nperforming semi-supervised learning.  The result of the Random  walk kernel is  taken \ndirectly  from  [11].  Finally,  the  cluster  kernel  performance has  been obtained with \np  =  q  =  2  and  r  =  10  in  the  transfer  function  2.  The  value  of  r  was  the  one \nminimizing the span estimate  (see left  side of figure  3). \n\nFuture experiments include for  instance the Marginalized kernel  (1)  with the stan(cid:173)\ndard generative model  used in text  classification by Naive Bayes classifier  [6]. \n\n4.4  Digit  recognition \n\nIn a second set of experiments, we considered the task of classifying the handwritten \ndigits  0 to  4  against  5 to 9 of the  USPS  database.  The cluster  assumption should \napply fairly well on this database since the different digits are likely to be clustered. \n\n2000  training  examples  have  been  selected  and  divided  into  50  subsets  on  40  ex(cid:173)\namples.  For  a  given  run,  one  of the  subsets  was  used  as  the  labeled  training  set, \nwhereas the other points remained unlabeled.  The width of the RBF kernel was set \nto 5  (it  was the value minimizing the test error in  the supervised case). \n\nThe  mean  test  error  for  the  standard  SVM  is  17.8%  (standard  deviation  3.5%), \nwhereas the transductive SVM  algorithm of [4]  did not yield  a  significant improve(cid:173)\nment  (17.6% \u00b1 3.2%).  As  for  the  cluster  kernel  (2),  the  cutoff index  r  was  again \nselected by minimizing the span estimate  (see right side of figure  3).  It gave a  test \nerror of 14.9%  (standard  deviation  3.3%).  It  is  interesting  to  note  in figure  3  the \nlocal  minimum  at r  =  10,  which  can  be  interpreted  easily  since  it  corresponds  to \nthe number of different  digits in  the database. \nIt is  somehow surprising that the transductive SVM algorithm did not improve the \ntest  error on this  classification  problem,  whereas  it  did  for  text  classification.  We \nconjecture  the  following  explanation:  the  transductive  SVM  is  more  sensitive  to \noutliers in  the unlabeled set than  the  cluster  kernel  methods since  it  directly tries \nto  maximize  the margin on the  unlabeled  points.  For  instance,  in  the  top  middle \npart of figure  1,  there is  an unlabeled  point which  would  have probably perturbed \nthis  algorithm.  However, in  high  dimensional  problems  such  as  text  classification, \nthe influence  of outlier  points  is  smaller.  Another explanation is  that this  method \ncan  get  stuck  in  local  minima,  but  that  again,  in  higher  dimensional  space,  it  is \neasier to get out of local minima. \n\n\f5  Conclusion \n\nIn  a  discriminative  setting,  a  reasonable  way  to  incorporate  unlabeled  data  is \nthrough  the  cluster  assumption.  Based  on  the  ideas  of  spectral  clustering  and \nrandom walks,  we  proposed a  framework for  constructing kernels  which implement \nthe cluster assumption:  the induced distance depends  on whether the points are in \nthe  same cluster or not.  This  is  done  by changing the spectrum of the kernel  ma(cid:173)\ntrix.  Since there exist several bounds for  SVMs  which  depend on the shape of this \nspectrum, the main direction for  future  research is  to perform automatic model se(cid:173)\nlection based on these theoretical results.  Finally,  note that the cluster assumption \nmight also be useful  in a  purely supervised  learning task. \n\nAcknowledgments \n\nThe  authors  would  like  to  thank  Martin  Szummer  for  helpful  discussion  on  this \ntopic and for  having provided us  with his database. \n\nReferences \n\n[1]  A.  Blum  and T.  Mitchell.  Combining  labeled  and  unlabeled  data with  co-training. \nIn  COLT:  Proceedings  of the  Workshop  on  Computational  Learning  Theory.  Morgan \nKaufmann  Publishers,  1998. \n\n[2]  O.  Chapelle,  V.  Vapnik,  O. Bousquet,  and S. Mukherjee.  Choosing  multiple param(cid:173)\n\neters for  support vector machines.  Machine  Learning, 46(1-3):131-159,  2002. \n\n[3]  T.  Jaakkola and D.  Haussler.  Exploiting generative  models  in  discriminative  classi(cid:173)\n\nfiers.  In  Advances  in  Neural  Information  Processing,  volume 11,  pages 487-493.  The \nMIT Press,  1998. \n\n[4]  T.  Joachims.  Transductive inference  for  text  classification  using support  vector  ma(cid:173)\n\nchines.  In  Proceedings  of the  16th  International  Conference  on  Machine  Learning, \npages  200- 209.  Morgan  Kaufmann,  San Francisco,  CA,  1999. \n\n[5]  A.  Y.  Ng,  M.  I.  Jordan,  and  Y.  Weiss.  On  spectral  clustering:  Analysis  and  an \nalgorithm.  In  Advances  in  Neural  Information  Processing  Systems,  volume 14,  200l. \n[6]  K  Nigam, A.  K  McCallum,  S.  Thrun,  and T.  M.  Mitchell.  Learning to classify  text \nfrom  labeled and unlabeled documents.  In Proceedings  of AAAI-9S,  15th  Conference \nof the  American  Association for  Artificial Intelligence,  pages  792- 799,  Madison,  US, \n1998.  AAAI Press,  Menlo  Park,  US. \n\n[7]  B.  Scholkopf,  J.  Shawe-Taylor,  A.  J.  Smola,  and  R.  C.  Williamson.  Generalization \nbounds  via  eigenvalues  of  the  Gram  matrix.  Technical  Report  99-035,  NeuroColt, \n1999. \n\n[8]  B.  Scholkopf,  A. Smola,  and K-R.  Muller.  Nonlinear component analysis  as  a  kernel \n\neigenvalue  problem.  Neural  Computation, 10:1299- 1310,  1998. \n\n[9]  M.  Seeger.  Covariance  kernels  from  Bayesian  generative  models.  In  Advances  in \n\nNeural  Information  Processing  Systems, volume 14,  200l. \n\n[10]  M.  Seeger.  Learning  with  labeled  and  unlabeled  data.  Technical  report ,  Edinburgh \n\nUniversity, 200l. \n\n[11]  M.  Szummer and  T.  Jaakkola.  Partially  labeled  classification  with  markov  random \n\nwalks.  In  Advances  in  Neural  Information  Processing  Systems, volume 14,  200l. \n\n[12]  K  Tsuda, T.  Kin, and K  Asai.  Marginalized kernels for  biological sequences.  Bioin(cid:173)\n\nformatics , 2002.  To  appear.  Also  presented at ICMB  2002. \n\n[13]  Y.  Weiss.  Segmentation using  eigenvectors:  A  unifying  view.  In  International  Con(cid:173)\n\nference  on  Computer  Vision, pages 975- 982,  1999. \n\n\f", "award": [], "sourceid": 2257, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}