{"title": "Classification on Pairwise Proximity Data", "book": "Advances in Neural Information Processing Systems", "page_first": 438, "page_last": 444, "abstract": null, "full_text": "Classification on Pairwise Proximity  Data \n\nThore  Graepelt ,  Ralf Herbrichi , \n\nPeter Bollmann-Sdorrat ,  Klaus  Obermayert \n\nTechnical  University of Berlin, \n\nt  Statistics Research  Group,  Sekr.  FR 6-9, \n\nt  Neural Information  Processing  Group,  Sekr. FR 2-1 , \n\nFranklinstr.  28/29,  10587  Berlin,  Germany \n\nAbstract \n\nWe  investigate the problem of learning a classification task on data \nrepresented in terms of their pairwise proximities.  This representa(cid:173)\ntion does  not refer  to an explicit feature  representation of the data \nitems and  is  thus  more  general than the standard approach  of us(cid:173)\ning Euclidean feature vectors,  from  which  pairwise proximities can \nalways  be  calculated.  Our  first  approach  is  based  on  a  combined \nlinear  embedding  and  classification  procedure  resulting  in  an  ex(cid:173)\ntension of the Optimal Hyperplane algorithm to pseudo-Euclidean \ndata.  As  an  alternative  we  present  another  approach  based  on  a \nlinear threshold model in the proximity values themselves, which is \noptimized using Structural Risk Minimization.  We  show that prior \nknowledge about the problem can be incorporated by the choice of \ndistance measures and examine different metrics W.r.t.  their gener(cid:173)\nalization.  Finally, the algorithms are successfully applied to protein \nstructure  data  and  to  data from  the  cat's  cerebral  cortex.  They \nshow  better performance than K-nearest-neighbor classification. \n\n1 \n\nIntroduction \n\nIn  most  areas  of pattern recognition,  machine learning, and  neural  computation it \nhas  become  common  practice  to  represent  data  as  feature  vectors  in  a  Euclidean \nvector space.  This  kind of representation is  very  convenient because the Euclidean \nvector space offers  powerful  analytical tools  for  data analysis  not available in  other \nrepresentations.  However,  such  a  representation  incorporates  assumptions  about \nthe  data that may  not  hold  and of which  the  practitioner may not even be  aware. \nAnd  - an even more severe restriction - no  domain-independent  procedures for  the \nconstruction of features  are known  [3J. \n\nA  more  general  approach  to  the  characterization  of  a  set  of  data  items  is  to  de-\n\n\fClassification on Pairwise Proximity Data \n\n439 \n\nfine  a  proximity or distance measure between data items  - not necessarily given  as \nfeature  vectors  - and  to  provide  a  learning  algorithm  with  a  proximity  matrix  of \na  set of training data.  Since  pairwise  proximity measures  can be  defined  on  struc(cid:173)\ntured objects like graphs this procedure provides a bridge between the classical and \nthe\" structural\"  approaches to pattern recognition  [3J.  Additionally,  pairwise  data \noccur  frequently  in  empirical  sciences  like  psychology,  psychophysics ,  economics, \nbiochemistry etc.,  and most of the algorithms developed for  this  kind of data - pre(cid:173)\ndominantly clustering  [5, 4J  and multidimensional scaling [8,  6]- fall  into  the realm \nof unsupervised learning. \n\nIn  contrast  to  nearest-neighbor  classification  schemes  [10]  we  suggest  algorithms \nwhich  operate on  the  given  proximity data via linear models.  After  a  brief discus(cid:173)\nsion of different kinds of proximity data in terms of possible embeddings, we suggest \nhow the Optimal Hyperplane (OHC) algorithm for classification [2, 9]  can be applied \nto distance  data from  both Euclidean  and  pseudo-Euclidean spaces.  Subsequently, \na more general model is  introduced which is  formulated  as  a linear threshold model \non  the  proximities,  and  is  optimized  using  the  principle  of  Structural  Risk  Mini(cid:173)\nmization  [9J .  We  demonstrate how  the  choice  of proximity  measure  influences  the \ngeneralization  behavior  of  the  algorithm  and  apply  both  algorithms  to  real-world \ndata from  biochemistry and neuroanatomy. \n\n2  The  Nature of Proximity Data \n\nWhen  faced  with  proximity  data in  the  form  of  a  matrix  P  =  {Pij}  of  pairwise \nproximity  values  between  data items ,  one  idea is  to  embed  the  data in  a  suitable \nspace for  visualization and analysis.  This is  referred to as multidimensional scaling, \nand Torgerson [8J  suggested a procedure for  the linear embedding of proximity data. \nInterpreting  the  proximities  as  Euclidean  distances  in  some  unknown  Euclidean \nspace  one can  calculate an inner  product  matrix H  =  XTX  w.r.t.  to the center  of \nmass  of the data from  the proximities  according to  [8] \n\n(H)ij  =  -2  !Pij! \n\n1 \n\n( \n\nf \n\n2 1  \n\n- \u00a3 Ii !Pmj !  - \u00a3 ~ !Pin !  + \u00a32  m~l !Pmn! \n\n21  \n\n21  \n\nf \n\nf\n\n) \n\n2 \n\n. \n\n(1) \n\nLet  us  perform  a  spectral  decomposition  H  = UDU T  =  XTX  and  choose  D \nand  U  such  that  their  columns  are  sorted  in  decreasing  order  of  magnitude  of \nthe  eigenvalues  .Ai  of  H .  The  embedding  in  an  n-dimensional  space  is  achieved \nby  calculating  the  first  n  rows  of  X  =  D ~ U T .  In  order  to  embed  a  new  data \nitem  characterized by  a  vector  p  consisting  of  its  pairwise  proximities  Pi  w.r.t.  to \nthe  previously  known  data items,  one  calculates  the  corresponding  inner  product \nvector h  using  (1)  with  (H)ij,  Pij,  and Pmj  replaced  by  hi , Pi , and Pm  respectively, \nand then  obtains the embedding x  =  D -~ UTh. \nThe  matrix  H  has  negative  eigenvalues  if  the  distance  data  P  were  not  Eu(cid:173)\nclidean.  Then the  data can be isometrically embedded only in  a  pseudo-Euclidean \nor  Minkowski  space  ~(n+,n-),  equipped  with  a  bilinear  form  q> ,  which  is  not \nIn  this  case  the  distance  measure  takes  the  form  P(Xi, Xj)  = \npositive  definite. \nJq>(Xi  - Xj)  =  J(Xi - xj)TM(Xi - Xj),  where M  is  any  n  x n  symmetric matrix \nassumed  to  have  full  rank,  but  not  necessarily  positive  definite.  However,  we  can \nalways find  a basis such that the matrix M  assumes the form  M  =  diag(In+ , -In-) \nwith  n  =  n+  + n-, where  the  pair  (n+, n-) is  called  the  signature  of  the  pseudo(cid:173)\nEuclidean space [3J .  Also in this case (1)  serves to reconstruct the symmetric bilinear \nform , and the embedding proceeds as  above with D  replaced by  D , whose diagonal \ncontains  the modules  of the eigenvalues of H. \n\n\f440 \n\nT.  Graepel.  R.  Herbrich.  P.  Bollmann-Sdorra and K.  Obermayer \n\nFrom  the  eigenvalue  spectrum  of  H  the  effective  dimensionality  of the  proximity \npreserving embedding can be obtained.  (i)  If there  is  only  a  small  number of large \npositive  eigenvalues,  the  data  items  can  be  reasonably  embedded  in  a  Euclidean \nspace.  (ii)  If there  is  a  small  number  of  positive  and  negative eigenvalues  of large \nabsolute  value,  then an embedding  in  a  pseudo-Euclidean space  is  possible.  (iii)  If \nthe spectrum is  continuous and relatively flat,  then  no linear embedding is  possible \nin  less  than .e  - 1 dimensions. \n\n3  Classification in Euclidean and  Pseudo-Euclidean Space \n\nLet the training set S  be given by an .e x.e matrix P  of pairwise distances of unknown \ndata vectors x  in  a  Euclidean space,  and a  target class Yi  E {-I, + I}  for  each data \nitem.  Assuming that the data are linearly separable,  we  follow  the  OHC  algorithm \n[2J  and  set  up  a  linear model  for  the  classification in  data space, \n\ny(x)  = sign(xT w + b)  . \n\n(2) \n\nThen we  can always find  a  weight  vector wand threshold  b such  that \n\nYi(xTw+b)~l \n\n(3) \nNow  the  optimal  hyperplane  with  maximal  margin  is  found  by  minimizing  IIw l12 \nunder  the  constraints  (3).  This  is  equivalent  to  maximizing  the  Wolfe  dual  W(o:) \nw.r.t.  0:, \n\ni=l, . .. ,.e. \n\n1 \n\nW(o:)  = o:TI- 20:TYXTXYo: , \n\n(4) \nwith Y  = diag(y) , and the .e-vector 1.  The constraints are ai  ~ 0, Vi,  and 1 Ty 0:*  = \nO.  Since the optimal  weight  vector w*  can  be expressed  as  a  linear combination of \ntraining examples \n\n(5) \nand  the  optimal  threshold  b*  is  obtained  by  evaluating  b*  = Yi  - xT w*  for  any \ntraining  example  X i  with  at  i- 0,  the  decision  function  (2)  can  be  fully  evaluated \nusing inner products between data vectors only.  This formulation allows  us  to learn \non  the distance data directly. \n\nw*  =  XYo:*, \n\nIn  the  Euclidean  case  we  can  apply  (1)  to  the  distance  matrix  P  of the  training \ndata,  obtain  the  inner  product  matrix  H  =  XTX,  and  introduce  it  directly  -\nwithout explicit embedding of the data - into the Wolfe  dual  (4).  The same is  true \nfor the test phase, where only the inner products of the test vector with the training \nexamples  are needed. \n\nIn the case of pseudo-Euclidean distance data the inner product matrix H  obtained \nfrom  the  distance  matrix  P  via  (1)  has  negative  eigenvalues.  This  means  that \nthe corresponding data vectors  can only  be embedded in  a  pseudo-Euclidean space \nR(n+ ,n-)  as  explained  in  the  previous section.  Also  H  cannot serve  as  the  Hessian \nin  the  quadratic  programming  (QP)  problem  (4).  It turns  out,  however , that  the \nindefiniteness  of  the  bilinear  form  in  pseudo-Euclidean  spaces  does  not  forestall \nlinear classification [3].  A decision plane is  characterized by the equation xTMw = \n0,  as  illustrated in  Fig.  1.  However,  Fig.  1 also  shows  that the same plane can just \nas well  be described by x T W  = 0 - as  if the space were Euclidean - where w = Mw \nis  simply  the  mirror  image  of  w  w.r.t.  the  axes  of  negative  signature.  For  the \nOHC  algorithm this  means,  that if we  can reconstruct the Euclidean inner product \nmatrix  XTX  from  the  distance  data,  we  can  proceed  with  the  OHC  algorithm  as \nusual.  fI = XTX  is  calculated  by  \"flipping\"  the  axes  of negative  signature ,  i.e., \nwith D = diag(l>-ll, ... , I>-cl),  we  can calculate fI  according to \n\nfI =  UDU T  , \n\n(6) \n\n\fClassification on Pairwise Proximity Data \n\n441 \n\n\"-\n\n\"-\n\n\"-\n\n/ \n\n/ \n\n/ \n\n-\nx \n\nxTMw =  a \n\n/ \n\n/  xTMx =  a \n\n/ \n\n\"-\n\n\"-\n\n\"-\n\n\"-\n\nw \n\nx+ \n\n\"-\n\nW \n\n\"-\n\n\"-\n\nFigure  1:  Plot  of  a  decision  line  (thick) \nin  a  2D  pseudo-Euclidean  space  with  sig(cid:173)\nnature  (1,1),  i.e. ,  M  = diag(l, -1).  The \ndecision  line  is  described  by  xTMw  =  a. \nWhen interpreted as Euclidean it is at right \nangles  with  w,  which  is  the  mirror  image \nof  w  w.r.t.  the  axis  X- of  negative  signa(cid:173)\nture.  In physics this plot is  referred to as  a \nMinkowski  space-time  diagram,  where  x+ \ncorresponds to the space axis and x- to the \ntime axis.  The  dashed  diagonal  lines  indi(cid:173)\ncate the  points  xTMx = a of zero  length, \nthe light cone. \n\nwhich  serves  now  as  the  Hessian  matrix for  normal  OHC  classification.  Note,  that \nH is  positive semi-definite, which ensures a  unique solution for  the QP problem (4). \n\n4  Learning a  Linear  Decision Function in Proximity Space \n\nIn order to cope with general proximity data (case (iii)  of Section 2)  let the training \nset  S  be  given  by  an  f  x  R proximity  matrix P  whose elements  P' )  =  P( .l\" \"  r ) )  \"rf' \nthe pairwise proximity values  between data items  Xi, i  =  1, ... , \u00a3,  and a  target class \nYi  E  {-I , + I}  for  each  data item.  Let  us  assume that the  proximity  values  satisfy \nreflexivity,  Pii  =  a,Vi,  and symmetry,  Pij  = pji,Vi,j.  We  can  make  a  linear  model \nfor  the  classification  of a  new  data  item  x  represented  by  a  vector  of  proximities \n,pe)T  where Pi  =  p(x, xd are the proximities  of x  w.r.t.  to the items  Xi \nP  =  (PI,'\" \nin  the training set, \n\ny(x)  =  sign(pT w  + b)  . \n\n(7) \nComparing (7)  to  (2)  we  note, that this is  equivalent to using the vector of proxim(cid:173)\nities  p  as  the  feature  vector  x  characterizing data item  x.  Consequently,  the  OHC \nalgorithm from  the  previous  section can be  used  to learn  a  proximity  model  when \nx  is  replaced  by  p  in  (2),  XTX  is  replaced  by  p2  in  the  Wolfe  dual  (4),  and  the \ncolumns  P l  of P  serve as  the training data. \nNote that the formal  correspondence does  not  imply  that the columns  of the prox(cid:173)\nimity  matrix  are  Euclidean  feature  vectors  as  used  in  the  SV  setting.  We  merely \nconsider a linear threshold model on the proximities of a data item to all the training \ndata items.  Since the Hessian of the QP problem  (4)  is  the square of the proximity \nmatrix,  it  is  always  at least positive semi-definite,  which  guarantees a  unique solu(cid:173)\ntion  of the  QP  problem.  Once  the  optimal  coefficients  0:;  have  been  found,  a  test \ndata item can be classified by determining its proximities Pi  from the elements Xi  of \nthe training set  and by using conditions  (2)  together with  (5)  for  its  classification. \n\n5  Metric Proximities \n\nLet  us  consider two examples in order to see,  what learning on pairwise metric data \namounts to.  The first  example is  the minimalistic a-I-metric, which for  two objects \nXi  and  x J  is  defined  as  follows : \n\n(  .  x.)  _  {  a  if  Xi  =  Xj \n1  otherwise \n\nPo  Xl, \n\nJ \n\n-\n\n. \n\n(8) \n\n\f442 \n\nT.  Graepe/,  R.  Herbrich,  P  Bollmann-Sdorra and K.  Obermayer \n\n. - .... . '\" \n~.: ... ~ \n.. -. \n\n\u2022 \u2022  \n.. \n\nII  \u2022 \u2022  \n\n'f \n\n)I \n\n,  ' \n,  , \n\" \n\n. . . \\  .' \n,. . \n\u2022 \n\n. . . .\"  '. '.'~ .. : \n\n. : \n\n\u2022  I, ..... \n\n. . . . \n. .  \" \n. \\  . ..  , \n\u2022 \n. . . \n.... \"~ : \n.. , \n. I, \n.  '. \n\n:  I, \u2022\u2022 ,\".' \n\nb) \n\n.- I!\"' \u2022\u2022 \n.. \n\n.- .~ \n~ .: 'l~ \n. . . . -\n\na} \n\n\" . ~ \n\n,  . . . \n. \n\\  . ..  , \n.. \n.... \": : \n.  '. \n.  . ..... \n\n.,  \"I, \n\n,',. \n\nc) \n\nFigure 2:  Decision functions in a simple two-class classification problem for different \nMinkowski  metrics.  The  algorithm  described  in  Sect.  4  was  applied  with  (a)  the \ncity-block metric  (r =  1),  (b)  the Euclidean metric  (r =  2),  and  (c)  the maximum \nmetric  (r  -+  00).  The  three  metrics  result  in  considerably  different  generalization \nbehavior, and use different  Support Vectors  (circled). \n\nThe corresponding \u00a3 x  \u00a3 proximity matrix Po has full  rank as  can be seen from  its \nnon-vanishing  determinant  det(Po)  =  (_I)l-l(\u00a3 - 1).  From  the  definition  of the \n0-1  metric  it  is  clear  that  every  data item  x  not  contained  in  the  training  set  is \nrepresented by  the same  proximity vector p  =  1, and will  be  assigned to the same \nclass.  For  the  0-1  metric  the  QP problem  (4)  can  be solved  analytically by  matrix \ninversion, and using POl =  (\u00a3 - 1)-111 T  -\n\nI  we  obtain for  the classification \n\nThis  result  means,  that  each  new  data  item  is  assigned  to  the  majority  class  of \nthe training sample, which is  - given the available information - the Bayes optimal \ndecision.  This example demonstrates, how the prior information - in the case of the \n0-1  metric  the minimal information of identity - is  encoded in  the  chosen  distance \nmeasure. \n\nAs  an easy-to-visualize example of metric  distance  measures on  vectors x  E  ~n let \nus  consider the  Minkowski r-metrics defined for  r  2:  1 as \n\n(10) \n\nFor  r  = 2 the  Minkowski  metric is  equivalent  to the Euclidean  distance.  The  case \nr  =  1 corresponds to the so-called city-block metric, in  which  the distaqce is  given \nby the sum of absolute differences for each feature.  On the other extreme, the max(cid:173)\nimum  norm,  r  -+  00,  takes  only  the largest absolute difference  in  feature  values  as \nthe  distance between  objects.  Note  that with increasing r  more weight is  given  to \nthe  larger  differences  in  feature  values,  and  that in  the  literature  on  multidimen(cid:173)\nsional  scaling  [1]  Minkowski  metrics  have  been  used  to  examine  the  dominance  of \nfeatures  in  human  perception.  Using  the  Minkowski  metrics  for  classification  in  a \ntoy example, we observed that different values of r  lead to very different generaliza(cid:173)\ntion behavior on the same set of data points,  as  can be seen in  Fig.  2.  Since  there \nis  no apriori reason to prefer one metric over the other, using a  particular metric is \nequivalent to incorporating prior knowledge into the solution of the problem. \n\n\fClassification on  Pairwise Proximity Data \n\n443 \n\nI  Size  of Class \nORC-cut-off \nORC-flip-axis \nOR C-proximi ty  3.08 \n1-NN \n5.82 \n2-NN \n6.09 \n5.29 \n3-NN \n4-NN \n6.45 \n5.55 \n5-NN \n\n3.08 \n4.62 \n3.08  1.54 \n4.62 \n6.00 \n4.46 \n2.29 \n5.14 \n2.75 \n\n6.15 \n4.62 \n3.08 \n6.09 \n7.91 \n4.18 \n3.68 \n2.72 \n\n3.08 \n3.08 \n1.54 \n6.74 \n5.09 \n4.71 \n5.17 \n5.29 \n\n4.01 \n0.91 \n0.91 \n4.01 \n0.45  3.60 \n3.66 \n1.65 \n5.27 \n2.01 \n6.34 \n2.14 \n2.46 \n5.13 \n5.09 \n1.65 \n\n0.45 \n0.45 \n0.45 \n0.00 \n0.00 \n0.00 \n0.00 \n0.00 \n\n0.00 \n0.00 \n0.00 \n2.01 \n3.44 \n2.68 \n4.87 \n4.11 \n\nTable  1:  Classification  results  for  Cat  Cortex  and  Protein  data.  Bold  numbers \nindicate best  results. \n\n6  Real-World  Proximity Data \n\nIn the numerical experiments we focused on two real-world data sets, which are both \ngiven  in  terms  of a  proximity matrix P  and class  labels  y  for  each  data item.  The \ndata  set  called  \"cat  cortex\"  consists  of  a  matrix  of  connection  strengths  between \n65  cortical  areas  of  the  cat.  The  data  was  collected  by  Scannell  [7]  from  text \nand figures  of the available anatomical literature and the connections are assigned \nproximity values  p  as  follows:  self-connection  (p =  0) , strong and  dense connection \n(p  =  1) , intermediate  connection  (p  = 2),  weak  connection  (p  = 3),  and  absent or \nunreported  connection  (p  =  4).  From  functional  considerations  the  areas  can  be \nassigned  to  four  different  regions:  auditory  (A),  visual  (V),  somatosensory  (SS), \nand frontolimbic  (FL).  The classification task is  to discriminate between these four \nregions,  each  time one against the three others. \nThe second data set consists of a proximity matrix from the structural comparison of \n224 protein sequences based upon the concept of evolutionary distance.  The major(cid:173)\nity of these proteins can be assigned to one of four  classes of globins:  hemoglobin-a \n(R-a),  hemoglobin-;3  (R-;3),  myoglobin  (M),  and  heterogenous  globins  (GR).  The \nclassification task is  to assign  proteins to one of these classes,  one  against  the rest. \nWe  compared  three  different  procedures  for  the  described  two-class  classification \nproblems,  performing  leave-one-out  cross-validation  for  the  \"cat  cortex\"  dataset \nand  lO-fold  cross-validation for  the  \"protein\"  data set  to  estimate  the  generaliza(cid:173)\ntion  error.  Table  1  shows  the  results.  ORC-cut-off  refers  to  the  simple  method \nof making the inner  product  matrix H  positive semi-definite  by  neglecting projec(cid:173)\ntions  to those  eigenvectors  with  negative  eigenvalues.  ORC-flip-axis flips  the  axes \nof  negative  signature  as  described  in  (6)  and  thus  preserves  the  information  con(cid:173)\ntained  in  those  directions  for  classification.  ORC-proximit}',  finally,  refers  to  the \nmodel  linear  in  the  proximities  as  introduced  in  Section  4. \nIt  can  be  seen  that \naRC-proximity  shows  a  better  generalization  than  ORC-flip-axis ,  which  in  turn \nperforms  slightly  better  than  ORC-cut-off.  This  is  especially  the  case  on  the  cat \ncortex data set,  whose  inner  Rroduct  matrix H  has  negative eigenvalues.  For  com(cid:173)\nparison,  the lower  part of Table  1 shows  the corresponding cross-validation  results \nfor  K-nearest-neighbor,  which  is  a natural choice  to  use,  because it  only  needs  the \npairwise  proximities  to  determine  the  training  data  to  participate  in  the  voting. \nThe  presented  algorithms  ORC-flip-axis  and  aRC-proximity perform  consistently \nbetter than K-nearest-neighbor, even when the value  of K  is  optimally chosen. \n\n\f444 \n\nT  Graepe/,  R.  Herbnch,  P.  Bollmann-Sdorra and K.  Obermayer \n\n7  Conclusion and  Future work \n\nIn  this  contribution  we  investigated  the  nature  of  proximity  data  and  suggested \nways  for  performing  classification  on  them.  Due  to  the  generality  of  the  proxim(cid:173)\nity  approach  we  expect  that  many  other  problems  can  be  fruitfully  cast  into  this \nframework.  Although  we  focused  on classification problems , regression can  be con(cid:173)\nsidered on proximity data in an analogous way.  Noting that Support Vector kernels \nand  covariance  functions  for  Gaussian  processes  are  similarity  measures  for  vector \nspaces,  we  see that this  approach has  recently  gained  a  lot of popularity.  However, \none  problem  with  pairwise  proximities  is  that  their  number  scales  quadratically \nwith  the  number  of  objects  under  consideration.  Hence,  for  large  scale  practical \napplications  the  problems  of missing  data and  active  data selection  for  proximity \ndata will  be  of increasing importance. \n\nAcknow ledgments \n\nWe  thank  Prof.  U.  Kockelkorn  for  fruitful  discussions.  We  also  thank  S.  Gunn  for \nproviding his  Support  Vector  implementation.  Finally,  we  are indebted  to M.  Vin(cid:173)\ngron  and  T.  Hofmann  for  providing  the  protein  data set.  This  project  was  funded \nby the Technical U ni versity of Berlin via the Forschungsinitiativprojekt FIP 13/41. \n\nReferences \n\n[1 J  1.  Borg  and  J.  Lingoes.  Multidimensional  Similarity  Structure  Analysis,  vol(cid:173)\n\nume  13  of  Springer  Series  in  Statistics.  Springer-Verlag,  Berlin,  Heidelberg, \n1987. \n\n[2J  B.  Boser, 1.  Guyon, and V.  N.  Vapnik.  A training algorithm for  optimal margin \nIn  Proceedings  of the  Fifth  Annual  Workshop  on  Computational \n\nclassifiers. \nLearning  Theory,  pages  144~ 152,  1992. \n\n[3J  L.  Goldfarb.  Progress  in  Pattern  Recognition,  volume  2,  chapter  9:  A  New \nApproach To  Pattern Recognition, pages 241 ~402. Elsevier Science Publishers, \n1985. \n\n[4J  T.  Graepel and K.  Obermayer.  A stochastic self-organizing map for  proximity \n\ndata.  Neural  Computation  (accepted  for  pUblication),  1998. \n\n[5J  T.  Hofmann  and  J .  Buhmann.  Pairwise  data clustering  by  deterministic  an(cid:173)\nIEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence, \n\nnealing. \n19(1):1- 14,  1997. \n\n[6J  H.  Klock  and  J.  M.  Buhmann.  Multidimensional  scaling  by  deterministic  an(cid:173)\nnealing.  In  M.  Pelillo and E. R.  Hancock, editors,  Energy  Minimization  Meth(cid:173)\nods  in  Computer  Vision  and Pattern Recognition, volume  1223, pages 246-260, \nBerlin,  Heidelberg,  1997.  Springer-Verlag. \n\n[7J  J.  W.  Scannell,  C.  Blakemore,  and  M.  P.  Young.  Analysis  of  connectivity  in \nthe cat cerebral cortex.  The  Journal  of Neuroscience,  15(2):1463- 1483,1995. \n\n[8J  W.  S.  Torgerson.  Theory  and  Methods  of Scaling.  Wiley,  New  York,  1958. \n[9J  V.  Vapnik.  The  Nature  of Statistical  Learning.  Springer-Verlag,  Berlin,  Hei(cid:173)\n\ndelberg,  Germany,  1995. \n\n[10J  D.  Weinshall,  D.  W.  Jacobs,  and  Y.  Gdalyahu.  Classification  in  non~metric \nspace.  In Advances in Neural Information Processing Systems, volume 11,  1999. \nin  press. \n\n\f", "award": [], "sourceid": 1571, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Peter", "family_name": "Bollmann-Sdorra", "institution": null}, {"given_name": "Klaus", "family_name": "Obermayer", "institution": null}]}