{"title": "A Randomized Algorithm for Pairwise Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 430, "abstract": null, "full_text": "A  Randomized  Algorithm  for  Pairwise  Clustering \n\nYoram  Gdalyahu, Daphna Weinshall, Michael Werman \n\nInstitute of Computer Science,  The Hebrew  University,  91904  Jerusalem, Israel \n\n{yoram,daphna,werman}@cs.huji.ac.il \n\nAbstract \n\nWe present a stochastic clustering algorithm based on pairwise sim(cid:173)\nilarity  of datapoints.  Our  method  extends  existing  deterministic \nmethods, including agglomerative algorithms, min-cut graph  algo(cid:173)\nrithms,  and  connected  components.  Thus  it  provides  a  common \nframework for  all  these  methods.  Our  graph-based  method  differs \nfrom  existing  stochastic  methods  which  are  based  on  analogy  to \nphysical  systems.  The  stochastic  nature  of our  method  makes  it \nmore  robust  against  noise,  including  accidental  edges  and  small \nspurious clusters.  We demonstrate the superiority of our algorithm \nusing  an example with  3 spiraling bands and  a  lot of noise. \n\n1 \n\nIntroduction \n\nClustering algorithms can  be divided into two categories:  those  that  require  a  vec(cid:173)\ntorial  representation  of the data, and those which  use only pairwise representation. \nIn the former case, every data item must be represented  as a vector in a real normed \nspace,  while  in  the  second  case  only  pairwise  relations  of similarity or  dissimilar(cid:173)\nity  are  used.  The  pairwise  information  can  be  represented  by  a  weighted  graph \nG(V, E):  the nodes  V  represent  data items, and  the  positive weight  Wij  of an edge \n(i, j) representing  the  amount of similarity or  dissimilarity between  items i  and  j. \nThe graph G  might not be a complete graph.  In the rest of this paper Wij  represents \na  similarity value. \n\nA  vectorial  representation  is  very  convenient  when  one  has  either  an  explicit  or \nan  implicit  parametric  model  for  the  data.  An  implicit  model  means  that  the \ndata distribution function  is  not  known,  but  it is  assumed,  e.g.,  that every  cluster \nis  symmetrically  distributed  around  some  center.  An  explicit  model  specifically \ndescribes  the shape of the distribution  (e.g.,  Gaussian).  In these cases,  if a vectorial \nrepresentation is available, the clustering procedure may rely on iterative estimation \nof means  (e.g.,  [2,  8]). \n\nIn  the absence  of a  vectorial  representation,  one can either  try to embed the graph \nof distances  in  a  vector  space,  or  use  a  direct  pairwise  clustering  method.  The \n\n\fA Randomized Algorithm for Pairwise Clustering \n\n425 \n\nembedding problem  is  difficult,  since  it  is  desirable  to  use  a  representation  that  is \nboth low  dimensional and has a low distortion of distances  [6,  7,  3].  Moreover, even \nif such  embedding  is  achieved,  it  can  help  to  cluster  the  data only  if at  least  an \nimplicit  parametric  model  is  valid.  Hence,  direct  methods for  pairwise clustering \nare of great  value. \n\nOne strategy  of pairwise clustering  is  to  use  a  similarity threshold  (),  remove edges \nwith  weight  less  than  (),  and  identify  the  connected  components  that  remain  as \nclusters.  A transformation of weights may precede the thresholding!.  The physically \nmotivated transformation in  [1]  uses  a granular magnet model and replaces  weights \nby  \"spin correlations\".  Our  algorithm is  similar to this  model, see  Section 2.4. \n\nA second pairwise clustering strategy is used by agglomerative algorithms [2],  which \nstart with  the trivial partition of N  points into N  clusters of size one,  and continue \nby  subsequently  merging  pairs  of clusters.  At  every  step  the  two  clusters  which \nare  most similar are  merged  together,  until  the  similarity of the  closest  clusters  is \nlower  than  some  threshold.  Different  similarity measures  between  clusters  distin(cid:173)\nguish  between  different  agglomerative algorithms.  In  particular,  the single linkage \nalgorithm defines  the similarity between clusters as the maximal similarity between \ntwo of their members, and  the complete linkage algorithm uses  the minimal value. \n\nA  third  strategy  of pairwise  clustering  uses  the  notion  of cuts  in  a  graph.  A  cut \n(A, B)  in  a  graph  G(V, E)  is  a  partition of V  into  two  disjoint  sets  A  and  B.  The \ncapacity  of the  cut  is  the  sum  of weights  of all  edges  that  cross  the  cut,  namely: \nc(A, B)  = 2:iEA,jEB Wij.  Among  all  the  cuts  that  separate  two  marked  vertices, \nthe  minimal cut is the one which  has  minimal capacity.  The minimal cut clustering \nalgorithm [11]  divides the graph into components using a cascade of minimal cuts2 . \nThe normalized cut algorithm [9]  uses  the association of A  (sum of weights incident \non  A)  and the association of B  to  normalize the capacity c(A, B).  In contrast with \nthe easy min-cut problem, the problem of finding  a minimal normalized cut  (Ncut) \nis  NP-hard,  but with certain  approximations it  reduces  to a  generalized  eigenvalue \nproblem  [9]. \n\nOther pairwise clustering methods include techniques of non parametric density es(cid:173)\ntimation [4]  and pairwise deterministic annealing [3].  However,  the three categories \nof methods above are of special  importance to us,  since our current  work  provides a \ncommon framework for  all of them.  Specifically, our new  algorithm may be viewed \nas a  randomized version of an agglomerative clustering procedure,  and in  the same \ntime  it  generalizes  the  minimal  cut  algorithm.  It is  also  strongly  related  to  the \nphysically motivated granular magnet model algorithm.  By showing the connection \nbetween  these  methods, which  may seem  very  different  at a first  glance,  we  provide \na  better  understanding of pairwise clustering. \n\nOur method is  unique in its stochastic nature while provenly maintaining low  com(cid:173)\nplexity.  Thus our method performs as well  as the aforementioned methods in \"easy\" \ncases,  while  keeping  the  good  performance  in  \"difficult\"  ,cases.  In  particular,  it  is \nmore robust  against  noise  and pathological configurations:  (i)  A  minimal cut algo(cid:173)\nrithm is  intuitively reasonable since  it  optimizes so  that  as  much of the similarity \n\n1 For example,  the  mutual' neighborhood  clustering  algorithm  [10]  substitutes  the edge \nweight  W,}  with a  new  weight w:}  =  m + n  where  i  is  the  m th  nearest neighbor  of j  and j \nis  the  nth  nearest neighbor  of i. \n\n2The reader who is familiar  with flow  theory may notice that this algorithm also belongs \nto the first  category  of methods,  as it is equivalent  to a  weight  transformation  followed  by \nthresholding.  The weight  transformation replaces  Wi}  by  the  maximal  flow  between  i  and \nJ. \n\n\f426 \n\nY.  Gdalyahu,  D.  Weins hall and M  Werman \n\nweight  remains within the parts of the clusters,  and  as little as possible is  \"wasted\" \nbetween  the clusters.  However,  it  tends  to  fail  when  there  is  no  clean  separation \ninto 2  parts,  or when  there  are  many small spurious  parts due,  e.g.,  to  noise.  Our \nstochastic  approach  avoids  these  problems  and  behaves  more  robustly. \n(ii)  The \nsingle  linkage algorithm deals well  with  chained  data,  where  items  in  a  cluster  are \nconnected  by  transitive  relations.  Unfortunately  the  deterministic  construction  of \nchains  can  be  harmful  in  the  presence  of noise,  where  a  few  points  can  make  a \n\"bridge\"  between  two  large clusters  and  merge  them  together.  Our  algorithm in(cid:173)\nherits the ability to cluster chained  data;  at the same time it is  robust against such \nnoisy bridges as long as the probability to select  all the edges  in  the bridge remains \nsmall. \n\n2  Stochastic pairwise  clustering \n\nOur randomized clustering algorithm is  constructed  of two  main steps: \n\n1.  Stochastic  partition  of  the  similarity  graph  into  r  parts  (by  randomized \n\nagglomeration).  For each  partition index  r  (r = N  ... 1): \n(a)  for  every  pair of points,  the probability that they  remain in  the same \n\npart  is computed; \n\n(b)  the weight of the edge between  the two points is  replaced  by this prob(cid:173)\n\nability; \n\n(c)  clusters  are formed  using connected  components and  threshold  of 0.5. \nThis is  described  in  Sections  2.1  and 2.2. \n\n2.  Selection  of proper  r  values,  which  reflect  \"interesting\"  structure  m  our \n\nproblem.  This is  described  in  Section 2.3. \n\n2.1  The similarity transformation \n\nAt  each  level  r,  our  algorithm  performs  a  similarity  transformation  followed  by \nthresholding.  In  introducing  this  process,  our  starting  point  is  a  generalization  of \nthe minimal cut algorithm; then we  show how  this generalization is obtained by the \nrandomization of a  single linkage algorithm. \n\nFirst,  instead  of  considering  only  the  minimal  cuts,  let  us  induce  a  probability \ndistribution  on  the  set  of  all  cuts.  We  assign  to  each  cut  a  probability  which \ndecreases  with increasing capacity.  Hence the minimal cut is  the most probable cut \nin the graph,  but it does  not  determine the graph partition on its own. \n\nAs  a  second  generalization  to  the  min-cut  algorithm  we  consider  multi-way cuts. \nAn  r-way  cut is  a  partition of G  into r  connected  components.  The capacity of an \nr-way cut  is  the  sum of weights  of all  edges  that connect  different  components.  In \nthe rest  of this paper we  may refer  to r-way cuts  simply as  \"cuts\". \n\nUsing  the  distribution  induced  on  r-way  cuts,  we  apply  the  following  family  of \nweight  transformations.  The weight  Wij  is  replaced  by  the probability that  nodes  i \nand  j  are in the same side of a  random  r-way cut:  Wij  -+ pij'  This transformation \nis  defined  for  every  integer  r  between  1 and  N. \n\nSince  the  number  of cuts  in  a  graph  is  exponentially  large,  one must  ask  whether \npi\u00b7  is computable.  Here  the decaying  rate of the cut  probability plays an essential \nr01e.  The induced probability is found  to decay fast enough with the capacity, hence \npij  is  dominated  by  the  low  capacity  cuts.  Thus,  since  there  exists  a  polynomial \n\n\fA Randomized Algorithm/or Pairwise Clustering \n\n427 \n\nbound on  the  number  of low  capacity cuts  in  any  graph  [5],  the  problem  becomes \ncomputable. \n\nThis strong property suggests  a  sampling scheme  to estimate the pairing probabil(cid:173)\nities.  Assume  that  a  sampling tool  is  available,  which  generates  cuts  according  to \ntheir  probability.  Under  this condition,  a  sample of polynomial size  is  sufficient  to \nestimate the P'ij's. \nThe sampling tool that we use is called the \"contraction algorithm\" [5].  Its discovery \nled  to an efficient probabilistic algorithm for  the minimal cut problem.  It was shown \nthat for a given r, the probability that the contraction algorithm returns the minimal \nr-way cut of any graph is  at least N- 2(r-l), and it decays with increasing capacity3. \nFor a  graph which  is  really  made of clusters  this is  a  rough  underestimation. \n\nThe contraction algorithm can  be  implemented  in  several  ways.  We  describe  here \nits simplest form,  which  is  constructed  from  N-l  edge  contraction steps.  Each edge \ncontraction follows  the procedure  below: \n\n\u2022  Select  edge  (i, j) with  probability proportional to Wij. \n\u2022  Replace  nodes  i and  j  by  a  single node  {ij}. \n\u2022  Let the set of edges incident on {ij} be the union of the sets of edges incident \non  i  and  j,  but  remove self loops formed  by  edges  originally connecting  i \nto j. \n\nIt is  shown  in  [5]  that each  step of edge contraction  can  be  implemented in  O(N) \ntime, hence this simple form of the contraction algorithm has complexity of O(N2). \nFor  sparse graphs an O(N log N)  implementation can be shown. \n\nThe  contraction  algorithm  as  described  above  is  a  randomized  version  of the  ag(cid:173)\nglomerative single  linkage procedure.  If the  probabilistic selection  rule  is  replaced \nby  a  greedy  selection  of the  maximal weight  edge,  the single  linkage  algorithm is \nobtained. \nIn  terms of similarity transformations,  a  single  linkage algorithm which  halts  with \nr  clusters  may  be  associated  with  the  transformation  Wij~O,1 (1  if  i  and  j  are \nreturned  at  the  same  cluster,  0  otherwise).  Our  similarity  transformation  (P'ij) \nuses  the  expected  value  (or  the  average)  of of  this  binary  assignment  under  the \nprobabilistic relaxation of the selection  rule. \n\nWe could estimate pij by repeating the contraction algorithm M  times and averaging \nthese binary indicators (a better way is  described  belowJ.  Using Chernoff inequality \nit can  be shown4  that if M  2:  (21n 2 + 4ln N - 2ln Il) / t \nthen each P'ij  is  estimated, \nwith  probability 2:1  - Il,  within  t  from  its true value. \n\n2.2  Construction of partitions \n\nTo  compute a  partition  at  every  r  level,  it  is  sufficient  to  know  for  every  i-j  pair \nwhich  r  satisfies P'ij  =  0.5. \nThis  is  found  by  repeating  the  contraction  algorithm  M  times.  In  each  iteration \nthere  exists  a  single  r  at  which  the  edge  between  points  i -\nj  is  marked  and  the \npoints are merged.  Denote by rm  the level r  which joins i and j  at the m-th iteration \n(m =  1 ... M).  The median r' of the sequence {rl' r2  ... r M  }  is the sample estimate \n\n3The exact  decay  rate is  not known,  but found  experimentally  to be adequate.  Other(cid:173)\n\nwise  we  would  ignore cuts generated  with high  capacity. \n\n4Thanks  to  Ido Bergman for pointing  this  out. \n\n\f428 \n\nY.  Gdalyahu,  D.  Weins hall and M  Werman \n\nfor  the level  r  that satisfies  pij  =  0.5.  We  use  an on-line technique  (not  described \nhere)  to estimate the median r' using constant  and small memory. \n\nHaving  computed  the  matrix  r',  where  the  entry  r;j  is  the  estimator  for  r  that \nsatisfies  pij  = 0.5,  we  find  the  connected  components  at  a  given  r  value  after \ndisconnecting  every  edge  (i,j)  for  which  r~j > r.  This gives  the  r  level  partition. \n\n2.3  Hierarchical clustering \n\nWe  now  address  the  problem of choosing  \"good\"  r  values. \n\nThe transformed weight  pij  has  the advantage of reflecting  transitive relations  be(cid:173)\ntween  data items i  and  j.  For  a  selected  value  of r  (which  defines  a  specification \nlevel) the partition of data items into clusters is obtained by eliminating edges whose \nweight  (pij)  is  less  than  a  fixed  threshold  (0.5).  That is:  nodes  are  assigned  to the \nsame cluster if at level  r  their probability to be on the same side of a  random r-way \ncut is  larger  than half. \n\nPartitions which  correspond  to  subsequent  r  values  might  be  very  similar to each \nother,  or  even  identical,  in  the  sense  that  only  a  few  nodes  (if at  all)  change  the \ncomponent to which  they  belong.  Events  which  are of interest,  therefore,  are when \nthe  variation  between  subsequent  partitions  is  of the  order  of the  size  of a  clus(cid:173)\nter .  This  typically  happens  when  two  clusters  combine  to  form  one  cluster  which \ncorresponds  to a  higher  scale  (less  resolution). \n\nIn  accordance,  using the  hierarchical  partition obtained in Section 2.2,  we  measure \nthe  variation  between  subsequent  partitions  by  L~=l ~Nk ,  where  J{  is  a  small \nconstant  (of  the  order  of the  number  of  clusters)  and  Nk  is  the  size  of  the  kth \nlargest component of the partition. \n\n2.4  The granular magnet model \n\nOur  algorithm  is  closely  related  to  the  successful  granular  magnet  model  recently \nproposed  in  [1].  However, the  two  methods draw  the random cuts  effectively  from \ndifferent  distributions.  In our case  the distribution is  data driven,  imposed  by  the \ncontraction  algorithm .  The  physical  model  imposes  the  Boltzmann  distribution, \nwhere  a  cut  of capacity  E  is  assigned  a  probability  proportional  to  exp(-E/T) , \nand T  is  a  temperature parameter. \nThe probability P&  measures  whether  nodes  i  and  j  are on  the  same side  of a  cut \nat temperature T  (originally called  \"spin-spin correlation function\" ).  The magnetic \nmodel  uses  the  similarity transformation Wij  --+ P&  and  a  threshold  (0.5)  to  break \nthe  graph  into components.  However,  even  if identical  distributions were  used, P& \nis  inherently  different  from  pij  since  at  a  fixed  temperature  the  random cuts  may \nhave different  numbers of components. \n\nSuperficially,  the parameter T  plays in the magnetic model a  similar role  to our pa(cid:173)\nrameter  r.  But  the two parameterizations are quite different .  First,  r  is  a  discrete \nparameter while T  is  a continuous one.  Moreover,  in order to find  the pairing prob(cid:173)\nabilities P'fJ..  for  different  temperatures,  the  stochastic  process  should  be  employed \nfor  every  '1 '  value  separately.  On  the  other  hand ,  our  algorithm estimates  pij  for \nevery  1 ::;  r  ::;  N  at  once.  For  hard  clustering  (v.s.  soft  clustering)  it  was  shown \nabove that even  this is not necessary,  since we can get a direct  estimation of r  which \nsatisfies pij =  0.5 . \n\n\fA Randomized Algorithm/or Pairwise Clustering \n\n429 \n\n3  Example \n\nPairwise  clustering  has  the  advantage  that  a  vectorial  representation  of the  data \nis  not  needed.  However,  graphs  of distances  are  hard  to  visualize  and  we  there(cid:173)\nfore  demonstrate  our  algorithm  using  vectorial  data.  In  spite  of having  vectorial \nrepresentation,  the information which is  made available to the clustering algorithm \nincludes only  the  matrix of pairwise  Euclidean  distances 5  dij .  Since our  algorithm \nworks  with  similarity  values  and  not  with  distances,  it  is  necessary  to  invert  the \ndistances  using  Wij  = f(d ij ).  We choose  f  to be similar to the function  used  in  [1]: \nWij  = exp( -drj / a2 )  where  a  is  the  average  distance  to  the  n-th  nearest  neighbor \n(we  used  n=10,  but  the  results  remain  the  same  as  long  as  a  reasonable  value  is \nselected). \n\nbl'---____  -.J \n\nFigure  1:  The  2000  data points  (left),  and  the  three  most  pronounced  hierarchical  levels \nof clustering  (right).  At  r=353  the  three  spirals  form  one  cluster  (figure  a) .  This  cluster \nsplits  at  r=354  into  two  (figures  b1,b2),  and into  three  parts  at  r=368  (figures  c1,c2,c3) . \nThe background  points  form  isolated  clusters,  usualy  of size  1 (not shown). \n\nFigure  1 shows  2000  data points  in  the  Euclidean plane.  In  the stochastic stage of \nthe  algorithm  we  used  only  200  iterations  of graph  contraction,  during  which  we \nestimated for  every pair i-j the value of r  which satisfies pij = 0.5  (see  Section 2.2). \nAs expected, subsequent partitions are typically identical or differ only slightly from \neach  other  (Figure  2).  The  variation between subsequent  partitions was  measured \nusing the  10  largest  parts (I{ = 10, see  Section 2.3).  The results  did not depend on \nthe exact  value of J{  since  the sum was  dominated  by its first  terms. \n\nAt low  r  values  (partition into a small number of components) a  typical partition is \ncomposed of one giant component and a few  tiny components that capture isolated \nnoise  points.  The incorporation  of these  tiny  components  into  the  giant  one  pro(cid:173)\nduce  negligible  variations  between  subsequent  partitions.  At  high  r  values  all  the \ncomponents  are  small , and  therefore  the  variation  between  subsequent  partitions \nmust decay.  At  intermediate r  values  a  small number of sharp peaks  appear. \n\nThe two  highest  peaks in Figure 2 are at r=354  and  r=368;  they  mark meaningful \nhierarchies  for  the  data clustering,  as  shown  in  Figure  1.  We  compare our  results \nwith  two other methods in  Figures  3 and 4. \n\n5The  vectorial  representation  of data points  is  not  useful  even  if it  was  available,  since \n\nthe  parametric  model  is  not  known  (see  Section  1) \n\n\f430 \n\nY.  Gdalyahu.  D.  Weinshall and M.  Werman \n\n1200_-------------, \n\n1000 \n\nBOO \n\n600 \n\n400 \n\nFigure  2:  The  variation  between  subse(cid:173)\nquent  partitions  (see  text)  as  a  function  of \nthe  number  of components  (r).  The  vari(cid:173)\nation  is  computed  for  every  integer  r  (the \nspacing  between peaks is  not due  to sparse \nsampling).  Outside the displayed range  the \nvariation  vanishes. \n\nFigure  3:  The best  bi-partition  according \nto  the  normalized  cut  algorithm  [9).  Since \nthe first  partition  breaks one  of the spirals, \na  satisfactory  solution  cannot  be  achieved \nin  any  of the later stages. \n\nFigure  4:  A  three  (macro(cid:173)\nscopic)  clusters  partition  by \na  deterministic  single  linkage \nalgorithm.  The  probabilistic \nscheme  avoids  the  \"bridging \nthanks  to  the  small \neffect\" \nprobability  of  selecting \nthe \nparticular  chain  of edges. \n\nReferences \n\n[1]  Blatt  M.,  Wiseman  S.  and  Domany  E.,  \"Data  clustering  usmg  a  model  granular \n\nmagnet\",  Neural  Computation 9,  1805-1842,  1997. \n\n[2]  Duda O. and Hart E.,  \"Pattern classification and scene analysis\", Wiley-Interscience, \n\nNew  York,  1973. \n\n[3]  Hofmann T. and Buhmann J.,  \"Pairwise data clustering  by  deterministic  annealing\", \n\nPAMI  19,  1-14,  1997. \n\n[4]  Jain  A.  and  Dubes R.,  \"Algorithms for  clustering data\",  Prentice Hall,  NJ,  1988. \n[5]  Karger  D.,  \"A  new  approach  to  the  minimum  cut  problem\",  Journal  of the  ACM, \n\n43(4)  1996. \n\n[6]  Klock  H.  and Buhmann  J.,  \"Data visualization  by multidimensional  scaling:  a  deter(cid:173)\n\nministic  annealing  approach\",  Technical Report  IAI-TR-96-8,  Institut  fur Informatik \nIII,  University  of Bonn.  October 1996. \n\n[7]  Linial  N.,  London  E.  and  Rabinovich  Y.,  \"The  geometry  of graphs  and  some  of its \n\nalgorithmic  applications\",  Combinatorica  15,  215-245,  1995. \n\n[8]  Rose  K.,  Gurewitz  E.  and  Fox  G.,  \"Constrained  clustering  as  an  optimization \n\nmethod\",  PAMI 15,  785-794,  1993. \n\n[9]  Shi  J.  and  Malik  J.,  \"Normalized  cuts  and image  segmentation\",  Proc.  CVPR,  731-\n\n737,  1997. \n\n[10]  Smith  S.,  \"Threshold  validity  for  mutual neighborhood  clustering\",  PAMI  15,  89-92, \n\n1993. \n\n[ll]  Wu Z.  and Leahy R.,  \"An optimal graph theoretic approach to data clustering:  theory \n\nand  its application  to image segmentation\",  PAMI  15,  1l01-1113,  1993. \n\n\f", "award": [], "sourceid": 1593, "authors": [{"given_name": "Yoram", "family_name": "Gdalyahu", "institution": null}, {"given_name": "Daphna", "family_name": "Weinshall", "institution": null}, {"given_name": "Michael", "family_name": "Werman", "institution": null}]}