{"title": "Spectral Relaxation for K-means Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1064, "abstract": null, "full_text": "Spectral  Relaxation  for  K-means \n\nClustering \n\nHongyuan  Zha &  Xiaofeng He \n\nDept.  of Compo  Sci.  &  Eng. \n\nThe Pennsylvania State University \n\nUniversity Park, PA  16802 \n\n{zha,xhe}@cse.psu.edu \n\nChris  Ding &  Horst  Simon \n\nNERSC Division \n\nLawrence Berkeley National Lab. \nUC  Berkeley, Berkeley,  CA  94720 \n\n{chqding,hdsimon}@lbl.gov \n\nMing  Gu \n\nDept.  of Mathematics \n\nUC  Berkeley,  Berkeley,  CA  95472 \n\nmgu@math.berkeley.edu \n\nAbstract \n\nThe popular K-means  clustering partitions a  data set by minimiz(cid:173)\ning  a  sum-of-squares cost function.  A  coordinate descend  method \nis  then used to find  local minima.  In  this  paper we  show that the \nminimization can be reformulated as a trace maximization problem \nassociated with the Gram matrix of the data vectors.  Furthermore, \nwe  show that a  relaxed version of the trace maximization problem \npossesses  global  optimal solutions  which  can be obtained by  com(cid:173)\nputing a  partial eigendecomposition  of the  Gram matrix,  and the \ncluster assignment for  each data vectors  can be found  by  comput(cid:173)\ning  a  pivoted  QR decomposition  of the  eigenvector  matrix.  As  a \nby-product  we  also  derive  a  lower bound  for  the  minimum  of the \nsum-of-squares cost function. \n\n1 \n\nIntroduction \n\nK-means  is  a  very  popular method for  general clustering  [6].  In  K-means  clusters \nare represented by centers of mass of their members,  and it  can be shown that the \nK-means  algorithm  of alternating  between  assigning  cluster  membership  for  each \ndata vector to the nearest  cluster center and computing the  center of each  cluster \nas the centroid of its member data vectors is equivalent to finding the minimum of a \nsum-of-squares cost function using coordinate descend.  Despite the popularity of K(cid:173)\nmeans clustering, one of its major drawbacks is  that the coordinate descend search \nmethod is prone to local minima.  Much research has been done on computing refined \ninitial points and adding explicit  constraints to the sum-of-squares cost function for \nK-means clustering so  that the search can converge to better local minimum  [1 ,2]. \nIn  this  paper  we  tackle  the  problem from  a  different  angle:  we  find  an equivalent \nformulation  of  the  sum-of-squares  minimization  as  a  trace  maximization  problem \nwith  special  constraints;  relaxing the  constraints leads to  a  maximization problem \n\n\fthat  possesses  optimal  global  solutions.  As  a  by-product  we  also  have  an  easily \ncomputable lower bound for the minimum of the sum-of-squares cost function.  Our \nwork  is  inspired  by  [9,  3]  where  connection  to  Gram  matrix  and  extension  of K(cid:173)\nmeans method to general Mercer kernels  were investigated. \n\nThe rest of the paper is  organized as follows:  in section 2,  we  derive the equivalent \ntrace maximization formulation and discuss its spectral relaxation.  In section 3, we \ndiscuss  how to assign cluster membership using pivoted  QR decomposition,  taking \ninto  account  the  special  structure  of  the  partial  eigenvector  matrix.  Finally,  in \nsection 4, we illustrate the performance of the clustering algorithms using document \nclustering as  an example. \nNotation.  Throughout,  II  . II  denotes  the  Euclidean norm  of a  vector.  The  trace \nof a  matrix  A,  i.e.,  the  sum of its  diagonal  elements,  is  denoted  as  trace(A).  The \nFrobenius norm of a  matrix  IIAIIF  =  Jtrace(AT A).  In  denotes  identity matrix of \norder n. \n\n2  Spectral Relaxation \n\nGiven a  set of m-dimensional data vectors ai, i  =  1, ... ,n, we form the m-by-n data \nmatrix  A  =  [a1,\"\"  an].  A  partition  II  of the  date  vectors  can  be  written  in  the \nfollowing  form \n\n(1) \nwhere  E  is  a  permutation matrix,  and  Ai  is  m-by-si,  i.e.,  the  ith cluster contains \nthe data vectors in A.  For a given partition II in (1),  the associated sum-of-squares \ncost function  is  defined as \n\nk \n\nSi \n\nss(II)  =  L L Ila~i)  - mi11 2 ,  m\u00b7 =  \"a(i)ls\u00b7 \n\n'l  ~ S \n\n2, \n\ni=l  s=l \n\nSi \n\ns=l \n\ni.e.,  mi  is  the  mean  vector  of  the  data  vectors  in  cluster  i.  Let  e  be  a  vector \nof  appropriate  dimension  with  all  elements  equal  to  one,  it  is  easy  to  see  that \nmi =  Aiel Si  and \nSi \n\nSSi  ==  L Ila~i)  - mil1 2  =  IIAi  - mieTII}  =  IIAi(Isi  - eeT ISi)II}\u00b7 \n\ns=l \n\nNotice that lSi  - eeT I Si  is  a  projection matrix and  (Isi  - eeT I Si)2  =  lSi  - eeT lSi, \nit follows  that \n\nSSi  = trace(Ai(Isi  - eeT I si)Af) = trace((Isi  - eeT I si)AT Ai). \n\nTherefore, \n\nss(II)  = t, SSi  = t, (trace(AT Ai) - (~) AT Ai (~) )  . \n\nLet the n-by-k  orthonormal matrix X  be \n\nX  =  :~  (e\n\nlVsl \n\nelVSi. \n\nSk \n\n(2) \n\n\fThe sum-of-squares cost function  can now  be written as \n\nss(II)  =  trace(AT A) - trace(XT AT AX), \n\nand its minimization is  equivalent  to \n\nmax{  trace(XT AT AX)  I X  of the form  in  (2)}. \n\nREMARK.  Without  loss  of generality,  let  E  =  I  in  (1).  If we  let  Xi  be the  cluster \nindicator vector, i.e., \n\nxT =  [0, ... ,0,1, ... ,1,0, .. . ,0]. \n\n'---v-----\" \n\nSi \n\nThen it is  easy to see that \n\ntrace(XT AT AX) = t xT AT AXi  = t II Axil1 2 \n\ni=l \n\nXTXi \n\ni=l \n\nIIxil1 2 \n\nUsing the partition in  (1),  the right-hand side of the above can be written as \n\na weighted sum of the squared Euclidean norms of the mean vector of each clusters. \n\nIf we  consider  the  elements  of  the  Gram  matrix  AT A  as  measuring \nREMARK. \nsimilarity between data vectors, then we  have shown that Euclidean distance leads \nto  Euclidean  inner-product  similarity.  This  inner-product  can  be  replaced  by  a \ngeneral Mercer kernel  as  is  done in  [9, 3]. \n\nIgnoring the special structure of X  and let  it  be  an arbitrary orthonormal matrix, \nwe  obtain a  relaxed maximization problem \n\nmax \n\nXTX=h \n\ntrace(XT AT AX) \n\n(3) \n\nIt turns out the above trace maximization problem has  a  closed-form solution. \n\nTheorem.  (Ky  Fan)  Let H  be  a symmetric  matrix with  eigenvalues \n\nAl  :::::  A2  :::::  ... :::::  An, \n\nand the  corresponding  eigenvectors  U  =  [Ul, .. . , Un].  Then \n\nAl  + ... Ak  =  max \nXTX=Ik \n\ntrace(XT H X) . \n\nMoreover,  the  optimal  X*  is  given  by  X*  =  [Ul' ... ' Uk]Q  with  Q  an  arbitrary \northogonal matrix. \n\nIt follows from the above theorem that we need to compute the largest k eigenvectors \nof the Gram matrix AT A.  As  a  by-product, we  have \n\nminss(II)  :::::  trace(AT A) - max \nn \nXT X=h \n\ntrace(XT AT AX) =  L  0-; (A), \n\nmin{m ,n} \n\ni=k+l \n\n(4) \n\nwhere  oi(A)  is  the i  largest  singular value  of A.  This  gives  a  lower  bound for  the \nminimum of the sum-of-squares cost function. \n\n\fREMARK.  It is  easy  to  see  from  the  above  derivation  that  we  can replace  A  with \nA  - aeT ,  where a is  an arbitrary vector.  Then we  have the following  lower  bound \n\nmJnss(II)  :::::  m~  L  u;(A - aeT ). \n\nmin{m,n} \n\ni=k+l \n\nREMARK.  One might also try the following  approach:  notice that \n\nIIAi - mi e  IIF  =  2Si  ~ ~ Ilaj - aj'11  . \n\nT2  1 \" ,   ' \"  \n\n2 \n\naj EAi  aj' EAi \n\nLet W  =  ( Ilai - ajl12  )i,j=l' and and Xi  =  [Xij]j=l  with \n\n1 \n\nif aj  E  Ai \nXij  =  {  o  otherwise \n\nThen \n\nk \n\nT \n\nss(II)  =  ~ ' \"  Xi  WXi  > ~  min  ZTWZ =  ~  ' \"   Ai(W). \n\n2 ~ XT Xi \n\" \n\ni=l \n\n- 2  ZT Z=h \n\nn \n\n2  ~ \ni=n-k+l \n\nUnfortunately, some of the smallest eigenvalues of W  can be negative. \nLet X k be the n-by-k matrix consisting of the k  largest eigenvectors of AT A.  Each \nrow of X k  corresponds to a  data vector, and the above process can be considered as \ntransforming the original data vectors which live  in  a  m-dimensional space to new \ndata vectors  which  now  live  in  a  k-dimensional space.  One might  be attempted to \ncompute the cluster assignment by applying the ordinary K-means method to those \ndata vectors  in  the  reduced  dimension  space.  In  the  next  section,  we  discuss  an \nalternative that takes into account the structure of the eigenvector matrix X k  [5]. \nREMARK.  The similarity of the projection process to principal component analysis \nis  deceiving:  the goal  here  is  not to reconstruct  the  data matrix using  a  low-rank \napproximation but rather to capture its cluster structure. \n\n3  Cluster  Assignment  Using Pivoted  QR Decomposition \n\nWithout  loss  of generality,  let  us  assume  that  the  best  partition of the  data vec(cid:173)\ntors  in  A  that  minimizes  ss(II)  is  given  by  A  =  [AI\"'\"  Ak],  each  submatrix  Ai \ncorresponding to a  cluster.  Now  write the  Gram matrix of A  as \n\nATA=[A~A'  ArA, \n\n~  1+E=:B+E. \n\no \n\n0 \n\nArAk \n\nIf the overlaps among the clusters represented by the submatrices Ai are small, then \nthe  norm  of E  will  be small  as  compare with  the  block  diagonal  matrix B  in  the \nabove equation.  Let the largest eigenvector of AT Ai be Yi , and \n\nthen the columns of the matrix \n\nAT AiYi  =  fJiYi ,  IIYil1  =  1, \n\ni  =  1, ... , k, \n\n\fspan an invariant subspace of B.  Let the eigenvalues and eigenvectors of AT A  be \n\nA1::::  A2::::  ... ::::  An,  AT AXi =  AiXi, \n\ni  =  1, ... ,n. \n\nAssume  that  there  is  a  gap  between  the  two  eigenvalue  sets  {fl1,'\"  flk}  and \n{Ak+1 , '\"  An}, i.e. , \n\no < J  =  min{lfli - Aj II  i  =  1, ... ,k, j  =  k + 1, ... ,n}. \n\nThen  Davis-Kahan  sin(0)  theorem  states  that  IlynXk+1,'\"  ,xn]11  <  IIEII/J  [11, \nTheorem 3.4].  After some manipulation, it can be shown that \n\nX k ==  [Xl, ... , Xk]  = Yk V  + O(IIEII) , \n\nwhere V  is  an k-by-k orthogonal matrix.  Ignoring the O(IIEII)  term, we  see that \n\nv \n\ncluster  1 \n\nv \n\ncluster  k \n\nwhere we  have used y'[  = [Yil , ... ,Yis.], and VT = [V1' ... ,Vk].  A key observation is \nthat all the Vi  are orthogonal to each other:  once we have selected a Vi, we can jump \nto  other  clusters  by looking at the orthogonal complement  of Vi'  Also  notice  that \nIIYil1  =  1,  so  the  elements  of Yi  can not  be  all  small.  A  robust  implementation  of \nthe above idea can be obtained  as follows:  we  pick a  column of X k T  which has the \nlar;est norm, say, it belongs to cluster i , we  orthogonalize the rest of the columns of \nX k  against this column.  For the columns belonging to cluster i  the residual vector \nwill  have  small  norm,  and for  the  other  columns  the  residual  vectors  will  tend  to \nbe  not  small.  We  then  pick  another  vector  with  the  largest  residual  norm,  and \northogonalize  the  other  residual  vectors  against this  residual  vector.  The  process \ncan be  carried out  k  steps,  and it turns out to be exactly  QR decomposition  with \ncolumn pivoting applied to X k T  [4],  i.e.,  we  find  a  permutation matrix P  such that \n\nX'[P =  QR =  Q[Rl1,Rd, \n\nwhere Q is a  k-by-k orthogonal matrix, and Rl1  is a  k-by-k upper triangular matrix. \nWe  then compute the matrix \n\nR =  Rj} [Rl1 ' Rd pT =  [Ik' Rj} R12]PT. \n\nThen the cluster membership of each data vector is  determined by the row index of \nthe largest element  in  absolute  value of the corresponding column of k \nREMARK.  Sometimes it may be advantageous to include more than  k  eigenvectors \nto form  Xs T  with s > k.  We  can still use  QR decomposition with column pivoting \nto select  k columns of Xs T  to form an s-by-k matrix, say X.  Then for  each column \nz  of Xs T  we  compute the least squares solution of t*  =  argmintERk li z - Xtll.  Then \nthe cluster membership of z  is  determined  by the row index of the largest element \nin  absolute  value of t* . \n\n4  Experimental Results \n\nIn this section we  present our experimental results on clustering a  dataset of news(cid:173)\ngroup  articles  submitted  to  20  newsgroups.1  This  dataset  contains  about  20,000 \narticles  (email  messages)  evenly  divided  among  the  20  newsgroups.  We  list  the \nnames of the news groups together with the associated group labels. \n\nlThe newsgroup  dataset  together  with  the bow  toolkit  for  processing  it  can  be  down(cid:173)\nloadedfrorn http : //www . cs.cmu.edu/afs/cs/project/theo-ll/www/naive-bayes.html. \n\n\f0\u00b7~.5 \n\n0.55 \n\n0.6 \n\n0.65 \n\n0.7 \n\n0.75 \n\n0.8 \n\n0.85 \n\n0.9 \n\n0.95 \n\n0\u00b71L,  -~--,c-----O~'  _--,c-'-_~-----' \n\np-{)R \n\np-Kmeans \n\nFigure  1:  Clustering accuracy for  five  newsgroups NG2/NG9/NG10/NG15/NG18: \np-QR vs.  p-Kmeans  (left)  and p-Kmeans vs.  Kmeans  (right) \n\nNG4:  comp.sys.ibm.pc.hardvare \n\nNG6:  comp.vindovs.x \n\nNG10:  rec.sport.baseball \n\nNG8:  rec.autos \n\nNG1:  alt.atheism  NG2:  comp.graphics \nNG3:  comp.os.ms-vindovs.misc \nNG5:comp.sys.mac.hardvare \nNG7:misc.forsale \nNG9:rec.motorcycles \nNGll:rec.sport.hockey  NG12:  sci. crypt \nNG13:sci.electronics \nNG15:sci.space \nNG17:talk.politics.guns \nNG19:talk.politics.misc \n\nNG14:  sci.med \n\nNG16:  soc.religion.christian \n\nNG18:  talk.politics.mideast \nNG20:  talk.religion.misc \n\nWe  used  the  bow  toolkit  to  construct  the  term-document  matrix for  this  dataset, \nspecifically we  use the tokenization option so that the UseNet headers are stripped, \nand we  also applied stemming [8].  The following three preprocessing steps are done: \n1)  we  apply  the usual  tf.idf weighting scheme;  2)  we  delete  words  that appear too \nfew  times; 3)  we  normalized each document  vector to have unit Euclidean length. \n\nWe  tested three clustering algorithms:  1)  p-QR, this  refers  to the algorithm using \nthe eigenvector matrix followed  by  pivoted QR decomposition for  cluster member(cid:173)\nship assignment;  2)  p-Kmeans, we  compute the eigenvector matrix, and then apply \nK-means on the rows of the eigenvector matrix; 3) K-means, this is K-means directly \napplied to the original data vectors.  For both K-means methods, we start with a set \nof cluster  centers  chosen randomly from  the  (projected)  data vectors,  and we  aslo \nmake sure that the same random set is  used for  both for  comparison.  To  assess the \nquality of a  clustering algorithm, we  take advantage of the fact  that the news group \ndata are  already labeled  and  we  measure  the  performance  by  the  accuracy  of the \nclustering algorithm against the document category labels  [10].  In particular, for  a \nk  cluster case, we  compute a  k-by-k confusion matrix C  =  [Cij]  with Cij  the number \nof documents in cluster i  that belongs to newsgroup category j.  It is  actually quite \nsubtle to compute the accuracy using the confusion matrix because we  do not know \nwhich  cluster matches  which  newsgroup  category.  An  optimal way  is  to solve  the \nfollowing  maximization problem \n\nmax{  trace(CP)  I P  is  a  permutation matrix}, \n\nand  divide  the  maximum  by  the  total  number  of documents  to  get  the  accuracy. \nThis is  equivalent to finding perfect matching a  complete weighted bipartite graph, \none can use  Kuhn-Munkres algorithm [7].  In all our experiments,  we  used a greedy \nalgorithm to compute a  sub-optimal solution. \n\n\fTable  1:  Comparison of p-QR,  p-Kmeans,  and K-means for  two-way clustering \n\nNewsgroups \nNG1/NG2 \nNG2/NG3 \nNG8/NG9 \nNG10/NG11 \nNG1/NG 15 \nNG18/NG19 \n\np-QR \n89.29 \u00b1 7.51 % \n62.37 \u00b1 8.39% \n75.88 \u00b1 8.88% \n73.32 \u00b1 9.08% \n73.32 \u00b1 9.08% \n63.86 \u00b1 6.09% \n\np-Kmeans \n89.62 \u00b1 6.90% \n63.84 \u00b1 8.74% \n77.64 \u00b1 9.00% \n74.86 \u00b1 8.89% \n74.86 \u00b1 8.89% \n64.04 \u00b1 7.23% \n\nK-means \n76.25  \u00b1 13.06% \n61.62 \u00b1 8.03% \n65.65 \u00b1 9.26% \n62.04 \u00b1 8.61% \n62 .04  \u00b1 8.61% \n63.66  \u00b1 8.48% \n\nTable  2:  Comparison  of p-QR,  p-Kmeans, and K-means for  multi-way clustering \n\nNewsgroups \nNG2/NG3/NG4/NG5/NG6  (50) \nNG2/NG3/NG4/NG5/NG6  UOO) \nNG2/NG9/NG10/NG15/NG18  l50j \nNG2/NG9/NG10/NG15/NG18  (100) \n\nNG1/NG5/NG7/NG8/NG 11/ \n\nNG12/NG13/NG14/NG15/NG17 \n\nNG1/NG5/NG7 /NG8/NG 11/ \n\nNG12/NG13/NG14/NG15/NG17 \n\n(100) \n\n65.08 \u00b1 5.14% \n\n58.99 \u00b1 5.22% \n\n48.33 \u00b1 5.64% \n\np-QR \n40.36 \u00b1 5.17% \n41.67 \u00b1 5.06% \n77.83 \u00b1 9.26% \n79.91  \u00b1 9.90% \n60.21  \u00b1 4.88% \n\np-Kmeans \n41.15 \u00b1 5.73% \n42.53 \u00b1 5.02% \n70.13 \u00b1 11.67% \n75.56 \u00b1 10.63% \n58.18 \u00b1 4.41% \n\nK-means \n35.77 \u00b1 5.19% \n37.20 \u00b1 4.39% \n58.10 \u00b1 9.60% \n66.37 \u00b1 10.89% \n40.18 \u00b1 4.64% \n\n(50) \n\nEXAMPLE  1.  In this  example,  we  look at  binary clustering.  We  choose 50  random \ndocument  vectors  each  from  two  newsgroups.  We  tested  100  runs  for  each  pair \nof  newsgroups,  and  list  the  means  and  standard  deviations  in  Table  1.  The  two \nclustering algorithms p-QR and p-Kmeans are comparable to each  other, and both \nare better and sometimes  substantially better t han K-means. \nEXAMPLE  2.  In this example,  we  consider k-way clustering with k =  5 and k =  10. \nThree news group sets are chosen with 50  and 100 random samples from each news(cid:173)\ngroup as indicated in the parenthesis.  Again 100 runs are used for each tests and the \nmeans and standard deviations are listed in Table 2.  Moreover, in Figure 1, we  also \nplot the accuracy for  the 100 runs for  the test NG2/NG9/NG10/NG15/NG18  (50). \nBoth  p-QR and p-Kmeans  perform  better than  Kmeans.  For  news group sets  with \nsmall  overlaps, p-QR performs better than p-Kmeans.  This might be explained by \nthe fact  that p-QR explores  the  special  structure of the  eigenvector matrix and is \ntherefore more efficient.  As  a less thorough comparison wit h the information bottle(cid:173)\nneck method used in  [10],  there for  15  runs of NG2/NG9/NGlO/NG15/NG18  (100) \nmean  accuracy  56.67%  with  maximum  accuracy  67.00%  is  obtained.  For  15  runs \nof  the  10  newsgroup  set  with  50  samples,  mean  accuracy  35.00%  with  maximum \naccuracy about 40.00%  is  obtained. \n\nEXAMPLE  3.  We  compare  the  lower  bound  given  in  (4).  We  only  list  a  typical \nsample  from  NG2/NG9/NGlO/NG15/NG18  (50).  The  column  with  \"NG  labels\" \nindicates clustering using the newsgroup labels and by definition has 100% accuracy. \nIt  is  quite  clear  that  the  news group  categories  are  not  completely  captured  by \nthe sum-of-squares cost function  because  p-QR and  \"NG  labels\"  both have higher \naccuracy  but  also  larger  sum-of-squares  values.  Interestingly,  it  seems  t hat  p-QR \ncaptures some of this information of the newsgroup categories. \n\naccuracy \n\nssm) \n\np-QR \n86.80% \n224.1110 \n\np-Kmeans  K-means  NG  labels \n\nlower  bound \n\n83.60% \n223.8966 \n\n57.60% \n228.8416 \n\n100% \n\n224.4040 \n\nN/A \n\n219.0266 \n\n\fAcknowledgments \n\nThis  work  was  supported in  part by  NSF  grant  CCR-9901986 and by  Department \nof Energy through an LBL  LDRD  fund. \n\nReferences \n\n[1]  P.  S.  Bradley  and  Usama  M.  Fayyad.  (1998).  R efining  Initial  Points  for  K-Means \n\nClustering.  Proc.  15th International Conf. on  Machine Learning,  91- 99. \n\n[2]  P.  S.  Bradley,  K.  Bennett  and  A.  Demiritz.  Constrained  K-means  Clustering.  Mi(cid:173)\n\ncrosoft  Research,  MSR-TR-2000-65,  2000. \n\n[3]  M.  Girolani.  (2001).  Mercer Kernel Based Clustering in Feature Space.  To  appear in \n\nIEEE  Transactions  on  Neural  Networks. \n\n[4]  G. Golub and C. Van Loan.  (1996) .  Matrix  Computations.  Johns Hopkins University \n\nPress,  3rd Edition. \n\n[5]  Ming Gu, Hongyuan Zha, Chris Ding, Xiaofeng He and Horst Simon. (2001) .  Spectral \nEmbedding for K- Way  Graph  Clustering.  Technical Report, Department of Computer \nScience and Engineering,  CSE-OI-007,  Pennsylvania  State University. \n\n[6]  J.A.  Hartigan  and  M.A.  Wong.  (1979).  A  K-means  Clustering  Algorithm.  Applied \n\nStatistics,  28:100- 108. \n\n[7]  L.  Lovasz and M.D.  Plummer.  (1986)  Matching  Theory.  Amsterdam:  North Holland. \n[8]  A.  McCallum.  Bow:  A  toolkit  for  statistical  language  modeling,  text retrieval,  clas(cid:173)\n\nsification  and clustering.  http : //www . CS. cmu. edu/  mccallum/bow. \n\n[9]  B.  Schi:ilkopf,  A.  Smola and K.R.  Miiller.  (1998).  Nonlinear  Component  Analysis  as \n\na  Kernel  Eigenvalue Problem.  Neural  Computation, 10:  1299- 1219. \n\n[10]  N.  Slonim  and  N.  Tishby.  (2000).  Document  clustering  using  word  clusters  via  the \n\ninformation  bottleneck  method.  Proceedings of SIGIR-2000. \n\n[11]  G.W.  Stewart  and J.G.  Sun.  (1990).  Matrix  Perturbation  Theory.  Academic  Press, \n\nSan Diego,  CA. \n\n\f", "award": [], "sourceid": 1992, "authors": [{"given_name": "Hongyuan", "family_name": "Zha", "institution": null}, {"given_name": "Xiaofeng", "family_name": "He", "institution": null}, {"given_name": "Chris", "family_name": "Ding", "institution": null}, {"given_name": "Ming", "family_name": "Gu", "institution": null}, {"given_name": "Horst", "family_name": "Simon", "institution": null}]}