{"title": "Clustering via Concave Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 368, "page_last": 374, "abstract": null, "full_text": "Clustering via  Concave Minimization \n\nP.  S.  Bradley and O.  L.  Mangasarian \n\nComputer Sciences  Department \n\nUniversity of Wisconsin \n1210 West Dayton Street \n\nMadison,  WI 53706 \n\nw.  N.  Street \n\nComputer Science Department \n\nOklahoma State University \n205  Mathematical Sciences \n\nStillwater,  OK 74078 \n\nemail:  paulb@es.wise.edu,  olvi@es.wise.edu \n\nemail: nstreet@es. okstate. edu \n\nAbstract \n\nThe problem of assigning m  points in the n-dimensional real space \nRn  to k  clusters is  formulated  as  that of determining k  centers in \nRn  such  that  the  sum  of distances  of  each  point  to  the  nearest \ncenter is  minimized.  If a polyhedral distance is  used,  the problem \ncan be formulated as that of minimizing a piecewise-linear concave \nfunction  on  a  polyhedral  set  which  is  shown  to  be  equivalent  to \na  bilinear  program:  minimizing  a  bilinear  function  on  a  polyhe(cid:173)\ndral  set.  A  fast  finite  k-Median  Algorithm  consisting  of solving \nfew  linear  programs in  closed  form  leads  to a  stationary point  of \nthe bilinear program.  Computational testing on  a number of real(cid:173)\nworld  databases  was  carried  out.  On  the  Wisconsin  Diagnostic \nBreast  Cancer  (WDBC)  database,  k-Median  training set  correct(cid:173)\nness was comparable to that of the k-Mean Algorithm,  however its \ntesting set correctness was  better.  Additionally,  on the Wisconsin \nPrognostic  Breast  Cancer  (WPBC)  database,  distinct  and  clini(cid:173)\ncally  important  survival  curves  were  extracted  by  the  k-Median \nAlgorithm,  whereas  the  k-Mean  Algorithm  failed  to  obtain such \ndistinct survival curves for  the same database. \n\n1 \n\nIntroduction \n\nThe  unsupervised  assignment  of elements  of a  given  set  to  groups  or  clusters  of \nlike  points, is  the objective of cluster analysis.  There are many approaches to this \nproblem,  including statistical  [9],  machine  learning  [7],  integer and  mathematical \nprogramming [18,1].  In this paper we concentrate on a simple concave minimization \nformulation  of the problem  that leads to a  finite  and fast  algorithm.  Our point  of \n\n\fClustering via Concave Minimization \n\n369 \n\ndeparture is the following explicit description of the problem:  given m  points in the \nn-dimensional real space Rn , and a fixed number k of clusters, determine k centers in \nRn such that the sum of \"distances\" of each point to the nearest center is minimized. \nIf the  I-norm  is  used,  the  problem  can  be  formulated  as  the  minimization  of a \npiecewise-linear  concave  function  on  a  polyhedral  set.  This  is  a  hard  problem  to \nsolve  because a  local  minimum  is  not necessarily  a  global  minimum.  However,  by \nconverting  this  problem  to  a  bilinear  program,  a  fast  successive-linearization  k(cid:173)\nMedian  Algorithm terminates  after  a few  linear programs  (each  explicitly  solvable \nin  closed  form)  at  a  point  satisfying the  minimum  principle  necessary  optimality \ncondition  for  the problem.  Although  there  is  no  guarantee that  such  a  point  is  a \nglobal solution to our original problem, numerical tests on five  real-world databases \nindicate that the  k-Median  Algorithm is  comparable to or better than the  k-Mean \nAlgorithm  [18,  9,  8].  This may  be due to the fact  that outliers have  less  influence \non  the  k-Median  Algorithm which  utilizes  the  I-norm distance.  In contrast  the k(cid:173)\nMean Algorithm uses  squares of 2-norm distances to generate cluster centers which \nmay  be inaccurate if outliers  are present.  We  also note  that clustering algorithms \nbased  on  statistical  assumptions  that  minimize  some  function  of scatter  matrices \ndo  not appear to have convergence proofs  [8,  pp.  508-515]'  however convergence to \na  partial optimal solution is  given  in  [18]  for  k-Mean type algorithms. \n\nWe  outline now the contents of the paper.  In Section 2,  we formulate the clustering \nproblem for a fixed number of clusters, as that of minimizing the sum of the I-norm \ndistances  of each point to the nearest cluster center.  This piecewise-linear concave \nfunction  minimization  on  a  polyhedral set  turns out to be equivalent  to  a  bilinear \nprogram [3].  We  use  an  effective  linearization of the bilinear  program proposed in \n[3,  Algorithm  2.1]  to solve our problem by  solving a few  linear programs.  Because \nof  the  simple  structure,  these  linear  programs  can  be  explicitly  solved  in  closed \nform,  thus leading to the finite  k-Median Algorithm 2.3 below.  In Section 3 we give \ncomputational results  on five  real-world databases.  Section 4 concludes the paper. \n\nA  word  about  our  notation  now.  All  vectors  are  column  vectors  unless  otherwise \nspecified.  For  a  vector  x  E  Rn,  Xi,  i  =  1, ... ,n, will  denote  its  components.  The \nnorm II  . lip  will  denote the p  norm,  1 ~ p  ~ 00,  while  A  E RTnxn  will  signify  a  real \nm  x n  matrix.  For such a matrix, AT will  denote the transpose, and Ai will  denote \nrow i.  A vector of ones in a  real space of arbitrary dimension will  be denoted bye. \n\n2  Clustering as  Bilinear Programming \n\nGiven a set A of m  points in R n  represented by the matrix A E  RTnxn and a number \nk  of desired  clusters,  we  formulate  the clustering problem  as  follows.  Find  cluster \ncenters Gl, e =  1, ... , k, in  Rn such that the sum of the minima over e E  {I, ... , k} \nof the  I-norm distance between each point  Ai, i  =  1, ... , m, and the cluster centers \nGl , e =  1, ... , k,  is  minimized.  More  specifically  we  need  to  solve  the  following \nmathematical program: \n\nminimize \n\nC ,D \n\nsubject to \n\nTn \n\nL  min  { e T Dil} \n\ni=l  l=l , ... ,k \n\n-Dil ~ AT - Gl  ~ Dil' i  =  1, ... ,m, e =  1, ... k \n\n(1) \n\nHere  Dil E Rn, is  a  dummy variable that bounds the components of the difference \n\n\fP.  S.  Bradley, O.  L.  Mangasarian and W.  N.  Street \n\n370 \nAT - Ct  between  point AT  and center Ct,  and e  is  a  vector of ones  in  Rn.  Hence \neT Dit  bounds the  I-norm distance between  Ai and Ct.  We  note immediately that \nsince  the  objective  function  of  (1)  is  the  sum  of minima  of  k  linear  (and  hence \nconcave)  functions,  it is  a  piecewise-linear  concave function  [13,  Corollary 4.1.14]. \nIf the 2-norm or p-norm,  p  oF  1,00,  is  used,  the objective function  will  be  neither \nconcave  nor  convex.  Nevertheless,  minimizing  a  piecewise-linear  concave function \non  a  polyhedral set is  NP-hard,  because  the general linear complementarity prob(cid:173)\nlem,  which  is  NP-complete  [4],  can  be  reduced  to such a  problem  [11,  Lemma  1]. \nGiven this fact we try to look for effective methods for processing this problem.  We \npropose  reformulation of problem  (1)  as  a  bilinear  program.  Such  reformulations \nhave been very effective in computationally solving NP-complete linear complemen(cid:173)\ntarity problems [14]  as well as other difficult machine learning [12]  and optimization \nproblems with equilibrium constraints [12].  In order to carry out this reformulation \nwe  need the following  simple lemma. \n\nLemma 2.1  Let a E Rk.  Then \n\nmin  {at}  =  min { t  altl  ttl =  1,  tt ~ 0,  f  =  1, ... , k} \n\n(2) \n\n1<t<k \n\n-\n\n-\n\ntERk \n\nl=l \n\nt=1 \n\nProof This essentially obvious result follows  immediately upon writing the dual of \nthe linear program appearing on the right-hand side of (2)  which is \n\nTl;{hlh:::;  at, f  =  1, . .. k} \n\n(3) \nObviously,  the  maximum  of this  dual  problem  is  h =  minl<t<k {at}.  By  linear \nprogramming  duality  theory,  this  maximum  equals  the  minimum  of  the  primal \nlinear program in the right hand side of (2).  This establishes the equality of (2).  0 \nBy  defining  a~  = eT Dit, i  = 1, ... , m,  f  =  1, ... , k,  Lemma  2.1  can  be  used  to \nreformulate the clustering problem  (1)  as a  bilinear program as follows. \n\nProposition 2.2  Clustering  as  a  Bilinear  Program  The  clustering  problem \n(1)  is  equivalent  to  the  following  bilinear program: \n\nminimize \n\nCtERn,DttERn ,TilER \n\nsubject to \n\nE:'l E;=1  eT DitTit \n\n- Dil  :::;  AT - Cl  :::;  Dil' i  = 1 ... ,m,  f  = 1, ... , k \nE;=l  Til  =  1 \nTil  ~ 0,  i  =  1, ... ,m, f  =  1, ... , k \n\n(4) \n\nNote that the constraints of (4)  are uncoupled in the variables  (C, D) and the vari(cid:173)\nable  T.  Hence  the  Uncoupled  Bilinear  Program  Algorithm  UBPA  [3,  Algorithm \n2.1]  is  applicable.  Simply stated, this algorithm alternates between solving a  linear \nprogram  in  the  variable T  and  a  linear  program  in  the  variables  (C, D).  The  al(cid:173)\ngorithm terminates in  a  finite  number of iterations at a  stationary point satisfying \nthe minimum principle necessary optimality condition for problem  (4)  [3,  Theorem \n2.1].  We  note  however,  because  of the  simple  structure the  bilinear  program  (4), \nthe two linear programs can be solved explicitly  in closed form.  This  leads  to the \nfollowing  algorithmic implementation. \nAlgorithm 2.3  k-Median Algorithm Given cf, ... ,ct at iteration j,  compute \ncf+! , ... ,ct+!  by  the following  two  steps: \n\n\fClustering via Concave Minimization \n\n371 \n(a)  Cluster Assignment:  For each AT,  i  =  1, ... m,  determine \u00a3( i)  such that \n\nC1(i)  is  closest  to AT  in the  1-norm. \n\n(b)  Cluster  Center Update:  For  \u00a3 =  1, ... ,k choose  Cj \n\n'+1 \n\nall AT  assigned  to CI. \n\nas  a  median  of \n\nStop  when cI+ 1  = cl, \u00a3 =  1, ... , k. \nAlthough the k-Median  Algorithm is  similar to the k-Mean Algorithm wherein  the \n2-norm  distance  is  used  [18,  8,  9],  it differs  from  it computationally,  and  theoreti(cid:173)\ncally.  In fact,  the underlying problem  (1)  of the  k-Median  Algorithm  is  a  concave \nminimization on a  polyhedral set while  the corresponding  problem for  the p-norm, \np\"#  1,  is: \n\nminimize \n\nC,D \n\nsubject to \n\nL  min  IIDillip \n, \n.=1 \n\nl=I\"\",k \n\n-Dil ~ AT - Cl  ~ Dil' i  =  1 ... , m,  \u00a3 =  1, ... , k. \n\n(5) \n\nThis  is  not  a  concave  minimization  on  a  polyhedral  set,  because the minimum  of \na  set  of  convex  functions  is  not  in  general  concave.  The  concave  minimization \nproblem  of [18]  is  not  in  the  original  space  of the  problem  variables,  that  is,  the \ncluster center variables,  (C, D),  but merely in the space of variables T  that assign \npoints to clusters.  We also note that the k-Mean Algorithm finds a stationary point \nnot  of problem  (5)  with  p  =  2,  but  of  the  same  problem  except  that  IIDill12  is \nreplaced  by  IIDilll~.  Without  this  squared  distance  term,  the  subproblem  of  the \nk-Mean  Algorithm  becomes  the  considerably harder Weber  problem  [17,  5]  which \nlocates a center in Rn closest in sum of Euclidean distances (not their squares!)  to a \nfinite set of given points.  The Weber problem has no closed form  solution.  However, \nusing the mean  as  a  cluster center of points assigned  to the cluster,  minimizes  the \nsum  of the  squares  of the  distances  from  the  cluster  center  to  the  points.  It is \nprecisely the mean that is  used in the k-Mean Algorithm subproblem. \n\nBecause  there  is  no  guaranteed  way  to  ensure  global  optimality  of the  solution \nobtained  by  either  the  k-Median  or  k-Mean  Algorithms,  different  starting points \ncan  be  used  to  initiate  the  algorithm.  Random  starting cluster  centers  or  some \nother heuristic  can  be  used  such  as  placing  k  initial  centers  along  the  coordinate \naxes at densest,  second densest,  ... , k  densest intervals on the axes. \n\n3  Computational Results \n\nAn  important computational issue is  how  to measure the correctness of the results \nobtained by the proposed algorithm.  We  decided on  the following three ways. \n\nRemark 3.1  Training  Set  Correctness  The  k-Median  algorithm  (k  =  2)  is \napplied to a  database  with two  known classes to  obtain centers.  Training correctness \nis  measured by  the  ratio  of the  sum of the  number examples  of the  majority class  in \neach  cluster  to  the  total  number of points  in  the  database.  The  k-Median  training \nset correctness  is  compared  to  that  of the k-Mean  Algorithm  as  well  as  the  training \ncorrectness  of a  supervised  learning  method,  a  perceptron  trained  by  robust  linear \nprogramming  [2l.  Table  1  shows  results  averaged  over  ten  random  starts  for  the \n\n\f372 \n\nP.  S. Bradley, O.  L.  Mangasarian and W.  N.  Street \n\npublicly  available  Wisconsin Diagnostic Breast  Cancer (WDBC)  database  as  well as \nthree  others [15,  16).  We  note that for  two  of the  databases  k-Median outperformed \nk-Mean,  and for  the  other two  k-Mean was  better. \n\nAlgorithm .J..  Database -t  WDBC  Cleveland  Votes  Star / Galaxy-Bright \nUnsupervised  k-Median \nUnsupervised k-Mean \nSupervised Robust  LP \n\n84.6% \n85.5% \n95.6% \n\n80.6% \n83.1% \n86.5% \n\n93.2% \n91.1% \n100% \n\n87.6% \n85.6% \n99.7% \n\nTable  1  Training set correctness using the unsupervised k-Median \n\nand k-Mean  Algorithms  and the supervised  Robust  LP  on four  databases \n\nRemark 3.2  Testing Set Correctness \n\n86 \n\n84 \n\no \n\nI \nI \nI \nI \n\n' \n\n94 \n\n92 \n\nk-Meen \n\nRobust lP \n\n- - k-Median \n\nT eoIing Set Correctness vo. T eoIing Set Size \n\nI \n~ 90 \n<1 \n~ \n'\" :j88 \n~ \n... \n\n~~-------------\n\nThe  idea  behind  this  approach \nlearning \nis \nthat  supervised \nmay  be  costly  due \nto  prob(cid:173)\nlem size,  difficulty in obtaining \ntrue  classification,  etc.,  hence \nthe  importance  of  good  per-\nformance  of  an  unsupervised \nlearning  algorithm  on  a  test-\ning  subset  of a  database.  The \nWDBC  database  [15}  is  split \ninto  training  and  testing  sub-\nsets  of  different  proportions. \nThe  k-Median  and k-Mean Al(cid:173)\ngorithms (k =  2) are  applied to \nthe  training  subset.  The  cen(cid:173)\nters  are  given  class  labels  de(cid:173)\ntermined  by  the  majority  class \nof  training  subset  points  as(cid:173)\nsigned to  the  cluster.  Class  la(cid:173)\nbels  are  assigned  to  the  testing \nsubset  by  the  label  of the  clos(cid:173)\nest  center.  Testing  correctness  is  determined  by  the  number  of points  in  testing \nsubset  correctly  classified  by  this  assignment.  This  is  compared  to  the  correctness \nof a supervised learning  method,  a perceptron trained via  robust linear programming \n[2},  using  the  leave-one-out  strategy  applied  to  the  testing subset only.  This  com(cid:173)\nparison  is  then  carried  out  for  various  sizes  of the  testing  subset.  Figure  1  shows \nthe  results  averaged  over  50  runs  for  each  of 7 testing  subset  sizes.  As  expected, \nthe  performance  of the  supervised  learning  algorithm  (Robust  LP)  improved  as  the \nsize  of the  testing subset increases.  The  k-Median  Algorithm test set correctness re(cid:173)\nmained fairly  constant in the  range  of 92.3% to  93.5%,  while  the k-Mean Algorithm \ntest set correctness  was  lower  and more  varied  in  the  range  88.0%  to  91.3%. \n\nFigure  1:  Correctness  on  variable-size  test  set  of \nunsupervised k-Median &  k-Mean Algorithms ver(cid:173)\nsus  correctness  of the  supervised  Robust  LP  on \nWDBC \n\n35 \nTesting Set Size (% 01 Original) \n\n15 \n\n20 \n\n25 \n\n30 \n\n40 \n\n45 \n\n50 \n\n10 \n\nRemark 3.3  Separability  of Survival  Curves  In  mining  medical  databases, \nsurvival  curves  [10}  are  important prognostic  tools.  We  applied  the  k-Median  and \nk-Mean  (k  =  3)  Algorithms,  as  knowledge  discovery  in  database  (KDD)  tools  [6}, \nto  the  Wisconsin  Prognostic  Breast  Cancer Database  (WPBC) [15}  using  only  two \nfeatures:  tumor size  and  lymph  node  status.  Survival  curves  were  constructed for \n\n\fClustering via Concave Minimization \n\n373 \n\n:f\\,' '- .. \n\n1 0 .7 \n\nI~ \n\n,,\n\n' - - - - - - - ,  \n\n\\ \u00b7\u00b7 '  \n\n...... i  \\ \n\nit .................... .. \n\nj08 \n\n1 :: \n\n03 \n\n02 \n\n0.1 \n\n0.\" \n\n08 \n\n~0.7 \n\nJOB \n\n1:: \n\n03 \n\n02 \n\n0.1 \n\n.. ......... -. \n\n~L-~ro~~40~~ro---M~~,OO~~'ro~~ \n\n'40 \n\n\u00b0 \n\u00b0 \n\n2C \n\n40 \n\n60 \n\nMoo.,. \n\n80 \n\n100 \n\n120 \n\nUO \n\n\"0'''',. \n\n(a)  k-Median \n\n(b)  k-Mean \n\nFigure  2:  Survival  curves  for  the  3  clusters  obtained  by  k-Median  and  k-Mean \nAlgorithms \n\neach  cluster,  representing  expected  percent  of surviving  patients  as  a  function  of \ntime,  for  patients  in  that  cluster.  Figure  2( a)  depicts  the  survival  curves  from \nclusters  obtained from  the  k-Median  Algorithm,  Figure  2(b)  depicts  curves for  the \nk-Mean  Algorithm.  The  key  observation  to  make here  is  that curves in Figure  2(a) \nare  well separated,  and hence  the  clusters  can  be  used  as  prognostic  indicators.  In \ncontrast,  the  curves  in  Figure  2(b)  are  poorly  separated,  and  hence  are  not  useful \nfor prognosis. \n\n4  Conclusion \n\nWe have proposed a new approach for assigning points to clusters based on a simple \nconcave minimization model.  Although a global solution to the problem cannot be \nguaranteed,  a  finite  and  simple  k-Median  Algorithm  quickly  locates  a  very  useful \nstationary point.  Utility of the proposed algorithm lies in its ability to handle large \ndatabases  and  hence  would  be a  useful  tool  for  data mining.  Comparing it  with \nthe k-Mean Algorithm, we have exhibited instances where the k-Median Algorithm \nis  superior,  and hence  preferable.  Further research  is  needed  to pinpoint  types  of \nproblems for  which the k-Median  Algorithm is  best. \n\n5  Acknowledgements \n\nOur colleague Jude Shavlik suggested the testing set strategy used in  Remark 3.2. \nThis  research  is  supported  by  National  Science  Foundation  Grants  CCR-9322479 \nand National Institutes of Health !NRSA  Fellowship  1 F32  CA  68690-01. \n\nReferences \n\n[1]  K.  AI-Sultan.  A  Tabu  search  approach  to  the  clustering  problem.  Pattern \n\nRecognition,  28(9):1443-1451, 1995. \n\n\f374 \n\np.  S.  Bradley, 0. L  Mangasarian and W  N.  Street \n\n[2]  K.  P.  Bennett  and  O.  L.  Mangasarian.  Robust  linear  programming  discrim(cid:173)\nination  of two  linearly  inseparable sets.  Optimization  Methods  and  Software, \n1:23-34, 1992. \n\n[3]  K.  P.  Bennett  and  O.  L.  Mangasarian.  Bilinear  separation  of two  sets  in  n(cid:173)\n\nspace.  Computational  Optimization  \u00a33  Applications,  2:207-227,  1993. \n\n[4]  S.-J. Chung.  NP-completeness of the linear complementarity problem.  Journal \n\nof Optimization  Theory  and Applications, 60:393-399, 1989. \n\n[5]  F.  Cordellier  and J.  Ch.  Fiorot.  On  the Fermat-Weber problem  with  convex \n\ncost functionals.  Mathematical  Programming,  14:295-311, 1978. \n\n[6]  U.  Fayyad,  G.  Piatetsky-Shapiro,  and  P.  Smyth.  The  KDD  process  for  ex(cid:173)\ntracting useful knowledge from volumes of data.  Communications of the  ACM, \n39:27-34, 1996. \n\n[7]  D.  Fisher.  Knowledge  acquisition  via incremental  conceptual clustering.  Ma(cid:173)\n\nchine  Learning,  2:139-172, 1987. \n\n[8]  K.  Fukunaga.  Statistical Pattern Recognition.  Academic  Press, NY,  1990. \n\n[9]  A.  K.  Jain  and  R.  C.  Dubes.  Algorithms for  Clustering  Data.  Prentice-Hall, \n\nInc,  Englewood Cliffs,  NJ,  1988. \n\n[10]  E.  L.  Kaplan and P.  Meier.  Nonparametric estimation from  incomplete obser(cid:173)\n\nvations.  J.  Am.  Stat.  Assoc., 53:457-481, 1958. \n\n[11]  O.  L.  Mangasarian.  Characterization  of linear  complementarity  problems  as \n\nlinear programs.  Mathematical Programming  Study,  7:74-87,  1978. \n\n[12]  O.  L.  Mangasarian.  Misclassification  minimization.  Journal  of Global  Opti(cid:173)\n\nmization,  5:309-323, 1994. \n\n[13]  O.  L.  Mangasarian.  Nonlinear Programming.  SIAM,  Philadelphia,  PA,  1994. \n\n[14J  O. L.  Mangasarian. The linear complementarity problem as a separable bilinear \n\nprogram.  Journal  of Global  Optimization,  6:153-161, 1995. \n\n[15]  P.  M.  Murphy and D.  W.  Aha.  UCI repository of machine learning databases. \nDepartment  of  Information  and  Computer  Science,  University  of  California, \nIrvine,  www.ics.uci.edu/AI/ML/MLDBRepository.html,  1992. \n\n[16]  S.  Odewahn,  E.  Stockwell,  R.  Pennington,  R.  Hummphreys,  and W.  Zumach. \nAutomated  star/galaxy  discrimination  with  neural  networks.  Astronomical \nJournal,  103(1):318-331, 1992. \n\n[17]  M.  L.  Overton.  A  quadratically convergent  method  for  minimizing  a  sum  of \n\neuclidean norms.  Mathematical  Programming,  27:34-63,  1983. \n\n[18J  S.  Z.  Selim  and  M.  A.  Ismail.  K-Means-Type algorithms:  a  generalized  con(cid:173)\n\nvergence theorem and characterization of local optimality.  IEEE  Transactions \non  Pattern  Analysis  and Machine  Intelligence,  PAMI-6:81-87, 1984. \n\n\f", "award": [], "sourceid": 1260, "authors": [{"given_name": "Paul", "family_name": "Bradley", "institution": null}, {"given_name": "Olvi", "family_name": "Mangasarian", "institution": null}, {"given_name": "W.", "family_name": "Street", "institution": null}]}