{"title": "An Information-theoretic Learning Algorithm for Neural Network Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 591, "page_last": 597, "abstract": null, "full_text": "An Information-theoretic  Learning \n\nAlgorithm for  Neural Network \n\nClassification \n\nDavid J.  Miller \n\nDepartment of Electrical Engineering \nThe Pennsylvania State University \n\nState College,  Pa:  16802 \n\nAjit Rao,  Kenneth Rose,  and Allen  Gersho \n\nDepartment of Electrical and Computer  Engineering \n\nUniversity of California \nSanta Barbara, Ca.  93106 \n\nAbstract \n\nA  new  learning  algorithm is  developed for  the design  of statistical \nclassifiers  minimizing  the  rate  of misclassification.  The  method, \nwhich  is  based on  ideas  from  information  theory  and  analogies  to \nstatistical  physics,  assigns  data to  classes  in  probability.  The  dis(cid:173)\ntributions  are  chosen  to minimize the  expected  classification  error \nwhile  simultaneously enforcing  the classifier's structure and a level \nof \"randomness\"  measured  by  Shannon's entropy.  Achievement of \nthe classifier structure is  quantified by an  associated cost.  The con(cid:173)\nstrained optimization  problem is  equivalent to the minimization of \na  Helmholtz  free  energy,  and  the  resulting  optimization  method \nis  a  basic  extension  of the  deterministic  annealing  algorithm  that \nexplicitly  enforces  structural  constraints on  assignments  while  re(cid:173)\nducing  the  entropy  and  expected  cost  with  temperature.  In  the \nlimit of low temperature, the error rate is minimized directly and a \nhard classifier with the requisite structure is  obtained.  This learn(cid:173)\ning algorithm can be used to design a variety of classifier structures. \nThe approach is  compared with standard methods for  radial basis \nfunction  design  and  is  demonstrated  to  substantially  outperform \nother  design  methods  on  several  benchmark  examples,  while  of(cid:173)\nten retaining design complexity comparable to,  or only moderately \ngreater than that of strict descent-based methods. \n\n\f592 \n\nD.  NnLLER.A.RAO.K. ROSE.A. GERSHO \n\n1 \n\nIntroduction \n\nThe problem of designing a  statistical classifier  to minimize  the  probability of mis(cid:173)\nclassification or a  more general  risk  measure  has been a  topic of continuing interest \nsince the 1950s.  Recently, with the increase in power of serial and parallel computing \nresources,  a  number of complex neural network classifier structures have  been pro(cid:173)\nposed, along with associated learning algorithms to design them.  While  these struc(cid:173)\ntures  offer  great  potential for  classification,  this  potenl ial  cannot  be fully  realized \nwithout  effective  learning  procedures  well-matched to  the  minimllm classification(cid:173)\nerror oh.iective.  Methods such as  back propagation which approximate class targets \nin  a  sqllared error sense  do  not directly minimize  the  probability of error.  Rather, \nit has been shown that these  approaches design  networks  to approximate the  class \na  posteriori probabilities.  The probability estimates can then be  used to form a  de(cid:173)\ncision rule.  While large networks can in principle accurately approximate the Bayes \ndiscriminant,  in  practice  the network size  must be  constrained  to  avoid overfitting \nthe  (finite)  training set.  Thus,  discriminative learning techniques, e.g.  (Juang and \nKatagiri,  1992),  which  seek  to  directly  minimize  classification  error  may  achieve \nbetter  results.  However,  these  methods  may still  be susceptible  to finding shallow \nlocal  minima far  from  the global  minimum. \n\nAs  an alternative to strict descent-based procedures, we  propose  a  new determinis(cid:173)\ntic learning algorithm for  statistical classifier design with a  demonstrated potential \nfor  avoiding  local  optima of the  cost.  Several  deterministic,  annealing-based  tech(cid:173)\nniques  have  been  proposed  for  avoiding nonglobal  optima in  computer  vision  and \nimage  processing  (Yuille,  1990),  (Geiger  and  Girosi,1991),  in  combinatorial  opti(cid:173)\nmization,  and elsewhere.  Our approach is  derived  based on ideas from information \ntheory and statistical physics, and builds on the probabilistic framework of the de(cid:173)\nterministic  annealing  (DA)  approach  to  clustering  and  related  problems  (Rose  et \nal.,  1990,1992,1993).  In  the  DA  approach for  data clustering,  the  probability  dis(cid:173)\ntributions  are  chosen  to  minimize  the  expected  clustering  cost,  given  a  constraint \non  the level of randomness,  as measured  by Shannon's entropy 1. \n\nIn  this  work,  the  DA  approach  is  extended  in  a  novel  way,  most  significantly  to \nincorporate  structural  constraints  on  data  assignments,  but  also  to  minimize  the \nprobability  of  error  as  the  cost.  While  the  general  approach  we  suggest  is  likely \napplicable to  problems of structured vector quantization and regression  as well,  we \nfocus on the classification  problem here.  Most design methods have been developed \nfor  specific classifier structures.  In this work, we will develop a general approach but \nonly  demonstrate results for  RBF  classifiers.  The  design  of nearest  prototype  and \nMLP  classifiers is  considered in  (Miller et al.,  1995a,b).  Our method  provides sub(cid:173)\nstantial performance gains over conventional designs for  all of these structures, while \nretaining design  complexity  in  many  cases  comparable  to  the strict  descent  meth(cid:173)\nods.  Our approach often designs small networks to achieve training set performance \nthat can only be obtained by a much larger network designed in a conventional way. \nThe design  of smaller  networks  may  translate to superior  performance  outside  the \ntraining set. \n\nINote that in (Rose et al.,  1990,1992,1993),  the DA method was formally  derived using \nthe  maximum entropy  principle.  Here  we  emphasize  the  alternative,  but mathematically \nequivalent description that the chosen distributions minimize the expected cost given con(cid:173)\nstrained  entropy.  This  formulation  may  have  more  intuitive appeal  for  the  optimization \nproblem at  hand. \n\n\fAn  Information-theoretic  Learning Algorithm for  Neural  Network Classification \n\n593 \n\n2  Classifier Design Formulation \n\n2.1  Problem Statement \n\nLet T  =  {(x,c)}  be  a  training set of N  labelled vectors,  where x E 'Rn is  a  feature \nvector  and  c  E I  is  its class  label  from  an  index  set I.  A  classifier  is  a  mapping \nC  : 'Rn  _  I, which  assigns  a  class  label in  I  to each  vector  in 'Rn.  Typically,  the \npartitioning of the  feature  space  into  regions  Rj = {x  E  'Rn \nclassifier  is  represented  by  a set of model  parameters  A.  The  classifier  specifies  a \n:  C(x)  =  j}, where \nU Rj  = 'Rn  and n Rj  = 0.  It also  induces  a  partitioning of the  training  set  into \nsets  7j  C  T, where  7j  = {{x,c}  : x  E  Rj,(x,c)  E T}.  A  training pair  (x,c)  E T \nis  misc1assified if C(x)  \"#  c.  The  performance  measure  of primary  interest  is  the \nempirical error fraction  Pe  of the  classifier,  i.e.  the fraction  of the  training set  (for \ngeneralization purposes,  the fraction of the test set)  which  is  misclassified: \n\nj \n\nj \n\nPe  =  2.  L \n\n6(c,C(x\u00bb  =  ~L L \n\n6(c,j), \n\n(1) \n\nN  (X,c)ET \n\njEI (X,C)ETj \n\nwhere  6( c, j) =  1 if c \"#  j  and  0  otherwise.  In  this work,  we  will  assume  that  the \nclassifier produces an output Fj(x) associated with each class,  and uses  a  \"winner(cid:173)\ntake-all\"  classification  fll Ie: \n\nThis rule is  consistent  with  MLP  and RBF-based classification. \n\nR j  ==  {x E'Rn \n\n:  Fj (x) ~ Fk(X)  \"Ik  E I}. \n\n(2) \n\n2.2  Randomized  Classifier Partition \n\nAs  in  the  original  DA  approach  for  clustering  (Rose  et  aI.,  1990,1992),  we  cast \nthe  optimization  problem  in  a  framework  in  which  data  are  assigned  to  classes \nin  probability.  Accordingly,  we  define  the  probabilities  of association  between  a \nfeature  x  and  the  class  regions,  i.e.  {P[x  E  R j ]}.  As  our  design  method,  which \noptimizes  over  these  probabilities,  must  ultimately  form  a  classifier  that  makes \n\"hard\"  decisions  based  on  a  specified  network  model,  the  distributions  must  be \nchosen  to  be  consistent  with  the  decision  rule  of the  model.  In  other  words,  we \nneed to introduce randomness into the classifier's partition.  Clearly, there are many \nways one  could define  probability distributions  which  are consistent with  the hard \npartition at some limit.  We use an information-theoretic approach.  We measure the \nrandomness  or  uncertainty  by  Shannon's  entropy,  and  determine  the  distribution \nfor  a  given level  of entropy.  At the  limit  of zero  entropy we  should recover  a hard \npartition.  For  now,  suppose  that the  values  of the model  parameters  A have  been \nfixed.  We can then write an objective function  whose maximization determines the \nhard partition for  a given  A: \n\nFh  =  N  ~ L  Fj(x). \n\n1 \n\nJEI (X,c)ETj \n\n(3) \n\nNote specifically that maximizing (3)  over all  possible  partitions captures the deci(cid:173)\nsion rule of (2).  The probabilistic generalization of (3)  is \n\n1 \n\nF =  N  L  LP[x E Rj]Fj(x), \n\n(X,c)ET \n\nj \n\n(4) \n\nwhere  the  (randomized)  partition  is  now  represented  by  association  probabilities, \nand the  corresponding entropy is \n\n1 \n\nH = - N  L  LP[x E Rj)logP[x E Rj). \n\n(5) \n\n(X,c)ET \n\nj \n\n\f594 \n\nD. MILLER, A. RAO, K. ROSE, A. GERSHO \n\nWe  determine  the  distribution  at  a  given  level  of randomness  as  the  one  which \nmaximizes  F  while  maintaining H  at  a prescribed level iI: \nmax  F  subject  to  H = iI. \n\n(6) \n\n{P[XERj]} \n\nThe result  is  the  best probabilistic  partition, in the sense of F,  at  the  specified level \nof randomness.  For iI = 0 we  get back the hard partition  maximizing  (3).  At any \niI, the solution of(6)  is  the  Gibbs  distribution \n\nP[x E Rj] ==  Pjl~(A) = E e'YF\" (X) , \n\ne'YFj(X) \n\nk \n\n(7) \n\nwhere  'Y  is  the  Lagrange  multiplier.  For  'Y  --t  0,  the  associations  become  increas(cid:173)\ningly  uniform,  while  for  'Y  --t  00,  they  revert  to hard classifications,  equivalent  to \napplication of the rule in (2).  Note  that the probabilities depend on A through the \nnetwork outputs.  Here  we  have emphasized this dependence  through our  choice  of \nconcise notation. \n\n2.3 \n\nInformation-Theoretic Classifier  Design \n\nUntil  now  we  have  formulated  a  controlled  way  of  introducing  randomness  into \nthe  classifier's  partition  while  enforcing  its  structural  constraint.  However,  the \nderivation assumed that the model parameters were given,  and thus produced only \nthe form of the distribution Pjl~(A), without actually prescribing how to choose the \nvalw's of its  parameter set.  Moreover  the derivation did  not consider  the ultimate \ngoal of minimizing the  probability of error.  Here we  remedy  both shortcomings. \n\nThe method we suggest gradually enforces formation of a hard classifier  minimizing \nthe probability of error.  We start with a highly random classifier and a high expected \nmisclassification cost.  We  then gradually reduce  both the randomness and the cost \nin  a  deterministic  learning  process  which  enforces  formation  of  a  hard  classifier \nwith  the requisite structure.  As  before,  we  need to introduce  randomness into the \npartition  while  enforcing  the  classifier's structure,  only  now  we  are  also  interested \nin  minimizing  the  expected  misclassification  cost.  While  satisfying  these  multiple \nobjectives may appear to be a formidable task,  the problem is greatly simplified by \nrestricting  the  choice  of random classifiers  to the set of distributions  {Pjl~(A)} as \ngiven  in  (7)  - these  random  classifiers  naturally  enforce  the  structural  constraint \nthrough  'Y.  Thus, from  the  parametrized set  {Pjl~(A)}, we  seek  that  distribution \nwhich minimizes the average misclassification  cost while  constraining the entropy: \n\n(8) \n\nsubject to \n\nThe solution yields the  best random classifier in the sense of minimum < Pe  > for  a \ngiven iI.  At  the limit  of zero  entropy, we should get  the best  hard classifier  in  the \nsense of Pe  with the desired structure, i.e.  satisfying (2). \nThe constrained minimization  (8)  is  equivalent to  the  unconstrained  minimization \nof the  Lagrangian: \n\n(9) \n\nmin L ==  minfj < Pe  > -H, \nA,'Y \n\nA,'Y \n\n\f00. \n\nAn  Infonnation-theoretic  Learning Algorithm  for  Neural  Network Classification \n595 \nwhere  {3  is  the  Lagrange  multiplier  associated  with  (8).  For  {3 = 0,  the sole  objec(cid:173)\ntive is  entropy  maximization,  which  is  achie\\\"ed  by  the  uniform distribution.  This \nsolution,  which  is  the global  minimum for  L  at  {3  =  0,  can  be  obtained  by  choos(cid:173)\ning  ,  = O.  At  the  other  end  of the  spectrum,  for  {3  -\n00,  the  sole  objective  is \nto  minimize  <  Pe  >,  and  is  achieved  by  choosing  a  non-random  (hard)  classifier \n(hence  minimizing  Pe ).  The hard solution  satisfies  the classification rule (2)  and is \nobtained for  ,  -\nMotivation  for  minimizing  the  Lagrangian  can  be  obtained  from  a  physical  per(cid:173)\nspective by noting that L  is  the  Helmholtz free  energy  of a simulated system, with \n< Pe  > the  \"energy\", H  the system entropy, and  ~ the  \"temperature\".  Thus, from \nthis  physical  view  we  can  suggest  a  deterministic  annealing  (DA)  process  which \ninvolves minimizing L  starting at the global minimum for {3  =  0 (high temperature) \nand  tracking  the  solution  while  increasing  {3  towards  infinity  (zero  temperature). \nIn  this  way,  we  obtain  a  sequence  of solutions  of decreasing  entropy  and  average \nmisclassification  cost.  Each such solution is  the  best random classifier  in  the sense \nof <  Pe  >  for  a  given  level  of  randomness.  The  annealing  process  is  useful  for \navoiding local  optima of the  cost  < Pe  >,  and  minimizes  <  Pe  > directly  at  low \ntemperature .  While this annealing process ostensibly involves the quantities Hand \n< Pe  >, the restriction to {PjIAA)} from (7)  ensures that the process also enforces \nthe  structural  constraint  on  the  classifier  in  a  controlled  way.  Note  in  particular \nthat, has  not lost  its  interpretation  as  a Lagrange  multiplier determining F.  Thus, \n, = 0 means  that F  is  unconstrained - we  are free  to choose  the uniform distribu(cid:173)\ntion .  Similarly, sending, -\n00  requires  maximizing  F  - hence  the  hard solution. \nSince, is  chosen  to  minimize  L,  this  parameter effectively  determines  the level of \nF  - the  level  of structural constraint - consistent  with  Hand < Pe  > for  a  given \n{3.  As {3  is increased,  the entropy constraint is  relaxed, allowing greater satisfaction \nof both  the  minimum < Pe  > and  maximum F  objectives.  Thus,  annealing  in  {3 \ngradually enforces  both the structural constraint (via ,) and the minimum < Pe  > \nobjective  2. \n\nOur formulation  clearly  identifies what  distinguishes  the  annealing  approach from \ndirect descent procedures.  Note that a descent method could be obtained by simply \nneglecting  the  constraint  on  the  entropy,  instead  choosing  to  directly  minimize  < \nPe  > over the parameter set.  This minimization will directly lead to a hard classifier, \nand is  akin  to the method described in  (Juang and Katagiri,  1992)  as  well  as other \nrelated  approaches  which  attempt  to  directly  minimize  a  smoothed  probability of \nerror  cost.  However,  as  we  will  experimentally  verify  through  simulations,  our \nannealing  approach outperforms design  based on directly minimizing < Pe  >. \nFor  conciseness,  we  will  not  derive  necessary optimality  conditions for  minimizing \nthe  Lagrangian  at  a  give  temperature,  nor  will  we  specialize  the  formulation  for \nindividual  classification  structures  here.  The  reader  is  referred  to  (Miller  et  al., \n1995a)  for  these  details. \n\n3  Experimental Comparisons \n\nWe demonstrate the performance of our design  approach in  comparison with other \nmethods for  the normalized RBF structure (Moody and Darken,  1989).  For the DA \nmethod,  steepest  descent  was  used  to  minimize  L  at  a  sequence  of exponentially \nincreasing  {3,  given  by  (3(n  + 1)  = a:{3(n) ,  for  a:  between  1.05  and  1.1.  We  have \nfound  that much of the optimization occurs at or near a critical temperature in the \n\n2While not shown here, the method does converge directly for f3  - 00, and at this limit \n\nenforces  the classifier's structure. \n\n\f596 \n\nD.~LLER,A.RAO,K.ROSE,A.GERSHO \n\nMethod \n\nDA \n\nM \n\n4 \nPe  (tram)  0.11 \nPe  (test) \n0.13 \n\n30 \n\n4 \n\n0.028  0.33 \n0.167  0.35 \n\nTR-RHF \n10 \n30 \n\n0.162  0.145 \n0.165  0.168 \n\n50 \n\nMU-ltBJ<' \n10 \n50 \n0.19 \n0.129 \n0.3 \n0.18 \n0.179  0.37 \n\n\\jPe \n10 \n0.18 \n0.20 \n\nTable 1:  A comparison of DA with known  design  techniques for  RBF classification \non  the  40-dimensional  noisy  waveform data from  (Breiman et al.,  1980). \n\nsolution process.  Beyond this critical  temperature,  the annealing process can often \nbe  \"quenched\"  to zero temperature by sending I  ---+  00 without incurring significant \nperformance  loss.  Quenching the  process often  makes the design  complexity of our \nmethod comparable  to  that of descent-based methods such as  back propagation or \ngradient descent on  < Pe  >. \nWe  have  compared  our  RBF  design  approach  with  the  mf,thod  in  (Moody \nand  Darken,  1989)  (MD-RBF),  with  a  method  described  ill  (Tarassenko  and \nRoberts,1994)  (TR-RBF),  with  the  approach  in  (Musavi  et  al.,  1992),  and  with \nsteepest  descent  on  <  P e  >  (G-RBF).  MD-RBF  combines  unsupervised  learning \nof receptive  field  parameters with  supervis,'d  learning  of the  weights from  the  re(cid:173)\nceptive fields  so  as  to  minimize  the squared  distance  to target  class  outputs.  The \nprimary  advantage of this  approach is  its  modest  design  complexity.  However,  the \nrecept i\\\"c fields  are  not  optimized  in  a  supervised fashion,  which  can  cause  perfor(cid:173)\nmance  degradation.  TR-RBF optimizes  all of the  RBF  parameters to approximate \ntarget class outputs.  This design is more  complex than MD-RBF and achieves bet(cid:173)\nter performance  for  a  given  model size.  However,  as  aforementioned,  the TR-RBF \ndesign  objective  is  not  equivalent  to  minimizing  Pe ,  but  rather  to  approximating \nthe  Bayes-optimal  discriminant.  While  direct  descent  on  <  P e  >  may  minimize \nthe  \"right\"  objective,  problems  of local  optima  may  be  quite  severe.  In  fact,  we \nhave found  that the performance of all of these  methods can  be quite poor without \na  judicious  initialization.  For  all  of these  methods,  we  have  employed  the  unsu(cid:173)\npervised  learning phase described in  (Moody  and  Darken,  1989)  (based on  Isodata \nclustering and  variance  estimation)  as  model  initialization.  Then,  steepest descent \nwas  performed on  the  respective  cost  surface.  We  have  found  that  the complexity \nof our design is  typically 1-5  times that of TR-RB F or G-RBF (though occasionally \nour  design  is  actually  faster  than  G-RBF).  Accordingly,  we  have  chosen  the  best \nresults based on five random initializations for  these techniques, and compared with \nthe single  DA design  run. \n\nOne  example  reported here is  the  40D  \"noisy\"  waveform data used  in  (Breiman et \nal.,  1980)  (obtained from the DC-Irvine machine learning database repository.).  We \nsplit  the  5000  vectors  into  equal  size  training  and  test  sets.  Our  results  in  Table \nI  demonstrate quite substantial performance  gains over  all  the other methods,  and \nperformance  quite  close  to  the  estimated  Bayes  rate  of 14%.  Note  in  particular \nthat the other methods  perform quite  poorly for  a small number of receptive fields \n(M),  and  need  to  increase  M  to  achieve  training  set  performance  comparable  to \nour  approach.  However,  performance on  the  test set  does  not  necessarily  improve, \nand may  degrade for  increasing M. \nTo  further  justify  this  claim,  we  compared  our  design  with  results  reported  in \n(Musavi  et  al.,  1992),  for  the  two  and  eight  dimensional  mixture  examples.  For \nthe  2D  example,  our  method  achieved  Petro-in  = 6.0%  for  a  400  point  training set \nand  Pe, \u2022\u2022 ,  =  6.1%  on  a  20,000  point  test  set,  using  M  =  3  units  (These  results \nare  near-optimal,  based  on the  Bayes rate.).  By  contrast, the method of Musavi et \n\n\fAn  Information-theoretic  Learning  Algorithm for  Neural  Network Classification \n\n597 \n\nal.  used  86  receptive fields  and  achieved  P et \u2022\u2022 t  =  9.26%.  For  the  8D  example  and \nM  = 5,  our method achieved  Petr,.;n  = 8%  and  P et \u2022\u2022 t  = 9.4%  (again  near-optimal), \nwhile  the method in (Musavui et al.,  1992)  achieved  Pet\u20225t  =  12.0%  using M  =  128. \nIn summary, we  have  proposed  a new, information-theoretic learning algorithm for \nclassifier  design,  demonstrated to outperform other design  methods,  and with gen(cid:173)\neral applicability to a variety of structures.  Future work may investigate important \napplications,  such  as  recognition  problems  for  speech  and  images.  Moreover,  our \nextension  of DA  to  incorporate  structure  is  likely  applicable  to  structured  vector \nquantizer design  and to regression modelling.  These problems will  be considered in \nfuture  work. \n\nAcknowledgements \n\nThis work  was supported in  part by  the  National Science  Foundation under  grant \nno.  NCR-9314335,  the  University  of  California  M(( 'BO  program,  DSP  Group, \nInc.  Echo Speech Corporation, Moseley  Associates, 1'\\ ill ional Semiconductor Corp., \nQualcomm, Inc., Rockwell International Corporation, Speech Technology Labs,  and \nTexas Instruments, Inc. \n\nReferences \n\nL.  Breiman, J.  H.  Friedman,  R.  A.  Olshen,  and C.  J. Stone.  Classification  and  Re(cid:173)\ngression  Trees.  The Wadsworth Statistics/Probability Series,  Belmont,CA., 1980. \n\nD.  Geiger and F.  Girosi.  Parallel and deterministic algorithms from MRFs:  Surface \nreconstruction.  IEEE  Trans.  on  Patt.  Anal.  and  Mach.  Intell.,  13:401- 412,  1991. \n\nB.-H.  Juang and S.  Katagiri.  Discriminative learning for  minimum error classifica(cid:173)\ntion.  IEEE  Trans.  on  Sig.  Proc.,  40:3043-3054,  1992. \nD.  Miller,  A.  Rao,  K.  Rose,  and  A.  Gersho.  A  global  optimization  technique  for \nstatistical classifier  design.  (Submitted for  publication.),  1995. \n\nD.  Miller,  A.  Rao,  K.  Rose,  and  A.  Gersho.  A  maximum entropy framework  for \noptimal statistical classification.  In  IEEE  Workshop  on  Neural Networks for Signal \nProcessing.),1995. \nJ. Moody and C.  J. Darken.  Fast learning in locally-tuned processing units.  Neural \nComp.,  1:281-294,  1989. \nM.  T.  Musavi,  W.  Ahmed,  K.  H.  Chan,  K.  B.  Faris,  and  D.  M.  Hummels.  On  the \ntraining of radial basis function  classifiers.  Neural Networks,  5:595--604,  1992. \n\nK.  Rose,  E.  Gurewitz,  and G.  C.  Fox.  Statistical mechanics and  phase  transitions \nin  clustering.  Phys.  Rev.  Lett., 65:945--948,  1990. \n\nK.  Rose,  E. Gurewitz, and G.  C. Fox.  Vector quantization by  deterministic  anneal(cid:173)\ning.  IEEE  Trans.  on  Inform.  Theory,  38:1249-1258,  1992. \nK.  Rose,  E.  Gurewitz,  and  G.  C.  Fox.  Constrained  clustering  as  an  optimization \nmethod.  IEEE  Trans.  on  Patt.  Anal.  and  Mach.  Intell.,  15:785-794,  1993. \n\nL.  Tarassenko and S.  Roberts.  Supervised and unsupervised learning in radial basis \nfunction  classifiers.  lEE Proc.- Vis.  Image  Sig.  Proc.,  141:210-216,  1994. \n\nA.  L.  Yuille.  Generalized  deformable  models,  statistical  physics,  and  matching \nproblems.  Ne 'ural Comp.,  2:1-24,  1990. \n\n\f", "award": [], "sourceid": 1161, "authors": [{"given_name": "David", "family_name": "Miller", "institution": null}, {"given_name": "Ajit", "family_name": "Rao", "institution": null}, {"given_name": "Kenneth", "family_name": "Rose", "institution": null}, {"given_name": "Allen", "family_name": "Gersho", "institution": null}]}