{"title": "GDS: Gradient Descent Generation of Symbolic Classification Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 1093, "page_last": 1100, "abstract": null, "full_text": "GDS:  Gradient  Descent  Generation of \n\nSymbolic Classification Rules \n\nKaiserslautern  University,  Germany \n\nPresent  address:  Siemens AG,  ZFE ST SN  41 \n\nReinhard Blasig \n\n81730  Miinchen,  Germany \n\nAbstract \n\nImagine you have designed a neural network that successfully learns \na complex classification task.  What are the relevant input features \nthe classifier relies on and how  are these features  combined to pro(cid:173)\nduce  the  classification  decisions?  There  are  applications  where  a \ndeeper  insight  into  the  structure  of an  adaptive  system  and  thus \ninto the underlying classification problem may well be as important \nas  the  system's  performance  characteristics,  e.g.  in  economics  or \nmedicine.  GDSi  is  a  backpropagation-based  training scheme  that \nproduces networks  transformable into an equivalent and concise \nset  of IF-THEN rules.  This is achieved by  imposing penalty terms \non the network parameters that adapt the network to the expressive \npower of this class of rules.  Thus during training we simultaneously \nminimize classification  and  transformation error.  Some real-world \ntasks  demonstrate the viability of our approach. \n\n1 \n\nIntroduction \n\nThis paper deals with backpropagation networks trained to perform a classification \ntask on  Boolean or real-valued data.  Given such  a  classification task in  most cases \nit is  not  too  difficult  to  devise  a  network  architecture  that  is  capable  of learning \nthe  input-output  relation  as  represented  by  a  number of training examples.  Once \ntraining  is  finished  one  has  a  black  box  which  often  does  a  quite  good  job  not \n\n1 Gradient  Descent Symbolic  Rule  Generation \n\n1093 \n\n\f1094 \n\nBlasig \n\nonly  on  the  training patterns but  also on  some previously unseen  test  patterns.  A \ngood generalization performance indicates that the network has grasped part of the \nstructure  inherent  in  the  classification  task.  The  net  has  figured  out  which  input \nfeatures are  relevant to make a classification decision and which are not.  It has also \nmodelled the way the relevant features have to be combined in order to produce the \nclassifying output.  In many applications it is  important to get an understanding of \nthis information hidden inside the neural network.  Not only does this help to create \nor verify  a  domain theory,  the  analysis  of this  information may also  serve  human \nexperts  to determine,  when  and  in what  way  the  classifier  will  fail. \nIn  order  to  explicate  the  network's  implicit  information,  we  transform  it  into  a \nset  of rules.  This idea  is  not  new,  cf.  (Saito  and  Nakano,  1988),  (Bochereau  and \nBourgine,  1990), (Y.  Hayashi,  1991) and  (Towell and Shavlik, 1992).  In contrast to \nthese approaches, which extract rules after BP-training is finished,  we apply penalty \nterms during training to adapt the network's expressive  power to that of the  rules \nwe  want to generate.  Consequently the net will be transformable into an equivalent \nset of rules. \n\nDue  to  their  good  comprehensibility  we  restrict  the  rules  to  be  of the  form  IF \n< premise> THEN  < conclusion  >,  where  the premise as  well  as  the conclusion \nare  Boolean expressions.  To  actually make the  transformation two  problems have \nto be solved: \n\n\u2022  Neural  nets are  well  known for  their distributed representation  of informa(cid:173)\ntion;  so in order to transform a  net into a  concise  and comprehensible  rule \nset  one has to find  a  way  of condensing  this information without substan(cid:173)\ntially changing it . \n\n\u2022  In  the  case  of backpropagation networks  a  continuous  activation function \n\ndetermines  a  node's output depending  on  its activation.  However,  the  dy(cid:173)\nnamic of this function  has no  counterpart in  the context  of rule-based  de(cid:173)\nscriptions. \n\nWe  address  these  problems by  introducing a penalty function  Ep, which  we  add to \nthe classification error  Ec yielding the total back propagation error \n\nET  =  ED + A * Ep. \n\n(1) \n\n2  The Penalty Term \n\nThe  term  Ep  is  intended  to  have  two  effects  on  the  network  weights.  First,  by \na  weight  decay  component  it  aims  at  reducing  network  complexity  by  pushing  a \n(hopefully large) fraction of the weights to O.  The smaller the net, the more concise \nthe rules describing its behavior will be.  As a positive side effect,  this component will \ntend to act as a form of \"Occam's razor\":  simple networks are more likely to exhibit \ngood generalization than complex ones.  Secondly, the penalty term should minimize \nthe  error  caused  by  transforming  the  network  into  a  set  of rules.  Adopting  the \ncommon approach  that each  non-input  neuron  represents  one  rule,  there  would  be \nno transformation error if the neurons' activation function were  threshold functions; \nthe Boolean  node  output would  then  indicate,  whether  the  conclusion is  drawn  or \nnot.  But  since  backpropagation  neurons  use  continuous  activation  functions  like \n\n\fGDS: Gradient Descent Generation of Symbolic Classification Rules \n\n1095 \n\ny = tanh (x)  to transform their  activation value  x into the  output  value  y,  we  are \nleft with  the difficulty of interpreting the continuous output of a  neuron.  Thus our \npenalty  term  will  be  designed  to  produce  a  high  penalty  for  those  neurons  of the \nbackpropagation  net,  whose  behavior  cannot  be  well  approximated  by  threshold \nneurons,  because  their  activation  values  are  likely  to  fall  into  the  nonsaturated \nregion  of the tanh-function2 . \n\n,.--------------\n\nI \nI \nI \nI \nI \nI \n\n1.00 \n\n0.00 \n\n-1.00 \n\n-3.00 \n\n0.00 \n\n3.00 \n\nFigure  1:  We  regard  Ixl  > 3  with  Iyl  = I tanh(x)I  >  0.9  as  the  regions,  where  a \nsigmoidal neuron  can  be  approximated  by  a  threshold  neuron.  The  nonsaturated \nregion  is  marked by the dashed  box. \n\nFor  a  better  understanding  of our  penalty  term  one  has  to  be  aware  of the  fact \nthat IF-THEN rules with a  Boolean premise and conclusion are essentially Boolean \nfunctions.  It can  easily  be  shown  that  any  such  function  can  be  calculated  by  a \nnetwork of threshold neurons provided there is one (sufficiently large) hidden layer. \nThis is still true if we  restrict  connection  weights to the values  {-I, 0,  I} and node \nthresholds to be integers  (Hertz,  Krogh and Palmer, 1991).  In order to transfer this \nscenario  to  nets  with  sigmoidal activation functions  and  having in  mind  that  the \nactivation values  of the  sigmoidal neurons  should  always exceed  \u00b13 (see  figure  1), \nwe  require the nodes'  biases  to be odd multiples of \u00b13 and the weights Wji  to obey \n(2) \nWe  shortly  comment on the practical  problem that sometimes bias values as  large \nas  \u00b16m,  (mi  being  the  fan-in  of node  i)  may  be  necessary  to  implement certain \nBoolean functions.  This may slow down or even block the learning process.  A simple \nsolution to this problem is to use some additional input units with a constant output \nof +1.  If the connections to these units are also subject to the penalty function Ep, \nit is sufficient  to restrict  the bias values to \n\nWji  E {-6,0,6}. \n\nhi  E {-3, 3}. \n\n(3) \n\n2We  have  to  point  out  that  the  conversion  of sigmoidal  neurons  to  threshold  neurons \nwill  reduce  the  net's  computational  power:  there  are  Boolean  functions  which  can  be \ncomputed  by  a  net  of sigmoidal  neurons,  but  which  exceed  the  capacity  of  a  threshold \nnet  of  the  same  topology  (Maass,  Schnitger  and  Sontag,  1991).  Note  that  the  objective \nto  use  threshold  units  is  a consequence  of the  decision  to search  for  rules  of the  type  IF \n<  premise  > THEN  <  conclusion  >.  A  failure  of the  net  to  simultaneously  minimize \nboth parts of the error  measure  may  indicate  that other rule  types  are  more  adequate  to \nhandle  the given  classification  task. \n\n\f1096 \n\nBlasig \n\nNow we can define penalty functions that push the biases and weights to the desired \nvalues.  Obviously  Eb  (the  bias  penalty)  and  Ew  (the  weight  penalty)  have  to  be \ndifferent: \n\n(4) \n\nEb(bi )  = 13-lbill \n{  16 - IWji11 \n\nfor  IWjil ~ e \nfor  IWjil < e \n\nE  (w .. ) -\nJ'  -\n\nw \n\n(5) \nThe parameter e determines whether a weight should be subject to decay or pushed \nto attain the value 6 (or -6 respectively).  Figure 2 displays the graphs ofthe penalty \nfunctions. \n\nIWjil \n\n-3.0 \n\n3.0 \n\n-6.0 \n\n-8 \n\n8 \n\n6.0 \n\nFigure  2:  The penalty functions  Eb  and  Ew. \n\nThe value of e is  chosen  with  the  objective  that only  those  weights  should exceed \nthis value,  which almost certainly have to be nonzero to solve the given classification \ntask.  Since  we  initialize  the  network  with  weights  uniformly  distributed  in  the \ninterval [-0.5,0.5]' E>  =  1.5 works well at the beginning of the training process.  The \npenalty  term  then  has  the  effect  of a  pure  weight  decay.  When  learning proceeds \nand the weights converge,  we  can slowly reduce  the value of e, because superfluous \nweights  will already have decayed.  So  after each sequence  of 100  training patterns, \nsay,  we  decrease e by  a factor  of 0.995. \nObservation shows  that weights  which  once  exceeded  the value  of e quickly  reach \n6  or  -6 and  that  there  are  relatively  few  cases  where  a  large  weight  is  reduced \nagain  to  a  value  smaller  than e.  Accordingly,  the  number  of weights  in  {-6, 6} \nsuccessively  grows in the  course of learning,  and the  criterion to stop  training thus \ninfluences  the number of nonzero  weights. \nThe  end  of training  is  determined  by  means of cross  validation.  However,  we  do \nnot  examine  the  cross  validation  performance  of the  trained  net,  but  that of the \ncorresponding rule set.  This is  accomplished by  calculating the performance of the \noriginal net  with all  weights  and  biases  replaced  by their optimal values  according \nto (2)  and  (3). \nThe  weighting  factor  A of the  penalty  term  (see  equation  1)  is  critical  for  good \nlearning  performance.  We  pursued  the  strategy  to  start  learning  with  A =  0,  so \nthat the network  parameters first  move into a  region  where  the  classification error \nis  small.  If this error falls  below  a  prespecified  tolerance  level  L,  A is  incremented \nby 0.001.  The factor A goes  down by the same amount, when  the error grows larger \n\n\fGDS: Gradient Descent Generation of Symbolic Classification Rules \n\n1097 \n\nthan L3.  By  adjusting the weighting factor every  100 training patterns we  keep  the \nclassification error close to the tolerance level.  The choice of L of course  depends on \nthe learning task.  As  a  heuristic,  L should be slightly larger than the classification \nerror attainable by  a  non-penalized network. \n\n3  Splice-Junction Recognition \n\nThe  DNA,  carrying  the  genetic  information of biological  cells,  can  be  thought  to \nbe  composed  of two  types  of subsequences:  exons  and  introns.  The  task  is  to \nclassify  each  DNA  position  as  either  an  exon-to-intron  transition  (EI),  an  intron(cid:173)\nto-exon transition (IE) or neither (N). The only information available is  a sequence \nof 30  nucleotides  (A,  C,  G or T) before  and  30  nucleotides  after the position to be \nclassified.  Splice-junction  recognition is  a  classification  task  that has  already  been \ninvestigated  by  a  number  of machine  learning  researchers  using  various  adaptive \nmodels. \n\nThe pattern reservoir contains about 3200 DNA samples, 30% of which were used for \ntraining, 10% for cross-validation and 60% for testing.  Since we used a grandmother(cid:173)\ncell  coding  for  the  input  DNA  sequence,  the  network  has  an  input  layer  of 4*60 \nneurons.  With a hidden layer of 20 neurons4  and two output units for the classes  EI \nand IE,  this amounts to about 5000 free  parameters.  The following table compares \nthe  classification  performance  of our  penalty  term  approach  and  other  machine \nlearning algorithms, cf.  (Murphy  and Aha,  1992). \n\nTable  1:  Splice-junction recognition:  error  (in  percent) of various machine learning \nalgorithms \n\nalgorithm \nKBANN \nGDS \nBackprop \nPerceptron \nID3 \nNearest  Neighbor  31.11 \n\nIE \nEI \nN \n8.47 \n7.56 \n4.62 \n4.43 \n9.24 \n6.71 \n5.29 \n5.74  10.75 \n3.99  16.32  17.41 \n8.84  10.58  13.99 \n9.09 \n\n11.65 \n\ntotal \n6.32 \n6.75 \n6.77 \n10.43 \n10.56 \n20.74 \n\nSurprisingly,  the  GDS  network  turned  out  to  be  very  small.  The  weight  decay \ncomponent of our penalty term managed to push all but 61  weights to zero,  making \nuse  of only  three  hidden  neurons.  Thus  in  addition  to  performing  very  well,  the \nnetwork  is  transformable into a concise  rule set,  as follows5 : \n\n3Negative  A-values  are not  allowed. \n4. A  reasonable  size,  considering  the experiments  described  in  (Shavlik  et al.,  1991) \n5We  adopt  a.  notation  commonly  used  in  this domain:  @n  denotes  the  position  of the \nfirst  nucleotide  in  the  given  sequence  being  left  (negative  n)  or  right  (positive  n)  to  the \npoint  to  be  classified.  Nucleotide  'V'  stands  for  (,C'  or  'T'),  'X'  is  a.ny  of {A, C, G, T}. \nConsequently,  e.g.  neuron  hidden(2)  is  active  iff at  least  four  of the  five  nucleotides  of \nthe sequence  'GTAXG' are  identical  to  the input  pattern  at  positions  1 to  5  right  of the \npossible  splice junction. \n\n\f1098 \n\nBlasig \n\nsequence  11: \nhidden(2):  at  least  4  nucleotides  match \nhidden(11):  at  least  3  nucleotides  match \nsequence  1-3:  'YAG' \nhidden(17):  at  least  1  nucleotides  matches  sequence  1-1:  'GG' \n\n'GTAXG' \n\nclass  EI:  hidden(2)  AID  hidden(11) \nclass  IE:  IOT(hidden(2\u00bb  AID  hidden(17) \n\n4  Prediction of Interest Rates \n\nThis is  an application,  where  the  network input is  a  vector of real  numbers.  Since \nour  approach  can  only  handle  binary  input,  we  supplement  the  net  with  a  dis(cid:173)\ncretization layer that provides  a  thermometer code  representation  (Hancock  1988) \nof the  continuous  valued  input.  In  contrast  to  pure  Boolean  learning  algorithms \n(Goodman, Miller and Smyth, 1989), (Mezard  and Nadal,  1989), which can also be \nendowed  with  discretization  facilities,  here  the  discretization  process  is  fully  inte(cid:173)\ngrated  into the  learning scheme,  as  the  discretization intervals  will  be  adapted  by \nthe backpropagation algorithm. \nThe data comprises a  total of 226  patterns,  which  we  distribute randomly on three \nsets:  training set  (60%),  cross-validation set  (20%)  and test  set  (20%).  The input \nrepresents  the monthly development of 14  economic time series  during  the last  19 \nyears.  The Boolean target indicates,  whether  the interest  rates  will go  up or down \nduring  the  six  months  succeeding  the  reference  month6 \u2022  The  time  series  include \namong others  month of the  year,  income  of private  households  or  the  amount  of \nGerman foreign  investments.  For  some  time series  it is  useful  not to take  the  raw \nfeature measurements as input, but the difference  between two succeeding measure(cid:173)\nments;  this  is  advantageous  if the  underlying time series  show  only  small changes \nrelative  to  their  absolute  values.  All  series  were  normalized  to  have  values  in  the \nrange from  -1 to +1. \nWe  used  a  network  containing a  discretization  layer  of two  neurons  per  input  di(cid:173)\nmension.  So  there  are  28  discretization  neurons,  which  are  fully  connected  to  the \n10  hidden  nodes.  The output layer  consists  of a  single  neuron.  Since  our data set \nis  relatively small, the intention to obtain simple rules  is not only motivated by the \nobjective of comprehensibility, but also by the notion that we cannot expect  a large \nrule  set  to be justified by  a small amount of training data.  In fact,  during training \n90% of the weights were set to zero and three hidden units proved to be sufficient for \nthis task.  Nevertheless the prediction error on the test set could be reduced  to 25%. \nThis  compares  to an error  rate  of about  20%  attainable by  a  standard  backprop(cid:173)\nagation network  with one  hidden  layer of ten  neurons  and no  input discretization. \nWe  thus sacrificed  5%  of prediction performance  to yield  a  very  compact net,  that \ncan be easily transformed into a set of rules.  Some of the generated  rules are shown \nbelow.  The  first  rule  e.g.  states  that  interest  rates  will  rise  if private  income  in(cid:173)\ncreases AND foreign  investments decrease  by a  certain amount during the reference \nmonth. \nIf the  rules  produce  contradicting  predictions for  a  given  input,  the  final  decision \nwill be  made according to a  majority vote.  A  tie is  broken by  the bias value of the \n\n61.e.  the  month where  the input  data has  been measured. \n\n\fGDS:  Gradient Descent Generation of Symbolic Classification Rules \n\n1099 \n\noutput unit, which  states that by  default interest  rates  will rise. \n\nIF  (at  least  2  ot  {  increase  ot  private  income  <  0.73%, \n\ndecrease  ot  toreign  investments  < 64  MID  DM  }) \n\nTHE!  (interest  rates  will  rise) \nELSE  (interest  rates  will  fall). \n\nIF  (at  least  3  ot  {  increase  of  business  climate  estimate  <  1.76%, \ntreasury  bonds  yields  (11  month  ago)  >  7.36%, \ntreasury  bonds  yields  (12  month  ago)  > 8.2%, \nincrease  ot  foreign  investments  < 60  MID  DM  }) \n\nTHE!  (interest  rates  will  tall) \nELSE  (interest  rates  will  rise). \n\n5  Conclusion and  Future Work \n\nG DS  is  a  learning  algorithm  that  utilizes  a  penalty  term  in  order  to  prepare  a \nbackpropagation  network  for  rule  extraction.  The  term  is  designed  to  have  two \neffects  on the network's weights: \n\n\u2022  By  a  weight decay  component,  the  number of nonzero  weights  is  reduced: \nthus  we  get  a  net  that  can  hopefully  be  transformed  into  a  concise  and \ncomprehensible rule set . \n\n\u2022  The  penalty term encourages  weight  constellations  that keep  the  node  ac(cid:173)\n\ntivations  out  of the  nonsaturated  part  of the  activation  function.  This \nis  motivated  by  the  fact  that  rules  of the  type  IF  <  premise  > THEN \n< conclusion > can only mimic the  behavior of threshold units. \n\nThe important point is  that our penalty function  adapts the  net  to  the expressive \npower of the type of rules we wish to obtain.  Consequently, we are able to transform \nthe network into an equivalent rule set.  The applicability of GDS was demonstrated \non  two  tasks:  splice-junction  recognition  and  the  prediction  of  German  interest \nrates.  In  both  cases  the  generated  rules  not  only showed  a  generalization  perfor(cid:173)\nmance close  to or even  superior to what can be  attained by other machine learning \napproaches  such  as  MLPs  or  ID3.  The  rules  also  prove  to  be  very  concise  and \ncomprehensible.  This  is  even  more  remarkable,  since  both  applications  represent \nreal-world tasks with a  large number of inputs. \nClearly the applied penalty terms impose severe restrictions on the network  param(cid:173)\neters:  besides  minimizing the number of nonzero weights,  the weights are restricted \nto a small set of distinct values.  Last but not least, the simplification of sigmoidal to \nthreshold  units also affects  the net's computational power.  There are  applications, \nwhere  such  a  strong  bias  may  negatively  influence  the  net's  learning  capabilities. \nFurthermore  our  current  approach  is  only  applicable  to  tasks  with  binary  target \npatterns.  These  limitations  can  be  overcome  by  dealing  with  more  general  rules \nthan those of the  Boolean IF-THEN  type.  Future work  will go  into this direction. \n\n\f1100 \n\nBlasig \n\nAcknowledgements \n\nI  wish  to  thank  Hans-Georg Zimmermann and  Ferdinand  Hergert  for  many useful \ndiscussions  and for  providing the  data on  interest  rates,  and  Patrick  Murphy  and \nDavid  Aha  for  providing  the  UCI  Repository  of ML  databases.  This  work  was \nsupported by  a grant of the Siemens AG,  Munich. \n\nReferences \n\nL.  Bochereau,  P.  Bourgine.  (1990)  Extraction  of Semantic  Features  and  Logical \nRules  from  a  Multilayer Neural  Network.  Proceedings  of the  1990  IJCNN - Wash(cid:173)\nington  DC,  Vol.II 579-582. \nR.M. Goodman, J .W. Miller, P. Smyth.  (1989) An Information Theoretic Approach \nto  Rule-Based  Connectionist  Expert  Systems.  Advances  in  Neural  Information \nProcessing  Systems 1,  256-263.  San Mateo,  CA:  Morgan  Kaufmann. \nP.J .B.  Hancock.  (1988)  Data Representation  in  Neural  Nets:  an Empirical Study. \nProc.  Connectionist  Summer School. \nY.  Hayashi.  (1991)  A Neural  Expert System with Automated  Extraction of Fuzzy \nIf-Then Rules and its Application to Medical Diagnosis.  Advances  in Neural Infor(cid:173)\nmation  Processing  Systems 3,  578-584.  San  Mateo,  CA:  Morgan  Kaufmann. \nJ.  Hertz,  A.  Krogh,  R.G.  Palmer.  (1991)  Introduction  to  the  Theory  of Neural \nComputation.  Addison-Wesley. \nC.M. Higgins, R.M. Goodman.  (1991) Incremental Learning with Rule-Based  Neu(cid:173)\nral  Networks.  Proceedings  of the  1991  IEEE INNS International  Joint  Conference \non  Neural  Networks - Seattle,  Vol.1  875-880. \nM.  Mezard,  J .-P.  Nadal.  (1989)  Learning in  Feedforward  Layered  Networks:  The \nTiling Algorithm.  J.  Phys.  A: Math.  Gen.  22,  2191-2203. \nW.  Maass,  G.  Schnitger,  E.D.  Sontag. \n(1991)  On  the  Computational  Power  of \nSigmoids versus Boolean Threshold Circuits.  Proceedings  of the 32nd Annual IEEE \nSymposium  on  Foundations  of Computer Science,  767-776. \nP.M.  Murphy,  D.W.  Aha.  (1992).  UCI  Repository  of machine  learning  databases \n[ftp-site:  ics.uci.edu:  pub/machine-Iearning-databases].  Irvine,  CA:  University  of \nCalifornia, Department of Information and Computer Science. \nJ .R.  Quinlan.  (1986)  Induction of Decision Trees.  Machine  Learning,  1:  81-106. \nK.  Siato,  R.  Nakano. \n(1988)  Medical  diagnostic  expert  systems  based  on  PDP \nmodel.  Proc.  IEEE International  Conference  on  Neural Networks Vol.  I  255-262. \nV.  Tresp,  J.  Hollatz,  S.  Ahmad.  (1993)  Network  Structuring  and  Training  Using \nRule-Based  Knowledge.  Advances  in  Neural  Information  Processing  Systems  5, \n871-878.  San Mateo,  CA:  Morgan  Kaufman. \nG.G.  Towell,  J.W.  Shavlik.  (1991)  Training  Knowledge-Based  Neural  Networks \nto  Recognize  Genes  in  DNA  Sequences.  In:  Lippmann,  Moody,  Touretzky  (eds.), \nAdvances  in  Neural  Information  Processing  Systems  3,  530-536.  San  Mateo,  CA: \nMorgan Kaufmann. \n\n\f", "award": [], "sourceid": 774, "authors": [{"given_name": "Reinhard", "family_name": "Blasig", "institution": null}]}