{"title": "Using Pairs of Data-Points to Define Splits for Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 513, "abstract": null, "full_text": "Using Pairs of Data-Points to Define \n\nSplits for  Decision Trees \n\nGeoffrey E.  Hinton \n\nMichael Revow \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nUniversity  of Toronto \n\nToronto, Ontario,  M5S  lA4, Canada \n\nToronto,  Ontario,  M5S  lA4, Canada \n\nhinton@cs.toronto.edu \n\nrevow@cs.toronto.edu \n\nAbstract \n\nConventional binary classification trees  such  as  CART either split \nthe data using axis-aligned hyperplanes or  they perform a  compu(cid:173)\ntationally expensive search  in the continuous space of hyperplanes \nwith unrestricted orientations.  We show that the limitations of the \nformer  can  be overcome  without resorting to the latter.  For every \npair of training data-points, there is one hyperplane that is orthog(cid:173)\nonal  to the line joining the data-points and  bisects  this line.  Such \nhyperplanes  are  plausible  candidates  for  splits.  In  a  comparison \non  a suite of 12  datasets  we  found  that this  method of generating \ncandidate splits outperformed  the standard  methods,  particularly \nwhen  the training sets were  small. \n\n1 \n\nIntroduction \n\nBinary decision trees come in many flavours,  but they all rely on splitting the set of \nk-dimensional data-points at each internal node into two disjoint sets.  Each split is \nusually performed  by  projecting the data onto some direction  in the k-dimensional \nspace  and  then  thresholding  the  scalar  value  of  the  projection.  There  are  two \ncommonly used  methods of picking a  projection direction.  The simplest  method is \nto  restrict  the  allowable directions  to  the  k  axes  defined  by  the  data.  This  is  the \ndefault  method  used  in  CART  [1].  If this  set  of directions  is  too  restrictive,  the \nusual  alternative is  to search  general  directions  in  the full  k-dimensional  space  or \ngeneral  directions  in  a  space defined  by  a subset of the  k  axes. \n\nProjections onto  one  of the  k  axes  defined  by  the  the  data have  many advantages \n\n\f508 \n\nG.  E.  HINTON,  M.  REVOW \n\nover  projections onto a  more  general  direction: \n\n1.  It is very efficient  to perform the projection for  each of the data-points.  We \n\nsimply ignore  the values  of the data-point on the other axes. \n\n2.  For  N  data-points,  it  is  feasible  to  consider  all  possible  axis-aligned  pro(cid:173)\n\njections  and  thresholds  because  there  are  only  k  possible  projections  and \nfor  each  of these  there  are  at  most  N  - 1  threshold  values  that  yield  dif(cid:173)\nferent  splits.  Selecting  from  a  fixed  set  of  projections  and  thresholds  is \nsimpler  than searching  the  k-dimensional continuous space  of hyperplanes \nthat correspond  to unrestricted  projections and  thresholds. \n\n3.  Since a split is selected from only about N k  candidates, it takes only about \nlog2  N + log2  k  bits to define  the split.  So it should be possible to use many \nmore of these axis-aligned splits before overfitting occurs than if we use more \ngeneral  hyperplanes.  If the data-points are in general  position, each subset \nof size k  defines  a different  hyperplane so there are  N!/k!(N - k)!  distinctly \ndifferent  hyperplanes  and  if k  < < N  it  takes  approximately  k log2  N  bits \nto specify  one  of them. \n\nFor  some  datasets,  the  restriction  to  axis-aligned  projections  is  too  limiting.  This \nis  especially  true for  high-dimensional data,  like  images, in which  there  are strong \ncorrelations  between  the  intensities  of neighbouring  pixels.  In  such  cases,  many \naxis-aligned  boundaries  may  be  required  to  approximate  a  planar  boundary  that \nis  not  axis-aligned,  so  it  is  natural  to  consider  unrestricted  projections  and  some \nversions  of the CART program allow  this.  Unfortunately  this greatly increases  the \ncomputational burden and the search may get trapped in local minima.  Also signif(cid:173)\nicant care must be exercised to avoid overfitting.  There is, however,  an intermediate \napproach which allows the projections to be non-axis-aligned but preserves  all three \nof the  attractive properties of axis-aligned projections:  It is  trivial  to decide  which \nside of the resulting  hyperplane  a  given  data-point lies  on;  the hyperplanes  can  be \nselected  from  a  modest-sized  set of sensible  candidates;  and  hence  many splits can \nbe used before overfitting occurs because only a few  bits are required to specify each \nsplit. \n\n2  Using two  data-points  to define  a  projection \n\nEach  pair of data-points  defines  a  direction  in  the  data space.  This  direction  is  a \nplausible  candidate  for  a  projection  to  be  used  in  splitting  the  data,  especially  if \nit  is  a  classification  task  and  the  two  data-points are in  different  classes.  For  each \nsuch  direction,  we  could  consider  all  of the  N  - 1  possible  thresholds  that  would \ngive different splits,  or,  to save time and reduce complexity, we  could only consider \nthe  threshold  value  that  is  halfway  between  the  two  data-points  that  define  the \nprojection.  If we  use  this threshold  value, each  pair of data-points  defines  exactly \none hyperplane and we  call the two  data-points  the  \"poles\"  of this  hyperplane. \nFor  a  general  k-dimensional  hyperplane  it  requires  O( k)  operations  to  decide \nwhether  a  data-point,  C,  is  on  one  side  or  the  other.  But  we  can  save  a  factor \nof k  by  using  hyperplanes  defined  by  pairs  of data-points.  If we  already  know  the \ndistances  of  C  from  each  of  the  two  poles,  A, B  then  we  only  need  to  compare \n\n\fUsing Pairs of Data Points to Define Splits for Decision Trees \n\n509 \n\nB \n\nA \n\nFigure  1:  A  hyperplane  orthogonal  to  the  line  joining  points  A  and  B.  We  can \nquickly  determine  on  which  side  a  test  point,  G,  lies  by  comparing  the  distances \nAG and BG. \n\nthese two distances (see figure  1).1  So if we  are willing to do O(kN2)  operations to \ncompute all the pairwise distances between  the data-points,  we  can then  decide  in \nconstant time which side of the hyperplane a  point lies  on. \n\nAs  we  are building the decision tree,  we  need to compute the gain  in  performance \nfrom  using each possible split at each existing terminal node.  Since all the terminal \nnodes  combined contain  N  data-points and there are  N(N - 1)/2 possible  splits2 \nthis takes time O(N3) instead of O(kN3).  So the work in computing all the pairwise \ndistances is  trivial compared with the savings. \n\nUsing  the Minimum Description Length framework,  it is  clear that pole-pair splits \ncan be described very cheaply, so a lot of them can be used before overfitting occurs. \nWhen applying MDL to a supervised learning task we can assume that the receiver \ngets  to see the input vectors for free.  It is  only the output vectors that need to be \ncommunicated.  So if splits are selected from a set of N (N -1) /2 possibilities that is \ndetermined by the input vectors, it takes only about 210g2 N  bits to communicate \na  split  to a  receiver.  Even  if we  allow  all  N  - 1 possible  threshold  values  along \nthe projection defined  by two data-points, it takes only about 310g2 N  bits.  So the \nnumber of these splits that can be used before overfitting occurs should be greater by \na factor of about k/2 or k/3 than for  general hyperplanes.  Assuming that k \u00ab  N, \nthe same line of argument suggests that even more axis-aligned planes can be used, \nbut only by a  factor of about 2 or 3. \n\nTo summarize, the hyperplanes planes defined by pairs of data-points are computa(cid:173)\ntionally convenient and seem like natural candidates for good splits.  They overcome \nthe  major  weakness  of axis-aligned  splits  and,  because  they  can  be  specified  in  a \nmodest number of bits,  they may be more effective  than fully  general  hyperplanes \nwhen the training set is  small. \n\n1 If the threshold value is  not midway  between the poles,  we  can still save a factor  of k \n\nbut we  need to compute (d~c - d1c )/2dAB  instead of just the sign  of this expression. \n\n2Since  we  only  consider  splits  in  which  the  poles  are  in  different  classes,  this  number \n\nignores a  factor  that is  independent of N. \n\n\f510 \n\nG. E.  HINTON, M.  REVOW \n\n3  Building the decision tree \n\nWe  want  to compare the  \"pole-pair\"  method of generating candidate hyperplanes \nwith  the standard axis-aligned method and the method  that uses  unrestricted  hy(cid:173)\nperplanes.  We  can see no reason to expect strong interactions between the method \nof building the tree and the method  of generating the candidate hyperplanes,  but \nto minimize confounding effects we always use exactly the same method of building \nthe decision  tree. \n\nWe faithfully followed  the method described in  [1],  except for  a small modification \nwhere  the code  that was  kindly  supplied  by  Leo  Breiman  used a  slightly  different \nmethod for  determining the amount of pruning. \n\nTraining a  decision  tree involves  two  distinct  stages.  In  the first  stage,  nodes  are \nrepeatedly split until each terminal node is  \"pure\"  which means that all of its data(cid:173)\npoints  belong  to  the  same  class.  The  pure  tree  therefore  fits  the  training  data \nperfectly.  A node is  split by considering all candidate decision  planes and choosing \nthe one that maximizes the decrease in  impurity.  Breiman  et.  al recommend using \nthe  Gini index  to measure impurity.3  If pUlt) is  the probability of class j  at node \nt,  then the Gini index is  1 - 2:j  p2(jlt). \nClearly the tree obtained at the end of the first  stage will overfit the data and so in \nthe second stage the tree is  pruned by recombining nodes.  For a  tree, Ti , with  ITil \nterminal nodes we  consider the regularized cost: \n\n(1) \n\nwhere E  is  the classification error and  Q  is  a  pruning parameter.  In  \"weakest-link\" \npruning the terminal nodes are eliminated in  the order which keeps  (1)  minimal as \nQ  increases.  This leads  to a  particular sequence,  T  =  {TI' T2,  ... Tk}  of subtrees, \nin which  ITII > IT21 ...  > ITkl.  We  call this the  \"main\"  sequence of subtrees because \nthey are trained on  all of the training data. \n\nThe last remaining issue  to be resolved is  which  tree in  the main  sequence  to use. \nThe  simplest  method  is  to  use  a  separate  validation  set  and choose  the  tree  size \nthat  gives  best  classification  on  it.  Unfortunately,  many  of  the  datasets  we  used \nwere  too  small  to  hold  back a  reserved  validation  set.  So  we  always  used  10-fold \ncross  validation  to pick  the size  of the  tree.  We  first  grew  10  different  subsidiary \ntrees until their terminal nodes were pure, using 9/10 of the data for training each of \nthem.  Then we pruned back each of these pure subsidiary trees, as above, producing \n10 sequences of subsidiary subtrees.  These subsidiary sequences could then be used \nfor  estimating the performance of each subtree in  the main sequence.  For each of \nthe  main  subtrees,  Ti ,  we  found  the  largest  tree  in  each  subsidiary  sequence  that \nwas  no larger than Ti and estimated the performance of Ti  to be the average of the \nperformance achieved by each subsidiary subtree on  the 1/10 of the data that was \nnot used  for  training that subsidiary tree.  We  then chose the Ti  that achieved the \nbest  performance estimate and  used  it  on  the  test  set4.  Results  are expressed  as \n\n3Impurity is  not an  information  measure  but, like  an  information  measure,  it is  mini(cid:173)\n\nmized when all the nodes are pure and maximized when all classes at each node have equal \nprobability. \n\n4This  differs from  the conventional  application of cross  validation,  where  it  is  used  to \n\n\fUsing Pairs of Data Points to Define Splits for  Decision Trees \n\nSize  (N) \nClasses  (e) \nAttributes  (k) \n\nJR \n\n150 \n3 \n4 \n\nTR \n\nLV \n\n215 \n3 \n5 \n\n345 \n2 \n6 \n\nDB \n\n768 \n2 \n8 \n\nBC \n\n683 \n2 \n9 \n\nGL \n\n163 \n2 \n9 \n\nVW  WN \n\nVH  WV \n\nIS \n\n990 \n11 \n10 \n\n178 \n3 \n13 \n\n846 \n4 \n18 \n\n2100 \n3 \n21 \n\n351 \n2 \n34 \n\n511 \n\nSN \n\n208 \n2 \n60 \n\nTable 1:  Summary of the datasets used. \n\nthe ratio of the test error rate to the baseline rate, which  is the error rate of a tree \nwith only a single  terminal node. \n\n4  The Datasets \n\nEleven  datasets  were  selected  from  the  database of machine  learning  tasks  main(cid:173)\ntained by  the  University of California at Irvine  (see  the  appendix for  a  list  of the \ndatasets used).  Except  as  noted  in  the  appendix,  the datasets  were  used  exactly \nin  the form  of the distribution  as  of June 1993.  All  datasets have only continuous \nattributes and there are no missing values. 5  The synthetic  \"waves\"  example [1]  was \nadded as  a  twelfth dataset. \n\nTable  1 gives  a  brief description  of the datasets.  Datasets are identified  by a  two \nletter abbreviation  along  the  top.  The rows  in  the table give  the  total  number of \ninstances,  number of classes and number of attributes for each dataset. \n\nA  few  datasets  in  the  original  distribution  have  designated  training  and  testing \nsubsets  while  others  do  not.  To  ensure  regularity  among  datasets,  we  pooled  all \nusable  examples  in  a  given  dataset,  randomized  the  order  in  the  pool  and  then \ndivided the pool into training and testing sets.  Two divisions were considered.  The \nlarge  training  division  had  ~ of the pooled examples allocated  to  the training set \nand  ~ to the test set.  The small training division  had  ~ of the data in the training \nset and  ~ in  the test set. \n\n5  Results \n\nTable  2  gives  the  error  rates  for  both  the  large  and  small  divisions  of  the  data, \nexpressed  as  a  percentage  of  the  error  rate  obtained  by  guessing  the  dominant \nclass. \n\nIn both the small and  large training divisions of the datasets, the pole-pair method \nhad  lower  error rates  than  axis-aligned  or  linear  cart  in  the  majority  of datasets \ntested.  While these results are interesting, they do not provide any measure of con(cid:173)\nfidence  that one method performs better or worse than another.  Since all methods \nwere trained and tested on  the same data,  we  can  perform a  two-tailed  McNemar \ntest  [2]  on  the  predictions for  pairs of methods.  The  resulting  P-values are given \nin  table 3.  On most  of the tasks, the pole-pair method is  significantly better than \nat least one  of the standard methods for  at least  one  of the  training set  sizes  and \nthere are only  2 tasks for  which  either of the other methods is  significantly better \non either training set size. \n\ndetermine the best value of ex  rather than the tree size \n\n5In  the Be  dataset  we  removed  the  case  identification  number attribute  and  had  to \n\ndelete  16  cases with missing values. \n\n\f512 \n\nG.  E.  HINTON, M.  REVOW \n\nDatabase \n\nSmall Train \n\nLarge Train \n\nIR \nTR \nLV \nDB \nBC \nGL \nVW \nWN \nVH \nWV \nIS \nSN \n\ncart \n14.3 \n36.6 \n88.9 \n85.8 \n12.8 \n62.5 \n31.8 \n17.8 \n42.5 \n28.9 \n44.0 \n65.2 \n\nlinear \n14.3 \n26.8 \n100.0 \n82.2 \n14.1 \n81.3 \n37.7 \n13.7 \n46.5 \n25.8 \n31.0 \n71.2 \n\npole \n4.3 \n14.6 \n100.0 \n87.0 \n8.3 \n89.6 \n30.0 \n11.0 \n44.2 \n24.3 \n41.7 \n48.5 \n\ncart \n5.6 \n33.3 \n108.7 \n69.7 \n15.7 \n46.4 \n21.4 \n14.7 \n36.2 \n30.6 \n21.4 \n48.4 \n\nlinear \n\n5.6 \n33.3 \n87.0 \n69.7 \n12.0 \n46.4 \n26.2 \n11.8 \n43.9 \n24.8 \n23 .8 \n45.2 \n\npole \n5.6 \n20.8 \n97.8 \n59.6 \n9.6 \n35.7 \n19.2 \n14.7 \n40.7 \n26.6 \n42.9 \n48.4 \n\nTable 2:  Relative error  rates  expressed  as  a  percentage  of the baseline rate on  the \nsmall and large training sets. \n\n6  Discussion \n\nWe  only  considered  hyperplanes  whose  poles  were  in  different  classes,  since  these \nseemed  more  plausible  candidates.  An  alternative  strategy  is  to  disregard  class \nmembership,  and  consider  all possible  pole-pairs.  Another  variant  of the  method \narises  depending  on  whether  the  inputs  are  scaled.  We  transformed  all  inputs  so \nthat  the  training data has  zero  mean  and unit  variance.  However,  using  unsealed \ninputs and/or allowing both poles  to have the same class  makes little difference  to \nthe overall  advantage of the pole-pair method. \n\nTo summarize, we have demonstrated that the pole-pair method is a simple, effective \nmethod for  generating projection directions at binary tree nodes.  The same idea of \nminimizing complexity by selecting among a sensible fixed  set of possibilities rather \nthan  searching  a  continuous  space  can  also  be  applied  to  the  choice  of  input-to(cid:173)\nhidden  weights in a  neural  network. \n\nA  Databases used  in the study \n\nIR - Iris  plant database. \nTR - Thyroid gland data. \nLV  - BUPA liver disorders. \nDB - Pima Indians Diabetes. \nBC - Breast  cancer  database from  the University  of Wisconsin  Hospitals. \nGL  - Glass  identification  database.  In  these  experiments  we  only  considered  the \nclassification  into float/nonfloat processed  glass,  ignoring other types of glass. \nVW - Vowel  recognition. \nWN - Wine recognition. \nVH  - Vehicle silhouettes. \nWV - Waveform example,  the synthetic example from  [1]. \nIS  - Johns Hopkins University Ionosphere  database. \nSN - Sonar - mines versus rocks discrimination.  We did not control for aspect-angle. \n\n\fUsing Pairs of Data Points to Define Splits for Decision Trees \n\n513 \n\nSmall  Training - Large  Test \n\nIR \n\nTR \n\nLV \n\nDB \n\nBC \n\nGL \n\nVW  WN \n\nVH \n\nWV \n\nIS \n\n-SN \n\nAxis- Pole \n\n.02  ~  .18 \n\n.46 \n\n.06 \n\n.02 \n\nLinear- Pole  ~  .13 \n\nAxis-Linear \n\n1.0 \n\n.06 \n\n1.0 \n\n.18 \n\n.24 \n\n.00 \n\n.15 \n\n.41 \n\n.31 \n\n.33 \n\n.27 \n\n.08 \n\n,QQ.. \n\n.44 \n\n.07 \n\n.17 \n\n.09  ~ \n\n.03  ~  .32 \n\n.26  ~  .30 \n\n.30 \n\n.40 \n\nJ>O \n\nJ>O \n\nLarge Training - Small  Test \n\nAxis-Pole \n\nIR \n\n.75 \n\nLinear-Pole \n\n.75 \n\n.23 \n\n.23 \n\nAxis-Linear \n\n1.0 \n\n1.0 \n\nTR \n\nLV \n\nDB \n\nBC \n\nGL \n\nVW  WN \n\nVH \n\nWV \n\nIS \n\n.29 \n\n.26 \n\n.07 \n\n:.Q!.. \n\n:.Q!.. \n\n1.0 \n\n.11 \n\n.25 \n\n.29 \n\n.29 \n\n.30 \n\n.69 \n\n.26 \n\n.69 \n\nJ!!... \n\n.06 \n\n.50 \n\n.50 \n\n.14 \n\n.25 \n\nF3\"\" \n\n.08 \n\n.26 \n\n:.Q!. \n\n:02 \n:os \n\n.50 \n\nSN \n\n.60 \n\n.50 \n\n.50 \n\nTable 3:  P-Values  using  a  two-tailed  McNemar test  on  the  small (top)  and  large \n(bottom) training sets.  Each row  gives P-values when  the methods in the left most \ncolumn are compared.  A significant difference at the P = 0.05 level is indicated with \na line above (below)  the P-value depending on whether  the first  (second)  mentioned \nmethod in the first  column had superior performance.  For example, in the top most \nrow,  the pole-pair method  was significantly better than the axis-aligned method on \nthe TR dataset. \n\nAcknowledgments \n\nWe  thank  Leo  Breiman  for  kindly  making  his  CART  code  available  to  us.  This \nresearch  was  funded  by  the  Institute for  Robotics  and  Intelligent  Systems  and  by \nNSERC.  Hinton  is a fellow  of the Canadian Institute for  Advanced  Research. \n\nReferences \n\n[1]  L.  Breiman,  J.  H.  Freidman,  R.  A.  Olshen,  and  C. J.  Stone.  Classification  and \n\nregression  trees.  Wadsworth international Group,  Belmont,  California, 1984. \n\n[2]  J. L. Fleiss.  Statistical methods for rates  and proportions.  Second edition. Wiley, \n\n1981. \n\n\f", "award": [], "sourceid": 1171, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Michael", "family_name": "Revow", "institution": null}]}