{"title": "Ordered Classes and Incomplete Examples in Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 556, "abstract": null, "full_text": "Ordered Classes and Incomplete Examples \n\nin Classification \n\nMark Mathieson \n\nDepartment of Statistics, University of Oxford \n1 South Parks Road, Oxford OXI 3TG, UK \n\nE-mail:  mathies@stats.ox.ac.uk \n\nAbstract \n\nThe classes in classification tasks often have a natural ordering, and the \ntraining and testing examples are often incomplete.  We  propose a non(cid:173)\nlinear ordinal  model for  classification into ordered classes.  Predictive, \nsimulation-based approaches are used to learn from past and classify fu(cid:173)\nture  incomplete examples.  These techniques are  illustrated  by  making \nprognoses for patients who have suffered severe head injuries. \n\n1  Motivation \n\nJennett et al.  (1979) reported data on patients with  severe head injuries.  For each  patient \nsome of the  information in Table  1 was available shortly after injury.  The objective is to \npredict the degree of recovery attained within six months as  measured by  outcome.  This \nproblem exhibits two characteristics that are common in classification tasks:  allocation qf \nexamples into classes which have a natural ordering, and learning from past and classifying \nfuture incomplete examples. \n\n2  A Flexible Model for Ordered Classes \n\nThe Bayes decision rule (see, for example, Ripley,  1996) depends on the loss  L(j, k)  in(cid:173)\ncurred in assigning to class  k  an object belonging to class j.  When better information is \nunavailable, for unordered or nominal classes we treat every mis-classification as  equally \nserious:  LU, k)  is 0 when j  = k and 1 otherwise. For ordered classes, when the K  classes \nare numbered from 1 to K  in their natural order, a better default choice is LU, k) =1  j - k  I. \nA class is then given support by its position in the ordering, and the Bayes rule will some(cid:173)\ntimes assign  patterns to  classes  that do not have maximum posterior probability to avoid \nmaking a serious error. \n\n\fOrdered Classes and Incomplete Examples in Classification \n\n551 \n\nTable 1:  Definition of variables with proportion missing. \n\nVariable \nage \nemv \nmotor \nchange \neye \npupils \noutcome \n\nDefinition \n\nAge in decades (1=0-9, 2=10-19,  ... ,8=70+). \nMeasure of eye, motor and verbal response to stimulation (1-7). \nMotor response patterns for all limbs (1-7). \nChange in neurological function over the first 24 hours (-1,0,+1). \nEye indicant.  1 (bad), 2 (impaired), 3 (good). \nPupil reaction to light.  1 (non-reacting), 2 (reacting). \nRecovery after six months based on Glasgow Outcome Scale. \n1 (dead/vegetative), 2 (severe disability), 3 (moderate/good recovery). \n\nMissing % \n\n\u00b0 41 \n\n33 \n78 \n65 \n30 \n\n\u00b0 \n\nIf the classes in  a classification problem are ordered the ordering should also be reflected \nin  the  probability  model.  Methods  for  nominal  tasks  can  certainly  be  used  for  ordinal \nproblems,  but an  ordinal model should have  a simpler parameterization than comparable \nnominal models, and interpretation will be easier.  Suppose that an example represented by \na row vector X  belongs to class C  =  C(X).  To  make the Bayes-optimal classification it \nis  sufficient to know the  posterior probabilities p(C  =  k  I X  = x).  The ordinallogis(cid:173)\ntic  regression (OLR)  model for K  ordered classes  models the cumulative posterior class \nprobabilities p( C  ~ k  I X  =  x) by \n[  p( C  ~ k I X  =  x) \n\nlog  1 _  p(C ~ k I X  =  x)  =  \u00a2>k  -1](x) \n\nk = 1, ... ,K -1, \n\n(1) \n\n] \n\nfor some function 1].  We  impose the constraints  \u00a2>1  ~ . . .  ~ \u00a2>K-l  on  the cut-points to \nensure thatp(C ~ k  I X  =  x) increases with k.  If \u00a2>o  =  -00 and \u00a2>K  =  00 then (1) gives \n\np(C = k I X  = x) = a(\u00a2>k  -1](x)) - a(\u00a2>k-l  -1](x)) \n\nk=l, ... ,K \n\nwhere a(x) = 1/(1 + e- X ). McCullagh (1980) proposed linear OLR where 1](x)  = x{3. \nThe posterior probabilities depend  on  the  patterns  x  only through  1],  and  high  values  of \n1](x)  correspond to higher predicted classes (Figure la). This can be useful for interpreting \nthe fitted model.  However, linear OLR is rather inflexible since the decision boundaries are \nalways parallel hyperplanes.  Departures from linearity can be accommodated by allowing \n1]  to be a non-linear function  of the feature  space.  We  extend OLR to non-linear ordinal \nlogistic regression  (NOLR) by  letting 1](x)  be  the single linear output of a feed-forward \nneural  network with  input vector x,  having  skip-layer connections  and  sigmoid  transfer \nfunctions in the hidden layer (Figure Ib). Then for weights Wij  and biases bj we have \n\n1](x)  =  2: WioXCi)  + 2: wjoa(bj + 2: WijXCi\u00bb), \n\ni-to \n\nj-to \n\ni-tj \n\nwhere  :Li-tj denotes  the  sum  over i  such that node i  is  connected to  node  j, and  node \no is  the  single output node.  The usual  output-unit bias  is  incorporated in  the  cut-points. \nObserve that OLR is the special case of NOLR with no hidden nodes. Although the network \ncomponent of NOLR is a universal approximator the NOLR model cannot approximate all \nprobability densities arbitrarily well (unlike 'softmax', the most similar nominal method). \nThe likelihood for the cut-points l/>  =  (\u00a2>1, ... ,\u00a2> K -1) and network parameters w  given a \ntraining set T  =  {(Xi, Ci)  Ii = 1, ... ,n} ofn correctly classified examples is \n\nn \n\n\u00a3(w, l/\u00bb  = IIp(Ci I Xi)  = II [a(\u00a2>Ci  -1](Xi; w)) - a(\u00a2>ci-l  -1](Xi; w))] . \n\nn \n\n(2) \n\ni=l \n\ni=l \n\n\f552 \n\nM.  Mathieson \n\nq , - - - - - - - - - - - - - - - - - - ,  \n\np( 11 eta) \np(21 eta) \np(31 eta) \np(41 eta) \np(SI eta) \n\nco o \n\n'\" o \n\n-10 \n\n-8 \n\n-2 \n-6 \nNetwork output (eta) \n\n-4 \n\no \n\n2 \n\no \n\n20 \n\n40 \nage (years) \n\n60 \n\nFigure  1:  (a)  p(k  1  \"I)  plotted  against  \"I  for  an  OLR  model  with  K  =  5  classes  and  4>  = \n(-7, -6, -3, -1). (b) The network output TJ(x)  from a NOLR model used to predict change given \nall other variables  (except  outcome) predicts that  young  patients with  high  emv  score  are  likely  to \nimprove over first 24 hours.  While age and emv  are varied,  other variables are fixed.  Dark shading \ndenotes low values ofTJ(x). The Bayes decision boundaries are shown for loss L(j, k)  =1  j  - k  I. \n\nIf we estimate the  classifier  by substituting  the  maximum  likelihood estimates  we  must \nmaximize (2)  whilst constraining the  cut-points  to  be increasing  (Mathieson,  1996).  To \navoid over-fitting we regularize both by weight decay (which is equivalent to putting inde(cid:173)\npendent Gaussian priors on the network weights) and by imposing independent Gamma pri(cid:173)\nors on the differences between adjacent cut-points. The minim and is now -log f(w, l/\u00bb  + \n>..D(w) + E(l/>; t, 0:)  with hyperparameters >..  > 0, t, 0:  (to be chosen by cross-validation, \nfor example, or averaged over under a Bayesian scheme) where D(w) =  2:i,j W;j  and \n\nK-l \n\nE(l/\u00bb  =  L [t(<Pi  - <pi-d + (1  - 0:) log(<pi  - <Pi-d] . \n\ni=2 \n\n3  Classification and Incomplete Examples \n\nWe  now  consider simulation-based  methods  for  training  diagnostic  paradigm classifiers \nfrom incomplete examples,  and  classifying future  incomplete examples.  To  avoid  mod(cid:173)\nelling the missing data we assume that the missing data mechanism is independent of the \nmissing values given the observed values (missing at random) and that the missing data and \ndata generation mechanisms are independent (ignorable) (Little &  Rubin,  1987).  This as(cid:173)\nsumption is rarely true but is usually less damaging than adopting crude ad hoc approaches \nto missing values. \n\n3.1  Learning from Incomplete Examples \nThe training  set  is r  =  {(xi, Ci)  I i  =  1, ... ,n} where  xi, xi are  the  observed  and \nunobserved  parts  of the  ith  example,  which  belongs  to  class  Ci.  Define  XO  =  {xi  I \ni  =  1, ... ,n} and  Xu  = {xi  I i  =  1, ... ,n}, and use C to  denote all  the classes,  so \nr  =  (XO, C).  We  assume that C is fully observed.  Under the diagnostic paradigm (which \nincludes logistic regression  and its  non-linear and ordinal variants such  as  'softmax'  and \n\n\fOrdered Classes and Incomplete Examples in Classification \n\n553 \n\nNOLR) we model p( C Ix) by p( C I Xj 8) giving the conditional likelihood \n\nn \n\nn \n\nn \n\ni=1 \n\ni=l \n\n\u00a3(8)  = IIp(ci I xf;8) = IIIEx:'lxfP(ci I xf,Xi;8) = IExulXo IIp(ci I xf,Xi;8) \n(3) \nwhen the examples are independent. The model for p( C Ix) contains no information about \np(x) and so we construct a model for p(XU  I XO)  separately using T (Section 3.2). Once we \ncan sample xfu' ..  ,xfm from p(xf I xi, Ci)  a Monte Carlo approximation for \u00a3(8)  based \non the last expression of (3) by averaging over repeated imputations of the missing values \nin the training set (Little &  Rubin,  1987, and earlier): \n\ni=l \n\nlog\u00a3(8) ;:::;;  log  m ~ 1jP(Ci I xf, xij; 8) \n\n(\n\n) \n\n1  m \n\nn \n\n. \n\n(4) \n\n( \n\nExisting algorithms for finding maximum likelihood estimates for 8 allow maximization of \nthe individual summands in (4) with respect to 8,  but in general the software will require \nextensive modification in order to maximize the average.  This problem can be avoided if \nwe  approximate the  arithmetic  average over the  imputations by  a  geometric  one  so that \n\u00a3( 8);:::;;  TIj  TIi p( Ci  I xi, xt; 8) \n. Now the log-posterior averages over the log of the \nlikelihoods of the  completed training sets,  so  standard estimation algorithms can be used \non a training set formed by pooling all completions of the training set, giving each weight \n11m.  The approximation log ! I:j Pj  ;:::;;  ! I:j logpj  has been made,  where we define \nPj (8)  =  TIi p( Ci  I xi , xij; 8),  although in fact log ! I:j Pj  ~ ! I:j  log Pj  everywhere. \nSuppose that the Pj  are well approximated by some function P for the region of interest in \nthe parameter space. Then in this region \n\n) 1/m \n\nlL \n\nI\n\nog- p._-\nJ  m \n\nm \n\n1  Ll \n\nj \n\nj \n\n1  L (Pi - p)2 \n\n1  L  (Pi  - p)(Pj  - p) \n- - - - -\n2m2 \n\np2 \n\nP \n\nogp\u00b7;:::;;-\n2m \n\nJ \n\ni \n\ni,j \n\n(5) \n\nand  so  the  approximation  will  be  reasonable  when  the  imputed  values  have  little  effect \non  the  likelihood of the completed training sets.  Note  that the  approximation cannot be \nimproved by increasing m;  (5) does not tend to zero as  m  ---t  00.  The relative effects on \nthe  likelihood of making this  approximation and the Monte Carlo approximation (4) will \nbe problem specific and dependent on m . \n\nThe predictive approach (Ripley,  1996, for example) incorporates uncertainty in 8 by esti(cid:173)\nmatingp(c I x) asp(c I x)  =  IEOITP(C I x;8). Changing the order of integration gives \n\np(C I x)  = J p(c I Xj 8)p(8 I T) d8 ex J p(c I x; 8)p(8) IT IExulxfP(Ci  I xf, Xi; 8) d8 \n\n= IExulXo J p(c I x; 8)p(8) ITp(Ci  I xi, Xi; 8) d8 \n\n\\=1 \n\ni=1 \n\n(6) \n\nThis justifies applying standard techniques for complete data to build a separate classifier \nusing each completed training set, and then averaging the posterior class probabilities that \nthey predict. The integral over 8 in (6) will usually require approximation; in particular we \ncould average over plug-in estimates toobtainp(c I x);:::;;!  I:~1P(C I x; OJ),  where OJ  is \nthe MAP estimate of 8 based only on the jth imputed training set.  A more subtle approach \n\n\f554 \n\nM.  Mathieson \n\nTable 2:  Classifier performance on 301  complete test examples.  See Section 4. \n\nTraining set \n40 complete training examples only \n40 complete + 206 incomplete training examples: \n\u2022  Median imputation (In each variable, substitute the median for missing \n\nvalues whenever they occur.) \n\n\u2022  Averaging predicted probabilities over 1000 completions of T  generated by: \n\n[> Unconditional imputation (Sample missing values from the \n\nempirical distribution of each variable in the training set.) \n\n[> Gibbs sampling from p(XU I xo,,,fJ) \n\nPool the  1000 completions from  the line above to form a single training set \n\nTest set loss \n\n132 \n\n149 \n\n133 \n118 \n117 \n\n(Ripley,  1994) approximates each posterior by a mixture of Gaussians centred at the local \nmaxima Oj1, ... ,0jRj of p( fJ  1 T, X}L)  to give \n\n(7) \n\nwhere:  N(\u00b7; j.\u00a3, E)  is  the  Gaussian  density  function  with  mean  j.\u00a3  and  covariance matrix \nE,  the Hessian  Hjr  =  &()~~&() 10gp(fJ  1  T, XJ')  is  evaluated  at Ojr  and,  using  Laplace's \napproximation, Wjr  = p(Ojr  1 T, Xl) 1 Hjr  1- 1/ 2 .  We can average over the maxima to get \np(c 1 x) ~ (m l:j,r Wjr )-ll:j,r P(c  I x; Ojr), butthe full-blooded approach samples from \nthe 'mixture of mixtures' approximation to p( fJ  1 T) and also uses importance sampling to \ncompute the predictive estimates p. \n\n3.2  The Imputation Model \nWe  need  samples  from  p(xy  I xi, Ci)  for  each  i.  When many  patterns  of missing  val(cid:173)\nues  occur it is  not practical  to  model  p(XU  1  xo, c)  for  each  pattern,  but  Markov chain \nMonte Carlo methods can be employed.  The Gibbs sampler is  convenient and in its  most \nbasic  form  requires  models  for  the  distribution  of each  element of x  given  the  others, \nthat is  p(x(j)  1 x( -j), c)  where x( -j)  =  (X(l), ... ,x(j-1), x(j+1) , ... ,x(p\u00bb.  We  model \nthese full conditionals parametrically as p( xU)  1 x( - j) , c; 'I/J)  and assume here that the pa(cid:173)\nrameters  for  each  of the full  conditionals  are  disjoint,  so  p(x(j)  I x( -j), C; 'I/J(j\u00bb  where \n'I/J  =  ('I/J(1), ... ,'I/J(p\u00bb.  When  x(j}  takes  discrete values  this is  a classification  task,  and \nfor continuous values a regression problem.  Under certain conditions the chain of depen(cid:173)\ndent samples of Xu  converges in  distribution to  p( XU  I xo, 'I/J)  and  the ergodic  average \nof p(c  I xo, XU)  converges as  required  to the predictive estimate p(c  I Xo).  We  usually \ntake every wth sample to provide a cover of the space in fewer samples, reducing the com(cid:173)\nputation required to learn the classifier.  It is essential to check convergence of the Gibbs \nsampler although we do not give details here. \nIf we have  sufficient complete examples  we  might  use  them to estimate 'I/J  to  be -J;  and \nGibbs sample from p(XU  1 xo; -J;).  Otherwise, in the Bayesian framework, incorporate 'I/J \ninto the sampling scheme by Gibbs sampling from p( 'I/J,  Xu  I XO)  (the solution suggested \nby Li,  1988). In the head injury example we report results using the former approach. (The \nlatter was found to make little improvement and requires considerably more computation \ntime.) \n\n\fOrdered Classes and Incomplete Examples in Classification \n\n555 \n\nTable 3:  Predictive approximations  for  a NOLR  model  fitted  to  a single  completion T, Xu  of the \ntraining set. The likelihood maxima at {h  and {h  account for over 0.99 of the posterior probability. \n\nPosterior probability \n-logp(Oi I T, XU} \nTest set loss: \n\n\u2022  using the plug-in classifier p( c I x; Oi} \n\u2022  averaging over 10,000 samples from Gaussian \n\n0.929 \n176.10 \n\n0.071 \n174.65 \n\nPredictive: \n\n128 \n120 \n\n149 \n137 \n\n126 \n119 \n\n3.3  Classifying Incomplete Examples \n\nWe  could build a separate classifier for each pattern of missing data that occurs,  but this \ncan be computationally expensive, will lose information and the classifiers need not make \nconsistent predictions.  We know that p(c I XO)  = IExulxop(c I xo, XU)  so it seems better \nto classify Xo by averaging over repeated imputations of XU  from the imputation model. \n\n4  Prognosis After Head Injury \n\nWe  now  return  to the  head  injury  prognosis example to  learn  a NOLR  classifier from  a \ntraining  set containing 40 complete and  206 incomplete examples.  The NOLR architec(cid:173)\nture (4  nodes,  skip-layer connections and A =  0.01) was  selected by cross-validation on \na  single imputation of the training set,  and  we  use  a predictive approximation. 1  Table  2 \nshows  the performance of this  classifier on  a test set of 301  complete examples and  loss \nL (j, k)  = I j - k I for different strategies for dealing with the missing values. For imputation \nby Gibbs sampling we modelled each of the full conditionals using NOLR because all vari(cid:173)\nables in this dataset are ordinal.  Categorical inputs to models are put in as level indicators, \nso  change corresponds to two indicators taking values (0,0), (1,0) and (1,1).  Throughout \nthis example we predict age, emv and motor as categorical variables but treat them as con(cid:173)\ntinuous inputs to models.  Models were selected by cross-validation based on the complete \ntraining examples only and used the predictive approximation described above. Several full \nconditionals benefited from a non-linear model. \n\nWe  now  classify  199 incomplete test examples using  the classifier found  in  the last line \nof Table  2.  Median  imputation of missing values  in  the test set incurs  loss  132 whereas \nunconditional imputation incurs  loss  106.  The Gibbs  sampling imputation model incurs \nloss  91  and  is  predicting probabilities  accurately  (Figure  2).  Michie  et al.  (1994)  and \nreferences therein give alternative analyses of the head injury data. \n\nNOLR has provided an interpretable network model for ordered classes, the missing data \nstrategy  successfully learns from  incomplete training examples and  classifies  incomplete \nfuture examples, and the predictive approach is beneficial. \n\nIFor each completion T, X jU  of the  training  set  we  form  a mixture  approximation  (7)  to  p(O  I \nT, XjU }, sample from this 10,000 times and average the predicted probabilities.  These predictions are \naveraged over completions.  Maxima were found by running the optimizer 50 times from randomized \nstarting  weights.  Up to  26 distinct  maxima  were  found  and  approximately  5  generally  accounted \nfor over 95%  of the posterior probability in  most cases.  Table 3 gives  an  example:  averaging over \nmaxima  has  greater effect than  sampling around  them,  although  both  are  useful.  The cut-points cI> \nin the NOLR model must satisfy order constraints, so  we  rejected samples of ()  where these did not \nhold.  However, the parameters were sufficiently well determined that this occurred in less than 0.5% \nof samples. \n\n\f556 \n\nM.  Mathieson \n\nsevere disability \n\ngood recovery \n\nFigure 2:  (a) Test set calibration for median imputation (dashed) and conditional imputation (solid). \nFor predictions by  conditional  imputation  we  average p( c  1  xo, XU) over  100 pseudo-independent \nsamples  from  p(XU  1  Xo).  Ticks  on  the  lower  (upper)  axis  denote  predicted  probabilities  for  the \ntest  examples  using  median  (conditional)  imputation.  (b)  In  100  pseudo-independent  conditional \nimputations of the missing parts  XU  of a particular incomplete test example eight distinct values xf \n(i  =  1, . . .  ,8) occur.  (Recall  that all  components  of x  are discrete.)  For each  distinct imputation \nwe  plot  a circle  with  centre  corresponding  to  (p(1  1  xO,xf),p(2  1  xO,xf),p(3  1  xO,xf))  and \narea  proportional  to  the  number  of occurrences  of xf  in  the  100  imputations.  The  prediction  by \nmedian  imputation  is  located  by  x; the  average prediction  over conditional  imputations  is located \nby  \u2022 .  Actual  outcome is  'good recovery'.  The conditional  method correctly classifies the example \nand  shows  that the  example is  close to  the  Bayes decision  boundary under loss  L(j, k)  =1  j  - k  1 \n(dashed).  Median imputation results in a confident and incorrect classification. \n\nSoftware:  A  software  library  for  fitting  NOLR  models  in  S-Plus  is  available  at  URL \nhttp://www.stats.ox.ac.uk/-mathies \n\nAcknowledgements:  The author thanks Brian  Ripley  for productive discussions  of this \nwork and Gordon Murray for permission to use the head injury dataset.  This research was \nfunded by the UK EPSRC and investment managers GMO Woolley Ltd. \n\nReferences \nJennett,  B.,  Teasdale,  G.,  Braakman,  R.,  Minderhoud,  J.,  Heiden,  J.  &  Kurze,  T.  (1979) \n\nPrognosis of patients with severe head injury. Neurosurgery,  4782-790. \n\nLi,  K.-H. (1988) Imputation using Markov chains. Journal of Statistical Computation and \n\nSimulation, 3057-79. \n\nLittle, R. &  Rubin, D. B. (1987) Statistical Analysis with Missing Data. (Wiley, New York). \nMathieson,  M.  J.  (1996) Ordinal  models for  neural  networks.  In Neural Networks  in  Fi(cid:173)\n\nnancial Engineering,  eds A.-P.  Refenes, Y.  Abu-Mostafa, J. Moody and A.  S. Weigend \n(World Scientific, Singapore) 523-536. \n\nMcCullagh, P.  (1980) Regression models for ordinal data. Journal of the Royal Statistical \n\nSociety Series B,  42 109-142. \n\nMichie, D., Spiegelhalter, D . J.  & Taylor, C. C. (eds) (1994) Machine Learning, Neural and \n\nStatistical Classification. (Ellis Horwood, New York). \n\nRipley,  B.  D.  (1994)  Flexible non-linear approaches to  classification.  In  From  Statistics \nto Neural Networks.  Theory and Pattern Recognition Applications, eds V.  Cherkassky, \nJ.  H. Friedman and H. Wechsler (Springer Verlag, New York) 108-126. \n\nRipley,  B.  D.  (1996)  Pattern  Recognition and Neural Networks.  (Cambridge  University \n\nPress, Cambridge). \n\n\f", "award": [], "sourceid": 1241, "authors": [{"given_name": "Mark", "family_name": "Mathieson", "institution": null}]}