{"title": "Minkowski-r Back-Propagation: Learning in Connectionist Models with Non-Euclidian Error Signals", "book": "Neural Information Processing Systems", "page_first": 348, "page_last": 357, "abstract": null, "full_text": "348 \n\nMinkowski-r Back-Propaaation:  Learnine in Connectionist \n\nModels with Non-Euclidian Error Silllais \n\nStephen Jose Hanson and David J. Burr \n\nBell Communications Research \nMorristown, New Jersey 07960 \n\nAbstract \n\nMany connectionist learning models are implemented using a gradient descent \nin a least squares error function of the output and teacher signal.  The present model \nFneralizes. in particular. back-propagation [1]  by using Minkowski-r power metrics. \nFor  small  r's  a  \"city-block\"  error  metric  is  approximated  and  for  large  r's  the \n\"maximum\" or \"supremum\"  metric is  approached.  while  for r=2  the  standard  back(cid:173)\npropagation  model  results.  An  implementation  of Minkowski-r back-propagation  is \ndescribed.  and  several  experiments  are  done  which  show  that  different values  of r \nmay be desirable for various purposes. Different r values may be appropriate for the \nreduction  of  the  effects  of outliers  (noise).  modeling  the  input  space  with  more \ncompact clusters. or modeling  the statistics of a particular domain more naturally or \nin a way that may be more perceptually or psychologically meaningful (e.g. speech or \nvision). \n\n1.  Introduction \n\nThe recent resurgence of connectionist models can be traced to their ability to \ndo complex modeling of an input domain.  It can be shown that neural-like networks \ncontaining  a  single  hidden  layer  of non-linear  activation  units  can  learn  to  do  a \npiece-wise linear partitioning of a feature space [2].  One result of such a partitioning \nis  a  complex  gradient  surface  on  which  decisions  about  new  input  stimuli  will  be \nmade.  The generalization, categorization and clustering propenies of the network are \ntherefore detennined  by this mapping of input stimuli  to  this  gradient swface in  the \noutput  space.  This  gradient  swface  is  a  function  of  the  conditional  probability \ndistributions of the output vectors given the input feature vectors as well as a function \nof the error relating the teacher signal and output. \n\nf'F'I  an\" c.,-i,. ....  T..  ............ \n\n.. ., \n\n~.r 01. \n\n\f349 \n\nPresently many of the models have been implemented using least squares error. \nIn this paper we describe a new model of gradient descent back-propagation [I] using \nMinkowski-r power error metrics.  For small r's a \"city-block\" error measure (r=I) is \napproximated  and  for  larger  r's  a  \"maximum\"  or  supremum  error  measure  is \napproached,  while the  standard case of Euclidian back-propagation is a  special  case \nwith 1'*2.  Fll\"St  we derive  the general case and then discuss some of the implications \nof varying the power in the general metric. \n\n2.  Derivation of Minkowski-r Back-propagation \n\nThe standard back-propagation is derived by minimizing least squares error as \na  function  of connection  weights  within  a  completely  connected  layered  network. \nThe error for the Euclidian case is (for a single input-output pair), \n\n..  2 \nE = - L O'j-Yj)  , \n\n1 \n2  . J \n\n(1) \n\nwhere Y is  the  activation  of a  unit  and y represents  an  independent  teacher  signal. \nThe activation of a unit 0') is typically computed by nonnalizing the input from other \nunits (x) over the interval (0,1) while compressing the high and low end of this range. \nA common function used for this normalization is the logistic, \n\n1 \n\nYj=---\n1 + e-Xt \n\n(2) \n\nThe  input  to  a  unit  (x)  is  found  by  summing  products  of  the  weights  and \ncorresponding activations from other units, \n\n(3) \n\nwhere Yle  represents units in the fan  in of unit i and Whi  represents  the strength of the \nconnection  between unit i and unit h. \n\nA  gradient for the Euclidian or standard back-propagation case could be found \nby finding the partial of the error with respect to each weight, and can be expressed in \nthis three tenn differential, \n\n\f350 \n\ndE \ndX; \ndE  dyi \n- . - - - -\ndyi  ax;  aw., \ndw/ti \n\nwhich from the equations before turns out to be, \n\n(4) \n\n(5) \n\nGeneralizing  the  error  for  Minkowski-r  power  metrics  (see  Figure  1  for  the \n\nfamily of curves), \n\nE = - L  I (Yi  - Yi I \n....  )' \n\n1 \nr \n\n. \u2022 \n\n(6) \n\nI \n\n~ .. \n\n~ \n: \n: \n\nC'f \n0 \n\n\u00a7 \n\n.eo \n\n\u2022 \n\n\u00b720 \n\n0 \n\nao \n\nNfIII \n\n... \n\n10 \n\nFigure 1: Minkowski-r Family \n\nUsing equations 24 above  with equation  6 we can easily  find  an  expression  for the \ngradient in the general Minkowski-r case, \n\ndE \n~ = (  Yi  - Yi)  Yi(  -Yi \",.sgn \naw,.; \n\n....  I  ,-1  1 )  \n\nI \n\n(y \n\n..... ) \ni  - Yi \n\n(7) \n\nThis  gradient  is  used  in  the weight update  rule  proposed  by Rumelhart,  Hinton  and \nWilliams [1], \n\n\fwhi(n+l) = (X - + wAi(n) \n\ndE \ndWAi \n\n351 \n\n(8) \n\nSince the gradient computed for the hidden layer is a function  of the gradient for the \noutput,  the  hidden  layer  weight  updating  proceeds  in  the  same  way  as  in  the \nEuclidian case [1], simply substituting this new Minkowski-r gradient. \n\nIt  is  also  possible  to  define  a  gradient  over r  such  that  a  minimum  in  error \nwould  be  sought.  Such  a  gradient  was  suggested  by  White  [3,  see  also  4]  for \nmaximum likelihood estimation of r, and can be  shown to be, \n\ndIO\u00a3E)  = (1-1Ir)(1Ir) + (llr)2/og (r) + (lIr) 2\",(1lr) + (1/r) 21Yi-Yi 1 \n\n-(1/r)(IYi -Yil)'/og(IYi -Yi I) \n\n(9) \n\nAn  approximation  of  this  gradient  (using  the  last  term  of  equation  9)  has  been \nimplemented  and  investigated  for  simple problems and  shown  to  be  fairly  robust in \nrecovering similar r  values.  However,  it is  important that the r  update rule changes \nslower than the weight update rule.  In the simulations we ran r was changed once for \nevery 10 times the weight values were changed.  This rate might be expected to vary \nwith  the problem and rate of convergence.  Local minima may be  expected in larger \nproblems while seeking an optimal r.  It may be more infonnative for the moment to \nexamine different classes of problems with fixed  r and consider the specific rationale \nfor those classes of problems. \n\n3.  Variations in r \n\nVarious r values may be  useful for various aspects of representing infonnation \nin  the  feature  domain.  Changing r  basically results  in a  reweighting  of errors  from \noutput bitsl .  Small r's give  less  weight  for  large  deviations  and  tend  to reduce  the \ninfluence  of outlier  points  in  the  feature  space  during  learning.  In fact,  it  can  be \nshown that if the distributions of feature vectors are  non-gaussian, then  the  r=2 case \n\n1.  It is possible to entcltain r values  that are  negative,  which  would give  largest weight to small errors \nclose to zero and smallest weight to very large emn.  Values of r lea than  1 generally are non-metric. \ni.e.  they  viola1e 81ieast one of the  meuic axioms.  For example. r<O  violates the  triangle  inequality. \nFa' aome problems this may make sense and the need for a metric em:r weighting may be unnecessary. \nThese issues are not explored in this paper. \n\n\f352 \n\nwill not be a maximum likelihood estimator of the weights [5].  The city block case, \nr=1, in fact,  arises if the underlying conditional probability distributions are  Laplace \n[5].  More generally. r's less than two will  tend  to  model  non~gaussian distributions \nwhere the tails of the distributions are more pronounced than in the gaussian.  Better \nestimators can be shown to exist for general noise reduction and have been studied in \nthe area of robust estimation procedures [5]  of which the Minkowski-r metric is only \none possible case to consider. \n\nr<2.  It is  generally  recommended  that 1'=1.5  may  be optimal  for  many  noise \nreduction problems [6].  However, noise reduction may also be expected to vary with \nthe problem and  nature  of the  noise.  One example  we have  looked  at  involves  the \nrecovery of an  arbitrary  3  dimensional smooth  surface  as  shown  in  Figure 2a,  after \nthe addition of random noise. This surface was generated from a gaussian curve in the \n2 dimensions.  Uniform random noise equal to  the width  (standard deviation) of the \nsurface  shape  was added point-wise  to  the  surface  producing  the  noise  plus surface \nshape shown in Figure 2b. \n\nb \n\nFigure 2: Shape surface (2a), Shape plus noise surface (2b) and recovered Shape \n\nsUrface (2c) \n\nThe shape in  Figure 2a was used as  target points for Minkowski-r back~propagation2 \nand  recovered  with  some distortion  of the  slope  of the  shape  near  the  peak  of the \n\n2.  All simulation runs, unless otherwise stated, used  the same learning rate (.05) and smoothing value (.9) \nand stopping critmon defined  in  tenns of absolute  mean  deviation.  The number of iterations  to  meet \nthe stopping criterion varied considerably as r was changed (see below). \n\n\f353 \n\nsurface (see Fiaure 2c).  Next the  noise plus  shape  surface  was used as  target points \nfor the  learning procedure  with r=2.  The shape  shown in  Figure 3a was recovered, \nhowever. with considerable distortion iaround the base  and peak.  The value of r was \nreduced to 1.5  (Figure 3b) and then finally to 1.2 (Figure 3c) before shape distortions \nwere  eliminated.  Although,  the  major properties  of the  shape  of the  surface  were \nrecovered.  the  scale  seems  distorted  (however,  easily  restored  with  renormalization \ninto the 0.1  range). \n\nFigure 3: Shape surface recovered with r=2 (3a), r=1.5 (3b) and r=1.2 (3c) \nr>2.  Large r's tend to weight large deviations.  When noise is not possible in \nthe feature space (as in an arbitrary boolean problem) or where the token clusters are \ncompact  and  isolated  tllen  simpler  (in  the  sense  of the  number  and  placement  of \npartition  planes)  genenuization  surfaces  may  be  created  with  larger  r  values.  For \nexample,  in  the  simple XOR  problem,  the  main  effect of increasing  r  is  to pull  the \ndecision boundaries closer into the non-zero targets  (compare high activation regions \nin Figure 4a and 4b). \n\nIn  this  particular  problem  clearly  such  compression  of the  target  regions  does  not \nconstitute simpler decision surfaces.  However, if more hidden units are used than are \nneeded  for  pattern  class  separation,  then  increasing  r  during  training  will  tend  to \nreduce  the  number of cuts  in  the  space  to  the  minimum  needed.  This  seems  to be \nprimarily due  to  the  sensitivity of the hyper-plane placement in  the  feature  space  to \nthe geometry of the targets. \n\nA  more  complex  case  illustrating  the  same  idea  comes  from  an  example \nsuggested  by  Minsky  &  Papen  [7]  called  \"the  mesh\".  This  type  of  pattern \nrecognition problem is also. like XOR, a non-linearly separable problem.  An optimal \n\n\f354 \n\nFigure 4: XOR solved with r=2  (4a) and r=4 (4b) \n\nsolution  involves  only  three  cuts  in  feature  space  to  separate  the  two  \"meshed\" \ncluSten (see Figure Sa). \n\nf14W'\"  1 \n\nb \n\nFigure 5: Mesh problem with minimwn cut solution (5a) and Performance Surface(5b) \n\nTypical  solutions  for  r=2  in  this  case  tend  to use  a large  number of hidden  units  to \nseparate  the  two  sets  of exemplars  (see  Figure  5b  for  a  perfonnance  surface).  For \nexample t  in  Figure  6a  notice  that  a  typical  (based  on  several  runs)  Euclidian  back(cid:173)\nprop  starting  with  16  hidden  units  has  found  a  solution  involving  five  decision \nboundaries  (lines  shown  in  the  plane  also  representing  hidden  units)  while  the  r=3 \ncase  used  primarily  three  decision  boundaries  and  placed  a  number  of  other \n\n\fboundaries redundantly near the center of the  meshed region  (see  Figure  6b)  where \nthere is maximum uncertainty about the cluster identification. \n\n355 \n\n~ \n\n-~ \n0 .. \n\u2022 \n\n0 \n\n0 \n\nC'lI \n0 \n\n0 \n0 \n\n0.0 \n\n0.2 \n\nG.4 \n\n0.8 \n\n0.8 \n\n1.0 \n\nb \n\n~ -\n\nID \n0 \n\nID \n0 \n\u2022 \n0 \n\nC'lI \n0 \n\n~ \n0 \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.8 \n\n0.8 \n\n1.0 \n\nFigure 6: Mesh solved with r=2 (6a) and r=3 (6b) \n\nSpeech Recognition.  A final case in which large r's may be  appropriate is data \nthat  has  been  previously  processed  with  a  transformation  that  produced  compact \nregions  requiring  separation  in the  feature  space.  One  example  we  have  looked  at \ninvolves  spoken digit recognition.  The first  10  cepstral coefficients of spoken digits \n(\"one\"  through  \"ten\") were used for input to  a network.  In this case an  advantage  is \nshown for  larger r's with  smaller  training  set  sizes.  Shown  in  Figure  7  are  transfer \ndata for 50 spoken digits replicated in ten different runs per point (bars show standard \nerror of the  mean).  Transfer shows  a  training  set  size  effect  for  both  r=2  and r=3, \nhowever  for  the  larger  r  value  at  smaller  training  set  sizes  (10  and  20)  note  that \ntransfer is enhanced. \n\nWe  speculate  that  this  may be due  to  the  larger  r  backprop  creating  discrimination \nregions  that are  better able  to  capture  the  compactness  of the  clusters  inherent  in  a \nsmall number of training points. \n\n4.  Conver&ence Properties \n\nlinearly  (although \n\nIt should be generally noted that as r increases. convergence time tends to grow \nroughly \nthis  may  be  problem  dependent).  Consequently, \ndecreasing  r  can  significantly  improve  convergence,  without  much  change  to  the \nnature  of  solution.  Further,  if  noise  is  present  decreasing  r  may  reduce  it \ndramatically.  Note finally  that the  gradient for the Minkowski-r back-propagation  is \nnonlinear and therefore more complex for implementing learning procedures. \n\n\f... ---... ....... ----.-- -.. -- - -- - 1..- ~ '~- '. ~ \nQ \n\n~ .-\n\nI \n\nt------l--:::~+-\u00b7/'-\n\n\". \n\n. \n\n356 \n\n8 \n\n0 co \n\n0 \n\n~ \nc \n! ... \n~  co \n\u00b7c \n:J \n0 \n~  0 \n0 u \n~  0 \n0 \nC\\I \n\n~ \n~ \n\n~ \n\n0 \n\ni \n! \n: \n1 \n\ni1 \n\nR=2  ,.-\n\n/ \n: \n! \nI \n:' \n\n!  T.:'  10 replications of 50 transfer POints \ni' \nI : \n/ ..... \u00b7.C/ \n\nt\u00b7 \n! \n: \nI \nL ____ U.____ \n\n----.. \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\nTRAINING SET SIZE \n\nFigure 7: Digit Recognition Set Size Effect \n\n5.  Summary and Conclusion \n\nA  new  procedure  which  is  a  variation  on  the  Back-propagation  algorithm  is \nderived and simulated in a number of different problem domains.  Noise in the target \ndomain  may  be  reduced  by  using  power  values  less  than  2  and  the  sensitivity  of \npartition  planes  to  the  geometry  of the  problem  may  be  increased  with  increasing \npower  values.  Other  types  of  objective  functions  should  be  explored  for  their \npotential  consequences  on  network  resources  and  ensuing  pattern  recognition \ncapabilities. \n\nReferences \n\n1. Rumelhart D.  E., Hinton G. E., Williams R., Learning Internal Representations by \nerror propagation.  Nature.  1986. \n\n2. Burr D. I. and Hanson S. I .\u2022 Knowledge Representation in Connectionist Networks. \nBellcore. Technical Report, \n\n3. White. H. Personal Communication.  1987. \n\n4.  White,  H.  Some  Asymptotic  Results  for  Learning  in  Single  Hidden  Layer \nFeedforward Network Models. Unpublished Manuscript.  1987. \n\n\f357 \n\nS. Mosteller, F. & Tukey, 1. Robust Estimation Procedures, Addison Wesley, 1980. \n\n6. Tukey, 1.  Personal Communication, 1987. \n\n7.  Minsky,  M.  &  Papert,  S.,  Perceptrons:  An  Introduction  to  Computational \nGeometry, MIT Press,  1969. \n\n\f", "award": [], "sourceid": 65, "authors": [{"given_name": "Stephen", "family_name": "Hanson", "institution": null}, {"given_name": "David", "family_name": "Burr", "institution": null}]}