{"title": "Graph Matching for Shape Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 896, "page_last": 902, "abstract": null, "full_text": "The Bias-Variance Tradeoff and the Randomized \n\nGACV \n\nGrace Wahba, Xiwu Lin and Fangyu Gao \n\nDept of Statistics \nUniv of Wisconsin \n\n1210 W Dayton Street \nMadison, WI 53706 \n\nwahba,xiwu,fgao@stat.wisc.edu \n\nDong Xiang \n\nSAS Institute, Inc. \nSAS Campus Drive \n\nCary, NC 27513 \n\nsasdxx@unx.sas.com \n\nRonald Klein, MD and Barbara Klein, MD \n\nDept of Ophthalmalogy \n610 North Walnut Street \n\nMadison, WI 53706 \n\nkleinr,kleinb@epi.ophth.wisc.edu \n\nAbstract \n\nWe  propose a new in-sample cross validation based method (randomized \nGACV) for choosing smoothing or bandwidth parameters that govern the \nbias-variance or fit-complexity tradeoff in  'soft' classification.  Soft clas(cid:173)\nsification refers to  a  learning procedure which  estimates the probability \nthat an example with a given attribute vector is  in  class  1 vs  class O.  The \ntarget  for  optimizing  the  the  tradeoff is  the  Kullback-Liebler distance \nbetween  the  estimated  probability  distribution  and  the  'true'  probabil(cid:173)\nity  distribution,  representing  knowledge of an  infinite  population.  The \nmethod uses a randomized estimate of the trace of a Hessian and mimics \ncross validation at the cost of a single relearning with perturbed outcome \ndata. \n\n1 \n\nINTRODUCTION \n\nWe propose and test a new in-sample cross-validation based method for optimizing the bias(cid:173)\nvariance tradeoff in 'soft classification' (Wahba et al1994), called ranG ACV (randomized \nGeneralized Approximate Cross Validation). Summarizing from Wahba et al(l994) we are \ngiven a  training  set consisting of n  examples,  where for  each  example  we  have a  vector \nt  E  T  of attribute values,  and an outcome y, which is  either 0 or 1.  Based on the training \ndata it is desired to estimate the probability p of the outcome 1 for any new examples in the \n\n\fThe Bias-Variance TradeofJand the Randomized GACV \n\n621 \n\nfuture.  In  'soft' classification the estimate p(t) of p(t) is of particular interest, and might be \nused  by a physician to  tell  patients how they might modify their risk p by changing (some \ncomponent of) t,  for example, cholesterol as  a risk factor for heart attack.  Penalized like(cid:173)\nlihood estimates are obtained for p by assuming that the  logit  f(t), t  E  T, which  satisfies \np(t)  =  ef(t) 1(1 + ef(t\u00bb)  is  in  some space 1{ of functions .  Technically 1{ is a reproducing \nkernel  Hilbert space, but you don't need to  know  what that is  to  read on.  Let the training \nset be {Yi, ti, i  =  1,\u00b7\u00b7\u00b7, n}. Letting Ii =  f(td, the negative log likelihood .c{Yi, ti, fd of \nthe observations, given f  is \n\nn \n\n.c{Yi, ti, fd  =  2::[-Ydi + b(li)], \n\ni=1 \n\n(1) \n\nwhere  b(f)  =  log(l  + ef ).  The  penalized  likelihood  estimate of the  function  f  is  the \nsolution to:  Find f  E  1{ to minimize h. (I): \n\nn \n\nh.(f) =  2::[-Ydi + b(ld) + J>.(I), \n\ni =1 \n\n(2) \n\nwhere 1>.(1)  is  a quadratic penalty functional depending on parameter(s) A =  (AI, ... , Aq) \nwhich govern the so called bias-variance tradeoff.  Equivalently the components of A con(cid:173)\ntrol the tradeoff between the complexity of f and the fit to the training data.  In this paper we \nsketch the derivation of the ranG ACV method for choosing A,  and present some prelim(cid:173)\ninary but favorable simulation results, demonstrating its efficacy.  This method is designed \nfor use with  penalized likelihood estimates, but it  is clear that it can be used with  a variety \nof other methods which contain bias-variance parameters to be chosen, and for which mini(cid:173)\nmizing the Kullback-Liebler (K L) distance is the target.  In the work of which this is a part, \nwe are concerned with  A having multiple components.  Thus, it  will  be highly convenient \nto  have an  in-sample method for  selecting A,  if one that  is  accurate and  computationally \nconvenient can be found. \n\nLet P>.  be  the  the estimate and  p  be the  'true'  but  unknown  probability  function  and  let \nPi  =  p(td,p>.i  =  p>.(ti).  For  in-sample tuning,  our criteria  for  a  good  choice of A is \nthe KL distance KL(p,p>.)  = ~ E~I[PilogP7. + (1- pdlogg~::?)]. We  may replace \nK L(p,p>.)  by the comparative K L distance (C K L), which differs from K L by a quantity \nwhich does not depend on A.  Letting hi = h (ti),  the C K L  is given by \nCKL(p,p>.)  ==  CKL(A)  =  ;;, 2:: [-pd>'i  + b(l>.i)). \n\n1  n \n\n(3) \n\ni=) \n\nC K L(A)  depends on the unknown p, and it  is  desired is  to  have a good estimate or proxy \nfor it, which can then be minimized with respect to A. \nIt is  known (Wong 1992) that no exact unbiased estimate of CK L(A) exists in this case, so \nthat only approximate methods are possible.  A number of authors have tackled this prob(cid:173)\nlem, including Utans and M90dy(1993), Liu(l993), Gu(1992). The iterative U BR method \nof Gu(l992) is  included in  GRKPACK  (Wang  1997),  which  implements general smooth(cid:173)\ning spline ANOVA penalized likelihood estimates with multiple smoothing parameters.  It \nhas  been  successfully  used  in  a  number of practical  problems,  see,  for  example,  Wahba \net al  (1994,1995).  The  present  work  represents  an  approach  in  the  spirit  of GRKPACK \nbut which employs several approximations, and may  be used  with  any data set,  no matter \nhow large, provided that an algorithm for solving the penalized likelihood equations, either \nexactly or approximately, can be implemented. \n\n\f622 \n\nG.  Wahba et al. \n\n2  THE GACV ESTIMATE \n\nIn the general penalized likelihood problem the minimizer 1>,(-)  of (2) has a representation \n\nM \n\n1>.(t)  = L  dv<Pv(t)  + L CiQ>.(ti, t) \n\nn \n\n(4) \n\nv=l \n\ni=l \n\nwhere the <Pv  span the  null space of 1>\"  Q>.(8, t)  is  a  reproducing kernel  (positive definite \nfunction)  for the penalized part of 7-1.,  and C  =  (Cl' ... ,Cn)'  satisfies M  linear conditions, \nso  that  there  are  (at most)  n  free  parameters  in  1>..  Typically  the  unpenalized  functions \n<Pv  are  low  degree polynomials.  Examples of Q(ti,')  include radial  basis  functions  and \nvarious kinds of splines; minor modifications include sigmoidal basis functions, tree basis \nfunctions and so on. See, for example Wahba( 1990, 1995), Girosi, Jones and Poggio( 1995). \nIf f>.C)  is  of the  form  (4)  then  1>,(f>.)  is  a  quadratic form  in  c.  Substituting  (4)  into  (2) \nresults  in  h  a  convex  functional  in  C  and d,  and  C  and  d  are obtained  numerically  via  a \nNewton Raphson iteration,  subject to  the conditions on c.  For large n, the second sum on \nthe right of (4)  may be replaced by  L~=1 Cik Q>. (tik , t), where the tik  are chosen via one \nof several principled methods. \n\nTo obtain the CACV we begin with the ordinary leaving-out-one cross validation function \nCV(.\\) for the CKL: \n\nCV  .\\)  -\n\n( \n\n_  1  \"\" \n\n- LJ-yd>.i  + b 1>.i)  , \n] \nn \n\n[-i] \n\n( \n\nn \n\ni=1 \n\n(5) \n\nwhere  fl- i ]  the  solution  to  the  variational  problem of (2)  with  the  ith data  point left out \nand  fti]  is  the  value  of fl- i] at  ti .  Although  f>.C)  is  computed by  solving  for  C  and d \nthe CACV  is  derived  in  terms of the  values  (it\"\", fn)'  of f  at  the ti.  Where  there  is \nno confusion between functions  f(-)  and vectors (it, ... ,fn)' of values of fat tl, ... ,tn, \nwe let  f  = (it, ... \" fn)'.  For any  f(-)  of the form  (4), J>. (f) also has a representation as \na  non-negative definite quadratic form  in  (it, ... , fn)'.  Letting L:>.  be twice the matrix of \nthis quadratic form  we can rewrite (2) as \n\nn \n\nh(f,Y) = L[-Ydi + b(/i)] + 2f'L:>.f. \n\n1 \n\n(6) \n\ni=1 \n\nLet W  =  W(f)  be the n  x  n  diagonal matrix  with (/ii  ==  Pi(l  - Pi)  in  the iith position. \nUsing the  fact  that (/ii  is  the second derivative of b(fi),  we  have that H  =  [W + L:>.] - 1 \nis  the inverse Hessian of the  variational problem (6).  In Xiang and  Wahba (1996), several \nTaylor  series  approximations,  along  with  a  generalization of the  leaving-out-one  lemma \n(see  Wahba  1990)  are  applied  to  (5)  to  obtain  an  approximate  cross  validation  function \nACV(.\\), which  is  a second order approximation to  CV(.\\) . Letting hii be the iith entry \nof H , the result is \n\nCV(.\\) ~ ACV('\\) =  .!. t[-Yd>.i + b(f>.i)] + .!.  t \nn  i=1 \n\nn  i= l \n\nhiiYi(Yi  - P>.i)  . \n\n[1  - hiwii] \n\n(7) \n\nThen the GACV  is  obtained from  the  ACV by  replacing hii  by  ~ L~1 hii  ==  ~tr(H) \nand replacing 1 - hiWii  by  ~tr[I - (Wl/2 HWl/2)], giving \n\nCACV('\\) = ;; t;;[-Yd>.i + b(1).i)  + -n-tr[I _  (Wl/2HWl /2)]  , \n\n] \n\ntr(H)  L~l Yi(Yi  - P>.i) \n\n1 ~ \n\n(8) \n\nwhere W  is  evaluated at 1>..  Numerical results based on an exact calculation of (8) appear \nin Xiang and Wahba (1996). The exact calculation is  limited to small n  however. \n\n\fThe Bias-Variance TradeofJand the Randomized GACV \n\n623 \n\n3  THE RANDOMIZED GACV ESTIMATE \n\nGiven any 'black box' which, given >.,  and a training set {Yi, ti} produces f>. (.)  as the min(cid:173)\nimizer of (2),  and thence f>.  = (fA 1 ,  \" ' ,  f>.n)',  we can produce randomized estimates of \ntrH and tr[! - W 1/ 2 HW 1/ 2 J without having any explicit calculations of these matrices. \nThis is  done by  running the  'black box' on perturbed data  {Vi  + <5i , td.  For the Yi  Gaus(cid:173)\nsian, randomized trace estimates of the Hessian of the variational problem (the  'influence \nmatrix') have been studied extensively and shown to  be essentially as good as exact calcu(cid:173)\nlations for  large n, see for example Girard(1998) .  Randomized trace estimates are  based \non the fact that if A  is any square matrix and <5  is  a zero mean random n-vector with  inde(cid:173)\npendent components with  variance (TJ,  then  E<5' A<5  = ~ tr A.  See Gong et al( 1998) and \nreferences cited there for experimental results with multiple regularization parameters. Re-\nturning to the 0-1  data case, it is easy to see that the minimizer fA(')  of 1;..  is continuous in \nY,  not withstanding the fact that in our training set the Yi  take on only values 0 or 1. Letting \nif =  UA1,\"', f>.n)'  be  the  minimizer of (6)  given  y  =  (Y1,\"', Yn)',  and  if+O  be the \nminimizer given data y+<5  =  (Y1 +<51, ... ,Yn +<5n)' (the ti remain fixed), Xiang and Wahba \n(1997) show, again using Taylor series expansions, that if+O  - ff ,.....,  [WUf) +  ~AJ-1<5. \nThis suggests that ~<5'Uf+O - ff) provides an estimate oftr[W(ff) + ~At1. However, \nif we take the solution ff to the nonlinear system for the original data Y as the initial value \nfor  a Newton-Raphson calculation of ff+O  things  become even simpler.  Applying a one \nstep Newton-Raphson iteration gives \n\nu\" \n\nu\" \n\n(9) \n\nSince  Pjf(ff,y  +  <5)  = \n[  82 h(fY  )J- 1 \n8?7if  A' Y \n[WUf) + EAt 1<5.  The result is  the following ranGACV function: \n\n-<5  +  PjfUf,Y)  =  -<5,  and  [:;~f(ff,Y +  <5)J - 1 \nf y+o,l  -\nfY \n- A \n-\n\nfY \nA +  8?7if  A' Y \n\n[ 8 2 h(fY  )J- 1 J: \n\n,we  ave  A \n\nu  so  t  at  A \n\nf y+o,l \n\nh \n\nh \n\nranGACV(>.)  =  .!. ~[- 'I '+bU  .)J+ \n\nn \n\nn  ~ Yz  At \n\nAt \n\n<5'  (f Y+O,l \nn \n\nA \n\nfY) \n- A \n\n) \n\",n \nwi=l Yi  Yi  - PAi \n\n( \n\n. \n\n[<5'<5  - <5'WUf)Uf+O,l  - ff)J \n(10) \nin  (10),  we  may  draw  R \n'+'  in \nto  obtain  an  R-replicated \n\nthe  variance  in \n\nTo  reduce \nindependent  replicate  vectors  <51,'\"  ,<5R ,  and  replace  the  term  after  the \n(1O)b \n\n1...  \",R  o:(fr Hr .1 -ff) \n\n2:7-1 y.(y.-P>..) \n\nterm  after  the \n\nthe \n\n'+' \n\n[O~Or-O~ W(fn(f~+Ar . l-ff)1 \n\ny  R  wr=l \n\nn \nranGACV(>.)  function. \n\n4  NUMERICAL RESULTS \n\nIn  this  section  we  present simulation  results  which  are  representative of more extensive \nsimulations to  appear elsewhere.  In each case,  K  < < n  was chosen by  a sequential clus(cid:173)\ntering algorithm. In that case, the ti were grouped into K  clusters and one member of each \ncluster selected at random. The model is fit.  Then the number of clusters is doubled and the \nmodel is fit again. This procedure continues until the fit does not change. In the randomized \ntrace estimates the random variates were Gaussian.  Penalty functionals were (multivariate \ngeneralizations of) the cubic spline penalty functional>. fa1 U\" (X))2, and smoothing spline \nANOVA models were fit. \n\n\f624 \n\nG.  Wahba  et at. \n\n4.1  EXPERIMENT 1.  SINGLE SMOOTHING PARAMETER \n\nIn  this  experiment t  E  [0,1], f(t)  =  2sin(10t), ti  =  (i  -\n.5)/500, i  =  1,\u00b7\u00b7\u00b7,500.  A \nrandom number generator produced  'observations' Yi  =  1 with  probability Pi  =  el , /(1 + \neli), to get the training set.  Q A is given in Wahba( 1990) for this cubic spline case, K  = 50. \nSince  the  true  P  is  known,  the  true  CKL  can  be  computed.  Fig. \nl(a)  gives  a  plot  of \nCK L(A)  and  10 replicates of ranGACV(A).  In each  replicate  R  was  taken as  1,  and  J \nwas generated anew as  a Gaussian random vector with  (115  =  .001.  Extensive simulations \nwith  different  (115  showed  that  the  results  were  insensitive  to  (115  from  1.0  to  10-6 \u2022  The \nminimizer of C K L  is  at  the filled-in  circle and  the  10  minimizers of the  10 replicates of \nranGACV are the open circles.  Anyone of these 10 provides a  rather good  estimate of \nthe A that goes  with  the  filled-in  circle.  Fig.  l(b) gives  the  same experiment, except that \nthis time R  =  5.  It can be seen that the minimizers ranGACV become even more reliable \nestimates of the minimizer of C K L, and the C K L  at all  of the ranG ACV estimates are \nactually quite close to its minimum value. \n\n4.2  EXPERIMENT 2.  ADDITIVE MODEL WITH A =  (Al' A2) \n\nHere  t  E  [0,1]  0  [0,1].  n  =  500  values  of ti  were  generated  randomly  according  to \na  uniform  distribution  on  the  unit  square  and  the  Yi  were  generated  according  to  Pi  = \neli j(l + el ,)  with  t  =  (Xl,X2)  and  f(t)  =  5 sin 27rXl  - 3sin27rX2.  An  additive model \nas  a  special case of the smoothing spline ANOVA  model  (see Wahba  et al,  1995), of the \nform  f(t)  =  /-l  + h(xd + h(X2)  with  cubic  spline  penalties  on  hand h  were  used. \n.001, R  =  5.  Figure  l(c)  gives  a  plot of CK L(Al' A2)  and Figure  l(d) \nK  =  50, (115  = \ngives a  plot of ranGACV(Al, A2).  The open circles  mark the  minimizer of ranGACV \nin  both  plots  and  the  filled  in  circle  marks  the  minimizer of C K L.  The  inefficiency,  as \nmeasured  by  CKL()..)/minACKL(A)  is  1.01.  Inefficiencies  near  1  are  typical  of our \nother similar simulations. \n\n4.3  EXPERIMENT 3.  COMPARISON OF ranGACV AND UBR \n\nThis  experiment  used  a  model  similar  to  the  model  fit  by  GRKPACK  for  the  risk  of \nprogression  of diabetic  retinopathy  given  t  = \n(Xl, X2,  X3)  =  (duration,  glycosylated \nhemoglobin,  body  mass  index)  in  Wahba  et al(l995)  as  'truth'.  A  training  set  of 669 \nexamples was generated according to that model,  which  had the structure  f(Xl, X2,  X3)  = \n/-l + fl (xd + h  (X2)  + h  (X3)  + fl,3 (Xl, X3).  This (synthetic) training set was fit  by GRK(cid:173)\nPACK  and  also  using  K  =  50  basis  functions  with  ranG ACV.  Here  there  are P  =  6 \nsmoothing parameters (there are 3 smoothing parameters in f13)  and the ranGACV func(cid:173)\ntion  was searched by a downhill simplex method to  find  its minimizer.  Since the  'truth'  is \nknown,  the  CKL for)\"  and for  the  GRKPACK  fit  using  the  iterative UBR method  were \ncomputed. This was repeated  100 times, and the  100 pairs of C K L  values appears in Fig(cid:173)\nure  l(e).  It can be seen that the U BR and ranGACV give similar C K L  values about 90% \nof the time, while the ranG ACV has lower C K L  for most of the remaining cases. \n\n4.4  DATA ANALYSIS: AN APPLICATION \n\nFigure 1(f) represents part of the results of a study of association at baseline of pigmentary \nabnormalities with various risk factors in 2585 women between the ages of 43 and 86 in the \nBeaver Dam Eye Study, R.  Klein et al( 1995). The attributes are:  Xl  =  age, X2  =body mass \nindex, X3  =  systolic blood pressure, X4  =  cholesterol.  X5  and X6  are indicator variables for \ntaking hormones, and history of drinking. The smoothing spline ANOVA model fitted  was \nf(t)  =  /-l+dlXl +d2X2 + h(X3)+ f4(X4)+ h4(X3, x4)+d5I(x5) +d6I(x6), where I is the \nindicator function.  Figure  l(e) represents a cross section of the fit  for  X5  =  no, X6  =  no, \n\n\fThe Bias- Variance Tradeoff and the Randomized GACV \n\n625 \n\nX2, X3  fixed  at  their medians and Xl  fixed  at  the  75th  percentile.  The dotted  lines are the \nBayesian confidence intervals, see Wahba et al( 1995). There is a suggestion of a borderline \ninverse association of cholesterol. The reason for this association is uncertain. More details \nwill appear elsewhere. \n\nPrincipled soft classification procedures can now  be implemented in much larger data sets \nthan previously possible, and the ranG ACV should be applicable in general learning. \n\nReferences \n\nGirard, D. (1998), 'Asymptotic comparison of (partial) cross-validation, GCV and random(cid:173)\nized GCV in nonparametric regression', Ann.  Statist.  126, 315-334. \nGirosi,  F.,  Jones,  M.  &  Poggio,  T.  (1995),  'Regularization  theory  and  neural  networks \narchitectures', Neural Computatioll 7,219-269. \nGong,  J.,  Wahba,  G.,  Johnson,  D.  &  Tribbia,  J.  (1998),  'Adaptive  tuning  of numerical \nweather prediction models:  simultaneous estimation of weighting, smoothing and physical \nparameters', MOllthly  Weather Review 125, 210-231. \nGu,  C.  (1992),  'Penalized  likelihood  regression:  a  Bayesian  analysis',  Statistica  Sinica \n2,255-264. \nKlein, R., Klein, B.  & Moss, S.  (1995),  'Age-related eye disease and survival. the Beaver \nDam Eye Study', Arch Ophthalmol113, 1995. \nLiu,  Y.  (1993),  Unbiased  estimate  of generalization error and  model  selection  in  neural \nnetwork, manuscript, Department of Physics, Institute of Brain and Neural Systems, Brown \nUniversity. \nUtans,  J.  &  Moody,  J.  (1993),  Selecting  neural  network  architectures  via  the  prediction \nrisk:  application to corporate bond rating prediction, in  'Proc. First Int'I Conf. on Artificial \nIntelligence Applications on Wall Street', IEEE Computer Society Press. \nWahba,  G.  (1990), Spline Models for Observational Data, SIAM.  CBMS-NSF Regional \nConference Series in Applied Mathematics, v.  59. \nWahba,  G.  (1995),  Generalization  and  regularization  in  nonlinear  learning  systems,  in \nM.  Arbib,  ed.,  'Handbook of Brain Theory and  Neural  Networks',  MIT Press,  pp.  426-\n430. \nWahba, G.,  Wang,  Y.,  Gu, c., Klein, R.  & Klein,  B.  (1994), Structured machine learning \nfor  'soft'  classification  with  smoothing  spline  ANOVA  and  stacked  tuning,  testing  and \nevaluation, in J.  Cowan, G.  Tesauro &  J. Alspector, eds,  'Advances in  Neural Information \nProcessing Systems 6', Morgan Kauffman, pp. 415-422. \nWahba, G., Wang, Y., Gu, C., Klein, R.  &  Klein, B. (1995), 'Smoothing spline AN OVA for \nexponential families,  with application to the Wisconsin Epidemiological Study of Diabetic \nRetinopathy' , Ann. Statist. 23,  1865-1895. \nWang, Y.  (1997), 'GRKPACK: Fitting smoothing spline analysis of variance models to data \nfrom exponential families',  Commun. Statist.  Sim.  Compo  26,765-782. \nWong,  W.  (1992), Estimation of the loss  of an estimate,  Technical  Report 356, Dept.  of \nStatistics, University of Chicago, Chicago, II. \nXiang, D.  &  Wahba, G.  (1996), 'A generalized approximate cross validation for smoothing \nsplines  with  non-Gaussian data', Statistica  Sinica  6, 675-692, preprint TR 930 available \nvia www. stat. wise. edu/-wahba - > TRLIST. \nXiang, D. &  Wahba, G.  (1997), Approximate smoothing spline methods for large data sets \nin the binary case, Technical Report 982, Department of Statistics, University of Wisconsin, \nMadison  WI.  To  appear in  the Proceedings of the  1997  ASA Joint Statistical  Meetings, \nBiometrics Section, pp 94-98 (1998). Also in TRLIST as above. \n\n\fG,  Wahba et aI, \n\nCKL \n\nCKL \nranGACV \n\n626 \n\n10 \n(0 \nc:i \n\n0 \n(0 \nc:i \n\n10 \n10 \nc:i \n\n0 \n~ \n0 \n\n.' . \n\n10 \n(0 \nc:i \n\no \n(0 \nc:i \n\n10 \n10 \nc:i \n\no \n10 \nc:i \n\n-8 \n\n-7 \n\n-6 \n-5 \nlog lambda \n\n(a) \n\n-4 \n\n-3 \n\n-8 \n\n-7 \n\n-6 \n-5 \nlog lambda \n\n(b) \n\n-4 \n\n-3 \n\n9.29 \n\nr,,6 \n:0' \n~,  O.  7  O.  9 \n\n-7 \n\nlog lambda1 \n\n-6 \n\n(c) \n\n-5 \n\no \n\n12! \no \n\nC\\I \n(0 \nc:i \n\nco \n10 \nc:i \n\n\"f \n\n... 0,28 \n\n.. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7r ..... ranGACV \n\n\\~7 \n0\\ \n\n.' \n\n\"  ...... \n\n0:\\13 \n\n': \n\n:.\u00b7 .. O-:!4!7 \n: \n\n0 \n\n.  . \n.  . \n\n. \n. \n\n.:  0'F5  0'F8  0.[32 \n\nlog lambda1 \n\n-6 \n\n(d) \n\n-5 \n\n-4 \n\nO.  4 \n\n-4 \n\n.25  0'.2,4 \n\n-7 \n\n(0 \nc:i \n\n.~ =...,. \n.0  . \nca O \n.0 e a.. \n\nC\\I \nc:i \n\n(0 \n10 \nc:i  ~--------.-------.--------r--~ \n\n0,56 \n\n0,58 \n\n0,60 \nranGACV \n\n(e) \n\n0,62 \n\no \nc:i  ~ __ ~ ____ , -____ . -__ -. ____ . -__ ~ \n\n100 \n\n150 \n\n200 \n300 \nCholesterol (mg/dL) \n\n250 \n\n(f) \n\n350 \n\n400 \n\nFigure  1:  (a)  and (b):  Single smoothing parameter comparison of ranGACV and CK L. \n(c) and (d):  Two smoothing parameter comparison of ranGACV and CK L. (e):  Compar(cid:173)\nison of ranG ACV and U B R.  (f):  Probability estimate from Beaver Dam Study \n\n\fGraph Matching for  Shape  Retrieval \n\nBenoit  Huet,  Andrew  D.J.  Cross and  Edwin R.  Hancock' \n\nDepartment of Computer Science,  University of York \n\nYork,  YOI  5DD,  UK \n\nAbstract \n\nThis  paper  describes  a  Bayesian  graph  matching  algorithm  for \ndata-mining  from  large  structural  data-bases.  The  matching  al(cid:173)\ngorithm uses edge-consistency and node attribute similarity to de(cid:173)\ntermine the a posteriori probability of a query graph for each of the \ncandidate matches in  the data-base.  The node feature-vectors are \nconstructed  by  computing  normalised  histograms  of  pairwise  ge(cid:173)\nometric  attributes.  Attribute  similarity  is  assessed  by  computing \nthe  Bhattacharyya distance  between  the  histograms.  Recognition \nis  realised by selecting the candidate from the data-base which  has \nthe  largest  a  posteriori probability.  We  illustrate  the  recognition \ntechnique  on  a  data-base  containing  2500  line  patterns  extracted \nfrom  real-world imagery.  Here  the recognition  technique  is  shown \nto significantly outperform a  number of algorithm alternatives. \n\n1 \n\nIntroduction \n\nSince Barrow and Popplestone [1]  first  suggested that relational structures could be \nused to represent and interpret 2D scenes, there has been considerable interest in the \nmachine  vision  literature in  developing  practical  graph-matching algorithms  [8,  3, \n10].  The main computational issues are how to compare relational descriptions when \nthere is significant structural corruption [8,  10]  and how to search for the best match \n[3].  Despite resulting in  significant  improvements in  the available methodology for \ngraph-matching, there has  been  little progress in  applying the resulting algorithms \nto large-scale object recognition problems.  Most of the algorithms developed in the \nliterature are evaluated for the relatively simple problem of matching a model-graph \nagainst a scene known to contain the relevant structure.  A more realistic problem is \nthat of taking a  large number (maybe thousands)  of scenes and retrieving the ones \nthat best match the model.  Although this problem is key to data-mining from  large \nlibraries  of  visual  information,  it  has  invariably  been  approached  using  low-level \nfeature comparison techniques.  Very little effort [7,4]  has been devoted to matching \n\n\u2022 corresponding author erh@cs.york.ac.uk \n\n\fGraph Matching for Shape Retrieval \n\n897 \n\nhigher-level structural primitives such as lines, curves or regions.  Moreover, because \nof the  perceived  fragility  of the  graph  matching  process,  there  has  been  even  less \neffort  directed at attempting to retrieve shapes using relational information. \n\nHere  we  aim  to fill  this  gap  in  the  literature by  using  graph-matching as  a  means \nof retrieving the shape from  a large data-based that most closely resembles a  query \nshape.  Although the indexation  images in  large data-bases is  a  problem of current \ntopicality  in  the  computer  vision  literature  [5,  6,  9],  the  work  presented  in  this \npaper  is  more  ambitious.  Firstly,  we  adopt  a  structural  abstraction  of the  shape \nrecognition  problem  and match  using  attributed  relational  graphs.  Each  shape in \nour data-base is a  pattern of line-segments.  The structural abstraction is  a  nearest \nneighbour graph for  the centre-points of the line-segments.  In addition,  we  exploit \nattribute information for  the line patterns.  Here the geometric arrangement of the \nline-segments is encapsulated using a histogram of Euclidean invariant pairwise (bi(cid:173)\nnary) attributes.  For each line-segment in turn we construct a normalised histogram \nof relative angle and length with the remaining line-segments in the pattern.  These \nhistograms  capture the  global  geometric  context  of each  line-segment.  Moreover, \nwe  interpret  the  pairwise  geometric  histograms  as  measurement  densities  for  the \nline-segments which  we  compare using the Bhattacharyya distance. \n\nOnce we  have  established the pattern representation,  we  realise object recognition \nusing  a  Bayesian  graph-matching  algorithm.  This  is  a  two-step  process.  Firstly, \nwe  establish  correspondence  matches  between  the  individual  tokens  in  the  query \npattern  and each  of the  patterns  in  the  data-base.  The  correspondences  matches \nare sought  so  as to maximise  the  a  posteriori  measurement  probability.  Once  the \nMAP  correspondence matches  have  been  established,  then  the second  step  in  our \nrecognition architecture involves selecting the line-pattern from the data-base which \nhas maximum matching probability. \n\n2  MAP  Framework \n\nFormally our recognition problem is  posed as follows.  Each ARG in the database is \na  triple,  G  =  (Vc, Ec, Ac), where  Vc  is  the set of vertices  (nodes),  Ec  is  the edge \nset  (Ec  C  Vc  x  Vc),  and  Ac  is  the  set  of node  attributes.  In  our  experimental \nexample, the nodes represent line-structures segmented from  2D  images.  The edges \nare  established  by  computing  the  N-nearest  neighbour  graph  for  the  line-centres. \nEach  node  j  E  Vc  is  characterised by  a  vector  of attributes,  ~j  and  hence  Ac  = \n{~j jj  E  Vc}. \nIn  the  work  reported  here  the  attribute-vector  is  represents  the \ncontents of a  normalised pairwise attribute histogram . \nThe  data-base of line-patterns  is  represented  by  the  set  of ARG's  D  =  {G}.  The \ngoal  is  to  retrieve  from  the  data-base  D,  the  individual  ARG  that  most  closely \nresembles a  query  pattern Q =  (VQ' EQ, AQ).  We  pose the retrieval process as one \nof associating  with  the query  the graph  from  the  data-base that  has  the largest  a \nposteriori  probability.  In  other  words,  the  class  identity  of the graph  which  most \nclosely  corresponds to the query  is \n\nwQ  =  arg max P(G' IQ) \n\nC'EV \n\nHowever,  since  we  wish  to  make  a  detailed  structural  comparison  of  the  graphs, \nrather  than  comparing  their  overall  statistical  properties,  we  must  first  establish \na  set  of best-match  correspondences  between  each  ARG  in  the data-base  and  the \nquery  Q.  The  set  of  correspondences  between  the  query  Q and  the  ARG  G  is \na  relation  fc  :  Vc  f-7  VQ  over  the  vertex  sets  of  the  two  graphs.  The  mapping \nfunction  consists of a set of Cartesian pairings between the nodes of the two graphs, \n\n\f898 \n\nB.  Huet,  A.  D.  1.  Cross and E.  R.  Hancock \n\ni.e.  Ie = {(a,a);a  E  Ve,a  E  VQ}  ~ Ve  x  VQ .  Although this  may appear to be  a \nbrute force method, it must be stressed that we  view  this process of correspondence \nmatching  as  the  final  step  in  the  filtering  of  the  line-patterns.  We  provide  more \ndetails of practical implementation in  the experimental section of this paper. \n\nWith the correspondences to hand we  can re-state our maximum  a posteriori prob(cid:173)\nability  recognition  objective  as  a  two  step  process.  For  each  graph G  in  turn,  we \nlocate the  maximum  a  posteriori probability  mapping function  Ie  onto the query \nQ.  The second step is to perform recognition by selecting the graph whose mapping \nfunction  results in the largest matching probability.  These two steps are succinctly \ncaptured by the following statement of the recognition condition \n\nwQ  = arg max max P(fe,IG', Q) \n\ne'ED  la' \n\nThis global MAP condition is developed into a useful local update formula by apply(cid:173)\ning the Bayes formula  to the  a posteriori matching probability.  The simplification \nis  as follows \n\nPU  IG  Q) = p(Ae, AQl/e)P(felVe, Ee, VQ, EQ)P(Ve , Ee)P(VQ, EQ) \n\ne \n\n, \n\nP(G)P(Q) \n\nThe  terms  on  the  right-hand  side  of  the  Bayes  formula  convey  the  following \nmeaning.  The  conditional  measurement  density  p(Ae,AQl/e)  models  the  mea(cid:173)\nsurement  similarity  of  the  node-sets  of  the  two  graphs.  The  conditional  prob(cid:173)\nability  P(feIEe, EQ)  models  the  structural  similarity  of  the  two  graphs  under \nthe  current  set  of  correspondence  matches.  The  assumptions  used  in  develop(cid:173)\ning  our  simplification  of  the  a  posteriori  matching  probability  are  as  follows. \nFirstly,  we  assume  that  the  joint  measurements  are  conditionally  independent \nof  the  structure  of  the  two  graphs  provided  that  the  set  of  correspondences  is \nknown,  i.e.  P(Ae, AQl/e, Ee, Ve, EQ, VQ)  =  P(Ae, AQl/e).  Secondly,  we  as(cid:173)\nsume  that  there  is  conditional  independence  of the  two  graphs  in  the  absence  of \ncorrespondences.  In  other  words,  P(Ve, Ee, VQ, EQ)  =  P(VQ, EQ)P(Ve, Ee)  and \nP(G, Q)  =  P(G)P(Q).  Finally,  the graph priors P(Ve, Ee) , P(VQ, EQ)  P(G)  and \nP( Q)  are taken as  uniform and are eliminated from  the decision  making process. \n\nTo continue our development, we first focus on the conditional measurement density, \np(Ae, AQl/e)  which  models  the  process  of  comparing  attribute  similarity  on  the \nnodes of the two graphs.  Assuming statistical independence of node attributes, the \nconditional measurement density p( Ae, AQ lie), can be factorised over the Cartesian \npairs  (a, a)  E  Ve  x  VQ  which  constitute  the  the  correspondence match  Ie in  the \nfollowing  manner \n\np(Ae, AQl/e) =  II  P(~a' ~ol/e(a) = a) \n\n(a,o)E/a \n\nAs  a  result the correspondence matches may be optimised using a  simple node-by(cid:173)\nnode  discrete  relaxation  procedure.  The rule  for  updating  the  match  assigned  to \nthe node a of the graph G  is \n\nle(a) =  arg  max \n\nP(~a'~o)l/(a) =  a)P(feIEe,EQ) \n\noEVQU{4>} \n\nIn order to model the structural consistency of the set of assigned matches,we turn \nto the framework  recently  reported by  Finch,  Wilson  and Hancock  [2}.  This  work \nprovides  a  framework  for  computing  graph-matching  energies  using  the  weighted \nHamming distance between matched cliques.  Since we are dealing with a large-scale \nobject recognition system, we  would  like to minimise the computational overheads \nassociated  with  establishing correspondence matches.  For this  reason,  rather than \n\n\fGraph Matchingfor Shape Retrieval \n\n899 \n\nworking with graph neighbourhoods or cliques,  we  chose to work with the relational \nunits of the smallest practical size.  In  other words we satisfy ourself with measuring \nconsistency  at  the  edge  level.  For  edge-units,  the structural matching probability \nP(fa!Va, Ea, VQ, EQ)  is  computed from  the formula \n\n(a,b)EEG  (Ct ,(J )EEQ \n\nwhere Pe  is  the probability of an error appearing on one of the edges of the matched \nstructure.  The Sa,Ct  are assignment variables which are used to represent the current \nstate of match and convey the following meaning \n\nSa  Ct  = {I  if fa (a)  = a \n\n0  otherwise \n\n, \n\n3  Histogram-based  consistency \n\nWe  now  furnish  some  details  of the  shape  retrieval  task  used  in  our  experimental \nevaluation  of the  recognition  method.  In  particular,  we  focus  on  the  problem  of \nrecognising 2D  line patterns in a  manner which is  invariant to rotation, translation \nand  scale.  The raw  information  available  for  each  line  segment  are its  orientation \n(angle with respect to the horizontal axis)  and its length  (see figure 1).  To illustrate \nhow  the Euclidean invariant pairwise feature attributes are computed, suppose that \nwe  denote  the  line  segments  associated  with  the  nodes  indexed  a  and  b  by  the \nvectors  Ya  and  Yb  respectively.  The vectors  are directed  away  from  their  point  of \nintersection.  The pairwise relative angle attribute is  given  by \n\n(Ja ,b = arccos [I:: 1\u00b71::1] \n\nFrom the relative angle we compute the directed relative angle.  This involves giving \n\nd \n\n~:.~~: \n\nb - !  \n\n---------c:----;-~:---\n\no..b \n\n~------. \n\n---------------. \n\nD;b \n\nFigure 1:  Geometry for  shape representation \n\nthe relative angle a  positive sign if the direction of the angle from  the baseline Ya  to \nits partner Yb  is  clockwise and a  negative sign if it is  counter-clockwise.  This allows \nus to extend the range of angles describing pairs of segments from  [0,7I\"J  to [-7I\",7I\"J. \nThe  directed  relative  position  {}a,b  is  represented  by  the  normalised  length  ratio \nbetween the oriented baseline vector  Ya  and the vector yl joining the end (b)  of the \nbaseline segment  (ab)  to the intersection of the segment  pair  (cd). \n\n{}a,b  = \n\n1 \nD \n\nl+~ \n2 \n\nDab \n\n\f900 \n\nB.  Huet,  A.  D. 1.  Cross and E.  R.  Hancock \n\nThe physical range of this attribute is  (0, IJ.  A relative position of 0 indicates that \nthe two  segments are parallel,  while  a  relative position  of 1 indicates that the two \nsegments intersect at the middle point of the baseline. \nThe Euclidean invariant angle and position attributes 8a,b  and {)a ,b are binned in  a \nhistogram.  Suppose  that Sa(J-l, v)  = {(a , b)18a,b E  All 1\\  {)a,b  E  Rv  1\\  bE VD}  is  the \nset  of nodes  whose  pairwise geometric  attributes  with  the  node  a  are  spanned  by \nthe  range  of directed  relative  angles  All  and  the  relative  position  attribute  range \nRv.  The contents  of the  histogram  bin  spanning the two  attribute ranges is  given \nby  Ha(J-l, v)  =  ISa(J-l, v)l.  Each  histogram  contains  nA  relative  angle  bins  and  nR \nlength ratio bins.  The normalised geometric histogram bin-entries are computed as \nfollows \n\nha(J-l, v)  =  \"nA  \"nR  H  ( \n\n) \n~Il'=l ~v'=l  a J-l, v \n\nHa(J-l, v) \n\nThe probability of match between the pattern-vectors is  computed using the Bhat(cid:173)\ntacharyya distance between the normalised histograms. \n\nP(f(a) =  al~a' ~a) =  L \n\nI:~~l I:~~l ha(J-l, v)ha(J-l, v) \n)h  ( \n\nI:nA  I:nR  h  ( \n\nj'EQ \n\nIl'=l \n\nv'=l  a J-l,  V \n\na  J-l,  V \n\n)  =  exp[-Ba ,aJ \n\nWith this  modelling ingredient, the condition for  recognition  is \n\nWQ  = arg~~%  L  L  {-Ba,a-Bb,iJ+ln(I-Pe)Sa,aSb,iJ+lnPe(I-Sa,aSb,/3)} \n\n(a , b}EE~ (a,iJ}EEQ \n\n4  Experiments \n\nThe aim in this section is  to evaluate the graph-based recognition scheme on a data(cid:173)\nbase  of real-world  line-patterns.  We  have  conducted  our  recognition  experiments \nwith  a  data-base of 2500  line-patterns each  containing over  a  hundred  lines.  The \nline-patterns have been obtained by  applying line/edge detection algorithms to the \nraw grey-scale images followed by polygonisation.  For each line-pattern in the data(cid:173)\nbase,  we  construct the six-nearest neighbour graph.  The feature extraction process \ntogether  with  other  details  of the  data used  in  our  study  are  described  in  recent \npapers where we  have focussed on the issues of histogram representation [4J  and the \noptimal choice of the relational structure for the purposes of recognition.  In order to \nprune the set of line-patterns for detailed graph-matching we select about 10% of the \ndata-base using a two-step process.  This consists of first refining the data-base using \na  global histogram of pairwise attributes  [4J .  The top  quartile of matches selected \nin  this  way  are then  further  refined  using  a  variant  of the  Haussdorff distance  to \nselect the set of pairwise attributes that best match against the query. \n\nThe recognition task is posed as one of recovering the line-pattern which most closely \nresembles a digital map.  The original images from which our line-patterns have been \nobtained are from  a number of diverse sources.  However, a subset of the images are \naerial infra-red line-scan  views  of southern England.  Two of these infra-red images \ncorrespond to different  views  of the area covered  by  the digital  map.  These views \nare obtained when  the line-scan device is  flying  at different altitudes.  The line-scan \ndevice  used  to  obtain  the  aerial  images  introduces  severe  barrel  distortions  and \nhence  the  map  and  aerial  images  are  not  simply  related  via  a  Euclidean  or  affine \ntransformation.  The remaining line-patterns in  the data-base have been  extracted \nfrom  trademarks and logos.  It is  important to stress that although the raw  images \nare obtained from  different  sources,  there is  nothing salient  about their associated \nline-pattern  representations  that  allows  us  to  distinguish  them  from  one-another. \n\n\fGraph Matchingfor Shape Retrieval \n\n901 \n\n(a)  Digital  Map \n\n(b)  Target  1 \n\n(c)  Target 2 \n\nFigure 2:  Images  from  the data-base \n\nMoreover,  since  it  is  derived  from  a  digital  map  rather  than  one  of the  images  in \nthe  data-base,  the  query  is  not  identical  to  any  of the  line-patterns  in  the  model \nlibrary. \n\nWe  aim  to  assess  the  importance  of different  attributes  representation  on  the  re(cid:173)\ntrieval  process.  To  this end,  we  compare node-based  and  the  histogram-based at(cid:173)\ntribute representation.  \\Ve  also consider  the effect  of taking the relative angle and \nrelative  position  attributes  both  singly  and  in  tandem.  The  final  aspect  of  the \ncomparison is to consider the effects of using the attributes purely for  initialisation \npurposes and also in a  persistent way during the iteration of the matching process. \nTo this end  we  consider the following  variants of our algorithm . \n\n\u2022  Non-Persistent  Attributes:  Here  we  ignore  the  attribute  information \nprovided  by  the  node-histograms  after  the  first  iteration  and  attempt  to \nmaximise the structural congruence of the graphs . \n\n\u2022  Local attributes:  Here we  use only the single node attributes rather than \n\nan attribute histogram to model  the  a  posteriori matching probabilities. \n\nGraph Matching Strategy \n\nRetrieval \nAccuracy \n\nIterations \nper recall \n\nReI.  Position Attribute iInitialisation only) \nReI.  Angle  Attribute  (Initialisation only) \n\nReI.  Angle + Position  Attributes  (Initialisation only) \n\n1D  ReI.  Position Histogram  (Initialisation only) \n1D  ReI.  Angle  Histogram  (Initialisation only) \n\n2D  Histogram  (Initialisation only) \nReI.  Position Attribute  (Persistent) \nReI.  Angle  Attribute  (Persistent) \n\nReI.  Angle + Position Attributes  (Persistent) \n\n1D  ReI.  Position Histogram  (Persistent) \n1D  ReI.  Angle  Histogram  (Persistent) \n\n39% \n45% \n58% \n42% \n59% \n68% \n63% \n89% \n98% \n66% \n92% \n100% \n\n5.2 \n4.75 \n4.27 \n4.7 \n4.2 \n3.9 \n3.96 \n3.59 \n3.31 \n3.46 \n3.23 \n3.12 \n\n2D  Histogram  (Persistent) \n\nTable  1:  Recognition  performance  of various  recognition  strategies  averaged  over \n26  queries  in  a  database of 260  line-patterns \nIn Table 1 we present the recognition performance for each of the recognition strate(cid:173)\ngies in turn.  The table lists the recall performance together with the average number \n\n\f902 \n\nB.  Huet,  A.  D. 1.  Cross and E.  R.  Hancock \n\nof iterations per recall  for  each of the recognition strategies in turn.  The main  fea(cid:173)\ntures to note from this table are as follows .  Firstly, the iterative recall using the full \nhistogram  representation  outperforms  each  of the  remaining  recognition  methods \nin  terms of both accuracy and computational overheads.  Secondly,  it is  interesting \nto compare the effect  of using the histogram in the initialisation-only and iteration \npersistent modes.  In the latter case the recall performance is some 32%  better than \nin  the former  case.  In  the  non-persistent  mode  the best  recognition accuracy  that \ncan be obtained  is  68%.  Moreover,  the  recall  is  typically  achieved  in  only  3.12  it(cid:173)\nerations as  opposed  to  3.9  (average over  26  queries  on  a  database of 260  images) . \nFinally, the histogram representation provides better performance, and more signif(cid:173)\nicantly, much  faster  recall  than the single-attribute similarity  measure.  When  the \nattributes are used singly,  rather than in  tandem, then  it  is  the relative angle that \nappears to be the most  powerful. \n\n5  Conclusions \n\nWe  have  presented  a  practical graph-matching algorithm  for  data-mining in  large \nstructural libraries.  The main  conclusion  to be  drawn  from  this  study  is  that  the \ncombined  use  of structural  and  histogram  information  improves  both  recognition \nperformance  and  recall  speed.  There  are  a  number  of  ways  in  which  the  ideas \npresented  in this  paper can  be extended.  Firstly,  we  intend to explore more a  per(cid:173)\nceptually meaningful  representation of the line  patterns,  using grouping principals \nderived  from  Gestalt  psychology.  Secondly,  we  are exploring the  possibility of for(cid:173)\nmulating the filtering of line-patterns prior to graph matching using Bayes decision \ntrees. \n\nReferences \n\n[1]  H.  Barrow and  R.  Popplestone.  Relational descriptions  in  picture processing. \n\nMachine  Intelligence,  5:377- 396, 1971. \n\n[2]  A.  Finch,  R.  Wilson,  and E.  Hancock.  Softening discrete relaxation.  Advances \nin NIPS 9,  Edited  by  M.  Mozer,  M.  Jordan  and  T.  Petsche,  MIT Press,  pages \n438- 444,  1997. \n\n[3]  S.  Gold  and  A.  Rangarajan.  A  graduated  assignment  algorithm  for  graph \n\nmatching.  IEEE PAMI,  18:377- 388,  1996. \n\n[4]  B.  Huet  and  E.  Hancock.  Relational  histograms  for  shape  indexing.  IEEE \n\nICC V,  pages 563- 569, 1998. \n\n[5]  W.  Niblack  et  al..  The QBIC project:  Querying images by content using color, \n\ntexture and shape.  Image  and  Vision  Storage  and Retrieval,  173- 187, 1993. \n[6]  A.  P.  Pentland,  R.  W.  Picard, and S.  Scarloff.  Photobook:  tools for  content(cid:173)\n\nbased  manipulation  of image databases.  Storage  and  Retrieval for  Image  and \nVideo  Database  II,  pages 34- 47,  February 1994. \n\n[7]  K. Sengupta and K. Boyer. Organising large structural databases.  IEEE PAMI, \n\n17(4):321- 332,1995. \n\n[8]  1.  Shapiro  and  R.  Haralick.  A  metric  for  comparing  relational  descriptions. \n\nIEEE PAMI,  7(1):90- 94,  1985. \n\n[9]  M.  Swain  and D.  Ballard.  Color indexing.  International  Journal  of Computer \n\nVision,  7(1) :11- 32,  1991. \n\n[10]  R.  Wilson  and  E.  R.  Hancock.  Structural  matching  by  discrete  relaxation. \n\nIEEE PAMI,  19(6):634- 648, June 1997. \n\n\f", "award": [], "sourceid": 1484, "authors": [{"given_name": "Benoit", "family_name": "Huet", "institution": null}, {"given_name": "Andrew", "family_name": "Cross", "institution": null}, {"given_name": "Edwin", "family_name": "Hancock", "institution": null}]}