{"title": "Exploratory Data Analysis Using Radial Basis Function Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 535, "abstract": null, "full_text": "Exploratory Data Analysis Using Radial Basis \n\nFunction Latent Variable Models \n\nDERA \n\nSt Andrews Road, Malvern \n\nWorcestershire U.K. WR14 3PS \n\nAlan D. Marrs and Andrew R. Webb \n\n{marrs,webb}@signal.dera.gov.uk \n\n@British Crown Copyright 1998 \n\nAbstract \n\nTwo  developments of nonlinear latent variable models  based  on  radial \nbasis functions are discussed:  in the first, the use of priors or constraints \non allowable models is considered as a means of preserving data structure \nin  low-dimensional representations for  visualisation  purposes.  Also,  a \nresampling approach  is  introduced which  makes  more  effective use  of \nthe latent samples in evaluating the likelihood. \n\n1 \n\nINTRODUCTION \n\nRadial basis  functions  (RBF)  have been extensively used for problems in  discrimination \nand regression.  Here  we consider their application for obtaining low-dimensional repre(cid:173)\nsentations of high-dimensional data as part of the exploratory data analysis process. There \nhas  been a  great deal of research over the years  into linear and nonlinear techniques for \ndimensionality  reduction.  The technique  most  commonly used  is  principal components \nanalysis (PCA) and there have been several nonlinear generalisations, each taking a partic(cid:173)\nular definition of PCA and generalising it to the nonlinear situation. \n\nOne  approach is  to  find  surfaces of closest fit  (as  a generalisation of the PCA definition \ndue  to  the  work of Pearson  (1901)  for finding  lines  and  planes  of closest fit).  This has \nbeen explored by Hastie and  Stuetzle  (1989), Tibshirani (1992) (and  further by  LeBlanc \nand Tibshirani,  1994) and various authors using a neural network approach (for example, \nKramer,  1991).  Another approach is  one of variance maximisation subject to  constraints \non the transformation (Hotelling, 1933). This has been investigated by Webb (1996), using \na transformation modelled as an RBF network, and in a supervised context in Webb (1998). \n\nAn alternative strategy also using RBFs,  based on metric multidimensional scaling, is de(cid:173)\nscribed by Webb  (1995) and Lowe and Tipping (1996).  Here,  an  optimisation criterion, \n\n\f530 \n\nA. D. Marrs and A. R. Webb \n\ntermed stress, is defined in the transformed space and the weights in an RBF model deter(cid:173)\nmined by minimising the stress. \n\nThe above methods use a  radial basis function to model a transformation from the high(cid:173)\ndimensional data space to a low-dimensional representation space.  A complementary ap(cid:173)\nproach is provided by Bishop et al (1998) in which the structure of the data is modelled as \na function of hidden or latent variables.  Termed generative topographic mapping (GTM), \nthe  model may be regarded as a  nonlinear generalisation of factor analysis in  which  the \nmapping from latent space to data space is characterised by an RBF. \n\nSuch generative models are relevant to a  wide range of applications including radar target \nmodelling, speech recognition and handwritten character recognition. \n\nHowever, one of the problems with GTM that limits its practical use for visualising data on \nmanifolds in high dimensional space arises from distortions in the structure that it imposes. \nThis is  acknowledged in Bishop et al (1997) where  'magnification factors' are introduced \nto correct for the GTM's deficiency as a means of data visualisation. \n\nThis paper considers two developments: constraints on the permissible models and resam(cid:173)\npiing of the  latent space.  Section 2  presents  the  background to  latent  variable  models; \nModel constraints are discussed in Section 3.  Section 4 describes a re-sampling approach \nto estimation of the posterior pdf on the latent samples.  An illustration is provided in  Sec(cid:173)\ntion 5. \n\n2  BACKGROUND \n\nBriefly,  we  shall  re-state  the  basic  GTM  model,  retaining the  notation  of Bishop  et al \n(1998).  Let Ui, i  =  1, ... , N}, ti  E RP  represent measurements on the data space vari(cid:173)\nables; Z  E R.  represent the latent variables. \nLet t  be normally-distributed with mean y(z; W) and covariance matrix {3-1 I; y(z; W) \nis a nonlinear transformation that depends on a set of parameters W. Specifically, we shall \nassume a basis function model \n\ny(z; W)  =  L Wi<Pi(Z) \n\nM \n\ni=1 \n\nwhere  the  vectors Wi  E  R. D  are  to  be  determined through optimisation  and  {<Pi, i  = \n1, . . . , M} is a set of basis functions defined on the latent space. \n\nThe data distribution may be written \n\np(tIW,{3)  =  ! p(tlz; W ,{3)p(z)dz \n\nwhere, under the assumptions of normality, \n\n(1) \n\np(tlz; W,{3)  = \n\n(  {3  )  D /2 \n271\" \n\n{{3 \n\n} \nexp  -\"2ly(z; W) - tll 2 \n\nApproximating the integral by a finite sum (assuming the functions p(z) and y(z) do not \nvary too greatly compared with the sample spacing), we have \n\nK \n\np(tIW,{3)  =  LPiP(tlzi; W,{3) \n\n(2) \n\ni =1 \n\nwhich may be regarded as a function of the parameters W  and {3  that characterise y . \n\n\fExploratory Data Analysis Using Radial Basis Function Latent Variable Models \nGiven the data set {ti' i  = 1, ... ,N}, the log likelihood is given by \n\nN \n\nL(W,J3)  =  I)n[p(t;IW,J3)] \n\nj=l \n\nwhich may be maximised using a standard EM approach (Bishop et ai,  1998). \n\nIn this case, we have \n\n1  N \n\nn=l \n\nas the re-estimate of the mixture component weights, Pj, at the (m + 1) step, where \n\nPj  =  N  L~n \n\n~  _ \n\nn \n\n-\n\np~m)p(tnIZi; w(m), J3(m\u00bb) \nEiP~m)p(tnlzi;W(m),J3(m\u00bb) \n\n531 \n\n(3) \n\n(4) \n\nand (.) (m) denotes values at the mth step.  Note that Bishop et al (1998) do not re-estimate \nP;; all values are taken to be equal. \nThe number of P;  terms to be re-estimated is K, the number of terms used to approximate \nthe integral (1).  We  might expect that the density is  smoothly varying and governed by a \nmuch fewer number of parameters (not dependent on K). \nThe re-estimation equation for the D  x  M  matrix W  =  [w 11 ... I W M] is \n\nW(m+l)  = TT RT +[+T G+]-l \n\n(5) \n\nwhere G is the K  x  K  diagonal matrix with \n\nN \n\nGjj  =  LRjn \n\nn=l \n\nand  TT  =  [tIl .. . ltN],  +T  =  [tfJ(Zl)I .. . ltfJ(ZK)].  The  term  J3  is  re-estimated  as \nl/J3(m)  =  1/(ND) E~l E~l Rjilti - w(m+l)tfJ(Zj) 12. \nOnce we  have  determined parameters of the transformation, we  may invert the model by \nasking for the distribution of Z  given a measurement ti. That is, we require \n\np(Zlti)  = \n\np(tilz)p(z) \n\nf p(tilz)p(z)dz \n\n(6) \n\nFor example, we may plot the position of the peak of the distribution p( Ziti) for each data \nsample ti. \n\n3  APPLYING A CONSTRAINT \n\nOne  way  to  retain  structure is to  impose  a condition that ensures that a  unit step  in  the \nlatent space corresponds to a unit step in the data space (more or less).  For a single latent \nvariable, Xl, we may impose the constraints that \n\nlay  12 \n\naXl \n\n1 \n\n= \n\nwhich may be written, in terms of W  as \n\nifwTWil \n\n1 \n\n\f532 \n\nA.  D.  Marrs and A.  R.  Webb \n\n~ \n\nwhere;l  =  8c/J. \nThe derivative of the data space variable with respect to the latent variable has  unit mag(cid:173)\nnitude.  The derivative is of course a function of Xl  and imposing such a condition at each \nsample point in  latent space would not be possible owing to the  smoothness of the  RBF \nmodel.  However, we may average over the latent space, \n\nwhere ( .) denotes average over the latent space. \n\nIn general, for L  latent variables we may impose a constraint JTWTW J  =  1 L  leading \nto the penalty term \n\nTr {A(JTWTW J  - IL)} \n\nwhere J  is  an  M  x  L  matrix with  jth column 8\u00a2/8xj and  A  is  a symmetric matrix of \nLagrange multipliers.  This  is  very  similar to  regularisation  terms.  It is  a  condition  on \nthe norm of W; it incorporates the Jacobian matrix J  and a symmetric L  x  L  matrix of \nLagrange multipliers, A. The re-estimation solution for W  may be written \n\n(7) \n\nwith A chosen so that the constraint JT W T W J  =  1 L  is satisfied. \nWe  may  also  use  the  derivatives of the  transformation to  define  a  distortion  measure  or \nmagnification factor, \n\nM(Zj W)  =  IIJTWTW J  - 1112 \n\nwhich is a function of the latent variables and the model parameters. A value of zero shows \nthat there is no distortion 1 \u2022 \n\nAn  alternative to the  constraint approach above is to  introduce a  prior on  the allowable \ntransformations using the magnification factor; for example, \n\nP(W)  ~ exp(-AM(zj W)) \n\n(8) \nwhere  A is  a  regularisation  parameter.  This  leads  to  a  modification  to  the  M-step  re(cid:173)\nestimation equation for W, providing a maximum a posteriori estimate. Equation (8) pro(cid:173)\nvides a natural generalisation of PCA since for the special case of a linear transformation \n(Pi  = Xi, M  = L), the solution for W  is the PCA space as A ~ 00. \n\n4  RESAMPLING THE LATENT SPACE \n\nHaving obtained a mapping from latent space to data space using the above constraint, we \nseek a better estimate to the posterior pdf of the latent samples.  Current versions of GTM \nrequire the  latent samples  to  be  uniformly distributed  in  the  latent space  which  leads to \ndistortions  when  the data of interest are  projected into  the latent space  for visualisation. \nSince the responsibility matrix R can be used to determine a weight for each of the latent \nsamples it is possible to update these samples using a resampling scheme. \n\nWe propose to use a resampling scheme based upon adaptive kernel density estimation. The \nbasic procedure places a Gaussian kernel on each latent sample. This results in a Gaussian \n\n1 Note  that this differs  from  the  measure  in  the  paper by  Bishop et aI,  where  a rati()-()f-areas \ncriterion is used, a factor which is unity for zero distortion, but may also be unity for some distortions. \n\n\fExploratory Data Analysis Using Radial Basis Function Latent Variable Models \n\nmixture representation of the pdf of the latent samples p( x It), \n\nK \n\np(xlt) =  ~PiN(lLi' E i ), \n\ni=l \n\n533 \n\n(9) \n\nwhere each mixture component is  weighted according to the latent sample weight Pi.  Ini(cid:173)\ntially,  the  Ei'S are all  equal,  taking their  value  from  the  standard  formula  of Silverman \n(1986), \n\nwhere matrix Y  is an estimate of the covariance of p( x )and, \n\nEi =  hLy, \n\n(10) \n\n(11) \n\nIf the kernels are centered exactly on the latent samples, this model artificially inflates the \nvariance of the  latent  samples.  Following West  (1993) we  perform kernel  shrinkage by \nmaking the lLi  take the values \n\n(12) \n\nwhere jL  is the mean of the latent samples.  This ensures that there is no artificial inflation \nof the variance. \n\nTo reduce the redundancy in our initially large number of mixture components, we propose \na kernel  reduction scheme in  a similar manner to West.  However,  the scheme used  here \ndiffers from that of West and follows a scheme proposed by Salmond (1990).  Essentially, \nwe chose the component with the smallest weight and its nearest neighbour, denoting these \nwith  subscripts 1 and  2  respectively.  These components are then combined into a single \ncomponent denoted with subscript c as follows, \n\nPc  = Pl + P2 \n\nPllLl + P21L2 \nIL  =  --=---= \n\nc \n\nPc \n\nEc =  Pl[El  + (lLc  -lLl)(lLc -lLl)T] + P2[E2 + (lLc  -1L2)(lLc  -1L2)T]. \n\nPc \n\n(13) \n\n(14) \n\n(15) \n\nThis  procedure is  repeated  until  some  stopping  criterion  is  met.  The  stopping criterion \ncould be a  simple limit upon the number of mixture components ie;  smaller than  K  but \nsufficiently large to model the data structure.  Alternatively, the average kernel covariance \nand between kernel covariance can be monitored and the reduction stopped before some \nmultiple (eg.  10) of the average kernel covariance exceeds the between kernel covariance. \n\nOnce a  final  mixture  density  estimate  is  obtained,  a  new  set  of equally  weighted  latent \nsamples can be  drawn  from  it.  The new  latent samples represent a better estimate of the \nposterior pdf of the latent samples and can be used, along with the existing RBF mapping, \nto  calculate  a  new  responsibility  matrix  R.  This  procedure  can be  repeated  to  obtain  a \nfurther improved estimate of the posterior pdf which, after only a couple of iterations can \nlead to good estimates of the posterior pdf which further iterations fail to improve upon. \n\n5  RESULTS \n\nA  latent  variable  model  based  oil  a  spherically-symmetric Gaussian  RBF  has  been  im(cid:173)\nplemented.  The weights and the centres of the RBF  were initialised  so that the solution \nbest approximated the zer<rdistortion principal components solution for tw<rdimensional \nprojection. \n\n\f534 \n\nA. D. Marrs and A. R. Webb \n\nFor our example we chose to construct a simulated data set with easily identifiable structure. \nFour hundred points lying on the letters \"NIPS\" were sampled and projected onto a sphere \nof radius 50 such that the points lay between 25 0  and  1750  longitude and 750  and  1250 \nlatitude with Gaussian noise of variance 4.0 on the radius of each point. The resulting data \nare shown in figure 1. \n\nToY  dataset \n\nFigure 1: Simulated data. \n\n:: I  I . .,:.' .:\\  ~ .. ~. \n.i,~ \u2022 \n\"I: \n. > I  I \n.~.,......  I \n\u00a3~:  l' \n-u  r ... 1 \n-to.,. \n\n:.,  ...... \n\nI \n\nFigure 2: Results for standard GTM model. \n\nFigure 3: Results for regularisedlresampled model. \n\nFigure 2 shows results for the standard GTM (uniform grid of latent samples) projection of \nthe data to two  dimensions.  The central figure shows the projection onto the latent space, \nexhibiting significant distortion.  The left figure shows the projection of the regular grid of \nlatent samples (red points) into the data space.  Distortion of this grid can be easily seen. \nThe right figure is a plot of the magnification factor as defined in section 3, with mean value \nof 4.577. For this data set most stretching occurs at the edges of the latent variable space. \n\nFigure 3 shows results for the regularisedlresampled version of the latent variable model \nfor  A =  1.0.  Again the central figure  shows the projection onto the latent space after 2 \niterations of the resampling procedure.  The left-hand  figure  shows the projection of the \ninitial  regular grid  of latent samples  into the  data  space.  The effect of regularisation  is \nevident by the lack of severe distortions.  Finally the magnification factors can be seen in \nthe right-hand figure to be lower, with a mean value of 0.976. \n\n\fExploratory Data Analysis Using Radial Basis Function Latent Variable Models \n\n535 \n\n6  DISCUSSION \n\nWe have considered two developments of the GTM latent variable model: the incorporation \nof priors on the allowable model and a resampling approach to the maximum likelihood pa(cid:173)\nrameter estimation.  Results have been presented for this regularisedlresampling approach \nand magnification  factors  lower than the  standard model achieved,  using  the  same  RBF \nmodel.  However, further reduction in magnification factor is possible with different RBF \nmodels, but the example illustrates that resampling offers a more robust approach. Current \nwork is aimed at assessing the approach on realistic data sets. \n\nReferences \n\nBishop, C.M. and Svensen, M.  and Williams, C.K.1.  (1997).  Magnification factors for the \nGTM algorithm. lEE International Conference on Artificial Neural Networks, 465-471. \n\nBishop, C.M.  and  Svensen,  M.  and Williams,  C.K.1.  (1998).  GTM:  the generative topo(cid:173)\ngraphic mapping. Neural Computation, 10,215-234. \n\nHastie, T.  and  Stuetzle,  W.  (1989).  Principal curves,  Journal of the American Statistical \nAssociation, 84, 502-516. \n\nHotelling, H.  (1933).  Analysis of a complex of statistical variables into principal compo(cid:173)\nnents. Journal of Educational Psychology, 24, 417-441,498-520. \n\nKramer, M.A. (1991). Nonlinear principal component analysis using autoassociative neural \nnetworks. American Institute of Chemical Engineers Journal, 37(2),233-243. \n\nLeBlanc, M.  and Tibshirani, R.  (1994).  Adaptive principal surfaces. Journal of the Ameri(cid:173)\ncan Statistical Association, 89(425), 53-664. \n\nLowe,  D.  and Tipping, M.  (1996).  Feed-forward neural networks and topographic map(cid:173)\npings for exploratory data analysis. Neural Computing and Applications, 4, 83-95. \n\nPearson, K. (1901). On lines and planes of closest fit.  Philosophical Magazine, 6, 559-572. \n\nSalmond, D.J. (1990). Mixture reduction algorithms for target tracking in clutter.  Signal & \nData processing of small targets,  edited by O.  Drummond, SPlE,  1305. \n\nSilverman, B.W.  (1986). Density Estimation for Statistics and Data Analysis.  Chapman & \nHall,1986. \n\nTibshirani, R.  (1992). Principal curves revisited.  Statistics and Computing, 2(4), 183-190. \n\nWebb, A.R. (1995).  Multidimensional scaling by iterative majorisation using radial basis \nfunctions.  Pattern Recognition, 28(5), 753-759. \n\nWebb,  A.R.  (1996).  An  approach  to  nonlinear  principal  components  analysis  using \nradially-symmetric kernel functions.  Statistics and Computing, 6,  159-168. \n\nWebb, A.R.  (1997).  Radial basis functions for exploratory data analysis:  an iterative ma(cid:173)\njorisation approach for Minkowski distances based on multidimensional scaling.  Journal \nof Classification,  14(2),249-267. \n\nWebb,  A.R.  (1998).  Supervised nonlinear principal components analysis.  (submitted for \npublication ). \n\nWest, M.  (1993).  Approximating posterior distributions by mixtures.  J.  R.  Statist.  Soc B, \n55(2), 409-422. \n\n\f", "award": [], "sourceid": 1544, "authors": [{"given_name": "Alan", "family_name": "Marrs", "institution": null}, {"given_name": "Andrew", "family_name": "Webb", "institution": null}]}