{"title": "Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data", "book": "Advances in Neural Information Processing Systems", "page_first": 329, "page_last": 336, "abstract": "", "full_text": "Gaussian Process Latent Variable Models for\n\nVisualisation of High Dimensional Data\n\nNeil D. Lawrence\n\nDepartment of Computer Science,\n\nUniversity of Shef\ufb01eld,\n\nRegent Court, 211 Portobello Street,\n\nShef\ufb01eld, S1 4DP, U.K.\n\nneil@dcs.shef.ac.uk\n\nAbstract\n\nIn this paper we introduce a new underlying probabilistic model for prin-\ncipal component analysis (PCA). Our formulation interprets PCA as a\nparticular Gaussian process prior on a mapping from a latent space to\nthe observed data-space. We show that if the prior\u2019s covariance func-\ntion constrains the mappings to be linear the model is equivalent to PCA,\nwe then extend the model by considering less restrictive covariance func-\ntions which allow non-linear mappings. This more general Gaussian pro-\ncess latent variable model (GPLVM) is then evaluated as an approach to\nthe visualisation of high dimensional data for three different data-sets.\nAdditionally our non-linear algorithm can be further kernelised leading\nto \u2018twin kernel PCA\u2019 in which a mapping between feature spaces occurs.\n\n1 Introduction\n\nVisualisation of high dimensional data can be achieved through projecting a data-set onto\na lower dimensional manifold. Linear projections have traditionally been preferred due\nto the ease with which they can be computed. One approach to visualising a data-set in\ntwo dimensions is to project the data along two of its principal components. If we were\nforced to choose a priori which components to project along, we might sensibly choose\nthose associated with the largest eigenvalues. The probabilistic reformulation of principal\ncomponent analysis (PCA) also informs us that choosing the \ufb01rst two components is also\nthe choice that maximises the likelihood of the data [11].\n\n1.1 Integrating Latent Variables, Optimising Parameters\n\nProbabilistic PCA (PPCA) is formulated as a latent variable model: given a set centred of\nand denoting the latent variable associated with each data-\nwe may write the likelihood for an individual data-point under the PPCA model\n\n-dimensional data\n\npoint\nas\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\n\u0004\u000b\n\r\f\n\n\u000f\u0011\u0010\n\n\u0004\u0013\u0012\n\n\u0014\u0016\u0015\u0018\u0017\r\u0019\u001b\u001a\n\n\u000f\u001d\u0010\n\n\u0004\u0005\u0012\n\n\u0004\u001e\u0015\u0018\u0014\u0016\u0015\u001f\u0017\r\u0019\n\n\u000f\u001d\u0010\n\n\u0004 \u0019\"!\n\n\n\u000e\n\u0004\n\u0002\n\u001c\n\u0002\n\u000e\n\u000e\n\u000e\n\u0004\n\f, and\n\u0004 \u0019\ncan then be found1 by as-\nis i.i.d. and maximising the likelihood of the data-set,\n\n\u0004\u0013\u0012\n\nwhere \u000f\u001d\u0010\n\u0004 \u0019\n\u000f\u001d\u0010\n\u0004\u0013\u0012\nsuming that\n\n\u0004\u001e\u0015\u001f\u0014\u0016\u0015\u0018\u0017\r\u0019\n\nwhere\u000b\n\n\u001a\u000f\u000e\n\n\u0004\u0013\u0012\n\nis Gaussian distributed with unit covariance, \u000f\u001d\u0010\n\u0001\u0006\u0005\n\u0004\u0007\u0015\u001f\u0017\b\u0007\n\u0003\n\t . The solution for\n\u000f\u001d\u0010\f\u000b\nT is the\u0001\u0015\u0014\n\b\u0013\u0012\n\n\u0019\u001b\u001a\n\u0002\u0013\u0004\ndesign matrix.\n\n\u0014\u0016\u0015\u001f\u0017\r\u0019\u001e\u0015\n\n\u0002\r\f\u0011\u0010\n\u0010\n\u0010\u001f\u0002\n\n\u0014\u0016\u0015\u0018\u0017\n\n\u000f\u001d\u0010\n\n\f\n\nProbabilistic principal component analysis and other latent variable models, such as factor\nanalysis (FA) or independent component analysis (ICA), require a marginalisation of the\nlatent variables and optimisation of the parameters. In this paper we consider the dual\napproach of marginalising\n. This probabilistic model also turns\nout to be equivalent to PCA.\n\nand optimising each\n\n\u0002 \u0015\u0004\u0003\u0003\u0019\n\n1.2 Integrating Parameters, Optimising Latent Variables\n\n\u0015\u0018\u0017\n\nthen\n\n\f\n\nT\n\n,\n\n(1)\n\n(2)\n\nwhich implies that at our solution\n\n\u0005\u001c\u001b\ntr\u0005\ntr\u0005\n\nrow of the matrix\n\n, and integrating over\n\nT. The corresponding log-likelihood is\n\n\u0002 \u0015\u0004\u001d\u001e\u0007\n\u000b6\u000b\n\u000b6\u000b\n\nsome algebraic manipulation of this formula [11] leads to\n\n\t where\u001b\nT\t87\nT\t\u001e\u0010\n\nNow that the parameters are marginalised we may focus on optimisation of the likelihood\n\n\u000f\u001d\u0010 \u000b\n\u001a9\u001d:!;! T<\nwhere,\nwith respect to the!\n\n\u0010$#&%\n\u0003 and!\n?\u0001\n#A@CB\n. The gradients of (2) with respect to! may be found as,\n\u001aH\u001d:,\n\n\u001a is the\u001f th\n\u001a\u0017\u0016\u0019\u0018\nwe obtain a marginalised likelihood for\u000b\n354\n+.-0/21\n\u0012\u001c'\n\u0010\n\u0010\n\u0010\u001f\u000e\n#F@CB\nT,\nT,\n\u001aJILK\nMON\n\nBy \ufb01rst specifying a prior distribution, \u000f\u001d\u0010\n\u0019('*)\n\u001a=\u000e\n\u0010D#E%\n\u000b;\u000b\n\u000b;\u000b\n\u001aH!\nis an\u0001P\u0014RQ matrix (Q\nwhereILK\nis aQS\u0014TQ diagonal matrix whoseU th element isVXW\nare eigenvectors of \u000b6\u000b\nY[Z8\\\n+ , whereb\nis an arbitraryQc\u0014dQ\n\u0007a`\nis theU th eigenvalue of\u000b;\u000b\n]:_\nPCA in this manner is a key step in the development of kernel PCA [9] where\u000b;\u000b\n\u001de!6! T<\n\u0003 . Therefore a natural extension is the non-linearisation of the mapping from\n1As can the solution forf but since the solution forg\nis not dependent onf we will disregard\n2For independent component analysis the correct rotation matrixh must also be found, here we\n\northogonal matrix2. Note that the eigenvalue problem we have developed can easily be\nshown to be equivalent to that solved in PCA (see e.g. [10]), indeed the formulation of\nT is\nreplaced with a kernel. Our probabilistic PCA model shares an underlying structure with\n[11] but differs in that where they optimise we marginalise and where they marginalise we\noptimise. The marginalised likelihood we are optimising in (1) is recognised as the product\nindependent Gaussian processes where the (linear) covariance function is given by\n\nlatent space to the data space through the introduction of a non-linear covariance function.\n\nhave placed no constraints on the orientation of the axes so this matrix cannot be recovered.\n\nis the dimension of the latent space) whose columns\n\nT, andN\n\nof \n\nit.\n\n\u0017\b\u0007\n\nT,\n\n\u000e\n\u000e\n\u001a\n\u0001\n\u0010\n\u000e\n\u0002\n\u000e\n\u001a\n\u0002\n\u0014\n\u000e\n\f\n\u0014\n\u0002\n\u0004\n\u0012\n\b\n\n\u0004\n\u0012\n\n\u0014\n\u000e\n\u0004\n\u0014\n\u0019\n\u001a\n\u0001\n\u001a\n\u0012\n\f\n\u0003\n\u0014\n\u0014\n\u0012\n!\n\u0019\n\u001a\n\"\n+\n\u0012\n,\n\"\n#\n,\n\u0007\n\f\n\u0015\n\u0017\n\u0007\n\f\n\u000e\n\f\n\b\n\u0012\n>\n\u001a\n4\n\u0019\n4\n\n\u0012\n,\n\u0012\n4\n\"\n#\n,\n\u0007\n\f\nG\n>\nG\n!\n\u0007\n\f\n\u0007\n\f\n!\n4\n\u001d\n\n,\n\u0007\n\f\n!\n\u0015\n\"\n\n\u0007\n\f\n!\n\u0015\n!\nM\n\u001a\n]\n\u0018\n4\n\f\n^\nW\n\f\n\f2 Gaussian Process Latent Variable Models\n\nWe saw in the previous section how PCA can be interpreted as a Gaussian process \u2018map-\nping3\u2019 from a latent space to a data space where the locale of the points in latent space is\n. We will re-\nfer to models of this class as Gaussian process latent variable models (GPLVM). Principal\ninner prod-\n, in this section we develop an alternative GPLVM by considering a prior\nwhich allows for non-linear processes, speci\ufb01cally we focus on the popular \u2018RBF kernel\u2019\nwhich takes the form\n\ndetermined by maximising the Gaussian process likelihood with respect to!\ncomponent analysis is a GPLVM where the process prior is based on the\u0001\n\u0014[\u0001\nuct matrix of!\n<\b\u0007\n\u0004\t\u0004\n\u0004\u0003\u0002\n\nis a scale parameter and\ndenotes the Kronecker delta. Gradients of (2) with respect to the latent points can be\n\nis the element in the\n\nth row and\n\nwhere \u0001\n\u0004\f\u0004\nfound through combining\n\nth column of,\n\n4\u0006\u0005\n\n-\n/\n\n\u0004\u0003\u0002\n\n, \u0005\n\nT\n\n\u000b6\u000b\n\nT,\n\n\u000f\u000e\n\r\u000f\u0010\u0012\u0011\t\u0013\n\nwith\nnon-linear optimiser such as scaled conjugate gradients (SCG) [7] to obtain a latent variable\nrepresentation of the data. Furthermore gradients with respect to the parameters of the\n. The solution\n\n\\ via the chain rule. These gradients may be used in combination with (2) in a\nkernel matrix may be computed and used to jointly optimise!\nfor! will naturally not be unique; even for the linear case described above the solution is\n\n, \u0005 and\nsubject to an arbitrary rotation, here we may expect multiple local minima.\n\n2.1 Illustration of GPLVM via SCG\n\n,\u001d\n\nTo illustrate a simple Gaussian process latent variable model we turn to the \u2018multi-phase oil\n\ufb02ow\u2019 data [2]. This is a twelve dimensional data-set containing data of three known classes\ncorresponding to the phase of \ufb02ow in an oil pipeline: strati\ufb01ed, annular and homogeneous.\nIn this illustration, for computational reasons, the data is sub-sampled to 100 data-points.\n\nFigure 1 shows visualisations of the data using both PCA and our GPLVM algorithm which\npositions for the GPLVM model were initialised us-\ning PCA (see http://www.dcs.shef.ac.uk/~neil/gplvm/ for the MATLAB\ncode used).\n\nrequired 766 iterations of SCG. The!\n\nThe gradient based optimisation of the RBF based GPLVM\u2019s latent space shows results\nwhich are clearly superior (in terms of greater separation between the different \ufb02ow do-\nmains) to those achieved by the linear PCA model. Additionally the use of a Gaussian\nprocess to perform our \u2018mapping\u2019 means that there is uncertainty in the positions of the\npoints in the data space. For our formulation the level of uncertainty is shared across all 4\ndimensions and thus may be visualised in the latent space. In Figure 1 (and subsequently)\n\nthis is done through varying the intensity of the background pixels.\n\nUnfortunately, a quick analysis of the complexity of the algorithm shows that each gradient\nstep requires an inverse of the kernel matrix, an\nimpractical for many data-sets of interest.\n\n\u0005D\u0001\u0016\u0015\n\n\t operation, rendering the algorithm\n\n3Strictly speaking the model does not represent a mapping as a Gaussian process \u2018maps\u2019 to a\n\ndistribution in data space rather than a point.\n\n4This apparent weakness in the model may be easily recti\ufb01ed to allow different levels of uncer-\ntainty for each output dimension, our more constrained model allows us to visualise this uncertainty\nin the latent space and is therefore preferred for this work.\n\n\u0001\n\u0004\n\u001a\n\u001d\n1\nY\n#\n\u0010\n\u000e\n\u0004\n4\n\u000e\n\u0004\n\u0019\n\u0010\n\u000e\n\u0004\n4\n\u000e\n\u0004\n\u0019\n_\n\u0017\n\u0007\n\f\n\u0004\n\n\u000b\n\u0007\nG\n>\nG\n,\n\u001a\n,\n\u0007\n\f\n\u0007\n\f\n4\n\n,\n\u0007\n\f\n\u0015\n\u0017\n\n\u0014\n\f0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\n\u22120.8\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n\u22120.05\n\n\u22120.1\n\n\u22120.15\n\n\u22120.2\n\n\u22120.25\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n\u22120.2\n\n\u22120.1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n2.4\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\nFigure 1: Visualisation of the Oil data with (a) PCA (a linear GPLVM) and (b) A GPLVM which\nuses an RBF kernel. Crosses, circles and plus signs represent strati\ufb01ed, annular and homogeneous\n\ufb02ows respectively. The greyscales in plot (b) indicate the precision with which the manifold was\n,\nexpressed in data-space for that latent point. The optimised parameters of the kernel were\n\n\u0001\n\u0007\f\u000b\n\n\u000e\u0007\b\u000f\n\n.\n\nandf\u0010\u0001\n\u000f\f\u0003\u0012\u0011\n\n2.2 A Practical Algorithm for GPLVMs\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\nThere are three main components to our revised, computationally ef\ufb01cient, optimisation\nprocess:\n\nSparsi\ufb01cation. Kernel methods may be sped up through sparsi\ufb01cation, i.e. representing\npoints known as the active set. The remainder, the inactive\nthe data-set by a subset,\n. We make use of the informative vector machine [6] which selects\nset, is denoted by\npoints sequentially according to the reduction in the posterior process\u2019s entropy that they\ninduce.\n\n, of\n\n\u000f\u0011\u0010\n\u001a\b\u001b\n\n\u0015\u0018\u0017\n\n\u0019\u001b\u001a\n\nwhose mean is\nthe active set and\n\ndenotes the kernel matrix developed from\n\nthat since\nother data in\nrespect to each\n\ninto the data space as a Gaussian distribution\n\nthat correspond to the active set. The variance is\n\n. We can therefore independently optimise the likelihood of each\n\nLatent Variable Optimisation. A point from the inactive set,U , can be shown to project\nT,\nW where,\nis a column vector consisting of the elements from theU th column of\n\u0010 Note\nW do not depend on\nW does not appear in the inverse, gradients with respect to\nW with\nW . Thus the full set!\u001d\u001c\n\u000f\u001d\u0010\f\u000b\n\u0010\u001c#&%\nwhich can be optimised5 with respect to\u001d\n\nAlgorithm 1 summarises the order in which we implemented these steps. Note that whilst\nwe never optimise points in the active set, we repeatedly reselect the active set so it is\n\nKernel Optimisation. The likelihood of the active set is given by\n\nand \u0005 with gradient evaluations costing\n\ncan be optimised with one pass through the data.\n\n-\n/21\n\n,\n\n\u0015\u0017\u0016\u0019\u0018\n\nT\n\nT\n\n(3)\n\n(4)\n\n5 In practice we looked for MAP solutions for all our optimisations, specifying a unit covariance\n\nGaussian prior for the matrix\n\nand using\n\n,\n\nrespectively.\n\n\u0003 \u001f\n\nfor \t\n\n\u0003 \u001f\u0006\n\n\u0003 \u001f0f and\n\n,f and\n\n\t\n\u0013\n!\n\u0014\n\u0002\nW\n\u0012\n\u000e\nW\n\u0015\n\u001d\n\u0015\n\u0005\n\u0001\n\u0005\n\u0002\nW\n\u0012\n\u0015\nW\nW\n\u0003\n\t\n\u0015\nW\n\u001a\n\u000b\n\u0007\n\f\n\u001a\n\u0002\n\u001a\n\u0002\n\u001a\n\u0002\n\u001a\n\u001b\n\u001a\n\u0002\nW\n,\n\u0016\n\u0018\nW\n\u001a\n\u0001\n\u0010\n\u000e\nW\n\u0015\n\u000e\nW\n\u0019\n4\n\u001b\n\u001a\n\u0002\nW\n,\n\u0007\n\f\n\u001a\n\u0002\n\u001a\n\u001b\n\u001a\n\u0002\nW\n\u000e\n\u000e\n\u0014\n\u0002\n\u000e\n\u001a\n\u0019\n\u001a\n\"\n\u0019\n'\n+\n\u0012\n,\n\u001a\n\u0002\n\u001a\n\u0012\n`\n+\n3\n4\n\"\n#\n\u000b\n\u001a\n,\n\u0007\n\f\n\u001a\n\u0002\n\u001a\n\u000b\n\u001a\n7\n\u0015\n\u0017\n\u0014\n\u0005\n!\n\u0015\n\t\n\u0010\n\u001e\n\t\n\n\fAlgorithm 1 An algorithm for modelling with a GPLVM.\nRequire: A size for the active set,\n\n. A number of iterations,\n\n.\n\nusing scaled conjugate gradients.\n\nfor\n\niterations do\n\nthrough PCA.\n\nSelect a new active set using the IVM algorithm.\n\nInitialise!\nOptimise (4) with respect to the parameters of,\nfor Each point not in active set,U . do\n\nOptimise (3) with respect to\n\nSelect a new active set.\n\nend for\n\nend for\n\nW using scaled conjugate gradients.\n\niterations and an active set of size\n\nunlikely that many points remain in their original location. For all the experiments that\nfollow we used\nrun on a \u2018one-shot\u2019 basis6 so we cannot make statements as to the effects that signi\ufb01cant\nmodi\ufb01cation of these parameters would have. We present results on three data-sets: for the\noil \ufb02ow data (Figure 2) from the previous section we now make use of all 1000 available\npoints and we include a comparison with the generative topographic mapping (GTM) [4].\n\n\u0002 . The experiments were\n\n!\u0011\u001a\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\u22121\n\n\u22120.8\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\nFigure 2: The full oil \ufb02ow data-set visualised with (a) GTM with 225 latent points laid out on a\ngrid and with 16 RBF nodes and (b) an RBF based GPLVM. The parameters of the latent\n\u0003 \u0005\u0003\u0002\nvariable model were found to be\t\n. Notice how the GTM arti\ufb01cially\n\u2018discretises\u2019 the latent space around the locations of the 225 latent points.\n\nand\n\n\u0003 \u0005\n\n\u0001\n\t\n\nWe follow [5] in our 2-D visualisation of a sub-set of 3000 of the digits 0-4 (600 of each\n\n\u000b greyscale version of the USPS digit data-set (Figure 3).\n\n\u0001\n\u0007\f\u000b\n\n\u0004\u0005\u0004\b\u0005\n\n\u0003\u0006\u0004\b\u0007\n\n,f\n\ndigit) from a\"\ndigitised at#\r\f\n\nFinally we modelled a face data-set [8] consisting of 1965 images from a video sequence\n\nthe underlying dimensionality of the data to be one \u2014 the images are produced in a smooth\nway over time which can be thought of as a piece of string embedded in a high (560)\ndimensional pixel space. We therefore present ordered results from a 1-D visualisation in\nFigure 4 .\n\n\u0002 . Since the images are originally from a video sequence we might expect\n\nAll the code used for performing the experiments is available from http://www.dcs.\n\n6By one-shot we mean that, given the algorithm above, each experiment was only run once with\none setting of the random seed and the values of\ngiven. If we were producing a visualisation\nfor only one dataset this would leave us open to the criticism that our one-shot result was \u2018lucky\u2019.\nHowever we present three data-sets in what follows and using a one-shot approach in problems with\nmultiple local minima removes the temptation of preferentially selecting \u2018prettier\u2019 results.\n\nand\n\n!\n\n\n\u000e\n\n\u001a\n\"\n\u0001\n\"\n\u0002\n\u0001\n\n\u000b\n\t\n\n\u000b\n\u0014\n\"\n\u0014\n#\n\u000e\n\u000f\n\f4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\nFigure 3: The digit images visualised in the 2-D latent space. We followed [5] in plotting images in\na random order but not plotting any image which would overlap an existing image. 538 of the 3000\ndigits are plotted. Note how little space is taken by the \u2018ones\u2019 (the thin line running from (-4, -1.5) to\n(-1, 0)) in our visualisation, this may be contrasted with the visualisation of a similar data-set in [5].\nWe suggest this is because \u2018ones\u2019 are easier to model and therefore do not require a large region in\nlatent space.\n\nshef.ac.uk/~neil/gplvm/ along with avi video \ufb01les of the 1-D visualisation and\nresults from two further experiments on the same data (a 1-D GPLVM model of the digits\nand a 2-D GPLVM model of the faces).\n\n3 Discussion\n\nEmpirically the RBF based GPLVM model gives useful visualisations of a range of data-\nsets. Strengths of the method include the ability to optimise the kernel parameters and to\ngenerate fantasy data from any point in latent space. Through the use of a probabilistic\nprocess we can obtain error bars on the position of the manifolds which can be visualised\nby imposing a greyscale image upon the latent space.\n\nWhen Kernels Collide: Twin Kernel PCA The eigenvalue problem which provides the\nfor the linear kernel is exploited in kernel PCA. One could\nT are replaced by kernel\nfunctions. Twin kernel PCA could no longer be undertaken with an eigenvalue decompo-\nsition but Algorithm 1 would still be a suitable mechanism with which to determine the\n\nmaxima of (2) with respect to!\nconsider a \u2018twin kernel\u2019 PCA where both\u001de!6! T<\nand the parameters of!\nvalues of!\n\n\u2019s kernel.\n\n\u0017\b\u0007\n\n\u0003 and\u000b6\u000b\n\n\f\n\fFigure 4: Top: Fantasy faces from the 1-D model for the face data. These faces were created by\ntaking 64 uniformly spaced and ordered points from the latent space and visualising the mean of their\ndistribution in data space. The plots above show this sequence unfolding (starting at the top left and\nmoving right). Ideally the transition between the images should be smooth. Bottom: Examples from\nthe data-set which are closest to the corresponding fantasy images in latent space. Full sequences of\n2000 fantasies and the entire dataset are available on the web as avi \ufb01les.\n\nStochastic neighbor embedding. Consider\n\n\u0001\u0002\u0001\nin!\n\n\u0010\u0005\u0003\n\n\u0019\"!\n\n\u0002 \u0015\u0004,\n\u0010\u0004\u0003\n@CB\n\u000b6\u000b\n, we therefore may add it to>\nT and we have rede\ufb01ned,\n\u0001\u0006\u0001\n\u0001\u0002\u0001\nKL \u0010\n\n\u0019\u001b\u001a\n\nas,\n\u0010\u0004\u0003\n\n\u0003 where we have introduced a vector,\nto obtain\n\n,\nis constant7\n\nthat\n\n(2) could be written as >\n\u0003 , of length\u0001\n. The entropy of\u0001\n\u0002 \u0015\u0004,\n\u0010\u0005\u0003\n@CB\n\u0002 \u0015\u0004,\n\u0010\u0005\u0003\n\n\u0010\u0005\u0003\n\n(5)\n\nwhich is recognised Kullback-Leibler (KL) divergence between the two distributions.\nStochastic neighbor embedding (SNE) [5] also minimises this KL divergence to visualise\ndata. However in SNE the vector \u0003\nGenerative topographic mapping. The Generative topographic mapping [3] makes use\nof a radial basis function network to perform the mapping from latent space to observed\nspace. Marginalisation of the latent space is achieved with an expectation-maximisation\n\nis discrete.\n\n7Computing the entropy requires \u0007\t\b\n\nby adding \u2018jitter\u2019 to \u0007\t\b , e.g. \u0007\t\b\u000b\n\f\u0007\t\b\u000e\r\u001d\u0010\u000f\u0012\u0011\u0014\u0013 .\n\nto be of full rank, this is not true in general but can be forced\n\n\u001a\n\n\u0012\n\u0002\n\u0015\n,\n\u0001\n\u0019\n\u0001\n\u0010\n\u0012\n\u0010\n,\n\u0001\n\u001a\n\f\n\u0018\n\u0010\n\u0001\n\u0012\n\u0002\n\u0015\n,\n\u0001\n\u0019\n\u0012\n\u0001\n\u0010\n\u001c\n\u0012\n\u0002\n\u0015\n,\n\u0001\n\u0019\n\u0001\n\u0010\n\u0012\n\u0010\n\u0019\n\u0001\n\u0001\n\u0012\n\u0001\n\u0019\n!\n\u0003\n\u0015\n\f(EM) algorithm. A radial basis function network is a special case of a generalised linear\nmodel and can be interpreted as a Gaussian process. Under this interpretation the GTM\nbecomes GPLVM with a particular covariance function. The special feature of the GTM\nis the manner in which the latent space is represented, as a set of uniformly spaced delta\nfunctions. One could view the GPLVM as having a delta function associated with each\ndata-point: in the GPLVM the positions of the delta functions are optimised, in the GTM\neach data point is associated with several different \ufb01xed delta functions.\n\n4 Conclusions\n\nWe have presented a new class of models for probabilistic modelling and visualisation\nof high dimensional data. We provided strong theoretical grounding for the approach by\nproving that principal component analysis is a special case. On three real world data-sets\nwe showed that visualisations provided by the model cluster the data in a reasonable way.\nOur model has an advantage over the various spectral clustering algorithms that have been\npresented in recent years in that, in common with the GTM, it is truly generative with\nan underlying probabilistic interpretation. However it does not suffer from the arti\ufb01cial\n\u2018discretetisation\u2019 suffered by the GTM. Our theoretical analysis also suggested a novel\nnon-linearisation of PCA involving two kernel functions.\n\nAcknowledgements We thank Aaron Hertzmann for comments on the manuscript.\n\nReferences\n\n[1] S. Becker, S. Thrun, and K. Obermayer, editors. Advances in Neural Information Processing\n\nSystems, volume 15, Cambridge, MA, 2003. MIT Press.\n\n[2] C. M. Bishop and G. D. James. Analysis of multiphase \ufb02ows using dual-energy gamma densito-\nmetry and neural networks. Nuclear Instruments and Methods in Physics Research, A327:580\u2013\n593, 1993.\n\n[3] C. M. Bishop, M. Svens\u00e9n, and C. K. I. Williams. GTM: a principled alternative to the Self-\nIn Advances in Neural Information Processing Systems, volume 9, pages\n\nOrganizing Map.\n354\u2013360. MIT Press, 1997.\n\n[4] C. M. Bishop, M. Svens\u00e9n, and C. K. I. Williams. GTM: the Generative Topographic Mapping.\n\nNeural Computation, 10(1):215\u2013234, 1998.\n\n[5] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Becker et al. [1], pages 857\u2013864.\n[6] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The\n\ninformative vector machine. In Becker et al. [1], pages 625\u2013632.\n\n[7] I. T. Nabney.\n\nNetlab:\n\nAlgorithms\n\nfor Pattern Recognition.\n\nPattern Recognition.\n\nin\nhttp://www.ncrg.aston.ac.uk/netlab/.\n\nSpringer,\n\nBerlin,\n\n2001.\n\nCode\n\nAdvances\nfrom\n\navailable\n\n[8] S. Roweis, L. K. Saul, and G. Hinton. Global coordination of local linear models. In T. G.\nDietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing\nSystems, volume 14, pages 889\u2013896, Cambridge, MA, 2002. MIT Press.\n\n[9] B. Sch\u00f6lkopf, A. J. Smola, and K.-R. M\u00fcller. Kernel principal component analysis. In Pro-\nceedings 1997 International Conference on Arti\ufb01cial Neural Networks, ICANN\u201997, page 583,\nLausanne, Switzerland, 1997.\n\n[10] M. E. Tipping. Sparse kernel principal component analysis. In T. K. Leen, T. G. Dietterich,\nand V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages\n633\u2013639, Cambridge, MA, 2001. MIT Press.\n\n[11] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society, B, 6(3):611\u2013622, 1999.\n\n\f", "award": [], "sourceid": 2540, "authors": [{"given_name": "Neil", "family_name": "Lawrence", "institution": null}]}