{"title": "GTM: A Principled Alternative to the Self-Organizing Map", "book": "Advances in Neural Information Processing Systems", "page_first": 354, "page_last": 360, "abstract": null, "full_text": "GTM: A Principled Alternative \n\nto the Self-Organizing Map \n\nChristopher M. Bishop \n\nC.M .Bishop@aston.ac.uk \n\nMarkus Svensen Christopher K. I. Williams \nsvensjfm@aston.ac.uk \n\nC.K.r. Williams@aston.ac.uk \n\nNeural Computing Research Group \n\nAston University, Birmingham, B4 7ET, UK \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nThe Self-Organizing Map (SOM) algorithm has been extensively \nstudied and has been applied with considerable success to a wide \nvariety of problems. However, the algorithm is derived from heuris(cid:173)\ntic ideas and this leads to a number of significant limitations. In \nthis paper, we consider the problem of modelling the probabil(cid:173)\nity density of data in a space of several dimensions in terms of \na smaller number of latent, or hidden, variables. We introduce a \nnovel form of latent variable model, which we call the GTM algo(cid:173)\nrithm (for Generative Topographic Mapping), which allows general \nnon-linear transformations from latent space to data space, and \nwhich is trained using the EM (expectation-maximization) algo(cid:173)\nrithm. Our approach overcomes the limitations of the SOM, while \nintroducing no significant disadvantages. We demonstrate the per(cid:173)\nformance of the GTM algorithm on simulated data from flow diag(cid:173)\nnostics for a multi-phase oil pipeline. \n\n1 \n\nIntroduction \n\nThe Self-Organizing Map (SOM) algorithm of Kohonen (1982) represents a form of \nunsupervised learning in which a set of unlabelled data vectors tn (n = 1, ... , N) \nin a D-dimensional data space is summarized in terms of a set of reference vectors \nhaving a spatial organization corresponding (generally) to a two-dimensional sheet1 \u2022 \n\nIBiological metaphor is sometimes invoked when motivating the SOM procedure. It \nshould be stressed that our goal here is not neuro-biological modelling, but rather the \ndevelopment of effective algorithms for data analysis. \n\n\fGTM: A PrincipledAlternative to the Self-Organizing Map \n\n355 \n\nWhile this algorithm has achieved many successes in practical applications, it also \nsuffers from some major deficiencies, many of which are highlighted in Kohonen \n(1995) and reviewed in this paper. \n\nFrom the perspective of statistical pattern recognition, a fundamental goal in unsu(cid:173)\npervised learning is to develop a representation of the distribution p(t) from which \nthe data were generated. In this paper we consider the problem of modelling p( t) \nin terms of a number (usually two) of latent or hidden variables. By considering a \nparticular class of such models we arrive at a formulation in terms of a constrained \nGaussian mixture which can be trained using the EM (expectation-maximization) \nalgorithm. The topographic nature of the representation is an intrinsic feature of \nthe model and is not dependent on the details of the learning process. Our model \ndefines a generative distribution p(t) and will be referred to as the GTM (Generative \nTopographic Mapping) algorithm (Bishop et al., 1996a). \n\n2 Latent Variables \n\nThe goal of a latent variable model is to find a representation for the distribution \np( t) of data in a D-dimensional space t = (t 1 , ... , t D) in terms of a number L of \nlatent variables x = (Xl, ... , XL). This is achieved by first considering a non-linear \nfunction y(x; W), governed by a set of parameters W, which maps points x in the \nlatent space into corresponding points y(x; W) in the data space. Typically we \nare interested in the situation in which the dimensionality L of the latent space \nis less than the dimensionality D of the data space, since our premise is that the \ndata itself has an intrinsic dimensionality which is less than D. The transformation \ny(x; W) then maps the latent space into an L-dimensional non-Euclidean manifold \nembedded within the data space. \n\nIT we define a probability distribution p(x) on the latent space, this will induce a \ncorresponding distribution p(YIW) in the data space. We shall refer to p(x) as the \nprior distribution of x for reasons which will become clear shortly. Since L < D, \nthe distribution in t-space would be confined to a manifold of dimension L and \nhence would be singular. Since in reality the data will only approximately live on \na lower-dimensional manifold, it is appropriate to include a noise model for the \nt vector. We therefore define the distribution of t, for given x and W, to be a \nspherical Gaussian centred on y(x; W) having variance {3-1 so that p(tlx, W, {3) f\"V \nN(tly(x; W),{3-1 I). The distribution in t-space, for a given value of W, is then \nobtained by integration over the x-distribution \n\np(tIW,{3) = I p(tlx, W,{3)p(x) dx. \n\n(1) \n\nFor a given a data set 1) = (t1 , \u2022.. , t N) of N data points, we can determine the \nparameter matrix W, and the inverse variance {3, using maximum likelihood, where \nthe log likelihood function is given by \n\nL(W,{3) = L Inp(tnIW,{3). \n\nN \n\nn=l \n\n(2) \n\nIn principle we can now seek the maximum likelihood solution for the weight matrix, \nonce we have specified the prior distribution p(x) and the functional form of the \n\n\f356 \n\nC. M. Bishop, M. Svensen and C. K. I. Williams \n\nt) Q \n\u2022 \u2022 \u2022 ~ \n\ny(x;W) \n\n~ \u2022 \u2022 \u2022 \n\u2022 \u2022 \u2022 \n\nx) \n\nt2 \n\nFigure 1: We consider a prior distribution p(x) consisting of a superposition of delta \nfunctions, located at the nodes of a regular grid in latent space. Each node XI is mapped \nto a point Y(XI; W) in data space, which forms the centre of the corresponding Gaussian \ndistribution. \n\nmapping y(x; W), by maximizing L(W, f3). The latent variable model can be re(cid:173)\nlated to the Kohonen SOM algorithm by choosingp(x) to be a sum of delta functions \ncentred on the nodes of a regular grid in latent space p(x) = 1/ K ~~l l5(x - Xl). \nThis form of p(x) allows the integral in (1) to be performed analytically. Each \npoint Xl is then mapped to a corresponding point Y(XI; W) in data space, which \nforms the centre of a Gaussian density function, as illustrated in Figure 1. Thus \nthe distribution function in data space takes the form of a Gaussian mixture model \np(tIW, f3) = 1/ K ~~l p(tIXl, W, f3) and the log likelihood function (2) becomes \n\nN {I K \n\n} \nL(W,f3) = ~ In K ~P(tnIXI' W,f3) \n\n. \n\n(3) \n\nThis distribution is a constrained Gaussian mixture since the centres of the Gaus(cid:173)\nsians cannot move independently but are related through the function y(x; W). \nNote that, provided the mapping function y(x; W) is smooth and continuous, the \nprojected points Y(XI; W) will necessarily have a topographic ordering. \n\n2.1 The EM Algorithm \n\nIT we choose a particular parametrized form for y(x; W) which is a differentiable \nfunction of W we can use standard techniques for non-linear optimization, such as \nconjugate gradients or quasi-Newton methods, to find a weight matrix W*, and \ninverse variance f3*, which maximize L (W , f3). However, our model consists of a \nmixture distribution which suggests that we might seek an EM algorithm (Dempster \net al., 1977). By making a careful choice of model y(x; W) we will see that the \nM-step can be solved exactly. In particular we shall choose y(x; W) to be given by \na generalized linear network model of the form \n\ny(x; W) = W CI~* ...... )( \u2022 * :. \n\u00b7 . . . . . . \\ i/r\"', . . . . . . . . . \n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n\u2022 \u2022 \u2022 \u2022 \u2022 x\n\u2022 \u2022 \u2022 . }; \n\n.'.;\", . \n''0.;;: \n\n.. \n\n.. .. \n\n.. \n\n)( \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. \n\n.. x. \n.. \n\n)( \n.. \n.. \n\u00b7 ........... \u2022 \u2022 .- t \u2022 ... . \nx \u2022 \u2022 \u2022 x \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 ., \u2022 \u2022 \u2022 \u2022 \u2022 ~. \n+ \u2022\u2022 r .. . +.,', \n.....;I'.,-it: \"':/ ..\u2022 .. ;. .... \u2022 \n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022\u2022 )( >U! \n\u00b7 ........ +: + ...... )( I\\. 1l \n+ + .... ~\"\",oIit ... \u00b7\u00b7\u00b7 \"' :t, \n\u2022 )( \u2022 \u2022 \u2022 \u2022 ~+.,.~+ ~ .+. ~ .. \u2022 \u2022 \u2022 \u2022 \u2022\u2022 \u2022 \nx \u2022 \u2022 \u2022 * Ii ...... _.;.#'''' .+-\"'\" ...... *ot++ .\nx\u00b7 \u2022 . ++~ r~\",+:\"\"\",d'-\"'+\"\u00b7.' \u2022 \u2022 \u2022\u2022 +*+ . ..... i+.t~ \u2022 .\".. .. >,..: \n...... .j., \u2022\u2022 \u2022 \u2022 *.*+. ~ .. \n\u2022 \u2022 i+iI' . . . . . **\" ..... ~.Iji.; \u2022 \u2022 \u2022 \n\u2022 \u2022\u2022 t: \u2022 r.; ....... t \n' .... \\ti\"' ....... .,... ** ..... \n'0* *+ + \u2022 .;. ... )( \n. ;.>;:.,.: > \u2022\u2022\u2022\u2022\u2022\u2022 o :CD 0 +::c:+* '* \n\u2022 \u2022 + .. ++; ... <:1:.+ \u2022 \u2022 \u2022 \u2022 0 \u2022 \u2022 tt \u2022 \u2022 \u2022 ........ \n. t.. + x \n+ ++ + .... ~ + l ++e \n\u2022 GOne t \n.. _ \u2022\n\u2022 + + \u2022\u2022 ~)4IpO \u2022 \u2022 ~ CIa # ... . * ~. CID \u2022 \u2022 qD. : \n\" \"'~~_ClD. \" \u2022\u2022\u2022 o \u2022\u2022 p \u2022 \u2022 lb \n;, .....\u2022.. \u00b0.0. ~ \u2022 ~ . . . ...... \u2022 0 \n+ \n.... . ~~ ~ .... !:~ ~~\" ......... ' .'~:.' \n\" . Q)