{"title": "Learning Topology with the Generative Gaussian Graph and the EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 83, "page_last": 90, "abstract": null, "full_text": "Learning Topology with the Generative Gaussian Graph and the EM Algorithm\n Michael Aupetit CEA - DASE BP 12 - 91680 ` ^ Bruyeres-le-Chatel, France aupetit@dase.bruyeres.cea.fr\n\nAbstract\nGiven a set of points and a set of prototypes representing them, how to create a graph of the prototypes whose topology accounts for that of the points? This problem had not yet been explored in the framework of statistical learning theory. In this work, we propose a generative model based on the Delaunay graph of the prototypes and the ExpectationMaximization algorithm to learn the parameters. This work is a first step towards the construction of a topological model of a set of points grounded on statistics.\n\n1\n1.1\n\nIntroduction\nTopology what for?\n\nGiven a set of points in a high-dimensional euclidean space, we intend to extract the topology of the manifolds from which they are drawn. There are several reasons for this among which: increasing our knowledge about this set of points by measuring its topological features (connectedness, intrinsic dimension, Betti numbers (number of voids, holes, tunnels. . . )) in the context of exploratory data analysis [1], allowing to compare two sets of points wrt their topological characteristics or to find clusters as connected components in the context of pattern recognition [2], or finding shortest path along manifolds in the context of robotics [3]. There are two families of approaches which deal with \"topology\" : on one hand, the \"topology preserving\" approaches based on nonlinear projection of the data in lower dimensional spaces with a constrained topology to allow visualization [4, 5, 6, 7, 8]; on the other hand, the \"topology modelling\" approaches based on the construction of a structure whose topology is not constrained a priori, so it is expected to better account for that of the data [9, 10, 11] at the expense of the visualisability. Much work has been done about the former problem also called \"manifold learning\", from Generative Topographic Mapping [4] to Multi-Dimensional Scaling and its variants [5, 6], Principal Curves [7] and so on. In all these approaches, the intrinsic dimension of the model is fixed a priori which eases the visualization but arbitrarily forces the topology of the model. And when the dimension is not fixed as in the mixture of Principal Component Analyzers [8], the connectedness is lost. The latter problem we deal with had never been explored in the statistical learning\n\n\f\nperspective. Its aim is not to project and visualize a high-dimensional set of points, but to extract the topological information from it directly in the high-dimensional space, so that the model must be freed as much as possible from any a priori topological constraint. 1.2 Learning topology: a state of the art\n\nAs we may learn a complicated function combining simple basis functions, we shall learn a complicated manifold1 combining simple basis manifolds. A simplicial complex2 is such a model based on the combination of simplices, each with its own dimension (a 1-simplex is a line segment, a 2-simplex is a triangle. . . a k -simplex is the convex hull of a set of k + 1 points). In a simplicial complex, the simplices are exclusively connected by their vertices or their faces. Such a structure is appealing because it is possible to extract from it topological information like Betti numbers, connectedness and intrinsic dimension [10]. A particular simplicial complex is the Delaunay complex defined as the set of simplices whose Vorono i cells3 of the vertices are adjacent assuming general position for the vertices. The Delaunay graph is made of vertices and edges of the Delaunay complex [12]. All the previous work about topology modelling is grounded on the result of Edelsbrunner and Shah [13] which prove that given a manifold M RD and a set of N0 vector prototypes w (RD )N0 nearby M, it exists a simplicial subcomplex of the Delaunay complex of w which has the same topology as M under what we call the \"ES-conditions\". In the present work, the manifold M is not known but through a finite set of M data points v MM . Martinetz and Schulten proposed to build a graph of the prototypes with an algorithm called \"Competitive Hebbian Learning\" (CHL)[11] to tackle this problem. Their approach has been extended to simplicial complexes by De Silva and Carlsson with the definition of \"weak witnesses\" [10]. In both cases, the ES-conditions about M are weakened so they can be verified by a finite sample v of M, so that the graph or the simplicial complex built over w is proved to have the same topology as M if v is a sufficiently dense sampling of M. The CHL consists in connecting two prototypes in w if they are the first and the second closest neighbors to a point of v (closeness wrt the Euclidean norm). Each point of v leads to an edge, and is called a \"weak witness\" of the connected prototypes [10]. The topology representing graph obtained is a subgraph of the Delaunay graph. The region of R D in which any data point would connect the same prototypes, is the \"region of influence\" (ROI) of this edge (see Figure 2 d-f). This principle is extended to create k -simplices connecting k + 1 prototypes, which are part of the Delaunay simplicial-complex of w [10]. Therefore, the model obtained is based on regions of influence: a simplex exists in the model if there is at least one datum in its ROI. Hence, the capacity of this model to correctly represent the topology of a set of points, strongly depends on the shape and location of the ROI wrt the points, and on the presence of noise in the data. Moreover, as far as N 0 > 2, it cannot exist an isolated prototype allowing to represent an isolated bump in the data distribution, because any datum of this bump will have two closest prototypes to connect to each other. An aging process has been proposed by Martinetz and Schulten to filter out the noise, which works roughly such that edges with fewer data than a threshold in there ROI are pruned from the graph. This looks like a filter based on the probability density of the data distribution, but no statistical criterion is proposed to tune the parameters. Moreover the area of the ROI may be intractable in high dimension and is not trivially related to the\n1 For simplicity, we call \"manifold\" what can be actually a set of manifolds connected or not to each other with possibly various intrinsic dimensions. 2 The terms \"simplex\" or \"graph\" denote both the abstract object and its geometrical realization. 3 i Given a set of points w in RD , Vi = {v RD |(v - wi )2 (v - wj )2 , j } defines the Vorono cell associated to wi w.\n\n\f\ncorresponding line segment, so measuring the frequency over such a region is not relevant to define a useful probability density. At last, the line segment associated to an edge of the graph is not part of the model: data are not projected on it, data drawn from such a line segment may not give rise to the corresponding edge, and the line segment may not intersect at all its associated ROI. In other words, the model is not self-consistent, that is the geometrical realization of the graph is not always a good model of its own topology whatever the density of the sampling. We proposed to define Vorono cells of line segments as ROI for the edges and defined a i criterion to cut edges with a lower density of data projecting on their middle than on their borders [9]. This solves some of the CHL limits but it still remains one important problem common to both approaches: they rely on the visual control of their quality, i.e. no criterion allows to assess the quality of the model especially in dimension greater than 3. 1.3 Emerging topology from a statistical generative model\n\nFor all the above reasons, we propose another way for modelling topology. The idea is to construct a \"good\" statistical generative model of the data taking the noise into account, and to assume that its topology is therefore a \"good\" model of the topology of the manifold which generated the data. The only constraint we impose on this generative model is that its topology must be as \"flexible\" as possible and must be \"extractible\". \"Flexible\" to avoid at best any a priori constraint on the topology so as to allow the modelling of any one. \"Extractible\" to get a \"white box\" model from which the topological characteristics are tractable in terms of computation. So we propose to define a \"generative simplicial complex\". However, this work being preliminary, we expose here the simpler case of defining a \"generative graph\" (a simplicial complex made only of vertices and edges) and tuning its parameters. This allows to demonstrate the feasibility of this approach and to foresee future difficulties when it is extended to simplicial complexes. It works as follows. Given a set of prototypes located over the data distribution using e.g. Vector Quantization [14], the Delaunay graph (DG) of the prototypes is constructed [15]. Then, each edge and each vertex of the graph is the basis of a generative model so that the graph generates a mixture of gaussian density functions. The maximization of the likelihood of the data wrt the model, using Expectation-Maximization, allows to tune the weights of this mixture and leads to the emergence of the expected topology representing graph through the edges with non-negligible weights that remain after the optimization process. We first present the framework and the algorithm we use in section 2. Then we test it on artificial data in section 3 before the discussion and conclusion in section 4.\n\n2\n2.1\n\nA Generative Gaussian Graph to learn topology\nThe Generative Gaussian Graph\n\nIn this work, M is the support of the probability density function (pdf) p from which are drawn the data v . In fact, this is not the topology of M which is of interest, but the topology of manifolds Mprin called \"principal manifolds\" of the distribution p (in reference to the definition of Tibshirani [7]) which can be viewed as the manifold M without the noise. We assume the data have been generated by some set of points and segments constituting the set of manifolds Mprin which have been corrupted with additive spherical gaussian noise 2 with mean 0 and unknown variance noise . Then, we define a gaussian mixture model to account for the observed data, which is based on both gaussian kernels that we call \"gaussian-points\", and what we call \"gaussian-segments\", forming a \"Generative Gaussian Graph\" (GGG).\n\n\f\nThe value at point vj v of a normalized gaussian-point centered on a prototype wi w with variance 2 is defined as: g 0 (vj , wi , ) = (2 2 )-D/2 exp(-\n(vj -wi )2 ) 2 2\n\nA normalized gaussian-segment is defined as the sum of an infinite number of gaussianpoints evenly spread on a line segment. Thus, this is the integral of a gaussian-point along a line segment. The value at point vj of the gaussian-segment [wai wbi ] associated to the ith edge {ai , bi } in DG with variance 2 is: wb\ni\n\nexp\n\ng (vj , {wai , wbi }, )\n\n1\n\n=\n\nw ai\n\n-\n\n(vj - w)2 2 2\nD 2\n\nd\n\nw (1)\n\n(2 2 )\n\nL ai b i erf Qj\nai b i 2\n\n=\n\n(- j (vj -q )2 i exp 2 2\nD -1 2 2 ) 2\n\n\n\n- 2\n\nerf\n\nQj\n\n-L a b ai b i ii 2\n\nL ai b i Qj\n\nj ab where Lai bi = wbi - wai , Qj i bi = j ai a bbi ai and qi = wai + (wbi - wai ) La ib i a ii ii is the orthogonal projection of vj on the straight line passing through wai and wbi . In the case where wai = wbi , we set g 1 (vj , {wai , wbi }, ) = g 0 (vj , wai , ).\n\nv -w L |w -w\n\nThe left part of the dot product accounts for the gaussian noise orthogonal to the line segment, and the right part for the gaussian noise integrated along the line segment. The R0 functions g 0 and g 1 are positive and we can prove that: D g (v , wi , )dv = 1 and R1 g (v , {wa , wb }, )dv = 1, so they are both probability density functions. A gaussianD point is associated to each prototype in w and a gaussian-segment to each edge in DG. The gaussian mixture is obtained by a weighting sum of the N0 gaussian-points and N1 gaussian-segments, such that the weights sum to 1 and are non-negative: k 1 iNk\n\np(vj | , w, , DG) =\n\nk i g k (vj , sk , ) i\n\n(2)\n\n=0 =1\n\n1 Nk k k with k=0 i=1 i = 1 and i, k , i 0, where s0 = wi and s1 = {wai , wbi } such i i th 0 1 that {ai , bi } is the i edge in DG. The weight i (resp. i ) is the probability that a datum v was drawn from the gaussian-point associated to wi (resp. the gaussian-segment associated to the ith edge of DG). 2.2 Measure of quality\n\nThe function p(vj | , w, , DG) is the probability density at vj given the parameters of the model. We measure the likelihood P of the data v wrt the parameters of the GGG model: jM\n\nP = P ( , w, , DG) =\n\np(vj | , w, , DG)\n\n(3)\n\n=1\n\n2.3\n\nThe Expectation-Maximization algorithm\n\nIn order to maximize the likelihood P or equivalently to minimize the negative loglikelihood L = -log(P ) wrt and , we use the Expectation-Maximization algorithm.\n\n\f\nWe refer to [2] (pages 59 - 73) and [16] for further details. The minimization of the negative log-likelihood consists in tmax iterative steps updating and which ensure the decrease of L. The updating rules take into account the constraints about positivity or sum to unity of the parameters: i\nk[new]\n\n= = +\n\n1 M 1 DM\n\n 2[new]\n\nM\n\nj =1\n\nP (k , i|vj ) N0\ni=1\n\nwhere I1 I2 = = 2\n\nN1\n\ni=1\n\nP (1, i|vj )\n\nM\n\nj =1 [\n\nP (0, i|vj )(vj - wi )2 ]\n\n(4)\n\nj (v -q )2 j (2 2 )-D/2 exp(- j22i )(I1 [(vj -qi )2 + 2 ]+I2 ) 1 (v ,{w Lai bi g j ai ,wbi }, )\n\n Qj b Qj b -Lai bi ai i ai i )) 2 (erf ( 2 ) - erf ( 2 ( j 2 (Q -La b )2 (Q j ai ) Qj i bi - Lai bi ) exp(- ai bi 2 i i ) - Qj i bi exp(- 2b2i ) a a 2\n k g k (v ,sk , )\n\n(5)\n\nj i and P (k , i|vj ) = p(vj |,w,,iDG) is the posterior probability that the datum vj was generated by the component associated to (k , i).\n\n2.4\n\nEmerging topology by maximizing the likelihood\n\nFinally, to get the topology representing graph from the generative model, the core idea is to prune from the initial DG the edges for which there is probability they generated the data. The complete algorithm is the following: 1. Initialize the location of the prototypes w using vector quantization [14]. 2. Construct the Delaunay graph DG of the prototypes. 3. Initialize the weights to 1/(N0 + N1 ) to give equiprobability to each vertices and edges. 4. Given w and DG, use updating rules (4) to find 2 and maximizing the likelihood P . 5. Prune the edges {ai bi } of DG associated to the gaussian segments with 1 1 probability i where i . The topology representing graph emerges from the edges with probabilities > . It is the graph which best models the topology of the data in the sense of the maximum likelihood wrt , , and the set of prototypes w and their Delaunay graph.\n\n3\n\nExperiments\n\nIn these experiments, given a set of points and a set of prototypes located thanks to vector quantization [14], we want to verify the relevance of the GGG to learn the topology in various noise conditions. The principle of the GGG is shown in the Figure 1. In the Figure 2, we show the comparison of the GGG to a CHL for which we filter out edges which have a number of hits lower than a threshold T . The data and prototypes are the same for both algorithms. We set T such that the graph obtained matches visually as close as possible the expected solution. We optimize and using (4) for tmax = 100 steps and = 0.001. Conditions and conclusions of the experiments are given in the captions.\n\n\f\n1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.1\n\n1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nan horizontal one and an isolated point with respective density {0.25; 0.5; 0.25}. The prototypes are located at the extreme points of the segments, and at the isolated point. They are connected with edges from the Delaunay graph. (b) The corresponding initial Generative Gaussian Graph. (c) The optimal GGG obtained after optimization of the likelihood according to and . (d) The edges of the optimal GGG associated to non-negligible probabilities model the topology of the data.\n\nFigure 1: Principle of the Generative Gaussian Graph: (a) Data drawn from an oblique segment,\n\n4\n\nDiscussion\n\nWe propose that the problem of learning the topology of a set of points can be posed as a statistical learning problem: we assume that the topology of a statistical generative model of a set of points is an estimator of the topology of the principal manifold of this set. From this assumption, we define a topologically flexible statistical generative mixture model that we call Generative Gaussian Graph from which we can extract the topology. The final topology representing graph emerges from the edges with non-negligible probability. We propose to use the Delaunay graph as an initial graph assuming it is rich enough to contain as a subgraph a good topological model of the data. The use of the likelihood criterion makes possible cross-validation to select the best generative model hence the best topological model in terms of generalization capacities. The GGG allows to avoid the limits of the CHL for modelling topology. In particular, it allows to take into account the noise and to model isolated bumps. Moreover, the likelihood of the data wrt the GGG is maximized during the learning, allowing to measure the quality of the model even when no visualization is possible. For some particular data distributions where all the data lie on the Delaunay line segments, no maximum of the likelihood exists. This case is not a problem because = 0 effectively defines a good solution (no noise in a data set drawn from a graph). If only some of the data lie exactly on the line segments, a maximum of the likelihood still exists because 2 defines the variance for all the generative gaussian points and segments at the same time so it cannot vanish to 0. The 3 computing time complexity of the GGG is o(D(N0 + N1 )M tmax ) plus the time O(DN0 ) [15] needed to build the Delaunay graph which dominates the overall worst time complexity. The Competitive Hebbian Learning is in time o(DN0 M ). As in general, the CHL builds too much edges than needed to model the topology, it would be interesting to use the Delaunay subgraph obtained with the CHL as a starting point for the GGG model. The Generative Gaussian Graph can be viewed as a generalization of gaussian mixtures to points and segments: a gaussian mixture is a GGG with no edge. GGG provides at the same time an estimation of the data distribution density more accurate than the gaussian mixture based on the same set of prototypes and the same noise isovariance hypothesis (because it adds gaussian-segments to the pool of gaussian-points), and intrinsically an explicit model of the topology of the data set which provides most of the topological information at once. In contrast, other generative models do not provide any insight about the topology of the data, except the Generative Topographic Map (GTM) [4], the revisited Principal Manifolds [7] or the mixture of Probabilistics Principal Component Analysers (PPCA) [8]. However, in the two former cases, the intrinsic dimension of the model is fixed a priori and\n\n\f\nnoise = 0.05\n1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1\n\nnoise = 0.15\n1\n\nnoise = 0.2\n\n0.5\n\n0\n\n-0.5\n\n-1 -1 -0.5 0 0.5 1 1.5 -1 -0.5 0 0.5 1 1.5\n\n(a) GGG: = 0.06\n1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -0.5 0 0.5 1\n\n(b) GGG: = 0.17\n1\n\n(c) GGG: = 0.21\n1\n\n0.5\n\n0.5\n\n0\n\n0\n\n-0.5\n\n-0.5\n\n-1 -0.5 0 0.5 1\n\n-1 -1 -0.5 0 0.5 1\n\n(d) CHL: T = 0\n1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -0.5 0 0.5 1 -1 -0.5 0 0.5 1\n\n(e) CHL: T = 0\n1\n\n(f) CHL: T = 0\n\n0.5\n\n0\n\n-0.5\n\n-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1\n\n(g) CHL: T = 60\n\n\n(h) CHL: T = 65\n\n\n(i) CHL: T = 58\n\n\nFigure 2: Learning the topology of a data set: 600 data drawn from a spirale and an isolated point 2 corrupted with additive gaussian noise with mean 0 and variance noise . Prototypes are located by vector quantization [14]. (a-c) The edges of the GGG with weights greater than allow to recover the topology of the principal manifolds except for large noise variance (c) where a triangle was created at the center of the spirale. over-estimates noise because the model is piecewise linear while the true manifolds are non-linear. (d-f) The CHL without threshold (T=0) is not able to recover the true topology of the data for even small noise . In particular, the isolated bump cannot be recovered. The grey cells correspond to ROI of the edges (darker cells contain more data). It shows these cells are not intuitively related to the edges they are associated to (e.g. they may have very tiny areas (e), and may partly (d) or never (f) contain the corresponding line segment). (g-h) The CHL with a threshold T allows to recover the topology of the data only for small noise variance (g) (Notice T1 < T2 DGC H L (T2 ) DGC H L (T1 )). Moreover, setting T requires visual control and is not associated to the optimum of any energy function which prevents its use in higher dimensional space.\n\n\f\nnot learned from the data, while in the latter the local intrinsic dimension is learned but the connectedness between the local models is not. One obvious way to follow to extend this work is considering a simplicial complex in place of the graph to get the full topological information extractible. Some other interesting questions arise about the curse of the dimension, the selection of the number of prototypes and the threshold , the theoretical grounding of the connection between the likelihood and some topological measure of accuracy, the possibility to devise a \"universal topology estimator\", the way to deal with data sets with multi-scale structures or background noise. . . This preliminary work is an attempt to bridge the gap between Statistical Learning Theory [17] and Computational Topology [18][19]. We wish it to cross-fertilize and to open new perspectives in both fields.\n\nReferences\n[1] M. Aupetit and T. Catz. High-dimensional labeled data analysis with topology representing graphs. Neurocomputing, Elsevier, 63:139169, 2005. [2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford Univ. Press, New York, 1995. [3] M. Zeller, R. Sharma, and K. Schulten. Topology representing network for sensor-based robot motion planning. World Congress on Neural Networks, INNS Press, pages 100103, 1996. [4] C. M. Bishop, M. Svensen, and C. K. I. Williams. Gtm: the generative topographic mapping. Neural Computation, MIT Press, 10(1):215234, 1998. [5] V. de Silva and J. B. Tenenbaum. Global versus local methods for nonlinear dimensionality reduction. In S. Becker, S. Thrun, K. Obermayer (Eds) Advances in Neural Information Processing Systems, MIT Press,Cambridge, MA, 15:705712, 2003. [6] J. A. Lee, A. Lendasse, and M. Verleysen. Curvilinear distance analysis versus isomap. Europ. Symp. on Art. Neural Networks, Bruges (Belgium), d-side eds., pages 185192, 2002. [7] R. Tibshirani. Principal curves revisited. Statistics and Computing, (2):183190, 1992. [8] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11(2):443482, 1999. [9] M. Aupetit. Robust topology representing networks. European Symp. on Artificial Neural Networks, Bruges (Belgium), d-side eds., pages 4550, 2003. [10] V. de Silva and G. Carlsson. Topological estimation using witness complexes. In M. Alexa and S. Rusinkiewicz (Eds) Eurographics Symposium on Point-Based Graphics, ETH, Zurich,Switzerland, June 2-4, 2004. [11] T. M. Martinetz and K. J. Schulten. Topology representing networks. Neural Networks, Elsevier London, 7:507522, 1994. [12] A. Okabe, B. Boots, and K. Sugihara. Spatial tessellations: concepts and applications of Vorono diagrams. John Wiley, Chichester, 1992. i [13] H. Edelsbrunner and N. R. Shah. Triangulating topological spaces. International Journal on Computational Geometry and Applications, 7:365378, 1997. [14] T. M. Martinetz, S. G. Berkovitch, and K. J. Schulten. \"neural-gas\" network for vector quantization and its application to time-series prediction. IEEE Trans. on NN, 4(4):558569, 1993. [15] E. Agrell. A method for examining vector quantizer structures. Proceedings of IEEE International Symposium on Information Theory, San Antonio, TX, page 394, 1993. [16] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):138, 1977. [17] V.N. Vapnik. Statistical Learning Theory. John Wiley, 1998. [18] T. Dey, H. Edelsbrunner, and S. Guha. Computational topology. In B. Chazelle, J. Goodman and R. Pollack, editors, Advances in Discrete and Computational Geometry. American Math. Society, Princeton, NJ, 1999. [19] V. Robins, J. Abernethy, N. Rooney, and E. Bradley. Topology and intelligent data analysis. IDA-03 (International Symposium on Intelligent Data Analysis), Berlin, 2003.\n\n\f\n", "award": [], "sourceid": 2922, "authors": [{"given_name": "Micha\u00ebl", "family_name": "Aupetit", "institution": null}]}