Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)
Yee Teh, Sam Roweis
We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a high- dimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the sys- tem it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover com- pact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that man- ifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analy- sis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coher- ent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by max- imizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved train- ing algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm Suppose we have a set of data points given by the rows of a -dimensional space, which we assume are sampled from a fold. We approximate the manifold coordinates using images ! dimensional embedding space. Suppose also that we have already trained, or have been th reducer produces a%$ dimen- given, a mixture of" sional internal representation&(' $ for data point )' as well as a “responsibility”* ' $,+.- ' $ describing how reliable the# 10 is. These satisfy/ local dimensionality reducers. The# th reducer’s representation of and can be obtained, for example, using a gating network in a mixture of experts, or the posterior probabilities in a probabilistic network. Notice that the manifold coordinates and internal representations need not have the same number of dimensions. from dimensional mani- in a Given the data, internal representations, and responsibilities, our algorithm automatically aligns the various hidden representations into a single global coordinate system. Two key ideas motivate the method. First, to use a convex cost function whose unique minimum is to attained at the desired global coordinates. Second, to restrict the global coordinates ' $ and responsibilities* depend on the data thereby leveraging the structure of the mixture model to regularize and reduce the effective size of the optimization problem. In effect, rather than working with individual data points, we work with large groups of points belonging to particular submodels. ' only through the local representations &