{"title": "Global Coordination of Local Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 889, "page_last": 896, "abstract": null, "full_text": "Global Coordination of Local Linear Models\n\nSam Roweis , Lawrence K. Saul\u0001 , and Geoffrey E. Hinton \n\n Department of Computer Science, University of Toronto\n\n\u0001 Department of Computer and Information Science, University of Pennsylvania\n\nAbstract\n\nHigh dimensional data that lies on or near a low dimensional manifold can be de-\nscribed by a collection of local linear models. Such a description, however, does\nnot provide a global parameterization of the manifold\u2014arguably an important\ngoal of unsupervised learning. In this paper, we show how to learn a collection\nof local linear models that solves this more dif\ufb01cult problem. Our local linear\nmodels are represented by a mixture of factor analyzers, and the \u201cglobal coordi-\nnation\u201d of these models is achieved by adding a regularizing term to the standard\nmaximum likelihood objective function. The regularizer breaks a degeneracy\nin the mixture model\u2019s parameter space, favoring models whose internal coor-\ndinate systems are aligned in a consistent way. As a result, the internal coor-\ndinates change smoothly and continuously as one traverses a connected path on\nthe manifold\u2014even when the path crosses the domains of many different local\nmodels. The regularizer takes the form of a Kullback-Leibler divergence and\nillustrates an unexpected application of variational methods: not to perform ap-\nproximate inference in intractable probabilistic models, but to learn more useful\ninternal representations in tractable ones.\n\n1 Manifold Learning\n\nConsider an ensemble of images, each of which contains a face against a neutral back-\nground. Each image can be represented by a point in the high dimensional vector space\nof pixel intensities. This representation, however, does not exploit the strong correlations\nbetween pixels of the same image, nor does it support many useful operations for reasoning\nabout faces. If, for example, we select two images with faces in widely different locations\nand then average their pixel intensities, we do not obtain an image of a face at their average\nlocation. Images of faces lie on or near a low-dimensional, curved manifold, and we can\nrepresent them more usefully by the coordinates on this manifold than by pixel intensi-\nties. Using these \u201cintrinsic coordinates\u201d, the average of two faces is another face with the\naverage of their locations, poses and expressions.\n\nTo analyze and manipulate faces, it is helpful to imagine a \u201cmagic black box\u201d with levers\nor dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the\nlevers and dials, the box generates an image of a face. Given an image of a face, the box\ndeduces the appropriate setting of the levers and dials. In this paper, we describe a fairly\ngeneral way to construct such a box automatically from an ensemble of high-dimensional\nvectors. We assume only that there exists an underlying manifold of low dimensionality\nand that the relationship between the raw data and the manifold coordinates is locally linear\nand smoothly varying. Thus our method applies not only to images of faces, but also to\nmany other forms of highly distributed perceptual and scienti\ufb01c data (e.g., spectrograms of\nspeech, robotic sensors, gene expression arrays, document collections).\n\n\f2 Local Linear Models\n\nThe global structure of perceptual manifolds (such as images of faces) tends to be highly\nnonlinear. Fortunately, despite their complicated global structure, we can usually char-\nacterize these manifolds as locally linear. Thus, to a good approximation, they can be\nrepresented by collections of simpler models, each of which describes a locally linear\nneighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely\ncaptures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to\ndescribe high dimensional data that lies on or near a lower dimensional manifold. MFAs\nparameterize a joint distribution over observed and hidden variables:\n\n\f:\n\u0017\f\u000bN\n\n\f*E\n\n\f:\n\u0017\f\bN\n\n(2)\n\n(3)\n\n\u0002\u0001\u0013\n\n;\b<\n9\u0017K\n\n\u0007\t\u0005\u000b\n\r\f\u000f\u000e\u0011\u0010J\u0014\n\n&\u00197\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\t\u0005\u000b\n\r\f\u000f\u000e\u0011\u0010\u0012\u0002\u0001\u0013\u0003\u0015\u0014\n\n(1)\n, represents the high dimensional data; the discrete\n\n\u0007\u0016\u0005\b\n\u0017\f\u000f\u000e\u0018\u0002\u0001\u0004\n\u0017\f\u0019\u0014\n\n\u0007\u001a\u000e\u001b\u0002\u0001\u0004\u0007\u001a\u000e\u000f\u0005\n\ncoordinates are normally distributed 1 as:\n\nThe model assumes that data is sampled from different neighborhoods on the manifold\n\nFinally, the model assumes that the data\u2019s high and low dimensional coordinates are related\n\ndiscrete and continuous latent variables. The result is a mixture of Gaussian distributions\nwith parameterized covariance matrices of the form:\n\n1\u000f;\b<\r=?>A@\u001fBDC\n\f\tF\n\f , loading matricesH\n9\rK\n\nwhere the observed variable, \u0003\u001d\u001c\u001f\u001e! \nhidden variables,\u0007\"\u001c$#\t%\t\u0005\b&'\u0005)()(*(\u000f\u0005\b+-, , indexes different neighborhoods on the manifold; and\nthe continuous hidden variables, \n.\f/\u001c0\u001e!1 , represent low dimensional local coordinates.\nwith prior probabilities \u0002\u0001\u0004\u0007\u001a\u000e2\u001043\n\f , and that within each neighborhood, the data\u2019s local\n\u0007\u001a\u000e2\u00105\u00016&87.\u000e:9\n\f and noise levelsI\n\f :\nby linear processes parameterized by centersG\n=?>A@LBDC\n\u0002\u0001\u0004\u0003\u0015\u0014\nThe marginal data distribution,\u0002\u0001\u0004\u0003O\u000e , is obtained by summing/integrating out the model\u2019s\nN:FU((4)\n3R\f-\u0014\n&872\u0001\n\u0002\u0001\u0004\u00032\u000e\u0011\u0010QP\n\f\u001fS\n\f , and\n\f , transformations H\nThe learning problem for MFAs is to estimate the centers G\n\f of these linear processes, as well as the prior probabilities3\u0017\f of sampling\nnoise levelsI\n\u0002\u0001\u0013\u0003Z\u000e , averaged over training examples.\nprobability,VXW\u0016Y\n\f:\\/\f ,\n\f-[\nNote that the parameter space of this model exhibits an invariance: taking H\n\u0010ba ), does not change the marginal dis-\nwhere\\\n\f are]_^`] orthogonal matrices (\\\ntribution, \u0002\u0001\u0013\u00032\u000e . The transformations H\n\f correspond to arbitrary rotations and\n\nre\ufb02ections of the local coordinates in each linear model. The objective function for the EM\nalgorithm is unchanged by these transformations. Thus, maximum likelihood estimation\nin MFAs does not favor any particular alignment; instead, it produces models whose inter-\nnal representations change unpredictably as one traverses connected paths on the manifold.\nCan we encourage models whose local coordinate systems are aligned in a consistent way?\n\ndata from different parts of the manifold. Parameter estimation in MFAs can be handled\nby an Expectation-Maximization (EM) algorithm[5] that attempts to maximize the log-\n\n\f\u000f\u000e\n\n9\rK\n\n\f\u000f\u000e)\u0014\n\n9\rK\n\n;\b<\n\n=)>T@\n\n\f8\u0014\n\n\f\"S\n\n&\u001fM\n\n&LM\n\n3 Global Coordination\n\nSuppose the data lie near a smooth manifold with a locally \ufb02at (developable) structure.\n\n1Although in principle each neighborhood could have a different prior on its local coordinates,\nis the same for all\n\nThen there exist a single set of \u201cglobal coordinates\u201d c which parametrize the manifold\nwithout loss of generality we have made the standard assumption thatd/egfih*j\nsettings ofk and absorbed the shape of each local Gaussian model into the matricesm2h .\n\nk?l\n\n\f\n\u0014\n%\n&\n\n(\nI\n%\n\u0003\nC\nG\n\f\nC\nH\nE\nI\n\f\nM\n\u0003\nC\nG\n\f\nC\nH\nF\n(\n\f\nH\n\f\nH\nE\nI\nB\nC\n%\n\u0003\nC\nG\n\f\nN\nE\n\u0001\nH\n\f\nH\nE\nI\nM\n\u0003\nC\nG\n\f\nH\n\f\n\\\nE\n\f\n\f\n[\nH\n\f\n\\\n\fhidden\nvariables\n\ns,z\n\ng\n\nglobal\n\ncoordinates\n\nx\n\ndata\n\nularly tractable due to the conditional independencies of the model.\n\nFigure 1: Graphical model for globally coordinated MFAs. Al-\nare unobserved, they affect the learn-\ning through a regularization term. After learning, inferences about\nthe global variables are made by computing posterior distributions,\n\nthough global coordinates\n\u0002.l . Likewise, data can easily be generated by sampling from\nd/e\u0001\u0017j\nl . All these operations are partic-\nthe conditional distribution,d/e\u0003\u0002Zj\n\u00016\u0007\u0016\u0005\b\n\u0017\f\u000f\u000e\n\neverywhere. Furthermore, to a good approximation, these global coordinates can be related\nto the local coordinates of different neighborhoods (in their region of validity) by linear 2\ntransformations:\n\nof the manifold? Intuitively, if a data point belongs to overlapping neighborhoods, then the\nglobal coordinates computed from their local coordinate systems, given by eq. (5), should\nagree. We can formalize this \u201cglobal coordination\u201d of different local models by treating the\n\n\f\u000f\n\r\f\n\u0010\u0005\u0004\nS\u0007\u0006\n\f?\u000e provide a global parameterization\n\u0001\u0004\u0007\t\u0005\u000b\n\nWhat does it mean to say that the coordinatesc\ncoordinatesc as unobserved variables and incorporating them into the probabilistic model:\n\u0007\t\u0005\u000b\n\r\f\u000f\u000e$\u0010\t\bA\u0001\n\n(Here we posit a deterministic relationship between local and global coordinates, although\nit is possible to add noise to this mapping as well.) The globally coordinated MFA is rep-\nresented by the graphical model in Fig. 1. We can appeal to its conditional independencies\nto make other useful inferences. In particular:\n\n\f\u000f\n\u0017\f\n\n\f)\u000e\u000f\u0005\n\n\u0002\u0001\n\n(6)\n\n(5)\n\n\u0005\b\u0007\n\n(7)\n\n(8)\n\n\u0002\u0001\n\u0002\u0001\n\n\u0003\u000b\nR\u0005:\u0007\u001a\u000e\n\u0003\u000b\n'\u000e\n\nnon-negligible probability, then the posterior distributions for the global coordinates of\n\narise when different mixture components give rise to inconsistent global coordinates. While\nis dif\ufb01cult, a penalty which encourages con-\nsistency can be easily incorporated into the learning algorithm. We introduce a family of\n\n\u0007\t\u0005\u000b\n\u0017\f\u000f\u000e\u001b\u0002\u0001\u0013\n\u0017\f\u0019\u0014\n\u0007\t\u0005\u000b\u0003\r\n'\u000e\n\u0003\u000b\nR\u0005:\u0007\n\u0003\u000b\n'\u000e\u0018\u0002\u0001\n\u000e?(\nK and\u0007\n< \u2014explain a data point\u0003\u000e\n with\n\u000e\u0011\u0010\n\u0003\u000f\nR\u0005:\u0007\n\u000e . To enforce this criterion of agreement, we need to penalize models whose\n\u0003\u0012\n'\u000e given by eq. (8) are multimodal, since multiple modes only\n\u0003\u0013\n'\u000e , to\n\u0005\b\u0007T\u0014\n\u0005:\u0007\n\nR\fR\u0002\u0001\n\u0002\u00016\u0007\nNow, if two or more mixture components\u2014say,\u0007\nthis data point, as induced by eq. (8), should be nearly identical: that is, \u0002\u0001\n\u0002\u0001\nposterior distributions\u0002\u0001\ndirectly penalizing multimodality of\u0002\u0001\nunimodal distributions over bothc and\u0007 , and encourage the true posteriors,\u0002\u0001\nbe close to some member,\u0014\n\u0002\u0001\u0013\u0003\u000b\n'\u000e\nVXW\u0016Y\n\nDeveloping this idea further, we introduce a new objective function for unsupervised learn-\ning in MFAs. The new objective function incorporates a regularizer to encourage the global\nconsistency of local models:\n\n\u0003\u0012\n'\u000e , of this family.\nC\u0017\u0016\n\u0005\b\u0007T\u0014\n\nThe \ufb01rst term in this objective function computes the log-probability of the data. The\nsecond term computes a sum of Kullback-Leibler (KL) divergences; these are designed to\n\n\u0003\r\ni\u000e\n\u000e\u001e\u001d\n\nVXW\u0016Y\u001c\u001b\n\n\u0003\u0012\n'\u000e\n\n\u0003\r\ni\u000e\n\n\u0005:\u0007\n\u0005\b\u0007T\u0014\n\n\f\u0019\u0018\n\n\u0016\f\n\nc\u001a\u0014\n\n\u0002\u0001\n\n(9)\n\nby exploiting the polar factorization and absorbing re\ufb02ection and rotation into the local coordinate\nsystems. (In practice, though, it may be easier to optimize the objective function without constraining\nthe matrices to be of this form.) In the experiments reported below, we have further restricted them to\n\n2Without loss of generality, the matrices\u001f\u0006h can be taken to be symmetric and positive-de\ufb01nite,\nbe diagonal. Together, then, the coordination matrices\u001f\u0006h and vectors \rh account for an axis-aligned\n\nscaling and uniform translation between the global and local coordinate systems.\n\n\nc\n\f\n(\nc\n\u0014\nc\nC\n\u0004\nC\n\u0006\nc\n\u0014\n\u0010\n\f\n]\nc\n\u0014\nc\n\u0014\n\u0010\nP\n\f\n\u0014\nc\n\u0014\nc\n\u0014\nK\nc\n\u0014\n\u0003\n\n<\nc\n\u0014\nc\n\u0014\nc\n\u0001\nc\n\u0014\n\u0015\n\u0010\nP\n\nP\n\u0001\nc\n\u0014\n\u0001\nc\n\u0014\nc\n\u0003\n\n\u0005\n\fpenalize MFAs whose posterior distributions over global coordinates are not unimodal. The\ntwin goals of density estimation and manifold learning in MFAs are pursued by attempting\n\naffect likelihood) are exploited in order to achieve submodel agreement. In what follows\n\nThe most convenient way to parameterize the family of unimodal distributions is a factor-\nized form involving a Gaussian density and a multinomial:\n\nto balance these terms in the objective function. The factor\u0016 controls the tradeoff between\n[\u0001 only strict invariances (which do not\ndensity modeling and global coordination: as\u0016\nwe have set\u0016\n\u0003\u000b\n'\u000e2\u0010\n\u0005\b\u0007T\u0014\n\n% arbitrarily; further optimization is possible.\n\u0005\t\b\u0013\n'\u000e\n\u0002\n\u0003\u000b\n'\u000e\n\u0003\u000b\n'\u000e\u0003\u0002\n\u00016\u0007T\u0014\n\u0003\u000b\n'\u000e\u0005\u0004\u0007\u0006\n\u0005\b\u0007T\u0014\n\u0003\u000b\n'\u000e\nin eq. (10) factorizes over \u0007 andc\nNote that the distribution\u0014\naccording to this family of models\u2014the global coordinatec\ncomponent \u0007 given the data point \u0003\n. Also,\u0014\nThese are exactly the constraints we wish to impose on the posterior\u0002\u0001\niteration of learning, the meansc\ndetermined separately for each data point,\u0003\n\u0005:\u0007T\u0014\neq. (9): this amounts to computing the unimodal distributions,\u0014\n\u0003\u0012\n'\u000e .\nthe true posterior distributions,\u0002\u0001\n\n\u0016\f\n\u0003\u000b\n'\u000e2\u0010\f\u000b\n\ni\u000e . At each\n\u0005:\u0007T\u0014\n\n\u0016\f are\n\u000e , best matched to\n\n, implying that\u2014\nis independent of the mixture\nis Gaussian, and thus unimodal.\n\nso as to maximize the objective function in\n\n, and mixture weights \u000b\n\n, covariance matrices \b\n\n4 Learning Algorithm\n\n\u0001\u0004\u0007T\u0014\n\n(10)\n\n\u0005:\u0007\n\nLatent variable models are traditionally estimated by maximum likelihood or Bayesian\nmethods whose objective functions do not reward the interpretability of their internal rep-\nresentations. Note how the goal of developing more useful internal representations has\nchanged the learning problem in a fundamental way. Now we have additional \u201ccoordina-\n\n\f and weights\u0004\ntion\u201d parameters\u2013the offsets\u0006\nWe also have auxiliary parameters for each data point\u2014the meansc\n\n\u0016\f \u2014that determine the target distributions,\u0014\nthese parameters, as well as the MFA model parameters#\b3\n\nces \b\nto \u201cstitch together\u201d the local coordinates systems in a smooth way and to learn internal\nrepresentations easily coordinated by the local-to-global mapping in eq. (6).\n\n\f \u2013that must also be learned from examples.\n\n'\u000e . All\n, , must be chosen\n\n, and mixture weights \u000b\n\n, covariance matri-\n\n\u0005\b\u0007T\u0014\n\nOptimization of the objective function in eq. (9) is reminiscent of so-called \u201cvariational\u201d\nmethods for approximate learning[7].\nIn these methods, an approximation to an exact\n(but intractable) posterior distribution is \ufb01tted by minimizing a KL divergence between the\ntwo distributions. The auxiliary parameters of the approximating distribution are known\nas variational parameters. Our objective function illustrates an unexpected application of\nsuch variational methods: not to perform approximate inference in intractable probabilistic\nmodels, but to learn more useful internal representations in tractable ones. We introduce the\nto regularize the multimodal distributions\n\nunimodal and factorized distributions\u0014\n\u0002\u0001\n\u0005\b\u0007T\u0014\n\nthe model\u2019s parameter space and favors local linear models that can be globally aligned.\n\n\u000e . Penalizing the KL divergence between these distributions lifts a degeneracy in\n\n\u0003\u0012\n'\u000e\n\n\u0005\b\u0007T\u0014\n\n4.1 Computing and optimizing the objective function\n\nEvaluating the objective function in eq. (9) requires a sum and integral over the latent vari-\nables of the model. These operations are simpli\ufb01ed by rewriting the objective function as:\n\n\f\u0019\u0018\n\n\u0003\u000b\n'\u000e\n\u0005:\u0007\nThe factored form of the distributions\u0014\n\nc\u001a\u0014\n\n\u0016\f\n\nW\tY\n\u0005:\u0007T\u0014\n\nrequired sums and integrals. The \ufb01nal result is a simple form in terms of entropies\n\n(11)\n\nN\t(\n\nVXW\u0016Y\n\n\u0005\b\u0007\u001a\u000e\n\n\u0003\r\n'\u000e\n\n\u0002\u0001\u0013\u0003\u000b\nR\u0005\n\n\u0005\b\u0007T\u0014\n\u0003\u0012\n'\u000e makes it straightforward to perform the\n\n\u0016\f and\n\n\u0010\n\u0014\n\u0001\nc\n\u0014\n\u0001\nc\n\u0014\n\u0014\n\u0014\n\u0001\nc\n\u0014\n\u0001\nc\n\n\u0014\n\u0001\nc\n\n\u0001\nc\n\u0014\n\u0003\n\n\u000e\nc\n\u0003\n\n\u0001\nc\n\u0003\n\nc\n\u0014\n\n\u0001\nc\n\u0003\n\f\n\u0005\nH\n\f\n\u0005\nG\n\f\n\u0005\nI\n\f\n\u0001\nc\nc\n\u0003\n\n\u0015\n\u0010\nP\n\u0001\nc\n\u0014\nM\nC\nV\n\u0014\n\u0001\nc\nS\nc\n\u0001\nc\n\n\f\n\u0016\f\n\nth data point:\n\nenergies\n\n\u0016\f\n\n\u0016\f\n\n\u0016\f associated with the\u0001\n\n\u0016\f\n\n\u0016\fO\u0001\n\u0001\u0004&\u00197.\u000e\u000f\u0005\n\b\u0013\n\u0017\u0014\nW\tY\nVXW\u0016Y\n\u0004\u0005\u0004\n\n\u0016\f\u0003\u0002\n\n\u0016\f\n\n\u0016\f\nS\u0011\u0010\n3R\f\n\f8\u0014\nVXW\u0016Y\n\u000e and the local precision matrices\u0002\n\n9\rK\n\n\t\f\n&\u0007\u0006\t\b\u001aM\n\u00016&87.\u000e?\u0005\nVXW\u0016Y\nwhere we have introduced simplifying notation for the vector differences \u0003\n\n\u0016\f\n\f\u0013\u0012\n9\u0017K\n9\rK\nandc\n\n\u0016\f\u000f\u000e\u0017\u0005\n\n\u0016\f\n\u0003\u0011E\n\n\u0016\f\nW\tY\n\nW\tY\n\u0003\u000b\n\u0016\f\nVXW\u0016Y\n\n9\rK\n\n\u0016\f\n\n\f\u0019\u0014\n\nIteratively maximizing the objective function by coordinate ascent now leads to a learning\nalgorithm of the same general style as EM.\n\n\b\u000b\n\r\f\u000f\u000e\n\u0010\u001d\u0001\u0004\u0003\n\f\u0015\u0014\n\n(12)\n\n(13)\n\n(14)\n\n.\n\n9\rK\n\n\u0016\f\n\n(15)\n\n(16)\n\n9\u001f\u001e! #\"\n9\u001f\u001e\n\n &\"\n\n4.2 E-step\nMaximizing the objective function, eq. (9), with respect to the regularizing parameters\n\n4.3 M-step\nThe M-step consists of maximizing the objective function, eq. (9), with respect to\nthe generative model parameters. Let us denote the updated parameter estimates by\n\ninitialization \u000b\nating the \ufb01xed point equations. The objective function is completely invariant to translation\n). To\nremove this degeneracy, after solving the equations above we further constrain the global\ncoordinates to have mean zero and unit variance in each direction. These constraints are\nenforced without changing the value of the objective function by simply translating the\n\n% ) leads to the \ufb01xed point equations:\n\n\u0016\f\u0006\u0010\n\u0005\t\b\u0013\nR\u0005\n\n\u0016\f\u001a, (and subject to the constraint\u0016\n9\u0017K\n\u0017\u0018P\n\u0010\u0011\u0017\n\f\u0019\u0018\n\n\t\f\u001c\u0018\n\n\u0016\f\u001b\u001a\n\n\u0016\f\n\f%$\nwhere\u001a\n\n\t\f . These equations can be solved by iteration with\n\n\u0016\f\"\u0010\n9\rK\n\f and\u001a\n\f . Notice that\u0002\n\u001043\n\n\t\f only need to be computed once before iter-\n\n\u0016\f\n\n appear only in the form\u0001\n\f andc\n\f (since\u0004\n\f ,\u0006\n\n and\u0006\nand rescaling ofc\n\f and rescaling the diagonal matrices\u0004\n\f .\noffsets'\n\f\u001a\u0005,*\n\u0005+*\n#)(\n3R\f\n\f.-\n\u000b*\f!/\n\f%$\ndifference vectors*\n\u0003\u0012\n\u0016\f\nand the variances2\n\f\u001c465\n9\rK\n\n\t\f , the M-step updates for the \ufb01rst three\n9\u0017K\n\n\u0016\f\n\n\t\f.*\n\u0003\u000b\n\u0016\f+*\n\f , the correlations1\n\n\u0016\f ,\nN .\n465?F\n9\u0017K\n9\rK\n\n\t\f\n\f+9\n9\rK\n9\u0017K\n\u000b*\f\n\f satisfy an algebraic Riccati equation which\n\f which may be expensive for full\n\n\u0016\f\n\n\u0016\f\n\n\u0016\f\n9\rK87\nAt the optimum, the coordination weights\u0004\napproaches involve inverting the previous value of \u0004\n\ncan be solved by iterating the update shown above. (Such equations can also be solved\nby much more sophisticated methods well known in the engineering community. Most\n\n\f*, . Letting \u000b\n\f,-\n\u00050*\n\u0003\n\n9\rK\n\nThe remaining updates, to be performed in the order shown, are given in terms of updated\n\nmatrices but is fast in our diagonal implementation.)\n\n\u0016\f\n\n\u0016\f\n9\rK\n\n\u0016\fZ\u0003\u000b\nR(\n\n\f\u0015\u0010\n\u0004\u0005\u0004\n\n(17)\n\n(18)\n\n(19)\n\n9\rK\n\n\u0001\u0013a\n\nof these are:\n\n9\rK\n\n\u0016\f\n\n\f\u000f\u000e\n\n\u0015\n\u0010\nP\n\u000b\n\nC\n\n\n\u0010\n%\n&\n\u0014\nC\nV\n\u000b\nS\n]\n&\nV\n\n\u0010\n%\n&\nc\nE\n\f\nc\nS\n%\n&\nI\n\f\nC\nc\nE\nE\n\f\nH\nE\n\f\nI\n\f\n\u0003\nS\n%\nN\nS\n%\n&\n\u0014\nI\nS\nV\n\u0014\n\u0004\nC\nS\n]\n&\n\nC\nG\n\f\n\u000e\n\u0010\n\u0001\nc\n\nC\n\u0006\n\f\n\f\n\u0010\n\u0004\na\nS\nH\nE\n\f\nI\n\f\nH\n\u0004\n\f\n#\nc\n\n\u000b\n\f\n\u000b\n\b\n\nP\n\f\n\u000b\n\u0002\nc\n\n\u0010\n\b\n\n\f\n\u000b\n\u000b\n\u0010\n\u001d\n\u0016\n\u001d\n$\n(\n\u0002\n\f\n\u0006\n\f\nS\n\u0004\n\u0004\nE\n\f\nH\nE\n\f\nI\n\f\n\u0003\nc\n\nC\n\u0006\n\f\n\u000e\n\u0004\n\u0004\nE\n\f\n\u0006\nG\n\f\n\u0005\n(\nH\n\f\n\u0005\n(\nI\n\f\n\u0005\n(\n\u0004\n\f\n\u0010\n\u0016\n\n\u000b\n(\n3\nP\n\u000b\n\f\n$\n*\n\u0006\n\u000b\n\f\nP\n\n\u000b\nc\n\n*\nG\n\f\n-\n\u000b\n\f\nP\n\n\u000b\n\u0010\nC\n*\nG\n\f\nc\n\u0010\nc\n\nC\n*\n\u0006\n\u0016\n\n\u000b\nc\nE\n\f\n\u0010\n\u0016\n\n\u000b\nM\n\b\n\nS\n*\nc\n*\nc\nE\n(\nH\n\f\n-\n1\n\f\n2\n\f\n\u0004\n\f\n\u0005\n3\n(\nI\n-\n\u000b\n\f\nP\n\n\u000b\nB\n3\n*\n\u0003\nC\n(\nH\n\f\n\u0004\n\f\n*\nc\n4\n<\n5\nS\n3\n(\nH\n\f\n\u0004\n\f\n\b\n\nE\n\f\n(\nH\nE\n\f\n(\n\u0004\n\f\n-\nS\nH\nE\n\f\nI\n\f\nH\n\u0004\nE\n\f\nS\nH\nE\n\f\n(\nI\n\f\n1\n2\n\f\n\fFigure 2: Global coordination of local lin-\near models. (left) A model trained using maxi-\nmum likelihood, with the arrows indicating the\ndirection of increase for each factor analyzer\u2019s\nlocal coordinate system. (right) A coordinated\nmodel; arrows indicate the direction in the data\nspace corresponding to increasing the global\n\nellipses show the one standard deviation con-\ntour of the density of each analyzer.\n\ncoordinate as inferred by the algorithm. The\n\n5 Experiments\n\nWe have tested our model on simple synthetic manifolds whose structure is known as well\nas on collections of images of handwritten digits and faces. Figure 2 illustrates the basic\nconcept of coordination, as achieved by our learning rule. In the coordinated model, the\nglobal coordinate always points in the same direction along the data manifold, as de\ufb01ned\n\nWe also applied the algorithm to collections of images of handwritten digits and of faces.\n\nmum likelihood, the density is well captured but each local latent variable has a random\norientation along the manifold.\n\nfaces.)\nThe MFAs had 64 local models and the global coordinates were two dimensional. After\ntraining, the coordinated MFAs had learned a smooth, continuous mapping from the plane\nto images of digits or of faces. This allows us both to infer a two-dimensional location given\n\n\f and\u0004\n\f . In the model trained with maxi-\nby the composition of the transformations H\nThe representation of\u0003 was an unprocessed vector of raw 8-bit grayscale pixel intensities\n%\u0002 digits and 560 for the&\u0004\u0003\nfor each image (of dimensionality 256 for the %\u0001\n\u00032\u000e and to generate new images from any point in the plane\nany image by computing\u0002\u0001\n\u000e . (Precisely what we wanted from the magic box.) In general, both\nby computing\u0002\u0001\u0004\u0003\u0015\u0014\n\u000e )\n(i.e. the means of the unimodal distributions\u0014\nthe inferred global coordinatesc\nfrom the generative model, created by evaluating the mean of \u0002\u0001\u0013\u0003\u0015\u0014\n\u000e along straight line\nhave captured tilt/shape and identity and represented them as the two axes of thec\n\npaths in the global coordinate space. In the case of digits, it seems as though our models\nspace; in\nthe case of the faces the axes seem to capture pose and expression. (For the faces, the \ufb01nal\n\nof the training points after the last iteration of training as well as examples of new images\n\nof these conditional distributions have the form of a mixture of Gaussians. Figure 3 shows\n\nspace was rotated by hand to align interpretable directions with the coordinate axes.)\n\nAs with all EM algorithms, the coordinated MFA learning procedure is susceptible to lo-\ncal optima. Crucial to the success of our experiments is a good initialization, which was\n\nprovided by the Locally Linear Embedding algorithm[9]. We clampedc\nto updatec\n\nbedding coordinate provided by LLE and \b\nto a small value and trained until convergence\n(typically 30-100 iterations). Then we proceeded with training using the full EM equations\n, again until convergence (usually 5-10 more iterations). Note, however, that\nLLE and other embedding algorithms such as Isomap[10] are themselves unsupervised, so\nthe overall procedure, including this initial phase, is still unsupervised.\n\n equal to the em-\n\n6 Discussion\nMixture models provide a simple way to approximate the density of high dimensional data\nthat lies on or near a low dimensional manifold. However, their hidden representations\ndo not make explicit the relationship between dissimilar data vectors. In this paper, we\nhave shown how to learn global coordinates that can act as an encapsulating interface, so\nthat other parts of a learning system do not need to interact with the individual compo-\nnents of a mixture. This should improve generalization as well as facilitate the propagation\nand exchange of information when these models are incorporated into a larger (perhaps\n\n^\n^\n&\n\nc\n\u0014\nc\n\n\u0001\nc\n\u0014\n\u0003\n\nc\nc\n\n\fthe\n\nshows\n\n3:\nconstructed\n\nFigure\nAutomat-\nically\ntwo\ndimensional global param-\neterizations of manifolds\nof digits and faces. Each\nplot\nglobal\ncoordinate\nspace discov-\nered by the unsupervised\nalgorithm; points indicate\nfor\neach training item at\nthe\nend of learning. The image\nstacks on the borders are\nnot from the training set\nbut are generated from the\nmodel itself and represent\nthe mean of the predictive\n\nthe inferred means \u0001\n\ndistribution d/e\u0003\u0002Zj\n\ncorresponding open circles\n(sampled along the straight\nlines in the global space).\n\nl at the\n\nThe models provide both a\ntwo degree-of-freedom gen-\nerator for complex images\nas well as a\npose/slant recognition sys-\n\nvia d/e\u0003\u0002Zj\ntem viad/e\u0001\u0017j\n\n\u0002.l .\n\nFor the handwritten digits,\nthe training set consisted\nof 1100 examples of the\ndigit \u201c2\u201d (shown as crosses\nabove) mixed with 1100 ex-\namples of \u201c3\u201ds (shown as\ntriangles). The digits are\nfrom the NIST dataset, dig-\nitized at 16x16 pixels. For\nthe faces, we used 2000 im-\nages of a single person with\nvarious poses and expres-\nsions taken from consecu-\ntive frames of a video digi-\ntized at 20x20 pixels. Bren-\ndan Frey kindly provided\nthe face data.\n\nhierarchical) architecture for probabilistic reasoning.\n\nTwo variants of our purely unsupervised proposal are possible. The \ufb01rst is to use an em-\nbedding algorithm (such as LLE or Isomap) not only as an initialization step but to provide\nclamped values for the global coordinates. While this supervised approach may work in\npractice, unsupervised coordination makes clear the objective function that is being opti-\n\n\n\nl\n\fFigure 4: A situation in which an un-coordinated mix-\nture model\u2013trained to do density estimation\u2013cannot be \u201cpost-\ncoordinated\u201d. Noise has caused one of the local density mod-\nels to orient orthogonal to the manifold. In globally coordi-\nnated learning, there is an additional pressure to align with\nneighbouring models which would force the local model to\nlie in the correct subspace.\n\nmized, which uni\ufb01es the goals of manifold learning and density estimation. Another variant\nis to train an unsupervised mixture model (such as a MFA) using a traditional maximum\nlikelihood objective function and then to \u201cpost-coordinate\u201d its parameters by applying local\nre\ufb02ections/rotations and translations to create global coordinates. As illustrated in \ufb01gure 4,\nhowever, this two-step procedure can go awry because of noise in the original training set.\nWhen both density estimation and coordination are optimized simultaneously there is extra\npressure for local experts to \ufb01t the global structure of the manifold.\n\nOur work can be viewed as a synthesis of two long lines of research in unsupervised\nlearning.\nIn the \ufb01rst are efforts at learning the global structure of nonlinear manifolds\n[1, 4, 9, 10]; in the second are efforts at developing probabilistic graphical models for rea-\nsoning under uncertainty[5, 6, 7]. Our work proposes to model the global coordinates on\nmanifolds as latent variables, thus attempting to combine the representational advantages\nof both frameworks. It differs from embedding by providing a fully probabilistic model\nvalid away from the training set, and from work in generative topographic mapping[2] by\nnot requiring a uniform discretized gridding of the latent space. Moreover, by extending\nthe usefulness of mixture models,it further develops an architecture that has already proved\nquite powerful and enormously popular in applications of statistical learning.\n\nAcknowledgements\nWe thank Mike Revow for sharing his unpublished work (at the University of Toronto) on coordinat-\ning mixtures, and Zoubin Ghahramani, Peter Dayan, Jakob Verbeek and two anonymous reviewers\nfor helpful comments and corrections.\n\nReferences\n[1] D. Beymer & T. Poggio. Image representations for visual learning. pringerScience 272 (1996).\n[2] C. Bishop, M. Svensen, and C. Williams. GTM: The generative topographic mapping.\n\nNeural Computation 10 (1998).\n\n[3] C. Bregler & S. Omohundro. Nonlinear image interpolation using manifold learning.\n\nAdvances in Neural Information Processing Systems 7 (1995).\n\n[4] D. DeMers & G.W. Cottrell. Nonlinear dimensionality reduction.\n\nAdvances in Neural Information Processing Systems 5 (1993).\n\n[5] Ghahramani, Z. and Hinton, G. The EM algorithm for mixtures of factor analyzers.\n\nUniversity of Toronto Technical Report CRG-TR-96-1 (1996).\n\n[6] Hinton, G., Dayan, P., and Revow, M. Modeling the manifolds of images of handwritten digits.\n\nIEEE Transactions on Neural Networks 8 (1997).\n\n[7] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning 37(2) (1999).\n\n[8] N. Kambhatla and T. K. Leen. Dimension reduction by local principal component analysis.\n\nNeural Computation 9 (1997).\n\n[9] S. T. Roweis & L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience 290 (2000).\n\n[10] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science 290 (2000).\n\n\f", "award": [], "sourceid": 2082, "authors": [{"given_name": "Sam", "family_name": "Roweis", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}