{"title": "Non-Local Manifold Parzen Windows", "book": "Advances in Neural Information Processing Systems", "page_first": 115, "page_last": 122, "abstract": null, "full_text": "Non-Local Manifold Parzen Windows\n\nYoshua Bengio, Hugo Larochelle and Pascal Vincent\n\nDept. IRO, Universit\u00b4e de Montr\u00b4eal\n\nP.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada\n\n{bengioy,larocheh,vincentp}@iro.umontreal.ca\n\nAbstract\n\nTo escape from the curse of dimensionality, we claim that one can learn\nnon-local functions, in the sense that the value and shape of the learned\nfunction at x must be inferred using examples that may be far from x.\nWith this objective, we present a non-local non-parametric density esti-\nmator. It builds upon previously proposed Gaussian mixture models with\nregularized covariance matrices to take into account the local shape of\nthe manifold. It also builds upon recent work on non-local estimators of\nthe tangent plane of a manifold, which are able to generalize in places\nwith little training data, unlike traditional, local, non-parametric models.\n\n1 Introduction\n\nA central objective of statistical machine learning is to discover structure in the joint dis-\ntribution between random variables, so as to be able to make predictions about new com-\nbinations of values of these variables. A central issue in obtaining generalization is how\ninformation from the training examples can be used to make predictions about new exam-\nples and, without strong prior assumptions (i.e.\nin non-parametric models), this may be\nfundamentally dif\ufb01cult, as illustrated by the curse of dimensionality.\n\n(Bengio, Delalleau and Le Roux, 2005) and (Bengio and Monperrus, 2005) present sev-\neral arguments illustrating some fundamental limitations of modern kernel methods due\nto the curse of dimensionality, when the kernel is local (like the Gaussian kernel). These\narguments are all based on the locality of the estimators, i.e., that very important informa-\ntion about the predicted function at x is derived mostly from the near neighbors of x in the\ntraining set. This analysis has been applied to supervised learning algorithms such as SVMs\nas well as to unsupervised manifold learning algorithms and graph-based semi-supervised\nlearning. The analysis in (Bengio, Delalleau and Le Roux, 2005) highlights intrinsic limita-\ntions of such local learning algorithms, that can make them fail when applied on problems\nwhere one has to look beyond what happens locally in order to overcome the curse of di-\nmensionality, or more precisely when the function to be learned has many variations while\nthere exist more compact representations of these variations than a simple enumeration.\nThis strongly suggests to investigate non-local learning methods, which can in principle\ngeneralize at x using information gathered at training points xi that are far from x. We\npresent here such a non-local learning algorithm, in the realm of density estimation.\n\nThe proposed non-local non-parametric density estimator builds upon the Manifold Parzen\ndensity estimator (Vincent and Bengio, 2003) that associates a regularized Gaussian with\n\n\feach training point, and upon recent work on non-local estimators of the tangent plane of\na manifold (Bengio and Monperrus, 2005). The local covariance matrix characterizing the\ndensity in the immediate neighborhood of a data point is learned as a function of that data\npoint, with global parameters. This allows to potentially generalize in places with little or\nno training data, unlike traditional, local, non-parametric models. Here, the implicit as-\nsumption is that there is some kind of regularity in the shape of the density, such that learn-\ning about its shape in one region could be informative of the shape in another region that\nis not adjacent. Note that the smoothness assumption typically underlying non-parametric\nmodels relies on a simple form of such transfer, but only for neighboring regions, which is\nnot very helpful when the intrinsic dimension of the data (the dimension of the manifold\non which or near which it lives) is high or when the underlying density function has many\nvariations (Bengio, Delalleau and Le Roux, 2005). The proposed model is also related to\nthe Neighborhood Component Analysis algorithm (Goldberger et al., 2005), which learns\na global covariance matrix for use in the Mahalanobis distance within a non-parametric\nclassi\ufb01er. Here we generalize this global matrix to one that is a function of the datum x.\n\n2 Manifold Parzen Windows\n\nIn the Parzen Windows estimator, one puts a spherical (isotropic) Gaussian around each\ntraining point xi, with a single shared variance hyper-parameter. One approach to improve\non this estimator, introduced in (Vincent and Bengio, 2003), is to use not just the presence\nof xi and its neighbors but also their geometry, trying to infer the principal characteristics of\nthe local shape of the manifold (where the density concentrates), which can be summarized\nin the covariance matrix of the Gaussian, as illustrated in Figure 1. If the data concentrates\nin certain directions around xi, we want that covariance matrix to be \u201c\ufb02at\u201d (near zero\nvariance) in the orthogonal directions.\n\nOne way to achieve this is to parametrize each of these covariance matrices in terms of\n\u201cprincipal directions\u201d (which correspond to the tangent vectors of the manifold, if the data\nconcentrates on a manifold). In this way we do not need to specify individually all the\nentries of the covariance matrix. The only required assumption is that the \u201cnoise directions\u201d\northogonal to the \u201cprincipal directions\u201d all have the same variance.\n\n\u02c6p(y) =\n\n1\nn\n\nn\n\nX\n\ni=1\n\nN (y; xi + \u00b5(xi), S(xi))\n\n(1)\n\nwhere N (y; xi + \u00b5(xi), S(xi)) is a Gaussian density at y, with mean vector xi + \u00b5(xi) and\ncovariance matrix S(xi) represented compactly by\n\nS(xi) = \u03c32\n\nnoise(xi)I +\n\nd\n\nX\n\nj=1\n\ns2\nj (xi)vj(xi)vj(xi)\u2032\n\n(2)\n\nj (xi) and \u03c32\nj (xi) + \u03c32\n\nwhere s2\nvariance s2\nother directions). vj(xi)\u2032 denotes the transpose of vj(xi).\n\nnoise(xi) are scalars, and vj(xi) denotes a \u201cprincipal\u201d direction with\nnoise(xi), while \u03c32\nnoise(xi) is the noise variance (the variance in all the\n\nj (xi) + \u03c32\n\nnoise(xi) = \u03c32\n\n0 is a global hyper-\nIn (Vincent and Bengio, 2003), \u00b5(xi) = 0, and \u03c32\nnoise(xi), vj(xi)) are the leading (eigen-\nparameter, while (\u03bbj(xi), vj) = (s2\nvalue,eigenvector) pairs from the eigen-decomposition of a locally weighted covariance\nmatrix (e.g. the empirical covariance of the vectors xl \u2212 xi, with xl a near neighbor of xi).\nThe \u201cnoise level\u201d hyper-parameter \u03c32\n0 must be chosen such that the principal eigenvalues\n0. Another hyper-parameter is the number d of principal components\nare all greater than \u03c32\nto keep. Alternatively, one can choose \u03c32\nnoise(xi) to be the (d + 1)th eigenvalue, which\nguarantees that \u03bbj (xi) > \u03c32\nnoise(xi), and gets rid of a hyper-parameter. This very simple\nmodel was found to be consistently better than the ordinary Parzen density estimator in\nnumerical experiments in which all hyper-parameters are chosen by cross-validation.\n\n\f3 Non-Local Manifold Tangent Learning\n\nIn (Bengio and Monperrus, 2005) a manifold learning algorithm was introduced in which\nthe tangent plane of a d-dimensional manifold at x is learned as a function of x \u2208 RD,\nusing globally estimated parameters. The output of the predictor function F (x) is a d \u00d7 D\nmatrix whose d rows are the d (possibly non-orthogonal) vectors that span the tangent\nplane. The training information about the tangent plane is obtained by considering pairs of\nnear neighbors xi and xj in the training set. Consider the predicted tangent plane of the\nmanifold at xi, characterized by the rows of F (xi). For a good predictor we expect the\nvector (xi \u2212 xj ) to be close to its projection on the tangent plane, with local coordinates\nw \u2208 Rd. w can be obtained analytically by solving a linear system of dimension d.\nThe training criterion chosen in (Bengio and Monperrus, 2005) then minimizes the sum\nover such (xi, xj) of the sinus of the projection angle, i.e. ||F \u2032(xi)w \u2212 (xj \u2212 xi)||2/||xj \u2212\nxi||2. It is a heuristic criterion, which will be replaced in our new algorithm by one de-\nrived from the maximum likelihood criterion, considering that F (xi) indirectly provides\nthe principal eigenvectors of the local covariance matrix at xi. Both criteria gave similar\nresults experimentally, but the model proposed here yields a complete density estimator. In\nboth cases F (xi) can be interpreted as specifying the directions in which one expects to\nsee the most variations when going from xi to one of its near neighbors in a \ufb01nite sample.\n\n\n\u0001\n\n\u0001\n\nxi\n\n\u00b5\n\nqs2\n\n1\n\n+ \u03c3\n\n2\nnoise\n\nv1\n\n\u03c3noise\n\ntangent\nplane\n\nFigure 1: Illustration of the local parametrization of local or Non-Local Manifold Parzen.\nThe examples around training point xi are modeled by a Gaussian. \u00b5(xi) speci\ufb01es the\ncenter of that Gaussian, which should be non-zero when xi is off the manifold. vk\u2019s are\nprincipal directions of the Gaussian and are tangent vectors of the manifold. \u03c3noise repre-\nsents the thickness of the manifold.\n\n4 Proposed Algorithm: Non-Local Manifold Parzen Windows\n\nIn equations (1) and (2) we wrote \u00b5(xi) and S(xi) as if they were functions of xi rather\nthan simply using indices \u00b5i and Si. This is because we introduce here a non-local ver-\nsion of Manifold Parzen Windows inspired from the non-local manifold tangent learning\nalgorithm, i.e., in which we can share information about the density across different\nregions of space. In our experiments we use a neural network of nhid hidden neurons,\nwith xi in input to predict \u00b5(xi), \u03c32\nj (xi) and vj(xi). The vectors com-\nputed by the neural network do not need to be orthonormal: we only need to consider the\nsubspace that they span. Also, the vectors\u2019 squared norm is used to infer s2\nj (xi), instead\nof having a separate output for them. We will note F (xi) the matrix whose rows are the\nvectors output of the neural network. From it we obtain the s2\nj (xi) and vj(xi) by perform-\ning a singular value decomposition, i.e. F \u2032F = Pd\nj. Moreover, to make sure\nnoise does not get too small, which could make the optimization unstable, we impose\n\u03c32\n0 is\n\u03c32\nnoise(xi) = s2\na \ufb01xed constant.\n\n0, where snoise(\u00b7) is an output of the neural network and \u03c32\n\nnoise(xi), and the s2\n\nnoise(xi) + \u03c32\n\nj=1 s2\n\nj vjv\u2032\n\nImagine that the data were lying near a lower dimensional manifold. Consider a training\nexample xi near the manifold. The Gaussian centered near xi tells us how neighbors of\n\n\fxi are expected to differ from xi. Its \u201cprincipal\u201d vectors vj(xi) span the tangent of the\nmanifold near xi. The Gaussian center variation \u00b5(xi) tells us how xi is located with\nnoise(xi) tells us how far\nrespect to its projection on the manifold. The noise variance \u03c32\nfrom the manifold to expect neighbors, and the directional variances s2\nnoise(xi)\ntell us how far to expect neighbors on the different local axes of the manifold, near xi\u2019s\nprojection on the manifold. Figure 1 illustrates this in 2 dimensions.\n\nj (xi) + \u03c32\n\nThe important element of this model is that the parameters of the predictive neural network\ncan potentially represent non-local structure in the density, i.e., they allow to potentially\ndiscover shared structure among the different covariance matrices in the mixture. Here is\nthe pseudo code algorithm for training Non-Local Manifold Parzen (NLMP):\n\nAlgorithm NLMP::Train(X, d, k, k\u00b5, \u00b5(\u00b7), S(\u00b7), \u03c32\n0)\n\nInput: training set X, chosen number of principal directions d, chosen number of\nneighbors k and k\u00b5, initial functions \u00b5(\u00b7) and S(\u00b7), and regularization hyper-parameter\n0.\n\u03c32\n(1) For xi \u2208 X\n(2) Collect max(k,k\u00b5) nearest neighbors of xj.\n\nBelow, call yj one of the k nearest neighbors, y\u00b5\n\nj one of the k\u00b5 nearest neighbors.\n\n(3) Perform a stochastic gradient step on parameters of S(\u00b7) and \u00b5(\u00b7),\n\nusing the negative log-likelihood error signal on the yj, with a Gaussian\nof mean xi + \u00b5(xi) and of covariance matrix S(xi).\n\nThe approximate gradients are:\n\nj ,xi)\n\n\u2202C(y\u00b5\n\u2202\u00b5(xi) = \u2212 1\n\nnk\u00b5 (y\u00b5\n\nj ) S(xi)\u22121(y\u00b5\n\nj \u2212 xi \u2212 \u00b5(xi))\n\n\u2202C(yj ,xi)\n\u2202\u03c32\n\nnoise(xi) = 0.5\n\u2202F (xi) = 1\n\n\u2202C(yj ,xi)\n\n1\n\nnk(yj ) (cid:0)T r(S(xi)\u22121) \u2212 ||(yj \u2212 xi \u2212 \u00b5(xi))\u2032S(xi)\u22121||2(cid:1)\n\nnk(yj ) F (xi)S(xi)\u22121 (cid:0)I \u2212 (yj \u2212 xi \u2212 \u00b5(xi))(yj \u2212 xi \u2212 \u00b5(xi))\u2032S(xi)\u22121(cid:1)\n\nwhere nk(y) = |Nk(y)| is the number of points in the training set that\nhave y among their k nearest neighbors.\n\n(4) Go to (1) until a given criterion is satis\ufb01ed (e.g. average NLL of NLMP density\nestimation on a validation set stops decreasing)\nResult: trained \u00b5(\u00b7) and S(\u00b7) functions, with corresponding \u03c32\n0.\n\nDeriving the gradient formula (the derivative of the log-likelihood with respect to the neural\nnetwork outputs) is lengthy but straightforward. The main trick is to do a Singular Value\nDecomposition of the basis vectors computed by the neural network, and to use known\nsimplifying formulas for the derivative of the inverse of a matrix and of the determinant of\na matrix. Details on the gradient derivation and on the optimization of the neural network\nare given in the technical report (Bengio and Larochelle, 2005).\n\n5 Computationally Ef\ufb01cient Extension: Test-Centric NLMP\n\nWhile the NLMP algorithm appears to perform very well, one of its main practical lim-\nitation for density estimation, that it shares with Manifold Parzen, is the large amount of\ncomputation required upon testing: for each test point x, the complexity of the computation\nis O(n.d.D) (where D is the dimensionality of input space RD).\nHowever there may be a different and cheaper way to compute an estimate of the density\nat x. We build here on an idea suggested in (Vincent, 2003), which yields an estimator that\n\n\fdoes not exactly integrate to one, but this is not an issue if the estimator is to be used for\napplications such as classi\ufb01cation. Note that in our presentation of NLMP, we are using\n\u201chard\u201d neighborhoods (i.e. a local weighting kernel that assigns a weight of 1 to the k\nnearest neighbors and 0 to the rest) but it could easily be generalized to \u201csoft\u201d weighting,\nas in (Vincent, 2003).\nLet us decompose the true density at x as: p(x) = p(x|x \u2208 Bk(x))P (Bk(x)), where\nBk(x) represents the spherical ball centered on x and containing the k nearest neighbors\nof x (i.e., the ball with radius kx \u2212 Nk(x)k where Nk(x) is the k-th neighbor of x in the\ntraining set).\n\nIt can be shown that the above NLMP learning procedure looks for functions \u00b5(\u00b7) and S(\u00b7)\nthat best characterize the distribution of the k training-set nearest neighbors of x as the\nnormal N (\u00b7; x + \u00b5(x), S(x)). If we trust this locally normal (unimodal) approximation of\nthe neighborhood distribution to be appropriate then we can approximate p(x|x \u2208 Bk(x))\nby N (x; x + \u00b5(x), S(x)). The approximation should be good when Bk(x) is small and\np(x) is continuous. Moreover as Bk(x) contains k points among n we can approximate\nP (Bk(x)) by k\nn .\n\nThis yields the estimator \u02c6p(x) = N (x; x+\u00b5(x), S(x)) k\nn , which requires only O(d.D) time\nto evaluate at a test point. We call this estimator Test-centric NLMP, since it considers only\nthe Gaussian predicted at the test point, rather than a mixture of all the Gaussians obtained\nat the training points.\n\n6 Experimental Results\n\nWe have performed comparative experiments on both toy and real-world data, on density\nestimation and classi\ufb01cation tasks. All hyper-parameters are selected by cross-validation,\nand the costs on a large test set is used to compare \ufb01nal performance of all algorithms.\nExperiments on toy 2D data. To understand and validate the non-local algorithm we\ntested it on toy 2D data where it is easy to understand what is being learned. The sinus\ndata set includes examples sampled around a sinus curve. In the spiral data set examples\nare sampled near a spiral. Respectively, 57 and 113 examples are used for training, 23 and\n48 for validation (hyper-parameter selection), and 920 and 3839 for testing. The following\nalgorithms were compared:\n\u2022 Non-Local Manifold Parzen Windows. The hyper-parameters are the number of princi-\npal directions (i.e., the dimension of the manifold), the number of nearest neighbors k and\nk\u00b5, the minimum constant noise variance \u03c32\n0 and the number of hidden units of the neural\nnetwork.\n\u2022 Gaussian mixture with full but regularized covariance matrices. Regularization is done\n0 to the eigenvalues of the Gaussians. It is trained\nby setting a minimum constant value \u03c32\nby EM and initialized using the k-means algorithm. The hyper-parameter is \u03c32\n0, and early\nstopping of EM iterations is done with the validation set.\n\u2022 Parzen Windows density estimator, with a spherical Gaussian kernel. The hyper-\nparameter is the spread of the Gaussian kernel.\n\u2022 Manifold Parzen density estimator. The hyper-parameters are the number of principal\n0.\ncomponents, k of the nearest neighbor kernel and the minimum eigenvalue \u03c32\nNote that, for these experiments, the number of principal directions (or components) was\n\ufb01xed to 1 for both NLMP and Manifold Parzen.\n\nDensity estimation results are shown in table 1. To help understand why Non-Local Mani-\nfold Parzen works well on these data, \ufb01gure 2 illustrates the learned densities for the sinus\nand spiral data. Basically, it works better here because it yields an estimator that is less sen-\nsitive to the speci\ufb01c samples around each test point, thanks to its ability to share structure\n\n\fAlgorithm\nNon-Local MP\nManifold Parzen\nGauss Mix Full\nParzen Windows\n\nsinus\n1.144\n1.345\n1.567\n1.841\n\nspiral\n-1.346\n-0.914\n-0.857\n-0.487\n\nTable 1: Average out-of-sample negative log-\nlikelihood on two toy problems, for Non-Local\nManifold Parzen, a Gaussian mixture with full\ncovariance, Manifold Parzen, and Parzen Win-\ndows. The non-local algorithm dominates all\nthe others.\n\nacross the whole training set.\n\nAlgorithm\nNon-Local MP\nManifold Parzen\nParzen Windows\n\nValid.\n-73.10\n65.21\n77.87\n\nTest\n-76.03\n58.33\n65.94\n\nTable 2: Average Negative Log-Likelihood on\nthe digit rotation experiment, when testing on\na digit class (1\u2019s) not used during training, for\nNon-Local Manifold Parzen, Manifold Parzen,\nand Parzen Windows. The non-local algorithm\nis clearly superior.\n\nFigure 2: Illustration of the learned densities (sinus on top, spiral on bottom) for four com-\npared models. From left to right: Non-Local Manifold Parzen, Gaussian mixture, Parzen\nWindows, Manifold Parzen. Parzen Windows wastes probability mass in the spheres around\neach point, while leaving many holes. Gaussian mixtures tend to choose too few compo-\nnents to avoid over\ufb01tting. The Non-Local Manifold Parzen exploits global structure to yield\nthe best estimator.\n\nExperiments on rotated digits. The next experiment is meant to show both qualitatively\nand quantitatively the power of non-local learning, by using 9 classes of rotated digit images\n(from 729 \ufb01rst examples of the USPS training set) to learn about the rotation manifold and\ntesting on the left-out class (digit 1), not used for training. Each training digit was rotated\nby 0.1 and 0.2 radians and all these images were used as training data. We used NLMP\nfor training, and for testing we formed an augmented mixture with Gaussians centered not\nonly on the training examples, but also on the original unrotated 1 digits. We tested our\nestimator on the rotated versions of each of the 1 digits. We compared this to Manifold\nParzen trained on the training data containing both the original and rotated images of the\ntraining class digits and the unrotated 1 digits. The objective of the experiment was to see\nif the model was able to infer the density correctly around the original unrotated images,\ni.e., to predict a high probability for the rotated versions of these images. In table 2 we see\nquantitatively that the non-local estimator predicts the rotated images much better.\n\nAs qualitative evidence, we used small steps in the principal direction predicted by Test-\ncentric NLMP to rotate an image of the digit 1. To make this task even more illustrative of\nthe generalization potential of non-local learning, we followed the tangent in the direction\nopposite to the rotations of the training set.\nIt can be seen in \ufb01gure 3 that the rotated\n\n\fFigure 3: From left to right: original image of a digit 1; rotated analytically by \u22120.2\nradians; Rotation predicted using Non-Local MP; rotation predicted using MP. Rotations\nare obtained by following the tangent vector in small steps.\n\ndigit obtained is quite similar to the same digit analytically rotated. For comparison, we\ntried to apply the same rotation technique to that digit, but by using the principal direction,\ncomputed by Manifold Parzen, of its nearest neighbor\u2019s Gaussian component in the training\nset. This clearly did not work, and hence shows how crucial non-local learning is for this\ntask.\n\nIn this experiment, to make sure that NLMP focusses on the tangent plane of the rotation\nmanifold, we \ufb01xed the number of principal directions d = 1 and the number of nearest\nneighbors k = 1, and also imposed \u00b5(\u00b7) = 0. The same was done for Manifold Parzen.\nExperiments on Classi\ufb01cation by Density Estimation. The USPS data set was used\nto perform a classi\ufb01cation experiment. The original training set (7291) was split into a\ntraining (\ufb01rst 6291) and validation set (last 1000), used to tune hyper-parameters. One\ndensity estimator for each of the 10 digit classes is estimated. For comparison we also\nshow the results obtained with a Gaussian kernel Support Vector Machine (already used\nin (Vincent and Bengio, 2003)). Non-local MP* refers to the variation described in (Bengio\nand Larochelle, 2005), which attemps to train faster the components with larger variance.\nThe t-test statistic for the null hypothesis of no difference in the average classi\ufb01cation\nerror on the test set of 2007 examples between Non-local MP and the strongest competitor\n(Manifold Parzen) is shown in parenthesis. Figure 4 also shows some of the invariant\ntransformations learned by Non-local MP for this task.\n\nNote that better SVM results (about 3% error) can be obtained using prior knowledge about\nimage invariances, e.g. with virtual support vectors (Decoste and Scholkopf, 2002). How-\never, as far as we know the NLMP performance is the best on the original USPS dataset\namong algorithms that do not use prior knowledge about images.\n\nAlgorithm\n\nSVM\n\nParzen Windows\nManifold Parzen\nNon-local MP\n\nValid.\n1.2%\n1.8%\n0.9%\n0.6% 3.64% (-1.5218)\n\nTest\n4.68%\n5.08%\n4.08%\n\nNon-local MP*\n\n0.6% 3.54% (-1.9771)\n\nHyper-Parameters\nC = 100, \u03c3 = 8\n\u03c3 = 0.8\nd = 11, k = 11, \u03c32\n0 = 0.1\nd = 7, k = 10, k\u00b5 = 10,\n0 = 0.05, nhid = 70\n\u03c32\nd = 7, k = 10, k\u00b5 = 4,\n0 = 0.05, nhid = 30\n\u03c32\n\nTable 3: Classi\ufb01cation error obtained on USPS with SVM, Parzen Windows and Local and\nNon-Local Manifold Parzen Windows classi\ufb01ers. The hyper-parameters shown are those\nselected with the validation set.\n\n7 Conclusion\n\nWe have proposed a non-parametric density estimator that, unlike its predecessors, is able\nto generalize far from the training examples by capturing global structural features of the\n\n\fFigure 4: Tranformations learned by Non-local MP. The top row shows digits taken from\nthe USPS training set, and the two following rows display the results of steps taken by one\nof the 7 principal directions learned by Non-local MP, the third one corresponding to more\nsteps than the second one.\n\ndensity. It does so by learning a function with global parameters that successfully predicts\nthe local shape of the density, i.e., the tangent plane of the manifold along which the density\nconcentrates. Three types of experiments showed that this idea works, yields improved\ndensity estimation and reduced classi\ufb01cation error compared to its local predecessors.\n\nAcknowledgments\nThe authors would like to thank the following funding organizations for support: NSERC,\nMITACS, and the Canada Research Chairs. The authors are also grateful for the feedback\nand stimulating exchanges that helped to shape this paper, with Sam Roweis and Olivier\nDelalleau.\n\nReferences\n\nBengio, Y., Delalleau, O., and Le Roux, N. (2005). The curse of dimensionality for local\nkernel machines. Technical Report 1258, D\u00b4epartement d\u2019informatique et recherche\nop\u00b4erationnelle, Universit\u00b4e de Montr\u00b4eal.\n\nBengio, Y. and Larochelle, H. (2005). Non-local manifold parzen windows. Technical re-\nport, D\u00b4epartement d\u2019informatique et recherche op\u00b4erationnelle, Universit\u00b4e de Montr\u00b4eal.\nBengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In Saul, L.,\nWeiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems\n17. MIT Press.\n\nDecoste, D. and Scholkopf, B. (2002). Training invariant support vector machines. Ma-\n\nchine Learning, 46:161\u2013190.\n\nGoldberger, J., Roweis, S., Hinton, G., and Salakhutdinov, R. (2005). Neighbourhood\nIn Saul, L., Weiss, Y., and Bottou, L., editors, Advances in\n\ncomponent analysis.\nNeural Information Processing Systems 17. MIT Press.\n\nVincent, P. (2003). Mod`eles `a Noyaux `a Structure Locale. PhD thesis, Universit\u00b4e de\nMontr\u00b4eal, D\u00b4epartement d\u2019informatique et recherche op\u00b4erationnelle, Montreal, Qc.,\nCanada.\n\nVincent, P. and Bengio, Y. (2003). Manifold parzen windows. In Becker, S., Thrun, S.,\nand Obermayer, K., editors, Advances in Neural Information Processing Systems 15,\nCambridge, MA. MIT Press.\n\n\f", "award": [], "sourceid": 2914, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Hugo", "family_name": "Larochelle", "institution": null}, {"given_name": "Pascal", "family_name": "Vincent", "institution": null}]}