{"title": "Manifold Parzen Windows", "book": "Advances in Neural Information Processing Systems", "page_first": 849, "page_last": 856, "abstract": null, "full_text": "Manifold Parzen Windows\n\nPascal Vincent and Yoshua Bengio\nDept. IRO, Universit\u00e9 de Montr\u00e9al\n\nC.P. 6128, Montreal, Qc, H3C 3J7, Canada\n\n{vincentp,bengioy}@iro.umontreal.ca\nhttp://www.iro.umontreal.ca/ vincentp\n\nAbstract\n\nThe similarity between objects is a fundamental element of many learn-\ning algorithms. Most non-parametric methods take this similarity to be\n\ufb01xed, but much recent work has shown the advantages of learning it, in\nparticular to exploit the local invariances in the data or to capture the\npossibly non-linear manifold on which most of the data lies. We propose\na new non-parametric kernel density estimation method which captures\nthe local structure of an underlying manifold through the leading eigen-\nvectors of regularized local covariance matrices. Experiments in density\nestimation show signi\ufb01cant improvements with respect to Parzen density\nestimators. The density estimators can also be used within Bayes classi-\n\ufb01ers, yielding classi\ufb01cation rates similar to SVMs and much superior to\nthe Parzen classi\ufb01er.\n\n1 Introduction\nIn [1], while attempting to better understand and bridge the gap between the good perfor-\nmance of the popular Support Vector Machines and the more traditional K-NN (K Nearest\nNeighbors) for classi\ufb01cation problems, we had suggested a modi\ufb01ed Nearest-Neighbor\nalgorithm. This algorithm, which was able to slightly outperform SVMs on several real-\nworld problems, was based on the geometric intuition that the classes actually lived \u201cclose\nto\u201d a lower dimensional non-linear manifold in the high dimensional input space. When\nthis was not properly taken into account, as with traditional K-NN, the sparsity of the data\npoints due to having a \ufb01nite number of training samples would cause \u201choles\u201d or \u201czig-zag\u201d\nartifacts in the resulting decision surface, as illustrated in Figure 1.\n\nFigure 1: A local view of the decisionsurface, with\u201choles\u201d,producedby the Nearest Neighbor\nwhenthedatahavealocalstructure(horizontaldirection).\n\nThe present work is based on the same underlying geometric intuition, but applied to the\nwell known Parzen windows [2] non-parametric method for density estimation, using Gaus-\nsian kernels.\n\nMost of the time, Parzen Windows estimates are built using a \u201cspherical Gaussian\u201d with\na single scalar variance (or width) parameter \u0001\u0003\u0002 .\nIt is also possible to use a \u201cdiagonal\nGaussian\u201d, i.e. with a diagonal covariance matrix, or even a \u201cfull Gaussian\u201d with a full\ncovariance matrix, usually set to be proportional to the global empirical covariance of the\n\n\ftraining data. However these are equivalent to using a spherical Gaussian on preprocessed,\nnormalized data (i.e. normalized by subtracting the empirical sample mean, and multiply-\ning by the inverse sample covariance). Whatever the shape of the kernel, if, as is customary,\na \ufb01xed shape is used, merely centered on every training point, the shape can only compen-\nsate for the global structure (such as global covariance) of the data.\n\nNow if the true density that we want to model is indeed \u201cclose to\u201d a non-linear lower di-\nmensional manifold embedded in the higher dimensional input space, in the sense that most\nof the probability density is concentrated around such a manifold (with a small noise com-\nponent away from it), then using Parzen Windows with a spherical or \ufb01xed-shape Gaussian\nis probably not the most appropriate method, for the following reason.\n\nWhile the true density mass, in the vicinity of a particular training point \u0002\u0001 , will be mostly\n\nconcentrated in a few local directions along the manifold, a spherical Gaussian centered on\nthat point will spread its density mass equally along all input space directions, thus giving\ntoo much probability to irrelevant regions of space and too little along the manifold. This\nis likely to result in an excessive \u201cbumpyness\u201d of the thus modeled density, much like the\n\u201choles\u201d and \u201czig-zag\u201d artifacts observed in KNN (see Fig. 1 and Fig. 2).\n\nthen it should be possible to infer the local direction of that manifold from the neighbor-\n\nIf the true density in the vicinity of \n\u0001 is concentrated along a lower dimensional manifold,\nhood of \u0003\u0001 , and then anchor on \u0004\u0001 a Gaussian \u201cpancake\u201d parameterized in such a way that\n\nit spreads mostly along the directions of the manifold, and is almost \ufb02at along the other\ndirections. The resulting model is a mixture of Gaussian \u201cpancakes\u201d, similar to [3], mix-\ntures of probabilistic PCAs [4] or mixtures of factor analyzers [5, 6], in the same way that\nthe most traditional Parzen Windows is a mixture of spherical Gaussians. But it remains a\nmemory-based method, with a Gaussian kernel centered on each training points, yet with a\ndifferently shaped kernel for each point.\n\n2 The Manifold Parzen Windows algorithm\nIn the following we formally de\ufb01ne and justify in detail the proposed algorithm. Let \u0005\n\b\n\t , and an unknown probability density\nan \u0006 -dimensional random variable with values in \u0007\n\u0011 . Our training set contains \u0012 samples of that random variable, collected in a\nfunction\u000b\u0003\f\u000e\r\u0010\u000f\n\u0012\u0014\u0013\u0015\u0006 matrix \u0016 whose row \nis the \u0017 -th sample. Our goal is to estimate the density \u000b\n\u0011 has the form of a mixture of Gaussians, but unlike the Parzen density\nOur estimator \u0018\n\u000b\u0003\u0019\u001b\u001a\u001c\r\u0010\u000f\n\u0001 are not necessarily spherical and not necessarily identical\nestimator, its covariances \u001d\n\u000b\u0003\u0019\u001b\u001a\u001c\r\u001e\u0003\u0011\u001b\u001f\n\neverywhere:\n\n&('*),+\n\n.\u0003\u00110/\n\n-\u0003)\n\nbe\n\n(1)\n\n.\n\nwhere &21\nmatrix \u001d\n\n.\u0004\u0011\n\n:\n\nis the multivariate Gaussian density with mean vector 3\n\n1ED\u001eF\n\n-HG\n\n1ED\n\nand covariance\n\n(2)\n\n\u0001$#\n%\n\nB8C\n\n65879\u0011;:9<\n\n\u001d=<\u001e>@?\n\n.\u0004\u0011\u001b\u001f\n\u001d=< is the determinant of \u001d\n\nwhere <\n\nFrom the above discussion, we expect that if there is an underlying \u201cnon-linear principal\nmanifold\u201d, those gaussians would be \u201cpancakes\u201d aligned with the plane locally tangent\nto this underlying manifold. The only available information (in the absence of further\nprior knowledge) about this tangent plane can be gathered from the training samples int the\n\n. How should we select the individual covariances \u001d\n\n\u0001 ?\n\nneighborhood of \u0003\u0001 . In other words, we are interested in computing the principal directions\nof the samples in the neighborhood of \nFor generality, we can de\ufb01ne a soft neighborhood of \n\u0001 with a neighborhood kernel\n\r.\nJ\u0010\n\nthat will associate an in\ufb02uence weight to any point \n\nin the neighborhood of\n\n\u0001 .\n\n\u0001\n\f\n\u0018\n \n\u0012\n!\n\"\n+\n-\n&\n1\n+\n-\n \n4\nA\n'\n?\nA\nC\n'\n?\nI\n\u0001\n\u0011\n\f\r.\n\n\u0011\u000f\f\n\n\u001d\u0001\n\n\u0002\u0004\u0003\n\n#\n%\u0006\u0005\u0007\u0005\n\n\u001e\n#\n%\u0006\u0005\u0007\u0005\n\n\u0003\t\b\n#\u0002\u0001\n\u0002\u000e\u0003\n\n\u0003\u0001,\u0011\r\f6\r\u001e\n\u0001,\u0011\n\n\u0003\u000b\n\n\u0001,\u0011\n\n(3)\n\n\u0001 . We can then compute the weighted covariance matrix\n\u0003\u000b\n\n\u0001,\u00110\r.\n\u0003\t\b\n\r\u001e\n#\u0002\u0001\n\u0011 denotes the outer product.\n\u0011 could be a spherical Gaussian centered on \n\u0003\u0001,\u0011\n\nde\ufb01nite kernel, possibly incorporating prior knowledge as to what constitutes a reasonable\nis\nthe global training sample covariance. As an important special case, we can de\ufb01ne a hard\nto any point no further\n\nwhere \r\u001e\n\r.\n/\u0010\nneighborhood for point \u0004\u0001 . Notice that if I\n\u0001 by assigning a weight of \nk-neighborhood for training sample \nthan the \u0010 -th nearest neighbor of \u0004\u0001 among the training set, according to some metric such\nas the Euclidean distance in input space, and assigning a weight of \u0011\n) is the unweighted covariance of the \u0010 nearest neighbors\nthe \u0010 -th neighbor. In that case, \u001d\u0012\nof \n\n\u0001 for instance, or any other positive\nis a constant (uniform kernel), \u001d\n\nNotice what is happening here: we start with a possibly rough prior notion of neighborhood,\nsuch as one based on the ordinary Euclidean distance in input space, and use this to compute\na local covariance matrix, which implicitly de\ufb01nes a re\ufb01ned local notion of neighborhood,\ntaking into account the local direction observed in the training samples.\n\nto points further than\n\n\u001e9/\n\n\u0001 .\n\nNow that we have a way of computing a local covariance matrix for each training point, we\nmight be tempted to use this directly in equations 2 and 1. But a number of problems must\n\ufb01rst be addressed:\n\n\t . A\n\n\u0002\u0019\u0018 .\n\u001d\u0001\n\u0001 by adding \u0001\n\nin all directions, which is done by simply adding \u0001\n\nuse a different, more compact representation of the inverse Gaussian, by storing only the\n\n\u0001 and its \u0010 neighbors, and it does not constitute a proper density in \u0007\n\n\u0002 , when we deal with high dimensional spaces,\nit would be prohibitive in computation time and storage to keep and use the full inverse\ncovariance matrix as expressed in 2. This would in effect multiply both the time and storage\n\n\u0013 Equation 2 requires the inverse covariance matrix, whereas \u001d\u000b\nis likely to be ill-\nconditioned. This situation will de\ufb01nitely arise if we use a hard k-neighborhood with\n. In this case we get a Gaussian that is totally \ufb02at outside of the af\ufb01ne subspace\n\u0010\u0015\u0014\nspanned by \ncommon way to deal with this problem is to add a small isotropic (spherical) Gaussian\nto the diagonal of\nnoise of variance \u0001\nthe covariance matrix: \u001d\n\u0013 Even if we regularize \u001d\nrequirement of the already expensive ordinary Parzen Windows by \u0006\neigenvectors associated with the \ufb01rst few largest eigenvalues of \u001d\nThe eigen-decomposition of a covariance matrix \u001d\nwhere the columns of \u001a\nare the orthonormal eigenvectors and \u001c\nthe eigenvalues \n\u000f! \nThe \ufb01rst \" eigenvectors with largest eigenvalues correspond to the principal directions of\n\" -dimensional manifold (but the true underlying dimension is unknown and may actually\nvary across space). The last few eigenvalues and eigenvectors are but noise directions with\na small variance. So we may, without too much risk, force those last few components to\nthe same low noise level \u0001\n\" eigenvalues\n\u0002 . We have done this by zeroing the last \u0006\n(by considering only the \ufb01rst \"\nThis allows us to store only the \ufb01rst \" eigenvectors, and to later compute &\n\r\u001e\u0003\u0011\n\u0011 . Thus both the storage requirement and the computational cost\n\r.\u0006%$\t\"\n\r\u001e\u0006\nwhen estimating the density at a test point is only about \"\nfollowing computation of &\n\ntimes that of ordinary Parzen.\nIt can easily be shown that such an approximation of the covariance matrix yields to the\n\n . So instead, we\n\u0001 , as described below.\n\u001f\u001b\u001a\u001d\u001c\u001e\u001a\u001f\f ,\n\nis a diagonal matrix with\n, that we will suppose sorted in decreasing order, without loss of\n\nthe local neighborhood, i.e. the high variance local directions of the supposed underlying\n\nleading eigenvalues) and then adding \u0001\n\nto all eigenvalues.\nin time\n\ncan be expressed as:\n\ninstead of #\n\ngenerality.\n\n)\u0017\u0016\n\n\u001e\u0003\u0011 :\n\n\n)\n\u001f\n!\n+\nI\n\u0003\nJ\n\n\n!\n+\nI\n\u0003\nJ\n\n\u0003\n\n\n\u0001\n\u0003\n\n\n\u0001\nI\n\u0001\n\n)\n)\n\u0006\n\b\n\u0002\n\u0002\n\u0001\n\u001f\n\u0001\n\u0016\n\u001d\n%\n\u000f\n\u000f\n:\n\n\u0002\n1\n+\n-\n#\n\u0011\n\u0002\n\u0016\n \n1\n+\n-\n\fInput: test vector \nin the columns of \u001a\n5E79\u0011\n\n(1)\n(2)\nOutput: Gaussian density\n\nAlgorithm LocalGaussian(9/\n\u0001 , dimension \" , and the regularization hyper-parameter \u0001\n\u0002\n\t\n<$<\n\n\u0002 )\n\u0003\u0001;/\u0006\u001a\n\b\u0002\t , \" eigenvalues \n\u0011\r\u0003\u0006\u0005\b\u0007\u0003\r\n<$<\n\n\b9\t , training vector \n#\n%\n\u0002\n\t\n\n\"\u0004\u0003\u0006\u0005\b\u0007\n<$<\n\n\u001e\u0006\n\u001a\u001d\f\n\n\u0003\u000b\u0005\f\u0007\u0003\r\n#\n%\n\n/! \n\n.\n\n<$<\n\n\u0003 , \" eigenvectors\n\n\u0002 .\n\n\u0010\u0012\u0011\nC\u0006\u0015\u0017\u0016\u0019\u0018\n\nIn the case of the hard k-neighborhood, the training algorithm pre-computes the local prin-\n\nthem with a SVD rather than an eigen-decomposition of the covariance matrix, see below).\n\ncipal directions \u001a\u0003\u0001 of the \u0010 nearest neighbors of each training point \u0017 (in practice we compute\nNote that with \"\n\n\u0011 , we trivially obtain the traditional Parzen windows estimator.\n\n\u0002 )\n\n/!\u0010\u0004/\n\n\t , chosen number of principal direc-\n\nAlgorithm MParzen::Train(\u0016\u0015/\n\u0012\u001f\u001e\n\u0001#\u001c\n/!\"$\u001e\n\nInput: training set matrix \u0016 with \u0012 rows \n\" , and regularization hyper-parameter \u0001\ntions \" , chosen number of neighbors \u0010\u001b\u001a\n(1) For \u0017\n\u0003 of \n\u0001 in the rows of matrix\n\u0001 , and put \n, to obtain the leading \"\n) and singular column vectors \u001a\n\u0016('\nis an \u0012\n\u0011 , where \u001a\n/\u0006 \u0002/\u0006\u0010\u0004/\n\" matrix with all the eigenvalues.\nis a \u0012\n\n\u0001\u001d\u001c\nCollect \u0010 nearest neighbors \nPerform a partial singular value decomposition of\n/!\"$\u001e\n, let \n= \r.\u0016\n/\u0006\u001a\ncollects all the eigenvectors and \n\n(2)\n(3)\nsingular values\n(4)\n\u0001\u001d\u001c\nOutput: The model\n\ntensor that\n\n\u0003 of\n\nFor\n\n\u0002 .\n\n\u0001&%\n\n(\n\n.\n\n.\n\n)\n\nAlgorithm MParzen::Test(\n/*)\n\u0011 .\n\n\u0012\u001f\u001e\n\n(1)\n\n(3)\n\n\u0002 )\n! .\n\n\u0001\u001d\u001c\n!-+.!\n\n/!\u001a\u001b/! 9/!\u0010\n\r\u001e\u0016\n/!\u001a\n\u0001\u0010/\u0006 \n\u000b\u0004\u0019\u001b\u001a\u001c\r\u001e\u0003\u0011\u001b\u001f\n\nInput: test point and model\n!,+\n(2) For \u0017\n\u0016 LocalGaussian(9/\u0010\u0004\u0001\nOutput: manifold Parzen estimator \u0018\n3 Related work\nAs we have already pointed out, Manifold Parzen Windows, like traditional Parzen Win-\ndows and so many other density estimation algorithms, results in de\ufb01ning the density as\na mixture of Gaussians. What differs is mostly how those Gaussians and their parameters\nare chosen. The idea of having a parameterization of each Gaussian that orients it along\nthe local principal directions also underlies the already mentioned work on mixtures of\nGaussian pancakes [3], mixtures of probabilistic PCAs [4], and mixtures of factor analy-\nsers [5, 6]. All these algorithms typically model the density using a relatively small number\nof Gaussians, whose centers and parameters must be learnt with some iterative optimisation\nalgorithm such as EM (procedures which are known to be sensitive to local minima traps).\nBy contrast our approach is, like the original Parzen windows, heavily memory-based. It\navoids the problem of optimizing the centers by assigning a Gaussian to every training\npoint, and uses simple analytic SVD to compute the local principal directions for each.\n\nAnother successful memory-based approach that uses local directions and inspired our\nwork is the tangent distance algorithm [7]. While this approach was initially aimed at\nsolving classi\ufb01cation tasks with a nearest neighbor paradigm, some work has already been\ndone in developing it into a probabilistic interpretation for mixtures with a few gaussians,\n\n\u0001\n\u0001\n/\n\"\n/\n\u0001\n\u0001\n\u0007\n\u0001\n\u0001\n\u0007\n\u0001\n\u0002\n\u001f\n\n\u0016\n\u0003\n \n\u0003\n\u0016\n\u0001\n\u0002\n\u0011\n\u0016\n\n\"\n\u0001\n\u0002\n\u0011\n\u000e\n\u001f\n%\n\u000f\nB\n\n\n\n\u0001\n\u0002\n\u0016\n\u0003\n\n%\n\n%\n\u000f\nB\n\u0011\n\u0003\n\n\n\u0001\n\u0011\n\u0002\n>\n?\n\u0013\n\u0005\n\u0014\nD\n\u001f\n\"\n\u0001\n\u0001\n\u0001\n\u0007\n\b\n \n/\n5\n/\n\u000f\n\u000f\n\u000f\n/\n\u0003\n\n\n \n \n!\n\u0003\n\"\n \n/\n\u000f\n\u000f\n\u000f\n \n\"\n \n/\n\u000f\n\u000f\n\u000f\n\u0001\n\u0003\n\u001f\n\u0001\n\u0002\nB\n\u0011\n!\n)\n\"\n/\n\u0001\n\u0002\n\u0013\n\u0006\n\u0013\n\"\n\u0013\n)\n\u001f\n/\n\"\n/\n\u0001\n\u0002\n\u0011\n \n/\n5\n/\n\u000f\n\u000f\n\u000f\n/\n\u0001\n/\n\"\n/\n\u0001\n'\n\fas well as for full-\ufb02edged kernel density estimation [8, 9]. The main difference between\nour approach and the above is that the Manifold Parzen estimator does not require prior\nknowledge, as it infers the local directions directly from the data, although it should be\neasy to also incorporate prior knowledge if available.\n\nWe should also mention similarities between our approach and the Local Linear Embed-\nding and recent related dimensionality reduction methods [10, 11, 12, 13]. There are also\nlinks with previous work on locally-de\ufb01ned metrics for nearest-neighbors [14, 15, 16, 17].\nLastly, it can also be seen as an extension along the line of traditional variable and adaptive\nkernel estimators that adapt the kernel width locally (see [18] for a survey).\n\n, the width of the Gaussian.\n\n4 Experimental results\nThroughout this whole section, when we mention Parzen Windows (sometimes abbreviated\nParzen ), we mean ordinary Parzen windows using a spherical Gaussian kernel with a single\nhyper-parameter \u0001\nWhen we mention Manifold Parzen Windows (sometimes abbreviated MParzen), we used\n, the\n\na hard k-neighborhood, so that the hyper-parameters are: the number of neighbors \u0010\nnumber of retained principal components \" , and the additional isotropic Gaussian noise\nparameter \u0001\n\u000b\n\r\u001e\u0003\u0011 , we used the average negative log\nWhen measuring the quality of a density estimator \u0018\n\u0011 with the \nexamples\nlikelihood: ANLL \u001f\n\n\u0001 from a test set.\n\n\u0003\u000b\u0005\f\u0007=\u0018\n\n.\n\n.\n\n4.1 Experiment on 2D arti\ufb01cial data\n\nA training set of 300 points, a validation set of 300 points and a test set of 10000 points\n\n \u001c\u001b\n\n\u001e\u001d\n\n\u001f\u0015\u0011\n\n/\u0010\u0001\n\u001f\u001d\n\n\u001a\n\r\u001e3H/\n\n2\u001f\n\u0011 , \u000e\n\nis a normal density.\n\n\u0011 points:\n\n\u0016\u000f\u000e\n\u0011\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b9\r\r\u0005\u0010\u0011\n\u0011 , \u000e\u0017\u0016\n\nwere generated from the following distribution of two dimensional \r.\n/\u0002\u0001\n\u0016\u0015\u000e\u0017\u0016\n\n\u0005\u0010\u0011\n\u0005\u0014\u0007\n\u0011\u0011\u0003\u0012\u0005\b\u0013\n\u0011 , \u0018\n\u0019\u0018\nwhere \u0005\nis uniform in the interval\n\n \u0003/\"!\n\n \u0003/\u0017!0\u0011 and \u001d\n and \"(\u001f\nWe trained an ordinary Parzen, as well as MParzen with \"(\u001f\n and \"\nrithms (as the visual representation for MParzen with \"\ndifferent, we show only the case \"\n\n5 on the training\n5 did not appear very\n ). The graphic reveals the anticipated \u201cbumpyness\u201d\n\nset, tuning the hyper-parameters to achieve best performance on the validation set. Figure 2\nshows the training set and gives a good idea of the densities produced by both kinds of algo-\n\nartifacts of ordinary Parzen, and shows that MParzen is indeed able to better concentrate\nthe probability density along the manifold, even when the training data is scarce.\n\nQuantitative comparative results of the two models are reported in table 1\n\nTable 1: Comparativeresultsonthearti\ufb01cialdata(standarderrorsareinparenthesis).\n\nAlgorithm Parameters used\n\nANLL on test-set\n-1.183 (0.016)\n-1.466 (0.009)\n-1.419 (0.009)\n\nParzen\nMParzen\nMParzen\n\n $#\n\u001f\u0015\u0011\n , \u0010=\u001f\n5 , \u0010=\u001f\n\n @ , \u0001\n\u0011 , \u0001\n\n\u0011&%\n\nSeveral points are worth noticing:\n\n\u0013 Both MParzen models seem to achieve a lower ANLL than ordinary Parzen (even\n ), and with more con-\nthough the underlying manifold really has dimension \"\nsistency over the test sets (lower standard error).\n\u0013 The optimal width \u0001\nof the true generating model (0.01), probably because of the \ufb01nite sample size.\n\nfor ordinary Parzen is much larger than the noise parameter\n\n%\n\u0019\n\u0002\n\u0019\n\u0001\n#\n%\n\u000b\n\u0001\n\u0011\n\u000f\n'\n\u000f\n/\n'\n\n\u0011\n/\n\u0011\n\u000f\n\u0011\n \n\n\u0011\n/\n\u0011\n\u000f\n\u0011\n \n\u0011\n\u0001\n\u0011\n\u001f\n\u001f\n\u001f\n\u0001\n\u000f\n\u0011\n\u001a\n\"\n\u001f\n\u001f\n\u0011\n\u000f\n\"\n\u001f\n \n\u001f\n\u0011\n\u000f\n\u0011\n\u0011\n\u0011\n\u0011\n \n\u001f\n\f\u0013 The optimal regularization parameter \u0001\n(i.e. supposing\na one-dimensional underlying manifold) is very close to the actual noise param-\neter of the true generating model. This suggests that it was able to capture the\nunderlying structure quite well. Also it is the best of the three models, which is\nnot surprising, since the true model is indeed a one dimensional manifold with an\nadded isotropic Gaussian noise.\n\nfor MParzen with \"\n\n\u0013 The optimal additional noise parameter \u0001\n(i.e. supposing\na two-dimensional underlying manifold) is close to 0, which suggests that the\nmodel was able to capture all the noise in the second \u201cprincipal direction\u201d.\n\nfor MParzen with \"\n\nFigure 2: Illustration of the density estimated by ordinary Parzen Windows (left) and Manifold\nParzenWindows(right).Thetwoimagesonthebottomareazoomedareaofthecorrespondingimage\nat the top. The 300 training points are represented as black dots and the area where the estimated\ndensity\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\u0003\n is above 1.0 is painted in gray. The excessive \u201cbumpyness\u201d and holes produced\nby ordinary Parzen windows model can clearly be seen, whereas Manifold Parzen density is better\nalignedwiththeunderlying manifold,allowingittoevensuccessfully\u201cextrapolate\u201dinregionswith\nfewdatapointsbuthightruedensity.\n\n\u001f\n \n\u001f\n5\n\n\f4.2 Density estimation on OCR data\n\nIn order to compare the performance of both algorithms for density estimation on a real-\nworld problem, we estimated the density of one class of the MNIST OCR data set, namely\nthe \u201c2\u201d digit. The available data for this class was divided into 5400 training points, 558\nvalidation points and 1032 test points. Hyper-parameters were tuned on the validation\nset. The results are summarized in Table 2, using the performance measures introduced\nabove (average negative log-likelihood). Note that the improvement with respect to Parzen\nwindows is extremely large and of course statistically signi\ufb01cant.\n\nTable 2: Densityestimationofclass\u20192\u2019intheMNISTdataset. Standarderrorsinparenthesis.\n\nAlgorithm Parameters used\n\nParzen\nMParzen\n\n\u0011 , \u0010=\u001f\u0001\n\n\u0011 , \u0001\n\n\u0011\u0004%\n\n4.3 Classi\ufb01cation performance\n\nvalidation ANLL\n-197.27 (4.18)\n-696.42 (5.94)\n\ntest ANLL\n-197.19 (3.55)\n-695.15 (5.21)\n\nTo obtain a probabilistic classi\ufb01er with a density estimator we train an estimator\n\n\u000b\n\r.\n\n\u0011 for each class \u0004 , and apply Bayes\u2019 rule to obtain \u0018\n\n\u000b\u0003\u0002*\r\u001e\u0003\u0011\u001b\u001f\nWhen measuring the quality of a probabilistic classi\ufb01er\nditional log likelihood: ANCLL \u000e\u0010\u000f\u0012\u0011\n(correct class, input) from a test set.\n\n\u0006\u0004\n\nD .\n\u0003\u0011\u001b\u001f\n\r\u0006\u0004\n\u0003\u0011 , we used the negative con-\n\u0011 , with the \n\n'\t\b\n\u000b\r\f\nexamples \r\u0013\u0004\n\n\u0003\u000b\u0005\f\u0007\n\n'\t\b\n\n\u0006\u0004\n\nThis method was applied to both the Parzen and the Manifold Parzen density estimators,\nwhich were compared with state-of-the-art Gaussian SVMs on the full USPS data set. The\noriginal training set (7291) was split into a training (\ufb01rst 6291) and validation set (last\n1000), used to tune hyper-parameters. The classi\ufb01cation errors for all three methods are\ncompared in Table 3, where the hyper-parameters are chosen based on validation classi\ufb01-\ncation error. The log-likelihoods are compared in Table 4, where the hyper-parameters are\n\nchosen based on validation ANCLL. Hyper-parameters for SVMs are the box constraint \u001d\n\n. MParzen has the lowest classi\ufb01cation error and ANCLL of the\n\nand the Gaussian width \u0001\nthree algorithms.\n\nTable 3: Classi\ufb01cationerrorobtainedonUSPSwithSVM,ParzenwindowsandManifoldParzen\nwindowsclassi\ufb01ers.\n\nAlgorithm validation error\n\nparameters\n\n5 Conclusion\nThe rapid increase in computational power now allows to experiment with sophisticated\nnon-parametric models such as those presented here. They have allowed to show the use-\nfulness of learning the local structure of the data through a regularized covariance matrix\nestimated for each data point. By taking advantage of local structure, the new kernel den-\nsity estimation method outperforms the Parzen windows estimator. Classi\ufb01ers built from\n\nSVM\nParzen\nMParzen\n\n1.2%\n1.8%\n0.9%\n\nParzen\nMParzen\n\n0.1022\n0.0658\n\ntest error\n4.68%\n5.08%\n4.08%\n\ntest ANCLL\n0.3478\n0.3384\n\nTable 4: ComparativenegativeconditionalloglikelihoodobtainedonUSPS.\nAlgorithm valid ANCLL\n\nparameters\n\n\u0011 , \u0001\n , \u0010\n\n\u001f\u0001\n , \u0001\n\n\u001f\u0015\u0011\n\n# , \u0010=\u001f\n\n# , \u0001\n\n#\u0011\u001b\n\n\u001f\u0015\u0011\n\n\u0001\n\u001f\n\u0011\n\u000f\n \n%\n\"\n\u001f\n\u001b\n\u001f\n\u0011\n\u000f\n\u0018\n\u0018\n<\n\u0004\n\u0005\n<\n\u0007\n\u001a\nC\n\u0002\nD\n\u0007\n\nC\n\u0002\nD\nF\n\u0007\n\u001a\nC\n\u0002\nF\nD\n\u0007\n\nC\n\u0002\nF\n\u0018\n\u0005\n<\n\u001f\n\n%\n\u0019\n\u0002\n\u0019\n\u0001\n#\n%\n\u0018\n\u0005\n\u0001\n<\n\n\u0001\n\u0001\n/\n\n\u0001\n\u0011\n\u001d\n\u001f\n \n\u0011\n\u0001\n\u001f\n\u0011\n\u000f\n\n\"\n\u001f\n \n\u001f\n \n\u0002\n\u000f\n \n\u0001\n\u001f\n\u0011\n\u000f\n\n\"\n\u001f\n \n \n\u0002\n\u000f\n\fthis density estimator yield state-of-the-art knowledge-free performance, which is remark-\nable for a not discriminatively trained classi\ufb01er. Besides, in some applications, the accurate\nestimation of probabilities can be crucial, e.g. when the classes are highly imbalanced.\n\nFuture work should consider other alternative methods of estimating the local covariance\nmatrix, for example as suggested here using a weighted estimator, or taking advantage of\nprior knowledge (e.g. the Tangent distance directions).\n\nReferences\n[1] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor algorithms.\nIn T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information\nProcessing Systems, volume 14. The MIT Press, 2002.\n\n[2] E. Parzen. On the estimation of a probability density function and mode. Annals of Mathemat-\n\nical Statistics, 33:1064\u20131076, 1962.\n\n[3] G.E. Hinton, M. Revow, and P. Dayan. Recognizing handwritten digits using mixtures of linear\nmodels. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural Information\nProcessing Systems 7, pages 1015\u20131022. MIT Press, Cambridge, MA, 1995.\n\n[4] M.E. Tipping and C.M. Bishop. Mixtures of probabilistic principal component analysers. Neu-\n\nral Computation, 11(2):443\u2013482, 1999.\n\n[5] Z. Ghahramani and G.E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical\n\nReport CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toronto, 21 1996.\n\n[6] Z. Ghahramani and M. J. Beal. Variational inference for Bayesian mixtures of factor analysers.\nIn Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. MIT Press.\n[7] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern\nrecognition \u2014 tangent distance and tangent propagation. Lecture Notes in Computer Science,\n1524, 1998.\n\n[8] D. Keysers, J. Dahmen, and H. Ney. A probabilistic view on tangent distance. In 22nd Sympo-\n\nsium of the German Association for Pattern Recognition, Kiel, Germany, 2000.\n\n[9] J. Dahmen, D. Keysers, M. Pitz, and H. Ney. Structured covariance matrices for statistical image\nIn 22nd Symposium of the German Association for Pattern Recognition,\n\nobject recognition.\nKiel, Germany, 2000.\n\n[10] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290(5500):2323\u20132326, Dec. 2000.\n\n[11] Y. Whye Teh and S. Roweis. Automatic alignment of local representations.\n\nIn S. Becker,\nS. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems,\nvolume 15. The MIT Press, 2003.\n\n[12] V. de Silva and J.B. Tenenbaum. Global versus local approaches to nonlinear dimensionality\nreduction. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information\nProcessing Systems, volume 15. The MIT Press, 2003.\n\n[13] M. Brand. Charting a manifold. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances\n\nin Neural Information Processing Systems, volume 15. The MIT Press, 2003.\n\n[14] R. D. Short and K. Fukunaga. The optimal distance measure for nearest neighbor classi\ufb01cation.\n\nIEEE Transactions on Information Theory, 27:622\u2013627, 1981.\n\n[15] J. Myles and D. Hand. The multi-class measure problem in nearest neighbour discrimination\n\nrules. Pattern Recognition, 23:1291\u20131297, 1990.\n\n[16] J. Friedman. Flexible metric nearest neighbor classi\ufb01cation. Technical Report 113, Stanford\n\nUniversity Statistics Department, 1994.\n\n[17] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classi\ufb01cation and regres-\nsion. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in\nNeural Information Processing Systems, volume 8, pages 409\u2013415. The MIT Press, 1996.\n\n[18] A.J. Inzenman. Recent developments in nonparametric density estimation. Journal of the\n\nAmerican Statistical Association, 86(413):205\u2013224, 1991.\n\n\f", "award": [], "sourceid": 2203, "authors": [{"given_name": "Pascal", "family_name": "Vincent", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}