{"title": "Submanifold density estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1375, "page_last": 1382, "abstract": "Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory.", "full_text": "Submanifold density estimation\n\nArkadas Ozakin\n\nGeorgia Tech Research Institute\nGeorgia Insitute of Technology\n\nAlexander Gray\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\narkadas.ozakin@gtri.gatech.edu\n\nagray@cc.gatech.edu\n\nAbstract\n\nKernel density estimation is the most widely-used practical method for accurate\nnonparametric density estimation. However, long-standing worst-case theoretical\nresults showing that its performance worsens exponentially with the dimension\nof the data have quashed its application to modern high-dimensional datasets for\ndecades. In practice, it has been recognized that often such data have a much\nlower-dimensional intrinsic structure. We propose a small modi\ufb01cation to ker-\nnel density estimation for estimating probability density functions on Riemannian\nsubmanifolds of Euclidean space. Using ideas from Riemannian geometry, we\nprove the consistency of this modi\ufb01ed estimator and show that the convergence\nrate is determined by the intrinsic dimension of the submanifold. We conclude\nwith empirical results demonstrating the behavior predicted by our theory.\n\n1 Introduction: Density estimation and the curse of dimensionality\n\nKernel density estimation (KDE) [8] is one of the most popular methods for estimating the under-\nlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having\nthe data points \u201ccontribute\u201d to the estimate at a given point according to their distances from the\npoint. In the simplest multi-dimensional KDE [3], the estimate \u02c6fm(y0) of the PDF f (y0) at a point\ny0 \u2208 RN is given in terms of a sample {y1, . . . , ym} as,\n\n\u02c6fm(y0) =\n\n1\nm\n\nm\n\nXi=1\n\n1\nhN\nm\n\nK (cid:18) kyi \u2212 y0k\n\nhm\n\n(cid:19) ,\n\n(1)\n\nwhere hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number\nm of data points increases, and K : [0.\u221e) \u2192 [0, \u221e) is a kernel function that satis\ufb01es certain\nproperties such as boundedness. Various theorems exist on the different types of convergence of\nthe estimator to the correct result and the rates of convergence. The earliest result on the pointwise\nconvergence rate in the multivariable case seems to be given in [3], where it is stated that under\ncertain conditions for f and K, assuming hm \u2192 0 and mhm \u2192 \u221e as m \u2192 \u221e, the mean squared\nerror in the estimate \u02c6f (y0) of the density at a point goes to zero with the rate, MSE[ \u02c6fm(y0)] =\nm(cid:17) as m \u2192 \u221e. If hm is chosen to be proportional to\n\nE(cid:20)(cid:16) \u02c6fm(y0) \u2212 f (y0)(cid:17)2(cid:21) = O(cid:16)h4\n\nm + 1\nmhN\n\nm\u22121/(N +4), one gets,\n\nMSE[ \u02c6fm(p)] = O(cid:18)\n\n1\n\nm4/(N +4)(cid:19) ,\n\n(2)\n\nas m \u2192 \u221e. This is an example of a curse of dimensionality; the convergence rate slows as the\ndimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the\nsample size required for a given mean square error for the estimate of a multivariable normal distri-\nbution increases with the dimensionality. The numbers look as discouraging as the formula 2.\n\n1\n\n\fOne source of optimism towards various curses of dimensionality is the fact that although the data\nfor a given problem may have many features, in reality the intrinsic dimensionality of the \u201cdata\nsubspace\u201d of the full feature space may be low. This may result in there being no curse at all, if\nthe performance of the method/algorithm under consideration can be shown to depend only on the\nintrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising\nways to work with the low-dimensional data subspace by using dimensional reduction techniques\non the data. One example of the former case is the results on nearest neighbor search [6, 2] which\nindicate that the performance of certain nearest-neighbor search algortihms is determined not by the\nfull dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace.\n\nRiemannian manifolds.\nIn this paper, we will assume that the data subspace is a Riemannian\nmanifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3\nto higher dimensions. As \ufb01rst clari\ufb01ed by Gauss in the two-dimensional case (and by Riemann in\nthe general case) it turns out that intrinsic features of the geometry of a surface such as lengths of\nits curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric\ntensor1 g without referring to the particular way the the surface is embedded in R3. A space whose\ngeometry is de\ufb01ned in terms of a metric tensor is called a Riemannian manifold (for a rigorous\nde\ufb01nition, see, e.g., [5, 7, 1]).\n\nPrevious work.\nIn [9], Pelletier de\ufb01nes an estimator of a PDF on a Riemannian manifold M by\nusing the distances measured on M via its metric tensor, and obtains the same convergence rate\nas in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we\nknow that the data lives on a Riemannian manifold M , the convergence rate of this estimator will\nbe determined by the dimensionality of M , instead of the full dimensionality of the feature space\non which the data may have been originally sampled. While an interesting generalization of the\nusual KDE, this approach assumes that the data manifold M is known in advance, and that we have\naccess to certain geometric quantities related to this manifold such as intrinsic distances between\nits points and the so-called volume density function. Thus, this Riemannian KDE cannot be used\ndirectly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools\nfrom existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate\nthe quantities needed in the estimator of [9], however, a more straightforward method that directly\nestimates the density of the data as measured in the subspace is desirable.\n\nOther related works include [13], where the authors propose a submanifold density estimation\nmethod that uses a kernel function with a variable covariance but do not present theorerical re-\nsults, [4] where the author proposes a method for doing density estimation on a Riemannian man-\nifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that\nthe manifold is known in advance, together with intricate geometric information pertaining to it, and\n[10, 11], which discuss various issues related to statistics on a Riemannian manifold.\n\nThis paper.\nIn this paper, we propose a direct way to estimate the density of Euclidean data that\nlives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise\nconsistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic\ndimension of the submanifold the data lives in. This is an example of the avoidance of the curse of\ndimensionality in the manner mentioned above, by a method whose performance depends on the\nintrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our\nmethod is practical in that it works with Euclidean distances on RN . In particular, we do not assume\nany knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold\nsuch as its metric tensor, geodesic distances between its points, its volume form, etc.\n\n2 The estimator and its convergence rate\n\nMotivation.\nIn this paper, we are concerned with the estimation of a PDF that lives on an (un-\nknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional\nkernel density estimation would not work for this problem, since if interpreted as living on RN , the\n\n1The metric tensor can be thought of as giving the \u201cin\ufb01nitesimal distance\u201d ds between two points whose\n\ncoordinates differ by the in\ufb01nitesimal amounts (dy1, . . . , dyN ) as ds2 = Pij gijdyidyj.\n\n2\n\n\funderlying PDF would involve a \u201cdelta function\u201d that vanishes when one moves away from M , and\n\u201cbecomes in\ufb01nite\u201d on M in order to have proper normalization. More formally, the N -dimensional\nprobability measure for such an n-dimensional PDF on M will have support only on M , will not be\nabsolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability\ndensity function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from\nsuch a probability measure, the estimator will \u201ctry to converge\u201d to a singular PDF, one that is in\ufb01nite\non M , zero outside.\nIn order to estimate the probability density function on M by using data given in RN , we pro-\npose a simple modi\ufb01cation of usual KDE on RN , namely, to use a kernel that is normalized for\nn-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind\nthis approach is based on three facts: 1) For small distances, an n-dimensional Riemannian mani-\nfold \u201clooks like\u201d Rn, and densities in Rn should be estimated by an n-dimensional kernel, 2) For\npoints of M that are close enough to each other, the intrinsic distances as measured on M are close\nto Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to\nthe estimate at a point comes from data points that are nearby. Thus, as the number of data points\nincreases and the bandwidth is taken to be smaller and smaller, estimating the density by using a\nkernel normalized for n-dimensions and distances as measured in RN should give a result closer\nand closer to the correct value.\n\nWe will next give the formal de\ufb01nition of the estimator motivated by these considerations, and state\nour theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator\nis asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel\nfunction becomes a \u201cdelta function\u201d. This result is also used in showing that with an appropriate\nchoice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the\nestimator is pointwise consistent.\n\nStatement of the theorem Let M be an n-dimensional, embedded, complete Riemannian sub-\nmanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be\nthe length of a length-minimizing geodesic in M between p, q \u2208 M , and let u(p, q) be the geodesic\n(linear) distance between p and q as measured in RN . Note that u(p, q) \u2264 d(p, q). We will use the\nnotation up(q) = u(p, q) and dp(q) = d(p, q). We will denote the Riemannian volume measure on\nM by V , and the volume form by dV .\nTheorem 2.1. Let f : M \u2192 [0, \u221e) be a probability density function de\ufb01ned on M (so that the\nrelated probability measure is f V ), and K : [0, \u221e) \u2192 [0, \u221e) be a continous function that sat-\nis\ufb01es vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satis\ufb01es,\n\nRkzk\u22641 K(kzk)dnz = 1. Assume f is differentiable to second order in a neighborhood of p \u2208 M ,\nand for a sample q1, . . . , qm of size m drawn from the density f , de\ufb01ne an estimator \u02c6fm(p) of f (p)\nas,\n\n\u02c6fm(p) =\n\n1\nm\n\n1\nhn\nm\n\nhm (cid:19)\nK(cid:18) up(qj)\n\n(3)\n\nm\n\nXj=1\n\nwhere hm > 0. If hm satis\ufb01es limm\u2192\u221e hm = 0 and limm\u2192\u221e mhn\nnon-negative numbers m\u2217, Cb, and CV such that for all m > m\u2217 we have,\n\nm = \u221e, then, there exists\n\nMSEh \u02c6fm(p)i = E(cid:20)(cid:16) \u02c6fm(p) \u2212 f (p)(cid:17)2(cid:21) < Cbh4\n\nm +\n\nCV\nmhn\nm\n\n.\n\n(4)\n\nIf hm is chosen to be proportional to m\u22121/(n+4), this gives, Eh(fm(p) \u2212 f (p))2i = O(cid:0)\n\nas m \u2192 \u221e.\n\n1\n\nm4/(n+4)(cid:1)\n\nThus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced\nby the intrinsic dimension n of M . The proof will follow from the two lemmas below on the\nconvergence rates of the bias and the variance.\n\n2The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves\nwith zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a\ncomplete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points,\nhowever, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points\non the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great\ncircle passing throught the points. For a detailed discussion of these issues, see, e.g., [1].\n\n3\n\n\f3 Preliminary results\n\nThe following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the\nkernel becomes a \u201cdelta function\u201d as the bandwidth gets smaller.\nTheorem 3.1. Let K : [0, \u221e) \u2192 [0, \u221e) be a continuous function that vanishes outside [0, 1) and\nis differentiable with a bounded derivative in [0, 1), and let \u03be : M \u2192 R be a function that is\ndifferentiable to second order in a neighborhood of p \u2208 M . Let\n\n\u03beh(p) =\n\n1\n\nhn ZM\n\nK(cid:18) up(q)\n\nh (cid:19) \u03be(q) dV (q) ,\n\nwhere h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h \u2192 0,\n\n\u03beh(p) \u2212 \u03be(p)ZRn\n\nK(kzk)dnz = O(h2) ,\n\n(5)\n\n(6)\n\nwhere z = (z1, . . . , zn) denotes the Cartesian coordinates on Rn and dnz = dz1 . . . dzn denotes\n\nthe volume form on Rn. In particular, limh\u21920 \u03beh(p) = \u03be(p)RRn K(kzk)dnz.\n\nBefore proving this theorem, we prove some results on the relation between up(q) and dp(q).\nLemma 3.1. There exist \u03b4up > 0 and Mup > 0 such that for all q with dp(q) \u2264 \u03b4up, we have,\n\ndp(q) \u2265 up(q) \u2265 dp(q) \u2212 Mup [dp(q)]3 .\n\n(7)\n\nIn particular, limq\u2192p\n\nup(q)\ndp(q) = 1.\n\n(cid:12)(cid:12)(cid:12)s=0\n\nv0 (s)k = 1, which gives3 x\u2032\n\nv0 (s) \u00b7 x\u2032\u2032\n\n= 1 , and d2up(cv0 (s))\n\nds (cid:12)(cid:12)s=0 = v0. When s < rinj, s is equal to dp(cv0 (s)) [7, 1]. Now let xv0 (s) be the\n\nProof. Let cv0(s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial ve-\nlocity dcv0\nrepresentation of cv0(s) in RN in terms of Cartesian coordinates with the origin at p. We have\nv0 (s) = 0. Using these\nup(cv0(s)) = kxv0(s)k and kx\u2032\nwe get, dup(cv0 (s))\n= 0. Let M3 \u2265 0 be an upper bound on\nthe absolute value of the third derivative of up(cv0 (s)) for all s \u2264 rinj and all unit length v0:\nd3up(cv0 (s))\n(cid:12)(cid:12)(cid:12)\ns3\n3! .\nThus, (7) holds with Mup = M3\n3! , for all r < rinj. For later convenience, instead of \u03b4u = rinj,\nwe will pick \u03b4up as follows. The polynomial r \u2212 Mupr3 is monotonically increasing in the interval\n0 \u2264 r \u2264 1/p3Mup. We let \u03b4up = min{rinj , 1/pMup}, so that r \u2212 Mupr3 is ensured to be\nmonotonic for 0 \u2264 r \u2264 \u03b4up.\nDe\ufb01nition 3.2. For 0 \u2264 r1 < r2, let,\n\n\u2264 M3. Taylor\u2019s theorem gives up(cv0(s)) = s + Rv0 (s) where |Rv0 (s)| \u2264 M3\n\n(cid:12)(cid:12)(cid:12)s=0\n\nds2\n\nds\n\n(cid:12)(cid:12)(cid:12)\n\nds3\n\nHp(r1, r2) = inf{up(q) : r1 \u2264 dp(q) < r2} ,\n\nHp(r) = Hp(r, \u221e) = inf{up(q) : r1 \u2264 dp(q)} ,\n\n(8)\n(9)\n\ni.e., Hp(r1, r2) is the smallest u-distance from p among all points that have a d-distance between r1\nand r2.\n\nSince M is assumed to be an embedded submanifold, we have Hp(r) > 0 for all r > 0. In the\nbelow, we will assume that all radii are smaller than rinj, in particular, a set of the form {q : r1 \u2264\ndp(q) < r2} will be assumed to be non-empty and so, due to the completeness of M , to contain a\npoint q \u2208 M such that dp(q) = r1. Note that,\n\nHp(r1) = min{H(r1, r2), H(r2)} .\n\n(10)\n\nLemma 3.2. Hp(r) is a non-decreasing, non-negative function, and there exist \u03b4Hp > 0 and MHp \u2265\n0 such that, r \u2265 Hp(r) \u2265 r \u2212 MHpr3 , for all r < \u03b4Hp . In particular, limr\u21920\n\nHp(r)\n\nr = 1.\n\n3Primes denote differentiation with respect to s.\n\n4\n\n\fProof. Hp(r) is clearly non-decreasing and Hp(r) \u2264 r follows from up(q) \u2264 dp(q) and the fact\nthat there exists at least one point q with dp(q) = r in the set {q : r \u2264 dp(q)}\nLet \u03b4Hp = Hp(\u03b4up) where \u03b4up is as in the proof of Lemma 3.1 and let r < \u03b4Hp. Since r < \u03b4Hp =\nHp(\u03b4up ) \u2264 \u03b4up, by Lemma 3.1 we have,\n\nr \u2265 up(r) \u2265 r \u2212 Mupr3 ,\n\n(11)\n\nfor some Mup > 0. Now, since r and r \u2212 Mupr3 are both monotonic for 0 \u2264 r \u2264 \u03b4up, we have (see\n\ufb01gure)\n\n(12)\nIn particular, H(r, \u03b4up) \u2264 r < \u03b4Hp = Hp(\u03b4up ), i.e, H(r, \u03b4up) < Hp(\u03b4up). Using (10) this\ngives, Hp(r) = Hp(r, \u03b4up). Combining this with (12), we get r \u2265 Hp(r) \u2265 r \u2212 Mupr3 for all\nr < \u03b4Hp .\n\nr \u2265 Hp(r, \u03b4up) \u2265 r \u2212 Mupr3 .\n\nNext we show that for all small enough h, there exists some radius Rp(h) such that for all points q\nwith a dp(q) \u2265 Rp(h), we have up(q) \u2265 h. Rp(h) will roughly be the inverse function of Hp(r).\nLemma 3.3. For any h < Hp(rinj ), let Rp(h) = sup{r : Hp(r) \u2264 h}. Then, up(q) \u2265 h for all\nq with dp(q) \u2265 Rp(h) and there exist \u03b4Rp > 0 and MRp > 0 such that for all h \u2264 \u03b4Rp, Rp(h)\nsatis\ufb01es,\n\nh \u2264 Rp(h) \u2264 h + MRph3 .\n\n(13)\n\nIn particular, limh\u21920\n\nRp(h)\n\nh = 1.\n\nProof. That up(q) \u2265 h when dq(q) \u2265 Rp(h) follows from the de\ufb01nitions. In order to show (13), we\nwill use Lemma 3.2. Let \u03b1(r) = r \u2212 MHpr3, where MHp is as in Lemma 3.2. Then, \u03b1(r) is one-\nto-one and continuous in the interval 0 \u2264 r \u2264 \u03b4Hp \u2264 \u03b4up. Let \u03b2 = \u03b1\u22121 be the inverse function of\n\u03b1 in this interval. From the de\ufb01nition of Rp(h) and Lemma 3.2, it follows that h \u2264 Rp(h) \u2264 \u03b2(h)\nfor all h \u2264 \u03b1(\u03b4Hp ). Now, \u03b2(0) = 0, \u03b2\u2032(0) = 1, \u03b2\u2032\u2032(0) = 0, so by Taylor\u2019s theorem and the fact\nthat the third derivative of \u03b2 is bounded in a neighborhood of 0, there exists \u03b4g and MRp such that\n\u03b2(h) \u2264 h + MRph3 for all h \u2264 \u03b4g. Thus,\n\nh \u2264 Rp(h) \u2264 h + MRph3,\n\n(14)\n\nfor all h \u2264 \u03b4R where \u03b4R = min{\u03b1(\u03b4Hp), \u03b4g}.\n\nProof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to\nthe integral in the de\ufb01nition of \u03beh(p) (see (5)) from outside the coordinate patch covered by normal\ncoordinates.4\nLet h0 > 0 be such that Rp(h0) < rinj (such an h0 exists since limh\u21920 Rp(h) = 0). For any\nh \u2264 h0, all points q with dp(q) > rinj will satisfy up(q) > h. This means if h is small enough,\nK( up(q)\nh ) = 0 for all points outside the injectivity radius and we can perform the integral in (5)\nsolely in the patch of normal coordinates at p.\nFor normal coordinates y = (y1, . . . , yn) around the point p with y(p) = 0, we have dp(q) =\nky(q)k [7, 1]. With slight abuse of notation, we will write up(y(q)) = up(q), \u03be(y(q)) = \u03be(q) and\ng(q) = g(y(q)), where g is the metric tensor of M .\n\nSince K( up(q)\n\nh ) = 0 for all q with dp(q) > Rp(h), we have,\n\n\u03beh(p) =\n\n1\n\nhn Zkyk\u2264Rp(h)\n\nK(cid:18) up(y)\n\nh (cid:19) \u03be(y)pg(y)dy1 . . . dyn ,\n\n(15)\n\n4Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordi-\nnates, in the sense that the components of the metric have vanishing \ufb01rst derivatives at p, and gij (p) = \u03b4ij [1].\nNormal coordinates can be de\ufb01ned in a \u201cgeodesic ball\u201d of radius less than rinj.\n\n5\n\n\fwhere g denotes the determinant of g as calculated in normal coordinates. Changing the variable of\nintegration to z = y/h, we get,\n\nh (cid:19) \u03be(zh)pg(zh)dnz \u2212 \u03be(0)Zkzk\u22641\n\nh (cid:19) \u03be(zh)(cid:16)pg(zh) \u2212 1(cid:17) dnz +\n\nh (cid:19) \u2212 K(kzk)(cid:19) dnz +\n\nK(cid:18) up(zh)\n\n\u03beh(p) \u2212 \u03be(p)Z K(kzk)dnz =\nZkzk\u2264Rp(h)/h\n= Zkzk\u22641\nZkzk\u22641\nZkzk\u22641\nZ1