{"title": "Random Projection Trees Revisited", "book": "Advances in Neural Information Processing Systems", "page_first": 496, "page_last": 504, "abstract": "The Random Projection Tree (RPTree) structures proposed in [Dasgupta-Freund-STOC-08] are space partitioning data structures that automatically adapt to various notions of intrinsic dimensionality of data. We prove new results for both the RPTree-Max and the RPTree-Mean data structures. Our result for RPTree-Max gives a near-optimal bound on the number of levels required by this data structure to reduce the size of its cells by a factor s >= 2. We also prove a packing lemma for this data structure. Our final result shows that low-dimensional manifolds possess bounded Local Covariance Dimension. As a consequence we show that RPTree-Mean adapts to manifold dimension as well.", "full_text": "Random Projection Trees Revisited\n\nAman Dhesi\u2217\n\nPurushottam Kar\n\nDepartment of Computer Science\n\nDepartment of Computer Science and Engineering\n\nPrinceton University\n\nPrinceton, New Jersey, USA.\nadhesi@princeton.edu\n\nIndian Institute of Technology\nKanpur, Uttar Pradesh, INDIA.\n\npurushot@cse.iitk.ac.in\n\nAbstract\n\nThe Random Projection Tree (RPTREE) structures proposed in [1] are space par-\ntitioning data structures that automatically adapt to various notions of intrinsic\ndimensionality of data. We prove new results for both the RPTREE-MAX and\nthe RPTREE-MEAN data structures. Our result for RPTREE-MAX gives a near-\noptimal bound on the number of levels required by this data structure to reduce\nthe size of its cells by a factor s \u2265 2. We also prove a packing lemma for this data\nstructure. Our \ufb01nal result shows that low-dimensional manifolds have bounded\nLocal Covariance Dimension. As a consequence we show that RPTREE-MEAN\nadapts to manifold dimension as well.\n\n1 Introduction\n\nThe Curse of Dimensionality [2] has inspired research in several directions in Computer Science and\nhas led to the development of several novel techniques such as dimensionality reduction, sketching\netc. Almost all these techniques try to map data to lower dimensional spaces while approximately\npreserving useful information. However, most of these techniques do not assume anything about\nthe data other than that they are imbedded in some high dimensional Euclidean space endowed with\nsome distance/similarity function.\n\nAs it turns out, in many situations, the data is not simply scattered in the Euclidean space in a random\nfashion. Often, generative processes impose (non-linear) dependencies on the data that restrict the\ndegrees of freedom available and result in the data having low intrinsic dimensionality. There exist\nseveral formalizations of this concept of intrinsic dimensionality. For example, [1] provides an\nexcellent example of automated motion capture in which a large number of points on the body of\nan actor are sampled through markers and their coordinates transferred to an animated avatar. Now,\nalthough a large sample of points is required to ensure a faithful recovery of all the motions of the\nbody (which causes each captured frame to lie in a very high dimensional space), these points are\nnevertheless constrained by the degrees of freedom offered by the human body which are very few.\n\nAlgorithms that try to exploit such non-linear structure in data have been studied extensively re-\nsulting in a large number of Manifold Learning algorithms for example [3, 4, 5]. These techniques\ntypically assume knowledge about the manifold itself or the data distribution. For example, [4]\nand [5] require knowledge about the intrinsic dimensionality of the manifold whereas [3] requires a\nsampling of points that is \u201csuf\ufb01ciently\u201d dense with respect to some manifold parameters.\n\nRecently in [1], Dasgupta and Freund proposed space partitioning algorithms that adapt to the in-\ntrinsic dimensionality of data and do not assume explicit knowledge of this parameter. Their data\nstructures are akin to the k-d tree structure and offer guaranteed reduction in the size of the cells\nafter a bounded number of levels. Such a size reduction is of immense use in vector quantization [6]\nand regression [7]. Two such tree structures are presented in [1] \u2013 each adapting to a different notion\n\n\u2217Work done as an undergraduate student at IIT Kanpur\n\n1\n\n\fof intrinsic dimensionality. Both variants have already found numerous applications in regression\n[7], spectral clustering [8], face recognition [9] and image super-resolution [10].\n\n1.1 Contributions\n\nThe RPTREE structures are new entrants in a large family of space partitioning data structures such\nas k-d trees [11], BBD trees [12], BAR trees [13] and several others (see [14] for an overview). The\ntypical guarantees given by these data structures are of the following types :\n\n1. Space Partitioning Guarantee : There exists a bound L(s), s \u2265 2 on the number of levels\none has to go down before all descendants of a node of size \u2206 are of size \u2206/s or less. The\nsize of a cell is variously de\ufb01ned as the length of the longest side of the cell (for box-shaped\ncells), radius of the cell, etc.\n\n2. Bounded Aspect Ratio : There exists a certain \u201croundedness\u201d to the cells of the tree - this\nnotion is variously de\ufb01ned as the ratio of the length of the longest to the shortest side of the\ncell (for box-shaped cells), the ratio of the radius of the smallest circumscribing ball of the\ncell to that of the largest ball that can be inscribed in the cell, etc.\n\n3. Packing Guarantee : Given a \ufb01xed ball B of radius R and a size parameter r, there exists a\nbound on the number of disjoint cells of the tree that are of size greater than r and intersect\nB. Such bounds are usually arrived at by \ufb01rst proving a bound on the aspect ratio for cells\nof the tree.\n\nThese guarantees play a crucial role in algorithms for fast approximate nearest neighbor searches\n[12] and clustering [15]. We present new results for the RPTREE-MAX structure for all these types\nof guarantees. We \ufb01rst present a bound on the number of levels required for size reduction by any\ngiven factor in an RPTREE-MAX. Our result improves the bound obtainable from results presented\nin [1]. Next, we prove an \u201ceffective\u201d aspect ratio bound for RPTREE-MAX. Given the randomized\nnature of the data structure it is dif\ufb01cult to directly bound the aspect ratios of all the cells. Instead\nwe prove a weaker result that can nevertheless be exploited to give a packing lemma of the kind\nmentioned above. More speci\ufb01cally, given a ball B, we prove an aspect ratio bound for the smallest\ncell in the RPTREE-MAX that completely contains B.\nOur \ufb01nal result concerns the RPTREE-MEAN data structure. The authors in [1] prove that this\nstructure adapts to the Local Covariance Dimension of data (see Section 5 for a de\ufb01nition). By\nshowing that low-dimensional manifolds have bounded local covariance dimension, we show its\nadaptability to the manifold dimension as well. Our result demonstrates the robustness of the notion\nof manifold dimension - a notion that is able to connect to a geometric notion of dimensionality such\nas the doubling dimension (proved in [1]) as well as a statistical notion such as Local Covariance\nDimension (this paper).\n\nDue to lack of space we relegate some proofs to the Supplementary Material document and present\nproofs of only the main theorems here. All results cited from other papers are presented as Facts in\nthis paper. We will denote by B(x, r), a closed ball of radius r centered at x. We will denote by d,\nthe intrinsic dimensionality of data and by D, the ambient dimensionality (typically d \u226a D).\n2 The RPTREE-MAX structure\n\nThe RPTREE-MAX structure adapts to the doubling dimension of data (see de\ufb01nition below). Since\nlow-dimensional manifolds have low doubling dimension (see [1] Theorem 22) hence the structure\nadapts to manifold dimension as well.\nDe\ufb01nition 1. The doubling dimension of a set S \u2282 RD is the smallest integer d such that for any\nball B(x, r) \u2282 RD, the set B(x, r) \u2229 S can be covered by 2d balls of radius r/2.\nThe RPTREE-MAX algorithm is presented data imbedded in RD having doubling dimension d. The\nalgorithm splits data lying in a cell C of radius \u2206 by \ufb01rst choosing a random direction v \u2208 RD,\nprojecting all the data inside C onto that direction, choosing a random value \u03b4 in the range [\u22121, 1] \u00b7\n6\u2206/\u221aD and then assigning a data point x to the left child if x \u00b7 v < median({z \u00b7 v : z \u2208 C}) + \u03b4\nand the right child otherwise. Since it is dif\ufb01cult to get the exact value of the radius of a data set,\n\n2\n\n\fthe algorithm settles for a constant factor approximation to the value by choosing an arbitrary data\npoint x \u2208 C and using the estimate \u02dc\u2206 = max({kx \u2212 yk : y \u2208 C}).\nThe following result is proven in [1] :\nFact 2 (Theorem 3 in [1]). There is a constant c1 with the following property. Suppose an RPTREE-\nMAX is built using a data set S \u2282 RD . Pick any cell C in the RPTREE-MAX; suppose that\nS \u2229 C has doubling dimension \u2264 d. Then with probability at least 1/2 (over the randomization in\nconstructing the subtree rooted at C), every descendant C\u2032 more than c1d log d levels below C has\nradius(C\u2032) \u2264 radius(C)/2.\nIn Sections 2, 3 and 4, we shall always assume that the data has doubling dimension d and shall\nnot explicitly state this fact again and again. Let us consider extensions of this result to bound the\nnumber of levels it takes for the size of all descendants to go down by a factor s > 2. Let us analyze\nthe case of s = 4. Starting off in a cell C of radius \u2206, we are assured of a reduction in size by a\nfactor of 2 after c1d log d levels. Hence all 2c1d log d nodes at this level have radius \u2206/2 or less. Now\nwe expect that after c1d log d more levels, the size should go down further by a factor of 2 thereby\ngiving us our desired result. However, given the large number of nodes at this level and the fact\nthat the success probability in Fact 2 is just greater than a constant bounded away from 1, it is not\npossible to argue that after c1d log d more levels the descendants of all these 2c1d log d nodes will be\nof radius \u2206/4 or less. It turns out that this can be remedied by utilizing the following extension of\nthe basic size reduction result in [1]. We omit the proof of this extension.\nFact 3 (Extension of Theorem 3 in [1]). For any \u03b4 > 0, with probability at least 1\u2212 \u03b4, every descen-\ndant C\u2032 which is more than c1d log d + log(1/\u03b4) levels below C has radius(C\u2032) \u2264 radius(C)/2.\nThis gives us a way to boost the con\ufb01dence and do the following : go down L = c1d log d + 2 levels\nfrom C to get the the radius of all the 2c1d log d+2 descendants down to \u2206/2 with con\ufb01dence 1\u2212 1/4.\nAfterward, go an additional L\u2032 = c1d log d + L + 2 levels from each of these descendants so that\nfor any cell at level L, the probability of it having a descendant of radius > \u2206/4 after L\u2032 levels is\nless than 1\n2 that all descendants\nof C after 2L + c1d log d + 2 have radius \u2264 \u2206/4. This gives a way to prove the following result :\nTheorem 4. There is a constant c2 with the following property. For any s \u2265 2, with probability at\nleast 1\u22121/4, every descendant C\u2032 which is more than c2\u00b7s\u00b7d log d levels below C has radius(C\u2032) \u2264\nradius(C)/s.\n\n4\u00b72L . Hence conclude with con\ufb01dence at least 1 \u2212 1\n\n4\u00b72L \u00b7 2L \u2265 1\n\n4 \u2212 1\n\nProof. Refer to Supplementary Material\n\nNotice that the dependence on the factor s is linear in the above result whereas one expects it to\nbe logarithmic. Indeed, typical space partitioning algorithms such as k-d trees do give such guar-\nantees. The \ufb01rst result we prove in the next section is a bound on the number of levels that is\npoly-logarithmic in the size reduction factor s.\n\n3 A generalized size reduction lemma for RPTREE-MAX\n\nIn this section we prove the following theorem :\nTheorem 5 (Main). There is a constant c3 with the following property. Suppose an RPTREE-MAX\nis built using data set S \u2282 RD . Pick any cell C in the RPTREE-MAX; suppose that S \u2229 C\nhas doubling dimension \u2264 d. Then for any s \u2265 2, with probability at least 1 \u2212 1/4 (over the\nrandomization in constructing the subtree rooted at C), for every descendant C\u2032 which is more than\nc3 \u00b7 log s \u00b7 d log sd levels below C, we have radius(C\u2032) \u2264 radius(C)/s.\nCompared to this, data structures such as [12] give deterministic guarantees for such a reduction in\nD log s levels which can be shown to be optimal (see [1] for an example). Thus our result is optimal\nbut for a logarithmic factor. Moving on with the proof, let us consider a cell C of radius \u2206 in the\nRPTREE-MAX that contains a dataset S having doubling dimension \u2264 d. Then for any \u01eb > 0, a\nrepeated application of De\ufb01nition 1 shows that the S can be covered using at most 2d log(1/\u01eb) balls\nof radius \u01eb\u2206. We will cover S \u2229 C using balls of radius\nso that O(cid:0)(sd)d(cid:1) balls would\nsuf\ufb01ce. Now consider all pairs of these balls, the distance between whose centers is \u2265 \u2206\n960s\u221ad\n\ns \u2212 \u2206\n\n960s\u221ad\n\n.\n\n\u2206\n\n3\n\n\fneutral split\n\ngood split\n\n\u2206\n\nB1\n\nB2\n\nbad split\n\nFigure 1: Balls B1 and B2 are of radius \u2206/s\u221ad and their centers are \u2206/s \u2212 \u2206/s\u221ad apart.\n\nIf random splits separate data from all such pairs of balls i.e. for no pair does any cell contain data\nfrom both balls of the pair, then each resulting cell would only contain data from pairs whose centers\nare closer than \u2206\n\n. Thus the radius of each such cell would be at most \u2206/s.\n\ns \u2212 \u2206\n\n960s\u221ad\n\nWe \ufb01x such a pair of balls calling them B1 and B2. A split in the RPTREE-MAX is said to be good\nwith respect to this pair if it sends points inside B1 to one child of the cell in the RPTREE-MAX\nand points inside B2 to the other, bad if it sends points from both balls to both children and neutral\notherwise (See Figure 1). We have the following properties of a random split :\nLemma 6. Let B = B(x, \u03b4) be a ball contained inside an RPTREE-MAX cell of radius \u2206 that\ncontains a dataset S of doubling dimension d. Lets us say that a random split splits this ball if the\nsplit separates the data set S into two parts. Then a random split of the cell splits B with probability\natmost 3\u03b4\u221ad\n\u2206 .\n\nProof. Refer to Supplementary Material\n\nLemma 7. Let B1 and B2 be a pair of balls as described above contained in the cell C that contains\ndata of doubling dimension d. Then a random split of the cell is a good split with respect to this pair\nwith probability at least 1\n\n56s .\n\nProof. Refer to Supplementary Material. Proof similar to that of Lemma 9 of [1].\n\nLemma 8. Let B1 and B2 be a pair of balls as described above contained in the cell C that contains\ndata of doubling dimension d. Then a random split of the cell is a bad split with respect to this pair\nwith probability at most\n\n1\n\n320s .\n\nProof. The proof of a similar result in [1] uses a conditional probability argument. However the\ntechnique does not work here since we require a bound that is inversely proportional to s. We instead\nmake a simple observation that the probability of a bad split is upper bounded by the probability that\none of the balls is split since for any two events A and B, P [A \u2229 B] \u2264 min{P [A] , P [B]}. The\nresult then follows from an application of Lemma 6.\n\nWe are now in a position to prove Theorem 5. What we will prove is that starting with a pair of balls\nin a cell C, the probability that some cell k levels below has data from both the balls is exponentially\nsmall in k. Thus, after going enough number of levels we can take a union bound over all pairs of\n\nballs whose centers are well separated (which are O(cid:0)(sd)2d(cid:1) in number) and conclude the proof.\ncontained inside C with radii \u2206/960s\u221ad and centers separated by at least \u2206/s \u2212 \u2206/960s\u221ad. Let\n\nProof. (of Theorem 5) Consider a cell C of radius \u2206 in the RPTREE-MAX and \ufb01x a pair of balls\n\n4\n\n\fj denote the probability that a cell i levels below C has a descendant j levels below itself that\npi\ncontains data points from both the balls. Then the following holds :\n\nLemma 9. p0\n\n68s(cid:1)l\nk \u2264 (cid:0)1 \u2212 1\n\npl\nk\u2212l.\n\nProof. Refer to Supplementary Material. Proof similar to that of Lemma 11 of [1].\n\nk \u2264 (cid:0)1 \u2212 1\n\n68s(cid:1)k as a corollary. However using this result would require us\nNote that this gives us p0\n\u2126((sd)2d) which results in a bound that is worse\nto go down k = \u2126(sd log(sd)) levels before p0\n(by a factor logarithmic in s) than the one given by Theorem 4. This can be attributed to the small\nprobability of a good split for a tiny pair of balls in large cells. However, here we are completely\nneglecting the fact that as we go down the levels, the radii of cells go down as well and good splits\nbecome more frequent.\n\nk =\n\n1\n\ns/2 then the good and bad split probabilities are 1\n\nIndeed setting s = 2 in Theorems 7 and 8 tells us that if the pair of balls were to be contained in a\ncell of radius \u2206\n640 respectively. This paves\nway for an inductive argument : assume that with probability > 1 \u2212 1/4, in L(s) levels, the size of\nall descendants go down by a factor s. Denote by pl\ng the probability of a good split in a cell at depth\nb the corresponding probability of a bad split. Set l\u2217 = L(s/2) and let E be the event that\nl and by pl\nthe radius of every cell at level l\u2217 is less than \u2206\n\ns/2 . Let C\u2032 represent a cell at depth l\u2217. Then,\n\n112 and 1\n\npl\u2217\ng \u2265 P [good split in C\u2032|E] \u00b7 P [E] \u2265\npl\u2217\nb = P [bad split in C\u2032|E] \u00b7 P [E] + P [bad split in C\u2032|\u00acE] \u00b7 P [\u00acE]\n\n1\n\n112 \u00b7(cid:18)1 \u2212\n\n1\n\n4(cid:19) \u2265\n\n1\n150\n\n\u2264\n\n1\n640 \u00b7\n\n1\n640 \u00b7 1 +\n\n1\n4 \u2264\nNotice that now, for any m > 0, we have pl\u2217\nk = l\u2217 + c4d log(sd) and applying Lemma 9 gives us p0\n4(sd)2d . Thus we have\n\nm \u2264 (cid:0)1 \u2212 1\n\n512\n\n1\n\n1\n\n213(cid:1)m. Thus, for some constant c4, setting\n213(cid:1)c4d log(sd)\nk \u2264 (cid:0)1 \u2212 1\n\u2264\n\n(cid:0)1 \u2212 1\n\n68s(cid:1)l\u2217\n\nL(s) \u2264 L(s/2) + c4d log(sd)\n\nwhich gives us the desired result on solving the recurrence i.e. L(s) = O (d log s log sd).\n4 A packing lemma for RPTREE-MAX\n\nIn this section we prove a probabilistic packing lemma for RPTREE-MAX. A formal statement of\nthe result follows :\nTheorem 10 (Main). Given any \ufb01xed ball B(x, R) \u2282 RD, with probability greater than 1/2 (where\nthe randomization is over the construction of the RPTREE-MAX), the number of disjoint RPTREE-\nr (cid:1)O(d log d log(dR/r)).\nMAX cells of radius greater than r that intersect B is at most (cid:0) R\nr (cid:1)O(1)\nr (cid:1)D which behaves like (cid:0) R\nData structures such as BBD-trees give a bound of the form O(cid:0) R\nr (cid:1)O(log R\nr ) for \ufb01xed d. We will prove the\nfor \ufb01xed D. In comparison, our result behaves like (cid:0) R\nresult in two steps : \ufb01rst of all we will show that with high probability, the ball B will be completely\ninscribed in an RPTREE-MAX cell C of radius no more than O(cid:16)Rd\u221ad log d(cid:17). Thus the number of\n\ndisjoint cells of radius at least r that intersect this ball is bounded by the number of descendants of\nC with this radius. To bound this number we then invoke Theorem 5 and conclude the proof.\n\n4.1 An effective aspect ratio bound for RPTREE-MAX cells\n\nIn this section we prove an upper bound on the radius of the smallest RPTREE-MAX cell that\ncompletely contains a given ball B of radius R. Note that this effectively bounds the aspect ratio\nof this cell. Consider any cell C of radius \u2206 that contains B. We proceed with the proof by \ufb01rst\n\n5\n\n\fuseless split\n\nuseful split\n\nC\n\nBi\n\nB\n\n\u2206\n2\n\nFigure 2: Balls Bi are of radius \u2206/512\u221ad and their centers are \u2206/2 far from the center of B.\n\nshowing that the probability that B will be split before it lands up in a cell of radius \u2206/2 is at most\na quantity inversely proportional to \u2206. Note that we are not interested in all descendants of C - only\nthe ones ones that contain B. That is why we argue differently here. We consider balls of radius\n\u2206/512\u221ad surrounding B at a distance of \u2206/2 (see Figure 2). These balls are made to cover the\nannulus centered at B of mean radius \u2206/2 and thickness \u2206/512\u221ad \u2013 clearly dO(d) balls suf\ufb01ce.\nWithout loss of generality assume that the centers of all these balls lie in C.\nNotice that if B gets separated from all these balls without getting split in the process then it will\nlie in a cell of radius < \u2206/2. Fix a Bi and call a random split of the RPTREE-MAX useful if\nit separates B from Bi and useless if it splits B. Using a proof technique similar to that used in\nLemma 7 we can show that the probability of a useful split is at least 1\n192 whereas Lemma 6 tells us\nthat the probability of a useless split is at most 3R\u221ad\n\u2206 .\nLemma 11. There exists a constant c5 such that the probability of a ball of radius R in a cell of\nradius \u2206 getting split before it lands up in a cell of radius \u2206/2 is at most c5Rd\u221ad log d\n\n.\n\n\u2206\n\nProof. Refer to Supplementary Material\n\nWe now state our result on the \u201ceffective\u201d bound on aspect ratios of RPTREE-MAX cells.\nTheorem 12. There exists a constant c6 such that with probability > 1 \u2212 1/4, a given (\ufb01xed) ball\nB of radius R will be completely inscribed in an RPTREE-MAX cell C of radius no more than\nc6 \u00b7 Rd\u221ad log d.\n\nProof. Refer to Supplementary Material\n\nProof. (of Theorem 10) Given a ball B of radius R, Theorem 12 shows that with probability at\n\nleast 3/4, B will lie in a cell C of radius at most R\u2032 = O(cid:16)Rd\u221ad log d(cid:17). Hence all cells of\n\nradius atleast r that intersect this ball must be either descendants or ancestors of C. Since we want\nan upper bound on the largest number of such disjoint cells, it suf\ufb01ces to count the number of\ndescendants of C of radius no less than r. We know from Theorem 5 that with probability at least\n3/4 in log(R\u2032/r)d log(dR\u2032/r) levels the radius of all cells must go below r. The result follows by\nobserving that the RPTREE-MAX is a binary tree and hence the number of children can be at most\n2log(R\u2032/r)d log(dR\u2032/r). The success probability is at least (3/4)2 > 1/2.\n\n6\n\n\fM\n\nTp(M)\n\np\n\nFigure 3: Locally, almost all the energy of the data is concentrated in the tangent plane.\n\n5 Local covariance dimension of a smooth manifold\n\nThe second variant of RPTREE, namely RPTREE-MEAN, adapts to the local covariance dimension\n(see de\ufb01nition below) of data. We do not go into the details of the guarantees presented in [1] due\nto lack of space. Informally, the guarantee is of the following kind : given data that has small local\ncovariance dimension, on expectation, a data point in a cell of radius r in the RPTREE-MEAN will\nbe contained in a cell of radius c7 \u00b7 r in the next level for some constant c7 < 1. The randomization\nis over the construction of RPTREE-MEAN as well as choice of the data point. This gives per-level\nimprovement albeit in expectation whereas RPTREE-MAX gives improvement in the worst case but\nafter a certain number of levels.\nWe will prove that a d-dimensional Riemannian submanifold M of RD has bounded local covari-\nance dimension thus proving that RPTREE-MEAN adapts to manifold dimension as well.\nDe\ufb01nition 13. A set S \u2282 RD has local covariance dimension (d, \u01eb, r) if there exists an isometry\nM of RD under which the set S when restricted to any ball of radius r has a covariance matrix for\nwhich some d diagonal elements contribute a (1 \u2212 \u01eb) fraction of its trace.\nThis is a more general de\ufb01nition than the one presented in [1] which expects the top d eigenvalues\nof the covariance matrix to account for a (1 \u2212 \u01eb) fraction of its trace. However, all that [1] requires\nfor the guarantees of RPTREE-MEAN to hold is that there exist d orthonormal directions such that\na (1 \u2212 \u01eb) fraction of the energy of the dataset i.e. Px\u2208S kx \u2212 mean(S)k2 is contained in those d\ndimensions. This is trivially true when M is a d-dimensional af\ufb01ne set. However we also expect\nthat for small neighborhoods on smooth manifolds, most of the energy would be concentrated in the\ntangent plane at a point in that neighborhood (see Figure 3). Indeed, we can show the following :\nTheorem 14 (Main). Given a data set S \u2282 M where M is a d-dimensional Riemannian manifold\nwith condition number \u03c4, then for any \u01eb \u2264 1\nFor manifolds, the local curvature decides how small a neighborhood should one take in order to\nexpect a sense of \u201c\ufb02atness\u201d in the non-linear surface. This is quanti\ufb01ed using the Condition Number\n\u03c4 of M (introduced in [16]) which restricts the amount by which the manifold can curve locally.\nThe condition number is related to more prevalent notions of local curvature such as the second\nfundamental form [17] in that the inverse of the condition number upper bounds the norm of the\nsecond fundamental form [16]. Informally, if we restrict ourselves to regions of the manifold of\nradius \u03c4 or less, then we get the requisite \ufb02atness properties. This is formalized in [16] as follows.\nFor any hyperplane T \u2282 RD and a vector v \u2208 Rd, let vk(T ) denote the projection of v onto T .\nFact 15 (Implicit in Lemma 5.3 of [16]). Suppose M is a Riemannian manifold with condition\nnumber \u03c4. For any p \u2208 M and r \u2264 \u221a\u01eb\u03c4, \u01eb \u2264 1\n4 , let M\u2032 = B(p, r) \u2229 M. Let T = Tp(M) be the\ntangent space at p. Then for any x, y \u2208 M\u2032, kxk(T ) \u2212 yk(T )k2 \u2265 (1 \u2212 \u01eb)kx \u2212 yk2.\nThis already seems to give us what we want - a large fraction of the length between any two points\non the manifold lies in the tangent plane - i.e.\nin d dimensions. However in our case we have\n\n4 , S has local covariance dimension(cid:16)d, \u01eb,\n\n\u221a\u01eb\u03c4\n3 (cid:17).\n\nto show that for some d-dimensional plane P , Px\u2208S k(x \u2212 \u00b5)\n\nk\n\n(P )k2 > (1 \u2212 \u01eb)Px\u2208S kx \u2212 \u00b5k2\n\n7\n\n\fwhere \u00b5 = mean(S). The problem is that we cannot apply Fact 15 since there is no surety that the\nmean will lie on the manifold itself. However it turns out that certain points on the manifold can act\nas \u201cproxies\u201d for the mean and provide a workaround to the problem.\n\nProof. (of Theorem 14) Refer to Supplementary Material\n\n6 Conclusion\n\nIn this paper we considered the two random projection trees proposed in [1]. For the RPTREE-\nMAX data structure, we provided an improved bound (Theorem 5) on the number of levels required\nto decrease the size of the tree cells by any factor s \u2265 2. However the bound we proved is poly-\nlogarithmic in s. It would be nice if this can be brought down to logarithmic since it would directly\nimprove the packing lemma (Theorem 10) as well. More speci\ufb01cally the packing bound would\n\nr ) for \ufb01xed d.\n\nr (cid:1)O(1)instead of (cid:0) R\n\nr (cid:1)O(log R\n\nAs far as dependence on d is concerned, there is room for improvement in the packing lemma. We\nhave shown that the smallest cell in the RPTREE-MAX that completely contains a \ufb01xed ball B of\n\nbecome(cid:0) R\nradius R has an aspect ratio no more than O(cid:16)d\u221ad log d(cid:17) since it has a ball of radius R inscribed in\nit and can be circumscribed by a ball of radius no more than O(cid:16)Rd\u221ad log d(cid:17). Any improvement in\n\nthe aspect ratio of the smallest cell that contains a given ball will also directly improve the packing\nlemma.\n\n3\n\nMoving on to our results for the RPTREE-MEAN, we demonstrated that it adapts to manifold di-\nmension as well. However the constants involved in our guarantee are pessimistic. For instance,\nthe radius parameter in the local covariance dimension is given as \u221a\u01eb\u03c4\n- this can be improved to\n\u221a\u01eb\u03c4\nif one can show that there will always exists a point q \u2208 B(x0, r) \u2229 M at which the function\n2\ng : x \u2208 M 7\u2212\u2192 kx \u2212 \u00b5k attains a local extrema.\nWe conclude with a word on the applications of our results. As we already mentioned, packing\nlemmas and size reduction guarantees for arbitrary factors are typically used in applications for\nnearest neighbor searching and clustering. However, these applications (viz [12], [15]) also require\nthat the tree have bounded depth. The RPTREE-MAX is a pure space partitioning data structure that\ncan be coerced by an adversarial placement of points into being a primarily left-deep or right-deep\ntree having depth \u2126(n) where n is the number of data points.\nExisting data structures such as BBD Trees remedy this by alternating space partitioning splits with\ndata partitioning splits. Thus every alternate split is forced to send at most a constant fraction\nof the points into any of the children thus ensuring a depth that is logarithmic in the number of\ndata points. A similar technique is used in [7] to bound the depth of the version of RPTREE-\nMAX used in that paper. However it remains to be seen if the same trick can be used to bound the\ndepth of RPTREE-MAX while maintaining the packing guarantees because although such \u201cspace\npartitioning\u201d splits do not seem to hinder Theorem 5, they do hinder Theorem 10 (more speci\ufb01cally\nthey hinder Theorem 11).\n\nWe leave open the question of a possible augmentation of the RPTREE-MAX structure, or a better\nanalysis, that can simultaneously give the following guarantees :\n\n1. Bounded Depth : depth of the tree should be o(n), preferably (log n)O(1)\n\n2. Packing Guarantee : of the form(cid:0) R\n\nr )O(1)\n\nr (cid:1)(d log R\n\n3. Space Partitioning Guarantee : assured size reduction by factor s in (d log s)O(1) levels\n\nAcknowledgments\n\nThe authors thank James Lee for pointing out an incorrect usage of the term Assouad dimension in\na previous version of the paper. Purushottam Kar thanks Chandan Saha for several fruitful discus-\nsions and for his help with the proofs of the Theorems 5 and 10. Purushottam is supported by the\nResearch I Foundation of the Department of Computer Science and Engineering, IIT Kanpur.\n\n8\n\n\fReferences\n\n[1] Sanjoy Dasgupta and Yoav Freund. Random Projection Trees and Low dimensional Manifolds. In 40th\n\nAnnual ACM Symposium on Theory of Computing, pages 537\u2013546, 2008.\n\n[2] Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors : Towards Removing the Curse of\n\nDimensionality. In 30th Annual ACM Symposium on Theory of Computing, pages 604\u2013613, 1998.\n\n[3] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(22):2319\u20132323, 2000.\n\n[4] Piotr Indyk and Assaf Naor. Nearest-Neighbor-Preserving Embeddings. ACM Transactions on Algo-\n\nrithms, 3, 2007.\n\n[5] Richard G. Baraniuk and Michael B. Wakin. Random Projections of Smooth Manifolds. Foundations of\n\nComputational Mathematics, 9(1):51\u201377, 2009.\n\n[6] Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, and Nakul Verma. Learning the structure of manifolds\nusing random projections. In Twenty-First Annual Conference on Neural Information Processing Systems,\n2007.\n\n[7] Samory Kpotufe. Escaping the curse of dimensionality with a tree-based regressor.\n\nConference on Learning Theory, 2009.\n\nIn 22nd Annual\n\n[8] Donghui Yan, Ling Huang, and Michael I. Jordan. Fast Approximate Spectral Clustering. In 15th ACM\n\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 907\u2013916, 2009.\n\n[9] John Wright and Gang Hua. Implicit Elastic Matching with Random Projections for Pose-Variant Face\nRecognition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages\n1502\u20131509, 2009.\n\n[10] Jian Pu, Junping Zhang, Peihong Guo, and Xiaoru Yuan. Interactive Super-Resolution through Neighbor\n\nEmbedding. In 9th Asian Conference on Computer Vision, pages 496\u2013505, 2009.\n\n[11] Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communica-\n\ntions of the ACM, 18(9):509\u2013517, 1975.\n\n[12] Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An Opti-\nmal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions. Journal of the ACM,\n45(6):891\u2013923, 1998.\n\n[13] Christian A. Duncan, Michael T. Goodrich, and Stephen G. Kobourov. Balanced Aspect Ratio Trees:\n\nCombining the Advantages of k-d Trees and Octrees. Journal of Algorithms, 38(1):303\u2013333, 2001.\n\n[14] Hanan Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publish-\n\ners, 2005.\n\n[15] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and An-\ngela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry,\n28(2-3):89\u2013112, 2004.\n\n[16] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the Homology of Submanifolds with\nHigh Con\ufb01dence from Random Samples. Discrete & Computational Geometry, 39(1-3):419\u2013441, 2008.\n[17] Sebasti\u00b4an Montiel and Antonio Ros. Curves and Surfaces, volume 69 of Graduate Studies in Mathemat-\n\nics. American Mathematical Society and Real Sociedad Matem\u00b4atica Epa\u02dcnola, 2005.\n\n9\n\n\f", "award": [], "sourceid": 125, "authors": [{"given_name": "Aman", "family_name": "Dhesi", "institution": null}, {"given_name": "Purushottam", "family_name": "Kar", "institution": null}]}