{"title": "Cluster Trees on Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 2679, "page_last": 2687, "abstract": "We investigate the problem of estimating the cluster tree for a density $f$ supported on or near a smooth $d$-dimensional manifold $M$ isometrically embedded in $\\mathbb{R}^D$. We study a $k$-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta. Under mild assumptions on $f$ and $M$, we obtain rates of convergence that depend on $d$ only but not on the ambient dimension $D$. We also provide a sample complexity lower bound for a natural class of clustering algorithms that use $D$-dimensional neighborhoods.", "full_text": "Cluster Trees on Manifolds\n\nSivaraman Balakrishnan\u2020\nsbalakri@cs.cmu.edu\n\nSrivatsan Narayanan\u2020\n\nAlessandro Rinaldo\u2021\n\nsrivatsa@cs.cmu.edu\n\narinaldo@stat.cmu.edu\n\nAarti Singh\u2020\n\naarti@cs.cmu.edu\n\nLarry Wasserman\u2021\n\nlarry@stat.cmu.edu\n\nSchool of Computer Science\u2020 and Department of Statistics\u2021\n\nCarnegie Mellon University\n\nIn this paper we investigate the problem of estimating the cluster tree for a density f supported on\nor near a smooth d-dimensional manifold M isometrically embedded in RD. We analyze a modi-\n\ufb01ed version of a k-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta\n(2010). The main results of this paper show that under mild assumptions on f and M, we obtain\nrates of convergence that depend on d only but not on the ambient dimension D. Finally, we sketch\na construction of a sample complexity lower bound instance for a natural class of manifold oblivious\nclustering algorithms.\n\n1\n\nIntroduction\n\nIn this paper, we study the problem of estimating the cluster tree of a density when the density\nis supported on or near a manifold. Let X := {X1, . . . , Xn} be a sample drawn i.i.d.\nfrom a\ndistribution P with density f. The connected components Cf (\u03bb) of the upper level set {x : f (x) \u2265\n\u03bb} are called density clusters. The collection C = {Cf (\u03bb) : \u03bb \u2265 0} of all such clusters is called the\ncluster tree and estimating this cluster tree is referred to as density clustering.\n\nThe density clustering paradigm is attractive for various reasons. One of the main dif\ufb01culties of\nclustering is that often the true goals of clustering are not clear and this makes clusters, and clustering\nas a task seem poorly de\ufb01ned. Density clustering however is estimating a well de\ufb01ned population\nquantity, making its goal, consistent recovery of the population density clusters, clear. Typically\nonly mild assumptions are made on the density f and this allows extremely general shapes and\nnumbers of clusters at each level. Finally, the cluster tree is an inherently hierarchical object and\nthus density clustering algorithms typically do not require speci\ufb01cation of the \u201cright\u201d level, rather\nthey capture a summary of the density across all levels.\n\nThe search for a simple, statistically consistent estimator of the cluster tree has a long history.\nHartigan (1981) showed that the popular single-linkage algorithm is not consistent for a sample\nfrom RD, with D > 1. Recently, Chaudhuri and Dasgupta (2010) analyzed an algorithm which is\nboth simple and consistent. The algorithm \ufb01nds the connected components of a sequence of care-\nfully constructed neighborhood graphs. They showed that, as long as the parameters of the algorithm\nare chosen appropriately, the resulting collection of connected components correctly estimates the\ncluster tree with high probability.\n\nIn this paper, we are concerned with the problem of estimating the cluster tree when the density\nf is supported on or near a low dimensional manifold. The motivation for this work stems from\nthe problem of devising and analyzing clustering algorithms with provable performance that can be\nused in high dimensional applications. When data live in high dimensions, clustering (as well as\nother statistical tasks) generally become prohibitively dif\ufb01cult due to the curse of dimensionality,\n\n1\n\n\fwhich demands a very large sample size. In many high dimensional applications however data is\nnot spread uniformly but rather concentrates around a low dimensional set. This so-called manifold\nhypothesis motivates the study of data generated on or near low dimensional manifolds and the study\nof procedures that can adapt effectively to the intrinsic dimensionality of this data.\n\nHere is a brief summary of the main contributions of our paper: (1) We show that the simple al-\ngorithm studied in the paper Chaudhuri and Dasgupta (2010) is consistent and has fast rates of\nconvergence for data on or near a low dimensional manifold M. The algorithm does not require\nthe user to \ufb01rst estimate M (which is a dif\ufb01cult problem). In other words, the algorithm adapts to\nthe (unknown) manifold. (2) We show that the sample complexity for identifying salient clusters is\nindependent of the ambient dimension. (3) We sketch a construction of a sample complexity lower\nbound instance for a natural class of clustering algorithms that we study in this paper. (4) We intro-\nduce a framework for studying consistency of clustering when the distribution is not supported on\na manifold but rather, is concentrated near a manifold. The generative model in this case is that the\ndata are \ufb01rst sampled from a distribution on a manifold and then noise is added. The original data\nare latent (unobserved). We show that for certain noise models we can still ef\ufb01ciently recover the\ncluster tree on the latent samples.\n\n1.1 Related Work\n\nThe idea of using probability density functions for clustering dates back to Wishart Wishart (1969).\nHartigan (1981) expanded on this idea and formalized the notions of high-density clustering, of\nthe cluster tree and of consistency and fractional consistency of clustering algorithms. In partic-\nular, Hartigan (1981) showed that single linkage clustering is consistent when D = 1 but is only\nfractionally consistent when D > 1. Stuetzle and R. (2010) and Stuetzle (2003) have also proposed\nprocedures for recovering the cluster tree. None of these procedures however, come with the theoret-\nical guarantees given by Chaudhuri and Dasgupta (2010), which demonstrated that a generalization\nof Wishart\u2019s algorithm allows one to estimate parts of the cluster tree for distributions with full-\ndimensional support near-optimally under rather mild assumptions. This paper forms the starting\npoint for our work and is reviewed in more detail in the next section.\n\nIn the last two decades, much of the research effort involving the use of nonparametric density\nestimators for clustering has focused on the more specialized problems of optimal estimation of the\nsupport of the distribution or of a \ufb01xed level set. However, consistency of estimators of a \ufb01xed level\nset does not imply cluster tree consistency, and extending the techniques and analyses mentioned\nabove to hold simultaneously over a variety of density levels is non-trivial. See for example the\npapers Polonik (1995); Tsybakov (1997); Walther (1997); Cuevas and Fraiman (1997); Cuevas et al.\n(2006); Rigollet and Vert (2009); Maier et al. (2009); Singh et al. (2009); Rinaldo and Wasserman\n(2010); Rinaldo et al. (2012), and references therein. Estimating the cluster tree has more recently\nbeen considered by Kpotufe and von Luxburg (2011) who also give a simple pruning procedure\nfor removing spurious clusters. Steinwart (2011) and Sriperumbudur and Steinwart (2012) propose\nprocedures for determining recursively the lowest split in the cluster tree and give conditions for\nasymptotic consistency with minimal assumptions on the density.\n\n2 Background and Assumptions\n\nLet P be a distribution supported on an unknown d-dimensional manifold M. We assume that the\nmanifold M is a d-dimensional Riemannian manifold without boundary embedded in a compact set\nX \u2282 RD with d < D. We further assume that the volume of the manifold is bounded from above by\na constant, i.e., vold(M ) \u2264 C. The main regularity condition we impose on M is that its condition\nnumber be not too large. The condition number of M is 1/\u03c4 , where \u03c4 is the largest number such\nthat the open normal bundle about M of radius r is imbedded in RD for every r < \u03c4 . The condition\nnumber is discussed in more detail in the paper Niyogi et al. (2008).\n\nThe Euclidean norm is denoted by k \u00b7 k and vd denotes the volume of the d-dimensional unit ball in\nRd. B(x, r) denotes the full-dimensional ball of radius r centered at x and BM (x, r) ..= B(x, r) \u2229\n\n2\n\n\fM. For Z \u2282 Rd and \u03c3 > 0, de\ufb01ne Z\u03c3 = Z + B(0, \u03c3) and ZM,\u03c3 = (Z + B(0, \u03c3)) \u2229 M. Note that\nZ\u03c3 is full dimensional, while if Z \u2286 M then ZM,\u03c3 is d-dimensional.\nLet f be the density of P with respect to the uniform measure on M. For \u03bb \u2265 0, let Cf (\u03bb) be the\ncollection of connected components of the level set {x \u2208 X : f (x) \u2265 \u03bb} and de\ufb01ne the cluster tree\nof f to be the hierarchy C = {Cf (\u03bb) : \u03bb \u2265 0}. For a \ufb01xed \u03bb, any member of Cf (\u03bb) is a cluster.\nFor a cluster C its restriction to the sample X is de\ufb01ned to be C[X] = C \u2229 X. The restriction of\nthe cluster tree C to X is de\ufb01ned to be C[X] = {C \u2229 X : C \u2208 C}. Informally, this restriction is a\ndendrogram-like hierarchical partition of X.\n\nTo give \ufb01nite sample results, following Chaudhuri and Dasgupta (2010), we de\ufb01ne the notion of\nsalient clusters. Our de\ufb01nitions are slight modi\ufb01cations of those in Chaudhuri and Dasgupta (2010)\nto take into account the manifold assumption.\n\nDe\ufb01nition 1 Clusters A and A\u2032 are (\u03c3, \u01eb) separated if there exists a nonempty S \u2282 M such that:\n\n1. Any path along M from A to A\u2032 intersects S.\n2. supx\u2208SM,\u03c3 f (x) < (1 \u2212 \u01eb) inf x\u2208AM,\u03c3\u222aA\u2032\n\nM,\u03c3\n\nf (x).\n\nChaudhuri and Dasgupta (2010) analyze a robust single linkage (RSL) algorithm (in Figure 1). An\nRSL algorithm estimates the connected components at a level \u03bb in two stages. In the \ufb01rst stage,\nthe sample is cleaned by thresholding the k-nearest neighbor distance of the sample points at a\nradius r and then, in the second stage, the cleaned sample is connected at a connection radius R.\nThe connected components of the resulting graph give an estimate of the restriction Cf (\u03bb)[X]. In\nSection 4 we prove a sample complexity lower bound for the class of RSL algorithms which we now\nde\ufb01ne.\n\nDe\ufb01nition 2 The class of RSL algorithms refers to any algorithm that is of the form described in\nthe algorithm in Figure 1 and relying on Euclidean balls, with any choice of k, r and R.\n\nWe de\ufb01ne two notions of consistency for an estimator bC of the cluster tree:\nDe\ufb01nition 3 (Hartigan consistency) For any sets A, A\u2032 \u2282 X , let An (resp., A\u2032\nest cluster of bC containing A \u2229 X (resp, A\u2032 \u2229 X). We say bC is consistent if, whenever A and A\u2032 are\ndifferent connected components of {x : f (x) \u2265 \u03bb} (for some \u03bb > 0), the probability that An is\ndisconnected from A\u2032\n\nn) denote the small-\n\nn approaches 1 as n \u2192 \u221e.\n\nDe\ufb01nition 4 ((\u03c3, \u01eb) consistency) For any sets A, A\u2032 \u2282 X such that A and A\u2032 are (\u03c3, \u01eb) separated,\nn) denote the smallest cluster of bC containing A \u2229 X (resp, A\u2032 \u2229 X). We say bC is\nlet An (resp., A\u2032\nconsistent if, whenever A and A\u2032 are different connected components of {x : f (x) \u2265 \u03bb} (for some\n\u03bb > 0), the probability that An is disconnected from A\u2032\n\nn approaches 1 as n \u2192 \u221e.\n\nThe notion of (\u03c3, \u01eb) consistency is similar that of Hartigan consistency except restricted to (\u03c3, \u01eb)\nseparated clusters A and A\u2032.\n\nChaudhuri and Dasgupta (2010) prove a theorem, establishing \ufb01nite sample bounds for a particular\nRSL algorithm. In their result there is no manifold and f is a density with respect to the Lebesgue\nmeasure on RD. Their result in essence says that if\n\nn \u2265 O(cid:18)\n\nD\n\n\u03bb\u01eb2vD(\u03c3/2)D log\n\n\u03bb\u01eb2vD(\u03c3/2)D(cid:19)\n\nD\n\nthen an RSL algorithm with appropriately chosen parameters can resolve any pair of (\u03c3, \u01eb) clusters\nat level at least \u03bb.\nIt is important to note that this theorem does not apply to the setting when\ndistributions are supported on a lower dimensional set for at least two reasons: (1) the density f is\nsingular with respect to the Lebesgue measure on X and so the cluster tree is trivial, and (2) the\nde\ufb01nitions of saliency with respect to X are typically not satis\ufb01ed when f has a lower dimensional\nsupport.\n\n3\n\n\f1. For each Xi, rk(Xi) := inf{r : B(Xi, r) contains k data points}.\n2. As r grows from 0 to \u221e:\n\n(a) Construct a graph Gr,R with nodes {Xi : rk(Xi) \u2264 r} and edges (Xi, Xj) if\n(b) Let C(r) be the connected components of Gr,R.\n\nkXi \u2212 Xjk \u2264 R.\n\n3. Denote bC = {C(r) : r \u2208 [0,\u221e)} and return bC.\n\nFigure 1: Robust Single Linkage (RSL) Algorithm\n\n3 Clustering on Manifolds\n\nIn this section we show that the RSL algorithm can be adapted to recover the cluster tree of a\ndistribution supported on a manifold of dimension d < D with the rates depending only on d. In\nplace of the cluster salience parameter \u03c3, our rates involve a new parameter \u03c1\n\n\u03c1 := min(cid:18) 3\u03c3\n\n16\n\n,\n\n\u01eb\u03c4\n72d\n\n,\n\n\u03c4\n\n16(cid:19) .\n\nThe precise reason for this de\ufb01nition of \u03c1 will be clear from the proofs (particularly of Lemma 7)\nbut for now notice that in addition to \u03c3 it is dependent on the condition number 1/\u03c4 and deteriorates\nas the condition number increases. Finally, to succinctly present our results we use \u00b5 := log n +\nd log(1/\u03c1).\n\nTheorem 5 There are universal constants C1 and C2 such that the following holds. For any \u03b4 > 0,\n0 < \u01eb < 1/2, run the algorithm in Figure 1 on a sample X drawn from f, where the parameters are\nset according to the equations\n\nR = 4\u03c1 and k = C1 log2(1/\u03b4)(\u00b5/\u01eb2).\n\nThen with probability at least 1\u2212\u03b4, bC is (\u03c3, \u01eb) consistent. In particular, the clusters containing A[X]\n\nand A\u2032[X], where A and A\u2032 are (\u03c3, \u01eb) separated, are internally connected and mutually disconnected\nin C(r) for r de\ufb01ned by\n\nvdrd\u03bb =\n\n1\n\n1 \u2212 \u01eb/6(cid:18) k\n\nn\n\n+\n\nC2 log(1/\u03b4)\n\nn\n\npk\u00b5(cid:19)\n\nprovided \u03bb \u2265 2\nBefore we prove this theorem a few remarks are in order:\n\nvd\u03c1d\n\nk\nn .\n\n1. To obtain an explicit sample complexity we plug in the value of k and solve for n from the in-\nequality restricting \u03bb. The sample complexity of the RSL algorithm for recovering (\u03c3, \u01eb) clusters\nat level at least \u03bb on a manifold M with condition number at most 1/\u03c4 is\n\nn = O(cid:18)\n\nd\n\n\u03bb\u01eb2vd\u03c1d log\n\nd\n\n\u03bb\u01eb2vd\u03c1d(cid:19)\n\nwhere \u03c1 = C min (\u03c3, \u01eb\u03c4 /d, \u03c4 ). Ignoring constants that depend on d the main difference between\nthis result and the result of Chaudhuri and Dasgupta (2010) is that our results only depend on\nthe manifold dimension d and not the ambient dimension D (typically D \u226b d). There is also a\ndependence of our result on 1/(\u01eb\u03c4 )d, for \u01eb\u03c4 \u226a \u03c3. In Section 4 we sketch the construction of an\ninstance that suggests that this dependence is not an artifact of our analysis and that the sample\ncomplexity of the class of RSL algorithms is at least n \u2265 1/(\u01eb\u03c4 )\u2126(d).\n2. Another aspect is that our choice of the connection radius R depends on the (typically) unknown\n\u03c1, while for comparison, the connection radius in Chaudhuri and Dasgupta (2010) is chosen to be\n\n4\n\n\f\u221a2r. Under the mild assumption that \u03bb \u2264 nO(1) (which is satis\ufb01ed for instance, if the density\n\non M is bounded from above), we show in Appendix A.8 that an identical theorem holds for\nR = 4r. k is the only real tuning parameter of this algorithm whose choice depends on \u01eb and an\nunknown leading constant.\n\n3. It is easy to see that this theorem also establishes consistency for recovering the entire cluster\ntree by selecting an appropriate schedule on \u03c3n, \u01ebn and kn that ensures that all clusters are\ndistinguished for n large enough (see Chaudhuri and Dasgupta (2010) for a formal proof).\n\nOur proofs structurally mirror those in Chaudhuri and Dasgupta (2010). We begin with a few tech-\nnical results in 3.1. In Section 3.2 we establish (\u03c3, \u01eb) consistency by showing that the clusters are\nmutually disjoint and internally connected. The main technical challenge is that the curvature of the\nmanifold, modulated by its condition number 1/\u03c4 , limits our ability to resolve the density level sets\nfrom a \ufb01nite sample, by limiting the maximum cleaning and connection radii the algorithm can use.\nIn what follows, we carefully analyze this effect and show that somewhat surprisingly, despite this\ncurvature, essentially the same algorithm is able to adapt to the unknown manifold and produce a\nconsistent estimate of the entire cluster tree. Similar manifold adaptivity results have been shown in\nclassi\ufb01cation Dasgupta and Freund (2008) and in non-parametric regression Kpotufe and Dasgupta\n(2012); Bickel and Li (2006).\n\n3.1 Technical results\n\nIn our proof, we use the uniform convergence of the empirical mass of Euclidean balls to their true\nmass. In the full dimensional setting of Chaudhuri and Dasgupta (2010), this follows from standard\nVC inequalities. To the best of our knowledge however sharp (ambient dimension independent)\ninequalities for manifolds are unknown. We get around this obstacle by using the insight that, in\norder to analyze the RSL algorithms, uniform convergence for Euclidean balls around the sample\npoints and around a \ufb01xed minimum s-net N of M (for an appropriately chosen s) suf\ufb01ce to analyze\nthe RSL algorithm.\n\nRecall, an s-net N \u2286 M is such that every point of M is at a distance at most s from some point\nin N . Let Bn,N := nB(z, s) : z \u2208 N \u222a X, s \u2265 0o be the collection of balls whose centers are\n\nsample or net points. We now state our uniform convergence lemma. The proof is in Appendix A.3.\n\nLemma 6 (Uniform Convergence) Assume k \u2265 \u00b5. Then there exists a constant C0 such that the\nfollowing holds. For every \u03b4 > 0, with probability > 1 \u2212 \u03b4, for all B \u2208 Bn,N , we have:\n\nC\u03b4\u00b5\n\nn\n\n+\n\nP (B) \u2265\nk\nC\u03b4\nn\nk\nn \u2212\n\n=\u21d2 Pn(B) > 0,\nn pk\u00b5 =\u21d2 Pn(B) \u2265\nk\nn\nn pk\u00b5 =\u21d2 Pn(B) <\nk\nn\n\nC\u03b4\n\n,\n\n,\n\nP (B) \u2265\nP (B) \u2264\n\nwhere C\u03b4 := 2C0 log(2/\u03b4), and \u00b5 := 1 + log n + log |N| = Cd + log n + d log(1/s). Here\nPn(B) = |X\u2229 B|/n denotes the empirical probability measure of B, and C is a universal constant.\nNext we provide a tight estimate of the volume of a small ball intersected with M. This bounds\nthe distortion of the apparent density due to the curvature of the manifold and is central to many of\nour arguments. Intuitively, the claim states that the volume is approximately that of a d-dimensional\nEuclidean ball, provided that its radius is small enough compared to \u03c4 . The lower bound is based\non Lemma 5.3 of Niyogi et al. (2008) while the upper bound is based on a modi\ufb01cation of the main\nresult of Chazal (2013).\n\nLemma 7 (Ball volumes) Assume r < \u03c4 /2. De\ufb01ne S := B(x, r) \u2229 M for a point x \u2208 M. Then\n\n(cid:18)1 \u2212\n\nr2\n\n4\u03c4 2(cid:19)d/2\n\nvdrd \u2264 vold(S) \u2264 vd(cid:18) \u03c4\n\n\u03c4 \u2212 2r1(cid:19)d\n\nrd\n1,\n\n5\n\n\fwhere r1 = \u03c4 \u2212 \u03c4p1 \u2212 2r/\u03c4 . In particular, if r \u2264 \u01eb\u03c4 /72d for 0 \u2264 \u01eb < 1, then\n\nvdrd(1 \u2212 \u01eb/6) \u2264 vold(S) \u2264 vdrd(1 + \u01eb/6).\n\n3.2 Separation and Connectedness\n\nLemma 8 (Separation) Assume that we pick k, r and R to satisfy the conditions:\n\nvdrd(1 \u2212 \u01eb/6)\u03bb \u2265\n\nk\nn\n\n+\n\nC\u03b4\n\nr \u2264 \u03c1,\nn pk\u00b5,\n\nR = 4\u03c1,\nvdrd(1 + \u01eb/6)\u03bb(1 \u2212 \u01eb) \u2264\n\nk\nn \u2212\n\nC\u03b4\n\nn pk\u00b5.\n\nThen with probability 1 \u2212 \u03b4, we have: (1) All points in A\u03c3\u2212r and A\u2032\nS\u03c3\u2212r are removed. (2) The two point sets A \u2229 X and A\u2032 \u2229 X are disconnected in Gr,R.\nProof. The proof is analogous to the separation proof of Chaudhuri and Dasgupta (2010) with sev-\neral modi\ufb01cations. Most importantly, we need to ensure that despite the curvature of the manifold\nwe can still resolve the density well enough to guarantee that we can identify and eliminate points\nin the region of separation.\n\n\u03c3\u2212r are kept, and all points in\n\nThroughout the proof, we will assume that the good event in Lemma 6 (uniform convergence for\nBn,N ) occurs. Since r \u2264 \u01eb\u03c4 /72d, by Lemma 7 vol(BM (x, r)) is between vdrd(1 \u2212 \u01eb/6) and\nvdrd(1+\u01eb/6), for any x \u2208 M. So if Xi \u2208 A\u222aA\u2032, then BM (Xi, r) has mass at least vdrd(1\u2212\u01eb/6)\u00b7\u03bb.\nn \u221ak\u00b5 by assumption, this ball contains at least k sample points, and hence\nSince this is \u2265 k\nXi is kept. On the other hand, if Xi \u2208 S\u03c3\u2212r, then the set BM (Xi, r) contains mass at most\nn \u221ak\u00b5. Thus by Lemma 6 BM (Xi, r) contains fewer than\nn \u2212 C\u03b4\nvdrd(1 + \u01eb/6)\u00b7 \u03bb(1\u2212 \u01eb). This is \u2264 k\nk sample points, and hence Xi is removed.\n\nn + C\u03b4\n\nTo prove the graph is disconnected, we \ufb01rst need a bound on the geodesic distance between two\npoints that are at most R apart in Euclidean distance. Such an estimate follows from Proposition\n6.3 in Niyogi et al. (2008) who show that if kp \u2212 qk = R \u2264 \u03c4 /2, then the geodesic distance\ndM (p, q) \u2264 \u03c4 \u2212 \u03c4q1 \u2212 2R\n\u03c4 (cid:1) \u2264 2R. Now,\nnotice that if the graph is connected there must be an edge that connects two points that are at a\ngeodesic distance of at least 2(\u03c3 \u2212 r). Any path between a point in A and a point in A\u2032 along M\nmust pass through S\u03c3\u2212r and must have a geodesic length of at least 2(\u03c3 \u2212 r). This is impossible if\nthe connection radius satis\ufb01es 2R < 2(\u03c3 \u2212 r), which follows by the assumptions on r and R. (cid:3)\nAll the conditions in Lemma 8 can be simultaneously satis\ufb01ed by setting k := 16C 2\n\u03b4 (\u00b5/\u01eb2), and\n\n\u03c4 . In particular, if R \u2264 \u03c4 /4, then dM (p, q) < R(cid:0)1 + 4R\n\nvdrd(1 \u2212 \u01eb/6) \u00b7 \u03bb =\n\nk\nn\n\n+\n\nC\u03b4\n\nn pk\u00b5.\n\n(1)\n\nk\n\nvd\u03c1d\n\nn and the condition on R is satis\ufb01ed by its de\ufb01nition.\n\nThe condition on r is satis\ufb01ed since \u03bb \u2265 2\nLemma 9 (Connectedness) Assume that the parameters k, r and R satisfy the separation condi-\ntions (in Lemma 8). Then, with probability at least 1 \u2212 \u03b4, A[X] is connected in Gr,R.\nProof. Let us show that any two points in A \u2229 X are connected in Gr,R. Consider y, y\u2032 \u2208 A \u2229 X.\nSince A is connected, there is a path P between y, y\u2032 lying entirely inside A, i.e., a continuous map\nP : [0, 1] \u2192 A such that P (0) = y and P (1) = y\u2032. We can \ufb01nd a sequence of points y0, . . . , yt \u2208 P\nsuch that y0 = y, yt = y\u2032, and the geodesic distance on M (and hence the Euclidean distance)\nbetween yi\u22121 and yi is at most \u03b7, for an arbitrarily small constant \u03b7.\nLet N be minimal R/4-net of M. There exist zi \u2208 N such that kyi \u2212 zik \u2264 R/4. Since yi \u2208 A, we\nhave zi \u2208 AM,R/4, and hence the ball BM (zi, R/4) lies completely inside AM,R/2 \u2286 AM,\u03c3\u2212r. In\nparticular, the density inside the ball is at least \u03bb everywhere, and hence the mass inside it is at least\n\nvd(R/4)d(1 \u2212 \u01eb/6)\u03bb \u2265\n\nC\u03b4\u00b5\n\nn\n\n.\n\n6\n\n\fObserve that R \u2265 4r and so this condition is satis\ufb01ed as a consequence of satisfying Equation 1.\nThus Lemma 6 guarantees that the ball BM (zi, R/4) contains at least one sample point, say xi.\n(Without loss of generality, we may assume x0 = y and xt = y\u2032.) Since the ball lies completely in\nAM,\u03c3\u2212r, the sample point xi is not removed in the cleaning step (Lemma 8).\n\nFinally, we bound d(xi\u22121, xi) by considering the sequence of points (xi\u22121, zi\u22121, yi\u22121, yi, zi, xi).\nThe pair (yi\u22121, yi) are at most s apart and the other successive pairs at most R/4 apart, hence\nd(xi\u22121, xi) \u2264 4(R/4) + \u03b7 = R + \u03b7. The claim follows by letting \u03b7 \u2192 0. (cid:3)\n\n4 A lower bound instance for the class of RSL algorithms\n\n1\n\n\u03bb\u01eb2vD\u03c3D log\n\n1\n\nd\n\nd\n\n\u03bb\u01eb2vd\u03c1d log\n\n\u03c1 = C min (\u03c3, \u01eb\u03c4 /d, \u03c4 ). For full dimensional densities, Chaudhuri and Dasgupta (2010) showed\n\nRecall that the sample complexity in Theorem 5 scales as n = O(cid:16)\n\u03bb\u01eb2vd\u03c1d(cid:17) where\nthe information theoretic lower bound n = \u2126(cid:16)\n\u03bb\u01eb2vD\u03c3D(cid:17) . Their construction can be\nstraightforwardly modi\ufb01ed to a d-dimensional instance on a smooth manifold. Ignoring constants\nthat depend on d, these upper and lower bounds can still differ by a factor of 1/(\u01eb\u03c4 )d, for \u01eb\u03c4 \u226a \u03c3.\nIn this section we provide an informal sketch of a hard instance for the class of RSL algorithms (see\nDe\ufb01nition 2) that suggests a sample complexity lower bound of n \u2265 1/(\u01eb\u03c4 )\u2126(d).\nWe \ufb01rst describe our lower bound instance. The manifold M consists of two disjoint components, C\nand C \u2032 (whose sole function is to ensure f integrates to 1). The component C in turn contains three\nparts, which we call \u2018top\u2019, \u2018middle\u2019, and \u2018bottom\u2019 respectively. The middle part, denoted M2, is the\n\nportion of the standard d-dimensional unit sphere Sd(0, 1) between the planes x1 = +\u221a1 \u2212 4\u03c4 2\nand x1 = \u2212\u221a1 \u2212 4\u03c4 2. The top part, denoted M1, is the upper hemisphere of radius 2\u03c4 centered\nat (+\u221a1 \u2212 4\u03c4 2, 0, 0, . . . , 0). The bottom part, denoted M3, is a symmetric hemisphere centered at\n(\u2212\u221a1 \u2212 4\u03c4 2, 0, 0, . . . , 0). Thus C is obtained by gluing a portion of the unit sphere with two (small)\nhemispherical caps. C as described does not have a condition number at most 1/\u03c4 because of the\n\u201ccorners\u201d at the intersection of M2 and M1 \u222a M3. This can be \ufb01xed without affecting the essence\nof the construction by smoothing this intersection by rolling a ball of radius \u03c4 around it (a similar\nconstruction is made rigorous in Theorem 6 of Genovese et al. (2012)). Let P be the distribution\non M whose density over C is \u03bb if |x1| > 1/2, and \u03bb(1 \u2212 \u01eb) if |x1| \u2264 1/2, where \u03bb is chosen\nsmall enough such that \u03bb vold(C) \u2264 1. The density over C \u2032 is chosen such that the total mass of the\nmanifold is 1. Now M1 and M3 are (\u03c3, \u01eb) separated at level \u03bb for \u03c3 = \u2126(1). The separator set S is\nthe equator of M2 in the plane x1 = 0.\n\nWe now provide some intuition for why RSL algorithms will require n \u2265 1/(\u01eb\u03c4 )\u2126(d) to succeed on\nthis instance. We focus our discussion on RSL algorithms with k > 2, i.e. on algorithms that do in\nfact use a cleaning step, ignoring the single linkage algorithm which is known to be inconsistent for\nfull dimensional densities. Intuitively, because of the curvature of the described instance, the mass\nof a suf\ufb01ciently large Euclidean ball in the separator set is larger than the mass of a corresponding\nball in the true clusters. This means that any algorithm that uses large balls cannot reliably clean the\nsample and this restricts the size of the balls that can be used. Now if points in the regions of high\ndensity are to survive then there must be k sample points in the small ball around any point in the\ntrue clusters and this gives us a lower bound on the necessary sample size.\n\nThe RSL algorithms work by counting the number of sample points inside the balls B(x, r) centered\nat the sample points x, for some radius r. In order for the algorithm to reliably resolve (\u03c3, \u01eb) clusters,\nit should distinguish points in the separator set S \u2282 M2 from those in the level \u03bb clusters M1\u222aM3. A\nnecessary condition for this is that the mass of a ball B(x, r) for x \u2208 S\u03c3\u2212r should be strictly smaller\nthan the mass inside B(y, r) for y \u2208 M1 \u222a M3. In Appendix A.4, we show that this condition\nrestricts the radius r to be at most O(\u03c4p\u01eb/d). Now, consider any sample point x0 in M1 \u222a M3\n(such an x exists with high probability). Since x0 should not be removed during the cleaning step,\nthe ball B(x0, r) must contain some other sample point (indeed, it must contain at least k \u2212 1\nmore sample points). By a union bound, this happens with probability at most (n \u2212 1)vdrd\u03bb \u2264\n\n7\n\n\fO(d\u2212d/2n\u03c4 d\u01ebd/2\u03bb). If we want the algorithm to succeed with probability at least 1/2 (say) then\n\nn \u2265 \u2126(cid:16) dd/2\n\n\u03c4 d\u03bb\u01ebd/2(cid:17) .\n\n5 Cluster tree recovery in the presence of noise\n\nSo far we have considered the problem of recovering the cluster tree given samples from a density\nsupported on a lower dimensional manifold. In this section we extend these results to the more\ngeneral situation when we have noisy samples concentrated near a lower dimensional manifold.\nIndeed it can be argued that the manifold + noise model is a natural and general model for high-\ndimensional data.\nIn the noisy setting, it is clear that we can infer the cluster tree of the noisy\ndensity in a straightforward way. A stronger requirement would be consistency with respect to\nthe underlying latent sample. Following the literature on manifold estimation (Balakrishnan et al.\n(2012); Genovese et al. (2012)) we consider two main noise models. For both of them, we specify a\ndistribution Q for the noisy sample.\n1. Clutter Noise: We observe data Y1, . . . , Yn from the mixture Q := (1 \u2212 \u03c0)U + \u03c0P where\n0 < \u03c0 \u2264 1 and U is a uniform distribution on X . Denote the samples drawn from P in this mixture\nX = {X1, . . . , Xm}. The points drawn from U are called background clutter. In this case, we can\nshow:\n\nR := 4\u03c1\n\nk := C1 log2(1/\u03b4)(\u00b5/\u01eb2).\n\nA[X] and A\u2032[X] are internally connected and mutually disconnected in C(r) for r de\ufb01ned by\n\nTheorem 10 There are universal constants C1 and C2 such that the following holds. For any \u03b4 > 0,\n0 < \u01eb < 1/2, run the algorithm in Figure 1 on a sample {Y1, . . . , Yn}, with parameters\nThen with probability at least 1 \u2212 \u03b4, bC is (\u03c3, \u01eb) consistent. In particular, the clusters containing\n1 \u2212 \u01eb/6(cid:18) k\nprovided \u03bb \u2265 max(cid:26) 2\nn(cid:1)1\u2212d/D(cid:27) where \u03c1 is now slightly modi\ufb01ed (in con-\n(cid:0) k\nstants), i.e., \u03c1 := min(cid:0) \u03c3\n2. Additive Noise: The data are of the form Yi = Xi+\u03b7i where X1, . . . , Xn \u223c P ,and \u03b71, . . . , \u03b7n are\na sample from any bounded noise distribution \u03a6, with \u03b7i \u2208 B(0, \u03b8). Note that Q is the convolution\nof P and \u03a6, Q = P \u22c6 \u03a6.\n\nn , 2vd/D\n24(cid:1).\n72d , \u03c4\n\npk\u00b5(cid:19)\n\nvd\u03c1d\n7 , \u01eb\u03c4\n\nC2 log(1/\u03b4)\n\n\u03c0vdrd\u03bb =\n\nD (1\u2212\u03c0)d/D\n\n+\n\nn\n\nvd\u01ebd/D\u03c0\n\nn\n\n1\n\nk\n\nTheorem 11 There are universal constants C1 and C2 such that the following holds. For any \u03b4 > 0,\n0 < \u01eb < 1/2, run the algorithm in Figure 1 on the sample {Y1, . . . , Yn} with parameters\nThen with probability at least 1 \u2212 \u03b4, bC is (\u03c3, \u01eb) consistent for \u03b8 \u2264 \u03c1\u01eb/24d. In particular, the clusters\ncontaining {Yi : Xi \u2208 A} and {Yi : Xi \u2208 A\u2032} are internally connected and mutually disconnected\nin C(r) for r de\ufb01ned by\n\nk := C1 log2(1/\u03b4)(\u00b5/\u01eb2).\n\nR := 5\u03c1\n\n+\n\nC\u03b4\n\nn pk\u00b5\n\nk\nn\n\u01eb\u03c4\n\nvdrd(1 \u2212 \u01eb/12)(1 \u2212 \u01eb/6)\u03bb =\n7 , \u03c4\n24 ,\n\nk\n\nvd\u03c1d\n\n144d(cid:1) .\n\nn and \u03b8 \u2264 \u03c1\u01eb/24d, where \u03c1 := min(cid:0) \u03c3\n\nif \u03bb \u2265 2\nThe proofs for both Theorems 10 and 11 appear in Appendix A.5. Notice that in each case we receive\nsamples from a full D-dimensional distribution but are still able to achieve rates independent of D\nbecause these distributions are concentrated around the lower dimensional M. For the clutter noise\ncase we produce a tree that is consistent for samples drawn from P (which are exactly on M), while\nin the additive noise case we produce a tree on the observed Yis which is (\u03c3, \u01eb) consistent for the\nlatent Xis (for \u03b8 small enough). It is worth noting that in the case of clutter noise we can still\nconsistently recover the entire cluster tree. Intuitively, this is because the k-NN distances for points\non M are much smaller than for clutter points that are far away from M. As a result the clutter noise\nonly affects a vanishingly low level set of the cluster tree.\n\n8\n\n\fReferences\n\nS. Balakrishnan, A. Rinaldo, D. Sheehy, A. Singh, and L. Wasserman. Minimax rates for homology inference.\n\nAISTATS, 2012.\n\nP. Bickel and B. Li. Local polynomial regression on unknown manifolds. In Technical report, Department of\n\nStatistics, UC Berkeley. 2006.\n\nK. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In J. Lafferty, C. K. I. Williams,\nJ. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23,\npages 343\u2013351. 2010.\n\nF. Chazal. An upper bound for the volume of geodesic balls in submanifolds of euclidean spaces. Personal Com-\nmunication, available at http://geometrica.saclay.inria.fr/team/Fred.Chazal/BallVolumeJan2013.pdf, 2013.\nA. Cuevas and R. Fraiman. A plug-in approach to support estimation. Annals of Statistics, 25(6):2300\u20132312,\n\n1997.\n\nA. Cuevas, W. Gonz\u00e1lez-Manteiga, and A. Rodr\u00edguez-Casal. Plug-in estimation of general level sets. Aust. N.\n\nZ. J. Stat., 48(1):7\u201319, 2006.\n\nS. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, pages 537\u2013546.\n\n2008.\n\nC. R. Genovese, M. Perone-Paci\ufb01co, I. Verdinelli, and L. Wasserman. Minimax manifold estimation. Journal\n\nof Machine Learning Research, 13:1263\u20131291, 2012.\n\nJ. A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American Statistical\n\nAssociation, 76(374):pp. 388\u2013394, 1981.\n\nS. Kpotufe and S. Dasgupta. A tree-based regressor that adapts to intrinsic dimension. J. Comput. Syst. Sci.,\n\n78(5):1496\u20131515, 2012.\n\nS. Kpotufe and U. von Luxburg. Pruning nearest neighbor cluster trees. In L. Getoor and T. Scheffer, editors,\nProceedings of the 28th International Conference on Machine Learning (ICML-11), ICML \u201911, pages 225\u2013\n232. ACM, New York, NY, USA, 2011.\n\nM. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest-neighbor graphs for identifying\n\nnoisy clusters. Theor. Comput. Sci., 410(19):1749\u20131764, 2009.\n\nP. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high con\ufb01dence from\n\nrandom samples. Discrete & Computational Geometry, 39(1-3):419\u2013441, 2008.\n\nW. Polonik. Measuring mass concentrations and estimating density contour clusters: an excess mass approach.\n\nAnnals of Statistics, 23(3):855\u2013882, 1995.\n\nP. Rigollet and R. Vert. Fast rates for plug-in estimators of density level sets. Bernoulli, 15(4):1154\u20131178,\n\n2009.\n\nA. Rinaldo, A. Singh, R. Nugent, and L. Wasserman. Stability of density-based clustering. Journal of Machine\n\nLearning Research, 13:905\u2013948, 2012.\n\nA. Rinaldo and L. Wasserman. Generalized density clustering. The Annals of Statistics, 38(5):2678\u20132722,\n\n2010. 0907.3454.\n\nA. Singh, C. Scott, and R. Nowak. Adaptive {H}ausdorff estimation of density level sets. Ann. Statist.,\n\n37(5B):2760\u20132782, 2009.\n\nB. K. Sriperumbudur and I. Steinwart. Consistency and rates for clustering with dbscan. Journal of Machine\n\nLearning Research - Proceedings Track, 22:1090\u20131098, 2012.\n\nI. Steinwart. Adaptive density level set clustering. Journal of Machine Learning Research - Proceedings Track,\n\n19:703\u2013738, 2011.\n\nW. Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J.\n\nClassi\ufb01cation, 20(1):025\u2013047, 2003.\n\nW. Stuetzle and N. R. A generalized single linkage method for estimating the cluster tree of a density. Journal\n\nof Computational and Graphical Statistics, 19(2):397\u2013418, 2010.\n\nA. B. Tsybakov. On nonparametric estimation of density level sets. Ann. Statist., 25(3):948\u2013969, 1997.\nG. Walther. Granulometric smoothing. Annals of Statistics, 25(6):2273\u20132299, 1997.\nD. Wishart. Mode analysis: a generalization of nearest neighbor which reduces chaining. In Proceedings of the\n\nColloquium on Numerical Taxonomy held in the University of St. Andrews, pages 282\u2013308. 1969.\n\n9\n\n\f", "award": [], "sourceid": 1257, "authors": [{"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": "CMU"}, {"given_name": "Srivatsan", "family_name": "Narayanan", "institution": "CMU"}, {"given_name": "Alessandro", "family_name": "Rinaldo", "institution": "CMU"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}, {"given_name": "Larry", "family_name": "Wasserman", "institution": "CMU"}]}