{"title": "Rates of convergence for the cluster tree", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 351, "abstract": "For a density f on R^d, a high-density cluster is any connected component of {x: f(x) >= c}, for some c > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f. We present a procedure for estimating the cluster tree given samples from f. We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem.", "full_text": "Rates of convergence for the cluster tree\n\nKamalika Chaudhuri\n\nUC San Diego\n\nkchaudhuri@ucsd.edu\n\nSanjoy Dasgupta\n\nUC San Diego\n\ndasgupta@cs.ucsd.edu\n\nAbstract\n\nFor a density f on Rd, a high-density cluster is any connected component of {x :\nf (x) \u2265 \u03bb}, for some \u03bb > 0. The set of all high-density clusters form a hierarchy\ncalled the cluster tree of f. We present a procedure for estimating the cluster tree\ngiven samples from f. We give \ufb01nite-sample convergence rates for our algorithm,\nas well as lower bounds on the sample complexity of this estimation problem.\n\n1\n\nIntroduction\n\nA central preoccupation of learning theory is to understand what statistical estimation based on a\n\ufb01nite data set reveals about the underlying distribution from which the data were sampled. For\nclassi\ufb01cation problems, there is now a well-developed theory of generalization. For clustering,\nhowever, this kind of analysis has proved more elusive.\n\nConsider for instance k-means, possibly the most popular clustering procedure in use today.\nIf\nthis procedure is run on points X1, . . . , Xn from distribution f, and is told to \ufb01nd k clusters, what\ndo these clusters reveal about f? Pollard [8] proved a basic consistency result: if the algorithm\nalways \ufb01nds the global minimum of the k-means cost function (which is NP-hard, see Theorem 3\nof [3]), then as n \u2192 \u221e, the clustering is the globally optimal k-means solution for f. This result,\nhowever impressive, leaves the fundamental question unanswered: is the best k-means solution to f\nan interesting or desirable quantity, in settings outside of vector quantization?\n\nIn this paper, we are interested in clustering procedures whose output on a \ufb01nite sample converges\nto \u201cnatural clusters\u201d of the underlying distribution f. There are doubtless many meaningful ways\nto de\ufb01ne natural clusters. Here we follow some early work on clustering (for instance, [5]) by\nassociating clusters with high-density connected regions. Speci\ufb01cally, a cluster of density f is any\nconnected component of {x : f (x) \u2265 \u03bb}, for any \u03bb > 0. The collection of all such clusters forms\nan (in\ufb01nite) hierarchy called the cluster tree (Figure 1).\n\nAre there hierarchical clustering algorithms which converge to the cluster tree? Previous theory\nwork [5, 7] has provided weak consistency results for the single-linkage clustering algorithm, while\nother work [13] has suggested ways to overcome the de\ufb01ciencies of this algorithm by making it\nmore robust, but without proofs of convergence. In this paper, we propose a novel way to make\nsingle-linkage more robust, while retaining most of its elegance and simplicity (see Figure 3). We\nestablish its \ufb01nite-sample rate of convergence (Theorem 6); the centerpiece of our argument is a\nresult on continuum percolation (Theorem 11). We also give a lower bound on the problem of\ncluster tree estimation (Theorem 12), which matches our upper bound in its dependence on most of\nthe parameters of interest.\n\n2 De\ufb01nitions and previous work\n\nLet X be a subset of Rd. We exclusively consider Euclidean distance on X , denoted k \u00b7 k. Let\nB(x, r) be the closed ball of radius r around x.\n\n1\n\n\ff (x)\n\n\u03bb1\n\n\u03bb2\n\n\u03bb3\n\nC1\n\nC2\n\nC3\n\nX\n\nFigure 1: A probability density f on R, and three of its clusters: C1, C2, and C3.\n\n2.1 The cluster tree\n\nWe start with notions of connectivity. A path P in S \u2282 X is a continuous 1 \u2212 1 function P :\nP\n[0, 1] \u2192 S. If x = P (0) and y = P (1), we write x\n y, and we say that x and y are connected in\nS. This relation \u2013 \u201cconnected in S\u201d \u2013 is an equivalence relation that partitions S into its connected\ncomponents. We say S \u2282 X is connected if it has a single connected component.\nThe cluster tree is a hierarchy each of whose levels is a partition of a subset of X , which we will\noccasionally call a subpartition of X . Write \u03a0(X ) = {subpartitions of X}.\nDe\ufb01nition 1 For any f : X \u2192 R, the cluster tree of f is a function Cf : R \u2192 \u03a0(X ) given by\n\nCf (\u03bb) = connected components of {x \u2208 X : f (x) \u2265 \u03bb}.\n\nAny element of Cf (\u03bb), for any \u03bb, is called a cluster of f.\n\nFor any \u03bb, Cf (\u03bb) is a set of disjoint clusters of X . They form a hierarchy in the following sense.\nLemma 2 Pick any \u03bb\u2032 \u2264 \u03bb. Then:\n\n1. For any C \u2208 Cf (\u03bb), there exists C \u2032 \u2208 Cf (\u03bb\u2032) such that C \u2286 C \u2032.\n2. For any C \u2208 Cf (\u03bb) and C \u2032 \u2208 Cf (\u03bb\u2032), either C \u2286 C \u2032 or C \u2229 C \u2032 = \u2205.\n\nWe will sometimes deal with the restriction of the cluster tree to a \ufb01nite set of points x1, . . . , xn.\nFormally, the restriction of a subpartition C \u2208 \u03a0(X ) to these points is de\ufb01ned to be C[x1, . . . , xn] =\n{C \u2229 {x1, . . . , xn} : C \u2208 C}. Likewise, the restriction of the cluster tree is Cf [x1, . . . , xn] : R \u2192\n\u03a0({x1, . . . , xn}), where Cf [x1, . . . , xn](\u03bb) = Cf (\u03bb)[x1, . . . , xn]. See Figure 2 for an example.\n2.2 Notion of convergence and previous work\n\nSuppose a sample Xn \u2282 X of size n is used to construct a tree Cn that is an estimate of Cf . Hartigan\n[5] provided a very natural notion of consistency for this setting.\nDe\ufb01nition 3 For any sets A, A\u2032 \u2282 X , let An (resp, A\u2032\nn) denote the smallest cluster of Cn containing\nA \u2229 Xn (resp, A\u2032 \u2229 Xn). We say Cn is consistent if, whenever A and A\u2032 are different connected\ncomponents of {x : f (x) \u2265 \u03bb} (for some \u03bb > 0), P(An is disjoint from A\u2032\nIt is well known that if Xn is used to build a uniformly consistent density estimate fn (that is,\nsupx |fn(x) \u2212 f (x)| \u2192 0), then the cluster tree Cfn is consistent; see the appendix for details.\nThe big problem is that Cfn is not easy to compute for typical density estimates fn: imagine, for\ninstance, how one might go about trying to \ufb01nd level sets of a mixture of Gaussians! Wong and\n\nn) \u2192 1 as n \u2192 \u221e.\n\n2\n\n\ff (x)\n\nX\n\nFigure 2: A probability density f, and the restriction of Cf to a \ufb01nite set of eight points.\n\nLane [14] have an ef\ufb01cient procedure that tries to approximate Cfn when fn is a k-nearest neighbor\ndensity estimate, but they have not shown that it preserves the consistency property of Cfn.\nThere is a simple and elegant algorithm that is a plausible estimator of the cluster tree: single\nlinkage (or Kruskal\u2019s algorithm); see the appendix for pseudocode. Hartigan [5] has shown that it is\nconsistent in one dimension (d = 1). But he also demonstrates, by a lovely reduction to continuum\npercolation, that this consistency fails in higher dimension d \u2265 2. The problem is the requirement\nthat A \u2229 Xn \u2282 An: by the time the clusters are large enough that one of them contains all of A,\nthere is a reasonable chance that this cluster will be so big as to also contain part of A\u2032.\nWith this insight, Hartigan de\ufb01nes a weaker notion of fractional consistency, under which An (resp,\nn) need not contain all of A\u2229 Xn (resp, A\u2032 \u2229 Xn), but merely a sizeable chunk of it \u2013 and ought to\nA\u2032\nbe very close (at distance \u2192 0 as n \u2192 \u221e) to the remainder. He then shows that single linkage has\nthis weaker consistency property for any pair A, A\u2032 for which the ratio of inf{f (x) : x \u2208 A\u222a A\u2032} to\nsup{inf{f (x) : x \u2208 P} : paths P from A to A\u2032} is suf\ufb01ciently large. More recent work by Penrose\n[7] closes the gap and shows fractional consistency whenever this ratio is > 1.\nA more robust version of single linkage has been proposed by Wishart [13]: when connecting points\nat distance r from each other, only consider points that have at least k neighbors within distance r\n(for some k > 2). Thus initially, when r is small, only the regions of highest density are available for\nlinkage, while the rest of the data set is ignored. As r gets larger, more and more of the data points\nbecome candidates for linkage. This scheme is intuitively sensible, but Wishart does not provide a\nproof of convergence. Thus it is unclear how to set k, for instance.\nStuetzle and Nugent [12] have an appealing top-down scheme for estimating the cluster tree, along\nwith a post-processing step (called runt pruning) that helps identify modes of the distribution. The\nconsistency of this method has not yet been established.\n\nSeveral recent papers [6, 10, 9, 11] have considered the problem of recovering the connected com-\nponents of {x : f (x) \u2265 \u03bb} for a user-speci\ufb01ed \u03bb: the \ufb02at version of our problem. In particular,\nthe algorithm of [6] is intuitively similar to ours, though they use a single graph in which each point\nis connected to its k nearest neighbors, whereas we have a hierarchy of graphs in which each point\nis connected to other points at distance \u2264 r (for various r). Interestingly, k-nn graphs are valuable\nfor \ufb02at clustering because they can adapt to clusters of different scales (different average interpoint\ndistances). But they are challenging to analyze and seem to require various regularity assumptions\non the data. A pleasant feature of the hierarchical setting is that different scales appear at different\nlevels of the tree, rather than being collapsed together. This allows the use of r-neighbor graphs, and\nmakes possible an analysis that has minimal assumptions on the data.\n\n3 Algorithm and results\n\nIn this paper, we consider a generalization of Wishart\u2019s scheme and of single linkage, shown in\nFigure 3. It has two free parameters: k and \u03b1. For practical reasons, it is of interest to keep these as\n\n3\n\n\f1. For each xi set rk(xi) = inf{r : B(xi, r) contains k data points}.\n2. As r grows from 0 to \u221e:\n\nInclude edge (xi, xj) if kxi \u2212 xjk \u2264 \u03b1r.\n\n(a) Construct a graph Gr with nodes {xi : rk(xi) \u2264 r}.\n(b) Let bC(r) be the connected components of Gr.\n\nFigure 3: Algorithm for hierarchical clustering. The input is a sample Xn = {x1, . . . , xn} from\ndensity f on X . Parameters k and \u03b1 need to be set. Single linkage is (\u03b1 = 1, k = 2). Wishart\nsuggested \u03b1 = 1 and larger k.\n\nsmall as possible. We provide \ufb01nite-sample convergence rates for all 1 \u2264 \u03b1 \u2264 2 and we can achieve\nk \u223c d log n, which we conjecture to be the best possible, if \u03b1 > \u221a2. Our rates for \u03b1 = 1 force k to\nbe much larger, exponential in d. It is a fascinating open problem to determine whether the setting\n(\u03b1 = 1, k \u223c d log n) yields consistency.\n3.1 A notion of cluster salience\n\nSuppose density f is supported on some subset X of Rd. We will show that the hierarchical cluster-\ning procedure is consistent in the sense of De\ufb01nition 3. But the more interesting question is, what\nclusters will be identi\ufb01ed from a \ufb01nite sample? To answer this, we introduce a notion of salience.\n\nThe \ufb01rst consideration is that a cluster is hard to identify if it contains a thin \u201cbridge\u201d that would\nmake it look disconnected in a small sample. To control this, we consider a \u201cbuffer zone\u201d of width\n\u03c3 around the clusters.\nDe\ufb01nition 4 For Z \u2282 Rd and \u03c3 > 0, write Z\u03c3 = Z + B(0, \u03c3) = {y \u2208 Rd : inf z\u2208Z ky \u2212 zk \u2264 \u03c3}.\nAn important technical point is that Z\u03c3 is a full-dimensional set, even if Z itself is not.\nSecond, the ease of distinguishing two clusters A and A\u2032 depends inevitably upon the separation\nbetween them. To keep things simple, we\u2019ll use the same \u03c3 as a separation parameter.\nDe\ufb01nition 5 Let f be a density on X \u2282 Rd. We say that A, A\u2032 \u2282 X are (\u03c3, \u01eb)-separated if there\nexists S \u2282 X (separator set) such that:\n\n\u2022 Any path in X from A to A\u2032 intersects S.\n\u2022 supx\u2208S\u03c3 f (x) < (1 \u2212 \u01eb) inf x\u2208A\u03c3 \u222aA\u2032\n\n\u03c3 f (x).\n\nUnder this de\ufb01nition, A\u03c3 and A\u2032\nis zero. However, S\u03c3 need not be contained in X .\n3.2 Consistency and \ufb01nite-sample rate of convergence\n\n\u03c3 must lie within X , otherwise the right-hand side of the inequality\n\nHere we state the result for \u03b1 > \u221a2 and k \u223c d log n. The analysis section also has results for\n1 \u2264 \u03b1 \u2264 2 and k \u223c (2/\u03b1)dd log n.\nTheorem 6 There is an absolute constant C such that the following holds. Pick any \u03b4, \u01eb > 0, and\nrun the algorithm on a sample Xn of size n drawn from f, with settings\n\n\u221a2(cid:18)1 +\n\n\u01eb2\n\n\u221ad(cid:19) \u2264 \u03b1 \u2264 2 and k = C \u00b7\n\nd log n\n\n\u01eb2\n\n\u00b7 log2 1\n\n\u03b4\n\n.\n\nThen there is a mapping r : [0,\u221e) \u2192 [0,\u221e) such that with probability at least 1 \u2212 \u03b4, the following\nholds uniformly for all pairs of connected subsets A, A\u2032 \u2282 X : If A, A\u2032 are (\u03c3, \u01eb)-separated (for \u01eb\nand some \u03c3 > 0), and if\n\nf (x) \u2265\nwhere vd is the volume of the unit ball in Rd, then:\n\nx\u2208A\u03c3 \u222aA\u2032\n\u03c3\n\n\u03bb :=\n\ninf\n\n1\n\nvd(\u03c3/2)d \u00b7\n\n4\n\nk\n\nn \u00b7(cid:16)1 +\n\n\u01eb\n\n2(cid:17)\n\n(*)\n\n\f1. Separation. A \u2229 Xn is disconnected from A\u2032 \u2229 Xn in Gr(\u03bb).\n2. Connectedness. A \u2229 Xn and A\u2032 \u2229 Xn are each individually connected in Gr(\u03bb).\n\nThe two parts of this theorem \u2013 separation and connectedness \u2013 are proved in Sections 3.3 and 3.4.\nWe mention in passing that this \ufb01nite-sample result implies consistency (De\ufb01nition 3): as n \u2192 \u221e,\ntake kn = (d log n)/\u01eb2\nn with any schedule of (\u01ebn : n = 1, 2, . . .) such that \u01ebn \u2192 0 and kn/n \u2192 0.\nUnder mild conditions, any two connected components A, A\u2032 of {f \u2265 \u03bb} are (\u03c3, \u01eb)-separated for\nsome \u03c3, \u01eb > 0 (see appendix); thus they will get distinguished for suf\ufb01ciently large n.\n\n3.3 Analysis: separation\n\nThe cluster tree algorithm depends heavily on the radii rk(x): the distance within which x\u2019s nearest\nk neighbors lie (including x itself). Thus the empirical probability mass of B(x, rk(x)) is k/n. To\nshow that rk(x) is meaningful, we need to establish that the mass of this ball under density f is also,\nvery approximately, k/n. The uniform convergence of these empirical counts follows from the fact\nthat balls in Rd have \ufb01nite VC dimension, d + 1. Using uniform Bernstein-type bounds, we get a\nset of basic inequalities that we use repeatedly.\n\nC\u03b4d log n\n\nLemma 7 Assume k \u2265 d log n, and \ufb01x some \u03b4 > 0. Then there exists a constant C\u03b4 such that with\nprobability > 1 \u2212 \u03b4, every ball B \u2282 Rd satis\ufb01es the following conditions:\n=\u21d2 fn(B) > 0\nn pkd log n =\u21d2 fn(B) \u2265\nk\nn\nn pkd log n =\u21d2 fn(B) <\nk\nn\n\nHere fn(B) = |Xn \u2229 B|/n is the empirical mass of B, while f (B) =RB f (x)dx is its true mass.\n\nf (B) \u2265\nf (B) \u2264\n\nPROOF: See appendix. C\u03b4 = 2Co log(2/\u03b4), where Co is the absolute constant from Lemma 16. (cid:3)\n\n+\n\nf (B) \u2265\nC\u03b4\nk\nn\nk\nn \u2212\n\nC\u03b4\n\nn\n\nWe will henceforth think of \u03b4 as \ufb01xed, so that we do not have to repeatedly quantify over it.\n\nLemma 8 Pick 0 < r < 2\u03c3/(\u03b1 + 2) such that\n\nvdrd\u03bb \u2265\nvdrd\u03bb(1 \u2212 \u01eb) <\n\n+\n\nk\nn\nk\nn \u2212\n\nC\u03b4\n\nn pkd log n\nn pkd log n\n\nC\u03b4\n\n(recall that vd is the volume of the unit ball in Rd). Then with probability > 1 \u2212 \u03b4:\n\n1. Gr contains all points in (A\u03c3\u2212r \u222a A\u2032\n2. A \u2229 Xn is disconnected from A\u2032 \u2229 Xn in Gr.\n\n\u03c3\u2212r) \u2229 Xn and no points in S\u03c3\u2212r \u2229 Xn.\n\nPROOF: For (1), any point x \u2208 (A\u03c3\u2212r\u222aA\u2032\n\u03c3\u2212r) has f (B(x, r)) \u2265 vdrd\u03bb; and thus, by Lemma 7, has\nat least k neighbors within radius r. Likewise, any point x \u2208 S\u03c3\u2212r has f (B(x, r)) < vdrd\u03bb(1 \u2212 \u01eb);\nand thus, by Lemma 7, has strictly fewer than k neighbors within distance r.\nFor (2), since points in S\u03c3\u2212r are absent from Gr, any path from A to A\u2032 in that graph must have an\nedge across S\u03c3\u2212r. But any such edge has length at least 2(\u03c3 \u2212 r) > \u03b1r and is thus not in Gr. (cid:3)\n\nDe\ufb01nition 9 De\ufb01ne r(\u03bb) to be the value of r for which vdrd\u03bb = k\n\nn + C\u03b4\n\nn \u221akd log n.\n\nTo satisfy the conditions of Lemma 8, it suf\ufb01ces to take k \u2265 4C 2\n\n\u03b4 (d/\u01eb2) log n; this is what we use.\n\n5\n\n\fxi\n\n\u03c0(xi)\n\nx\u2032\n\nxi\n\nxi+1\n\nx\u2032\n\nx\n\nx\n\n\u03c0(xi)\n\nFigure 4: Left: P is a path from x to x\u2032, and \u03c0(xi) is the point furthest along the path that is within\ndistance r of xi. Right: The next point, xi+1 \u2208 Xn, is chosen from a slab of B(\u03c0(xi), r) that is\nperpendicular to xi \u2212 \u03c0(xi) and has width 2\u03b6/\u221ad.\n\n3.4 Analysis: connectedness\n\nWe need to show that points in A (and similarly A\u2032) are connected in Gr(\u03bb). First we state a simple\nbound (proved in the appendix) that works if \u03b1 = 2 and k \u223c d log n; later we consider smaller \u03b1.\nLemma 10 Suppose 1 \u2264 \u03b1 \u2264 2. Then with probability \u2265 1 \u2212 \u03b4, A \u2229 Xn is connected in Gr\nwhenever r \u2264 2\u03c3/(2 + \u03b1) and the conditions of Lemma 8 hold, and\n\nvdrd\u03bb \u2265(cid:18) 2\n\n\u03b1(cid:19)d C\u03b4d log n\n\nn\n\n.\n\nComparing this to the de\ufb01nition of r(\u03bb), we see that choosing \u03b1 = 1 would entail k \u2265 2d, which is\nundesirable. We can get a more reasonable setting of k \u223c d log n by choosing \u03b1 = 2, but we\u2019d like\n\u03b1 to be as small as possible. A more re\ufb01ned argument shows that \u03b1 \u2248 \u221a2 is enough.\nTheorem 11 Suppose \u03b12 \u2265 2(1 + \u03b6/\u221ad), for some 0 < \u03b6 \u2264 1. Then, with probability > 1 \u2212 \u03b4,\nA \u2229 Xn is connected in Gr whenever r \u2264 \u03c3/2 and the conditions of Lemma 8 hold, and\n\nvdrd\u03bb \u2265\n\n8\n\u03b6 \u00b7\n\nC\u03b4d log n\n\n.\n\nn\n\nPROOF: We have already made heavy use of uniform convergence over balls. We now also require\na more complicated class G, each element of which is the intersection of an open ball and a slab\nde\ufb01ned by two parallel hyperplanes. Formally, each of these functions is de\ufb01ned by a center \u00b5 and\na unit direction u, and is the indicator function of the set\n\n{z \u2208 Rd : kz \u2212 \u00b5k < r,|(z \u2212 \u00b5) \u00b7 u| \u2264 \u03b6r/\u221ad}.\n\nWe will describe any such set as \u201cthe slab of B(\u00b5, r) in direction u\u201d. A simple calculation (see\nLemma 4 of [4]) shows that the volume of this slab is at least \u03b6/4 that of B(x, r). Thus, if the slab lies\nentirely in A\u03c3, its probability mass is at least (\u03b6/4)vdrd\u03bb. By uniform convergence over G (which\nhas VC dimension 2d), we can then conclude (as in Lemma 7) that if (\u03b6/4)vdrd\u03bb \u2265 (2C\u03b4d log n)/n,\nthen with probability at least 1 \u2212 \u03b4, every such slab within A contains at least one data point.\nP\nPick any x, x\u2032 \u2208 A\u2229Xn; there is a path P in A with x\n x\u2032. We\u2019ll identify a sequence of data points\nx0 = x, x1, x2, . . ., ending in x\u2032, such that for every i, point xi is active in Gr and kxi\u2212xi+1k \u2264 \u03b1r.\nThis will con\ufb01rm that x is connected to x\u2032 in Gr.\nTo begin with, recall that P is a continuous 1 \u2212 1 function from [0, 1] into A. We are also interested\nin the inverse P \u22121, which sends a point on the path back to its parametrization in [0, 1]. For any\npoint y \u2208 X , de\ufb01ne N (y) to be the portion of [0, 1] whose image under P lies in B(y, r): that is,\nN (y) = {0 \u2264 z \u2264 1 : P (z) \u2208 B(y, r)}. If y is within distance r of P , then N (y) is nonempty.\nDe\ufb01ne \u03c0(y) = P (sup N (y)), the furthest point along the path within distance r of y (Figure 4, left).\nThe sequence x0, x1, x2, . . . is de\ufb01ned iteratively; x0 = x, and for i = 0, 1, 2, . . . :\n\n\u2022 If kxi \u2212 x\u2032k \u2264 \u03b1r, set xi+1 = x\u2032 and stop.\n\n6\n\n\f\u2022 By construction, xi is within distance r of path P and hence N (xi) is nonempty.\n\u2022 Let B be the open ball of radius r around \u03c0(xi). The slab of B in direction xi \u2212 \u03c0(xi)\n\nmust contain a data point; this is xi+1 (Figure 4, right).\n\nThe process eventually stops because each \u03c0(xi+1) is strictly further along path P than \u03c0(xi);\nformally, P \u22121(\u03c0(xi+1)) > P \u22121(\u03c0(xi)). This is because kxi+1 \u2212 \u03c0(xi)k < r, so by continuity of\nthe function P , there are points further along the path (beyond \u03c0(xi)) whose distance to xi+1 is still\n< r. Thus xi+1 is distinct from x0, x1, . . . , xi. Since there are \ufb01nitely many data points, the process\nmust terminate, so the sequence {xi} does constitute a path from x to x\u2032.\nEach xi lies in Ar \u2286 A\u03c3\u2212r and is thus active in Gr (Lemma 8). Finally, the distance between\nsuccessive points is:\n\nkxi \u2212 xi+1k2 = kxi \u2212 \u03c0(xi) + \u03c0(xi) \u2212 xi+1k2\n\n= kxi \u2212 \u03c0(xi)k2 + k\u03c0(xi) \u2212 xi+1k2 + 2(xi \u2212 \u03c0(xi)) \u00b7 (\u03c0(xi) \u2212 xi+1)\n\u2264 2r2 +\n\n2\u03b6r2\n\u221ad \u2264 \u03b12r2,\n\nwhere the second-last inequality comes from the de\ufb01nition of slab. (cid:3)\n\n\u03b4 (d/\u01eb2) log n, which satis\ufb01es the requirements\nTo complete the proof of Theorem 6, take k = 4C 2\nof Lemma 8 as well as those of Theorem 11, using \u03b6 = 2\u01eb2. The relationship that de\ufb01nes r(\u03bb)\n(De\ufb01nition 9) then translates into\n\nvdrd\u03bb =\n\nk\n\nn(cid:16)1 +\n\n\u01eb\n\n2(cid:17) .\n\nThis shows that clusters at density level \u03bb emerge when the growing radius r of the cluster tree\nalgorithm reaches roughly (k/(\u03bbvdn))1/d. In order for (\u03c3, \u01eb)-separated clusters to be distinguished,\nwe need this radius to be at most \u03c3/2; this is what yields the \ufb01nal lower bound on \u03bb.\n\n4 Lower bound\n\nWe have shown that the algorithm of Figure 3 distinguishes pairs of clusters that are (\u03c3, \u01eb)-separated.\nThe number of samples it requires to capture clusters at density \u2265 \u03bb is, by Theorem 6,\n\nO(cid:18)\n\nd\n\nvd(\u03c3/2)d\u03bb\u01eb2 log\n\nvd(\u03c3/2)d\u03bb\u01eb2(cid:19) ,\n\nd\n\nWe\u2019ll now show that this dependence on \u03c3, \u03bb, and \u01eb is optimal. The only room for improvement,\ntherefore, is in constants involving d.\n\nTheorem 12 Pick any \u01eb in (0, 1/2), any d > 1, and any \u03c3, \u03bb > 0 such that \u03bbvd\u22121\u03c3d < 1/50. Then\nthere exist: an input space X \u2282 Rd; a \ufb01nite family of densities \u0398 = {\u03b8i} on X ; subsets Ai, A\u2032\ni, Si \u2282\n\u03b8i(x) \u2265 \u03bb, with\nX such that Ai and A\u2032\nthe following additional property.\nConsider any algorithm that is given n \u2265 100 i.i.d. samples Xn from some \u03b8i \u2208 \u0398 and, with\nprobability at least 1/2, outputs a tree in which the smallest cluster containing Ai \u2229 Xn is disjoint\nfrom the smallest cluster containing A\u2032\n\ni are (\u03c3, \u01eb)-separated by Si for density \u03b8i, and inf x\u2208Ai,\u03c3 \u222aA\u2032\n\ni,\u03c3\n\ni \u2229 Xn. Then\n\nn = \u2126(cid:18)\n\n1\n\nvd\u03c3d\u03bb\u01eb2d1/2 log\n\n1\n\nvd\u03c3d\u03bb(cid:19) .\n\nPROOF: We start by constructing the various spaces and densities. X is made up of two disjoint\nregions: a cylinder X0, and an additional region X1 whose sole purpose is as a repository for excess\nprobability mass. Let Bd\u22121 be the unit ball in Rd\u22121, and let \u03c3Bd\u22121 be this same ball scaled to have\nradius \u03c3. The cylinder X0 stretches along the x1-axis; its cross-section is \u03c3Bd\u22121 and its length is\n4(c + 1)\u03c3 for some c > 1 to be speci\ufb01ed: X0 = [0, 4(c + 1)\u03c3] \u00d7 \u03c3Bd\u22121. Here is a picture of it:\n\n7\n\n\f\u03c3\n\n0\n\n4\u03c3\n\n8\u03c3\n\n12\u03c3\n\n4(c + 1)\u03c3\n\nx1 axis\n\nWe will construct a family of densities \u0398 = {\u03b8i} on X , and then argue that any cluster tree algorithm\nthat is able to distinguish (\u03c3, \u01eb)-separated clusters must be able, when given samples from some \u03b8I,\nto determine the identity of I. The sample complexity of this latter task can be lower-bounded using\nFano\u2019s inequality (typically stated as in [2], but easily rewritten in the convenient form of [15], see\nappendix): it is \u2126((log |\u0398|)/\u03b2), for \u03b2 = maxi6=j K(\u03b8i, \u03b8j), where K(\u00b7,\u00b7) is KL divergence.\nThe family \u0398 contains c \u2212 1 densities \u03b81, . . . , \u03b8c\u22121, where \u03b8i is de\ufb01ned as follows:\n\n\u2022 Density \u03bb on [0, 4\u03c3i + \u03c3]\u00d7 \u03c3Bd\u22121 and on [4\u03c3i + 3\u03c3, 4(c + 1)\u03c3]\u00d7 \u03c3Bd\u22121. Since the cross-\nsectional area of the cylinder is vd\u22121\u03c3d\u22121, the total mass here is \u03bbvd\u22121\u03c3d(4(c + 1) \u2212 2).\n\u2022 Density \u03bb(1 \u2212 \u01eb) on (4\u03c3i + \u03c3, 4\u03c3i + 3\u03c3) \u00d7 \u03c3Bd\u22121.\n\u2022 Point masses 1/(2c) at locations 4\u03c3, 8\u03c3, . . . , 4c\u03c3 along the x1-axis (use arbitrarily narrow\n\u2022 The remaining mass, 1/2\u2212 \u03bbvd\u22121\u03c3d(4(c + 1)\u2212 2\u01eb), is placed on X1 in some \ufb01xed manner\n\nspikes to avoid discontinuities).\n\n(that does not vary between different densities in \u0398).\n\nHere is a sketch of \u03b8i. The low-density region of width 2\u03c3 is centered at 4\u03c3i + 2\u03c3 on the x1-axis.\n\ndensity \u03bb(1 \u2212 \u01eb)\n\n2\u03c3\n\ndensity \u03bb\n\npoint mass 1/2c\n\nFor any i 6= j, the densities \u03b8i and \u03b8j differ only on the cylindrical sections (4\u03c3i + \u03c3, 4\u03c3i + 3\u03c3) \u00d7\n\u03c3Bd\u22121 and (4\u03c3j + \u03c3, 4\u03c3j + 3\u03c3)\u00d7 \u03c3Bd\u22121, which are disjoint and each have volume 2vd\u22121\u03c3d. Thus\n\nK(\u03b8i, \u03b8j) = 2vd\u22121\u03c3d(cid:18)\u03bb log\n\n\u03bb\n\n\u03bb(1 \u2212 \u01eb)\n\n+ \u03bb(1 \u2212 \u01eb) log\n\n\u03bb(1 \u2212 \u01eb)\n\n\u03bb\n\n(cid:19)\n\n= 2vd\u22121\u03c3d\u03bb(\u2212\u01eb log(1 \u2212 \u01eb)) \u2264\n\n4\nln 2\n\nvd\u22121\u03c3d\u03bb\u01eb2\n\n(using ln(1 \u2212 x) \u2265 \u22122x for 0 < x \u2264 1/2). This is an upper bound on the \u03b2 in the Fano bound.\nNow de\ufb01ne the clusters and separators as follows: for each 1 \u2264 i \u2264 c \u2212 1,\n\n\u2022 Ai is the line segment [\u03c3, 4\u03c3i] along the x1-axis,\n\u2022 A\u2032\n\u2022 Si = {4\u03c3i + 2\u03c3} \u00d7 \u03c3Bd\u22121 is the cross-section of the cylinder at location 4\u03c3i + 2\u03c3.\n\ni is the line segment [4\u03c3(i + 1), 4(c + 1)\u03c3 \u2212 \u03c3] along the x1-axis, and\n\ni are one-dimensional sets while Si is a (d \u2212 1)-dimensional set. It can be checked\ni are (\u03c3, \u01eb)-separated by Si in density \u03b8i.\n\nThus Ai and A\u2032\nthat Ai and A\u2032\nWith the various structures de\ufb01ned, what remains is to argue that if an algorithm is given a sample\nI \u2229 Xn, then it can\nXn from some \u03b8I (where I is unknown), and is able to separate AI \u2229 Xn from A\u2032\neffectively infer I. This has sample complexity \u2126((log c)/\u03b2). Details are in the appendix. (cid:3)\n\nThere remains a discrepancy of 2d between the upper and lower bounds; it is an interesting open\nproblem to close this gap. Does the (\u03b1 = 1, k \u223c d log n) setting (yet to be analyzed) do the job?\nAcknowledgments. We thank the anonymous reviewers for their detailed and insightful comments,\nand the National Science Foundation for support under grant IIS-0347646.\n\n8\n\n\fReferences\n\n[1] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture\n\nNotes in Arti\ufb01cial Intelligence, 3176:169\u2013207, 2004.\n\n[2] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 2005.\n[3] S. Dasgupta and Y. Freund. Random projection trees for vector quantization. IEEE Transac-\n\ntions on Information Theory, 55(7):3229\u20133242, 2009.\n\n[4] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Jour-\n\nnal of Machine Learning Research, 10:281\u2013299, 2009.\n\n[5] J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American\n\nStatistical Association, 76(374):388\u2013394, 1981.\n\n[6] M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest neighbor graphs\n\nfor identifying noisy clusters. Theoretical Computer Science, 410:1749\u20131764, 2009.\n\n[7] M. Penrose. Single linkage clustering and continuum percolation. Journal of Multivariate\n\nAnalysis, 53:94\u2013109, 1995.\n\n[8] D. Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135\u2013140, 1981.\n[9] P. Rigollet and R. Vert. Fast rates for plug-in estimators of density level sets. Bernoulli,\n\n15(4):1154\u20131178, 2009.\n\n[10] A. Rinaldo and L. Wasserman. Generalized density clustering.\n\n38(5):2678\u20132722, 2010.\n\nAnnals of Statistics,\n\n[11] A. Singh, C. Scott, and R. Nowak. Adaptive hausdorff estimation of density level sets. Annals\n\nof Statistics, 37(5B):2760\u20132782, 2009.\n\n[12] W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree\n\nof a density. Journal of Computational and Graphical Statistics, 19(2):397\u2013418, 2010.\n\n[13] D. Wishart. Mode analysis: a generalization of nearest neighbor which reduces chaining ef-\nfects. In Proceedings of the Colloquium on Numerical Taxonomy held in the University of St.\nAndrews, pages 282\u2013308, 1969.\n\n[14] M.A. Wong and T. Lane. A kth nearest neighbour clustering procedure. Journal of the Royal\n\nStatistical Society Series B, 45(3):362\u2013368, 1983.\n\n[15] B. Yu. Assouad, Fano and Le Cam. Festschrift for Lucien Le Cam, pages 423\u2013435, 1997.\n\n9\n\n\f", "award": [], "sourceid": 496, "authors": [{"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": null}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}]}