{"title": "Statistical Inference for Cluster Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1839, "page_last": 1847, "abstract": "A cluster tree provides an intuitive summary of a density function that reveals essential structure about the high-density clusters. The true cluster tree is estimated from a finite sample from an unknown true density. This paper addresses the basic question of quantifying our uncertainty by assessing the statistical significance of different features of an empirical cluster tree. We first study a variety of metrics that can be used to compare different trees, analyzing their properties and assessing their suitability for our inference task. We then propose methods to construct and summarize confidence sets for the unknown true cluster tree. We introduce a partial ordering on cluster trees which we use to prune some of the statistically insignificant features of the empirical tree, yielding interpretable and parsimonious cluster trees. Finally, we provide a variety of simulations to illustrate our proposed methods and furthermore demonstrate their utility in the analysis of a Graft-versus-Host Disease (GvHD) data set.", "full_text": "Statistical Inference for Cluster Trees\n\nJisu Kim\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, USA\n\njisuk1@andrew.cmu.edu\n\nYen-Chi Chen\n\nDepartment of Statistics\nUniversity of Washington\n\nSeattle, USA\n\nyenchic@uw.edu\n\nSivaraman Balakrishnan\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, USA\n\nsiva@stat.cmu.edu\n\nAlessandro Rinaldo\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, USA\n\narinaldo@stat.cmu.edu\n\nLarry Wasserman\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, USA\n\nlarry@stat.cmu.edu\n\nAbstract\n\nA cluster tree provides a highly-interpretable summary of a density function by\nrepresenting the hierarchy of its high-density clusters. It is estimated using the\nempirical tree, which is the cluster tree constructed from a density estimator. This\npaper addresses the basic question of quantifying our uncertainty by assessing the\nstatistical signi\ufb01cance of topological features of an empirical cluster tree. We \ufb01rst\nstudy a variety of metrics that can be used to compare different trees, analyze their\nproperties and assess their suitability for inference. We then propose methods to\nconstruct and summarize con\ufb01dence sets for the unknown true cluster tree. We\nintroduce a partial ordering on cluster trees which we use to prune some of the\nstatistically insigni\ufb01cant features of the empirical tree, yielding interpretable and\nparsimonious cluster trees. Finally, we illustrate the proposed methods on a variety\nof synthetic examples and furthermore demonstrate their utility in the analysis of a\nGraft-versus-Host Disease (GvHD) data set.\n\n1\n\nIntroduction\n\nClustering is a central problem in the analysis and exploration of data. It is a broad topic, with several\nexisting distinct formulations, objectives, and methods. Despite the extensive literature on the topic,\na common aspect of the clustering methodologies that has hindered its widespread scienti\ufb01c adoption\nis the dearth of methods for statistical inference in the context of clustering. Methods for inference\nbroadly allow us to quantify our uncertainty, to discern \u201ctrue\u201d clusters from \ufb01nite-sample artifacts, as\nwell as to rigorously test hypotheses related to the estimated cluster structure.\nIn this paper, we study statistical inference for the cluster tree of an unknown density. We assume that\nwe observe an i.i.d. sample {X1, . . . , Xn} from a distribution P0 with unknown density p0. Here,\nXi \u2208 X \u2282 Rd. The connected components C(\u03bb), of the upper level set {x : p0(x) \u2265 \u03bb}, are called\nhigh-density clusters. The set of high-density clusters forms a nested hierarchy which is referred to\nas the cluster tree1 of p0, which we denote as Tp0.\nMethods for density clustering fall broadly in the space of hierarchical clustering algorithms, and\ninherit several of their advantages: they allow for extremely general cluster shapes and sizes, and\nin general do not require the pre-speci\ufb01cation of the number of clusters. Furthermore, unlike \ufb02at\n\n1It is also referred to as the density tree or the level-set tree.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fclustering methods, hierarchical methods are able to provide a multi-resolution summary of the\nunderlying density. The cluster tree, irrespective of the dimensionality of the input random variable, is\ndisplayed as a two-dimensional object and this makes it an ideal tool to visualize data. In the context\nof statistical inference, density clustering has another important advantage over other clustering\nmethods: the object of inference, the cluster tree of the unknown density p0, is clearly speci\ufb01ed.\nIn practice, the cluster tree is estimated from a \ufb01nite sample, {X1, . . . , Xn} \u223c p0. In a scienti\ufb01c\napplication, we are often most interested in reliably distinguishing topological features genuinely\npresent in the cluster tree of the unknown p0, from topological features that arise due to random\n\ufb02uctuations in the \ufb01nite sample {X1, . . . , Xn}. In this paper, we focus our inference on the cluster\n\ntree of the kernel density estimator, T(cid:98)ph, where(cid:98)ph is the kernel density estimator,\n\n(cid:98)ph(x) =\n\n1\nnhd\n\nn(cid:88)\n\ni=1\n\nK\n\n(cid:18)(cid:107)x \u2212 Xi(cid:107)\n\n(cid:19)\n\nh\n\n,\n\n(1)\n\nwhere K is a kernel and h is an appropriately chosen bandwidth 2.\nTo develop methods for statistical inference on cluster trees, we construct a con\ufb01dence set for Tp0,\ni.e. a collection of trees that will include Tp0 with some (pre-speci\ufb01ed) probability. A con\ufb01dence\nset can be converted to a hypothesis test, and a con\ufb01dence set shows both statistical and scienti\ufb01c\nsigni\ufb01cances while a hypothesis test can only show statistical signi\ufb01cances [23, p.155].\nTo construct and understand the con\ufb01dence set, we need to solve a few technical and conceptual\nissues. The \ufb01rst issue is that we need a metric on trees, in order to quantify the collection of trees\n\nthat are in some sense \u201cclose enough\u201d to T(cid:98)ph to be statistically indistinguishable from it. We use the\n\nbootstrap to construct tight data-driven con\ufb01dence sets. However, only some metrics are suf\ufb01ciently\n\u201cregular\u201d to be amenable to bootstrap inference, which guides our choice of a suitable metric on trees.\nOn the basis of a \ufb01nite sample, the true density is indistinguishable from a density with additional\nin\ufb01nitesimal perturbations. This leads to the second technical issue which is that our con\ufb01dence\nset invariably contains in\ufb01nitely complex trees. Inspired by the idea of one-sided inference [9],\nwe propose a partial ordering on the set of all density trees to de\ufb01ne simple trees. To \ufb01nd simple\nrepresentative trees in the con\ufb01dence set, we prune the empirical cluster tree by removing statistically\ninsigni\ufb01cant features. These pruned trees are valid with statistical guarantees that are simpler than\nthe empirical cluster tree in the proposed partial ordering.\nOur contributions: We begin by considering a variety of metrics on trees, studying their properties\nand discussing their suitability for inference. We then propose a method of constructing con\ufb01dence\nsets and for visualizing trees in this set. This distinguishes aspects of the estimated tree correspond\nto real features (those present in the cluster tree Tp0) from noise features. Finally, we apply our\nmethods to several simulations, and a Graft-versus-Host Disease (GvHD) data set to demonstrate the\nusefulness of our techniques and the role of statistical inference in clustering problems.\nRelated work: There is a vast literature on density trees (see for instance the book by Klemel\u00e4 [16]),\nand we focus our review on works most closely aligned with our paper. The formal de\ufb01nition of\nthe cluster tree, and notions of consistency in estimation of the cluster tree date back to the work of\nHartigan [15]. Hartigan studied the ef\ufb01cacy of single-linkage in estimating the cluster tree and showed\nthat single-linkage is inconsistent when the input dimension d > 1. Several \ufb01xes to single-linkage\nhave since been proposed (see for instance [21]). The paper of Chaudhuri and Dasgupta [4] provided\nthe \ufb01rst rigorous minimax analysis of the density clustering and provided a computationally tractable,\nconsistent estimator of the cluster tree. The papers [1, 5, 12, 17] propose various modi\ufb01cations and\nanalyses of estimators for the cluster tree. While the question of estimation has been extensively\naddressed, to our knowledge our paper is the \ufb01rst concerning inference for the cluster tree.\nThere is a literature on inference for phylogenetic trees (see the papers [13, 10]), but the object of\ninference and the hypothesized generative models are typically quite different. Finally, in our paper,\nwe also consider various metrics on trees. There are several recent works, in the computational\ntopology literature, that have considered different metrics on trees. The most relevant to our own\nwork, are the papers [2, 18] that propose the functional distortion metric and the interleaving distance\non trees. These metrics, however, are NP-hard to compute in general. In Section 3, we consider a\nvariety of computationally tractable metrics and assess their suitability for inference.\n\n2We address computing the tree T(cid:98)ph, and the choice of bandwidth in more detail in what follows.\n\n2\n\n\fFigure 1: Examples of density trees. Black curves are the original density functions and the red trees\nare the associated density trees.\n\n2 Background and De\ufb01nitions\nWe work with densities de\ufb01ned on a subset X \u2282 Rd, and denote by (cid:107).(cid:107) the Euclidean norm on X .\nThroughout this paper we restrict our attention to cluster tree estimators that are speci\ufb01ed in terms of\na function f : X (cid:55)\u2192 [0,\u221e), i.e. we have the following de\ufb01nition:\nDe\ufb01nition 1. For any f : X (cid:55)\u2192 [0,\u221e) the cluster tree of f is a function Tf : R (cid:55)\u2192 2X , where 2X is\nthe set of all subsets of X , and Tf (\u03bb) is the set of the connected components of the upper-level set\nTf (\u03bb).\n\n{x \u2208 X : f (x) \u2265 \u03bb}. We de\ufb01ne the collection of connected components {Tf}, as {Tf} =(cid:83)\n\n\u03bb\n\nAs will be clearer in what follows, working only with cluster trees de\ufb01ned via a function f simpli\ufb01es\nour search for metrics on trees, allowing us to use metrics speci\ufb01ed in terms of the function f. With a\nslight abuse of notation, we will use Tf to denote also {Tf}, and write C \u2208 Tf to signify C \u2208 {Tf}.\nThe cluster tree Tf indeed has a tree structure, since for every pair C1, C2 \u2208 Tf , either C1 \u2282 C2,\nC2 \u2282 C1, or C1 \u2229 C2 = \u2205 holds. See Figure 1 for a graphical illustration of a cluster tree. The formal\nde\ufb01nition of the tree requires some topological theory; these details are in Appendix B.\nIn the context of hierarchical clustering, we are often interested in the \u201cheight\u201d at which two points or\ntwo clusters merge in the clustering. We introduce the merge height from [12, De\ufb01nition 6]:\nDe\ufb01nition 2. For any two points x, y \u2208 X , any f : X (cid:55)\u2192 [0,\u221e), and its tree Tf , their merge height\nmf (x, y) is de\ufb01ned as the largest \u03bb such that x and y are in the same density cluster at level \u03bb, i.e.\n\nmf (x, y) = sup{\u03bb \u2208 R : there exists C \u2208 Tf (\u03bb) such that x, y \u2208 C} .\n\nWe refer to the function mf : X \u00d7 X (cid:55)\u2192 R as the merge height function. For any two clusters\nC1, C2 \u2208 {Tf}, their merge height mf (C1, C2) is de\ufb01ned analogously,\n\nmf (C1, C2) = sup{\u03bb \u2208 R : there exists C \u2208 Tf (\u03bb) such that C1, C2 \u2282 C} .\n\nOne of the contributions of this paper is to construct valid con\ufb01dence sets for the unknown true\ntree and to develop methods for visualizing the trees contained in this con\ufb01dence set. Formally, we\nassume that we have samples {X1, . . . , Xn} from a distribution P0 with density p0.\nDe\ufb01nition 3. An asymptotic (1\u2212 \u03b1) con\ufb01dence set, C\u03b1, is a collection of trees with the property that\n\nP0(Tp0 \u2208 C\u03b1) = 1 \u2212 \u03b1 + o(1).\n\nWe also provide non-asymptotic upper bounds on the o(1) term in the above de\ufb01nition. Additionally,\nwe provide methods to summarize the con\ufb01dence set above. In order to summarize the con\ufb01dence\nset, we de\ufb01ne a partial ordering on trees.\nDe\ufb01nition 4. For any f, g : X (cid:55)\u2192 [0,\u221e) and their trees Tf , Tg, we say Tf (cid:22) Tg if there exists a map\n\u03a6 : {Tf} \u2192 {Tg} such that for any C1, C2 \u2208 Tf , we have C1 \u2282 C2 if and only if \u03a6(C1) \u2282 \u03a6(C2).\nWith De\ufb01nition 3 and 4, we describe the con\ufb01dence set succinctly via some of the simplest trees in\nthe con\ufb01dence set in Section 4. Intuitively, these are trees without statistically insigni\ufb01cant splits.\nIt is easy to check that the partial order (cid:22) in De\ufb01nition 4 is re\ufb02exive (i.e. Tf (cid:22) Tf ) and transitive (i.e.\nthat Tf1 (cid:22) Tf2 and Tf2 (cid:22) Tf3 implies Tf1 (cid:22) Tf3). However, to argue that (cid:22) is a partial order, we\nneed to show the antisymmetry, i.e. Tf (cid:22) Tg and Tg (cid:22) Tf implies that Tf and Tg are equivalent in\nsome sense. In Appendices A and B, we show an important result: for an appropriate topology on\ntrees, Tf (cid:22) Tg and Tg (cid:22) Tf implies that Tf and Tf are topologically equivalent.\n\n3\n\np(x)xp(x)x\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: Three illustrations of the partial order (cid:22) in De\ufb01nition 4. In each case, in agreement with\nour intuitive notion of simplicity, the tree on the top ((a), (b), and (c)) is lower than the corresponding\ntree on the bottom((d), (e), and (f)) in the partial order, i.e. for each example Tp (cid:22) Tq.\n\nThe partial order (cid:22) in De\ufb01nition 4 matches intuitive notions of the complexity of the tree for several\nreasons (see Figure 2). Firstly, Tf (cid:22) Tg implies (number of edges of Tf ) \u2264 (number of edges of Tg)\n(compare Figure 2(a) and (d), and see Lemma 6 in Appendix B). Secondly, if Tg is obtained from\nTf by adding edges, then Tf (cid:22) Tg (compare Figure 2(b) and (e), and see Lemma 7 in Appendix B).\nFinally, the existence of a topology preserving embedding from {Tf} to {Tg} implies the relationship\nTf (cid:22) Tg (compare Figure 2(c) and (f), and see Lemma 8 in Appendix B).\n\n3 Tree Metrics\n\nIn this section, we introduce some natural metrics on cluster trees and study some of their properties\nthat determine their suitability for statistical inference. We let p, q : X \u2192 [0,\u221e) be nonnegative\nfunctions and let Tp and Tq be the corresponding trees.\n\n3.1 Metrics\n\nWe consider three metrics on cluster trees, the \ufb01rst is the standard (cid:96)\u221e metric, while the second and\nthird are metrics that appear in the work of Eldridge et al. [12].\n(cid:96)\u221e metric: The simplest metric is d\u221e(Tp, Tq) = (cid:107)p\u2212 q(cid:107)\u221e = supx\u2208X |p(x)\u2212 q(x)|. We will show\nin what follows that, in the context of statistical inference, this metric has several advantages over\nother metrics.\nMerge distortion metric: The merge distortion metric intuitively measures the discrepancy in the\nmerge height functions of two trees in De\ufb01nition 2. We consider the merge distortion metric [12,\nDe\ufb01nition 11] de\ufb01ned by\n\ndM(Tp, Tq) = sup\nx,y\u2208X\n\n|mp(x, y) \u2212 mq(x, y)|.\n\nThe merge distortion metric we consider is a special case of the metric introduced by Eldridge et al.\n[12]3. The merge distortion metric was introduced by Eldridge et al. [12] to study the convergence of\ncluster tree estimators. They establish several interesting properties of the merge distortion metric:\nin particular, the metric is stable to perturbations in (cid:96)\u221e, and further, that convergence in the merge\ndistortion metric strengthens previous notions of convergence of the cluster trees.\nModi\ufb01ed merge distortion metric: We also consider the modi\ufb01ed merge distortion metric given by\n\ndMM(Tp, Tq) = sup\nx,y\u2208X\n\n|dTp (x, y) \u2212 dTq (x, y)|,\n\nwhere dTp (x, y) = p(x) + p(y) \u2212 2mp(x, y), which corresponds to the (pseudo)-distance between x\nand y along the tree. The metric dMM is used in various proofs in the work of Eldridge et al. [12].\n\n3They further allow \ufb02exibility in taking a sup over a subset of X .\n\n4\n\nxTpxTpxTpxTqxTqxTq\fIt is sensitive to both distortions of the merge heights in De\ufb01nition 2, as well as of the underlying\ndensities. Since the metric captures the distortion of distances between points along the tree, it is\nin some sense most closely aligned with the cluster tree. Finally, it is worth noting that unlike the\ninterleaving distance and the functional distortion metric [2, 18], the three metrics we consider in this\npaper are quite simple to approximate to a high-precision.\n\n3.2 Properties of the Metrics\n\nThe following Lemma gives some basic relationships between the three metrics d\u221e, dM and dMM. We\nde\ufb01ne pinf = inf x\u2208X p(x), and qinf analogously, and a = inf x\u2208X{p(x) + q(x)}\u2212 2 min{pinf , qinf}.\nNote that when the Lebesgue measure \u00b5(X ) is in\ufb01nite, then pinf = qinf = a = 0.\nLemma 1. For any densities p and q, the following relationships hold: (i) When p and q are\ncontinuous, then d\u221e(Tp, Tq) = dM(Tp, Tq). (ii) dMM(Tp, Tq) \u2264 4d\u221e(Tp, Tq). (iii) dMM(Tp, Tq) \u2265\nd\u221e(Tp, Tq) \u2212 a, where a is de\ufb01ned as above. Additionally when \u00b5(X ) = \u221e, then dMM(Tp, Tq) \u2265\nd\u221e(Tp, Tq).\n\nThe proof is in Appendix F. From Lemma 1, we can see that under a mild assumption (continuity of\nthe densities), d\u221e and dM are equivalent. We note again that the work of Eldridge et al. [12] actually\nde\ufb01nes a family of merge distortion metrics, while we restrict our attention to a canonical one. We\ncan also see from Lemma 1 that while the modi\ufb01ed merge metric is not equivalent to d\u221e, it is usually\nmultiplicatively sandwiched by d\u221e.\nOur next line of investigation is aimed at assessing the suitability of the three metrics for the task\nof statistical inference. Given the strong equivalence of d\u221e and dM we focus our attention on d\u221e\nand dMM. Based on prior work (see [7, 8]), the large sample behavior of d\u221e is well understood. In\n\nparticular, d\u221e(T(cid:98)ph , Tp0) converges to the supremum of an appropriate Gaussian process, on the basis\n\nof which we can construct con\ufb01dence intervals for the d\u221e metric.\nThe situation for the metric dMM is substantially more subtle. One of our eventual goals is to use\nthe non-parametric bootstrap to construct valid estimates of the con\ufb01dence set. In general, a way to\nassess the amenability of a functional to the bootstrap is via Hadamard differentiability [24]. Roughly\nspeaking, Hadamard-differentiability is a type of statistical stability, that ensures that the functional\nunder consideration is stable to perturbations in the input distribution. In Appendix C, we formally\nde\ufb01ne Hadamard differentiability and prove that dMM is not point-wise Hadamard differentiable.\nThis does not completely rule out the possibility of \ufb01nding a way to construct con\ufb01dence sets based\non dMM, but doing so would be dif\ufb01cult and so far we know of no way to do it.\nIn summary, based on computational considerations we eliminate the interleaving distance and\nthe functional distortion metric [2, 18], we eliminate the dMM metric based on its unsuitability for\nstatistical inference and focus the rest of our paper on the d\u221e (or equivalently dM) metric which is\nboth computationally tractable and has well understood statistical behavior.\n\n4 Con\ufb01dence Sets\n\nIn this section, we consider the construction of valid con\ufb01dence intervals centered around the kernel\ndensity estimator, de\ufb01ned in Equation (1). We \ufb01rst observe that a \ufb01xed bandwidth for the KDE\ngives a dimension-free rate of convergence for estimating a cluster tree. For estimating a density\nin high dimensions, the KDE has a poor rate of convergence, due to a decreasing bandwidth for\nsimultaneously optimizing the bias and the variance of the KDE.\nWhen estimating a cluster tree, the bias of the KDE does not affect its cluster tree. Intuitively, the\ncluster tree is a shape characteristic of a function, which is not affected by the bias. De\ufb01ning the\n\nbiased density, ph(x) = E[(cid:98)ph(x)], two cluster trees from ph and the true density p0 are equivalent\nwith respect to the topology in Appendix A, if h is small enough and p0 is regular enough:\nLemma 2. Suppose that the true unknown density p0, has no non-degenerate critical points 4, then\nthere exists a constant h0 > 0 such that for all 0 < h \u2264 h0, the two cluster trees, Tp0 and Tph have\nthe same topology in Appendix A.\n\n4The Hessian of p0 at every critical point is non-degenerate. Such functions are known as Morse functions.\n\n5\n\n\fFrom Lemma 2, proved in Appendix G, a \ufb01xed bandwidth for the KDE can be applied to give a\ndimension-free rate of convergence for estimating the cluster tree. Instead of decreasing bandwidth h\nand inferring the cluster tree of the true density Tp0 at rate OP (n\u22122/(4+d)), Lemma 2 implies that we\ncan \ufb01x h > 0 and infer the cluster tree of the biased density Tph at rate OP (n\u22121/2) independently of\nthe dimension. Hence a \ufb01xed bandwidth crucially enhances the convergence rate of the proposed\nmethods in high-dimensional settings.\n\n4.1 A data-driven con\ufb01dence set\n\nWe recall that we base our inference on the d\u221e metric, and we recall the de\ufb01nition of a valid\ncon\ufb01dence set (see De\ufb01nition 3). As a conceptual \ufb01rst step, suppose that for a speci\ufb01ed value \u03b1 we\n\ncould compute the 1 \u2212 \u03b1 quantile of the distribution of d\u221e(T(cid:98)ph , Tph), and denote this value t\u03b1. Then\na valid con\ufb01dence set for the unknown Tph is C\u03b1 = {T : d\u221e(T, T(cid:98)ph) \u2264 t\u03b1}. To estimate t\u03b1, we use\nthe bootstrap. Speci\ufb01cally, we generate B bootstrap samples, {(cid:101)X 1\n1 ,\u00b7\u00b7\u00b7 , (cid:101)X B\n1 ,\u00b7\u00b7\u00b7 , (cid:101)X 1\nn },\nthe KDE, and the associated cluster tree. We denote the cluster trees {(cid:101)T 1\nby sampling with replacement from the original sample. On each bootstrap sample, we compute\n}. Finally, we\nestimate t\u03b1 by (cid:98)t\u03b1 = (cid:98)F \u22121(1 \u2212 \u03b1), where\nThen the data-driven con\ufb01dence set is (cid:98)C\u03b1 = {T : d\u221e(T,(cid:98)Th) \u2264(cid:98)t\u03b1}. Using techniques from [8, 7],\n\nn}, . . . ,{(cid:101)X B\n, . . . ,(cid:101)T B\n, T(cid:98)ph) < s).\n\nI(d\u221e((cid:101)T i\n\nthe following can be shown (proof omitted):\nTheorem 3. Under mild regularity conditions on the kernel5, we have that the constructed con\ufb01dence\nset is asymptotically valid and satis\ufb01es,\n\n(cid:98)F (s) =\n\nn(cid:88)\n\ni=1\n\nph\n\nph\n\n1\nB\n\nph\n\nP(cid:16)\n\nTh \u2208 (cid:98)C\u03b1\n\n(cid:17)\n\n= 1 \u2212 \u03b1 + O\n\n(cid:16)(cid:16) log7 n\n\n(cid:17)1/6(cid:17)\n\n.\n\nnhd\n\nHence our data-driven con\ufb01dence set is consistent at dimension independent rate. When h is a \ufb01xed\nsmall constant, Lemma 2 implies that Tp0 and Tph have the same topology, and Theorem 3 guarantees\nthat the non-parametric bootstrap is consistent at a dimension independent O(((log n)7/n)1/6) rate.\nFor reasons explained in [8], this rate is believed to be optimal.\n\n4.2 Probing the Con\ufb01dence Set\n\nThe con\ufb01dence set (cid:98)C\u03b1 is an in\ufb01nite set with a complex structure. In\ufb01nitesimal perturbations of the\n\ndensity estimate are in our con\ufb01dence set and so this set contains very complex trees. One way to\nunderstand the structure of the con\ufb01dence set is to focus attention on simple trees in the con\ufb01dence\nset. Intuitively, these trees only contain topological features (splits and branches) that are suf\ufb01ciently\nstrongly supported by the data.\n\n2. Pruning leaves and internal branches: In this case, we \ufb01rst prune the leaves as above. This\n\nthe con\ufb01dence set. Pruning the empirical tree aids visualization as well as de-noises the empirical\ntree by eliminating some features that arise solely due to the stochastic variability of the \ufb01nite-sample.\nThe algorithms are (see Figure 3):\n\nWe propose two pruning schemes to \ufb01nd trees, that are simpler than the empirical tree T(cid:98)ph that are in\n1. Pruning only leaves: Remove all leaves of length less than 2(cid:98)t\u03b1 (Figure 3(b)).\nyields a new tree. Now we again prune (using cumulative length) any leaf of length less than 2(cid:98)t\u03b1. We\ncontinue iteratively until all remaining leaves are of cumulative length larger than 2(cid:98)t\u03b1 (Figure 3(c)).\ntree (cid:101)T after either of the above pruning operations satis\ufb01es: (i) (cid:101)T (cid:22) T(cid:98)ph, (ii) there exists a function f\nwhose tree is (cid:101)T , and (iii) (cid:101)T \u2208 (cid:98)C\u03b1 (see Lemma 10 in Appendix D.2). In other words, we identi\ufb01ed a\nvalid tree with a statistical guarantee that is simpler than the original estimate T(cid:98)ph. Intuitively, some\nof the statistically insigni\ufb01cant features have been removed from T(cid:98)ph. We should point out, however,\n\nIn Appendix D.2 we formally de\ufb01ne the pruning operation and show the following. The remaining\n\n5See Appendix D.1 for details.\n\n6\n\n\f(a) The empirical tree.\n\n(b) Pruning only leaves.\n\n(c) Pruning leaves and branches.\n\nFigure 3: Illustrations of our two pruning strategies. (a) shows the empirical tree. In (b), leaves that\nare insigni\ufb01cant are pruned, while in (c), insigni\ufb01cant internal branches are further pruned top-down.\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\ntrees recover the actual structure of connected components.\n\nFigure 4: Simulation examples. (a) and (d) are the ring data; (b) and (e) are the mickey mouse data;\n(c) and (f) are the yingyang data. The solid lines are the pruned trees; the dashed lines are leaves (and\n\nedges) removed by the pruning procedure. A bar of length 2(cid:98)t\u03b1 is at the top right corner. The pruned\nthat there may exist other trees that are simpler than T(cid:98)ph that are in (cid:98)C\u03b1. Ideally, we would like to\n\nhave an algorithm that identi\ufb01es all trees in the con\ufb01dence set that are minimal with respect to the\npartial order (cid:22) in De\ufb01nition 4. This is an open question that we will address in future work.\n\n5 Experiments\n\nIn this section, we demonstrate the techniques we have developed for inference on synthetic data, as\nwell as on a real dataset.\n\n5.1 Simulated data\n\nWe consider three simulations: the ring data (Figure 4(a) and (d)), the Mickey Mouse data (Figure 4(b)\nand (e)), and the yingyang data (Figure 4(c) and (f)). The smoothing bandwidth is chosen by the\nSilverman reference rule [20] and we pick the signi\ufb01cance level \u03b1 = 0.05.\n\n7\n\nL1L2L3L4L5L6E1E2E3E5E4Ring data, alpha = 0.05lambda0.00.20.40.60.81.000.2080.2720.529\u2212\u2212Mickey mouse data, alpha = 0.05lambda0.00.20.40.60.81.000.2550.291\u2212\u2212Yingyang data, alpha = 0.05lambda0.00.20.40.60.81.000.0350.0440.0520.07\u2212\u2212\f(a) The positive treatment data.\n\n(b) The control data.\n\nFigure 5: The GvHD data. The solid brown lines are the remaining branches after pruning; the blue\n\ndashed lines are the pruned leaves (or edges). A bar of length 2(cid:98)t\u03b1 is at the top right corner.\n\nExample 1: The ring data. (Figure 4(a) and (d)) The ring data consists of two structures: an outer\nring and a center node. The outer circle consists of 1000 points and the central node contains 200\npoints. To construct the tree, we used h = 0.202.\nExample 2: The Mickey Mouse data. (Figure 4(b) and (e)) The Mickey Mouse data has three\ncomponents: the top left and right uniform circle (400 points each) and the center circle (1200 points).\nIn this case, we select h = 0.200.\nExample 3: The yingyang data. (Figure 4(c) and (f)) This data has 5 connected components: outer\nring (2000 points), the two moon-shape regions (400 points each), and the two nodes (200 points\neach). We choose h = 0.385.\nFigure 4 shows those data ((a), (b), and (c)) along with the pruned density trees (solid parts in (d), (e),\nand (f)). Before pruning the tree (both solid and dashed parts), there are more leaves than the actual\nnumber of connected components. But after pruning (only the solid parts), every leaf corresponds to\nan actual connected component. This demonstrates the power of a good pruning procedure.\n\n5.2 GvHD dataset\n\nNow we apply our method to the GvHD (Graft-versus-Host Disease) dataset [3]. GvHD is a\ncomplication that may occur when transplanting bone marrow or stem cells from one subject to\nanother [3]. We obtained the GvHD dataset from R package \u2018mclust\u2019. There are two subsamples: the\ncontrol sample and the positive (treatment) sample. The control sample consists of 9083 observations\nand the positive sample contains 6809 observations on 4 biomarker measurements (d = 4). By the\nnormal reference rule [20], we pick h = 39.1 for the positive sample and h = 42.2 for the control\nsample. We set the signi\ufb01cance level \u03b1 = 0.05.\nFigure 5 shows the density trees in both samples. The solid brown parts are the remaining components\nof density trees after pruning and the dashed blue parts are the branches removed by pruning. As can\nbe seen, the pruned density tree of the positive sample (Figure 5(a)) is quite different from the pruned\ntree of the control sample (Figure 5(b)). The density function of the positive sample has fewer bumps\n(2 signi\ufb01cant leaves) than the control sample (3 signi\ufb01cant leaves). By comparing the pruned trees,\nwe can see how the two distributions differ from each other.\n\n6 Discussion\n\nThere are several open questions that we will address in future work. First, it would be useful to have\nan algorithm that can \ufb01nd all trees in the con\ufb01dence set that are minimal with respect to the partial\norder (cid:22). These are the simplest trees consistent with the data. Second, we would like to \ufb01nd a way\nto derive valid con\ufb01dence sets using the metric dMM which we view as an appealing metric for tree\ninference. Finally, we have used the Silverman reference rule [20] for choosing the bandwidth but we\nwould like to \ufb01nd a bandwidth selection method that is more targeted to tree inference.\n\n8\n\n0.00.20.40.60.81.00e+002e\u2212104e\u2212106e\u2212108e\u221210\u2212\u22120.00.20.40.60.81.00e+001e\u2212102e\u2212103e\u2212104e\u221210\u2212\u2212\fReferences\n[1] S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster trees on manifolds. In\n\nAdvances in Neural Information Processing Systems, 2012.\n\n[2] U. Bauer, E. Munch, and Y. Wang. Strong equivalence of the interleaving and functional distortion metrics\nfor reeb graphs. In 31st International Symposium on Computational Geometry (SoCG 2015), volume 34,\npages 461\u2013475. Schloss Dagstuhl\u2013Leibniz-Zentrum fuer Informatik, 2015.\n\n[3] R. R. Brinkman, M. Gasparetto, S.-J. J. Lee, A. J. Ribickas, J. Perkins, W. Janssen, R. Smiley, and\nC. Smith. High-content \ufb02ow cytometry and temporal data analysis for de\ufb01ning a cellular signature of\ngraft-versus-host disease. Biology of Blood and Marrow Transplantation, 13(6):691\u2013700, 2007.\n\n[4] K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information\n\nProcessing Systems, pages 343\u2013351, 2010.\n\n[5] K. Chaudhuri, S. Dasgupta, S. Kpotufe, and U. von Luxburg. Consistent procedures for cluster tree\n\nestimation and pruning. IEEE Transactions on Information Theory, 2014.\n\n[6] F. Chazal, B. T. Fasy, F. Lecci, B. Michel, A. Rinaldo, and L. Wasserman. Robust topological inference:\n\nDistance to a measure and kernel distance. arXiv preprint arXiv:1412.7197, 2014.\n\n[7] Y.-C. Chen, C. R. Genovese, and L. Wasserman. Density level sets: Asymptotics, inference, and visualiza-\n\ntion. arXiv:1504.05438, 2015.\n\n[8] V. Chernozhukov, D. Chetverikov, and K. Kato. Central limit theorems and bootstrap in high dimensions.\n\nAnnals of Probability, 2016.\n\n[9] D. Donoho. One-sided inference about functionals of a density. The Annals of Statistics, 16(4):1390\u20131420,\n\n1988.\n\n[10] B. Efron, E. Halloran, and S. Holmes. Bootstrap con\ufb01dence levels for phylogenetic trees. Proceedings of\n\nthe National Academy of Sciences, 93(23), 1996.\n\n[11] U. Einmahl and D. M. Mason. Uniform in bandwidth consistency of kernel-type function estimators. The\n\nAnnals of Statistics, 33(3):1380\u20131403, 2005.\n\n[12] J. Eldridge, M. Belkin, and Y. Wang. Beyond hartigan consistency: Merge distortion metric for hierarchical\n\nclustering. In Proceedings of The 28th Conference on Learning Theory, pages 588\u2013606, 2015.\n\n[13] J. Felsenstein. Con\ufb01dence limits on phylogenies, a justi\ufb01cation. Evolution, 39, 1985.\n\n[14] C. R. Genovese, M. Perone-Paci\ufb01co, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation.\n\nThe Annals of Statistics, 42(4):1511\u20131545, 2014.\n\n[15] J. A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American Statistical\n\nAssociation, 1981.\n\n[16] J. Klemel\u00e4. Smoothing of multivariate data: density estimation and visualization, volume 737. John Wiley\n\n& Sons, 2009.\n\n[17] S. Kpotufe and U. V. Luxburg. Pruning nearest neighbor cluster trees.\n\nIn Proceedings of the 28th\n\nInternational Conference on Machine Learning (ICML-11), pages 225\u2013232, 2011.\n\n[18] D. Morozov, K. Beketayev, and G. Weber. Interleaving distance between merge trees. Discrete and\n\nComputational Geometry, 49:22\u201345, 2013.\n\n[19] D. W. Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons,\n\n2015.\n\n[20] B. W. Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986.\n\n[21] W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density.\n\nJournal of Computational and Graphical Statistics, 19(2), 2010.\n\n[22] L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.\n\n[23] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business\n\nMedia, 2010. ISBN 1441923225, 9781441923226.\n\n[24] J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science\n\n& Business Media, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1000, "authors": [{"given_name": "Jisu", "family_name": "KIM", "institution": "Carnegie Mellon University"}, {"given_name": "Yen-Chi", "family_name": "Chen", "institution": "Carnegie Mellon University"}, {"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": "Carnegie Mellon University"}, {"given_name": "Alessandro", "family_name": "Rinaldo", "institution": "Carnegie Mellon University"}, {"given_name": "Larry", "family_name": "Wasserman", "institution": "Carnegie Mellon University"}]}