{"title": "Hierarchical Clustering Beyond the Worst-Case", "book": "Advances in Neural Information Processing Systems", "page_first": 6201, "page_last": 6209, "abstract": "Hiererachical clustering, that is computing a recursive partitioning of a dataset to obtain clusters at increasingly finer granularity is a fundamental problem in data analysis. Although hierarchical clustering has mostly been studied through procedures such as linkage algorithms, or top-down heuristics, rather than as optimization problems, recently Dasgupta [1] proposed an objective function for hierarchical clustering and initiated a line of work developing algorithms that explicitly optimize an objective (see also [2, 3, 4]). In this paper, we consider a fairly general random graph model for hierarchical clustering, called the hierarchical stochastic blockmodel (HSBM), and show that in certain regimes the SVD approach of McSherry [5] combined with specific linkage methods results in a clustering that give an O(1)-approximation to Dasgupta\u2019s cost function. We also show that an approach based on SDP relaxations for balanced cuts based on the work of Makarychev et al. [6], combined with the recursive sparsest cut algorithm of Dasgupta, yields an O(1) approximation in slightly larger regimes and also in the semi-random setting, where an adversary may remove edges from the random graph generated according to an HSBM. Finally, we report empirical evaluation on synthetic and real-world data showing that our proposed SVD-based method does indeed achieve a better cost than other widely-used heurstics and also results in a better classification accuracy when the underlying problem was that of multi-class classification.", "full_text": "Hierarchical Clustering Beyond the Worst-Case\n\nVincent Cohen-Addad\nUniversity of Copenhagen\nvcohenad@gmail.com\n\nVarun Kanade\n\nUniversity of Oxford\nAlan Turing Institute\nvarunk@cs.ox.ac.uk\n\nFrederik Mallmann-Trenn\n\nMIT\n\nmallmann@mit.edu\n\nAbstract\n\nHiererachical clustering, that is computing a recursive partitioning of a dataset to\nobtain clusters at increasingly \ufb01ner granularity is a fundamental problem in data\nanalysis. Although hierarchical clustering has mostly been studied through proce-\ndures such as linkage algorithms, or top-down heuristics, rather than as optimization\nproblems, Dasgupta [9] recently proposed an objective function for hierarchical clus-\ntering and initiated a line of work developing algorithms that explicitly optimize an\nobjective (see also [7, 22, 8]). In this paper, we consider a fairly general random graph\nmodel for hierarchical clustering, called the hierarchical stochastic block model\n(HSBM), and show that in certain regimes the SVD approach of McSherry [18] com-\nbined with speci\ufb01c linkage methods results in a clustering that give an Op1q approx-\nimation to Dasgupta\u2019s cost function. Finally, we report empirical evaluation on syn-\nthetic and real-world data showing that our proposed SVD-based method does indeed\nachieve a better cost than other widely-used heurstics and also results in a better classi-\n\ufb01cation accuracy when the underlying problem was that of multi-class classi\ufb01cation.\n\n1\n\nIntroduction\n\nComputing a recursive partitioning of a dataset to obtain a \ufb01ner and \ufb01ner classi\ufb01cation of the data is a\nclassic problem in data analysis. Such a partitioning is often refered to as a hierarchical clustering and\nrepresented as a rooted tree whose leaves correspond to data elements and where each internal node\ninduces a cluster of the leaves of its subtree. There exists a large literature on the design and analysis of\nalgorithms for hierarchical clustering (see e.g., [21]). Two main approaches have proven to be success-\nful in practice so far: on the one hand divisive heuristics compute the hierarchical clustering tree in a\ntop-down fashion by recursively partitioning the data (see e.g., [14]). On the other hand, agglomerative\nheuristics produce a tree by \ufb01rst de\ufb01ning a cluster for each data elements and successively merging\nclusters according to a carefully de\ufb01ned function (see e.g., [19]). These heuristics are widely used in\npractice and are now part of the data scientists\u2019 toolkit\u2014standard machine learning libraries contain\nimplementations of both types of heuristics.\nAgglomerative heuristics have several appealing features: they are easy to implement, easy to tune, and\n\ntheir running time is rOpn2polylognq on a dataset of size n. Standard divisive heuristics based on graph\n\npartitioning or clustering methods (like for example the bisection k-means or the recursive sparsest-cut\napproaches) often involve solving or approximating NP-hard problems.1 Therefore, it is natural to\n\n1In some cases, it may be possible to have a very fast algorithms based on heuristics to compute partitions,\nhowever, we are unaware of any such methods that would have provable guarantees for the kinds of graphs that\nappear in hierarchical clustering.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fask how good the solution output by an agglomerative method is compared to the solution output by\na top-down method.\nFrom a qualitative perspective, this question has been addressed in a large body of work (see e.g., [5]).\nHowever, from a quantitative perspective little is known. As Dasgupta observes in his recent work [9],\nboth agglomerative and divisive heuristics are de\ufb01ned procedurally rather than in term of an objective\nfunction to optimize, a reason why a quantitative comparision of the different heuristics is rather\ndif\ufb01cult. Dasgupta introduced an objective function to model the problem of \ufb01nding a hierarchical clus-\ntering of a similarity graph\u2014such an objective can be used to explicitly design optimization algorithms\nthat minimize this cost function as well as serve as a quantitative measure of the quality of the output.\nGiven a similarity graph i.e., a graph where vertices represent data elements and edge weights sim-\nilarities between data elements, Dasgupta\u2019s objective function associates a cost to any hierarchical\nclustering tree of the graph. He showed that his objective function exhibits several desirable properties:\nFor example, if the graph is disconnected i.e., data elements in different connected components are very\ndissimilar, a tree minimizing this objective function will \ufb01rst split the graph according to the connected\ncomponents.\nThis axiomatic approach to de\ufb01ning a \u201cmeaningful\u201d objective function for hierarchical clustering has\nbeen further explored in recent work by Cohen-Addad et al. [8]. Roughly speaking, they characterize\na family of cost functions, which includes Dasgupta\u2019s cost function, that when the input graph has a\n\u201cnatural\u201d ground-truth hierarchical clustering tree (in other words a natural classi\ufb01cation of the data),\nthis tree has optimal cost (and any tree that is not a \u201cnatural\u201d hierarchical clustering tree of the graph\nhas higher cost). Therefore, the results by Dasgupta and Cohen-Addad et al. indicate that Dasgupta\u2019s\ncost function provides a sound framework for a rigorous quantitative analysis of agglomerative and\ndivisive heuristics.\nA suitable objective function to measure the quality of a clustering also allows one to explicitly design\nalgorithms that minimize the cost. Dasgupta showed that the recursive sparsest-cut heuristic is an\nOplog3{2nq-approximation algorithm for his objective function. His analysis has been improved by\nCharikar and Chatziafratis [7] and Cohen-Addad et al. [8] to Op?\nlog nq. Unfortunately, Charikar\nand Chatziafratis [7] and Roy and Pokutta [22] showed that, for general inputs, the problem cannot be\napproximated within any constant factor under the Small-Set Expansion hypothesis. Thus, as suggested\nby Charikar and Chatziafratis [7], a natural way to obtain a more \ufb01ne-grained analysis of the classic\nagglomerative and divisive heuristics is to study beyond-worst case scenarios.\nRandom Graph Model for Hierarchical Clustering. A natural way to analyse a problem beyond the\nworst-case is to consider a suitable random input model, which is the focus of this paper. More precisely,\nwe introduce a random graph model based on the notion of \u201chierarchical stochastic block model\u201d\n(HSBM) introduced by Cohen-Addad et al., which is a natural extensions of the stochastic block\nmodel introduced. Our random graph model relies on the notion of ultrametric, a metric in which the\ntriangle inequality is strengthened by requiring dpx,yq\u010f maxpdpx,zq,dpy,zqq. This is a key concept\nas ultrametrics exactly capture the notion of data having a \u201cnatural\u201d hierarchical structure (cf. [5]).\nThe random graphs are generated from data that comes from an ultrametric, but the randomness hides\nthe natural hierarchical structure. Two natural questions are: Given a random graph generated in such a\nfashion, when is it possible to identify the underlying ultrametric and is the optimization of Dasgupta\u2019s\ncost function easier for graphs generated according to such a model. The former question was partially\naddressed by Cohen-Addad et al. and our focus is primarily on developing algorithms that achieve\nan Op1q approximation to the expected Dasgupta cost, not on recovering the underlying ultrametric.\nMore formally, assume that the data elements lie in an unknown ultrametric space pA,distq and so\nexhibit a natural hierarchical clustering de\ufb01ned by this ultrametric. The input is a random graph\ngenerated as follows: an edge is added between nodes u,vP A with probability p\u201c fpdistpu,vqq, where\nf is an (unknown) non-increasing function with range p0,1q.Thus, vertices that are very close in the\nultrametric (and so very similar) have a higher probability to have an edge between them than vertices\nthat are further apart. Given such a random graph, the goal is to obtain a hierarchical clustering tree\nthat has a good cost for the objective function. The actual ground-truth tree is optimal in expectation\nand we focus on designing algorithms that with high probability output a tree whose cost is within\na constant factor of the expected cost of the ground-truth tree. Although, we do not study it in this\nwork, the question of exact recovery is also an interesting one and the work of Cohen-Addad et al. [8]\naddresses this partially in certain regimes.\nAlgorithmic Results. Even in the case of random graphs, the linkage algorithms may perform quite\npoorly, mainly because ties may be broken unfavourably at the very bottom, when the clusters are\n\n2\n\n\fcursive sparsest-cut approach yields an Oplog3{2nq. His analysis was recently improved to Op?\nconstraints to obtain approximation algorithms with approximation factor Oplognq and Op?\n\nsingleton nodes; these choices cannot be easily compensated later on in the algorithm. We thus\nconsider the LINKAGE++ algorithm which \ufb01rst uses a seeding step using a standard SVD approach\nto build clusters of a signi\ufb01cant size, which is an extension of the algorithm introduced in [8]. Then,\nwe show that using these clusters as starting point, the classic single-linkage approach achieves a\np1`\u03b5q-approximation for the problem (cf. Theorem 2.4).\nExperimental Results. We evaluate the performance of LINKAGE++ on real-world data (Scikit-learn)\nas well as on synthetic hierarchical data. The measure of interest is the Dasgupta cost function and\nfor completeness we also consider the classi\ufb01cation error (see e.g., [22]). Our experiments show that 1)\nLINKAGE++ performs well on all accounts and 2) that a clustering with a low Dasgupta cost appears to\nbe correlated with a good classi\ufb01cation. On synthetic data LINKAGE++ seems to be clearly superior.\nRelated Work. Our work follows the line of research initiated by Dasgupta [9] and further studied\nby [22, 7, 8]. Dasgupta [9] introduced the cost function studied in this paper and showed that the re-\nlognq\nby [7, 8]. Roy and Pokutta [22] and Charikar also considered LP and SDP formulations with spreading\nlognq\nrespectively. Both these works also showed the infeasibility of constant factor approximations under\nthe small-set expansion hypothesis. Cohen-Addad et al. [8] took an axiomatic approach to identify\nsuitable cost functions for data generated from ultrametrics, which results in a natural ground-truth\nclustering. They also looked at a slightly less general hierarchical stochastic blockmodel (HSBM),\nwhere each bottom-level cluster must have a linear size and with stronger conditions on allowable\nprobabilities. Their algorithm also has a \u201cseeding phase\u201d followed by an agglomerative approach. We\ngo beyond their bounds by focusing on approximation algorithms (we obtain a p1`\u03b5q-approximation)\nwhereas they aim at recovering the underlying ultrametric. As the experiments show, this trade-off\nseem not to impact the classi\ufb01cation error compared to classic other approaches.\nThere is also a vast literature on graph partitionning problems in random and semi-random models.\nMost of this work (see e.g., [18, 11]) focuses on recovering a hidden subgraph e.g., a clique, whereas\nwe address the problem of obtaining good approximation guarantees w.r.t. an objective function.the\nreader may refer to [24, 13] for the de\ufb01nitions and the classic properties on agglomerative and divisive\nheuristics. Agglomerative and divisive heuristics have been widely studied from either a qualitative\nperspective or for classic \u201c\ufb02at\u201d clustering objective like the classic k-median and k-means, see e.g.,\n[20, 10, 16, 3, 2]. For further background on hierarchical clustering and its application in machine\nlearning and data science, the reader may refer to e.g., [15, 23, 12, 6].\nPreliminaries In this paper, we work with undirected weighted graph G\u201cpV,E,wq, where V is a set\n\u0159\nof vertices, E a set of edges, and w : E\u00d1R`. In the random and semi-random model, we work with\nunweighted graphs. We slightly abuse notation and extend the function w to subsets of V . Namely,\nfor any A,B \u010e V , let wpA,Bq \u201c\naPA,bPB wpa,bq. We use weights to model similarity, namely\nwpu,vq \u0105 wpu,wq means that data element u is more similar to v than to w. When G is clear from\nthe context, we let |V |\u201c n and |E|\u201c m. For any subset S of vertices of a graph G, let GrSs be the\nsubgraph induced by the nodes of S.\nIn the following, let G\u201cpV,E,wq be a weighted graph on n vertices. A cluster tree or hierarchical\nclustering T for G is a rooted binary tree with exactly |V | leaves, each of which is labeled by a distinct\nvertex vP V . We denote LCATpu,vq the lowest common ancestor of vertices u,v in T . Given a tree\nT and a node N of T , we say that the subtree of N in T is the the connected subgraph containing all\nthe leaves of T that are descendant of N and denote this set of leaves by V pNq. A metric space pX,dq\nis an ultrametric if for every x,y,zP X, dpx,yq\u010f maxtdpx,zq,dpy,zqu.\nWe borrow the notion of a (similarity) graph generated from an ultrametric and generating tree\nintroduced by [8]. A weighted graph G \u201c pV, E, wq is a generated from an ultrametric, if there\nexists an ultrametric pX,dq, such that V \u010e X, and for every x,y P V,x \u2030 y, e \u201c tx,yu exists, and\nwpeq\u201c fpdpx,yqq, where f :R`\u00d1R` is a non-increasing function.\nDe\ufb01nition 1.1 (Generating Tree). Let G\u201cpV,E,wq be a graph generated by a minimal ultrametric\npV, dq. Let T be a rooted binary tree with |V | leaves; let N denote the internal nodes and L the\nset of leaves of T and let \u03c3 : L\u00d1 V denote a bijection between the leaves of T and nodes of V . We\nsay that T is a generating tree for G, if there exists a weight function W : N \u00d1 R`, such that for\nN1,N2 PN , if N1 appears on the path from N2 to the root, WpN1q\u010f WpN2q. Moreover for every\nx,yP V , wptx,yuq\u201c WpLCATp\u03c3\u00b41pxq,\u03c3\u00b41pyqqq.\n\n3\n\n\fAs noted in [8], the above notion bear similarities to what is referred to as a dendrogram in the machine\nlearning literature (see e.g., [5]).\nObjective Function. We consider the objective function introduced by Dasgupta [9]. Let G\u201cpV,E,wq\n\u0159\nbe a weighted graph and T \u201cpN ,Eq be any rooted binary tree with leaves set V . The cost induced\nby a node N of T is costTpNq\u201c|V pNq|\u00a8 wpV pC1q,V pC2qq where C1,C2 are the children of N in\nT . The cost of T is costT \u201c\nNPN costTpNq. As pointed out by Dasgupta [9], this can be rephrased\nas costT \u201c\n\n\u0159\npu,vqPEwpu,vq\u00a8|V pLCATpu,vqq|.\n\n2 A General Hierarchical Stochastic Block Model\n\nWe introduce a generalization of the HSBM studied by [8] and [17]. Cohen-Addad et al. [8] introduce\nan algorithm to recover a \u201cground-truth\u201d hierarchical clustering in the HSBM setting. The regime\nin which their algorithm works is the following: (1) there is a set of hidden clusters that have linear\nsize and (2) the ratio between the minimum edge probability and the maximum edge probability is\nOp1q. We aim at obtaining an algorithm that \u201cworks\u201d in a more general setting. We reach this goal\nby proposing on p1` \u03b5q-approximation algorithms. Our algorithm very similar to the widely-used\nlinkage approach and remains easy to implement and parallelize. Thus, the main message of our work\nis that, on \u201cstructured inputs\u201d the agglomerative heuristics perform well, hence making a step toward\nexplaining their success in practice.\nThe graphs generated from our model possess an underlying, hidden (because of noise) \u201cground-truth\nhierarchical clustering tree\u201d (see De\ufb01nition 2.1). This aims at modeling real-world classi\ufb01cation\nproblem for which we believe there is a natural hierarchical clustering but perturbed because of missing\ninformation or measurement erros. For example, in the tree of life, there is a natural hierarchical cluster-\ning hidden that we would like to reconstruct. Unfortunately because of extinct species, we don\u2019t have a\nperfect input and must account for noise. We formalize this intuition using the notion of generating tree\n(Def 1.1) which, as hinted at by the de\ufb01nition, can be associated to an ultrametric (and so a \u201cnatural\u201d hi-\nerarchical clustering). The \u201cground-truth tree\u201d is the tree obtained from a generating tree on k leaves to\nwhich we will refer as \u201cbottom\u201d-level clusters containing n1,n2,...,nk nodes (following the terminology\nin [8]). Each edge of a generated graph has a \ufb01xed probability of being present, which only depends on\nthe underlying ground-truth tree. This probability is a function of the clusters in which their endpoints\nlie and the underlying graph on k vertices for which the generating tree is generating (as in Def 1.1).\nDe\ufb01nition 2.1 (Hierarchical Stochastic Block Model \u2013 Generalization of [8]). Let n be a positive\ninteger. A hierarchical stochastic block model with k bottom-level clusters is de\ufb01ned as follows:\n\n1) Let rGk \u201c prVk,rEk,wq be a graph generated from an ultrametric, where |rVk| \u201c k for each e P rEk,\nwpeqPp0,1q. letrTk be a tree on k leaves, let rN denote the internal nodes ofrT andrL denote the leaves;\nletr\u03c3 :rL\u00d1rks be a bijection. LetrT be generating for rGk with weight function\u0102W : rN \u00d1r0,1q.\n2) For each iPrks, let piPp0,1s be such that pi\u0105\u0102WpNq, if N denotes the parent ofr\u03c3\u00b41piq inrT .\ngraph with probability p\u03c8piq if \u03c8piq\u201c \u03c8pjq and with probability\u0102WpNq if \u03c8piq\u2030 \u03c8pjq and N is the least\ncommon ancestor ofr\u03c3\u00b41piq andr\u03c3\u00b41pjq inrT . The graph G\u201cpV,Eq is returned without any labels.\nWe use, for a generating tree rT , the notation pmin to denote\u0102WpN0q, where N0 is the root node of\nrT . Let nmin be the size of the smallest cluster (of the k clusters) As in [8], we will use the notion of\n\n3) For each iPrks, there is a positive integer ni such that\nThen a random graph G\u201cpV,Eq on n nodes is de\ufb01ned as follows. Each vertex iPrns is assigned a label\n\u03c8piqPrks, so that exactly nj nodes are assigned the label j for j Prks. An edge pi,jq is added to the\n\n\u0159\ni\u201c1ni\u201c n.\n\nexpected graph. The expected graph as the is the weighted complete graph \u00afG in which an edge pi,jq\nhas weight pi,j, where pi,j is the probability with which it appears in the random graph G. We refer\nto any tree that is generating for the expected graph \u00afG as a ground-truth tree for G. In order to avoid\nambiguity, we denote by costTpGq and costTp \u00afGq the costs of the cluster tree T for the unweighted\n(random) graph G and weighted graph \u00afG respectively. Observe that due to linearity of expectation\na\nfor any tree T and any admissible cost function, costTp \u00afGq\u201c ErcostTpGqs, where the expectation is\nwith respect to the random choices of edges in G. We have\nlogn{nq. Let k be a \ufb01xed constant and G\nTheorem 2.2. Let n be a positive integer and pmin\u201c \u03c9p\n\nbe a graph generated from an HSBM (as per Defn. 2.1) where the underlying graph rGk has k nodes\n\nand minimum probability is pmin. For any binary tree T with n leaves labelled by the vertices of G,\n\nk\n\n4\n\n\fthe following holds with high probability: |costpTq\u00b4ErcostpTqs|\u010f opErcostpTqsq. The expectation\nis taken only over the random choice of edges. In particular if T \u02da is a ground-truth tree for G, then,\nwith high probability, costpT \u02daq\u010fp1`op1qqminT 1costpT 1q\u201cp1`op1qqOPT.\nAlgorithm LINKAGE++, a p1`\u03b5q-Approximation Algorithm in the HSBM. We consider a simple\nalgorithm, called LINKAGE++, which works in two phases (see Alg. 1). We use a result of [18] who\nconsiders the planted partition model. His approach however does not allow to recover directly a\nhierarchical structure when the input has it.\n\n1 ,...,C \u03b6\n\nAlgorithm 1 LINKAGE++\n1: Input: Graph G\u201cpV,Eq generated from an HSBM.\n2: Parameter: An integer k.\n3: Apply (SVD) projection algorithm of [18, Thm. 12] with parameters G, k, \u03b4 \u201c |V |\u00b42, to get\n4: Run the single-linkage algorithm on the points t\u03b6p1q,...,\u03b6p|V |qu until there are exactly k clusters.\nku be the clusters (of points \u03b6piq) obtained. Let Ci\u010e V denote the set of vertices\n\n\u03b6p1q,...,\u03b6p|V |qPR|V | for vertices in V , where dimpspanp\u03b6p1q,...,\u03b6p|V |qqq\u201c k.\nLet C\u201ctC \u03b6\ncorresponding to the cluster C \u03b6\ni .\nj q{p|C \u03b6\n5: De\ufb01ne dist :C\u02c6C\u00de\u00d1R`: distpC \u03b6\nj q\u201c wpC \u03b6\ni ,C \u03b6\n6: while there are at least two clusters in C do\nj of C at max distpC1\nTake the pair of clusters C1\ni,C1\ni,C1\n7:\n(cid:96)q,distpC1\n(cid:96)q\u201c maxpdistpC1\nUpdate dist: distpC1,C1\n8:\nC\u00d0C z tC1\nju Y tC1u\n9:\n10: end while\n11: The sequence of merges in the while-loop (Steps 6 to 10) induces a hierarchical clustering tree\nk by the tree obtained\n\ni ||C \u03b6\nj |q.\njq. De\ufb01ne a new cluster C1\u201ctC1\n(cid:96)qq\nj,C1\n\nk). Replace each leaf C \u03b6\n\nk with k leaves (C \u03b6\n\nku, say T 1\n\niu z tC1\n\niYC1\nju.\n\ni of T 1\n\n1 ,...,C \u03b6\n\ni,C1\n\ni ,C \u03b6\n\non tC \u03b6\nfor C \u03b6\n\n1 ,...,C \u03b6\ni at Step 4 to obtain T .\n\n12: Repeat the algorithm k1\u201c 2klogn times. Let T 1,...T k1\n13: Output: Tree T i (out of the k1 candidates) that minimises \u0393pTiq.\n\nbe the corresponding outputs.\n\n\u02d8\n\n`\n\nTheorem 2.3 ([18], Observation 11 and a simpli\ufb01cation of Theorem 12). Let \u03b4 be the con\ufb01dence\nparameter. Assume that for all u,v belonging to different clusters with adjacency vectors u,v (i.e.,\nui is 1 if the edge pu,iq exists in G and 0 otherwise) satisfy\n\n}Erus\u00b4Ervs}2\n\n2 \u011b c\u00a8k\u00a8\n\n\u03c32n{nmin`logpn{\u03b4q\n\n(1)\nfor a large enough constant c, where Erus is the entry-wise expectation and \u03c32 \u201c \u03c9plog6n{nq is an\nupper bound on the variance. Then, the algorithm of [18, Thm. 12] with parameters G,k,\u03b4 projects\nthe columns of the adjacency matrix of G to points t\u03b6p1q,...,\u03b6p|V |qu in a k-dimensional subspace\nof R|V | such that the following holds w.p. at least 1\u00b4\u03b4 over the random graph G and with probability\n1{k over the random bits of the algorithm. There exists \u03b7\u0105 0 such that for any u in the ith cluster and\nv in the jth cluster: 1) if i\u201c j then }\u03b6puq\u00b4\u03b6pvq}2\na\nIn the remainder we assume \u03b4\u201c 1{|V |2. We are ready to state our main theorem.\nTheorem 2.4. Let n be a positive integer and \u03b5\u0105 0 a constant. Assume that the separation of bottom\nlogn{nq, and nmin\u011b?\nclusters given by (1) holds, pmin\u201c \u03c9p\nn\u00a8log1{4n. Let k be a \ufb01xed constant\n\nand G be a graph generated from an HSBM (as per Defn. 2.1) where the underlying graph rGk has\n\n2 \u010f \u03b7 and 2) if i\u2030 j then }\u03b6puq\u00b4\u03b6pvq}2\n\nk nodes with satisfying the above constraints.\nWith high probability, Algorithm 1 with parameter k on graph G outputs a tree T 1 that satis\ufb01es\ncostT 1 \u010fp1`\u03b5qOPT.\nWe note that k might not be known in advance. However, different values of k can be tested and an\nOp1q-estimate on k is enough for the proofs to hold. Thus, it is possible to run Algorithm 1 Oplog nq\ntimes with different \u201cguesses\u201d for k and take the best of these runs.\nLet G \u201c pV,Eq be the input graph generated according to an HSBM. Let T be the tree output by\nAlgorithm 1. We divide the proof into two main lemmas that correspond to the outcome of the two\nphases mentioned above.\n\n2 \u0105 2\u03b7.\n\n5\n\n\fThe algorithm of [18, Thm. 12] might fail for two reasons: The \ufb01rst reason is that the random choices\nby the algorithm result in an incorrect clustering. This happens w.p. at most 1\u00b41{k and we can simply\nrepeat the algorithm suf\ufb01ciently many times to be sure that at least once we get the desired result, i.e.,\nthe projections satisfy the conclusion of Thm. 2.3. Lemmas 2.6, 2.7 show that in this case, Steps 6\nto 10 of LINKAGE++ produce a tree that has cost close to optimal. Ultimately, the algorithm simply\noutputs a tree that has the least cost among all the ones produced (and one of them is guaranteed to\nhave cost p1`\u03b5qOPT) with high probability.\nThe second reason why the McSherry\u2019s algorithm may fail is that the generated random graph G might\n\u201cdeviate\u201d too much from its expectation. This is controlled by the parameter \u03b4 (which we set to 1{|V |2).\nDeviations from expected behaviour will cause our algorithm to fail as well. We bound this failure\nprobability in terms of two events. The \ufb01rst bad event is that McSherry\u2019s algorithm fails for either of the\naforementioned reasons. We denote the complement of this event E1. The second bad event it that the\nnumber of edges between the vertices of two nodes of the ground-truth tree deviates from it\u2019s expectation.\nNamely, that given two nodes N1,N2 of T \u02da, we expect the cut to be EpN1,N2q \u201c|V pN1q|\u00a8|V pN1q|\u00a8\nWpLCAT \u02dapN1,N2qq. Thus, we de\ufb01ne E2 to be the event that |wpV pN1q,V pN2qq\u00b4 EpN1,N2q| \u0103\n\u03b52EpN1,N2q for all cuts of the k bottom leaves. Note that the number of cuts is bounded by 2k and we\nwill show that, due to size of nmin and pmin this even holds w.h.p.. The assumptions on the ground-truth\ntree will ensure that the latter holds w.h.p. allowing us to argue that both events hold w.p. at least \u2126p1{kq\nThus, from now on we assume that both \u201cgood\u201d events E1 and E2 occur. We bound the probability of\nevent E1 in Lemma 2.5. We now prove a structural properties of the tree output by the algorithm, we\nintroduce the following de\ufb01nition. We say that a tree T \u201cpN ,Eq is a \u03b3-approximate ground-truth tree for\nG and T \u02da if there exists a weight function W 1 :N \u00de\u00d1R` such that for any two vertices a,b, we have that\n\n1. \u03b3\u00b41W 1pLCATpa,bqq\u010f WpLCAT \u02dapa,bqq\u010f \u03b3W 1pLCATpa,bqq and\n2. for any node N of T and any node N1 descendant of N in T , WpNq\u010f WpN1q.\n\n1 ,...,C\u02da\n\n\u03c0pjq.\n\nk be the hidden bottom-level clusters, i.e., C\u02da\n\nLemma 2.5. Let G be generated by an HSBM. Assume that the separation of bottom clusters given by\ni \u201ctv | \u03c8pvq\u201c iu. With probability\n(1) holds. Let C\u02da\nat least \u2126p1{kq, the clusters obtained after Step 4 correspond to the assignment \u03c8, i.e., there exists\na\na permutation \u03c0 :rks\u00d1rks, such that Cj \u201c C\u02da\nlogn{nq,\nLemma 2.6. Assume that the separation of bottom clusters given by (1) holds, pmin\u201c \u03c9p\nand nmin \u011b?\nn\u00a8log1{4n. Let G be generated according to an HSBM and let T \u02da be a ground-truth\ntree for G. Assume that events E1 and E2 occur, and that furthermore, the clusters obtained after Step 4\ncorrespond to the assignment \u03c8, i.e., there exists a permutation \u03c0 :rks\u00d1rks such that for each vP Ci,\n\u03c8pvq\u201c \u03c0piq. Then, the output by the algorithm is a p1`\u03b5q-approximate ground-truth tree.\nThe following lemma allows us to bound the cost of an approximate ground-truth tree.\nLemma 2.7. Let G be a graph generated according to an HSBM and let T \u02da be a ground-truth tree\nfor G. Let \u00afG be the expected graph associated to T \u02da and G. Let T be a \u03b3-approximate ground-truth\ntree. Then, costT \u010f \u03b32OPT.\nProof of Theorem 2.4. Conditioning on E1 and E2 which occur w.h.p. and combining Lemmas 2.5, 2.7,\nand 2.6 together with Theorem 2.2 yields the result. As argued before, E1 holds at least w.p. 1{k and it is\npossible to boost part of this probability by running Algorithm 1 multiple times. Running it \u2126pklognq\ntimes and taking the tree with the smallest cost yields the result. Moreover, E2 also holds w.h.p..\n\n3 Empirical Evaluation\n\nIn this section, we evaluate the effectiveness of LINKAGE++ on real-world and synthetic datasets.\nWe compare our results to the classic agglomerative heuristics for hierarchical clustering both in\nterms of the cost function and the classi\ufb01cation error. Our goal is answering the question: How good\nis LINKAGE++ compared to the classic agglomerative approaches on real-world and synthetic data\nthat exhibit a ground-truth clustering?\nDatasets. The datasets we use are part of the standard Scikit-learn library [4] (and most of them are\navailable at the UCI machine learning repository [1]). Most of these datasets exhibit a \u201c\ufb02at\u201d clustering\nstructure, with the exception of the newsgroup datasets which is truly hierarchical. The goal of the\n\n6\n\n\falgorithm is to perform a clustering of the data by \ufb01nding the underlying classes. The datasets are: iris,\n2, diabetes, cancer, boston. For a given dataset, we de\ufb01ne similarity between\ndigits, newsgroup\ndata elements using the cosine similarity, this is a standard approach for de\ufb01ning similarity between data\nelements (see, e.g., [22]) This induces a weighted similarity graph that is given as input to LINKAGE++.\nSynthethic Data. We generate random graphs of sizes nPt256,512,1024u according to the model\ndescribed in Section 2.1. More precisely, we de\ufb01ne a binary tree on (cid:96)Pt4,8u bottom clusters/leaves.\nEach leaf represents a \u201cclass\u201d. We create n{(cid:96) vertices for each class. The probability of having an\nedge between two vertices of class a and b is given by the probability induced by lowest common\nancestor between the leaves corresponding to a and b respectively. We \ufb01rst de\ufb01ne pmin\u201c 2logn\u00a8(cid:96){n.\nThe probability induced by the vertices of the binary tree are the following: the probability at the root\nis p\u201c pmin`p1\u00b4pminq{logp(cid:96)q, and the probability induced by a node at distance d from the root is\npd` 1qp. In particular, the probability induced by the leaves is pmin` logp(cid:96)qp1\u00b4 pminq{logp(cid:96)q\u201c 1.\nWe also investigate a less structured setting using a ground truth tree on three nodes.\nMethod. We run LINKAGE++ with 9 different breakpoints at which we switch between phase 1\nand phase 2 (which corresponds to \u201cguesses\u201d of k). We output the clustering with the smallest cost.\nTo evaluate our algorithm, we compare its performances to classic agglomerative heuristics (for the\nsimilarity setting): single linkage, complete linkage, (see also [24, 13] for a complete description)\nand to the approach of performing only phase 1 of LINKAGE++ until only one cluster remains; we\nwill denote the approach as PCA+. Additionally, we compare ourselves to applying only phase 2 of\nLINKAGE++, we call this approach density-based linkage. We observe that the running times of the\n\nalgorithms are of order rOpn2q stemming already from the agglomerative parts.3 This is close to the\nrOpn2qq running time achieved by the classic agglomerative heuristics.\n\u02d8\n\nWe compare the results by using both the cost of the output tree w.r.t. the hierarchical clustering cost\nfunction and the classi\ufb01cation error. The classi\ufb01cation error is a classic tool to compare different (usu-\nally \ufb02at) clusterings (see, e.g., [22]). For a k-clustering C : V \u00de\u00d1t1,...,ku, the classi\ufb01cation error w.r.t.\na ground-truth \ufb02at clustering C\u02da : V \u00de\u00d1t1,...,ku is de\ufb01ned as min\u03c3PSk\n{|V |,\nwhere Sk is the set of all permutations \u03c3 over k elements.\nWe note that the cost function is more relevant for the newsgroup dataset since it exhibits a truly\nhierarchical structure and so the cost function is presumably capturing the quality of the classi\ufb01cation\nat different levels. On the other hand, the classi\ufb01cation error is more relevant for the others data sets as\nthey are intrinsically \ufb02at. All experiments are repeated at least 10 times and standard deviation is shown.\nResults. The results are summarized in Figure 1, 2, and 3 (App. 3). Almost in all experiments\nLINKAGE++ performs extremlely well w.r.t. the cost and classi\ufb01cation error. Moreover, we observe\nthat a low cost function correlates with a good classi\ufb01cation error. For synthetic data, in both\nLINKAGE++ and PCA+, we observe in Figure 2b that classi\ufb01cation error drops drastically from\nk\u201c 4 to k\u201c 8, from 0.5 to 0 as the size is number of nodes is increased from n\u201c 512 to n\u201c 1024. We\nobserve this threshold phenomena for all \ufb01xed k we considered. We can observe that the normalized\ncost in Figure 2a for the other linkage algorithms increases in the aforementioned setting.\nMoreover, the only dataset where LINKAGE++ and PCA+ differ signi\ufb01cantly is the hierarchical\ndataset newsgroup. Here the cost function of PCA+ is much higher. While the classi\ufb01cation error\nof all algorithm is large, it turns out by inspecting the \ufb01nal clustering of LINKAGE++ and PCA+ that\nthe categories which were being misclassi\ufb01ed are mostly sub categories of the same category. On\nthe dataset of Figure 3 (App. 3) only LINKAGE++ performs well.\nConclusion. Overall both algorithms LINKAGE++ and Single-linkage perform considerably better\nwhen it comes to real-world data and LINKAGE++ and PCA+ dominate on our synthetic datasets.\nHowever, in general there is no reason to believe that PCA+ would perform well in clustering truly\nhierarchical data: there are regimes of the HSBM for which applying only phase 1 of the algorithm\nmight lead to a high missclassi\ufb01cation error and high cost and for which we can prove that LINKAGE++\nis an p1`\u03b5q-approximation.\nThis is exempli\ufb01ed in Figure 3 (App. 3). Moreover, our experiments suggest that one should use in addi-\ntion to LINKAGE++ other linkage algorithm and pick the algorithm with the lowest cost function, which\nappears to correlate with the classi\ufb01cation error. Nevertheless, a high classi\ufb01cation error of hierarchical\n\n`\u0159\n\nxPV 1Cpxq\u2030\u03c3pC\u02dapxqq\n\n2Due to the enormous size of the dataset, we consider a subset consisting of \u2019comp.graphics\u2019, \u2019comp.os.ms-\nwindows.misc\u2019, \u2019comp.sys.ibm.pc.hardware\u2019, \u2019comp.sys.mac.hardware\u2019, \u2019rec.sport.baseball\u2019, \u2019rec.sport.hockey\u2019\n\n3Top k singular vectors of an n\u02c6n matrix can be approximately computed in time rOpkn2q.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: A comparison of the algorithms on real-world data. (a) The \ufb01gure shows the cost costp\u00a8q of the\nalgorithm normalized by the the cost of LINKAGE++. (b) The \ufb01gure shows the percentage of misclassi\ufb01ed nodes.\nBy looking more closely at the output of the algorithm, one can see that a large fraction of the misclassi\ufb01cations\nhappen in subgroups of the same group.\n\n(a)\n\n(b)\n\nFigure 2: A comparison of the algorithms on synthetic data for highly structured ground-truth for different n,k.\nPCA+ performs well on these inputs and we conjecture that this due to the highly structured nature of the ground-\ntruth. (a) The cost of LINKAGE++ and PCA+ are well-below the costs\u2019 of the standard linkage algorithms. (b) We\nsee a threshold phenomena for k\u201c 8 from n\u201c 512 to n\u201c 1024. Here the classi\ufb01cation error drops from 0.5 to 0,\nwhich is explained by concentration of the eigenvalues allowing the PCA to separated the bottom clusters correctly.\n\ndata is not a bad sign per se: A misclassi\ufb01cation of subcategories of the same categories (as we observe in\nour experiments) is arguably tolerable, but ignored by the classi\ufb01cation error. On the other hand, the cost\nfunction captures such errors nicely by its inherently hierarchical nature and we thus strongly advocate it.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: The clustering obtained by PCA+ on a ground truth tree on three nodes induced by the adjacency matrix\nrr1.,0.49,0.39sr0.49,0.49,0.39sr0.39,0.39,0.62ss and n\u201c 999 nodes split equally. Here only LINKAGE++ and\nPCA+ classify the bottom clusters of the subtrees correctly. However, the projection to the euclidian space (PCA)\ndoes not preserve the underlying ultramtric causing PCA+ to merge incorrectly. (a) LINKAGE++ recovers the\nground truth. All other algorithm merge incorrectly. (b) LINKAGE++ and PCA+ classify the bottom clusters\ncorrectly causing the classi\ufb01cation to be perfect even though PCA+ failed to correctly reconstruct the ground-truth.\nThis suggests that the classi\ufb01cation error is less suitable measure for hierarchical data. (c) PCA+ in contrast\nto LINKAGE++ merges incorrectly two bottom clusters of different branches in the ground-truth tree (green and\nblue as opposed to green and red).\n\nAcknowledgement The project leading to this application has received funding from the European\nUnion\u2019s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant\nagreement No. 748094. This work was supported in part by EPSRC grant EP/N510129/1. This work\nwas supported in part by NSF Award Numbers BIO-1455983, CCF-1461559, and CCF-0939370.\n\n8\n\n\fReferences\n[1] D. J. N. A. Asuncion. UCI machine learning repository, 2007.\n\n[2] M. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM J. Comput., 45(1):102\u2013155, 2016.\n\n[3] M.-F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via similarity functions.\n\nIn STOC \u201908, pages 671\u2013680. ACM.\n\n[4] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer,\nA. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design\nfor machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop:\nLanguages for Data Mining and Machine Learning, pages 108\u2013122, 2013.\n\n[5] G. Carlsson and F. M\u00e9moli. Characterization, stability and convergence of hierarchical clustering methods.\n\nJournal of Machine Learning Research, 11:1425\u20131470, 2010.\n\n[6] R. M. Castro, M. J. Coates, and R. D. Nowak. Likelihood based hierarchical clustering. IEEE Transactions\n\non signal processing, 52(8):2308\u20132321, 2004.\n\n[7] M. Charikar and V. Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics.\n\nIn SODA\u201917, pages 841\u2013854, 2017.\n\n[8] V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. Hierarchical clustering: Objective\n\nfunctions and algorithms. To appear at SODA\u201917, 2017.\n\n[9] S. Dasgupta. A cost function for similarity-based hierarchical clustering. In Proc. of the 48th Annual ACM\n\nSymposium on Theory of Computing, STOC 2016. ACM, 2016.\n\n[10] S. Dasgupta and P. M. Long. Performance guarantees for hierarchical clustering. Journal of Computer\n\nand System Sciences, 70(4):555\u2013569, 2005.\n\n[11] U. Feige and J. Kilian. Heuristics for semirandom graph problems. J. Comput. Syst. Sci., 63(4):639\u2013671,\n\nDec. 2001.\n\n[12] J. Felsenstein and J. Felenstein. Inferring phylogenies, volume 2. Sinauer Associates Sunderland, 2004.\n\n[13] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer, 2001.\n\n[14] A. Gu\u00e9noche, P. Hansen, and B. Jaumard. Ef\ufb01cient algorithms for divisive hierarchical clustering with\n\nthe diameter criterion. Journal of classi\ufb01cation, 8(1):5\u201330, 1991.\n\n[15] N. Jardine and R. Sibson. Mathematical Taxonomy. Wiley series in probability and mathematical statistiscs.\n\nJohn Wiley & Sons, 1972.\n\n[16] G. Lin, C. Nagarajan, R. Rajaraman, and D. P. Williamson. A general approach for incremental\n\napproximation and hierarchical clustering. In SODA \u201906, pages 1147\u20131156. SIAM, 2006.\n\n[17] V. Lyzinski, M. Tang, A. Athreya, Y. Park, and C. E. Priebe. Community detection and classi\ufb01cation in hierar-\nchical stochastic blockmodels. IEEE Transactions on Network Science and Engineering, 4(1):13\u201326, 2017.\n\n[18] F. McSherry. Spectral partitioning of random graphs. In FOCS \u201901, pages 529\u2013537.\n\n[19] F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal,\n\n26(4):354\u2013359, 1983.\n\n[20] C. G. Plaxton. Approximation algorithms for hierarchical location problems. In STOC \u201903, pages 40\u201349.\n\n[21] C. K. Reddy and B. Vinzamuri. A survey of partitional and hierarchical clustering algorithms. Data\n\nClustering: Algorithms and Applications, 87, 2013.\n\n[22] A. Roy and S. Pokutta. Hierarchical clustering via spreading metrics. In NIPS \u201916, pages 2316\u20132324.\n\n[23] P. H. Sneath and R. R. Sokal. Numerical taxonomy. Nature, 193(4818):855\u2013860, 1962.\n\n[24] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In In KDD\n\nWorkshop on Text Mining, 2000.\n\n9\n\n\f", "award": [], "sourceid": 3139, "authors": [{"given_name": "Vincent", "family_name": "Cohen-Addad", "institution": "University of Copenhagen"}, {"given_name": "Varun", "family_name": "Kanade", "institution": "University of Oxford"}, {"given_name": "Frederik", "family_name": "Mallmann-Trenn", "institution": "ENS"}]}