{"title": "Hierarchical Clustering via Spreading Metrics", "book": "Advances in Neural Information Processing Systems", "page_first": 2316, "page_last": 2324, "abstract": "We study the cost function for  hierarchical clusterings introduced by [Dasgupta, 2015]  where hierarchies are treated as first-class objects rather than deriving their cost from projections into flat clusters. It was also shown in [Dasgupta, 2015] that a top-down algorithm  returns a hierarchical clustering of cost at most \\(O\\left(\\alpha_n \\log n\\right)\\) times the cost of the optimal hierarchical clustering, where \\(\\alpha_n\\) is the approximation ratio of the Sparsest Cut subroutine used. Thus using the best known approximation algorithm for Sparsest Cut due to Arora-Rao-Vazirani,  the top down algorithm returns a hierarchical clustering of cost at most  \\(O\\left(\\log^{3/2} n\\right)\\) times the cost of the optimal solution. We improve this by giving an \\(O(\\log{n})\\)-approximation algorithm for this problem. Our main technical ingredients are a combinatorial characterization of ultrametrics induced by this cost function, deriving an Integer Linear Programming (ILP) formulation for this family of ultrametrics, and showing how to iteratively round an LP relaxation of this formulation by  using the idea of \\emph{sphere growing} which has been extensively used in the context of graph  partitioning. We also prove that our algorithm returns an \\(O(\\log{n})\\)-approximate  hierarchical clustering for a generalization of this cost function also studied in [Dasgupta, 2015]. Experiments show that the hierarchies found by using the ILP formulation as well  as our rounding algorithm often have better projections into flat clusters than the standard linkage based algorithms. We conclude with an inapproximability result for this problem, namely that no polynomial sized LP or SDP can be used to obtain a constant factor approximation for this problem.", "full_text": "Hierarchical Clustering via Spreading Metrics\n\nAurko Roy1 and Sebastian Pokutta2\n\n1College of Computing, Georgia Institute of Technology, Atlanta, GA, USA.\n\nEmail: aurko@gatech.edu\n\n2ISyE, Georgia Institute of Technology, Atlanta, GA, USA.\n\nEmail: sebastian.pokutta@isye.gatech.edu\n\nAbstract\n\n(cid:16)\n\nWe study the cost function for hierarchical clusterings introduced by [16] where\nhierarchies are treated as \ufb01rst-class objects rather than deriving their cost from\nprojections into \ufb02at clusters. It was also shown in [16] that a top-down algorithm\nreturns a hierarchical clustering of cost at most O (\u03b1n log n) times the cost of\nthe optimal hierarchical clustering, where \u03b1n is the approximation ratio of the\nSparsest Cut subroutine used. Thus using the best known approximation algorithm\nfor Sparsest Cut due to Arora-Rao-Vazirani, the top-down algorithm returns a\nhierarchical clustering of cost at most O\ntimes the cost of the optimal\nsolution. We improve this by giving an O(log n)-approximation algorithm for this\nproblem. Our main technical ingredients are a combinatorial characterization of\nultrametrics induced by this cost function, deriving an Integer Linear Programming\n(ILP) formulation for this family of ultrametrics, and showing how to iteratively\nround an LP relaxation of this formulation by using the idea of sphere growing\nwhich has been extensively used in the context of graph partitioning. We also prove\nthat our algorithm returns an O(log n)-approximate hierarchical clustering for a\ngeneralization of this cost function also studied in [16]. We also give constant\nfactor inapproximability results for this problem.\n\nlog3/2 n\n\n(cid:17)\n\n1\n\nIntroduction\n\nHierarchical clustering is an important method in cluster analysis where a data set is recursively\npartitioned into clusters of successively smaller size. They are typically represented by rooted trees\nwhere the root corresponds to the entire data set, the leaves correspond to individual data points and\nthe intermediate nodes correspond to a cluster of its descendant leaves. Such a hierarchy represents\nseveral possible \ufb02at clusterings of the data at various levels of granularity; indeed every pruning of\nthis tree returns a possible clustering. Therefore in situations where the number of desired clusters is\nnot known beforehand, a hierarchical clustering scheme is often preferred to \ufb02at clustering.\nThe most popular algorithms for hierarchical clustering are bottoms-up agglomerative algorithms\nlike single linkage, average linkage and complete linkage. In terms of theoretical guarantees these\nalgorithms are known to correctly recover a ground truth clustering if the similarity function on the\ndata satis\ufb01es corresponding stability properties (see, e.g., [5]). Often, however, one wishes to think of\na good clustering as optimizing some kind of cost function rather than recovering a hidden \u201cground\ntruth\u201d. This is the standard approach in the classical clustering setting where popular objectives are\nk-means, k-median, min-sum and k-center (see Chapter 14, [23]). However as pointed out by [16]\nfor a lot of popular hierarchical clustering algorithms including linkage based algorithms, it is hard\nto pinpoint explicitly the cost function that these algorithms are optimizing. Moreover, much of the\nexisting cost function based approaches towards hierarchical clustering evaluate a hierarchy based\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fon a cost function for \ufb02at clustering, e.g., assigning the k-means or k-median cost to a pruning of\nthis tree. Motivated by this, [16] introduced a cost function for hierarchical clustering where the cost\ntakes into account the entire structure of the tree rather than just the projections into \ufb02at clusterings.\nThis cost function is shown to recover the intuitively correct hierarchies on several synthetic examples\nlike planted partitions and cliques. In addition, a top-down graph partitioning algorithm is presented\nthat outputs a tree with cost at most O(\u03b1n log n) times the cost of the optimal tree and where \u03b1n\nis the approximation guarantee of the Sparsest Cut subroutine used. Thus using the Leighton-Rao\n\nalgorithm [33] or the Arora-Rao-Vazirani algorithm [3] gives an approximation factor of O(cid:0)log2 n(cid:1)\n\nand O\n\nlog3/2 n\n\nrespectively.\n\n(cid:16)\n\n(cid:17)\n\nIn this work we give a polynomial time algorithm to recover a hierarchical clustering of cost at most\nO(log n) times the cost of the optimal clustering according to this cost function. We also analyze\na generalization of this cost function studied by [16] and show that our algorithm still returns an\nO(log n) approximate clustering in this setting. We do this by giving a combinatorial characterization\nof the ultrametrics induced by this cost function, writing a convex relaxation for it and showing how\nto iteratively round a fractional solution into an integral one using a rounding scheme used in graph\npartitioning algorithms. We also implement the integer program, its LP relaxation, and the rounding\nalgorithm and test it on some synthetic and real world data sets to compare the cost of the rounded\nsolutions to the true optimum, as well as to compare its performance to other hierarchical clustering\nalgorithms used in practice. Our experiments suggest that the hierarchies found by this algorithm are\noften better than the ones found by linkage based algorithms as well as the k-means algorithm in\nterms of the error of the best pruning of the tree compared to the ground truth. We conclude with\nconstant factor hardness results for this problem.\n\n1.1 Related Work\n\nThe immediate precursor to this work is [16] where the cost function for evaluating a hierarchical\nclustering was introduced. Prior to this there has been a long line of research on hierarchical\nclustering in the context of phylogenetics and taxonomy (see, e.g., [22]). Several authors have also\ngiven theoretical justi\ufb01cations for the success of the popular linkage based algorithms for hierarchical\nclustering (see, e.g. [1]). In terms of cost functions, one approach has been to evaluate a hierarchy in\nterms of the k-means or k-median cost that it induces (see [17]). The cost function and the top-down\nalgorithm in [16] can also be seen as a theoretical justi\ufb01cation for several graph partitioning heuristics\nthat are used in practice.\nLP relaxations for hierarchical clustering have also been studied in [2] where the objective is to \ufb01t\na tree metric to a data set given pairwise dissimilarities. Another work that is indirectly related to\nour approach is [18] where an ILP was studied in the context of obtaining the closest ultrametric to\narbitrary functions on a discrete set. Our approach is to give a combinatorial characterization of the\nultrametrics induced by the cost function of [16] which allows us to use the tools from [18] to model\nthe problem as an ILP. The natural LP relaxation of this ILP turns out to be closely related to LP\nrelaxations considered before for several graph partitioning problems (see, e.g., [33, 19, 32]) and we\nuse a rounding technique studied in this context to round this LP relaxation.\nRecently, we became aware of independent work by Charikar and Chatziafratis [12] obtaining similar\n\nresults for hierarchical clustering. In particular they improve the approximation factor to O(cid:0)\u221a\n\nlog n(cid:1)\n\nby showing how to round a spreading metric SDP relaxation for this cost function. They also analyze\na similar LP relaxation using the divide-and-conquer approximation algorithms using spreading\nmetrics paradigm of [20] together with a result of [7] to prove an O(log n) approximation. Finally,\nthey also give similar inapproximability results for this problem.\n\n2 Preliminaries\n\nA similarity based clustering problem consists of a dataset V of n points and a similarity function\n\u03ba : V \u00d7 V \u2192 R such that \u03ba(i, j) is a measure of the similarity between i and j for any i, j \u2208 V . We\nwill assume that the similarity function is symmetric, i.e., \u03ba(i, j) = \u03ba(j, i) for every i, j \u2208 V . We\nalso require \u03ba \u2265 0 as in [16]; see supplementary material for a discussion. Note that we do not make\nany assumptions about the points in V coming from an underlying metric space. For a given instance\nof a clustering problem we have an associated weighted complete graph Kn with vertex set V and\n\n2\n\n\fweight function given by \u03ba. A hierarchical clustering of V is a tree T with a designated root r and\nwith the elements of V as its leaves, i.e., leaves(T ) = V . For any set S \u2286 V we denote the lowest\ncommon ancestor of S in T by lca(S). For pairs of points i, j \u2208 V we will abuse the notation for\nthe sake of simplicity and denote lca({i, j}) simply by lca(i, j). For a node v of T we denote the\nsubtree of T rooted at v by T [v]. The following cost function was introduced by [16] to measure the\nquality of the hierarchical clustering T\n\n\u03ba(i, j)|leaves(T [lca(i, j)])| .\n\n(1)\n\n(cid:88)\n\ncost(T ) :=\n\n{i,j}\u2208E(Kn)\n\nThe intuition behind this cost function is as follows. Let T be a hierarchical clustering with designated\nroot r so that r represents the whole data set V . Since leaves(T ) = V , every internal node v \u2208 T\nrepresents a cluster of its descendant leaves, with the leaves themselves representing singleton clusters\nof V . Starting from r and going down the tree, every distinct pair of points i, j \u2208 V will be eventually\nseparated at the leaves. If \u03ba(i, j) is large, i.e., i and j are very similar to each other then we would\nlike them to be separated as far down the tree as possible if T is a good clustering of V . This is\nenforced in the cost function (1): if \u03ba(i, j) is large then the number of leaves of lca(i, j) should be\nsmall, i.e., lca(i, j) should be far from the root r of T .\nUnder the cost function (1), one can interpret the tree T as inducing an ultrametric dT on V given by\ndT (i, j) := |leaves(T [lca (i, j)])| \u2212 1. This is an ultrametric since dT (i, j) = 0 iff i = j and for any\ntriple i, j, k \u2208 V we have dT (i, j) \u2264 max{dT (i, k), dT (j, k)}. The following de\ufb01nition introduces\nthe notion of non-trivial ultrametrics. These turn out to be precisely the ultrametrics that are induced\nby tree decompositions of V corresponding to cost function (1), as we will show in Lemma 5.\nDe\ufb01nition 1. An ultrametric d on a set of points V is non-trivial if the following conditions hold.\n\n1. For every non-empty set S \u2286 V , there is a pair of points i, j \u2208 S such that d(i, j) \u2265 |S|\u2212 1.\n2. For any t if St is an equivalence class of V under the relation i \u223c j iff d(i, j) \u2264 t, then\n\nmaxi,j\u2208St d(i, j) \u2264 |St| \u2212 1.\n\nNote that for an equivalence class St where d(i, j) \u2264 t for every i, j \u2208 St it follows from Condition 1\nthat t \u2265 |St| \u2212 1. Thus in the case when t = |St| \u2212 1 the two conditions imply that the maximum\ndistance between any two points in S is t and that there is a pair i, j \u2208 S for which this maximum\nis attained. The following lemma shows that non-trivial ultrametrics behave well under restrictions\nto equivalence classes St of the form i \u223c j iff d(i, j) \u2264 t. Due to page limitation full proofs are\nincluded in the supplementary material.\nLemma 2. Let d be a non-trivial ultrametric on V and let St \u2286 V be an equivalence class under the\nrelation i \u223c j iff d(i, j) \u2264 t. Then d restricted to St is a non-trivial ultrametric on St.\nThe intuition behind the two conditions in De\ufb01nition 1 is as follows. Condition 1 imposes a certain\nlower bound by ruling out trivial ultrametrics where, e.g., d(i, j) = 1 for every distinct pair i, j \u2208 V .\nOn the other hand Condition 2 discretizes and imposes an upper bound on d by restricting its range\nto the set {0, 1, . . . , n \u2212 1} (see Lemma 3). This rules out the other spectrum of triviality where for\nexample d(i, j) = n for every distinct pair i, j \u2208 V with |V | = n.\nLemma 3. Let d be a non-trivial ultrametric on the set V . Then the range of d is contained in the set\n{0, 1, . . . , n \u2212 1} with |V | = n.\n\n3 Ultrametrics and Hierarchical Clusterings\n\nIn this section we study the combinatorial properties of the ultrametrics induced by cost function (1).\nWe start with the following easy lemma showing that if a subset S \u2286 V has r as its lowest common\nancestor, then there must be a pair of points i, j \u2208 S for which r = lca(i, j).\nLemma 4. Let S \u2286 V of size \u2265 2. If r = lca(S) then there is a pair i, j \u2208 S such that lca(i, j) = r.\nThe following lemma shows that non-trivial ultrametrics exactly capture the ultrametrics that are\ninduced by tree decompositions of V using cost function (1). The proof of Lemma 5 is inductive and\nuses Lemma 4 as a base case. As it turns out, the inductive proof also gives an algorithm to build the\ncorresponding hierarchical clustering given such a non-trivial ultrametric in polynomial time. Since\n\n3\n\n\fthis algorithm is relatively straightforward, we refer the reader to the supplementary material for the\ndetails.\nLemma 5. Let T be a hierarchical clustering on V and let dT be the ultrametric on V induced\nby cost function (1). Then dT is a non-trivial ultrametric on V . Conversely, let d be a non-trivial\nultrametric on V . Then there is a hierarchical clustering T on V such that for any pair i, j \u2208 V we\nhave dT (i, j) = |leaves(T [lca (i, j)])| \u2212 1 = d(i, j). Moreover this hierarchy can be constructed in\n\ntime O(cid:0)n3(cid:1) where |V | = n.\n\nTherefore to \ufb01nd the hierarchical clustering of minimum cost, it suf\ufb01ces to minimize (cid:104)\u03ba, d(cid:105) over\nnon-trivial ultrametrics d : V \u00d7 V \u2192 {0, . . . , n \u2212 1}. A natural approach is to formulate this\nproblem as an Integer Linear Program (ILP) and then study Linear Programming (LP) relaxations of\nit. We consider the following ILP for this problem that is motivated by [18]. We have the variables\nij = 1 if and only if d(i, j) \u2265 t. For any positive\nij, . . . , xn\u22121\nx1\ninteger n, let [n] := {1, 2, . . . , n}.\n\nfor every distinct pair i, j \u2208 V with xt\n\nij\n\nmin\n\ns.t.\n\n{i,j}\u2208E(Kn)\n\n(cid:88)\nn\u22121(cid:88)\nt=1\nij \u2265 xt+1\nxt\n(cid:88)\nij\njk \u2265 xt\nij + xt\nxt\nij \u2265 2\nxt\n(cid:88)\n\ni,j\u2208S\n\nik\n\n|S|\nij \u2264 |S|2\n\nx\n\ni,j\u2208S\n\n\u03ba(i, j)xt\nij\n\n(ILP-ultrametric)\n\n\u2200i, j \u2208 V, t \u2208 [n \u2212 2]\n\n\u2200i, j, k \u2208 V, t \u2208 [n \u2212 1]\n\n\u2200t \u2208 [n \u2212 1], S \u2286 V,|S| = t + 1\n\n\uf8eb\uf8ec\uf8ec\uf8ed(cid:88)\n\ni,j\u2208S\n\n(cid:0)1 \u2212 xt\n\nij\n\n(cid:88)\n\ni\u2208S\nj /\u2208S\n\nxt\nij +\n\n(cid:1)\uf8f6\uf8f7\uf8f7\uf8f8\u2200t \u2208 [n \u2212 1], S \u2286 V\n\nxt\nji, xt\nij = xt\nij \u2208 {0, 1}\nxt\n\nii = 0\n\n\u2200i, j \u2208 V, t \u2208 [n \u2212 1]\n\n\u2200i, j \u2208 V, t \u2208 [n \u2212 1]\n\n(2)\n(3)\n\n(4)\n\n(5)\n\n(6)\n(7)\n\nij.\nt=1 xt\n\nNote that constraint (3) is the same as the strong triangle inequality since the variables xt\nij are in\n{0, 1}. Constraint 6 ensures that the ultrametric is symmetric. Constraint 4 ensures the ultrametric\nsatis\ufb01es Condition 1 of non-triviality: for every S \u2286 V of size t + 1 we know that there must be\npoints i, j \u2208 S such that d(i, j) = d(j, i) \u2265 t or in other words xt\nji = 1. Constraint 5 ensures\nthat the ultrametric satis\ufb01es Condition 2 of non-triviality. To see this note that the constraint is\nij) = 0. In other words d(i, j) \u2264 t \u2212 1 for\nevery i, j \u2208 S and S is a maximal such set since if i \u2208 S and j /\u2208 S then d(i, j) \u2265 t. Thus S is\nan equivalence class under the relation i \u223c j iff d(i, j) \u2264 t \u2212 1 and so for every i, j \u2208 S we have\nd(i, j) \u2264 |S| \u2212 1 or equivalently x\nij is\n\n|S|\nij = 0. The ultrametric d represented by a feasible solution xt\n\nij = 0 and(cid:80)\n\ni\u2208S,j /\u2208S(1 \u2212 xt\n\ni,j\u2208S xt\n\nij = xt\n\nij | t \u2208 [n \u2212 1], i, j \u2208 V(cid:9) let Et be de\ufb01ned as Et :=(cid:8){i, j} | xt\n\nij = 0(cid:9).\n\nactive only when(cid:80)\ngiven by d(i, j) =(cid:80)n\u22121\nDe\ufb01nition 6. For any(cid:8)xt\n\nij \u2265 xt+1\n\nij is feasible for ILP-ultrametric then Et \u2286 Et+1 for any t since xt\n\nt=1 induce a natural sequence of graphs {Gt}n\u22121\n\n. The sets\nt=1 where Gt = (V, Et) with V being the data\n\nNote that if xt\n{Et}n\u22121\nset.\nFor a \ufb01xed t \u2208 {1, . . . , n \u2212 1} it is instructive to study the combinatorial properties of the so called\nlayer-t problem, where we \ufb01x a choice of t and restrict ourselves to the constraints corresponding to\nthat particular t. In particular we drop the inter-layer constraint (2), and constraints (3), (4) and (5)\nonly range over i, j, k \u2208 V and S \u2286 V with t \ufb01xed. The following lemma provides a combinatorial\ncharacterization of feasible solutions to the layer-t problem.\nLemma 7. Fix a choice of t \u2208 [n\u22121]. Let Gt = (V, Et) be the graph as in De\ufb01nition 6 corresponding\nij to the layer-t problem. Then Gt is a disjoint union of cliques of size \u2264 t. Moreover\nto a solution xt\nthis exactly characterizes all feasible solutions to the layer-t ILP.\n\nij\n\n4\n\n\fBy Lemma 7 the layer-t problem is to \ufb01nd a subset Et \u2286 E(Kn) of minimum weight under \u03ba, such\nthat the complement graph Gt = (V, Et) is a disjoint union of cliques of size \u2264 t. Our algorithmic\napproach is to solve an LP relaxation of ILP-ultrametric and then round the solution to get a feasible\nsolution to ILP-ultrametric. The rounding however proceeds iteratively in a layer-wise manner and so\nwe need to make sure that the rounded solution satis\ufb01es the inter-layer constraints (2) and (5). The\nfollowing lemma gives a combinatorial characterization of solutions that satisfy these two constraints.\nLemma 8. For every t \u2208 [n \u2212 1], let xt\nij be feasible for the layer-t problem. Let Gt = (V, Et) be\nij, so that by Lemma 7, Gt is a disjoint union of\nthe graph as in De\ufb01nition 6 corresponding to xt\ncliques K t\neach of size at most t. Then xt\nij is feasible for ILP-ultrametric if and only if the\nfollowing conditions hold.\nNested cliques For any s \u2264 t every clique K s\n\np for some p \u2208 [ls] in Gs is a subclique of some clique\n\n1, . . . , K t\nlt\n\nRealization If(cid:12)(cid:12)K t\n\nq in Gt where q \u2208 [lt].\nK t\n\n(cid:12)(cid:12) = s for some s \u2264 t, then Gs contains K t\n\nfor some q \u2208 [ls].\n\np\n\np as a component clique, i.e., K s\n\nq = K t\np\n\nThe combinatorial interpretation of the individual layer-t problems allow us to simplify the formu-\nlation of ILP-ultrametric by replacing the constraints for sets of a speci\ufb01c size (Constraint 4) by a\nglobal constraint about all sets.\nLemma 9. We may replace Constraint 4 of ILP-ultrametric by the following equivalent constraint\n\nj\u2208S xt\n\nij \u2265 |S| \u2212 t, for every t \u2208 [n \u2212 1], S \u2286 V and i \u2208 S.\n\n(cid:80)\n\n4 Rounding an LP relaxation\n\nIn this section we consider the following natural LP relaxation for ILP-ultrametric. We keep the\nij for every t \u2208 [n \u2212 1] and i, j \u2208 V but relax the integrality constraint on the variables.\nvariables xt\n\n\u03ba(i, j)xt\nij\n\n(LP-ultrametric)\n\nmin\n\ns.t.\n\n{i,j}\u2208E(Kn)\n\n(cid:88)\nn\u22121(cid:88)\nt=1\nij \u2265 xt+1\nxt\n(cid:88)\nij\nxt\nij + xt\n\njk \u2265 xt\nij \u2265 |S| \u2212 t\nxt\n\nik\n\nj\u2208S\nxt\nij = xt\n0 \u2264 xt\n\n\u2200i, j \u2208 V, t \u2208 [n \u2212 2]\n\n\u2200i, j, k \u2208 V, t \u2208 [n \u2212 1]\n\u2200t \u2208 [n \u2212 1], S \u2286 V, i \u2208 S\n\u2200i, j \u2208 V, t \u2208 [n \u2212 1]\n\n(8)\n(9)\n\n(10)\n\nii = 0\n\nji, xt\nij \u2264 1\n\n\u2200i, j \u2208 V, t \u2208 [n \u2212 1]\n\n(11)\n(12)\nNote that the LP relaxation LP-ultrametric differs from ILP-ultrametric in not having constraint 5. A\nij to LP-ultrametric induces a sequence {dt}t\u2208[n\u22121] of distance metrics over V\nfeasible solution xt\nde\ufb01ned as dt(i, j) := xt\nij. Constraint 10 enforces an additional restriction on this metric: informally\npoints in a \u201clarge enough\u201d subset S should be spread apart according to the metric dt. Metrics of\ntype dt are called spreading metrics and were \ufb01rst studied by [19, 20] in relation to graph partitioning\nproblems. The following lemma gives a technical interpretation of spreading metrics (see, e.g.,\n[19, 20]).\nij be feasible for LP-ultrametric and for a \ufb01xed t \u2208 [n \u2212 1], let dt be the induced\nLemma 10. Let xt\nspreading metric. Let i \u2208 V be an arbitrary vertex and let S \u2286 V be a set containing i such that\n|S| > (1 + \u03b5)t for some \u03b5 > 0. Then maxj\u2208S dt(i, j) > \u03b5\n1+\u03b5 .\nThe following lemma states that we can optimize over LP-ultrametric in polynomial time.\nLemma 11. An optimal solution to LP-ultrametric can be computed in time polynomial in n and\nlog (maxi,j \u03ba(i, j)).\n\nFrom now on we will simply refer to a feasible solution of LP-ultrametric by the sequence of\nspreading metrics {dt}t\u2208[n\u22121] it induces. The following de\ufb01nition introduces the notion of an open\n\n5\n\n\fIf U = V then we denote BU (i, r, t) simply by B (i, r, t).\nTo round LP-ultrametric to get a feasible solution for ILP-ultrametric, we will use the technique of\nsphere growing which was introduced in [33] to show an O(log n) approximation for the maximum\nmulticommodity \ufb02ow problem. The basic idea is to grow a ball around a vertex until the expansion of\nthis ball is below a certain threshold, chop off this ball and declare it as a partition and recurse on\nthe remaining vertices. Since then this idea has been used by [25, 19, 14] to design approximation\nalgorithms for various graph partitioning problems. The \ufb01rst step is to associate to every ball\nBU (i, r, t) a volume vol (BU (i, r, t)) and a boundary \u2202BU (i, r, t) so that its expansion is de\ufb01ned.\nFor any t \u2208 [n \u2212 1] and U \u2286 V we denote by \u03b3U\nthe value of the layer-t objective for solution dt\nrestricted to the set U, i.e., \u03b3U\nt simply by\nt\nt \u2264 \u03b3t for any U \u2286 V . We are now ready to de\ufb01ne\n\u03b3t. Since \u03ba : V \u00d7 V \u2192 R\u22650, it follows that \u03b3U\nthe volume, boundary and expansion of a ball BU (i, r, t). We use the de\ufb01nition of [19] modi\ufb01ed for\nrestrictions to arbitrary subsets U \u2286 V .\nDe\ufb01nition 13. [19] Let U be an arbitrary subset of V . For a vertex i \u2208 U, radius r \u2208 R, and\nt \u2208 [n \u2212 1], let BU (i, r, t) be the ball of radius r as in De\ufb01nition 12. Then we de\ufb01ne its volume as\n\u03ba(j, k) (r \u2212 dt(i, j)) .\n\n\u03ba(i, j)dt(i, j). When U = V we refer to \u03b3U\n\nvol (BU (i, r, t)) :=\n\n:=(cid:80)\n\n\u03ba(j, k)dt(j, k) +\n\n(cid:88)\n\n(cid:88)\n\ni,j\u2208U\ni<j\n\n\u03b3U\nt\n\n+\n\nt\n\nn log n\n\nj,k\u2208BU (i,r,t)\n\nj<k\n\nj\u2208BU (i,r,t)\nk /\u2208BU (i,r,t)\n\nk\u2208U\n\nball BU (i, r, t) of radius r centered at i \u2208 V according to the metric dt and restricted to the set\nU \u2286 V .\nDe\ufb01nition 12. Let {dt | t \u2208 [n \u2212 1]} be the sequence of spreading metrics feasible for LP-\nultrametric. Let U \u2286 V be an arbitrary subset of V . For a vertex i \u2208 U, r \u2208 R, and t \u2208 [n \u2212 1] we\nde\ufb01ne the open ball BU (i, r, t) of radius r centered at i as\n\nBU (i, r, t) := {j \u2208 U | dt(i, j) < r} \u2286 U.\n\n\u2202r\n\nThe boundary of the ball \u2202BU (i, r, t) is the partial derivative of volume with respect to the radius, i.e.,\n\u2202BU (i, r, t) := \u2202 vol(BU (i,r,t))\n. The expansion \u03c6(BU (i, r, t)) of the ball BU (i, r, t) is then de\ufb01ned\nas the ratio of its boundary to its volume, i.e., \u03c6 (BU (i, r, t)) := \u2202BU (i,r,t)\nThe following theorem establishes that the rounding procedure of Algorithm 1 ensures that the cliques\nin Ct are \u201csmall\u201d and that the cost of the edges removed to form them are not too high. It also\nas in\nshows that Algorithm 1 can be implemented to run in time polynomial in n. Let m\u03b5 :=\nAlgorithm 1.\n\n(cid:107)\nij | t \u2208 [m\u03b5], i, j \u2208 V(cid:9) be the output of Algorithm 1 on a feasible solution\n\n{dt}t\u2208[n\u22121] of LP-ultrametric and any choice of \u03b5 \u2208 (0, 1). For any t \u2208 [m\u03b5], xt\nij is feasible\nfor the layer-(cid:98)(1 + \u03b5) t(cid:99) problem and there is a constant c(\u03b5) > 0 depending only on \u03b5 such that\nij \u2264 c(\u03b5)(log n)\u03b3t. Moreover, Algorithm 1 can be implemented to run in time\n\nTheorem 14. Let(cid:8)xt\n(cid:80){i,j}\u2208E(Kn) \u03ba(i, j)xt\n\n(cid:106) n\u22121\n\nvol(BU (i,r,t)) .\n\n1+\u03b5\n\npolynomial in n.\n\nWe are now ready to state the main theorem showing that we can obtain a low cost non-trivial\nultrametric from Algorithm 1. The proof idea of the main theorem is to use the combinatorial\ncharacterization of Lemma 8 to show that the rounded solution is feasible for ILP-ultrametric besides\nusing Theorem 14 for the individual layer-t guarantees.\nTheorem 15. Let {xt\nij | t \u2208 [m\u03b5] , i, j \u2208 V } be the output of Algorithm 1 on an optimal solution\n\n{dt}t\u2208[n\u22121] of LP-ultrametric for any choice of \u03b5 \u2208 (0, 1). De\ufb01ne the sequence(cid:8)yt\nfor ILP-ultrametric and satis\ufb01es(cid:80)n\u22121\n\nij\nij := 1 otherwise. Then yt\nij is feasible\nij \u2264 (2c(\u03b5) log n) OPT, where OPT\nis the optimal solution to ILP-ultrametric and c(\u03b5) is the constant in the statement of Theorem 14.\n\n(cid:80){i,j}\u2208E(Kn) \u03ba(i, j)yt\n\nt \u2208 [n\u2212 1] and i, j \u2208 V as yt\n\n(cid:9) for every\n\nif t > 1 + \u03b5 and yt\n\n(cid:98)t/(1+\u03b5)(cid:99)\nij\n\nij := x\n\nt=1\n\nLemma 11 and Theorem 15 imply the following corollary where we put everything together to obtain\na hierarchical clustering of V in time polynomial in n with |V | = n. Let T denote the set of all\npossible hierarchical clusterings of V .\n\n6\n\n\fAlgorithm 1: Iterative rounding algorithm to \ufb01nd a low cost ultrametric\nInput: Data set V , {dt}t\u2208[n\u22121] : V \u00d7 V , \u03b5 > 0, \u03ba : V \u00d7 V \u2192 R\u22650\nOutput: A solution set of the form\n\nij \u2208 {0, 1} | t \u2208(cid:104)(cid:106) n\u22121\n\n(cid:107)(cid:105)\n\n(cid:110)\n\n, i, j \u2208 V\n\nxt\n\n1+\u03b5\n\n(cid:111)\n\nm\u03b5 \u2190(cid:106) n\u22121\n\n(cid:107)\n\n1+\u03b5\n\nt \u2190 m\u03b5\nCt+1 \u2190 {V }\n\u2206 \u2190 \u03b5\nwhile t \u2265 1 do\n\n1+\u03b5\n\nCt \u2190 \u2205\nfor U \u2208 Ct+1 do\n\nif |U| \u2264 (1 + \u03b5)t then\n\nCt \u2190 Ct \u222a {U}\nGo to line 1\nend\nwhile U (cid:54)= \u2205 do\n\nLet i be arbitrary in U\nLet r \u2208 (0, \u2206] be s.t. \u03c6 (BU (i, r, t)) \u2264 1\nCt \u2190 Ct \u222a {BU (i, r, t)}\nU \u2190 U \\ BU (i, r, t)\n\n\u2206 log\n\n(cid:16) vol(BU (i,\u2206,t))\n\nvol(BU (i,0,t))\n\n(cid:17)\n\nend\nend\nij = 1 if i \u2208 U1 \u2208 Ct, j \u2208 U2 \u2208 Ct and U1 (cid:54)= U2, else xt\nxt\nt \u2190 t \u2212 1\n\nij = 0\n\nreturn(cid:8)xt\n\nend\n\nij | t \u2208 [m\u03b5], i, j \u2208 V(cid:9)\n\nCorollary 16. Given a data set V of n points and a similarity function \u03ba : V \u00d7 V \u2192\nR\u22650, there is an algorithm to compute a hierarchical clustering T of V satisfying cost(T ) \u2264\nO (log n) minT (cid:48)\u2208T cost(T (cid:48)) in time polynomial in n and log (maxi,j\u2208V \u03ba(i, j)).\n\n5 Generalized Cost Function\n\nIn this section we study the following natural generalization of cost function (1) also introduced\nby [16] where the distance between the two points is scaled by a function f : R\u22650 \u2192 R\u22650 i.e.,\n\ncostf (T ) :=(cid:80){i,j}\u2208E(Kn) \u03ba(i, j)f (|leaves T [lca(i, j)]|). In order for this cost function to make\n\nthe objective function of ILP-ultrametric by(cid:80){i,j}\u2208E(Kn) \u03ba(i, j)f\n\nsense, f should be strictly increasing and satisfy f (0) = 0. Possible choices for f could be in\n{x2, ex \u2212 1, log(1 + x)}. The top-down heuristic in [16] \ufb01nds the optimal hierarchical clustering up\nto an approximation factor of cn log n with cn being de\ufb01ned as cn := 3\u03b1n max1\u2264n(cid:48)\u2264n\nf ((cid:100)n(cid:48)/3(cid:101)) and\nwhere \u03b1n is the approximation factor from the Sparsest Cut algorithm used.\nA naive approach to solving this problem using the ideas of Algorithm 1 would be to replace\n. This makes the\ncorresponding analogue of LP-ultrametric non-linear however, and for a general \u03ba and f it is not\nclear how to compute an optimum solution in polynomial time. Using a small trick, one can still\nprove that Algorithm 1 returns a good approximation in this case as the following theorem states. For\nmore details on the generalized cost function we refer the reader to the supplementary material.\nTheorem 17. Let an := maxn(cid:48)\u2208[n](f (n(cid:48)) \u2212 f (n(cid:48) \u2212 1)). Given a data set V of n points and a\nsimilarity function \u03ba : V \u00d7 V \u2192 R\u22650, there is an algorithm to compute a hierarchical clus-\ntering T of V satisfying costf (T ) \u2264 O (log n + an) minT (cid:48)\u2208T costf (T (cid:48)) in time polynomial in n,\nlog (maxi,j\u2208V \u03ba(i, j)) and log f (n).\n\n(cid:16)(cid:80)n\u22121\n\nt=1 xt\nij\n\nf (n(cid:48))\n\n(cid:17)\n\nNote that, in this case we pay a price of O (log f (n)) in the running time due to binary search.\n\n7\n\n\f6 Experiments\n\nFinally, we describe the experiments we performed. We implemented a generalized version of\nILP-ultrametric where one can plug in any strictly increasing function f satisfying f (0) = 0. For the\n\nsake of exposition, we limited ourselves to(cid:8)x, x2, log(1 + x), ex \u2212 1(cid:9) for the function f. We used\n\nthe dual simplex method and separate constraints (9) and (10) to obtain fast computations. For the\nsimilarity function \u03ba we limited ourselves to using cosine similarity \u03bacos and the Gaussian kernel\n\u03bagauss with \u03c3 = 1. Since Algorithm 1 requires \u03ba \u2265 0, in practice we use 1 + \u03bacos instead of \u03bacos.\nNote that both Ward\u2019s method and the k-means algorithm work on the squared Euclidean distance\nand thus need vector representations of the data set. For the linkage based algorithms we use the\nsame similarity function that we use for Algorithm 1.\nWe considered synthetic data sets and some data sets from the UCI database [36]. The synthetic data\nsets were mixtures of Gaussians in various small dimensional spaces and for some of the larger data\nsets we subsampled a smaller number of points uniformly at random for a number of times depending\non the performance of the MIP and LP solver. For a comparison of the cost of the hierarchy returned\nby Algorithm 1 and the optimal hierarchy obtained by solving ILP-ultrametric, see the supplementary\nmaterial.\nTo compare the different hierarchical clustering algorithms, we prune the hierarchy to get the best k\n\ufb02at clusters and measure its error relative to the ground truth. We use the following notion of error\nalso known as Classi\ufb01cation Error that is standard in the literature for hierarchical clustering (see,\ne.g., [37]).\nDe\ufb01nition 18. Given a proposed clustering h : V \u2192 {1, . . . , k} its classi\ufb01cation error relative\nto a target clustering g : V \u2192 {1, . . . , k} is denoted by err (g, h) and is de\ufb01ned as err (g, h) :=\nmin\u03c3\u2208Sk [Prx\u2208V [h(x) (cid:54)= \u03c3(g(x))] .\nFigure 1 shows that Algorithm 1 often gives better prunings compared to the other standard clustering\nalgorithms with respect to this notion of error.\n\n7 Conclusion\n\nIn this work we have studied the cost function introduced by [16] for hierarchical clustering of data\nunder a pairwise similarity function. We have shown a combinatorial characterization of ultrametrics\ninduced by this cost function leading to an improved approximation algorithm for this problem. It\nremains for future work to investigate combinatorial algorithms for this cost function as well as\nalgorithms for other cost functions of a similar \ufb02avor; see supplementary material for a discussion.\n\nFigure 1: Comparison of Algorithm 1 with other algorithms for clustering using 1 + \u03bacos (left) and\n\u03bagauss (right)\n\n8 Acknowledgments\n\nResearch reported in this paper was partially supported by NSF CAREER award CMMI-1452463 and\nNSF grant CMMI-1333789. The authors thank Kunal Talwar and Mohit Singh for helpful discussions\nand anonymous reviewers for helping improve the presentation of this paper.\n\n8\n\n01020304050Datasets0.00.20.40.60.81.0ErrorwithrespecttogroundtruthAlgorithm1AveragelinkageSinglelinkageCompletelinkageWard\u2019smethodk-means01020304050Datasets0.00.20.40.60.81.0ErrorwithrespecttogroundtruthAlgorithm1AveragelinkageSinglelinkageCompletelinkageWard\u2019smethodk-means\fReferences\n[1] Margareta Ackerman, Shai Ben-David, and David Loker. Characterization of linkage-based clustering. In\n\nCOLT, pages 270\u2013281. Citeseer, 2010. 2\n\n[2] Nir Ailon and Moses Charikar. Fitting tree metrics: Hierarchical clustering and phylogeny. In 46th Annual\n\nIEEE Symposium on Foundations of Computer Science (FOCS\u201905), pages 73\u201382. IEEE, 2005. 2\n\n[3] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander \ufb02ows, geometric embeddings and graph\n\npartitioning. Journal of the ACM (JACM), 56(2):5, 2009. 2\n\n[5] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via\nsimilarity functions. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages\n671\u2013680. ACM, 2008. 1\n\n[7] Yair Bartal. Graph decomposition lemmas and their role in metric embedding methods. In European\n\nSymposium on Algorithms, pages 89\u201397. Springer, 2004. 2\n\n[12] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and\n\nspreading metrics. arXiv preprint arXiv:1609.09548, 2016. 2\n\n[14] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. In\nFoundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on, pages 524\u2013533.\nIEEE, 2003. 6\n\n[16] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Daniel Wichs and Yishay\nMansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing,\nSTOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 118\u2013127. ACM, 2016. ISBN 978-1-4503-\n4132-5. doi: 10.1145/2897518.2897527. URL http://doi.acm.org/10.1145/2897518.2897527. 1,\n2, 3, 7, 8\n\n[17] Sanjoy Dasgupta and Philip M Long. Performance guarantees for hierarchical clustering. Journal of\n\nComputer and System Sciences, 70(4):555\u2013569, 2005. 2\n\n[18] Marco Di Summa, David Pritchard, and Laura Sanit\u00e0. Finding the closest ultrametric. Discrete Applied\n\nMathematics, 180:70\u201380, 2015. 2, 4\n\n[19] Guy Even, Joseph Naor, Satish Rao, and Baruch Schieber. Fast approximate graph partitioning algorithms.\n\nSIAM Journal on Computing, 28(6):2187\u20132214, 1999. 2, 5, 6\n\n[20] Guy Even, Joseph Sef\ufb01 Naor, Satish Rao, and Baruch Schieber. Divide-and-conquer approximation\n\nalgorithms via spreading metrics. Journal of the ACM (JACM), 47(4):585\u2013616, 2000. 2, 5\n\n[22] Joseph Felsenstein and Joseph Felenstein. Inferring phylogenies, volume 2. Sinauer Associates Sunderland,\n\n2004. 2\n\n[23] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics Springer, Berlin, 2001. 1\n\n[25] Naveen Garg, Vijay V Vazirani, and Mihalis Yannakakis. Approximate max-\ufb02ow min-(multi) cut theorems\n\nand their applications. SIAM Journal on Computing, 25(2):235\u2013251, 1996. 6\n\n[32] Robert Krauthgamer, Joseph Sef\ufb01 Naor, and Roy Schwartz. Partitioning graphs into balanced components.\nIn Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 942\u2013949.\nSociety for Industrial and Applied Mathematics, 2009. 2\n\n[33] Tom Leighton and Satish Rao. An approximate max-\ufb02ow min-cut theorem for uniform multicommodity\n\ufb02ow problems with applications to approximation algorithms. In Foundations of Computer Science, 1988.,\n29th Annual Symposium on, pages 422\u2013431. IEEE, 1988. 2, 6\n\n[36] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. 8\n\n[37] Marina Meil\u02d8a and David Heckerman. An experimental comparison of model-based clustering methods.\n\nMachine learning, 42(1-2):9\u201329, 2001. 8\n\n9\n\n\f", "award": [], "sourceid": 1199, "authors": [{"given_name": "Aurko", "family_name": "Roy", "institution": "Georgia Tech"}, {"given_name": "Sebastian", "family_name": "Pokutta", "institution": "GeorgiaTech"}]}