{"title": "Data-Driven Clustering via Parameterized Lloyd's Families", "book": "Advances in Neural Information Processing Systems", "page_first": 10641, "page_last": 10651, "abstract": "Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a \"typical instance\" may vary depending on the application, and different clustering heuristics perform differently depending on the instance.\n\nIn this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.", "full_text": "Data-Driven Clustering via Parameterized Lloyd\u2019s\n\nFamilies\n\nMaria-Florina Balcan\n\nDepartment of Computer Science\n\nCarnegie-Mellon University\n\nPittsburgh, PA 15213\n\nninamf@cs.cmu.edu\n\nTravis Dick\n\nDepartment of Computer Science\n\nCarnegie-Mellon University\n\nPittsburgh, PA 15213\ntdick@cs.cmu.edu\n\nColin White\n\nDepartment of Computer Science\n\nCarnegie-Mellon University\n\nPittsburgh, PA 15213\n\ncrwhite@cs.cmu.edu\n\nAbstract\n\nAlgorithms for clustering points in metric spaces is a long-studied area of research.\nClustering has seen a multitude of work both theoretically, in understanding the\napproximation guarantees possible for many objective functions such as k-median\nand k-means clustering, and experimentally, in \ufb01nding the fastest algorithms and\nseeding procedures for Lloyd\u2019s algorithm. The performance of a given clustering\nalgorithm depends on the speci\ufb01c application at hand, and this may not be known\nup front. For example, a \u201ctypical instance\u201d may vary depending on the application,\nand different clustering heuristics perform differently depending on the instance.\nIn this paper, we de\ufb01ne an in\ufb01nite family of algorithms generalizing Lloyd\u2019s al-\ngorithm, with one parameter controlling the initialization procedure, and another\nparameter controlling the local search procedure. This family of algorithms in-\ncludes the celebrated k-means++ algorithm, as well as the classic farthest-\ufb01rst\ntraversal algorithm. We design ef\ufb01cient learning algorithms which receive samples\nfrom an application-speci\ufb01c distribution over clustering instances and learn a near-\noptimal clustering algorithm from the class. We show the best parameters vary\nsigni\ufb01cantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians.\nOur learned algorithms never perform worse than k-means++, and on some datasets\nwe see signi\ufb01cant improvements.\n\n1\n\nIntroduction\n\nClustering is a fundamental problem in machine learning with applications in many areas including\ntext analysis, transportation networks, social networks, and so on. The high-level goal of clustering\nis to divide a dataset into natural subgroups. For example, in text analysis we may want to divide\ndocuments based on topic, and in social networks we might want to \ufb01nd communities. A common\napproach to clustering is to set up an objective function and then approximately \ufb01nd the optimal\nsolution according to the objective. There has been a wealth of both theoretical and empirical research\nin clustering using this approach Gonzalez [1985], Charikar et al. [1999], Arya et al. [2004], Arthur\nand Vassilvitskii [2007], Kaufman and Rousseeuw [2009], Ostrovsky et al. [2012], Byrka et al. [2015],\nAhmadian et al. [2017].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe most popular method in practice for clustering is local search, where we start with k centers and\niteratively make incremental improvements until a local optimum is reached. For example, Lloyd\u2019s\nmethod (sometimes called k-means) Lloyd [1982] and k-medoids Friedman et al. [2001], Cohen et al.\n[2016] are two popular local search algorithms. There are multiple decisions an algorithm designer\nmust make when using a local search algorithm. First, the algorithm designer must decide how to seed\nlocal search, e.g., how the algorithm chooses the k initial centers. There is a large body of work on\nseeding algorithms, since the initial choice of centers can have a large effect on both the quality of the\noutputted clustering and the time it takes for the algorithm to converge Higgs et al. [1997], Pena et al.\n[1999], Arai and Barakbah [2007]. The best seeding method often depends on the speci\ufb01c application\nat hand. For example, a \u201ctypical problem instance\u201d in one setting may have signi\ufb01cantly different\nproperties from that in another, causing some seeding methods to perform better than others. Second,\nthe algorithm designer must decide on an objective function for the local search phase (k-means,\nk-median, etc.) For some applications, there is an obvious choice. For instance, if the application is\nWi-Fi hotspot location, then the explicit goal is to minimize the k-center objective function. For many\nother applications such as clustering communities in a social network, the goal is to \ufb01nd clusters\nwhich are close to an unknown target clustering, and we may use an objective function for local\nsearch in the hopes that approximately minimizing the chosen objective will produce clusterings\nwhich are close to matching the target clustering (in terms of the number of misclassi\ufb01ed points). As\nbefore, the best objective function for local search may depend on the speci\ufb01c application.\nIn this paper, we show positive theoretical and empirical results for learning the best initialization and\nlocal search procedures over a large family of algorithms. We take a transfer learning approach where\nwe assume there is an unknown distribution over problem instances corresponding to our application,\nand the goal is to use experience from the early instances to perform well on the later instances. For\nexample, if our application is clustering facilities in a city, we would look at a sample of cities with\nexisting optimally-placed facilities, and use this information to \ufb01nd the empirically best seeding/local\nsearch pair from an in\ufb01nite family, and we use this pair to cluster facilities in new cities.\n(\u03b1, \u03b2)-Lloyds++ We de\ufb01ne an in\ufb01nite family of algorithms generalizing Lloyd\u2019s method, with two\nparameters \u03b1 and \u03b2. Our algorithms have two phases, a seeding phase to \ufb01nd k initial centers\n(parameterized by \u03b1), and a local search phase which uses Lloyd\u2019s method to converge to a local\noptimum (parameterized by \u03b2). In the seeding phase, each point v is sampled with probability propor-\ntional to dmin(v, C)\u03b1, where C is the set of centers chosen so far and dmin(v, C) = minc\u2208C d(v, c).\nThen Lloyd\u2019s method is used to converge to a local minima for the (cid:96)\u03b2 objective. By ranging\n\u03b1 \u2208 [0,\u221e) \u222a {\u221e} and \u03b2 \u2208 [1,\u221e) \u222a {\u221e}, we de\ufb01ne our in\ufb01nite family of algorithms which we call\n(\u03b1, \u03b2)-Lloyds++. Setting \u03b1 = \u03b2 = 2 corresponds to the k-means++ algorithm Arthur and Vassilvit-\nskii [2007]. The seeding phase is a spectrum between random seeding (\u03b1 = 0), and farthest-\ufb01rst\ntraversal Gonzalez [1985], Dasgupta and Long [2005] (\u03b1 = \u221e), and the Lloyd\u2019s step is able to\noptimize over common objectives including k-median (\u03b2 = 1), k-means (\u03b2 = 2), and k-center\n(\u03b2 = \u221e). We design ef\ufb01cient learning algorithms which receive samples from an application-speci\ufb01c\ndistribution over clustering instances and learn a near-optimal clustering algorithm from our family.\n\nTheoretical analysis In Section 4, we prove that O(cid:0) 1\n\n\u00012 min(T, k) log n(cid:1) samples are suf\ufb01cient to\n\nguarantee the empirically optimal parameters (\u02c6\u03b1, \u02c6\u03b2) have expected cost at most \u0001 higher than the\noptimal parameters (\u03b1\u2217, \u03b2\u2217) over the distribution, with high probability over the random sample,\nwhere n is the size of the clustering instances and T is the maximum number of Lloyd\u2019s iterations.\nThe key challenge is that for any clustering instance, the cost of the outputted clustering is not even\na continuous function in \u03b1 or \u03b2 since a slight tweak in the parameters may lead to a completely\ndifferent run of the algorithm. We overcome this obstacle by showing a strong bound on the expected\nnumber of discontinuities of the cost function, which requires a delicate reasoning about the structure\nof the \u201cdecision points\u201d in the execution of the algorithm; in other words, for a given clustering\ninstance, we must reason about the total number of outcomes the algorithm can produce over the full\nrange of parameters. This allows us to use Rademacher complexity, a distribution-speci\ufb01c technique\nfor achieving uniform convergence.\nNext, we complement our sample complexity result with a computational ef\ufb01ciency result. Specif-\nically, we give a novel meta-algorithm which ef\ufb01ciently \ufb01nds a near-optimal value \u02c6\u03b1 with high\nprobability. The high-level idea of our algorithm is to run depth-\ufb01rst-search over the \u201cexecution tree\u201d\nof the algorithm, where a node in the tree represents a state of the algorithm, and edges represent a\ndecision point. A key step in our meta-algorithm is to iteratively solve for the decision points of the\nalgorithm, which itself is nontrivial since the equations governing the decision points do not have\n\n2\n\n\fa closed-form solution. We show the equations have a certain structure which allows us to binary\nsearch through the range of parameters to \ufb01nd the decision points.\nExperiments We give a thorough experimental analysis of our family of algorithms by evaluating their\nperformance on a number of different real-world and synthetic datasets including MNIST, Cifar10,\nCNAE-9, and mixtures of Gaussians. In each case, we create clustering instances by choosing subsets\nof the labels. For example, we look at an instance of MNIST with digits {0, 1, 2, 3, 4}, and also an\ninstance with digits {5, 6, 7, 8, 9}. We show the the optimal parameters transfer from one instance\nto the other. Among datasets, there is no single parameter setting that is nearly optimal, and for\nsome datasets, the best algorithm from the (\u03b1, \u03b2)-Lloyds++ family performs signi\ufb01cantly better than\nknown algorithms such as k-means++ and farthest-\ufb01rst traversal.\n\n2 Related Work\n\nLloyd\u2019s method for clustering The iterative local search method for clustering, known as Lloyd\u2019s\nalgorithm or sometimes called k-means, is one of the most popular algorithms for k-means clustering\nLloyd [1982], and improvements are still being found Max [1960], MacQueen et al. [1967], Dempster\net al. [1977], Pelleg and Moore [1999], Kanungo et al. [2002], Kaufman and Rousseeuw [2009].\nMany different initialization approaches have been proposed Higgs et al. [1997], Pena et al. [1999],\nArai and Barakbah [2007]. When using d2-sampling to \ufb01nd the initial k centers, the algorithm is\nknown as k-means++, and the approximation guarantee is provably O(log k) Arthur and Vassilvitskii\n[2007].\n\nLearning to Learn A recent paper shows positive results for learning linkage-based algorithms\nwith pruning over a distribution over clustering instances Balcan et al. [2017], although there is\nno empirical study done. There are several related models for learning the best representation and\ntransfer learning for clustering. Ashtiani and Ben-David [2015] show how to learn an instance-speci\ufb01c\nembedding for a clustering instance, such that k-means does well over the embedding. There has\nalso been work on related questions for transfer learning on unlabeled data and unsupervised tasks\nRaina et al. [2007], Yang et al. [2009], Jiang and Chung [2012]. To our knowledge, there is no prior\nwork on learning the best clustering objective for a speci\ufb01c distribution over problem instances, given\nlabeled clustering instances for training.\n\n(cid:80)\n\ni\n\nv\u2208Ci\n\nd(v, ci)p(cid:1) 1\n\na center ci and cost(C) =(cid:0)(cid:80)\n\n3 Preliminaries\nClustering A clustering instance V consists of a point set V of size n, a distance metric d (such\nas Euclidean distance in Rd), and a desired number of clusters 1 \u2264 k \u2264 n. A clustering C =\n{C1, . . . , Ck} is a k-partitioning of V . Often in practice, clustering is carried out by approximately\nminimizing an objective function (which maps each clustering to a nonzero value). Common objective\nfunctions such as k-median and k-means come from the (cid:96)p family, where each cluster Ci is assigned\np (k-median and k-means correspond to p = 1 and\np = 2, respectively). There are two distinct goals for clustering depending on the application. For\nsome applications such as computing facility locations, the algorithm designer\u2019s only goal is to \ufb01nd\nthe best centers, and the actual partition {C1, . . . , Ck} is not needed. For many other applications\nsuch as clustering documents by subject, clustering proteins by function, or discovering underlying\nk},\ncommunities in a social network, there exists an unknown \u201ctarget\u201d clustering C\u2217 = {C\u2217\n1 , . . . , C\u2217\nand the goal is to output a clustering C which is close to C\u2217. Formally, we de\ufb01ne C and C(cid:48) to be\n\u03c3(i)| \u2264 \u0001n. For these applications, the\nalgorithm designer chooses an objective function while hoping that minimizing the objective function\nwill lead to a clustering that is close to the target clustering. In this paper, we will focus on the cost\nfunction set to the distance to the target clustering, however, our analysis holds for an abstract cost\nfunction cost which can be set to an objective function or any other well-de\ufb01ned measure of cost.\nAlgorithm Con\ufb01guration In this work, we assume that there exists an unknown, application-speci\ufb01c\ndistribution D over a set of clustering instances such that for each instance V, |V | \u2264 n. We suppose\nthere is a cost function that measures the quality of a clustering of each instance. As discussed in\nthe previous paragraph, we can set the cost function to be the expected Hamming distance of the\nreturned clustering to the target clustering, the cost of an (cid:96)p objective, or any other function. The\n\n\u0001-close if there exists a permutation \u03c3 such that(cid:80)k\n\ni=1 |Ci \\ C(cid:48)\n\n3\n\n\flearner\u2019s goal is to \ufb01nd the parameters \u03b1 and \u03b2 that approximately minimize the expected cost with\nrespect to the distribution D. Our main technical results bound the intrinsic complexity of the class\nof (\u03b1, \u03b2)-Lloyds++ clustering algorithms, which leads to generalization guarantees through standard\nRademacher complexity Bartlett and Mendelson [2002], Koltchinskii [2001]. This implies that the\nempirically optimal parameters are also nearly optimal in expectation.\n\n4\n\n(\u03b1, \u03b2)-Lloyds++\n\nIn this section, we de\ufb01ne an in\ufb01nite family of algorithms generalizing Lloyd\u2019s algorithm, with one\nparameter controlling the the initialization procedure, and another parameter controlling the local\nsearch procedure. Our main results bound the intrinsic complexity of this family of algorithms\n(Theorems 4 and 5) and lead to sample complexity results guaranteeing the empirically optimal\nparameters over a sample are close to the optimal parameters over the unknown distribution. We\nmeasure optimality in terms of agreement with the target clustering. We also show theoretically that\nno parameters are optimal over all clustering applications (Theorem 2). Finally, we give an ef\ufb01cient\nalgorithm for learning the best initialization parameter (Theorem 7).\nOur family of algorithms is parameterized by choices of \u03b1 \u2208 [0,\u221e) \u222a {\u221e} and \u03b2 \u2208 [1,\u221e) \u222a {\u221e}.\nEach choice of (\u03b1, \u03b2) corresponds to one local search algorithm. A summary of the algorithm is as\nfollows (see Algorithm 1). The algorithm has two phases. The goal of the \ufb01rst phase is to output\nk initial centers. Each center is iteratively chosen by picking a point with probability proportional\nto the minimum distance to all centers picked so far, raised to the power of \u03b1. The second phase is\nan iterative two step procedure similar to Lloyd\u2019s method, where the \ufb01rst step is to create a Voronoi\npartitioning of the points induced by the initial set of centers, and then a new set of centers is chosen\nby computing the (cid:96)\u03b2 mean of each Voronoi tile.\n\nAlgorithm 1 (\u03b1, \u03b2)-Lloyds++ Clustering\n\nInput: Instance V = (V, d, k), parameter \u03b1.\nPhase 1: Choosing initial centers with d\u03b1-sampling\n1. Initialize C = \u2205 and draw a vector (cid:126)Z = {z1, . . . , zk} from [0, 1]k uniformly at random.\n2. For each t = 1, . . . , k:\n(a) Partition [0, 1] into n intervals, where there is an interval Ivi for each vi with size equal to\n(b) Denote ct as the point such that zt \u2208 Ict, and add ct to C.\nPhase 2: Lloyd\u2019s algorithm\n5. Set C(cid:48) = \u2205. Let {C1, . . . , Ck} denote the Voronoi tiling of V induced by centers C.\n6. Compute argminx\u2208V\n7. If C(cid:48) (cid:54)= C, set C = C(cid:48) and goto 5.\nOutput: Centers C and clustering induced by C.\n\nthe probability of choosing vi during d\u03b1-sampling in round t (see Figure 1).\n\nd(x, v)\u03b2 for all 1 \u2264 i \u2264 k, and add it to C(cid:48).\n\n(cid:80)\n\nv\u2208Ci\n\nOur goal is to \ufb01nd parameters which return clusterings close to the ground-truth in expectation.\nSetting \u03b1 = \u03b2 = 2 corresponds to the k-means++ algorithm. The seeding phase is a spectrum\nbetween random seeding (\u03b1 = 0), and farthest-\ufb01rst traversal (\u03b1 = \u221e), and the Lloyd\u2019s algorithm\ncan optimize for common clustering objectives including k-median (\u03b2 = 1), k-means (\u03b2 = 2), and\nk-center (\u03b2 = \u221e).\nWe start with two structural results about the family of (\u03b1, \u03b2)-Lloyds++ clustering algorithms. The\n\ufb01rst shows that for suf\ufb01ciently large \u03b1, phase 1 of Algorithm 1 is equivalent to farthest-\ufb01rst traversal.\nThis means that it is suf\ufb01cient to consider \u03b1 parameters in a bounded range.\nFarthest-\ufb01rst traversal Gonzalez [1985] starts by choosing a random center, and then iteratively\nchoosing the point farthest to all centers chosen so far, until there are k centers. We assume that ties\nare broken uniformly at random.\nLemma 1. Given a clustering instance V and \u03b4 > 0, if \u03b1 >\n, then d\u03b1-sampling will give the\nsame output as farthest-\ufb01rst traversal with probability > 1 \u2212 \u03b4. Here, s denotes the minimum ratio\nd1/d2 between two distances d1 > d2 in the point set.\n\n\u03b4 )\nlog( nk\nlog s\n\n4\n\n\f1\n\nFor some datasets,\nlog s might be very large. In Section 5, we empirically observe that for all datasets\nwe tried, (\u03b1, \u03b2)-Lloyds++ behaves the same as farthest-\ufb01rst traversal for \u03b1 > 20. Also, in the full\nversion of this paper, we show that if the dataset satis\ufb01es a stability assumption called separability\nKobren et al. [2017], Pruitt et al. [2011], then (\u03b1, \u03b2)-Lloyds++ outputs the same clustering as\nfarthest-\ufb01rst traversal with high probability when \u03b1 > log n.\nNext, to motivate learning the best parameters, we show that for any pair of parameters (\u03b1\u2217, \u03b2\u2217), there\nexists a clustering instance such that (\u03b1\u2217, \u03b2\u2217)-Lloyds++ outperforms all other values of \u03b1, \u03b2. This\nimplies that d\u03b2-sampling is not always the best choice of seeding for the (cid:96)\u03b2 objective. Let clus\u03b1,\u03b2(V)\ndenote the expected cost of the clustering outputted by (\u03b1, \u03b2)-Lloyds++, with respect to the target\nclustering. Formally, clus\u03b1,\u03b2(V) = E (cid:126)Z\u223c[0,1]k\nis the\ncost of the clustering outputted by (\u03b1, \u03b2)-Lloyds++ with randomness (cid:126)Z] \u2208 [0, 1]k (see line 1 of\nAlgorithm 1).\nTheorem 2. For \u03b1\u2217 \u2208 [0,\u221e) \u222a {\u221e} and \u03b2\u2217 \u2208 [1,\u221e) \u222a {\u221e}, there exists a clustering instance V\nwhose target clustering is the optimal (cid:96)\u03b2\u2217 clustering, such that clus\u03b1\u2217,\u03b2\u2217 (V) < clus\u03b1,\u03b2(V) for\nall (\u03b1, \u03b2) (cid:54)= (\u03b1\u2217, \u03b2\u2217).\n\n(cid:16)V, (cid:126)Z\n\n(cid:16)V, (cid:126)Z\n\n, where clus\u03b1,\u03b2\n\n(cid:17)(cid:105)\n\n(cid:104)\n\nclus\u03b1,\u03b2\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)V, (cid:126)Z\n\nSample ef\ufb01ciency Now we give sample complexity bounds for learning the best algorithm from\nthe class of (\u03b1, \u03b2)-Lloyds++ algorithms. We analyze the phases of Algorithm 1 separately. For\nthe \ufb01rst phase, our main structural result is to show that for a given clustering instance and value\nof \u03b2, with high probability over the randomness in Algorithm 1, the number of discontinuities\nas we vary \u03b1 \u2208 [0, \u03b1h] is O(nk(log n)\u03b1h). Our analysis\nof the cost function clus\u03b1,\u03b2\ncrucially harnesses the randomness in the algorithm to achieve this bound. For instance, if we use a\ncombinatorial approach as in prior algorithm con\ufb01guration work, we would only achieve a bound\nof nO(k), which is the total number of sets of k centers. For completeness, we give a combinatorial\nproof of O(nk+3) discontinuities in the full version of this paper.\nTo show the O(nk(log n)\u03b1h) upper bound, we start by giving a few de\ufb01nitions of concepts used in\nthe proof. Assume we start to run Algorithm 1 without a speci\ufb01c setting of \u03b1, but rather a range\n[\u03b1(cid:96), \u03b1h], for some instance V and randomness (cid:126)Z. In some round t, if Algorithm 1 would choose a\ncenter ct for every setting of \u03b1 \u2208 [\u03b1(cid:96), \u03b1h], then we continue normally. However, if the algorithm\nwould choose a different center depending on the speci\ufb01c value of \u03b1 used from the interval [\u03b1(cid:96), \u03b1h],\nthen we fork the algorithm, making one copy for each possible center. In particular, we partition\n[\u03b1(cid:96), \u03b1h] into a \ufb01nite number of sub-intervals such that the next center is constant on each interval.\nThe boundaries between these intervals are \u201cbreakpoints\u201d, since as \u03b1 crosses those values, the next\ncenter chosen by the algorithm changes. Our goal is to bound the total number of breakpoints over all\nk rounds in phase 1 of Algorithm 1, which bounds the number of discontinuities of the cost of the\noutputted clustering as a function of \u03b1 over [\u03b1(cid:96), \u03b1h].\nA crucial step in the above approach is determining when to fork and where the breakpoints are\nd\u03b1\nDn(\u03b1),\nlocated. Recall that in round t of Algorithm 1, each datapoint vi has an interval in [0, 1] of size\ni\n1 + \u00b7\u00b7\u00b7 + d\u03b1\nwhere di is the minimum distance from vi to the current set of centers, and Dj(\u03b1) = d\u03b1\nj .\nFurthermore, the interval is located between Di\u22121(\u03b1)\nDn(\u03b1) (see Figure 1). WLOG, we assume\nd1 \u2265 \u00b7\u00b7\u00b7 \u2265 dn. We prove the following nice structure about these intervals.\nLemma 3. Assume that v1, . . . , vn are sorted in decreasing distance from a set C of centers. Then\nfor each i = 1, . . . , n, the function \u03b1 (cid:55)\u2192 Di(\u03b1)\nDn(\u03b1) is monotone increasing and continuous along [0,\u221e).\nFurthermore, for all 1 \u2264 i \u2264 j \u2264 n and \u03b1 \u2208 [0,\u221e), we have Di(\u03b1)\nThis lemma guarantees two crucial properties. First, we know that for every (ordered) set C of t \u2264 k\ncenters chosen by phase 1 of Algorithm 1 up to round t, there is a single interval (as opposed to a\nmore complicated set) of \u03b1-parameters that would give rise to C. Second, for an interval [\u03b1(cid:96), \u03b1h],\nthe set of possible next centers is exactly vi(cid:96), vi(cid:96)+1, . . . , vih, where i(cid:96) and ih are the centers sampled\nwhen \u03b1 is \u03b1(cid:96) and \u03b1h, respectively (see Figure 1). Now we are ready to prove our main structural\nresult. Formally, we de\ufb01ne seed\u03b1(V, (cid:126)Z) as the outputted centers from phase 1 of Algorithm 1 on\ninstance V with randomness (cid:126)Z.\n\nDn(\u03b1) \u2264 Dj (\u03b1)\nDn(\u03b1) .\n\nDn(\u03b1) and Di(\u03b1)\n\n5\n\n\fFigure 1: The algorithm chooses v3 as a center (left). In the interval [\u03b1(cid:96), \u03b1(cid:96)+1], the algorithm may\nchoose v4, v3, v2, or v1 as a center, based on the value of \u03b1 (right).\n\nTheorem 4. Given a clustering instance V, the expected number of discontinuities of seed\u03b1(V, (cid:126)Z)\nas a function of \u03b1 over [0, \u03b1h] is O(nk(log n)\u03b1h). Here, the expectation is over the uniformly random\ndraw of (cid:126)Z \u2208 [0, 1]k.\n\nProof sketch. Consider round t of a run of Algorithm 1. Suppose at the beginning of round t, there\nare L possible states of the algorithm, e.g., L sets of \u03b1 such that within a set, the choice of the \ufb01rst\nt \u2212 1 centers is \ufb01xed. By Lemma 3, we can write these sets as [\u03b10, \u03b11], . . . , [\u03b1L\u22121, \u03b1L], where\n0 = \u03b10 < \u00b7\u00b7\u00b7 < \u03b1L = \u03b1h. Given one interval, [\u03b1(cid:96), \u03b1(cid:96)+1], we claim the expected number of new\nbreakpoints #It,(cid:96) by choosing a center in round t is bounded by 4n log n(\u03b1(cid:96)+1 \u2212 \u03b1(cid:96)). Note that\n#It,(cid:96) + 1 is the number of possible choices for the next center in round t using \u03b1 in [\u03b1(cid:96), \u03b1(cid:96)+1].\nThe claim gives an upper bound on the expected number of new breakpoints, where the expectation is\nonly over zt (the uniformly random draw from [0, 1] used by Algorithm 1 in round t), and the bound\nholds for any given con\ufb01guration of d1 \u2265 \u00b7\u00b7\u00b7 \u2265 dn. Assuming the claim, we can \ufb01nish off the proof\nby using linearity of expectation as follows. Let #I denote the total number of discontinuities of\nseed\u03b1(V, (cid:126)Z).\n\n(cid:34) k(cid:88)\n\nL(cid:88)\n\n(cid:35)\n\n\u2264 k(cid:88)\n\nL(cid:88)\n\nEZ\u2208[0,1]k [#I] \u2264 EZ\u2208[0,1]k\n\n(#It,(cid:96))\n\nEZ\u2208[0,1]k [#It,(cid:96)] \u2264 4nk log n \u00b7 \u03b1h.\n\nt=1\n\n(cid:96)=1\n\nt=1\n\n(cid:96)=1\n\nDx(\u03b1(cid:96))\n\nNow we will prove the claim. Given zt \u2208 [0, 1], let x and y denote the minimum indices s.t.\nDn(\u03b1(cid:96)) > zt and Dy(\u03b1(cid:96)+1)\nDn(\u03b1(cid:96)+1) > zt, respectively. Then from Lemma 3, the number of breakpoints is\nexactly x \u2212 y. Therefore, our goal is to compute Ezt\u2208[0,1][x \u2212 y]. One method is to sum up the\nexpected number of breakpoints for each interval Iv by bounding the maximum possible number of\nbreakpoints given that zt lands in Iv. However, this will sometimes lead to a bound that is too coarse.\nFor example, if \u03b1(cid:96)+1 \u2212 \u03b1(cid:96) = \u0001 \u2248 0, then for each bucket Ivj , the maximum number of breakpoints\nis 1, but we want to show the expected number of breakpoints is proportional to \u0001. To tighten up\nthis analysis, we will show that for each bucket, the probability (over zt) of achieving the maximum\nnumber of breakpoints is low.\nAssuming that zt lands in a bucket Ivj , we further break into cases as follows. Let i denote the\nDn(\u03b1(cid:96)+1) > Dj (\u03b1(cid:96))\nminimum index such that Di(\u03b1(cid:96)+1)\nDn(\u03b1(cid:96)). Note that i is a function of j, \u03b1(cid:96), and \u03b1(cid:96)+1, but it is\nindependent of zt. If zt is less than Di(\u03b1(cid:96)+1)\nDn(\u03b1(cid:96)+1), then we have the maximum number of breakpoints\npossible, since the algorithm chooses center vi\u22121 when \u03b1 = \u03b1(cid:96)+1 and it chooses center vj when\n\u03b1 = \u03b1(cid:96). The number of breakpoints is therefore j \u2212 i + 1, by Lemma 3. We denote this event by Et,j,\ni.e., Et,j is the event that in round t, zt lands in Ivj and is less than Di(\u03b1(cid:96)+1)\nDn(\u03b1(cid:96)+1). If zt is instead greater\nthan Di(\u03b1(cid:96)+1)\nDn(\u03b1(cid:96)+1), then the algorithm chooses center vi when \u03b1 = \u03b1(cid:96)+1, so the number of breakpoints\nis \u2264 j \u2212 i. We denote this event by E(cid:48)\nt,j is the\nevent that zt \u2208 Ivj .\n\nt,j are disjoint and Et,j \u222a E(cid:48)\n\nt,j. Note that Et,j and E(cid:48)\n\n6\n\n\fWithin an interval Ivj , the expected number of breakpoints is\n\nP (Et,j)(j \u2212 i + 1) + P (E(cid:48)\n\nt,j)(j \u2212 i) = P (Et,j \u222a Et,j)(j \u2212 i) + P (E(cid:48)\n\nt,j).\n\nWe will show that j \u2212 i and P (Et,j) are both proportional to (log n)(\u03b1(cid:96)+1 \u2212 \u03b1(cid:96)), which \ufb01nishes off\nthe claim.\nFirst we upper bound P (Et,j). Recall this is the probability that zt is in between Dj (\u03b1(cid:96))\nDn(\u03b1(cid:96)) and\n(cid:16) Dj (\u03b1)\nDn(\u03b1(cid:96)+1), which we can show is at most (4 log n)(\u03b1(cid:96)+1 \u2212 \u03b1(cid:96)) by upper bounding the derivative\nDi(\u03b1(cid:96)+1)\n\n(cid:12)(cid:12)(cid:12) \u2202\n\n\u2202\u03b1\n\n(cid:17)(cid:12)(cid:12)(cid:12).\n\nDn(\u03b1)\n\nNow we upper bound j \u2212 i. Recall that j \u2212 i represents the number of intervals between Di(\u03b1(cid:96))\nDn(\u03b1(cid:96)). Therefore, we can bound j \u2212 i by\nand Dj (\u03b1(cid:96))\ndividing Dj (\u03b1(cid:96))\nDn(\u03b1) to show this value is\nproportional to 4 log n(\u03b1(cid:96)+1 \u2212 \u03b1(cid:96)). This concludes the proof.\n\nDn(\u03b1(cid:96)). We can again use the derivative of Di(\u03b1)\n\nDn(\u03b1(cid:96)), and the smallest interval in this range is\n\nDn(\u03b1(cid:96)) \u2212 Di(\u03b1(cid:96))\n\nDn(\u03b1(cid:96)) by\n\nDn(\u03b1(cid:96))\n\n\u03b1(cid:96)\nd\nj\n\n\u03b1(cid:96)\nj\n\nd\n\nNow we analyze phase 2 of Algorithm 1. Since phase 2 does not have randomness, we use combina-\ntorial techniques. We de\ufb01ne lloyds\u03b2(V, C, T ) as the cost of the outputted clustering from phase 2\nof Algorithm 1 on instance V with initial centers C, and a maximum of T iterations.\nTheorem 5. Given T \u2208 N, a clustering instance V, and a \ufb01xed set C of initial centers, the number\nof discontinuities of lloyds\u03b2(V, C, T ) as a function of \u03b2 on instance V is O(min(n3T , nk+3)).\nProof sketch. Given V and a set of initial centers C, we bound the number of discontinuities\nintroduced in the Lloyd\u2019s step of Algorithm 1. First, we give a bound of nk+3 which holds for\nany value of T . Recall that Lloyd\u2019s algorithm is a two-step procedure, and note that the Voronoi\npartitioning step is independent of \u03b2. Let {C1, . . . , Ck} denote the Voronoi partition of V induced by\nd(c, v)\u03b2. Given any\nC. Given one of these clusters Ci, the next center is computed by minc\u2208Ci\nd(c1, v)\u03b2 <\nd(c2, v)\u03b2. By a consequence of Rolle\u2019s theorem, this equation has at most 2n + 1 roots. This\n\nc1, c2 \u2208 Ci, the decision for whether c1 is a better center than c2 is governed by(cid:80)\n(cid:80)\nequation depends on the set C of centers, and the two points c1 and c2, therefore, there are(cid:0)n\n\n2\nequations each with 2n + 1 roots. We conclude that there are nk+3 total intervals of \u03b2 such that the\noutcome of Lloyd\u2019s method is \ufb01xed.\nNext we give a different analysis which bounds the number of discontinuities by n3T , where T is the\nmaximum number of Lloyd\u2019s iterations. By the same analysis as the previous paragraph, if we only\nconsider one round, then the total number of equations which govern the output of a Lloyd\u2019s iteration\n\n(cid:1), since the set of centers C is \ufb01xed. These equations have 2n + 1 roots, so the total number of\n\n(cid:1) \u00b7(cid:0)n\n\nis(cid:0)n\n\n(cid:80)\n\nv\u2208Ci\n\nv\u2208Ci\n\nv\u2208Ci\n\n(cid:1)\n\nk\n\nintervals in one round is O(n3). Therefore, over T rounds, the number of intervals is O(n3T ).\nBy combining Theorem 4 with Theorem 5, and using standard learning theory results, we can bound\nthe sample complexity needed to learn near-optimal parameters \u03b1, \u03b2 for an unknown distribution\nD over clustering instances. Recall that clus\u03b1,\u03b2(V) denotes the expected cost of the clustering\noutputted by (\u03b1, \u03b2)-Lloyds++, with respect to the target clustering, and let H denote the maximum\nvalue of clus\u03b1,\u03b2(V).\nTheorem 6. Given \u03b1h and a sample of size m = O\n\n(cid:1)(cid:17)\nfrom(cid:0)D \u00d7 [0, 1]k(cid:1)m, with probability at least 1 \u2212 \u03b4 over the choice of the sample, for all \u03b1 \u2208 [0, \u03b1h]\n\nand \u03b2 \u2208 [1,\u221e) \u222a {\u221e},\nNote that a corollary of Theorem 6 and Lemma 1 is a uniform convergence bound for all \u03b1 \u2208\n[0,\u221e) \u222a {\u221e}, however, the algorithm designer may decide to set \u03b1h < \u221e.\n\nV \u223cD [clus\u03b1,\u03b2 (V )]\n\n(cid:1)2(cid:0)min(T, k) log n + log 1\n(cid:12)(cid:12)(cid:12) < \u0001.\n\n(cid:16)(cid:0) H\nV (i), (cid:126)Z (i)(cid:17) \u2212 E\n\ni=1 clus\u03b1,\u03b2\n\n(cid:80)m\n\n\u03b4 + log \u03b1h\n\n(cid:12)(cid:12)(cid:12) 1\n\n(cid:16)\n\nm\n\n\u0001\n\n2\n\nComputational ef\ufb01ciency\nIn this section, we present an algorithm for tuning \u03b1 whose running\ntime scales with the true number of discontinuities over the sample. Combined with Theorem 4, this\ngives a bound on the expected running time of tuning \u03b1.\n\n7\n\n\fAlgorithm 2 Dynamic algorithm con\ufb01guration\n\nInput: Instance V = (V, d, k), randomness (cid:126)Z, \u03b1h, \u0001 > 0\n1. Initialize Q to be an empty queue, then push the root node ((cid:104)(cid:105), [0, \u03b1h]) onto Q.\n2. While Q is non-empty\n(a) Pop node (C, A) from Q with centers C and alpha interval A.\n(b) For each point ui that can be chosen as the next center, compute Ai = {\u03b1 \u2208 A :\nui is the sampled center} up to error \u0001 and set Ci = C \u222a {ui}.\n(c) For each i, if |Ci| < k, push (Ci, Ai) onto Q. Otherwise, output (Ci, Ai).\n\nThe high-level idea of our algorithm is to directly enumerate the set of centers that can possibly\nbe output by d\u03b1-sampling for a given clustering instance V and pre-sampled randomness (cid:126)Z. We\nknow from the previous section how to count the number of new breakpoints at any given state in the\nalgorithm, however, ef\ufb01ciently solving for the breakpoints poses a new challenge. From the previous\nsection, we know the breakpoints in \u03b1 occur when Di(\u03b1)\nDn(\u03b1) = zt. This is an exponential equation with\nn terms, and there is no closed-form solution for \u03b1. Although an arbitrary equation of this form may\nhave up to n solutions, our key observation is that if d1 \u2265 \u00b7\u00b7\u00b7 \u2265 dn, then Di(\u03b1)\nDn(\u03b1) must be monotone\ndecreasing (from Lemma 3), therefore, it suf\ufb01ces to binary search over \u03b1 to \ufb01nd the unique solution\nto this equation. We cannot \ufb01nd the exact value of the breakpoint from binary search (and even\nif there was a closed-form solution for the breakpoint, it might not be rational), however we can\n\ufb01nd the value to within additive error \u0001 for all \u0001 > 0. We show that the expected cost function is\n\n(Hnk log n)-Lipschitz in \u03b1, therefore, it suf\ufb01ces to run O(cid:0)log Hnk\n(cid:16)(cid:0) H\nThe total running time to \ufb01nd the best break point is O(cid:0)mn2k2\u03b1h log(cid:0) nH\n\na solution whose expected cost is within \u0001 of the optimal cost. This motivates Algorithm 2.\nTheorem 7. Given parameters \u03b1h > 0, \u0001 > 0, \u03b2 \u2265 1, and a sample S of size m =\nO\nple and collect all break-points (i.e., boundaries of the intervals Ai). With probability at least 1 \u2212 \u03b4,\nthe break-point \u00af\u03b1 with lowest empirical cost satis\ufb01es |clus \u00af\u03b1,\u03b2(S) \u2212 min0\u2264\u03b1\u2264\u03b1h clus\u03b1,\u03b2(S)| < \u0001.\n\n(cid:1) rounds of binary search to \ufb01nd\nfrom(cid:0)D \u00d7 [0, 1]k(cid:1)m, run Algorithm 2 on each sam-\n\n(cid:1)2(cid:0)min(T, k) log n + log 1\n\n(cid:1) log n(cid:1).\n\n\u03b4 + log \u03b1h\n\n(cid:1)(cid:17)\n\n\u0001\n\n\u0001\n\n\u0001\n\nProof sketch. First we argue that one of the break-points output by Algorithm 2 on the sample is\napproximately optimal. If we exactly solved for the break-points, then every value of \u03b1 \u2208 [0, \u03b1h]\nwould produce exactly the same clusterings on the sample one of the break-points, so some break-\npoint must be empirically optimal. By Theorem 6 this break-point is also approximately optimal in\nexpectation. Since the algorithm approximately calculates the break-points to within additive error\n\u0001, we are guaranteed that all true break-points are within distance \u0001 of one output by the algorithm.\nIn the full version of the paper, we show that the expected cost function is (Hn2k log n)-Lipschitz.\nTherefore, the best approximate break-point has approximately optimal expected cost.\nNow we analyze the runtime of Algorithm 2. Let (C, A) be any node in the algorithm, with centers\nC and alpha interval A = [\u03b1(cid:96), \u03b1h]. Sorting the points in V according to their distance to C has\ncomplexity O(n log n). Finding the points sampled by d\u03b1-sampling with \u03b1 set to \u03b1(cid:96) and \u03b1h costs\nO(n) time. Finally, computing the alpha interval Ai for each child node of (C, A) costs O(n log nH\n\u0001 )\ntime, since we need to perform log nkH log n\nDn(\u03b1) and each\n\u0001 ) time to the corresponding\nevaluation of the function costs O(n) time. We charge this O(n log nH\nchild node. If we let #I denote the total number of \u03b1-intervals for V, then each layer of the execution\ntree has at most #I nodes, and the depth is k, giving a total running time of O(#I \u00b7 kn log nH\n\u0001 ).\n\nWith Theorem 4 this gives us an expected runtime of O(cid:0)mn2k2\u03b1h(log n)(cid:0)log nH\n\niterations of binary search on \u03b1 (cid:55)\u2192 Di(\u03b1)\n\n(cid:1)(cid:1).\n\n\u0001\n\n\u0001\n\nSince we showed that d\u03b1-sampling is Lipschitz as a function of \u03b1, it is also possible to \ufb01nd the best\n\u03b1 parameter with sub-optimality at most \u0001 by \ufb01nding the best point from a discretization of [0, \u03b1h]\nwith step-size s = \u0001/(Hn2k log n). The running time of this algorithm is O(n3k2H log n/\u0001), which\nis signi\ufb01cantly slower than the ef\ufb01cient algorithm presented in this section. Intuitively, Algorithm 2\nis able to binary search to \ufb01nd each breakpoint in time O(log nH\n\u0001 ), whereas a discretization-based\nalgorithm must check all values of alpha uniformly, so the runtime of the discretization-based\nalgorithm increases by a multiplicative factor of O\n\n\u00b7(cid:0)log nH\n\n(cid:1)\u22121(cid:17)\n\n(cid:16) nH\n\n.\n\n\u0001\n\n\u0001\n\n8\n\n\f5 Experiments\n\nIn this section, we empirically evaluate the effect of the \u03b1 parameter on clustering cost for real-\nworld and synthetic clustering domains. In the full version of this paper we provide full details and\nadditional experiments exploring the effect of \u03b2 and the number of possible clusterings as we vary \u03b1.\nExperiment Setup. Our experiments evaluate the (\u03b1, \u03b2)-Lloyds++ family of algorithms on distribu-\ntions over clustering instances derived from multi-class classi\ufb01cation datasets. For each classi\ufb01cation\ndataset, we sample a clustering instance by choosing a random subset of k labels, sampling N\nexamples belonging to each of the k chosen labels. The clustering instance then consists of the\nkN points, and the target clustering is given by the ground-truth labels. This sampling distribution\ncovers many related clustering tasks (i.e., clustering different subsets of the same labels). We always\nmeasure distance between points using the (cid:96)2 distance and set \u03b2 = 2. We measure clustering cost in\nterms of the majority cost, which is the fraction of points whose label disagrees with the majority\nlabel in their cluster. The majority cost takes values in [0, 1] and is zero iff the output clustering\nmatches the target clustering perfectly. We generate m = 50, 000 samples from each distribution and\ndivide them into equal-sized training and test sets. We then use Algorithm 2 to evaluate the average\nmajority cost for all values of \u03b1 on the train and test sets. Figure 2 shows the average majority cost\nfor all values of \u03b1 on both training and testing sets.\nWe ran experiments on datasets including MNIST, CIFAR10, CNAE9, and a synthetic Gaussian Grid\ndataset. For MNIST and CIFAR10 we set k = 5, and N = 100, while for CNAE9 and the Gaussian\nGrid we set k = 4 and N = 120. For MNIST, we used more samples (m = 250, 000).\nWe \ufb01nd that the optimal value of \u03b1 varies signi\ufb01cantly between tasks, showing that tuning \u03b1 on a\nper-task basis can lead to improved performance. Moreover, we \ufb01nd strong agreement in the average\ncost of each value of \u03b1 across the independent training and testing samples of clustering instances, as\npredicted by our sample complexity results.\n\n(a) MNIST\n\n(b) CIFAR-10\n\n(c) CNAE-9\n\n(d) Gaussian Grid\n\nFigure 2: Majority cost for (\u03b1, \u03b2)-Lloyds++ as a function of \u03b1 for \u03b2 = 2.\n\n6 Conclusion\n\nWe de\ufb01ne an in\ufb01nite family of algorithms generalizing Lloyd\u2019s method, with one parameter control-\nling the the initialization procedure, and another parameter controlling the local search procedure.\nThis family of algorithms includes the celebrated k-means++ algorithm, as well as the classic\nfarthest-\ufb01rst traversal algorithm. We provide a sample ef\ufb01cient and computationally ef\ufb01cient algo-\nrithm to learn a near-optimal parameter over an unknown distribution of clustering instances, by\ndeveloping techniques to bound the expected number of discontinuities in the cost as a function of the\nparameter. We give a thorough empirical analysis, showing that the value of the optimal parameters\ntransfer to related clustering instances. We show the optimal parameters vary among different types\nof datasets, and the optimal parameters often signi\ufb01cantly improves the error compared to existing\nalgorithms such as k-means++ and farthest-\ufb01rst traversal.\n\n7 Acknowledgments\n\nThis work was supported in part by NSF grants CCF-1535967, IIS-1618714, an Amazon Research\nAward, a Microsoft Research Faculty Fellowship, a National Defense Science & Engineering Graduate\n(NDSEG) fellowship, and by the generosity of Eric and Wendy Schmidt by recommendation of the\nSchmidt Futures program.\n\n9\n\n051015200.36500.36750.37000.37250.3750Majority Costtesttrain051015200.460.470.480.490.50Majority Costtesttrain051015200.40.50.60.7Majority Costtesttrain051015200.0000.0250.0500.0750.1000.125Majority Costtesttrain\fReferences\nSara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means\nand euclidean k-median by primal-dual algorithms. In Proceedings of the Annual Symposium on\nFoundations of Computer Science (FOCS), 2017.\n\nKohei Arai and Ali Ridho Barakbah. Hierarchical k-means: an algorithm for centroids initialization\n\nfor k-means. Reports of the Faculty of Science and Engineering, 36(1):25\u201331, 2007.\n\nDavid Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings\n\nof the Annual Symposium on Discrete Algorithms (SODA), pages 1027\u20131035, 2007.\n\nVijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka\nPandit. Local search heuristics for k-median and facility location problems. SIAM Journal on\nComputing, 33(3):544\u2013562, 2004.\n\nHassan Ashtiani and Shai Ben-David. Representation learning for clustering: a statistical framework.\nIn Proceedings of the Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 82\u201391,\n2015.\n\nMaria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic\nfoundations of algorithm con\ufb01guration for combinatorial partitioning problems. In Proceedings of\nthe Annual Conference on Learning Theory (COLT), pages 213\u2013274, 2017.\n\nPeter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\nJaros\u0142aw Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved\napproximation for k-median, and positive correlation in budgeted optimization. In Proceedings of\nthe Annual Symposium on Discrete Algorithms (SODA), pages 737\u2013756, 2015.\n\nMoses Charikar, Sudipto Guha, \u00c9va Tardos, and David B Shmoys. A constant-factor approximation\nalgorithm for the k-median problem. In Proceedings of the Annual Symposium on Theory of\nComputing (STOC), pages 1\u201310, 1999.\n\nMichael B Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median\nin nearly linear time. In Proceedings of the Annual Symposium on Theory of Computing (STOC),\npages 9\u201321. ACM, 2016.\n\nSanjoy Dasgupta and Philip M Long. Performance guarantees for hierarchical clustering. Journal of\n\nComputer and System Sciences, 70(4):555\u2013569, 2005.\n\nArthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data\n\nvia the em algorithm. Journal of the royal statistical society, pages 1\u201338, 1977.\n\nJerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics New York, NY, USA:, 2001.\n\nTeo\ufb01lo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer\n\nScience, 38:293\u2013306, 1985.\n\nRichard E Higgs, Kerry G Bemis, Ian A Watson, and James H Wikel. Experimental designs for\nselecting molecules from large chemical databases. Journal of chemical information and computer\nsciences, 37(5):861\u2013870, 1997.\n\nWenhao Jiang and Fu-lai Chung. Transfer spectral clustering.\n\nIn Proceedings of the Annual\n\nConference on Knowledge Discovery and Data Mining (KDD), pages 789\u2013803, 2012.\n\nTapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and An-\ngela Y Wu. An ef\ufb01cient k-means clustering algorithm: Analysis and implementation. transactions\non pattern analysis and machine intelligence, 24(7):881\u2013892, 2002.\n\nLeonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis,\n\nvolume 344. John Wiley & Sons, 2009.\n\n10\n\n\fAri Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. An online hierar-\nchical algorithm for extreme clustering. In Proceedings of the Annual Conference on Knowledge\nDiscovery and Data Mining (KDD), 2017.\n\nVladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions\n\non Information Theory, 47(5):1902\u20131914, 2001.\n\nStuart Lloyd. Least squares quantization in pcm. transactions on information theory, 28(2):129\u2013137,\n\n1982.\n\nJames MacQueen et al. Some methods for classi\ufb01cation and analysis of multivariate observations. In\nsymposium on mathematical statistics and probability, volume 1, pages 281\u2013297. Oakland, CA,\nUSA, 1967.\n\nJoel Max. Quantizing for minimum distortion. IRE Transactions on Information Theory, 6(1):7\u201312,\n\n1960.\n\nRafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of\n\nlloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):28, 2012.\n\nDan Pelleg and Andrew Moore. Accelerating exact k-means algorithms with geometric reasoning. In\nProceedings of the Annual Conference on Knowledge Discovery and Data Mining (KDD), pages\n277\u2013281, 1999.\n\nJos\u00e9 M Pena, Jose Antonio Lozano, and Pedro Larranaga. An empirical comparison of four ini-\ntialization methods for the k-means algorithm. Pattern recognition letters, 20(10):1027\u20131040,\n1999.\n\nKim D Pruitt, Tatiana Tatusova, Garth R Brown, and Donna R Maglott. Ncbi reference sequences\n(refseq): current status, new features and genome annotation policy. Nucleic acids research, 40\n(D1):D130\u2013D135, 2011.\n\nRajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning:\ntransfer learning from unlabeled data. In Proceedings of the International Conference on Machine\nLearning (ICML), pages 759\u2013766, 2007.\n\nQiang Yang, Yuqiang Chen, Gui-Rong Xue, Wenyuan Dai, and Yong Yu. Heterogeneous transfer\nlearning for image clustering via the social web. In Proceedings of the Conference on Natural\nLanguage Processing, pages 1\u20139, 2009.\n\n11\n\n\f", "award": [], "sourceid": 6782, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Carnegie Mellon University"}, {"given_name": "Travis", "family_name": "Dick", "institution": "Carnegie Mellon University"}, {"given_name": "Colin", "family_name": "White", "institution": "Carnegie Mellon University"}]}