{"title": "Nested Mini-Batch K-Means", "book": "Advances in Neural Information Processing Systems", "page_first": 1352, "page_last": 1360, "abstract": "A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm, already used data should preferentially be reused. To this end we propose using nested mini-batches, whereby data in a mini-batch at iteration t is automatically reused at iteration t+1. Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1\\% of the empirical minimum 100 times earlier than the standard mini-batch algorithm.", "full_text": "Nested Mini-Batch K-Means\n\nJames Newling\n\nIdiap Research Institue & EPFL\njames.newling@idiap.ch\n\nFranc\u00b8ois Fleuret\n\nIdiap Research Institue & EPFL\n\nfrancois.fleuret@idiap.ch\n\nAbstract\n\nA new algorithm is proposed which accelerates the mini-batch k-means algorithm\nof Sculley (2010) by using the distance bounding approach of Elkan (2003). We\nargue that, when incorporating distance bounds into a mini-batch algorithm, al-\nready used data should preferentially be reused. To this end we propose using\nnested mini-batches, whereby data in a mini-batch at iteration t is automatically\nreused at iteration t + 1.\nUsing nested mini-batches presents two dif\ufb01culties. The \ufb01rst is that unbalanced\nuse of data can bias estimates, which we resolve by ensuring that each data sample\ncontributes exactly once to centroids. The second is in choosing mini-batch sizes,\nwhich we address by balancing premature \ufb01ne-tuning of centroids with redun-\ndancy induced slow-down. Experiments show that the resulting nmbatch algo-\nrithm is very effective, often arriving within 1% of the empirical minimum 100\u00d7\nearlier than the standard mini-batch algorithm.\n\n1\n\nIntroduction\n\nThe k-means problem is to \ufb01nd k centroids to minimise the mean distance between samples and\ntheir nearest centroids. Speci\ufb01cally, given N training samples X = {x(1), . . . , x(N )} in vector\nspace V, one must \ufb01nd C = {c(1), . . . , c(k)} in V to minimise energy E de\ufb01ned by,\n\nN(cid:88)\n\ni=1\n\nE(C) =\n\n1\nN\n\n(cid:107)x(i) \u2212 c(a(i))(cid:107)2,\n\n(1)\n\nwhere a(i) = arg minj\u2208{1,...,k} (cid:107)x(i) \u2212 c(j)(cid:107). In general the k-means problem is NP-hard, and so\na trade off must be made between low energy and low run time. The k-means problem arises in data\ncompression, classi\ufb01cation, density estimation, and many other areas.\nA popular algorithm for k-means is Lloyd\u2019s algorithm, henceforth lloyd. It relies on a two-step\niterative re\ufb01nement technique. In the assignment step, each sample is assigned to the cluster whose\ncentroid is nearest. In the update step, cluster centroids are updated in accordance with assigned\nsamples. lloyd is also referred to as the exact algorithm, which can lead to confusion as it does\nnot solve the k-means problem exactly. Similarly, approximate k-means algorithms often refer to\nalgorithms which perform an approximation in either the assignment or the update step of lloyd.\n\n1.1 Previous works on accelerating the exact algorithm\n\nSeveral approaches for accelerating lloyd have been proposed, where the required computation is\nreduced without changing the \ufb01nal clustering. Hamerly (2010) shows that approaches relying on\ntriangle inequality based distance bounds (Phillips, 2002; Elkan, 2003; Hamerly, 2010) always pro-\nvide greater speed-ups than those based on spatial data structures (Pelleg and Moore, 1999; Kanungo\net al., 2002). Improving bounding based methods remains an active area of research (Drake, 2013;\nDing et al., 2015). We discuss the bounding based approach in \u00a7 2.1.\n\n1\n\n\f1.2 Previous approximate k-means algorithms\n\nThe assignment step of lloyd requires more computation than the update step. The majority of\napproximate algorithms thus focus on relaxing the assignment step, in one of two ways. The \ufb01rst is\nto assign all data approximately, so that centroids are updated using all data, but some samples may\nbe incorrectly assigned. This is the approach used in Wang et al. (2012) with cluster closures. The\nsecond approach is to exactly assign a fraction of data at each iteration. This is the approach used in\nAgarwal et al. (2005), where a representative core-set is clustered, and in Bottou and Bengio (1995),\nand Sculley (2010), where random samples are drawn at each iteration. Using only a fraction of data\nis effective in reducing redundancy induced slow-downs.\nThe mini-batch k-means algorithm of Sculley (2010), henceforth mbatch, proceeds as follows. Cen-\ntroids are initialised as a random selection of k samples. Then at every iteration, b of N samples are\nselected uniformly at random and assigned to clusters. Cluster centroids are updated as the mean\nof all samples ever assigned to them, and are therefore running averages of assignments. Samples\nrandomly selected more often have more in\ufb02uence on centroids as they reappear more frequently in\nrunning averages, although the law of large numbers smooths out any discrepancies in the long run.\nmbatch is presented in greater detail in \u00a7 2.2.\n\n1.3 Our contribution\n\nThe underlying goal of this work is to accelerate mbatch by using triangle inequality based distance\nbounds. In so doing, we hope to merge the complementary strengths of two powerful and widely\nused approaches for accelerating lloyd.\nThe effective incorporation of bounds into mbatch requires a new sampling approach. To see this,\n\ufb01rst note that bounding can only accelerate the processing of samples which have already been\nvisited, as the \ufb01rst visit is used to establish bounds. Next, note that the expected proportion of visits\nduring the \ufb01rst epoch which are revisits is at most 1/e, as shown in SM-A. Thus the majority of\nvisits are \ufb01rst time visits and hence cannot be accelerated by bounds. However, for highly redundant\ndatasets, mbatch often obtains satisfactory clustering in a single epoch, and so bounds need to be\neffective during the \ufb01rst epoch if they are to contribute more than a minor speed-up.\nTo better harness bounds, one must preferentially reuse already visited samples. To this end, we\npropose nested mini-batches. Speci\ufb01cally, letting Mt \u2286 {1, . . . , N} be the mini-batch indices used\nat iteration t \u2265 1, we enforce that Mt \u2286 Mt+1. One concern with nesting is that samples entering\nin early iterations have more in\ufb02uence than samples entering at late iterations, thereby introducing\nbias. To resolve this problem, we enforce that samples appear at most once in running averages.\nSpeci\ufb01cally, when a sample is revisited, its old assignment is \ufb01rst removed before it is reassigned.\nThe idea of nested mini-batches is discussed in \u00a7 3.1.\nThe second challenge introduced by using nested mini-batches is determining the size of Mt. On\nthe one hand, if Mt grows too slowly, then one may suffer from premature \ufb01ne-tuning. Speci\ufb01cally,\nwhen updating centroids using Mt \u2282 {1, . . . , N}, one is using energy estimated on samples indexed\nby Mt as a proxy for energy over all N training samples. If Mt is small and the energy estimate\nis poor, then minimising the energy estimate exactly is a waste of computation, as as soon as the\nmini-batch is augmented the proxy energy loss function will change. On the other hand, if Mt\ngrows too rapidly, the problem of redundancy arises. Speci\ufb01cally, if centroid updates obtained with\na small fraction of Mt are similar to the updates obtained with Mt, then it is waste of computation\nusing Mt in its entirety. These ideas are pursued in \u00a7 3.2.\n\n2 Related works\n\n2.1 Exact acceleration using the triangle inequality\n\nThe standard approach to perform the assignment step of lloyd requires k distance calculations.\nThe idea introduced in Elkan (2003) is to eliminate certain of these k calculations by maintaining\nbounds on distances between samples and centroids. Several novel bounding based algorithms have\nsince been proposed, the most recent being the yinyang algorithm of Ding et al. (2015). A thorough\ncomparison of bounding based algorithms was presented in Drake (2013). We illustrate the basic\n\n2\n\n\fidea of Elkan (2003) in Alg. 1, where for every sample i, one maintains k lower bounds, l(i, j) for\nj \u2208 {1, . . . , k}, each bound satisfying l(i, j) \u2264 (cid:107)x(i) \u2212 c(j)(cid:107). Before computing (cid:107)x(i) \u2212 c(j)(cid:107) on\nline 4 of Alg. 1, one checks that l(i, j) < d(i), where d(i) is the distance from sample i to the nearest\ncurrently found centroid. If l(i, j) \u2265 d(i) then (cid:107)x(i) \u2212 c(j)(cid:107) \u2265 d(i), and thus j can automatically\nbe eliminated as a nearest centroid candidate.\n\nif l(i, j) < d(i) then\n\nAlgorithm 1 assignment-with-bounds(i)\n1: d(i) \u2190 (cid:107)x(i) \u2212 c(a(i))(cid:107)\n2: for all j \u2208 {1, . . . , k} \\ {a(i)} do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nl(i, j) \u2190 (cid:107)x(i) \u2212 c(j)(cid:107)\nif l(i, j) < d(i) then\n\na(i) = j\nd(i) = l(i, j)\n\nend if\n\nend if\n\n(cid:46) where d(i) is distance to nearest centroid found so far\n\n(cid:46) make lower bound on distance between x(i) and c(j) tight\n\nThe fully-\ufb02edged algorithm of Elkan (2003) uses additional tests to the one shown in Alg. 1, and\nincludes upper bounds and inter-centroid distances. The most recently published bounding based al-\ngorithm, yinyang of Ding et al. (2015), is like that of Elkan (2003) but does not maintain bounds on\nall k distances to centroids, rather it maintains lower bounds on groups of centroids simultaneously.\nTo maintain the validity of bounds, after each centroid update one performs l(i, j) \u2190 l(i, j) \u2212 p(j),\nwhere p(j) is the distance moved by centroid j during the centroid update, the validity of this\ncorrection follows from the triangle inequality. Lower bounds are initialised as exact distances in the\n\ufb01rst iteration, and only in subsequent iterations can bounds help in eliminating distance calculations.\nTherefore, the algorithm of Elkan (2003) and its derivatives are all at least as slow as lloyd during\nthe \ufb01rst iteration.\n\n2.2 Mini-batch k-means\n\nThe work of Sculley (2010) introduces mbatch, presented in Alg. 4, as a scalable alternative to\nlloyd. Reusing notation, we let the mini-batch size be b, and the total number of assignments ever\nmade to cluster j be v(j). Let S(j) be the cumulative sum of data samples assigned to cluster j.\nThe centroid update, line 9 of Alg. 4, is then c(j) \u2190 S(j)/v(j). Sculley (2010) present mbatch in\nthe context sparse datasets, and at the end of each round an l1-sparsi\ufb01cation operation is performed\nto encourage sparsity. In this paper we are interested in mbatch in a more general context and do\nnot consider sparsi\ufb01cation.\n\nAlgorithm 2 initialise-c-S-v\n\nfor j \u2208 {1, . . . , k} do\n\nc(j) \u2190 x(i) for some i \u2208 {1, . . . , N}\nS(j) \u2190 x(i)\nv(j) \u2190 1\n\nend for\n\nAlgorithm 3 accumulate(i)\n\nS(a(i)) \u2190 S(a(i)) + x(i)\nv(a(i)) \u2190 v(a(i)) + 1\n\n3 Nested mini-batch k-means : nmbatch\n\nThe bottleneck of mbatch is the assignment step, on line 5 of Alg. 4, which requires k distance\ncalculations per sample. The underlying motivation of this paper is to reduce the number of distance\ncalculations at assignment by using distance bounds. However, as already discussed in \u00a7 1.3, simply\nwrapping line 5 in a bound test would not result in much gain, as only a minority of visited samples\nwould bene\ufb01t from bounds in the \ufb01rst epoch. For this reason, we will replace random mini-batches\nat line 3 of Alg. 4 by nested mini-batches. This modi\ufb01cation motivates a change to the running\naverage centroid updates, discussed in Section 3.1.\nIt also introduces the need for a scheme to\n\n3\n\n\fM \u2190 uniform random sample of size b from {1, . . . , N}\nfor all i \u2208 M do\n\nAlgorithm 4 mbatch\n1: initialise-c-S-v()\n2: while convergence criterion not satis\ufb01ed do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\nend for\n10:\n11: end while\n\na(i) \u2190 arg minj\u2208{1,...,k} (cid:107)x(i) \u2212 c(j)(cid:107)\naccumulate(i)\n\nend for\nfor all j \u2208 {1, . . . , k} do\n\nc(j) \u2190 S(j)/v(j)\n\nchoose mini-batch sizes, discussed in 3.2. The resulting algorithm, which we refer to as nmbatch,\nis presented in Alg. 5.\nThere is no random sampling in nmbatch, although an initial random shuf\ufb02ing of samples can be\nperformed to remove any ordering that may exist. Let bt be the size of the mini-batch at iteration\nt, that is bt = |Mt|. We simply take Mt to be the \ufb01rst bt indices, that is Mt = {1, . . . , bt}.\nThus Mt \u2286 Mt+1 corresponds to bt \u2264 bt+1. Let T be the number of iterations of nmbatch\nbefore terminating. We use as stopping criterion that no assignments change on the full training set,\nalthough this is not important and can be modi\ufb01ed.\n\n3.1 One sample, one vote : modifying cumulative sums to prevent duplicity\n\nIn mbatch, a sample used n times makes n contributions to one or more centroids, through line 6 of\nAlg. 4. Due to the extreme and systematic difference in the number of times samples are used with\nnested mini-batches, it is necessary to curtail any potential bias that duplicitous contribution may\nincur. To this end, we only alow a sample\u2019s most recent assignment to contribute to centroids. This\nis done by removing old assignments before samples are reused, shown on lines 15 and 16 of Alg. 5.\n\n3.2 Finding the sweet spot : balancing premature \ufb01ne-tuning with redundancy\n\nWe now discuss how to sensibly select mini-batch size bt, where recall that the sample indices of the\nmini-batch at iteration t are Mt = {1, . . . , bt}. The only constraint imposed so far is that bt \u2264 bt+1\nfor t \u2208 {1, . . . , T \u2212 1}, that is that bt does not decrease. We consider two extreme schemes to\nillustrate the importance of \ufb01nding a scheme where bt grows neither too rapidly nor too slowly.\nThe \ufb01rst extreme scheme is bt = N for t \u2208 {1, . . . , T}. This is just a return to full batch k-means,\nand thus redundancy is a problem, particularly at early iterations. The second extreme scheme,\nwhere Mt grows very slowly, is the following: if any assignment changes at iteration t, then bt+1 =\nbt, otherwise bt+1 = bt + 1. The problem with this second scheme is that computation may be\nwasted in \ufb01nding centroids which accurately minimise the energy estimated on unrepresentative\nsubsets of the full training set. This is what we refer to as premature \ufb01ne-tuning.\nTo develop a scheme which balances redundancy and premature \ufb01ne-tuning, we need to \ufb01nd sensible\nde\ufb01nitions for these terms. A \ufb01rst attempt might be to de\ufb01ne them in terms of energy (1), as this is\nultimately what we wish to minimise. Redundancy would correspond to a slow decrease in energy\ncaused by long iteration times, and premature \ufb01ne-tuning would correspond to approaching a local\nminimum of a poor proxy for (1). A dif\ufb01culty with an energy based approach is that we do not want\nto compute (1) at each iteration and there is no clear way to quantify the underestimation of (1) using\na mini-batch. We instead consider de\ufb01nitions based on centroid statistics.\n\n3.2.1 Balancing intra-cluster standard deviation with centroid displacement\nLet ct(j) denote centroid j at iteration t, and let ct+1(j|b) be ct+1(j) when Mt+1 = {1, . . . , b},\nso that ct+1(j|b) is the update to ct(j) using samples {x(1), . . . , x(b)}. Consider two options,\n\n4\n\n\ffor i \u2208 Mt\u22121 and j \u2208 {1, . . . , k} do\nend for\nfor i \u2208 Mt\u22121 do\naold(i) \u2190 a(i)\nsse(aold(i)) \u2190 sse(aold(i)) \u2212 d(i)2\nS(aold(i)) \u2190 S(aold(i)) \u2212 x(i)\nv(aold(i)) \u2190 v(aold(i)) \u2212 1\nassignment-with-bounds(i)\naccumulate(i)\nsse(a(i)) \u2190 sse(a(i)) + d(i)2\n\nl(i, j) \u2190 (cid:107)x(i) \u2212 c(j)(cid:107)\n\nend for\nfor i \u2208 Mt \\ Mt\u22121 and j \u2208 {1, . . . , k} do\nend for\nfor i \u2208 Mt \\ Mt\u22121 do\n\na(i) \u2190 arg minj\u2208{1,...,k} l(i, j)\nd(i) \u2190 l(i, a(i))\naccumulate(i)\nsse(a(i)) \u2190 sse(a(i)) + d(i)2\n\n\u02c6\u03c3C(j) \u2190(cid:112)(sse(j))/ (v(j)(v(j) \u2212 1))\n\nsse(j) \u2190 0\n\nAlgorithm 5 nmbatch\n1: t = 1\n2: M0 \u2190 {}\n3: M1 \u2190 {1, . . . , bs}\n4: initialise-c-S-v()\n5: for j \u2208 {1, . . . , k} do\n6:\n7: end for\n8: while stop condition is false do\n9:\nl(i, j) \u2190 l(i, j) \u2212 p(j)\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n33:\n34:\n35:\n36:\n37:\n38:\n39:\n40:\n41:\n42: end while\n\nend for\nfor j \u2208 {1, . . . , k} do\n\ncold(j) \u2190 c(j)\nc(j) \u2190 S(j)/v(j)\np(j) \u2190 (cid:107)c(j) \u2212 cold(j)(cid:107)\n\nend if\nt \u2190 t + 1\n\nend for\nif minj\u2208{1,...,k} (\u02c6\u03c3c(j)/p(j)) > \u03c1 then\nMt+1 \u2190 {1, . . . , min (2|Mt|, N )}\nMt+1 \u2190 Mt\n\nelse\n\n(cid:46) Iteration number\n\n(cid:46) Indices of samples in current mini-batch\n\n(cid:46) Initialise sum of squares of samples in cluster j\n\n(cid:46) Update bounds of reused samples\n\n(cid:46) Remove expired sse, S and v contributions\n\n(cid:46) Reset assignment a(i)\n\n(cid:46) Tight initialisation for new samples\n\n(cid:46) Check doubling condition\n\nbt+1 = bt with resulting update ct+1(j|bt), and bt+1 = 2bt with update ct+1(j|2bt). If,\n\n(cid:107)ct+1(j|2bt) \u2212 ct+1(j|bt)(cid:107) (cid:28) (cid:107)ct(j) \u2212 ct+1(j|bt)(cid:107),\n\n(2)\nthen it makes little difference if centroid j is updated with bt+1 = bt or bt+1 = 2bt, as illustrated in\nFigure 1, left. Using bt+1 = 2bt would therefore be redundant. If on the other hand,\n\n(cid:107)ct+1(j|2bt) \u2212 ct+1(j|bt)(cid:107) (cid:29) (cid:107)ct(j) \u2212 ct+1(j|bt)(cid:107),\n\n(3)\nthis suggests premature \ufb01ne-tuning, as illustrated in Figure 1, right. Balancing redundancy and\npremature \ufb01ne-tuning thus equates to balancing the terms on the left and right hand sides of (2)\nand (3). Let us denote by Mt(j) the indices of samples in Mt assigned to cluster j. In SM-B we\nshow that the term on the left hand side of (2) and (3) can be estimated by 1\n\n2 \u02c6\u03c3C(j), where\n\n\u02c6\u03c32\nC(j) =\n\n1\n\n|Mt(j)|2\n\n(cid:107)x(i) \u2212 ct(j)(cid:107)2.\n\n(4)\n\n(cid:88)\n\ni\u2208Mt(j)\n\n5\n\n\f\u2022\nct(j)\n\nct+1(j|bt)\n\n\u2022\n\u2022\n\nct+1(j|2bt)\n\n\u2022\n\nct+1(j|bt)\n\n\u2022\nct(j)\n\nct+1(j|2bt)\n\n\u2022\n\nFigure 1: Centroid based de\ufb01nitions of redundancy and premature \ufb01ne-tuning. Starting from cen-\ntroid ct(j), the update can be performed with a mini-batch of size bt or 2bt. On the left, it makes\nlittle difference and so using all 2bt points would be redundant. On the right, using 2bt samples\nresults in a much larger change to the centroid, suggesting that ct(j) is near to a local minimum of\nenergy computed on bt points, corresponding to premature \ufb01ne-tuning.\n\n\u02c6\u03c3C(j) may underestimate (cid:107)ct+1(j|2bt) \u2212 ct+1(j|bt)(cid:107) as samples {x(bt+1), . . . , x(2bt)} have not\nbeen used by centroids at iteration t, however our goal here is to establish dimensional homogeneity.\nThe right hand sides of (2) and (3) can be estimated by the distance moved by centroid j in the\npreceding iteration, which we denote by p(j). Balancing redundancy and premature \ufb01ne-tuning\nthus equates to preventing \u02c6\u03c3C(j)/p(j) from getting too large or too small.\nIt may be that \u02c6\u03c3C(j)/p(j) differs signi\ufb01cantly between clusters j. It is not possible to independently\ncontrol the number of samples per cluster, and so a joint decision needs to be made by clusters as to\nwhether or not to increase bt. We choose to make the decision based on the minimum ratio, on line\n37 of Alg. 5, as premature \ufb01ne-tuning is less costly when performed on a small mini-batch, and so\nit makes sense to allow slowly converging centroids to catch-up with rapidly converging ones.\nThe decision to use a double-or-nothing scheme for growing the mini-batch is motivated by the fact\nthat \u02c6\u03c3C(j) drops by a constant factor when the mini-batch doubles in size. A linearly increasing\nmini-batch would be prone to premature \ufb01ne-tuning as the mini-batch would not be able to grow\nrapidly enough.\nStarting with an initial mini-batch size b0, nmbatch iterates until minj \u02c6\u03c3C(j)/p(j) is above some\nthreshold \u03c1, at which point mini-batch size increases as bt \u2190 min(2bt, N ), shown on line 37 of\nAlg. 5. The mini-batch size is guaranteed to eventually reach N, as p(j) eventually goes to zero.\nThe doubling threshold \u03c1 re\ufb02ects the relative costs of premature \ufb01ne-tuning and redundancy.\n\n3.3 A note on parallelisation\n\nThe parallelisation of nmbatch can be done in the same way as in mbatch, whereby a mini-batch\nis simply split into sub-mini-batches to be distributed. For mbatch, the only constraint on sub-\nmini-batches is that they are of equal size to guarantee equal processing times. With nmbatch the\nconstraint is slightly stricter, as the time required to process a sample depends on its time of entry into\nthe mini-batch, due to bounds. Samples from all iterations should thus be balanced, the constraint\nbecoming that each sub-mini-batch contains an equal number of samples from Mt \\Mt\u22121 for all t.\n\n4 Results\n\nWe have performed experiments on 3 dense datasets and sparse dataset used in Sculley (2010). The\nINFMNIST dataset (Loosli et al., 2007) is an extension of MNIST, consisting of 28\u00d728 hand-written\ndigits (d = 784). We use 400,000 such digits for performing k-means and 40,000 for computing a\nvalidation energy EV . STL10P (Coates et al., 2011) consists of 6\u00d76\u00d73 image patches (d = 108), we\ntrain with 960,000 patches and use 40,000 for validation. KDDC98 contains 75,000 training samples\nand 20,000 validation samples, in 310 dimensions. Finally, the sparse RCV1 dataset of Lewis et al.\n(2004) consists of data in 47,237 dimensions, with two partitions containing 781,265 and 23,149\nsamples respectively. As done in Sculley (2010), we use the larger partition to learn clusters.\nThe experimental setup used on each of the datasets is the following: for 20 random seeds, the\ntraining dataset is shuf\ufb02ed and the \ufb01rst k datapoints are taken as initialising centroids. Then, for\neach of the algorithms, k-means is run on the shuf\ufb02ed training set. At regular intervals, a validation\nenergy EV is computed on the validation set. The time taken to compute EV is not included in run\ntimes. The batchsize for mbatch and initial batchsize for nmbatch are 5, 000, and k = 50 clusters.\n\n6\n\n\fFigure 2: The mean energy on validation data (EV ) relative to lowest energy (E\u2217) across 20 runs\nwith standard deviations. Baselines are lloyd, yinyang, and mbatch, shown with the new algo-\nrithm nmbatch with \u03c1 = 100. We see that nmbatch is consistently faster than all baselines, and\nobtains \ufb01nal minima very similar to those obtained by the exact algorithms. On the sparse dataset\nRCV1, the speed-up is noticeable within 0.5% of the empirical minimum E\u2217. On the three dense\ndatasets, the speed-up over mbatch is between 10\u00d7 and 100\u00d7 at 2% of E\u2217, with even greater\nspeed-ups below 2% where nmbatch converges very quickly to local minima.\n\nFigure 3: Relative errors on validation data at t \u2208 {2, 10}, for nmbatch with and with bound tests,\nfor \u03c1 \u2208 {10\u22121, 100, 101, 102, 103}. In the standard case of active bound testing, large values of \u03c1\nwork well, as premature \ufb01ne-tuning is less of a concern. However with the bound test deactivated,\npremature \ufb01ne-tuning becomes costly for large \u03c1, and an optimal \u03c1 value is one which trades off\nredundancy (\u03c1 too small) and premature \ufb01ne-tuning (\u03c1 too large).\n\nThe mean and standard deviation of EV over the 20 runs are computed, and this is what is plotted\nin Figure 2, relative to the lowest obtained validation energy over all runs on a dataset, E\u2217. Before\ncomparing algorithms, we note that our implementation of the baseline mbatch is competitive with\nexisting implementations, as shown in Appendix A.\n\n7\n\n10\u221211001011021031040.000.020.040.060.080.10(EV\u2212E\u2217)/E\u2217KDDC98lloydyinyangmbatchnmbatch10\u221211001011021031040.000.010.020.030.040.050.06INFMNIST10\u22121100101102103104time[s]0.0000.0050.0100.0150.0200.0250.030(EV\u2212E\u2217)/E\u2217RCV110\u22121100101102103104time[s]0.000.020.040.060.080.100.12STL10P10\u22121100101102103\u03c10.000.010.020.030.040.05(EV\u2212E\u2217)/E\u2217KDDC98100101102103\u03c1INFMNIST100101102103\u03c1STL10P100101102103\u03c1RCV1t=2s(active)t=10s(actve)t=2s(deactive)t=10s(deactive)\fIn Figure 2, we plot time-energy curves for nmbatch with three baselines. We use \u03c1 = 100, as\ndescribed in the following paragraph. On the 3 dense datasets, we see that nmbatch is much faster\nthan mbatch, obtaining a solution within 2% of E\u2217 between 10\u00d7 and 100\u00d7 earlier than mbatch.\nOn the sparse dataset RCV1, the speed-up becomes noticeable within 0.5% of E\u2217. Note that in a\nsingle epoch nmbatch gets very near to E\u2217, whereas the full batch algorithms lloyd and yinyang\nonly complete one iteration. The mean \ufb01nal energies of nmbatch and the exact algorithms are\nconsistently within one initialisation standard deviation. This means that the random initialisation\nseed has a larger impact on \ufb01nal energy than the choose between nmbatch and an exact algorithm.\nWe now discuss the choice of \u03c1. Recall that the mini-batch size doubles when minj \u02c6\u03c3C(j)/p(j) > \u03c1.\nThus a large \u03c1 means smaller p(j)s are needed to invoke a doubling, which means less robustness\nagainst premature \ufb01ne-tuning. The relative costs of premature \ufb01ne-tuning and redundancy are in\ufb02u-\nenced by the use of bounds. Consider the case of premature \ufb01ne-tuning with bounds. p(j) becomes\nsmall, and thus bound tests become more effective as they decrease more slowly (line 10 of Alg. 5).\nThus, while premature \ufb01ne-tuning does result in more samples being visited than necessary, each\nvisit is processed rapidly and so is less costly. We have found that taking \u03c1 to be large works well for\nnmbatch. Indeed, there is little difference in performance for \u03c1 \u2208 {10, 100, 1000}. To test that our\nformulation is sensible, we performed tests with the bound test (line 3 of Alg. 1) deactivated. When\ndeactivated, \u03c1 = 10 is in general better than larger values of \u03c1, as seen in Figure 3. Full time-energy\ncurves for different \u03c1 values are provided in SM-C.\n\n5 Conclusion and future work\n\nWe have shown how triangle inequality based bounding can be used to accelerate mini-batch k-\nmeans. The key is the use of nested batches, which enables rapid processing of already used samples.\nThe idea of replacing uniformly sampled mini-batches with nested mini-batches is quite general,\nand applicable to other mini-batch algorithms. In particular, we believe that the sparse dictionary\nlearning algorithm of Mairal et al. (2009) could bene\ufb01t from nesting. One could also consider\nadapting nested mini-batches to stochastic gradient descent, although this is more speculative.\nCelebi et al. (2013) show that specialised initialisation schemes such as k-means++ can result in\nbetter clusterings. While this is not the case for the datasets we have used, it would be interesting to\nconsider adapting such initialisation schemes to the mini-batch context.\nOur nested mini-batch algorithm nmbatch uses a very simple bounding scheme. We believe that\nfurther improvements could be obtained through more advanced bounding, and that the memory\nfootprint of O(KN ) could be reduced by using a scheme where, as the mini-batch grows, the num-\nber of bounds maintained decreases, so that bounds on groups of clusters merge.\n\nA Comparing Baseline Implementations\n\nWe compare our implementation of mbatch with two publicly available implementations, that ac-\ncompanying Sculley (2010) in C++, and that in scikit-learn Pedregosa et al. (2011), written in\nCython. Comparisons are presented in Table 1, where our implementations are seen to be com-\npetitive. Experiments were all single threaded. Our C++ and Python code is available at https:\n//github.com/idiap/eakmeans.\n\nINFMNIST (dense)\nours\n12.4\n\nsklearn\n20.6\n\nours\n15.2\n\nRCV1 (sparse)\n\nsklearn\n63.6\n\nso\ufb01a\n23.3\n\nTable 1: Comparing implementations of mbatch on INFMNIST (left) and RCV1 (right). Time in\nseconds to process N datapoints, where N = 400, 000 for INFMNIST and N = 781, 265 for RCV1.\nImplementations are our own (ours), that in scikit-learn (sklearn), and that of Sculley (2010) (so\ufb01a).\n\nAcknowledgments\n\nJames Newling was funded by the Hasler Foundation under the grant 13018 MASH2.\n\n8\n\n\fReferences\nAgarwal, P. K., Har-Peled, S., and Varadarajan, K. R. (2005). Geometric approximation via core-\nsets. In COMBINATORIAL AND COMPUTATIONAL GEOMETRY, MSRI, pages 1\u201330. University\nPress.\n\nBottou, L. and Bengio, Y. (1995). Convergence properties of the K-means algorithm. pages 585\u2013\n\n592.\n\nCelebi, M. E., Kingravi, H. A., and Vela, P. A. (2013). A comparative study of ef\ufb01cient initialization\n\nmethods for the k-means clustering algorithm. Expert Syst. Appl., 40(1):200\u2013210.\n\nCoates, A., Lee, H., and Ng, A. (2011). An analysis of single-layer networks in unsupervised feature\nIn Gordon, G., Dunson, D., and Dudk, M., editors, Proceedings of the Fourteenth\nlearning.\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 15 of JMLR Workshop\nand Conference Proceedings, pages 215\u2013223. JMLR W&CP.\n\nDing, Y., Zhao, Y., Shen, X., Musuvathi, M., and Mytkowicz, T. (2015). Yinyang k-means: A\ndrop-in replacement of the classic k-means with consistent speedup. In Proceedings of the 32nd\nInternational Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages\n579\u2013587.\n\nDrake, J. (2013). Faster k-means clustering. Accessed online 19 August 2015.\nElkan, C. (2003). Using the triangle inequality to accelerate k-means. In Machine Learning, Pro-\nceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washing-\nton, DC, USA, pages 147\u2013153.\n\nHamerly, G. (2010). Making k-means even faster. In SDM, pages 130\u2013140.\nKanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. (2002). An ef\ufb01-\ncient k-means clustering algorithm: analysis and implementation. Pattern Analysis and Machine\nIntelligence, IEEE Transactions on, 24(7):881\u2013892.\n\nLewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). Rcv1: A new benchmark collection for text\n\ncategorization research. JOURNAL OF MACHINE LEARNING RESEARCH, 5:361\u2013397.\n\nLoosli, G., Canu, S., and Bottou, L. (2007). Training invariant support vector machines using\nselective sampling. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J., editors, Large Scale\nKernel Machines, pages 301\u2013320. MIT Press, Cambridge, MA.\n\nMairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online dictionary learning for sparse coding.\nIn Proceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909,\npages 689\u2013696, New York, NY, USA. ACM.\n\nPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Pret-\ntenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M.,\nPerrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of\nMachine Learning Research, 12:2825\u20132830.\n\nPelleg, D. and Moore, A. (1999). Accelerating exact k-means algorithms with geometric reasoning.\nIn Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201999, pages 277\u2013281, New York, NY, USA. ACM.\n\nPhillips, S. (2002). Acceleration of k-means and related clustering algorithms. volume 2409 of\n\nLecture Notes in Computer Science. Springer.\n\nSculley, D. (2010). Web-scale k-means clustering. In Proceedings of the 19th International Confer-\n\nence on World Wide Web, WWW \u201910, pages 1177\u20131178, New York, NY, USA. ACM.\n\nWang, J., Wang, J., Ke, Q., Zeng, G., and Li, S. (2012). Fast approximate k-means via cluster\n\nclosures. In CVPR, pages 3037\u20133044. IEEE Computer Society.\n\n9\n\n\f", "award": [], "sourceid": 747, "authors": [{"given_name": "James", "family_name": "Newling", "institution": "Idiap Research Institute"}, {"given_name": "Fran\u00e7ois", "family_name": "Fleuret", "institution": "Idiap Research Institute"}]}