{"title": "Ultra Fast Medoid Identification via Correlated Sequential Halving", "book": "Advances in Neural Information Processing Systems", "page_first": 3655, "page_last": 3664, "abstract": "The medoid of a set of n points is the point in the set that minimizes the sum of distances to other points. It can be determined exactly in O(n^2) time by computing the distances between all pairs of points. Previous works show that one can significantly reduce the number of distance computations needed by adaptively querying distances. The resulting randomized algorithm is obtained by a direct conversion of the computation problem to a multi-armed bandit statistical inference problem. In this work, we show that we can better exploit the structure of the underlying computation problem by modifying the traditional bandit sampling strategy and using it in conjunction with a suitably chosen multi-armed bandit algorithm. Four to five orders of magnitude gains over exact computation are obtained on real data, in terms of both number of distance computations needed and wall clock time. Theoretical results are obtained to quantify such gains in terms of data parameters. Our code is publicly available online at https://github.com/TavorB/Correlated-Sequential-Halving.", "full_text": "Ultra Fast Medoid Identi\ufb01cation\nvia Correlated Sequential Halving\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nTavor Z. Baharav\n\nStanford University\nStanford, CA 94305\n\ntavorb@stanford.edu\n\nDavid Tse\n\nStanford University\nStanford, CA 94305\n\ndntse@stanford.edu\n\nAbstract\n\nThe medoid of a set of n points is the point in the set that minimizes the sum of\ndistances to other points. It can be determined exactly in O(n2) time by computing\nthe distances between all pairs of points. Previous works show that one can\nsigni\ufb01cantly reduce the number of distance computations needed by adaptively\nquerying distances [1]. The resulting randomized algorithm is obtained by a direct\nconversion of the computation problem to a multi-armed bandit statistical inference\nproblem. In this work, we show that we can better exploit the structure of the\nunderlying computation problem by modifying the traditional bandit sampling\nstrategy and using it in conjunction with a suitably chosen multi-armed bandit\nalgorithm. Four to \ufb01ve orders of magnitude gains over exact computation are\nobtained on real data, in terms of both number of distance computations needed\nand wall clock time. Theoretical results are obtained to quantify such gains in terms\nof data parameters. Our code is publicly available online at https://github.\ncom/TavorB/Correlated-Sequential-Halving.\n\n1\n\nIntroduction\n\nIn large datasets, one often wants to \ufb01nd a single element that is representative of the dataset as a\nwhole. While the mean, a point potentially outside the dataset, may suf\ufb01ce in some problems, it will\nbe uninformative when the data is sparse in some domain; taking the mean of an image dataset will\nyield visually random noise [2]. In such instances the medoid is a more appropriate representative,\nwhere the medoid is de\ufb01ned as the point in a dataset which minimizes the sum of distances to other\npoints. For one dimensional data under `1 distance, this is equivalent to the median. This has seen\nuse in algorithms such as k-medoid clustering due to its reduced sensitivity to outliers [3].\nFormally, let x1, ..., xn 2U , where the underlying space U is equipped with some distance function\nd : U \u21e5 U 7! R+. It is convenient to think of U = Rd and d(x, y) = kx yk2 for concreteness, but\nother spaces and distance functions (which need not be symmetric or satisfy the triangle inequality)\ncan be substituted. The medoid of {xi}n\ni\u21e4 = argmin\n\u2713i\n\ni=1, assumed here to be unique, is de\ufb01ned as xi\u21e4 where\n\nd(xi, xj)\n\n(1)\n\ni2[n]\n\n:\n\n\u2713i , 1\nn\n\nnXj=1\n\nNote that for non-adversarially constructed data, the medoid will almost certainly be unique. Unfortu-\nnately, brute force computation of the medoid becomes infeasible for large datasets, e.g. RNA-Seq\ndatasets with n = 100k points [4].\nThis issue has been addressed in recent works by noting that in most problem instances solving for\nthe value of each \u2713i exactly is unnecessary, as we are only interested in identifying xi\u21e4 and not in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcomputing every \u2713i [1, 5, 6, 7]. This allows us to solve the problem by only estimating each \u2713i,\nsuch that we are able to distinguish with high probability whether it is the medoid. By turning this\ncomputational problem into a statistical one of estimating the \u2713i\u2019s one can greatly decrease algorithmic\ncomplexity and running time. The key insight here is that sampling a random J \u21e0 Unif([n]) and\ncomputing d(xi, xJ ) gives an unbiased estimate of \u2713i. Clearly, as we sample and average over more\niid\u21e0 Unif([n]), we will obtain a better estimate of \u2713i. Estimating each \u2713i to\nindependently selected Jk\nthe same degree of precision by computing \u02c6\u2713i = 1\nk=1 d(xi, xJk ) yields an order of magnitude\nimprovement over exact computation, via an algorithm like RAND [7].\nIn a recent work [1] it was observed that this statistical estimation could be done much more ef\ufb01ciently\nby adaptively allocating estimation budget to each of the \u2713i in eq. (1). This is due to the observation\nthat we only need to estimate each \u2713i to a necessary degree of accuracy, such that we are able to\nsay with high probability whether it is the medoid or not. By reducing to a stochastic multi-armed\nbandit problem, where each arm corresponds to a \u2713i, existing multi-armed bandit algorithms can be\nleveraged leading to the algorithm Med-dit [1]. As can be seen in Fig. 1 adding adaptivity to the\nstatistical estimation problem yields another order of magnitude improvement.\n\nT PT\n\n(a) RNA-Seq 20k dataset [4], `1 dist\n\n(b) 100k users from Net\ufb02ix dataset [8], cosine dist\n\nFigure 1: Empirical performance of exact computation, RAND, Med-dit and Correlated Sequential\n\nHalving The error probability is the probability of not returning the correct medoid.\n\n1.1 Contribution\n\nWhile adaptivity is already a drastic improvement, current schemes are still unable to process\nlarge datasets ef\ufb01ciently; running Med-dit on datasets with n = 100k takes 1.5 hours. The main\ncontribution of this paper is a novel algorithm that is able to perform this same computation in 1\nminute. Our algorithm achieves this by observing that we want to \ufb01nd the minimum element and not\nthe minimum value, and so our interest is only in the relative ordering of the \u2713i, not their actual values.\nIn the simple case of trying to determine if \u27131 >\u2713 2, we are interested in estimating \u27131 \u27132 rather\nthan \u27131 or \u27132 separately. One can imagine the \ufb01rst step is to take one sample for each, i.e. d(x1, xJ1)\nto estimate \u27131 and d(x2, xJ2) to estimate \u27132, and compare the two estimates. In the direct bandit\nreduction used in the design of Med-dit, J1 and J2 would be independently chosen, since successive\nsamples in the multi-armed bandit formulation are independent. In effect, we are trying to compare\n\u27131 and \u27132, but not using a common reference point to estimate them. This can be problematic for\na sampling based algorithm, as it could be the case that \u27131 <\u2713 2, but the reference point xJ1 we\npick for estimating \u27131 is on the periphery of the dataset as in Fig. 2a. This issue can fortunately be\nremedied by using the same reference point for both x1 and x2 as in Fig. 2b. By using the same\nreference point we are correlating the samples and intuitively reducing the variance of the estimator\nfor \u27131 \u27132. Here, we are exploiting the structure of the underlying computation problem rather than\nsimply treating this as a standard multi-armed bandit statistical inference problem.\nBuilding on this idea, we correlate the random sampling in our reduction to statistical estimation\nand design a new medoid algorithm, Correlated Sequential Halving. This algorithm is based on the\nSequential Halving algorithm in the multi-armed bandit literature [9]. We see in Fig. 1 that we are\nable to gain another one to two orders of magnitude improvement, yielding an overall four to \ufb01ve\norders of magnitude improvement over exact computation. This is accomplished by exploiting the\nfact that the underlying problem is computational rather than statistical.\n\n2\n\n\fx2\n\nx1\n\nx2\n\nx1\n\n(a) Shortcoming of direct bandit reduction\n\n(b) Improvement afforded by correlation\n\nFigure 2: Toy 2D example\n\n1.2 Theoretical Basis\nWe now provide high level insight into the theoretical basis for our observed improvement, later\nformalized in Theorem 2.1. We assume without loss of generality that the points are sorted so that\n\u27131 <\u2713 2 \uf8ff . . . \uf8ff \u2713n, and de\ufb01ne i , \u2713i \u27131 for i 2 [n] \\ {1}, where [n] is the set {1, 2, . . . , n}.\nFor visual clarity, we use the standard notation a_ b , max(a, b) and a^ b , min(a, b), and assume\na base of 2 for all logarithms .\nOur proposed algorithm samples in a correlated manner as in Fig. 2b, and so we introduce new\nnotation to quantify this improvement. As formalized later, \u21e2i is the improvement afforded by\ncorrelated sampling in distinguishing arm i from arm 1. \u21e2i can be thought of as the relative reduction in\nvariance, where a small \u21e2i indicates that d(x1, xJ1)d(xi, xJ1) concentrates1 faster than d(x1, xJ1)\nd(xi, xJ2) about i for J1, J2 drawn independently from Unif([n]), shown graphically in Fig. 3.\n\n(a) Comparison of top 2 points (i=2)\n\n(b) Comparison of top and mid (i=10000)\n\nFigure 3: Correlated d(1, J1) d(i, J1) vs Independent d(1, J1) d(i, J2) sampling in RNA-Seq\n\n20k dataset [4]. Averaged over the dataset, the independent samples have standard deviation\n\n = 0.25, so for (a) \u21e2i = .11, and (b) \u21e2i = .25\n\ni/2\n\n(i)/2\n\ni with \u21e22\n\nIn the standard bandit setting with independent sampling, one needs a number of samples proportional\nto H2 = maxi2 i/2\ni to determine the best arm [10]. Replacing the standard arm dif\ufb01culty of\ni, the dif\ufb01culty accounting for correlation, we show that one can solve the problem\n1/2\nusing a number of samples proportional to \u02dcH2 = maxi2 i\u21e22\n(i), an analogous measure. Here\nthe permutation (\u00b7) indicates that the arms are sorted by decreasing \u21e2i/i as opposed to just by 1/i.\nThese details are formalized in Theorem 2.1.\nOur theoretical improvement incorporating correlation can thus be quanti\ufb01ed as H2/ \u02dcH2. As we show\nlater in Fig. 5, in real datasets arms with small i have similarly small \u21e2i, indicating that correlation\nyields a larger relative gain for previously dif\ufb01cult arms. Indeed, for the RNA-Seq 20k dataset we see\nthat the ratio is H2/ \u02dcH2 = 6.6. The Net\ufb02ix 100k dataset is too large to perform this calculation on,\nbut for similar datasets like MNIST [11] this ratio is 4.8. We hasten to note that this ratio does not\n1Throughout this work we talk about concentration in the sense of the empirical average of a random variable\n\nconcentrating about the true mean of that random variable.\n\n3\n\n\ffully encapsulate the gains afforded by the correlation our algorithm uses, as only pairwise correlation\nis considered in our analysis. This is discussed further in Appendix B\n\n1.3 Related Works\nSeveral algorithms have been proposed for the problem of medoid identi\ufb01cation. An O(n3/22\u21e5(d))\nalgorithm called TRIMED was developed \ufb01nding the true medoid of a dataset under certain assump-\ntions on the distribution of the points near the medoid [5]. This algorithm cleverly carves away\nnon-medoid points, but unfortunately does not scale well with the dimensionality of the dataset. In\nthe use cases we consider the data is very high dimensional, often with d \u21e1 n. While this algorithm\nworks well for small d, it becomes infeasible to run when d > 20. A similar problem, where the\ncentral vertex in a graph is desired, has also been analyzed. One proposed algorithm for this problem\nis RAND, which selects a random subset of vertices of size k and measures the distance between\neach vertex in the graph and every vertex in the subset [7]. This was later improved upon with the\nadvent of TOPRANK [6]. We build off of the algorithm Med-dit (Medoid-Bandit), which \ufb01nds the\nmedoid in \u02dcO(n) time under mild distributional assumptions [1].\nMore generally, the use of bandits in computational problems has gained recent interest. In addition\nto medoid \ufb01nding [1], other examples include Monte Carlo Tree Search for game playing AI [12],\nhyper-parameter tuning [13], k-nearest neighbor, hierarchical clustering and mutual information\nfeature selection [14], approximate k-nearest neighbor [15], and Monte-Carlo multiple testing [16].\nAll of these works use a direct reduction of the computation problem to the multi-armed bandit\nstatistical inference problem. In contrast, the present work further exploits the fact that the inference\nproblem comes from a computational problem, which allows a more effective sampling strategy to\nbe devised. This idea of preserving the structure of the computation problem in the reduction to a\nstatistical estimation one has potentially broader impact and applicability to these other applications.\n\n2 Correlated Sequential Halving\nIn previous works it was noted that sampling a random J \u21e0 Unif([n]) and computing d(xi, xJ ) gives\nan unbiased estimate of \u2713i [1, 14]. This was where the problem was reduced to that of a multi-armed\nbandit and solved with an Upper Con\ufb01dence Bound (UCB) based algorithm [17]. In their analysis,\nestimates of \u2713i are generated as \u02c6\u2713i = 1\nd(xi, xj) for Ji \u2713 [n], and the analysis hinges on\nshowing that as we sample the arms more, \u02c6\u27131 < \u02c6\u2713i 8 i 2 [n] with high probability 2. In a standard\nUCB analysis this is done by showing that each \u02c6\u2713i individually concentrates. However on closer\ninspection, we see that this is not necessary; it is suf\ufb01cient for the differences \u02c6\u27131 \u02c6\u2713i to concentrate\nfor all i 2 [n].\nUsing our intuition from Fig. 2 we see that one way to get this difference to concentrate faster is by\nsampling the same j for both arms 1 and i. We can see that if |J1| = |Ji|, one possible approach is\nto set J1 = Ji = J . This allows us to simplify \u02c6\u27131 \u02c6\u2713i as\n\n|Ji|Pj2Ji\n\n\u02c6\u27131 \u02c6\u2713i =\n\n1\n\n|J1| Xj2J1\n\nd(x1, xj) \n\nd(xi, xj) =\n\nd(x1, xj) d(xi, xj).\n\n1\n\n|Ji| Xj2Ji\n\n1\n\n|J |Xj2J\n\nWhile UCB algorithms yield a serial process that samples one arm at a time, this observation suggests\nthat a different algorithm that pulls many arms at the same time would perform better, as then the\nsame reference j could be used. By estimating each points\u2019 centrality \u2713i independently, we are\nignoring the dependence of our estimators on the random reference points selected; using the same\nset of reference points for estimating each \u2713i reduces the variance in the choice of random reference\npoints. We show that a modi\ufb01ed version of Sequential Halving [10] is much more amenable to this\ntype of analysis. At a high level this is due to the fact that Sequential Halving proceeds in stages\nby sampling arms uniformly, eliminating the worse half of arms from consideration, and repeating.\nThis very naturally obeys this \u201ccorrelated sampling\u201d condition, as we can now use the same set of\nreference points J for all arms under consideration in each round. We present the slightly modi\ufb01ed\n2In order to maintain the unbiasedness of the estimator given the sequential nature of UCB, reference points\nare chosen with replacement in Med-dit, potentially yielding a multiset Ji. For the sake of clarity we ignore this\nsubtlety for Med-dit, as our algorithm samples without replacement.\n\n4\n\n\falgorithm below, introducing correlation and capping the number of pulls per round, noting that the\nmain difference comes in the analysis rather than the algorithm itself.\n\ni=1\n\nselect a set Jr of tr data point indices uniformly\nat random without replacement from [n] where\n|Sr|dlog ne\u232b ^ n\ntrPj2Jr\n\nAlgorithm 1 Correlated Sequential Halving\n1: Input: Sampling budget T , dataset {xi}n\n2: initialize S0 [n]\n3: for r=0 to dlog ne 1 do\n4:\ntr =\u21e21 _\nFor each i 2 Sr set \u02c6\u2713(r)\n5:\nif tr = n then\n6:\nOutput arm in Sr with the smallest \u02c6\u2713(r)\n7:\nelse\n8:\nLet Sr+1 be the set of d|Sr|/2e arms in Sr with the smallest \u02c6\u2713(r)\n9:\nend if\n10:\n11: end for\n12: return arm in Sdlog ne\n\ni = 1\n\nd(xi, xj)\n\ni\n\nT\n\ni\n\nExamining the random variables \u02c6i , d(x1, xJ ) d(xi, xJ ) for J \u21e0 Unif([n]), we see that for any\n\ufb01xed dataset all \u02c6i are bounded, as maxi,j2[n] d(xi, xj) is \ufb01nite. In particular, this means that all \u02c6i\nare sub-Gaussian.\nDe\ufb01nition 1. We de\ufb01ne to be the minimum sub-Gaussian constant of d(xI, xJ ) for I, J drawn\nindependently from Unif([n]). Additionally, for i 2 [n] we de\ufb01ne \u21e2i to be the minimum sub-Gaussian\nconstant of d(x1, xJ ) d(xi, xJ ), where is as above and \u21e2i is an arm (point) dependent scaling,\nas displayed in Figure 3.\n\nThis shifts the direction of the analysis, as where in previous works the sub-Gaussianity of d(x1, xJ )\nwas used [1], we now instead utilize the sub-Gaussianity of d(x1, xJ ) d(xi, xJ ). Here \u21e2i \uf8ff 1\nindicates that the correlated sampling improves the concentration and by extension the algorithmic\nperformance.\nA standard UCB algorithm is unable to algorithmically make use of these {\u21e2i}. Even considering\nbatch UCB algorithms, in order to incorporate correlation the con\ufb01dence bounds would need to be\ncalculated differently for each pair of arms depending on the number of j\u2019s they\u2019ve pulled in common\nand the sub-Gaussian parameter of d(xi1, xJ ) d(xi2, xJ ). It is unreasonable to assume this is\nknown for all pairs of points a priori, and so we restrict ourselves to an algorithm that only uses these\npairwise correlations implicitly in its analysis instead of explicitly in the algorithm. Below we state\nthe main theorem of the paper.\nTheorem 2.1. Assuming that T n log n and denoting the sub-Gaussian constants of d(x1, xJ ) \nd(xi, xJ ) as \u21e2i for i 2 [n] as in de\ufb01nition 1, Correlated Sequential Halving (Algorithm 1) correctly\nidenti\ufb01es the medoid in at most T distance computations with probability at least\n\n1 3 log n exp \n\nT\n\n162 log n \u00b7 min\ni T\n\nwhich can be coarsely lower bounded as\n\n(i)\ni\u21e22\n\nn log n\" 2\n(i)#!\n1 3 log n \u00b7 exp\u2713\n\nT\n\n16 \u02dcH22 log n\u25c6\n\nwhere\n\n\u02dcH2 = max\ni2\n\ni\u21e22\n(i)\n2\n(i)\n\n,\n\n(\u00b7) : [n] 7! [n], (1) = 1,\n\n(2)\n\u21e2(2) \uf8ff\n\n(3)\n\u21e2(3) \uf8ff\u00b7\u00b7\u00b7\uf8ff\n\n(n)\n\u21e2(n)\n\n5\n\n\fAbove \u02dcH2 is a natural measure of hardness for this problem analogous to H2 = maxi\nin the\nstandard bandit case, and (\u00b7) orders the arms by dif\ufb01culty in distinguishing from the best arm after\ntaking into account the \u21e2i. We defer the proof of Thm. 2.1 and necessary lemmas to Appendix A for\nreadability.\n\ni\n2\ni\n\n2.1 Lower bounds\n\nIdeally in such a bandit problem, we would like to provide a matching lower bound. We can naively\nlower bound the sample complexity as \u2326(n), but unfortunately no tighter results are known. A more\ntraditional bandit lower bound was recently proved for adaptive sampling in the approximate k-NN\ncase, but requires that the algorithm only interact with the data by sampling coordinates uniformly\nat random [15]. This lower bound can be transferred to the medoid setting, however this constraint\nbecomes that an algorithm can only interact with the data by measuring the distance from a desired\npoint to another point selected uniformly at random. This unfortunately removes all the correlation\neffects we analyze. For a more in depth discussion of the dif\ufb01culty of providing a lower bound for\nthis problem and the higher order problem structure causing this, we refer the reader to Appendix B.\n\n3 Simulation Results\n\nCorrelated Sequential Halving (corrSH) empirically performs much better than UCB type algorithms\non all datasets tested, reducing the number of comparisons needed by 2 orders of magnitude for the\nRNA-Seq dataset and by 1 order of magnitude for the Net\ufb02ix dataset to achieve comparable error\nprobabilities, as shown in Table 1. This yields a similarly drastic reduction in wall clock time which\ncontrasts most UCB based algorithms; usually, when implemented, the overhead needed to run UCB\nmakes it so that even though there is a signi\ufb01cant reduction in number of pulls, the wall clock time\nimprovement is marginal [14].\n\ndataset, metric\nRNA-Seq 20k, `1\n\nn, d\n\n20k, 28k\n\nRNA-Seq 100k, `1\n\n109k, 28k\n\nNet\ufb02ix 20k, cosine dist\n\n20k, 18k\n\nNet\ufb02ix 100k, cosine dist\n\n100k, 18k\n\nMNIST Zeros, `2\n\n6424, 784\n\ntime\n# pulls\ntime\n# pulls\ntime\n# pulls\ntime\n# pulls\ntime\n# pulls\n\ncorrSH\n\n10.9\n2.43\n64.2\n2.10\n6.82\n15.0\n53.4\n18.5\n1.46\n47.9\n\nMed-dit\n\n246\n\n121 (2.1%)\n\n5819\n420\n593\n85.8\n6495\n\n151\n\nRand\n2131\n\n1000 (.1%)\n\n10462\n\n1000 (.5%)\n\n1000 (.6%)\n\n70.2\n\n959\n\n65.7\n\nExact Comp.\n\n40574\n20000\n\n-\n\n-\n\n100000\n\n139\n20000\n\n100000\n22.8\n6424\n\n90.5 (6%)\n\n1000 (3.6%)\n\n91.2 (.1%)\n\n1000 (65.2%)\n\nTable 1: Performance in average number of pulls per arm. Final percent error noted parenthetically if\n\nnonzero. corrSH was run with varying budgets until it had no failures on the 1000 trials.\n\nWe note that in our simulations we only used 1 pull to initialize each arm for Med-dit for plotting\npurposes where in reality one would use 16 or some larger constant, sacri\ufb01cing a small additional\nnumber of pulls for a roughly 10% reduction in wall clock time. In these plots we show a comparison\nbetween Med-dit [1], Correlated Sequential Halving, and RAND [7], shown in Figures 1 and 4.\n\na) \n\nb) \n\nc) \n\nFigure 4: Number of pulls versus error probability for various datasets and distance metrics. (a)\n\nNet\ufb02ix 20k, cosine [8]. (b) RNA-Seq 100k, `1 [4] (c) MNIST, `2 [11]\n\n6\n\n\f3.1 Simulation details\n\nThe 3 curves for the randomized algorithms previously discussed are generated in different ways. For\nRAND and Med-dit the curves represent the empirical probability, averaged over 1000 trials, that\nafter nx pulls (x pulls per arm on average) the true medoid was the empirically best arm. RAND\nwas run with a budget of 1000 pulls per arm, and Med-dit was run with target error probability of\n = 1/n. Since Correlated Sequential Halving behaves differently after x pulls per arm depending\non what its input budget was, it requires a different method of simulation; every solid dot in the plots\nrepresents the average of 1000 trials at a \ufb01xed budget, and the dotted line connecting them is simply\ninterpolating the projected performance. In all cases the only variable across trials was the random\nseed, which was varied across 0-999 for reproducibility. The value noted for Correlated Sequential\nHalving in Table 1 is the minimum budget above which all simulated error probabilities were 0.\nRemark 1. In theory it is much cleaner to discard samples from previous stages when constructing\nthe estimators in stage r to avoid dependence issues in the analysis. In practice we use these past\nsamples, that is we construct our estimator for arm i in stage r from all the samples seen of arm i so\nfar, rather than just the tr fresh ones.\n\nMany different datasets and distance metrics were used to validate the performance of our algorithm.\nThe \ufb01rst dataset used was a single cell RNA-Seq one, which contains the gene expressions corre-\nsponding to each cell in a tissue sample. A common \ufb01rst step in analyzing single cell RNA-Seq\ndatasets is clustering the data to discover sub classes of cells, where medoid \ufb01nding is used as a\nsubroutine. Since millions of cells are sequenced and tens of thousands of gene expressions are\nmeasured in such a process, this naturally gives us a large high dimensional dataset. Since the gene\nexpressions are normalized to a probability distribution for each cell, `1 distance is commonly used\nfor clustering [18]. We use the 10xGenomics dataset consisting of 27,998 gene-expressions over\n1.3 million neuron cells from the cortex, hippocampus, and subventricular zone of a mouse brain\n[4]. We test on two subsets of this dataset, a small one of 20,000 cells randomly subsampled, and a\nlarger one of 109,140 cells, the largest true cluster in the dataset. While we can exactly compute a\nsolution for the 20k dataset, it is computationally dif\ufb01cult to do so for the larger one, so we use the\nmost commonly returned point from our algorithms as ground truth (all 3 algorithms have the same\nmost frequently returned point).\nAnother dataset we used was the famous Net\ufb02ix-prize dataset [8]. In such recommendation systems,\nthe objective is to cluster users with similar preferences. One challenge in such problems is that the\ndata is very sparse, with only .21% of the entries in the Net\ufb02ix-prize dataset being nonzero. This\nnecessitates the use of normalized distance measures in clustering the dataset, like cosine distance,\nas discussed in [2, Chapter 9]. This dataset consists of 17,769 movies and their ratings by 480,000\nNet\ufb02ix users. We again subsample this dataset, generating a small and large dataset of 20,000 and\n100,000 users randomly subsampled. Ground truth is generated as before.\nThe \ufb01nal dataset we used was the zeros from the commonly used MNIST dataset [11]. This\ndataset consists of centered images of handwritten digits. We subsample this, using only the images\ncorresponding to handwritten zeros, in order to truly have one cluster. We use `2 distance, as root\nmean squared error (RMSE) is a frequently used metric for image reconstruction. Combining the\ntrain and test datasets we get 6,424 images, and since each image is 28x28 pixels we get d = 784.\nSince this is a smaller dataset, we are able to compute the ground truth exactly.\n\n3.2 Discussion on {\u21e2i}\nFor correlation to improve our algorithmic performance, we ideally want \u21e2i \u2327 1 and decaying with\ni. Empirically this appears to be the case as seen in Fig. 5, where we plot \u21e2i versus i for the\nRNA-Seq and MNIST datasets. 1\ncan be thought of as the multiplicative reduction in number of\n\u21e22\ni\npulls needed to differentiate that arm from the best arm, i.e. 1\n= 10 roughly implies that we need a\n\u21e2i\nfactor of 100 fewer pulls to differentiate it from the best arm due to our \u201ccorrelation\u201d. Notably, for\narms that would normally require many pulls to differentiate from the best arm (small i), \u21e2i is also\nsmall. Since algorithms spend the bulk of their time differentiating between the top few arms, this\ntranslates into large practical gains.\nOne candidate explanation for the phenomena that small i lead to small \u21e2i is that the points\nthemselves are close in space. However, this intuition fails for high dimensional datasets as shown in\n\n7\n\n\f(a) RNA-Seq 20k dataset [4]\n\n(b) MNIST dataset [11]\n\nFigure 5: 1/i vs. 1/\u21e2i in real world dataset\n\nFig. 6. We do see empirically however that \u21e2i decreases with i, which drastically decreases the\nnumber of comparisons needed as desired.\nWe can bound \u21e2i if our distance function obeys the triangle inequality, as \u02c6i , d(xi, xJ ) d(x1, xJ )\nis then a bounded random variable since | \u02c6i|\uf8ff d(xi, x1). Combining this with the knowledge that\nE \u02c6i = i we get \u02c6i is sub-Gaussian with parameter at most\n2d(xi, x1) + i\n\n\u21e2i \uf8ff\n\n2\n\nAlternatively, if we assume that \u02c6i is normally distributed with variance \u21e22\ntighter characterization of \u21e2i:\n\ni 2, we are able to get a\n\ni 2 = Var(d(1, J) d(i, J))\n\u21e22\n= Eh(d(1, J) d(i, J))2i (E [d(1, J) d(i, J)])2\n\uf8ff d(1, i)2 2\n\ni\n\nWe can clearly see that as d(1, i) ! 0, \u21e2i decreases, to 0 in the normal case. However in high\ndimensional datasets d(1, i) is usually not small for almost any i. This is empirically shown in Fig. 6.\n\n(a) RNA-Seq 20k, `1\n\n(b) MNIST zeros, `2\n\n(c) Net\ufb02ix 20k, cosine distance\n\nFigure 6: Distance from point i to the medoid, d(x1, xi) versus i\n\nWhile \u21e2i can be small, it is not immediately clear that it is bounded above. However, since our\ndistances are bounded for any given dataset, we have that d(1, J) and d(i, J) are both -sub-Gaussian\nfor some , and so we can bound the sub-Gaussian parameter of d(1, J) d(i, J) quantity using the\nOrlicz norm.\n\n\u21e22\ni 2 = kd(1, J) d(i, J)k2\n\n \uf8ff (kd(1, J)k + kd(i, J)k )2 = 42\n\ni\n\ni\n\n2\u21e22\n\ni 2\u2318 \uf8ff exp\u21e3 n2\n\nWhile this appears to be worse at \ufb01rst glance, we are able to jointly bound P(\u02c6\u2713i \u02c6\u27131 i < i) \uf8ff\nexp\u21e3 n2\n82\u2318 by the control of \u21e2i above. In the regular case, this bound is achieved\nby separating the two and bounding the probability that either \u02c6\u2713i <\u2713 i i/2 or \u02c6\u27131 >\u2713 1 + i/2,\nwhich yields an equivalent probability since we need \u02c6\u2713i, \u02c6\u27131 to concentrate to half the original width.\nHence, even for data without signi\ufb01cant correlation, attempting to correlate the noise will not increase\nthe number of pulls required when using this standard analysis method.\n\n8\n\n\f3.3 Fixed Budget\nIn simulating Correlated Sequential Halving, we swept the budget over a range and reported the\nsmallest budget above which there were 0 errors in 1000 trials. One logical question given a \ufb01xed\nbudget algorithm like corrSH is then, for a given problem, what to set the budget to. This is an\nimportant question for further investigation, as there does not seem to be an ef\ufb01cient way to estimate\n\u02dcH2. Before providing our doubling trick based solution, we would like to note that it is unclear\nwhat to set the hyperparameters to for any of the aforementioned randomized algorithms. RAND\nis similarly \ufb01xed budget, and for Med-dit, while setting = 1\nn achieves vanishing error probability\ntheoretically, using this setting in practice for \ufb01nite n yields an error rate of 6% for the Net\ufb02ix 100k\ndataset. Additionally, the \ufb01xed budget setting makes sense in the case of limited computed power or\ntime sensitive applications.\nThe approach we propose is a variant of the doubling trick, which is commonly used to convert \ufb01xed\nbudget or \ufb01nite horizon algorithms to data dependent or anytime ones. Here this would translate\nto running the algorithm with a certain budget T (say 3n), then doubling the budget to 6n and\nrerunning the algorithm. If the two answers are the same, declare this the medoid and output it. If the\nanswers are different, double the budget again to 12n and compare. The odds that the same incorrect\narm is outputted both times is exceedingly small, as even with a budget that is too small, the most\nlikely output of this algorithm is the true medoid. This requires a budget of at most 8T to yield\napproximately the same error probability as that of just running our algorithm with budget T .\n\n4 Summary\n\nWe have presented a new algorithm, Correlated Sequential Halving, for computing the medoid of\na large dataset. We prove bounds on it\u2019s performance, deviating from standard multi-armed bandit\nanalysis due to the correlation in the arms. We include experimental results to corroborate our\ntheoretical gains, showing the massive improvement to be gained from utilizing correlation in real\nworld datasets. There remains future practical work to be done in seeing if other computational or\nstatistical problems can bene\ufb01t from this correlation trick. Additionally there are open theoretical\nquestions in proving lower bounds for this special query model, seeing if there is any larger view of\ncorrelation beyond pairwise that is analytically tractable, and analyzing this generalized stochastic\nbandits setting.\n\nAcknowledgements\n\nThe authors gratefully acknowledge funding from the NSF GRFP, Alcatel-Lucent Stanford Graduate\nFellowship, NSF grant under CCF-1563098, and the Center for Science of Information (CSoI), an\nNSF Science and Technology Center under grant agreement CCF-0939370.\n\n9\n\n\fReferences\n[1] V. Bagaria, G. Kamath, V. Ntranos, M. Zhang, and D. Tse, \u201cMedoids in almost-linear time via\nmulti-armed bandits,\u201d in Proceedings of the Twenty-First International Conference on Arti\ufb01cial\nIntelligence and Statistics, pp. 500\u2013509, 2018.\n\n[2] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive datasets. Cambridge university\n\npress, 2014.\n\n[3] L. K. P. J. Rousseeuw, \u201cClustering by means of medoids,\u201d 1987.\n[4] 10xGenomics, \u201c1.3 million brain cells from e18 mice,\u201d 2017. available at https://support.\n\n10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons.\n\n[5] J. Newling and F. Fleuret, \u201cA sub-quadratic exact medoid algorithm,\u201d arXiv preprint\n\narXiv:1605.06950, 2016.\n\n[6] K. Okamoto, W. Chen, and X.-Y. Li, \u201cRanking of closeness centrality for large-scale social\nnetworks,\u201d in International Workshop on Frontiers in Algorithmics, pp. 186\u2013195, Springer,\n2008.\n\n[7] D. E. J. Wang, \u201cFast approximation of centrality,\u201d Graph Algorithms and Applications, vol. 5,\n\nno. 5, p. 39, 2006.\n\n[8] J. Bennett, S. Lanning, et al., \u201cThe net\ufb02ix prize,\u201d in Proceedings of KDD cup and workshop,\n\nvol. 2007, p. 35, New York, NY, USA., 2007.\n\n[9] E. Kaufmann, O. Capp\u00e9, and A. Garivier, \u201cOn the complexity of best-arm identi\ufb01cation in multi-\narmed bandit models,\u201d The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1\u201342,\n2016.\n\n[10] Z. Karnin, T. Koren, and O. Somekh, \u201cAlmost optimal exploration in multi-armed bandits,\u201d in\n\nInternational Conference on Machine Learning, pp. 1238\u20131246, 2013.\n\n[11] L. Yann, C. Cortes, and C. J. Burges, \u201cThe mnist database of handwritten digits,\u201d 1998. available\n\nat http://yann.lecun.com/exdb/mnist/.\n\n[12] L. Kocsis and C. Szepesv\u00e1ri, \u201cBandit based monte-carlo planning,\u201d in European conference on\n\nmachine learning, pp. 282\u2013293, Springer, 2006.\n\n[13] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, \u201cHyperband: A novel\nbandit-based approach to hyperparameter optimization,\u201d arXiv preprint arXiv:1603.06560,\n2016.\n\n[14] V. Bagaria, G. M. Kamath, and D. N. Tse, \u201cAdaptive monte-carlo optimization,\u201d arXiv preprint\n\narXiv:1805.08321, 2018.\n\n[15] D. LeJeune, R. Heckel, and R. Baraniuk, \u201cAdaptive estimation for approximate k-nearest-\nneighbor computations,\u201d in Proceedings of Machine Learning Research, pp. 3099\u20133107, 2019.\n[16] M. J. Zhang, J. Zou, and D. N. Tse, \u201cAdaptive monte carlo multiple testing via multi-armed\n\nbandits,\u201d arXiv preprint arXiv:1902.00197, 2019.\n\n[17] T. L. Lai and H. Robbins, \u201cAsymptotically ef\ufb01cient adaptive allocation rules,\u201d Advances in\n\napplied mathematics, vol. 6, no. 1, pp. 4\u201322, 1985.\n\n[18] V. Ntranos, G. M. Kamath, J. M. Zhang, L. Pachter, and D. N. Tse, \u201cFast and accurate single-cell\nrna-seq analysis by clustering of transcript-compatibility counts,\u201d Genome biology, vol. 17,\nno. 1, p. 112, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1972, "authors": [{"given_name": "Tavor", "family_name": "Baharav", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}