{"title": "Practical and Optimal LSH for Angular Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 1225, "page_last": 1233, "abstract": "We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH (Andoni-Indyk-Nguyen-Razenshteyn 2014) (Andoni-Razenshteyn 2015)), our algorithm is also practical, improving upon the well-studied hyperplane LSH (Charikar 2002) in practice. We also introduce a multiprobe version of this algorithm and conduct an experimental evaluation on real and synthetic data sets.We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.", "full_text": "Practical and Optimal LSH for Angular Distance\n\nAlexandr Andoni\u2217\nColumbia University\n\nPiotr Indyk\n\nMIT\n\nThijs Laarhoven\nTU Eindhoven\n\nIlya Razenshteyn\n\nMIT\n\nLudwig Schmidt\n\nMIT\n\nAbstract\n\nWe show the existence of a Locality-Sensitive Hashing (LSH) family for the angu-\nlar distance that yields an approximate Near Neighbor Search algorithm with the\nasymptotically optimal running time exponent. Unlike earlier algorithms with this\nproperty (e.g., Spherical LSH [1, 2]), our algorithm is also practical, improving\nupon the well-studied hyperplane LSH [3] in practice. We also introduce a mul-\ntiprobe version of this algorithm and conduct an experimental evaluation on real\nand synthetic data sets.\nWe complement the above positive results with a \ufb01ne-grained lower bound for the\nquality of any LSH family for angular distance. Our lower bound implies that the\nabove LSH family exhibits a trade-off between evaluation time and quality that is\nclose to optimal for a natural class of LSH functions.\n\n1\n\nIntroduction\n\nNearest neighbor search is a key algorithmic problem with applications in several \ufb01elds including\ncomputer vision, information retrieval, and machine learning [4]. Given a set of n points P \u2282 Rd,\nthe goal is to build a data structure that answers nearest neighbor queries ef\ufb01ciently: for a given query\npoint q \u2208 Rd, \ufb01nd the point p \u2208 P that is closest to q under an appropriately chosen distance metric.\nThe main algorithmic design goals are usually a fast query time, a small memory footprint, and\u2014in\nthe approximate setting\u2014a good quality of the returned solution.\nThere is a wide range of algorithms for nearest neighbor search based on techniques such as space\npartitioning with indexing, as well as dimension reduction or sketching [5]. A popular method for\npoint sets in high-dimensional spaces is Locality-Sensitive Hashing (LSH) [6, 3], an approach that\noffers a provably sub-linear query time and sub-quadratic space complexity, and has been shown\nto achieve good empirical performance in a variety of applications [4]. The method relies on the\nnotion of locality-sensitive hash functions. Intuitively, a hash function is locality-sensitive if its\nprobability of collision is higher for \u201cnearby\u201d points than for points that are \u201cfar apart\u201d. More formally,\ntwo points are nearby if their distance is at most r1, and they are far apart if their distance is at\nleast r2 = c \u00b7 r1, where c > 1 quanti\ufb01es the gap between \u201cnear\u201d and \u201cfar\u201d. The quality of a\nhash function is characterized by two key parameters: p1 is the collision probability for nearby\npoints, and p2 is the collision probability for points that are far apart. The gap between p1 and p2\ndetermines how \u201csensitive\u201d the hash function is to changes in distance, and this property is captured\nby the parameter \u03c1 = log 1/p1\n, which can usually be expressed as a function of the distance gap c.\nlog 1/p2\nThe problem of designing good locality-sensitive hash functions and LSH-based ef\ufb01cient nearest\nneighbor search algorithms has attracted signi\ufb01cant attention over the last few years.\n\n\u2217The authors are listed in alphabetical order.\n\n1\n\n\fIn this paper, we focus on LSH for the Euclidean distance on the unit sphere, which is an important\nspecial case for several reasons. First, the spherical case is relevant in practice: Euclidean distance\non a sphere corresponds to the angular distance or cosine similarity, which are commonly used in\napplications such as comparing image feature vectors [7], speaker representations [8], and tf-idf data\nsets [9]. Moreover, on the theoretical side, the paper [2] shows a reduction from Nearest Neighbor\nSearch in the entire Euclidean space to the spherical case. These connections lead to a natural\nquestion: what are good LSH families for this special case?\nOn the theoretical side, the recent work of [1, 2] gives the best known provable guarantees for LSH-\nbased nearest neighbor search w.r.t. the Euclidean distance on the unit sphere. Speci\ufb01cally, their\n2c2\u22121 .1 E.g., for the\nalgorithm has a query time of O(n\u03c1) and space complexity of O(n1+\u03c1) for \u03c1 = 1\napproximation factor c = 2, the algorithm achieves a query time of n1/7+o(1). At the heart of the\nalgorithm is an LSH scheme called Spherical LSH, which works for unit vectors. Its key property\nis that it can distinguish between distances r1 = \u221a2/c and r2 = \u221a2 with probabilities yielding\n2c2\u22121 (the formula for the full range of distances is more complex and given in Section 3).\n\u03c1 = 1\nUnfortunately, the scheme as described in the paper is not applicable in practice as it is based on\nrather complex hash functions that are very time consuming to evaluate. E.g., simply evaluating\na single hash function from [2] can take more time than a linear scan over 106 points. Since an LSH\ndata structure contains many individual hash functions, using their scheme would be slower than\na simple linear scan over all points in P unless the number of points n is extremely large.\nOn the practical side, the hyperplane LSH introduced in the in\ufb02uential work of Charikar [3] has worse\ntheoretical guarantees, but works well in practice. Since the hyperplane LSH can be implemented\nvery ef\ufb01ciently, it is the standard hash function in practical LSH-based nearest neighbor algorithms2\nand the resulting implementations has been shown to improve over a linear scan on real data by\nmultiple orders of magnitude [14, 9].\nThe aforementioned discrepancy between the theory and practice of LSH raises an important ques-\ntion: is there a locality-sensitive hash function with optimal guarantees that also improves over the\nhyperplane LSH in practice?\nIn this paper we show that there is a family of locality-sensitive hash functions that achieves both\nobjectives. Speci\ufb01cally, the hash functions match the theoretical guarantee of Spherical LSH from [2]\nand, when combined with additional techniques, give better experimental results than the hyperplane\nLSH. More speci\ufb01cally, our contributions are:\n\nTheoretical guarantees for the cross-polytope LSH. We show that a hash function based on\nrandomly rotated cross-polytopes (i.e., unit balls of the (cid:96)1-norm) achieves the same parameter \u03c1 as\nthe Spherical LSH scheme in [2], assuming data points are unit vectors. While the cross-polytope\nLSH family has been proposed by researchers before [15, 16] we give the \ufb01rst theoretical analysis of\nits performance.\n\nFine-grained lower bound for cosine similarity LSH. To highlight the dif\ufb01culty of obtaining\noptimal and practical LSH schemes, we prove the \ufb01rst non-asymptotic lower bound on the trade-off\nbetween the collision probabilities p1 and p2. So far, the optimal LSH upper bound \u03c1 = 1\n2c2\u22121 (from\n[1, 2] and cross-polytope from here) attain this bound only in the limit, as p1, p2 \u2192 0. Very small p1\nand p2 are undesirable since the hash evaluation time is often proportional to 1/p2. Our lower bound\nproves this is unavoidable: if we require p2 to be large, \u03c1 has to be suboptimal.\nThis result has two important implications for designing practical hash functions. First, it shows that\nthe trade-offs achieved by the cross-polytope LSH and the scheme of [1, 2] are essentially optimal.\nSecond, the lower bound guides design of future LSH functions: if one is to signi\ufb01cantly improve\nupon the cross-polytope LSH, one has to design a hash function that is computed more ef\ufb01ciently\nthan by explicitly enumerating its range (see Section 4 for a more detailed discussion).\n\nMultiprobe scheme for the cross-polytope LSH. The space complexity of an LSH data structure\nis sub-quadratic, but even this is often too large (i.e., strongly super-linear in the number of points),\n\n1This running time is known to be essentially optimal for a large class of algorithms [10, 11].\n2Note that if the data points are binary, more ef\ufb01cient LSH schemes exist [12, 13]. However, in this paper\n\nwe consider algorithms for general (non-binary) vectors.\n\n2\n\n\fand several methods have been proposed to address this issue. Empirically, the most ef\ufb01cient scheme\nis multiprobe LSH [14], which leads to a signi\ufb01cantly reduced memory footprint for the hyperplane\nLSH. In order to make the cross-polytope LSH competitive in practice with the multiprobe hyperplane\nLSH, we propose a novel multiprobe scheme for the cross-polytope LSH.\nWe complement these contributions with an experimental evaluation on both real and synthetic\ndata (SIFT vectors, tf-idf data, and a random point set). In order to make the cross-polytope LSH\npractical, we combine it with fast pseudo-random rotations [17] via the Fast Hadamard Transform,\nand feature hashing [18] to exploit sparsity of data. Our results show that for data sets with around 105\nto 108 points, our multiprobe variant of the cross-polytope LSH is up to 10\u00d7 faster than an ef\ufb01cient\nimplementation of the hyperplane LSH, and up to 700\u00d7 faster than a linear scan. To the best of\nour knowledge, our combination of techniques provides the \ufb01rst \u201cexponent-optimal\u201d algorithm that\nempirically improves over the hyperplane LSH in terms of query time for an exact nearest neighbor\nsearch.\n\n1.1 Related work\n\nThe cross-polytope LSH functions were originally proposed in [15]. However, the analysis in that\npaper was mostly experimental. Speci\ufb01cally, the probabilities p1 and p2 of the proposed LSH func-\ntions were estimated empirically using the Monte Carlo method. Similar hash functions were later\nproposed in [16]. The latter paper also uses DFT to speed-up the random matrix-vector matrix multi-\nplication operation. Both of the aforementioned papers consider only the single-probe algorithm.\nThere are several works that show lower bounds on the quality of LSH hash functions [19, 10, 20, 11].\nHowever, those papers provide only a lower bound on the \u03c1 parameter for asymptotic values of p1\nand p2, as opposed to an actual trade-off between these two quantities. In this paper we provide such\na trade-off, with implications as outlined in the introduction.\n\n2 Preliminaries\nWe use (cid:107).(cid:107) to denote the Euclidean (a.k.a. (cid:96)2) norm on Rd. We also use Sd\u22121 to denote the unit\nsphere in Rd centered in the origin. The Gaussian distribution with mean zero and variance of one is\ndenoted by N (0, 1). Let \u00b5 be a normalized Haar measure on Sd\u22121 (that is, \u00b5(Sd\u22121) = 1). Note that\n\u00b5 it corresponds to the uniform distribution over Sd\u22121. We also let u \u223c Sd\u22121 be a point sampled\nfrom Sd\u22121 uniformly at random. For \u03b7 \u2208 R we denote\n[X \u2265 \u03b7] =\n\n(cid:90) \u221e\n\n\u2212t2/2 dt.\n\n1\n\u221a2\u03c0\n\n\u03a6c(\u03b7) =\n\nPr\n\nX\u223cN (0,1)\n\ne\n\n\u03b7\n\nWe will be interested in the Near Neighbor Search on the sphere Sd\u22121 with respect to the Euclidean\ndistance. Note that the angular distance can be expressed via the Euclidean distance between normal-\nized vectors, so our results apply to the angular distance as well.\nDe\ufb01nition 1. Given an n-point dataset P \u2282 Sd\u22121 on the sphere, the goal of the (c, r)-Approximate\nNear Neighbor problem (ANN) is to build a data structure that, given a query q \u2208 Sd\u22121 with the\npromise that there exists a datapoint p \u2208 P with (cid:107)p \u2212 q(cid:107) \u2264 r, reports a datapoint p(cid:48)\n\u2208 P within\ndistance cr from q.\nDe\ufb01nition 2. We say that a hash family H on the sphere Sd\u22121 is (r1, r2, p1, p2)-sensitive, if for\nevery p, q \u2208 Sd\u22121 one has Pr\nh\u223cH[h(x) = h(y)] \u2264 p2 if\n(cid:107)x \u2212 y(cid:107) \u2265 r2,\nIt is known [6] that an ef\ufb01cient (r, cr, p1, p2)-sensitive hash family implies a data structure for (c, r)-\nANN with space O(n1+\u03c1/p1 + dn) and query time O(d \u00b7 n\u03c1/p1), where \u03c1 = log(1/p1)\nlog(1/p2).\n3 Cross-polytope LSH\n\nh\u223cH[h(x) = h(y)] \u2265 p1 if (cid:107)x \u2212 y(cid:107) \u2264 r1, and Pr\n\nIn this section, we describe the cross-polytope LSH, analyze it, and show how to make it practical.\nFirst, we recall the de\ufb01nition of the cross-polytope LSH [15]: Consider the following hash family\n\n3\n\n\fH for points on a unit sphere Sd\u22121 \u2282 Rd. Let A \u2208 Rd\u00d7d be a random matrix with i.i.d. Gaussian\nentries (\u201ca random rotation\u201d). To hash a point x \u2208 Sd\u22121, we compute y = Ax/(cid:107)Ax(cid:107) \u2208 Sd\u22121 and\nthen \ufb01nd the point closest to y from {\u00b1ei}1\u2264i\u2264d, where ei is the i-th standard basis vector of Rd.\nWe use the closest neighbor as a hash of x.\nThe following theorem bounds the collision probability for two points under the above family H.\nTheorem 1. Suppose that p, q \u2208 Sd\u22121 are such that (cid:107)p \u2212 q(cid:107) = \u03c4, where 0 < \u03c4 < 2. Then,\n\n(cid:2)h(p) = h(q)(cid:3) =\n\n1\n\nln\n\nPr\nh\u223cH\n\n\u03c4 2\n\n4 \u2212 \u03c4 2 \u00b7 ln d + O\u03c4 (ln ln d) .\n\nBefore we show how to prove this theorem, we brie\ufb02y describe its implications. Theorem 1 shows\nthat the cross-polytope LSH achieves essentially the same bounds on the collision probabilities as the\n(theoretically) optimal LSH for the sphere from [2] (see Section \u201cSpherical LSH\u201d there). In particular,\nsubstituting the bounds from Theorem 1 for the cross-polytope LSH into the standard reduction from\nNear Neighbor Search to LSH [6], we obtain the following data structure with sub-quadratic space\nand sublinear query time for Near Neighbor Search on a sphere.\nCorollary 1. The (c, r)-ANN on a unit sphere Sd\u22121 can be solved in space O(n1+\u03c1 + dn) and query\ntime O(d \u00b7 n\u03c1), where \u03c1 = 1\nWe now outline the proof of Theorem 1. For the full proof, see Appendix B.\nDue to the spherical symmetry of Gaussians, we can assume that p = e1 and q = \u03b1e1 + \u03b2e2, where\n\u03b1, \u03b2 are such that \u03b12 + \u03b22 = 1 and (\u03b1 \u2212 1)2 + \u03b22 = \u03c4 2. Then, we expand the collision probability:\nh\u223cH[h(p) = h(q)] = 2d \u00b7 Pr\nPr\n\nh\u223cH[h(p) = h(q) = e1]\n\nc2 \u00b7 4\u2212c2r2\n\n4\u2212r2 + o(1) .\n\n(cid:20)\n\n= 2d \u00b7\n\nPr\n\nu,v\u223cN (0,1)d\n\n= 2d \u00b7 E\n\nX1,Y1\n\nPr\nX2,Y2\n\n(cid:104)\n[\u2200i |ui| \u2264 u1 and |\u03b1ui + \u03b2vi| \u2264 \u03b1u1 + \u03b2v1]\n|X2| \u2264 X1 and |\u03b1X2 + \u03b2Y2| \u2264 \u03b1X1 + \u03b2Y1\n\n(cid:105)d\u22121(cid:21)\n\n,\n\n(1)\n\nwhere X1, Y1, X2, Y2 \u223c N (0, 1). Indeed, the \ufb01rst step is due to the spherical symmetry of the hash\nfamily, the second step follows from the above discussion about replacing a random orthogonal\nmatrix with a Gaussian one and that one can assume w.l.o.g. that p = e1 and q = \u03b1e1 + \u03b2e2; the last\nstep is due to the independence of the entries of u and v.\nThus, proving Theorem 1 reduces to estimating the right-hand side of (1). Note that the probability\nPr[|X2| \u2264 X1 and |\u03b1X2 +\u03b2Y2| \u2264 \u03b1X1 +\u03b2Y1] is equal to the Gaussian area of the planar set SX1,Y1\nshown in Figure 1a. The latter is heuristically equal to 1 \u2212 e\u2212\u22062/2, where \u2206 is the distance from\nthe origin to the complement of SX1,Y1, which is easy to compute (see Appendix A for the precise\nstatement of this argument). Using this estimate, we compute (1) by taking the outer expectation.\n\n3.1 Making the cross-polytope LSH practical\n\nAs described above, the cross-polytope LSH is not quite practical. The main bottleneck is sampling,\nstoring, and applying a random rotation. In particular, to multiply a random Gaussian matrix with a\nvector, we need time proportional to d2, which is infeasible for large d.\n\nPseudo-random rotations. To rectify this issue, we instead use pseudo-random rotations. Instead\nof multiplying an input vector x by a random Gaussian matrix, we apply the following linear trans-\nformation: x (cid:55)\u2192 HD3HD2HD1x, where H is the Hadamard transform, and Di for i \u2208 {1, 2, 3}\nis a random diagonal \u00b11-matrix. Clearly, this is an orthogonal transformation, which one can store\nin space O(d) and evaluate in time O(d log d) using the Fast Hadamard Transform. This is simi-\nlar to pseudo-random rotations used in the context of LSH [21], dimensionality reduction [17], or\ncompressed sensing [22]. While we are currently not aware how to prove rigorously that such pseudo-\nrandom rotations perform as well as the fully random ones, empirical evaluations show that three\napplications of HDi are exactly equivalent to applying a true random rotation (when d tends to\nin\ufb01nity). We note that only two applications of HDi are not suf\ufb01cient.\n\n4\n\n\fFigure 1\n\n\u03c1\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n100\n\nCross-polytope LSH\nLower bound\n\n108\n\n104\nNumber of parts T\n\n1012\n\n1016\n\n(a) The set appearing in the analysis of the cross-\npolytope LSH: SX1Y1 = {|x| \u2264 X1 and |\u03b1x +\n\u03b2y| \u2264 \u03b1X1 + \u03b2Y1}.\n\n\u221a\n\n\u221a\n(b) Trade-off between \u03c1 and the number of parts\nfor distances\n2 (approximation c =\n2); both bounds tend to 1/7 (see discussion in Sec-\ntion 4).\n\n2/2 and\n\nFeature hashing. While we can apply a pseudo-random rotation in time O(d log d), even this\ncan be too slow. E.g., consider an input vector x that is sparse: the number of non-zero entries of\nx is s much smaller than d. In this case, we can evaluate the hyperplane LSH from [3] in time\nO(s), while computing the cross-polytope LSH (even with pseudo-random rotations) still takes time\nO(d log d). To speed-up the cross-polytope LSH for sparse vectors, we apply feature hashing [18]:\nbefore performing a pseudo-random rotation, we reduce the dimension from d to d(cid:48)\n(cid:28) d by applying\na linear map x (cid:55)\u2192 Sx, where S is a random sparse d(cid:48)\n\u00d7 d matrix, whose columns have one non-zero\n\u00b11 entry sampled uniformly. This way, the evaluation time becomes O(s + d(cid:48) log d(cid:48)). 3\n\u201cPartial\u201d cross-polytope LSH.\nIn the above discussion, we de\ufb01ned the cross-polytope LSH as a\nhash family that returns the closest neighbor among {\u00b1ei}1\u2264i\u2264d as a hash (after a (pseudo-)random\nrotation). In principle, we do not have to consider all d basis vectors when computing the closest\nneighbor. By restricting the hash to d(cid:48)\n\u2264 d basis vectors instead, Theorem 1 still holds for the\nnew hash family (with d replaced by d(cid:48)) since the analysis is essentially dimension-free. This slight\ngeneralization of the cross-polytope LSH turns out to be useful for experiments (see Section 6). Note\nthat the case d(cid:48) = 1 corresponds to the hyperplane LSH.\n\n4 Lower bound\nLet H be a hash family on Sd\u22121. For 0 < r1 < r2 < 2 we would like to understand the trade-off\nbetween p1 and p2, where p1 is the smallest probability of collision under H for points at distance\nat most r1 and p2 is the largest probability of collision for points at distance at least r2. We focus\non the case r2 \u2248 \u221a2 because setting r2 to \u221a2 \u2212 o(1) (as d tends to in\ufb01nity) allows us to replace p2\n\nwith the following quantity that is somewhat easier to handle:\n\n\u2217\n2 = Pr\np\nh\u223cH\n\nu,v\u223cSd\u22121\n\n[h(u) = h(v)].\n\nThis quantity is at most p2 + o(1), since the distance between two random points on a unit sphere\nSd\u22121 is tightly concentrated around \u221a2. So for a hash family H on a unit sphere Sd\u22121, we would\nlike to understand the upper bound on p1 in terms of p\u2217\n(cid:34)\n(cid:18)\nFor 0 \u2264 \u03c4 \u2264 \u221a2 and \u03b7 \u2208 R, we de\ufb01ne\nX \u2265 \u03b7 and\n\n2 and 0 < r1 < \u221a2.\n(cid:114)\n\n\u03c4 4\n4 \u00b7 Y \u2265 \u03b7\n\n(cid:35)(cid:46)\n\n[X \u2265 \u03b7] .\n\nX,Y \u223cN (0,1)\n\nPr\n\nX\u223cN (0,1)\n\n\u039b(\u03c4, \u03b7) =\n\nPr\n\n\u00b7 X +\n\n\u03c4 2 \u2212\n\n(cid:19)\n\n\u03c4 2\n2\n\n1 \u2212\n\n3Note that one can apply Lemma 2 from the arXiv version of [18] to claim that\u2014after such a dimension\nreduction\u2014the distance between any two points remains suf\ufb01ciently concentrated for the bounds from Theo-\nrem 1 to still hold (with d replaced by d(cid:48)).\n\n5\n\nx=\u2212X1x=X1\u03b1x+\u03b2y=\u03b1X1+\u03b2Y1\u03b1x+\u03b2y=\u2212(\u03b1X1+\u03b2Y1)\fWe are now ready to formulate the main result of this section.\nTheorem 2. Let H be a hash family on Sd\u22121 such that every function in H partitions the sphere into\nat most T parts of measure at most 1/2. Then we have p1 \u2264 \u039b(r1, \u03b7) + o(1), where \u03b7 \u2208 R is such\nthat \u03a6c(\u03b7) = p\u2217\n2 and o(1) is a quantity that depends on T and r1 and tends to 0 as d tends to in\ufb01nity.\n\nT (indeed, p\u2217\n\nThe idea of the proof is \ufb01rst to reason about one part of the partition using the isoperimetric inequality\nfrom [23], and then to apply a certain averaging argument by proving concavity of a function related\nto \u039b using a delicate analytic argument. For the full proof, see Appendix C.\nWe note that the above requirement of all parts induced by H having measure at most 1/2 is only a\ntechnicality. We conjecture that Theorem 2 holds without this restriction. In any case, as we will see\nbelow, in the interesting range of parameters this restriction is essentially irrelevant.\nOne can observe that if every hash function in H partitions the sphere into at most T parts, then\np\u2217\n2 is precisely the average sum of squares of measures of the parts). This observation,\n2 \u2265 1\ncombined with Theorem 2, leads to the following interesting consequence. Speci\ufb01cally, we can\nnumerically estimate \u039b in order to give a lower bound on \u03c1 = log(1/p1)\nlog(1/p2) for any hash family H in\nwhich every function induces at most T parts of measure at most 1/2. See Figure 1b, where we plot\nthis lower bound for r1 = \u221a2/2,4 together with an upper bound that is given by the cross-polytope\nLSH5 (for which we use numerical estimates for (1)). We can make several conclusions from this\nplot. First, the cross-polytope LSH gives an almost optimal trade-off between \u03c1 and T . Given that the\nevaluation time for the cross-polytope LSH is O(T log T ) (if one uses pseudo-random rotations), we\nconclude that in order to improve upon the cross-polytope LSH substantially in practice, one should\ndesign an LSH family with \u03c1 being close to optimal and evaluation time that is sublinear in T . We\nnote that none of the known LSH families for a sphere has been shown to have this property. This\ndirection looks especially interesting since the convergence of \u03c1 to the optimal value (as T tends to\n\nin\ufb01nity) is extremely slow (for instance, according to Figure 1b, for r1 = \u221a2/2 and r2 \u2248 \u221a2 we\nneed more than 105 parts to achieve \u03c1 \u2264 0.2, whereas the optimal \u03c1 is 1/7 \u2248 0.143).\n5 Multiprobe LSH for the cross-polytope LSH\n\nWe now describe our multiprobe scheme for the cross-polytope LSH, which is a method for reducing\nthe number of independent hash tables in an LSH data structure. Given a query point q, a \u201cstandard\u201d\nLSH data structure considers only a single cell in each of the L hash tables (the cell is given by the\nhash value hi(q) for i \u2208 [L]). In multiprobe LSH, we consider candidates from multiple cells in each\ntable [14]. The rationale is the following: points p that are close to q but fail to collide with q under\nhash function hi are still likely to hash to a value that is close to hi(q). By probing multiple hash\nlocations close to hi(q) in the same table, multiprobe LSH achieves a given probability of success\nwith a smaller number of hash tables than \u201cstandard\u201d LSH. Multiprobe LSH has been shown to\nperform well in practice [14, 24].\nThe main ingredient in multiprobe LSH is a probing scheme for generating and ranking possible\nmodi\ufb01cations of the hash value hi(q). The probing scheme should be computationally ef\ufb01cient and\nensure that more likely hash locations are probed \ufb01rst. For a single cross-polytope hash, the order of\nalternative hash values is straightforward: let x be the (pseudo-)randomly rotated version of query\npoint q. Recall that the \u201cmain\u201d hash value is hi(q) = arg maxj\u2208[d] |xj|.6 Then it is easy to see\nthat the second highest probability of collision is achieved for the hash value corresponding to the\ncoordinate with the second largest absolute value, etc. Therefore, we consider the indices i \u2208 [d]\nsorted by their absolute value as our probing sequence or \u201cranking\u201d for a single cross-polytope.\nThe remaining question is how to combine multiple cross-polytope rankings when we have more\nthan one hash function. As in the analysis of the cross-polytope LSH (see Section 3, we consider\ntwo points q = e1 and p = \u03b1e1 + \u03b2e2 at distance R. Let A(i) be the i.i.d. Gaussian matrix of hash\n\n4The situation is qualitatively similar for other values of r1.\n5More speci\ufb01cally, for the \u201cpartial\u201d version from Section 3.1, since T should be constant, while d grows\n6In order to simplify notation, we consider a slightly modi\ufb01ed version of the cross-polytope LSH that maps\nboth the standard basis vector +ej and its opposite \u2212ej to the same hash value. It is easy to extend the multiprobe\nscheme de\ufb01ned here to the \u201cfull\u201d cross-polytope LSH from Section 3.\n\n6\n\n\ffunction hi, and let x(i) = A(i)e1 be the randomly rotated version of point q. Given x(i), we are\ninterested in the probability of p hashing to a certain combination of the individual cross-polytope\nrankings. More formally, let r(i)\nvi be the index of the vi-th largest element of |x(i)|, where v \u2208 [d]k\nspeci\ufb01es the alternative probing location. Then we would like to compute\n\n(cid:12)(cid:12) = r(i)\n\nvi\n\n(cid:12)(cid:12)(cid:12) A(i)e1 = x(i)(cid:105)\n\n.\n\nPr\n\nA(1),...,A(k)\n\n=\n\nvi for all i \u2208 [k] | A(i)q = x(i)(cid:3)\n(cid:2)hi(p) = r(i)\n(cid:104)\nk(cid:89)\n(cid:12)(cid:12)(\u03b1 \u00b7 A(i)e1 + \u03b2 \u00b7 A(i)e2)j\n(cid:12)(cid:12)(cid:12) Ae1 = x\n(cid:104)\n\n(cid:12)(cid:12) = v\n\narg max\n\nPr\nA(i)\n\nj\u2208[d]\n\n(cid:105)\n\nPr\n\ni=1\n\n=\n\n.\n\nPr\nA\n\nj\u2208[d]\n\narg max\n\narg max\n\ny\u223cN (0,Id)\n\n\u03b2\n\u03b1 \u00b7 y)j\n\n(cid:12)(cid:12)(x +\n(cid:12)(cid:12)(x + \u03b2\n\n(cid:12)(cid:12)(\u03b1x + \u03b2 \u00b7 Ae2)j\n\nIf we knew this probability for all v \u2208 [d]k, we could sort the probing locations by their probability.\nWe now show how to approximate this probability ef\ufb01ciently for a single value of i (and hence drop\n(cid:104)\n(cid:105)\n(cid:12)(cid:12) = v\nthe superscripts to simplify notation). WLOG, we permute the rows of A so that rv = v and get\n(cid:12)(cid:12) = v}. Similar\nj\u2208[d]\nThe RHS is the Gaussian measure of the set S = {y \u2208 Rd | arg maxj\u2208[d]\nto the analysis of the cross-polytope LSH, we approximate the measure of S by its distance to the\norigin. Then the probability of probing location v is proportional to exp(\u2212(cid:107)yx,v(cid:107)2), where yx,v is the\nshortest vector y such that arg maxj |x+y|j = v. Note that the factor \u03b2/\u03b1 becomes a proportionality\nconstant, and hence the probing scheme does not require to know the distance R. For computational\nperformance and simplicity, we make a further approximation and use yx,v = (maxi |xi|\u2212|xv|)\u00b7 ev,\ni.e., we only consider modifying a single coordinate to reach the set S.\nOnce we have estimated the probabilities for each vi \u2208 [d], we incrementally construct the probing\nsequence using a binary heap, similar to the approach in [14]. For a probing sequence of length m,\nthe resulting algorithm has running time O(L\u00b7 d log d + m log m). In our experiments, we found that\nthe O(L\u00b7 d log d) time taken to sort the probing candidates vi dominated the running time of the hash\nfunction evaluation. In order to circumvent this issue, we use an incremental sorting approach that\nonly sorts the relevant parts of each cross-polytope and gives a running time of O(L \u00b7 d + m log m).\n6 Experiments\n\n\u03b1 y)j\n\nWe now show that the cross-polytope LSH, combined with our multiprobe extension, leads to an\nalgorithm that is also ef\ufb01cient in practice and improves over the hyperplane LSH on several data sets.\nThe focus of our experiments is the query time for an exact nearest neighbor search. Since hyperplane\nLSH has been compared to other nearest-neighbor algorithms before [8], we limit our attention to\nthe relative speed-up compared with hyperplane hashing.\nWe evaluate the two hashing schemes on three types of data sets. We use a synthetic data set of\nrandomly generated points because this allows us to vary a single problem parameter while keeping\nthe remaining parameters constant. We also investigate the performance of our algorithm on real\ndata: two tf-idf data sets [25] and a set of SIFT feature vectors [7]. We have chosen these data sets in\norder to illustrate when the cross-polytope LSH gives large improvements over the hyperplane LSH,\nand when the improvements are more modest. See Appendix D for a more detailed description of\nthe data sets and our experimental setup (implementation details, CPU, etc.).\nIn all experiments, we set the algorithm parameters so that the empirical probability of successfully\n\ufb01nding the exact nearest neighbor is at least 0.9. Moreover, we set the number of LSH tables L\nso that the amount of additional memory occupied by the LSH data structure is comparable to the\namount of memory necessary for storing the data set. We believe that this is the most interesting\nregime because signi\ufb01cant memory overheads are often impossible for large data sets. In order to\ndetermine the parameters that are not \ufb01xed by the above constraints, we perform a grid search over\nthe remaining parameter space and report the best combination of parameters. For the cross-polytope\nhash, we consider \u201cpartial\u201d cross-polytopes in the last of the k hash functions in order to get a smooth\ntrade-off between the various parameters (see Section 3.1).\n\nMultiprobe experiments.\nIn order to demonstrate that the multiprobe scheme is critical for making\nthe cross-polytope LSH competitive with hyperplane hashing, we compare the performance of a\n\n7\n\n\fData set Method\n\nNYT\nNYT\npubmed\npubmed\nSIFT\nSIFT\n\nHP\nCP\nHP\nCP\nHP\nCP\n\nQuery\n\ntime (ms)\n120 ms\n35 ms\n857 ms\n213 ms\n3.7 ms\n3.1 ms\n\nSpeed-up\n\nvs HP\n\n3.4\u00d7\n\n4.0\u00d7\n\n1.2\u00d7\n\nBest k\n\n19\n\n2 (64)\n\n20\n\n2 (512)\n\n30\n6 (1)\n\nNumber of\ncandidates\n\nHashing\ntime (ms)\n\nDistances\ntime (ms)\n\n57,200\n17,900\n\n1,481,000\n304,000\n18,600\n13,400\n\n16\n3.0\n36\n18\n0.2\n0.6\n\n96\n30\n762\n168\n3.0\n2.2\n\nTable 1: Average running times for a single nearest neighbor query with the hyperplane (HP) and\ncross-polytope (CP) algorithms on three real data sets. The cross-polytope LSH is faster than the\nhyperplane LSH on all data sets, with signi\ufb01cant speed-ups for the two tf-idf data sets NYT and\npubmed. For the cross-polytope LSH, the entries for k include both the number of individual hash\nfunctions per table and (in parenthesis) the dimension of the last of the k cross-polytopes.\n\n\u201cstandard\u201d cross-polytope LSH data structure with our multiprobe variant on an instance of the random\ndata set (n = 220, d = 128). As can be seen in Table 2 (Appendix D), the multiprobe variant is about\n13\u00d7 faster in our memory-constrained setting (L = 10). Note that in all of the following experiments,\nthe speed-up of the multiprobe cross-polytope LSH compared to the multiprobe hyperplane LSH is\nless than 11\u00d7. Hence without our multiprobe addition, the cross-polytope LSH would be slower than\nthe hyperplane LSH, for which a multiprobe scheme is already known [14].\n\nExperiments on random data. Next, we show that the better time complexity of the cross-\npolytope LSH already applies for moderate values of n. In particular, we compare the cross-polytope\nLSH, combined with fast rotations (Section 3.1) and our multiprobe scheme, to a multi-probe hyper-\nplane LSH on random data. We keep the dimension d = 128 and the distance to the nearest neighbor\nR = \u221a2/2 \ufb01xed, and vary the size of the data set from 220 to 228. The number of hash tables L is set\nto 10. For 220 points, the cross-polytope LSH is already 3.5\u00d7 faster than the hyperplane LSH, and for\nn = 228 the speedup is 10.3\u00d7 (see Table 3 in Appendix D). Compared to a linear scan, the speed-up\nachieved by the cross-polytope LSH ranges from 76\u00d7 for n = 220 to about 700\u00d7 for n = 228.\nExperiments on real data. On the SIFT data set (n = 106 and d = 128), the cross-polytope\nLSH achieves a modest speed-up of 1.2\u00d7 compared to the hyperplane LSH (see Table 1). On the\nother hand, the speed-up is is 3 \u2212 4\u00d7 on the two tf-idf data sets, which is a signi\ufb01cant improvement\nconsidering the relatively small size of the NYT data set (n \u2248 300, 000). One important difference\nbetween the data sets is that the typical distance to the nearest neighbor is smaller in the SIFT data\nset, which can make the nearest neighbor problem easier (see Appendix D). Since the tf-idf data sets\nare very high-dimensional but sparse (d \u2248 100, 000), we use the feature hashing approach described\nin Section 3.1 in order to reduce the hashing time of the cross-polytope LSH (the standard hyperplane\nLSH already runs in time proportional to the sparsity of a vector). We use 1024 and 2048 as feature\nhashing dimensions for NYT and pubmed, respectively.\n\nAcknowledgments\n\nWe thank Michael Kapralov for many valuable discussions during various stages of this work. We\nalso thank Stefanie Jegelka and Rasmus Pagh for helpful conversations. This work was supported\nin part by the NSF and the Simons Foundation. Work done in part while the \ufb01rst author was at the\nSimons Institute for the Theory of Computing.\n\n8\n\n\fReferences\n[1] Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing.\n\nIn SODA, 2014. Full version at http://arxiv.org/abs/1306.1547.\n\n[2] Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors.\n\nIn STOC, 2015. Full version at http://arxiv.org/abs/1501.01062.\n\n[3] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n[4] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-Neighbor Methods in Learning and\n\nVision: Theory and Practice. MIT Press, Cambridge, MA, 2005.\n\n[5] Hanan Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.\n[6] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing\n\nthe curse of dimensionality. Theory of Computing, 8(14):321\u2013350, 2012.\n\n[7] Herv\u00e9 J\u00e9gou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117\u2013128, 2011.\n\n[8] Ludwig Schmidt, Matthew Shari\ufb01, and Ignacio Lopez Moreno. Large-scale speaker identi\ufb01cation. In\n\nICASSP, 2014.\n\n[9] Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Mad-\nden, and Pradeep Dubey. Streaming similarity search over one billion tweets using parallel locality-\nsensitive hashing. In VLDB, 2013.\n\n[10] Moshe Dubiner. Bucketing coding and information theory for the statistical high-dimensional nearest-\n\nneighbor problem. IEEE Transactions on Information Theory, 56(8):4166\u20134179, 2010.\n\n[11] Alexandr Andoni and Ilya Razenshteyn. Tight lower bounds for data-dependent locality-sensitive hashing,\n\n2015. Available at http://arxiv.org/abs/1507.04299.\n\n[12] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In Machine\n\nLearning and Knowledge Discovery in Databases, pages 474\u2013489. Springer, 2012.\n\n[13] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor\n\nsearch. In ICML, 2014.\n\n[14] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: ef\ufb01cient indexing\n\nfor high-dimensional similarity search. In VLDB, 2007.\n\n[15] Kengo Terasawa and Yuzuru Tanaka. Spherical lsh for approximate nearest neighbor search on unit\n\nhypersphere. In Algorithms and Data Structures, pages 27\u201338. Springer, 2007.\n\n[16] Kave Eshghi and Shyamsundar Rajaram. Locality sensitive hash functions based on concomitant rank\n\norder statistics. In KDD, 2008.\n\n[17] Nir Ailon and Bernard Chazelle. The fast Johnson\u2013Lindenstrauss transform and approximate nearest\n\nneighbors. SIAM Journal on Computing, 39(1):302\u2013322, 2009.\n\n[18] Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature\n\nhashing for large scale multitask learning. In ICML, 2009.\n\n[19] Rajeev Motwani, Assaf Naor, and Rina Panigrahy. Lower bounds on locality sensitive hashing. SIAM\n\nJournal on Discrete Mathematics, 21(4):930\u2013935, 2007.\n\n[20] Ryan O\u2019Donnell, Yi Wu, and Yuan Zhou. Optimal lower bounds for locality-sensitive hashing (except\n\nwhen q is tiny). ACM Transactions on Computation Theory, 6(1):5, 2014.\n\n[21] Anirban Dasgupta, Ravi Kumar, and Tam\u00e1s Sarl\u00f3s. Fast locality-sensitive hashing. In KDD, 2011.\n[22] Nir Ailon and Holger Rauhut. Fast and RIP-optimal transforms. Discrete & Computational Geometry,\n\n52(4):780\u2013798, 2014.\n\n[23] Uriel Feige and Gideon Schechtman. On the optimality of the random hyperplane rounding technique for\n\nMAX CUT. Random Structures and Algorithms, 20(3):403\u2013440, 2002.\n\n[24] Malcolm Slaney, Yury Lifshits, and Junfeng He. Optimal parameters for locality-sensitive hashing. Pro-\n\nceedings of the IEEE, 100(9):2604\u20132623, 2012.\n\n[25] Moshe Lichman. UCI machine learning repository, 2013.\n[26] Persi Diaconis and David Freedman. A dozen de Finetti-style results in search of a theory. Annales de\n\nl\u2019institut Henri Poincar\u00e9 (B) Probabilit\u00e9s et Statistiques, 23(S2):397\u2013423, 1987.\n\n9\n\n\f", "award": [], "sourceid": 756, "authors": [{"given_name": "Alexandr", "family_name": "Andoni", "institution": "Columbia"}, {"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Thijs", "family_name": "Laarhoven", "institution": "TU/e"}, {"given_name": "Ilya", "family_name": "Razenshteyn", "institution": "MIT"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "MIT"}]}