{"title": "An algorithm for L1 nearest neighbor search via monotonic embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 983, "page_last": 991, "abstract": "Fast algorithms for nearest neighbor (NN) search have in large part focused on L2 distance. Here we develop an approach for L1 distance that begins with an explicit and exact embedding of the points into L2. We show how this embedding can efficiently be combined with random projection methods for L2 NN search, such as locality-sensitive hashing or random projection trees. We rigorously establish the correctness of the methodology and show by experimentation that it is competitive in practice with available alternatives.", "full_text": "An algorithm for (cid:2)1 nearest neighbor search via\n\nmonotonic embedding\n\nXinan Wang\u2217\nUC San Diego\n\nxinan@ucsd.edu\n\nSanjoy Dasgupta\n\nUC San Diego\n\ndasgupta@cs.ucsd.edu\n\nAbstract\n\nFast algorithms for nearest neighbor (NN) search have in large part focused on (cid:2)2\ndistance. Here we develop an approach for (cid:2)1 distance that begins with an explicit\nand exactly distance-preserving embedding of the points into (cid:2)2\n2. We show how\nthis can ef\ufb01ciently be combined with random-projection based methods for (cid:2)2 NN\nsearch, such as locality-sensitive hashing (LSH) or random projection trees. We\nrigorously establish the correctness of the methodology and show by experimen-\ntation using LSH that it is competitive in practice with available alternatives.\n\n1\n\nIntroduction\n\nNearest neighbor (NN) search is a basic primitive of machine learning and statistics. Its utility in\npractice hinges on two critical issues: (1) picking the right distance function and (2) using algorithms\nthat \ufb01nd the nearest neighbor, or an approximation thereof, quickly.\nThe default distance function is very often Euclidean distance. This is a matter of convenience\nand can be partially justi\ufb01ed by theory: a classical result of Stone [1] shows that k-nearest neigh-\nbor classi\ufb01cation is universally consistent in Euclidean space. This means that no matter what the\ndistribution of data and labels might be, as the number of samples n goes to in\ufb01nity, the kn-NN\nclassi\ufb01er converges to the Bayes-optimal decision boundary, for any sequence (kn) with kn \u2192 \u221e\nand kn/n \u2192 0. The downside is that the rate of convergence could be slow, leading to poor perfor-\nmance on \ufb01nite data sets. A more careful choice of distance function can help, by better separating\nthe different classes. For the well-known MNIST data set of handwritten digits, for instance, the 1-\nNN classi\ufb01er using Euclidean distance has an error rate of about 3%, whereas a more careful choice\nof distance function\u2014tangent distance [2] or shape context [3], for instance\u2014brings this below 1%.\nThe second impediment to nearest neighbor search in practice is that a naive search through n\ncandidate neighbors takes O(n) time, ignoring the dependence on dimension. A wide variety of\ningenious data structures have been developed to speed this up. The most popular of these fall into\ntwo categories: hashing-based and tree-based.\nPerhaps the best-known hashing approach is locality-sensitive hashing (LSH) [4, 5, 6, 7, 8, 9, 10].\nThese randomized data structures \ufb01nd approximate nearest neighbors with high probability, where\nc-approximate solutions are those that are at most c times as far as the nearest neighbor.\nWhereas hashing methods create a lattice-like spatial partition, tree methods [11, 12, 13, 14] create\na hierarchical partition that can also be used to speed up nearest neighbor search. There are families\nof randomized trees with strong guarantees on the tradeoff between query time and probability of\n\ufb01nding the exact nearest neighbor [15].\nThese hashing and tree methods for (cid:2)2 distance both use the same primitive: random projection [16].\nFor data in Rd, they (repeatedly) choose a random direction u from the multivariate Gaussian\n\n\u2217\n\nSupported by UC San Diego Jacobs Fellowship\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u221a\n\n2.\n\n1/2\n1\n\n1/2\n1\n\nN (0, Id) and then project points x onto this direction: x (cid:4)\u2192 u \u00b7 x. Such projections have many\nappealing mathematical properties that make it possible to give algorithmic guarantees, and that\nalso produce good performance in practice.\nFor distance functions other than (cid:2)2, there has been far less work. In this paper, we develop nearest\nneighbor methods for (cid:2)1 distance. This is a more natural choice than (cid:2)2 in many situations, for\ninstance when the data points are probability distributions: documents are often represented as dis-\ntributions over topics, images as distributions over categories, and so on. Earlier works on (cid:2)1 search\nare summarized below. We adopt a different approach, based on a novel embedding.\nOne basic fact is that (cid:2)1 distance is not embeddable in (cid:2)2 [17]. That is, given a set of points\nx1, . . . , xn \u2208 Rd, it is in general not possible to \ufb01nd corresponding points z1, . . . , zn \u2208 Rq such that\n(cid:6)xi\u2212 xj(cid:6)1 = (cid:6)zi\u2212 zj(cid:6)2. This can be seen even from the four points at the vertices of a square\u2014any\nembedding of these into (cid:2)2 induces a multiplicative distortion of at least\nInterestingly, however, the square root of (cid:2)1 distance is embeddable in (cid:2)2 [18]. And the nearest\nneighbor with respect to (cid:2)1 distance is the same as the nearest neighbor with respect to (cid:2)\n. This\nobservation is the starting point of our approach. It suggests that we might be able to embed data\ninto (cid:2)2 and then simply apply well-established methods for (cid:2)2 nearest neighbor search. However,\nthere are numerous hurdles to overcome.\ninto (cid:2)2 is an existential, not algorithmic, fact. Indeed, all that is\nFirst, the embeddability of (cid:2)\nknown for general case is that there exists such an embedding into Hilbert space. For the special\ncase of data in {0, 1, . . . , M}d, earlier work has suggested a unary embedding into a Hamming\nspace {0, 1}M d (where 0 \u2264 x \u2264 M gets mapped to x 1\u2019s followed by (M \u2212 x) 0\u2019s) [19], but this is\nwasteful of space and is inef\ufb01cient to be used by dimension reduction algorithms [16] when M is\nlarge. Our embedding is general and is more ef\ufb01cient.\nNow, given a \ufb01nite point set x1, . . . , xn \u2208 Rd and the knowledge that an embedding exists, we\ncould use multidimensional scaling [20] to \ufb01nd such an embedding. But this takes O(n3) time,\nwhich is often not viable. Instead, we exhibit an explicit embedding: we give an expression for\npoints z1, . . . , zn \u2208 RO(nd) such that (cid:6)xi \u2212 xj(cid:6)1 = (cid:6)zi \u2212 zj(cid:6)2\n2.\nThis brings us to the second hurdle. The explicit construction avoids in\ufb01nite-dimensional space but\nis still much higher-dimensional than we would like. The space requirement for writing down the n\nembedded points is O(n2d), which is prohibitive in practice. To deal with this, we recall that the two\npopular schemes for (cid:2)2 embedding described above are both based on Gaussian random projections,\nand in fact look at the data only through the lens of such projections. We show how to compute these\nprojections without ever constructing the O(nd)-dimensional embeddings explicitly.\nFinally, even if it is possible to ef\ufb01ciently build a data structure on the n points, how can queries\nbe incorporated? It turns out that if a query point is added to the original n points, our explicit\nembedding changes signi\ufb01cantly. Nonetheless, by again exploiting properties of Gaussian random\nprojections, we show that it is possible to hold on to the random projections of the original n em-\nbedded points and to set the projection of the query point so that the correct joint distribution is\nachieved. Moreover, this can be done very ef\ufb01ciently.\nFinally, we run a variety of experiments showing the good practical performance of this approach.\n\nRelated work\n\nThe k-d tree [11] is perhaps the prototypical tree-based method for nearest neighbor search, and can\nbe used for (cid:2)1 distance. It builds a hierarchical partition of the data using coordinate-wise splits, and\nuses geometric reasoning to discard subtrees during NN search. Its query time can degenerate badly\nwith increasing dimension, as a result of which several variants have been developed, such as trees\nin which the cells are allowed to overlap slightly [21]. Various tree-based methods have also been\ndeveloped for general metrics, such as the metric tree and cover tree [14, 12].\nFor k-d tree variants, theoretical guarantees are available for exact (cid:2)2 nearest neighbor search when\nthe split direction is chosen at random from a multivariate Gaussian [15]. For a data set of n points,\nthe tree has size O(n) and the query time is O(2d log n), where d is the intrinsic dimension of the\ndata. Such analysis is not available for (cid:2)1 distance.\n\n2\n\n\fAlso in wide use is locality-sensitive hashing for approximate nearest neighbor search [22]. For a\ndata set of n points, this scheme builds a data structure of size O(n1+\u03c1) and \ufb01nds a c-approximate\nnearest neighbor in time O(n\u03c1), for some \u03c1 > 0 that depends on c, on the speci\ufb01c distance function,\nand on the hash family. For (cid:2)2 distance, it is known how to achieve \u03c1 \u2248 1/c2 [23], although the\nscheme most commonly used in practice has \u03c1 \u2248 1/c [8]. This works by repeatedly using the\nfollowing hash function:\n\nh(x) = (cid:10)(v \u00b7 x + b)/R(cid:11),\n\nwhere v is chosen at random from a multivariate Gaussian, R > 0 is a constant, and b is uniformly\ndistributed in [0, R). A similar scheme also works for (cid:2)1, using Cauchy random projection: each\ncoordinate of v is picked independently from a standard Cauchy distribution. This achieves exponent\n\u03c1 \u2248 1/c, although one downside is the high variance of this distribution. Another LSH family\n[22, 10] uses a randomly shifted grid for (cid:2)1 nearest neighbor search. But it is less used in practice,\ndue to its restrictions on data. For example, if the nearest neighbor is further away than the width of\nthe grid, it may never be found.\nBesides LSH, random projection is the basis for some other NN search algorithms [24, 25], classi\ufb01-\ncation methods [26], and dimension reduction techniques [27, 28, 29].\nThere are several impediments to developing NN methods for (cid:2)1 spaces. 1) There is no Johnson-\nLindenstrauss type dimension reduction technique for (cid:2)1 [30]. 2) The Cauchy random projection\ndoes not preserve the (cid:2)1 distance as a norm, which restricts its usage for norm based algorithms [31].\n3) Useful random properties [26] cannot be formulated exactly; only approximations exist. Fortu-\nnately, all these three problems are absent in (cid:2)2 space, which motivates developing ef\ufb01cient embed-\nding algorithms from (cid:2)1 to (cid:2)2.\n\n2 Explicit embedding\n\nWe begin with an explicit isometric embedding from (cid:2)1 to (cid:2)2\nimmediately to multiple dimensions because both (cid:2)1 and (cid:2)2\n\n2 for 1-dimensional data. This extends\n\n2 distance are coordinatewise additive.\n\n2.1 The 1-dimensional case\nFirst, sort the points x1, . . . , xn \u2208 R so that x1 \u2264 x2 \u2264 \u00b7\u00b7\u00b7 \u2264 xn. Then, construct the embedding\n\u03c6(x1), \u03c6(x2), . . . , \u03c6(xn) \u2208 Rn\u22121 as follows:\n(cid:5)\n\u23a4\n\u23a5\u23a5\u23a5\u23a5\u23a6\n\n(cid:3)(cid:4)\nx2 \u2212 x1\nx3 \u2212 x2\nx4 \u2212 x3\n\n(cid:3)(cid:4)\nx2 \u2212 x1\nx3 \u2212 x2\n\n(cid:2)\n\u23a1\n\u23a2\u23a2\u23a2\u23a2\u23a3\n\n(cid:3)(cid:4)\nx2 \u2212 x1\n\n\u03c6(x2)\n\n(cid:5)\n\u23a4\n\u23a5\u23a5\u23a5\u23a5\u23a6\n\n(cid:3)(cid:4)\n\n\u03c6(x1)\n\n(cid:2)\n\u23a1\n\u23a2\u23a2\u23a2\u23a2\u23a3\n\n\u221a\n\u221a\n\u221a\n\n\u03c6(x3)\n\n\u221a\n\u221a\n\n\u03c6(xn)\n\n\u221a\n\n(1)\n\n0\n0\n0\n...\n0\n\n(cid:5)\n(cid:2)\n\u23a1\n\u23a4\n\u23a2\u23a2\u23a2\u23a2\u23a3\n\u23a5\u23a5\u23a5\u23a5\u23a6\nj(cid:13)\n\nk=i+1\n\n(cid:12)\n\n0\n0\n...\n0\n\n(cid:14)(cid:15)\nxk \u2212 xk\u22121\n\n(cid:2)\n\u23a1\n\u23a2\u23a2\u23a2\u23a2\u23a3\n(cid:16)2\n\n0\n...\n0\n\n(cid:17)1/2\n\n(cid:12)\n\n(cid:5)\n\u23a4\n\u23a5\u23a5\u23a5\u23a5\u23a6 . . .\nj(cid:13)\n\n\u221a\n\n...\nxn \u2212 xn\u22121\n(cid:17)1/2\n\nFor any 1 \u2264 i < j \u2264 n, \u03c6(xi) and \u03c6(xj) agree on all coordinates except i to (j \u2212 1). Therefore,\n\n(cid:6)\u03c6(xi) \u2212 \u03c6(xj)(cid:6)2 =\n\nxk \u2212 xk\u22121\n\n= |xj \u2212 xi|1/2, (2)\n\n=\n\nk=i+1\n\nso the embedding preserves the (cid:2)\nrestrictions on the range of x1, x2, . . . , xn, it is applicable to any \ufb01nite set of points.\n\ndistance between these points. Since the construction places no\n\n1/2\n1\n\n2.2 Extension to multiple dimensions\n\nWe construct an embedding of d-dimensional points by stacking 1-dimensional embeddings.\nConsider points x1, x2, . . . , xn \u2208 Rd.\nSuppose we have a collection of embedding maps\n\u03c61, \u03c62, . . . , \u03c6d, one per dimension. Each of the embeddings is constructed from the values on a\ndenote the j-th coordinate of xi, for 1 \u2264 j \u2264 d, then embedding \u03c6j\nsingle coordinate: if we let x\n\n(j)\ni\n\n3\n\n\fis based on x\n\n(j)\n1 , x\n\n(j)\n2 , . . . , x\n\n(j)\n\n(cid:18)\nn \u2208 R. The overall embedding is the concatenation\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:18)\n\n\u03c6 (xi) =\n\n\u03c6\u03c4\n1\n\nx\n\n, \u03c6\u03c4\n2\n\nx\n\n(1)\ni\n\n, . . . , \u03c6\u03c4\nd\n\nx\n\n(d)\ni\n\nwhere 1 \u2264 i \u2264 n, and \u03c4 denotes transpose. For any 1 \u2264 i < j \u2264 n,\n(cid:18)\n\u2212 \u03c6k\n(cid:17)1/2\n\n(cid:6)\u03c6 (xi) \u2212 \u03c6 (xj)(cid:6)2 =\n\n(cid:12)\n(cid:12)\n\n(cid:18)\n\n(cid:19)\n\n(k)\ni\n\nk=1\n\nx\n\nx\n\nd(cid:13)\nd(cid:13)\n\n(cid:21)(cid:21)(cid:21)\n\n\u2212 x\n\n(k)\nj\n\n(k)\ni\n\n= (cid:6)xi \u2212 xj(cid:6)1/2\n\n1\n\n=\n\nk=1\n\n(cid:19)(cid:19)\u03c4 \u2208 Rd(n\u22121)\n(cid:17)1/2\n(cid:19)(cid:20)(cid:20)(cid:20)2\n\n(k)\nj\n\n2\n\n(2)\ni\n\n(cid:19)\n(cid:20)(cid:20)(cid:20)\u03c6k\n(cid:21)(cid:21)(cid:21)x\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nIt may be of independent interest to consider the properties of this explicit embedding. We can\nrepresent it by a matrix of n columns with one embedded point per column. The rank of this\nmatrix\u2014and, therefore, the dimensionality of the embedded points\u2014turns out to be O(n). But we\ncan show that the \u201ceffective rank\u201d [32] of the centered matrix is just O(d log n); see Appendix B.\n\n3\n\nIncorporating a query\n\nOnce again, we begin with the 1-dimensional case and then extend to higher dimension.\n\n3.1 The 1-dimensional case\nFor nearest neighbor search, we need a joint embedding of the data points S = {x1, x2, . . . , xn}\nwith the subsequent query point q. In fact, we need to embed S \ufb01rst and then incorporate q later, but\nthis is non-trivial since adding q changes the explicit embedding of other points.\nWe start with an example. Again, assume x1 \u2264 x2 \u2264 \u00b7\u00b7\u00b7 \u2264 xn.\nExample 1. Suppose query q has x2 \u2264 q < x3. Adding q to the original n points changes the\nembedding \u03c6(\u00b7) \u2208 Rn\u22121 of Eq. 1 to \u03c6(\u00b7) \u2208 Rn. Notice that the dimension increases by one.\n\n(cid:2)\n\u23a1\n\n\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a3\n\n(cid:3)(cid:4)\n\n\u03c6(x1)\n\n0\n0\n0\n0\n...\n0\n\n(cid:5)\n\u23a4\n\n\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a6\n\n(cid:2)\n\u23a1\n\n\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a3\n\n(cid:3)(cid:4)\nx2 \u2212 x1\n\n\u03c6(x2)\n\n\u221a\n\n(cid:5)\n\u23a4\n\n\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a6\n\n(cid:2)\n\u23a1\n\n\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a3\n\n\u03c6(x3)\n\n\u03c6(xn)\n\n(cid:5)\n\u23a4\n\n(cid:2)\n\u23a1\n\n(cid:3)(cid:4)\n\u221a\nx2 \u2212 x1\n\u221a\nq \u2212 x2\n\u221a\nx3 \u2212 q\n\u221a\nx4 \u2212 x3\n\n\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a6 . . .\n\n(cid:3)(cid:4)\n\u221a\nx2 \u2212 x1\n\u221a\nq \u2212 x2\n\u221a\nx3 \u2212 q\n0\n...\n0\nq \u2212 x2, 0, . . . , 0)\u03c4 \u2208 Rn.\n\n\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a3\n\n\u221a\n\n\u221a\n\nxn \u2212 xn\u22121\n\n...\n\n(cid:5)\n\u23a4\n\n\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a6\n\n\u221a\n\nx2 \u2212 x1,\n\n0\n0\n0\n...\n0\n\nThe query point is mapped to \u03c6(q) = (\n\nFrom the example above, it is clear what happens when q lies between some xi and xi+1. There are\nalso two \u201ccorner cases\u201d that can occur: q < x1 and q > xn. Fortunately, the embedding of S is\nalmost unchanged for the corner cases: \u03c6(xi) = (\u03c6\u03c4 (xi), 0)\u03c4 \u2208 Rn, appending a zero at the end.\nx1 \u2212 q)\u03c4 \u2208 Rn; for q \u2265 xn, the query is\nFor q < x1, the query is mapped to \u03c6(q) = (0, . . . , 0,\nmapped to \u03c6(q) = (\n\nq \u2212 xn)\u03c4 \u2208 Rn.\n\nx3 \u2212 x2, . . . ,\n\nxn \u2212 xn\u22121,\n\nx2 \u2212 x1,\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n3.2 Random projection for the 1-dimensional case\n\nWe would like to generate Gaussian random projections of the (cid:2)2 embeddings of the data points. In\nthis subsection, we mainly focus on the typical case when the query q lies between two data points,\nand we leave the treatment of the (simpler) corner cases to Alg. 1. The notation follows section 3.1,\nand we assume the xi are arranged in increasing order for i = 1, 2, . . . , n.\nSetting 1. The query lies between two data points: x\u03b1 \u2264 q < x\u03b1+1 for some 1 \u2264 \u03b1 \u2264 n \u2212 1.\n\n4\n\n\fWe will consider two methods for randomly projecting the embedding of S \u222a{q} and show that they\nyield exactly the same joint distribution.\nThe \ufb01rst method applies Gaussian random projection to the embedding \u03c6 of S \u222a {q}. Sample a\nmultivariate Gaussian vector v from N (0, In). For any x \u2208 S \u222a {q}, the projection is\n\npg(x) := v \u00b7 \u03c6(x)\n\n(7)\nThis is exactly the projection we want. However, it requires both S and q, whereas in practice, we\nwill initially have to project just S by itself, and we will only later be given some (arbitrary) q.\nThe second method starts by projecting the explicitly embedded points S. Later, it receives query\nq and \ufb01nds a suitable projection for it as well. So, we begin by sampling a multivariate Gaussian\nvector u from N (0, In\u22121), and for any x \u2208 S, use the projection\n\n(8)\nwhere the subindex e stands for embedding. Conditioned on the value (pe (x\u03b1+1)\u2212pe (x\u03b1)), namely\n\u221a\nx\u03b1+1 \u2212 x\u03b1 \u00b7 u(\u03b1), the projection of a subsequent query q is taken to be\n\npe(x) := u \u00b7 \u03c6(x)\n\npe(q) = pe (x\u03b1) + \u0394\n\n(cid:22)\n\n\u0394 \u223c N\n\n1 \u00b7 (pe (x\u03b1+1) \u2212 pe (x\u03b1))\n\u03c32\n\n\u03c32\n1 + \u03c32\n2\n\n,\n\n\u03c32\n1\u03c32\n2\n\u03c32\n1 + \u03c32\n2\n\n(cid:23)\n\n(9)\n\n1 = q \u2212 x\u03b1, \u03c32\n\n2 = x\u03b1+1 \u2212 q.\n\nwhere \u03c32\nTheorem 1. Fix any x1, . . . , xn, q \u2208 R satisfying Setting 1. Consider the joint distribution of\n[pg (x1) , pg (x2) , . . . , pg (xn) , pg(q)] induced by a random choice of v (as per Eq. 7), and the joint\ndistribution of [pe (x1) , pe (x2) , . . . , pe (xn) , pe(q)] induced by a random choice of u and \u0394 (as\nper Eqs. 8 and 9). These distributions are identical.\n\nThe details are in Appendix A: brie\ufb02y, we show that both joint distributions are multivariate Gaus-\nsians, and that they have the same mean and covariance.\nWe highlight the advantages of our method. First, projecting the data set using Eq. 8 does not require\nadvance knowledge of the query, which is crucial for nearest neighbor search; second, generating\nthe projection for the 1-dimensional query takes O(log n) time, which makes this method ef\ufb01cient.\nWe describe the 1-dimensional algorithm in Alg. 1, where we assume that a permutation that sorts\nthe points, denoted \u03a0, is provided, along with the location of q within this ordering, denoted \u03b1. We\nwill resolve this later in Alg. 2.\n\n3.3 Random projection for the higher dimensional case\n\nWe will henceforth use ERP (Euclidean random projection) to denote our overall scheme consisting\nof embedding (cid:2)1 into (cid:2)2\n2, followed by random Gaussian projection (Alg. 2). A competitor scheme,\nas described earlier, applies Cauchy random projection directly in the (cid:2)1 space; we refer to this as\nCRP. The time and space costs for ERP are shown in Table 1, if we generate k projections for n data\npoints and m queries in Rd. The costs scale linearly in d, since the constructions and computation\nare dimension by dimension. We have a detailed analysis below.\nPreprocessing: This involves sorting the points along each coordinate separately and storing the\nresulting permutations \u03a01, . . . , \u03a0d. The time and space costs are acceptable, because reading or\nstoring the data takes as much as O(nd).\nProject data: The time taken by ERP to project the n points is comparable to that of CRP. But\nERP requires a factor O(n) more space, compared to O(kd) for CRP, because it needs to store the\nprojections of each of the individual coordinates of the data points.\nProject query: ERP methods are ef\ufb01cient for query answering. The projection is calculated directly\nin the original d-dimensional space. The log n overhead comes from using binary search, coordi-\nnatewise, to place the query within the ordering of the data points. Once these ranks are obtained,\nthey can be reused for as many projections as needed.\n\n5\n\n\fAlgorithm 1 Random projection (1-dimensional case)\n\n\u2264 x\u03c02\n\nsuch that x\u03c01\n\nfunction project-data (S, \u03a0)\ninput:\n\u2014 data set S = (xi : 1 \u2264 i \u2264 n)\n\u2014 sorted indices \u03a0 = (\u03c0i : 1 \u2264 i \u2264 n)\n\u2264, . . . ,\u2264 x\u03c0n\noutput:\n\u2014 projections P = (pi : 1 \u2264 i \u2264 n) for S\n\u2190 0\np\u03c01\nfor i = 2, 3, . . . , n do\nui \u2190 N (0, 1)\np\u03c0i\nend for\nreturn P\n\n\u2190 p\u03c0i\u22121 + ui \u00b7(cid:15)\n\n\u2212 x\u03c0i\u22121\n\nx\u03c0i\n\nfunction project-query(q, \u03b1,S, \u03a0, P )\ninput:\n\u2014 query q and its rank \u03b1 in data set S\n\u2014 sorted indices \u03a0 of S\n\u2014 projections P of S\noutput:\n\u2014 projection pq for q\ncase: 1 \u2264 \u03b1 \u2264 n \u2212 1\n1 \u2190 q \u2212 x\u03c0\u03b1\n(cid:22)\n\u03c32\n2 \u2190 x\u03c0\u03b1+1 \u2212 q\n\u03c32\n1 \u00b7 (p\u03c0\u03b1+1\n\u03c32\n\u0394 \u2190 N\npq \u2190 p\u03c0\u03b1 + \u0394\ncase: \u03b1 = 0\n\u2212 q\nr \u2190 N (0, 1),\nr \u2190 N (0, 1), pq \u2190 p\u03c0n + r \u00b7 \u221a\nq \u2212 x\u03c0n\nreturn pq\n\npq \u2190 r \u00b7 \u221a\n\n\u03c32\n1\u03c32\n2\n\u03c32\n1 + \u03c32\n2\n\ncase: \u03b1 = n\n\n\u2212 p\u03c0\u03b1 )\n\n\u03c32\n1 + \u03c32\n2\n\n(cid:23)\n\n,\n\nx\u03c01\n\nTable 1: Ef\ufb01ciency of ERP algorithm: Generate k projections for n data points and m queries in Rd.\n\nTime cost\nSpace cost\n\nPreprocessing\nO(dn log n)\n\nO(dn)\n\nProject data\n\nProject query\n\nO(knd)\nO(knd)\n\nO(md(k + log n))\n\nNA\n\n4 Experiment\n\nIn this section, we demonstrate that ERP can be directly used by existing NN search algorithms,\nsuch as LSH, for ef\ufb01cient (cid:2)1 NN search. We choose commonly used data sets for image retrieval\nand text classi\ufb01cation. Besides our method, we also implement the metric tree (a popular tree-type\ndata structure) and Cauchy LSH for comparison.\nData sets When data points represent distributions, (cid:2)1 distance is natural. We use four such data\nsets. 1) Corel uci [21], available at [33], contains 68,040 histograms (32-dimension) for color im-\nages from Corel image collections; 2) Corel hist [34, 21], processed by [21], contains 19,797 his-\ntograms (64-dimension, non-zero dimension is 44) for color images from Corel Stock Library; 3)\nCade [35], is a collection of documents from Brazilian web pages. Topics are extracted using latent\nDirichlet allocation algorithm [36]. We use 13,016 documents with distributions over the 120 topics\n(120-dimension); 4) We download about 35,000 images from ImageNet [37], and process each of\nthem into a probabilistic distribution over 1,000 classes using trained convolution neural network\n[38]. Furthermore, we collapse the distribution into a 100-dimension representation, summing each\n10 consecutive mass of probability. This reduces the training and testing time.\nIn each data set, we remove duplicates. For either parameter optimization or testing, we randomly\nseparate out 10% of the data as queries such that the query-to-data ratio is 1 : 9.\nPerformance evaluation We evaluate performance using query cost. For linear scan or metric\ntree, this is the average number of points accessed when answering a query. For LSH, we also need\nto add the overhead of evaluating the LSH functions.\nThe scheme [8, 39] of LSH is summarized as follows. Given three parameters k, L and R (k, L\nare positive integers, k is even, R is a positive real), the LSH algorithm uses k-tuple hash functions\nof the form g(x) = (h1(x), h2(x), . . . , hk(x)) to distribute data or queries to their bins. L is the\ntotal number of such g-functions. The h-functions are of the form h(x) = (cid:10)(v \u00b7 x + b)/R(cid:11), each\n\n6\n\n\fAlgorithm 2 Overall algorithm for Random projection, in context of NN search\n\nStarting information:\n\u2014 data set S = {xi : 1 \u2264 i \u2264 n} \u2282 Rd\nSubsequent arrival:\n\u2014 query q \u2208 Rd\n\npreprocessing:\nSort data along each dimension:\nfor j \u2208 {1, . . . , d} do\nSj = {x\n\u03a0j \u2190 index-sort (Sj), where\n\n: 1 \u2264 i \u2264 n}\n\n(j)\ni\n\n\u03a0j = {\u03c0ji : 1 \u2264 i \u2264 n} satisfying\n\u03c0j1 \u2264 x\n\n\u03c0j2 \u2264 \u00b7\u00b7\u00b7 \u2264 x\n\n(j)\n\u03c0jn\n\n(j)\n\n(j)\n\nx\nend for\nsave \u03a0 = (\u03a01, \u03a02, . . . , \u03a0d)\n\nproject data:\nfor j = 1, 2, . . . , d do\nPj \u2190 project-data (Sj, \u03a0j) where\n\nPj = {pji : 1 \u2264 i \u2264 n}\n(cid:24)\n\nend for\nsave P = (P1, P2, . . . , Pd)\nprojection of xi \u2208 S is\nd\nj=1 pji\nproject query:\nfor j = 1, 2, . . . , d do\n\u03b1j \u2190 binary-search(q(j), Sj, \u03a0j) satisfying\n\u2264 q(j) \u2264 x\n\n(j)\n\u03c0j\u03b1j\nend for\nsave rank \u03b1 for use in multiple projections\npq \u2190 0\nfor j = 1, 2, . . . , d do\npg \u2190 pg+ project-query(q(j), \u03b1j,Sj, \u03a0j, Pj)\nend if\nprojection for q is pg\n\n(j)\n\u03c0j(\u03b1j +1)\n\nx\n\nTable 2: Performance evaluation: Query cost = Tr + To.\n\nLinear Scan or Metric Tree\n\nCRP-LSH\nERP-LSH\n\nRetrieval cost: Tr\n# Accessed points\n# Accessed points\n# Accessed points\n\nOverhead: To\n\n0\n\nk/2 \u00b7 \u221a\n\n2L\n\nk/2 \u00b7 \u221a\n\n2L + log n\n\nb \u2208 [0, R). As suggested in [39], we implement the reuse of h-functions so that only (k/2\u00b7\u221a\n\neither explicitly or implicitly associated with a random vector v and a uniformly distributed variable\n2L) of\nthem are actually evaluated. For ERP-LSH, there is an additional overhead of log n due to the use\nof binary search. We summarize these costs in Table 2; for conciseness, we have removed the linear\ndependence on d in both the retrieval cost and the overhead.\nImplementations The linear scan and the metric tree are for exact NN search. We use the code\n[40] for metric tree. For LSH, there is only public code for (cid:2)2 NN search. We implement the LSH\nscheme, referring to the manual [39]. In particular, we implement the reuse of the h-functions, such\n\nthat the number of actually evaluated h-functions is (k/2 \u00b7 \u221a\n\n2L), in contrast to (k \u00b7 L).\n\n(cid:24)\n\n(cid:15)(cid:6)q \u2212 xN N (q)(cid:6)1 is the average (cid:2)\n\nWe choose approximation factor c = 1.5 (the results turn out to be much closer to true NN), and\nset the success rate to be 0.9, which means that the algorithm should report c-approximate NN\nsuccessfully for at least 90% of the queries. Taking the parameter suggestions [8] into account, we\nchoose R for CRP-LSH from dN N \u00d7 {1, 5, 10, 50, 100}; we choose R for ERP-LSH from d\n\u00d7\n(cid:3)\n(cid:24)\n(cid:6)q \u2212 xN N (q)(cid:6)1 is the average (cid:2)1 NN distance; d\n{1, 2, 3, 4}, where dN N = 1|Q|\n(cid:3)\nN N\nN N =\n(cid:3)\n1|Q|\nN N normalizes\nthe average NN distance to 1 for LSH. Fixing R, we optimize k and L in the following range:\nk \u2208 {2, 4, . . . , 30}, L \u2208 {1, 2, . . . , 40}.\nResults Both CRP-LSH and ERP-LSH achieve a competitive ef\ufb01ciency over the other two meth-\nods. We list the test results in Table 3, and put parameters in Table 4 in Appendix C.\n\n1/2\n1 NN distance. The term dN N or d\n\nq\u2208Q\n\nq\u2208Q\n\n7\n\n\fTable 3: Average query cost and average approximation rate if applicable (in parentheses).\n\nCorel uci\n(d = 32)\n61220\n2575\n\n329 \u00b1 55 (1.07)\n330 \u00b1 18 (1.11)\n\nCorel hist\n(d = 44)\n17809\n718\n\n245 \u00b1 43 (1.05)\n250 \u00b1 15 (1.08)\n\nLinear scan\nMetric tree\nCRP-LSH\nERP-LSH\n\nCade\n\n(d = 120)\n\n11715\n9184\n\n292 \u00b1 11 (1.11)\n218 \u00b1 8 (1.15)\n\nImageNet\n(d = 100)\n\n31458\n12375\n\n548 \u00b1 66 (1.09)\n346 \u00b1 15 (1.13)\n\n5 Conclusion\n\nIn this paper, we have proposed an explicit embedding from (cid:2)1 to (cid:2)2\n2, and we have found an algorithm\nto generate the random projections, reducing the time dependence of n from O(n) to O(log n). In\naddition, we have observed that the effective rank of the (centered) embedding is as low as O(d ln n),\ncompared to its rank O(n). Algorithms remain to be explored, in order to take advantage of such a\nlow rank.\nOur current method takes space O(ndm) to store the parameters of the random vectors, where m is\nthe number of hash functions. We have implemented one empirical scheme [39] to reuse the hashing\nfunctions. It is still expected to develop other possible schemes.\n\nAcknowledgement\n\nThe authors are grateful to the National Science Foundation for support under grant IIS-1162581.\n\nReferences\n[1] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5:595\u2013620, 1977.\n[2] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern\nIn Neural networks: Tricks of the\n\nrecognition\u2014Tangent distance and tangent propagation.\ntrade, volume 1524, pages 239\u2013274. Springer-Verlag, New York, 1998.\n\n[3] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape\n\ncontexts. IEEE Trans. Pattern Anal. Mach. Intell., 24(4):509\u2013522, 2002.\n\n[4] A. Broder. On the resemblance and containment of documents. In Proceedings of Compression\n\nand Complexity of Sequences, pages 21\u201329, 1997.\n\n[5] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of\n\ndimensionality. In STOC, pages 604\u2013613, 1998.\n\n[6] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations.\n\nJournal of Computer and System Sciences, 60:630\u2013659, 2000.\n\n[7] M. Charikar. Similarity estimation techniques from rounding algorithms.\n\n380\u2013388, 2002.\n\nIn STOC, pages\n\n[8] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based\n\non p-stable distributions. In SoCG, pages 253\u2013262, 2004.\n\n[9] A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product\n\nsearch (MIPS). In NIPS, pages 2321\u20132329, 2014.\n\n[10] A. Andoni and P. Indyk. Ef\ufb01cient algorithms for substring near neighbor problem. In SODA,\n\npages 1203\u20131212, 2006.\n\n[11] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communi-\n\ncations of the ACM, 18(9):509\u2013517, 1975.\n\n[12] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, pages\n\n97\u2013104, 2006.\n\n[13] S. M. Omohundro. Bumptrees for ef\ufb01cient function, constraint, and classi\ufb01cation learning. In\n\nNIPS, volume 40, pages 175\u2013179, 1991.\n\n[14] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information\n\nprocessing letters, 40:175\u2013179, 1991.\n\n8\n\n\f[15] S. Dasgupta and K. Sinha. Randomized partition trees for nearest neighbor search. Algorith-\n\nmica, 72:237\u2013263, 2015.\n\n[16] W. Johnson and J. Lindenstrauss. Extensions of Lipschhitz maps into a Hilbert space. Con-\n\ntemporary Math, 26:189\u2013206, 1984.\n\n[17] J. H. Wells and L. R. Williams. Embeddings and extensions in analysis, volume 84. Springer-\n\nVerlag, New York, 1975.\n\n[18] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. IEEE Trans.\n\nPattern Anal. Mach. Intell., 34:480\u2013492, 2012.\n\n[19] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic\n\napplications. In FOCS, pages 577\u2013591, 1994.\n\n[20] I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer-\n\nVerlag, Berlin, 1997.\n\n[21] T. Liu, A. Moore, A. Gray, and K. Yang. An investigation of practical approximate nearest\n\nneighbor algorithms. In NIPS, pages 825\u2013832, 2004.\n\n[22] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In Communications of the ACM, volume 51, pages 117\u2013122, 2008.\n\n[23] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In FOCS, pages 459\u2013468, 2006.\n\n[24] J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In STOC, pages\n\n599\u2013608, 1997.\n\n[25] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest\n\nneighbors. SIAM Journal on Computing, 39:302\u2013322, 2009.\n\n[26] P. Li, G. Samorodnitsk, and J. Hopcroft. Sign cauchy projections and chi-square kernel. In\n\nNIPS, pages 2571\u20132579, 2013.\n\n[27] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss.\n\nRandom Structures & Algorithms, 22:60\u201365, 2003.\n\n[28] D. Achlioptas. Database-friendly random projections. In Proceedings of the Symposium on\n\nPrinciples of Database Systems, pages 274\u2013281, 2001.\n\n[29] R. I. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random\n\nprojection. In FOCS, pages 616\u2013623, 1999.\n\n[30] M. Charikar and A. Sahai. Dimension reduction in the L1 norm. In FOCS, pages 551\u2013560,\n\n2002.\n\n[31] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream compu-\n\ntation. Journal of the ACM, 53(3):307\u2013323, 2006.\n\n[32] M. Rudelson and R. Vershynin. Sampling from large matrices: An approach through geometric\n\nfunctional analysis. Journal of the ACM (JACM), 54(4):21, 2007.\n\n[33] https://archive.ics.uci.edu/ml/datasets/Corel+Image+Features.\n[34] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In\n\nVLDB, volume 99, pages 518\u2013529, 1999.\n\n[35] Ana Cardoso-Cachopo. Improving Methods for Single-label Text Categorization. PdD Thesis,\nInstituto Superior Tecnico, Universidade Tecnica de Lisboa. Data avaliable at http://ana.\ncachopo.org/datasets-for-single-label-text-categorization, 2007.\n\n[36] http://www.cs.columbia.edu/~blei/topicmodeling_software.html.\n[37] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, pages 248\u2013255. IEEE, 2009.\n\n[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and\nT. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint\narXiv:1408.5093, 2014.\n\n[39] A. Andoni and P. Indyk. E2LSH 0.1 user manual. Technical report, 2005.\n[40] J. K. Uhlmann.\n\nImplementing metric trees to satisfy general proximity/similarity queries.\n\nManuscript, 1991.\n\n9\n\n\f", "award": [], "sourceid": 582, "authors": [{"given_name": "Xinan", "family_name": "Wang", "institution": "UCSD"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}]}