{"title": "An Investigation of Practical Approximate Nearest Neighbor Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 832, "abstract": null, "full_text": "      An Investigation of Practical Approximate\n                   Nearest Neighbor Algorithms\n\n\n\n              Ting Liu, Andrew W. Moore, Alexander Gray and Ke Yang\n                                 School of Computer Science\n                                 Carnegie-Mellon University\n                                  Pittsburgh, PA 15213 USA\n                     {tingliu, awm, agray, yangke}@cs.cmu.edu\n\n                                          Abstract\n\n         This paper concerns approximate nearest neighbor searching algorithms,\n         which have become increasingly important, especially in high dimen-\n         sional perception areas such as computer vision, with dozens of publica-\n         tions in recent years. Much of this enthusiasm is due to a successful new\n         approximate nearest neighbor approach called Locality Sensitive Hash-\n         ing (LSH). In this paper we ask the question: can earlier spatial data\n         structure approaches to exact nearest neighbor, such as metric trees, be\n         altered to provide approximate answers to proximity queries and if so,\n         how? We introduce a new kind of metric tree that allows overlap: certain\n         datapoints may appear in both the children of a parent. We also intro-\n         duce new approximate k-NN search algorithms on this structure. We\n         show why these structures should be able to exploit the same random-\n         projection-based approximations that LSH enjoys, but with a simpler al-\n         gorithm and perhaps with greater efficiency. We then provide a detailed\n         empirical evaluation on five large, high dimensional datasets which show\n         up to 31-fold accelerations over LSH. This result holds true throughout\n         the spectrum of approximation levels.\n\n\n1    Introduction\n\nThe k-nearest-neighbor searching problem is to find the k nearest points in a dataset X \nRD containing n points to a query point q  RD, usually under the Euclidean distance. It\nhas applications in a wide range of real-world settings, in particular pattern recognition,\nmachine learning [7] and database querying [11]. Several effective methods exist for this\nproblem when the dimension D is small (e.g. 1 or 2), such as Voronoi diagrams [26],\nor when the dimension is moderate (i.g. up to the 10's), such as kd-trees [8] and metric\ntrees. Metric trees [29], or ball-trees [24], so far represent the practical state of the art for\nachieving efficiency in the largest dimensionalities possible [22, 6]. However, many real-\nworld problems are posed with very large dimensionalities which are beyond the capability\nof such search structures to achieve sub-linear efficiency, for example in computer vision,\nin which each pixel of an image represents a dimension. Thus, the high-dimensional case\nis the long-standing frontier of the nearest-neighbor problem.\n\nApproximate searching. One approach to dealing with this apparent intractability has\nbeen to define a different problem, the (1 + ) approximate k-nearest-neighbor searching\nproblem, which returns points whose distance from the query is no more than (1 + ) times\nthe distance of the true kth nearest-neighbor. Further, the problem is often relaxed to only\ndo this with high probability, and without a certificate property telling the user when it\nhas failed to do so, nor any guarantee on the actual rank of the distance of the points\n\n\f\nreturned, which may be arbitrarily far from k [4]. Another commonly used modification to\nthe problem is to perform the search under the L1 norm rather than L2.\n\nLocality Sensitive Hashing. Several methods of this general nature have been proposed\n[17, 18, 12], and locality-sensitive hashing (LSH) [12] has received considerable recent\nattention because it was shown that its runtime is independent of the dimension D and\nhas been put forth as a practical tool [9]. Roughly speaking, a locality sensitive hashing\nfunction has the property that if two points are \"close,\" then they hash to same bucket\nwith \"high\" probability; if they are \"far apart,\" then they hash to same bucket with \"low\"\nprobability. Formally, a function family H = {h : S  U} is (r1, r2, p1, p2)-sensitive, where\nr1 < r2, p1 > p2, for distance function D if for any two points p, q  S, the following\nproperties hold:\n        1. if p  B(q, r1), then PrhH [h(q) = h(p)]  p1, and\n        2. if p  B(q, r2), then PrhH [h(q) = h(p)]  p2,\nwhere B(q, r) denotes a hypersphere of radius r centered at q. By defining a LSH scheme,\nnamely a (r, r(1 + ), p1, p2)-sensitive hash family, the (1 + )-NN problem can be solved\nby performing a series of hashing and searching within the buckets. See [12, 13] for details.\n\nApplications such as computer vision, e.g. [23, 28] have found (1 + ) approximation to\nbe useful, for example when the k-nearest-neighbor search is just one component in a large\nsystem with many parts, each of which can be highly inaccurate. In this paper we explore\nthe extent to which the most successful exact search structures can be adapted to perform\n(1 + ) approximate high-dimensional searches. A notable previous approach along this\nline is a simple modification of kd-trees [3]  ours takes the more powerful metric trees as\na starting point. We next review metric trees, then introduce a variant, known as spill trees.\n\n\n2      Metric Trees and Spill Trees\n\n2.1    Metric Trees\nThe metric tree [29, 25, 5] is a data structure that supports efficient nearest neighbor search.\nWe briefly A metric tree organizes a set of points in a spatial hierarchical manner. It is a\nbinary tree whose nodes represent a set of points. The root node represents all points, and\nthe points represented by an internal node v is partitioned into two subsets, represented by\nits two children. Formally, if we use N(v) to denote the set of points represented by node\nv, and use v.lc and v.rc to denote the left child and the right child of node v, then we have\n                                     N(v)     = N(v.lc)  N(v.rc)                                        (1)\n                                         /0 = N(v.lc)  N(v.rc)                                           (2)\n\nfor all the non-leaf nodes. At the lowest level, each leaf node contains very few points.\n\nPartitioning. The key to building a metric-tree is how to partition a node v. A typical\nway is as follows. We first choose two pivot points from N(v), denoted as v.lpv and v.rpv.\nIdeally, v.lpv and v.rpv are chosen so that the distance between them is the largest of all-\npair distances within N(v). More specifically, ||v.lpv - v.rpv|| = maxp1,p2N(v) ||p1 - p2||.\nHowever, it takes O(n2) time to find the optimal v.lpv and v.rpv. In practice, we resort to a\nlinear-time heuristic that is still able to find reasonable pivot points.1 After v.lpv and v.rpv\nare found, we can go ahead to partition node v.\n\nHere is one possible strategy for partitioning. We first project all the points down to the\nvector u = v.rpv - v.lpv, and then find the median point A along u. Next, we assign all\nthe points projected to the left of A to v.lc, and all the points projected to the right of A\nto v.rc. We use L to denote the d-1 dimensional plane orthogonal to u and goes through\nA. It is known as the decision boundary since all points to the left of L belong to v.lc\n\n     1Basically, we first randomly pick a point p from v. Then we search for the point that is the\nfarthest to p and set it to be v.lpv. Next we find a third point that is farthest to v.lpv and set it as v.rpv.\n\n\f\n  Figure 1: partitioning in a metric tree.           Figure 2: partitioning in a spill tree.\n\nand all points to the right of L belong to v.rc (see Figure 1). By using a median point to\nsplit the datapoints, we can ensure that the depth of a metric-tree is log n. However, in our\nimplementation, we use a mid point (i.e. the point at 1 (v.lpv + v.rpv)) instead, since it is\n                                                          2\nmore efficient to compute, and in practice, we can still have a metric tree of depth O(log n).\n\nEach node v also has a hypersphere B, such that all points represented by v fall in the ball\ncentered at v.center with radius v.r, i.e. we have N(v)  B(v.center, v.r). Notice that the\nballs of the two children nodes are not necessarily disjoint.\n\nSearching. A search on a metric-tree is simply a guided DFS (for simplicity, we assume\nthat k = 1). The decision boundary L is used to decide which child node to search first. If\nthe query q is on the left of L, then v.lc is searched first, otherwise, v.rc is searched first. At\nall times, the algorithm maintains a \"candidate NN\", which is the nearest neighbor it finds\nso far while traversing the tree. We call this point x, and denote the distance between q and\nx by r. If DFS is about to exploit a node v, but discovers that no member of v can be within\ndistance r of q, then it prunes this node (i.e., skip searching on this node, along with all its\ndescendants). This happens whenever v.center - q - v.r  r. We call this DFS search\nalgorithm MT-DFS thereafter.\n\nIn practice, the MT-DFS algorithm is very efficient for NN search, and particularly when\nthe dimension of a dataset is low (say, less than 30). Typically for MT-DFS, we observe\nan order of magnitude speed-up over naive linear scan and other popular data structures\nsuch as SR-trees. However, MT-DFS starts to slow down as the dimension of the datasets\nincreases. We have found that in practice, metric tree search typically finds a very good NN\ncandidate quickly, and then spends up to 95% of the time verifying that it is in fact the true\nNN. This motivated our new proposed structure, the spill-tree, which is designed to avoid\nthe cost of exact NN verification.\n\n2.2    Spill-Trees\n\nA spill-tree (sp-tree) is a variant of metric-trees in which the children of a node can \"spill\nover\" onto each other, and contain shared datapoints. The partition procedure of a metric-\ntree implies that point-sets of v.lc and v.rc are disjoint: these two sets are separated by the\ndecision boundary L. In a sp-tree, we change the splitting criteria to allow overlaps between\ntwo children. In other words, some datapoints may belong to both v.lc and v.rc.\n\nWe first explain how to split an internal node v. See Figure 2 as an example. Like a metric-\ntree, we first choose two pivots v.lpv and v.rpv, and find the decision boundary L that goes\nthrough the mid point A, Next, we define two new separating planes, LL and LR, both of\nwhich are parallel to L and at distance  from L. Then, all the points to the right of plane\nLL belong to the child v.rc, and all the points to the left of plane LR belong to the child v.lc.\nMathematically, we have\n\n                      N(v.lc)    = {x | x  N(v), d(x, LR) + 2 > d(x, LL)}                     (3)\n                      N(v.rc)    = {x | x  N(v), d(x, LL) + 2 > d(x, LR)}                     (4)\n\nNotice that points fall in the region between LL and LR are shared by v.lc and v.rc. We call\nthis region the overlapping buffer, and we call  the overlapping size. For v.lc and v.rc, we\ncan repeat the splitting procedure, until the number of points within a node is less than a\nspecific threshold, at which point we stop.\n\n\f\n3      Approximate Spill-tree-based Nearest Neighbor Search\n\nIt may seem strange that we allow overlapping in sp-trees. The overlapping obviously\nmakes both the construction and the MT-DFS less efficient than regular metric-trees, since\nthe points in the overlapping buffer may be searched twice. Nonetheless, the advantage of\nsp-trees over metric-trees becomes clear when we perform the defeatist search, an (1 + )-\nNN search algorithm based on sp-trees.\n\n3.1    Defeatist Search\nAs we have stated, the MT-DFS algorithm typically spends a large fraction of time back-\ntracking to prove a candidate point is the true NN. Based on this observation, a quick\nrevision would be to descends the metric tree using the decision boundaries at each level\nwithout backtracking, and then output the point x in the first leaf node it visits as the NN of\nquery q. We call this the defeatist search on a metric-tree. Since the depth of a metric-tree\nis O(log n), the complexity of defeatist search is O(log n) per query.\n\nThe problem with this approach is very low accuracy. Consider the case where q is very\nclose to a decision boundary L, then it is almost equally likely that the NN of q is on the\nsame side of L as on the opposite side of L, and the defeatist search can make a mistake\nwith probability close to 1/2. In practice, we observe that there exists a non-negligible\nfraction of the query points that are close to one of the decision boundaries. Thus the\naverage accuracy of the defeatist search algorithm is typically unacceptably low, even for\napproximate NN search.\n\nThis is precisely the place where sp-trees can help: the defeatist search on sp-trees has much\nhigher accuracy and remains very fast. We first describe the algorithm. For simplicity, we\ncontinue to use the example shown in Figure 2. As before, the decision boundary at node\nv is plane L. If a query q is to the left of L, we decide that its nearest neighbor is in v.lc. In\nthis case, we only search points within N(v.lc), i.e., the points to the left of LR. Conversely,\nif q is to the right of L, we only search node v.rc, i.e. points to the right of LL. Notice that\nin either case, points in the overlapping buffer are always searched. By introducing this\nbuffer of size , we can greatly reduce the probability of making a wrong decision. To see\nthis, suppose that q is to the left of L, then the only points eliminated are the one to the right\nof plane LR, all of which are at least distance  away from q.\n\n3.2    Hybrid Sp-Tree Search\nOne problem with spill-trees is that their depth varies considerably depending on the over-\nlapping size . If  = 0, a sp-tree turns back to a metric tree with depth O(log n). On the\nother hand, if   ||v.rpv - v.lpv||/2, then N(v.lc) = N(v.rc) = N(v). In other words, both\nchildren of node v contain all points of v. In this case, the construction of a sp-tree does\nnot even terminate and the depth of the sp-tree is .\n\nTo solve this problem, we introduce hybrid sp-trees and actually use them in practice. First\nwe define a balance threshold  < 1, which is usually set to 70%. The constructions of\na hybrid sp-tree is similar to that of a sp-tree except the following. For each node v, we\nfirst split the points using the overlapping buffer. However, if either of its children contains\nmore than  fraction of the total points in v, we undo the overlapping splitting. Instead, a\nconventional metric-tree partition (without overlapping) is used, and we mark v as a non-\noverlapping node. In contrast, all other nodes are marked as overlapping nodes. In this\nway, we can ensure that each split reduces the number of points of a node by at least a\nconstant factor and thus we can maintain the logarithmic depth of the tree.\n\nThe NN search on a hybrid sp-tree also becomes a hybrid of the MT-DFS search and the\ndefeatist search. We only do defeatist search on overlapping nodes, for non-overlapping\nnodes, we still do backtracking as MT-DFS search. Notice that we can control the hybrid\nby varying . If  = 0, we have a pure sp-tree with defeatist search -- very efficient but\nnot accurate enough; if   ||v.rpv - v.lpv||/2, then every node is a non-overlapping node\n(due to the balance threshold mechanism) -- in this way we get back to the traditional\nmetric-tree with MT-DFS, which is perfectly accurate but inefficient. By setting  to be\n\n\f\nsomewhere in between, we can achieve a balance of efficiency and accuracy. As a general\nrule, the greater  is, the more accurate and the slower the search algorithm becomes.\n\n3.3    Further Efficiency Improvement Using Random Projection\nThe hybrid sp-tree search algorithm is much more efficient than the traditional MT-DFS\nalgorithm. However, this speed-up becomes less pronounced when the dimension of a\ndataset becomes high (say, over 30). In some sense, the hybrid sp-tree search algorithm\nalso suffer from the curse of dimensionality, only much less severely than MT-DFS.\n\nHowever, a well-known technique, namely, random projection is readily available to deal\nwith the high-dimensional datasets. In particular, the Johnson-Lindenstrauss Lemma [15]\nstates that one can embed a dataset of n points in a subspace of dimension O(log n) with lit-\ntle distortion on the pair-wise distances. Furthermore, the embedding is extremely simple:\none simply picks a random subspace S and project all points to S.\n\nIn our (1 + )-NN search algorithm, we use random projection as a pre-processing step:\nproject the datapoints to a subspace of lower dimension, and then do the hybrid sp-\ntree search. Both the construction of sp-tree and the search are conducted in the low-\ndimensional subspace. Naturally, by doing random projection, we will lose some accuracy.\nBut we can easily fix this problem by doing multiple rounds of random projections and do-\ning one hybrid sp-tree search for each round. Assume the failure probability of each round\nis , then by doing L rounds, we drive down this probability to L.\n\nThe core idea of the hash function used in [9] can be viewed as a variant of random projec-\ntion.2 Random projection can be used as a pre-processing step in conjunction with other\ntechniques such as conventional MT-DFS \n                                                We conducted a series of experiments which\nshow that a modest speed-up is obtained by using random projection with MT-DFS (about\n4-fold), but greater (up to 700-fold) speed-up when used with sp-tree search. Due to limited\nspace these results will appear in the full version of this paper [19].\n\n4      Experimental Results\n\nWe report our experimental results based on hybrid sp-tree search on a variety of real-world\ndatasets, with the number of datapoints ranging from 20,000 to 275,465, and dimensions\nfrom 60 to 3,838. The first two datasets are same as the ones used in [9], where it is\ndemonstrated that LSH can have a significant speedup over SR-trees.\nAerial Texture feature data contain 275,465 feature vectors of 60 dimensions representing texture\n          information of large aerial photographs [21, 20].\n\nCorel hist 20,000 histograms (64-dimensional) of color thumbnail-sized images taken from the\n          COREL STOCK PHOTO library. However, of the 64 dimensions, only 44 of them con-\n          tain non-zero entries. See [27] for more discussions. We are unable to obtain the original\n          dataset used in [9] from the authors, and so we reproduced our own version, following their\n          description. We expect that the two datasets to be almost identical.\n\nCorel uci 68,040 histograms (64-dimensional) of color images from the COREL library. This\n          dataset differs significantly from Corel hist and is available from the UCI repository [1].\n\nDisk trace 40,000 content traces of disk-write operations, each being a 1 Kilo-Byte block (therefore\n          having dimension 1,024). The traces are generated from a desktop computer running SuSe\n          Linux during daily operation.\n\nGalaxy Spectra of 40,000 galaxies from the Sloan Digital Sky Survey, with 4000 dimensions.\n\nBesides the sp-tree search algorithm, we also run a number of other algorithms:\nLSH The original LSH implementation used in [9] is not public and we were unable to\n          obtain it from the authors. So we used our own efficient implementation. Experi-\n          ments (described later) show that ours is comparable to the one in [9].\n\nNaive The naive linear-scan algorithm.\n\n     2The Johnson-Lindenstrauss Lemma only works for L2 norm. The \"random sampling\" done in\nthe LSH in [9] roughly corresponds to the L1 version of the Johnson-Lindenstrauss Lemma.\n\n\f\nSR-tree We use the implementation of SR-trees by Katayama and Satoh [16].\n\nMetric-Tree This is highly optimized k-NN search based on metric trees [29, 22], and\n                            code is publicly available [2].\n\nThe experiments are run on a dual-processor AMD Opteron machine of 1.60 GHz with\n8 GB RAM. We perform 10-fold cross-validation on all the datasets. We measure the\nCPU time and accuracy of each algorithm. Since all the experiments are memory-based\n(all the data fit into memory completely), there is no disk access during our experiments.\nTo measure accuracy, we use the effective distance error [3, 9], which is defined as\n                                          d\nE = 1                                         alg\n                      Q     qQ           d - 1 , where dalg is the distance from a query q to the NN found by the\nalgorithm, and d is the distance from q to the true NN. The sum is taken over all queries.\nFor the k-NN case where (k > 1), we measure separately the distance ratios between the\nclosest points found to the nearest neighbor, the 2nd closest one to the 2nd nearest neighbor\nand so on, and then take the average. Obviously, for all exact k-NN algorithms, E = 0, for\nall approximate algorithms, E  0.\n\n4.1                   The Experiments\nFirst, as a benchmark, we run the Naive, SR-tree, and the Metric-tree algorithms. All of\nthem find exact NN. The results are summarized in Table 1.\n\n                                 Table 1: the CPU time of exact SR-tree, Metric-tree, and Naive search\n\n                       Algorithm                       Aerial            Corel hist         Corel uci                           Disk trace         Galaxy\n                            (%)                                    (k = 1)     (k = 10)\n                           Naive                       43620           462       465                  5460                        27050            46760\n                           SR-tree                     23450           184       330                  3230                         n/a              n/a\n                      Metric-tree                      3650            58.4     91.2                    791                       19860            6600\n\nAll the datasets are rather large, and the metric-tree is consistently the fastest. On the other\nhand, the SR-tree implement only has limited speedup over the Naive algorithm, and it fails\nto run on Disk trace and Galaxy, both of which have very high dimensions.\n\nThen, for approximate NN search, we compare sp-tree with three other algorithms: LSH,\ntraditional Metric-tree and SR-tree. For each algorithm, we measure the CPU time needed\nfor the error E to be 1%, 2%, 5%, 10% and 20%, respectively. Since Metric-tree and SR-\ntree are both designed for exact NN search, we also run them on randomly chosen subsets\nof the whole dataset to produce approximate answers. We show the comparison results of\nall algorithms for the Aerial and the Corel hist datasets, both for k = 1, in Figure 3. We\nalso examine the speed-up of sp-tree over other algorithms. In particular, the CPU time and\nthe speedup of sp-tree search over LSH is summarized in Table 2.\n\n\n                                          Aerial (D=60, n=275,476, k=1)                                                 Corel_hist (D=64, n=20,000, k=1)\n\n                       25000                                                                   400\n                                                        Sp-tree                                                                     Sp-tree\n                                                        LSH                                    350                                  LSH\n                       20000                            Metric-tree                            300                                  Metric-tree\n                                                        SR-tree                                                                     SR-tree\n                       15000                                                                   250\n                                                                                               200\n                       10000                                                                   150\n       CPU time(s)                                                                         CPU time(s)  100\n                        5000\n                                                                                                  50\n\n                            0                                                                         0\n                                  1  2            5         10                   20                             1  2       5           10                     20\n\n                                                         Error (%)                                                                  Error (%)\n\n                                               Figure 3: CPU time (s) vs. Error (%) for selected datasets.\n\nSince we used our own implementation of LSH, we first need to verify that it has com-\nparable performance as the one used in [9]. We do so by examining the speedup of both\nimplementations over SR-tree on the Aerial and Corel hist datasets, with both k = 1 and\n\n\f\n       Table 2: the CPU time (s) of Sp-tree and its speedup (in parentheses) over LSH\n\n     Error       Aerial            Corel hist            Corel uci     Disk trace        Galaxy\n      (%)                     (k = 1)      (k = 10)\n       20       33.5 (31)    1.67 (8.3)    3.27 (6.3)    8.7 (8.0)     13.2 (5.3)    24.9 (5.5)\n       10       73.2 (27)    2.79 (9.8)    5.83 (7.0)    19.1 (4.9)    43.1 (2.9)    44.2 (7.8)\n        5       138 (31)     4.5 (11)      9.58 (6.8)    33.1 (4.8)    123 (3.7)     76.9 (11)\n        2       286 (26)      8 (9.5)      20.6 (4.2)    61.9 (4.4)    502 (2.5)     110 (14)\n        1       426 (23)     13.5 (6.4)    27.9 (4.1)    105 (4.1)     1590 (3.2)    170 (12)\n\n\nk = 10 for the latter.3 For the Aerial dataset, in the case where E varies from 10% to 20%,\nthe speedup of LSH in [9] over SR-tree varies from 4 to 6, and as for our implementation,\nthe speedup varies from 4.5 to 5.4. For Corel hist, when E ranges from 2% to 20%, in\nthe case k = 1, the speedups of LSH in [9] ranges from 2 to 7, ours from 2 to 13. In the\ncase k = 10, the speedup in [9] is from 3 to 12, and ours from 4 to 16. So overall, our\nimplementation is comparable to, and often outperforms the one in [9].\n\nPerhaps a little surprisingly, the Metric-tree search algorithm (MT-DFS) performs very\nwell on Aerial and Corel hist datasets. In both cases, when the E is small (1%), MT-\nDFS outperforms LSH by a factor of up to 2.7, even though it aims at finding the exact\nNN, while LSH only finds an approximate NN. Furthermore, the approximate MT-DFS\nalgorithm (conventional metric-tree based search using a random subset of the training\ndata) consistently outperforms LSH across the entire error spectrum on Aerial. We believe\nthat it is because that in both datasets, the intrinsic dimensions are quite low and thus the\nMetric-tree does not suffer from the curse of dimensionality.\n\nFor the rest of the datasets, namely, Corel uci, Disk trace, and Galaxy, metric-tree becomes\nrather inefficient because of the curse of dimensionality, and LSH becomes competitive.\nBut in all cases, sp-tree search remains the fastest among all algorithms, frequently achiev-\ning 2 or 3 orders of magnitude in speed-up. Space does not permit a lengthy conclusion, but\nthe summary of this paper is that there is empirical evidence that with appropriate redesign\nof the data structures and search algorithms, spatial data structures remain a useful tool in\nthe realm of approximate k-NN search.\n\n5     Related Work\n\nThe idea of defeastist search, i.e., non-backtracking search, has been explored by various\nresearchers in different contexts. See, for example, Goldstein and Ramakrishnan [10],\nYianilos [30], and Indyk [14]. The latter also proposed a data structure similar to the spill-\ntree, where the decision boundary needs to be aligned with a coordinate and there is no\nhybrid version. Indyk proved how this data structure can be used to solve approximate NN\nin the L norm.\n\nReferences\n\n [1] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html.\n\n [2] http://www.autonlab.org/autonweb/showsoftware/154/.\n\n [3] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for ap-\n       proximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6):891923,\n       1998.\n\n [4] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is \"nearest neigh-\n       bor\" meaningful? Lecture Notes in Computer Science, 1540:217235, 1999.\n\n [5] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search\n       in metric spaces. In Proceedings of the 23rd VLDB International Conference, September 1997.\n\n     3The comparison in [9] is on disk access while we compare CPU time. So strictly speaking, these\nresults are not comparable. Nonetheless we expect them to be more or less consistent.\n\n\f\n [6] K. Clarkson. Nearest Neighbor Searching in Metric Spaces: Experimental Results for sb(S). ,\n     2002.\n\n [7] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons,\n     1973.\n\n [8] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in loga-\n     rithmic expected time. ACM Transactions on Mathematical Software, 3(3):209226, September\n     1977.\n\n [9] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In\n     Proc 25th VLDB Conference, 1999.\n\n[10] J. Goldstein and R. Ramakrishnan. Constrast Polots and P-Sphere Trees: Speace vs. Time in\n     Nearest Neighbor Searches. In Proc. 26th VLDB conference, 2000.\n\n[11] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of\n     the Third ACM SIGACT-SIGMOD Symposium on Principles of Database Systems. Assn for\n     Computing Machinery, April 1984.\n\n[12] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of\n     dimensionality. In STOC, pages 604613, 1998.\n\n[13] Piotr Indyk. High Dimensional Computational Geometry. PhD. Thesis, 2000.\n\n[14] Piotr Indyk. On approximate nearest neighbors under l norm. J. Comput. Syst. sci., 63(4),\n     2001.\n\n[15] W. Johnson and J. Lindenstrauss. Extensions of lipschitz maps into a hilbert space. Contemp.\n     Math., 26:189206, 1984.\n\n[16] Norio Katayama and Shin'ichi Satoh. The SR-tree: an index structure for high-dimensional\n     nearest neighbor queries. pages 369380, 1997.\n\n[17] J. Kleinberg. Two Algorithms for Nearest Neighbor Search in High Dimension. In Proceedings\n     of the Twenty-ninth Annual ACM Symposium on the Theory of Computing, pages 599608,\n     1997.\n\n[18] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient Search for Approximate Nearest Neigh-\n     bors in High Dimensional Spaces. In Proceedings of the Thirtieth Annual ACM Symposium on\n     the Theory of Computing, 1998.\n\n[19] T. Liu, A. W. Moore, A. Gray, and Ke. Yang. An investigation of practical approximate nearest\n     neighbor algorithms (full version). Manuscript in preparation.\n\n[20] B. S. Manjunath. Airphoto dataset, http://vivaldi.ece.ucsb.edu/Manjunath/research.htm.\n\n[21] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of large image data.\n     IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):837842, 1996.\n\n[22] A. W. Moore.     The Anchors Hierarchy: Using the Triangle Inequality to Survive High-\n     Dimensional Data. In Twelfth Conference on Uncertainty in Artificial Intelligence. AAAI Press,\n     2000.\n\n[23] G. Mori, S. Belongie, and J. Malik. Shape contexts enable efficient retrieval of similar shapes.\n     In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001.\n\n[24] S. M. Omohundro. Efficient Algorithms with Neural Network Behaviour. Journal of Complex\n     Systems, 1(2):273347, 1987.\n\n[25] S. M. Omohundro. Bumptrees for Efficient Function, Constraint, and Classification Learning.\n     In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information\n     Processing Systems 3. Morgan Kaufmann, 1991.\n\n[26] F. P. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985.\n\n[27] Y. Rubnet, C. Tomasi, and L. J. Guibas. The earth mover's distance as a metric for image\n     retrieval. International Journal of Computer Vision, 40(2):99121, 2000.\n\n[28] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation with parameter\n     sensitive hashing. In Proceedings of the International Conference on Computer Vision, 2003.\n\n[29] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information\n     Processing Letters, 40:175179, 1991.\n\n[30] P. Yianilos. Excluded middle vantage point forests for nearest neighbor search. In DIMACS\n     Implementation Challenge, 1999.\n\n\f\n", "award": [], "sourceid": 2666, "authors": [{"given_name": "Ting", "family_name": "Liu", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}, {"given_name": "Ke", "family_name": "Yang", "institution": null}, {"given_name": "Alexander", "family_name": "Gray", "institution": null}]}