{"title": "Predictive Indexing for Fast Search", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": "We tackle the computational problem of query-conditioned search. Given a machine-learned scoring rule and a query distribution, we build a predictive index by precomputing lists of potential results sorted based on an expected score of the result over future queries. The predictive index datastructure supports an anytime algorithm for approximate retrieval of the top elements. The general approach is applicable to webpage ranking, internet advertisement, and approximate nearest neighbor search. It is particularly effective in settings where standard techniques (e.g., inverted indices) are intractable. We experimentally find substantial improvement over existing methods for internet advertisement and approximate nearest neighbors.", "full_text": "Predictive Indexing for Fast Search\n\nSharad Goel\nYahoo! Research\n\nNew York, NY 10018\n\ngoel@yahoo-inc.com\n\nJohn Langford\nYahoo! Research\n\nNew York, NY 10018\njl@yahoo-inc.com\n\nAbstract\n\nAlex Strehl\n\nYahoo! Research\n\nNew York, NY 10018\n\nstrehl@yahoo-inc.com\n\nWe tackle the computational problem of query-conditioned search. Given a\nmachine-learned scoring rule and a query distribution, we build a predictive in-\ndex by precomputing lists of potential results sorted based on an expected score\nof the result over future queries. The predictive index datastructure supports an\nanytime algorithm for approximate retrieval of the top elements. The general ap-\nproach is applicable to webpage ranking, internet advertisement, and approximate\nnearest neighbor search. It is particularly effective in settings where standard tech-\nniques (e.g., inverted indices) are intractable. We experimentally \ufb01nd substantial\nimprovement over existing methods for internet advertisement and approximate\nnearest neighbors.\n\n1 Introduction\n\nThe Problem. The objective of web search is to quickly return the set of most relevant web pages\ngiven a particular query string. Accomplishing this task for a \ufb01xed query involves both determining\nthe relevance of potential pages and then searching over the myriad set of all pages for the most\nrelevant ones. Here we consider only the second problem. More formally, let Q \u2286 Rn be an input\nspace, W \u2286 Rm a \ufb01nite output space of size N, and f : Q \u00d7 W (cid:55)\u2192 R a known scoring function.\nGiven an input (search query) q \u2208 Q, the goal is to \ufb01nd, or closely approximate, the top-k output\nobjects (web pages) p1, . . . , pk in W (i.e., the top k objects as ranked by f(q,\u00b7)).\nThe extreme speed constraint, often 100ms or less, and the large number of web pages (N \u2248 1010)\nmakes web search a computationally-challenging problem. Even with perfect 1000-way paralleliza-\ntion on modern machines, there is far too little time to directly evaluate against every page when a\nparticular query is submitted. This observation limits the applicability of machine-learning methods\nfor building ranking functions. The question addressed here is: \u201cCan we quickly return the highest\nscoring pages as ranked by complex scoring rules typical of learning algorithms?\u201d\nPredictive Indexing. We describe a method for rapidly retrieving the top elements over a large set\nas determined by general scoring functions. The standard method for mitigating the computational\ndif\ufb01culties of search is to pre-process the data so that far less computation is necessary at runtime.\nTaking the empirical probability distribution of queries into account, we pre-compute collections of\nweb pages that have a large expected score conditioned on the query falling into particular sets of\nrelated queries {Qi}. For example, we may pre-compute and store the list of web pages that have the\nhighest average score when the query contains the phrase \u201cmachine learning\u201d. To yield a practical\nalgorithm, these sets should form meaningful groups of pages with respect to the scoring function\nand query distribution. At runtime, we then optimize only over those collections of top-scoring web\npages for sets Qi containing the submitted query.\nOur main contribution is optimizing the search index with respect to the query distribution. The em-\npirical evidence presented shows that predictive indexing is an effective technique, making general\nmachine learning style prediction methods viable for quickly ranking over large numbers of objects.\n\n1\n\n\fThe general methodology applies to other optimization problems as well, including approximate\nnearest neighbor search.\nIn the remainder of Section 1 we describe existing solutions to large-scale search, and their applica-\nbility to general scoring functions. Section 2 describes the predictive indexing algorithm and covers\nan example and lemma suggesting that predictive indexing has signi\ufb01cant advantages over existing\ntechniques. We present empirical evaluation of the method in Section 3, using both proprietary web\nadvertising data and public data for nearest neighbor search.\n\n1.1 Feature Representation\n\nOne concrete way to map web search into the general predictive index framework is to represent\nboth queries and pages as sparse binary feature vectors in a high-dimensional Euclidean space.\nSpeci\ufb01cally, we associate each word with a coordinate: A query (page) has a value of 1 for that\ncoordinate if it contains the word, and a value of 0 otherwise. We call this the word-based feature\nrepresentation, because each query and page can be summarized by a list of its features (i.e., words)\nthat it contains. The general predictive framework supports many other possible representations,\nincluding those that incorporate the difference between words in the title and words in the body of\nthe web page, the number of times a word occurs, or the IP address of the user entering the query.\n\n1.2 Related Work\n\nproblem for linear scoring functions of the form f(q, p) =(cid:80)n\n\nGiven the substantial importance of large-scale search, a variety of techniques have been developed\nto address the rapid ranking problem. Past work that has referenced the query distribution includes\n(Cheng et al., 2006; Chierichetti et al., 2008). Here we describe two commonly applied methods\nrelated to the predictive index approach.\nFagin\u2019s Threshold Algorithm. Fagin\u2019s threshold algorithm (Fagin et al., 2003) supports the top-k\ni=1 qigi(p), where qi \u2208 {0, 1} is the\nith coordinate of the query q, and gi : W (cid:55)\u2192 R are partial scores for pages as determined by the ith\nfeature1. For each query feature i, construct an ordered list Li containing every web page, sorted\nin descending order by their partial scores gi(p). We refer to this as the projective order, since it\nis attained by projecting the scoring rule onto individual coordinates. Given a query q, we evaluate\nweb pages in the lists Li that correspond to features of q. The algorithm maintains two statistics,\nupper and lower bounds on the score of the top-kth page, halting when these bounds cross. The\nlower bound is the score of the kth best page seen so far; the upper bound is the sum of the partial\nscores (i.e., gi(p)) for the next-to-be-scored page in each list. Since the lists are ordered by the\npartial scores, the upper threshold does in fact bound the score of any page yet to be seen.\nThe threshold algorithm is particularly effective when a query contains a small number of features,\nfacilitating fast convergence of the upper bound. In our experiments, we \ufb01nd that the halting con-\ndition is rarely satis\ufb01ed within the imposed computational restrictions. One can, of course, simply\nhalt the algorithm when it has expended the computational budget (Fagin, 2002), which we refer to\nas the Halted Threshold Algorithm.\nInverted Indices. An inverted index is a datastructure that maps every page feature x to a list of\npages p that contain x. When a new query arrives, a subset of page features relevant to the query is\n\ufb01rst determined. For instance, when the query contains \u201cdog\u201d, the page feature set might be {\u201cdog\u201d,\n\u201ccanine\u201d, \u201ccollar\u201d, ...}. Note that a distinction is made between query features and page features, and\nin particular, the relevant page features may include many more words than the query itself. Once a\nset of page features is determined, their respective lists (i.e., inverted indices) are searched, and from\nthem the \ufb01nal list of output pages is chosen. One method for searching over these lists is to execute\nFagin\u2019s threshold algorithm. Other methods, such as the \u201cWeighted-And\u201d algorithm (Broder et al.,\n2003), use one global order for pages in the lists and walk down the lists synchronously to compute\npage scores. See (Zobel & Moffat, 2006) for an overview of inverted indices applied to web search.\nStandard approaches based on inverted indices suffer from a shortcoming. The resulting algorithms\nare ef\ufb01cient only when it is suf\ufb01cient to search over a relatively small set of inverted indices for each\n\n1More general monotone scoring functions (e.g., coordinate-wise product and max) are in fact supported;\n\nfor clarity, however, we restrict to the linear case.\n\n2\n\n\fquery. They require, for each query q, that there exists a small set2 Xq of page features such that the\nscore of any page against q depends only on its intersection with Xq. In other words, the scoring\nrule must be extremely sparse, with most words or features in the page having zero contribution to\nthe score for q. In Section 3.1, we consider a machine-learned scoring rule, derived from internet\nadvertising data, with the property that almost all page features have substantial in\ufb02uence on the\nscore for every query, making any straightforward approach based on inverted indices intractable.\nFurthermore, algorithms that use inverted indices do not typically optimize the datastructure against\nthe query distribution and our experiments suggest that doing so may be bene\ufb01cial.\n\n2 An Algorithm for Rapid Approximate Ranking\n\nSuppose we are provided with a categorization of possible queries into related, potentially overlap-\nping, sets. For example, these sets might be de\ufb01ned as, \u201cqueries containing the word \u2018France\u2019,\u201d\nor \u201cqueries with the phrase \u2018car rental\u2019.\u201d For each query set, the associated predictive index is an\nordered list of web pages sorted by their expected score for random queries drawn from that set. In\nparticular, we expect web pages at the top of the \u2018France\u2019 list to be good, on average, for queries\ncontaining the word \u2018France.\u2019 In contrast to an inverted index, the pages in the \u2018France\u2019 list need not\nthemselves contain the word \u2018France\u2019. To retrieve results for a particular query (e.g., \u201cFrance car\nrental\u201d), we optimize only over web pages in the relevant, pre-computed lists. Note that the predic-\ntive index is built on top of an already existing categorization of queries, a critical, and potentially\ndif\ufb01cult initial step. In the applications we consider, however, we \ufb01nd that predictive indexing works\nwell even when applied to naively de\ufb01ned query sets. Furthermore, in our application to approxi-\nmate nearest neighbor search, we found predictive indexing to be robust to cover sets generated via\nrandom projections whose size and shape were varied across experiments.\nWe represent queries and web pages as points in, respectively, Q \u2286 Rn and W \u2286 Rm. This setting\nis general, but for the experimental application we consider n, m \u2248 106, with any given page or\nquery having about 102 non-zero entries (see Section 3.1 for details). Thus, pages and points are\ntypically sparse vectors in very high dimensional spaces. A coordinate may indicate, for example,\nwhether a particular word is present in the page/query, or more generally, the number of times that\nword appears. Given a scoring function f : Q \u00d7 W (cid:55)\u2192 R, and a query q, we attempt to rapidly \ufb01nd\nthe top-k pages p1, . . . , pk. Typically, we \ufb01nd an approximate solution, a set of pages \u02c6p1, . . . , \u02c6pk\nthat are among the top l for l \u2248 k. We assume queries are generated from a probability distribution\nD that may be sampled.\n\n2.1 Predictive Indexing for General Scoring Functions\nConsider a \ufb01nite collection Q of sets Qi \u2286 Q that cover the query space (i.e., Q \u2286 \u222aiQi). For each\nQi, de\ufb01ne the conditional probability distribution Di over queries in Qi by Di(\u00b7) = D(\u00b7|Qi), and\nde\ufb01ne fi : W (cid:55)\u2192 R as fi(p) = Eq\u223cDi[f(q, p)]. The function fi(p) is the expected score of the\nweb page p for the (related) queries in Qi. The hope is that any page p has approximately the same\nscore for any query q \u2208 Qi. If, for example, Qi is the set of queries that contain the word \u201cdog\u201d, we\nmay expect every query in Qi to score high against pages about dogs and to score low against those\npages not about dogs.\nFor each set of queries Qi we pre-compute a sorted list Li of pages pi1 , pi2, . . . , piN ordered in\ndescending order of fi(p). At runtime, given a query q, we identify the query sets Qi containing\nq, and compute the scoring function f only on the restricted set of pages at the beginning of their\nassociated lists Li. We search down these lists for as long as the computational budget allows.\nIn general, it is dif\ufb01cult to compute exactly the conditional expected scores of pages fi(p). One can,\nhowever, approximate these scores by sampling from the query distribution D. Algorithm 1 outlines\nthe construction of the sampling-based predictive indexing datastructure. Algorithm 2 shows how\nthe method operates at run time.\nNote that in the special case where we cover Q with a single set, we end up with a global ordering\nof web pages, independent of the query, which is optimized for the underlying query distribution.\n\n2The size of these sets are typically on the order of 100 or smaller.\n\n3\n\n\fAlgorithm 1 Construct-Predictive-Index(Cover Q, Dataset S)\n\nLj[s] = 0 for all objects s and query sets Qj.\nfor t random queries q \u223c D do\n\nfor all objects s in the data set do\n\nfor all query sets Qj containing q do\n\nLj[s] \u2190 Lj[s] + f(q, s)\n\nend for\n\nend for\n\nend for\nfor all lists Lj do\n\nsort Lj\n\nend for\nreturn {L}\n\nAlgorithm 2 Find-Top(query q, count k)\n\ni = 0\ntop-k list V = \u2205\nwhile time remains do\n\nfor each query set Qj containing q do\n\ns \u2190 Lj[i]\nif f(q, s) > kth best seen so far then\n\ninsert s into ordered top-k list V\n\nend if\nend for\ni \u2190 i + 1\n\nend while\nreturn V\n\nWhile this global ordering may not be effective in isolation, it could perhaps be used to order pages\nin traditional inverted indices.\n\n2.2 Discussion\n\nWe present an elementary example to help develop intuition for why we can sometimes expect\npredictive indexing to improve upon projective datastructures such as those used in Fagin\u2019s threshold\ntwo query features t1 and t2; three possible queries q1 = {t1},\nalgorithm. Suppose we have:\nq2 = {t2} and q3 = {t1, t2}; and three web pages p1, p2 and p3. Further suppose we have a simple\nlinear scoring function de\ufb01ned by\n\nf(q, p1) = It1\u2208q \u2212 It2\u2208q\n\nf(q, p2) = It2\u2208q \u2212 It1\u2208q\n\nf(q, p3) = .5 \u00b7 It2\u2208q + .5 \u00b7 It1\u2208q\n\nwhere I is the indicator function. That is, pi is the best match for query qi, but p3 does not score\nhighly for either query feature alone. Thus, an ordered, projective datastructure would have\n\nt1 \u2192 {p1, p3, p2}\n\nt2 \u2192 {p2, p3, p1}.\n\nSuppose, however, that we typically only see query q3. In this case, if we know t1 is in the query,\nwe infer that t2 is likely to be in the query (and vice versa), and construct the predictive index\n\nt1 \u2192 {p3, p1, p2}\n\nt2 \u2192 {p3, p2, p1}.\n\nOn the high probability event, namely query q3, we see the predictive index outperforms the projec-\ntive, query independent, index.\nWe expect predictive indices to generally improve on datastructures that are agnostic to the query\ndistribution. In the simple case of a single cover set (i.e., a global web page ordering) and when\nwe wish to optimize the probability of returning the highest-scoring object, Lemma 2.1 shows that\na predictive ordering is the best ordering relative to any particular query distribution.\n\n4\n\n\f(Hq \u2208 {s1, ..., sk}) =\n\nPr\nq\u223cD\n\nmax\n\npermutations \u03c0\n\nPr\nq\u223cD\n\n(Hq \u2208 {s\u03c0(1), ..., s\u03c0(k)}).\n\nLemma 2.1. Suppose we have a set of points S, a query distribution D, and a function f that scores\nqueries against points in S. Further assume that for each query q, there is a unique highest scoring\npoint Hq. For s \u2208 S, let h(s) = Prq\u223cD(s = Hq), and let s1, s2, . . . , sN be ordered according to\nh(s). For any \ufb01xed k,\n\npearing in the top k entries equals(cid:80)k\n\nProof. For any ordering of points, s\u03c0(1), . . . , s\u03c0(k), the probability of the highest scoring point ap-\nj=1 h(s\u03c0(j)). This sum is clearly maximized by ordering the\nlist according to h(\u00b7).\n\n3 Empirical Evaluation\n\nWe evaluate predictive indexing for two applications: Internet advertising and approximate nearest\nneighbor.\n\n3.1\n\nInternet Advertising\n\nform f(p, a) = (cid:80)\n\nWe present results on Internet advertising, a problem closely related to web search. We have ob-\ntained proprietary data, both testing and training, from an online advertising company. The data are\ncomprised of logs of events, where each event represents a visit by a user to a particular web page\np, from a set of web pages Q \u2286 Rn. From a large set of advertisements W \u2286 Rm, the commercial\nsystem chooses a smaller, ordered set of ads to display on the page (generally around 4). The set of\nads seen and clicked by users is logged. Note that the role played by web pages has switched, from\nresult to query. The total number of ads in the data set is |W| \u2248 6.5 \u00d7 105. Each ad contains, on\naverage, 30 ad features, and a total of m \u2248 106 ad features are observed. The training data consist\nof 5 million events (web page \u00d7 ad displays). The total number of distinct web pages is 5 \u00d7 105.\nEach page consists of approximately 50 page features, and a total of n \u2248 9\u00d7 105 total page features\nare observed.\nWe used a sparse feature representation (see Section 1.1) and trained a linear scoring rule f of the\ni,j wi,jpiaj, to approximately rank the ads by their probability of click. Here,\nwi,j are the learned weights (parameters) of the linear model. The search algorithms we compare\nwere given the scoring rule f, the training pages, and the ads W for the necessary pre-computations.\nThey were then evaluated by their serving of k = 10 ads, under a time constraint, for each page\nin the test set. There was a clear separation of test and training. We measured computation time\nin terms of the number of full evaluations by the algorithm (i.e., the number of ads scored against\na given page). Thus, the true test of an algorithm was to quickly select the most promising T ads\nto fully score against the page, where T \u2208 {100, 200, 300, 400, 500} was externally imposed and\nvaried over the experiments. These numbers were chosen to be in line with real-world computational\nconstraints.\nWe tested four methods: halted threshold algorithm (TA), as described in Section 1.2, two variants\nof predictive indexing (PI-AVG and PI-DCG), and a fourth method, called best global ordering\n(BO), which is a degenerate form of PI discussed in Section 2.1. An inverted index approach is\nprohibitively expensive since almost all ad features have substantial in\ufb02uence on the score for every\nweb page (see Section 1.2).\nPI-AVG and PI-DCG require a covering of the web page space. We used the natural covering sug-\ngested by the binary features\u2014each page feature i corresponds to a cover set consisting of precisely\nthose pages p that contain i. The resulting datastructure is therefore similar to that maintained by\nthe TA algorithm\u2014lists, for each page feature, containing all the ads. However, while TA orders ads\nj wi,jpiaj for each \ufb01xed page feature i, the predictive methods order by expected\nscore. PI-AVG sorts ad lists by expected score of f, Ep\u223cDi[f(p, a)] = Ep\u223cD[f(p, a)|i \u2208 p], condi-\ntioned on the page containing feature i. PI-DCG and BO optimize the expected value of a modi\ufb01ed\nscoring rule, DCGf (p, a) = Ir(p,a)\u226416/ log2 (r(p, a) + 1), where r is the rank function and I is the\nindicator function. Here, r(p, a) = j indicates that ad a has rank j according to f(p, a) over all ads\nin W . BO stores a single list of all ads, sorted by expected DCGf (p, a), while PI-DCG stores a list\nfor each page feature i sorted by Ep\u223cDi[DCGf (p, a)]. We chose this measure because:\n\nby partial score(cid:80)\n\n5\n\n\f1. Compared with using the average score of f, we empirically observe that expected DCGf\n\ngreatly improves the performance of BO on these data.\n\n2. It is related to \u201cdiscounted cumulative gain\u201d, a common measure for evaluating search\n\nresults in the information retrieval literature (J\u00a8arvelin & Kek\u00a8al\u00a8ainen, 2002).\n\n3. Expected DCGf is zero for many ads, enabling signi\ufb01cant compression of the predictive\n\nindex.\n\n4. Lemma 2.1 suggests ordering by the probability an ad is in the top 10. The DCGf score is\n\na softer version of indicator of top 10.\n\nAll three predictive methods were trained by sampling from the training set, as described in 2.1.\nFigure 3.1 plots the results of testing the four algorithms on the web advertising data. Each point in\nthe \ufb01gure corresponds to one experiment, which consisted of executing each algorithm on 1000 test\npages. Along the x-axis we vary the time constraint imposed on the algorithm. The y-axis plots the\nfrequency, over the test pages, that the algorithm succeeded in serving the top scoring ad for position\n1 (Figure 1(a)) and for position 10 (Figure 1(b)). Thus, vertical slices through each plot show the\ndifference in performance between the algorithms when they are given the same amount of serving\ntime per page. The probabilities were computed by off-line scoring of all 6.5\u00d7 105 ads for each test\npage and computing the true top-10 ads. Serving correctly for position 10 is more dif\ufb01cult than for\nposition 1, because it also requires correctly serving ads for positions 1 through 9. We see that all\nthree methods of predictive indexing are superior to Fagin\u2019s halted threshold algorithm. In addition,\nthe use of a richer covering, for PI-DCG and PI-AVG, provides a large boost in performance. These\nlatter two predictive indexing methods attain relatively high accuracy even when fully evaluating\nonly 0.05% of the potential results.\n\n(a)\n\n(b)\n\nFigure 1: Comparison of the \ufb01rst and tenth results returned from the four serving algorithms on the\nweb advertisement dataset.\n\nOur implementation of the predictive index, and also the halted threshold algorithm, required about\n50ms per display event when 500 ad evaluations are allowed. The RAM use for the predictive index\nis also reasonable, requiring about a factor of 2 more RAM than the ads themselves.\n\n3.2 Approximate Nearest Neighbor Search\n\nA special case application of predictive indexing is approximate nearest neighbor search. Given a set\nof points W in n-dimensional Euclidean space, and a query point x in that same space, the nearest\nneighbor problem is to quickly return the top-k neighbors of x. This problem is of considerable\ninterest for a variety of applications, including data compression, information retrieval, and pattern\nrecognition. In the predictive indexing framework, the nearest neighbor problem corresponds to\n\n6\n\nlllll1002003004005000.00.20.40.60.81.0Comparison of Serving AlgorithmsNumber of Full EvaluationsProbability of Exact Retrieval\u22121st ResultlllllllPI\u2212AVGPI\u2212DCGFixed OrderingHalted TAlllll1002003004005000.00.20.40.60.81.0Comparison of Serving AlgorithmsNumber of Full EvaluationsProbability of Exact Retrieval\u221210th ResultlllllllPI\u2212AVGPI\u2212DCGFixed OrderingHalted TA\fminimizing a scoring function, f(x, y) = ||x \u2212 y||2, de\ufb01ned by Euclidean distance. We assume\nquery points are generated from a distribution D that can be sampled.\nTo start, we de\ufb01ne a covering Q of the input space Rn, which we borrow from locality-sensitive\nhashing (LSH) (Gionis et al., 1999; Datar et al., 2004), a commonly suggested scheme for the\napproximate nearest neighbor problem. Fix positive integer parameters \u03b1, \u03b2. First, we form \u03b1\nrandom partitions of the input space. Geometrically, each partition splits the n-dimensional space\non \u03b2 random hyperplanes. Formally, for all 1 \u2264 i \u2264 \u03b1 and 1 \u2264 j \u2264 \u03b2, generate a random unit-\nn ) \u2208 Rn from the Gaussian (normal) distribution. For \ufb01xed\nnorm n-vector Y ij = (Y ij\ni \u2208 {1, . . . , \u03b1} and subset J \u2286 {1, . . . , \u03b2} de\ufb01ne the cover set Qi,J = {x \u2208 Rn : x \u00b7 Y ij \u2265\n0 if and only if j \u2208 J}. Note that for \ufb01xed i, the set {Qi,J|J \u2286 {1, . . . , k}} partitions the space by\nrandom planes.\n\nGiven a query point x, consider the union Ux =(cid:83){Qi,J\u2208Q | x \u2208 Qi,J} Qi,J of all cover sets contain-\n\n1 , . . . , Y ij\n\ning x. Standard LSH approaches to the nearest neighbor problem work by scoring points in the set\nQx = W \u2229 Ux. That is, LSH considers only those points in W that are covered by at least one of\nthe same \u03b1 sets as x. Predictive indexing, in contrast, maps each cover set Qi,J to an ordered list\nof points sorted by their probability of being a top-10 nearest point to points in Qi,J. That is, the\nlists are sorted by hQi,J (p) = Prq\u223cD|Qi,J (p is one of the nearest 10 points to q). For the query x,\nwe then consider those points in W with large probability hQi,J for at least one of the sets Qi,J that\ncover x.\nWe compare LSH and predictive indexing over four data sets: (1) MNIST\u201460,000 training and\n10,000 test points in 784 dimensions; (2) Corel\u201437,749 points in 32 dimensions, split randomly\ninto 95% training and 5% test subsets; (3) Pendigits\u20147494 training and 3498 test points in 17\ndimensions; and (4) Optdigits\u20143823 training and 1797 test points in 65 dimensions. The MNIST\ndata is available at http://yann.lecun.com/exdb/mnist/ and the remaining three data\nsets are available at the UCI Machine Learning Repository (http://archive.ics.uci.edu/\nml/). Random projections were generated for each experiment, inducing a covering of the space that\nwas provided to both LSH and predictive indexing. The predictive index was generated by sampling\nover the training set as discussed in Section 2.1. The number of projections \u03b2 per partition was set to\n24 for the larger sets (Corel and MNIST) and 63 for the smaller sets (Pendigits and Optdigits), while\nthe number of partitions \u03b1 was varied as an experimental parameter. Larger \u03b1 corresponds to more\nfull evaluations per query, resulting in improved accuracy at the expense of increased computation\ntime. Both algorithms were restricted to the same average number of full evaluations per query.\nPredictive indexing offers substantial improvements over LSH for all four data sets. Figure 2(a)\ndisplays the true rank of the \ufb01rst point returned by LSH and predictive indexing on the MNIST data\nset as a function of \u03b1, averaged over all points in the test set and over multiple trials. Predictive\nindexing outperforms LSH at each parameter setting, with the difference particularly noticeable\nwhen fewer full evaluation are permitted (i.e., small \u03b1). Figure 2(b) displays the performance of\nLSH and predictive indexing for the tenth point returned, over all four data sets, with values of \u03b1\nvarying from 5 to 70, averaged over the test sets, and replicated by multiple runs. In over 300 trials,\nwe did not observe a single instance of LSH outperforming predictive indexing.\nRecent work has proposed more sophisticated partitionings for LSH (Andoni & Indyk, 2006). Ap-\nproaches based on metric trees (Liu et al., 2004), which take advantage of the distance metric struc-\nture, have also been shown to perform well for approximate nearest neighbor. Presumably, taking\nadvantage of the query distribution could further improve these algorithms as well, although that is\nnot studied here.\n\n4 Conclusion\n\nPredictive indexing is the \ufb01rst datastructure capable of supporting scalable, rapid ranking based on\ngeneral purpose machine-learned scoring rules. In contrast, existing alternatives such as the Thresh-\nold Algorithm (Fagin, 2002) and Inverted Index approaches (Broder et al., 2003) are either substan-\ntially slower, inadequately expressive, or both, for common machine-learned scoring rules. In the\nspecial case of approximate nearest neighbors, predictive indexing offers substantial and consistent\nimprovements over the Locality Sensitive Hashing algorithm.\n\n7\n\n\f(a) The y-axis, \u201cRank of 1st Result\u201d measures the\ntrue rank of the \ufb01rst result returned by each method.\nAs the number of partitions \u03b1 is increased, improved\naccuracy is achieved at the expense of longer com-\nputation time.\n\n(b) Each point represents the outcome of a single ex-\nperiment for one of the four data sets at various pa-\nrameter settings.\n\nFigure 2: Comparison of the \ufb01rst and tenth results returned from LSH and predictive indexing.\n\nReferences\nAndoni, A., & Indyk, P. (2006). Near-optimal hashing algorithms for near neighbor problem in high dimen-\n\nsions. Proceedings of the Symposium on Foundations of Computer Science (FOCS\u201906).\n\nBroder, A. Z., Carmel, D., Herscovici, M., Soffer, A., & Zien, J. (2003). Ef\ufb01cient query evaluation using a\ntwo-level retrieval process. CIKM \u201903: Proceedings of the twelfth international conference on Information\nand knowledge management (pp. 426\u2013434).\n\nCheng, C.-S., Chung, C.-P., & Shann, J. J.-J. (2006). Fast query evaluation through document identi\ufb01er assign-\n\nment for inverted \ufb01le-based information retrieval systems. Inf. Process. Manage., 42, 729\u2013750.\n\nChierichetti, F., Lattanzi, S., Mari, F., & Panconesi, A. (2008). On placing skips optimally in expectation.\nWSDM \u201908: Proceedings of the international conference on Web search and web data mining (pp. 15\u201324).\nNew York, NY, USA: ACM.\n\nDatar, M., Immorlica, N., Indyk, P., & Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-\nstable distributions. SCG \u201904: Proceedings of the twentieth annual symposium on Computational geometry\n(pp. 253\u2013262). New York, NY, USA: ACM.\n\nFagin, R. (2002). Combining fuzzy information: an overview. SIGMOD Rec., 31, 109\u2013118.\nFagin, R., Lotem, A., & Naor, M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst.\n\nSci., 66, 614\u2013656.\n\nGionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. The VLDB\n\nJournal (pp. 518\u2013529).\n\nJ\u00a8arvelin, K., & Kek\u00a8al\u00a8ainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions\n\non Information Systems, 20, 422\u2013446.\n\nLiu, T., Moore, A., Gray, A., & Yang, K. (2004). An investigation of practical approximate nearest neighbor\n\nalgorithms. Neural Information Processing Systems.\n\nZobel, J., & Moffat, A. (2006). Inverted \ufb01les for text search engines. ACM Comput. Surv., 38, 6.\n\n8\n\nllllll203040506070010203040LSH vs. Predictive Indexing on MNIST DataNumber of Partitions aaRank of 1st ResultlLSHPredictive Indexingllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll2040608010020406080100LSH vs. Predictive Indexing \u2212 All Data SetsLSH \u2212 Rank of 10th ResultPredictive Indexing \u2212 Rank of 10th Result\f", "award": [], "sourceid": 840, "authors": [{"given_name": "Sharad", "family_name": "Goel", "institution": null}, {"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "Alexander", "family_name": "Strehl", "institution": null}]}