{"title": "Polynomial Semantic Indexing", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 72, "abstract": "We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods.", "full_text": "Polynomial Semantic Indexing\n\nBing Bai(1)\n\nKunihiko Sadamasa(1)\n\nJason Weston(1)(2) David Grangier(1)\n\nRonan Collobert(1)\nCorinna Cortes(2) Mehryar Mohri(2)(3)\n\nYanjun Qi(1)\n\n{bbai, dgrangier, collober, kunihiko, yanjun}@nec-labs.com\n\n(1)NEC Labs America, Princeton, NJ\n\n(2) Google Research, New York, NY\n\n{jweston, corinna, mohri}@google.com\n\n(3) NYU Courant Institute, New York, NY\n\nmohri@cs.nyu.edu\n\nAbstract\n\nWe present a class of nonlinear (polynomial) models that are discriminatively\ntrained to directly map from the word content in a query-document or document-\ndocument pair to a ranking score. Dealing with polynomial models on word fea-\ntures is computationally challenging. We propose a low-rank (but diagonal pre-\nserving) representation of our polynomial models to induce feasible memory and\ncomputation requirements. We provide an empirical study on retrieval tasks based\non Wikipedia documents, where we obtain state-of-the-art performance while pro-\nviding realistically scalable methods.\n\n1\n\nIntroduction\n\nRanking text documents given a text-based query is one of the key tasks in information retrieval.\nA typical solution is to: (i) embed the problem in a feature space, e.g. model queries and target\ndocuments using a vector representation; and then (ii) choose (or learn) a similarity metric that\noperates in this vector space. Ranking is then performed by sorting the documents based on their\nsimilarity score with the query.\nA classical vector space model, see e.g. [24], uses weighted word counts (e.g. via tf-idf) as the\nfeature space, and the cosine similarity for ranking. In this case, the model is chosen by hand and no\nmachine learning is involved. This type of model often performs remarkably well, but suffers from\nthe fact that only exact matches of words between query and target texts contribute to the similarity\nscore. That is, words are considered to be independent, which is clearly a false assumption.\nLatent Semantic Indexing [8], and related methods such as pLSA and LDA [18, 2], are unsupervised\nmethods that choose a low dimensional feature representation of \u201clatent concepts\u201d where words\nare no longer independent. They are trained with reconstruction objectives, either based on mean\nsquared error (LSI) or likelihood (pLSA, LDA). These models, being unsupervised, are still agnostic\nto the particular task of interest.\nMore recently, supervised models for ranking texts have been proposed that can be trained on a\nsupervised signal (i.e., labeled data) to provide a ranking of a database of documents given a query.\nFor example, if one has click-through data yielding query-target relationships, one can use this to\ntrain these models to perform well on this task. Or, if one is interested in \ufb01nding documents related\nto a given query document, one can use known hyperlinks to learn a model that performs well on this\ntask. Many of these models have typically relied on optimizing over only a few hand-constructed\nfeatures, e.g. based on existing vector space models such as tf-idf, the title, URL, PageRank and\nother information [20, 5].\nIn this work, we investigate an orthogonal research direction, as we\nanalyze supervised methods that are based on words only. Such models are both more \ufb02exible, e.g.\ncan be used for tasks such as cross-language retrieval, and can still be used in conjunction with\n\n1\n\n\fother features explored in previous work for further gains. At least one recent work, called Hash\nKernels [25], has been proposed that does construct a word-feature based model in a learning-to-rank\ncontext.\nIn this article we de\ufb01ne a class of nonlinear (polynomial) models that can capture higher order\nrelationships between words. Our nonlinear representation of the words results in a very high di-\nmensional feature space. To deal with this space we propose low rank (but diagonal preserving) rep-\nresentations of our polynomial models to induce feasible memory and computation requirements,\nresulting in a method that both exhibits strong performance and is tractable to train and test.\nWe show experimentally on retrieval tasks derived from Wikipedia that our method strongly outper-\nforms other word based models, including tf-idf vector space models, LSI, query expansion, margin\nrank perceptrons and Hash Kernels.\nThe rest of this article is as follows. In Section 2, we describe our method, Section 3 discusses prior\nwork, and Section 4 describes the experimental study of our method.\n\n2 Polynomial Semantic Indexing\nLet us denote the set of documents in the corpus as {dt}(cid:96)\nt=1 \u2282 RD and a query text as q \u2208 RD, where\nD is the dictionary size, and the jth dimension of a vector indicates the frequency of occurrence of\nthe jth word, e.g. using the tf-idf weighting and then normalizing to unit length.\nGiven a query q and a document d we wish to learn a (nonlinear) function f(q, d) that returns a\nscore measuring the relevance of d given q. Let us \ufb01rst consider the naive approach of concatenating\n(q, d) into a single vector and using f(q, d) = w(cid:62)[q, d] as a linear ranking model. This clearly does\nnot learn anything useful as it would result in the same document ordering for any query, given \ufb01xed\nparameters w. However, considering a polynomial model:\n\nwhere \u03a6k(\u00b7) is a feature map that considers all possible k-degree terms:\n\nf(q, d) = w(cid:62)\u03a6k([q, d])\n\ndoes render a useful discriminative model. For example for degree k = 2 we obtain:\n\n\u03a6k(x1, . . . , xD) = (cid:104)xi1 . . . xik : 1 \u2264 i1 . . . ik \u2264 D(cid:105)\n\nf(q, d) =(cid:88)\n\nijqiqj +(cid:88)\n\nw1\n\nijdiqj +(cid:88)\n\nw2\n\nw3\n\nijdidj\n\nij\n\nij\n\nij\n\nwhere w has been rewritten as w1 \u2208 RD\u00d7D, w2 \u2208 RD\u00d7D and w3 \u2208 RD\u00d7D. The ranking order\nof documents d given a \ufb01xed query q is independent of w1 and the value of the term with w3 is\nindependent of the query, so in the following we will consider models containing only terms with\nboth q and d. In particular, we will consider the following degree k = 2 model:\n\nf 2(q, d) =\n\nWijqidj = q(cid:62)W d\n\nwhere W \u2208 RD\u00d7D, and the degree k = 3 model:\n\nD(cid:88)\n\ni,j=1\n\nD(cid:88)\n\n(1)\n\n(2)\n\nf 3(q, d) =\n\nWijkqidjdk + f 2(q, d).\n\ni,j,k=1\n\nNote that if W is an identity matrix in equation (1), we obtain the cosine similarity with tf-idf\nweighting. When other weights are nonzero this model can capture synonymy and polysemy as it\nlooks at all possible cross terms, which can be tuned directly for the task of interest during training,\ne.g. the value of Wij corresponding to related words e.g. the word \u201cjagger\u201d in the query and \u201cstones\u201d\nin the target could be given a large value during training. The degree k = 3 model goes one stage\nfurther and can upweight Wijk for the triple \u201cjagger\u201d, \u201cstones\u201d and \u201crolling\u201d and can downweight\nthe triple \u201cjagger\u201d, \u201cgem\u201d and \u201cstones\u201d. Note that we do not necessarily require preprocessing\nmethods such as stemming here since these models can already match words with common stems (if\n\n2\n\n\fit is useful for the task). Note also that in equation (2) we could have just as easily have considered\npairs of words in the query (rather than the document) as well.\nUnfortunately, using such polynomial models is clearly infeasible for several reasons. Firstly, it\nwill hardly be possible to \ufb01t W in memory for realistic tasks. If the dictionary size is D = 30000,\nthen, for k = 2 this requires 3.4GB of RAM (assuming \ufb02oats), and if the dictionary size is 2.5\nMillion (as it will be in our experiments in Section 4) this amounts to 14.5TB. For k = 3 this is even\nworse. Besides memory requirements, the huge number of parameters can of course also affect the\ngeneralization ability of this model.\nWe thus propose a low-rank (but diagonal preserving) approximation of these models which will\nlead to capacity control, faster computation speed and smaller memory footprint.\nFor k = 2 we propose to replace W with W , where\n\nW ij = (U(cid:62)V )ij + Iij =(cid:88)\n\nUliVlj + Iij.\n\nl\n\nPlugging this into equation (1) yields:\n\nLR(q, d) = q(cid:62)(U(cid:62)V + I)d,\nf 2\n\nN(cid:88)\n\n(3)\n\n(4)\nHere, U and V are N \u00d7D matrices. Before looking at higher degree polynomials, let us \ufb01rst analyze\nthis case. This induces a N-dimensional \u201clatent concept\u201d space in a way similar to LSI. However,\nthis is different in several ways:\n\ni=1\n\n=\n\n(U q)i(V d)i + q(cid:62)d.\n\nrelations (ranking constraints).\n\n\u2022 First, and most importantly, we advocate training from a supervised signal using preference\n\u2022 Further, U and V differ so it does not assume the query and target document should be\nembedded in the same way. This can hence model when the query text distribution is very\ndifferent to the document text distribution, e.g. queries are often short and have different\nword occurrence and co-occurrence statistics. In the extreme case in cross language re-\ntrieval query and target texts are in different languages yet are naturally modeled in this\nsetup.\n\u2022 Finally, the addition of the identity term means this model automatically learns the trade-\noff between using the low dimensional space and a classical vector space model. This is\nimportant because the diagonal of the W matrix gives the speci\ufb01city of picking out when a\nword co-occurs in both documents (indeed, setting W = I is equivalent to cosine similar-\nity using tf-idf). The matrix I is full rank and therefore cannot be approximated with the\nlow-rank model U(cid:62)V , so our model combines both terms in the approximation.\n\nHowever, the ef\ufb01ciency and memory footprint are as favorable as LSI. Typically, one caches the\nN-dimensional representation for each document to use at query time.\nFor higher degree polynomials, e.g. k = 3 one can perform a similar approximation. Indeed, Wijk\nis approximated with\n\nwhere U, V and Y are N \u00d7D. When adding the diagonal preserving term and the lower order terms\nfrom the k = 2 polynomial, we obtain\n\nLR(q, d) =\nf 3\n\n(U q)i(V d)i(Y d)i + f 2\n\nLR(q, d).\n\ni=1\n\nClearly, we can approximate any degree k polynomial using a product of k linear embeddings in\nsuch a scheme. Note that at test time one can again cache the N-dimensional representation for each\ndocument by computing the product between the V and Y terms and are then still left with only N\nmultiplications per document for the embedding term at query time.\nInterestingly, one can view this model as a \u201cproduct of experts\u201d: the document is projected twice i.e.\nby two experts V and Y and the training will force them to focus on different aspects.\n\n3\n\nUliVljYlk\n\nW ijk =(cid:88)\nN(cid:88)\n\nl\n\n\f2.1 Training\n\nTraining such models could take many forms.\nIn this paper we will adopt the typical \u201clearning\nto rank\u201d setup [17, 20]. Suppose we are given a set of tuples R (labeled data), where each tuple\ncontains a query q, a relevant document d+ and an non-relevant (or lower ranked) document d\u2212. We\nwould like to choose W such that f(q, d+) > f(q, d\u2212), that is d+ should be ranked higher than d\u2212.\nWe thus employ the margin ranking loss [17] which has already been used in several IR methods\n\nbefore [20, 5, 14], and minimize:(cid:88)\n\n(q,d+,d\u2212)\u2208R\n\nmax(0, 1 \u2212 f(q, d+) + f(q, d\u2212)).\n\n(5)\n\nWe train this using stochastic gradient descent, (see, e.g. [5]): iteratively, one picks a random tuple\nand makes a gradient step for that tuple. We choose the (\ufb01xed) learning rate which minimizes the\ntraining error. Convergence (or early stopping) is assessed with a validation set. Stochastic training\nis highly scalable and is easy to implement for our model. For example, for k = 2, one makes the\nfollowing updates:\n\nU \u2190 U + \u03bbV (d+ \u2212 d\u2212)q(cid:62),\nV \u2190 V + \u03bbU q(d+ \u2212 d\u2212)(cid:62),\n\nif 1 \u2212 f 2\nif 1 \u2212 f 2\n\nLR(q, d+) + f 2\nLR(q, d+) + f 2\n\nLR(q, d\u2212) > 0\nLR(q, d\u2212) > 0.\n\nClearly, it is important to exploit the sparsity of q and d when calculating these updates. In our\nexperiments we initialized the matrices U and V randomly using a normal distribution with mean\nzero and standard deviation one. The gradients for k = 3 are similar.\nNote that researchers have also explored optimizing various alternative loss functions other than\nthe ranking loss including optimizing normalized discounted cumulative gain (NDCG) and mean\naverage precision (MAP) [5, 4, 6, 28]. In fact, one could use those optimization strategies to train our\nmodels instead of optimizing the ranking loss. One could also just as easily use them in unsupervised\nlearning, such as in LSI, as well, e.g. by stochastic gradient descent on the reconstruction error.\n\n3 Prior Work\n\nJoachims et al. [20] trained a SVM with hand-designed features based on the title, body, search\nengines rankings and the URL. Burges et al. [5] proposed a neural network method using a similar\nset of features (569 in total). As described before, in contrast, we limited ourselves to body text (not\nusing title, URL, etc.) and trained on millions of features based on these words.\nThe authors of [15] used a model similar to the naive full rank model (1), but for the task of im-\nage retrieval, and [13] also used a related (regression-based) method for advert placement. These\ntechniques are implemented in related software to these two publications, PAMIR1 and Vowpal\nWabbit2. When the memory usage is too large, the latter bins the features randomly into a reduced\nspace (hence with random collisions), a technique called Hash Kernels [25]. In all cases, the task\nof document retrieval, and the use of low-rank approximation or polynomial features is not studied.\nThe current work generalizes and extends the Supervised Semantic Indexing approach [1] to general\npolynomial models.\nAnother related area of research is in distance metric learning [27, 19, 12]. Methods like LMNN\n[27] also learn a model similar to the naive full rank model (1), i.e. with the full matrix W (but\nnot with our improvements of this model that make it tractable for word features). They impose the\nconstraint during the optimization that W be a positive semide\ufb01nite matrix. Their method has con-\nsiderable computational cost. For example, even after considerable optimization of the algorithm,\nit still takes 3.5 hours to train on 60,000 examples and 169 features (a pre-processed version of\nMNIST). This would hence not be scalable for large-scale text ranking experiments. Nevertheless,\n[7] compared LMNN [27], LEGO [19] and MCML [12] to a stochastic gradient method with a full\nmatrix W (identical to the model (1)) on a small image ranking task and reported in fact that the\nstochastic method provides both improved results and ef\ufb01ciency. Our method, on the other hand,\nboth outperforms models like (1) and is feasible for word features, when (1) is not.\n\n1http://www.idiap.ch/pamir/\n2http://hunch.net/\u02dcvw/\n\n4\n\n\fA tf-idf vector space model and LSI [8] are two standard baselines we will also compare to. We\nalready mentioned pLSA [18] and LDA [2]; both have scalability problems and are not reported to\ngenerally outperform LSA and TF-IDF [11]. Query Expansion, often referred to as blind relevance\nfeedback, is another way to deal with synonyms, but requires manual tuning and does not always\nyield a consistent improvement [29].\nSeveral authors [23, 21] have proposed interesting nonlinear versions of unsupervised LSI using\nneural networks and showed they outperform LSI or pLSA. However, in the case of [23] we note\ntheir method is rather slow, and a dictionary size of only 2000 was used. A supervised method\nfor LDA (sLDA) [3] has also been proposed where a set of auxiliary labels are trained on jointly\nwith the unsupervised task. This provides supervision at the document level (via a class label or\nregression value) which is not a task of learning to rank, whereas here we study supervision at\nthe (query,documents) level. The authors of [10] proposed \u201cExplicit Semantic Analysis\u201d which\nrepresents the meaning of texts in a high-dimensional space of concepts by building a feature space\nderived from the human-organized knowledge from an encyclopedia, e.g. Wikipedia. In the new\nspace, cosine similarity is applied. Our method could be applied to such feature representations so\nthat they are not agnostic to a particular supervised task as well.\nAs we will also evaluate our model over cross-language retrieval, we also brie\ufb02y mention methods\npreviously applied to this problem. These include \ufb01rst applying machine translation and then a\nconventional retrieval method such as LSI [16], a direct method of applying LSI for this task called\nCL-LSI [9], or using Kernel Canonical Correlation Analysis, KCCA [26]. While the latter is a\nstrongly performing method, it also suffers from scalability problems.\n\n4 Experimental Study\n\nLearning a model of term correlations over a large vocabulary is a considerable challenge that re-\nquires a large amount of training data. Standard retrieval datasets like TREC3 or LETOR [22]\ncontain only a few hundred training queries, and are hence too small for that purpose. Moreover,\nsome datasets only provide pre-processed features like tf, idf or BM25, and not the actual words.\nClick-through from web search engines could provide valuable supervision. However, such data is\nnot publicly available.\nWe hence conducted experiments on Wikipedia and used links within Wikipedia to build a large-\nscale ranking task. We considered several tasks: document-document and query-document retrieval\ndescribed in Section 4.1, and cross-language document-document retrieval described in Section 4.2.\nIn these experiments we compared our approach, Polynomial Semantic Indexing (PSI), to the fol-\nlowing methods: tf-idf + cosine similarity (TFIDF), Query Expansion (QE), LSI4, \u03b1LSI + (1 \u2212 \u03b1)\nTFIDF, and the margin ranking perceptron and Hash Kernels with hash size h using model (1).\ni=1 dri of the top\nE retrieved documents multiplied by a weighting \u03b2 to the query, and applying TFIDF again. For\nall methods, hyperparameters such as the embedding dimension N \u2208 {50, 100, 200, 500, 1000},\nh \u2208 {1M, 3M, 6M}, \u03b1, \u03b2 and E were chosen using a validation set.\nFor each method, we measured the ranking loss (the percentage of tuples in R that are incorrectly\nordered), precision P (n) at position n = 10 (P@10) and the mean average precision (MAP), as\nwell as their standard deviations. For computational reasons, MAP and P@10 were measured by\naveraging over a \ufb01xed set of 1000 test queries, and the true test links and random subsets of 10,000\ndocuments were used as the database, rather than the whole testing set. The ranking loss is measured\nusing 100,000 testing tuples.\n\nQuery Expansion involves applying TFIDF and then adding mean vector \u03b2(cid:80)E\n\n4.1 Document Retrieval\n\nWe considered a set of 1,828,645 English Wikipedia documents as a database, and split the\n24,667,286 links randomly into two portions, 70% for training (plus validation) and 30% for test-\n\n3http://trec.nist.gov/\n4We use the SVDLIBC software http://tedlab.mit.edu/\u02dcdr/svdlibc/ and the cosine distance\n\nin the latent concept space.\n\n5\n\n\fTable 1: Document-document ranking results on Wikipedia (limited dictionary size of 30,000\nwords). Polynomial Semantic Indexing (PSI) outperforms all baselines, and performs better with\nhigher degree k = 3.\n\nAlgorithm\nTFIDF\nQE\nLSI\n\u03b1LSI + (1 \u2212 \u03b1)TFIDF\nMargin Ranking Perceptron using (1)\nPSI (k = 2)\nPSI (k = 3)\n\nRank-Loss\n\n1.62%\n1.62%\n4.79%\n1.28%\n0.41%\n0.30%\n0.14%\n\nMAP\n\n0.329\u00b10.010\n0.330\u00b10.010\n0.158\u00b10.006\n0.346\u00b10.011\n0.477\u00b10.011\n0.517\u00b10.011\n0.539\u00b10.011\n\nP@10\n\n0.163\u00b10.006\n0.163\u00b10.006\n0.098\u00b10.005\n0.170\u00b10.007\n0.212\u00b10.007\n0.229\u00b10.007\n0.236\u00b10.007\n\nTable 2: Empirical results for document-document ranking on Wikipedia (unlimited dictionary size).\n\nAlgorithm\nTFIDF\nQE\n\u03b1LSI + (1 \u2212 \u03b1)TFIDF\nHash Kernels using (1)\nPSI (k = 2)\nPSI (k = 3)\n\nRank-Loss\n\n0.842%\n0.842%\n0.721%\n0.347%\n0.158%\n0.099%\n\nMAP\n\n0.432\u00b10.012\n0.432\u00b10.012\n0.433\u00b10.012\n0.485\u00b10.011\n0.547\u00b10.012\n0.590\u00b10.012\n\nP@10\n\n0.1933\u00b10.007\n0.1933\u00b10.007\n0.193\u00b10.007\n0.215\u00b10.007\n0.239\u00b10.008\n0.249\u00b10.008\n\nTable 3: Empirical results for document-document ranking in two train/test setups: partitioning into\ntrain+test sets of links, or into train+test sets of documents with no cross-links (limited dictionary\nsize of 30,000 words). The two setups yield. similar results.\n\nAlgorithm\nPSI (k = 2)\nPSI (k = 2)\n\nTesting Setup\nPartitioned links\n\nPartitioned docs+links\n\nRank-Loss\n\n0.407%\n0.401%\n\nMAP\n\n0.506\u00b10.012\n0.503\u00b10.010\n\nP@10\n\n0.225\u00b10.007\n0.225\u00b10.006\n\nTable 4: Empirical results for query-document ranking on Wikipedia where query has n keywords\n(this experiment uses a limited dictionary size of 30,000 words). For each n we measure the ranking\nloss, MAP and P@10 metrics.\n\nAlgorithm\nTFIDF\n\u03b1LSI + (1 \u2212 \u03b1)TFIDF\nPSI (k = 2)\n\nn = 5\nRank MAP\n21.6% 0.047\n14.2% 0.049\n4.37% 0.166\n\nP@10\n0.023\n0.023\n0.083\n\nn = 10\nRank MAP\n14.0% 0.083\n9.73% 0.089\n2.91% 0.229\n\nP@10\n0.035\n0.037\n0.100\n\nn = 20\nRank MAP\n9.14% 0.128\n6.36% 0.133\n1.80% 0.302\n\nP@10\n0.054\n0.059\n0.130\n\ning.5 We then considered the following task: given a query document q, rank the other documents\nsuch that if q links to d then d is highly ranked.\nIn our \ufb01rst experiment we constrained all methods to use only the top 30,000 most frequent words.\nThis allowed us to compare to a margin ranking perceptron using model (1) which would otherwise\nnot \ufb01t in memory. For our approach, Polynomial Semantic Indexing (PSI), we report results for\ndegrees k = 2 and k = 3. Results on the test set are given in Table 1. Both variants of our method\nPSI strongly outperform the existing techniques. The margin rank perceptron using (1) can be seen\nas a full rank version of PSI for k = 2 (with W unconstrained) but is outperformed by its low-\nrank counterpart \u2013 probably because it has too much capacity. Degree k = 3 outperforms k = 2,\nindicating that the higher order nonlinearities captured provide better ranking scores. For LSI and\nPSI embedding dimension N = 200 worked best, but other values gave similar results. In terms\nof other techniques, LSI is slightly better than TFIDF but QE in this case does not improve much\nover TFIDF, perhaps because of the dif\ufb01culty of this task (there may too often be many irrelevant\ndocuments in the top E documents initially retrieved for QE to help).\n\n5We removed links to calendar years as they provide little information while being very frequent.\n\n6\n\n\fTable 5: The closest \ufb01ve words in the document embedding space to some example query words.\n\ncat\nveterinarian\ncomputer\nyork\nprogramming windows\nconsole\n\nkitten\nvet\nibm\nnyc\nc++\nxbox\nbeatles mccartney\nbritney\n\ngame\nlennon\nalbum\n\nspears\n\ncats\nanimals\nveterinary medicine\ncompany\nnew\n\ntechnology\nmanhattan\nmac\ngames\nsong\nmusic\n\ndogs\nanimal\ndata\nbrooklyn\nlinux\n\nspecies\nanimals\nsoftware\ncity\nunix\nmicrosoft windows\nharrison\nband\npop\nher\n\nIn our second experiment we no longer constrained methods to a \ufb01xed dictionary size, so all 2.5\nmillion words are used.\nIn this setting we compare to Hash Kernels which can deal with these\ndictionary sizes. The results, given in Table 2 show the same trends, indicating that the dictionary\nsize restriction in the previous experiment did not bias the results in favor of any one algorithm.\nNote also that as a page has on average just over 3 test set links to other pages, the maximum P@10\none can achieve in this case is 0.31.\nIn some cases, one might be worried that our experimental setup has split training and testing data\nonly by partitioning the links, but not the documents, hence performance of our model when new\nunseen documents are added to the database might be in question. We therefore also tested an\nexperimental setup where the test set of documents is completely separate from the training set of\ndocuments, by completely removing all training set links between training and testing documents.\nIn fact, this does not alter the performance signi\ufb01cantly, as shown in Table 3.\n\nQuery-Document Ranking So far, our evaluation uses whole Wikipedia articles as queries. One\nmight wonder if the reported improvements also hold in a setup where queries consist of only a few\nkeywords. We thus also tested our approach in this setup. We used the same setup as before but we\nconstructed queries by keeping only n random words from query documents in an attempt to mimic\na \u201ckeyword search\u201d. Table 4 reports the results for keyword queries of length n = 5, 10 and 20. PSI\nyields similar improvements as in the document-document retrieval case over the baselines.\n\nding U q) can be viewed as V d =(cid:80)\n\nWord Embedding The document embedding V d in equation (3) (similarly for the query embed-\ni V\u00b7idi, in which each column V\u00b7i is the embedding of the word\ndi. It is natural that semantically similar words are more likely to have similar embeddings. Table\n5 shows a few examples. The \ufb01rst column contains query words, on the right are the 5 words with\nsmallest Euclidean distance in the embedded space. We can see that they are quite relevant.\n\n4.2 Cross Language Document Retrieval\n\nCross Language Retrieval [16] is the task of retrieving documents in a target language E given a\nquery in a different source language F . For example, Google provides such a service6. This is\nan interesting case for word-based learning to rank models which can naturally deal with this task\nwithout the need for machine translation as they directly learn the correspondence between the two\nlanguages from bi-lingual labeled data in the form of tuples R. The use of a non-symmetric low-\nrank model like (3) also naturally suits this task (however in this case adding the identity does not\nmake sense). We therefore also provide a case study in this setting.\nWe thus considered the same set of 1,828,645 English Wikipedia documents and a set of 846,582\nJapanese Wikipedia documents, where 135,737 of the documents are known to be about the same\nconcept as a corresponding English page (this information can be found in the wiki mark-up pro-\nvided in a Wikipedia dump.) For example, the page about \u201cMicrosoft\u201d can be found in both English\nand Japanese, and they are cross-referenced. These pairs are referred to as \u201cmates\u201d in the literature\n(see, e.g. [9]).\nWe then consider a cross language retrieval task that is analogous to the task in Section 4.1: given\na Japanese query document qJap that is the mate of the English document qEng, rank the English\n\n6http://translate.google.com/translate_s\n\n7\n\n\fTable 6: Cross-lingual Japanese document-English document ranking (limited dictionary size of\n30,000 words).\n\nAlgorithm\nTFIDFEngEng(Google translated queries)\nTFIDFEngEng(ATLAS word-based translation)\nTFIDFEngEng(ATLAS translated queries)\nLSIEngEng(ATLAS translated queries)\n\u03b1LSIEngEng(ATLAS)+(1 \u2212 \u03b1)TFIDFEngEng(ATLAS)\nCL-LSIJapEng\n\u03b1CL-LSIJapEng+(1 \u2212 \u03b1)TFIDFEngEng(ATLAS)\nPSIEngEng(ATLAS)\nPSIJapEng\n\u03b1PSIJapEng + (1 \u2212 \u03b1)TFIDFEngEng(ATLAS)\n\u03b1PSIJapEng + (1 \u2212 \u03b1)PSIEngEng(ATLAS)\n\nRank-Loss\n\n4.78%\n8.27%\n4.83%\n7.54%\n3.71%\n9.29%\n3.31%\n1.72%\n0.96%\n0.75%\n0.63%\n\nMAP\n\n0.319\u00b10.009\n0.115\u00b10.005\n0.290\u00b10.008\n0.169\u00b10.007\n0.300\u00b10.008\n0.190\u00b10.007\n0.275\u00b10.009\n0.399\u00b10.009\n0.438\u00b10.009\n0.493\u00b10.009\n0.524\u00b10.009\n\nP@10\n\n0.259\u00b10.008\n0.103\u00b10.005\n0.243\u00b10.008\n0.150\u00b10.007\n0.253\u00b10.008\n0.161\u00b10.007\n0.212\u00b10.008\n0.325\u00b10.009\n0.351\u00b10.009\n0.377\u00b10.009\n0.386\u00b10.009\n\ndocuments so that the documents linked to qEng appear above the others. The document qEng is\nremoved and not considered during training or testing. The dataset is split into train/test as before.\nThe \ufb01rst type of baseline we considered is based on machine translation. We used a machine trans-\nlation tool on the Japanese query, and then applied TFIDF or LSI. We considered three methods\nof machine translation: Google\u2019s API7 or Fujitsu\u2019s ATLAS8 was used to translate each query doc-\nument, or we translated each word in the Japanese dictionary using ATLAS and then applied this\nword-based translation to a query. We also compared to CL-LSI [9] trained on all 90,000 Jap-Eng\npairs from the training set.\nFor PSI, we considered two cases: (i) apply the ATLAS machine translation tool \ufb01rst, and then use\nPSI trained on the task in Section 4.1, e.g. the model given in equation (3) (PSIEngEng), which was\ntrained on English queries and English target documents; or (ii) train PSI directly with Japanese\nqueries and English target documents using the model using (3) without the identity, which we call\nPSIJapEng. We use degree k = 2 for PSI (trying k = 3 would have been interesting, but we have\nnot performed this experiment). The results are given in Table 6. The dictionary size was again\nlimited to the 30,000 most frequent words in both languages for ease of comparison with CL-LSI.\nTFIDF using the three translation methods gave relatively similar results. Using LSI or CL-\nLSI slightly improved these results, depending on the metric. Machine translation followed by\nPSIEngEng outperformed all these methods, however the direct PSIJapEng which required no ma-\nchine translation tool at all, improved results even further. We conjecture that this is because trans-\nlation mistakes generate noisy features which PSIJapEng circumvents.\nHowever, we also considered combining PSIJapEng with TFIDF or PSIEngEng using a mixing\nparameter \u03b1 and this provided further gains at the expense of requiring a machine translation tool.\nNote that many cross-lingual experiments, e.g. [9], typically measure the performance of \ufb01nding a\n\u201cmate\u201d, the same document in another language, whereas our experiment tries to model a query-\nbased retrieval task. We also performed an experiment in the mate-\ufb01nding setting. In this case, PSI\nachieves a ranking error of 0.53%, and CL-LSI achieves 0.81%.\n\n5 Conclusion\n\nWe described a versatile, powerful set of discriminatively trained models for document ranking\nbased on polynomial features over words, which was made feasible with a low-rank (but diagonal\npreserving) approximation. Many generalizations are possible: adding more features into our model,\nusing other choices of loss function and exploring the use of the same models for tasks other than\ndocument retrieval, for example applying these models to ranking images rather than text, or to\nclassi\ufb01cation, rather than ranking, tasks.\n\n7http://code.google.com/p/google-api-translate-java/\n8http://www.fujitsu.com/global/services/software/translation/atlas/\n\n8\n\n\fReferences\n[1] B. Bai, J. Weston, R. Collobert, and D. Grangier. Supervised Semantic Indexing. In European Conference\n\non Information Retrieval, pages 761\u2013765, 2009.\n\n[2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[3] D. M. Blei and J. D. McAuliffe. Supervised topic models. In NIPS, pages 121\u2013128, 2007.\n[4] C. Burges, R. Ragno, and Q. Le. Learning to Rank with Nonsmooth Cost Functions. In NIPS, pages\n\n193\u2013200, 2007.\n\n[5] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to\n\nRank Using Gradient Descent. In ICML, pages 89\u201396, 2005.\n\n[6] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach.\n\nIn ICML, pages 129\u2013136, 2007.\n\n[7] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through\n\nranking. In (Snowbird) Learning Workshop, 2009.\n\n[8] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis.\n\nJASIS, 41(6):391\u2013407, 1990.\n\n[9] S. Dumais, T. Letsche, M. Littman, and T. Landauer. Automatic cross-language retrieval using latent\nIn AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pages\n\nsemantic indexing.\n15\u201321, 1997.\n\n[10] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit se-\n\nmantic analysis. In IJCAI, pages 1606\u20131611, 2007.\n\n[11] P. Gehler, A. Holub, and M. Welling. The rate adapting poisson (rap) model for information retrieval and\n\nobject recognition. In ICML, pages 337\u2013344, 2006.\n\n[12] A. Globerson and S. Roweis. Visualizing pairwise similarity via semide\ufb01nite programming. In AISTATS,\n\n2007.\n\n[13] S. Goel, J. Langford, and A. Strehl. Predictive indexing for fast search. In NIPS, pages 505\u2013512, 2008.\n[14] D. Grangier and S. Bengio. Inferring document similarity from hyperlinks. In CIKM, pages 359\u2013360,\n\n2005.\n\n[15] D. Grangier and S. Bengio. A discriminative kernel-based approach to rank images from text queries.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence., 30(8):1371\u20131384, 2008.\n[16] G. Grefenstette. Cross-Language Information Retrieval. Kluwer, Norwell, MA, USA, 1998.\n[17] R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classi\ufb01ers, chapter Large margin\n\nrank boundaries for ordinal regression. MIT Press, Cambridge, MA, 2000.\n\n[18] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50\u201357, 1999.\n[19] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. In\n\nNIPS, pages 761\u2013768, 2008.\n\n[20] T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, pages 133\u2013142, 2002.\n[21] M. Keller and S. Bengio. A Neural Network for Text Representation. In International Conference on\n\nArti\ufb01cial Neural Networks, 2005. IDIAP-RR 05-12.\n\n[22] T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for\n\ninformation retrieval. In Proceedings of SIGIR 2007 Workshop on Learning to Rank, 2007.\n\n[23] R. Salakhutdinov and G. Hinton. Semantic Hashing. Proceedings of the SIGIR Workshop on Information\n\nRetrieval and Applications of Graphical Models, 2007.\n\n[24] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1986.\n[25] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. Hash Kernels. In\n\nAISTATS, 2009.\n\n[26] A. Vinokourov, J. Shawe-Taylor, and N. Cristianini.\n\nInferring a Semantic Representation of Text via\n\nCross-Language Correlation Analysis. In NIPS, pages 1497\u20131504, 2003.\n\n[27] K. Weinberger and L. Saul. Fast solvers and ef\ufb01cient implementations for distance metric learning. In\n\nICML, pages 1160\u20131167, 2008.\n\n[28] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average preci-\n\nsion. In SIGIR, pages 271\u2013278, 2007.\n\n[29] L. Zighelnic and O. Kurland. Query-drift prevention for robust query expansion. In SIGIR, pages 825\u2013\n\n826, 2008.\n\n9\n\n\f", "award": [], "sourceid": 881, "authors": [{"given_name": "Bing", "family_name": "Bai", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "David", "family_name": "Grangier", "institution": null}, {"given_name": "Ronan", "family_name": "Collobert", "institution": null}, {"given_name": "Kunihiko", "family_name": "Sadamasa", "institution": null}, {"given_name": "Yanjun", "family_name": "Qi", "institution": null}, {"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}]}