Jaz Kandola, Nello Cristianini, John Shawe-taylor
The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorpo(cid:173) rate some notion of term similarity include latent semantic index(cid:173) ing , the use of semantic networks , and probabilistic methods . In this paper we propose two methods for inferring such sim(cid:173) ilarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results.