Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short En- glish document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a stan- dard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the ﬁrst and in the second space that are max- imally correlated. Since we assume the two representations are com- pletely independent apart from the semantic content, any correlation be- tween them should reﬂect some semantic similarity. Certain patterns of English words that relate to a speciﬁc meaning should correlate with cer- tain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we ﬁrst demonstrate that the correlations detected between the two versions of the corpus are signiﬁcantly higher than random, and hence that a rep- resentation based on such features does capture statistical patterns that should reﬂect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and signiﬁcantly superior to LSI on the same data.