Ran El-Yaniv, Shai Fine, Naftali Tishby
Classification of finite sequences without explicit knowledge of their statistical nature is a fundamental problem with many important applications. We propose a new information theoretic approach to this problem which is based on the following ingredients: (i) se(cid:173) quences are similar when they are likely to be generated by the same source; (ii) cross entropies can be estimated via "universal compres(cid:173) sion"; (iii) Markovian sequences can be asymptotically-optimally merged. With these ingredients we design a method for the classification of discrete sequences whenever they can be compressed. We introduce the method and illustrate its application for hierarchical clustering of languages and for estimating similarities of protein sequences.