Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Jaety Edwards, Yee Teh, Roger Bock, Michael Maire, Grace Vesom, David Forsyth
We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is fitted, using transcribed latin to obtain a transition model and one exam- ple each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of vari- ant spellings to demonstrate that the search respects the ink of the doc- ument. Furthermore, our model produces fair searches on a document from which we obtained no training data.