Verba Volant, Scripta Manent: Automatic Transcription of Medieval Latin Manuscripts
The transcription of handwritten historical documents into machine-encoded text has always been a difficult and time-consuming task. Much work has been done to alleviate some of that burden via software packages aimed at making this task less tedious and more accessible to non-experts. An automated solution would allow for not only preservation of our cultural heritage but also opens the door to taking advantage of recent advances in artificial intelligence to automatically analyze these documents. Therefore, we have embarked on a project to automatically transcribe and analyze Medieval Latin manuscripts of literary and liturgical significance.
In order to facilitate this goal, we have developed a software platform designed to collect pixelwise ground truth data from high-quality scans of the manuscripts. Our platform allows us to leverage the expert classicists and medievalists on our team to produce accurate ground truth data.
These pixelwise annotations allow us to leverage U-Net, a fully convolutional neural network which is more commonly used to segment biomedical images, to attempt to isolate and classify individual characters even from connected scripts and ligatures.
We also have been building a language model of contemporaneous latin, which opens the door to using more traditional OCR pipelines utilizing recurrent neural networks, and exploring unsupervised models.
Samuel Grieggs, Bingyu Shen, Walter Scheirer
Collaborators: John Nolan, Luke Song, Ivy Wang, Christine Ascik, Erik Ellis, Mihow McKenny, Nikolas Churik, Emily Mahan, Hildegund Muller