Intern
Zentrum für Philologie und Digitalität "Kallimachos"

Best Paper Award bei der HIP'21

06.09.2021

Unser Paper "Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Fine-tuning" wurde auf der HIP'21 als bestes Paper ausgezeichnet. In Kooperation mit Kollegen der PlanetAI GmbH in Rostock und der LMU München wurde zunächst ein umfangreiches Korpus an OCR Trainingsdaten gesammelt und aufbereitet, das mehrere Sprachen und fast 450 Jahre Druckgeschichte abdeckt. Anschließend wurde, unter Verwendung bestehender Best Practices kombiniert mit neuen datenseitigen Optimierungen, ein OCR Modell trainiert, das zum einen bei der out-of-the-box Anwendung auf ungesehenem Material existierende Methoden deutlich übertraf und sich zum anderen als wertvoller Ausgangspunkt für weitere, gezieltere Trainingsprozesse erwiesen hat.

Abstract:
"In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community."