Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning. Reul, Christian; Tomasek, Stefan; Langhanki, Florian; Springmann, Uwe in 2022 15th IAPR International Workshop on Document Analysis Systems (2022).
This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To train the mixed models we collected a corpus of 35 manuscripts and ca. 12.5k text lines for two widely used handwriting styles, Gothic and Bastarda cursives. Evaluating the mixed models out-of-the-box on four unseen manuscripts resulted in an average Character Error Rate (CER) of 6.22%. After training on 2, 4 and eventually 32 pages the CER dropped to 3.27%, 2.58%, and 1.65%, respectively. While the in-domain recognition and training of models (Bastarda model to Bastarda material, Gothic to Gothic) unsurprisingly yielded the best results, finetuning out-of-domain models to unseen scripts was still shown to be superior to training from scratch. Our new mixed models have been made openly available to the community.
Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning. Reul, Christian; Wick, Christoph; Nöth, Maximilian; Büttner, Andreas; Wehner, Maximilian; Springmann, Uwe in 6th International Workshop on Historical Document Imaging and Processing (2021).
In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community.
One-Model Ensemble-Learning for Text Recognition of Historical Printings. Wick, Christoph; Reul, Christian in Proceedings of the 16th International Conference on Document Analysis and Recognition ICDAR 2021 (2021).
In this paper, we propose a novel method for Automatic Text Recognition (ATR) on early printed books. Our approach signi�cantly reduces the Character Error Rates (CERs) for book-speci�c training when only a few lines of Ground Truth (GT) are available and considerably outperforms previous methods. An ensemble of models is trained simultaneously by optimising each one independently but also with respect to a fused output obtained by averaging the individual con�dence matrices. Various experiments on �ve early printed books show that this approach already outperforms the current state-of-the-art by up to 20% and 10% on average. Replacing the averaging of the con�dence matrices during prediction with a con�dence-based voting boosts our results by an additional 8% leading to a total average improvement of about 17%.
Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. Wick, Christoph; Reul, Christian; Puppe, Frank in Digital Humanities Quarterly (2020). 14(2)
Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results [Springmann and Lüdeling 2017] [Reul et al. 2017a]. To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient [Reul et al. 2018a] [Reul et al. 2018b]. Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-Short-Term-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares.
OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank in Applied Sciences (2019). 9(22)
Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years, great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout analysis and segmentation, character recognition, and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper, we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. While a variety of materials can already be processed fully automatically, books with more complex layouts require manual intervention by the users. This is mostly due to the fact that the required ground truth for training stronger mixed models (for segmentation, as well as text recognition) is not available, yet, neither in the desired quantity nor quality. To deal with this issue in the short run, OCR4all offers a comfortable GUI that allows error corrections not only in the final output, but already in early stages to minimize error propagations. In the long run, this constant manual correction produces large quantities of valuable, high quality training material, which can be used to improve fully automatic approaches. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. During experiments, the fully automated application on 19th Century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. Furthermore, on very complex early printed books, even users with minimal or no experience were able to capture the text with manageable effort and great quality, achieving excellent Character Error Rates (CERs) below 0.5%. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.
Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification. Reul, Christian; Göttel, Sebastian; Springmann, Uwe; Wick, Christoph; Würzner, Kay-Michael; Puppe, Frank in Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage (2019).
When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's "Wörterbuch der Deutschen Sprache") from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.
Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning. Reul, Christian; Springmann, Uwe; Wick, Christoph; Puppe, Frank in JLCL: Special Issue on Automatic Text and Layout Recognition (2018). 33(1) 3–24.
We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.
Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. Reul, Christian; Springmann, Uwe; Wick, Christoph; Puppe, Frank in 2018 13th IAPR International Workshop on Document Analysis Systems (2018).
In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.