Dr. rer. nat. Christian Reul

University of Würzburg
Centre for Philology and Digitality

Emil-Hilb-Weg 23
Campus Hubland Nord
D-97074 Würzburg

Room: 00.010
Phone: +49 931 / 31 - 80722
Fax: +49 931 / 31 - 88427
christian.reul@uni-wuerzburg.de

Qualifications and Career

09/2023: Offering of Academy Professorship (W2) for “Research Software Engineering in the Digital Humanities” (Academy of Sciences and Literature Mainz / University of Marburg) – declined
Since 10/2018 (provisionally until 08/2021): Head of the Digitization Unit of the Centre for Philology and Digitality (University of Würzburg)
2015-2020: Doctoral studies at the Chair for Artificial Intelligence and Knowledge Systems (University of Würzburg). Doctoral thesis: An Intelligent Semi-Automatic Workflow for Optical Character Recognition of Historical Printings
10/2015-10/2018: Research assistant at the Chair for Artificial Intelligence and Knowledge Systems (University of Würzburg)
2009-2015: Studies of Computer Science (Master, specialized in Intelligent Systems) at the University of Würzburg.

Major Research Grants

Reference project "Digital fairy tale reference library by Jacob and Wilhelm Grimm" (DFG, 2024-27)
Digitisation and indexing of historical textbooks on religious education and development of a knowledge base for educational history research (DFG, 2024-27)
Robust and performant methods for layout analysis in OCR-D (DFG, 2023-25)
OCR4all-libraries – Full-text transformation of historical collections (DFG, 2021-24)

Project Participations as Co-Investigator and/or Technical Partner (selection)

The photo poem in illustrated magazines between 1895 and 1945 (DFG, 2024-27)
Musical Instruments in Ancient Mesopotamia (MIAM): Terminology, Iconography, and Contexts (DFG, 2024-27)
William Lovell digital (DFG, 2024-27)
From English in Hong Kong to Hong Kong English: A new diachronic approach to genre and varietal developments in (post)colonial contexts (DFG, 2024-27)
Arthurian literature from the library of the Duc de Nemours (DFG, 2023-26)
The Seven Sages of Rome: editing and reappraising a forgotten premodern classic from global and gendered perspectives (DFG/AHRC, 2023-26)
Narragonia Latina. Bilingual hybrid edition of the Latin ‚Ships of Fools' by Jakob Locher (1497) and Jodocus Badius (1505) (DFG, 2022-25)
Measuring the World by Degrees. Intensity in early modern medicine and natural philosophy (1400-1650) (DFG, 2022-25)
Camerarius digital (DFG, 2021-24)
Annotated Corpus of Ancient West Asian Imagery: Cylinder Seals (ACAWAI-CS) (BMBF, 2021-23)
Thesaurus Linguarum Hethaeorum digitalis (TLHdig) (DFG, 2020-23)
Kallimachos – Centre for Digital Edition and Quantitative Analysis (BMBF, 2014-19, member of the Narragonien digital project group)
Richard Wagner Writings. Historical-critical complete edition (Academy Program, 2013--29)

Honors and Awards

Best Paper Award at the 6th International Workshop on Historical Document Imaging and Processing (HiP) for the paper Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning
Shortlist for Best Paper Award at the 3rd International Conference on Digital Access to Textual Cultural Heritage (DATeCH) for the paper Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification
Award from the Institute of Computer Science at the University of Würzburg for Exceptional Academic Achievements and an Outstanding Master's Thesis

Academic Activities

Member of IAPR, DAGM, EADH, DHd
DHd Working Group OCR (Founding Chairman, Convenor 2019-23)
Joint Organizer of the Working Group "Philology and Digitality" at the University of Würzburg
Reviewer of research grants for several funding organziations in Germany and Austria
Reviewer for various journals and conferences in the area of Artificial Intelligence, Pattern Recognition, and Digital Humanities; among others IJDAR, DHQ, JOCCH, JDMDH, JLCL, ICPR, VISAPP, DATeCH, HiP, QURATOR, EADH, CHR, and LREC

Selected Publications

A (somewhat) complete list of publications I co-authored can be found here.

Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning. Reul, Christian; Tomasek, Stefan; Langhanki, Florian; Springmann, Uwe. In 2022 15th IAPR International Workshop on Document Analysis Systems. 2022.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To train the mixed models we collected a corpus of 35 manuscripts and ca. 12.5k text lines for two widely used handwriting styles, Gothic and Bastarda cursives. Evaluating the mixed models out-of-the-box on four unseen manuscripts resulted in an average Character Error Rate (CER) of 6.22%. After training on 2, 4 and eventually 32 pages the CER dropped to 3.27%, 2.58%, and 1.65%, respectively. While the in-domain recognition and training of models (Bastarda model to Bastarda material, Gothic to Gothic) unsurprisingly yielded the best results, finetuning out-of-domain models to unseen scripts was still shown to be superior to training from scratch. Our new mixed models have been made openly available to the community.

@article{reul2022source, abstract = {This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To train the mixed models we collected a corpus of 35 manuscripts and ca. 12.5k text lines for two widely used handwriting styles, Gothic and Bastarda cursives. Evaluating the mixed models out-of-the-box on four unseen manuscripts resulted in an average Character Error Rate (CER) of 6.22%. After training on 2, 4 and eventually 32 pages the CER dropped to 3.27%, 2.58%, and 1.65%, respectively. While the in-domain recognition and training of models (Bastarda model to Bastarda material, Gothic to Gothic) unsurprisingly yielded the best results, finetuning out-of-domain models to unseen scripts was still shown to be superior to training from scratch. Our new mixed models have been made openly available to the community.}, author = {Reul, Christian and Tomasek, Stefan and Langhanki, Florian and Springmann, Uwe}, journal = {2022 15th IAPR International Workshop on Document Analysis Systems}, keywords = {myown:selected}, title = {Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning}, year = 2022 }
Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning. Reul, Christian; Wick, Christoph; Nöth, Maximilian; Büttner, Andreas; Wehner, Maximilian; Springmann, Uwe. In 6th International Workshop on Historical Document Imaging and Processing. 2021.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community.

@article{reul2021mixed, abstract = {In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community.}, author = {Reul, Christian and Wick, Christoph and Nöth, Maximilian and Büttner, Andreas and Wehner, Maximilian and Springmann, Uwe}, journal = {6th International Workshop on Historical Document Imaging and Processing}, keywords = {myown:selected}, title = {Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning}, year = 2021 }
One-Model Ensemble-Learning for Text Recognition of Historical Printings. Wick, Christoph; Reul, Christian. In Proceedings of the 16th International Conference on Document Analysis and Recognition ICDAR 2021. 2021.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
In this paper, we propose a novel method for Automatic Text Recognition (ATR) on early printed books. Our approach signi�cantly reduces the Character Error Rates (CERs) for book-speci�c training when only a few lines of Ground Truth (GT) are available and considerably outperforms previous methods. An ensemble of models is trained simultaneously by optimising each one independently but also with respect to a fused output obtained by averaging the individual con�dence matrices. Various experiments on �ve early printed books show that this approach already outperforms the current state-of-the-art by up to 20% and 10% on average. Replacing the averaging of the con�dence matrices during prediction with a con�dence-based voting boosts our results by an additional 8% leading to a total average improvement of about 17%.

@article{wick2021onemodel, abstract = {In this paper, we propose a novel method for Automatic Text Recognition (ATR) on early printed books. Our approach signi�cantly reduces the Character Error Rates (CERs) for book-speci�c training when only a few lines of Ground Truth (GT) are available and considerably outperforms previous methods. An ensemble of models is trained simultaneously by optimising each one independently but also with respect to a fused output obtained by averaging the individual con�dence matrices. Various experiments on �ve early printed books show that this approach already outperforms the current state-of-the-art by up to 20% and 10% on average. Replacing the averaging of the con�dence matrices during prediction with a con�dence-based voting boosts our results by an additional 8% leading to a total average improvement of about 17%.}, author = {Wick, Christoph and Reul, Christian}, booktitle = {Proceedings of the 16th International Conference on Document Analysis and Recognition ICDAR 2021}, keywords = {myown:selected}, title = {One-Model Ensemble-Learning for Text Recognition of Historical Printings}, year = 2021 }
Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. Wick, Christoph; Reul, Christian; Puppe, Frank. In Digital Humanities Quarterly, 14(2). 2020.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results [Springmann and Lüdeling 2017] [Reul et al. 2017a]. To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient [Reul et al. 2018a] [Reul et al. 2018b]. Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-Short-Term-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares.

@article{wick2020calamari, abstract = {Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results [Springmann and Lüdeling 2017] [Reul et al. 2017a]. To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient [Reul et al. 2018a] [Reul et al. 2018b]. Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-Short-Term-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares.}, author = {Wick, Christoph and Reul, Christian and Puppe, Frank}, journal = {Digital Humanities Quarterly}, keywords = {sys:relevantfor:uniwue_info6}, number = 2, title = {Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition}, volume = 14, year = 2020 }
Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification. Reul, Christian; Göttel, Sebastian; Springmann, Uwe; Wick, Christoph; Würzner, Kay-Michael; Puppe, Frank. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. 2019.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's "Wörterbuch der Deutschen Sprache") from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.

@article{reul2019automatic, abstract = {When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's "Wörterbuch der Deutschen Sprache") from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.}, author = {Reul, Christian and Göttel, Sebastian and Springmann, Uwe and Wick, Christoph and Würzner, Kay-Michael and Puppe, Frank}, journal = {Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage}, keywords = {sys:relevantfor:uniwue_info6}, title = {Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification}, year = 2019 }
OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank. In Applied Sciences, 9(22). 2019.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years, great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout analysis and segmentation, character recognition, and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper, we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. While a variety of materials can already be processed fully automatically, books with more complex layouts require manual intervention by the users. This is mostly due to the fact that the required ground truth for training stronger mixed models (for segmentation, as well as text recognition) is not available, yet, neither in the desired quantity nor quality. To deal with this issue in the short run, OCR4all offers a comfortable GUI that allows error corrections not only in the final output, but already in early stages to minimize error propagations. In the long run, this constant manual correction produces large quantities of valuable, high quality training material, which can be used to improve fully automatic approaches. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. During experiments, the fully automated application on 19th Century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. Furthermore, on very complex early printed books, even users with minimal or no experience were able to capture the text with manageable effort and great quality, achieving excellent Character Error Rates (CERs) below 0.5%. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.

@article{reul2019ocr4all, abstract = {Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years, great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout analysis and segmentation, character recognition, and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper, we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. While a variety of materials can already be processed fully automatically, books with more complex layouts require manual intervention by the users. This is mostly due to the fact that the required ground truth for training stronger mixed models (for segmentation, as well as text recognition) is not available, yet, neither in the desired quantity nor quality. To deal with this issue in the short run, OCR4all offers a comfortable GUI that allows error corrections not only in the final output, but already in early stages to minimize error propagations. In the long run, this constant manual correction produces large quantities of valuable, high quality training material, which can be used to improve fully automatic approaches. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. During experiments, the fully automated application on 19th Century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. Furthermore, on very complex early printed books, even users with minimal or no experience were able to capture the text with manageable effort and great quality, achieving excellent Character Error Rates (CERs) below 0.5%. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.}, author = {Reul, Christian and Christ, Dennis and Hartelt, Alexander and Balbach, Nico and Wehner, Maximilian and Springmann, Uwe and Wick, Christoph and Grundig, Christine and Büttner, Andreas and Puppe, Frank}, journal = {Applied Sciences}, keywords = {sys:relevantfor:uniwue_info6}, number = 22, title = {OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings}, volume = 9, year = 2019 }
Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. Reul, Christian; Springmann, Uwe; Wick, Christoph; Puppe, Frank. In 2018 13th IAPR International Workshop on Document Analysis Systems. 2018.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.

@article{reul2018voting, abstract = {In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.}, author = {Reul, Christian and Springmann, Uwe and Wick, Christoph and Puppe, Frank}, journal = {2018 13th IAPR International Workshop on Document Analysis Systems}, keywords = {sys:relevantfor:uniwue_info6}, title = {Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting}, year = 2018 }
Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning. Reul, Christian; Springmann, Uwe; Wick, Christoph; Puppe, Frank. In JLCL: Special Issue on Automatic Text and Layout Recognition, 33(1), pp. 3–24. 2018.
- [ Abstract ]
- [ BibTeX ]
- [ URL ]
We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.

@article{reul2018improving, abstract = {We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.}, author = {Reul, Christian and Springmann, Uwe and Wick, Christoph and Puppe, Frank}, journal = {JLCL: Special Issue on Automatic Text and Layout Recognition}, keywords = {sys:relevantfor:uniwue_info6}, number = 1, pages = {3-24}, title = {Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning}, volume = 33, year = 2018 }