piwik-script

Deutsch Intern
  • [Translate to Englisch:]
Zentrum für Philologie und Digitalität "Kallimachos"

OCR4all

OCR4all is a software designed to digitally explore primarily very early printed texts whose elaborate printing types and often uneven layout concepts are beyond the recognition abilities of most other recognition software. Understandably and independently to use, OCR4all’s suggested semi-automatic workflow also explicitly focusses users with no technical background and combines different tools in one consistent interface. A frequent change between software is not necessary anymore.

From the images’ preparation (“Preprocessing”) via the layout segmentation (“RegionSegmentation” with LAREX), the segmentation of lines (“Line Segmentation”) and character recognition (“Recognition” with Calamari) to the identified characters’ correction (“Ground Truth Production”) and the building of book specific OCR-models in one software, OCR4all describes an adequate OCR-workflow.

Fig.: Main components of an OCR workflow: original image, image preparation, layout segmentation, character recognition, post correction.

Especially due to the possibility to forge and train book specific recognition models , OCR4all is able to achieve very good results in digital character recognition.

                    

                        Fig.: Semantic layout segmentation with LAREX.                                Fig.: Text correction in page view (left), on line basis (in the middle), virtual keyboard (right).

Latest Release: OCR4all 0.4.0

With the release of OCR4all 0.4.0 in July 2020 the software is now available for implementation via VirtualBox, which should be much easier for non-technical users. Besides notable new features are an upgrade to Calamari 1.0.5, which comes along with the need of new mixed model ensembles,  an automatic project conversion from the "Legacy" working mode to "Latest", a selection of preconfigured virtual keyboards for different languages in LAREX, the possibility of switching between binary, grayscale und despeckled images in LAREX, a checkbox for word level PageXML generation during recognition as well as many minor bug fixes and optimizations (please see the corresponding release notes for OCR4all and LAREX).

 

Installation guide

An installation via VirtualBox is strongly recommended for non-technical users.

Publications

  • OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank in ArXiv Preprints (submitted to  MDPI - Applied Sciences) (2019). [ PDF ]
  • Texterkennungssoftware für historische Drucke. Wehner, Maximilian in KulturBetrieb 25 (2019). [PDF]
  • OCR4all - Eine semi-automatische Open-Source-Software für die OCR historischer Drucke. Wehner, Maximilian; Dahnke, Michael; Landes, Florian; Nasarek, Robert; Reul, Christian in DHd 2020 Spielräume: Digital Humanities zwischen Modellierung und Interpretation. Konferenzabstracts (2020). [PDF]

Media coverage

OCR4all received coverage by the following news outlets and websites: