Enhancing Historical Dutch OCR Accuracy with Post-Correction & Synthetic Data
T. Eckhardt (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C.C.S. Liem – Mentor (Multimedia Computing)
C. Lofi – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This paper presents a novel approach to synthetic data generation for OCR post-correction, utilizing specific background and font variations tailored to specific timeperiods. The goal is to use synthetic data to enhance text accuracy in digitized historical documents. The proposed three-step process involves generating synthetic images that emulate the characteristics of historical documents from different years, incorporating year-specific backgrounds and fonts. Using these images, a dataset can be created. Multiple T5 sequence-to-sequence transformers are then fine-tuned on the generated dataset. The trained models demonstrate the capabilities of improving the OCR, and aligning them with the ground-truth text. The effectiveness of the approach is evaluated through various performance metrics, highlighting the benefits of using year-specific synthetic data for training. This work contributes to the field of OCR post-correction by providing a powerful framework for improving the accuracy of OCR systems in historical OCR text tasks.