Enhancing Historical Dutch OCR Accuracy with Post-Correction & Synthetic Data

Master Thesis (2023)
Author(s)

T. Eckhardt (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.C.S. Liem – Mentor (Multimedia Computing)

C. Lofi – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2023
Language
English
Graduation Date
26-06-2023
Awarding Institution
Delft University of Technology
Programme
Computer Science, Data Science and Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
329
Collections
thesis
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper presents a novel approach to synthetic data generation for OCR post-correction, utilizing specific background and font variations tailored to specific timeperiods. The goal is to use synthetic data to enhance text accuracy in digitized historical documents. The proposed three-step process involves generating synthetic images that emulate the characteristics of historical documents from different years, incorporating year-specific backgrounds and fonts. Using these images, a dataset can be created. Multiple T5 sequence-to-sequence transformers are then fine-tuned on the generated dataset. The trained models demonstrate the capabilities of improving the OCR, and aligning them with the ground-truth text. The effectiveness of the approach is evaluated through various performance metrics, highlighting the benefits of using year-specific synthetic data for training. This work contributes to the field of OCR post-correction by providing a powerful framework for improving the accuracy of OCR systems in historical OCR text tasks.

Files

Thesis_Thomas_Eckhardt.pdf
(pdf | 19.7 Mb)
License info not available