Enhancing Historical Dutch OCR Accuracy with Post-Correction & Synthetic Data

Master thesis (2023)

Authors

T. Eckhardt Electrical Engineering, Mathematics and Computer Science

Contributors

C.C.S. Liem (mentor)

C. Lofi Web Information Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:549c8ae6-dc15-4708-9c0b-b7afd4d021fa

More Info

expand_more

Published Date

26-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

This paper presents a novel approach to synthetic data generation for OCR post-correction, utilizing specific background and font variations tailored to specific timeperiods. The goal is to use synthetic data to enhance text accuracy in digitized historical documents. The proposed three-step process involves generating synthetic images that emulate the characteristics of historical documents from different years, incorporating year-specific backgrounds and fonts. Using these images, a dataset can be created. Multiple T5 sequence-to-sequence transformers are then fine-tuned on the generated dataset. The trained models demonstrate the capabilities of improving the OCR, and aligning them with the ground-truth text. The effectiveness of the approach is evaluated through various performance metrics, highlighting the benefits of using year-specific synthetic data for training. This work contributes to the field of OCR post-correction by providing a powerful framework for improving the accuracy of OCR systems in historical OCR text tasks.

Files

Thesis_Thomas_Eckhardt.pdf

(pdf | 19.7 Mb)