Evaluating transformers for OCR post-correction in early modern Dutch theatre

Publication type
C1
Publication status
Published
Authors
Debaene, F, Maladry, A, Lefever, E., & Hoste, V.
Editor
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio and Steven Schockaert
Series
Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025
Pagination
10367-10374
Publisher
Association for Computational Linguistics (ACL)
Conference
31st International Conference on Computational Linguistics (COLING 2025) (Abu Dhabi, United Arab Emirates)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.