Evaluating transformers for OCR post-correction in early modern Dutch theatre

Publication type: C1
Publication status: Published
Authors: Debaene, F, Maladry, A, Lefever, E., & Hoste, V.
Editor: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio and Steven Schockaert
Series: Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025
Pagination: 10367-10374
Publisher: Association for Computational Linguistics (ACL)
Conference: 31st International Conference on Computational Linguistics (COLING 2025) (Abu Dhabi, United Arab Emirates)
Download
View in Biblio

Abstract

This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.

July 17, 2025	Summer Teambuilding
July 10, 2025	LT3 at EST 2025
July 4, 2025	LT3 at MT Summit and ICWSM 2025
June 27, 2025	Workshop CALM Work Placements
June 12, 2025	LT3 at LTRC, ICTIC, NITS and DHBenelux