Evaluating Transformers for {OCR} Post-Correction in Early {M}odern {D}utch Theatre

Publication type
U
Publication status
Published
Authors
Debaene, F, Maladry, A, Lefever, E., & Hoste, V.
Editor
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio and Steven Schockaert
Series
Proceedings of the 31st International Conference on Computational Linguistics
Pagination
10367-10374
Publisher
Association for Computational Linguistics (Abu Dhabi, UAE)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This paper explores the effectiveness of two types of transformer models {---} large generative models and sequence-to-sequence models {---} for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.