Classifying TEI encoding for DutchDraCor with transformer models

Publication type: C1
Publication status: In press
Authors: Debaene, F, & Hoste, V.
Series: Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX)
Pagination: 1-5
Publisher: Association for Computational Linguistics (ACL)
Conference: 19th Linguistic Annotation Workshop Co-located with ACL 2025 (Vienna, Austria)
Download
View in Biblio

Abstract

Computational Drama Analysis relies on well- structured textual data, yet many dramatic works remain in need of encoding. The Dutch dramatic tradition is one such an example, with currently 180 plays available in the DraCor database, while many more plays await integration still. To facilitate this process, we propose a semi-automated TEI encoding annotation methodology using transformer encoder language models to classify structural elements in Dutch drama. We fine-tune 4 Dutch models on the DutchDraCor dataset to predict the 9 most relevant labels used in the DraCor TEI encoding, experimenting with 2 model input settings. Our results show that incorporating additional context through beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens greatly improves performance, increasing the average macro F1 score across models from 0.717 to 0.923 (+0.206). Using the best- performing model, we generate silver-standard DraCor labels for EmDComF, an unstructured corpus of early modern Dutch comedies and farces, paving the way for its integration into DutchDraCor after validation.

July 17, 2025	Summer Teambuilding
July 10, 2025	LT3 at EST 2025
July 4, 2025	LT3 at MT Summit and ICWSM 2025
June 27, 2025	Workshop CALM Work Placements
June 12, 2025	LT3 at LTRC, ICTIC, NITS and DHBenelux