Classifying TEI encoding for DutchDraCor with transformer models

Publication type
C1
Publication status
In press
Authors
Debaene, F, & Hoste, V.
Series
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX)
Pagination
1-5
Publisher
Association for Computational Linguistics (ACL)
Conference
19th Linguistic Annotation Workshop Co-located with ACL 2025 (Vienna, Austria)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Computational Drama Analysis relies on well- structured textual data, yet many dramatic works remain in need of encoding. The Dutch dramatic tradition is one such an example, with currently 180 plays available in the DraCor database, while many more plays await integration still. To facilitate this process, we propose a semi-automated TEI encoding annotation methodology using transformer encoder language models to classify structural elements in Dutch drama. We fine-tune 4 Dutch models on the DutchDraCor dataset to predict the 9 most relevant labels used in the DraCor TEI encoding, experimenting with 2 model input settings. Our results show that incorporating additional context through beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens greatly improves performance, increasing the average macro F1 score across models from 0.717 to 0.923 (+0.206). Using the best- performing model, we generate silver-standard DraCor labels for EmDComF, an unstructured corpus of early modern Dutch comedies and farces, paving the way for its integration into DutchDraCor after validation.