Unlocking domain knowledge : model adaptation for non-normative Dutch

Publication type
A2
Publication status
In press
Authors
Debaene, F, Maladry, A, Singh, P., Lefever, E., & Hoste, V.
Journal
COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL
Volume
14
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This study examines the adaptation of transformer models to two non-normative Dutch language variants: early modern Dutch and contemporary social media Dutch. Both share linguistic features that set them apart from standard Dutch, including spelling inconsistencies, semantic shifts and out-of-domain vocabulary. To address this, we explore two domain adaptation techniques to adapt models to these language variants: (1) continued full-model pre-training and (2) training specialized adapters integrated into existing models. We evaluate these adaptation techniques on sentiment and emotion detection in early modern Dutch comedies and farces and on emotion and irony detection in Dutch tweets. Our results show that both adaptation methods significantly improve performance on historical and social media Dutch tasks, with the greatest gains occurring when domain-relevant datasets are used. The effectiveness of model adaptation is task-dependent and sensitive to the selection of pre-training data, emphasizing domain relevance over data quantity for optimizing downstream performance. We hypothesize that contemporary Dutch encoder models already capture informal language but lack historical Dutch exposure, making adaptation more impactful for the latter. Additionally, we compare adapted encoder models to generative decoder models, which are state-of-the-art in many NLP tasks. While generative models fail to match the performance of our adapted models for historical Dutch, fine-tuned generative models outperform adapted models on social media Dutch tasks. This suggests that task-specific fine-tuning remains crucial for effective generative modelling. Finally, we release two pre-training corpora for Dutch encoder adaptation and two novel task-specific datasets for early modern Dutch on Hugging Face.