Unsupervised authorship attribution for Medieval Latin using transformer-based embeddings

Publication type: C1
Publication status: Published
Authors: De Langhe, L., De Clercq, O., & Hoste, V.
Editor: Rachele Sprugnoli and Marco Passarotti
Series: Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Pagination: 57-64
Publisher: ELRA
Conference: Third Workshop on Language Technologies for Historical and Ancient Languages @LREC-COLING-2024 (LT4HALA 2024) (Turin, Italy)
Download
View in Biblio

Abstract

We explore the potential of employing transformer-based embeddings in an unsupervised authorship attribution task for medieval Latin. The development of Large Language Models (LLMs) and recent advances in transfer learning alleviate many of the traditional issues associated with authorship attribution in lower-resourced (ancient) languages. Despite this, these methods remain heavily understudied within this domain. Concretely, we generate strong contextual embeddings using a variety of mono -and multilingual transformer models and use these as input for two unsupervised clustering methods: a standard agglomerative clustering algorithm and a self-organizing map. We show that these transformer-based embeddings can be used to generate high-quality and interpretable clusterings, resulting in an attractive alternative to the traditional feature-based methods.

June 27, 2025	Workshop CALM Work Placements
June 12, 2025	LT3 at LTRC, ICTIC, NITS and DHBenelux
June 5, 2025	Podcast Episode Dwars Door de Klas
June 3, 2025	PhD Defense Margot 🎓
May 30, 2025	The road towards fine-tuned LLMs for lexicography