Unsupervised authorship attribution for Medieval Latin using transformer-based embeddings

Publication type
C1
Publication status
Published
Authors
De Langhe, L., De Clercq, O., & Hoste, V.
Editor
Rachele Sprugnoli and Marco Passarotti
Series
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Pagination
57-64
Publisher
ELRA
Conference
Third Workshop on Language Technologies for Historical and Ancient Languages @LREC-COLING-2024 (LT4HALA 2024) (Turin, Italy)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

We explore the potential of employing transformer-based embeddings in an unsupervised authorship attribution task for medieval Latin. The development of Large Language Models (LLMs) and recent advances in transfer learning alleviate many of the traditional issues associated with authorship attribution in lower-resourced (ancient) languages. Despite this, these methods remain heavily understudied within this domain. Concretely, we generate strong contextual embeddings using a variety of mono -and multilingual transformer models and use these as input for two unsupervised clustering methods: a standard agglomerative clustering algorithm and a self-organizing map. We show that these transformer-based embeddings can be used to generate high-quality and interpretable clusterings, resulting in an attractive alternative to the traditional feature-based methods.