Linguistic annotation of Byzantine book epigrams : revisited

Publication type
B2
Publication status
In press
Authors
Swaelens, C., De Vos, I., & Lefever, E.
Series
Computational approaches to Ancient Greek and Latin
Publisher
De Gruyter (Berlin ; Boston)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

In the current surge of interest in large language models (LLM) within the field of natural language processing (NLP), the automatic assignment of linguistic information may seem straightforward. However, tasks like part-of-speech tagging, morphological analysis, and lemmatisation pose significant challenges for ancient languages such as Greek, Latin, and Sanskrit. A major issue with these languages is that they are examples of closed corpora, meaning the available data is finite and cannot be expanded. Furthermore, these corpora are relatively small compared to those for languages like English or Chinese. Smaller corpora provide less data for training, which - in the case of LLMs - typically results in reduced performance. This challenge is compounded by the morphologically richness of languages like Greek, making automatic linguistic annotation even more difficult. This article presents our recently developed techniques for part-of-speech tagging, morphological analysis, and lemmatisation for Byzantine Greek. These recently developed techniques are compared to existing algorithms, in order to assess whether these seemingly simple tasks benefit from complex solutions. This is followed by a discussion that highlights the challenges researchers have faced over the past fifty years of developing linguistic analysis tools for Greek.