From orthography to semantics : large-scale unsupervised textual similarity detection in historical Greek

Publication type: A2
Publication status: Published
Authors: Lemay, P., Lefever, E., & Bentein, K.
Journal: COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL
Volume: 15
Pagination: 217-253
Download
View in Biblio

Abstract

Computational methods for detecting textual similarity provide a powerful lens for exploring linguistic patterns, formulaic language, and textual transmission in historical corpora. In this paper, we investigate what insights become possible when such similarity measures are applied across a vast corpus of Greek texts from Antiquity to the Byzantine period. We propose two methods that enable analysis at this scale: orthographic similarity using MinHash-LSH and semantic similarity using transformer-based sentence embeddings. We first validate both approaches on the Database of Byzantine Book Epigrams, which serves as a gold standard for assessing performance, before applying them to a much larger and more heterogeneous corpus. Scaling MinHash-LSH reveals repeated formulae across textual traditions, while clustering semantic embeddings uncovers conceptual and thematic relationships between texts, highlighting recurring motifs and ideas despite orthographic variation. Our findings illustrate how unsupervised methods suited to high-volume data uncover structures and relationships that targeted studies may overlook.

June 8, 2026	20 years of LT3
May 31, 2026	PhD Defense Quanqi Du
May 20, 2026	📢 PhD Position
Dec. 17, 2025	On how GPT-4o, Gemini-2.5 and DeepSeek-R1 have been used in lexicography
Oct. 31, 2025	PhD Defense Sofie