A corpus-based list of frequently used words in Sesotho

Publication type
C1
Publication status
Published
Authors
Sibeko, J., & De Clercq, O.
Editor
Rooweither Mabuya, Don Mthobela, Mmasibidi Setaka and Menno Van Zaanen
Series
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
Pagination
32-41
Publisher
Association for Computational Linguistics
Conference
Fourth workshop on Resources for African Indigenous Languages (RAIL 2023) (Dubrovnik, Croatia)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This article describes the development of a list of frequently used words in written Sesotho. The list has been created with the aim of incorporating it into frequency-based text readability metrics. The list was derived using a corpus-based approach. By leveraging three existing Sesotho corpora, frequency lists could be derived, which were subsequently merged and qualitatively analysed and fine-tuned by an experienced speaker of Sesotho. The main challenges in compiling the list included reconciling the spelling variations, the treatment of abbreviations, and the presence of unexpected words in the preliminary lists. The final list comprises 3037 entries and is made publicly available to the research community.