Part-of-Speech Tagging for Bantu Languages

Developers
Guy De Pauw and Gilles-Maurice de Schryver
Website
https://demos.aflat.org/

About Part-of-Speech Tagging for Bantu Languages

This demos our data-driven part-of-speech tagging for four Bantu languages: Cilubà and Zulu, as well as Northern Sotho and Swahili. It is based on the following paper.

Abstract
Recent scientific publications on data-driven part-of-speech tagging of Sub-Saharan African languages have reported encouraging accuracy scores, using off-the-shelf tools and often fairly limited amounts of training data. Unfortunately, no research efforts exist that explore which type of linguistic features contribute to accurate part-of-speech tagging for the languages under investigation. This paper describes feature selection experiments with a memory-based tagger, as well as a resource-light alternative approach. Experimental results show that contextual information is often not strictly necessary to achieve a good accuracy for tagging Bantu languages and that decent results can be achieved using a very straightforward unigram approach, based on orthographic features.