APARSIN : a multi-variety sentiment and translation benchmark for Iranic languages

Publication type
C1
Publication status
Published
Authors
Jafari, Sadegh, Azin, T., Roodi, F., Dehghani Tafti, Z., Ghadrdan, M., Vatankhahan Esfahani, E., Naebzadeh, A., Shahhosseini, M., Khan, G., Forghani, K., Namazi, D., Hossein Hashemi, M., Farsi, F., Osoolian, M., Mohammadi, M., Erfan Zare, M., Hasnain Khan, M., Hussain, M., Zaki, N., Mohammadi, J., Bali, S., Javad Ranjbar, M., Lefever, E., & Hoste, V.
Series
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Pagination
83-97
Publisher
Association for Computational Linguistics (ACL)
Conference
Association for Computational Linguistics (Rabat, Morocco)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

The Iranic language family includes many underrepresented languages and dialects that remain largely unexplored in modern NLP research. We introduce APARSIN, a multi-variety benchmark covering 14 Iranic languages, dialects, and accents, designed for sentiment analysis and machine translation. The dataset includes both high and low-resource varieties, several of which are endangered, capturing linguistic variation across them. We evaluate a set of instruction-tuned Large Language Models (LLMs) on these tasks and analyze their performance across the varieties. Our results highlight substantial performance gaps between standard Persian and other Iranic languages and dialects, demonstrating the need for more inclusive multilingual and dialectally diverse NLP benchmarks.