Predicting machine translation performance on low-resource languages : the role of domain similarity

Publication type
C1
Publication status
Published
Authors
Khiu, E., Toossi, H., Liu, J., Li, J., Anugraha, D., Flores, J., Roman, L., Doğruöz, A.S., & Lee, E.
Editor
Yvette Graham and Matthew Purver
Series
Findings of the Association for Computational Linguistics (EACL 2024)
Pagination
1474-1486
Publisher
Association for Computational Linguistics
Conference
18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024) (St. Julian’s, Malta)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.