Towards better language representation in Natural Language Processing : a multilingual dataset for text-level Grammatical Error Correction

Publication type
A2
Publication status
In press
Authors
Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., Volodina, E., Östling, R., Allkivi, K., Arhar Holdt, Š., Auzina, I., Darģis, R., Drakonaki, E., Frey, J., Glišić, I., Kikilintza, P., Nicolas, L., Romanyshyn, M., Rosen, A., Rozovskaya, A., Suluste, K., Syvokon, O., Tantos, A., Touriki, D., Tsiotskas, K., Tsourilla, E., Varsamopoulos, V., Wisniewski, K., Žagar, A., & Zesch, T.
Journal
INTERNATIONAL JOURNAL OF LEARNER CORPUS RESEARCH
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts - typically learner essays - rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studies.