MultiGEC (https://doi.org/10.23695%2Fh9f5-8143) is a dataset for Multilingual Grammatical Error Correction in 12 European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) compiled by the CompSLA working group and over 20 external data providers in the context of MultiGEC-2025, the first text-level GEC shared task.
Participants were asked to create a model for multilingual Grammatical Error Correction (GEC). They were provided with MultiGEC, a dataset containing 17 subcorpora covering 12 individual European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. Each subcorpus is a collection of original learner texts accompanied by one or more correction hypotheses which can be treated as the ‘gold standard’ for the purpose of evaluating system hypotheses. For more details on the dataset, please refer to the corpus paper (LINK TO BE ADDED).
Generally, texts could be rewritten in two ways: 1) by implementing minimal corrections and 2) including fluency edits. Minimal corrections are meant to produce texts that conform to the norms of the target language whilst preserving the intended meaning of the learner production and also stay close to its grammar, lexis and writing style. Fluency edits, on the other hand, may also include more extensive rephrasings. For instance, the text
in the pass when I have free time I go to see my laptop. In this, almost the music and movie.
could be minimally corrected to
In the past when I had free time I went to see my laptop. In this, there is almost all the music and movies.
whereas a fluency-edited version of the same text could be
I used to spend most of my free time on my laptop, where I had a lot of music and movies.
These examples come from the corpus paper (LINK TO BE ADDED).
You can access the data of MultiGEC by filling in your credentials at the top of this page.
The download includes:
All data which can be distributed as part of the MultiGEC-2025 shared task. Basically, this is all data of the different languages, excluding English and Russian for which the data providers required a separate data access process.
Diane Nicholls, Andrew Caines & Paula Buttery (2024). The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English. DOI forthcoming.
@article{wicorpus24,
author = {Diane Nicholls and Andrew Caines and Paula Buttery},
year = {2024},
title = {The {W}rite \& {I}mprove {C}orpus 2024: Error-annotated and {CEFR}-labelled essays by learners of {E}nglish},
publisher = {Cambridge University Press & Assessment},
url = {https://doi.org/10.17863/CAM.112997}
}
For the languages included in this repository the data includes: all training, development and test splits. Please note that the test split does not come with a gold standard version. Users wishing to evaluate a GEC system on the held-out test splits can refer to the shared task’s Codalab page.
Please note that by downloading the data you agree to the following terms and conditions:
Contact multigec@svenska.gu.se for more information.