MultiGEC is a dataset for Multilingual Grammatical Error Correction in 12 European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) compiled by the CompSLA working group and over 20 external data providers in the context of MultiGEC-2025, the first text-level GEC shared task.
Participants were asked to create a model for multilingual Grammatical Error Correction (GEC). They were provided with MultiGEC, a dataset containing 17 subcorpora covering 12 individual European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. Each subcorpus is a collection of original learner texts accompanied by one or more correction hypotheses which can be treated as the ‘gold standard’ for the purpose of evaluating system hypotheses. For more details on the dataset, please refer to the corpus paper (LINK TO BE ADDED).
Generally, texts could be rewritten in two ways: 1) by implementing minimal corrections and 2) including fluency edits. Minimal corrections are meant to produce texts that conform to the norms of the target language whilst preserving the intended meaning of the learner production and also stay close to its grammar, lexis and writing style. Fluency edits, on the other hand, may also include more extensive rephrasings. For instance, the text
in the pass when I have free time I go to see my laptop. In this, almost the music and movie.
could be minimally corrected to
In the past when I had free time I went to see my laptop. In this, there is almost all the music and movies.
whereas a fluency-edited version of the same text could be
I used to spend most of my free time on my laptop, where I had a lot of music and movies.
These examples come from the corpus paper (LINK TO BE ADDED).
You can access the data of MultiGEC by filling in your credentials at the top of this page.
The download includes:
For English, please refer to … to get access (INFO WILL BE ADDED SHORTLY)
For Russian, please reach out to Alla Rozovskaya to get access.
For the languages included in this repository the data includes: all training, development and test splits. Please note that the test split does not come with a gold standard version. Users wishing to evaluate a GEC system on the held-out test splits can refer to the shared task’s Codalab page.
Please note that by downloading the data you agree to the following terms and conditions:
Contact multigec@svenska.gu.se for more information.