MultiGEC dataset

Download
Request access

About MultiGEC dataset

MultiGEC (https://doi.org/10.23695%2Fh9f5-8143) is a dataset for Multilingual Grammatical Error Correction in 12 European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) compiled by the CompSLA working group and over 20 external data providers in the context of MultiGEC-2025, the first text-level GEC shared task. 

Task description

Participants were asked to create a model for multilingual Grammatical Error Correction (GEC). They were provided with MultiGEC, a dataset containing 17 subcorpora covering 12 individual European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. Each subcorpus is a collection of original learner texts accompanied by one or more correction hypotheses which can be treated as the ‘gold standard’ for the purpose of evaluating system hypotheses. For more details on the dataset, please refer to the corpus paper (LINK TO BE ADDED).


Generally, texts could be rewritten in two ways: 1) by implementing minimal corrections and 2) including fluency edits. Minimal corrections are meant to produce texts that conform to the norms of the target language whilst preserving the intended meaning of the learner production and also stay close to its grammar, lexis and writing style. Fluency edits, on the other hand, may also include more extensive rephrasings. For instance, the text

      in the pass when I have free time I go to see my laptop. In this, almost the music and movie.

could be minimally corrected to

      In the paswhen I had free time I went to see my laptop. In this, there is almost all the music and movies.

whereas a fluency-edited version of the same text could be

      I used to spend most of my free time on my laptop, where I had a lot of music and movies.


These examples come from the corpus paper (LINK TO BE ADDED).


Data

You can access the data of MultiGEC by filling in your credentials at the top of this page.


The download includes:

All data which can be distributed as part of the MultiGEC-2025 shared task. Basically, this is all data of the different languages, excluding English and Russian for which the data providers required a separate data access process.

  • For English, the dataset is available from the ELiT website. Please visit that page, agree to the licence terms, download the data and move it to this directory. Please also cite the following if using the data:

Diane Nicholls, Andrew Caines & Paula Buttery (2024). The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English. DOI forthcoming.

@article{wicorpus24,
  author = {Diane Nicholls and Andrew Caines and Paula Buttery},
  year = {2024},
  title = {The {W}rite \& {I}mprove {C}orpus 2024: Error-annotated and {CEFR}-labelled essays by learners of {E}nglish},
  publisher = {Cambridge University Press & Assessment},
  url = {https://doi.org/10.17863/CAM.112997}
}

For the languages included in this repository the data includes: all training, development and test splits. Please note that the test split does not come with a gold standard version. Users wishing to evaluate a GEC system on the held-out test splits can refer to the shared task’s Codalab page.

Terms of use

Please note that by downloading the data you agree to the following terms and conditions:

  • The organizers and their affiliated institutions make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • Each subset part of the dataset is subject to the license of the source corpus.
  • The access to the dataset is personal. Each user should apply individually.
  • The dataset should only be used for scientific or research purposes, as well as for relevant academic purposes. Any other use is explicitly prohibited.
  • Re-identification of the data subjects (learners and/or authors of the texts) is explicitly prohibited.
  • Feeding the data to proprietary machine learning models that retain data for model training is explicitly prohibited.
  • The datasets must not be redistributed or shared in part or full with any third party. Redirect interested parties to multigec@svenska.gu.se
  • If you use any of the subsets provided in the shared task, you agree to cite the associated papers.

Contact multigec@svenska.gu.se for more information.