Annotated Corpora for Term Extraction Research (ACTER)

Publication type
Publication status
Published
Authors
Rigouts Terryn, A., Hoste, V., & Lefever, E.
Publisher
Ghent University
View in Biblio
(externe link)

Abstract

The Annotated Corpora for Term Extraction Research (ACTER) contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one, lowercased, unlemmatised, unique annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparant and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020.