Annotated Corpora for Term Extraction Research (ACTER)

Publication type
Publication status: Published
Authors: Rigouts Terryn, A., Hoste, V., & Lefever, E.
Publisher: Zenodo
View in Biblio

Abstract

The Annotated Corpora for Term Extraction Research (ACTER) contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one, lowercased, unlemmatised, unique annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparant and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020.

June 8, 2026	20 years of LT3
May 31, 2026	PhD Defense Quanqi Du
May 20, 2026	📢 PhD Position
Dec. 17, 2025	On how GPT-4o, Gemini-2.5 and DeepSeek-R1 have been used in lexicography
Oct. 31, 2025	PhD Defense Sofie