About D-Terminer
Terminology
Automatic term extraction is the process of automatically identifying terminology in domain-specific text.
Terms consist of one or more words that express a specific concept in a domain.
Terminology has multiple meanings. It can refer to the specialised, domain-specific vocabulary of a domain, i.e., the collection of all terms in a domain, or it can be used to describe the study of terms.
Examples of terms in the domain of heart failure are: cariodology, beta-blockers, myocardial infarction, and heart failure with reduced ejection fraction.
Domain in this context can be interpreted as any area in which people can build expertise. Common examples are the domains of medicine, technology, or finance. Equally valid domains are music, football, or cooking. Domains can be defined very broadly, e.g., medicine, or they can be more specific, e.g, heart failure. The domains included in the ACTER dataset that is used to train the system are: corruption, dressage (horse riding), heart failure, and wind energy. This means the system will work particularly well on those domains, or domains that are strongly related. However, it will also generalise to other domains.
The definition of terms leaves much room for interpretation, so the boundary between terms and general language is not always clear. Some people only consider very specific terms to be valid, while others will include more general words. For instance, in the domain of heart failure, some people will consider heart to be a valid term, while others consider this general language. The interpretation usually depends on the intended application.
This project differentiates between different types of terms. For the monolingual term extraction, you can choose to focus on all, or a subset of these types. The results are most accurate when the system is trained to extract all of these types (standard settings), but you can also use a system trained to find a subset.
- Specific Terms: strongly related to the domain and not part of general language, so require domain-expertise to understand
- Common Terms: strongly related to the domain, but also part of general language, so familier to most laypeople
- Out-of-Domain Terms: not related to the domain, but not part of general language
- Named Entities: proper names of people, places, organisations, etc. (not necessarily domain-specific)
Publications
This demo has been developed based on Ayla Rigouts Terryn's PhD research. More information about the methodology, evaluation, and dataset can be found in the following publications:
- PhD (please cite when using demo):
Rigouts Terryn, A. (2021). D-TERMINE: Data-driven Term Extraction Methodologies Investigated [Doctoral thesis]. Ghent University. http://hdl.handle.net/1854/LU-8709150 - conference paper on demo and multilingual term extraction:
Rigouts Terryn, A., Hoste, V., & Lefever, E. (2022). D-Terminer: Online Demo for Monolingual and Bilingual Automatic Term Extraction. Proceedings of Terminology in the 21st Century: Many Faces, Many Places, an LREC2022 Workshop. - journal paper on monolingual term extraction:
Rigouts Terryn, A., Hoste, V., & Lefever, E. (2022). Tagging Terms in Text: A Supervised Sequential Labelling Approach to Automatic Term Extraction. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 28(1). https://doi.org/10.1075/term.21010.rig - journal paper on ACTER dataset:
Rigouts Terryn, A., Hoste, V., & Lefever, E. (2020). In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora. Language Resources and Evaluation, 54(2), 385–418. https://doi.org/10.1007/s10579-019-09453-9
Contact
This demo is a work in progress, so feel free to contact us at ayla.rigoutsterryn@kuleuven.be with suggestions on how to improve it. We will also gladly answer your questions regarding this demo.
Planned improvements include:
- exports in .tbx format
- easier submission of parallel corpora for multilingual extraction