MuST: Multilingual Corpora for the automatic Structuring of Terms.

Start date: Jan. 1, 2013
End date: Dec. 31, 2014
Sponsor: HOF (Research Fund University College Ghent)

About MuST

The MuST project aims to extract all domain-specific terms from a multilingual technical corpus as well as the semantic relationships between these terms. The automatic detection of synonymy and hyponymy links allows us to take an important step forward from a flat term list to a structured concept list. For the automatic terminology extraction and detection of semantic relations between the terms, we will use all available parallel corpora at hand without adding any external lexical resources. This turns it into a generic and language-independent approach, which will enable us to deploy it dynamically to new domains or documents. The bilingual terminology-extraction is carried out with a previously developed terminology extraction tool that generates bilingual term pairs from a parallel corpus (Lefever et al. 2009). For the automatic detection of synonyms, a distributional approach will be combined with a multilingual method. This distributional approach starts from the hypothesis that semantically related words occur in similar contexts. By comparing the context and syntactic information of terms, we can distinguish semantically related terms in the term list. For the multilingual approach, we apply a previously developed approach for word sense disambiguation using parallel corpora (Lefever et al. 2011). In order to automatically extract hyponymy relations, we develop an algorithm that adapts an existing pattern-based approach (Hearst 1992) to a multilingual context and that further optimizes the results by means of comparable corpora. References: Lefever, E., Macken, L., and Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computational Linguistics, Athens, Greece. Lefever, E., Hoste, V. and De Cock, M. (2011). ParaSense or how to use Parallel Corpora for Word Sense Disambiguation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA. Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of the International Conference on Computational Linguistics (COLING-1992), 539-545. Nantes, France.