Parallel corpora for Dutch word sense disambiguation

Start date: Jan. 1, 2007
End date: Dec. 31, 2012
Sponsor: HOF (Research Fund University College Ghent)
Research portal: http://research.flw.ugent.be/projects/parasense

About ParaSense

Parallel corpora for Dutch word sense disambiguation. Ambiguity remains one of the major problems for current Machine translation systems. The example sentence "Apple has doubled its profits in 2005" will get translated by Babelfish (Babelfish.altavista.com) as "De appel heeft zijn winsten in 2005 verdubbeld". Although "appel" (fruit) is a correct translation of the word "Apple", it is the wrong translation in this context. Other language technology applications, such as Question Answering (QA) systems or information retrieval (IR) systems, also suffer from the poor contextual Word Sense Disambiguation (WSD). WSD is considered one of the most difficult problems within language technology today. It requires the construction of an artificial text understanding; the system should detect the correct word sense based on the context of the word. In this project we want to develop a generic automatic WSD system for Dutch. This system should detect words with more than one sense and assign the correct contextual sense. Current state-of-the-art WSD systems are mainly based on supervised learning algorithms that learn from labeled data, which are annotated corpora containing labels that have been manually assigned. Given the fact that such corpora hardly exist in Dutch and that manual labeling is very time-consuming and expensive, we will start from parallel corpora. The approach of deducting word senses in an automated way from parallel corpora is based on the observation that a word with more than one sense often has different translations for these different senses. Given the fact that the Dutch word "blik" gets translated in English into "glance" and "tin", we can conclude that "blik" has at least two distinct senses. The use of parallel corpora for WSD has been investigated in several studies for ao English and Chinese, and appears to be a promissing method (Ng et al. 2003, Shao and Ng 2004, etc). Using parallel corpora solves a couple of other issues as well. Defining possible senses of a polysemous word is rather subjective, and many words get different senses across dictionaries. Next to that, there is also a granularity problem: it is not clear how detailed sense distinctions must be in order to be useful in concrete applications; not all sense distinctions get lexicalised in all languages. Taking the English word "head" as an example; we see that this word is always translated as "hoofd" in Dutch (as well for "chief" as for "body part"). In this project we want to examin following research topics:

to which extent can we detect word senses in an automated way, based on parallel corpora, and without using any information from dictionaries or other lexical sources?
how big is the error rate of automatic word alignment of parallel corpora?
how much syntactic knowledge is needed for a good detection of word senses/translations?
what is the optimal variation in contrasting languages for establishing an efficient sense inventory?
what is the optimal granularity for reaching a good performance?
which improvements in terms of precision and recall can we obtain by integrating our automatic WSD system in a practical application?