Dutch compound splitting for bilingual terminology extraction

Publication type
B2
Publication status
Published
Authors
Macken, L., & Tezcan, A.
Editor
Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
Series
Multiword units in machine translation and translation technology
Volume
341
Pagination
148-162
Publisher
John Benjamins
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Compounds pose a problem for applications that rely on precise word alignments such as bilingual terminology extraction. We therefore developed a state-of-the-art hybrid compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. We perform an extensive intrinsic evaluation on a Gold Standard set of 50,000 Dutch compounds and a set of 5,000 Dutch compounds belonging to the automotive domain. We also propose a novel methodology for word alignment that makes use of the compound splitter. As compounds are not always translated compositionally, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. The obtained word alignment points are then combined.