Leveraging synthetic monolingual data for fuzzy-match augmentation in neural machine translation : a preliminary study

Publication type
C1
Publication status
Published
Authors
Moerman, T.M., & Tezcan, A.
Editor
Arda Tezcan, Vı́ctor M. Sánchez-Cartagena and Miquel Esplà-Gomis
Series
Proceedings of the First International Workshop on Knowledge-Enhanced Machine Translation
Pagination
34-39
Publisher
European Association for Machine Translation (EAMT)
Conference
1st International Workshop on Knowledge-Enhanced Machine Translation (KEMT 2024) (Sheffield, UK)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This study investigates the integration of Large Language Models (LLMs) with Fuzzy Match (FM)-augmentation techniques to enhance Neural Machine Translation (NMT) systems. While previous research has underscored the efficacy of FM-augmentation in improving translation quality, particularly in domain-specific contexts with ample training data, its benefits diminish in general and low-resource settings. This study extends current methodologies by exploring the generation of monolingual data through LLMs for back translation (BT), enhancing the augmentation process. Neural Fuzzy Repair (NFR) is employed to refine the integration of back-translated data, contrasting this method with direct fuzzy match retrievals from target monolingual data using sentence embeddings. Preliminary findings from ongoing experiments, which utilize the DGT Translation Memory for the English-to-French language pair, suggest that while direct synthetic data incorporation through BT might not yield performance improvements, its use in conjunction with NFR enhances output quality. This research aims to broaden the applicability of FM-augmentation in NMT, particularly in scenarios lacking extensive bilingual or monolingual datasets.