Leveraging synthetic monolingual data for fuzzy-match augmentation in neural machine translation : a preliminary study

Publication type: C1
Publication status: Published
Authors: Moerman, T.M., & Tezcan, A.
Editor: Arda Tezcan, Vı́ctor M. Sánchez-Cartagena and Miquel Esplà-Gomis
Series: Proceedings of the First International Workshop on Knowledge-Enhanced Machine Translation
Pagination: 34-39
Publisher: European Association for Machine Translation (EAMT)
Conference: 1st International Workshop on Knowledge-Enhanced Machine Translation (KEMT 2024) (Sheffield, UK)
Download
View in Biblio

Abstract

This study investigates the integration of Large Language Models (LLMs) with Fuzzy Match (FM)-augmentation techniques to enhance Neural Machine Translation (NMT) systems. While previous research has underscored the efficacy of FM-augmentation in improving translation quality, particularly in domain-specific contexts with ample training data, its benefits diminish in general and low-resource settings. This study extends current methodologies by exploring the generation of monolingual data through LLMs for back translation (BT), enhancing the augmentation process. Neural Fuzzy Repair (NFR) is employed to refine the integration of back-translated data, contrasting this method with direct fuzzy match retrievals from target monolingual data using sentence embeddings. Preliminary findings from ongoing experiments, which utilize the DGT Translation Memory for the English-to-French language pair, suggest that while direct synthetic data incorporation through BT might not yield performance improvements, its use in conjunction with NFR enhances output quality. This research aims to broaden the applicability of FM-augmentation in NMT, particularly in scenarios lacking extensive bilingual or monolingual datasets.

May 30, 2025	The road towards fine-tuned LLMs for lexicography
May 16, 2025	Talk at the BQTA annual meeting
Dec. 2, 2024	Survey on AI's influence on writing launched (in Dutch)
Oct. 17, 2024	PhD Defense Jasper
Sept. 9, 2024	CLIN34 Recap