Comparing MT approaches for text normalization

Publication type: C1
Publication status: Published
Authors: Matos Veliz, C., De Clercq, O., & Hoste, V.
Series: Proceedings of Recent Advances in Natural Language Processing (RANLP 2019) : natural language processing in a deep learining world
Pagination: 740-749
Conference: 12th International Conference on 'Recent Advances in Natural Language Processing' (RANLP 2019) (Varna, Bulgaria)
Download
View in Biblio

Abstract

One of the main characteristics of social media data is the use of non-standard language. Since NLP tools have been trained on traditional text material, their performance drops when applied to social media data. One way to overcome this is to first perform text normalization. In this work, we apply text normalization to noisy English and Dutch text coming from different genres: text messages, message board posts and tweets. We consider the normalization task as a Machine Translation problem and test the two leading paradigms: statistical and neural machine translation. For SMT we explore the added value of varying background corpora for training the language model. For NMT we have a look at data augmentation since the parallel datasets we are working with are limited in size. Our results reveal that when relying on SMT to perform the normalization, it is beneficial to use a background corpus that is close to the genre to be normalized. Regarding NMT, we find that the translations - or normalizations - coming out of this model are far from perfect and that for a low-resource language like Dutch adding additional training data works better than artificially augmenting the data.

July 10, 2025	LT3 at EST 2025
July 4, 2025	LT3 at MT Summit and ICWSM 2025
June 27, 2025	Workshop CALM Work Placements
June 12, 2025	LT3 at LTRC, ICTIC, NITS and DHBenelux
June 5, 2025	Podcast Episode Dwars Door de Klas