Detecting machine-translated subtitles in large parallel corpora

Publication type
C1
Publication status
Published
Authors
Lison, P., & Doğruöz, A.S.
Series
11th Workshop on Building and Using Comparable Corpora (LREC'18)
Pagination
25-32
Conference
11th Workshop on Building and Using Comparable Corpora (BUCC 2018) (Miyzaki, Japan)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus.