SCATE taxonomy and corpus of machine translation errors

Publication type
B2
Publication status
Published
Authors
Tezcan, A., Hoste, V., & Macken, L.
Editor
Gloria Corpas Pastor and Isabel Durán-Muñoz
Series
Trends in E-tools and resources for translators and interpreters
Volume
45
Pagination
219-244
Publisher
Brill | Rodopi
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Quality Estimation (QE) and error analysis of Machine Translation (MT) output remain
active areas in Natural Language Processing (NLP) research. Many recent efforts have
focused on Machine Learning (ML) systems to estimate the MT quality, translation errors,
post-editing speed or post-editing effort. As the accuracy of such ml tasks relies
on the availability of corpora, there is an increasing need for large corpora of machine
translations annotated with translation errors and the error annotation guidelines to
produce consistent annotations. Drawing on previous work on translation error taxonomies,
we present the SCATE (Smart Computer-aided Translation Environment)
mt error taxonomy, which is hierarchical in nature and is based upon the familiar
notions of accuracy and fluency. In the scate annotation framework, we annotate
fluency errors in the target text and accuracy errors in both the source and target text,
while linking the source and target annotations. We also propose a novel method for
alignment-based Inter-Annotator Agreement (IAA) analysis and show that this method
can be used effectively on large annotation sets. Using the scate taxonomy and
guidelines, we create the first corpus of MT errors for the English-Dutch language pair,
consisting of Statistical Machine Translation (SMT) and Rule-Based Machine Translation
(RBMT) errors, which is a valuable resource not only for NLP tasks in this field
but also to study the relationship between mt errors and post-editing efforts in the
future. Finally, we analyse the error profiles of the smt and the rbmt systems used in
this study and compare the quality of these two different mt architectures based on
the error types.