Several studies (covering many language pairs and translation tasks) have demonstrated that translation quality has improved enormously since the emergence of neural machine translation (NMT) systems. This raises the question whether such systems are able to produce high-quality translations for more difficult text types such as literature and whether they are able to generate coherent translations at document level.
We used Google’s NMT system to translate Agatha Christie’s novel The Mysterious Affair at Styles and report on a fine-grained error analysis of the complete novel. For the error classification, we used the SCATE taxonomy (Tezcan, 2017). This taxonomy differentiates between fluency (well-formedness of the target language) and accuracy errors (correct transfer of source content) and was adapted for document-level literary MT. We included two additional fluency categories to the original classification: 'coherence' and 'style & register'. These categories cover errors that are harder to spot. Coherence errors, for instance, can sometimes only be unveiled when evaluating on document-level (Läubli et al., 2018). An additional reason for adding the category 'coherence' is the fact that it is regarded as essential to literary MT evaluation (Voigt and Jurafsky, 2012; Moorkens et al., 2018).
Before analyzing the error annotation of the whole novel, we calculated the inter-annotator agreement (IAA) for annotations on the first chapter made independently by two annotators. We report on the IAA on error detection (how many annotations were detected by both annotators) and error categorization (how many of those were annotated with the same categories). To find out how we could improve our annotation guidelines in future work, we also study the category distribution of the isolated annotations (i.e. annotations only detected by one of the two annotators).
Finally, we take a close look at the error annotation of the whole novel. If specific accuracy and fluency errors co-occur regularly, it is highly likely that the fluency errors are caused by the accuracy errors. Therefore, we investigate the co-occurrence of fluency and accuracy errors. A comparison is also made between the category distribution of all error annotations and those in other studies that use the SCATE taxonomy to evaluate NMT output. We expect the distribution to contain fewer errors since the NMT we used is of a later date.