Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) systems or large language models (LLMs). Yet, this has barely been explored in relation to a long-lasting problem in these systems: gender bias, which, although widely studied, has not yet been solved (Savoldi et al., 2025). One study by Attanasio et al. (2023) has explored the interplay between these two domains and found that interpretability is a “valuable tool for studying and mitigating bias in language models”. This interplay is the focus of this research.
It has been shown that certain contextual cues influence a human’s perception of gender in an ambiguous sentence context (Hackenbuchner et al., forthcoming). We aim to understand whether the same contextual cues influence an MT system when it translates a person’s gender into a certain inflection. To study this, we look at saliency scores to analyse which input tokens are most relevant for a certain translation decision. Specifically, we focus on contrastive explanations, which have been shown to outperform non-contrastive ones, to analyse which input tokens lead a model to produce one output instead of another (why is X predicted instead of Y?), as introduced in Yin and Neubig (2022). This interpretability technique helps understand which contextual cues (input tokens) in the source sentence affect the model’s choice of a certain gender inflection instead of another in the target translation.
This research is conducted on the dataset of gender-ambiguous sentences introduced in Hackenbuchner et al. (forthcoming) and compared to that study’s human annotations of contextual cues affecting their gender perceptions. Using the inseq toolkit (Sarti 2023), we realise contrastive explanations by computing the difference between two attribution outputs (the difference in probability between two options) based on the target translation (taken from the original dataset) and a translation contrasting in terms of gender. To exemplify this, we analyse which input tokens in the source “The business writer from Miami” lead to a higher probability of being translated into male (e.g., DE: Schriftsteller) or into female (e.g., DE: Schriftstellerin).
Preliminary results show that there is a noticeable overlap between human perceptions and model attribution. Contextual cues (words) that influence human perception of gender are among the most frequent tokens that influence a model’s translation of gender in the target (i.e. that increase the probability of a certain gender output over another). This shows that humans and models seem to be influenced by very similar (if not the same) contexts in regards to gender (even in ambiguous scenarios). With this study, we contribute to the very limited research conducted on interpretability measures of model decisions in the translation of gender.
----------------------------------------------------------------------------------------------------------------------------------------------
Beatrice Savoldi, Jasmijn Bastings, Luisa Bentivogli, Eva Vanmassenhove. 2025. A decade of gender bias in machine translation. In: Patterns. Volume 6, Issue 6.
Gabriele Sarti, Nils Feldhus, Ludwig Sickert, and Oskar van der Wal. 2023. Inseq: An Interpretability Toolkit for Sequence Generation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 421–435, Toronto, Canada. Association for Computational Linguistics.
Giuseppe Attanasio, Flor Miriam Plaza del Arco, Debora Nozza, and Anne Lauscher. 2023. A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3996–4014, Singapore. Association for Computational Linguistics.
Janiça Hackenbuchner, Arda Tezcan, Joke Daems. Forthcoming: CLIN vol 14. Gender Bias and the Role of Context in Human Perception and Machine Translation.
Kayo Yin and Graham Neubig. 2022. Interpreting Language Models with Contrastive Explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.