Mimicking how humans interpret out-of-context sentences through controlled toxicity decoding

Publication type
C1
Publication status
Published
Authors
Trusca, M., & Allein, L
Series
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Pagination
291-297
Publisher
Association for Computational Linguistics (ACL)
Conference
5th Workshop on Trustworthy NLP (TrustNLP 2025), colocated with the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025) (Albuquerque, New Mexico)
View in Biblio
(externe link)

Abstract

Interpretations of a single sentence can vary, particularly when its context is lost. This paper aims to simulate how readers perceive content with varying toxicity levels by generating diverse interpretations of out-of-context sentences. By modeling toxicity we can anticipate misunderstandings and reveal hidden toxic meanings. Our proposed decoding strategy explicitly controls toxicity in the set of generated interpretations by (i) aligning interpretation toxicity with the input, (ii) relaxing toxicity constraints for more toxic input sentences, and (iii) promoting diversity in toxicity levels within the set of generated interpretations. Experimental results show that our method improves alignment with human-written interpretations in both syntax and semantics while reducing model prediction uncertainty.