Exploring Cross-Modal Interactions in Unimodal and Multimodal Emotion Recognition: An Empirical Study

Publication type
U
Publication status
In press
Authors
Du, Q, De Langhe, L., Lefever, E., & Hoste, V.
Conference
Workshop on Computational Affective Science (Palma, Mallorca)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Understanding how cross-modal interactions influence unimodal and multimodal emotion recognition remains an open question in multimodal affective computing. This study presents a systematic empirical investigation of how multimodal inputs affect both unimodal and multimodal emotion recognition performance. Using the UniC dataset, which provides modality-specific and global multimodal annotations across text, audio, and visual modalities, we conduct experiments based on the Tensor Fusion Network (TFN) under unimodal, bi-modal, and tri-modal configurations. Results show that cross-modal interactions exert complex and asymmetric effects. While additional modalities can provide complementary emotional cues, they may also introduce interference when signals diverge. Models continue to struggle with less frequent or extreme emotions such as disgust. Notably, multimodal embeddings combined with unimodal annotations outperform fully multimodal supervision in the same setup, highlighting the role of annotation consistency and cue reliability. These findings provide a systematic empirical validation of the long-assumed notions, demonstrating that cross-modal effects are not simply additive and highlighting the need for more interpretable multimodal fusion strategies.