Multimodal emotion recognition has attracted increasing attention in affective computing, yet it is often grounded in a simplifying assumption: that emotions expressed through text, audio, and visual signals can be represented by a single, unified “gold” label. This dissertation challenges that assumption by arguing that emotional meaning is inherently perspectival, and that each modality carries its own interpretation of emotion, which may align with or diverge from those of other modalities. Rather than treating cross-modal divergence as noise, this work conceptualizes such divergence as meaningful signals that reflect the complexity of human emotion perception and can be systematically modeled.
From a human perspective, the dissertation first investigates how emotional interpretations diverge across modalities and annotators. A pilot study demonstrates that independently annotating text, audio, and visual channels reveals unimodal emotion labels that do not always align with multimodal labels. Building on these findings, the UniC dataset is introduced as the first multimodal resource providing parallel unimodal and multimodal emotion annotations, enabling controlled comparisons of emotional perspectives across modalities. Furthermore, to better characterize annotation variability, the dissertation proposes Absolute Annotation Difference (AAD) as a complementary metric to traditional agreement measures. Empirical results show that disagreement is structured, modality-sensitive, and even predictable, challenging the view that low agreement scores of traditional agreement measures necessarily indicate poor data quality.
From a machine perspective, the dissertation explores how perspectival structure can be operationalized in computational models. A hierarchical sentiment analysis study of Classical Chinese poetry serves as a pilot, demonstrating how modeling fine-grained line-level sentiment improves overall prediction. Extending to multimodal settings, unimodal-supervision-based modality weighting strategies are proposed to learn modality-specific weights from annotation divergence. These approaches enhance performance, robustness, and interpretability across datasets with different languages and cultural contexts.
Finally, the dissertation critically examines the conditions under which multimodality is beneficial. Systematic comparisons of unimodal, bimodal, and trimodal configurations reveal that cross-modal interactions are neither uniformly positive nor inherently additive. In certain scenarios, unimodal supervision rivals or even surpasses multimodal supervision, highlighting the importance of alignment between modalities, annotation schemes, and modeling assumptions.
Overall, this work advances a perspective-aware framework for multimodal emotion recognition, demonstrating that explicitly modeling modality divergence, hierarchical structure, and annotation disagreement leads to more theoretically grounded, interpretable, and context-sensitive affective computing systems.