Categorical multimodal emotion recognition (MER) aims to infer discrete emotional states by integrating heterogeneous signals such as text, speech, and visual expressions. A persistent challenge in this setting lies in handling cross-modal inconsistency, where different modalities convey divergent emotional cues. While existing MER models often rely on attention mechanisms or implicit interaction layers to address this issue, modality contributions are typically learned in an opaque manner and are rarely directly supervised. Leveraging the unique annotation scheme of the UniC dataset, which provides parallel unimodal and multimodal categorical emotion labels, this paper presents an investigation of how unimodal emotion supervision can be explicitly incorporated into multimodal learning. We examine representative late fusion and tensor fusion strategies and propose an explicit, per-sample modality weighting framework built upon multitask tensor fusion. The proposed method derives modality importance from unimodal–multimodal label disagreement during training and learns to predict modality weights at inference time without relying on attention mechanisms. Experiments on the UniC dataset demonstrate that explicit modality weighting consistently improves performance and stability over strong multimodal baselines, achieving the highest average accuracy under seven-class emotion recognition. Additional evaluations on the CH-SIMS dataset further confirm the generalisability of the approach. Beyond performance gains, the weighting design enables modality reliance analysis, offering interpretable insights into emotion-specific modality dependencies. This study provides one of the earliest systematic investigations into supervised modality weighting for robust and interpretable categorical MER.