Most datasets for multimodal emotion recognition only have one emotion annotation for all the modalities combined, which serves as a gold standard for single modalities. This procedure ignores, however, the fact that each modality constitutes a unique perspective that contains its own clues. Moreover, as in unimodal emotion analysis, the perspectives of annotators can also diverge in a multimodal setup. In this paper, we therefore propose to annotate each modality independently and to more closely investigate how perspectives between modalities and annotators diverge. Moreover, we also explore the role of annotator training on perspectivism. We find that for the different unimodal levels, the annotations made on text resemble most closely those of the multimodal setup. Furthermore, we see that annotator training has a positive influence on the annotator agreement in modalities with lower agreement scores, but it also reduces the variety of perspectives. We therefore suggest that a moderate training which still values the individual perspectives of annotators might be beneficial before starting annotations. Finally, we observe that negative sentiment and emotions tend to be annotated more inconsistently across the different modality setups.