Why gender and age prediction from tweets is hard : lessons from a crowdsourcing experiment

Publication type: C1
Publication status: Published
Authors: Nguyen, D., Trieschnigg, D., Doğruöz, A.S., Gravel, R., Theune, M., Meder, T., & de Jong, F.
Series: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics : technical papers
Pagination: 1950-1961
Publisher: Dublin City University and Association for Computational Linguistics (Dublin, Ireland)
Conference: 25th International Conference on Computational Linguistics (COLING 2014) (Dublin, Ireland)
Download
View in Biblio

Abstract

There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speak- ers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.

July 17, 2025	Summer Teambuilding
July 10, 2025	LT3 at EST 2025
July 4, 2025	LT3 at MT Summit and ICWSM 2025
June 27, 2025	Workshop CALM Work Placements
June 12, 2025	LT3 at LTRC, ICTIC, NITS and DHBenelux