News topic classification as a first step towards diverse news recommendation

Publication type
A2
Publication status
Published
Authors
De Clercq, O., De Bruyne, L., & Hoste, V.
Journal
COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL
Volume
10
Pagination
37-55
Download
(.pdf)
View in Biblio
(externe link)

Abstract

When developing an algorithm that uses news diversity as a key driver for personalized news recommendation it is crucial to focus on means to cluster news articles in a fine-grained manner, ideally by leveraging the content of the text. In this paper we investigate semantic classi fication of news articles in an un filtered news stream. We first present an analysis of the EventDNA corpus: a collection of Dutch-language news articles annotated with event data according to a
predefi ned typology. We found that the types assigned as features of events do not allow for such a semantic classi fication and investigate the IPTC News Media Topics standard as an alternative. By mapping event types with manually-assigned IPTC topics, we observe that a more diversi fied picture emerges, which leads us to conclude that the IPTC classi fication is a useful proxy. Based on a historical data sample of Dutch news articles covering the year 2018, we then perform a series of machine learning experiments in order to automatically predict the top two levels of the IPTC taxonomy. Various multi-label classi fication models are built with BERTje using a bottom-up and top-down approach. The results reveal that the top-down approach yields the best results, with an overall macro F-1 score of 86.4% and a Jaccard accuracy of 89.2% for the level-one topics and one of 83.7% and 87.5% for the level-two predictions.