StirWaC, Compiling a diverse corpus based on texts from the web for South Tyrolean German

Publication type
C1
Publication status
Published
Author
Editor
Stefan Evert, Egon Stemle, Paul Rayson
Series
Proceedings of the 8th Web as Corpus Workshop (WAC-8) @Corpus Linguistics 2013
Pagination
37-45
Download
(.pdf)

Abstract

In this paper, we report on the creation of a web corpus for the variety of German spoken in South Tyrol. We hence provide an example for the compilation of a corpus for a language variety that has neighboring varieties and for which the content on the internet is both sparse and published under various top-level domains. We discuss how we tackled the task of finding a balance between data quantity and quality. Our aim was twofold: to create a web corpus diverse in terms of text types and highly representative of South Tyrolean German. We present our procedure for collecting relevant texts and an approach to enhance diversity by detecting and filling gaps in a corpus.