Universal dependencies for spoken Spanish

Publication type
D1
Publication status
Published
Author
Bonilla, J.
Publisher
Ghent University. Faculty of Arts and Philosophy (Ghent, Belgium)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

In the realm of Natural Language Processing (NLP), Part-of-Speech (PoS) tagging and parsing stand as foundational pillars, facilitating the structural comprehension and interpretation of language. PoS tagging categorizes words into their grammatical parts of speech, preparing text for advanced processes like parsing, which unravels the syntactic architecture of sentences by identifying word- and phrase-related relationships. The accuracy of these processes is crucial for various NLP applications, including text analysis and machine translation.

Despite significant advances, NLP resources and models are primarily designed for high-resource languages and dialects, often neglecting the complexities of spoken language. This oversight is particularly evident in dialects with limited written representation, where spoken language's fluid syntax and regional idiosyncrasies pose unique challenges for standard models trained predominantly on written text.

Addressing this gap, our comprehensive research involves four in-depth studies dedicated to developing and evaluating a treebank based on the transcriptions of the Corpus Oral y Sonoro del EspaƱol Rural (COSER - Audible Corpus of Spoken Rural Spanish). This project targets spoken Spanish, especially its rural dialects, filling a crucial void in NLP resources for low-resource dialects. The studies follow a trajectory from PoS tagging to syntactic parsing, incorporating innovative methods such as gamification for data annotation and fine-tuning existing NLP models. This approach adapts to the syntactic and semantic nuances of spoken language, making significant strides in the visibility and understanding of spoken Spanish in NLP.

The overarching goal is to improve the processing of spoken Spanish in NLP applications, which involves creating a gold standard corpus for PoS tagging (COSER-PoS) and leveraging crowdsourcing through Games with a Purpose (GWAP) for the subsequent development of the COSER-UD treebank. This treebank includes not only PoS tags but also parsing, aligning with the need to adapt NLP models to the syntactic and semantic intricacies of spoken Spanish. These efforts are vital in addressing the underrepresentation of spoken language in current NLP resources, particularly for dialects with minimal written presence.

The research methodology unfolds in two interconnected phases. Initially, the focus is on PoS Tagging and Gamification. This phase aims to accurately tag spoken Spanish morphosyntactically, using gamification to enhance data annotation efficiency and assess its effectiveness. Additionally, fine-tuning a model using the latest neural network architecture is critical to evaluating the performance of these taggers on the COSER-PoS dataset compared to standard datasets like AnCora.

The latter phase changes to parsing and treebank development, focusing on the development of the COSER-UD treebank for parsing spoken Spanish. This involves annotating dependency relations and managing spoken language's unique features like ellipses, disfluencies, and word order variations. Training and evaluation of parsing models highlight significant performance disparities when applied to spoken versus written language datasets.

The studies reveal the effectiveness of gamification in PoS tagging, enhancing engagement and efficiency in data annotation, despite the challenge of attracting and retaining players. They also illuminate the distinct challenges in PoS Tagging and Parsing of spoken Spanish, necessitating fine-tuning of existing models and reevaluation of annotation strategies. A notable disparity in model performance emerges when transitioning from written to spoken Spanish datasets, emphasizing the need for models trained specifically on spoken language data.

These studies contribute significantly to representing the linguistic diversity in NLP. By focusing on spoken language resources from low-resource dialects, such as those in the COSER corpus, the research highlights the importance of diversifying NLP resources. The COSER-UD treebank becomes a pivotal resource for future research in understanding and processing spoken Spanish.

In conclusion, these studies provide an in-depth understanding of the complexities involved in processing spoken Spanish. They set a precedent for future NLP research, advocating for the inclusion of diverse linguistic resources to develop more comprehensive and inclusive language technologies. The methodologies and findings are poised to significantly influence NLP applications, particularly in handling spoken languages and dialects, marking a substantial advancement towards more inclusive and representative language processing technologies.