IVESS: Intelligent Vocabulary and Example Selection for Spanish vocabulary learning

Publication type
U
Publication status
Published
Authors
Degraeuwe, JRD, & Goethals, P.
Conference
The 30th Meeting of Computational Linguistics In the Netherlands (CLIN30) (Utrecht)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

In this poster, we will outline the research aims and work packages of the recently started PhD project “IVESS”, which specifically focuses on ICALL for SFL vocabulary learning purposes. ICALL uses NLP techniques to facilitate the creation of digital, customisable language learning materials. In this PhD, we are primarily studying and improving NLP driven methodologies for (1) vocabulary retrieval; (2) vocabulary selection; (3) example selection; and (4) example simplification. As a secondary research question, we will also be analysing the attitudes students and teachers show towards ICALL.
For the retrieval of vocabulary from corpora, every retrieved item should be, ideally, a “lexical unit”, i.e. a particular lexeme linked to a particular meaning. This requires automatically distinguishing between single-word lexemes (unigrams) and multiword lexemes (multigrams; e.g. darse cuenta [EN “to realise”], por tanto [EN “thus”]), as well as disambiguating polysemous lexemes (e.g. función: EN “function”, “theatre play”). Automatic multigram retrieval attempts for Spanish have yielded F1 scores between 11.08 and 38.39 (Ramisch et al., 2018). In our project, we are conducting supervised machine learning experiments, with human rated multigram scores as the dependent variable and features such as frequency, entropy and asymmetrical word association measure scores as independent variables. As for word sense disambiguation, we will test different methodologies based on word and synset embeddings, and evaluate their suitability in a didactic context.
Next, regarding vocabulary selection, we focus on domain specificity (keyness) and difficulty grading as selection criteria. For keyness calculation, which indicates how typical vocabulary items are of a specific domain, we are building upon previous research (Degraeuwe & Goethals, subm.), in which we used keyness metrics to select key items from a domain specific study corpus compared to a general reference corpus. As for vocabulary grading, we are also building upon previous research: in Goethals, Tezcan & Degraeuwe (2019) we built a machine learning classifier to predict the difficulty level of unigram vocabulary items in Spanish, obtaining a 62% accuracy.
With respect to example selection, we intend to elaborate a methodology similar to the one proposed by Pilán (2018) for Swedish. Concretely, we will collect and adapt the features for Spanish in order to elaborate a two dimensional grading of the examples, based upon (1) readability and (2) typicality.
Moreover, we will investigate the feasibility of applying example simplification techniques to those examples that have a good score for typicality but not for readability, a challenging task given the high error margins of the current systems for Spanish (Saggion, 2017).
The NLP driven methodologies under investigation can be integrated (as a pipeline or as separate modules) into an ICALL tool to generate digital, customisable vocabulary learning materials for students and teachers. By conducting surveys and taking interviews, we aim to gain insight into their attitudes towards working with automatically generated learning materials.