Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings

Publication type
P1
Publication status
Published
Authors
Lefever, E., Labat, S., & Singh, P.
Series
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020)
Pagination
4096-4101
Publisher
European Language Resources Association (ELRA)
Conference
12th International Conference on Language Resources and Evaluation (LREC) (Marseille, France)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although the classifier already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.