EnerGIZAr : leveraging GIZA++ for effective tokenizer initialization

Publication type
C1
Publication status
Published
Authors
Singh, P., Agirre, E., Azkune, G., De Clercq, O., & Lefever, E.
Editor
Wanxiang Che, Joyce Nabende, Ekaterina Shutova and Mohammad Taher Pilehvar
Series
Findings of the Association for Computational Linguistics : ACL 2025
Pagination
2124-2137
Publisher
Association for Computational Linguistics (ACL)
Conference
63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) (Vienna, Austria)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks – including POS tagging, Sentiment Analysis, NLI, and NER – demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.