Home | Publications | HSF22

Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

MCML Authors

Viktor Hangya

Dr.

* Former Member

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

Data Analytics & Statistics

Abstract

Pre-trained multilingual language models are the foundation of many NLP approaches, including cross-lingual transfer solutions. However, languages with small available monolingual corpora are often not well-supported by these models leading to poor performance. We propose an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pre-trained language models. We perform experiments on nine languages, using contextual word retrieval and zero-shot named entity recognition to measure both intrinsic cross-lingual word representation quality and downstream task performance, showing improvements on both tasks. Our results show that it is possible to improve pre-trained multilingual language models by relying only on non-parallel resources.

inproceedings HSF22