Home  | Publications | HSR+23

Multilingual Word Embeddings for Low-Resource Languages Using Anchors and a Chain of Related Languages

MCML Authors

Link to Profile Alexander Fraser PI Matchmaking

Alexander Fraser

Prof. Dr.

Principal Investigator

Link to Profile Hinrich Schütze PI Matchmaking

Hinrich Schütze

Prof. Dr.

Principal Investigator

Abstract

Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good crosslingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (≤ 5M tokens) and 4 moderately low-resource (≤ 50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.

inproceedings


MRL @EMNLP 2023

3rd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023.

Authors

V. Hangya • S. Severini • R. Ralev • A. FraserH. Schütze

Links

DOI

Research Area

 B2 | Natural Language Processing

BibTeXKey: HSR+23

Back to Top