Home | Publications | Sab22

Multilingual Representations and Models for Improved Low-Resource Language Processing

MCML Authors

Masoud Jalili Sabet

Dr.

* Former Member

→ Group Hinrich Schütze
Computational Linguistics

Abstract

This thesis examines methods to improve Natural Language Processing (NLP) for low-resource languages, addressing challenges such as limited training data, lack of tokenization models, and difficulties in word segmentation. While pretrained language models have advanced multilingual representation learning, they primarily benefit high-resource languages. This work explores multilinguality in language models and develops techniques for word alignment without requiring parallel data. Key contributions include analyzing multilingual word alignments, extracting alignments from the Bible corpus, applying graph algorithms to improve alignments, generating cross-lingual embeddings from small parallel corpora, and enhancing alignment quality through subword sampling. These efforts aim to improve NLP for underrepresented languages. (Shortened.)

phdthesis Sab22