Home  | Publications | Sab22

Multilingual Representations and Models for Improved Low-Resource Language Processing

MCML Authors

Masoud Jalili Sabet

Dr.

Abstract

This thesis examines methods to improve Natural Language Processing (NLP) for low-resource languages, addressing challenges such as limited training data, lack of tokenization models, and difficulties in word segmentation. While pretrained language models have advanced multilingual representation learning, they primarily benefit high-resource languages. This work explores multilinguality in language models and develops techniques for word alignment without requiring parallel data. Key contributions include analyzing multilingual word alignments, extracting alignments from the Bible corpus, applying graph algorithms to improve alignments, generating cross-lingual embeddings from small parallel corpora, and enhancing alignment quality through subword sampling. These efforts aim to improve NLP for underrepresented languages. (Shortened.)

phdthesis


Dissertation

LMU München. Jul. 2022

Authors

M. J. Sabet

Links

DOI

Research Area

 B2 | Natural Language Processing

BibTeXKey: Sab22

Back to Top