Masoud Jalili Sabet
Dr.
* Former Member
This thesis examines methods to improve Natural Language Processing (NLP) for low-resource languages, addressing challenges such as limited training data, lack of tokenization models, and difficulties in word segmentation. While pretrained language models have advanced multilingual representation learning, they primarily benefit high-resource languages. This work explores multilinguality in language models and develops techniques for word alignment without requiring parallel data. Key contributions include analyzing multilingual word alignments, extracting alignments from the Bible corpus, applying graph algorithms to improve alignments, generating cross-lingual embeddings from small parallel corpora, and enhancing alignment quality through subword sampling. These efforts aim to improve NLP for underrepresented languages. (Shortened.)
BibTeXKey: Sab22