Home | Publications | OF25

Bilingual Sentence Mining for Low-Resource Languages: A Case Study on Upper and Lower Sorbian

MCML Authors

Shu Okabe

Dr.

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

Data Analytics & Statistics

Abstract

Parallel sentence mining is crucial for downstream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.

inproceedings OF25

Compute-EL @ICLDC 2025

8th Workshop on The Use of Computational Methods in the Study of Endangered Languages at the 9th International Conference on Language Documentation and Conservation. Honolulu, Hawaii, USA, Mar 06-06, 2025.