Home  | Publications | OF25

Bilingual Sentence Mining for Low-Resource Languages: A Case Study on Upper and Lower Sorbian

MCML Authors

Link to Profile Alexander Fraser PI Matchmaking

Alexander Fraser

Prof. Dr.

Principal Investigator

Abstract

Parallel sentence mining is crucial for downstream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.

inproceedings


Compute-EL @ICLDC 2025

8th Workshop on The Use of Computational Methods in the Study of Endangered Languages at the 9th International Conference on Language Documentation and Conservation. Honolulu, Hawaii, USA, Mar 06-06, 2025.

Authors

S. OkabeA. Fraser

Links

URL

Research Area

 B2 | Natural Language Processing

BibTeXKey: OF25

Back to Top