Home | Publications | OHF25

Improving Parallel Sentence Mining for Low-Resource and Endangered Languages

MCML Authors

Shu Okabe

Dr.

→ Group Alexander Fraser
Data Analytics & Statistics

Katharina Hämmerl

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Core PI

Data Analytics & Statistics

Abstract

While parallel sentence mining has been extensively covered for fairly well-resourced languages, pairs involving low-resource languages have received comparatively little attention.To address this gap, we present Belopsem, a benchmark of new datasets for parallel sentence mining on three language pairs where the source side is low-resource and endangered: Occitan-Spanish, Upper Sorbian-German, and Chuvash-Russian. These combinations also reflect varying linguistic similarity within each pair. We compare three language models in an established parallel sentence mining pipeline and apply two types of improvements to one of them, Glot500. We observe better mining quality overall by both applying alignment post-processing with an unsupervised aligner and using a cluster-based isotropy enhancement technique. These findings are crucial for optimising parallel data extraction for low-resource languages in a realistic way.

inproceedings OHF25

ACL 2025

63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025.

Authors

S. Okabe • K. Hämmerl • A. Fraser

Links

DOI

Research Area

B2 | Natural Language Processing

BibTeXKey: OHF25

#p-fraser