Home  | Publications | LPP+25

Neural Text Normalization for Luxembourgish Using Real-Life Variation Data

MCML Authors

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Core PI

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

inproceedings LPP+25


VarDial @COLING 2025

12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025.

Authors

A.-M. Lutgen • A. Plum • C. Purschke • B. Plank

Links

URL

Research Area

 B2 | Natural Language Processing

BibTeXKey: LPP+25

Back to Top