Home  | Publications | LZA+26

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

MCML Authors

Abstract

Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing emph{typographical errors} (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We release a Python package for MulTypo and make the source code publicly available.

inproceedings LZA+26


ACL 2026

64th Annual Meeting of the Association for Computational Linguistics. San Diego, CA, USA, Jul 02-07, 2026. To be published. Preprint available.
Conference logo
A* Conference

Authors

Y. LiuR. Zhao • L. Altinger • H. SchützeM. A. Hedderich

Links

arXiv GitHub

Research Area

 B2 | Natural Language Processing

BibTeXKey: LZA+26

Back to Top