Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
MCML Authors
Ercong Nie
Dr.
* Former Member
Abstract
Ercong Nie
Dr.
* Former Member
Abstract
Language confusion—where large language models (LLMs) generate unintended languages against the user’s need—remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs)—specific positions where language switches occur—are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion while largely preserving general competence and fluency. Our approach matches multilingual alignment in confusion reduction for many languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.
inproceedings NSS25
Findings @EMNLP 2025
Findings of the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025.Authors
E. Nie • H. Schmid • H. SchützeLinks
DOIResearch Area
BibTeXKey: NSS25