Home | Publications | WLW+26

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

MCML Authors

Mingyang Wang

→ Group Hinrich Schütze
Computational Linguistics

Ercong Nie

Dr.

* Former Member

→ Group Hinrich Schütze
Computational Linguistics

Hinrich Schütze

Prof. Dr.

Core PI

Computational Linguistics

Abstract

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text’s reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions.

inproceedings WLW+26

Findings @ACL 2026

Findings at the 64th Annual Meeting of the Association for Computational Linguistics. San Diego, CA, USA, Jul 02-07, 2026.

Authors

Z. Wang • Y. Liu • M. Wang • E. Nie • D. Chen • Z. Zhao • S. Feng • D. Wang • X. Yang • Y. Zhang • H. Schütze

Links

DOI GitHub

Research Area

B2 | Natural Language Processing

BibTeXKey: WLW+26

#p-schuetze