PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs
MCML Authors
Ercong Nie
Dr.
* Former Member
Abstract
Ercong Nie
Dr.
* Former Member
Abstract
Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions.
inproceedings WLW+26
Findings @ACL 2026
Findings at the 64th Annual Meeting of the Association for Computational Linguistics. San Diego, CA, USA, Jul 02-07, 2026. To be published. Preprint available.Authors
Z. Wang • Y. Liu • M. Wang • E. Nie • D. Chen • Z. Zhao • S. Feng • D. Wang • X. Yang • Y. Zhang • H. SchützeLinks
arXiv URLResearch Area
BibTeXKey: WLW+26