Home | Publications | WLW+26

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

MCML Authors

Abstract

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions.

inproceedings WLW+26


Findings @ACL 2026

Findings at the 64th Annual Meeting of the Association for Computational Linguistics. San Diego, CA, USA, Jul 02-07, 2026. To be published. Preprint available.
Conference logo

Authors

Z. Wang • Y. Liu • M. WangE. Nie • D. Chen • Z. Zhao • S. Feng • D. Wang • X. Yang • Y. Zhang • H. Schütze

Links

arXiv URL

Research Area

 B2 | Natural Language Processing

BibTeXKey: WLW+26

Back to Top