Home  | Publications | WLW+26

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

MCML Authors

Abstract

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions.

misc WLW+26


Preprint

Jan. 2026

Authors

Z. Wang • Y. Liu • M. WangE. Nie • D. Chen • Z. Zhao • S. Feng • D. Wang • X. Yang • Y. Zhang • H. Schütze

Links

arXiv URL

Research Area

 B2 | Natural Language Processing

BibTeXKey: WLW+26

Back to Top