Home | Publications | RHF+25

Reasoning With Fewer Eyes: Efficient Visual Token Withdrawal for Multimodal Reasoning

MCML Authors

Stefano Gasperini

Dr.

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Abstract

Vision-Language models have shown strong promise for multimodal reasoning tasks, where autoregressive generation allows the model to combine perception and abstract reasoning. However, especially when processing high-resolution images or long videos, the large number of visual tokens severely slows down inference. Drawing from the observation that attention devoted to vision tokens consistently drops during autoregressive text generation, we propose a simple method to accelerate multimodal reasoning: after the model has generated a small number of text tokens, we remove all vision tokens from subsequent decoding steps. This reduces both memory usage and computation, while retaining the model’s ability to ground its reasoning in the visual input. Our approach requires no additional training and is fully compatible with popular efficiency techniques such as KV caching and FlashAttention. Experiments on multiple datasets and with different models demonstrate that our method achieves substantial speedups with minimal impact on reasoning accuracy.

inproceedings RHF+25

ER @NeurIPS 2025

Workshop on Efficient Reasoning at the 39th Conference on Neural Information Processing Systems. San Diego, CA, USA, Nov 30-Dec 07, 2025. To be published. Preprint available.