From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
MCML Authors
Abstract
Abstract
High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MMAU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual dependence, we introduce two dataset-level diagnostics: Blind Gap, measuring above-chance textonly performance, and Visual Gain, measuring the marginal benefit of adding video. We further propose an instance-level Shortcut Score that combines text-only confidence with visual necessity signals, enabling continuous, training-free filtering of shortcut-prone questions. The resulting subsets reduce shortcut bias and improve visual grounding. Our findings reveal large differences in grounding quality across benchmarks and show that visually grounded evaluation, not just high accuracy, is essential in safety-critical VideoQA.
inproceedings KBK+26
CTB @ICML 2026
Workshop on Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand at the 43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.Authors
S. Korkut • M. A. Bravo Sarmiento • S. Kim • Z. AkataLinks
URLResearch Area
BibTeXKey: KBK+26