Home | Publications | KBK+26

From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

MCML Authors

Sanghwan Kim

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Core PI

Interpretable and Reliable Machine Learning

Abstract

High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MMAU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual dependence, we introduce two dataset-level diagnostics: Blind Gap, measuring above-chance textonly performance, and Visual Gain, measuring the marginal benefit of adding video. We further propose an instance-level Shortcut Score that combines text-only confidence with visual necessity signals, enabling continuous, training-free filtering of shortcut-prone questions. The resulting subsets reduce shortcut bias and improve visual grounding. Our findings reveal large differences in grounding quality across benchmarks and show that visually grounded evaluation, not just high accuracy, is essential in safety-critical VideoQA.

inproceedings KBK+26

CTB @ICML 2026

Workshop on Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand at the 43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.

Authors

S. Korkut • M. A. Bravo Sarmiento • S. Kim • Z. Akata

Links

URL

Research Area

B1 | Computer Vision

BibTeXKey: KBK+26

#p-akata