Home  | Publications | ZWP+25a

VGGSounder: Audio-Visual Evaluations for Foundation Models

MCML Authors

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The classification dataset VGGSound is commonly used as a benchmark for evaluating audio-visual understanding. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. VGGSounder offers a robust benchmark supporting the future development of audio-visual foundation models.

inproceedings


ICCV 2025

IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai'i, Oct 19-23, 2025. To be published.
Conference logo
A* Conference

Authors

D. Zverev • T. Wiedemer • A. Prabhu • M. Bethge • W. Brendel • A. S. Koepke

In Collaboration

partnerlogo

Research Area

 B1 | Computer Vision

BibTeXKey: ZWP+25a

Back to Top