Home  | Publications | Zwp 25a

VGGSounder: Audio-Visual Evaluations for Foundation Models

MCML Authors

Abstract

Designing effective foundation models requires high-quality evaluation datasets. With the emergence of audio-visual foundation models, reliable assessment of their multi-modal understanding is essential. The current gold standard for evaluating audio-visual understanding is the popular classification dataset VGGSound. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of models' true auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is explicitly designed to accurately evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. We believe VGGSounder offers a robust and reliable benchmark supporting the future development of audio-visual foundation models.

inproceedings ZWP+25a


ICCV 2025

IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai'i, Oct 19-23, 2025. To be published. Preprint available.
Conference logo
A* Conference

Authors

D. Zverev • T. Wiedemer • A. Prabhu • M. Bethge • W. Brendel • A. S. Koepke

Links

URL

In Collaboration

partnerlogo

Research Area

 B1 | Computer Vision

BibTeXKey: ZWP+25a

Back to Top