Home | Publications | ZWP+25a

VGGSounder: Audio-Visual Evaluations for Foundation Models

MCML Authors

Daniil Zverev

→ Group Almut Sophia Koepke
Computer Vision & Artificial Intelligence

Almut Sophia Koepke

Dr.

JRG Leader Multi-Modal Learning

Computer Vision & Artificial Intelligence

Abstract

Designing effective foundation models requires high-quality evaluation datasets. With the emergence of audio-visual foundation models, reliable assessment of their multi-modal understanding is essential. The current gold standard for evaluating audio-visual understanding is the popular classification dataset VGGSound. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of models' true auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is explicitly designed to accurately evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. We believe VGGSounder offers a robust and reliable benchmark supporting the future development of audio-visual foundation models.

inproceedings ZWP+25a