Home | Research | Groups | Almut Sophia Koepke

Research Group Almut Sophia Koepke

Almut Sophia Koepke

Dr.

JRG Leader Multi-Modal Learning

B1 | Computer Vision

Computer Vision & Artificial Intelligence

Almut Sophia Koepke

leads the MCML Junior Research Group ‘Multi-Modal Learning’ at TU Munich.

She and her team conduct research into multi-modal learning from vision, sound, and text. They focus on advancing video understanding, with an emphasis on capturing temporal dynamics and cross-modal relationships. To achieve this, they aim to improve the combination of information from various modalities within learning frameworks. Furthermore, they are exploring how to adapt large pre-trained models for audio-visual understanding tasks. Funded as a BMBF project, the group explores research areas that go beyond our current focus while maintaining a close collaboration with MCML.

Team members @MCML

PhD Students

Monica Riedler

B1 | Computer Vision
→ Group Almut Sophia Koepke

Computer Vision & Artificial Intelligence

Daniil Zverev

B1 | Computer Vision
→ Group Almut Sophia Koepke

Computer Vision & Artificial Intelligence

Publications @MCML

2025

[2]

D. Zverev, T. Wiedemer, A. Prabhu, M. Bethge, W. Brendel and A. Koepke.
VGGSounder: Audio-Visual Evaluations for Foundation Models.
Sight and Sound @CVPR 2025 - Workshop Sight and Sound at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025). Nashville, TN, USA, Jun 11-15, 2025. PDF

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The classification dataset VGGSound is commonly used as a benchmark for evaluating audio-visual understanding. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. VGGSounder offers a robust benchmark supporting the future development of audio-visual foundation models.

MCML Authors

Daniil Zverev

B1 | Computer Vision
→ Group Almut Sophia Koepke

Computer Vision & Artificial Intelligence

[1]

P. Mondorf, S. Zhou, M. Riedler and B. Plank.
Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality.
Preprint (Apr. 2025). arXiv

Abstract

Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend the approach of meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce SYGAR-a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions, significantly outperforming state-of-the-art LLMs, including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

MCML Authors