27.02.2025

Teaser image to

MCML Researchers With Eight Papers at WACV 2025

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025). Tucson, AZ, USA, 28.02.2025–04.03.2024

We are happy to announce that MCML researchers are represented with eight papers at WACV 2025. Congrats to our researchers!

Main Track (8 papers)

R. Amoroso, G. Zhang, R. Koner, L. Baraldi, R. Cucchiara and V. Tresp.
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA benchmarks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with recent advancements in video QA.

MCML Authors
Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to website

Rajat Koner

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


A. H. Berger, L. Lux, S. Shit, I. Ezhov, G. Kaissis, M. Menten, D. Rückert and J. C. Paetzold.
Cross-Domain and Cross-Dimension Learning for Image-to-Graph Transformers.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task’s complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss to effectively learn object relations in multiple domains with different numbers of edges, (2) a domain adaptation framework for image-to-graph transformers aligning image- and graph-level features from different domains, and (3) a projection function that allows using 2D data for training 3D transformers. We demonstrate our method’s utility in cross-domain and cross-dimension experiments, where we utilize labeled data from 2D road networks for simultaneous learning in vastly different target domains. Our method consistently outperforms standard transfer learning and self-supervised pretraining on challenging benchmarks, such as retinal or whole-brain vessel graph extraction.

MCML Authors
Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Georgios Kaissis

Georgios Kaissis

Dr.

* Former Member

Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


S. Chen, Z. Han, B. He, J. Liu, M. Buckley, Y. Qin, P. Torr, V. Tresp and J. Gu.
Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI URL
Abstract

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method.

MCML Authors
Link to website

Shuo Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


F. Fundel, J. Schusterbauer, V. T. Hu and B. Ommer.
Distillation of Diffusion Features for Semantic Correspondence.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


F. Hofherr, B. Haefner and D. Cremers.
On Neural BRDFs: A Thorough Comparison of State-of-the-Art Approaches.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. Oral Presentation. DOI
Abstract

The bidirectional reflectance distribution function (BRDF) is an essential tool to capture the complex interaction of light and matter. Recently, several works have employed neural methods for BRDF modeling, following various strategies, ranging from utilizing existing parametric models to purely neural parametrizations. While all methods yield impressive results, a comprehensive comparison of the different approaches is missing in the literature. In this work, we present a thorough evaluation of several approaches, including results for qualitative and quantitative reconstruction quality and an analysis of reciprocity and energy conservation. Moreover, we propose two extensions that can be added to existing approaches: A novel additive combination strategy for neural BRDFs that split the reflectance into a diffuse and a specular part, and an input mapping that ensures reciprocity exactly by construction, while previous approaches only ensure it by soft constraints.

MCML Authors
Link to website

Florian Hofherr

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


Y. Li, M. Ghahremani, Y. Wally and C. Wachinger.
DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Diagnosing dementia, particularly for Alzheimer’s Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study.

MCML Authors
Link to website

Yitong Li

Artificial Intelligence in Medical Imaging

Link to website

Morteza Ghahremani

Dr.

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


O. Wysocki, Y. Tan, T. Froech, Y. Xia, M. Wysocki, L. Hoegner, D. Cremers and C. Holst.
ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI GitHub
Abstract

Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA11Project page: https://github.com/OloOcki/zaha, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods’ comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. More-over, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to website

Magdalena Wysocki

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


Y. Zhang, H. Chen, A. Frikha, Y. Yang, D. Krompass, G. Zhang, J. Gu and V. Tresp.
CL-Cross VQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Visual Question Answering (VQA) systems witnessed a significant advance in recent years due to the development of large-scale Vision-Language Pre-trained Models (VLPMs). As the application scenario and user demand change over time, an advanced VQA system is expected to be capable of continuously expanding its knowledge and capabilities over time, not only to handle new tasks (i.e., new question types or visual scenes) but also to answer questions in new specialized domains without forgetting previously acquired knowledge and skills. Existing works studying CL on VQA tasks primarily consider answer- and question-type incremental learning or scene- and function-incremental learning, whereas how VQA systems perform when they encounter new domains and increasing user demands has not been studied. Motivated by this, we introduce CL-CrossVQA, a rigorous Continual Learning benchmark for Cross-domain Visual Question Answering, through which we conduct extensive experiments on 4 VLPMs, 5 CL approaches, and 5 VQA datasets from different domains. In addition, by probing the forgetting phenomenon of the intermediate layers, we provide insights into how model architecture affects CL performance, why CL approaches can help mitigate forgetting in VLPMs, and how to design CL approaches suitable for VLPMs in this challenging continual learning environment. To facilitate future work on developing an advanced All-in-One VQA system, we will release our datasets and code.

MCML Authors
Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Haokun Chen

Database Systems and Data Mining

Ahmed Frikha

Ahmed Frikha

Dr.

* Former Member

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


27.02.2025


Subscribe to RSS News feed

Related

Link to Call for Papers for NLPOR - First Workshop on Bridging NLP and Public Opinion Research

20.05.2025

Call for Papers for NLPOR - First Workshop on Bridging NLP and Public Opinion Research

This interdisciplinary workshop explores the powerful connections between Natural Language Processing (NLP) and Public Opinion Research (POR).

Link to

02.05.2025

MCML Researchers With Five Papers at AISTATS 2025

28th International Conference on Artificial Intelligence and Statistics (AISTATS 2025). Mai Khao, Thailand, 29.04.2025 - 05.05.2024

Link to MCML Delegation Visit to the USA

28.04.2025

MCML Delegation Visit to the USA

MCML delegation visits top US institutions to foster AI research collaborations in Generative and Medical AI, May 19–23, 2025.

Link to

28.04.2025

MCML Researchers With Eleven Papers at NAACL 2025

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, 29.04.2025 - 04.05.2024

Link to

25.04.2025

MCML Researchers With Seven Papers at CHI 2025

Conference on Human Factors in Computing Systems (CHI 2025). Yokohama, Japan, 26.04.2025 - 01.05.2024