27.02.2025

MCML Researchers With Four Papers at WACV 2025
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025). Tucson, AZ, USA, 28.02.2025–04.03.2024
We are happy to announce that MCML researchers are represented with four papers at WACV 2025:
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. To be published. Preprint available. arXiv
Abstract
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA benchmarks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with recent advancements in video QA.
MCML Authors
Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. To be published. Preprint available. arXiv
Abstract
Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task’s complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss to effectively learn object relations in multiple domains with different numbers of edges, (2) a domain adaptation framework for image-to-graph transformers aligning image- and graph-level features from different domains, and (3) a projection function that allows using 2D data for training 3D transformers. We demonstrate our method’s utility in cross-domain and cross-dimension experiments, where we utilize labeled data from 2D road networks for simultaneous learning in vastly different target domains. Our method consistently outperforms standard transfer learning and self-supervised pretraining on challenging benchmarks, such as retinal or whole-brain vessel graph extraction.
MCML Authors
Artificial Intelligence in Healthcare and Medicine
Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. To be published. Preprint available. arXiv URL
Abstract
Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method.
MCML Authors
DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. To be published. Preprint available. arXiv
Abstract
Diagnosing dementia, particularly for Alzheimer’s Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study.
MCML Authors
Artificial Intelligence in Medical Imaging
27.02.2025
Related

24.02.2025
MCML Researchers With Seven Papers at AAAI 2025
39th Conference on Artificial Intelligence (AAAI 2025). Philadelphia, PA, USA, 25.02.2025 - 04.03.2024

05.12.2024
MCML Researchers With 27 Papers at NeurIPS 2024
38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, 10.12.2024 - 15.12.2024

06.11.2024
MCML Researchers With 22 Papers at EMNLP 2024
Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, 12.11.2024 - 16.11.2024