02.01.2024

MCML Researchers With Seven Papers at WACV 2024

EEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024). Waikoloa, Hawai, 04.01.2024–08.01.2024

We are happy to announce that MCML researchers are represented with seven papers at WACV 2024. Congrats to our researchers!

Main Track (7 papers)

M. Bernhard, R. Amoroso, Y. Kindermann, M. Schubert, L. Baraldi, R. Cucchiara and V. Tresp.
What's Outside the Intersection? Fine-grained Error Analysis for Semantic Segmentation Beyond IoU.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI GitHub

Abstract

Semantic segmentation represents a fundamental task in computer vision with various application areas such as autonomous driving, medical imaging, or remote sensing. For evaluating and comparing semantic segmentation models, the mean intersection over union (mIoU) is currently the gold standard. However, while mIoU serves as a valuable benchmark, it does not offer insights into the types of errors incurred by a model. Moreover, different types of errors may have different impacts on downstream applications. To address this issue, we propose an intuitive method for the systematic categorization of errors, thereby enabling a fine-grained analysis of semantic segmentation models. Since we assign each erroneous pixel to precisely one error type, our method seamlessly extends the popular IoU-based evaluation by shedding more light on the false positive and false negative predictions. Our approach is model- and dataset-agnostic, as it does not rely on additional information besides the predicted and ground-truth segmentation masks. In our experiments, we demonstrate that our method accurately assesses model strengths and weaknesses on a quantitative basis, thus reducing the dependence on time-consuming qualitative model inspection. We analyze a variety of state-of-the-art semantic segmentation models, revealing systematic differences across various architectural paradigms. Exploiting the gained insights, we showcase that combining two models with complementary strengths in a straightforward way is sufficient to consistently improve mIoU, even for models setting the current state of the art on ADE20K.

MCML Authors

Maximilian Bernhard

Dr.

* Former Member

→ Group Matthias Schubert
Spatial Artificial Intelligence

Matthias Schubert

Prof. Dr.

Principal Investigator

Spatial Artificial Intelligence

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

M. Brahimi, B. Haefner, T. Yenamandra, B. Goldluecke and D. Cremers.
SupeRVol: Super-Resolution Shape and Reflectance Estimation in Inverse Volume Rendering.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

We propose an end-to-end inverse rendering pipeline called SupeRVol that allows us to recover 3D shape and material parameters from a set of color images in a superresolution manner. To this end, we represent both the bidirectional reflectance distribution function’s (BRDF) parameters and the signed distance function (SDF) by multi-layer perceptrons (MLPs). In order to obtain both the surface shape and its reflectance properties, we revert to a differentiable volume renderer with a physically based illumination model that allows us to decouple reflectance and lighting. This physical model takes into account the effect of the camera’s point spread function thereby enabling a reconstruction of shape and material in a super-resolution quality. Experimental validation confirms that SupeRVol achieves state of the art performance in terms of inverse rendering quality. It generates reconstructions that are sharper than the individual input images, making this method ideally suited for 3D modeling from low-resolution imagery.

MCML Authors

Tarun Yenamandra

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Computer Vision & Artificial Intelligence

S. Klenk, D. Bonello, L. Koestler, N. Araslanov and D. Cremers.
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

Event cameras asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. However, annotation of event data is a costly and laborious process, which limits the use of deep learning methods for classification and other semantic tasks with the event modality. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task, leading to a consistent improvement of the task accuracy. For example, our method reaches state-of-the-art classification accuracy across three datasets, N-ImageNet, N-Cars, and N-Caltech101, increasing the top-1 accuracy of previous work by significant margins. When tested on real-world event data, MEM is even superior to supervised RGB-based pretraining. The models pretrained with MEM are also label-efficient and generalize well to the dense task of semantic image segmentation.

MCML Authors

Simon Klenk

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Computer Vision & Artificial Intelligence

U. Sahin, H. Li, Q. Khan, D. Cremers and V. Tresp.
Enhancing Multimodal Compositional Reasoning of Visual Language Models With Generative Negative Mining.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI GitHub

Abstract

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs’ performance in tasks involving multimodal compositional reasoning.

MCML Authors

Hang Li

* Former Member

→ Group Volker Tresp
Database Systems, Data Mining and AI

Qadeer Khan

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Computer Vision & Artificial Intelligence

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

T. Tewari, N. Yang, F. Bernard, C. Theobalt and D. Cremers.
FIRe: Fast Inverse Rendering Using Directional and Signed Distance Functions.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

Neural 3D implicit representations learn priors that are useful for diverse applications, such as single- or multiple-view 3D reconstruction. A major downside of existing approaches while rendering an image is that they require evaluating the network multiple times per camera ray so that the high computational time forms a bottleneck for downstream applications. We address this problem by introducing a novel neural scene representation that we call the directional distance function (DDF). To this end, we learn a signed distance function (SDF) along with our DDF model to represent a class of shapes. Specifically, our DDF is defined on the unit sphere and predicts the distance to the surface along any given direction. Therefore, our DDF allows rendering images with just a single network evaluation per camera ray. Based on our DDF, we present a novel fast algorithm (FIRe) to reconstruct 3D shapes given a posed depth map. We evaluate our proposed method on 3D reconstruction from single-view depth images, where we empirically show that our algorithm reconstructs 3D shapes more accurately and it is more than 15 times faster (per iteration) than competing methods.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

Computer Vision & Artificial Intelligence

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Constrained Probabilistic Mask Learning for Task-specific Undersampled MRI Reconstruction.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

Undersampling is a common method in Magnetic Resonance Imaging (MRI) to subsample the number of data points in k-space, reducing acquisition times at the cost of decreased image quality. A popular approach is to employ undersampling patterns following various strategies, e.g., variable density sampling or radial trajectories. In this work, we propose a method that directly learns the under-sampling masks from data points, thereby also providing task- and domain-specific patterns. To solve the resulting discrete optimization problem, we propose a general optimization routine called ProM: A fully probabilistic, differentiable, versatile, and model-free framework for mask optimization that enforces acceleration factors through a convex constraint. Analyzing knee, brain, and cardiac MRI datasets with our method, we discover that different anatomic regions reveal distinct optimal undersampling masks, demonstrating the benefits of using custom masks, tailored for a downstream task. For example, ProM can create undersampling masks that maximize performance in downstream tasks like segmentation with networks trained on fully-sampled MRIs. Even with extreme acceleration factors, ProM yields reasonable performance while being more versatile than existing methods, paving the way for data-driven all-purpose mask generation.

MCML Authors

Tobias Weber

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

Michael Ingrisch

Prof. Dr.

Principal Investigator

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Director

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Principal Investigator

Statistics, Data Science and Machine Learning

G. Zhang, Y. Zhang, K. Zhang and V. Tresp.
Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI GitHub

Abstract

Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even surpass human capability in reasoning times and location. To address this question, we propose a two-stage Recognition & Reasoning probing task applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features and further reason about it. To facilitate the studies, we introduce WikiTiLo, a well-curated image dataset compromising images with rich socio-cultural cues. In extensive evaluation experiments, we find that although VLMs can effectively retain times and location-relevant features in visual encoders, they still fail to make perfect reasoning with context-conditioned visual features.

MCML Authors

Gengyuan Zhang

→ Group Volker Tresp
Database Systems, Data Mining and AI

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

WACV 2024

Subscribe to RSS News feed

29.09.2025

Machine Learning for Climate Action - With Researcher Kerstin Forster

Kerstin Forster researches how AI can cut emissions, boost renewable energy, and drive corporate sustainability.

26.09.2025

Björn Ommer Featured in WELT

MCML PI Björn Ommer told WELT that AI can never be entirely neutral and that human judgment remains essential.

25.09.2025

Björn Schuller Featured in Macwelt Article

MCML PI Björn Schuller discusses in Macwelt how Apple Watch monitors health, detects subtle changes, and supports early intervention.

24.09.2025

MCML PI Björn Ommer Featured on ZDF NANO Talk

MCML PIs Björn Ommer & Alena Buyx discuss AI’s essence on ZDF NANO Talk, covering tech, ethics, and societal impact.

23.09.2025

Benjamin Lange Explores Opportunities and Risks of AI Agents

Benjamin Lange highlights both opportunities and ethical risks of AI agents and calls for clear rules to ensure they benefit society.

MCML Researchers With Seven Papers at WACV 2024

EEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024). Waikoloa, Hawai, 04.01.2024–08.01.2024

Main Track (7 papers)

Related