Home | Publications

Publications by our Members

2025


[2093]
A. Triantafyllopoulos, A. Spiesberger, I. Tsangko, X. Jing, V. Distler, F. Dietz, F. Alt and B. W. Schuller.
Vishing: Detecting social engineering in spoken communication — A first survey & urgent roadmap to address an emerging societal challenge.
Computer Speech and Language 94.101802 (Nov. 2025). DOI
Abstract

Vishing – the use of voice calls for phishing – is a form of Social Engineering (SE) attacks. The latter have become a pervasive challenge in modern societies, with over 300,000 yearly victims in the US alone. An increasing number of those attacks is conducted via voice communication, be it through machine-generated ‘robocalls’ or human actors. The goals of ‘social engineers’ can be manifold, from outright fraud to more subtle forms of persuasion. Accordingly, social engineers adopt multi-faceted strategies for voice-based attacks, utilising a variety of ‘tricks’ to exert influence and achieve their goals. Importantly, while organisations have set in place a series of guardrails against other types of SE attacks, voice calls still remain ‘open ground’ for potential bad actors. In the present contribution, we provide an overview of the existing speech technology subfields that need to coalesce into a protective net against one of the major challenges to societies worldwide. Given the dearth of speech science and technology works targeting this issue, we have opted for a narrative review that bridges the gap between the existing psychological literature on the topic and research that has been pursued in parallel by the speech community on some of the constituent constructs. Our review reveals that very little literature exists on addressing this very important topic from a speech technology perspective, an omission further exacerbated by the lack of available data. Thus, our main goal is to highlight this gap and sketch out a roadmap to mitigate it, beginning with the psychological underpinnings of vishing, which primarily include deception and persuasion strategies, continuing with the speech-based approaches that can be used to detect those, as well as the generation and detection of AI-based vishing attempts, and close with a discussion of ethical and legal considerations.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to website

Anika Spiesberger

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[2092]
S. Bamberger, R. Heckel and F. Krahmer.
Approximating Positive Homogeneous Functions with Scale Invariant Neural Networks.
Journal of Approximation Theory 311.106177 (Nov. 2025). DOI
Abstract

We investigate the approximation of positive homogeneous functions, i.e., functions satisfying for all , with neural networks. Extending previous work, we establish new results explaining under which conditions such functions can be approximated with neural networks. As a key application for this, we analyze to what extent it is possible to solve linear inverse problems with networks. Due to the scaling invariance arising from the linearity, an optimal reconstruction function for such a problem is positive homogeneous. In a network, this condition translates to considering networks without bias terms. For the recovery of sparse vectors from few linear measurements, our results imply that networks with two hidden layers allow approximate recovery with arbitrary precision and arbitrary sparsity level in a stable way. In contrast, we also show that with only one hidden layer such networks cannot even recover 1-sparse vectors, not even approximately, and regardless of the width of the network. These findings even apply to a wider class of recovery problems including low-rank matrix recovery and phase retrieval. Our results also shed some light on the seeming contradiction between previous works showing that neural networks for inverse problems typically have very large Lipschitz constants, but still perform very well also for adversarial noise. Namely, the error bounds in our expressivity results include a combination of a small constant term and a term that is linear in the noise level, indicating that robustness issues may occur only for very small noise levels.

MCML Authors
Link to Profile Reinhard Heckel

Reinhard Heckel

Prof. Dr.

Machine Learning and Information Processing

Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis


[2091]
A. Jevtić, C. Reich, F. Wimbauer, O. Hahn, C. Rupprecht, S. Roth and D. Cremers.
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-13, 2025. To be published. Preprint available. arXiv
Abstract

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

MCML Authors
Link to website

Christoph Reich

Computer Vision & Artificial Intelligence

Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[2090]
Z. Jonassen, K. Lawrence, B. M. Wiesenfeld, S. Feuerriegel and D. Mann.
A qualitative analysis of remote patient monitoring: how a paradox mindset can support balancing emotional tensions in the design of healthcare technologies.
CSCW 2025 - 28th ACM SIGCHI Conference on Computer-Supported Cooperative Work and Social Computing. Bergen, Norway, Oct 18-22, 2025. To be published. Preprint available. DOI
Abstract

Remote patient monitoring (RPM) is the use of digital technologies to improve patient care at a distance. However, current RPM solutions are often biased toward tech-savvy patients. To foster health equity, researchers have studied how to address the socio-economic and cognitive needs of diverse patient groups, but their emotional needs have remained largely neglected. We perform the first qualitative study to explore the emotional needs of diverse patients around RPM. Specifically, we conduct a thematic analysis of 18 interviews and 4 focus groups at a large US healthcare organization. We identify emotional needs that lead to four emotional tensions within and across stakeholder groups when applying an equity focus to the design and implementation of RPM technologies. The four emotional tensions are making diverse patients feel: (i) heard vs. exploited; (ii) seen vs. deprioritized for efficiency; (iii) empowered vs. anxious; and (iv) cared for vs. detached from care. To manage these emotional tensions across stakeholders, we develop design recommendations informed by a paradox mindset (i.e., ‘both-and’ rather than ‘and-or’ strategies).

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[2089]
P. Jahn, W. Durani, C. Leiber, A. Beer and T. Seidl.
Going Offline: An Evaluation of the Offline Phase in Stream Clustering.
ECML-PKDD 2025 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Porto, Portugal, Sep 15-19, 2025. To be published. GitHub
Abstract

Data streams are a challenging and ever more relevant setting for clustering methods as more data arrives faster and faster. Stream clustering strategies either determine the clusters in an online manner directly as the instances appear, or they employ an offline phase where the online summarization structures are processed to obtain a clustering result. A recent analysis found that offline clustering may often be unnecessary or even counterproductive. The methods used in the offline phase are usually fixed for each stream clustering approach and typically stem from only a handful of clustering techniques. In this paper, we perform a broad experimental analysis specifically targeting the offline phase of stream clustering. We analyze several ways of extracting information from the summarization structures, including a novel strategy
based on data generation. Ultimately, we showcase that an offline phase is an impactful design choice for stream clustering. We also find that the chosen offline method significantly impacts the clustering performance, with the clustering quality improving drastically for some settings.

MCML Authors
Link to website

Philipp Jahn

Database Systems and Data Mining

Link to website

Walid Durani

Database Systems and Data Mining

Collin Leiber

Collin Leiber

Dr.

* Former Member

Anna Beer

Anna Beer

Dr.

* Former Member

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[2088]
E. Özeren, A. Ulbrich, S. Filimon, D. Rügamer and A. Bender.
Enhancing Traffic Accident Classifications: Application of NLP Methods for City Safety.
ECML-PKDD 2025 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Porto, Portugal, Sep 15-19, 2025. To be published. Preprint available. arXiv
Abstract

A comprehensive understanding of traffic accidents is essential for improving city safety and informing policy decisions. In this study, we analyze traffic incidents in Munich to identify patterns and characteristics that distinguish different types of accidents. The dataset consists of both structured tabular features, such as location, time, and weather conditions, as well as unstructured free-text descriptions detailing the circumstances of each accident. Each incident is categorized into one of seven predefined classes. To assess the reliability of these labels, we apply NLP methods, including topic modeling and few-shot learning, which reveal inconsistencies in the labeling process. These findings highlight potential ambiguities in accident classification and motivate a refined predictive approach. Building on these insights, we develop a classification model that achieves high accuracy in assigning accidents to their respective categories. Our results demonstrate that textual descriptions contain the most informative features for classification, while the inclusion of tabular data provides only marginal improvements. These findings emphasize the critical role of free-text data in accident analysis and highlight the potential of transformer-based models in improving classification reliability.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[2087]
Y. Sale, A. Javanmardi and E. Hüllermeier.
Aleatoric and Epistemic Uncertainty in Conformal Prediction.
COPA 2025 - 14th Symposium on Conformal and Probabilistic Prediction with Applications. Egham, UK, Sep 10-12, 2025. To be published.
Abstract

Recently, there has been a particular interest in distinguishing different types of uncertainty in supervised machine learning (ML) settings (H¨ullermeier and Waegeman, 2021). Aleatoric uncertainty captures the inherent randomness in the data-generating process. As it represents variability that cannot be reduced even with more data, it is often referred to as irreducible uncertainty. In contrast, epistemic uncertainty arises from a lack of knowledge about the underlying data-generating process, which—in principle—can be reduced by acquiring additional data or improving the model itself (viz. reducible uncertainty). In parallel, interest in conformal prediction (CP)—both its theory and applications—has become equally vigorous. Conformal Prediction (Vovk et al., 2005) is a model-agnostic framework for uncertainty quantification that provides prediction sets or intervals with rigorous statistical coverage guarantees. Notably, CP is distribution-free and makes only the mild assumption of exchangeability. Under this assumption, it yields prediction intervals that contain the true label with a user-specified probability. Thus, conformal prediction is seen as a promising tool to quantify uncertainty. But how is it related to aleatoric and epistemic uncertainty? In particular, we first analyze how (estimates of) aleatoric and epistemic uncertainty enter into the construction of vanilla CP—that is, how noise and model error jointly shape the global threshold. We then review ‘uncertainty-aware’ extensions that integrate these uncertainty estimates into the CP pipeline.

MCML Authors
Link to website

Yusuf Sale

Artificial Intelligence and Machine Learning

Link to website

Alireza Javanmardi

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[2086]
L. Schneider, B. Bischl and M. Feurer.
Overtuning in Hyperparameter Optimization.
AutoML 2025 - Methods Track - Methods Track at the International Conference on Automated Machine Learning. New York City, NY, USA, Sep 08-11, 2025. To be published. URL
Abstract

Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter configuration (HPC) such that the resulting model generalizes well to unseen data. Since directly optimizing the expected generalization error is impossible, resampling techniques like holdout validation or cross-validation are used as proxy measures in HPO. However, this implicitly assumes that the HPC minimizing validation error will also yield the best true generalization performance. Given that our inner validation error estimate is inherently stochastic and depends on the resampling, we study: Can excessive optimization of the validation error lead to a similarly detrimental effect as excessive optimization of the empirical risk of an ML model? This phenomenon, which we refer to as overtuning, represents a form of overfitting at the HPO level. Despite its potential impact, overtuning has received limited attention in the HPO and automated machine learning (AutoML) literature. We first formally define overtuning and distinguish it from related concepts such as meta-overfitting. We then reanalyze large-scale HPO benchmark data, assessing how frequently overtuning occurs and its practical relevance. Our findings suggest that overtuning is more common than expected, although often mild. However, in 10% of cases, severe overtuning results in selecting an HPC whose generalization performance is worse than the default HPC. We further examine how factors such as the chosen performance metric, resampling method, dataset size, learning algorithm, and optimization strategy influence overtuning and discuss potential mitigation strategies. Our results highlight the need to raise awareness of overtuning, particularly in the small-data regime, indicating that further mitigation strategies should be studied.

MCML Authors
Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[2085]
Z. Li, D. Muhtar, F. Gu, X. Zhang, P. Xiao, G. He and X. Zhu.
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation.
ISPRS Journal of Photogrammetry and Remote Sensing 227 (Sep. 2025). DOI GitHub
Abstract

Automatically and rapidly understanding Earth’s surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth’s surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[2084]
S. Rauch, C. M. M. Frey, A. Maldonado and T. Seidl.
BEST: Bilaterally Expanding Subtrace Tree for Event Sequence Prediction.
BPM 2025 - 23rd International Conference on Business Process Management. Seville, Spain, Aug 31-Sep 05, 2025. To be published.
Abstract

In Predictive Process Monitoring, handling uncertainty regarding future case execution is the core building block for reliable predictive or prescriptive methods.In the last decade, deep learning methods are increasingly the preferred approach when it comes to Next Activity Prediction and/or Remaining Trace Prediction. However, it remains an open question whether deep learning models finally surpass traditional data mining techniques for these tasks. In our paper, we contribute to answering this question by proposing a sequence prediction framework based on bilaterally expanding hierarchical subtraces that serves as an alternative approach for currently established deep learning techniques. We mine sequential patterns from activity traces and arrange them into a hierarchical subtrace tree by their structural relationship and inter-pattern distances. The tree structure can directly be leveraged for forecasting the most probable future activities given the trace history. We achieve competitive forecasting results for Remaining Trace Prediction, even surpassing state-of-the-art deep learning approaches on the majority of the analyzed real-world benchmark process event logs while only relying on the available control-flow information.

MCML Authors
Link to website

Simon Rauch

Database Systems and Data Mining

Christian Frey

Christian Frey

Dr.

* Former Member

Link to website

Andrea Maldonado

Database Systems and Data Mining

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[2083]
J. Blake and M. Schubert.
Aerial Coverage Path Planning in Nuclear Emergencies A Training and Evaluation Environment.
Demonstration Track @IJCAI 2025 - Demonstration Track at the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025). Montreal, Canada, Aug 16-22, 2025. To be published.
Abstract

We formulate a Coverage Path Planning (CPP) problem for a helicopter or a UAV tasked with mapping ground-level radiation while avoiding radiation that is too strong. We introduce a simulation environment that incorporates digital elevation models, altitude-dependent measurement footprints and realistic flight constraints, as well as state-of-the-art radiation scenario simulations, such as nuclear explosions, provided by the German Federal Office for Radiation Protection. We highlight the complexity of radiological survey missions and demonstrate the necessity for new CPP approaches that address these unique challenges. The code to our simulation environment will be provided upon acceptance.

MCML Authors
Link to website

Johann Blake

Spatial Artificial Intelligence

Link to Profile Matthias Schubert

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence


[2082]
T. Benoit, Y. Wang, M. Dannehl and J. Kinder.
BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding.
USENIX 2025 - 34th USENIX Security Symposium. Seattle, WA, USA, Aug 13-15, 2025. To be published. Preprint available. PDF
Abstract

Function names can greatly aid human reverse engineers, which has spurred the development of machine learning-based approaches to predicting function names in stripped binaries. Much current work in this area now uses transformers, applying a metaphor of machine translation from code to function names. Still, function naming models face challenges in generalizing to projects unrelated to the training set. In this paper, we take a completely new approach by transferring advances in automated image captioning to the domain of binary reverse engineering, such that different parts of a binary function can be associated with parts of its name. We propose BLens, which combines multiple binary function embeddings into a new ensemble representation, aligns it with the name representation latent space via a contrastive learning approach, and generates function names with a transformer architecture tailored for function names. Our experiments demonstrate that BLens significantly outperforms the state of the art. In the usual setting of splitting per binary, we achieve an F1 score of 0.79 compared to 0.70. In the cross-project setting, which emphasizes generalizability, we achieve an F1 score of 0.46 compared to 0.29. Finally, in an experimental setting reducing shared components across projects, we achieve an F1 score of 0.32 compared to 0.19.

MCML Authors
Link to website

Yunru Wang

Programming Languages and Artificial Intelligence

Link to website

Moritz Dannehl

Programming Languages and Artificial Intelligence

Link to Profile Johannes Kinder

Johannes Kinder

Prof. Dr.

Programming Languages and Artificial Intelligence


[2081]
M. Windl, O. Akgul, N. Malkin and L. F. Cranor.
Privacy Solution or Menace? Investigating Perceptions of Radio-Frequency Sensing.
USENIX 2025 - 34th USENIX Security Symposium. Seattle, WA, USA, Aug 13-15, 2025. To be published. Preprint available. PDF
Abstract

Radio-frequency sensors are often introduced as privacy-preserving alternatives to cameras, as they enable similar use cases without relying on visual data. However, researchers argue that radio-frequency sensors cause privacy risks similar to cameras and even introduce additional risks. We conducted in-depth interviews (N= 14) and a large-scale vignette survey (N= 510) to understand people’s perceptions and privacy concerns around radio-frequency sensing. Most interviewees were initially unaware of the full capabilities of radio-frequency sensing but expressed nuanced concerns upon learning more. Our survey revealed that, while people expressed concerns, they mostly preferred radio-frequency sensors over cameras in private locations. However, they preferred cameras when considering radio-frequency sensing from a neighbor’s perspective and in security-relevant situations. Protective measures can reduce concerns, but the best protection depends on the context. Our findings can inform educational and legislative efforts to ensure a privacy-preserving future with radio-frequency technology.

MCML Authors
Link to website

Maximiliane Windl

Human-Centered Ubiquitous Media


[2080]
Y. Ma, J. Schweisthal, H. Zhang and S. Feuerriegel.
A Diffusion-Based Method for Learning the Multi-Outcome Distribution of Medical Treatments.
KDD 2025 - 31st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Toronto, ON, Canada, Aug 03-07, 2025. To be published. Preprint available. arXiv
Abstract

In medicine, treatments often influence multiple, interdependent outcomes, such as primary endpoints, complications, adverse events, or other secondary endpoints. Hence, to make optimal treatment decisions, clinicians are interested in learning the distribution of multi-dimensional treatment outcomes. However, the vast majority of machine learning methods for predicting treatment effects focus on single-outcome settings, despite the fact that medical data often include multiple, interdependent outcomes. To address this limitation, we propose a novel diffusion-based method called DIME to learn the joint distribution of multiple outcomes of medical treatments. We addresses three challenges relevant in medical practice: (i)it is tailored to learn the joint interventional distribution of multiple medical outcomes, which enables reliable decision-making with uncertainty quantification rather than relying solely on point estimates; (ii)it explicitly captures the dependence structure between outcomes; (iii)it can handle outcomes of mixed type, including binary, categorical, and continuous variables. In DIME, we take into account the fundamental problem of causal inference through causal masking. For training, our method decomposes the joint distribution into a series of conditional distributions with a customized conditional masking to account for the dependence structure across outcomes. For inference, our method auto-regressively generates predictions. This allows our method to move beyond point estimates of causal quantities and thus learn the joint interventional distribution. To the best of our knowledge, DIME is the first neural method tailored to learn the joint, multi-outcome distribution of medical treatments. Across various experiments, we demonstrate that our method effectively learns the joint distribution and captures shared information among multiple outcomes.

MCML Authors
Link to website

Yuchen Ma

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[2079]
Z. Ding, Y. Li, Y. He, A. Norelli, J. Wu, V. Tresp, Y. Ma and M. Bronstein.
DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models.
TGL @KDD 2025 - Temporal Graph Learning Workshop at the 31st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2025). Toronto, ON, Canada, Aug 03-07, 2025. To be published. Preprint available. arXiv
Abstract

Learning useful representations for continuous-time dynamic graphs (CTDGs) is challenging, due to the concurrent need to span long node interaction histories and grasp nuanced temporal details. In particular, two problems emerge: (1) Encoding longer histories requires more computational resources, making it crucial for CTDG models to maintain low computational complexity to ensure efficiency; (2) Meanwhile, more powerful models are needed to identify and select the most critical temporal information within the extended context provided by longer histories. To address these problems, we propose a CTDG representation learning model named DyGMamba, originating from the popular Mamba state space model (SSM). DyGMamba first leverages a node-level SSM to encode the sequence of historical node interactions. Another time-level SSM is then employed to exploit the temporal patterns hidden in the historical graph, where its output is used to dynamically select the critical information from the interaction history. We validate DyGMamba experimentally on the dynamic link prediction task. The results show that our model achieves state-of-the-art in most cases. DyGMamba also maintains high efficiency in terms of computational resources, making it possible to capture long temporal dependencies with a limited computation budget.

MCML Authors
Link to website

Zifeng Ding

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[2078]
D. Strieder and M. Drton.
Identifying total causal effects in linear models under partial homoscedasticity.
International Journal of Approximate Reasoning 183.109455 (Aug. 2025). DOI
Abstract

A fundamental challenge of scientific research is inferring causal relations based on observed data. One commonly used approach involves utilizing structural causal models that postulate noisy functional relations among interacting variables. A directed graph naturally represents these models and reflects the underlying causal structure. However, classical identifiability results suggest that, without conducting additional experiments, this causal graph can only be identified up to a Markov equivalence class of indistinguishable models. Recent research has shown that focusing on linear relations with equal error variances can enable the identification of the causal structure from mere observational data. Nonetheless, practitioners are often primarily interested in the effects of specific interventions, rendering the complete identification of the causal structure unnecessary. In this work, we investigate the extent to which less restrictive assumptions of partial homoscedasticity are sufficient for identifying the causal effects of interest. Furthermore, we construct mathematically rigorous confidence regions for total causal effects under structure uncertainty and explore the performance gain of relying on stricter error assumptions in a simulation study.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[2077]
A. Scagliotti and S. Farinelli.
Normalizing flows as approximations of optimal transport maps via linear-control neural ODEs.
Nonlinear Analysis 257.113811 (Aug. 2025). DOI
Abstract

In this paper, we consider the problem of recovering the W2-optimal transport map T between absolutely continuous measures as the flow of a linear-control neural ODE, where the control depends only on the time variable and takes values in a finite-dimensional space. We first show that, under suitable assumptions on and on the controlled vector fields governing the neural ODE, the optimal transport map is contained in the -closure of the flows generated by the system. Then, we tackle the problem under the assumption that only discrete approximations of of the original measures are available: we formulate approximated optimal control problems, and we show that their solutions give flows that approximate the original optimal transport map . In the framework of generative models, the approximating flow constructed here can be seen as a ‘Normalizing Flow’, which usually refers to the task of providing invertible transport maps between probability measures by means of deep neural networks. We propose an iterative numerical scheme based on the Pontryagin Maximum Principle for the resolution of the optimal control problem, resulting in a method for the practical computation of the approximated optimal transport map, and we test it on a two-dimensional example.

MCML Authors
Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[2076]
S. Dirksen, W. Li and J. Maly.
Subspace estimation under coarse quantization.
SampTA 2025 - 15th International Conference on Sampling Theory and Applications. Vienna, Austria, Jul 28-Aug 01, 2025. To be published. Preprint available. URL
Abstract

We study subspace estimation from coarsly quantized data. In particular, we analyze two stochastic quantization schemes which use dithering: a one-bit quantizer combined with rectangular dither and a multi-bit quantizer with triangular dither. For each quantizer, we derive rigorous high probability bounds for the distances between the true and estimated signal subspaces. Using our analysis, we identify scenarios in which subspace estimation via triangular dithering qualitatively outperforms rectangular dithering. We verify in numerical simulations that our estimates are optimal in their dependence on the smallest non-zero eigenvalue of the target matrix.

MCML Authors
Link to Profile Johannes Maly

Johannes Maly

Prof. Dr.

Mathematical Data Science and Artificial Intelligence


[2075]
F. Krahmer, F. Pagginelli Patricio and P. Catala.
On a Recovery Method with Approximation Guarantees for Noisy Unlimited Sampling.
SampTA 2025 - 15th International Conference on Sampling Theory and Applications. Vienna, Austria, Jul 28-Aug 01, 2025. To be published. Preprint available. URL
Abstract

The unlimited sampling problem of recovering a bandlimited signal from measurements that are affected by a modulo operation has recently been addressed in a number of works employing different approaches. Many of these methods, however, are not robust to Gaussian noise, as local outliers can affect the global solution quality. In this talk we propose and analyze a method to address this challenge by locally optimizing the choice of the function representation among the many equivalent modulo representatives – separately for each sub-interval in a given subdivision of the domain. Our analysis reveals that a successful recovery requires a careful balance between two types of potential limitations. On the one hand, the feasibility of our least-squares retrieval strategy requires the amount of sub-intervals to be large enough, so that the input varies little inside each of them. On the other hand, we show that the conditioning of the resulting linear system matrix deteriorates for too many intervals. The study of this trade-off provides a first step towards the theoretical understanding of our proposed algorithm and a practical guidance for its implementation.

MCML Authors
Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis


[2074]
A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. S. Aditya K. Surikuchi, E. Takmaz and A. Testoni.
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[2073]
J. Bi, Y. Wang, H. Chen, X. Xiao, A. Hecker, V. Tresp and Y. Ma.
LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. We introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance.

MCML Authors
Link to website

Haokun Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[2072]
F. Eichin, Y. J. Liu, B. Plank and M. A. Hedderich.
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.

MCML Authors
Link to website

Florian Eichin

AI and Computational Linguistics

Link to website

Yang Janet Liu

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics


[2071]
M. Fayyaz, A. Modarressi, H. Schütze and N. Peng.
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query’s answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

MCML Authors
Link to website

Ali Modarressi

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2070]
F. Friedrich, K. Hämmerl, P. Schramowski, M. Brack, J. Libovicky, K. Kersting and A. Fraser.
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.

MCML Authors
Link to website

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[2069]
M. A. Hedderich, A. Wang, R. Zhao, F. Eichin, J. Fischer and B. Plank.
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompt and model changes efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs, and we are able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.

MCML Authors
Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics

Link to website

Raoyuan Zhao

AI and Computational Linguistics

Link to website

Florian Eichin

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[2068]
T. Liu, Z. Lai, G. Zhang, P. Torr, V. Demberg, V. Tresp and J. Gu.
Multimodal Pragmatic Jailbreak on Text-to-image Models.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two close-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8% to 74%. In real-world scenarios, various filters such as keyword blocklists, customized prompt filters, and NSFW image filters, are commonly employed to mitigate these risks. We evaluate the effectiveness of such filters against our jailbreak and found that, while current classifiers may be effective for single modality detection, they fail to work against our jailbreak. Our work provides a foundation for further development towards more secure and reliable T2I models.

MCML Authors
Link to website

Tong Liu

Database Systems and Data Mining

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[2067]
T. Liu, X. Yu, W. Zhou, J. Gu and V. Tresp.
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~citep{chen2024preference} empirically finds that DPO training textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead textit{down-weighs} misranked preference pairs and prioritizes enhancing the model’s understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.

MCML Authors
Link to website

Tong Liu

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[2066]
Y. Liu, H. Ye, C. Ma, M. Wang and H. Schütze.
LangSAMP: Language-Script Aware Multilingual Pretraining.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model’s ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2065]
B. Ma, Y. Li, W. Zhou, Z. Gong, Y. J. Liu, K. Jasinskaja, A. Friedrich, J. Hirschberg, F. Kreuter and B. Plank.
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

MCML Authors
Link to website

Yang Janet Liu

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[2064]
B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Aßenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors
Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI

Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[2063]
P. Mondorf, S. Wold and B. Plank.
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[2062]
E. Nie, B. Shao, Z. Ding, M. Wang, H. Schmid and H. Schütze.
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE.

MCML Authors
Link to website

Zifeng Ding

Database Systems and Data Mining

Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2061]
R. Pei, Y. Liu, P. Lin, F. Yvon and H. Schütze.
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an encrypted version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap the conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2060]
M. Wang, H. Adel, L. Lange, Y. Liu, E. Nie, J. Strötgen and H. Schütze.
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2059]
B. Chen, S. Peng, A. Korhonen and B. Plank.
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding human label variation (HLV) and large language models (LLMs) can approximate HJD from a few human-provided label-explanation pairs. However, collecting explanations for every label is still time-consuming. This paper examines whether LLMs can be used to replace humans in generating explanations for approximating HJD. Specifically, we use LLMs as annotators to generate model explanations for a few given human labels. We test ways to obtain and combine these label-explanations with the goal to approximate human judgment distribution. We further compare the resulting human with model-generated explanations, and test automatic and human explanation selection. Our experiments show that LLM explanations are promising for NLI: to estimate HJD, generated explanations yield comparable results to human’s when provided with human labels. Importantly, our results generalize from datasets with human explanations to i) datasets where they are not available and ii) challenging out-of-distribution test sets.

MCML Authors
Link to website

Beiduo Chen

AI and Computational Linguistics

Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[2058]
L. Edman, H. Schmid and A. Fraser.
EXECUTE: A Multilingual Benchmark for LLM Token Understanding.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs’ understanding of character components.

MCML Authors
Link to website

Lukas Edman

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[2057]
A. D. Hakimi, A. Modarressi, P. Wicke and H. Schütze.
Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its attention heads and feed forward networks (FFNs) over the course of pre-training. We classify these components into four roles: general, entity, relation-answer, and fact-answer specific, and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, attention heads display the highest turnover. We also present evidence that FFNs remain more stable throughout training. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs.

MCML Authors
Link to website

Ahmad Dawar Hakimi

Computational Linguistics

Link to website

Ali Modarressi

Computational Linguistics

Link to website

Philipp Wicke

Dr.

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2056]
L. He, E. Nie, H. Schmid, H. Schütze, N. Mesgarani and J. Brennan.
Large Language Models as Neurolinguistic Subjects: Discrepancy in Performance and Competence for Form and Meaning.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs’ true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2055]
A. H. Kargaran, Y. Liu, F. Yvon and H. Schütze.
How Programming Concepts and Neurons Are Shared in Code Language Models.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model’s concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model’s concept space.

MCML Authors
Link to website

Amir Hossein Kargaran

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2054]
A. H. Kargaran, A. Modarressi, N. Nikeghbal, J. Diesner, F. Yvon and H. Schütze.
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs.

MCML Authors
Link to website

Amir Hossein Kargaran

Computational Linguistics

Link to website

Ali Modarressi

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2053]
S. Urchs, V. Thurner, M. Aßenmacher, C. Heumann and S. Thiemichen.
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available.
Abstract

Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus’s utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[2052]
I. Bueno, A. Bavaresco, J. M. Cunha and P. Wicke.
Analogy Prompting: Testing Spatial Intuitions of Humans and Multimodal Models in Analogies.
Analogy-Angle II @ACL 2025 - 2nd Workshop on Analogical Abstraction in Cognition, Perception, and Language at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. URL
Abstract

Language and Vision-Language Models exhibit impressive language capabilities akin to human reasoning. However, unlike humans who acquire language through embodied, interactive experiences, these models learn from static datasets without real-world interaction. This difference raises questions about how they conceptualize abstract notions and whether their reasoning aligns with human cognition. We investigate spatial conceptualizations of LLMs and VLMs by conducting analogy prompting studies with LLMs, VLMs, and human participants. We assess their ability to generate and interpret analogies for spatial concepts. We quantitatively compare the analogies produced by each group, examining the impact of multimodal inputs and reasoning mechanisms. Our findings indicate that generative models can produce and interpret analogies but differ significantly from human reasoning in their abstraction of spatial concepts - variability influenced by input modality, model size, and prompting methods, with analogy-based prompts not consistently enhancing alignment. Contributions include a methodology for probing generative models through analogies; a comparative analysis of analogical reasoning among models, and humans; and insights into the effect of multimodal inputs on reasoning.

MCML Authors
Link to website

Philipp Wicke

Dr.

Computational Linguistics


[2051]
A. Säuberli, D. Frassinelli and B. Plank.
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
BEA @ACL 2025 - 20th Workshop on Innovative Use of NLP for Building Educational Applications at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

MCML Authors
Link to website

Andreas Säuberli

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[2050]
E. Garces Arias, H. Blocher, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework.
GEM2 @ACL 2025 - 4th Workshop on Generation, Evaluation and Metrics at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Furthermore, we discuss the alignment of these approaches with human judgments. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, exhibit similarities with human preferences, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[2049]
O. Kononykhina, A.-C. Haensch and F. Kreuter.
How Much Can Stratification Improve the Approximation of Shapley Values?
GeBNLP @ACL 2025 - 6th Workshop on Gender Bias in Natural Language Processing at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published.
Abstract

Large Language Models (LLMs) offer promising alternatives to traditional occupational coding approaches in survey research. Using a German dataset, we examine the extent to which LLM-based occupational coding differs by gender. Our findings reveal systematic disparities: gendered job titles (e.g., “Autor” vs. “Autorin”, meaning “male author” vs. “female author”) frequently result in diverging occupation codes,
even when semantically identical. Across all models, 54%–82% of gendered inputs obtain different Top-5 suggestions. The practical impact, however, depends on the model. GPT includes the correct code most often (62%) but demonstrates female bias (up to +18 pp). IBM is less accurate (51%) but largely balanced. Alibaba, Gemini, and MiniLM achieve about 50% correct-code inclusion, and their small (< 10 pp) and direction-flipping gaps could indicate a sampling noise rather than gender bias. We discuss these findings in the context of fairness and reproducibility in NLP applications for social data.

MCML Authors
Link to website

Olga Kononykhina

Social Data Science and AI

Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[2048]
T. Lindenbauer, G. Groh and H. Schütze.
From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents.
REALM @ACL 2025 - 1st Workshop for Research on Agent Language Models at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv
Abstract

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2047]
Q. Feng, Y. Liu and H. Schütze.
Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding.
SRW @ACL 2025 - Student Research Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. URL
Abstract

Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics – such as text length – which may not accurately reflect the model’s own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2046]
M. Koshil, M. Feurer and K. Eggensperger.
In-Context Learning of Soft Nearest Neighbor Classifiers for Intelligible Tabular Machine Learning.
TRL @ACL 2025 - 4th Table Representation Learning Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. URL
Abstract

With in-context learning foundation models like TabPFN excelling on small supervised tabular learning tasks, it has been argued that ‘boosted trees are not the best default choice when working with data in tables’. However, such foundation models are inherently black-box models that do not provide interpretable predictions. We introduce a novel learning task to train ICL models to act as a nearest neighbor algorithm, which enables intelligible inference and does not decrease performance empirically.

MCML Authors
Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[2045]
J. Hanselle, A. Javanmardi, T. Oberkofler, Y. Sale and E. Hüllermeier.
Conformal Prediction without Nonconformity Scores.
UAI 2025 - 41st Conference on Uncertainty in Artificial Intelligence. Rio de Janeiro, Brazil, Jul 21-25, 2025. To be published.
Abstract

Conformal prediction (CP) is an uncertainty quantification framework that allows for constructing
statistically valid prediction sets. Key to the construction of these sets is the notion of nonconformity function, which assigns a real-valued score to individual data points: only those (hypothetical) data points contribute to a prediction set that sufficiently conform to the data. The point of departure of this work is the observation that CP predictions are invariant against (strictly) monotone transformations of a nonconformity function. In other words, it is only the ordering of the scores that matters, not their quantitative values. Consequently, instead of scoring individual data points, a conformal predictor only needs to be able to compare pairs of data points, deciding which of them is the more conforming one. This suggests an interesting connection between CP and preference learning, in particular learning-to-rank methods, and makes CP amenable to training data in the form of (qualitative) preferences. Elaborating on
this connection, we propose methods for learning (latent) nonconformity functions from data of that
kind and show their usefulness in real-world classification tasks.

MCML Authors
Link to website

Jonas Hanselle

Artificial Intelligence and Machine Learning

Link to website

Alireza Javanmardi

Artificial Intelligence and Machine Learning

Link to website

Tobias Oberkofler

Artificial Intelligence and Machine Learning

Link to website

Yusuf Sale

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[2044]
M. Drton, M. Garrote-López, N. Nikov, E. Robeva and Y. S. Wang.
Causal Discovery for Linear Non-Gaussian Models with Disjoint Cycles.
UAI 2025 - 41st Conference on Uncertainty in Artificial Intelligence. Rio de Janeiro, Brazil, Jul 21-25, 2025. To be published. Preprint available. URL GitHub
Abstract

The paradigm of linear structural equation modeling readily allows one to incorporate causal feedback loops in the model specification. These appear as directed cycles in the common graphical representation of the models. However, the presence of cycles entails difficulties such as the fact that models need no longer be characterized by conditional independence relations. As a result, learning cyclic causal structures remains a challenging problem. In this paper, we offer new insights on this problem in the context of linear non-Gaussian models. First, we precisely characterize when two directed graphs determine the same linear non-Gaussian model. Next, we take up a setting of cycle-disjoint graphs, for which we are able to show that simple quadratic and cubic polynomial relations among low-order moments of a non-Gaussian distribution allow one to locate source cycles. Complementing this with a strategy of decorrelating cycles and multivariate regression allows one to infer a block-topological order among the directed cycles, which leads to a consistent and computationally efficient algorithm for learning causal structures with disjoint cycles.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[2043]
O. Kononykhina and M. Schierholz.
Can Large Language Models Advance Occupational Coding? Evidence and Methodological Insights.
ESRA 2025 - 11th Conference of the European Survey Research Association. Utrecht, The Netherlands, Jul 14-18, 2025. To be published.
Abstract

Occupational coding is a critical funnel between open-ended job descriptions and the statistical frameworks that shape employment research and policies. Automatic coding tools—whether rule-based or machine learning (ML)—have streamlined the process, and demonstrate promising results. Yet, ML approaches typically require extensive, high-quality training data that exceed what a typical national survey can provide and fall under data protection constraints. This study asks whether mainstream large language models (LLMs) can serve as a viable alternative, largely bypassing the need for exhaustive training data and requiring only some coding skills and API access. We created embeddings for standardized German (Kldb) job descriptions, then used respondents’ own words (e.g., “doctor”) from a representative German survey to generate job embeddings. Cosine similarity was applied to find the five most likely occupational codes for each response. To assess performance, we compared LLM-based suggestions with those from a German ML occupational coding tool (OccuCoDe), using professional manual coding as our benchmark. Results show that in 55% of the cases, both LLM and OccuCoDe included the correct code among their top five suggestions. However, there was limited overlap: in 60% of the cases, the two tools shared at most two out of their five recommended codes. While OccuCoDe more frequently placed the correct code as the first suggestion, LLM-embeddings suggested the correct occupation in 45% of cases where OccuCoDe did not provide any result. Additionally, LLM performance was sensitive to minor changes in job descriptions (e.g., capitalisation or gendered job titles) and sometimes showed “embedding drift,” raising reproducibility concerns. Our findings highlight LLMs’ promise as a complement or substitute to other tools for occupational coding in limited training data contexts, while underscoring critical limitations that must be addressed before fully entrusting them with classifying the work we do.

MCML Authors
Link to website

Olga Kononykhina

Social Data Science and AI

Link to website

Malte Schierholz

Dr.

Social Data Science and AI


[2042]
F. Kiwitt, B. Tahmasebi and S. Jegelka.
Symmetries in Weight Space Learning: To Retain or Remove?
HiLD @ICML 2025 - Workshop on High-dimensional Learning Dynamics at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL
Abstract

Weight space learning, an emerging paradigm that seeks to understand neural networks through their space of parameters (weights), has shown promise in a variety of applications, including but not limited to predicting model behavior and addressing privacy concerns. However, weight spaces often exhibit inherent symmetries that impact both theory and practice, such as the scale and rotational invariances found in the Low-Rank Adaptation (LoRA) method, which is the state-of-the-art fine-tuning algorithm for Large Language Models (LLMs). In this work, we investigate a general weight space learning problem under symmetries, focusing on a fundamental question: What is the appropriate formulation for this problem in the presence of symmetries (such as those in LoRA), and should redundant representations that encode the same end-to-end function be removed? We address this question by fully characterizing a new space of symmetric weights, demonstrating that the relevance of redundancy depends on the function being predicted. Specifically, we show that end-to-end symmetries (such as those in LoRA) should not always be removed, as doing so may compromise the universality of the weight space learning problem. To our knowledge, this is the first time this phenomenon has been formally identified and presented, yielding insights into a broad class of weight space learning problems.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[2041]
J. von Berg, A. Fono, M. Datres, S. Maskey and G. Kutyniok.
The Price of Robustness: Stable Classifiers Need Overparameterization.
HiLD @ICML 2025 - Workshop on High-dimensional Learning Dynamics at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL
Abstract

In this work, we show that class stability, the expected distance of an input to the decision boundary, captures what classical capacity measures, such as weight norms, fail to explain. We prove a generalization bound that improves inversely with the class stability, interpreted as a quantifiable notion of robustness. As a corollary, we derive a law of robustness for classification: any interpolating model with parameters must be unstable, so high stability requires significant overparameterization. Crucially, our results extend beyond smoothness assumptions and apply to discontinuous classifiers. Preliminary experiments support our theory: empirical stability increases with model size, while norm-based measures remain uninformative.

MCML Authors
Link to website

Jonas von Berg

Mathematical Foundations of Artificial Intelligence

Link to website

Adalbert Fono

Mathematical Foundations of Artificial Intelligence

Link to website

Sohir Maskey

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[2040]
F. Kreuter.
Adaptive Alignment: Designing AI for a Changing World - Frauke Kreuter.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. Invited Talk. URL
Abstract

As artificial intelligence systems become deeply embedded in our institutions, economies, and personal lives, the challenge of alignment—ensuring AI acts in accordance with human values and societal norms—has become both urgent and complex. But what exactly should these systems be aligned to—and how do we know we’re getting it right? To address this, we turn to a long-standing body of work: how societies have historically measured public preferences and moral norms—and what often goes wrong in the process. The talk will introduce underutilized datasets—from decades of survey archives to international value studies—that could serve as empirical benchmarks for aligning AI systems with lived human norms. In addition to highlighting valuable data sources, we will examine how lessons from social science can inform the design of human feedback loops in AI. These insights help avoid common pitfalls in capturing human intentions and preferences—such as measurement error, framing effects, and unrepresentative sampling—that have plagued opinion research for decades. We’ll close by addressing the fluid and evolving nature of societal norms, emphasizing the need for alignment strategies that are adaptive to cultural and temporal change. Achieving this kind of adaptability requires not just better data, but durable collaborations between social scientists and machine learning researchers—so that updates to human values can be continuously reflected in system design. The goal is to provoke a deeper, interdisciplinary conversation about what it truly means to align AI with human values—and how to do so responsibly, reliably, and at scale.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[2039]
S. Müller, A. Reuter, N. Hollmann, D. Rügamer and F. Hutter.
Position: The Future of Bayesian Prediction Is Prior-Fitted.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. Preprint. arXiv
Abstract

Training neural networks on randomly generated artificial datasets yields Bayesian models that capture the prior defined by the dataset-generating distribution. Prior-data Fitted Networks (PFNs) are a class of methods designed to leverage this insight. In an era of rapidly increasing computational resources for pre-training and a near stagnation in the generation of new real-world data in many applications, PFNs are poised to play a more important role across a wide range of applications. They enable the efficient allocation of pre-training compute to low-data scenarios. Originally applied to small Bayesian modeling tasks, the field of PFNs has significantly expanded to address more complex domains and larger datasets. This position paper argues that PFNs and other amortized inference approaches represent the future of Bayesian inference, leveraging amortized learning to tackle data-scarce problems. We thus believe they are a fruitful area of research. In this position paper, we explore their potential and directions to address their current limitations.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[2038]
L. Thede, K. Roth, M. Bethge, Z. Akata and T. Hartvigsen.
WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. Preprint. arXiv
Abstract

Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce WikiBigEdit; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use WikiBigEdit to study existing knowledge editing techniques’ ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[2037]
U. Fischer Abaigar, C. Kern and J. Perdomo.
The Value of Prediction in Identifying the Worst-Off.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. Spotlight Presentation. Outstanding Paper Award. To be published. Preprint available. arXiv
Abstract

Machine learning is increasingly used in government programs to identify and support the most vulnerable individuals, prioritizing assistance for those at greatest risk over optimizing aggregate outcomes. This paper examines the welfare impacts of prediction in equity-driven contexts, and how they compare to other policy levers, such as expanding bureaucratic capacity. Through mathematical models and a real-world case study on long-term unemployment amongst German residents, we develop a comprehensive understanding of the relative effectiveness of prediction in surfacing the worst-off. Our findings provide clear analytical frameworks and practical, data-driven tools that empower policymakers to make principled decisions when designing these systems.

MCML Authors
Link to website

Unai Fischer Abaigar

Social Data Science and AI Lab

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[2036]
W. Durani, T. Nitzl, C. Plant and C. Böhm.
Weakly Supervised Anomaly Detection via Dual-Tailed Kernel.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. URL
Abstract

Detecting anomalies with limited supervision is challenging due to the scarcity of labeled anomalies, which often fail to capture the diversity of abnormal behaviors. We propose Weakly Supervised Anomaly Detection via Dual-Tailed Kernel (WSAD-DT), a novel framework that learns robust latent representations to distinctly separate anomalies from normal samples under weak supervision. WSAD-DT introduces two centroids—one for normal samples and one for anomalies—and leverages a dual-tailed kernel scheme: a light-tailed kernel to compactly model in-class points and a heavy-tailed kernel to main- tain a wider margin against out-of-class instances. To preserve intra-class diversity, WSAD-DT in- corporates kernel-based regularization, encouraging richer representations within each class. Furthermore, we devise an ensemble strategy that partition unlabeled data into diverse subsets, while sharing the limited labeled anomalies among these partitions to maximize their impact. Empirically, WSAD-DT achieves state-of-the-art performance on several challenging anomaly detection benchmarks, outperforming leading ensemble-based methods such as XGBOD.

MCML Authors
Link to website

Walid Durani

Database Systems and Data Mining


[2035]
X. Feng, Z. Jiang, T. Kaufmann, E. Hüllermeier, P. Weng and Y. Zhu.
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. URL
Abstract

Learning human objectives from preference feedback has significantly advanced reinforcement learning (RL) in domains with hard-to-formalize objectives. However, traditional methods based on pairwise trajectory comparisons face notable challenges, including the difficulty in comparing trajectories with subtle differences and the limitation of conveying only ordinal information, limiting direct inference of preference strength. In this paper, we introduce a novel distinguishability query, allowing humans to express preference strength by comparing two pairs of trajectories. Labelers first indicate which pair is easier to compare, then provide preference feedback only on the easier pair. Our proposed query type directly captures preference strength and is expected to reduce the cognitive load on the labeler. We further connect this query to cardinal utility and difference relations and develop an efficient query selection scheme to achieve better trade-off between query informativeness and easiness. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization.

MCML Authors
Link to website

Timo Kaufmann

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[2034]
R. Sharma, S. Mukherjee, A. Šipka, E. Hüllermeier, S. Vollmer, S. Redyuk and D. A. Selby.
X-Hacking: The Threat of Misguided AutoML.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. URL
Abstract

Explainable AI (XAI) and interpretable machine learning methods help to build trust in model predictions and derived insights, yet also present a perverse incentive for analysts to manipulate XAI metrics to support pre-specified conclusions. This paper introduces the concept of X-hacking, a form of p-hacking applied to XAI metrics such as Shap values. We show how easily an automated machine learning pipeline can be adapted to exploit model multiplicity at scale: searching a set of ‘defensible’ models with similar predictive performance to find a desired explanation. We formulate the trade-off between explanation and accuracy as a multi-objective optimisation problem, and illustrate empirically on familiar real-world datasets that, on average, Bayesian optimisation accelerates X-hacking 3-fold for features susceptible to it, versus random sampling. We show the vulnerability of a dataset to X-hacking can be determined by information redundancy among features. Finally, we suggest possible methods for detection and prevention, and discuss ethical implications for the credibility and reproducibility of XAI.

MCML Authors
Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[2033]
P. Fatemi, E. Sharifian and M. H. Yassaee.
A New Approach to Backtracking Counterfactual Explanations: A Unified Causal Framework for Efficient Model Interpretability.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv
Abstract

Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method called BRACE based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs.

MCML Authors
Link to website

Pouria Fatemi

Resource Aware Machine Learning


[2032]
S. Karnik, A. Veselovska, M. Iwen and F. Krahmer.
Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv
Abstract

We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.

MCML Authors
Link to website

Anna Veselovska

Dr.

Applied Numerical Analysis

Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis


[2031]
W. Lai, A. Fraser and I. Titov.
Joint Localization and Activation Editing for Low-Resource Fine-Tuning.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv
Abstract

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[2030]
A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon and H. Schütze.
NoLiMa: Long-Context Evaluation Beyond Literal Matching.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv URL
Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a ’needle’ (relevant information) from a ‘haystack’ (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

MCML Authors
Link to website

Ali Modarressi

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[2029]
D. A. Nguyen, E. Araya, A. Fono and G. Kutyniok.
Time to Spike? Understanding the Representational Power of Spiking Neural Networks in Discrete Time.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv URL
Abstract

Recent years have seen significant progress in developing spiking neural networks (SNNs) as a potential solution to the energy challenges posed by conventional artificial neural networks (ANNs). However, our theoretical understanding of SNNs remains relatively limited compared to the ever-growing body of literature on ANNs. In this paper, we study a discrete-time model of SNNs based on leaky integrate-and-fire (LIF) neurons, referred to as discrete-time LIF-SNNs, a widely used framework that still lacks solid theoretical foundations. We demonstrate that discrete-time LIF-SNNs with static inputs and outputs realize piecewise constant functions defined on polyhedral regions, and more importantly, we quantify the network size required to approximate continuous functions. Moreover, we investigate the impact of latency (number of time steps) and depth (number of layers) on the complexity of the input space partitioning induced by discrete-time LIF-SNNs. Our analysis highlights the importance of latency and contrasts these networks with ANNs employing piecewise linear activation functions. Finally, we present numerical experiments to support our theoretical findings.

MCML Authors
Link to website

Adalbert Fono

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[2028]
T. Pielok, B. Bischl and D. Rügamer.
Revisiting Unbiased Implicit Variational Inference.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv URL
Abstract

Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from highly complicated distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. While unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and identify key shortcomings. In particular, we show that UIVI’s MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback–Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.

MCML Authors
Link to website

Tobias Pielok

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[2027]
A. Reuter, T. G. J. Rudner, V. Fortuin and D. Rügamer.
Can Transformers Learn Full Bayesian Inference in Context?
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv URL
Abstract

Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context – without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows which enables us to infer complex posterior distributions for methods such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods not operating in context.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[2026]
R. Schulte, D. Rügamer and T. Nagler.
Adjustment for Confounding using Pre-Trained Representations.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL
Abstract

There is growing interest in extending average treatment effect (ATE) estimation to incorporate non-tabular data, such as images and text, which may act as sources of confounding. Neglecting these effects risks biased results and flawed scientific conclusions. However, incorporating non-tabular data necessitates sophisticated feature extractors, often in combination with ideas of transfer learning. In this work, we investigate how latent features from pre-trained neural networks can be leveraged to adjust for sources of confounding. We formalize conditions under which these latent features enable valid adjustment and statistical inference in ATE estimation, demonstrating results along the example of double machine learning. In this context, we also discuss critical challenges inherent to latent feature learning and downstream parameter estimation using those. As our results are agnostic to the considered data modality, they represent an important first step towards a theoretical foundation for the usage of latent representation from foundation models in ATE estimation.

MCML Authors
Link to website

Rickmer Schulte

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[2025]
J. Schweisthal, D. Frauen, M. Schröder, K. Heß, N. Kilbertus and S. Feuerriegel.
Learning Representations of Instruments for Partial Identification of Treatment Effects.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv
Abstract

Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

MCML Authors
Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Maresa Schröder

Artificial Intelligence in Management

Link to website

Konstantin Heß

Artificial Intelligence in Management

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[2024]
A. Soleymani, B. Tahmasebi, S. Jegelka and P. Jaillet.
Learning with Exact Invariances in Polynomial Time.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv
Abstract

We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[2023]
A. Uselis, A. Dittadi and S. J. Oh.
Does Data Scaling Lead to Visual Compositional Generalization?
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL GitHub
Abstract

Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning.

MCML Authors
Link to website

Andrea Dittadi

Dr.

Algorithmic Machine Learning & Explainable AI


[2022]
J. Zausinger, L. Pennig, A. Kozina, S. Sdahl, J. Sikora, A. Dendorfer, T. Kuznetsov, M. Hagog, N. Wiedemann, K. Chlodny, V. Limbach, A. Ketteler, T. Prein, V. M. Singh, M. M. Danziger and J. Born.
Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL GitHub
Abstract

While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives.

MCML Authors
Link to website

Lars Pennig

Ethics in Systems Design and Machine Learning


[2021]
L. Xu, M. Sarkar, A. I. Lonappan, Í. Zubeldia, P. Villanueva-Domingo, S. Casas, C. Fidler, C. Amancharla, U. Tiwari, A. Bayer, C. A. Ekioui, M. Cranmer, A. Dimitrov, J. Fergusson, K. Gandhi, S. Krippendorf, A. Laverick, J. Lesgourgues, A. Lewis, T. Meier, B. Sherwin, K. Surrao, F. Villaescusa-Navarro, C. Wang, X. Xu and B. Bolliet.
Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery.
ML4Astro @ICML 2025 - Machine Learning for Astrophysics at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv
Abstract

We present a multi-agent system for automation of scientific research tasks, cmbagent (this https URL). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

MCML Authors
Link to website

Thomas Meier

Dr.


[2020]
P. Spohn, L. Girrbach, J. Bader and Z. Akata.
Align-then-Unlearn: Embedding Alignment for LLM Unlearning.
MUGen @ICML 2025 - Workshop on Machine Unlearning for Generative AI at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL
Abstract

Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

MCML Authors
Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to website

Jessica Bader

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[2019]
Z. Li, X. Han, Y. Li, N. Strauß and M. Schubert.
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions.
WM @ICML 2025 - Workshop on Building Physically Plausible World Models at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published.
Abstract

Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. Therefore, in this paper, we propose a diffusion-based world model that generates state-reward trajectories conditioned on the current state, action, and return-to-go value, and efficiently infers missing actions via an inverse dynamics model (IDM). This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.

MCML Authors
Link to website

Zongyue Li

Spatial Artificial Intelligence

Link to website

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Link to Profile Matthias Schubert

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence


[2018]
C. Pellegrini, E. Özsoy, B. Busam, B. Wiestler, N. Navab and M. Keicher.
RaDialog: Large Vision-Language Models for X-Ray Reporting and Dialog-Driven Assistance.
MIDL 2025 - Medical Imaging with Deep Learning. Salt Lake City, UT, USA, Jul 09-11, 2025. URL GitHub
Abstract

Conversational AI tools for generating and discussing accurate radiology reports could transform radiology by enabling collaborative, human-in-the-loop diagnostic processes, saving time and enhancing report quality. While, to this end, Large Vision-Language Models hold promise, current methods lack clinical correctness or are single-task models without conversational abilities. We propose a novel architecture and dataset to address these limitations. First, we propose a secondary image branch, explicitly focusing on structured clinical findings, improving the clinical correctness score by 13.3%. Second, we propose a catastrophic forgetting mitigation strategy and instruct dataset with variable dialog-based tasks, to enable our model to handle a multitude of different queries. RaDialog marks a foundational step toward clinical dialog systems, outperforming existing medical LVLMs by 15.0% in clinical correctness in report generation, 23.4% in interactive report correction, and is preferred by radiologists in 84.0% of cases over a comparative method.

MCML Authors
Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality


[2017]
L. A. Heidrich, A. Rastogi, P. Upadhya, G. Brugnara, M. Foltyn-Dumitru, B. Wiestler and P. Vollmuth.
Curriculum Learning for Language-guided, Multi-modal Detection of Various Pathologies.
MIDL 2025 - Medical Imaging with Deep Learning. Salt Lake City, UT, USA, Jul 09-11, 2025. To be published. Preprint available. URL
Abstract

Pathology detection in medical imaging is crucial for radiologists, yet current approaches that train specialized models for each region of interest often lack efficiency and robustness. Furthermore, the scarcity of annotated medical data, particularly for diverse phenotypes, poses significant challenges in achieving generalizability. To address these challenges, we present a novel language-guided object detection pipeline for medical imaging that leverages curriculum learning strategies, chosen for their ability to progressively train models on increasingly complex samples, thereby improving generalization across pathologies, phenotypes, and modalities. We developed a unified pipeline to convert segmentation datasets into bounding box annotations, and applied two curriculum learning approaches - teacher curriculum and bounding box size curriculum - to train a Grounding DINO model. Our method was evaluated on different tumor types in MRI and CT scans and showed significant improvements in detection accuracy. The teacher and bounding box size curriculum learning approaches yielded a 4.9% AP and 5.2% AP increase over baseline, respectively. The results highlight the potential of curriculum learning to optimize medical image analysis and clinical workflow by providing a versatile and efficient detection algorithm.

MCML Authors
Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy


[2016]
V. M. Singh, A. G. V. Asiares, L. S. Schuhmacher, K. Rendall, S. Weißbrod, D. Rügamer and I. Körte.
An Interpretable Representation Learning Approach for Diffusion Tensor Imaging.
MIDL 2025 - Medical Imaging with Deep Learning. Salt Lake City, UT, USA, Jul 09-11, 2025. To be published. Preprint available. arXiv
Abstract

Diffusion Tensor Imaging (DTI) tractography offers detailed insights into the structural connectivity of the brain, but presents challenges in effective representation and interpretation in deep learning models. In this work, we propose a novel 2D representation of DTI tractography that encodes tract-level fractional anisotropy (FA) values into a 9x9 grayscale image. This representation is processed through a Beta-Total Correlation Variational Autoencoder with a Spatial Broadcast Decoder to learn a disentangled and interpretable latent embedding. We evaluate the quality of this embedding using supervised and unsupervised representation learning strategies, including auxiliary classification, triplet loss, and SimCLR-based contrastive learning. Compared to the 1D Group deep neural network (DNN) baselines, our approach improves the F1 score in a downstream sex classification task by 15.74% and shows a better disentanglement than the 3D representation.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[2015]
P. Kolpaczki, T. Nielen and E. Hüllermeier.
Antithetic Sampling for Top-k Shapley Identification.
xAI 2025 - 3rd World Conference on Explainable Artificial Intelligence. Istanbul, Turkey, Jul 09-11, 2025. Preprint. arXiv
Abstract

Additive feature explanations rely primarily on game-theoretic notions such as the Shapley value by viewing features as cooperating players. The Shapley value’s popularity in and outside of explainable AI stems from its axiomatic uniqueness. However, its computational complexity severely limits practicability. Most works investigate the uniform approximation of all features’ Shapley values, needlessly consuming samples for insignificant features. In contrast, identifying the k most important features can already be sufficiently insightful and yields the potential to leverage algorithmic opportunities connected to the field of multi-armed bandits. We propose Comparable Marginal Contributions Sampling (CMCS), a method for the top-k identification problem utilizing a new sampling scheme taking advantage of correlated observations. We conduct experiments to showcase the efficacy of our method in compared to competitive baselines. Our empirical findings reveal that estimation quality for the approximate-all problem does not necessarily transfer to top-k identification and vice versa.

MCML Authors
Link to website

Patrick Kolpaczki

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[2014]
P. Knab, S. Marton, U. Schlegel and C. Bartelt.
Which LIME should I trust? Concepts, Challenges, and Solutions.
xAI 2025 - 3rd World Conference on Explainable Artificial Intelligence. Istanbul, Turkey, Jul 09-11, 2025. To be published. Preprint available. arXiv GitHub
Abstract

As neural networks become dominant in essential systems, Explainable Artificial Intelligence (XAI) plays a crucial role in fostering trust and detecting potential misbehavior of opaque models. LIME (Local Interpretable Model-agnostic Explanations) is among the most prominent model-agnostic approaches, generating explanations by approximating the behavior of black-box models around specific instances. Despite its popularity, LIME faces challenges related to fidelity, stability, and applicability to domain-specific problems. Numerous adaptations and enhancements have been proposed to address these issues, but the growing number of developments can be overwhelming, complicating efforts to navigate LIME-related research. To the best of our knowledge, this is the first survey to comprehensively explore and collect LIME’s foundational concepts and known limitations. We categorize and compare its various enhancements, offering a structured taxonomy based on intermediate steps and key issues. Our analysis provides a holistic overview of advancements in LIME, guiding future research and helping practitioners identify suitable approaches. Additionally, we provide a continuously updated interactive website (this https URL), offering a concise and accessible overview of the survey.

MCML Authors
Link to website

Udo Schlegel

Database Systems and Data Mining


[2013]
Y. Li, M. Ghahremani and C. Wachinger.
MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis.
ICVSS 2025 - International Computer Vision Summer School: Computer Vision for Spatial Intelligence. Sicily, Italy, Jul 06-12, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to pronounced domain shifts. At the same time, training a medical foundation model requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis. MedBridge comprises three key components. First, a Focal Sampling module that extracts high-resolution local regions to capture subtle pathological features and compensate for the limited input resolution of general-purpose VLMs. Second, a Query Encoder (QEncoder) injects a small set of learnable queries that attend to the frozen feature maps of VLM, aligning them with medical semantics without retraining the entire backbone. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of diverse VLMs to maximize diagnostic performance. We evaluate MedBridge on five medical imaging benchmarks across three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings, even under varying levels of training data availability. Notably, MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging foundation models for accurate and data-efficient medical diagnosis.

MCML Authors
Link to website

Yitong Li

Artificial Intelligence in Medical Imaging

Link to website

Morteza Ghahremani

Dr.

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[2012]
W. Li, W. Chen, S. Qian, J. Chen, D. Cremers and H. Li.
DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair.
To be published. Preprint available (Jul 06-12, 2025). arXiv GitHub
Abstract

Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses.

MCML Authors
Link to website

Weihang Li

Computer Aided Medical Procedures & Augmented Reality

Link to website

Weirong Chen

Computer Vision & Artificial Intelligence

Link to website

Shenhan Qian

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Haoang Li

Haoang Li

Dr.

* Former Member


[2011]
J. Homer, O. Friedrich and D. Grün.
Simulation-based inference has its own Dodelson-Schneider effect (but it knows that it does).
Astronomy & Astrophysics 699.A213 (Jul. 2025). DOI
Abstract

Making inferences about physical properties of the Universe requires knowledge of the data likelihood. A Gaussian distribution is commonly assumed for the uncertainties with a covariance matrix estimated from a set of simulations. The noise in such covariance estimates causes two problems: it distorts the width of the parameter contours, and it adds scatter to the location of those contours which is not captured by the widths themselves. For non-Gaussian likelihoods, an approximation may be derived via Simulation-Based Inference (SBI). It is often implicitly assumed that parameter constraints from SBI analyses, which do not use covariance matrices, are not affected by the same problems as parameter estimation with a covariance matrix estimated from simulations. We investigate whether SBI suffers from effects similar to those of covariance estimation in Gaussian likelihoods. We use Neural Posterior and Likelihood Estimation with continuous and masked autoregressive normalizing flows for density estimation. We fit our approximate posterior models to simulations drawn from a Gaussian linear model, so that the SBI result can be compared to the true posterior. We test linear and neural network based compression, demonstrating that neither methods circumvent the issues of covariance estimation. SBI suffers an inflation of posterior variance that is equal or greater than the analytical result in covariance estimation for Gaussian likelihoods for the same number of simulations. The assumption that SBI requires a smaller number of simulations than covariance estimation for a Gaussian likelihood analysis is inaccurate. The limitations of traditional likelihood analysis with simulation-based covariance remain for SBI with a finite simulation budget. Despite these issues, we show that SBI correctly draws the true posterior contour given enough simulations.

MCML Authors
Link to website

Jed Homer

Astrophysics, Cosmology and Artificial Intelligence

Link to Profile Daniel Grün

Daniel Grün

Prof. Dr.

Astrophysics, Cosmology and Artificial Intelligence


[2010]
D. Geissler, A. Maarouf, D. Bär, N. Pröllochs and S. Feuerriegel.
A comment on 'A 2 million-person, campaign-wide field experiment shows how digital advertising affects voter turnout'.
I4R Discussion Paper Series.237 (Jul. 2025). URL
Abstract

Aggarwal et al. (2023) analyze the effects of an 8-month-long advertising program on voter turnout in the 2020 US presidential election. Therein, 2 million voters were exposed to pro-Biden and anti-Trump advertisements on social media in five battleground states. The study finds no average treatment effect on voter turnout but differential effects when modeling by Trump support: Biden supporters are 0.4 percentage points more likely to vote while Trump supporters are 0.3 percentage points less likely to vote (t = −2.09 with p-value < 0.05). We conduct a direct reproduction of the paper by using their data and code. In addition, we check that their claims are robust to new analyses for understanding heterogeneity through the use of the causal forest methodology. We confirm the sign, magnitude, and statistical significance of the point estimates for the new analyses for understanding heterogeneity. The only significant discrepancy in results is that we find greater and statistically significant effects for the ATE, nearly all CATEs (age 18-39, gender, race, vote margin, partisanship (except Democrats), and Trump support score), and the differential effects of the Trump support score using a causal forest. These differences are likely due to the use of the causal forest and do not question the validity of the findings of the original paper.

MCML Authors
Link to website

Dominique Geissler

Artificial Intelligence in Management

Link to website

Abdurahman Maarouf

Artificial Intelligence in Management

Link to website

Dominik Bär

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[2009]
Z. Ge, X. Xu, H. Guo and B. W. Schuller.
Multi-Task Partially Spoofed Speech Detection Using a Dual-View Graph Neural Network Assisted Segment-Level Module.
IEEE Transactions on Audio, Speech and Language Processing 33 (Jul. 2025). DOI
Abstract

The Partially Spoofed Speech Detection (PSSD), as a multi-task learning problem, typically comprises segment- and utterance-level detection tasks, benefitting from diverse feature representations for effective classification. However, existing models for multi-tasks PSSD usually employ a shared feature processing module for the two tasks, which may lead to suboptimal performance compared with task-specific strategies. Further, most of existing works mainly capture segment-level information from a single view, which may result in poorly modeling local differences between fake and bonafide segments. In this regard, we propose a Dual-view Graph neural network Assisted segment-level Module (DGAM) for multi-task PSSD. The proposed approach contains three modules: Shared representation extracting, task-specific feature processing for the utterance-level task, and a Dual-View Graph Neural Network (D-GNN) with a dual-view consistency loss for the segment-level task through the graph attention mechanism with cosine similarity and heat kernel function with Euclidean distance as two different views, which capture semantic and Euclidean spatial relationships, respectively. Experimental evaluations on multiple spoofed-speech datasets demonstrate that, the proposed approach outperforms existing approaches in both segment- and utterance-level detection in terms of equal error rate, showcasing its effectiveness for the multi-task partially spoofed scenario.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[2008]
U. Fischer Abaigar, C. Kern and F. Kreuter.
Adjusting survey estimates with multi-accuracy post-processing.
ITACOSM 2025 - Italian Conference on Survey Methodology. Bologna, Italy, Jul 01-04, 2025. Invited talk. To be published. Preprint available.
Abstract

With the rise of non-probability samples and new data sources, survey researchers face growing challenges related to selection bias. One emerging line of work adapts algorithmic tools from machine learning to improve robustness in such settings. This talk introduces multi-accuracy boosting (Kim et al., 2019), a post-processing method that reduces subgroup-level prediction error. Originally developed in the context of fairness, it has since been explored for use in survey adjustment tasks (Kim & Kern et al., 2022). I offer an accessible overview of the method and share reflections on its potential, and open questions for future research.

MCML Authors
Link to website

Unai Fischer Abaigar

Social Data Science and AI Lab

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[2007]
M. Herold, J. S. Jehle, F. Krahmer and A. Veselovska.
Non-intrusive surrogate modelling using sparse random features with applications in crashworthiness analysis.
International Journal for Uncertainty Quantification 15.4 (Jul. 2025).
Abstract

Efficient surrogate modelling is a key requirement for uncertainty quantification in data-driven scenarios. In this work, a novel approach of using Sparse Random Features for surrogate modelling in combination with self-supervised dimensionality reduction is described. The method is compared to other methods on synthetic and real data obtained from crashworthiness analyses. The results show a superiority of the here described approach over state of the art surrogate modelling techniques, Polynomial Chaos Expansions and Neural Networks.

MCML Authors
Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis

Link to website

Anna Veselovska

Dr.

Applied Numerical Analysis


[2006]
M. Keinert, S. Pistrosch, A. Mallol-Ragolta, B. W. Schuller and M. Berking.
Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.
Journal of Medical Internet Research 27 (Jul. 2025). DOI
Abstract

Background: The development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions.
Objective: This study aimed to develop deep learning–based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models’ performance, and validate them by comparing the models with human observers.
Methods: The STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)–based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy.
Results: Models using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%.
Conclusions: Future studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.

MCML Authors
Link to website

Simon Pistrosch

Health Informatics

Link to website

Adria Mallol-Ragolta

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[2005]
B. Bischl, G. Casalicchio, T. Das, M. Feurer, S. Fischer, P. Gijsbers, S. Mukherjee, A. C. Müller, L. Németh, L. Oala, L. Purucker, S. Ravi, J. N. van Rijn, P. Singh, J. Vanschoren, J. van der Velde and M. Wever.
OpenML: Insights from 10 years and more than a thousand papers.
Patterns In Press, Corrected Proof (Jul. 2025). DOI
Abstract

OpenML is an open-source platform that democratizes machine-learning evaluation by enabling anyone to share datasets in uniform standards, define precise machine-learning tasks, and automatically share detailed workflows and model evaluations. More than just a platform, OpenML fosters a collaborative ecosystem where scientists create new tools, launch initiatives, and establish standards to advance machine learning. Over the past decade, OpenML has inspired over 1,500 publications across diverse fields, from scientists releasing new datasets and benchmarking new models to educators teaching reproducible science. Looking back, we detail and describe the platform’s impact by looking at usage and citations. We share lessons from a decade of building, maintaining, and expanding OpenML, highlighting how rich metadata, collaborative benchmarking, and open interfaces have enhanced research and interoperability. Looking ahead, we cover ongoing efforts to expand OpenML’s capabilities and integrate with other platforms, informing a broader vision for open-science infrastructure for machine learning.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to website

Sebastian Fischer

Statistical Learning and Data Science


[2004]
A. Datar, A. Datar, F. Dietrich and W. Schilders.
Systematic Construction of Continuous-Time Neural Networks for Linear Dynamical Systems.
SIAM Journal on Scientific Computing 47.4 (Jul. 2025). DOI
Abstract

Discovering a suitable neural network architecture for modeling complex dynamical systems poses a formidable challenge, often involving extensive trial and error and navigation through a high-dimensional hyperparameter space. In this paper, we discuss a systematic approach to constructing neural architectures for modeling a subclass of dynamical systems, namely, linear time-invariant (LTI) systems. We use a variant of continuous-time neural networks in which the output of each neuron evolves continuously as a solution of a first-order or second-order ordinary differential equation. Instead of deriving the network architecture and parameters from data, we propose a gradient-free algorithm to compute sparse architecture and network parameters directly from the given LTI system, leveraging its properties. We bring forth a novel neural architecture paradigm featuring horizontal hidden layers and provide insights into why employing conventional neural architectures with vertical hidden layers may not be favorable. We also provide an upper bound on the numerical errors of our neural networks. Finally, we demonstrate the high accuracy of our constructed networks on three numerical examples.

MCML Authors
Link to Profile Felix Dietrich

Felix Dietrich

Prof. Dr.

Physics-enhanced Machine Learning


[2003]
Y. Sale and A. Ramdas.
Online Selective Conformal Prediction: Errors and Solutions.
Transactions on Machine Learning Research (Jul. 2025). Preprint. URL
Abstract

In online selective conformal inference, data arrives sequentially, and prediction intervals are constructed only when an online selection rule is met. Since online selections may break the exchangeability between the selected test datum and the rest of the data, one must correct for this by suitably selecting the calibration data. In this paper, we evaluate existing calibration selection strategies and pinpoint some fundamental errors in the associated claims that guarantee selection-conditional coverage and control of the false coverage rate (FCR). To address these shortcomings, we propose novel calibration selection strategies that provably preserve the exchangeability of the calibration data and the selected test datum. Consequently, we demonstrate that online selective conformal inference with these strategies guarantees both selection-conditional coverage and FCR control. Our theoretical findings are supported by experimental evidence examining tradeoffs between valid methods.

MCML Authors
Link to website

Yusuf Sale

Artificial Intelligence and Machine Learning


[2002]
E. M. Achour, K. Kohn and H. Rauhut.
The Riemannian Geometry associated to Gradient Flows of Linear Convolutional Networks.
Preprint (Jul. 2025). arXiv
Abstract

We study geometric properties of the gradient flow for learning deep linear convolutional networks. For linear fully connected networks, it has been shown recently that the corresponding gradient flow on parameter space can be written as a Riemannian gradient flow on function space (i.e., on the product of weight matrices) if the initialization satisfies a so-called balancedness condition. We establish that the gradient flow on parameter space for learning linear convolutional networks can be written as a Riemannian gradient flow on function space regardless of the initialization. This result holds for D-dimensional convolutions with D≥2, and for D=1 it holds if all so-called strides of the convolutions are greater than one. The corresponding Riemannian metric depends on the initialization.

MCML Authors
Link to Profile Holger Rauhut

Holger Rauhut

Prof. Dr.

Mathematical Data Science and Artificial Intelligence


[2001]
S. Ball, G. Gluch, S. Goldwasser, F. Kreuter, O. Reingold and G. N. Rothblum.
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment.
Preprint (Jul. 2025). arXiv
Abstract

With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system’s intelligence cannot be separated from its judgment.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[2000]
D. Chemnitz, M. Engel, C. Kühn and S.-V. Kuntz.
A Dynamical Systems Perspective on the Analysis of Neural Networks.
Preprint (Jul. 2025). arXiv
Abstract

In this chapter, we utilize dynamical systems to analyze several aspects of machine learning algorithms. As an expository contribution we demonstrate how to re-formulate a wide variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements. We also tackle three concrete challenges. First, we consider the process of information propagation through a neural network, i.e., we study the input-output map for different architectures. We explain the universal embedding property for augmented neural ODEs representing arbitrary functions of given regularity, the classification of multilayer perceptrons and neural ODEs in terms of suitable function classes, and the memory-dependence in neural delay equations. Second, we consider the training aspect of neural networks dynamically. We describe a dynamical systems perspective on gradient descent and study stability for overdetermined problems. We then extend this analysis to the overparameterized setting and describe the edge of stability phenomenon, also in the context of possible explanations for implicit bias. For stochastic gradient descent, we present stability results for the overparameterized setting via Lyapunov exponents of interpolation solutions. Third, we explain several results regarding mean-field limits of neural networks. We describe a result that extends existing techniques to heterogeneous neural networks involving graph limits via digraph measures. This shows how large classes of neural networks naturally fall within the framework of Kuramoto-type models on graphs and their large-graph limits. Finally, we point out that similar strategies to use dynamics to study explainable and reliable AI can also be applied to settings such as generative models or fundamental issues in gradient training methods, such as backpropagation or vanishing/exploding gradients.

MCML Authors
Link to Profile Christian Kühn

Christian Kühn

Prof. Dr.

Multiscale and Stochastic Dynamics

Link to website

Sara-Viola Kuntz

Multiscale and Stochastic Dynamics


[1999]
A. F. Dima, S. Shit, H. Qiu, R. Holland, T. T. Mueller, F. A. Musio, K. Yang, B. Menze, R. Braren, M. Makowski and D. Rückert.
Parametric shape models for vessels learned from segmentations via differentiable voxelization.
Preprint (Jul. 2025). arXiv
Abstract

Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1998]
Y. Du, P. Mondorf, S. Casola, Y. Yao, R. Litschko and B. Plank.
Reason to Rote: Rethinking Memorization in Reasoning.
Preprint (Jul. 2025). arXiv
Abstract

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to website

Silvia Casola

Dr.

AI and Computational Linguistics

Link to website

Robert Litschko

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1997]
C. Gruber, H. Alber, B. Bischl, G. Kauermann, B. Plank and M. Aßenmacher.
Revisiting Active Learning under (Human) Label Variation.
Preprint (Jul. 2025). arXiv
Abstract

Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

MCML Authors
Link to website

Helen Alber

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1996]
S. Haas and E. Hüllermeier.
Aleatoric and Epistemic Uncertainty Measures for Ordinal Classification through Binary Reduction.
Preprint (Jul. 2025). arXiv
Abstract

Ordinal classification problems, where labels exhibit a natural order, are prevalent in high-stakes fields such as medicine and finance. Accurate uncertainty quantification, including the decomposition into aleatoric (inherent variability) and epistemic (lack of knowledge) components, is crucial for reliable decision-making. However, existing research has primarily focused on nominal classification and regression. In this paper, we introduce a novel class of measures of aleatoric and epistemic uncertainty in ordinal classification, which is based on a suitable reduction to (entropy- and variance-based) measures for the binary case. These measures effectively capture the trade-off in ordinal classification between exact hit-rate and minimial error distances. We demonstrate the effectiveness of our approach on various tabular ordinal benchmark datasets using ensembles of gradient-boosted trees and multi-layer perceptrons for approximate Bayesian inference. Our method significantly outperforms standard and label-wise entropy and variance-based measures in error detection, as indicated by misclassification rates and mean absolute error. Additionally, the ordinal measures show competitive performance in out-of-distribution (OOD) detection. Our findings highlight the importance of considering the ordinal nature of classification problems when assessing uncertainty.

MCML Authors
Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1995]
J. Li and G. Kutyniok.
Expressivity of deep neural networks.
Preprint (Jul. 2025). PDF
Abstract

This chapter focuses on the approximation theory of deep ReLU neural networks, analyzing their ability to approximate various target functions with different network architectures. We begin by introducing the universal approximation theory of deep neural networks, stating that given enough neurons, neural networks can approximate general functions. We then delve into the fundamental properties of ReLU neural networks and explore the role of width and depth of neural networks, highlighting that increasing layers could be more effective than increasing width in improving approximation accuracy. Next, we discuss the approximation rates for Sobolev functions using fully connected and convolutional neural networks. To alleviate the curse of dimensionality, we further consider Korobov functions. Finally, we focus on the approximation properties of self-attention and transformers, which have become increasingly important in modern deep learning. These results shed light on the expressivity and reliability of deep learning models, providing valuable insights into networks’ behavior and performance.

MCML Authors
Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1994]
T. Meier and K. Khutsishvili.
Who Owns the Future? Ways to Understand Power, Technology, and the Moral Commons.
Preprint (Jul. 2025). URL
Abstract

The ascent of tech billionaires—and, depending on the market, soon trillionaires—signals more than a shift in global economic structures; it marks a transformation in the moral and cultural conditions under which democratic life is sustained. This contribution offers a communitarian critique of Big Tech’s influence, grounded in the philosophical frameworks of Charles Taylor, Michael Sandel, and virtue ethicist Shannon Vallor, and further supported by public goods theory and economic insights from Paul Samuelson and Joseph Stiglitz, with Elinor Ostrom’s work emphasizing the civic importance of collective stewardship. It contends that the challenge to democracy posed by concentrated digital power is not merely institutional, economic, or ethical, but a disruption of the very conditions for democratic citizenship.

MCML Authors
Link to website

Thomas Meier

Dr.


[1993]
B. Pulido, A. M. Franco-Pereira, R. E. Lillo and F. Scheipl.
Area-based epigraph and hypograph indices for functional outlier detection.
Preprint (Jul. 2025). arXiv
Abstract

Detecting outliers in Functional Data Analysis is challenging because curves can stray from the majority in many different ways. The Modified Epigraph Index (MEI) and Modified Hypograph Index (MHI) rank functions by the fraction of the domain on which one curve lies above or below another. While effective for spotting shape anomalies, their construction limits their ability to flag magnitude outliers. This paper introduces two new metrics, the Area-Based Epigraph Index (ABEI) and Area-Based Hypograph Index (ABHI) that quantify the area between curves, enabling simultaneous sensitivity to both magnitude and shape deviations. Building on these indices, we present EHyOut, a robust procedure that recasts functional outlier detection as a multivariate problem: for every curve, and for its first and second derivatives, we compute ABEI and ABHI and then apply multivariate outlier-detection techniques to the resulting feature vectors. Extensive simulations show that EHyOut remains stable across a wide range of contamination settings and often outperforms established benchmark methods. Moreover, applications to Spanish weather data and United Nations world population data further illustrate the practical utility and meaningfulness of this methodology.

MCML Authors
Link to Profile Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[1992]
Y. Qu, Q. Wang, Y. Mao, V. T. Hu and X. Ji.
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
Preprint (Jul. 2025). arXiv
Abstract

Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline’s reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt’s success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning


[1991]
M. Tuci, L. Bastian, B. Dupuis, N. Navab, T. Birdal and U. Şimşekli.
Mutual Information Free Topological Generalization Bounds via Stability.
Preprint (Jul. 2025). arXiv
Abstract

Providing generalization guarantees for stochastic optimization algorithms is a major challenge in modern learning theory. Recently, several studies highlighted the impact of the geometry of training trajectories on the generalization error, both theoretically and empirically. Among these works, a series of topological generalization bounds have been proposed, relating the generalization error to notions of topological complexity that stem from topological data analysis (TDA). Despite their empirical success, these bounds rely on intricate information-theoretic (IT) terms that can be bounded in specific cases but remain intractable for practical algorithms (such as ADAM), potentially reducing the relevance of the derived bounds. In this paper, we seek to formulate comprehensive and interpretable topological generalization bounds free of intractable mutual information terms. To this end, we introduce a novel learning theoretic framework that departs from the existing strategies via proof techniques rooted in algorithmic stability. By extending an existing notion of textit{hypothesis set stability}, to textit{trajectory stability}, we prove that the generalization error of trajectory-stable algorithms can be upper bounded in terms of (i) TDA quantities describing the complexity of the trajectory of the optimizer in the parameter space, and (ii) the trajectory stability parameter of the algorithm. Through a series of experimental evaluations, we demonstrate that the TDA terms in the bound are of great importance, especially as the number of training samples grows. This ultimately forms an explanation of the empirical success of the topological generalization bounds.

MCML Authors
Link to website

Lennart Bastian

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1990]
X. You, R. Yang, C. Zhang, Z. Jiang, J. Yang and N. Navab.
FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging.
Preprint (Jul. 2025). arXiv
Abstract

The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Codes are available.

MCML Authors
Link to website

Xin You

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1989]
P. O. Schenk, C. Kern and T. D. Buskirk.
Fares on Fairness: Using a Total Error Framework to Examine the Role of Measurement and Representation in Training Data on Model Fairness and Bias.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published.
Abstract

Data-driven decisions, often based on predictions from machine learning (ML) models are becoming ubiquitous. For these decisions to be just, the underlying ML models must be fair, i.e., work equally well for all parts of the population such as groups defined by gender or age. What are the logical next steps if, however, a trained model is accurate but not fair? How can we guide the whole data pipeline such that we avoid training unfair models based on inadequate data, recognizing possible sources of unfairness early on? How can the concepts of data-based sources of unfairness that exist in the fair ML literature be organized, perhaps in a way to gain new insight? In this paper, we explore two total error frameworks from the social sciences, Total Survey Error and its generalization Total Data Quality, to help elucidate issues related to fairness and trace its antecedents. The goal of this thought piece is to acquaint the fair ML community with these two frameworks, discussing errors of measurement and errors of representation through their organized structure. We illustrate how they may be useful, both practically and conceptually.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1988]
C. Strasser Ceballos, M. Novotny and C. Kern.
Re-evaluating the role of refugee integration factors for building more equitable allocation algorithms.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published.
Abstract

Numerous studies in the social sciences have examined how individual and location-level characteristics influence refugees’ integration outcomes. A more recent, smaller body of computational research has developed algorithmic tools that aim to improve refugee integration by optimizing matching to resettlement locations based on predicted outcomes. These tools, which are piloted in a number of countries, raise a number of concerns. This includes, first, their reliance on a narrow set of individual-level predictors – most of which are protected attributes under global anti-discrimination laws – overlooking valuable insights from migration studies that may improve predictive accuracy. Second, they guide refugee placement decisions without assessing group fairness, potentially reinforcing existing inequalities. Against this background, we draw on comprehensive refugee panel data from Germany and study the economic integration of refugees through the lens of predictive modeling. Specifically, we develop prediction models that integrate and test a wide range of integration factors from migration research. We then compare our extended model configurations with existing refugee-location matching algorithms, and evaluate group model performance to assess generalizability and fairness. Overall, we highlight the importance of integrating insights from migration studies into the development of algorithmic decision-making tools to improve their reliability and promote fair outcomes across diverse groups.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1987]
L. Bothmann, P. A. Boustani, J. M. Alvarez, G. Casalicchio, B. Bischl and S. Dandl.
Privilege Scores.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published. Preprint available. arXiv
Abstract

Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence of the PA is removed. At the individual level, PS can identify individuals who qualify for affirmative action; at the global level, PS can inform bias-transforming policies. After presenting estimation methods for PS, we propose privilege score contributions (PSCs), an interpretation method that attributes the origin of privilege to mediating features and direct effects. We provide confidence intervals for both PS and PSCs. Experiments on simulated and real-world data demonstrate the broad applicability of our methods and provide novel insights into gender and racial privilege in mortgage and college admissions applications.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to website

Philip Amir Boustani

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[1986]
L. Bothmann, K. Peters and B. Bischl.
What Is Fairness? On the Role of Protected Attributes and Fictitious Worlds.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published. Preprint available. arXiv
Abstract

A growing body of literature in fairness-aware machine learning (fairML) aims to mitigate machine learning (ML)-related unfairness in automated decision-making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods to ensure that trained ML models achieve low scores on these metrics. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a significant gap between centuries of philosophical discussion and the recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We argue that fairness problems can arise even without the presence of protected attributes (PAs), and point out that fairness and predictive performance are not irreconcilable opposites, but that the latter is necessary to achieve the former. Furthermore, we argue why and how causal considerations are necessary when assessing fairness in the presence of PAs by proposing a fictitious, normatively desired (FiND) world in which PAs have no causal effects. In practice, this FiND world must be approximated by a warped world in which the causal effects of the PAs are removed from the real-world data. Finally, we achieve greater linguistic clarity in the discussion of fairML. We outline algorithms for practical applications and present illustrative experiments on COMPAS data.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[1985]
C. Leininger, S. Rittel and L. Bothmann.
Overcoming Fairness Trade-offs via Pre-processing: A Causal Perspective.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published. Preprint available. arXiv
Abstract

Training machine learning models for fair decisions faces two key challenges: The fairness-accuracy trade-off results from enforcing fairness which weakens its predictive performance in contrast to an unconstrained model. The incompatibility of different fairness metrics poses another trade-off – also known as the impossibility theorem. Recent work identifies the bias within the observed data as a possible root cause and shows that fairness and predictive performance are in fact in accord when predictive performance is measured on unbiased data. We offer a causal explanation for these findings using the framework of the FiND (fictitious and normatively desired) world, a ‘fair’ world, where protected attributes have no causal effects on the target variable. We show theoretically that (i) classical fairness metrics deemed to be incompatible are naturally satisfied in the FiND world, while (ii) fairness aligns with high predictive performance. We extend our analysis by suggesting how one can benefit from these theoretical insights in practice, using causal pre-processing methods that approximate the FiND world. Additionally, we propose a method for evaluating the approximation of the FiND world via pre-processing in practical use cases where we do not have access to the FiND world. In simulations and empirical studies, we demonstrate that these pre-processing methods are successful in approximating the FiND world and resolve both trade-offs. Our results provide actionable solutions for practitioners to achieve fairness and high predictive performance simultaneously.

MCML Authors
Link to website

Simon Rittel

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[1984]
S. Yuan, E. Nie, B. Ma and M. Färber.
Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers.
IJCNN 2025 - International Joint Conference on Neural Networks. Rome, Italy, Jun 30-Jul 05, 2025. Preprint. arXiv
Abstract

Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs.

MCML Authors

[1983]
V. Ehm, N. El Amrani, Y. Xie, L. Bastian, M. Gao, W. Wang, L. Sang, D. Cao, Z. Lähner, D. Cremers and F. Bernard.
Beyond Complete Shapes: A Quantitative Evaluation of 3D Shape Matching Algorithms.
SGP 2025 - Symposium on Geometry Processing. Bilbao, Spain, Jun 30-Jul 04, 2025. To be published. Preprint available. arXiv
Abstract

Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. While approaches based on machine learning dominate modern 3D shape matching, almost all existing (learning-based) methods require that at least one of the involved shapes is complete. In contrast, the most challenging and arguably most practically relevant setting of matching partially observed shapes, is currently underexplored. One important factor is that existing datasets contain only a small number of shapes (typically below 100), which are unable to serve data-hungry machine learning approaches, particularly in the unsupervised regime. In addition, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations and to encourage research on these relevant settings, we provide a generic and flexible framework for the procedural generation of challenging partial shape matching scenarios. Our framework allows for a virtually infinite generation of partial shape matching instances from a finite set of shapes with complete geometry. Further, we manually create cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, leading to a total of 2543 shapes. Based on this, we propose several challenging partial benchmark settings, for which we evaluate respective state-of-the-art methods as baselines.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to website

Lennart Bastian

Computer Aided Medical Procedures & Augmented Reality

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to website

Lu Sang

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1982]
C. Strasser Ceballos and C. Kern.
Location matching on shaky grounds: Re-evaluating algorithms for refugee allocation.
ACM FAccT 2025 - 8th ACM Conference on Fairness, Accountability, and Transparency. Athens, Greece, Jun 23-26, 2025. DOI
Abstract

The initial location to which refugees are assigned upon arrival in a host country plays a key role in their integration. Several research groups have developed tools to optimize refugee-location matching, with the overall aim of improving refugees’ integration outcomes. Four primary tools are already being piloted across various countries: GeoMatch, Annie™ Moore, Match’In, and Re:Match. The first two tools combine supervised machine learning with optimal matching techniques, while the latter two rely on heuristic methods to match refugee preferences with suitable locations. These tools are used in a highly sensitive context and directly impact human lives. It is, therefore, not only desirable but critical to (re-)evaluate them through the lens of algorithmic fairness. We contribute in three key aspects: First, we provide a comprehensive overview and systematization of the tools aimed at the algorithmic fairness community. Second, we identify sources of biases along the tool design stages that can contribute to disparate impacts downstream. Finally, we simulate the application of the GeoMatch tool using German survey data to empirically illustrate the impact of target variable choice on matching outcomes. While GeoMatch optimizes economic integration, we demonstrate that the integration gains differ substantially when social integration is prioritized instead. With our use case, we highlight the susceptibility of algorithmic matching tools to design decisions such as the operationalization of the integration outcome and emphasize the need for more holistic evaluations of their social impacts.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1981]
A. Mallol-Ragolta, M. Gonzalez-Machorro, R. von Heynitz, K. Scherzer, I. Cordts and B. W. Schuller.
Early Detection of ALS in Absence of Speech Impairments with Computer Audition.
AIME 2025 - 23rd International Conference on Artificial Intelligence in Medicine. Pavia, Italy, Jun 23-26, 2025. DOI
Abstract

We investigate whether Amyotrophic Lateral Sclerosis (ALS) can be detected in patients without speech impairments utilising computer audition techniques. We exploit the information embedded in the patients’ speech while performing five different speech tasks. Specifically, producing the sustained vowel /a:/, repeating the syllables /da/-/da/ and /da/-/ba/ (separately), reading a text passage, and describing a picture. The implemented models are task-dedicated, as they are solely trained and assessed with the speech samples of the corresponding task. We conduct our experiments on the novel, German-speaking AIMnd dataset. We define the Unweighted Average Recall (UAR) as the evaluation metric. When differentiating ALS patients with normal speech from controls – binary classification –, the best models, which obtain a UAR score of 88% on the Test set, mostly exploit the speech samples corresponding to the /da/-/ba/ task. When including the ALS patients with, at least, detectable speech disturbances in the detection – three-class classification –, the best model on the Test set scores a UAR of 70%, also exploiting the speech samples corresponding to the /da/-/ba/ task.

MCML Authors
Link to website

Adria Mallol-Ragolta

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1980]
F. Ghorbanpour, T. Z. Malaguth and A. Akbaritabar.
Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models.
ICWSM 2025 - 19th International AAAI Conference on Web and Social Media. Copenhagen, Denmark, Jun 23-26, 2025. DOI
Abstract

Most web and digital trace data do not include information about an individual’s nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant’s country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest- and 67%, for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars’ full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods in addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

MCML Authors
Link to website

Faeze Ghorbanpour

Data Analytics & Statistics


[1979]
I. Tsangko, A. Triantafyllopoulos, E. Kyriakidis, G. Margetis and B. W. Schuller.
Large Language Models for the Analysis of Project Proposals.
AI-HCI 2025 - 6th International Conference on Artificial Intelligence in Human Computer Interaction. Gothenburg, Sweden, Jun 22-27, 2025. DOI
Abstract

We introduce a framework that integrates traditional topic modeling methods-Latent Dirichlet Allocation (LDA) and BERTopic- with Large Language Models (LLMs) to automatically identify topics featured in project proposals for the cultural heritage (CH) domain. Applied to a dataset of 1, 757 English project proposals aimed at protecting and promoting CH in Africa, our approach begins by extracting initial topics using LDA and BERTopic. These topics are further refined by LLaMA3, generating precise and semantically meaningful categories that incorporate domain expert-curated labels to ensure contextual relevance. The consistency of assigned labels is evaluated using automatic classification. Additionally, we explore the role of linguistic features, such as sentence complexity, sentiment analysis, and gendered language, as predictors of proposal success. Results highlight the potential of combining traditional topic modeling with LLMs to uncover hidden insights into funding allocation patterns, aiming to enhance the equitable distribution of resources in CH projects.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1978]
O. Dhaouadi, J. Meier, J. Kaiser and D. Cremers.
Shape Your Ground: Refining Road Surfaces Beyond Planar Representations.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. To be published. Preprint available. arXiv
Abstract

Road surface reconstruction from aerial images is fundamental for autonomous driving, urban planning, and virtual simulation, where smoothness, compactness, and accuracy are critical quality factors. Existing reconstruction methods often produce artifacts and inconsistencies that limit usability, while downstream tasks have a tendency to represent roads as planes for simplicity but at the cost of accuracy. We introduce FlexRoad, the first framework to directly address road surface smoothing by fitting Non-Uniform Rational B-Splines (NURBS) surfaces to 3D road points obtained from photogrammetric reconstructions or geodata providers. Our method at its core utilizes the Elevation-Constrained Spatial Road Clustering (ECSRC) algorithm for robust anomaly correction, significantly reducing surface roughness and fitting errors. To facilitate quantitative comparison between road surface reconstruction methods, we present GeoRoad Dataset (GeRoD), a diverse collection of road surface and terrain profiles derived from openly accessible geodata. Experiments on GeRoD and the photogrammetry-based DeepScenario Open 3D Dataset (DSC3D) demonstrate that FlexRoad considerably surpasses commonly used road surface representations across various metrics while being insensitive to various input sources, terrains, and noise types. By performing ablation studies, we identify the key role of each component towards high-quality reconstruction performance, making FlexRoad a generic method for realistic road surface modeling.

MCML Authors
Link to website

Johannes Meier

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1977]
O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berinpanathan, H. Banzhaf and D. Cremers.
Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. To be published. Preprint available. arXiv
Abstract

Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at this https URL, facilitating research in motion prediction, behavior modeling, and safety validation.

MCML Authors
Link to website

Johannes Meier

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1976]
F. Förster, Q. Khan and D. Cremers.
Decentralized Reinforcement Learning for Multi-Agent Navigation in Unconstrained Environments.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. To be published. Preprint available. PDF
Abstract

Supervised learning has demonstrated to be an effective strategy in training neural networks for vehicle navigation. However, it requires labeled data, which may not be available when a large number of vehicles need to be controlled simultaneously. In contrast, Deep Reinforcement Learning (DRL) circumvents the necessity for ground truth labels through environmental exploration. However, most concurrent DRL approaches either tend to operate in the discrete action/state space or do not consider the vehicle kinematics. In this paper, we use DRL to control multiple vehicles while also considering their kinematics. The task is for all the vehicles to reach their desired destination/target while avoiding collisions with each other or static obstacles in an unconstrained environment. For this, we propose a decentralized Proximal Policy Optimization (PPO) based DRL agent that independently provides control commands to each vehicle. The agent is based on two separate PPO models. The first is used to drive each vehicle to the proximity of its target. Once within the target’s proximity, the second model is used to park that vehicle at the correct position and orientation. The decentralized nature of the algorithm allows each agent to rely only on information about its current state and target, along with details regarding the closest obstacle/agent. By scaling this approach to all vehicles, simultaneous navigation of multiple vehicles can be achieved. Experimental results show a collective strategy that allows consistent results across a wide range of scenarios while scaling to situations with up to 20 vehicles and 12 stationary obstacles.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1975]
J. Meier, L. Inchingolo, O. Dhaouadi, Y. Xia, J. Kaiser and D. Cremers.
MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. To be published. Preprint available.
Abstract

We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

MCML Authors
Link to website

Johannes Meier

Computer Vision & Artificial Intelligence

Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1974]
F. Li, Y. Bi, D. Huang, Z. Jiang and N. Navab.
Robotic CBCT Meets Robotic Ultrasound.
IPCAI 2025 - International Conference on Information Processing in Computer-Assisted Interventions. Berlin, Germany, Jun 17-18, 2025. To be published. Preprint available. arXiv
Abstract

The multi-modality imaging system offers optimal fused images for safe and precise interventions in modern clinical practices, such as computed tomography - ultrasound (CT-US) guidance for needle insertion. However, the limited dexterity and mobility of current imaging devices hinder their integration into standardized workflows and the advancement toward fully autonomous intervention systems. In this paper, we present a novel clinical setup where robotic cone beam computed tomography (CBCT) and robotic US are pre-calibrated and dynamically co-registered, enabling new clinical applications. This setup allows registration-free rigid registration, facilitating multi-modal guided procedures in the absence of tissue deformation. First, a one-time pre-calibration is performed between the systems. To ensure a safe insertion path by highlighting critical vasculature on the 3D CBCT, SAM2 segments vessels from B-mode images, using the Doppler signal as an autonomously generated prompt. Based on the registration, the Doppler image or segmented vessel masks are then mapped onto the CBCT, creating an optimally fused image with comprehensive detail. To validate the system, we used a specially designed phantom, featuring lesions covered by ribs and multiple vessels with simulated moving flow. The mapping error between US and CBCT resulted in an average deviation of 1.72+-0.62 mm. A user study demonstrated the effectiveness of CBCT-US fusion for needle insertion guidance, showing significant improvements in time efficiency, accuracy, and success rate. Needle intervention performance improved by approximately 50% compared to the conventional US-guided workflow. We present the first robotic dual-modality imaging system designed to guide clinical applications. The results show significant performance improvements compared to traditional manual interventions.

MCML Authors
Link to website

Feng Li

Computer Aided Medical Procedures & Augmented Reality

Link to website

Yuan Bi

Computer Aided Medical Procedures & Augmented Reality

Link to website

Dianye Huang

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1973]
S. A. Baumann, F. Krause, M. Neumayr, N. Stracke, M. Sevi, V. T. Hu and B. Ommer.
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. URL GitHub
Abstract

In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between person'' and old person’’). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1972]
Y. Yeganeh, A. Farshad, I. Charisiadis, M. Hasny, M. Hartenberger, B. Ommer, N. Navab and E. Adeli.
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. Highlight Paper. To be published. URL
Abstract

Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this work will be publicly released upon its acceptance.

MCML Authors
Link to website

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1971]
Q. Bouniot, I. Redko, A. Mallasto, C. Laclau, O. Struckmeier, K. Arndt, M. Heinonen, V. Kyrki and S. Kaski.
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width – common factors associated with their expressive power – may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport mappings, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications.

MCML Authors
Link to website

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning


[1970]
H. Chen, H. Li, Y. Zhang, G. Zhang, J. Bi, P. Torr, J. Gu, D. Krompass and V. Tresp.
FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

One-Shot Federated Learning (OSFL), a special decentralized machine learning paradigm, has recently gained significant attention. OSFL requires only a single round of client data or model upload, which reduces communication costs and mitigates privacy threats compared to traditional FL. Despite these promising prospects, existing methods face challenges due to client data heterogeneity and limited data quantity when applied to real-world OSFL systems. Recently, Latent Diffusion Models (LDM) have shown remarkable advancements in synthesizing high-quality images through pretraining on large-scale datasets, thereby presenting a potential solution to overcome these issues. However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data. This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM’s pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. Hereby, FedBiP synthesizes images following the client’s local data distribution without compromising the privacy regulations. FedBiP is also the first approach to simultaneously address feature space heterogeneity and client data scarcity in OSFL. Our method is validated through extensive experiments on three OSFL benchmarks with feature space heterogeneity, as well as on challenging medical and satellite image datasets with label heterogeneity. The results demonstrate the effectiveness of FedBiP, which substantially outperforms other OSFL methods.

MCML Authors
Link to website

Haokun Chen

Database Systems and Data Mining

Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1969]
Z. Chen, Y. Wang, L. Nan and X. Zhu.
Parametric Point Cloud Completion for Polygonal Surface Reconstruction.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub
Abstract

Existing polygonal surface reconstruction methods heavily depend on input completeness and struggle with incomplete point clouds. We argue that while current point cloud completion techniques may recover missing points, they are not optimized for polygonal surface reconstruction, where the parametric representation of underlying surfaces remains overlooked. To address this gap, we introduce parametric completion, a novel paradigm for point cloud completion, which recovers parametric primitives instead of individual points to convey high-level geometric structures. Our presented approach, PaCo, enables high-quality polygonal surface reconstruction by leveraging plane proxies that encapsulate both plane parameters and inlier points, proving particularly effective in challenging scenarios with highly incomplete data. Comprehensive evaluations of our approach on the ABC dataset establish its effectiveness with superior performance and set a new standard for polygonal surface reconstruction from incomplete data.

MCML Authors
Link to website

Zhaiyu Chen

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1968]
T. Dagès, S. Weber, Y.-W. E. Lin, R. Talmon, D. Cremers, M. Lindenbaum, A. M. B. Alfred M. Bruckstein and R. Kimmel.
Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Dimensionality reduction is a fundamental task that aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation. To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances. They optimise the embedding to have pairwise distances as close as possible to the data dissimilarities. However, the current standard is limited to embedding data in Riemannian manifolds. Motivated by the lack of asymmetry in the Riemannian metric of the embedding space, this paper extends the MDS problem to a natural asymmetric generalisation of Riemannian manifolds called Finsler manifolds. Inspired by Euclidean spaces, we define a canonical Finsler space for embedding asymmetric data. Due to its simplicity with respect to geodesics, data representation in this space is both intuitive and simple to analyse. We demonstrate that our generalisation benefits from the same theoretical convergence guarantees. We reveal the effectiveness of our Finsler embedding across various types of non-symmetric data, highlighting its value in applications such as data visualisation, dimensionality reduction, directed graph embedding, and link prediction.

MCML Authors
Link to website

Simon Weber

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1967]
S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Your Multimodal Models Over Time?
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1966]
T. Hannan, M. M. Islam, J. Gu, T. Seidl and G. Bertasius.
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub
Abstract

Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (+2.6% R1@0.1 on MAD).

MCML Authors
Link to website

Tanveer Hannan

Database Systems and Data Mining

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[1965]
S. Kim, R. Xiao, M.-I. Georgescu, S. Alaniz and Z. Akata.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

MCML Authors
Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to website

Rui Xiao

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1964]
D. Mildenberger, P. Hager, D. Rückert and M. Menten.
A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Supervised contrastive learning (SupCon) has proven to be a powerful alternative to the standard cross-entropy loss for classification of multi-class balanced datasets. However, it struggles to learn well-conditioned representations of datasets with long-tailed class distributions. This problem is potentially exacerbated for binary imbalanced distributions, which are commonly encountered during many real-world problems such as medical diagnosis. In experiments on seven binary datasets of natural and medical images, we show that the performance of SupCon decreases with increasing class imbalance. To substantiate these findings, we introduce two novel metrics that evaluate the quality of the learned representation space. By measuring the class distribution in local neighborhoods, we are able to uncover structural deficiencies of the representation space that classical metrics cannot detect. Informed by these insights, we propose two new supervised contrastive learning strategies tailored to binary imbalanced datasets that improve the structure of the representation space and increase downstream classification accuracy over standard SupCon by up to 35%. We make our code available.

MCML Authors
Link to website

David Mildenberger

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine


[1963]
E. Özsoy, C. Pellegrini, T. Czempiel, F. Tristram, K. Yuan, D. Bani-Harouni, U. Eck, B. Busam, M. Keicher and N. Navab.
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub
Abstract

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments.

MCML Authors
Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Felix Tristram

Computer Aided Medical Procedures & Augmented Reality

Link to website

Kun Yuan

Computer Aided Medical Procedures & Augmented Reality

Link to website

David Bani-Harouni

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1962]
R. Qorbani, G. Villani, T. Panagiotakopoulos, M. B. Colomer, L. Härenstam-Nielsen, M. Segu, P. L. Dovesi, J. Karlgren, D. Cremers, F. Tombari and M. Poggi.
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world application. We introduce Semantic Library Adaptation (SemLa), a novel framework for training-free, test-time domain adaptation. SemLa leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on an 18-domain benchmark built over 10 standard datasets demonstrate SemLa’s superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.

MCML Authors
Link to website

Linus Härenstam-Nielsen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to Profile Federico Tombari

Federico Tombari

PD Dr.

Computer Aided Medical Procedures & Augmented Reality


[1961]
P. Roetzer, V. Ehm, D. Cremers, Z. Lähner and F. Bernard.
Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

In this work we address various shape matching problems that can be cast as finding cyclic paths in a product graph. This involves for example 2D-3D shape matching, 3D shape matching, or the matching of a contour to a graph. In this context, matchings are typically obtained as the minimum cost cycle in the product graph. Instead, inspired by related works on model-based image segmentation, we consider minimum ratio cycles, which we combine with the recently introduced conjugate product graph in order to allow for higher-order matching costs. With that, on the one hand we avoid the bias of obtaining matchings that involve fewer/shorter edges, while on the other hand being able to impose powerful geometric regularisation, e.g. to avoid zig-zagging. In our experiments we demonstrate that this not only leads to improved matching accuracy in most cases, but also to significantly reduced runtimes (up to two orders of magnitude, depending on the setting). Our GPU implementation will be made publicly available upon acceptance.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1960]
K. Roth, Z. Akata, D. Damen, I. Balažević and O. J. Hénaff.
Context-Aware Multimodal Pretraining.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1959]
L. Sang, Z. Canfes, D. Cao, R. Marin, F. Bernard and D. Cremers.
4Deform: Neural Surface Deformation for Robust Shape Interpolation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Generating realistic intermediate shapes between non-rigidly deformed shapes is a challenging task in computer vision, especially with unstructured data (e.g., point clouds) where temporal consistency across frames is lacking, and topologies are changing. Most interpolation methods are designed for structured data (i.e., meshes) and do not apply to real-world point clouds. In contrast, our approach, 4Deform, leverages neural implicit representation (NIR) to enable free topology changing shape deformation. Unlike previous mesh-based methods that learn vertex-based deformation fields, our method learns a continuous velocity field in Euclidean space. Thus, it is suitable for less structured data such as point clouds. Additionally, our method does not require intermediate-shape supervision during training; instead, we incorporate physical and geometrical constraints to regularize the velocity field. We reconstruct intermediate surfaces using a modified level-set equation, directly linking our NIR with the velocity field. Experiments show that our method significantly outperforms previous NIR approaches across various scenarios (e.g., noisy, partial, topology-changing, non-isometric shapes) and, for the first time, enables new applications like 4D Kinect sequence upsampling and real-world high-resolution mesh deformation.

MCML Authors
Link to website

Lu Sang

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1958]
D. Schnaus, N. Araslanov and D. Cremers.
It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL URL
Abstract

The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e., without parallel data. We present the first study towards this prospect, and investigate conformity of existing vision and language foundation models in the context of ‘blind’ matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens possibility for exciting applications embedding semantic knowledge into other modalities. As a showcase, we demonstrate a proof-of-concept unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

MCML Authors
Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1957]
J. Schusterbauer, M. Gui, F. Fundel and B. Ommer.
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Recent advancements in diffusion models have established new benchmarks in both generative tasks and downstream applications. In contrast, flow matching models have shown promising improvements in performance but have not been as extensively explored, particularly due to the difficulty of inheriting knowledge from a pretrained diffusion prior foundation model.In this work, we propose a novel method to bridge the gap between pretrained diffusion models and flow matching models by aligning their trajectories and matching their objectives. Our approach mathematically formalizes this alignment and enables the efficient transfer of knowledge from diffusion priors to flow matching models. We demonstrate that our method outperforms traditional diffusion and flow matching finetuning, achieving competitive results across a variety of tasks.

MCML Authors
Link to website

Johannes Schusterbauer

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1956]
N. Stracke, S. A. Baumann, K. Bauer, F. Fundel and B. Ommer.
CleanDIFT: Diffusion Features without Noise.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1955]
F. Wimbauer, W. Chen, D. Muhle, C. Rupprecht and D. Cremers.
AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL URL
Abstract

Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training.As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera motions. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds in a feed-forward fashion.

MCML Authors
Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to website

Weirong Chen

Computer Vision & Artificial Intelligence

Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1954]
R. Xiao, S. Kim, M.-I. Georgescu, Z. Akata and S. Alaniz.
FLAIR: VLM with Fine-grained Language-informed Image Representations.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub
Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

MCML Authors
Link to website

Rui Xiao

Interpretable and Reliable Machine Learning

Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning


[1953]
Y. Xie, V. Ehm, P. Roetzer, N. Amrani, M. Gao, F. Bernard and D. Cremers.
EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics. While most research has focused on finding correspondences in settings where at least one of the shapes is complete, the realm of partial-to-partial shape matching remains under-explored. Yet it is of importance since, in many applications, shapes are only observed partially due to occlusion or scanning.Finding correspondences between partial shapes comes with an additional challenge: We not only want to identify correspondences between points on either shape but also have to determine which points of each shape actually have a partner.To tackle this challenging problem, we present EchoMatch, a novel framework for partial-to-partial shape matching that incorporates the concept of correspondence reflection to enable an overlap prediction within a functional map framework.With this approach, we show that we can outperform current SOTA methods in challenging partial-to-partial shape matching problems.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1952]
Y. Yuan, Y. Xia, D. Cremers and M. Sester.
SparseAlign: a Fully Sparse Framework for Cooperative Object Detection.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird’s Eye View (BEV) feature maps, which is computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, textit{SparseAlign}, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1951]
G. Zhang, M. L. A. Fok, J. Ma, Y. Xia, D. Cremers, P. Torr, V. Tresp and J. Gu.
Localizing Events in Videos with Multimodal Queries.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.

MCML Authors
Link to website

Gengyuan Zhang

Database Systems and Data Mining

Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1950]
D. Zhu, Y. Di, S. Gavranovic and S. Ilic.
SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL
Abstract

Denoising diffusion probabilistic models have achieved significant success in point cloud generation, enabling numerous downstream applications, such as generative data augmentation and 3D model editing. However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point clouds with fine-grained segmentation labels. Specifically, we introduce the semantic part-aware latent point diffusion technique, which leverages the intermediate features of the generative models to jointly predict the noise for perturbed latent points and associated part segmentation labels during the denoising process, and subsequently decodes the latent points to point clouds conditioned on part segmentation labels. To effectively evaluate the quality of generated point clouds, we introduce a novel point cloud pairwise distance calculation method named part-aware Chamfer distance (p-CD). This method enables existing metrics, such as 1-NNA, to measure both the local structural quality and inter-part coherence of generated point clouds. Experiments on the large-scale synthetic dataset ShapeNet and real-world medical dataset IntrA demonstrate that SeaLion achieves remarkable performance in generation quality and diversity, outperforming the existing state-of-the-art model, DiffFacto, by 13.33% and 6.52% on 1-NNA (p-CD) across the two datasets. Experimental analysis shows that SeaLion can be trained semi-supervised, thereby reducing the demand for labeling efforts. Lastly, we validate the applicability of SeaLion in generative data augmentation for training segmentation models and the capability of SeaLion to serve as a tool for part-aware 3D shape editing.

MCML Authors
Link to website

Dekai Zhu

Computer Aided Medical Procedures & Augmented Reality


[1949]
C. Curreli, D. Muhle, A. Saroha, Z. Ye, R. Marin and D. Cremers.
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics.

MCML Authors
Link to website

Cecilia Curreli

Computer Vision & Artificial Intelligence

Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Zhenzhang Ye

Zhenzhang Ye

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1948]
O. Hahn, C. Reich, N. Araslanov, D. Cremers, C. Rupprecht and S. Roth.
Scene-Centric Unsupervised Panoptic Segmentation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

MCML Authors
Link to website

Christoph Reich

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1947]
W. Li, H. Xu, J. Huang, H. Jung, P. Yu, N. Navab and B. Busam.
GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. Preprint available. URL GitHub
Abstract

A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance’s global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We further introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275.

MCML Authors
Link to website

Weihang Li

Computer Aided Medical Procedures & Augmented Reality

Link to website

Junwen Huang

Computer Aided Medical Procedures & Augmented Reality

Link to website

Hyunjun Jung

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1946]
D. Sinitsyn, L. Härenstam-Nielsen and D. Cremers.
PRaDA: Projective Radial Distortion Averaging.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. Preprint available. arXiv URL
Abstract

We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions.Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate.In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion.Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.

MCML Authors
Link to website

Daniil Sinitsyn

Computer Vision & Artificial Intelligence

Link to website

Linus Härenstam-Nielsen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1945]
Y. Luo, R. Hoffmann, Y. Xia, O. Wysocki, B. Schwab, T. H. Kolbe and D. Cremers.
RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning.
PBVS @CVPR 2025 - 21st IEEE Workshop on Perception Beyond the Visible Spectrum at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025). Nashville, TN, USA, Jun 11-15, 2025. To be published. GitHub
Abstract

Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1944]
D. Zverev, T. Wiedemer, A. Prabhu, M. Bethge, W. Brendel and A. Koepke.
VGGSounder: Audio-Visual Evaluations for Foundation Models.
Sight and Sound @CVPR 2025 - Workshop Sight and Sound at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025). Nashville, TN, USA, Jun 11-15, 2025. PDF
Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The classification dataset VGGSound is commonly used as a benchmark for evaluating audio-visual understanding. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. VGGSounder offers a robust benchmark supporting the future development of audio-visual foundation models.

MCML Authors
Link to website

Daniil Zverev

Computer Vision & Artificial Intelligence


[1943]
W. Tang, W. Li, X. Liang, O. Wysocki, F. Biljecki, C. Holst and B. Jutzi.
Texture2LoD3: Enabling LoD3 Building Reconstruction With Panoramic Images.
USM3D @CVPR 2025 - 2nd Workshop on Urban Scene Modeling at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025). Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub
Abstract

Despite recent advancements in surface reconstruction, Level of Detail (LoD) 3 building reconstruction remains an unresolved challenge. The main issue pertains to the object-oriented modelling paradigm, which requires georeferencing, watertight geometry, facade semantics, and low-poly representation – Contrasting unstructured mesh-oriented models. In Texture2LoD3, we introduce a novel method leveraging the ubiquity of 3D building model priors and panoramic street-level images, enabling the reconstruction of LoD3 building models. We observe that prior low-detail building models can serve as valid planar targets for ortho-rectifying street-level panoramic images. Moreover, deploying segmentation on accurately textured low-level building surfaces supports maintaining essential georeferencing, watertight geometry, and low-poly representation for LoD3 reconstruction. In the absence of LoD3 validation data, we additionally introduce the ReLoD3 dataset, on which we experimentally demonstrate that our method leads to improved facade segmentation accuracy by 11% and can replace costly manual projections. We believe that Texture2LoD3 can scale the adoption of LoD3 models, opening applications in estimating building solar potential or enhancing autonomous driving simulations.

MCML Authors
Link to website

Weihang Li

Computer Aided Medical Procedures & Augmented Reality


[1942]
T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Preventing Sensitive Information Leakage via Post-hoc Orthogonalization with Application to Chest Radiograph Embeddings.
PAKDD 2025 - 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Sydney, Australia, Jun 10-13, 2025. DOI GitHub
Abstract

Deep learning has substantially advanced data analysis across various fields. However, research indicates that protected characteristics, such as age, sex, and race, are often implicitly encoded within the deep feature representations, or embeddings, generated by neural networks. This encoding can lead to inherent biases, which in turn may influence decision-making processes. In clinical settings, in particular, such biases risk leading to unfair treatment of certain subgroups, potentially resulting in serious consequences. After analyzing the sources of these biases in the field of radiology, we illustrate how embeddings of chest radiographs (CXRs) can be corrected to remove the influence of protected features. To showcase the harms of such incidents, we study the MIMIC and CheXpert datasets with three prominent pre-trained models: a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our experiments reveal a significant influence of protected features on predictions of pathologies in CXRs, demonstrating the potential harm of such practices. We then propose a correction method, removing these harmful effects while maintaining competitive predictive performance.

MCML Authors
Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1941]
M. Aljoud, G. M. Tavares, C. Leiber and T. Seidl.
DCMatch - Identify Matching Architectures in Deep Clustering through Meta-Learning.
PAKDD 2025 - 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Sydney, Australia, Jun 10-13, 2025. To be published.
Abstract

The effectiveness of deepclustering algorithms like DeepEmbedded Clustering (DEC) is heavily influenced by the architecture of the neural network employed. However, selecting an optimal architecture is challenging due to the absence of labels in clustering tasks, which makes traditional Neural Architecture Search (NAS) methods unsuitable. To address this, we propose a novel dataset characterization method specifically tailored for image datasets, combining deep-learning-based and sta tistical feature extraction techniques. By utilizing features extracted from a small subset of images, our method effectively captures both high-level semantic and low-level statistical properties of the data. These dataset characteristics are then employed in a meta-learning framework to recommend autoencoder architectures likely to outperform default configurations. Extensive experiments on 20 image datasets validate the robustness of our approach, achieving improved clustering performance on 16 datasets compared to the baseline configuration.

MCML Authors
Link to website

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining

Collin Leiber

Collin Leiber

Dr.

* Former Member

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[1940]
M. Ahmadpanah, M. Gobbi, D. Hedin, J. Kinder and A. Sabelfeld.
CodeX: Contextual Flow Tracking for Browser Extensions.
CODASPY 2025 - 15th ACM Conference on Data and Application Security and Privacy. Pittsburgh, PA, USA, Jun 04-06, 2025. DOI
Abstract

Browser extensions put millions of users at risk when misusing their elevated privileges. Despite the current practices of semi-automated code vetting, privacy-violating extensions still thrive in the official stores. We propose an approach for tracking contextual flows from browser-specific sensitive sources like cookies, browsing history, bookmarks, and search terms to suspicious network sinks through network requests. We demonstrate the effectiveness of the approach by a prototype called CodeX that leverages the power of CodeQL while breaking away from the conservativeness of bug-finding flavors of the traditional CodeQL taint analysis. Applying CodeX to the extensions published on the Chrome Web Store between March 2021 and March 2024 identified 1,588 extensions with risky flows. Manual verification of 339 of those extensions resulted in flagging 212 as privacy-violating, impacting up to 3.6M users.

MCML Authors
Link to Profile Johannes Kinder

Johannes Kinder

Prof. Dr.

Programming Languages and Artificial Intelligence


[1939]
J. Kaiser, J. Eigenmann, D. Rückert and G. Kaissis.
User-Level Differential Privacy in Medical Machine Learning.
TPDP 2025 - Workshop on Theory and Practice of Differential Privacy. Google, Mountain View, CA, USA, Jun 02-03, 2025. PDF
Abstract

We address the challenge of ensuring user-level DP when individuals contribute varying numbers of data records to a dataset. While group privacy can be used to aggregate record-level budgets, it can be overly pessimistic and lacks flexibility when users contribute varying numbers of data points. We propose a method for accounting for arbitrary numbers of records per user while maintaining a fixed per-user privacy guarantee by leveraging individual privacy assignment. Experimentally, our method yields excellent utility comparable to record-level DP while providing a more meaningful/interpretable protection.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Georgios Kaissis

Georgios Kaissis

Dr.

* Former Principal Investigator


[1938]
V. Margraf, T. Koerner, A. Tornede and M. Wever.
RunAndSchedule2Survive: Algorithm Scheduling Based on Run2Survive.
ACM Transactions on Evolutionary Learning and Optimization Just accepted (Jun. 2025). DOI
Abstract

The algorithm selection problem aims to identify the most suitable algorithm for a given problem instance under specific time constraints, where suitability typically refers to a performance metric such as algorithm runtime. While previous work has employed machine learning techniques to tackle this challenge, methods from survival analysis have proven particularly effective. This paper presents RunAndSchedule2Survive to address the more general and complex problem of algorithm scheduling, where the objective is to allocate computational resources across multiple algorithms to maximize performance within specified time constraints. Our approach combines survival analysis with evolutionary algorithms to optimize algorithm schedules by leveraging runtime distributions modeled as survival functions. Experimental results across various standard benchmarks demonstrate that our approach significantly outperforms previous methods for algorithm scheduling and yields more robust results than its algorithm selection variant. More specifically, RunAndSchedule2Survive achieves superior performance in 20 out of 25 benchmark scenarios, surpassing hitherto state-of-the-art approaches.

MCML Authors
Link to website

Valentin Margraf

Artificial Intelligence and Machine Learning


[1937]
L. Merker, M. Blessing, B. Zhang and H. S. Stein.
Information Dense and Industry Scalable Accelerated Formation.
Advanced Intelligent Discovery (Jun. 2025). DOI
Abstract

Bespoke formation of Batteries offers improved lifetime and performance but is generally associated with long processing times, high cost, and large floorspace. Facile strategies like heating or increasing the formation current, as well as current alterations during formation have their limits in speed up and efficiency. We present pulsed formation on graphitic anode full cells as an accelerated formation strategy and investigate its influence on various quality parameters. Optimized pulsed charging is demonstrated herein to reduce the formation time by more than 50% whilst maintaining or improving all other cell quality parameters including discharge capacity. The newly discovered protocol is scaled up to 25Ah prismatic cells in the PHEV1 format that confirm the accelerated and improved pulsed formation strategy. We attribute the accelerated and improved formation to an apt balance of surface and bulk diffusion which results in thinner, more homogenous SEI. Dynamics of pulsed formation also allow for the extraction of new quality markers while formation is happening.

MCML Authors
Link to Profile Helge Stein

Helge Stein

Prof. Dr.

Digital Catalysis


[1936]
H. Boche, A. Fono and G. Kutyniok.
Mathematical Algorithm Design for Deep Learning under Societal and Judicial Constraints: The Algorithmic Transparency Requirement.
Applied and Computational Harmonic Analysis 77.101763 (Jun. 2025). DOI
Abstract

Deep learning still has drawbacks in terms of trustworthiness, which describes a comprehensible, fair, safe, and reliable method. To mitigate the potential risk of AI, clear obligations associated to trustworthiness have been proposed via regulatory guidelines, e.g., in the European AI Act. Therefore, a central question is to what extent trustworthy deep learning can be realized. Establishing the described properties constituting trustworthiness requires that the factors influencing an algorithmic computation can be retraced, i.e., the algorithmic implementation is transparent. Motivated by the observation that the current evolution of deep learning models necessitates a change in computing technology, we derive a mathematical framework which enables us to analyze whether a transparent implementation in a computing model is feasible. We exemplarily apply our trustworthiness framework to analyze deep learning approaches for inverse problems in digital and analog computing models represented by Turing and Blum-Shub-Smale Machines, respectively. Based on previous results, we find that Blum-Shub-Smale Machines have the potential to establish trustworthy solvers for inverse problems under fairly general conditions, whereas Turing machines cannot guarantee trustworthiness to the same degree.

MCML Authors
Link to website

Adalbert Fono

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1935]
L. Merker, B. Zhang, J. Yuan, S. Ji and H. S. Stein.
Insight generation from information-dense formation protocols.
Batteries & Supercaps.e202500153 (Jun. 2025). DOI
Abstract

Accelerated formation protocols that utilize pulsed charging offer an unprecedented wealth of electrochemical data. Herein we present methods to extract diagnostic data relating to a pseudo-diffusion coefficients, internal resistance, and others that give live insight to the solid electrolyte interphase (SEI) growth. Specifically, we present a pure mathematical method to track formation progression at near-real time and chart a path towards incorporation of adjusting pulse parameters towards targeted SEI synthesis. The method and analysis performed on 3 mAh cells but can also be applied to higher capacity cells.

MCML Authors
Link to Profile Helge Stein

Helge Stein

Prof. Dr.

Digital Catalysis


[1934]
K. D. Bartl-Pokorny, A. Mallol-Ragolta, A. Spiesberger, A. Semertzidou, J. Löchner, F. B. Pokorny and B. W. Schuller.
'Hey Smartphone, Am I Ill?' Detecting Diseases From The Voice.
Frontiers Frontiers for Young Minds (Jun. 2025). URL
Abstract

As humans, we learn from what we perceive with our senses in our daily lives. Computers can have similar learning capabilities, allowing them to learn from what they ‘see’ and ‘hear’ and to use the knowledge they learn to solve future tasks. This ability is called artificial intelligence (AI). Devices equipped with AI, such as smartphones, smartwatches, or smart speakers, have now become our everyday companions. Among other things, they can listen to us and answer our questions. This type of technology is also playing a growing role in medicine. In this article, we explain how a computer can figure out whether the sound of a person’s voice or the way they speak indicates a certain disease. We demonstrate this using the example of detecting COVID-19, and discuss both problems and opportunities that arise when using AI for diagnosis.

MCML Authors
Link to website

Adria Mallol-Ragolta

Health Informatics

Link to website

Anika Spiesberger

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1933]
C. S. Vetter, A. Bender, D. B. Dwyer, M. Montembeault, A. Ruef, K. Chrisholm, L. Kambeitz-Ilankovic, L. A. Antonucci, S. Ruhrmann, J. Kambeitz, M. Lichtenstein, A. Riecher, R. Upthegrove, R. K. R. Salokangas, J. Hietala, C. Pantelis, R. Lencer, E. Meisenzahl, S. Wood, P. Brambilla, S. Borgwardt, P. Falkai, A. Bertolino, N. Koutsouleris and PRONIA Consortium.
Exploring the Predictive Value of Structural Covariance Networks for the Diagnosis of Schizophrenia.
Frontiers in Psychiatry 16 (Jun. 2025). DOI
Abstract

Schizophrenia is a psychiatric disorder hypothesized to result from disturbed brain connectivity. Structural covariance networks (SCN) describe the shared variation in morphological properties emerging from coordinated neurodevelopmental processes and may, thus, be a promising diagnostic biomarker for schizophrenia.We compared the diagnostic value of two SCN computation methods derived from regional gray matter volume (GMV) in 154 patients with a diagnosis of first episode psychosis or recurrent schizophrenia (PAT) and 366 healthy control individuals (HC). The first method (REF-SCN) quantifies the contribution of an individual to a normative reference group’s SCN, and the second approach (KLS-SCN) uses a symmetric version of Kulback-Leibler divergence. Their diagnostic value compared to regional GMV was assessed in a stepwise analysis using a series of linear support vector machines within a nested cross-validation framework and stacked generalization, all models were externally validated in an independent sample (NPAT=71, NHC=74), SCN feature importance was assessed, and the derived risk scores were analyzed for differential relationships with clinical variables.We found that models trained on SCNs were able to classify patients with schizophrenia and combining SCNs and regional GMV in a stacked model improved training (balanced accuracy (BAC)=69.96%) and external validation performance (BAC=67.10%). Among all unimodal models, the highest discovery sample performance was achieved by a model trained on REF-SCN (balanced accuracy (BAC=67.03%). All model decisions were driven by widespread structural covariance alterations involving the somato-motor, default mode, control, visual, and the ventral attention networks. Risk estimates derived from KLS-SCNs and regional GMV, but not REF-SCNs, could be predicted from clinical variables, especially driven by body mass index (BMI) and affect-related negative symptoms. These patterns of results show that different SCN computation approaches capture different aspects of the disease. While REF-SCNs contain valuable information for discriminating schizophrenia from healthy control individuals, KLS-SCNs may capture more nuanced symptom-level characteristics similar to those captured by PCA of regional GMV.

MCML Authors
Link to website

Clara Sophie Vetter

Artificial Intelligence in Healthcare and Medicine


[1932]
E. Pozzoli and A. Scagliotti.
Approximation of diffeomorphisms for quantum state transfers.
IEEE Control Systems Letters Early Access (Jun. 2025). DOI
Abstract

In this paper, we seek to combine two emerging standpoints in control theory. On the one hand, recent advances in infinite-dimensional geometric control have unlocked a method for controlling (with arbitrary precision and in arbitrarily small times) state transfers for bilinear Schrödinger PDEs posed on a Riemannian manifold M. In particular, these arguments rely on controllability results in the group of the diffeomorphisms of M. On the other hand, using tools of Γ-convergence, it has been proved that we can phrase the retrieve of a diffeomorphism of M as an ensemble optimal control problem. More precisely, this is done by employing a control-affine system for emph{simultaneously} steering a finite swarm of points towards the respective targets. Here we blend these two theoretical approaches and numerically find control laws driving state transitions (such as eigenstate transfers) in a bilinear Schrödinger PDE posed on the torus. Such systems have experimental relevance and are currently used to model rotational dynamics of molecules, and cold atoms trapped in periodic optical lattices.

MCML Authors
Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1931]
D. Zhao, M. Asgarimehr, K. Heidler, J. Wickert, X. Zhu and L. Mou.
Deep Learning-Based GNSS-R Global Vegetation Water Content: Dataset, Estimation, and Uncertainty.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Early Access (Jun. 2025). DOI
Abstract

Vegetation water content (VWC) is a crucial parameter for understanding vegetation dynamics and hydrological cycle on Earth. With rapid climate changes in recent years, monitoring VWC with high spatiotemporal coverage on a global scale is of paramount importance. Yet, traditional in situ measurements are constrained in remote and densely vegetated regions. Additionally, existing spaceborne remote sensing methods face challenges due to poor cloud penetration capabilities, soil moisture interference, and inadequate temporal resolution. Spaceborne global navigation satellite system reflectometry (GNSS-R) has demonstrated promising potential to overcome these limitations in vegetation monitoring. In this study, we propose a scheme for deep learning-based GNSS-R VWC assessment, leveraging a rapidly growing amount of GNSS-R data with an unprecedented sampling rate. We introduce a triplet dataset, which consists of measurements from the cyclone GNSS (CYGNSS), global land data assimilation system (GLDAS), and soil moisture active passive (SMAP), spanning over three years. Validation is performed using several benchmark models with the proposed dataset. Furthermore, the models’ predictive uncertainty is quantified with Monte Carlo (MC) dropout technique to provide a trustworthy representation of estimations. Experimental evaluation of the models demonstrates good consistency between the estimated VWC and ground truth, with a minimum root mean square deviation (RMSD) of 1.0988 kg/m2 and a bias of 0.002kg/m2 over a twelve-month test period. Moreover, a daily global VWC estimation is achieved through the proposed pipeline, filling the gaps of current products and enabling rapid measurements with enhanced temporal availability. We will make the proposed dataset publicly available.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1930]
Y. Ma, Q. Khan and D. Cremers.
MA-DV2F: A Multi-Agent Navigation Framework Using Dynamic Velocity Vector Field.
IEEE Robotics and Automation Letters 10.6 (Jun. 2025). DOI GitHub
Abstract

In this paper, we propose MA-DV2F: Multi-Agent Dynamic Velocity Vector Field. It is a framework for simultaneously controlling a group of vehicles in challenging environments. DV2F is generated for each vehicle independently and provides a map of reference orientation and speed that a vehicle must attain at any point on the navigation grid such that it safely reaches its target. The field is dynamically updated depending on the speed and proximity of the ego-vehicle to other agents. This dynamic adaptation of the velocity vector field allows prevention of imminent collisions. Experimental results show that MA-DV2F outperforms concurrent methods in terms of safety, computational efficiency and accuracy in reaching the target when scaling to a large number of vehicles.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1929]
S. Wang, Q. Cheng, Q. Cheng, W. Zhang, S.-C. Wu, N. Zeller, D. Cremers and N. Navab.
VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis.
IEEE Robotics and Automation Letters 10.6 (Jun. 2025). DOI
Abstract

The generation of high-fidelity view synthesis is essential for robotic navigation and interaction but remains challenging, particularly in indoor environments and real-time scenarios. Existing techniques often require significant computational resources for both training and rendering, and they frequently result in suboptimal 3D representations due to insufficient geometric structuring. To address these limitations, we introduce VoxNeRF, a novel approach that utilizes easy-to-obtain geometry priors to enhance both the quality and efficiency of neural indoor reconstruction and novel view synthesis. We propose an efficient voxel-guided sampling technique that allocates computational resources selectively to the most relevant segments of rays based on a voxel-encoded geometry prior, significantly reducing training and rendering time. Additionally, we incorporate a robust depth loss to improve reconstruction and rendering quality in sparse view settings. Our approach is validated with extensive experiments on ScanNet and ScanNet++ where VoxNeRF outperforms existing state-of-the-art methods and establishes a new benchmark for indoor immersive interpolation and extrapolation settings.

MCML Authors
Link to website

Sen Wang

Computer Aided Medical Procedures & Augmented Reality

Link to website

Qing Cheng

Computer Vision & Artificial Intelligence

Link to website

Qing Cheng

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1928]
J. Külz, M. Terzer, M. Magri, A. Giusti and M. Althoff.
Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution.
IEEE Transactions on Automation Science and Engineering Early Access (Jun. 2025). DOI
Abstract

In situ robotic automation in construction is challenging due to constantly changing environments, a shortage of robotic experts, and a lack of standardized frameworks bridging robotics and construction practices. This work proposes a holistic framework for construction task specification, optimization of robot morphology, and mission execution using a mobile modular reconfigurable robot. Users can specify and monitor the desired robot behavior through a graphical interface. In contrast to existing, monolithic solutions, we automatically identify a new task-tailored robot for every task by integrating Building Information Modeling (BIM). Our framework leverages modular robot components that enable the fast adaption of robot hardware to the specific demands of the construction task. Other than previous works on modular robot optimization, we consider multiple competing objectives, which allow us to explicitly model the challenges of real-world transfer, such as calibration errors. We demonstrate our framework in simulation by optimizing robots for drilling and spray painting. Finally, experimental validation demonstrates that our approach robustly enables the autonomous execution of robotic drilling.

MCML Authors
Link to website

Jonathan Külz

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[1927]
P. Gupta, M. Wever and E. Hüllermeier.
Information Leakage Detection through Approximate Bayes-optimal Prediction.
Information Sciences In Press, Journal Pre-proof.122419 (Jun. 2025). DOI
Abstract

In today’s data-driven world, the proliferation of publicly available information raises security concerns due to the information leakage (IL) problem. IL involves unintentionally exposing sensitive information to unauthorized parties via observable system information. Conventional statistical approaches rely on estimating mutual information (MI) between observable and secret information for detecting ILs, face challenges of the curse of dimensionality, convergence, computational complexity, and MI misestimation. Though effective, emerging supervised machine learning based approaches to detect ILs are limited to binary system sensitive information and lack a comprehensive framework. To address these limitations, we establish a theoretical framework using statistical learning theory and information theory to quantify and detect IL accurately. Using automated machine learning, we demonstrate that MI can be accurately estimated by approximating the typically unknown Bayes predictor’s log-loss and accuracy. Based on this, we show how MI can effectively be estimated to detect ILs. Our method performs superior to state-of-the-art baselines in an empirical study considering synthetic and real-world OpenSSL TLS server datasets.

MCML Authors
Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1926]
X. Zhao, Z. Xiong, P. Karlshöfer, N. Tziolas, M. Wiesmeier, U. Heiden and X. Zhu.
Soil organic carbon estimation using spaceborne hyperspectral composites on a large scale.
International Journal of Applied Earth Observation and Geoinformation 140 (Jun. 2025). DOI
Abstract

Soil Organic Carbon (SOC) is a key property for soil health. Spectral reflectance such as multispectral and hyperspectral data could provide efficient and cost-effective retrieval of SOC content. However, constrained by the availability of hyperspectral satellite data, current works mostly use a small number of spaceborne hyperspectral imagery for SOC retrieval on a small scale. In this work, the first large-scale hyperspectral imaging reflectance composites were built, and they were used for SOC estimation. Specifically, DESIS satellite images were used to predict SOC over the whole state of Bavaria in Germany ( 70,000 km). We prepare 850 hyperspectral images from the DESIS satellite and build temporal composites from them. For the soil data, data was gathered from LfU(Bavarian State Office for the Environment), LfL(Bavarian State Research Center for Agriculture) and LUCAS 2018 (Land Use and Coverage Area Frame Survey). 828 soil samples were selected after data filtering. For this regression task, different machine learning and deep learning methods were implemented and explored. Moreover, a spectral attention mechanism was added to the model. Besides hyperspectral input, the digital elevation model (DEM) was also included as an auxiliary input as the measured spectrum has inter-variability dependent on the elevation and the generated topographical features are also relevant with SOC distribution. Based on the regression results evaluated by , , and , the deep learning models showed much better performance than machine learning methods. Especially when only using hyperspectral data as input, the best result was achieved with 1.947%, 0.626, and 1.710 on the test set. After incorporating topographical features, the fused model achieved further improved performance with 1.752% and 0.695 and 1.919. From the interpretability analysis for model performance, it was found out that the bands in the range of 530 nm–570 nm, 770 nm–790 nm, and 840 nm - 870 nm are the most relevant bands for SOC estimation. In the end, several SOC maps were generated and analyzed together with soil types. The SOC maps indicate that water-associated areas, such as coastal soils and bogs, tend to have higher SOC, while mountain areas tend to contain lower SOC. Such findings align with SOC distribution across soil types and show the effectiveness of the model.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1925]
C. Wu, B. Ma, Z. Zhang, N. Deng, Y. He and Y. Xue.
Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models.
International Journal of Machine Learning and Cybernetics (Jun. 2025). DOI
Abstract

Aspect-based sentiment analysis (ABSA), a sequence labeling task, has attracted increasing attention in multilingual contexts. While previous research has focused largely on fine-tuning or training models specifically for ABSA, we evaluate large language models (LLMs) under zero-shot conditions to explore their potential to tackle this challenge with minimal task-specific adaptation. We conduct a comprehensive empirical evaluation of a series of LLMs on multilingual ABSA tasks, investigating various prompting strategies, including vanilla zero-shot, chain-of-thought (CoT), self-improvement, self-debate, and self-consistency, across nine different models. Results indicate that while LLMs show promise in handling multilingual ABSA, they generally fall short of fine-tuned, task-specific models. Notably, simpler zero-shot prompts often outperform more complex strategies, especially in high-resource languages like English. These findings underscore the need for further refinement of LLM-based approaches to effectively address ABSA task across diverse languages.

MCML Authors

[1924]
Q. Xu, L. F. De Vos, Y. Shi, N. Rüther, A. Bronstert and X. Zhu.
Urban Flood Modeling and Forecasting with Deep Neural Operator and Transfer Learning.
Journal of Hydrology In Press, Journal Pre-proof.133705 (Jun. 2025). DOI
Abstract

Physics-based models provide accurate flood modeling but are limited by their dependence on high-quality data and computational demands, particularly in complex urban environments. Machine learning-based surrogate models like neural operators present a promising alternative; however, their practical application in urban flood modeling remains challenges, such as insufficient feature representation, high memory demands, and limited transferability. To address these challenges, this study introduces a deep neural operator (DNO) and a transfer learning-based DNO for fast, accurate, resolution-invariant, and cross-scenario urban flood forecasting. The DNO features an enhanced Fourier layer with skip connections for improved memory efficiency, alongside a deep encoder-decoder framework and an urban-embedded residual loss to enhance modeling effectiveness. The transfer learning-based DNO further integrates a fine-tuning-based approach for efficient cross-scenario forecasting in the target domain and a domain adaptation-based strategy for continuous learning across diverse domains. The fine-tuning-based DNO enables rapid adaptation to target domains, while the domain adaptation-based DNO mitigates knowledge forgetting from the source domain. Experimental results demonstrate that the proposed DNO significantly outperforms existing neural solvers using a comprehensive urban flood benchmark dataset, particularly in predicting high water depths and exhibiting exceptional zero-shot downscaling performance for high-resolution forecasting. Moreover, the fine-tuning-based DNO enhances transferability for cross-scenario urban flood forecasting, while the domain adaptation-based DNO achieves accurate flood predictions in both source and target domains, even with limited labeled target data. Through the combination of these ML methods and the benchmark dataset, a practical tool is established for effective, cross-scenario, and downscaled spatiotemporal urban flood forecasting.

MCML Authors
Link to website

Qingsong Xu

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1923]
Y. Lemaréchal, G. Couture, F. Pelletier, R. Lefol, P.-L. Asselin, S. Ouellet, J. Bernard, L. Ebrahimpour, V. S. K. Manem, J. Topalis, B. Schachtner, S. Jodogne, P. Joubert, K. Jeblick, M. Ingrisch and P. Després.
PARADIM: A Platform to Support Research at the Interface of Data Science and Medical Imaging.
Journal of Imaging Informatics in Medicine (Jun. 2025). DOI
Abstract

This paper describes PARADIM, a digital infrastructure designed to support research at the interface of data science and medical imaging, with a focus on Research Data Management best practices. The platform is built from open-source components and rooted in the FAIR principles through strict compliance with the DICOM standard. It addresses key needs in data curation, governance, privacy, and scalable resource management. Supporting every stage of the data science discovery cycle, the platform offers robust functionalities for user identity and access management, data de-identification, storage, annotation, as well as model training and evaluation. Rich metadata are generated all along the research lifecycle to ensure the traceability and reproducibility of results. PARADIM hosts several medical image collections and allows the automation of large-scale, computationally intensive pipelines (e.g., automatic segmentation, dose calculations, AI model evaluation). The platform fills a gap at the interface of data science and medical imaging, where digital infrastructures are key in the development, evaluation, and deployment of innovative solutions in the real world.

MCML Authors
Link to website

Johanna Topalis

Clinical Data Science in Radiology

Link to website

Balthasar Schachtner

Dr.

Clinical Data Science in Radiology

Link to website

Katharina Jeblick

Dr.

Clinical Data Science in Radiology

Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[1922]
S. Campell, P. Liu and S. Nyholm.
Can Chatbots Preserve Our Relationships with the Dead?
Journal of the American Philosophical Association 11.2 (Jun. 2025). DOI
Abstract

Imagine that you are given access to an AI chatbot that compellingly mimics the personality and speech of a deceased loved one. If you start having regular interactions with this ’thanabot’, could this new relationship be a continuation of the relationship you had with your loved one? And could a relationship with a thanabot preserve or replicate the value of a close human relationship? To the first question, we argue that a relationship with a thanabot cannot be a true continuation of your relationship with a deceased loved one, though it might support one’s continuing bonds with the dead. To the second question, we argue that, in and of themselves, relationships with thanabots cannot benefit us as much as rewarding and healthy intimate relationships with other humans, though we explain why it is difficult to make reliable comparative generalizations about the instrumental value of these relationships.

MCML Authors
Link to Profile Sven Nyholm

Sven Nyholm

Prof. Dr.

Ethics of Artificial Intelligence


[1921]
M. Rauscher, A. Scagliotti and F. Pagginelli Patricio.
Shortest-path recovery from signature with an optimal control approach.
Mathematics of Control, Signals, and Systems 37 (Jun. 2025). DOI
Abstract

In this paper, we consider the signature-to-path reconstruction problem from the control-theoretic perspective. Namely, we design an optimal control problem whose solution leads to the minimal-length path that generates a given signature. In order to do that, we minimize a cost functional consisting of two competing terms, i.e., a weighted final-time cost combined with the -norm squared of the controls. Moreover, we can show that, by taking the limit to infinity of the parameter that tunes the final-time cost, the problem -converges to the problem of finding a sub-Riemannian geodesic connecting two signatures. Finally, we provide an alternative reformulation of the latter problem, which is particularly suitable for the numerical implementation.

MCML Authors
Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1920]
M. Balcerak, J. Weidner, P. Karnakov, I. Ezhov, S. Litvinov, P. Koumoutsakos, T. Amiranashvili, R. Z. Zhang, J. S. Lowengrub, I. Yakushev, B. Wiestler and B. Menze.
Individualizing glioma radiotherapy planning by optimization of a data and physics-informed discrete loss.
Nature Communications 16.5982 (Jun. 2025). DOI
Abstract

Brain tumor growth is unique to each glioma patient and extends beyond what is visible in imaging scans, infiltrating surrounding brain tissue. Understanding these hidden patient-specific progressions is essential for effective therapies. Current treatment plans for brain tumors, such as radiotherapy, typically involve delineating a uniform margin around the visible tumor on pre-treatment scans to target this invisible tumor growth. This ‘one size fits all’ approach is derived from population studies and often fails to account for the nuances of individual patient conditions. We present the Glioma Optimizing the Discrete Loss (GliODIL) framework, which infers the full spatial distribution of tumor cell concentration from available multi-modal imaging, leveraging a Fisher-Kolmogorov type physics model to describe tumor growth. This is achieved through the newly introduced method of Optimizing the Discrete Loss (ODIL), where both data and physics-based constraints are softly assimilated into the solution. Our test dataset comprises 152 glioblastoma patients with pre-treatment imaging and post-treatment follow-ups for tumor recurrence monitoring. By blending data-driven techniques with physics-based constraints, GliODIL enhances recurrence prediction in radiotherapy planning, challenging traditional uniform margins and strict adherence to the Fisher-Kolmogorov partial differential equation model, which is adapted for complex cases.

MCML Authors
Link to website

Jonas Weidner

AI for Image-Guided Diagnosis and Therapy

Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy


[1919]
P. Wicke and M. M. Bolognesi.
Red and blue language: Word choices in the Trump and Harris 2024 presidential debate.
PLOS One 20.6 (Jun. 2025). DOI GitHub
Abstract

Political debates are a peculiar type of political discourse, in which candidates directly confront one another, addressing not only the the moderator’s questions, but also their opponent’s statements, as well as the concerns of voters from both parties and undecided voters. Therefore, language is adjusted to meet specific expectations and achieve persuasion. We analyse how the language of Trump and Harris during the Presidential debate (September 10th, 2024) differs in relation to semantic and pragmatic features, for which we formulated targeted hypotheses: framing values and ideology, appealing to emotion, using words with different degrees of concreteness and specificity, addressing others through singular or plural pronouns. Our findings include: differences in the use of figurative frames (Harris often framing issues around recovery and empowerment, Trump often focused on crisis and decline); similar use of emotional language, with Trump showing a slightly higher tendency toward negativity and toward less subjective language compared to Harris; no significant difference in the specificity of candidates’ responses; similar use of abstract language, with Trump showing more variability than Harris, depending on the subject discussed; differences in addressing the opponent, with Trump not mentioning Harris by name, while Harris referring to Trump frequently; different uses of pronouns, with Harris using both singular and plural pronouns equally, while Trump using more singular pronouns. The results are discussed in relation to previous literature on Red and Blue language, which refers to distinct linguistic patterns associated with Republican (Red) and Democratic (Blue) political ideologies.

MCML Authors
Link to website

Philipp Wicke

Dr.

Computational Linguistics


[1918]
R. R. Valiev, R. T. Nasibullin, H. Sandström, P. Rinke, K. Puolamäki and T. Kurten.
Predicting intersystem crossing rate constants of alkoxy-radical pairs with structure-based descriptors and machine learning.
Physical Chemistry Chemical Physics Advance Article (Jun. 2025). DOI
Abstract

Peroxy radicals (RO2) are ubiquitous intermediates in many oxidation processes, especially in the atmospheric gas phase. The recombination reaction of two peroxy radicals (RO2 + R′O2) has been demonstrated to lead, via several steps, to a triplet complex of two alkoxy radicals: 3(RO˙⋯R′O˙). The different product channels of RO2 + R′O2 reactions thus correspond to different reactions of this triplet complex. Of particular interest to atmospheric chemistry is the intersystem crossing (ISC) to the singlet state, which enables the recombination of the two radicals to an ROOR′ peroxide with considerably lower volatility than the original precursors. These peroxides are believed to be key contributors to the formation of secondary organic aerosol (SOA) particles, which in turn contribute to both air pollution and radiative forcing uncertainties. Developing reliable computational models for, e.g., RO2 + R′O2 branching ratios requires accurate estimates of the ISC rate constants, which can currently be obtained only from computationally expensive quantum chemistry calculations. By contrast, machine learning (ML) methods offer a faster alternative for estimating ISC rate constants. In the present work, we create a dataset with 98[thin space (1/6-em)]082 conformations of radical pairs and their corresponding rate constants. We apply three ML models—random forest (RF), CatBoost (CB), and a neural network (NN)—to predict ISC rate constants from triplet to singlet states. Specifically, the models predict kISC(T1 → Si) for i = 1–4 and the cumulative kISC(T1 → Sn), in alkoxy radical pairs, using only molecular geometry descriptors as inputs. All ML models achieved a mean absolute error (MAE) on our test set within one order of magnitude and a coefficient of determination R2 > 0.82 for all rate constants. Overall, the ML prediction matches the quantum chemical calculations within 1–2 orders of magnitude, providing a fast and scalable alternative for quantum chemical methods for ISC rate estimation.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1917]
S. Maskey, G. Kutyniok and R. Levie.
Generalization Bounds for Message Passing Networks on Mixture of Graphons.
SIAM Journal on Mathematics of Data Science 7.2 (Jun. 2025). DOI
Abstract

We study the generalization capabilities of Message Passing Neural Networks (MPNNs), a prevalent class of Graph Neural Networks (GNN). We derive generalization bounds specifically for MPNNs with normalized sum aggregation and mean aggregation. Our analysis is based on a data generation model incorporating a finite set of template graphons. Each graph within this framework is generated by sampling from one of the graphons with a certain degree of perturbation. In particular, we extend previous MPNN generalization results to a more realistic setting, which includes the following modifications: 1) we analyze simple random graphs with Bernoulli-distributed edges instead of weighted graphs; 2) we sample both graphs and graph signals from perturbed graphons instead of clean graphons; and 3) we analyze sparse graphs instead of dense graphs. In this more realistic and challenging scenario, we provide a generalization bound that decreases as the average number of nodes in the graphs increases. Our results imply that MPNNs with higher complexity than the size of the training set can still generalize effectively, as long as the graphs are sufficiently large.

MCML Authors
Link to website

Sohir Maskey

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1916]
T. Boege, M. Drton, B. Hollering, S. Lumpp, P. Misra and D. Schkoda.
Conditional independence in stationary distributions of diffusions.
Stochastic Processes and their Applications 184.104604 (Jun. 2025). DOI
Abstract

Stationary distributions of multivariate diffusion processes have recently been proposed as probabilistic models of causal systems in statistics and machine learning. Motivated by these developments, we study stationary multivariate diffusion processes with a sparsely structured drift. Our main result gives a characterization of the conditional independence relations that hold in a stationary distribution. The result draws on a graphical representation of the drift structure and pertains to conditional independence relations that hold generally as a consequence of the drift’s sparsity pattern.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[1915]
L. Gosch, M. Sabanayagam, D. Ghoshdastidar and S. Günnemann.
Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks.
Transactions on Machine Learning Research (Jun. 2025). URL
Abstract

Generalization of machine learning models can be severely compromised by data poisoning, where adversarial changes are applied to the training data. This vulnerability has led to interest in certifying (i.e., proving) that such changes up to a certain magnitude do not affect test predictions. We, for the first time, certify Graph Neural Networks (GNNs) against poisoning attacks, including backdoors, targeting the node features of a given graph. Our certificates are white-box and based upon the neural tangent kernel, which characterizes the training dynamics of sufficiently wide networks; and a novel reformulation of the bilevel optimization problem describing poisoning as a mixed-integer linear program. Consequently, we leverage our framework to provide fundamental insights into the role of graph structure and its connectivity on the worst-case robustness behavior of convolution-based and PageRank-based GNNs. We note that our framework is more general and constitutes the first approach to derive white-box poisoning certificates for NNs, which can be of independent interest beyond graph-related tasks.

MCML Authors
Link to website

Lukas Gosch

Data Analytics & Machine Learning

Link to Profile Debarghya Ghoshdastidar

Debarghya Ghoshdastidar

Prof. Dr.

Theoretical Foundations of Artificial Intelligence

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning


[1914]
P. Pisal, O. Krejci and P. Rinke.
Machine learning accelerated descriptor design for catalyst discovery in CO2 to methanol conversion.
npj Computational Materials 11.213 (Jun. 2025). DOI
Abstract

Transforming CO2 into methanol represents a crucial step towards closing the carbon cycle, with thermoreduction technology nearing industrial application. However, obtaining high methanol yields and ensuring the stability of heterocatalysts remain significant challenges. Herein, we present a sophisticated computational framework to accelerate the discovery of thermal heterogeneous catalysts, using machine-learned force fields. We propose a new catalytic descriptor, termed adsorption energy distribution, that aggregates the binding energies for different catalyst facets, binding sites, and adsorbates. The descriptor is versatile and can be adjusted to a specific reaction through careful choice of the key-step reactants and reaction intermediates. By applying unsupervised machine learning and statistical analysis to a dataset comprising nearly 160 metallic alloys, we offer a powerful tool for catalyst discovery. We propose new promising candidates such as ZnRh and ZnPt3, which to our knowledge, have not yet been tested, and discuss their possible advantage in terms of stability.

MCML Authors
Link to website

Prajwal Pisal

AI-based Material Science

Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1913]
A. Aghdam and V. T. Hu.
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment.
Preprint (Jun. 2025). arXiv
Abstract

We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning


[1912]
S. Almi, M. Fornasier, J. Klemenc and A. Scagliotti.
Balanced quasistatic evolutions of critical points in metric spaces.
Preprint (Jun. 2025). arXiv
Abstract

Quasistatic evolutions of critical points of time-dependent energies exhibit piecewise smooth behavior, making them useful for modeling continuum mechanics phenomena like elastic-plasticity and fracture. Traditionally, such evolutions have been derived as vanishing viscosity and inertia limits, leading to balanced viscosity solutions. However, for nonconvex energies, these constructions have been realized in Euclidean spaces and assume non-degenerate critical points. In this paper, we take a different approach by decoupling the time scales of the energy evolution and of the transition to equilibria. Namely, starting from an equilibrium configuration, we let the energy evolve, while keeping frozen the system state; then, we update the state by freezing the energy, while letting the system transit via gradient flow or an approximation of it (e.g., minimizing movement or backward differentiation schemes). This approach has several advantages. It aligns with the physical principle that systems transit through energy-minimizing steady states. It is also fully constructive and computationally implementable, with physical and computational costs governed by appropriate action functionals. Additionally, our analysis is simpler and more general than previous formulations in the literature, as it does not require non-degenerate critical points. Finally, this approach extends to evolutions in locally compact metric path spaces, and our axiomatic presentation allows for various realizations.

MCML Authors
Link to Profile Massimo Fornasier

Massimo Fornasier

Prof. Dr.

Applied Numerical Analysis

Link to website

Jona Klemenc

Applied Numerical Analysis

Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1911]
D. Bani-Harouni, C. Pellegrini, E. Özsoy, M. Keicher and N. Navab.
Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning.
Preprint (Jun. 2025). arXiv
Abstract

Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited ‘out-of-the-box’ capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.

MCML Authors
Link to website

David Bani-Harouni

Computer Aided Medical Procedures & Augmented Reality

Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1910]
L. Bastian, M. Rashed, N. Navab and T. Birdal.
Continuous-Time SO(3) Forecasting with Savitzky--Golay Neural Controlled Differential Equations.
Preprint (Jun. 2025). arXiv
Abstract

Tracking and forecasting the rotation of objects is fundamental in computer vision and robotics, yet SO(3) extrapolation remains challenging as (1) sensor observations can be noisy and sparse, (2) motion patterns can be governed by complex dynamics, and (3) application settings can demand long-term forecasting. This work proposes modeling continuous-time rotational object dynamics on SO(3) using Neural Controlled Differential Equations guided by Savitzky-Golay paths. Unlike existing methods that rely on simplified motion assumptions, our method learns a general latent dynamical system of the underlying object trajectory while respecting the geometric structure of rotations. Experimental results on real-world data demonstrate compelling forecasting capabilities compared to existing approaches.

MCML Authors
Link to website

Lennart Bastian

Computer Aided Medical Procedures & Augmented Reality

Link to website

Mohammad Rashed

Physics-based Simulation

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1909]
C. Benjamins, H. Graf, S. Segel, D. Deng, T. Ruhkopf, L. Hennig, S. Basu, N. Mallik, E. Bergman, D. Chen, F. Clément, M. Feurer, K. Eggensperger, F. Hutter, C. Doerr and M. Lindauer.
carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks.
Preprint (Jun. 2025). arXiv URL
Abstract

Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (this https URL), we make an important step in the standardization of HPO evaluation.

MCML Authors
Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[1908]
A. Bergmeister, M. K. Lal, S. Jegelka and S. Sra.
A projection-based framework for gradient-free and parallel learning.
Preprint (Jun. 2025). arXiv
Abstract

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is as a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.

MCML Authors
Link to website

Andreas Bergmeister

Foundations of Deep Neural Networks

Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks

Link to Profile Suvrit Sra

Suvrit Sra

Prof. Dr.

Resource Aware Machine Learning


[1907]
N. Bhatia, P. Rinke and O. Krejci.
Leveraging active learning-enhanced machine-learned interatomic potential for efficient infrared spectra prediction.
Preprint (Jun. 2025). arXiv
Abstract

Infrared (IR) spectroscopy is a pivotal analytical tool as it provides real-time molecular insight into material structures and enables the observation of reaction intermediates in situ. However, interpreting IR spectra often requires high-fidelity simulations, such as density functional theory based ab-initio molecular dynamics, which are computationally expensive and therefore limited in the tractable system size and complexity. In this work, we present a novel active learning-based framework, implemented in the open-source software package PALIRS, for efficiently predicting the IR spectra of small catalytically relevant organic molecules. PALIRS leverages active learning to train a machine-learned interatomic potential, which is then used for machine learning-assisted molecular dynamics simulations to calculate IR spectra. PALIRS reproduces IR spectra computed with ab-initio molecular dynamics accurately at a fraction of the computational cost. PALIRS further agrees well with available experimental data not only for IR peak positions but also for their amplitudes. This advancement with PALIRS enables high-throughput prediction of IR spectra, facilitating the exploration of larger and more intricate catalytic systems and aiding the identification of novel reaction pathways.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1906]
D. Biagini, N. Navab and A. Farshad.
HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation.
Preprint (Jun. 2025). arXiv
Abstract

Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

MCML Authors
Link to website

Diego Biagini

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1905]
V. Blaschke, M. Winkler, C. Förster, G. Wenger-Glemser and B. Plank.
A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation.
Preprint (Jun. 2025). arXiv
Abstract

Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.

MCML Authors
Link to website

Verena Blaschke

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1904]
F. Bongratz, T. N. Wolf, J. G. Ramon and C. Wachinger.
X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis.
Preprint (Jun. 2025). arXiv
Abstract

Interpretable models are crucial for supporting clinical decision-making, driving advances in their development and application for medical images. However, the nature of 3D volumetric data makes it inherently challenging to visualize and interpret intricate and complex structures like the cerebral cortex. Cortical surface renderings, on the other hand, provide a more accessible and understandable 3D representation of brain anatomy, facilitating visualization and interactive exploration. Motivated by this advantage and the widespread use of surface data for studying neurological disorders, we present the eXplainable Surface Vision Transformer (X-SiT). This is the first inherently interpretable neural network that offers human-understandable predictions based on interpretable cortical features. As part of X-SiT, we introduce a prototypical surface patch decoder for classifying surface patch embeddings, incorporating case-based reasoning with spatially corresponding cortical prototypes. The results demonstrate state-of-the-art performance in detecting Alzheimer’s disease and frontotemporal dementia while additionally providing informative prototypes that align with known disease patterns and reveal classification errors.

MCML Authors
Fabian Bongratz

Fabian Bongratz

Artificial Intelligence in Medical Imaging

Link to website

Tom Nuno Wolf

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1903]
S. Casola, Y. J. Liu, S. Peng, O. Kraus, A. Gatt and B. Plank.
Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics.
Preprint (Jun. 2025). arXiv
Abstract

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

MCML Authors
Link to website

Silvia Casola

Dr.

AI and Computational Linguistics

Link to website

Yang Janet Liu

AI and Computational Linguistics

Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1902]
C. Casolo, S. Becker and N. Kilbertus.
Identifiability Challenges in Sparse Linear Ordinary Differential Equations.
Preprint (Jun. 2025). arXiv
Abstract

Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that ’linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory.’ However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems. In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.

MCML Authors
Link to website

Cecilia Casolo

Ethics in Systems Design and Machine Learning

Link to website

Sören Becker

Ethics in Systems Design and Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1901]
S. Chen, Y. Shi and X. Zhu.
Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels.
Preprint (Jun. 2025). arXiv GitHub
Abstract

Monocular height estimation is considered the most efficient and cost-effective means of 3D perception in remote sensing, and it has attracted much attention since the emergence of deep learning. While training neural networks requires a large amount of data, data with perfect labels are scarce and only available within developed regions. The trained models therefore lack generalizability, which limits the potential for large-scale application of existing methods. We tackle this problem for the first time, by introducing data with imperfect labels into training pixel-wise height estimation networks, including labels that are incomplete, inexact, and inaccurate compared to high-quality labels. We propose an ensemble-based pipeline compatible with any monocular height estimation network. Taking the challenges of noisy labels, domain shift, and long-tailed distribution of height values into consideration, we carefully design the architecture and loss functions to leverage the information concealed in imperfect labels using weak supervision through balanced soft losses and ordinal constraints. We conduct extensive experiments on two datasets with different resolutions, DFC23 (0.5 to 1 m) and GBH (3 m). The results indicate that the proposed pipeline outperforms baselines by achieving more balanced performance across various domains, leading to improvements of average root mean square errors up to 22.94 %, and 18.62 % on DFC23 and GBH, respectively. The efficacy of each design component is validated through ablation studies.

MCML Authors
Link to website

Sining Chen

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1900]
T. Cheng, T. Vatter, T. Nagler and K. Chen.
Vine Copulas as Differentiable Computational Graphs.
Preprint (Jun. 2025). arXiv
Abstract

Vine copulas are sophisticated models for multivariate distributions and are increasingly used in machine learning. To facilitate their integration into modern ML pipelines, we introduce the vine computational graph, a DAG that abstracts the multilevel vine structure and associated computations. On this foundation, we devise new algorithms for conditional sampling, efficient sampling-order scheduling, and constructing vine structures for customized conditioning variables. We implement these ideas in torchvinecopulib, a GPU-accelerated Python library built upon PyTorch, delivering improved scalability for fitting, sampling, and density evaluation. Our experiments illustrate how gradient flowing through the vine can improve Vine Copula Autoencoders and that incorporating vines for uncertainty quantification in deep learning can outperform MC-dropout, deep ensembles, and Bayesian Neural Networks in sharpness, calibration, and runtime. By recasting vine copula models as computational graphs, our work connects classical dependence modeling with modern deep-learning toolchains and facilitates the integration of state-of-the-art copula methods in modern machine learning pipelines.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[1899]
E. S. E. Eduardo Santos Escriche and S. Jegelka.
Learning equivariant models by discovering symmetries with learnable augmentations.
Preprint (Jun. 2025). arXiv
Abstract

Recently, a trend has emerged that favors learning relevant symmetries from data in geometric domains instead of designing constrained architectures. To do so, two popular options are (1) to modify the training protocol, e.g., with a specific loss and data augmentations (soft equivariance), or (2) to ignore equivariance and infer it only implicitly. However, both options have limitations: soft equivariance requires a priori knowledge about relevant symmetries, while inferring symmetries merely via the task and larger data lacks interpretability. To address both limitations, we propose SEMoLA, an end-to-end approach that jointly (1) discovers a priori unknown symmetries in the data via learnable data augmentations, and (2) softly encodes the respective approximate equivariance into an arbitrary unconstrained model. Hence, it does not need prior knowledge about symmetries, it offers interpretability, and it maintains robustness to distribution shifts. Empirically, we demonstrate the ability of SEMoLA to robustly discover relevant symmetries while achieving high prediction accuracy across various datasets, encompassing multiple data modalities and underlying symmetry groups.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1898]
E. Garces Arias, H. Blocher, J. Rodemann, M. Aßenmacher and C. Jansen.
Statistical Multicriteria Evaluation of LLM-Generated Text.
Preprint (Jun. 2025). arXiv
Abstract

Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1897]
K. Göbler, T. Windisch and M. Drton.
Nonlinear Causal Discovery for Grouped Data.
Preprint (Jun. 2025). arXiv
Abstract

Inferring cause-effect relationships from observational data has gained significant attention in recent years, but most methods are limited to scalar random variables. In many important domains, including neuroscience, psychology, social science, and industrial manufacturing, the causal units of interest are groups of variables rather than individual scalar measurements. Motivated by these applications, we extend nonlinear additive noise models to handle random vectors, establishing a two-step approach for causal graph learning: First, infer the causal order among random vectors. Second, perform model selection to identify the best graph consistent with this order. We introduce effective and novel solutions for both steps in the vector case, demonstrating strong performance in simulations. Finally, we apply our method to real-world assembly line data with partial knowledge of causal ordering among variable groups.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[1896]
E. Guha, R. Marten, S. Keh, N. Raoof, G. Smyrnis, H. Bansal, M. Nezhurina, J. Mercat, T. Vu, Z. Sprague, A. Suvarna, B. Feuer, L. Chen, Z. Khan, E. Frankel, S. Grover, C. Choi, N. Muennighoff, S. Su, W. Zhao, J. Yang, S. Pimpalgaonkar, K. Sharma, C. C.-J. Ji, Y. Deng, S. Pratt, V. Ramanujan, J. Saad-Falcon, J. Li, A. Dave, A. Albalak, K. Arora, B. Wulfe, C. Hegde, G. Durrett, S. Oh, M. Bansal, S. Gabriel, A. Grover, K.-W. Chang, V. Shankar, A. Gokaslan, M. A. Merrill, T. Hashimoto, Y. Choi, J. Jitsev, R. Heckel, M. Sathiamoorthy, A. G. Dimakis and L. Schmidt.
OpenThoughts: Data Recipes for Reasoning Models.
Preprint (Jun. 2025). arXiv URL
Abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B.

MCML Authors
Link to Profile Reinhard Heckel

Reinhard Heckel

Prof. Dr.

Machine Learning and Information Processing


[1895]
J. Huang, J. Liang, J. Hu, M. Sundermeyer, P. K. T. Yu, N. Navab and B. Busam.
XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity.
Preprint (Jun. 2025). arXiv GitHub
Abstract

We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research.

MCML Authors
Link to website

Junwen Huang

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1894]
R. Huang, G. Zhai, Z. Bauer, M. Pollefeys, F. Tombari, L. Guibas, G. Huang and F. Engelmann.
Video Perception Models for 3D Scene Synthesis.
Preprint (Jun. 2025). arXiv
Abstract

Traditionally, 3D scene synthesis requires expert knowledge and significant manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, virtual reality, and gaming. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors of modern image generation models. However, current LLMs demonstrate limited 3D spatial reasoning ability, which restricts their ability to generate realistic and coherent 3D scenes. Meanwhile, image generation-based methods often suffer from constraints in viewpoint selection and multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For more precise analysis, we further introduce First-Person View Score (FPVScore) for coherence and plausibility evaluation, utilizing continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios. The code will be released.

MCML Authors
Link to website

Guangyao Zhai

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Federico Tombari

Federico Tombari

PD Dr.

Computer Aided Medical Procedures & Augmented Reality


[1893]
E. Kavak, T. N. Wolf and C. Wachinger.
DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation.
Preprint (Jun. 2025). arXiv
Abstract

During prediction tasks, models can use any signal they receive to come up with the final answer - including signals that are causally irrelevant. When predicting objects from images, for example, the lighting conditions could be correlated to different targets through selection bias, and an oblivious model might use these signals as shortcuts to discern between various objects. A predictor that uses lighting conditions instead of real object-specific details is obviously undesirable. To address this challenge, we introduce a standard anti-causal prediction model (SAM) that creates a causal framework for analyzing the information pathways influencing our predictor in anti-causal settings. We demonstrate that a classifier satisfying a specific conditional independence criterion will focus solely on the direct causal path from label to image, being counterfactually invariant to the remaining variables. Finally, we propose DISCO, a novel regularization strategy that uses conditional distance correlation to optimize for conditional independence in regression tasks. We can show that DISCO achieves competitive results in different bias mitigation experiments, deeming it a valid alternative to classical kernel-based methods.

MCML Authors
Link to website

Emre Kavak

Artificial Intelligence in Medical Imaging

Link to website

Tom Nuno Wolf

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1892]
O. Kuzyk, Z. Li, M. Pollefeys and X. Wang.
VisualChef: Generating Visual Aids in Cooking via Mask Inpainting.
Preprint (Jun. 2025). arXiv
Abstract

Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action’s execution and the resulting appearance of the object, while preserving the initial frame’s environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

MCML Authors
Link to Profile Xi Wang

Xi Wang

Dr.

Computer Vision & Artificial Intelligence


[1891]
X. Li, D. Huang, Y. Zhang, N. Navab and Z. Jiang.
Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance.
Preprint (Jun. 2025). arXiv
Abstract

Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.

MCML Authors
Link to website

Xuesong Li

Computer Aided Medical Procedures & Augmented Reality

Link to website

Dianye Huang

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1890]
J. Liu, H. Li, C. Yang, M. Deutges, A. Sadafi, X. You, K. Breininger, N. Navab and P. J. Schüffler.
HASD: Hierarchical Adaption for pathology Slide-level Domain-shift.
Preprint (Jun. 2025). arXiv
Abstract

Domain shift is a critical problem for pathology AI as pathology data is heavily influenced by center-specific conditions. Current pathology domain adaptation methods focus on image patches rather than WSI, thus failing to capture global WSI features required in typical clinical scenarios. In this work, we address the challenges of slide-level domain shift by proposing a Hierarchical Adaptation framework for Slide-level Domain-shift (HASD). HASD achieves multi-scale feature consistency and computationally efficient slide-level domain adaptation through two key components: (1) a hierarchical adaptation framework that integrates a Domain-level Alignment Solver for feature alignment, a Slide-level Geometric Invariance Regularization to preserve the morphological structure, and a Patch-level Attention Consistency Regularization to maintain local critical diagnostic cues; and (2) a prototype selection mechanism that reduces computational overhead. We validate our method on two slide-level tasks across five datasets, achieving a 4.1% AUROC improvement in a Breast Cancer HER2 Grading cohort and a 3.9% C-index gain in a UCEC survival prediction cohort. Our method provides a practical and reliable slide-level domain adaption solution for pathology institutions, minimizing both computational and annotation costs.

MCML Authors
Link to website

Jingsong Liu

Computational Pathology

Link to website

Han Li

Dr.

Computational Pathology

Link to website

Xin You

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Peter Schüffler

Peter Schüffler

Prof. Dr.

Computational Pathology


[1889]
X. Ma, C. Lin, Y. Zhang, V. Tresp and Y. Ma.
Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation.
Preprint (Jun. 2025). arXiv
Abstract

Leveraging multiple Large Language Models(LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network(ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative ’team’ focused on a specific subtask. Agentic Neural Network follows a two-phase optimization strategy: (1) Forward Phase-Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase-Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables ANN to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across four benchmark datasets, ANN surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements. Our findings indicate that ANN provides a scalable, data-driven framework for multi-agent systems, combining the collaborative capabilities of LLMs with the efficiency and flexibility of neural network principles. We plan to open-source the entire framework.

MCML Authors
Link to website

Yao Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[1888]
Y. Ma, D. Frauen, E. Javurek and S. Feuerriegel.
Foundation Models for Causal Inference via Prior-Data Fitted Networks.
Preprint (Jun. 2025). arXiv
Abstract

Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train a foundation model for estimating conditional average treatment effects (CATEs) using back-door adjustment. We show that CausalFM performs competitively for CATE estimation using various synthetic and semi-synthetic benchmarks. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.

MCML Authors
Link to website

Yuchen Ma

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Emil Javurek

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1887]
P.-F. Massiani, C. Fiedler, L. Haverbeck, F. Solowjow and S. Trimpe.
A kernel conditional two-sample test.
Preprint (Jun. 2025). arXiv
Abstract

We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct conditional two-sample statistical tests. These tests identify the inputs – called covariates in this context – where two conditional expectations differ with high probability. Our key idea is to transform confidence bounds of a learning method into a conditional two-sample test, and we instantiate this principle for kernel ridge regression (KRR) and conditional kernel mean embeddings. We generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds enable circumventing the need for independent data in our statistical tests, since they allow online sampling. We also introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters, making our method readily applicable in practice. Such conditional two-sample tests are especially relevant in applications where data arrive sequentially or non-independently, or when output distributions vary with operational parameters. We demonstrate their utility through examples in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional two-sample testing, from theoretical guarantees to practical implementation, and advance the state-of-the-art on the concentration of vector-valued least squares estimation.

MCML Authors
Link to website

Christian Fiedler

Dr.

Applied Numerical Analysis


[1886]
C. J. Mertens, H. Häntze, S. Ziegelmayer, J. N. Kather, D. Truhn, S. H. Kim, F. Busch, D. Weller, B. Wiestler, M. Graf, F. Bamberg, C. L. Schlett, J. B. Weiss, S. Ringhof, E. Can, J. Schulz-Menger, T. Niendorf, J. Lammert, I. Molwitz, A. Kader, A. Hering, A. Meddeb, J. Nawabi, M. B. Schulze, T. Keil, S. N. Willich, L. Krist, M. Hadamitzky, A. Hannemann, F. Bassermann, D. Rückert, T. Pischon, A. Hapfelmeier, M. R. Makowski, K. K. Bressem and L. C. Adams.
Deep learning-enabled MRI phenotyping uncovers regional body composition heterogeneity and disease associations in two European population cohorts.
Preprint (Jun. 2025). DOI
Abstract

Body mass index (BMI) does not account for substantial inter-individual differences in regional fat and muscle compartments, which are relevant for the prevalence of cardiometabolic and cancer conditions. We applied a validated deep learning pipeline for automated segmentation of whole-body MRI scans in 45,851 adults from the UK Biobank and German National Cohort, enabling harmonized quantification of visceral (VAT), gluteofemoral (GFAT), and abdominal subcutaneous adipose tissue (ASAT), liver fat fraction (LFF), and trunk muscle volume. Associations with clinical conditions were evaluated using compartment measures adjusted for age, sex, height, and BMI. Our analysis demonstrates that regional adiposity and muscle volume show distinct associations with cardiometabolic and cancer prevalence, and that substantial disease heterogeneity exists within BMI strata. The analytic framework and reference data presented here will support future risk stratification efforts and facilitate the integration of automated MRI phenotyping into large-scale population and clinical research.

MCML Authors
Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1885]
J. Min, H. Li, T. Nagler and S. Li.
Assessing Climate-Driven Mortality Risk: A Stochastic Approach with Distributed Lag Non-Linear Models.
Preprint (Jun. 2025). arXiv
Abstract

Assessing climate-driven mortality risk has become an emerging area of research in recent decades. In this paper, we propose a novel approach to explicitly incorporate climate-driven effects into both single- and multi-population stochastic mortality models. The new model consists of two components: a stochastic mortality model, and a distributed lag non-linear model (DLNM). The first component captures the non-climate long-term trend and volatility in mortality rates. The second component captures non-linear and lagged effects of climate variables on mortality, as well as the impact of heat waves and cold waves across different age groups. For model calibration, we propose a backfitting algorithm that allows us to disentangle the climate-driven mortality risk from the non-climate-driven stochastic mortality risk. We illustrate the effectiveness and superior performance of our model using data from three European regions: Athens, Lisbon, and Rome. Furthermore, we utilize future UTCI data generated from climate models to provide mortality projections into 2045 across these regions under two Representative Concentration Pathway (RCP) scenarios. The projections show a noticeable decrease in winter mortality alongside a rise in summer mortality, driven by a general increase in UTCI over time. Although we expect slightly lower overall mortality in the short term under RCP8.5 compared to RCP2.6, a long-term increase in total mortality is anticipated under the RCP8.5 scenario.

MCML Authors
Link to website

Han Li

Dr.

Computational Pathology

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[1884]
C. Pellegrini, E. Özsoy, D. Bani-Harouni, M. Keicher and N. Navab.
From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs.
Preprint (Jun. 2025). arXiv
Abstract

Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.

MCML Authors
Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

David Bani-Harouni

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1883]
A. Rahma, C. Datar, A. Cukarska and F. Dietrich.
Rapid training of Hamiltonian graph networks without gradient descent.
Preprint (Jun. 2025). arXiv
Abstract

Learning dynamical systems that respect physical symmetries and constraints remains a fundamental challenge in data-driven modeling. Integrating physical laws with graph neural networks facilitates principled modeling of complex N-body dynamics and yields accurate and permutation-invariant models. However, training graph neural networks with iterative, gradient-based optimization algorithms (e.g., Adam, RMSProp, LBFGS) often leads to slow training, especially for large, complex systems. In comparison to 15 different optimizers, we demonstrate that Hamiltonian Graph Networks (HGN) can be trained up to 600x faster–but with comparable accuracy–by replacing iterative optimization with random feature-based parameter construction. We show robust performance in diverse simulations, including N-body mass-spring systems in up to 3 dimensions with different geometries, while retaining essential physical invariances with respect to permutation, rotation, and translation. We reveal that even when trained on minimal 8-node systems, the model can generalize in a zero-shot manner to systems as large as 4096 nodes without retraining. Our work challenges the dominance of iterative gradient-descent-based optimization algorithms for training neural network models for physical systems.

MCML Authors
Link to website

Atamert Rahma

Physics-enhanced Machine Learning

Link to website

Chinmay Datar

Physics-enhanced Machine Learning

Link to website

Ana Cukarska

Physics-enhanced Machine Learning

Link to Profile Felix Dietrich

Felix Dietrich

Prof. Dr.

Physics-enhanced Machine Learning


[1882]
S. Roschmann, Q. Bouniot, V. Feofanov, I. Redko and Z. Akata.
Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers.
Preprint (Jun. 2025). arXiv
Abstract

Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal yet another direction for reusing vision representations in a non-visual domain.

MCML Authors
Link to website

Simon Roschmann

Interpretable and Reliable Machine Learning

Link to website

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1881]
S. Schallmoser, J. Schweisthal, A. von Ehr, H. Ghanbari, F. Schiefenhövel, T. S. Valley, J. Wiens and S. Feuerriegel.
Causal machine learning for assessing the effectiveness of off-label use of amiodarone in new-onset atrial fibrillation.
Preprint (Jun. 2025). DOI
Abstract

Off-label drug use, i.e., uses of a drug that differ from what regulatory authorities have approved, is common, occurring overall in up to 36% of prescriptions. Yet, the effectiveness across different patient subgroups is often poorly understood. In this study, we demonstrate how one can use causal machine learning (ML) together with real-world data to identify which patient groups are most likely to benefit from off-label use. Specifically, we assessed the effectiveness of off-label use of amiodarone in patients with new-onset atrial fibrillation (NOAF). NOAF can often lead to hemodynamic instability and rapid ventricular response, so that hemodynamic stability should be restored. We developed a causal ML model to predict individualized treatment effects (ITEs) of off-label amiodarone use on the probability of returning to hemodynamic stability. We used real-world data from the U.S. to develop the causal ML model and externally evaluated that model on real-world data from the Netherlands. Our predicted ITEs show that 44.8% (95% confidence interval [CI]: 38.4% to 51.0%) of patients benefit from off-label use of amiodarone with large heterogeneity: amiodarone is predicted to increase the probability of restoring hemodynamic stability by a mean of 0.5 percentage points (pp), with an interquartile range (IQR) of −1.1 pp to 1.0 pp, in the external dataset from the Netherlands. Using these ITEs, we defined a personalized treatment rule, which could increase the number of patients achieving hemodynamic stability by 4.4% (95% CI: 1.0% to 7.8%) compared to current practice. Additionally, we studied which biomarkers are predictive of treatment effect heterogeneity and found that patients with higher blood pressure may benefit most from off-label use of amiodarone. Altogether, our study shows the potential of causal ML together with real-world data in identifying patients who benefit from off-label drug use.

MCML Authors
Link to website

Simon Schallmoser

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1880]
M. Schöffel, E. Garces Arias, M. Wiedner, P. Ruppert, M. Li, C. Heumann and M. Aßenmacher.
Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages.
Preprint (Jun. 2025). arXiv
Abstract

Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs’ ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1879]
A. Selivanov, P. Müller, Ö. Turgut, N. Stolt-Ansó and D. Rückert.
Global and Local Contrastive Learning for Joint Representations from Cardiac MRI and ECG.
Preprint (Jun. 2025). arXiv GitHub
Abstract

An electrocardiogram (ECG) is a widely used, cost-effective tool for detecting electrical abnormalities in the heart. However, it cannot directly measure functional parameters, such as ventricular volumes and ejection fraction, which are crucial for assessing cardiac function. Cardiac magnetic resonance (CMR) is the gold standard for these measurements, providing detailed structural and functional insights, but is expensive and less accessible. To bridge this gap, we propose PTACL (Patient and Temporal Alignment Contrastive Learning), a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal information from CMR. PTACL uses global patient-level contrastive loss and local temporal-level contrastive loss. The global loss aligns patient-level representations by pulling ECG and CMR embeddings from the same patient closer together, while pushing apart embeddings from different patients. Local loss enforces fine-grained temporal alignment within each patient by contrasting encoded ECG segments with corresponding encoded CMR frames. This approach enriches ECG representations with diagnostic information beyond electrical activity and transfers more insights between modalities than global alignment alone, all without introducing new learnable weights. We evaluate PTACL on paired ECG-CMR data from 27,951 subjects in the UK Biobank. Compared to baseline approaches, PTACL achieves better performance in two clinically relevant tasks: (1) retrieving patients with similar cardiac phenotypes and (2) predicting CMR-derived cardiac function parameters, such as ventricular volumes and ejection fraction. Our results highlight the potential of PTACL to enhance non-invasive cardiac diagnostics using ECG.

MCML Authors
Link to website

Nil Stolt-Ansó

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1878]
R. Skorobogat, K. Roth, M.-I. Georgescu and Z. Akata.
Subspace-Boosted Model Merging.
Preprint (Jun. 2025). arXiv
Abstract

Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on vision benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to further quantify task similarity, offering a new interpretable perspective on model merging.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1877]
S. Starck, V. Sideri-Lampretsa, B. Kainz, M. Menten, T. T. Mueller and D. Rückert.
Diff-Def: Diffusion-Generated Deformation Fields for Conditional Atlases.
Preprint (Jun. 2025). arXiv
Abstract

Anatomical atlases are widely used for population studies and analysis. Conditional atlases target a specific sub-population defined via certain conditions, such as demographics or pathologies, and allow for the investigation of fine-grained anatomical differences like morphological changes associated with ageing or disease. Existing approaches use either registration-based methods that are often unable to handle large anatomical variations or generative adversarial models, which are challenging to train since they can suffer from training instabilities. Instead of generating atlases directly in as intensities, we propose using latent diffusion models to generate deformation fields, which transform a general population atlas into one representing a specific sub-population. Our approach ensures structural integrity, enhances interpretability and avoids hallucinations that may arise during direct image synthesis by generating this deformation field and regularising it using a neighbourhood of images. We compare our method to several state-of-the-art atlas generation methods using brain MR images from the UK Biobank. Our method generates highly realistic atlases with smooth transformations and high anatomical fidelity, outperforming existing baselines. We demonstrate the quality of these atlases through comprehensive evaluations, including quantitative metrics for anatomical accuracy, perceptual similarity, and qualitative analyses displaying the consistency and realism of the generated atlases.

MCML Authors
Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1876]
Z. S. Taghavi, A. Modarressi, Y. Ma and H. Schütze.
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge.
Preprint (Jun. 2025). arXiv GitHub
Abstract

Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving ’two days ago’), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge.

MCML Authors
Link to website

Zeinab Sadat Taghavi

Computational Linguistics

Link to website

Ali Modarressi

Computational Linguistics

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1875]
I. Tsangko, A. Triantafyllopoulos, A. Abdelmoula, A. Mallol-Ragolta and B. W. Schuller.
Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition.
Preprint (Jun. 2025). arXiv
Abstract

Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to website

Adria Mallol-Ragolta

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1874]
L. von der Heyde, A.-C. Haensch, B. Weiß and J. Daikeler.
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation.
Preprint (Jun. 2025). arXiv
Abstract

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

MCML Authors
Link to website

Leah von der Heyde

Social Data Science and AI

Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI


[1873]
T. Walter, H. Markgraf, J. Külz and M. Althoff.
Provably Safe Reinforcement Learning from Analytic Gradients.
Preprint (Jun. 2025). arXiv
Abstract

Deploying autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research which aims to provide such guarantees using safeguards. These safeguards should be integrated during training to prevent a large sim-to-real gap. While there are several approaches for safeguarding sampling-based reinforcement learning, analytic gradient-based reinforcement learning often achieves superior performance and sample efficiency. However, there is no safeguarding approach for this learning paradigm yet. Our work addresses this gap by developing the first effective safeguard for analytic gradient-based reinforcement learning. We analyse existing, differentiable safeguards, adapt them through modified mappings and gradient formulations, and integrate them with a state-of-the-art learning algorithm and a differentiable simulation. We evaluate how different safeguards affect policy optimisation using numerical experiments on two classical control tasks. The results demonstrate safeguarded training without compromising performance.

MCML Authors
Link to website

Jonathan Külz

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[1872]
A. Wang, D. Shu, Y. Wang, Y. Ma and M. Du.
Improving LLM Reasoning through Interpretable Role-Playing Steering.
Preprint (Jun. 2025). arXiv
Abstract

Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model’s residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.

MCML Authors
Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[1871]
K. Wang, T. Klug, S. Ruschke, J. Kirschke and R. Heckel.
Reliable Evaluation of MRI Motion Correction: Dataset and Insights.
Preprint (Jun. 2025). arXiv
Abstract

Correcting motion artifacts in MRI is important, as they can hinder accurate diagnosis. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings. To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed Paired Motion-Corrupted 3D brain MRI data. To advance evaluation quality, we introduce MoMRISim, a feature-space metric trained for evaluating motion reconstructions. We assess each evaluation approach and find real-world evaluation together with MoMRISim, while not perfect, to be most reliable. Evaluation based on simulated motion systematically exaggerates algorithm performance, and reference-free evaluation overrates oversmoothed deep learning outputs.

MCML Authors
Link to website

Tobit Klug

Machine Learning and Information Processing

Link to Profile Reinhard Heckel

Reinhard Heckel

Prof. Dr.

Machine Learning and Information Processing


[1870]
M. Wang, S. Chen, K. Kersting, V. Tresp and Y. Ma.
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding.
Preprint (Jun. 2025). arXiv
Abstract

Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs’ inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.

MCML Authors
Link to website

Shuo Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[1869]
Y. Wang, J. Bi, Y. Ma and S. Pirk.
ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM.
Preprint (Jun. 2025). arXiv
Abstract

Multimodal Large Language Model (MLLM) often suffer from hallucinations. They over-rely on partial cues and generate incorrect responses. Recently, methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. In this work, we uncover that methods like VCD and ICD fundamentally influence internal attention dynamics of the model. This observation suggests that their effectiveness may not stem merely from surface-level modifications to logits but from deeper shifts in attention distribution. Inspired by this insight, we propose an attention-steerable contrastive decoding framework that directly intervenes in attention mechanisms of the model to offer a more principled approach to mitigating hallucinations. Our experiments across multiple MLLM architectures and diverse decoding methods demonstrate that our approach significantly reduces hallucinations and improves the performance on benchmarks such as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancing performance on standard VQA benchmarks.

MCML Authors
Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[1868]
Z. Xu, H. Li, D. Sun, Z. Li, Y. Li, Q. Kong, Z. Cheng, N. Navab and S. K. Zhou.
NeRF-based CBCT Reconstruction needs Normalization and Initialization.
Preprint (Jun. 2025). arXiv
Abstract

Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specifically, in each training step, only a subset of the hash encoder’s parameters is used (local sparse), whereas all parameters in the neural network participate (global dense). Consequently, hash features generated in each step are highly misaligned, as they come from different subsets of the hash encoder. These misalignments from different training steps are then fed into the neural network, causing repeated inconsistent global updates in training, which leads to unstable training, slower convergence, and degraded reconstruction quality. Aiming to alleviate the impact of this local-global optimization mismatch, we introduce a Normalized Hash Encoder, which enhances feature consistency and mitigates the mismatch. Additionally, we propose a Mapping Consistency Initialization(MCI) strategy that initializes the neural network before training by leveraging the global mapping property from a well-trained model. The initialized neural network exhibits improved stability during early training, enabling faster convergence and enhanced reconstruction performance. Our method is simple yet effective, requiring only a few lines of code while substantially improving training efficiency on 128 CT cases collected from 4 different datasets, covering 7 distinct anatomical regions.

MCML Authors
Link to website

Han Li

Dr.

Computational Pathology

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1867]
S. Yuan, E. Nie, L. Kouba, A. Y. Kangen, H. Schmid, H. Schütze and M. Färber.
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification.
Preprint (Jun. 2025). arXiv
Abstract

Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct ParaDeHate, a large-scale parallel dataset specifically for hatespeech detoxification. We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1866]
S. Yuan, E. Nie, M. Tawfelis, H. Schmid, H. Schütze and M. Färber.
Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models.
Preprint (Jun. 2025). arXiv
Abstract

Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1865]
K. Zaripova, E. Özsoy, N. Navab and A. Farshad.
PhenoKG: Knowledge Graph-Driven Gene Discovery and Patient Insights from Phenotypes Alone.
Preprint (Jun. 2025). arXiv
Abstract

Identifying causative genes from patient phenotypes remains a significant challenge in precision medicine, with important implications for the diagnosis and treatment of genetic disorders. We propose a novel graph-based approach for predicting causative genes from patient phenotypes, with or without an available list of candidate genes, by integrating a rare disease knowledge graph (KG). Our model, combining graph neural networks and transformers, achieves substantial improvements over the current state-of-the-art. On the real-world MyGene2 dataset, it attains a mean reciprocal rank (MRR) of 24.64% and nDCG@100 of 33.64%, surpassing the best baseline (SHEPHERD) at 19.02% MRR and 30.54% nDCG@100. We perform extensive ablation studies to validate the contribution of each model component. Notably, the approach generalizes to cases where only phenotypic data are available, addressing key challenges in clinical decision support when genomic information is incomplete.

MCML Authors
Link to website

Kamilia Zaripova

Computer Aided Medical Procedures & Augmented Reality

Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1864]
G. Zhang, T. Hannan, H. Kleiner, B. Aydemir, X. Xie, J. Lan, T. Seidl, V. Tresp and J. Gu.
AViLA: Asynchronous Vision-Language Agent for Streaming Multimodal Data Interaction.
Preprint (Jun. 2025). arXiv
Abstract

An ideal vision-language agent serves as a bridge between the human users and their surrounding physical world in real-world applications like autonomous driving and embodied agents, and proactively provides accurate and timely responses given user intents. An intriguing challenge arises when agents interact with the world as a dynamic data stream and ad-hoc queries from users: supporting knowledge for queries, namely evidence, usually appears asynchronously with the arrival time of queries, and agents need to ground their responses in historical data, present observations, and even future streams. We frame this challenge as Query-Evidence Asynchrony, where user queries and their supporting evidence typically arrive asynchronously in the streaming setting. This setting requires not only strong reasoning capabilities but also the ability to retain past observations and respond to queries with temporal awareness. In this paper, we introduce a diagnostic benchmark that evaluates Multimodal Large Language Models (MLLMs) on their ability to handle interaction with streaming data. Further, we present AViLA, Asynchronous Video-Language Agent for streaming data interaction that can handle ad-hoc queries and give time-aware responses. For this purpose, AViLA consists of three key modules: comprehensive memory retention, evidence identification, and evidence-grounded trigger, that are designed to maintain a general-purpose memory and respond readily and timely to queries. Our experiments show that existing models often fail to respond at appropriate times, while AViLA significantly improves both accuracy and temporal awareness. Our code and dataset will be publicly available.

MCML Authors
Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to website

Tanveer Hannan

Database Systems and Data Mining

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1863]
Y. Zhang, H. Gao, H. Chen, W. Li, Y. Ma and V. Tresp.
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models.
Preprint (Jun. 2025). arXiv
Abstract

Multimodal Large Language Models (MLLMs) excel in tasks like multimodal reasoning and cross-modal retrieval but face deployment challenges in real-world scenarios due to distributed multimodal data and strict privacy requirements. Federated Learning (FL) offers a solution by enabling collaborative model training without centralizing data. However, realizing FL for MLLMs presents significant challenges, including high computational demands, limited client capacity, substantial communication costs, and heterogeneous client data. Existing FL methods assume client-side deployment of full models, an assumption that breaks down for large-scale MLLMs due to their massive size and communication demands. To address these limitations, we propose FedNano, the first FL framework that centralizes the LLM on the server while introducing NanoEdge, a lightweight module for client-specific adaptation. NanoEdge employs modality-specific encoders, connectors, and trainable NanoAdapters with low-rank adaptation. This design eliminates the need to deploy LLM on clients, reducing client-side storage by 95%, and limiting communication overhead to only 0.01% of the model parameters. By transmitting only compact NanoAdapter updates, FedNano handles heterogeneous client data and resource constraints while preserving privacy. Experiments demonstrate that FedNano outperforms prior FL baselines, bridging the gap between MLLM scale and FL feasibility, and enabling scalable, decentralized multimodal AI systems.

MCML Authors
Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Haokun Chen

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1862]
Y. Zhang, C. Lin, S. Tang, H. Chen, S. Zhou, Y. Ma and V. Tresp.
SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence.
Preprint (Jun. 2025). arXiv GitHub
Abstract

The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation.

MCML Authors
Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Haokun Chen

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1861]
S. Zhao, I. Prapas, Z. Xiong, I. Karasante, I. Papoutsis, G. Camps-Valls and X. Zhu.
Causal Graph Neural Networks for Robust Wildfire Forecasting Across Geographic Shifts.
Preprint (Jun. 2025). DOI
Abstract

Machine learning has become a powerful tool for modeling the relationships between environmental factors and fire events. However, beyond the predictive performance, we argue that critical decision-making requires an understanding of fire mechanisms to improve reliability. Causality offers a promising framework for explicitly analyzing the interdependencies among factors; however, its integration into deep learning and further application in disaster management remain largely underexplored. To map the relationship between historical inputs and resulting burned areas, we proposed a causally inspired deep learning approach utilizing graph models. The graph representation is constructed through a learnable approach supervised by causal knowledge. A graph pooling layer, informed by backdoor adjustment criteria, mitigates the potential confounding effects of hidden variables on the target variable. Our experiments demonstrate that our model shows better robustness, reducing the standard deviation of the AUROC with longer forecasting horizons by 64%; and enhancing performance under geographical distribution shifts by 2 points compared with the baseline. Compared with fully connected and correlation-based graphs, the causally-informed graph proved to be more resilient to input perturbations. Additionally, our model revealed the lagged effect of Oceanic Climate Index variables on local fire events and the critical role of short-term local precipitation – indicating that Mediterranean fires are mostly drought-driven.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1860]
Y. Zhou, Y. Bi, W. Tong, W. Wang, N. Navab and Z. Jiang.
UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation.
Preprint (Jun. 2025). arXiv
Abstract

Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.

MCML Authors
Link to website

Yue Zhou

Computer Aided Medical Procedures & Augmented Reality

Link to website

Yuan Bi

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1859]
X. Zhu, S. Chen, F. Zhang, Y. Shi and Y. Wang.
GlobalBuildingAtlas: An Open Global and Complete Dataset of Building Polygons, Heights and LoD1 3D Models.
Preprint (Jun. 2025). arXiv
Abstract

We introduce GlobalBuildingAtlas, a publicly available dataset providing global and complete coverage of building polygons, heights and Level of Detail 1 (LoD1) 3D building models. This is the first open dataset to offer high quality, consistent, and complete building data in 2D and 3D form at the individual building level on a global scale. Towards this dataset, we developed machine learning-based pipelines to derive building polygons and heights (called this http URL) from global PlanetScope satellite data, respectively. Also a quality-based fusion strategy was employed to generate higher-quality polygons (called this http URL) based on existing open building polygons, including our own derived one. With more than 2.75 billion buildings worldwide, this http URL surpasses the most comprehensive database to date by more than 1 billion buildings. this http URL offers the most detailed and accurate global 3D building height maps to date, achieving a spatial resolution of 3x3 meters-30 times finer than previous global products (90 m), enabling a high-resolution and reliable analysis of building volumes at both local and global scales. Finally, we generated a global LoD1 building model (called GBA.LoD1) from the resulting this http URL and this http URL. GBA.LoD1 represents the first complete global LoD1 building models, including 2.68 billion building instances with predicted heights, i.e., with a height completeness of more than 97%, achieving RMSEs ranging from 1.5 m to 8.9 m across different continents. With its height accuracy, comprehensive global coverage and rich spatial details, GlobalBuildingAltas offers novel insights on the status quo of global buildings, which unlocks unprecedented geospatial analysis possibilities, as showcased by a better illustration of where people live and a more comprehensive monitoring of the progress on the 11th Sustainable Development Goal of the United Nations.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

Link to website

Sining Chen

Data Science in Earth Observation


[1858]
D. N. Jakobi, M. Stegenwallner-Schütz, N. Hollenstein, C. Ding, R. Kaspere, A. M. Škorić, E. Pavlinusic Vilus, S. Frank, M.-L. Müller, K. M. Jensen de López, N. Kharlamov, H. B. Søndergaard Knudsen, Y. Berzak, E. Lion, I. A. S. Irina A. Sekerina, C. Acarturk, M. F. Ansari, K. Harezlak, P. Kasprowski, A. Bautista, L. Beinborn, A. Bondar, A. Boznou, L. Bradshaw, J. M. Hofmann, T. Krosness, N. B. Soliva, A. Çepani, K. Cergol, A. Došen, M. Palmovic, A. Çerpja, D. Chirino, J. Chromý, V. Demberg, I. Škrjanec, N. D. Deniz, I. Fajardo, M. Giménez-Salvador, X. Mínguez-López, M. Filip, Z. Freibergs, J. Gomes, A. Janeiro, P. Luegi, J. Veríssimo, S. Gramatikov, J. Hasenäcker, A. Haveriku, N. Kote, M. M. Kamal, H. Kędzierska, D. Klimek-Jankowska, S. Kosutar, D. G. Krakowczyk, I. Krejtz, M. Łockiewicz, K. Lõo, J. Motiejūnienė, J. A. Nasir, J. S. Krog Nedergård, A. Özkan, M. Preininger, L. Pungă, D. R. Reich, C. Tschirner, Š. Rot, A. Säuberli, J. Solé-Casals, E. Strati, I. Svoboda, E. Trandafili, S. Varlokosta, M. Vulchanova and L. A. .
MultiplEYE: Creating a multilingual eye-tracking-while-reading corpus.
ETRA 2025 - ACM Symposium on Eye Tracking Research and Applications. Tokyo, Japan, May 26-29, 2025. DOI
Abstract

Eye-tracking-while-reading data provide valuable insights across multiple disciplines, including psychology, linguistics, natural language processing, education, and human-computer interaction. Despite its potential, the availability of large, high-quality, multilingual datasets remains limited, hindering both foundational reading research and advancements in applications. The MultiplEYE project addresses this gap by establishing a large-scale, international eye-tracking data collection initiative. It aims to create a multilingual dataset of eye movements recorded during natural reading, balancing linguistic diversity, while ensuring methodological consistency for reliable cross-linguistic comparisons. The dataset spans numerous languages and follows strict procedural, documentation, and data pre-processing standards to enhance eye-tracking data transparency and reproducibility. A novel data-sharing framework, integrated with data quality reports, allows for selective data filtering based on research needs. Researchers and labs worldwide are invited to join the initiative. By establishing and promoting standardized practices and open data sharing, MultiplEYE facilitates interdisciplinary research and advances reading research and gaze-augmented applications.

MCML Authors
Link to website

Andreas Säuberli

AI and Computational Linguistics


[1857]
J. W. Grootjen, F. Prummer, M. Bâce, C. Jiao, S. Jindal and A. Bulling.
PETMEI: 10th Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction.
PETMEI @ETRA 2025 - 10th International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction at the ACM Symposium on Eye Tracking Research and Applications (ETRA 2025). Tokyo, Japan, May 26-29, 2025. DOI
Abstract

The first applications of eye tracking and eye-based human-computer interfaces mainly concentrated on making use of the eyes in traditional desktop settings. However, this changed in the last decade with a growth of interest in smart eyewear. With recent advances in low-cost mobile eye trackers, gaze-based techniques for mobile computing have become increasingly important. PETMEI 2025 focuses on the pervasive eye tracking paradigm as a trailblazer for mobile eye-based interaction and eye-based context-awareness. We want to stimulate and explore the creativity of these communities with respect to the implications, key research challenges, and new applications for pervasive eye tracking in ubiquitous computing. The long-term goal is to create a strong interdisciplinary research community linking these fields and establish the workshop as the premier forum for research on pervasive eye tracking.

MCML Authors
Link to website

Jesse Grootjen

Human-Centered Ubiquitous Media


[1856]
V. Ruozzi, S. Matinfar, L. Schütz, B. Wiestler, A. Redaelli, E. Votta and N. Navab.
BioSonix: Can Physics-based Sonification Perceptualize Tissue Deformations From Tool Interactions?
IPMI 2025 - Information Processing in Medical Imaging. Kos Island, Greece, May 25-30, 2025. To be published.
Abstract

Perceptualizing tool interactions with deformable structures in surgical procedures remains challenging, as unimodal visualization techniques often fail to capture the complexity of these interactions due
to constraints such as occlusion and limited depth perception. This paper presents a novel approach to augment tool navigation in mixed reality environments by providing auditory representations of tool-tissue dynamics, particularly for interactions with soft tissue. BioSonix, a physics-informed design framework, utilizes tissue displacements in 3D space to compute excitation forces for a sound model encoding tissue properties such as stiffness and density. Biomechanical simulations were employed to model particle displacements resulting from tool-tissue interactions, establishing a robust foundation for the method. An optimization approach was used to define configurations for capturing diverse interaction scenarios with varying tool trajectories. Experiments were conducted to validate the accuracy of the sound-displacement mappings. Additionally, two user studies were performed: the first involved two clinical professionals (a neuroradiologist and a cardiologist), who confirmed the method’s impact and achieved high task accuracy; the second included 22 biomedical experts, who demonstrated high discrimination accuracy in tissue differentiation and targeting tasks. The results revealed a strong correlation between tool-tissue dynamics and their corresponding auditory profiles, highlighting the potential of these sound representations to en-
hance the intuitive understanding of complex interactions.

MCML Authors
Link to website

Sasan Matinfar

Computer Aided Medical Procedures & Augmented Reality

Link to website

Laura Schütz

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1855]
A. H. Berger, L. Lux, A. Weers, M. Menten, D. Rückert and J. C. Paetzold.
Pitfalls of topology-aware image segmentation.
IPMI 2025 - Information Processing in Medical Imaging. Kos Island, Greece, May 25-30, 2025. To be published. Preprint available. arXiv
Abstract

Topological correctness, i.e., the preservation of structural integrity and specific characteristics of shape, is a fundamental requirement for medical imaging tasks, such as neuron or vessel segmentation. Despite the recent surge in topology-aware methods addressing this challenge, their real-world applicability is hindered by flawed benchmarking practices. In this paper, we identify critical pitfalls in model evaluation that include inadequate connectivity choices, overlooked topological artifacts in ground truth annotations, and inappropriate use of evaluation metrics. Through detailed empirical analysis, we uncover these issues’ profound impact on the evaluation and ranking of segmentation methods. Drawing from our findings, we propose a set of actionable recommendations to establish fair and robust evaluation standards for topology-aware medical image segmentation methods.

MCML Authors
Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Link to website

Alexander Weers

Artificial Intelligence in Healthcare and Medicine

Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1854]
F. Bongratz, Y. Li, S. Elbaroudy and C. Wachinger.
3D Shape-to-Image Brownian Bridge Diffusion for Brain MRI Synthesis from Cortical Surfaces.
IPMI 2025 - Information Processing in Medical Imaging. Kos Island, Greece, May 25-30, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Despite recent advances in medical image generation, existing methods struggle to produce anatomically plausible 3D structures. In synthetic brain magnetic resonance images (MRIs), characteristic fissures are often missing, and reconstructed cortical surfaces appear scattered rather than densely convoluted. To address this issue, we introduce Cor2Vox, the first diffusion model-based method that translates continuous cortical shape priors to synthetic brain MRIs. To achieve this, we leverage a Brownian bridge process which allows for direct structured mapping between shape contours and medical images. Specifically, we adapt the concept of the Brownian bridge diffusion model to 3D and extend it to embrace various complementary shape representations. Our experiments demonstrate significant improvements in the geometric accuracy of reconstructed structures compared to previous voxel-based approaches. Moreover, Cor2Vox excels in image quality and diversity, yielding high variation in non-target structures like the skull. Finally, we highlight the capability of our approach to simulate cortical atrophy at the sub-voxel level.

MCML Authors
Fabian Bongratz

Fabian Bongratz

Artificial Intelligence in Medical Imaging

Link to website

Yitong Li

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1853]
L. D. Reyes Vargas, M. Menten, J. C. Paetzold, N. Navab and M. F. Azampour.
Skelite: Compact Neural Networks for Efficient Iterative Skeletonization.
IPMI 2025 - Information Processing in Medical Imaging. Kos Island, Greece, May 25-30, 2025. To be published. Preprint available. arXiv
Abstract

Skeletonization extracts thin representations from images that compactly encode their geometry and topology. These representations have become an important topological prior for preserving connectivity in curvilinear structures, aiding medical tasks like vessel segmentation. Existing compatible skeletonization algorithms face significant trade-offs: morphology-based approaches are computationally efficient but prone to frequent breakages, while topology-preserving methods require substantial computational resources. We propose a novel framework for training iterative skeletonization algorithms with a learnable component. The framework leverages synthetic data, task-specific augmentation, and a model distillation strategy to learn compact neural networks that produce thin, connected skeletons with a fully differentiable iterative algorithm. Our method demonstrates a 100 times speedup over topology-constrained algorithms while maintaining high accuracy and generalizing effectively to new domains without fine-tuning. Benchmarking and downstream validation in 2D and 3D tasks demonstrate its computational efficiency and real-world applicability.

MCML Authors
Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Mohammad Farid Azampour

Computer Aided Medical Procedures & Augmented Reality


[1852]
L. K. Senel.
Exploring the frontiers of word understanding and language model evaluation in NLP.
Dissertation 2025. DOI
Abstract

The field of natural language processing (NLP) has progressed dramatically with the rise of deep learning, yet many challenges in learning high-quality semantic representations remain. This thesis addresses these challenges through a series of studies focusing on both monolingual and multilingual contexts. (Shortened.)

MCML Authors
Lütfi Kerem Senel

Lütfi Kerem Senel

Dr.

* Former Member


[1851]
D. Huang, N. Navab and Z. Jiang.
Improving Probe Localization for Freehand 3D Ultrasound using Lightweight Cameras.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. To be published.
Abstract

Ultrasound (US) probe localization relative to the examined subject is essential for freehand 3D US imaging, which offers significant clinical value due to its affordability and unrestricted field of view. However, existing methods often rely on expensive tracking systems or bulky probes, while recent US image-based deep learning methods suffer from accumulated errors during probe maneuvering. To address these challenges, this study proposes a versatile, cost-effective probe pose localization method for freehand 3D US imaging, utilizing two lightweight cameras. To eliminate accumulated errors during US scans, we introduce PoseNet, which directly predicts the probe’s 6D pose relative to a preset world coordinate system based on camera observations. We first jointly train pose and camera image encoders based on pairs of 6D pose and camera observations densely sampled in simulation. This will encourage each pair of probe pose and its corresponding camera observation to share the same representation in latent space. To ensure the two encoders handle unseen images and poses effectively, we incorporate a triplet loss that enforces smaller differences in latent features between nearby poses compared to distant ones. Then, the pose decoder uses the latent representation of the camera images to predict the probe’s 6D pose. To bridge the sim-to-real gap, in the real world, we use the trained image encoder and pose decoder for initial predictions, followed by an additional MLP layer to refine the estimated pose, improving accuracy. The results obtained from an arm phantom demonstrate the effectiveness of the proposed method, which notably surpasses state-of-the-art techniques, achieving average positional and rotational errors of 2.03 mm and 0.37◦, respectively.

MCML Authors
Link to website

Dianye Huang

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1850]
J. Jung, S. Boche, S. B. Laina and S. Leutenegger.
Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. To be published. Preprint available. arXiv
Abstract

We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot’s stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.

MCML Authors
Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[1849]
J. Meier, L. Inchingolo, O. Dhaouadi, Y. Xia, J. Kaiser and D. Cremers.
MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. To be published. Preprint available. arXiv
Abstract

We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

MCML Authors
Link to website

Johannes Meier

Computer Vision & Artificial Intelligence

Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1848]
D. Strieder.
Structure Uncertainty in Causal Inference.
Dissertation 2025. URL
Abstract

In order to draw causal conclusions from available data, it is crucial to reason about the underlying causal structure that governs the data-generating process. In this publication-based thesis, we tackle the challenge of rigorously accounting for uncertainty in this underlying causal structure in causal inference. We present a framework based on test inversions to construct calibrated confidence regions for total causal effects that capture both sources of uncertainty: causal structure and numerical size of nonzero effects.

MCML Authors

[1847]
M. Dannehl, S. Valenzuela and J. Kinder.
Which Instructions Matter the Most: A Saliency Analysis of Binary Function Embedding Models.
DLSP @SPW 2025 - 8th Deep Learning Security and Privacy Workshop co-located with the 46th IEEE Symposium on Security and Privacy (SPW 2025). San Francisco, CA, May 15, 2025. DOI
Abstract

Current deep learning models for binary code struggle with explainability, since it is often unclear which factors are important for a given output. In this paper, we apply occlusion-based saliency analysis as an explainability method to binary code embedding models. We conduct experiments on two state-of-the-art Transformer-based models that take preprocessed assembly code as input and calculate embedding vectors for each function. We show that, during training, the models learn the importance of different instructions. From the results, we observe that call instructions and the names of external call targets are important. This observation confirms the intuition that function calls significantly impact the semantics of a function and therefore should also have a large impact on its learned embedding. This motivates the need for developing model architectures that integrate stronger analysis into preprocessing to further leverage call relationships.

MCML Authors
Link to website

Moritz Dannehl

Programming Languages and Artificial Intelligence

Link to website

Samuel Valenzuela

Programming Languages and Artificial Intelligence

Link to Profile Johannes Kinder

Johannes Kinder

Prof. Dr.

Programming Languages and Artificial Intelligence


[1846]
S. Ball, S. Allmendinger, F. Kreuter and N. Kühl.
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction.
AAPOR 2025 - AAPOR 80th Annual Conference on Reshaping Democracy’s Oracle: TransForming Polls, Surveys, and the Measurement of Public Opinion in the Age of Al. St. Louis, MO, USA, May 14-16, 2025. To be published. Preprint available. arXiv
Abstract

Generative AI (GenAI) is increasingly used in survey contexts to simulate human preferences. While many research endeavors evaluate the quality of synthetic GenAI data by comparing model-generated responses to gold-standard survey results, fundamental questions about the validity and reliability of using LLMs as substitutes for human respondents remain. Our study provides a technical analysis of how demographic attributes and prompt variations influence latent opinion mappings in large language models (LLMs) and evaluates their suitability for survey-based predictions. Using 14 different models, we find that LLM-generated data fails to replicate the variance observed in real-world human responses, particularly across demographic subgroups. In the political space, persona-to-party mappings exhibit limited differentiation, resulting in synthetic data that lacks the nuanced distribution of opinions found in survey data. Moreover, we show that prompt sensitivity can significantly alter outputs for some models, further undermining the stability and predictiveness of LLM-based simulations. As a key contribution, we adapt a probe-based methodology that reveals how LLMs encode political affiliations in their latent space, exposing the systematic distortions introduced by these models. Our findings highlight critical limitations in AI-generated survey data, urging caution in its use for public opinion research, social science experimentation, and computational behavioral modeling.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1845]
O. Kononykhina.
How ML-Filtered Answer Options Shape Responses and Interactions in CATI Surveys.
AAPOR 2025 - AAPOR 80th Annual Conference on Reshaping Democracy’s Oracle: TransForming Polls, Surveys, and the Measurement of Public Opinion in the Age of Al. St. Louis, MO, USA, May 14-16, 2025. To be published. Preprint available. URL
Abstract

Occupational coding has historically been a manual, post-survey task, but tools like OccuCoDe are shifting this process into real-time surveys using machine learning (ML). OccuCoDe dynamically filters and presents tailored answer options, allowing respondents themselves to select the description that best matches their occupation. However, our study revealed low agreement between such respondent-driven ML-based coding and post-survey manual coding, prompting us to explore how the quality of responses in automatic occupational coding relates to the quality of answer options, respondent and interviewer behaviors. We embedded OccuCoDe into a standard monthly multi-topic survey conducted by the Institute for Applied Social Science (INFAS) from 1 April to 31 June 2019. The survey was designed as a cross-sectional and panel survey with a 30:70 ratio for panel and new respondents, resulting in a representative sample of adults in Germany aged 18 and older. We received and analyzed 669 audio recordings through behavioral coding. Results showed that the quality of ML-generated suggestions significantly influenced classification accuracy, with highly accurate suggestion leading to better alignment with manual coding. Contrary to expectations, behavioral factors such as interviewer adherence to scripts or respondent mapping or comprehension issues were not the significant drivers of mismatches. Instead, familiar survey dynamics persisted: respondents often interrupted when they identified an option they liked, or interviewers skipped certain categories (e.g., ‘Other’). These findings suggest that while integrating ML or other AI tools into surveys is potentially fruitful, the key to success lies in refining the precision and distinctiveness of answer options. We also demonstrate that, although both respondents and interviewers showed adaptability to the presence of an automatic component, their behaviors could not overcome mismatches caused by limitations in ML-generated suggestions. In occupational coding—and potentially other survey domains—the effectiveness of real-time ML/AI integration depends on aligning algorithmic outputs with respondent realities to achieve high-quality data.

MCML Authors
Link to website

Olga Kononykhina

Social Data Science and AI


[1844]
C. Kühn and S.-V. Kuntz.
Analysis of the Geometric Structure of Neural Networks and Neural ODEs via Morse Functions.
DS 2025 - SIAM Conference on Applications of Dynamical Systems. Denver, CO, USA, May 11-15, 2025. To be published. Preprint available. arXiv
Abstract

Besides classical feed-forward neural networks, also neural ordinary differential equations (neural ODEs) have gained particular interest in recent years. Neural ODEs can be interpreted as an infinite depth limit of feed-forward or residual neural networks. We study the input-output dynamics of finite and infinite depth neural networks with scalar output. In the finite depth case, the input is a state associated with a finite number of nodes, which maps under multiple non-linear transformations to the state of one output node. In analogy, a neural ODE maps an affine linear transformation of the input to an affine linear transformation of its time-T map. We show that depending on the specific structure of the network, the input-output map has different properties regarding the existence and regularity of critical points, which can be characterized via Morse functions. We prove that critical points cannot exist if the dimension of the hidden layer is monotonically decreasing or the dimension of the phase space is smaller or equal to the input dimension. In the case that critical points exist, we classify their regularity depending on the specific architecture of the network. We show that except for a Lebesgue measure zero set in the weight space, each critical point is non-degenerate, if for finite depth neural networks the underlying graph has no bottleneck, and if for neural ODEs, the affine linear transformations used have full rank. For each type of architecture, the proven properties are comparable in the finite and the infinite depth case. The established theorems allow us to formulate results on universal embedding, i.e., on the exact representation of maps by neural networks and neural ODEs. Our dynamical systems viewpoint on the geometric structure of the input-output map provides a fundamental understanding of why certain architectures perform better than others.

MCML Authors
Link to Profile Christian Kühn

Christian Kühn

Prof. Dr.

Multiscale and Stochastic Dynamics

Link to website

Sara-Viola Kuntz

Multiscale and Stochastic Dynamics


[1843]
G. Manten, C. Casolo, S. W. Mogensen and N. Kilbertus.
An Asymmetric Independence Model for Causal Discovery on Path Spaces.
CLeaR 2025 - 4th Conference on Causal Learning and Reasoning. Lausanne, Switzerland, May 07-09, 2025. To be published. Preprint available. arXiv
Abstract

We develop the theory linking ‘E-separation’ in directed mixed graphs (DMGs) with conditional independence relations among coordinate processes in stochastic differential equations (SDEs), where causal relationships are determined by ‘which variables enter the governing equation of which other variables’. We prove a global Markov property for cyclic SDEs, which naturally extends to partially observed cyclic SDEs, because our asymmetric independence model is closed under marginalization. We then characterize the class of graphs that encode the same set of independence relations, yielding a result analogous to the seminal ‘same skeleton and v-structures’ result for directed acyclic graphs (DAGs). In the fully observed case, we show that each such equivalence class of graphs has a greatest element as a parsimonious representation and develop algorithms to identify this greatest element from data. We conjecture that a greatest element also exists under partial observations, which we verify computationally for graphs with up to four nodes.

MCML Authors
Link to website

Georg Manten

Ethics in Systems Design and Machine Learning

Link to website

Cecilia Casolo

Ethics in Systems Design and Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1842]
T. Nagler and T. Vatter.
Solving Estimating Equations With Copulas.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. DOI
Abstract

Thanks to their ability to capture complex dependence structures, copulas are frequently used to glue random variables into a joint model with arbitrary marginal distributions. More recently, they have been applied to solve statistical learning problems such as regression or classification. Framing such approaches as solutions of estimating equations, we generalize them in a unified framework. We can then obtain simultaneous, coherent inferences across multiple regression-like problems. We derive consistency, asymptotic normality, and validity of the bootstrap for corresponding estimators. The conditions allow for both continuous and discrete data as well as parametric, nonparametric, and semiparametric estimators of the copula and marginal distributions. The versatility of this methodology is illustrated by several theoretical examples, a simulation study, and an application to financial portfolio allocation. Supplementary materials for this article are available online.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[1841]
R. Schulte and D. Rügamer.
Additive Model Boosting: New Insights and Path(ologie)s.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. Oral Presentation. To be published. Preprint available. URL
Abstract

Additive models (AMs) have sparked a lot of interest in machine learning recently, allowing the incorporation of interpretable structures into a wide range of model classes. Many commonly used approaches to fit a wide variety of potentially complex additive models build on the idea of boosting additive models. While boosted additive models (BAMs) work well in practice, certain theoretical aspects are still poorly understood, including general convergence behavior and what optimization problem is being solved when accounting for the implicit regularizing nature of boosting. In this work, we study the solution paths of BAMs and establish connections with other approaches for certain classes of problems. Along these lines, we derive novel convergence results for BAMs, which yield crucial insights into the inner workings of the method. While our results generally provide reassuring theoretical evidence for the practical use of BAMs, they also uncover some ‘pathologies’ of boosting for certain additive model classes concerning their convergence behavior that require caution in practice. We empirically validate our theoretical findings through several numerical experiments.

MCML Authors
Link to website

Rickmer Schulte

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1840]
H.-H. Chou, J. Maly, C. M. Verdun, B. Freitas Paulo da Costa and H. Mirandola.
Get rid of your constraints and reparametrize: A study in NNLS and implicit bias.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. To be published. URL
Abstract

Over the past years, there has been significant interest in understanding the implicit bias of gradient descent optimization and its connection to the generalization properties of overparametrized neural networks. Several works observed that when training linear diagonal networks on the square loss for regression tasks (which corresponds to overparametrized linear regression) gradient descent converges to special solutions, e.g., non-negative ones. We connect this observation to Riemannian optimization and view overparametrized GD with identical initialization as a Riemannian GD. We use this fact for solving non-negative least squares (NNLS), an important problem behind many techniques, e.g., non-negative matrix factorization. We show that gradient flow on the reparametrized objective converges globally to NNLS solutions, providing convergence rates also for its discretized counterpart. Unlike previous methods, we do not rely on the calculation of exponential maps or geodesics. We further show accelerated convergence using a second-order ODE, lending itself to accelerated descent methods. Finally, we establish the stability against negative perturbations and discuss generalization to other constrained optimization problems.

MCML Authors
Link to Profile Johannes Maly

Johannes Maly

Prof. Dr.

Mathematical Data Science and Artificial Intelligence


[1839]
D. Dold, J. Kobialka, N. Palm, E. Sommer, D. Rügamer and O. Dürr.
Paths and Ambient Spaces in Neural Loss Landscapes.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. To be published. URL
Abstract

Understanding the structure of neural network loss surfaces, particularly the emergence of low-loss tunnels, is critical for advancing neural network theory and practice. In this paper, we propose a novel approach to directly embed loss tunnels into the loss landscape of neural networks. Exploring the properties of these loss tunnels offers new insights into their length and structure and sheds light on some common misconceptions. We then apply our approach to Bayesian neural networks, where we improve subspace inference by identifying pitfalls and proposing a more natural prior that better guides the sampling procedure.

MCML Authors
Link to website

Julius Kobialka

Statistics, Data Science and Machine Learning

Link to website

Nicolai Palm

Computational Statistics & Data Science

Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1838]
A. Koebler, T. Decker, I. Thon, V. Tresp and F. Buettner.
Incremental Uncertainty-aware Performance Monitoring with Active Labeling Intervention.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. To be published. URL
Abstract

We study the problem of monitoring machine learning models under gradual distribution shifts, where circumstances change slowly over time, often leading to unnoticed yet significant declines in accuracy. To address this, we propose Incremental Uncertainty-aware Performance Monitoring (IUPM), a novel label-free method that estimates performance changes by modeling gradual shifts using optimal transport. In addition, IUPM quantifies the uncertainty in the performance prediction and introduces an active labeling procedure to restore a reliable estimate under a limited labeling budget. Our experiments show that IUPM outperforms existing performance estimation baselines in various gradual shift scenarios and that its uncertainty awareness guides label acquisition more effectively compared to other strategies.

MCML Authors
Link to website

Thomas Decker

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1837]
J. Marcon, P. Weinhold, M. Rzany, M. P. Fabritius, M. Winkelmann, A. Buchner, L. Eismann, J.-F. Jokisch, J. Casuscelli, G. B. Schulz, T. Knösel, M. Ingrisch, J. Ricke, C. G. Stief, S. Rodler and P. M. Kazmierczak.
Radiomics-based differentiation of upper urinary tract urothelial and renal cell carcinoma in preoperative computed tomography datasets.
BMC Medical Imaging 25.196 (May. 2025). DOI
Abstract

Background: To investigate a non-invasive radiomics-based machine learning algorithm to differentiate upper urinary tract urothelial carcinoma (UTUC) from renal cell carcinoma (RCC) prior to surgical intervention.
Methods: Preoperative computed tomography venous-phase datasets from patients that underwent procedures for histopathologically confirmed UTUC or RCC were retrospectively analyzed. Tumor segmentation was performed manually, and radiomic features were extracted according to the International Image Biomarker Standardization Initiative. Features were normalized using z-scores, and a predictive model was developed using the least absolute shrinkage and selection operator (LASSO). The dataset was split into a training cohort (70%) and a test cohort (30%).
Results: A total of 236 patients [30.5% female, median age 70.5 years (IQR: 59.5–77), median tumor size 5.8 cm (range: 4.1–8.2 cm)] were included. For differentiating UTUC from RCC, the model achieved a sensitivity of 88.4% and specificity of 81% (AUC: 0.93, radiomics score cutoff: 0.467) in the training cohort. In the validation cohort, the sensitivity was 80.6% and specificity 80% (AUC: 0.87, radiomics score cutoff: 0.601). Subgroup analysis of the validation cohort demonstrated robust performance, particularly in distinguishing clear cell RCC from high-grade UTUC (sensitivity: 84%, specificity: 73.1%, AUC: 0.84) and high-grade from low-grade UTUC (sensitivity: 57.7%, specificity: 88.9%, AUC: 0.68). Limitations include the need for independent validation in future randomized controlled trials (RCTs).
Conclusions: Machine learning-based radiomics models can reliably differentiate between RCC and UTUC in preoperative CT imaging. With a suggested performance benefit compared to conventional imaging, this technology might be added to the current preoperative diagnostic workflow.

MCML Authors
Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[1836]
J. Baumsteiger, L. Celiberti, P. Rinke, M. Todorović and C. Franchini.
Exploring Noncollinear Magnetic Energy Landscapes with Bayesian Optimization.
Digital Discovery 4.6 (May. 2025). DOI
Abstract

The investigation of magnetic energy landscapes and the search for ground states of magnetic materials using ab initio methods like density functional theory (DFT) is a challenging task. Complex interactions, such as superexchange and spin-orbit coupling, make these calculations computationally expensive and often lead to non-trivial energy landscapes. Consequently, a comprehensive and systematic investigation of large magnetic configuration spaces is often impractical. We approach this problem by utilizing Bayesian Optimization, an active machine learning scheme that has proven to be efficient in modeling unknown functions and finding global minima. Using this approach we can obtain the magnetic contribution to the energy as a function of one or more spin canting angles with relatively small numbers of DFT calculations. To assess the capabilities and the efficiency of the approach we investigate the noncollinear magnetic energy landscapes of selected materials containing 3d, 5d and 5f magnetic ions: Ba3MnNb2O9, LaMn2Si2, β-MnO2, Sr2IrO4, UO2 and Ba2NaOsO6. By comparing our results to previous ab initio studies that followed more conventional approaches, we observe significant improvements in efficiency.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1835]
L. Mamede, R. C. Sabàb, S. Van Coillie, J. Prevot, S. Sánchez-Ramón, C. Poli, A. Barasa, B. W. Schuller, A. Hendel, N. Garcelon, C. Boersma, P. Lee, C. Booth, L. D. Notarangelo, J. Drabwell, N. L. Rider, F. Staal, S. O. Burns, M. van Hagen, M. Pergrnt, J. G. Rivière and N. Mahlaoui.
Navigating disruption in the PID landscape: embracing opportunities and anticipating threats in the next ten years.
Frontiers in Immunology 16 (May. 2025). DOI
Abstract

The International Patient Organisation for Primary Immunodeficiencies (IPOPI) held its third edition of the Global Multi-Stakeholders’ Summit, gathering key primary immunodeficiencies (PID) stakeholders and experts to discuss and foment global collaboration. This edition focused on the impact of genomic medicine in PID treatment, the role of digital health, including artificial intelligence, in PID care, and how to anticipate and minimise risks to ensure optimal patient access to care. These discussions aimed to examine current hurdles and brainstorm feasible solutions and priorities for the PID community in these areas in the next ten years. These discussions led to recommendations for comprehensive approaches to care and access to treatment for PID patients, suggesting actions that will bring the community closer to treatments based on real-world evidence and adjusted to patient’s needs. To accomplish this, collaboration between academia, industry, regulatory authorities, and patients is crucial.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1834]
C. Schweden, K. Hechinger, G. Kauermann and X. Zhu.
Can Uncertainty Quantification Benefit From Label Embeddings? A Case Study on Local Climate Zone Classification.
IEEE Transactions on Geoscience and Remote Sensing 63 (May. 2025). DOI
Abstract

Modern deep learning models have achieved superior performance in almost all fields of remote sensing. An often neglected aspect of these models is the quantification and evaluation of predictive uncertainties. Regarding a classification task, this means that the focus of the analysis solely lies on performance metrics such as accuracy or the loss. On the other hand, a notion of uncertainty indicates the model’s indecisiveness among the given classes and is essential to understand where the model struggles to classify the data samples. In this work, three levels of uncertainty are distinguished, starting with the typical softmax pseudo-probabilities as level-1 uncertainty. As a next level, the more flexible Dirichlet framework is utilized as model output space, and hereby also, a Bayesian setting with an uninformative prior is considered. For the level-3 uncertainty, an empirical Bayes setting is incorporated where a latent embedding of the label space is iteratively estimated by the marginal likelihood of the fully parameterized label space (see [1]). The estimated embeddings are then learned by the network in three different settings: Two regression losses use the embeddings directly, while the closed-form solution of the Kullback-Leibler (KL-) Divergence uses the embedding parameterized as a Dirichlet distribution. To assess the different levels of uncertainty, the label evaluation subset of the So2Sat LCZ42 dataset, which contains label votes from multiple remote sensing experts, is investigated. The predictive uncertainties are evaluated by means of Out-of-Distribution (OoD) detection and calibration performance. Overall, the embedding-based approaches show strong performance for calibration, while for the OoD experiments, the Bayesian Dirichlet setting with an uninformative prior achieves the best performance. In conclusion, embedded labels offer a flexible framework for incorporating uncertain or ambiguous labels into a supervised training setup. They could be highly beneficial for applications in fields such as urban planning or disaster response.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1833]
S. Wang, N. A. A. Braham and X. Zhu.
Weak-strong Graph Contrastive Learning Neural Network for Hyperspectral Image Classification.
IEEE Transactions on Geoscience and Remote Sensing Early Access (May. 2025). DOI GitHub
Abstract

Deep learning methods have shown promising results in various hyperspectral image (HSI) analysis tasks. Despite these advancements, existing models still struggle to accurately identify fine-classified land cover types on noisy hyperspectral images. Traditional methods have limited performance when extracting features from noisy hyperspectral data. Graph Neural Networks (GNNs) offer an adaptable and robust structure by effectively extracting both spectral and spatial features. However, supervised models still require large quantities of labeled data for effective training, posing a significant challenge. Contrastive learning, which leverages unlabeled data for pre-training, can mitigate this issue by reducing the dependency on extensive manual annotation. To address the issues, we propose WSGraphCL, a weak-strong graph contrastive learning model for HSI classification, and conduct experiments in a few-shot scenario. First, the image is transformed into K-hop subgraphs through a spectral-spatial adjacency matrix construction method. Second, WSGraphCL leverages contrastive learning to pre-train a graph-based encoder on the unlabeled hyperspectral image. We demonstrate that weak-strong augmentations and false negative pairs filtering stabilize pre-training and get good-quality representations. Finally, we test our model with a lightweight classifier on the features with a handful of labels. Experimental results showcase the superior performance of WSGraphCL compared to several baseline models, thereby emphasizing its efficacy in addressing the identified limitations in HSI classification.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1832]
S. Zhao, Z. Xiong and X. Zhu.
RainScaler: A Physics-inspired Network for Precipitation Correction and Downscaling.
IEEE Transactions on Geoscience and Remote Sensing Early Access (May. 2025). DOI GitHub
Abstract

Spatial downscaling of precipitation, in which finegrained regional precipitation patterns are recovered from coarse-resolution images, plays a crucial role in various weather and meteorological analyses. However, the intricate noise information presented in the observation data intertwines with the fine-scale characteristics, which poses challenges for subsequent feature extraction. Regional precipitation suffers from complex spatial patterns. Moreover, the real observatory data contains information inconsistent with the established physical principle, due either to inaccurate or incomplete physical models or limited data quality, thus making the implementation of physicallyinformed deep learning more difficult. For example, strong physical constraints may lead to over-regularization, in which the model becomes too rigid and fails to capture certain complexities in the data. In this work, we propose RainScaler, a physicsinspired deep neural network, to tackle these issues. First, to remove the noise and preserve the vital precipitation patterns effectively, the proposed RainScaler exploits an Inconsistencyaware Denoising Net to explicitly model the spatial variability of noise in the input. In addition, a graph module is designed to learn the geographical-dependent fine-grained patterns in high dimensional feature space at a moderate computation cost. Finally, multi-scale physical constraints are skillfully embedded to incorporate additional insights into the data-driven framework. We test our approach on a public dataset consisting of over 60,000 real low-resolution and high-resolution precipitation map pairs collected by different sensors. Our method produces realisticlooking precipitation maps with better discernment capability and corrects the structural error of precipitation distribution, especially for extreme events. Moreover, we evaluate the potential risks of incorporating physical constraints in real-world data applications. Our method unveils opportunities for multi-source data fusion and provides possible solutions to improve the physical feasibility of data-driven models.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1831]
Q. Li, L. Mou, Y. Shi and X. Zhu.
BANet: A bilateral attention network for extracting changed buildings between remote sensing imagery and cadastral maps.
International Journal of Applied Earth Observation and Geoinformation 139.104486 (May. 2025). DOI
Abstract

Up-to-date cadastral maps are vital to local governments in administrating real estate in cities. With its growing availability, remote sensing imagery is the cost-effective data for updating semantic contents on cadastral maps. In this study, we address the problem of updating buildings on cadastral maps, as city renewal is mainly characterized by new construction and demolition. While previous works focus on extracting all buildings from remote sensing images, we argue that these methods not only disregard preliminary information on cadastral maps but also fail to preserve building priors in unchanged areas on cadastral maps. Therefore, we focus on the task of extracting changed buildings (i.e., newly built and demolished buildings) from remote sensing images and cadastral maps. To address this task, we create an image-map building change detection (IMBCD) dataset, formed by around 27K pairs of remote sensing images and maps and their corresponding changed buildings in six distinct geographical areas across the globe. Accordingly, we propose a Bilateral Attention Network (BANet), introducing a novel attention mechanism: changed-first (CF) attention and non-changed-first (NCF) attention. This bilateral attention mechanism helps to refine the uncertain areas between changed and non-changed regions. Extensive experiments on our IMBCD dataset showcase the superior performance of BANet. Specifically, our BANet outperforms state-of-the-art models with F1 scores of 90.00% and 63.00% for the IMBCD-WHU and IMBCD-Inria datasets. This confirms that the leverage of bilateral attention blocks (BAB) can boost performance.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1830]
Y. Mu, J. Guo, M. Shahzad and X. Zhu.
National-scale tree species mapping with deep learning reveals forest management insights in Germany.
International Journal of Applied Earth Observation and Geoinformation 139.104522 (May. 2025). DOI
Abstract

Accurate tree species distribution is essential for biodiversity assessment, sustainable forest management, and environmental policy. However, mapping species over large areas with satellite data is challenging due to spectral mixing and complex spatial distribution. To address this, we developed a novel deep learning model, ForestFormer, using Sentinel-2 time series data to map eight dominant tree species in Germany. ForestFormer’s dual-branch network with spectral and spatial attention modules improves classification by highlighting species-specific characteristics. Cross-validation in 2,364 National Forest Inventory plots shows that ForestFormer achieves species classification accuracy ranging from 69% to 92%, with an average accuracy of 84%, outperforming existing baseline methods. The developed ForestFormer model can help generate a large-scale and reliable tree species map for Germany, which in turn provides crucial insights into the diverse characteristics of tree species to support forest management. Our analysis of results shows that Pine is the species most resistant to disturbances, while Douglas fir is the least. Northeastern regions of Germany exhibit particularly low levels of forest biodiversity, especially in the states of Brandenburg and Berlin, followed by neighboring states such as Sachsen-Anhalt, Mecklenburg-Vorpommern, Sachsen, and Niedersachsen. In addition, climatic factors, especially water deficit, are shown to play a very important role in determining tree species distribution patterns, followed by topographic and soil factors. These findings are anticipated to provide a critical basis for environmental policy formulation, particularly in forest management strategies responding to ongoing climate change.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1829]
H. Boche, V. Fojtik, A. Fono and G. Kutyniok.
Computability of Classification and Deep Learning: From Theoretical Limits to Practical Feasibility through Quantization.
Journal of Fourier Analysis and Applications 31.35 (May. 2025). DOI
Abstract

The unwavering success of deep learning in the past decade led to the increasing prevalence of deep learning methods in various application fields. However, the downsides of deep learning, most prominently its lack of trustworthiness, may not be compatible with safety-critical or high-responsibility applications requiring stricter performance guarantees. Recently, several instances of deep learning applications have been shown to be subject to theoretical limitations of computability, undermining the feasibility of performance guarantees when employed on real-world computers. We extend the findings by studying computability in the deep learning framework from two perspectives: From an application viewpoint in the context of classification problems and a general limitation viewpoint in the context of training neural networks. In particular, we show restrictions on the algorithmic solvability of classification problems that also render the algorithmic detection of failure in computations in a general setting infeasible. Subsequently, we prove algorithmic limitations in training deep neural networks even in cases where the underlying problem is well-behaved. Finally, we end with a positive observation, showing that in quantized versions of classification and deep network training, computability restrictions do not arise or can be overcome to a certain degree.

MCML Authors
Link to website

Vit Fojtik

Mathematical Foundations of Artificial Intelligence

Link to website

Adalbert Fono

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1828]
C. Kern, U. Fischer-Abaigar, J. Schweisthal, D. Frauen, R. Ghani, S. Feuerriegel, M. van der Schaar and F. Kreuter.
Algorithms for reliable decision-making need causal reasoning.
Nature Computational Science 5 (May. 2025). DOI
Abstract

Decision-making inherently involves cause–effect relationships that introduce causal challenges. We argue that reliable algorithms for decision-making need to build upon causal reasoning. Addressing these causal challenges requires explicit assumptions about the underlying causal structure to ensure identifiability and estimatability, which means that the computational methods must successfully align with decision-making objectives in real-world tasks. Algorithmic decision-making (ADM) has become common in a wide range of domains, including precision medicine, manufacturing, education, hiring, the public sector, and smart cities. At the core of ADM systems are data-driven models that learn from data to recommend decisions, often with the goal of maximizing a defined utility function1. For example, in smart city contexts, ADM is frequently used to optimize traffic flow through predictive models that analyze real-time data, thereby reducing congestion and improving urban mobility. Another prominent application area for ADM are normative decision support systems (often subsumed under ‘prescriptive analytics’) or, more recently, artificial intelligence (AI) agents that either inform or automatically execute managerial and operational decisions in industry. Yet, the applications of ADM to high-stakes decisions face safety and reliability issues1,2,3. Often, the objectives of ADM systems fail to align with the nuanced goals of real-world decision-making, thus creating a tension between the potential of ADM and the risk of harm and failure. Especially when deployed in dynamic, real-world environments, ADM can amplify systemic disadvantages for vulnerable communities and lead to flawed decisions. In this Comment, we argue that reliable algorithmic decision-making — systems that perform safely and robustly under deployment conditions — must be grounded in causal reasoning.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1827]
H. Homm, J. Laakso and P. Rinke.
Efficient dataset generation for machine learning halide perovskite alloys.
Physical Review Materials 9.053802 (May. 2025). DOI
Abstract

Lead-based perovskite solar cells have reached high efficiencies, but toxicity and lack of stability hinder their wide-scale adoption. These issues have been partially addressed through compositional engineering of perovskite materials, but the vast complexity of the perovskite materials space poses a significant obstacle to exploration. We previously demonstrated how machine learning (ML) can accelerate property predictions for the CsPb⁢(Cl/Br)3 perovskite alloy. However, the substantial computational demand of density functional theory (DFT) calculations required for model training prevents applications to more complex materials. Here, we introduce a data-efficient scheme to facilitate model training, validated initially on CsPb⁢(Cl/Br)3 data and extended to the ternary alloy CsSn⁢(Cl/Br/I)3. Our approach employs clustering to construct a compact yet diverse initial dataset of atomic structures. We then apply a two-stage active learning approach to first improve the reliability of the ML-based structure relaxations and then refine accuracy near equilibrium structures. Tests for CsPb⁢(Cl/Br)3 demonstrate that our scheme reduces the number of required DFT calculations during the different parts of our proposed model training method by up to 20% and 50%. The fitted model for CsSn⁢(Cl/Br/I)3 is robust and highly accurate, evidenced by the convergence of all ML-based structure relaxations in our tests and an average relaxation error of only 0.5 meV/atom.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1826]
H. Löwe, C. A. Scholbeck, C. Heumann, B. Bischl and G. Casalicchio.
fmeffects: An R Package for Forward Marginal Effects.
The R Journal 16.3 (May. 2025). DOI
Abstract

Forward marginal effects have recently been introduced as a versatile and effective model-agnostic interpretation method particularly suited for non-linear and non-parametric prediction models. They provide comprehensible model explanations of the form: if we change feature values by a pre-specified step size, what is the change in the predicted outcome? We present the R package fmeffects, the first software implementation of the theory surrounding forward marginal effects. The relevant theoretical background, package functionality and handling, as well as the software design and options for future extensions are discussed in this paper.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[1825]
J. O. Alabi, M. A. Hedderich, D. I. Adelani and D. Klakow.
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead.
Preprint (May. 2025). arXiv
Abstract

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 734 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

MCML Authors
Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics


[1824]
M. Arpogaus, T. Kneib, T. Nagler and D. Rügamer.
Hybrid Bernstein Normalizing Flows for Flexible Multivariate Density Regression with Interpretable Marginals.
Preprint (May. 2025). arXiv
Abstract

Density regression models allow a comprehensive understanding of data by modeling the complete conditional probability distribution. While flexible estimation approaches such as normalizing flows (NF) work particularly well in multiple dimensions, interpreting the input-output relationship of such models is often difficult, due to the black-box character of deep learning models. In contrast, existing statistical methods for multivariate outcomes such as multivariate conditional transformation models (MCTM) are restricted in flexibility and are often not expressive enough to represent complex multivariate probability distributions. In this paper, we combine MCTM with state-of-the-art and autoregressive NF to leverage the transparency of MCTM for modeling interpretable feature effects on the marginal distributions in the first step and the flexibility of neural-network-based NF techniques to account for complex and non-linear relationships in the joint data distribution. We demonstrate our method’s versatility in various numerical experiments and compare it with MCTM and other NF models on both simulated and real-world data.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1823]
R. L. Bach and C. Kern.
Fairness, Justice, and Social Inequality in Machine Learning.
Preprint (May. 2025). DOI
Abstract

As machine learning (ML) systems increasingly shape decision-making across crucial societal domains, the discourse around fairness in algorithmic systems (fairML) has intensified. Although fairML research is rapidly expanding, contributions from social science, particularly sociology, remain limited. This chapter aims to address this gap by examining fairness in ML through a sociological lens, focusing on the interplay between algorithmic decision-making and social inequality. We argue that fairML frameworks must explicitly distinguish technical fairness—focused on unbiased predictions—from normative justice, which addresses broader ethical and distributive considerations. We identify and discuss five key challenges confronting fairML today: (1) clearly separating fairness and justice, (2) developing more sophisticated measures of vulnerability and protected attributes, (3) incorporating historical disadvantage and social origin into fairness evaluations, (4) assessing unintended social consequences of algorithmic interventions, and (5) empirically investigating stakeholder preferences toward AI systems. By highlighting these sociologically informed challenges, this chapter advocates for a more holistic, context-sensitive approach to algorithmic fairness. Ultimately, our analysis proposes a sociologically grounded research agenda aimed at critically assessing and enhancing the role of fairML in either perpetuating or alleviating social inequalities.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1822]
J. Bi, D. Yan, Y. Wang, W. Huang, H. Chen, G. Wan, M. Ye, X. Xiao, H. Schütze, V. Tresp and Y. Ma.
CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process.
Preprint (May. 2025). arXiv
Abstract

Recent Large Reasoning Models significantly improve the reasoning ability of Large Language Models by learning to reason, exhibiting the promising performance in solving complex tasks. LRMs solve tasks that require complex reasoning by explicitly generating reasoning trajectories together with answers. Nevertheless, judging the quality of such an output answer is not easy because only considering the correctness of the answer is not enough and the soundness of the reasoning trajectory part matters as well. Logically, if the soundness of the reasoning part is poor, even if the answer is correct, the confidence of the derived answer should be low. Existing methods did consider jointly assessing the overall output answer by taking into account the reasoning part, however, their capability is still not satisfactory as the causal relationship of the reasoning to the concluded answer cannot properly reflected. In this paper, inspired by classical mechanics, we present a novel approach towards establishing a CoT-Kinetics energy equation. Specifically, our CoT-Kinetics energy equation formulates the token state transformation process, which is regulated by LRM internal transformer layers, as like a particle kinetics dynamics governed in a mechanical field. Our CoT-Kinetics energy assigns a scalar score to evaluate specifically the soundness of the reasoning phase, telling how confident the derived answer could be given the evaluated reasoning. As such, the LRM’s overall output quality can be accurately measured, rather than a coarse judgment (e.g., correct or incorrect) anymore.

MCML Authors
Link to website

Haokun Chen

Database Systems and Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[1821]
N. Broestl, B. Lange, C. Voinea, G. Keeling and R. Lam.
Evaluating Intra-firm LLM Alignment Strategies in Business Contexts.
Preprint (May. 2025). arXiv
Abstract

Instruction-tuned Large Language Models (LLMs) are increasingly deployed as AI Assistants in firms for support in cognitive tasks. These AI assistants carry embedded perspectives which influence factors across the firm including decision-making, collaboration, and organizational culture. This paper argues that firms must align the perspectives of these AI Assistants intentionally with their objectives and values, framing alignment as a strategic and ethical imperative crucial for maintaining control over firm culture and intra-firm moral norms. The paper highlights how AI perspectives arise from biases in training data and the fine-tuning objectives of developers, and discusses their impact and ethical significance, foregrounding ethical concerns like automation bias and reduced critical thinking. Drawing on normative business ethics, particularly non-reductionist views of professional relationships, three distinct alignment strategies are proposed: supportive (reinforcing the firm’s mission), adversarial (stress-testing ideas), and diverse (broadening moral horizons by incorporating multiple stakeholder views). The ethical trade-offs of each strategy and their implications for manager-employee and employee-employee relationships are analyzed, alongside the potential to shape the culture and moral fabric of the firm.

MCML Authors
Link to Profile Benjamin Lange

Benjamin Lange

Dr.

Ethics of Artificial Intelligence


[1820]
B. Chen, Y. Liu, A. Korhonen and B. Plank.
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation.
Preprint (May. 2025). arXiv
Abstract

The recent rise of reasoning-tuned Large Language Models (LLMs)–which generate chains of thought (CoTs) before giving the final answer–has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

MCML Authors
Link to website

Beiduo Chen

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1819]
H. Chen, Y. Zhang, Y. Bi, Y. Zhang, T. Liu, J. Bi, J. Lan, J. Gu, C. Grosser, D. Krompass, N. Navab and V. Tresp.
Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs.
Preprint (May. 2025). arXiv
Abstract

In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.

MCML Authors
Link to website

Haokun Chen

Database Systems and Data Mining

Link to website

Yuan Bi

Computer Aided Medical Procedures & Augmented Reality

Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Tong Liu

Database Systems and Data Mining

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1818]
N. De La Fuente, M. Pilligua, D. Vidal, A. Soutiff, C. Curreli, D. Cremers and A. Barsky.
Prototype Augmented Hypernetworks for Continual Learning.
Preprint (May. 2025). arXiv
Abstract

Continual learning (CL) aims to learn a sequence of tasks without forgetting prior knowledge, but gradient updates for a new task often overwrite the weights learned earlier, causing catastrophic forgetting (CF). We propose Prototype-Augmented Hypernetworks (PAH), a framework where a single hypernetwork, conditioned on learnable task prototypes, dynamically generates task-specific classifier heads on demand. To mitigate forgetting, PAH combines cross-entropy with dual distillation losses, one to align logits and another to align prototypes, ensuring stable feature representations across tasks. Evaluations on Split-CIFAR100 and TinyImageNet demonstrate that PAH achieves state-of-the-art performance, reaching 74.5% and 63.7% accuracy with only 1.7% and 4.4% forgetting, respectively, surpassing prior methods without storing samples or heads.

MCML Authors
Link to website

Cecilia Curreli

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1817]
D. Dementieva, N. Babakov and A. Fraser.
EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian.
Preprint (May. 2025). arXiv
Abstract

While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the this http URL platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.

MCML Authors
Link to website

Daryna Dementieva

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1816]
F. Eichin, Y. Du, P. Mondorf, B. Plank and M. A. Hedderich.
Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior.
Preprint (May. 2025). arXiv GitHub
Abstract

Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

MCML Authors
Link to website

Florian Eichin

AI and Computational Linguistics

Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics


[1815]
V. Fojtik, M. Matveev, H.-H. Chou, G. Kutyniok and J. Maly.
Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization.
Preprint (May. 2025). arXiv
Abstract

A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this theoretically, recent works examine gradient descent and its variants in simplified training settings, often assuming vanishing learning rates. These studies reveal various forms of implicit regularization, such as ℓ1-norm minimizing parameters in regression and max-margin solutions in classification. Concurrently, empirical findings show that moderate to large learning rates exceeding standard stability thresholds lead to faster, albeit oscillatory, convergence in the so-called Edge-of-Stability regime, and induce an implicit bias towards minima of low sharpness (norm of training loss Hessian). In this work, we argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization. We empirically demonstrate that the learning rate balances between low parameter norm and low sharpness of the trained model. We furthermore prove for diagonal linear networks trained on a simple regression task that neither implicit bias alone minimizes the generalization error. These findings demonstrate that focusing on a single implicit bias is insufficient to explain good generalization, and they motivate a broader view of implicit regularization that captures the dynamic trade-off between norm and sharpness induced by non-negligible learning rates.

MCML Authors
Link to website

Vit Fojtik

Mathematical Foundations of Artificial Intelligence

Link to website

Maria Matveev

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence

Link to Profile Johannes Maly

Johannes Maly

Prof. Dr.

Mathematical Data Science and Artificial Intelligence


[1814]
D. Frauen, V. Melnychuk, J. Schweisthal, M. van der Schaar and S. Feuerriegel.
Treatment Effect Estimation for Optimal Decision-Making.
Preprint (May. 2025). arXiv
Abstract

Decision-making across various fields, such as medicine, heavily relies on conditional average treatment effects (CATEs). Practitioners commonly make decisions by checking whether the estimated CATE is positive, even though the decision-making performance of modern CATE estimators is poorly understood from a theoretical perspective. In this paper, we study optimal decision-making based on two-stage CATE estimators (e.g., DR-learner), which are considered state-of-the-art and widely used in practice. We prove that, while such estimators may be optimal for estimating CATE, they can be suboptimal when used for decision-making. Intuitively, this occurs because such estimators prioritize CATE accuracy in regions far away from the decision boundary, which is ultimately irrelevant to decision-making. As a remedy, we propose a novel two-stage learning objective that retargets the CATE to balance CATE estimation error and decision performance. We then propose a neural method that optimizes an adaptively-smoothed approximation of our learning objective. Finally, we confirm the effectiveness of our method both empirically and theoretically. In sum, our work is the first to show how two-stage CATE estimators can be adapted for optimal decision-making.

MCML Authors
Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1813]
D. Frauen, M. Schröder, K. Hess and S. Feuerriegel.
Orthogonal Survival Learners for Estimating Heterogeneous Treatment Effects from Time-to-Event Data.
Preprint (May. 2025). arXiv
Abstract

Estimating heterogeneous treatment effects (HTEs) is crucial for personalized decision-making. However, this task is challenging in survival analysis, which includes time-to-event data with censored outcomes (e.g., due to study dropout). In this paper, we propose a toolbox of novel orthogonal survival learners to estimate HTEs from time-to-event data under censoring. Our learners have three main advantages: (i) we show that learners from our toolbox are guaranteed to be orthogonal and thus come with favorable theoretical properties; (ii) our toolbox allows for incorporating a custom weighting function, which can lead to robustness against different types of low overlap, and (iii) our learners are model-agnostic (i.e., they can be combined with arbitrary machine learning models). We instantiate the learners from our toolbox using several weighting functions and, as a result, propose various neural orthogonal survival learners. Some of these coincide with existing survival learners (including survival versions of the DR- and R-learner), while others are novel and further robust w.r.t. low overlap regimes specific to the survival setting (i.e., survival overlap and censoring overlap). We then empirically verify the effectiveness of our learners for HTE estimation in different low-overlap regimes through numerical experiments. In sum, we provide practitioners with a large toolbox of learners that can be used for randomized and observational studies with censored time-to-event data.

MCML Authors
Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Maresa Schröder

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1812]
S. Gerstner and H. Schütze.
Understanding Gated Neurons in Transformers from Their Input-Output Functionality.
Preprint (May. 2025). arXiv
Abstract

Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream (’enrichment neurons’) or reduce its presence (‘depletion neurons’). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

MCML Authors
Link to website

Sebastian Gerstner

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1811]
F. Ghorbanpour, D. Dementieva and A. Fraser.
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study.
Preprint (May. 2025). arXiv
Abstract

Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.

MCML Authors
Link to website

Faeze Ghorbanpour

Data Analytics & Statistics

Link to website

Daryna Dementieva

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1810]
F. Ghorbanpour, D. Dementieva and A. Fraser.
Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data.
Preprint (May. 2025). arXiv
Abstract

Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.

MCML Authors
Link to website

Faeze Ghorbanpour

Data Analytics & Statistics

Link to website

Daryna Dementieva

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1809]
X. Guo, A. Li, Y. Wang, S. Jegelka and Y. Wang.
G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning.
Preprint (May. 2025). arXiv GitHub
Abstract

Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs’ graph reasoning abilities. To enable RL training, we curate Erdõs, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erdõs, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1808]
P. Henkel, J. Li and P. Rinke.
Design Rules for Optimizing Quaternary Mixed-Metal Chalcohalides.
Preprint (May. 2025). arXiv
Abstract

Quaternary mixed-metal M(II)2M(III)Ch2X3 chalcohalides are an emerging material class for photovoltaic absorbers that combines the beneficial optoelectronic properties of lead-based halide perovskites with the stability of metal chalcogenides. Inspired by the recent discovery of lead-free mixed-metal chalcohalides materials, we utilized a combination of density functional theory and machine learning to determine compositional trends and chemical design rules in the lead-free and lead-based materials spaces. We explored a total of 54 M(II)2M(III)Ch2X3 materials with M(II) = Sn, Pb, M(III) = In, Sb, Bi, Ch = S, Se, Te, and X = Cl, Br, I per phase (Cmcm, Cmc21 , and P21/c). The P21/c phase is the equilibrium phase at low temperatures, followed by Cmc21 and Cmcm. The fundamental band gaps in Cmcm and Cmc21 are smaller than those in P21/c, but direct band gaps are more common in Cmcm and Cmc21. The effective electron masses in P21/c are significantly larger compared to Cmcm and Cmc21, while the effective hole masses are nearly the same across all three phases. Using random forest regression, we found that the two electron acceptor sites (Ch and X) are crucial in shaping the properties of mixed-metal chalcohalide compounds. Furthermore, the electron donor sites (M(II) and M(III)) can be used to finetune the material properties to desired applications. These design rules enable precise tailoring of mixed-metal chalcohalide compounds for a variety of applications.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1807]
P. Hofman, Y. Sale and E. Hüllermeier.
Uncertainty Quantification with Proper Scoring Rules: Adjusting Measures to Prediction Tasks.
Preprint (May. 2025). arXiv
Abstract

We address the problem of uncertainty quantification and propose measures of total, aleatoric, and epistemic uncertainty based on a known decomposition of (strictly) proper scoring rules, a specific type of loss function, into a divergence and an entropy component. This leads to a flexible framework for uncertainty quantification that can be instantiated with different losses (scoring rules), which makes it possible to tailor uncertainty quantification to the use case at hand. We show that this flexibility is indeed advantageous. In particular, we analyze the task of selective prediction and show that the scoring rule should ideally match the task loss. In addition, we perform experiments on two other common tasks. For out-of-distribution detection, our results confirm that a widely used measure of epistemic uncertainty, mutual information, performs best. Moreover, in the setting of active learning, our measure of epistemic uncertainty based on the zero-one-loss consistently outperforms other uncertainty measures.

MCML Authors
Link to website

Paul Hofman

Artificial Intelligence and Machine Learning

Link to website

Yusuf Sale

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1806]
N. Holzner, S. Maier and S. Feuerriegel.
Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis.
Preprint (May. 2025). arXiv
Abstract

Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges’ g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.

MCML Authors
Link to website

Sebastian Maier

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1805]
P. Hong, B. Chen, S. Peng, M.-C. de Marneffe and B. Plank.
LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference.
Preprint (May. 2025). arXiv
Abstract

There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

MCML Authors
Link to website

Beiduo Chen

AI and Computational Linguistics

Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1804]
A. Javanmardi, S. H. Zargarbashi, S. M. A. R. Thies, W. Waegeman, A. Bojchevski and E. Hüllermeier.
Optimal Conformal Prediction under Epistemic Uncertainty.
Preprint (May. 2025). arXiv
Abstract

Conformal prediction (CP) is a popular frequentist framework for representing uncertainty by providing prediction sets that guarantee coverage of the true label with a user-adjustable probability. In most applications, CP operates on confidence scores coming from a standard (first-order) probabilistic predictor (e.g., softmax outputs). Second-order predictors, such as credal set predictors or Bayesian models, are also widely used for uncertainty quantification and are known for their ability to represent both aleatoric and epistemic uncertainty. Despite their popularity, there is still an open question on ``how they can be incorporated into CP’’. In this paper, we discuss the desiderata for CP when valid second-order predictions are available. We then introduce Bernoulli prediction sets (BPS), which produce the smallest prediction sets that ensure conditional coverage in this setting. When given first-order predictions, BPS reduces to the well-known adaptive prediction sets (APS). Furthermore, when the validity assumption on the second-order predictions is compromised, we apply conformal risk control to obtain a marginal coverage guarantee while still accounting for epistemic uncertainty.

MCML Authors
Link to website

Alireza Javanmardi

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1803]
X. Jing, J. Wang, I. Tsangko, A. Triantafyllopoulos and B. W. Schuller.
MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge.
Preprint (May. 2025). arXiv
Abstract

Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments’ results demonstrate a consistence performance improvement on SER.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1802]
A. Karamolegkou, A. Borah, E. Cho, S. R. Choudhury, M. Galletti, R. Ghosh, P. Gupta, O. Ignat, P. Kargupta, N. Kotonya, H. Lamba, S.-J. Lee, A. Mangla, I. Mondal, D. Nazarova, P. Nemkova, D. Pisarevskaya, N. Rizwan, N. Sabri, D. Stammbach, A. Steinberg, D. Tomás, S. R. Wilson, B. Yi, J. H. Zhu, A. Zubiaga, A. Søgaard, A. Fraser, Z. Jin, R. Mihalcea, J. R. Tetreault and D. Dementieva.
NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment.
Preprint (May. 2025). arXiv
Abstract

Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.

MCML Authors
Link to website

Anna Steinberg

Social Data Science and AI

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to website

Daryna Dementieva

Dr.

Data Analytics & Statistics


[1801]
T. Karvonen, G. Santin and T. Wenzel.
General superconvergence for kernel-based approximation.
Preprint (May. 2025). arXiv
Abstract

Kernel interpolation is a fundamental technique for approximating functions from scattered data, with a well-understood convergence theory when interpolating elements of a reproducing kernel Hilbert space. Beyond this classical setting, research has focused on two regimes: misspecified interpolation, where the kernel smoothness exceeds that of the target function, and superconvergence, where the target is smoother than the Hilbert space. This work addresses the latter, where smoother target functions yield improved convergence rates, and extends existing results by characterizing superconvergence for projections in general Hilbert spaces. We show that functions lying in ranges of certain operators, including adjoint of embeddings, exhibit accelerated convergence, which we extend across interpolation scales between these ranges and the full Hilbert space. In particular, we analyze Mercer operators and embeddings into Lp spaces, linking the images of adjoint operators to Mercer power spaces. Applications to Sobolev spaces are discussed in detail, highlighting how superconvergence depends critically on boundary conditions. Our findings generalize and refine previous results, offering a broader framework for understanding and exploiting superconvergence. The results are supported by numerical experiments.

MCML Authors
Link to website

Tizian Wenzel

Dr.

Mathematical Data Science and Artificial Intelligence


[1800]
J. Kim, S. Alaniz, C. Schmid and Z. Akata.
LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance.
Preprint (May. 2025). arXiv GitHub
Abstract

Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement.

MCML Authors
Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1799]
C. Kühn and S.-V. Kuntz.
The Influence of the Memory Capacity of Neural DDEs on the Universal Approximation Property.
Preprint (May. 2025). arXiv
Abstract

Neural Ordinary Differential Equations (Neural ODEs), which are the continuous-time analog of Residual Neural Networks (ResNets), have gained significant attention in recent years. Similarly, Neural Delay Differential Equations (Neural DDEs) can be interpreted as an infinite depth limit of Densely Connected Residual Neural Networks (DenseResNets). In contrast to traditional ResNet architectures, DenseResNets are feed-forward networks that allow for shortcut connections across all layers. These additional connections introduce memory in the network architecture, as typical in many modern architectures. In this work, we explore how the memory capacity in neural DDEs influences the universal approximation property. The key parameter for studying the memory capacity is the product Kτ of the Lipschitz constant and the delay of the DDE. In the case of non-augmented architectures, where the network width is not larger than the input and output dimensions, neural ODEs and classical feed-forward neural networks cannot have the universal approximation property. We show that if the memory capacity Kτ is sufficiently small, the dynamics of the neural DDE can be approximated by a neural ODE. Consequently, non-augmented neural DDEs with a small memory capacity also lack the universal approximation property. In contrast, if the memory capacity Kτ is sufficiently large, we can establish the universal approximation property of neural DDEs for continuous functions. If the neural DDE architecture is augmented, we can expand the parameter regions in which universal approximation is possible. Overall, our results show that by increasing the memory capacity Kτ, the infinite-dimensional phase space of DDEs with positive delay τ>0 is not sufficient to guarantee a direct jump transition to universal approximation, but only after a certain memory threshold, universal approximation holds.

MCML Authors
Link to Profile Christian Kühn

Christian Kühn

Prof. Dr.

Multiscale and Stochastic Dynamics

Link to website

Sara-Viola Kuntz

Multiscale and Stochastic Dynamics


[1798]
J. Lan, Y. Fu, U. Schlegel, G. Zhang, T. Hannan, H. Chen and T. Seidl.
My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals.
Preprint (May. 2025). arXiv
Abstract

Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model’s response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs’ hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.

MCML Authors
Link to website

Udo Schlegel

Database Systems and Data Mining

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to website

Tanveer Hannan

Database Systems and Data Mining

Link to website

Haokun Chen

Database Systems and Data Mining

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[1797]
Y. Li, S. Shao, M. Milling and B. W. Schuller.
Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge.
Preprint (May. 2025). arXiv GitHub
Abstract

Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1796]
Y. Liu, M. Wang, A. H. Kargaran, F. Körner, E. Nie, B. Plank, F. Yvon and H. Schütze.
Tracing Multilingual Factual Knowledge Acquisition in Pretraining.
Preprint (May. 2025). arXiv GitHub
Abstract

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts – an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to website

Amir Hossein Kargaran

Computational Linguistics

Link to website

Felicia Körner

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1795]
Y. Liu, X. Xu, E. Nie, Z. Wang, S. Feng, D. Wang, Q. Li and H. Schütze.
Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning.
Preprint (May. 2025). arXiv GitHub
Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model’s representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1794]
T. Löhr, P. Hofman, F. Mohr and E. Hüllermeier.
Credal Prediction based on Relative Likelihood.
Preprint (May. 2025). arXiv
Abstract

Predictions in the form of sets of probability distributions, so-called credal sets, provide a suitable means to represent a learner’s epistemic uncertainty. In this paper, we propose a theoretically grounded approach to credal prediction based on the statistical notion of relative likelihood: The target of prediction is the set of all (conditional) probability distributions produced by the collection of plausible models, namely those models whose relative likelihood exceeds a specified threshold. This threshold has an intuitive interpretation and allows for controlling the trade-off between correctness and precision of credal predictions. We tackle the problem of approximating credal sets defined in this way by means of suitably modified ensemble learning techniques. To validate our approach, we illustrate its effectiveness by experiments on benchmark datasets demonstrating superior uncertainty representation without compromising predictive performance. We also compare our method against several state-of-the-art baselines in credal prediction.

MCML Authors
Link to website

Paul Hofman

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1793]
S. Maskey, R. Paolino, F. Jogl, G. Kutyniok and J. Lutzeyer.
Graph Representational Learning: When Does More Expressivity Hurt Generalization?
Preprint (May. 2025). arXiv
Abstract

Graph Neural Networks (GNNs) are powerful tools for learning on structured data, yet the relationship between their expressivity and predictive performance remains unclear. We introduce a family of premetrics that capture different degrees of structural similarity between graphs and relate these similarities to generalization, and consequently, the performance of expressive GNNs. By considering a setting where graph labels are correlated with structural features, we derive generalization bounds that depend on the distance between training and test graphs, model complexity, and training set size. These bounds reveal that more expressive GNNs may generalize worse unless their increased complexity is balanced by a sufficiently large training set or reduced distance between training and test graphs. Our findings relate expressivity and generalization, offering theoretical insights supported by empirical results.

MCML Authors
Link to website

Sohir Maskey

Mathematical Foundations of Artificial Intelligence

Link to website

Raffaele Paolino

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1792]
E. Nie, H. Schmid and H. Schütze.
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models.
Preprint (May. 2025). arXiv
Abstract

Language confusion – where large language models (LLMs) generate unintended languages against the user’s need – remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) – specific positions where language switches occur – are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1791]
E. Özsoy, A. Mamur, F. Tristram, C. Pellegrini, M. Wysocki, B. Busam and N. Navab.
EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding.
Preprint (May. 2025). arXiv
Abstract

Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR’s multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.

MCML Authors
Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Felix Tristram

Computer Aided Medical Procedures & Augmented Reality

Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Magdalena Wysocki

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1790]
E. Özsoy, C. Pellegrini, D. Bani-Harouni, K. Yuan, M. Keicher and N. Navab.
ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling.
Preprint (May. 2025). arXiv
Abstract

The real-world complexity of surgeries necessitates surgeons to have deep and holistic comprehension to ensure precision, safety, and effective interventions. Computational systems are required to have a similar level of comprehension within the operating room. Prior works, limited to single-task efforts like phase recognition or scene graph generation, lack scope and generalizability. In this work, we introduce ORQA, a novel OR question answering benchmark and foundational multimodal model to advance OR intelligence. By unifying all four public OR datasets into a comprehensive benchmark, we enable our approach to concurrently address a diverse range of OR challenges. The proposed multimodal large language model fuses diverse OR signals such as visual, auditory, and structured data, for a holistic modeling of the OR. Finally, we propose a novel, progressive knowledge distillation paradigm, to generate a family of models optimized for different speed and memory requirements. We show the strong performance of ORQA on our proposed benchmark, and its zero-shot generalization, paving the way for scalable, unified OR modeling and significantly advancing multimodal surgical intelligence. We will release our code and data upon acceptance.

MCML Authors
Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

David Bani-Harouni

Computer Aided Medical Procedures & Augmented Reality

Link to website

Kun Yuan

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1789]
P. Scholl, A. Dietrich, S. Wolf, J. Lee, A.-A. Schäffer, G. Kutyniok and M. Iskandar.
Interpretable Robotic Friction Learning via Symbolic Regression.
Preprint (May. 2025). arXiv
Abstract

Accurately modeling the friction torque in robotic joints has long been challenging due to the request for a robust mathematical description. Traditional model-based approaches are often labor-intensive, requiring extensive experiments and expert knowledge, and they are difficult to adapt to new scenarios and dependencies. On the other hand, data-driven methods based on neural networks are easier to implement but often lack robustness, interpretability, and trustworthiness–key considerations for robotic hardware and safety-critical applications such as human-robot interaction. To address the limitations of both approaches, we propose the use of symbolic regression (SR) to estimate the friction torque. SR generates interpretable symbolic formulas similar to those produced by model-based methods while being flexible to accommodate various dynamic effects and dependencies. In this work, we apply SR algorithms to approximate the friction torque using collected data from a KUKA LWR-IV+ robot. Our results show that SR not only yields formulas with comparable complexity to model-based approaches but also achieves higher accuracy. Moreover, SR-derived formulas can be seamlessly extended to include load dependencies and other dynamic factors.

MCML Authors
Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1788]
M. Schröder, J. Hartenstein and S. Feuerriegel.
PrivATE: Differentially Private Confidence Intervals for Average Treatment Effects.
Preprint (May. 2025). arXiv
Abstract

The average treatment effect (ATE) is widely used to evaluate the effectiveness of drugs and other medical interventions. In safety-critical applications like medicine, reliable inferences about the ATE typically require valid uncertainty quantification, such as through confidence intervals (CIs). However, estimating treatment effects in these settings often involves sensitive data that must be kept private. In this work, we present PrivATE, a novel machine learning framework for computing CIs for the ATE under differential privacy. Specifically, we focus on deriving valid privacy-preserving CIs for the ATE from observational data. Our PrivATE framework consists of three steps: (i) estimating a differentially private ATE through output perturbation; (ii) estimating the differentially private variance through a truncated output perturbation mechanism; and (iii) constructing the CIs while accounting for the uncertainty from both the estimation and privatization steps. Our PrivATE framework is model agnostic, doubly robust, and ensures valid CIs. We demonstrate the effectiveness of our framework using synthetic and real-world medical datasets. To the best of our knowledge, we are the first to derive a general, doubly robust framework for valid CIs of the ATE under (ε, δ)-differential privacy.

MCML Authors
Link to website

Maresa Schröder

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1787]
J. Schroeder, S. Howard, C. Eberle, J. Esslinger, N. Leopold-Kerschbaumer, K. V. Kepesidis and A. Döpp.
Information-optimal measurement: From fixed sampling protocols to adaptive spectroscopy.
Preprint (May. 2025). arXiv
Abstract

All measurements of continuous signals rely on taking discrete snapshots, with the Nyquist-Shannon theorem dictating sampling paradigms. We present a broader framework of information-optimal measurement, showing that traditional sampling is optimal only when we are entirely ignorant about the system under investigation. This insight unlocks methods that efficiently leverage prior information to overcome long-held fundamental sampling limitations. We demonstrate this for optical spectroscopy - vital to research and medicine - and show how adaptively selected measurements yield higher information in medical blood analysis, optical metrology, and hyperspectral imaging. Through our rigorous statistical framework, performance never falls below conventional sampling while providing complete uncertainty quantification in real time. This establishes a new paradigm where measurement devices operate as information-optimal agents, fundamentally changing how scientific instruments collect and process data.

MCML Authors
Link to website

Sunny Howard

Data-driven methods in Physics and Optics

Link to website

Christoph Eberle

Data-driven methods in Physics and Optics

Link to Profile Andreas Döpp

Andreas Döpp

Dr. habil

Data-driven methods in Physics and Optics


[1786]
I. Sen, B. Ma, G. Ahnert, A.-C. Haensch, T. Holtdirk, F. Kreuter and M. Strohmaier.
Connecting Natural Language Processing and Survey Methodology: Potentials, Challenges, and Open Questions.
Preprint (May. 2025). DOI
Abstract

Recent generative AI technologies, particularly Large Language Models (LLMs), have increased interest in Natural Language Processing (NLP) methods for scientists and practitioners across disciplines. In this position paper, we highlight one such discipline — survey methodology, which not only uses more and more NLP techniques, e.g., using LLMs to simulate survey respondents, but also stands to benefit NLP, e.g., informing the design of NLP annotation and evaluation tasks. We argue for increasing synergies between NLP and Survey Methodology to realize the potential at their intersection. We also outline challenges that impede progress on these potential synergies and present 10 open questions to encourage further reflection.

MCML Authors
Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI

Tobias Holtdirk

Tobias Holtdirk

Social Data Science and AI

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1785]
Y. Shen, W. Lai, S. Wang, K. Luo, A. Fraser and M. Sun.
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora.
Preprint (May. 2025). arXiv
Abstract

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1784]
R. S.-E. Shim, D. De Cristofaro, C. M. Hu, A. Vietti and B. Plank.
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically.
Preprint (May. 2025). arXiv
Abstract

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1783]
R. Sonabend, J. Zobolas, R. Bin, P. Kopper, L. Burk and A. Bender.
Examining marginal properness in the external validation of survival models with squared and logarithmic losses.
Preprint (May. 2025). arXiv
Abstract

Scoring rules promote rational and honest decision-making, which is important for model evaluation and becoming increasingly important for automated procedures such as ‘AutoML’. In this paper we survey common squared and logarithmic scoring rules for survival analysis, with a focus on their theoretical and empirical properness. We introduce a marginal definition of properness and show that both the Integrated Survival Brier Score (ISBS) and the Right-Censored Log-Likelihood (RCLL) are theoretically improper under this definition. We also investigate a new class of losses that may inform future survival scoring rules. Simulation experiments reveal that both the ISBS and RCLL behave as proper scoring rules in practice. The RCLL showed no violations across all settings, while ISBS exhibited only minor, negligible violations at extremely small sample sizes, suggesting one can trust results from historical experiments. As such we advocate for both the RCLL and ISBS in external validation of models, including in automated procedures. However, we note practical challenges in estimating these losses including estimation of censoring distributions and densities; as such further research is required to advance development of robust and honest evaluation in survival analysis.

MCML Authors
Link to website

Lukas Burk

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[1782]
J. Wang, P. Gupta, I. Habernal and E. Hüllermeier.
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs.
Preprint (May. 2025). arXiv
Abstract

Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to different prompt-based attacks, generating harmful content or sensitive information. Both closed-source and open-source LLMs are underinvestigated for these attacks. This paper studies effective prompt injection attacks against the 14 most popular open-source LLMs on five attack benchmarks. Current metrics only consider successful attacks, whereas our proposed Attack Success Probability (ASP) also captures uncertainty in the model’s response, reflecting ambiguity in attack feasibility. By comprehensively analyzing the effectiveness of prompt injection attacks, we propose a simple and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around 90% ASP. They also indicate that our ignore prefix attacks can break all 14 open-source LLMs, achieving over 60% ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.

MCML Authors
Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1781]
M. Wang, L. Lange, H. Adel, Y. Ma, J. Strötgen and H. Schütze.
Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes.
Preprint (May. 2025). arXiv
Abstract

Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model’s internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1780]
Q. Wang, M. Wang, N. Feldhus, S. Ostermann, Y. Cao, H. Schütze, S. Möller and V. Schmitt.
Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability.
Preprint (May. 2025). arXiv
Abstract

Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). While prior research has extensively investigated the degradation of various LLM capabilities due to quantization, its effects on model explainability and interpretability, which are crucial for understanding decision-making processes, remain unexplored. To address this gap, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with two explainability methods, counterfactual examples and natural language explanations, as well as two interpretability approaches, knowledge memorization analysis and latent multi-hop reasoning analysis. We complement our analysis with a thorough user study, evaluating selected explainability methods. Our findings reveal that, depending on the configuration, quantization can significantly impact model explainability and interpretability. Notably, the direction of this effect is not consistent, as it strongly depends on (1) the quantization method, (2) the explainability or interpretability approach, and (3) the evaluation protocol. In some settings, human evaluation shows that quantization degrades explainability, while in others, it even leads to improvements. Our work serves as a cautionary tale, demonstrating that quantization can unpredictably affect model transparency. This insight has important implications for deploying LLMs in applications where transparency is a critical requirement.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1779]
X. Wang, M. Wang, Y. Liu, H. Schütze and B. Plank.
Refusal Direction is Universal Across Safety-Aligned Languages.
Preprint (May. 2025). arXiv
Abstract

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1778]
Z. Wang, X. Xu, Y. Liu, Y. Zhang, P. Lin, S. Feng, X. Yang, D. Wang and H. Schütze.
Why Do More Experts Fail? A Theoretical Analysis of Model Merging.
Preprint (May. 2025). arXiv GitHub
Abstract

Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model. Although recent model merging methods have shown promising results, they struggle to maintain performance gains as the number of merged models increases. In this paper, we investigate the key obstacles that limit the scalability of model merging when integrating a large number of expert models. First, we prove that there is an upper bound on model merging. Further theoretical analysis reveals that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged. Gaussian Width shows that the marginal benefit of merging additional models diminishes according to a strictly concave function. This implies that the effective parameter space becomes rapidly saturated as the number of merged models increases. Furthermore, using Approximate Kinematics Theory, we prove the existence of a unique optimal threshold beyond which adding more models does not yield significant performance improvements. At the same time, we introduce a straightforward Reparameterized Heavy-Tailed method (RHT) to extend the coverage of the merged model, thereby enhancing its performance. Empirical results on 12 benchmarks, including both knowledge-intensive and general-purpose tasks, validate our theoretical analysis. We believe that these results spark further research beyond the current scope of model merging.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1777]
C. Wu, Y. Cai, Y. Liu, P. Zhu, Y. Xue, Z. Gong, J. Hirschberg and B. Ma.
Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects.
Preprint (May. 2025). arXiv
Abstract

While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals.
This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.

MCML Authors

[1776]
C. Zhang, S. Wu, Y. Chen, M. Aßenmacher, C. Heumann, Y. Men, G. Fan and J. Gama.
OBD-Finder: Explainable Coarse-to-Fine Text-Centric Oracle Bone Duplicates Discovery.
Preprint (May. 2025). arXiv GitHub
Abstract

Oracle Bone Inscription (OBI) is the earliest systematic writing system in China, while the identification of Oracle Bone (OB) duplicates is a fundamental issue in OBI research. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our approach with state-of-the-art content-based image retrieval and image matching methods, showing that our approach yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by OBI researchers for decades.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1775]
F. Zhang, Y. Shi and X. Zhu.
Global Collinearity-aware Polygonizer for Polygonal Building Mapping in Remote Sensing.
Preprint (May. 2025). arXiv GitHub
Abstract

This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity-aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer-based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity-aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity-aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity-aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas-Peucker algorithm. This finding underscores the broad applicability of GCP.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1774]
R. Zhao, B. Chen, B. Plank and M. A. Hedderich.
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs.
Preprint (May. 2025). arXiv
Abstract

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

MCML Authors
Link to website

Raoyuan Zhao

AI and Computational Linguistics

Link to website

Beiduo Chen

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics


[1773]
R. Zhao, A. Köksal, A. Modarressi, M. A. H. Michael A. Hedderich and H. Schütze.
Do We Know What LLMs Don't Know? A Study of Consistency in Knowledge Probing.
Preprint (May. 2025). arXiv
Abstract

The reliability of large language models (LLMs) is greatly compromised by their tendency to hallucinate, underscoring the need for precise identification of knowledge gaps within LLMs. Various methods for probing such gaps exist, ranging from calibration-based to prompting-based methods. To evaluate these probing methods, in this paper, we propose a new process based on using input variations and quantitative metrics. Through this, we expose two dimensions of inconsistency in knowledge gap probing. (1) Intra-method inconsistency: Minimal non-semantic perturbations in prompts lead to considerable variance in detected knowledge gaps within the same probing method; e.g., the simple variation of shuffling answer options can decrease agreement to around 40%. (2) Cross-method inconsistency: Probing methods contradict each other on whether a model knows the answer. Methods are highly inconsistent – with decision consistency across methods being as low as 7% – even though the model, dataset, and prompt are all the same. These findings challenge existing probing methods and highlight the urgent need for perturbation-robust probing frameworks.

MCML Authors
Link to website

Raoyuan Zhao

AI and Computational Linguistics

Link to website

Ali Modarressi

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1772]
S. Zhao, Z. Xiong, J. Zhao and X. Zhu.
ExEBench: Benchmarking Foundation Models on Extreme Earth Events.
Preprint (May. 2025). arXiv GitHub
Abstract

Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over extreme values. To explore the reliability of FM in the context of extreme events, we introduce textbf{ExE}Bench (textbf{Ex}treme textbf{E}arth Benchmark), a collection of seven extreme event categories across floods, wildfires, storms, tropical cyclones, extreme precipitation, heatwaves, and cold waves. The dataset features global coverage, varying data volumes, and diverse data sources with different spatial, temporal, and spectral characteristics. To broaden the real-world impact of FMs, we include multiple challenging ML tasks that are closely aligned with operational needs in extreme events detection, monitoring, and forecasting. ExEBench aims to (1) assess FM generalizability across diverse, high-impact tasks and domains, (2) promote the development of novel ML methods that benefit disaster management, and (3) offer a platform for analyzing the interactions and cascading effects of extreme events to advance our understanding of Earth system, especially under the climate change expected in the decades to come.

MCML Authors
Link to website

Jie Zhao

Dr.

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1771]
S. Zhou, S. Peng, S. Luebke, J. Haßler, M. Haim, S. M. Mohammad and B. Plank.
What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse.
Preprint (May. 2025). arXiv
Abstract

Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors’ opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs’ limitations in handling nuanced frames and stance expressions on climate change internet memes.

MCML Authors
Link to website

Shijia Zhou

AI and Computational Linguistics

Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1770]
D. Zhu, S. Gavranovic, F. Boussuge, B. Busam and S. Ilic.
Generative Data Augmentation for Object Point Cloud Segmentation.
Preprint (May. 2025). arXiv
Abstract

Data augmentation is widely used to train deep learning models to address data scarcity. However, traditional data augmentation (TDA) typically relies on simple geometric transformation, such as random rotation and rescaling, resulting in minimal data diversity enrichment and limited model performance improvement. State-of-the-art generative models for 3D shape generation rely on the denoising diffusion probabilistic models and manage to generate realistic novel point clouds for 3D content creation and manipulation. Nevertheless, the generated 3D shapes lack associated point-wise semantic labels, restricting their usage in enlarging the training data for point cloud segmentation tasks. To bridge the gap between data augmentation techniques and the advanced diffusion models, we extend the state-of-the-art 3D diffusion model, Lion, to a part-aware generative model that can generate high-quality point clouds conditioned on given segmentation masks. Leveraging the novel generative model, we introduce a 3-step generative data augmentation (GDA) pipeline for point cloud segmentation training. Our GDA approach requires only a small amount of labeled samples but enriches the training data with generated variants and pseudo-labeled samples, which are validated by a novel diffusion-based pseudo-label filtering method. Extensive experiments on two large-scale synthetic datasets and a real-world medical dataset demonstrate that our GDA method outperforms TDA approach and related semi-supervised and self-supervised methods.

MCML Authors
Link to website

Dekai Zhu

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1769]
J. Kobialka, E. Sommer, J. Kwon, D. Dold and D. Rügamer.
Approximate Posteriors in Neural Networks: A Sampling Perspective.
AABI 2025 - 7th Symposium on Advances in Approximate Bayesian Inference collocated with the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 29, 2025. To be published. Preprint available. URL
Abstract

The landscape of neural network loss functions is known to be highly complex, and the ability of gradient-based approaches to find well-generalizing solutions to such high-dimensional problems is often considered a miracle. Similarly, Bayesian neural networks (BNNs) inherit this complexity through the model’s likelihood. In applications where BNNs are used to account for weight uncertainty, recent advantages in sampling-based inference (SAI) have shown promising results outperforming other approximate Bayesian inference (ABI) methods. In this work, we analyze the approximate posterior implicitly defined by SAI and uncover key insights into its success. Among other things, we demonstrate how SAI handles symmetries differently than ABI, and examine the role of overparameterization. Further, we investigate the characteristics of approximate posteriors with sampling budgets scaled far beyond previously studied limits and explain why the localized behavior of samplers does not inherently constitute a disadvantage.

MCML Authors
Link to website

Julius Kobialka

Statistics, Data Science and Machine Learning

Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1768]
T. Nagler and D. Rügamer.
Uncertainty Quantification for Prior-Fitted Networks using Martingale Posteriors.
AABI 2025 - 7th Symposium on Advances in Approximate Bayesian Inference collocated with the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 29, 2025. To be published. Preprint available. URL
Abstract

Prior-fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient method to construct Bayesian posteriors for such estimates based on Martingale Posteriors. Several simulated and real-world data examples are used to showcase the resulting uncertainty quantification of our method in inference applications.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1767]
T. Rochussen and V. Fortuin.
Sparse Gaussian Neural Processes.
AABI 2025 - 7th Symposium on Advances in Approximate Bayesian Inference collocated with the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 29, 2025. To be published. Preprint available. arXiv
Abstract

Despite significant recent advances in probabilistic meta-learning, it is common for practitioners to avoid using deep learning models due to a comparative lack of interpretability. Instead, many practitioners simply use non-meta-models such as Gaussian processes with interpretable priors, and conduct the tedious procedure of training their model from scratch for each task they encounter. While this is justifiable for tasks with a limited number of data points, the cubic computational cost of exact Gaussian process inference renders this prohibitive when each task has many observations. To remedy this, we introduce a family of models that meta-learn sparse Gaussian process inference. Not only does this enable rapid prediction on new tasks with sparse Gaussian processes, but since our models have clear interpretations as members of the neural process family, it also allows manual elicitation of priors in a neural process for the first time. In meta-learning regimes for which the number of observed tasks is small or for which expert domain knowledge is available, this offers a crucial advantage.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[1766]
F. Ghorbanpour, V. Hangya and A. Fraser.
Fine-Grained Transfer Learning for Harmful Content Detection through Label-Specific Soft Prompt Tuning.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI
Abstract

The spread of harmful content online is a dynamic issue evolving over time. Existing detection models, reliant on static data, are becoming less effective and generalizable. Developing new models requires sufficient up-to-date data, which is challenging. A potential solution is to combine existing datasets with minimal new data. However, detection tasks vary—some focus on hate speech, offensive, or abusive content, which differ in the intent to harm, while others focus on identifying targets of harmful speech such as racism, sexism, etc—raising the challenge of handling nuanced class differences. To address these issues, we introduce a novel transfer learning method that leverages class-specific knowledge to enhance harmful
content detection. In our approach, we first present label-specific soft prompt tuning, which captures and represents class-level information. Secondly, we propose two approaches to transfer this fine-grained knowledge from source (existing tasks) to target (unseen and new tasks): initializing the target task prompts from source prompts and using an attention mechanism that learns and adjusts attention scores to utilize the most relevant information from source prompts. Experiments demonstrate significant improvements in harmful content detection across English and German datasets, highlighting the effectiveness of label-specific representations and knowledge transfer.

MCML Authors
Link to website

Faeze Ghorbanpour

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1765]
K. Hämmerl, T. Limisiewicz, J. Libovický and A. Fraser.
Beyond Literal Token Overlap: Token Alignability for Multilinguality.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI
Abstract

Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.

MCML Authors
Link to website

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1764]
C. Ma, A. ImaniGooghari, H. Ye, R. Pei, E. Asgari and H. Schütze.
Taxi1500: A Dataset for Multilingual Text Classification in 1500 Languages.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. URL
Abstract

While natural language processing tools have been developed extensively for some of the world’s languages, a significant portion of the world’s over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1763]
M. Spliethöver, T. Knebler, F. Fumagalli, M. Muschalik, B. Hammer, E. Hüllermeier and H. Wachsmuth.
Adaptive Prompting: Ad-hoc Prompt Composition for Social Bias Detection.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI
Abstract

Recent advances on instruction fine-tuning have led to the development of various prompting techniques for large language models, such as explicit reasoning steps. However, the success of techniques depends on various parameters, such as the task, language model, and context provided. Finding an effective prompt is, therefore, often a trial-and-error process. Most existing approaches to automatic prompting aim to optimize individual techniques instead of compositions of techniques and their dependence on the input. To fill this gap, we propose an adaptive prompting approach that predicts the optimal prompt composition ad-hoc for a given input. We apply our approach to social bias detection, a highly context-dependent task that requires semantic understanding. We evaluate it with three large language models on three datasets, comparing compositions to individual techniques and other baselines. The results underline the importance of finding an effective prompt composition. Our approach robustly ensures high detection performance, and is best in several settings. Moreover, first experiments on other tasks support its generalizability.

MCML Authors
Link to website

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1762]
L. Madaan, D. Esiobu, P. Stenetorp, B. Plank and D. Hupkes.
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. To be published. Preprint available. arXiv
Abstract

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1761]
M. Schöffel, M. Wiedner, E. Garces Arias, P. Ruppert, C. Heumann and M. Aßenmacher.
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. To be published. Preprint available. arXiv
Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1760]
R. Shim and B. Plank.
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. To be published. Preprint available.
Abstract

There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1759]
P. Lin, A. F. T. Martins and H. Schütze.
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models.
NAACL 2025 - Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI
Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1758]
P. Lin, A. F. T. Martins and H. Schütze.
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples.
NAACL 2025 - Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI GitHub
Abstract

Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever based on Glot500, a multilingual small language model, using positive and negative English examples constructed from the predictions of a multilingual large language model, i.e., MaLA500. Leveraging the cross-lingual capacity of the retriever, it can directly retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on the multilingual text classification benchmark SIB200 with 176 languages show that XAMPLER substantially improves the in-context learning performance across languages.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1757]
I. d. S. Bueno Júnior, H. Ye, A. Wisiorek and H. Schütze.
Privacy-Preserving Federated Learning for Hate Speech Detection.
SRW @NAACL 2025 - Student Research Workshop at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI
Abstract

This paper presents a federated learning system with differential privacy for hate speech detection, tailored to low-resource languages. By fine-tuning pre-trained language models, ALBERT emerged as the most effective option for balancing performance and privacy. Experiments demonstrated that federated learning with differential privacy performs adequately in low-resource settings, though datasets with fewer than 20 sentences per client struggled due to excessive noise. Balanced datasets and augmenting hateful data with non-hateful examples proved critical for improving model utility. These findings offer a scalable and privacy-conscious framework for integrating hate speech detection into social media platforms and browsers, safeguarding user privacy while addressing online harm.

MCML Authors
Link to website

Axel Wisiorek

Dr.

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1756]
B. Ma, C. A. Huang and A.-C. Haensch.
Can Large Language Models Advance Crosswalks? The Case of Danish Occupation Codes.
SRW @NAACL 2025 - Student Research Workshop at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, Apr 29-May 04, 2025. URL
Abstract

Crosswalks, which map one classification system to another, are critical tools for harmonizing data across time, countries, or frameworks. However, constructing crosswalks is labor-intensive and often requires domain expertise. This paper investigates the potential of Large Language Models (LLMs) to assist in creating crosswalks, focusing on two Danish occupational classification systems from different time periods as a case study. We propose a two-stage, prompt-based framework for this task, where LLMs perform similarity assessments between classification codes and identify final mappings through a guided decision process. Using four instruction-tuned LLMs and comparing them against an embedding-based baseline, we evaluate the performance of different models in crosswalks. Our results highlight the strengths of LLMs in crosswalk creation compared to the embedding-based baseline, showing the effectiveness of the interactive prompt-based framework for conducting crosswalks by LLMs. Furthermore, we analyze the impact of model combinations across two interactive rounds, highlighting the importance of model selection and consistency. This work contributes to the growing field of NLP applications for domain-specific knowledge mapping and demonstrates the potential of LLMs in advancing crosswalk methodologies.

MCML Authors
Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI


[1755]
V. Blaschke.
Beyond 'noisy' text: How (and why) to process dialect data.
W-NUT @NAACL 2025 - 10th Workshop on Noisy and User-generated Text at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, Apr 29-May 04, 2025. Keynote Talk. PDF
Abstract

Processing data from non-standard dialects links two lines of research: creating NLP tools that are robust to ’noisy’ inputs, and extending the coverage of NLP tools to underserved language communities. In this talk, I will describe ways in which processing dialect data differs from processing standard-language data, and discuss some of the current challenges in dialect NLP research. For instance, I will talk about strategies to mitigate the effect of infelicitous subword tokenization caused by ad-hoc pronunciation spellings. Additionally, I argue that we should not only consider how to tackle dialectal variation in NLP, but also why. To this end, I will highlight perspectives of some dialect speaker communities on which language technologies should (or should not) be able to process or produce dialectal in- or output.

MCML Authors
Link to website

Verena Blaschke

AI and Computational Linguistics


[1754]
D. Geißler, A. Maarouf and S. Feuerriegel.
Analyzing User Characteristics of Hate Speech Spreaders on Social Media.
WWW 2025 - ACM Web Conference. Sydney, Australia, Apr 28-May 02, 2025. To be published. Preprint available. arXiv
Abstract

Hate speech on social media threatens the mental and physical well-being of individuals and contributes to real-world violence. Resharing is an important driver behind the spread of hate speech on social media. Yet, little is known about who reshares hate speech and what their characteristics are. In this paper, we analyze the role of user characteristics in hate speech resharing across different types of hate speech (e.g., political hate). For this, we proceed as follows: First, we cluster hate speech posts using large language models to identify different types of hate speech. Then we model the effects of user attributes on users’ probability to reshare hate speech using an explainable machine learning model. To do so, we apply debiasing to control for selection bias in our observational social media data and further control for the latent vulnerability of users to hate speech. We find that, all else equal, users with fewer followers, fewer friends, fewer posts, and older accounts share more hate speech. This shows that users with little social influence tend to share more hate speech. Further, we find substantial heterogeneity across different types of hate speech. For example, racist and misogynistic hate is spread mostly by users with little social influence. In contrast, political anti-Trump and anti-right-wing hate is reshared by users with larger social influence. Overall, understanding the factors that drive users to share hate speech is crucial for detecting individuals at risk of engaging in harmful behavior and for designing effective mitigation strategies.

MCML Authors
Link to website

Abdurahman Maarouf

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1753]
J. de Ruite, N. Sairam, A. Camero, K. Rafiezadeh Shahi, X. Zhu, M. W. Smith and H. Kreibich.
The complex connection between flood risk and malaria dynamics in Sub-Saharan Africa.
EGU 2025 - General Assembly of the European Geosciences Union. Vienna, Austria, Apr 27-May 02, 2025. DOI
Abstract

Climate change projections for 2030 indicate a concerning increase in the frequency of floods, which is expected to result in significant economic damages and losses on a global scale. The growth of urbanization has indeed increased flood risk, highlighting the need for a prompt evaluation of economic losses to facilitate rapid response and effective reconstruction. However, providing timely and accurate economic damage assessment immediately after a flood event is difficult and associated with high uncertainty. Remote sensing data can support this task, but challenges such as cloud cover, infrequent return times from satellites, and the lack of ground truth data make supervised approaches challenging. To address these challenges, we propose a new economic damage assessment approach based on the analysis of multi-temporal and multi-source, Synthetic Aperture Radar (SAR) images before and after the flood peak with an unsupervised change detection method. This method utilizes computer vision techniques, specifically a pixel-based approach with SAR data (Sentinel-1 and TerraSAR-X/TanDEM-X) to monitor changes in buildings and the flood extension. It employs various threshold techniques and parameters to determine the optimal threshold values for highlighting changes and the presence of water. By using this method, our aim is to obtain an economic model based on pixels, which represents the volume of water surrounding or on each building and the flood extension. The purpose of this study is to support governments in decision-making processes and enable insurers to efficiently assess and compensate for damages caused by flood events.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1752]
S. Delgado Rodriguez, M. Windl, F. Alt and K. Marky.
The TaPSI Research Framework - A Systematization of Knowledge on Tangible Privacy and Security Interfaces.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. DOI
Abstract

This paper presents a comprehensive Systematization of Knowledge on tangible privacy and security interfaces (TaPSI). Tangible interfaces provide physical forms for digital interactions. They can offer significant benefits for privacy and security applications by making complex and abstract security concepts more intuitive, comprehensible, and engaging. Through a literature survey, we collected and analyzed 80 publications. We identified terminology used in these publications and addressed usable privacy and security domains, contributions, applied methods, implementation details, and opportunities or challenges inherent to TaPSI. Based on our findings, we define TaPSI and propose the TaPSI Research Framework, which guides future research by offering insights into when and how to conduct research on privacy and security involving TaPSI as well as a design space of TaPSI.

MCML Authors
Link to website

Maximiliane Windl

Human-Centered Ubiquitous Media


[1751]
T. Mitrevska, F. Chiossi and S. Mayer.
ERP Markers of Visual and Semantic Processing in AI-Generated Images: From Perception to Meaning.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. DOI
Abstract

Perceptual similarity assessment plays an important role in processing visual information, which is often employed in Human-AI interaction tasks such as object recognition or content generation. It is important to understand how humans perceive and evaluate visual similarity to iteratively generate outputs that meet the users’ expectations better and better. By leveraging physiological signals, systems can rely on users’ EEG responses to support the similarity assessment process. We conducted a study (N=20), presenting diverse AI-generated images as stimuli and evaluating their semantic similarity to a target image while recording event-related potentials (ERPs). Our results show that the N400 component distinguishes low, medium, and high similarity of images, while the P2 component showed no significant impact, implying consistent early perceptual processing. Thus, we demonstrate that ERPs allow us to assess the users’ perceived visual similarity to support rapid interactions with human-AI systems.

MCML Authors
Sven Mayer

Sven Mayer

Prof. Dr.

* Former Principal Investigator


[1750]
J. Simson, F. Draxler, S. Mehr and C. Kern.
Preventing Harmful Data Practices by Using Participatory Input to Navigate the Machine Learning Multiverse.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. DOI
Abstract

In light of inherent trade-offs regarding fairness, privacy, interpretability and performance, as well as normative questions, the machine learning (ML) pipeline needs to be made accessible for public input, critical reflection and engagement of diverse stakeholders. In this work, we introduce a participatory approach to gather
input from the general public on the design of an ML pipeline. We show how people’s input can be used to navigate and constrain the multiverse of decisions during both model development and evaluation. We highlight that central design decisions should be democratized rather than “optimized” to acknowledge their critical impact on the system’s output downstream. We describe the iterative development of our approach and its exemplary implementation on a citizen science platform. Our results demonstrate how public participation can inform critical design decisions along the model-building pipeline and combat widespread lazy data practices.

MCML Authors
Link to website

Jan Simson

Social Data Science and AI Lab

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1749]
M. Windl, R. Amberg and T. Kosch.
The Illusion of Privacy: Investigating User Misperceptions in Browser Tracking Protection.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. DOI
Abstract

Third parties track users’ web browsing activities, raising privacy concerns. Tracking protection extensions prevent this, but their influence on privacy protection beliefs shaped by narratives remains uncertain. This paper investigates users’ misperception of tracking protection offered by browser plugins. Our study explores how different narratives influence users’ perceived privacy protection by examining three tracking protection extension narratives: no protection, functional protection, and a placebo. In a study (N=36), participants evaluated their anticipated protection during a hotel booking process, influenced by the narrative about the plugin’s functionality. However, participants viewed the same website without tracking protection adaptations. We show that users feel more protected when informed they use a functional or placebo extension, compared to no protection. Our findings highlight the deceptive nature of misleading privacy tools, emphasizing the need for greater transparency to prevent users from a false sense of protection, as such misleading tools negatively affect user study results.

MCML Authors
Link to website

Maximiliane Windl

Human-Centered Ubiquitous Media


[1748]
M. Windl, P. Thalhammer, D. Müller, A. Schmidt and S. S. Feger.
PrivacyHub: A Functional Tangible and Digital Ecosystem for Interoperable Smart Home Privacy Awareness and Control.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. DOI
Abstract

Hubs are at the core of most smart homes. Modern cross-ecosystem protocols and standards enable smart home hubs to achieve interoperability across devices, offering the unique opportunity to integrate universally available smart home privacy awareness and control features. To date, such privacy features mainly focus on individual products or prototypical research artifacts. We developed a cross-ecosystem hub featuring a tangible dashboard and a digital web application to deepen our understanding of how smart home users interact with functional privacy features. The ecosystem allows users to control the connectivity states of their devices and raises awareness by visualizing device positions, states, and data flows. We deployed the ecosystem in six households for one week and found that it increased participants’ perceived control, awareness, and understanding of smart home privacy. We further found distinct differences between tangible and digital mechanisms. Our findings highlight the value of cross-ecosystem hubs for effective privacy management.

MCML Authors
Link to website

Maximiliane Windl

Human-Centered Ubiquitous Media

Link to Profile Albrecht Schmidt

Albrecht Schmidt

Prof. Dr.

Human-Centered Ubiquitous Media


[1747]
J. Leusmann, A. Belardinelli, L. Haliburton, S. Hasler, A. Schmidt, S. Mayer, M. Gienger and C. Wang.
Investigating LLM-Driven Curiosity in Human-Robot Interaction.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. To be published. DOI
Abstract

Integrating curious behavior traits into robots is essential for them to learn and adapt to new tasks over their lifetime and to enhance human-robot interaction. However, the effects of robots expressing curiosity on user perception, user interaction, and user experience in collaborative tasks are unclear. In this work, we present a Multimodal Large Language Model-based system that equips a robot with non-verbal and verbal curiosity traits. We conducted a user study (N=20) to investigate how these traits modulate the robot’s behavior and the users’ impressions of sociability and quality of interaction. Participants prepared cocktails or pizzas with a robot, which was either curious or non-curious. Our results show that we could create user-centric curiosity, which users perceived as more human-like, inquisitive, and autonomous while resulting in a longer interaction time. We contribute a set of design recommendations allowing system designers to take advantage of curiosity in collaborative tasks.

MCML Authors
Luke Haliburton

Luke Haliburton

Dr.

* Former Member

Link to Profile Albrecht Schmidt

Albrecht Schmidt

Prof. Dr.

Human-Centered Ubiquitous Media


[1746]
M. Windl, P. Z. Laboda and S. Mayer.
Designing Effective Consent Mechanisms for Spontaneous Interactions in Augmented Reality.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. To be published. DOI
Abstract

Ubiquitous computing devices like Augmented Reality (AR) glasses allow countless spontaneous interactions – all serving different goals. AR devices rely on data transfer to personalize recommendations and adapt to the user. Today’s consent mechanisms, such as privacy policies, are suitable for long-lasting interactions; however, how users can consent to fast, spontaneous interactions is unclear. We first conducted two focus groups (N=17) to identify privacy-relevant scenarios in AR. We then conducted expert interviews (N=11) with co-design activities to establish effective consent mechanisms. Based on that, we contribute (1) a validated scenario taxonomy to define privacy-relevant AR interaction scenarios, (2) a flowchart to decide on the type of mechanisms considering contextual factors, (3) a design continuum and design aspects chart to create the mechanisms, and (4) a trade-off and prediction chart to evaluate the mechanism. Thus, we contribute a conceptual framework fostering a privacy-preserving future with AR.

MCML Authors
Link to website

Maximiliane Windl

Human-Centered Ubiquitous Media


[1745]
K. Forster, V. Wagner, L. Keil, M. A. Müller, T. Sellhorn and S. Feuerriegel.
Tracking ESG Disclosures of European Companies with Retrieval-Augmented Generation.
Climate Change AI @ICLR 2025 - Workshop on Tackling Climate Change with Machine Learning at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published.
Abstract

Corporations play a crucial role in mitigating climate change and accelerating progress toward environmental, social, and governance (ESG) objectives. However, structured information on the current state of corporate ESG efforts remains limited. In this paper, we propose a machine learning framework based on a retrieval-augmented generation (RAG) pipeline to track ESG indicators from N = 9, 200 corporate reports. Our analysis includes ESG indicators from 600 of the largest listed corporations in Europe between 2014 and 2023. We focus on two key dimensions: first, we identify gaps in corporate sustainability reporting in light of existing standards. Second, we provide comprehensive bottom-up estimates of key ESG indicators across European industries. Our findings enable policymakers and financial markets to effectively assess corporate ESG transparency and track progress toward global sustainability objectives.

MCML Authors
Link to website

Kerstin Forster

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1744]
C. Bülte, S. Maskey, P. Scholl, J. von Berg and G. Kutyniok.
Graph Neural Networks for Enhancing Ensemble Forecasts of Extreme Rainfall.
Climate Change AI @ICLR 2025 - Workshop on Tackling Climate Change with Machine Learning at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Climate change is increasing the occurrence of extreme precipitation events, threatening infrastructure, agriculture, and public safety. Ensemble prediction systems provide probabilistic forecasts but exhibit biases and difficulties in capturing extreme weather. While post-processing techniques aim to enhance forecast accuracy, they rarely focus on precipitation, which exhibits complex spatial dependencies and tail behavior. Our novel framework leverages graph neural networks to post-process ensemble forecasts, specifically modeling the extremes of the underlying distribution. This allows to capture spatial dependencies and improves forecast accuracy for extreme events, thus leading to more reliable forecasts and mitigating risks of extreme precipitation and flooding.

MCML Authors
Link to website

Christopher Bülte

Mathematical Foundations of Artificial Intelligence

Link to website

Sohir Maskey

Mathematical Foundations of Artificial Intelligence

Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to website

Jonas von Berg

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1743]
J. Kobialka, E. Sommer, J. Kwon, D. Dold and D. Rügamer.
Approximate Posteriors in Neural Networks: A Sampling Perspective.
FPI @ICLR 2025 - Workshop on Frontiers in Probabilistic Inference: Learning meets Sampling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. Spotlight Talk. To be published. Preprint available. URL
Abstract

The landscape of neural network loss functions is known to be highly complex, and the ability of gradient-based approaches to find well-generalizing solutions to such high-dimensional problems is often considered a miracle. Similarly, Bayesian neural networks (BNNs) inherit this complexity through the model’s likelihood. In applications where BNNs are used to account for weight uncertainty, recent advantages in sampling-based inference (SAI) have shown promising results outperforming other approximate Bayesian inference (ABI) methods. In this work, we analyze the approximate posterior implicitly defined by SAI and uncover key insights into its success. Among other things, we demonstrate how SAI handles symmetries differently than ABI, and examine the role of overparameterization. Further, we investigate the characteristics of approximate posteriors with sampling budgets scaled far beyond previously studied limits and explain why the localized behavior of samplers does not inherently constitute a disadvantage.

MCML Authors
Link to website

Julius Kobialka

Statistics, Data Science and Machine Learning

Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1742]
T. Nagler and D. Rügamer.
Uncertainty Quantification for Prior-Fitted Networks using Martingale Posteriors.
FPI @ICLR 2025 - Workshop on Frontiers in Probabilistic Inference: Learning meets Sampling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Prior-fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient method to construct Bayesian posteriors for such estimates based on Martingale Posteriors. Several simulated and real-world data examples are used to showcase the resulting uncertainty quantification of our method in inference applications.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1741]
A. Reuter, T. G. J. Rudner, V. Fortuin and D. Rügamer.
Can Transformers Learn Full Bayesian Inference in Context?
FPI @ICLR 2025 - Workshop on Frontiers in Probabilistic Inference: Learning meets Sampling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context – without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows which enables us to infer complex posterior distributions for methods such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods not operating in context.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1740]
D. Rundel, E. Sommer, B. Bischl, D. Rügamer and M. Feurer.
Efficiently Warmstarting MCMC for BNNS.
FPI @ICLR 2025 - Workshop on Frontiers in Probabilistic Inference: Learning meets Sampling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Markov Chain Monte Carlo (MCMC) algorithms are widely regarded as the gold standard for approximate inference in Bayesian neural networks (BNNs). However, they remain computationally expensive and prone to inefficiencies, such
as dying samplers, frequently leading to substantial waste of computational resources. While prior work has presented warmstarting techniques as an effective method to mitigate these inefficiencies, we provide a more comprehensive empirical analysis of how initializations of samplers affect their behavior. Based on various experiments examining the dynamics of warmstarting MCMC, we propose novel warmstarting strategies that leverage performance predictors and adaptive termination criteria to achieve better-performing, yet more cost-efficient, models. In numerical experiments, we demonstrate that this approach provides a practical pathway to more resource-efficient approximate inference in BNNs.

MCML Authors
Link to website

David Rundel

Statistical Learning and Data Science

Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[1739]
C. Koke, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. B. Michael M. Bronstein and D. Cremers.
Graph Networks struggle with variable Scale.
ICBINB @ICLR 2025 - Workshop I Can’t Believe It’s Not Better: Challenges in Applied Deep Learning at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Standard graph neural networks assign vastly different latent embeddings to graphs describing the same object at different resolution scales. This precludes consistency in applications and prevents generalization between scales as would fundamentally be needed e.g. in AI4Science. We uncover the underlying obstruction, investigate its origin and show how to overcome it by modifying the message passing paradigm.

MCML Authors
Link to website

Christian Koke

Computer Vision & Artificial Intelligence

Yuesong Shen

Yuesong Shen

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1738]
H. Baniecki, G. Casalicchio, B. Bischl and P. Biecek.
Efficient and Accurate Explanation Estimation with Distribution Compression.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Spotlight Presentation. To be published. Preprint available. arXiv
Abstract

We discover a theoretical connection between explanation estimation and distribution compression that significantly improves the approximation of feature attributions, importance, and effects. While the exact computation of various machine learning explanations requires numerous model inferences and becomes impractical, the computational cost of approximation increases with an ever-increasing size of data and model parameters. We show that the standard i.i.d. sampling used in a broad spectrum of algorithms for post-hoc explanation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm of sample-efficient explainability. It relies on distribution compression through kernel thinning to obtain a data sample that best approximates its marginal distribution. CTE significantly improves the accuracy and stability of explanation estimation with negligible computational overhead. It often achieves an on-par explanation approximation error 2-3x faster by using fewer samples, i.e. requiring 2-3x fewer model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.

MCML Authors
Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[1737]
D. Herbst and S. Jegelka.
Higher-Order Graphon Neural Networks: Approximation and Cut Distance.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Spotlight Presentation. To be published. Preprint available. URL
Abstract

Graph limit models, like graphons for limits of dense graphs, have recently been used to study size transferability of graph neural networks (GNNs). While most literature focuses on message passing GNNs (MPNNs), in this work we attend to the more powerful higher-order GNNs. First, we extend the -WL test for graphons (Böker, 2023) to the graphon-signal space and introduce signal-weighted homomorphism densities as a key tool. As an exemplary focus, we generalize Invariant Graph Networks (IGNs) to graphons, proposing Invariant Graphon Networks (IWNs) defined via a subset of the IGN basis corresponding to bounded linear operators. Even with this restricted basis, we show that IWNs of order are at least as powerful as the -WL test, and we establish universal approximation results for graphon-signals in distances. This significantly extends the prior work of Cai & Wang (2022), showing that IWNs—a subset of their IGN-small—retain effectively the same expressivity as the full IGN basis in the limit. In contrast to their approach, our blueprint of IWNs also aligns better with the geometry of graphon space, for example facilitating comparability to MPNNs. We highlight that, while typical higher-order GNNs are discontinuous w.r.t. cut distance—which causes their lack of convergence and is inherently tied to the definition of -WL—their transferability remains comparable to MPNNs.

MCML Authors
Link to website

Daniel Herbst

Foundations of Deep Neural Networks

Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1736]
M. Sabanayagam, L. Gosch, S. Günnemann and D. Ghoshdastidar.
Exact Certification of (Graph) Neural Networks Against Label Poisoning.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Spotlight. To be published. Preprint available. URL
Abstract

Machine learning models are highly vulnerable to label flipping, i.e., the adversarial modification (poisoning) of training labels to compromise performance. Thus, deriving robustness certificates is important to guarantee that test predictions remain unaffected and to understand worst-case robustness behavior. However, for Graph Neural Networks (GNNs), the problem of certifying label flipping has so far been unsolved. We change this by introducing an exact certification method, deriving both sample-wise and collective certificates. Our method leverages the Neural Tangent Kernel (NTK) to capture the training dynamics of wide networks enabling us to reformulate the bilevel optimization problem representing label flipping into a Mixed-Integer Linear Program (MILP). We apply our method to certify a broad range of GNN architectures in node classification tasks. Thereby, concerning the worst-case robustness to label flipping: (i) we establish hierarchies of GNNs on different benchmark graphs; (ii) quantify the effect of architectural choices such as activations, depth and skip-connections; and surprisingly, (iii) uncover a novel phenomenon of the robustness plateauing for intermediate perturbation budgets across all investigated datasets and architectures. While we focus on GNNs, our certificates are applicable to sufficiently wide NNs in general through their NTK. Thus, our work presents the first exact certificate to a poisoning attack ever derived for neural networks, which could be of independent interest.

MCML Authors
Link to website

Lukas Gosch

Data Analytics & Machine Learning

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning

Link to Profile Debarghya Ghoshdastidar

Debarghya Ghoshdastidar

Prof. Dr.

Theoretical Foundations of Artificial Intelligence


[1735]
E. Abdelrahman, L. Zhao, V. T. Hu, M. Cord, P. Perez and M. Elhoseiny.
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Diffusion models break down the challenging task of generating data from high-dimensional distributions into a series of easier denoising steps. Inspired by this paradigm, we propose a novel approach that extends the diffusion framework into modality space, decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation, such as contours, palettes, and detailed textures, ultimately culminating in a high-quality RGB image. Instead of relying on the naive LDM concatenation conditioning mechanism to connect the different stages together, we employ Schrödinger Bridge to determine the optimal transport between different modalities. Although employing a cascaded pipeline introduces more stages, which could lead to a more complex architecture, each stage is meticulously formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM) performance. Modality composition not only enhances overall performance but enables emerging proprieties such as consistent editing, interaction capabilities, high-level interpretability, and faster convergence and sampling rate. Extensive experiments on diverse datasets, including LSUN-Churches, ImageNet, CelebHQ, and LAION-Art, demonstrate the efficacy of our approach, consistently outperforming state-of-the-art methods. For instance, ToddlerDiffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating 2× faster with a 3× smaller architecture.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning


[1734]
K. Bhatia, F. Köhler and N. Thuerey.
PRDP: Progressively Refined Differentiable Physics.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

The physics solvers employed for neural network training are primarily iterative, and hence, differentiating through them introduces a severe computational burden as iterations grow large. Inspired by works in bilevel optimization, we show that full accuracy of the network is achievable through physics significantly coarser than fully converged solvers. We propose Progressively Refined Differentiable Physics (PRDP), an approach that identifies the level of physics refinement sufficient for full training accuracy. By beginning with coarse physics, adaptively refining it during training, and stopping refinement at the level adequate for training, it enables significant compute savings without sacrificing network accuracy. Our focus is on differentiating iterative linear solvers for sparsely discretized differential operators, which are fundamental to scientific computing. PRDP is applicable to both unrolled and implicit differentiation. We validate its performance on a variety of learning scenarios involving differentiable physics solvers such as inverse problems, autoregressive neural emulators, and correction-based neural-hybrid solvers. In the challenging example of emulating the Navier-Stokes equations, we reduce training time by 62%.

MCML Authors
Link to Profile Nils Thuerey

Nils Thuerey

Prof. Dr.

Physics-based Simulation


[1733]
M. Bini, L. Girrbach and Z. Akata.
Decoupling Angles and Strength in Low-rank Adaptation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Parameter Efficient FineTuning (PEFT) methods have recently gained extreme popularity thanks to the vast availability of large-scale models, allowing to quickly adapt pretrained models to downstream tasks with minimal computational costs. However, current additive finetuning methods such as LoRA show low robustness to prolonged training and hyperparameter choices, not allowing for optimal out-of-the-box usage. On the other hand, multiplicative and bounded approaches such as ETHER, even if providing higher robustness, only allow for extremely low-rank adaptations and are limited to a fixed-strength transformation, hindering the expressive power of the adaptation. In this work, we propose the DeLoRA finetuning method that first normalizes and then scales the learnable low-rank matrices, thus effectively bounding the transformation strength, which leads to increased hyperparameter robustness at no cost in performance. We show that this proposed approach effectively and consistently improves over popular PEFT methods by evaluating our method on two finetuning tasks, subject-driven image generation and LLM instruction tuning.

MCML Authors
Link to website

Massimo Bini

Interpretable and Reliable Machine Learning

Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1732]
Q. Bouniot, P. Mozharovskyi and F. d'Alché-Buc.
Tailoring Mixup to Data for Calibration.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved predictive performance, Mixup is also a good technique for improving calibration. However, mixing data carelessly can lead to manifold mismatch, i.e., synthetic data lying outside original class manifolds, which can deteriorate calibration. In this work, we show that the likelihood of assigning a wrong label with mixup increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves predictive performance and calibration of models, while being much more efficient.

MCML Authors
Link to website

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning


[1731]
S. Dahan, G. Bénédict, L. Z. J. Williams, Y. Guo, D. Rückert, R. Leech and E. C. Robinson.
SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1730]
L. Fang, Y. Wang, Z. Liu, C. Zhang, S. Jegelka, J. Gao, B. Ding and Y. Wang.
What is Wrong with Perplexity for Long-context Language Modeling?
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL GitHub
Abstract

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1729]
A. Findeis, T. Kaufmann, E. Hüllermeier, S. Albanie and R. D. Mullins.
Inverse Constitutional AI: Compressing Preferences into Principles.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL GitHub
Abstract

Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the “better” of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling them – not least because extensive prior work has shown various unintended biases in preference datasets. Yet, preference datasets remain challenging to interpret. Neither black-box reward models nor statistics can answer why one text is preferred over another. Manual interpretation of the numerous (long) response pairs is usually equally infeasible. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding ICAI algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on several datasets: (a) synthetic feedback data with known principles; (b) AlpacaEval cross-annotated human feedback data; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As an example application, we further demonstrate the detection of biases in human feedback data. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases: they may help identify undesirable annotator biases, better understand model performance, scale feedback to unseen data, or assist with adapting AI models to individual user or group preferences.

MCML Authors
Link to website

Timo Kaufmann

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1728]
D. Frauen, K. Heß and S. Feuerriegel.
Model-agnostic meta-learners for estimating heterogeneous treatment effects over time.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Estimating heterogeneous treatment effects (HTEs) over time is crucial in many disciplines such as personalized medicine. For example, electronic health records are commonly collected over several time periods and then used to personalize treatment decisions. Existing works for this task have mostly focused on model-based learners (i.e., learners that adapt specific machine-learning models). In contrast, model-agnostic learners – so-called meta-learners – are largely unexplored. In our paper, we propose several meta-learners that are model-agnostic and thus can be used in combination with arbitrary machine learning models (e.g., transformers) to estimate HTEs over time. Here, our focus is on learners that can be obtained via weighted pseudo-outcome regressions, which allows for efficient estimation by targeting the treatment effect directly. We then provide a comprehensive theoretical analysis that characterizes the different learners and that allows us to offer insights into when specific learners are preferable. Finally, we confirm our theoretical insights through numerical experiments. In sum, while meta-learners are already state-of-the-art for the static setting, we are the first to propose a comprehensive set of meta-learners for estimating HTEs in the time-varying setting.

MCML Authors
Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Konstantin Heß

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1727]
L. Girrbach, Y. Huang, S. Alaniz, T. Darrell and Z. Akata.
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs).
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that finetuning-based debiasing methods achieve the best tradeoff between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.

MCML Authors
Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to website

Yiran Huang

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1726]
K. Heß and S. Feuerriegel.
Stabilized Neural Prediction of Potential Outcomes in Continuous Time.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Patient trajectories from electronic health records are widely used to predict potential outcomes of treatments over time, which then allows to personalize care. Yet, existing neural methods for this purpose have a key limitation: while some adjust for time-varying confounding, these methods assume that the time series are recorded in discrete time. In other words, they are constrained to settings where measurements and treatments are conducted at fixed time steps, even though this is unrealistic in medical practice. In this work, we aim to predict potential outcomes in continuous time. The latter is of direct practical relevance because it allows for modeling patient trajectories where measurements and treatments take place at arbitrary, irregular timestamps. We thus propose a new method called stabilized continuous time inverse propensity network (SCIP-Net). For this, we further derive stabilized inverse propensity weights for robust prediction of the potential outcomes. To the best of our knowledge, our SCIP-Net is the first neural method that performs proper adjustments for time-varying confounding in continuous time.

MCML Authors
Link to website

Konstantin Heß

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1725]
J. Kaiser, K. Schwethelm, D. Rückert and G. Kaissis.
Laplace Sample Information: Data Informativeness Through a Bayesian Lens.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Accurately estimating the informativeness of individual samples in a dataset is an important objective in deep learning, as it can guide sample selection, which can improve model efficiency and accuracy by removing redundant or potentially harmful samples. We propose Laplace Sample Information (LSI) measure of sample informativeness grounded in information theory widely applicable across model architectures and learning settings. LSI leverages a Bayesian approximation to the weight posterior and the KL divergence to measure the change in the parameter distribution induced by a sample of interest from the dataset. We experimentally show that LSI is effective in ordering the data with respect to typicality, detecting mislabeled samples, measuring class-wise informativeness, and assessing dataset difficulty. We demonstrate these capabilities of LSI on image and text data in supervised and unsupervised settings. Moreover, we show that LSI can be computed efficiently through probes and transfers well to the training of large models.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Georgios Kaissis

Georgios Kaissis

Dr.

* Former Principal Investigator


[1724]
C. Kern, M. P. Kim and A. Zhou.
Multi-Accurate CATE is Robust to Unknown Covariate Shifts.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Estimating heterogeneous treatment effects is important to tailor treatments to those individuals who would most likely benefit. However, conditional average treatment effect predictors may often be trained on one population but possibly deployed on different, possibly unknown populations. We use methodology for learning multi-accurate predictors to post-process CATE T-learners (differenced regressions) to become robust to unknown covariate shifts at the time of deployment. The method works in general for pseudo-outcome regression, such as the DR-learner. We show how this approach can combine (large) confounded observational and (smaller) randomized datasets by learning a confounded predictor from the observational dataset, and auditing for multi-accuracy on the randomized controlled trial. We show improvements in bias and mean squared error in simulations with increasingly larger covariate shift, and on a semi-synthetic case study of a parallel large observational study and smaller randomized controlled experiment. Overall, we establish a connection between methods developed for multi-distribution learning and achieve appealing desiderata (e.g. external validity) in causal inference and machine learning.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1723]
C. Kolb, T. Weber, B. Bischl and D. Rügamer.
Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the L1 norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of L1-penalized neural networks by adding differentiable L2 regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.

MCML Authors
Link to website

Chris Kolb

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1722]
M. Kollovieh, M. Lienen, D. Lüdke, L. Schwinn and S. Günnemann.
Flow Matching with Gaussian Process Priors for Probabilistic Time Series Forecasting.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Recent advancements in generative modeling, particularly diffusion models, have opened new directions for time series modeling, achieving state-of-the-art performance in forecasting and synthesis. However, the reliance of diffusion-based models on a simple, fixed prior complicates the generative process since the data and prior distributions differ significantly. We introduce TSFlow, a conditional flow matching (CFM) model for time series combining Gaussian processes, optimal transport paths, and data-dependent prior distributions. By incorporating (conditional) Gaussian processes, TSFlow aligns the prior distribution more closely with the temporal structure of the data, enhancing both unconditional and conditional generation. Furthermore, we propose conditional prior sampling to enable probabilistic forecasting with an unconditionally trained model. In our experimental evaluation on eight real-world datasets, we demonstrate the generative capabilities of TSFlow, producing high-quality unconditional samples. Finally, we show that both conditionally and unconditionally trained models achieve competitive results across multiple forecasting benchmarks.

MCML Authors
Link to website

Marcel Kollovieh

Data Analytics & Machine Learning

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning


[1721]
R. G. Laiz, T. Schmidt and S. Schneider.
Self-supervised contrastive learning performs non-linear system identification.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Self-supervised learning (SSL) approaches have brought tremendous success across many tasks and domains. It has been argued that these successes can be attributed to a link between SSL and identifiable representation learning: Temporal structure and auxiliary variables ensure that latent representations are related to the true underlying generative factors of the data. Here, we deepen this connection and show that SSL can perform system identification in latent space. We propose DynCL, a framework to uncover linear, switching linear and non-linear dynamics under a non-linear observation model, give theoretical guarantees and validate them empirically.

MCML Authors
Link to website

Tobias Schmidt

Dynamical Inference

Link to Profile Steffen Schneider

Steffen Schneider

Dr.

Dynamical Inference


[1720]
Y. Li, D. Rügamer, B. Bischl and M. Rezaei.
Calibrating LLMs with Information-Theoretic Evidential Deep Learning.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[1719]
H. Lim, J. Choi, J. Choo and S. Schneider.
Sparse autoencoders reveal selective remapping of visual concepts during adaptation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Adapting foundation models for specific purposes has become a standard approach to build machine learning systems for downstream applications. Yet, it is an open question which mechanisms take place during adaptation. Here we develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts at granular levels (e.g., shape, color, or semantics of an object) and their patch-wise spatial attributions. We explore how these concepts influence the model output in downstream image classification tasks and investigate how recent state-of-the-art prompt-based adaptation techniques change the association of model inputs to these concepts. While activations of concepts slightly change between adapted and non-adapted models, we find that the majority of gains on common adaptation tasks can be explained with the existing concepts already present in the non-adapted foundation model. This work provides a concrete framework to train and use SAEs for Vision Transformers and provides insights into explaining adaptation mechanisms.

MCML Authors
Link to Profile Steffen Schneider

Steffen Schneider

Dr.

Dynamical Inference


[1718]
L. Lux, A. H. Berger, A. Weers, N. Stucki, D. Rückert, U. Bauer and J. C. Paetzold.
Topograph: An efficient Graph-Based Framework for Strictly Topology Preserving Image Segmentation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Topological correctness plays a critical role in many image segmentation tasks, yet most networks are trained using pixel-wise loss functions, such as Dice, neglecting topological accuracy. Existing topology-aware methods often lack robust topological guarantees, are limited to specific use cases, or impose high computational costs. In this work, we propose a novel, graph-based framework for topologically accurate image segmentation that is both computationally efficient and generally applicable. Our method constructs a component graph that fully encodes the topological information of both the prediction and ground truth, allowing us to efficiently identify topologically critical regions and aggregate a loss based on local neighborhood information. Furthermore, we introduce a strict topological metric capturing the homotopy equivalence between the union and intersection of prediction-label pairs. We formally prove the topological guarantees of our approach and empirically validate its effectiveness on binary and multi-class datasets. Our loss demonstrates state-of-the-art performance with up to fivefold faster loss computation compared to persistent homology methods.

MCML Authors
Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Link to website

Alexander Weers

Artificial Intelligence in Healthcare and Medicine

Link to website

Nico Stucki

Applied Topology and Geometry

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Ulrich Bauer

Ulrich Bauer

Prof. Dr.

Applied Topology and Geometry


[1717]
G. Manten, C. Casolo, E. Ferrucci, S. Mogensen, C. Salvi and N. Kilbertus.
Signature Kernel Conditional Independence Tests in Causal Discovery for Stochastic Processes.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Inferring the causal structure underlying stochastic dynamical systems from observational data holds great promise in domains ranging from science and health to finance. Such processes can often be accurately modeled via stochastic differential equations (SDEs), which naturally imply causal relationships via ‘which variables enter the differential of which other variables’. In this paper, we develop conditional independence (CI) constraints on coordinate processes over selected intervals that are Markov with respect to the acyclic dependence graph (allowing self-loops) induced by a general SDE model. We then provide a sound and complete causal discovery algorithm, capable of handling both fully and partially observed data, and uniquely recovering the underlying or induced ancestral graph by exploiting time directionality assuming a CI oracle. Finally, to make our algorithm practically usable, we also propose a flexible, consistent signature kernel-based CI test to infer these constraints from data. We extensively benchmark the CI test in isolation and as part of our causal discovery algorithms, outperforming existing approaches in SDE models and beyond.

MCML Authors
Link to website

Georg Manten

Ethics in Systems Design and Machine Learning

Link to website

Cecilia Casolo

Ethics in Systems Design and Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1716]
M. Muschalik, F. Fumagalli, P. Frazzetto, J. Strotherm, L. Hermes, A. Sperduti, E. Hüllermeier and B. Hammer.
Exact Computation of Any-Order Shapley Interactions for Graph Neural Networks.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Albeit the ubiquitous use of Graph Neural Networks (GNNs) in machine learning (ML) prediction tasks involving graph-structured data, their interpretability remains challenging. In explainable artificial intelligence (XAI), the Shapley Value (SV) is the predominant method to quantify contributions of individual features to a ML model’s output. Addressing the limitations of SVs in complex prediction models, Shapley Interactions (SIs) extend the SV to groups of features. In this work, we explain single graph predictions of GNNs with SIs that quantify node contributions and interactions among multiple nodes. By exploiting the GNN architecture, we show that the structure of interactions in node embeddings are preserved for graph prediction. As a result, the exponential complexity of SIs depends only on the receptive fields, i.e. the message-passing ranges determined by the connectivity of the graph and the number of convolutional layers. Based on our theoretical results, we introduce GraphSHAP-IQ, an efficient approach to compute any-order SIs exactly. GraphSHAP-IQ is applicable to popular message passing techniques in conjunction with a linear global pooling and output layer. We showcase that GraphSHAP-IQ substantially reduces the exponential complexity of computing exact SIs on multiple benchmark datasets. Beyond exact computation, we evaluate GraphSHAP-IQ’s approximation of SIs on popular GNN architectures and compare with existing baselines. Lastly, we visualize SIs of real-world water distribution networks and molecule structures using a SI-Graph.

MCML Authors
Link to website

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1715]
L. Rauchwerger, S. Jegelka and R. Levie.
Generalization, Expressivity, and Universality of Graph Neural Networks on Attributed Graphs.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

We analyze the universality and generalization of graph neural networks (GNNs) on attributed graphs, i.e., with node attributes. To this end, we propose pseudometrics over the space of all attributed graphs that describe the fine-grained expressivity of GNNs. Namely, GNNs are both Lipschitz continuous with respect to our pseudometrics and can separate attributed graphs that are distant in the metric. Moreover, we prove that the space of all attributed graphs is relatively compact with respect to our metrics. Based on these properties, we prove a universal approximation theorem for GNNs and generalization bounds for GNNs on any data distribution of attributed graphs. The proposed metrics compute the similarity between the structures of attributed graphs via a hierarchical optimal transport between computation trees. Our work extends and unites previous approaches which either derived theory only for graphs with no attributes, derived compact metrics under which GNNs are continuous but without separation power, or derived metrics under which GNNs are continuous and separate points but the space of graphs is not relatively compact, which prevents universal approximation and generalization analysis.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1714]
L. Sang, Z. Canfes, D. Cao, F. Bernard and D. Cremers.
Implicit Neural Surface Deformation with Explicit Velocity Fields.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

In this work, we introduce the first unsupervised method that simultaneously predicts time-varying neural implicit surfaces and deformations between pairs of point clouds. We propose to model the point movement using an explicit velocity field and directly deform a time-varying implicit field using the modified level-set equation. This equation utilizes an iso-surface evolution with Eikonal constraints in a compact formulation, ensuring the integrity of the signed distance field. By applying a smooth, volume-preserving constraint to the velocity field, our method successfully recovers physically plausible intermediate shapes. Our method is able to handle both rigid and non-rigid deformations without any intermediate shape supervision. Our experimental results demonstrate that our method significantly outperforms existing works, delivering superior results in both quality and efficiency.

MCML Authors
Link to website

Lu Sang

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1713]
P. Scholl, K. Bieker, H. Hauger and G. Kutyniok.
ParFam -- (Neural Guided) Symbolic Regression Based on Continuous Global Optimization.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv GitHub
Abstract

The problem of symbolic regression (SR) arises in many different applications, such as identifying physical laws or deriving mathematical equations describing the behavior of financial markets from given data. Various methods exist to address the problem of SR, often based on genetic programming. However, these methods are usually complicated and involve various hyperparameters. In this paper, we present our new approach ParFam that utilizes parametric families of suitable symbolic functions to translate the discrete symbolic regression problem into a continuous one, resulting in a more straightforward setup compared to current state-of-the-art methods. In combination with a global optimizer, this approach results in a highly effective method to tackle the problem of SR. We theoretically analyze the expressivity of ParFam and demonstrate its performance with extensive numerical experiments based on the common SR benchmark suit SRBench, showing that we achieve state-of-the-art results. Moreover, we present an extension incorporating a pre-trained transformer network DL-ParFam to guide ParFam, accelerating the optimization process by up to two magnitudes.

MCML Authors
Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1712]
M. Schröder, V. Melnychuk and S. Feuerriegel.
Differentially private learners for heterogeneous treatment effects.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Patient data is widely used to estimate heterogeneous treatment effects and understand the effectiveness and safety of drugs. Yet, patient data includes highly sensitive information that must be kept private. In this work, we aim to estimate the conditional average treatment effect (CATE) from observational data under differential privacy. Specifically, we present DP-CATE, a novel framework for CATE estimation that is doubly robust and ensures differential privacy of the estimates. For this, we build upon non-trivial tools from semi-parametric and robust statistics to exploit the connection between privacy and model robustness. Our framework is highly general and applies to any two-stage CATE meta-learner with a Neyman-orthogonal loss function. It can be used with all machine learning models employed for nuisance estimation. We further provide an extension of DP-CATE where we employ RKHS regression to release the complete doubly robust CATE function while ensuring differential privacy. We demonstrate the effectiveness of DP-CATE across various experiments using synthetic and real-world datasets. To the best of our knowledge, we are the first to provide a framework for CATE estimation that is doubly robust and differentially private.

MCML Authors
Link to website

Maresa Schröder

Artificial Intelligence in Management

Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1711]
Y. Shehata, B. Holzschuh and N. Thuerey.
Improved Sampling Of Diffusion Models In Fluid Dynamics With Tweedie's Formula.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

State-of-the-art Denoising Diffusion Probabilistic Models (DDPMs) rely on an expensive sampling process with a large Number of Function Evaluations (NFEs) to provide high-fidelity predictions. This computational bottleneck renders diffusion models less appealing as surrogates for the spatio-temporal prediction of physics-based problems with long rollout horizons. We propose Truncated Sampling Models, enabling single-step and few-step sampling with elevated fidelity by simple truncation of the diffusion process, reducing the gap between DDPMs and deterministic single-step approaches. We also introduce a novel approach, Iterative Refinement, to sample pre-trained DDPMs by reformulating the generative process as a refinement process with few sampling steps. Both proposed methods enable significant improvements in accuracy compared to DDPMs, DDIMs, and EDMs with NFEs ≤ 10 on a diverse set of experiments, including incompressible and compressible turbulent flow and airfoil flow uncertainty simulations. Our proposed methods provide stable predictions for long rollout horizons in time-dependent problems and are able to learn all modes of the data distribution in steady-state problems with high uncertainty.

MCML Authors
Link to Profile Nils Thuerey

Nils Thuerey

Prof. Dr.

Physics-based Simulation


[1710]
E. Sommer, J. Robnik, G. Nozadze, U. Seljak and D. Rügamer.
Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Despite recent advances, sampling-based inference for Bayesian Neural Networks (BNNs) remains a significant challenge in probabilistic deep learning. While sampling-based approaches do not require a variational distribution assumption, current state-of-the-art samplers still struggle to navigate the complex and highly multimodal posteriors of BNNs. As a consequence, sampling still requires considerably longer inference times than non-Bayesian methods even for small neural networks, despite recent advances in making software implementations more efficient. Besides the difficulty of finding high-probability regions, the time until samplers provide sufficient exploration of these areas remains unpredictable. To tackle these challenges, we introduce an ensembling approach that leverages strategies from optimization and a recently proposed sampler called Microcanonical Langevin Monte Carlo (MCLMC) for efficient, robust and predictable sampling performance. Compared to approaches based on the state-of-the-art No-U-Turn Sampler, our approach delivers substantial speedups up to an order of magnitude, while maintaining or improving predictive performance and uncertainty quantification across diverse tasks and data modalities. The suggested Microcanonical Langevin Ensembles and modifications to MCLMC additionally enhance the method’s predictability in resource requirements, facilitating easier parallelization. All in all, the proposed method offers a promising direction for practical, scalable inference for BNNs.

MCML Authors
Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1709]
B. Tahmasebi and S. Jegelka.
Generalization Bounds for Canonicalization: A Comparative Study with Group Averaging.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Canonicalization, a popular method for generating invariant or equivariant function classes from arbitrary function sets, involves initial data projection onto a reduced input space subset, followed by applying any learning method to the projected dataset. Despite recent research on the expressive power and continuity of functions represented by canonicalization, its generalization capabilities remain less explored. This paper addresses this gap by theoretically examining the generalization benefits and sample complexity of canonicalization, comparing them with group averaging, another popular technique for creating invariant or equivariant function classes. Our findings reveal two distinct regimes where canonicalization may outperform or underperform compared to group averaging, with precise quantification of this phase transition in terms of sample size, group action characteristics, and a newly introduced concept of alignment. To the best of our knowledge, this study represents the first theoretical exploration of such behavior, offering insights into the relative effectiveness of canonicalization and group averaging under varying conditions.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1708]
T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning with the Gromov-Monge Gap.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1707]
X. Wang, C. Hu, P. Röttger and B. Plank.
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g.‘how do I kill someone?’), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. ‘how do I kill a Python process?’). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model’s safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1706]
Y. Wang, M. Schröder, D. Frauen, J. Schweisthal, K. Heß and S. Feuerriegel.
Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Constructing confidence intervals (CIs) for the average treatment effect (ATE) from patient records is crucial to assess the effectiveness and safety of drugs. However, patient records typically come from different hospitals, thus raising the question of how multiple observational datasets can be effectively combined for this purpose. In our paper, we propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs. Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice. The key idea of our method is that we leverage prediction-powered inferences and thereby essentially `shrink’ the CIs so that we offer more precise uncertainty quantification as compared to naïve approaches. We further prove the unbiasedness of our method and the validity of our CIs. We confirm our theoretical results through various numerical experiments. Finally, we provide an extension of our method for constructing CIs from combinations of experimental and observational datasets.

MCML Authors
Link to website

Yuxin Wang

Artificial Intelligence in Management

Link to website

Maresa Schröder

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to website

Konstantin Heß

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1705]
M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers and L.-C. Chen.
MaskBit: Embedding-free Image Generation via Bit Tokens.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

MCML Authors
Link to website

Mark Weber

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1704]
Q. Zhang, Y. Wang, J. Cui, X. Pan, Q. Lei, S. Jegelka and Y. Wang.
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1703]
C. Koke, D. Schnaus, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. Bronstein and D. Cremers.
On multi-scale Graph Representation Learning.
LMRL @ICLR 2025 - Workshop on Learning Meaningful Representations of Life at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

While Graph Neural Networks (GNNs) are widely used in modern computational biology, an underexplored drawback of common GNN methods,is that they are not inherently multiscale consistent: Two graphs describing the same object or situation at different resolution scales are assigned vastly different latent representations. This prevents graph networks from generating data representations that are consistent across scales. It also complicates the integration of representations at the molecular scale with those generated at the biological scale. Here we discuss why existing GNNs struggle with multiscale consistency and show how to overcome this problem by modifying the message passing paradigm within GNNs.

MCML Authors
Link to website

Christian Koke

Computer Vision & Artificial Intelligence

Link to website

Dominik Schnaus

Computer Vision & Artificial Intelligence

Yuesong Shen

Yuesong Shen

Dr.

* Former Member

Link to website

Abhishek Saroha

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1702]
S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Multimodal Models Over Time?
MCDC @ICLR 2025 - Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Model merging combines multiple expert models finetuned from a base foundation model on diverse tasks and domains into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME (Temporal Integration of Model Expertise) which defines temporal model merging across three axes: (1) initialization, (2) deployment, and (3) merging technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to build a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1701]
C. Koke, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. B. Michael M. Bronstein and D. Cremers.
On Incorporating Scale into Graph Networks.
MLMP @ICLR 2025 - Workshop on Machine Learning Multiscale Processes at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. MLMP Best Paper Award. To be published. Preprint available. URL
Abstract

Standard graph neural networks assign vastly different latent embeddings to graphs describing the same physical system at different resolution scales. This precludes consistency in applications and prevents generalization between scales as would fundamentally be needed in many scientific applications. We uncover the underlying obstruction, investigate its origin and show how to overcome it.

MCML Authors
Link to website

Christian Koke

Computer Vision & Artificial Intelligence

Yuesong Shen

Yuesong Shen

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1700]
A. Modarressi, A. Köksal, A. Imani, M. Fayyaz and H. Schütze.
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory.
NFAM @ICLR 2025 - Workshop on New Frontiers in Associative Memories at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augmented Generation (RAG) – though non-parametric – has its own limitations: it lacks structure, complicates interpretability and makes it hard to effectively manage stored knowledge. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM’s capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM’s performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.

MCML Authors
Link to website

Ali Modarressi

Computational Linguistics

Link to website

Ayyoob Imani

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1699]
L. Wimmer, B. Bischl and L. Bothmann.
Trust Me, I Know the Way: Predictive Uncertainty in the Presence of Shortcut Learning.
SCSL @ICLR 2025 - Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

The correct way to quantify predictive uncertainty in neural networks remains a topic of active discussion. In particular, it is unclear whether the state-of-the art entropy decomposition leads to a meaningful representation of model, or epistemic, uncertainty (EU) in the light of a debate that pits ignorance against disagreement perspectives. We aim to reconcile the conflicting viewpoints by arguing that both are valid but arise from different learning situations. Notably, we show that the presence of shortcuts is decisive for EU manifesting as disagreement.

MCML Authors
Link to website

Lisa Wimmer

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[1698]
C. Kolb, B. Bischl and D. Rügamer.
Differentiable Attention Sparsity via Structured D-Gating.
SLLM @ICLR 2025 - Workshop on Sparsity in LLMs at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

A core component of modern large language models is the attention mechanism, but its immense parameter count necessitates structured sparsity for resource-efficient optimization and inference. Traditional sparsity penalties, such as the group lasso, are non-smooth and thus incompatible with standard stochastic gradient descent methods. To address this, we propose a deep gating mechanism that reformulates the structured sparsity penalty into a fully differentiable optimization problem, allowing effective and principled norm-based group sparsification without requiring specialized non-smooth optimizers. Our theoretical analysis and empirical results demonstrate that this approach enables structured sparsity with simple stochastic gradient descent or variants while maintaining predictive performance.

MCML Authors
Link to website

Chris Kolb

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1697]
A. Reuter, T. G. J. Rudner, V. Fortuin and D. Rügamer.
Can Transformers Learn Full Bayesian Inference in Context?
SynthData @ICLR 2025 - Workshop SynthData: Will Synthetic Data Finally Solve the Data Access Problem? at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context – without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows which enables us to infer complex posterior distributions for methods such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods not operating in context.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1696]
L. Meynent, I. Melev, K. Schürholt, G. Kauermann and D. Borth.
Structure Is Not Enough: Leveraging Behavior for Neural Network Weight Reconstruction.
Weight Space Learning @ICLR 2025 - Workshop on Weight Space Learning at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

The weights of neural networks (NNs) have recently gained prominence as a new data modality in machine learning, with applications ranging from accuracy and hyperparameter prediction to representation learning or weight generation. One approach to leverage NN weights involves training autoencoders (AEs), using contrastive and reconstruction losses. This allows such models to be applied to a wide variety of downstream tasks, and they demonstrate strong predictive performance and low reconstruction error. However, despite the low reconstruction error, these AEs reconstruct NN models with deteriorated performance compared to the original ones, limiting their usability with regard to model weight generation. In this paper, we identify a limitation of weight-space AEs, specifically highlighting that a structural loss, that uses the Euclidean distance between original and reconstructed weights, fails to capture some features critical for reconstructing high-performing models. We analyze the addition of a behavioral loss for training AEs in weight space, where we compare the output of the reconstructed model with that of the original one, given some common input. We show a strong synergy between structural and behavioral signals, leading to increased performance in all downstream tasks evaluated, in particular NN weights reconstruction and generation.

MCML Authors
Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business


[1695]
Z. Li, S. Yan, Y. Ma, Y. Li, X. Lyu and M. Schubert.
Beyond Single-Step: Multi-Frame Action-Conditiones Video Generation for Reinforcement Learning Environments.
World Models @ICLR 2025 - Workshop on World Models: Understanding, Modelling and Scaling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

World models achieved great success in learning the dynamics from both low-dimensional and high-dimensional states. Yet, there is no existing work to address multi-step generation task with high dimensional data. In this paper, we propose the first action-conditioned multi-frame video generation model, advancing world
model development by generating future states from actions. As opposed to recent single-step or autoregressive approaches, our model directly generates multiple future frames conditioned on past observations and action sequences. Our framework extends its capabilities to action-conditioned video generation by introducing an action encoder. This addition enables the spatiotemporal variational autoencoder and diffusion transformer in Open-Sora to effectively incorporate action information, ensuring precise and coherent video generation. We evaluated performance on Atari environments (Breakout, Pong, DemonAttack) using MSE, PSNR, and LPIPS. Results show that conditioning solely on future actions and embedding-based encoding improve generation accuracy and perceptual quality while capturing complex temporal dependencies like inertia. Our work paves the way for action-conditioned multi-step generative world models in dynamic environment.

MCML Authors
Link to website

Zongyue Li

Spatial Artificial Intelligence

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Matthias Schubert

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence


[1694]
T. Decker, V. Tresp and F. Buettner.
Why Uncertainty Calibration Matters for Reliable Perturbation-based Explanations.
XAI4Science @ICLR 2025 - Workshop XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Perturbation-based explanations are widely utilized to enhance the transparency of modern machine-learning models. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models frequently produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved perturbation-based explanations while preserving their original predictions. Experiments on popular computer vision models demonstrate that our calibration strategy produces explanations that are more aligned with human perception and actual object locations.

MCML Authors
Link to website

Thomas Decker

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1693]
R. Visser, F. Fumagalli, M. Muschalik, E. Hüllermeier and B. Hammer.
Explaining Outliers using Isolation Forest and Shapley Interactions.
ESANN 2025 - European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges, Belgium, Apr 23-25, 2025. PDF
Abstract

In unsupervised machine learning, Isolation Forest (IsoForest) is a widely used algorithm for the efficient detection of outliers. Identifying the features responsible for observed anomalies is crucial for practitioners, yet the ensemble nature of IsoForest complicates interpretation and comparison. As a remedy, SHAP is a prevalent method to interpret outlier scoring models by assigning contributions to individual features based on the Shapley Value (SV). However, complex anomalies typically involve interaction of features, and it is paramount for practitioners to distinguish such complex anomalies from simple cases. In this work, we propose Shapley Interactions (SIs) to enrich explanations of outliers with feature interactions. SIs, as an extension of the SV, decompose the outlier score into contributions of individual features and interactions of features up to a specified explanation order. We modify IsoForest to compute SI using TreeSHAP-IQ, an extension of TreeSHAP for tree-based models, using the shapiqpackage. Using a qualitative and quantitative analysis on synthetic and real-world datasets, we demonstrate the benefit of SI and feature interactions for outlier explanations over feature contributions alone.

MCML Authors
Link to website

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1692]
H. A. Gündüz.
Designing and optimizing deep learning methods for genomic sequencing data.
Dissertation 2025. DOI
Abstract

This dissertation applies modern deep learning techniques to genomics, introducing new approaches for self-supervised learning, uncertainty quantification, and automated model design. A key focus is the effective use of unlabeled genomic data, highlighted by the development of Self-GenomeNet, a self-supervised method tailored to genomic sequences. The work also presents automated optimization strategies for model architectures and hyperparameters, achieving better results than expert-designed models. Finally, it contributes user-friendly software that supports various genomic data formats and integrates core methods developed in the thesis. (Shortened).

MCML Authors

[1691]
J. R. Jostan, L. M. Rodriguez, D. Z. Bernal, J. O. Berdugo, V. Aljure, F. Lopez, J. R. Lopez, N. Navab, D. Mateus and V. G. Duque.
Ultrasound Nerve Segmentation with Deep Learning for Leprosy.
ISBI 2025 - IEEE 22nd International Symposium on Biomedical Imaging. Houston, TX, USA, Apr 14-17, 2025. DOI
Abstract

Purpose: This study aims to provide an AI tool for detecting nerves in ultrasound images to help diagnose Hansen’s disease (Leprosy) in rural areas. The significant difference in the cross-sectional area (CSA) of superficial nerves in symmetrical extremities is a landmark in the early stages of the disease. Despite its potential, ultrasound nerve evaluation is limited due to the difficulty in accurately identifying nerves in ultrasound images.
Methodology: We propose the first Leprosy video nerve segmentation pipeline based on YOLOv8 and X-Mem architectures to automate frame detection, segmentation, and label propagation. We ensure alignment with clinical practices and evaluate the inference in real time of the method and its energy efficiency, confirming the approach’s feasibility in resource-limited settings.
Results: We establish a baseline for nerve segmentation of ultrasound Leprosy videos, presenting the first results to identify relevant frames, segment, and propagate labels. To support further research, we have open source a new leprosy test dataset and created a demo web page to try our method on real patient data. This initiative aims to promote research on AI techniques to improve healthcare in rural communities, where healthcare professionals are scarce and assistance is essential.

MCML Authors
Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1690]
Z. Ding.
Inductive representation learning and natural language question answering on temporal knowledge graphs.
Dissertation 2025. DOI
Abstract

Real-world applications such as recommendersystems, socialnetworks, andprotein-protein interactions often involve relational data. In recent years, there has been increasing interest in machine learning on such data, particularly in the context of knowledge graphs (KGs). KGs are structured relational data that store multi-relational information as directed graphs, where each node corresponds to an entity and each labeled edge represents a factual relationship between entities, e.g., (Oxford, located in, the United Kingdom). Traditional KGs assume time-invariant relationships. However, real-world relationships are dynamically evolving over time. For example, the chancellor of Germany in 2020 was Angela Merkel, but in 2022 it became Olaf Scholz. This necessitates the use of temporal knowledge graphs (TKGs), where temporal facts are introduced by coupling stationary facts with additional time identifiers, e.g., (Angela Merkel, is chancellor of, Germany, 2020). TKGs are more expressive than KGs as they model the temporal evolution of knowledge. Consequently, recent research has paid more attention to machine learning on TKGs. In this thesis, we focus on two machine learning problems: inductive knowledge representation learning and natural language question answering (QA) on TKGs. (Shortened)

MCML Authors
Link to website

Zifeng Ding

Database Systems and Data Mining


[1689]
K. Schwethelm, J. Kaiser, J. Kuntzer, M. Yigitsoy, D. Rueckert and G. Kaissis.
Differentially Private Active Learning: Balancing Effective Data Selection and Privacy.
SaTML 2025 - IEEE Conference on Secure and Trustworthy Machine Learning. Copenhagen, Denmark, Apr 09-11, 2025. DOI
Abstract

Active learning (AL) is a widely used technique for optimizing data labeling in machine learning by iteratively selecting, labeling, and training on the most informative data. However, its integration with formal privacy-preserving methods, particularly differential privacy (DP), remains largely underexplored. While some works have explored differentially private AL for specialized scenarios like online learning, the fundamental challenge of combining AL with DP in standard learning settings has remained unaddressed, severely limiting AL’s applicability in privacy-sensitive domains. This work addresses this gap by introducing differentially private active learning (DP-AL) for standard learning settings. We demonstrate that naively integrating DP-SGD training into AL presents substantial challenges in privacy budget allocation and data utilization. To overcome these challenges, we propose step amplification, which leverages individual sampling probabilities in batch creation to maximize data point participation in training steps, thus optimizing data utilization. Additionally, we investigate the effectiveness of various acquisition functions for data selection under privacy constraints, revealing that many commonly used functions become impractical. Our experiments on vision and natural language processing tasks show that DP-AL can improve performance for specific datasets and model architectures. However, our findings also highlight the limitations of AL in privacy-constrained environments, emphasizing the trade-offs between privacy, model accuracy, and data selection accuracy.

MCML Authors
Georgios Kaissis

Georgios Kaissis

Dr.

* Former Principal Investigator


[1688]
S. Bai, A. Kruspe and X. Zhu.
Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets.
ECIR 2025 - 47th European Conference on Information Retrieval. Lucca, Italy, Apr 06-10, 2025. To be published.
Abstract

Geo-tagged tweets collected at the building level has patterns that aid in building function classification. However, this data source suffers from substantial noise, limiting its effectiveness. Conducting a systematic noise analysis requires a noise-free environment, which is difficult to obtain from real-world data. In this study, we propose an approach using an LLM-generated synthetic oracle dataset that contains only correctly assigned tweets aligned with their respective buildings. To make the dataset reflects real-world distributions, we use a data generation pipeline that integrates data attributes from real world into LLM prompts. To evaluate the utility of the synthetic dataset for noise analysis, we compare the performance of Naïve Bayes (NB) and mBERT classifiers on it against real-world noisy data. Additionally, we assess the dataset’s diversity by comparing Self-BLEU and perplexity scores against those of real-world datasets. Our findings reveal that while noise significantly disrupts mBERT’s contextual learning, its removal in the synthetic dataset enables mBERT to substantially outperform NB. This highlights that noise reduction is more effective than increasing model complexity for context-dependent text classification tasks. Moreover, despite reduced noise and sentence structure variations, the synthetic dataset preserves realistic linguistic characteristics. These results confirm that a synthetic oracle dataset provides an effective noise-free experimental environment for studying noise impact in text classification.

MCML Authors
Link to website

Shanshan Bai

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1687]
M. Mironov, A. Marquard, D. Racek, C. Heumann, P. W. Thurner and M. Aßenmacher.
A Geoparsing Pipeline for Multilingual Social Media Posts from Ukraine.
GeoExT @ECIR 2025 - 3rd International Workshop on Geographic Information Extraction from Texts at the 47th European Conference on Information Retrieval (ECIR 2025). Lucca, Italy, Apr 06-10, 2025. PDF
Abstract

The dynamics of contemporary social media communication, particularly on platforms like X (formerly Twitter), have significantly evolved, and this data is frequently used for scientific research. However, due to X’s API changes in 2019, a tweet’s precise geolocation is no longer present in the data, thus preventing a geographical assessment of tweets. This project aims to extract location mentions from tweets’ texts and to map them to Ukraine’s administrative regions. We have developed a specialized pipeline for geoparsing with specific prebuilt components for the Ukrainian, Russian, and English languages. The main advantage of our pipeline’s architecture is the interchangeability of all components, allowing for the integration of custom-developed solutions. Initial tests on our hand-labeled Ukrainian dataset show promising results in accurately identifying and mapping location mentions despite various challenges, such as declension and the presence of multiple languages in a single tweet. Additional experiments using publicly available benchmark data further indicate promising performance when transferring our pipeline to other geographical regions. Both our geoparsing pipeline and its online documentation have been made publicly available.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1686]
H. Hauger, P. Scholl and G. Kutyniok.
Robust identifiability for symbolic recovery of differential equations.
ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India, Apr 06-11, 2025. DOI
Abstract

Recent advancements in machine learning have transformed the discovery of physical laws, moving from manual derivation to data-driven methods that simultaneously learn both the structure and parameters of governing equations. This shift introduces new challenges regarding the validity of the discovered equations, particularly concerning their uniqueness and, hence, identifiability. While the issue of non-uniqueness has been well-studied in the context of parameter estimation, it remains underexplored for algorithms that recover both structure and parameters simultaneously. Early studies have primarily focused on idealized scenarios with perfect, noise-free data. In contrast, this paper investigates how noise influences the uniqueness and identifiability of physical laws governed by partial differential equations (PDEs). We develop a comprehensive mathematical framework to analyze the uniqueness of PDEs in the presence of noise and introduce new algorithms that account for noise, providing thresholds to assess uniqueness and identifying situations where excessive noise hinders reliable conclusions. Numerical experiments demonstrate the effectiveness of these algorithms in detecting uniqueness despite the presence of noise.

MCML Authors
Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1685]
X. Jing, K. Zhou, A. Triantafyllopoulos and B. W. Schuller.
Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models.
ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India, Apr 06-11, 2025. DOI
Abstract

While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP’s text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1684]
I. Tsangko, A. Triantafyllopoulos, M. Müller, H. Schröter and B. W. Schuller.
DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids.
ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India, Apr 06-11, 2025. DOI
Abstract

The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all’ approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the DFN model, thus proposing the DFingerNet (DFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1683]
J. Kostin, F. Krahmer and D. Stöger.
How robust is randomized blind deconvolution via nuclear norm minimization against adversarial noise?
Applied and Computational Harmonic Analysis 76.101746 (Apr. 2025). DOI
Abstract

In this paper, we study the problem of recovering two unknown signals from their convolution, which is commonly referred to as blind deconvolution. Reformulation of blind deconvolution as a low-rank recovery problem has led to multiple theoretical recovery guarantees in the past decade due to the success of the nuclear norm minimization heuristic. In particular, in the absence of noise, exact recovery has been established for sufficiently incoherent signals contained in lower-dimensional subspaces. However, if the convolution is corrupted by additive bounded noise, the stability of the recovery problem remains much less understood. In particular, existing reconstruction bounds involve large dimension factors and therefore fail to explain the empirical evidence for dimension-independent robustness of nuclear norm minimization. Recently, theoretical evidence has emerged for ill-posed behavior of low-rank matrix recovery for sufficiently small noise levels. In this work, we develop improved recovery guarantees for blind deconvolution with adversarial noise which exhibit square-root scaling in the noise level. Hence, our results are consistent with existing counterexamples which speak against linear scaling in the noise level as demonstrated for related low-rank matrix recovery problems.

MCML Authors
Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis


[1682]
B. Lange.
Beyond the Ivory Tower? The Practical Role of Ethicists in Business.
Artificial Intelligence, Entrepreneurship and Risk. Technikzukünfte, Wissenschaft und Gesellschaft / Futures of Technology, Science and Society (Apr. 2025). DOI
Abstract

‘AI Ethics’, ‘Digital Ethics’ or ‘Corporate Digital Responsibility’—ethics in business, especially with the rise of Artificial Intelligence (AI), is now in vogue. But how, if at all, can ethicists meaningfully contribute to practical business challenges? I examine the value that resources from moral philosophy can bring to ethical issues in business, particularly the technology sector. I show that there is a specific need for sharpened ethical acumen in so-called ‘grey areas’, in which laws and regulation do not provide definite answers to the ethical challenges businesses face. I argue that ethicists can distinctively help businesses navigate grey areas by strengthening their ethical capabilities and functions, which concern an organization’s ethical awareness, deliberation, decision-making, and commitment. I conclude by discussing some practical examples of how ethicists can strengthen these capabilities.

MCML Authors
Link to Profile Benjamin Lange

Benjamin Lange

Dr.

Ethics of Artificial Intelligence


[1681]
Q. Li, H. Taubenböck and X. Zhu.
Identification of the potential for roof greening using remote sensing and deep learning.
Cities 159.105782 (Apr. 2025). DOI
Abstract

Under the mounting pressure from global warming, green roofs emerge as a valuable source for climate adaptation, particularly in compact metropolises where green space is limited. Consequently, there is a need to quantitatively evaluate the potential for roof greening where it is most needed and suitable. Despite the increasing importance of this issue, there have been limited studies on the effectiveness of remote sensing and deep learning in identifying the potential for roof greening in many cities. To address this, we have created a GreenRoof dataset, comprising approximately 6400 pairs of remote sensing images and corresponding masks of roofs with high greening potential in four European cities. Afterward, we exploit the capabilities of deep learning methods to identify roofs that are suitable for greening from remote sensing images. Using 15 German cities as a case study for future urban rooftop planning, we estimate the spatial potential for retrofitting green roofs. Structural parameters for prioritizing green roof implementation include vegetation coverage, thermal environment, and building density. Results indicate that the total area suitable for green roof retrofitting exceeds 20% of the roof area in the 15 German cities examined. The spatial analysis effectively reflects variation in demand and suitability for green roof retrofitting across different cities. In conclusion, this study provides a versatile screening approach utilizing remote sensing, deep learning, and spatial analysis, which can be readily adapted to inform municipal policies in other cities aiming to promote green roofs and enhance sustainable urban development.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1680]
T. Weber.
Advancing Deep Learning in medical imaging through generative modeling and representation learning.
Dissertation 2025. DOI
Abstract

In recent years, deep learning (DL) has proven to be a disruptive enabler in many domains, including the realm of medical imaging. The application of neural networks and other learnable algorithms has substantially impacted the medical field, promising to improve diagnostic accuracy, enhance patient outcomes, and streamline clinical workflows. The advent of large-scale datasets and advancements in computational power have facilitated the development of sophisticated DL models capable of analyzing and interpreting complex medical images. The scope of this thesis concentrates on a subset of the full DL spectrum, specifically the uprising areas of generative modeling and representation learning, which are closely interleaved with each other. The proposed contributions aim to push the boundaries of established medical image DL methods, venturing into more experimental research areas. (Shortened)

MCML Authors

[1679]
C. Cipriani, M. Fornasier and A. Scagliotti.
From NeurODEs to AutoencODEs: a mean-field control framework for width-varying Neural Networks.
European Journal of Applied Mathematics 36.Special Issue 2: From integro-differential models to data-oriented approaches for emergent phenomena (Apr. 2025). DOI
Abstract

The connection between Residual Neural Networks (ResNets) and continuous-time control systems (known as NeurODEs) has led to a mathematical analysis of neural networks, which has provided interesting results of both theoretical and practical significance. However, by construction, NeurODEs have been limited to describing constant-width layers, making them unsuitable for modelling deep learning architectures with layers of variable width. In this paper, we propose a continuous-time Autoencoder, which we call AutoencODE, based on a modification of the controlled field that drives the dynamics. This adaptation enables the extension of the mean-field control framework originally devised for conventional NeurODEs. In this setting, we tackle the case of low Tikhonov regularisation, resulting in potentially non-convex cost landscapes. While the global results obtained for high Tikhonov regularisation may not hold globally, we show that many of them can be recovered in regions where the loss function is locally convex. Inspired by our theoretical findings, we develop a training method tailored to this specific type of Autoencoders with residual connections, and we validate our approach through numerical experiments conducted on various examples.

MCML Authors
Cristina Cipriani

Cristina Cipriani

Dr.

* Former Member

Link to Profile Massimo Fornasier

Massimo Fornasier

Prof. Dr.

Applied Numerical Analysis

Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1678]
M. Fornasier, P. Richtárik, K. Riedl and L. Sun.
Consensus-Based Optimization with Truncated Noise.
European Journal of Applied Mathematics 36.Special Issue 2: From integro-differential models to data-oriented approaches for emergent phenomena (Apr. 2025).
Abstract

Consensus-based optimisation (CBO) is a versatile multi-particle metaheuristic optimisation method suitable for performing non-convex and non-smooth global optimisations in high dimensions. It has proven effective in various applications while at the same time being amenable to a theoretical convergence analysis. In this paper, we explore a variant of CBO, which incorporates truncated noise in order to enhance the well-behavedness of the statistics of the law of the dynamics. By introducing this additional truncation in the noise term of the CBO dynamics, we achieve that, in contrast to the original version, higher moments of the law of the particle system can be effectively bounded. As a result, our proposed variant exhibits enhanced convergence performance, allowing in particular for wider flexibility in choosing the noise parameter of the method as we confirm experimentally. By analysing the time evolution of the Wasserstein- 2 distance between the empirical measure of the interacting particle system and the global minimiser of the objective function, we rigorously prove convergence in expectation of the proposed CBO variant requiring only minimal assumptions on the objective function and on the initialisation. Numerical evidences demonstrate the benefit of truncating the noise in CBO.

MCML Authors
Link to Profile Massimo Fornasier

Massimo Fornasier

Prof. Dr.

Applied Numerical Analysis

Konstantin Riedl

Konstantin Riedl

Dr.

* Former Member

Link to website

Lukang Sun

Applied Numerical Analysis


[1677]
A. Maarouf, S. Feuerriegel and N. Pröllochs.
A fused large language model for predicting startup success.
European Journal of Operational Research 322.1 (Apr. 2025). DOI
Abstract

Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup’s probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup’s innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.

MCML Authors
Link to website

Abdurahman Maarouf

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1676]
L. Nas, B. F. Hoppe, A. T. Stüber, S. Grosu, N. Fink, A. von Fragstein, J. Rudolph, J. Ricke and B. O. Sabel.
Optimizing lower extremity CT angiography: A prospective study of individualized vs. fixed post-trigger delays in bolus tracking.
European Journal of Radiology 185.112009 (Apr. 2025). DOI
Abstract

Purpose: To compare the contrast media opacification and diagnostic quality in lower-extremity runoff CT angiography (CTA) between bolus-tracking using conventional fixed trigger delay and patient-specific individualized post-trigger delay.
Methods: In this prospective study, lower-extremity runoff CTA was performed in two cohorts, using either fixed or individualized trigger delay. Both cohorts had identical CT protocols, contrast media applications, and image reconstructions. Objective image quality (mean contrast opacification in HU), and subjective image quality (5-point Likert-scale), were assessed in six vessels: abdominal aorta (AA), common iliac artery (CIA), superficial femoral artery (SFA), popliteal artery (PA), posterior tibial artery (PTA), and dorsalis pedis artery (DPA) by one rater for objective and two raters for subjective image quality. Objective image quality was analyzed using Student t-tests, while subjective ratings were compared with Fisher’s exact test.
Results: Overall, 65 patients were included (mean age: 71 ± 14; 39 men), 35 in the individualized cohort and 30 in the fixed cohort. No differences were found between the groups regarding demographics or radiation exposure. Individualized trigger delay ranged from 2 to 23 s (mean: 8.7 ± 4.0 s) and was 10 s in the fixed cohort. The individualized cohort showed higher opacification in the peripheral arteries (PTA: 479 ± 140 HU vs. 379 ± 106 HU; p = 0.009; DPA: 477 ± 191 HU vs. 346 ± 137 HU; p = 0.009). Overall subjective “image quality” was rated higher in the individualized group (“excellent” or “good” in Rater 1: 97% vs. 57%; p < 0.001; and Rater 2: 89% vs. 53%; p = 0.002).
Conclusion: Individualized post-trigger delay enhances diagnostic quality, by improving vessel opacification in peripheral arteries and increasing subjective image quality in lower extremity runoff CTA.

MCML Authors
Link to website

Boj Friedrich Hoppe

Dr.

Clinical Data Science in Radiology

Link to website

Theresa Stüber

Clinical Data Science in Radiology


[1675]
G. Kutyniok.
How Can Reliability of Artificial Intelligence Be Ensured?
Harvard Data Science Review 7.2 (Apr. 2025). DOI
Abstract

Column Editor’s Note: Artificial intelligence (AI) is having a profound impact across many areas of science and society. However, there remain important gaps in our understanding of the deep neural networks that underpin these developments, and in many cases AI models lack robustness and reliability. In this Diving into Data column, Professor Kutyniok explores these issues from a mathematical perspective, highlighting open theoretical questions that will need to be resolved in order to develop AIs that are truly reliable, generalizable, and trustworthy.

MCML Authors
Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1674]
W. Qi, X. Xu, K. Qian, B. W. Schuller, G. Fortino and A. Aliverti.
A Review of AIoT-Based Human Activity Recognition: From Application to Technique.
IEEE Journal of Biomedical and Health Informatics 29.4 (Apr. 2025). DOI
Abstract

This scoping review paper redefines the Artificial Intelligence-based Internet of Things (AIoT) driven Human Activity Recognition (HAR) field by systematically extrapolating from various application domains to deduce potential techniques and algorithms. We distill a general model with adaptive learning and optimization mechanisms by conducting a detailed analysis of human activity types and utilizing contact or non-contact devices. It presents various system integration mathematical paradigms driven by multimodal data fusion, covering predictions of complex behaviors and redefining valuable methods, devices, and systems for HAR. Additionally, this paper establishes benchmarks for behavior recognition across different application requirements, from simple localized actions to group activities. It summarizes open research directions, including data diversity and volume, computational limitations, interoperability, real-time recognition, data security, and privacy concerns. Finally, we aim to serve as a comprehensive and foundational resource for researchers delving into the complex and burgeoning realm of AIoT-enhanced HAR, providing insights and guidance for future innovations and developments.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1673]
X. Qiu, W. Qiu, Y. Zhang, K. Qian, C. Li, B. Hu, B. W. Schuller and Y. Yamamoto.
FedKDC: Consensus-Driven Knowledge Distillation for Personalized Federated Learning in EEG-Based Emotion Recognition.
IEEE Journal of Biomedical and Health Informatics Early Access (Apr. 2025). DOI GitHub
Abstract

Federated learning (FL) has gained prominence in electroencephalogram (EEG)-based emotion recognition because of its ability to enable secure collaborative training without centralized data. However, traditional FL faces challenges due to model and data heterogeneity in smart healthcare settings. For example, medical institutions have varying computational resources, which creates a need for personalized local models. Moreover, EEG data from medical institutions typically face data heterogeneity issues stemming from limitations in participant availability, ethical constraints, and cultural differences among subjects, which can slow model convergence and degrade model performance. To address these challenges, we propose FedKDC, a novel FL framework that incorporates clustered knowledge distillation (CKD). This method introduces a consensus-based distributed learning mechanism to facilitate the clustering process. It then enhances the convergence speed through intraclass distillation and reduces the negative impact of heterogeneity through interclass distillation. Additionally, we introduce a DriftGuard mechanism to mitigate client drift, along with an entropy reducer to decrease the entropy of aggregated knowledge. The framework is validated on the SEED, SEED-IV, SEED-FRA, and SEED-GER datasets, demonstrating its effectiveness in scenarios where both the data and the models are heterogeneous. Experimental results show that FedKDC outperforms other FL frameworks in emotion recognition, achieving a maximum average accuracy of 85.2%, and in convergence efficiency, with faster and more stable convergence.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1672]
Y. Bi, Y. Su, N. Navab and Z. Jiang.
Gaze-Guided Robotic Vascular Ultrasound Leveraging Human Intention Estimation.
IEEE Robotics and Automation Letters 10.4 (Apr. 2025). DOI
Abstract

Medical ultrasound has been widely used to examine vascular structure in modern clinical practice. However, traditional ultrasound examination often faces challenges related to inter- and intra-operator variation. The robotic ultrasound system (RUSS) appears as a potential solution for such challenges because of its superiority in stability and reproducibility. Given the complex anatomy of human vasculature, multiple vessels often appear in ultrasound images, or a single vessel bifurcates into branches, complicating the examination process. To tackle this challenge, this work presents a gaze-guided RUSS for vascular applications. A gaze tracker captures the eye movements of the operator. The extracted gaze signal guides the RUSS to follow the correct vessel when it bifurcates. Additionally, a gaze-guided segmentation network is proposed to enhance segmentation robustness by exploiting gaze information. However, gaze signals are often noisy, requiring interpretation to accurately discern the operator’s true intentions. To this end, this study proposes a stabilization module to process raw gaze data. The inferred attention heatmap is utilized as a region proposal to aid segmentation and serve as a trigger signal when the operator needs to adjust the scanning target, such as when a bifurcation appears. To ensure appropriate contact between the probe and surface during scanning, an automatic ultrasound confidence-based orientation correction method is developed. In experiments, we demonstrated the efficiency of the proposed gaze-guided segmentation pipeline by comparing it with other methods. Besides, the performance of the proposed gaze-guided RUSS was also validated as a whole on a realistic arm phantom with an uneven surface.

MCML Authors
Link to website

Yuan Bi

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1671]
J. Huang, P. K. Yu, N. Navab and B. Busam.
TTAPose: Test-time Adaptation for Unseen Object Pose Estimation.
IEEE Robotics and Automation Letters 10.6 (Apr. 2025). DOI
Abstract

Recent advances in the field of 6D pose estimation of unseen objects not present during training are promising, however, the performance gap between these general methods and object-specific methods remains significant. This paper introduces an innovative unsupervised test-time adaptation method, termed TTAPose, capable of adapting a pose estimator to any unseen object. TTAPose initially undergoes pre-training using a large synthetic dataset and thereafter refines the weights using unsupervised loss conducted on unseen real-world target objects. The network, based on a teacher-student architecture, leverages an RGB-D pose refinement pipeline to incrementally improve pseudo labels. Notably, TTAPose operates with no requirement for target data annotation, thus minimizing time and data expenditure. Experimental results show performance levels comparable to supervised methods, effectively narrowing the gap to object-specific baselines.

MCML Authors
Link to website

Junwen Huang

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1670]
L. Christ, S. Amiriparian, A. Kathan, N. Müller, A. König and B. W. Schuller.
Towards Multimodal Prediction of Spontaneous Humor: A Novel Dataset and First Results.
IEEE Transactions on Affective Computing 16.2 (Apr. 2025). DOI
Abstract

Humor is a substantial element of human social behavior, affect, and cognition. Its automatic understanding can facilitate a more naturalistic human-AI interaction. Current methods of humor detection have been exclusively based on staged data, making them inadequate for ‘real-world’ applications. We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor (Passau-SFCH) dataset, comprising about 11 hours of recordings. The Passau-SFCH dataset is annotated for the presence of humor and its dimensions (sentiment and direction) as proposed in Martin’s Humor Style Questionnaire. We conduct a series of experiments employing pretrained Transformers, convolutional neural networks, and expert-designed features. The performance of each modality (text, audio, video) for spontaneous humor recognition is analyzed and their complementarity is investigated. Our findings suggest that for the automatic analysis of humor and its sentiment, facial expressions are most promising, while humor direction can be best modeled via text-based features. Further, we experiment with different multimodal approaches to humor recognition, including decision-level fusion and MulT, a multimodal Transformer approach. In this context, we propose a novel multimodal architecture that yields the best overall results.

MCML Authors
Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to website

Alexander Kathan

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1669]
L. Shen, H. Zhang, C. Zhu, R. Li, K. Qian, F. Tian, B. Hu, B. W. Schuller and Y. Yamamoto.
Enhancing Emotion Regulation in Mental Disorder Treatment: An AIGC-based Closed-Loop Music Intervention System.
IEEE Transactions on Affective Computing Early Access (Apr. 2025). DOI
Abstract

Mental disorders have increased rapidly and have emerged as a serious social health issue in the recent decade. Undoubtedly, the timely treatment of mental disorders is crucial. Emotion regulation has been proven to be an effective method for treating mental disorders. Music therapy as one of the methods that can achieve emotional regulation has gained increasing attention in the field of mental disorder treatment. However, traditional music therapy methods still face some unresolved issues, such as the lack of real-time capability and the inability to form closed-loop systems. With the advancement of artificial intelligence (AI), especially AI-generated content (AIGC), AI-based music therapy holds promise in addressing these issues. In this paper, an AIGC-based closed-loop music intervention system demonstration is proposed to regulate emotions for mental disorder treatment. This system demonstration consists of an emotion recognition model and a music generation model. The emotion recognition model can assess mental states, while the music generation model generates the corresponding emotional music for regulation. The system continuously performs recognition and regulation, thus forming a closed-loop process. In the experiment, we first conduct experiments on both the emotion recognition model and the music generation model to validate the accuracy of the recognition model and the music quality generated by the music generation models. In conclusion, we conducted comprehensive tests on the entire system to verify its feasibility and effectiveness.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1668]
A.-M. Rickmann, F. Bongratz and C. Wachinger.
Vertex Correspondence and Self-Intersection Reduction in Cortical Surface Reconstruction.
IEEE Transactions on Medical Imaging Early Access (Apr. 2025). DOI
Abstract

Mesh-based cortical surface reconstruction is essential for neuroimaging, enabling precise measurements of brain morphology such as cortical thickness. Establishing vertex correspondence between individual cortical meshes and group templates allows vertex-level comparisons, but traditional methods require time-consuming post-processing steps to achieve vertex correspondence. While deep learning has improved accuracy in cortical surface reconstruction, optimizing vertex correspondence has not been the focus of prior work. We introduce Vox2Cortex with Correspondence (V2CC), an extension of Vox2Cortex, which replaces the commonly used Chamfer loss with L1 loss on registered surfaces. This approach improves inter- and intra-subject correspondence, which makes it suitable for direct group comparisons and atlas-based parcellation. Additionally, we analyze mesh self-intersections, categorizing them into minor (neighboring faces) and major (non-neighboring faces) types.To address major self-intersections, which are not effectively handled by standard regularization losses, we propose a novel Self-Proximity loss, designed to adjust non-neighboring vertices within a defined proximity threshold. Comprehensive evaluations demonstrate that recent deep learning methods inadequately address vertex correspondence, often causing in-accuracies in parcellation. In contrast, our method achieves accurate correspondence and reduces self-intersections to below 1% for both pial and white matter surfaces.

MCML Authors
Fabian Bongratz

Fabian Bongratz

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1667]
Z. Li, Z. Wang, X. Xu, Y. Chen and B. W. Schuller.
Unsupervised Domain-Adaptive Semantic Segmentation for Surgical Instruments Leveraging Dropout-Enhanced Dual Heads and Coarse-Grained Classification Branch.
IEEE Transactions on Medical Robotics and Bionics Early Access (Apr. 2025). DOI
Abstract

Accurate semantic segmentation for surgical instruments is crucial in robot-assisted minimally invasive surgery, mainly regarded as a core module in surgical-instrument tracking and operation guidance. Nevertheless, it is usually difficult for existing semantic surgical-instrument segmentation approaches to adapt to unknown surgical scenes, particularly due to their insufficient consideration for reducing the domain gaps across different scenes. To address this issue, we propose an unsupervised domain-adaptive semantic segmentation approach for surgical instruments, leveraging Dropout-enhanced Dual Heads and Coarse-Grained classification branch (D2HCG). The proposed approach comprises dropout-enhanced dual heads for diverse feature representation, and a coarse-grained classification branch for capturing complexities across varying granularities. This incorporates consistency loss functions targeting fine-grained features and coarse-grained granularities, aiming to reduce crossscene domain gaps. Afterwards, we perform experiments in crossscene surgical-instrument semantic segmentation cases, with the experimental results reporting the effectiveness for the proposed approach, compared with state-of-the-art semantic segmentation ones.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1666]
L. Zhu, R. Wang, X. Jin, Y. Li, F. Tian, R. Cai, K. Qian, X. Hu, B. Hu, Y. Yamamoto and B. W. Schuller.
Explainable Depression Classification Based on EEG Feature Selection from Audio Stimuli.
IEEE Transactions on Neural Systems and Rehabilitation Engineering Early Access (Apr. 2025). DOI
Abstract

With the development of affective computing and Artificial Intelligence (AI) technologies, Electroencephalogram (EEG)-based depression detection methods have been widely proposed. However, existing studies have mostly focused on the accuracy of depression recognition, ignoring the association between features and models. Additionally, there is a lack of research on the contribution of different features to depression recognition. To this end, this study introduces an innovative approach to depression detection using EEG data, integrating Ant-Lion Optimization (ALO) and Multi-Agent Reinforcement Learning (MARL) for feature fusion analysis. The inclusion of Explainable Artificial Intelligence (XAI) methods enhances the explainability of the model’s features. The Time-Delay Embedded Hidden Markov Model (TDE-HMM) is employed to infer internal brain states during depression, triggered by audio stimulation. The ALO-MARL algorithm, combined with hyper-parameter optimization of the XGBoost classifier, achieves high accuracy (93.69%), sensitivity (88.60%), specificity (97.08%), and F1-score (91.82%) on a auditory stimulus-evoked three-channel EEG dataset. The results suggest that this approach outperforms state-of-the-art feature selection methods for depression recognition on this dataset, and XAI elucidates the critical impact of the minimum value of Power Spectral Density (PSD), Sample Entropy (SampEn), and Réenyi Entropy (Ren) on depression recognition. The study also explores dynamic brain state transitions revealed by audio stimuli, providing insights for the clinical application of AI algorithms in depression recognition.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1665]
M. Fischer, P. Neher, P. J. Schüffler, S. Ziegler, S. Xiao, R. Peretzke, D. Clunie, C. Ulrich, M. Baumgartner, A. Muckenhuber, S. Dias Almeida, M. Götz, J. Kleesiek, M. Nolden, R. Braren and K. Maier-Hein.
Unlocking the potential of digital pathology: Novel baselines for compression.
Journal of Pathology Informatics 17.100421 (Apr. 2025). DOI
Abstract

Digital pathology offers a groundbreaking opportunity to transform clinical practice in histopathological image analysis, yet faces a significant hurdle: the substantial file sizes of pathological whole slide images (WSIs). Whereas current digital pathology solutions rely on lossy JPEG compression to address this issue, lossy compression can introduce color and texture disparities, potentially impacting clinical decision-making. Whereas prior research addresses perceptual image quality and downstream performance independently of each other, we jointly evaluate compression schemes for perceptual and downstream task quality on four different datasets. In addition, we collect an initially uncompressed dataset for an unbiased perceptual evaluation of compression schemes. Our results show that deep learning models fine-tuned for perceptual quality outperform conventional compression schemes like JPEG-XL or WebP for further compression of WSI. However, they exhibit a significant bias towards the compression artifacts present in the training data and struggle to generalize across various compression schemes. We introduce a novel evaluation metric based on feature similarity between original files and compressed files that aligns very well with the actual downstream performance on the compressed WSI. Our metric allows for a general and standardized evaluation of lossy compression schemes and mitigates the requirement to independently assess different downstream tasks. Our study provides novel insights for the assessment of lossy compression schemes for WSI and encourages a unified evaluation of lossy compression schemes to accelerate the clinical uptake of digital pathology.

MCML Authors
Link to Profile Peter Schüffler

Peter Schüffler

Prof. Dr.

Computational Pathology


[1664]
Ö. Turgut, P. Müller, P. Hager, S. Shit, S. Starck, M. Menten, E. Martens and D. Rückert.
Unlocking the diagnostic potential of electrocardiograms through information transfer from cardiac magnetic resonance imaging.
Medical Image Analysis 101.103451 (Apr. 2025). DOI GitHub
Abstract

Cardiovascular diseases (CVD) can be diagnosed using various diagnostic modalities. The electrocardiogram (ECG) is a cost-effective and widely available diagnostic aid that provides functional information of the heart. However, its ability to classify and spatially localise CVD is limited. In contrast, cardiac magnetic resonance (CMR) imaging provides detailed structural information of the heart and thus enables evidence-based diagnosis of CVD, but long scan times and high costs limit its use in clinical routine. In this work, we present a deep learning strategy for cost-effective and comprehensive cardiac screening solely from ECG. Our approach combines multimodal contrastive learning with masked data modelling to transfer domain-specific information from CMR imaging to ECG representations. In extensive experiments using data from 40,044 UK Biobank subjects, we demonstrate the utility and generalisability of our method for subject-specific risk prediction of CVD and the prediction of cardiac phenotypes using only ECG data. Specifically, our novel multimodal pre-training paradigm improves performance by up to 12.19% for risk prediction and 27.59% for phenotype prediction. In a qualitative analysis, we demonstrate that our learned ECG representations incorporate information from CMR image regions of interest.

MCML Authors
Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1663]
A. Bitarafan, M. Mozafari, M. F. Azampour, M. S. Baghshah, N. Navab and A. Farshad.
Self-supervised 3D medical image segmentation by flow-guided mask propagation learning.
Medical Image Analysis 101.103478 (Apr. 2025). DOI GitHub
Abstract

Despite significant progress in 3D medical image segmentation using deep learning, manual annotation remains a labor-intensive bottleneck. Self-supervised mask propagation (SMP) methods have emerged to alleviate this challenge, allowing intra-volume segmentation with just a single slice annotation. However, the previous SMP methods often rely on 2D information and ignore volumetric contexts. While our previous work, called Vol2Flow, attempts to address this concern, it exhibits limitations, including not focusing enough on local (i.e., slice-pair) information, neglecting global information (i.e., volumetric contexts) in the objective function, and error accumulation during slice-to-slice reconstruction. This paper introduces Flow2Mask, a novel SMP method, developed to overcome the limitations of previous SMP approaches, particularly Vol2Flow. During training, Flow2Mask proposes the Local-to-Global (L2G) loss to learn inter-slice flow fields among all consecutive slices within a volume in an unsupervised manner. This dynamic loss is based on curriculum learning to gradually learn information within a volume from local to global contexts. Additionally, the Inter-Slice Smoothness (ISS) loss is introduced as a regularization term to encourage changes between the slices occur consistently and continuously. During inference, Flow2Mask leverages these 3D flow fields for inter-slice mask propagation in a 3D image, spreading annotation from a single annotated slice to the entire volume. Moreover, we propose an automatic strategy to select the most representative slice as initial annotation in the mask propagation process. Experimental evaluations on different abdominal datasets demonstrate that our proposed SMP method outperforms previous approaches and improves the overall mean DSC of Vol2Flow by +2.1%, +8.2%, and +4.0% for the Sliver, CHAOS, and 3D-IRCAD datasets, respectively. Furthermore, Flow2Mask even exhibits substantial improvements in weakly-supervised and self-supervised few-shot segmentation methods when applied as a mask completion tool.

MCML Authors
Link to website

Mohammad Farid Azampour

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1662]
L. von der Heyde, A.-C. Haensch and A. Wenz.
Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion.
Social Science Computer Review Online First (Apr. 2025). DOI
Abstract

‘Synthetic samples’ generated by large language models (LLMs) have been argued to complement or replace traditional surveys, assuming their training data is grounded in human-generated data that potentially reflects attitudes and behaviors prevalent in the population. Initial US-based studies that have prompted LLMs to mimic survey respondents found that the responses match survey data. However, the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this paper, we critically evaluate the use of LLMs for public opinion research in a different context, by investigating whether LLMs can estimate vote choice in Germany. We generate a synthetic sample matching the 2017 German Longitudinal Election Study respondents and ask the LLM GPT-3.5 to predict each respondent’s vote choice. Comparing these predictions to the survey-based estimates on the aggregate and subgroup levels, we find that GPT-3.5 exhibits a bias towards the Green and Left parties. While the LLM predictions capture the tendencies of “typical” voters, they miss more complex factors of vote choice. By examining the LLM-based prediction of voting behavior in a non-English speaking context, our study contributes to research on the extent to which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitations in applying them for public opinion estimation.

MCML Authors
Link to website

Leah von der Heyde

Social Data Science and AI

Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI


[1661]
N. Santhanam, H. E. Kim, D. Rügamer, A. Bender, S. Muthers, C. G. Cho, A. Alonso, K. Szabo, F.-S. Centner, H. Wenz, T. Ganslandt, M. Platten, C. Groden, M. Neumaier, F. Siegel and M. E. Maros.
Machine learning-based forecasting of daily acute ischemic stroke admissions using weather data.
npj Digital Medicine 8.225 (Apr. 2025). DOI
Abstract

Background: In the midst of the emerging climate crisis, healthcare providers lack locally validated, disease-specific surveillance models. Stroke, a significant contributor to the global disease burden, has been linked to climate change. Therefore, we developed and benchmarked machine learning (ML) models based on locoregional weather systems to forecast the number of daily acute ischemic stroke (AIS) admissions.
Methods: AIS patients diagnosed between 2015 and 2021 at the tertiary University Medical Center (UMC) Mannheim, Germany were extracted from the local data integration center and geospatially matched to weather data from the German Weather Service (DWD) based on the clinic’s, patients’ home and closest tower’s locations at the time of admission. Statistical-(Poisson), boosted generalized additive model (GAM), support vector machines (SVR), and tree-based models including random forest (RF) and extreme gradient boosting (XGB) were evaluated in regression settings within time-stratified nested cross-validation setup (training-validation: 2015-2020, test set: 2021) to predict the number of daily AIS admissions.
Findings: The cohort included 7,914 AIS patients (4,244 male, 53·6%). XGB showed the best test performance with lowest mean absolute error (MAE) of 1·21 cases/day. Maximum air pressure was identified as the top predictive variable. Shapley additive explanations analyses revealed that temperature extremes of extended cold- (lag-3 minimum temperature <-2 °C; minimum perceived temperature <-1·4 °C) and hot stressors (lag-7 minimum temperature >15 °C), as well as stormy conditions (lag-1 and lag-2 maximum wind gust >14 m/s and speed >10·4 m/s), increased stroke incidences substantially with distinct seasonal associations.
Interpretation: ML models can sufficiently forecast AIS admissions based on weather patterns allowing for improved resource allocation and preparedness.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[1660]
C. Bülte, Y. Sale, T. Löhr, P. Hofman, G. Kutyniok and E. Hüllermeier.
An Axiomatic Assessment of Entropy- and Variance-based Uncertainty Quantification in Regression.
Preprint (Apr. 2025). arXiv
Abstract

Uncertainty quantification (UQ) is crucial in machine learning, yet most (axiomatic) studies of uncertainty measures focus on classification, leaving a gap in regression settings with limited formal justification and evaluations. In this work, we introduce a set of axioms to rigorously assess measures of aleatoric, epistemic, and total uncertainty in supervised regression. By utilizing a predictive exponential family, we can generalize commonly used approaches for uncertainty representation and corresponding uncertainty measures. More specifically, we analyze the widely used entropy- and variance-based measures regarding limitations and challenges. Our findings provide a principled foundation for UQ in regression, offering theoretical insights and practical guidelines for reliable uncertainty assessment.

MCML Authors
Link to website

Christopher Bülte

Mathematical Foundations of Artificial Intelligence

Link to website

Yusuf Sale

Artificial Intelligence and Machine Learning

Link to website

Paul Hofman

Artificial Intelligence and Machine Learning

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1659]
Y. Burkhardt, S. Schaefer and S. Leutenegger.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection.
Preprint (Apr. 2025). arXiv GitHub
Abstract

Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin.

MCML Authors
Link to website

Yannick Burkhardt

Machine Learning for Robotics

Link to website

Simon Schaefer

Machine Learning for Robotics

Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[1658]
W. Chen, G. Zhang, F. Wimbauer, R. Wang, N. Araslanov, A. Vedaldi and D. Cremers.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction.
Preprint (Apr. 2025). arXiv
Abstract

Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM – bundle adjustment – with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

MCML Authors
Link to website

Weirong Chen

Computer Vision & Artificial Intelligence

Link to website

Ganlin Zhang

Computer Vision & Artificial Intelligence

Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1657]
L. Fichtel, M. Spliethöver, E. Hüllermeier, P. Jimenez, N. Klowait, S. Kopp, A.-C. N. Ngomo, A. Robrecht, I. Scharlau, L. Terfloth, A.-L. Vollmer and H. Wachsmuth.
Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues.
Preprint (Apr. 2025). arXiv
Abstract

The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research has focused on co-constructive explanation dialogues, where the explainer continuously monitors the explainee’s understanding and adapts explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with LLMs, of which some have been instructed to explain a predefined topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results indicate that current LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.

MCML Authors
Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1656]
D. Kotovenko, O. Grebenkova and B. Ommer.
EDGS: Eliminating Densification for Efficient Convergence of 3DGS.
Preprint (Apr. 2025). arXiv
Abstract

3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and iteratively refining under-reconstructed regions. This process is inherently slow, as it requires multiple densification steps where Gaussians are repeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often leads to suboptimal renderings, particularly in high-frequency regions where detail is critical. We propose a fundamentally different approach: we eliminate densification process with a one-step approximation of scene geometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimate rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian with well-informed colors, scales, and positions. As a result, we dramatically shorten the optimization path and remove the need for densification. Unlike traditional methods that rely on sparse keypoints, our dense initialization ensures uniform detail across the scene, even in high-frequency regions where 3DGS and other methods struggle. Moreover, since all splats are initialized in parallel at the start of optimization, we eliminate the need to wait for densification to adjust new Gaussians. Our method not only outperforms speed-optimized models in training efficiency but also achieves higher rendering quality than state-of-the-art approaches, all while using only half the splats of standard 3DGS. It is fully compatible with other 3DGS acceleration techniques, making it a versatile and efficient solution that can be integrated with existing approaches.

MCML Authors
Link to website

Olga Grebenkova

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1655]
D. Martens, G. Shmueli, T. Evgeniou, K. Bauer, C. Janiesch, S. Feuerriegel, S. Gabel, S. Goethals, T. Greene, N. Klein, M. Kraus, N. Kühl, C. Perlich, W. Verbeke, A. Zharova, P. Zschech and F. Provost.
Beware of 'Explanations' of AI.
Preprint (Apr. 2025). arXiv
Abstract

Understanding the decisions made and actions taken by increasingly complex AI system remains a key challenge. This has led to an expanding field of research in explainable artificial intelligence (XAI), highlighting the potential of explanations to enhance trust, support adoption, and meet regulatory standards. However, the question of what constitutes a ‘good’ explanation is dependent on the goals, stakeholders, and context. At a high level, psychological insights such as the concept of mental model alignment can offer guidance, but success in practice is challenging due to social and technical factors. As a result of this ill-defined nature of the problem, explanations can be of poor quality (e.g. unfaithful, irrelevant, or incoherent), potentially leading to substantial risks. Instead of fostering trust and safety, poorly designed explanations can actually cause harm, including wrong decisions, privacy violations, manipulation, and even reduced AI adoption. Therefore, we caution stakeholders to beware of explanations of AI: while they can be vital, they are not automatically a remedy for transparency or responsible AI adoption, and their misuse or limitations can exacerbate harm. Attention to these caveats can help guide future research to improve the quality and impact of AI explanations.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1654]
P. Mondorf, S. Zhou, M. Riedler and B. Plank.
Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality.
Preprint (Apr. 2025). arXiv
Abstract

Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend the approach of meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce SYGAR-a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions, significantly outperforming state-of-the-art LLMs, including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to website

Shijia Zhou

AI and Computational Linguistics

Link to website

Monica Riedler

Computer Vision & Artificial Intelligence

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1653]
E. Özeren, Y. Liu and H. Schütze.
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization.
Preprint (Apr. 2025). arXiv
Abstract

Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1652]
M. Pach, S. Karthik, Q. Bouniot, S. Belongie and Z. Akata.
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models.
Preprint (Apr. 2025). arXiv
Abstract

Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

MCML Authors
Link to website

Mateusz Pach

Interpretable and Reliable Machine Learning

Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to website

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1651]
N. Röhrich, A. Hoffmann, R. Nordsieck, E. Zarbali and A. Javanmardi.
Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics.
Preprint (Apr. 2025). arXiv
Abstract

Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.

MCML Authors
Link to website

Alireza Javanmardi

Artificial Intelligence and Machine Learning


[1650]
L. Sang, Z. Canfes, D. Cao, R. Marin, F. Bernard and D. Cremers.
TwoSquared: 4D Generation from 2D Image Pairs.
Preprint (Apr. 2025). arXiv
Abstract

Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

MCML Authors
Link to website

Lu Sang

Computer Vision & Artificial Intelligence

Link to website

Riccardo Marin

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1649]
C. Sauer, F. J. D. Lange, M. Thurow, I. Dormuth and A.-L. Boulesteix.
Statistical parametric simulation studies based on real data.
Preprint (Apr. 2025). arXiv
Abstract

Simulation studies are indispensable for evaluating and comparing statistical methods. The most common simulation approach is parametric simulation, where the data-generating mechanism (DGM) corresponds to a predefined parametric model from which observations are drawn. Many statistical simulation studies aim to provide practical recommendations on a method’s suitability for a given application; however, parametric simulations in particular are frequently criticized for being too simplistic and not reflecting reality. To overcome this drawback, it is generally considered a sensible approach to employ real data for constructing the parametric DGMs. However, while the concept of real-data-based parametric DGMs is widely recognized, the specific ways in which DGM components are inferred from real data vary, and their implications may not always be well understood. Additionally, researchers often rely on a limited selection of real datasets, with the rationale for their selection often unclear. This paper addresses these issues by formally discussing how components of parametric DGMs can be inferred from real data and how dataset selection can be performed more systematically. By doing so, we aim to support researchers in conducting simulation studies with a lower risk of overgeneralization and misinterpretation. We illustrate the construction of parametric DGMs based on a systematically selected set of real datasets using two examples: one on ordinal outcomes in randomized controlled trials and one on differential gene expression analysis.

MCML Authors
Link to website

Christina Sauer (née Nießl)

Biometry in Molecular Medicine

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1648]
M. Scherbela, N. Gao, P. Grohs and S. Günnemann.
Accurate Ab-initio Neural-network Solutions to Large-Scale Electronic Structure Problems.
Preprint (Apr. 2025). arXiv
Abstract

We present finite-range embeddings (FiRE), a novel wave function ansatz for accurate large-scale ab-initio electronic structure calculations. Compared to contemporary neural-network wave functions, FiRE reduces the asymptotic complexity of neural-network variational Monte Carlo (NN-VMC) by ∼nel, the number of electrons. By restricting electron-electron interactions within the neural network, FiRE accelerates all key operations – sampling, pseudopotentials, and Laplacian computations – resulting in a real-world 10× acceleration in now-feasible 180-electron calculations. We validate our method’s accuracy on various challenging systems, including biochemical compounds, conjugated hydrocarbons, and organometallic compounds. On these systems, FiRE’s energies are consistently within chemical accuracy of the most reliable data, including experiments, even in cases where high-accuracy methods such as CCSD(T), AFQMC, or contemporary NN-VMC fall short. With these improvements in both runtime and accuracy, FiRE represents a new `gold-standard’ method for fast and accurate large-scale ab-initio calculations, potentially enabling new computational studies in fields like quantum chemistry, solid-state physics, and material design.

MCML Authors
Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning


[1647]
F. Weindel and R. Heckel.
LLM-Guided Search for Deletion-Correcting Codes.
Preprint (Apr. 2025). arXiv
Abstract

Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. In this paper, we propose a novel approach for constructing deletion-correcting codes. A code is a set of sequences satisfying certain constraints, and we construct it by greedily adding the highest-priority sequence according to a priority function. To find good priority functions, we leverage FunSearch, a large language model (LLM)-guided evolutionary search proposed by Romera et al., 2024. FunSearch iteratively generates, evaluates, and refines priority functions to construct large deletion-correcting codes. For a single deletion, our evolutionary search finds functions that construct codes which match known maximum sizes, reach the size of the largest (conjectured optimal) Varshamov-Tenengolts codes where the maximum is unknown, and independently rediscover them in equivalent form. For two deletions, we find functions that construct codes with new best-known sizes for code lengths ( n = 12, 13 ), and ( 16 ), establishing improved lower bounds. These results demonstrate the potential of LLM-guided search for information theory and code design and represent the first application of such methods for constructing error-correcting codes.

MCML Authors
Link to website

Franziska Weindel

Machine Learning and Information Processing

Link to Profile Reinhard Heckel

Reinhard Heckel

Prof. Dr.

Machine Learning and Information Processing


[1646]
W. Yuan, Q. Khan and V. Golkov.
Generation of Musical Timbres using a Text-Guided Diffusion Model.
Preprint (Apr. 2025). arXiv GitHub
Abstract

In recent years, text-to-audio systems have achieved remarkable success, enabling the generation of complete audio segments directly from text descriptions. While these systems also facilitate music creation, the element of human creativity and deliberate expression is often limited. In contrast, the present work allows composers, arrangers, and performers to create the basic building blocks for music creation: audio of individual musical notes for use in electronic instruments and DAWs. Through text prompts, the user can specify the timbre characteristics of the audio. We introduce a system that combines a latent diffusion model and multi-modal contrastive learning to generate musical timbres conditioned on text descriptions. By jointly generating the magnitude and phase of the spectrogram, our method eliminates the need for subsequently running a phase retrieval algorithm, as related methods do.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence


[1645]
T. Zehle, M. Schlager, T. Heiß and M. Feurer.
CAPO: Cost-Aware Prompt Optimization.
Preprint (Apr. 2025). arXiv
Abstract

Large language models (LLMs) have revolutionized natural language processing by solving a wide range of tasks simply guided by a prompt. Yet their performance is highly sensitive to prompt formulation. While automated prompt optimization addresses this challenge by finding optimal prompts, current methods require a substantial number of LLM calls and input tokens, making prompt optimization expensive. We introduce CAPO (Cost-Aware Prompt Optimization), an algorithm that enhances prompt optimization efficiency by integrating AutoML techniques. CAPO is an evolutionary approach with LLMs as operators, incorporating racing to save evaluations and multi-objective optimization to balance performance with prompt length. It jointly optimizes instructions and few-shot examples while leveraging task descriptions for improved robustness. Our extensive experiments across diverse datasets and LLMs demonstrate that CAPO outperforms state-of-the-art discrete prompt optimization methods in 11/15 cases with improvements up to 21%p. Our algorithm achieves better performances already with smaller budgets, saves evaluations through racing, and decreases average prompt length via a length penalty, making it both cost-efficient and cost-aware. Even without few-shot examples, CAPO outperforms its competitors and generally remains robust to initial prompts. CAPO represents an important step toward making prompt optimization more powerful and accessible by improving cost-efficiency.

MCML Authors
Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[1644]
Y.-J. Li, M. Gladkova, Y. Xia, R. Wang and D. Cremers.
VXP: Voxel-Cross-Pixel Large-Scale Camera-LiDAR Place Recognition.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. To be published. Preprint available. arXiv
Abstract

Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities. However, it is non-trivial to perform accurate image-LiDAR global place recognition since extracting consistent and robust global descriptors from different domains (2D images and 3D point clouds) is challenging. To address this issue, we propose a novel Voxel-Cross-Pixel (VXP) approach, which establishes voxel and pixel correspondences in a self-supervised manner and brings them into a shared feature space. Specifically, VXP is trained in a two-stage manner that first explicitly exploits local feature correspondences and enforces similarity of global descriptors. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate our method surpasses the state-of-the-art cross-modal retrieval by a large margin.

MCML Authors
Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1643]
J. Seidenschwarz, Q. Zhou, B. Duisterhof, D. Ramanan and L. Leal-Taixé.
DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. To be published. Preprint available. arXiv
Abstract

Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.

MCML Authors
Jenny Seidenschwarz

Jenny Seidenschwarz

* Former Member


[1642]
H. Zeng, M. Gao and D. Cremers.
CoE: Deep Coupled Embedding for Non-Rigid Point Cloud Correspondences.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. To be published. Preprint available. arXiv
Abstract

The interest in matching non-rigidly deformed shapes represented as raw point clouds is rising due to the proliferation of low-cost 3D sensors. Yet, the task is challenging since point clouds are irregular and there is a lack of intrinsic shape information. We propose to tackle these challenges by learning a new shape representation – a per-point high dimensional embedding, in an embedding space where semantically similar points share similar embeddings. The learned embedding has multiple beneficial properties: it is aware of the underlying shape geometry and is robust to shape deformations and various shape artefacts, such as noise and partiality. Consequently, this embedding can be directly employed to retrieve high-quality dense correspondences through a simple nearest neighbor search in the embedding space. Extensive experiments demonstrate new state-of-the-art results and robustness in numerous challenging non-rigid shape matching benchmarks and show its great potential in other shape analysis tasks, such as segmentation.

MCML Authors
Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1641]
G. Zhai, E. P. Örnek, D. Z. Chen, R. Liao, Y. Di, N. Navab, F. Tombari and B. Busam.
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. To be published. Preprint available. arXiv
Abstract

We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.

MCML Authors
Link to website

Guangyao Zhai

Computer Aided Medical Procedures & Augmented Reality

Link to website

Ruotong Liao

Database Systems and Data Mining

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Federico Tombari

Federico Tombari

PD Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1640]
L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible modelling of time-varying exposures in event history analysis.
DAGStat 2025 - 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik. Berlin, Germany, Mar 24-28, 2025. Poster presentation. Full paper available. DOI
Abstract

We present a flexible modelling approach to analyse time-varying exposures and recurrent events in team sports injuries. The approach is based on the piece-wise exponential additive mixed model where the effects of past exposures (i.e. high-intensity training loads) may accumulate over time and present complex forms of association. In order to identify a relevant time window at which past exposures have an impact on the current risk, we propose a penalty approach. We conduct a simulation study to evaluate the performance of the proposed model, under different true weight functions and different levels of heterogeneity between recurrent events. Finally, we illustrate the approach with a case study application involving an elite male football team participating in the Spanish LaLiga competition. The cohort includes time-loss injuries and external training load variables tracked by Global Positioning System devices, during the seasons 2017–2018 and 2018–2019.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[1639]
L. Bothmann, S. Dandl, J. M. A. Jose M. Alvarez, P. A. Boustani and B. Bischl.
Privilege Scores for Fairness-Aware ML.
DAGStat 2025 - 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik. Berlin, Germany, Mar 24-28, 2025. Poster presentation. Preprint available. arXiv
Abstract

Bias-preserving methods in fairness-aware machine learning (fairML) focus on metrics that prioritize formal equality by balancing error rates across subgroups. These methods can perpetuate historical discrimination embedded in real-world data. In contrast, bias-transforming methods aim for substantive equality by actively addressing historical inequalities. As a contribution to bias-transforming methods, we introduce the concept of privilege scores, a novel approach to identifying and quantifying individual privilege in machine learning tasks. Privilege scores use causal inference techniques to compare real-world outcomes to those in a ‘fair’ world in which the protected attributes do not influence the target variable. This individual-level perspective provides actionable insights for applications such as affirmative action and beyond. Key contributions include (1) the formalization of privilege scores, (2) a methodological framework for estimation with uncertainty quantification via confidence intervals, (3) an interpretable machine learning approach for understanding privilege score contributions, and (4) a novel in-processing method, Multi-PrivScore, to mitigate model-level discrimination during model training. Experiments on simulated and real-world data demonstrate the usefulness of privilege scores. Overall, our work highlights privilege scores as a versatile tool for assessing and mitigating historical discrimination in various machine learning applications.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to website

Philip Amir Boustani

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[1638]
Y. Shen.
Probabilistic Modeling and Uncertainty Awareness in Deep learning.
Dissertation 2025. URL
Abstract

This dissertation focuses on probabilistic modeling and uncertainty-aware approaches for deep learning. It is based on four papers that tackle the problem of uncertainty-aware deep learning, covering techniques such as post-hoc calibration, model aggregation, and Bayesian deep learning with variational inference. Also, an overview of related prior work is provided, which covers both classical and deep-learning-based approaches.

MCML Authors
Yuesong Shen

Yuesong Shen

Dr.

* Former Member


[1637]
S. Okabe and A. Fraser.
Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian.
Compute-EL @ICLDC 2025 - 8th Workshop on The Use of Computational Methods in the Study of Endangered Languages at the 9th International Conference on Language Documentation and Conservation (ICLDC 2025). Honolulu, Hawaii, USA, Mar 06-06, 2025. To be published. Preprint available. URL
Abstract

Parallel sentence mining is crucial for downstream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.

MCML Authors
Link to website

Shu Okabe

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1636]
P. T. da Silva, A. Karollus, J. Hingerl, G. Galindez, N. Wagner, X. Hernandez-Alias, D. Incarnato and J. Gagneur.
Nucleotide dependency analysis of DNA language models reveals genomic functional elements.
CSHL 2025 - 5th Cold Spring Harbor conference on Probabilistic Modeling in Genomics. Cold Spring Harbor Laboratory, New York, USA, Mar 05-08, 2025. DOI URL
Abstract

Deciphering how nucleotides in genomes encode regulatory instructions and molecular machines is a long-standing goal in biology. DNA language models (LMs) implicitly capture functional elements and their organization from genomic sequences alone by modeling probabilities of each nucleotide given its sequence context. However, using DNA LMs for discovering functional genomic elements has been challenging due to the lack of interpretable methods. Here, we introduce nucleotide dependencies which quantify how nucleotide substitutions at one genomic position affect the probabilities of nucleotides at other positions. We generated genome-wide maps of pairwise nucleotide dependencies within kilobase ranges for animal, fungal, and bacterial species. We show that nucleotide dependencies indicate deleteriousness of human genetic variants more effectively than sequence alignment and DNA LM reconstruction. Regulatory elements appear as dense blocks in dependency maps, enabling the systematic identification of transcription factor binding sites as accurately as models trained on experimental binding data. Nucleotide dependencies also highlight bases in contact within RNA structures, including pseudoknots and tertiary structure contacts, with remarkable accuracy. This led to the discovery of four novel, experimentally validated RNA structures in Escherichia coli. Finally, using dependency maps, we reveal critical limitations of several DNA LM architectures and training sequence selection strategies by benchmarking and visual diagnosis. Altogether, nucleotide dependency analysis opens a new avenue for discovering and studying functional elements and their interactions in genomes.Competing Interest StatementThe authors have declared no competing interest.

MCML Authors
Link to website

Pedro Tomaz da Silva

Computational Molecular Medicine

Link to website

Alexander Karollus

Computational Molecular Medicine

Link to website

Johannes Hingerl

Computational Molecular Medicine

Link to Profile Julien Gagneur

Julien Gagneur

Prof. Dr.

Computational Molecular Medicine


[1635]
Z. Yuan, Z. Xiong, L. Mou and X. Zhu.
ChatEarthNet: a global-scale image–text dataset empowering vision–language geo-foundation models.
Earth System Science Data 17.3 (Mar. 2025). DOI
Abstract

The rapid development of remote sensing technology has led to an exponential growth in satellite images, yet their inherent complexity often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can bridge the gap between common users and complicated satellite imagery. Additionally, when paired with visual data, natural language can be utilized to train large vision–language foundation models, significantly improving performance in various tasks. Despite these advancements, the remote sensing community still faces a challenge due to the lack of large-scale, high-quality vision–language datasets for satellite images. To address this challenge, we introduce a new image–text dataset, providing high-quality natural language descriptions for global-scale satellite data. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency’s WorldCover project to enrich the descriptions of land cover types. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. We then include a manual verification process to enhance the dataset’s quality further. This step involves manual inspection and correction to refine the dataset. Finally, we offer the community ChatEarthNet, a large-scale image–text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163 488 image–text pairs with captions generated by ChatGPT-3.5 and an additional 10 000 image–text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for both training and evaluating vision–language geo-foundation models for remote sensing. The code is publicly available at https://doi.org/10.5281/zenodo.11004358 (Yuan et al., 2024b), and the ChatEarthNet dataset is available at https://doi.org/10.5281/zenodo.11003436 (Yuan et al., 2024c).

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1634]
F. Krahmer and A. Veselovska.
The mathematics of dots and pixels: On the theoretical foundations of image halftoning.
GAMM Mitteilungen 48.1 (Mar. 2025). DOI
Abstract

The evolution of image halftoning, from its analog roots to contemporary digital methodologies, encapsulates a fascinating journey marked by technological advancements and creative innovations. Yet the theoretical understanding of halftoning is much more recent. In this article, we explore various approaches towards shedding light on the design of halftoning approaches and why they work. We discuss both halftoning in a continuous domain and on a pixel grid. We start by reviewing the mathematical foundation of the so-called electrostatic halftoning method, which departed from the heuristic of considering the back dots of the halftoned image as charged particles attracted by the grey values of the image in combination with mutual repulsion. Such an attraction-repulsion model can be mathematically represented via an energy functional in a reproducing kernel Hilbert space allowing for a rigorous analysis of the resulting optimization problem as well as a convergence analysis in a suitable topology. A second class of methods that we discuss in detail is the class of error diffusion schemes, arguably among the most popular halftoning techniques due to their ability to work directly on a pixel grid and their ease of application. The main idea of these schemes is to choose the locations of the black pixels via a recurrence relation designed to agree with the image in terms of the local averages. We discuss some recent mathematical understanding of these methods that is based on a connection to Σ∆ quantizers, a popular class of algorithms for analog-to-digital conversion.

MCML Authors
Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis

Link to website

Anna Veselovska

Dr.

Applied Numerical Analysis


[1633]
S. Garske, K. Heidler, B. Evans, K. Wong and X. Zhu.
SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Early Access (Mar. 2025). DOI GitHub
Abstract

The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring for Hazard Detection and Mapping. SHAZAM uses a lightweight conditional UNet to generate expected images of a region of interest (ROI) for any day of the year, allowing for the direct modelling of normal seasonal changes and the ability to distinguish potential hazards. A modified structural similarity measure compares the generated images with actual satellite observations to compute region-level anomaly scores and pixel-level hazard maps. Additionally, a theoretically grounded seasonal threshold eliminates the need for dataset-specific optimisation. Evaluated on four diverse datasets that contain bushfires (wildfires), burned regions, extreme and out-of-season snowfall, floods, droughts, algal blooms, and deforestation, SHAZAM achieved F1 score improvements of between 0.066 and 0.234 over existing methods. This was achieved primarily through more effective hazard detection (higher recall) while using only 473K parameters. SHAZAM demonstrated superior mapping capabilities through higher spatial resolution and improved ability to suppress background features while accentuating both immediate and gradual hazards. SHAZAM has been established as an effective and generalisable solution for hazard detection and mapping across different geographical regions and a diverse range of hazards.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1632]
J. Xie, Y. Wang, X. Qian, J. Zhang and B. W. Schuller.
Improving Bird Vocalization Recognition in Open-Set Cross-Corpus Scenarios with Semantic Feature Reconstruction and Dual Strategy Scoring.
IEEE Signal Processing Letters 32 (Mar. 2025). DOI
Abstract

Automated recognition of bird vocalizations (BVs) is essential for biodiversity monitoring through passive acoustic monitoring (PAM), yet deep learning (DL) models encounter substantial challenges in open environments. These include difficulties in detecting unknown classes, extracting species-specific features, and achieving robust cross-corpus recognition. To address these challenges, this letter presents a DL-based open-set cross-corpus recognition method for BVs that combines feature construction with open-set recognition (OSR) techniques. We introduce a three-channel spectrogram that integrates both amplitude and phase information to enhance feature representation. To improve the recognition accuracy of known classes across corpora, we employ a class-specific semantic reconstruction model to extract deep features. For unknown class discrimination, we propose a Dual Strategy Coupling Scoring (DSCS) mechanism, which synthesizes the log-likelihood ratio score (LLRS) and reconstruction error score (RES). Our method achieves the highest weighted accuracy among existing approaches on a public dataset, demonstrating its effectiveness for open-set cross-corpus bird vocalization recognition.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1631]
C. Liu, C. M. Albrecht, Y. Wang and X. Zhu.
CromSS: Cross-Modal Pretraining With Noisy Labels for Remote Sensing Image Segmentation.
IEEE Transactions on Geoscience and Remote Sensing 63 (Mar. 2025). DOI GitHub
Abstract

We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multimodal framework for geospatial applications. We propose a novel cross-modal sample selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multimodal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial–temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels. To validate our approach, we assembled the multimodal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google’s Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks.

MCML Authors
Link to website

Chenying Liu

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1630]
F. Li, Y. Bi, D. Huang, Z. Jiang and N. Navab.
Robotic CBCT Meets Robotic Ultrasound.
International Journal of Computer Assisted Radiology and Surgery (Mar. 2025). DOI
Abstract

Purpose: The multi-modality imaging system offers optimal fused images for safe and precise interventions in modern clinical practices, such as computed tomography-ultrasound (CT-US) guidance for needle insertion. However, the limited dexterity and mobility of current imaging devices hinder their integration into standardized workflows and the advancement toward fully autonomous intervention systems. In this paper, we present a novel clinical setup where robotic cone beam computed tomography (CBCT) and robotic US are pre-calibrated and dynamically co-registered, enabling new clinical applications. This setup allows registration-free rigid registration, facilitating multi-modal guided procedures in the absence of tissue deformation.
Methods: First, a one-time pre-calibration is performed between the systems. To ensure a safe insertion path by highlighting critical vasculature on the 3D CBCT, SAM2 segments vessels from B-mode images, using the Doppler signal as an autonomously generated prompt. Based on the registration, the Doppler image or segmented vessel masks are then mapped onto the CBCT, creating an optimally fused image with comprehensive detail. To validate the system, we used a specially designed phantom, featuring lesions covered by ribs and multiple vessels with simulated moving flow.
Results: The mapping error between US and CBCT resulted in an average deviation of mm. A user study demonstrated the effectiveness of CBCT-US fusion for needle insertion guidance, showing significant improvements in time efficiency, accuracy, and success rate. Needle intervention performance improved by approximately 50% compared to the conventional US-guided workflow.
Conclusion: We present the first robotic dual-modality imaging system designed to guide clinical applications. The results show significant performance improvements compared to traditional manual interventions.

MCML Authors
Link to website

Feng Li

Computer Aided Medical Procedures & Augmented Reality

Link to website

Yuan Bi

Computer Aided Medical Procedures & Augmented Reality

Link to website

Dianye Huang

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1629]
L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible modelling of time-varying exposures and recurrent events to analyse training load effects in team sports injuries.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 74.2 (Mar. 2025). DOI
Abstract

We present a flexible modelling approach to analyse time-varying exposures and recurrent events in team sports injuries. The approach is based on the piece-wise exponential additive mixed model where the effects of past exposures (i.e. high-intensity training loads) may accumulate over time and present complex forms of association. In order to identify a relevant time window at which past exposures have an impact on the current risk, we propose a penalty approach. We conduct a simulation study to evaluate the performance of the proposed model, under different true weight functions and different levels of heterogeneity between recurrent events. Finally, we illustrate the approach with a case study application involving an elite male football team participating in the Spanish LaLiga competition. The cohort includes time-loss injuries and external training load variables tracked by Global Positioning System devices, during the seasons 2017–2018 and 2018–2019.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[1628]
M. Schneble and G. Kauermann.
Statistical modelling of on-street parking spot occupancy in smart cities.
Journal of the Royal Statistical Society. Series C (Applied Statistics).qlaf017 (Mar. 2025). DOI
Abstract

Many studies suggest that searching for parking is associated with significant direct and indirect costs. Therefore, it is appealing to reduce the time that car drivers spend on finding an available parking spot, especially in urban areas where the space for all road users is limited. The prediction of on-street parking spot occupancy can provide drivers with guidance on where clear parking spaces are likely to be found. This field of research has gained more and more attention in the last decade through the increasing availability of real-time parking spot occupancy data. In this paper, we pursue a statistical approach for the prediction of parking spot occupancy, where we make use of time-to-event models and semi-Markov process theory. The latter involves the employment of Laplace transformations as well as their inversion, which is an ambitious numerical task. We apply our methodology to data from the City of Melbourne in Australia. Our main result is that the semi-Markov model outperforms a Markov model in terms of both true negative rate and true positive rate while this is essentially achieved by respecting the current duration that a parking space already spends in its initial state.

MCML Authors
Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business


[1627]
A. Liebeskind, J. R. Schüre, M. S. Fabian, S. Weinmüller, P. Schünke, V. Golkov, D. Cremers and M. Zaiss.
The Pulseq-CEST Library: definition of preparations and simulations, example data, and example evaluations.
Magnetic Resonance Materials in Physics, Biology and Medicine (Mar. 2025). DOI
Abstract

Objectives: Despite prevalent use of chemical exchange saturation transfer (CEST) MRI, standardization remains elusive. Imaging depends heavily on parameters dictating radiofrequency (RF) events, gradients, and apparent diffusion coefficient (ADC). We present the Pulseq-CEST Library, a repository of CEST preparation and simulation definitions, including example data and evaluations, that provides a common basis for reproducible research, rapid prototyping, and in silico deep learning training data generation.
Materials and methods: A Pulseq-CEST experiment requires (i) a CEST preparation sequence, (ii) a Bloch–McConnell parameter set, (iii) a Bloch–McConnell simulation, and (iv) an evaluation script. Pulseq-CEST utilizes the Bloch–McConnell equations to model in vitro and in vivo conditions. Using this model, a candidate sequence or environment can be held constant while varying other inputs, enabling robust testing.
Results: Data were compared for amide proton transfer weighted (APTw) and water shift and B1 (WASABI) protocols using a five-tube phantom and simulated environments. Real and simulated data matched anticipated spectral shapes and local peak characteristics. The Pulseq-CEST Library supports similar experiments with common sequences and environments to assess new protocols and sample data.
Discussion: The Pulseq-CEST Library provides a flexible mechanism for standardizing and prototyping CEST sequences, facilitating collaborative development. With the capability for expansion, including open-source incorporation of new sequences and environments, the library accelerates the invention and spread of novel CEST and other saturation transfer approaches, such as relayed NOEs (rNOEs) and semisolid magnetization transfer contrast (MTC) methods.

MCML Authors
Alexander Liebeskind

Alexander Liebeskind

* Former Member

Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1626]
A. Tejada-Lapuerta, P. Bertin, S. Bauer, H. Aliee, Y. Bengio and F. J. Theis.
Causal machine learning for single-cell genomics.
Nature Genetics (Mar. 2025). DOI
Abstract

Advances in single-cell ‘-omics’ allow unprecedented insights into the transcriptional profiles of individual cells and, when combined with large-scale perturbation screens, enable measuring of the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the causative role of genes in complex biological processes. In this Perspective, we delineate the application of causal machine learning to single-cell genomics and its associated challenges. We first present the causal model that is most commonly applied to single-cell biology and then identify and discuss potential approaches to three open problems: the lack of generalization of models to novel experimental conditions, the complexity of interpreting learned models, and the difficulty of learning cell dynamics.

MCML Authors
Link to Profile Stefan Bauer

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems


[1625]
M. E. Consens, C. Dufault, M. Wainberg, D. Forster, M. Karimzadeh, H. Goodarzi, F. J. Theis, A. Moses and B. Wang.
Transformers and genome language models.
Nature Machine Intelligence (Mar. 2025). DOI
Abstract

Large language models based on the transformer deep learning architecture have revolutionized natural language processing. Motivated by the analogy between human language and the genome’s biological code, researchers have begun to develop genome language models (gLMs) based on transformers and related architectures. This Review explores the use of transformers and language models in genomics. We survey open questions in genomics amenable to the use of gLMs, and motivate the use of gLMs and the transformer architecture for these problems. We discuss the potential of gLMs for modelling the genome using unsupervised pretraining tasks, specifically focusing on the power of zero- and few-shot learning. We explore the strengths and limitations of the transformer architecture, as well as the strengths and limitations of current gLMs more broadly. Additionally, we contemplate the future of genomic modelling beyond the transformer architecture, based on current trends in research. This Review serves as a guide for computational biologists and computer scientists interested in transformers and language models for genomic data.

MCML Authors
Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems


[1624]
D. Bär, N. Pröllochs and S. Feuerriegel.
The role of social media ads for election outcomes: Evidence from the 2021 German election.
PNAS Nexus.pgaf073 (Mar. 2025). DOI
Abstract

Social media ads have become a key communication channel in politics. However, the relationship between political ads from social media and election outcomes is not fully understood. Here, we aim to estimate the association between online political advertising and election outcomes during the 2021 German federal election. For this, we analyze a large-scale dataset of 21,641 political ads from Facebook and Instagram that received ≈126 million impressions. Using regression analysis, we show that political advertising on social media has a positive relationship with a candidate’s election outcome and may even sway elections. All else equal, ≈200,000 additional impressions are predicted to increase a candidate’s votes by 2.1%. We further use a causal sensitivity analysis to evaluate how unobserved confounding may affect our estimates. We find that the estimated impact of ads cannot be reasonably explained away, highlighting the significance of social media for election outcomes.

MCML Authors
Link to website

Dominik Bär

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1623]
Q. Xu, Y. Shi, J. Zhao and X. Zhu.
FloodCastBench: A Large-Scale Dataset and Foundation Models for Flood Modeling and Forecasting.
Scientific Data 12.431 (Mar. 2025). DOI
Abstract

Effective flood forecasting is crucial for informed decision-making and emergency response. Existing flood datasets mainly describe flood events but lack dynamic process data suitable for machine learning (ML). This work introduces the FloodCastBench dataset, designed for ML-based flood modeling and forecasting, featuring four major flood events: Pakistan 2022, UK 2015, Australia 2022, and Mozambique 2019. FloodCastBench details the process of flood dynamics data acquisition, starting with input data preparation (e.g., topography, land use, rainfall) and flood measurement data collection (e.g., SAR-based maps, surveyed outlines) for hydrodynamic modeling. We deploy a widely recognized finite difference numerical solution to construct high-resolution spatiotemporal dynamic processes with 30-m spatial and 300-second temporal resolutions. Flood measurement data are used to calibrate the hydrodynamic model parameters and validate the flood inundation maps. FloodCastBench provides comprehensive low-fidelity and high-fidelity flood forecasting datasets specifically for ML. Furthermore, we establish a benchmark of foundational models for neural flood forecasting using FloodCastBench, validating its effectiveness in supporting ML models for spatiotemporal, cross-regional, and downscaled flood forecasting.

MCML Authors
Link to website

Qingsong Xu

Data Science in Earth Observation

Link to website

Jie Zhao

Dr.

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1622]
R. Hornung, M. Nalenz, L. Schneider, A. Bender, L. Bothmann, F. Dumpert, B. Bischl, T. Augustin and A.-L. Boulesteix.
Evaluating Machine Learning Models in Non-Standard Settings: An Overview and New Findings.
Statistical Science (Mar. 2025). To be published. Preprint available. arXiv
Abstract

Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines for GE estimation in various such non-standard settings: clustered data, spatial data, unequal sampling probabilities, concept drift, and hierarchically structured outcomes. Our overview combines well-established methodologies with other existing methods that, to our knowledge, have not been frequently considered in these particular settings. A unifying principle among these techniques is that the test data used in each iteration of the resampling procedure should reflect the new observations to which the model will be applied, while the training data should be representative of the entire data set used to obtain the final model. Beyond providing an overview, we address literature gaps by conducting simulation studies. These studies assess the necessity of using GE-estimation methods tailored to the respective setting. Our findings corroborate the concern that standard resampling methods often yield biased GE estimates in non-standard settings, underscoring the importance of tailored GE estimation.

MCML Authors
Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1621]
C. Bülte, P. Scholl and G. Kutyniok.
Probabilistic neural operators for functional uncertainty quantification.
Transactions on Machine Learning Research (Mar. 2025). URL
Abstract

Neural operators aim to approximate the solution operator of a system of differential equations purely from data. They have shown immense success in modeling complex dynamical systems across various domains. However, the occurrence of uncertainties inherent in both model and data has so far rarely been taken into accounttextemdash{}a critical limitation in complex, chaotic systems such as weather forecasting. In this paper, we introduce the probabilistic neural operator (PNO), a framework for learning probability distributions over the output function space of neural operators. PNO extends neural operators with generative modeling based on strictly proper scoring rules, integrating uncertainty information directly into the training process. We provide a theoretical justification for the approach and demonstrate improved performance in quantifying uncertainty across different domains and with respect to different baselines. Furthermore, PNO requires minimal adjustment to existing architectures, shows improved performance for most probabilistic prediction tasks, and leads to well-calibrated predictive distributions and adequate uncertainty representations even for long dynamical trajectories. Implementing our approach into large-scale models for physical applications can lead to improvements in corresponding uncertainty quantification and extreme event identification, ultimately leading to a deeper understanding of the prediction of such surrogate models.

MCML Authors
Link to website

Christopher Bülte

Mathematical Foundations of Artificial Intelligence

Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1620]
K. Schwethelm, J. Kaiser, M. Knolle, S. Lockfisch, D. Rückert and A. Ziller.
Visual Privacy Auditing with Diffusion Models.
Transactions on Machine Learning Research (Mar. 2025). URL
Abstract

Data reconstruction attacks on machine learning models pose a substantial threat to privacy, potentially leaking sensitive information. Although defending against such attacks using differential privacy (DP) provides theoretical guarantees, determining appropriate DP parameters remains challenging. Current formal guarantees on the success of data reconstruction suffer from overly stringent assumptions regarding adversary knowledge about the target data, particularly in the image domain, raising questions about their real-world applicability. In this work, we empirically investigate this discrepancy by introducing a reconstruction attack based on diffusion models (DMs) that only assumes adversary access to real-world image priors and specifically targets the DP defense. We find that (1) real-world data priors significantly influence reconstruction success, (2) current reconstruction bounds do not model the risk posed by data priors well, and (3) DMs can serve as heuristic auditing tools for visualizing privacy leakage.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1619]
P. Bertin, J. D. Viviano, A. Tejada-Lapuerta, W. Wang, S. Bauer, F. J. Theis and Y. Bengio.
A scalable gene network model of regulatory dynamics in single cells.
Preprint (Mar. 2025). arXiv
Abstract

Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene regulation. Modeling how gene regulatory functions shape the temporal dynamics of these responses is key to improving our understanding of biological perturbations. Dynamical models based on differential equations offer a principled way to capture transcriptional dynamics, but their application to single-cell data has been hindered by computational constraints, stochasticity, sparsity, and noise. Existing methods either rely on low-dimensional representations or make strong simplifying assumptions, limiting their ability to model transcriptional dynamics at scale. We introduce a Functional and Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions. Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale, provides improved functional insights into transcriptional mechanisms perturbed by gene knockouts, both in myeloid differentiation and K562 Perturb-seq experiments, and simulates single-cell trajectories of A549 cells following small-molecule perturbations.

MCML Authors
Link to Profile Stefan Bauer

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems


[1618]
C. Damke and E. Hüllermeier.
Adjusted Count Quantification Learning on Graphs.
Preprint (Mar. 2025). arXiv
Abstract

Quantification learning is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular Adjusted Classify & Count (ACC) method to graphs. We show that the prior probability shift assumption upon which ACC relies is often not fulfilled and propose two novel graph quantification techniques: Structural importance sampling (SIS) makes ACC applicable in graph domains with covariate shift. Neighborhood-aware ACC improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

MCML Authors
Link to website

Clemens Damke

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1617]
A. Fono, M. Singh, E. Araya, P. C. Petersen, H. Boche and G. Kutyniok.
Sustainable AI: Mathematical Foundations of Spiking Neural Networks.
Preprint (Mar. 2025). arXiv
Abstract

Deep learning’s success comes with growing energy demands, raising concerns about the long-term sustainability of the field. Spiking neural networks, inspired by biological neurons, offer a promising alternative with potential computational and energy-efficiency gains. This article examines the computational properties of spiking networks through the lens of learning theory, focusing on expressivity, training, and generalization, as well as energy-efficient implementations while comparing them to artificial neural networks. By categorizing spiking models based on time representation and information encoding, we highlight their strengths, challenges, and potential as an alternative computational paradigm.

MCML Authors
Link to website

Adalbert Fono

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1616]
L. Girrbach, S. Alaniz, G. Smith and Z. Akata.
A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models.
Preprint (Mar. 2025). arXiv
Abstract

With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.

MCML Authors
Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1615]
M. Hartenberger, H. Ayaz, F. Ozlugedik, C. Caredda, L. Giannoni, F. Lange, L. Lux, J. Weidner, A. Berger, F. Kofler, M. Menten, B. Montcel, I. Tachtsidis, D. Rückert and I. Ezhov.
Redefining spectral unmixing for in-vivo brain tissue analysis from hyperspectral imaging.
Preprint (Mar. 2025). arXiv
Abstract

In this paper, we propose a methodology for extracting molecular tumor biomarkers from hyperspectral imaging (HSI), an emerging technology for intraoperative tissue assessment. To achieve this, we employ spectral unmixing, allowing to decompose the spectral signals recorded by the HSI camera into their constituent molecular components. Traditional unmixing approaches are based on physical models that establish a relationship between tissue molecules and the recorded spectra. However, these methods commonly assume a linear relationship between the spectra and molecular content, which does not capture the whole complexity of light-matter interaction. To address this limitation, we introduce a novel unmixing procedure that allows to take into account non-linear optical effects while preserving the computational benefits of linear spectral unmixing. We validate our methodology on an in-vivo brain tissue HSI dataset and demonstrate that the extracted molecular information leads to superior classification performance.

MCML Authors
Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Link to website

Jonas Weidner

AI for Image-Guided Diagnosis and Therapy

Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1614]
J. Hingerl, L. D. Martens, A. Karollus, T. Manz, J. D. Buenrostro, F. J. Theis and J. Gagneur.
scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution.
Preprint (Mar. 2025). DOI
Abstract

Understanding how regulatory DNA elements shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA-seq and epigenomic profiling provides opportunities to build unifying models of gene regulation capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multi-modal technologies. Here, we introduce scooby, the first model to predict scRNA-seq coverage and scATAC-seq insertion profiles along the genome from sequence at single-cell resolution. For this, we leverage the pre-trained multi-omics profile predictor Borzoi as a foundation model, equip it with a cell-specific decoder, and fine-tune its sequence embeddings. Specifically, we condition the decoder on the cell position in a precomputed single-cell embedding resulting in strong generalization capability. Applied to a hematopoiesis dataset, scooby recapitulates cell-specific expression levels of held-out genes and cells, and identifies regulators and their putative target genes through in silico motif deletion. Moreover, accurate variant effect prediction with scooby allows for breaking down bulk eQTL effects into single-cell effects and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.Competing Interest StatementJ.D.B. holds patents related to ATAC-seq and is an SAB member of Camp4 and seqWell. F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd and Omniscope Ltd, and has ownership interest in Dermagnostix GmbH and Cellarity.

MCML Authors
Link to website

Johannes Hingerl

Computational Molecular Medicine

Link to website

Alexander Karollus

Computational Molecular Medicine

Link to Profile Julien Gagneur

Julien Gagneur

Prof. Dr.

Computational Molecular Medicine


[1613]
S. Kondylatos, N. Bountos, D. Michail, X. Zhu, G. Camps-Valls and I. Papoutsis.
On the Generalization of Representation Uncertainty in Earth Observation.
Preprint (Mar. 2025). arXiv GitHub
Abstract

Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain’s unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1612]
F. Krause, T. Phan, M. Gui, S. A. Baumann, V. T. Hu and B. Ommer.
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training.
Preprint (Mar. 2025). arXiv
Abstract

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1611]
F. J. D. Lange, J. C. Wilcke, S. Hoffmann, M. Herrmann and A.-L. Boulesteix.
On 'confirmatory' methodological research in statistics and related fields.
Preprint (Mar. 2025). arXiv
Abstract

Empirical substantive research, such as in the life or social sciences, is commonly categorized into the two modes exploratory and confirmatory, both of which are essential to scientific progress. The former is also referred to as hypothesis-generating or data-contingent research, the latter is also called hypothesis-testing research. In the context of empirical methodological research in statistics, however, the exploratory-confirmatory distinction has received very little attention so far. Our paper aims to fill this gap. First, we revisit the concept of empirical methodological research through the lens of the exploratory-confirmatory distinction. Secondly, we examine current practice with respect to this distinction through a literature survey including 115 articles from the field of biostatistics. Thirdly, we provide practical recommendations towards more appropriate design, interpretation, and reporting of empirical methodological research in light of this distinction. In particular, we argue that both modes of research are crucial to methodological progress, but that most published studies – even if sometimes disguised as confirmatory – are essentially of exploratory nature. We emphasize that it may be adequate to consider empirical methodological research as a continuum between ‘pure’ exploration and ‘strict’ confirmation, recommend transparently reporting the mode of conducted research within the spectrum between exploratory and confirmatory, and stress the importance of study protocols written before conducting the study, especially in confirmatory methodological research.

MCML Authors
Link to Profile Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1610]
J. Li, C. Liu, W. Bai, R. Arcucci, C. I. Bercea and J. A. Schnabel.
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions.
Preprint (Mar. 2025). arXiv GitHub
Abstract

Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical images. We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.

MCML Authors
Link to website

Jun Li

Computational Imaging and AI in Medicine

Link to Profile Julia Schnabel

Julia Schnabel

Prof. Dr.

Computational Imaging and AI in Medicine


[1609]
Y. Li, M. Milling and B. W. Schuller.
Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In & Out Learning.
Preprint (Mar. 2025). arXiv
Abstract

Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition to the more spread neuroapoptosis have largely been ignored in DNN architecture design. Instead, contemporary AI development predominantly focuses on constructing advanced frameworks, such as large language models, which retain a static structure of neural connections during training and inference. In this light, we explore how neurogenesis, neuroapoptosis, and neuroplasticity can inspire future AI advances. Specifically, we examine analogous activities in artificial NNs, introducing the concepts of dropin'' for neurogenesis and revisiting dropout’’ and structural pruning for neuroapoptosis. We additionally suggest neuroplasticity combining the two for future large NNs in ``life-long learning’’ settings following the biological inspiration. We conclude by advocating for greater research efforts in this interdisciplinary domain and identifying promising directions for future exploration.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1608]
Y. Li, Q. Sun, S. M. K. Murthy, E. Alturki and B. W. Schuller.
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations.
Preprint (Mar. 2025). arXiv
Abstract

Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual’s expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1607]
M. M. Mandl, F. Weber, T. Wöhrle and A.-L. Boulesteix.
The impact of the storytelling fallacy on real data examples in methodological research.
Preprint (Mar. 2025). arXiv
Abstract

The term ‘researcher degrees of freedom’ (RDF), which was introduced in metascientific literature in the context of the replication crisis in science, refers to the extent of flexibility a scientist has in making decisions related to data analysis. These choices occur at all stages of the data analysis process. In combination with selective reporting, RDF may lead to over-optimistic statements and an increased rate of false positive findings. Even though the concept has been mainly discussed in fields such as epidemiology or psychology, similar problems affect methodological statistical research. Researchers who develop and evaluate statistical methods are left with a multitude of decisions when designing their comparison studies. This leaves room for an over-optimistic representation of the performance of their preferred method(s). The present paper defines and explores a particular RDF that has not been previously identified and discussed. When interpreting the results of real data examples that are most often part of methodological evaluations, authors typically tell a domain-specific ‘story’ that best supports their argumentation in favor of their preferred method. However, there are often plenty of other plausible stories that would support different conclusions. We define the ‘storytelling fallacy’ as the selective use of anecdotal domain-specific knowledge to support the superiority of specific methods in real data examples. While such examples fed by domain knowledge play a vital role in methodological research, if deployed inappropriately they can also harm the validity of conclusions on the investigated methods. The goal of our work is to create awareness for this issue, fuel discussions on the role of real data in generating evidence in methodological research and warn readers of methodological literature against naive interpretations of real data examples.

MCML Authors
Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1606]
M. L. Mostafa, A. Alperovich, D. Fedotov, G. Ghazaei, S. Saur, A. Farshad and N. Navab.
Surgical Flow Masked Autoencoder for Event Recognition.
Preprint (Mar. 2025).
Abstract

Recognition and forecasting of surgical events from video sequences are crucial for advancing computer-assisted surgery. Surgical events are often characterized by specific tool-tissue interactions; for example, ”bleeding damage” occurs when a tool unintentionally cuts a tissue, leading to blood flow. Despite progress in general event classification, recognizing and forecasting events in medical contexts remains challenging due to data scarcity and the complexity of these events. To address these challenges, we propose a method utilizing video masked autoencoders (VideoMAE) for surgical event recognition. This approach focuses the network on the most informative areas of the video while minimizing the need for extensive annotations. We introduce a novel mask sampling technique based on an estimated prior probability map derived from optical flow. We hypothesize that leveraging prior knowledge of tool-tissue interactions will enable the network to concentrate on the most relevant regions in the video. We propose two methods for estimating the prior probability map: (a) retaining areas with the fastest motion and (b) incorporating an additional encoding pathway for optical flow. Our extensive experiments on the public dataset CATARACTS and our in-house neurosurgical data demonstrate that optical flow-based masking consistently outperforms random masking strategies of VideoMAE in phase and event classification tasks. We find that an optical flow encoder enhances classification accuracy by directing the network’s focus to the most relevant information, even in regions without rapid motion. Finally, we investigate sequential and multi-task training strategies to identify the best-performing model, which surpasses the current state-of-the-art by 5% on the CATARACTS dataset and 27% on our in-house neurosurgical data.

MCML Authors
Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1605]
I. Obadic, D. Kangin, D. Oliveira, P. Angelov and X. Zhu.
i-WiViG: Interpretable Window Vision GNN.
Preprint (Mar. 2025). arXiv
Abstract

Deep learning models based on graph neural networks have emerged as a popular approach for solving computer vision problems. They encode the image into a graph structure and can be beneficial for efficiently capturing the long-range dependencies typically present in remote sensing imagery. However, an important drawback of these methods is their black-box nature which may hamper their wider usage in critical applications. In this work, we tackle the self-interpretability of the graph-based vision models by proposing our Interpretable Window Vision GNN (i-WiViG) approach, which provides explanations by automatically identifying the relevant subgraphs for the model prediction. This is achieved with window-based image graph processing that constrains the node receptive field to a local image region and by using a self-interpretable graph bottleneck that ranks the importance of the long-range relations between the image regions. We evaluate our approach to remote sensing classification and regression tasks, showing it achieves competitive performance while providing inherent and faithful explanations through the identified relations. Further, the quantitative evaluation reveals that our model reduces the infidelity of post-hoc explanations compared to other Vision GNN models, without sacrificing explanation sparsity.

MCML Authors
Link to website

Ivica Obadic

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1604]
R. D. Paul, J. Seiffarth, D. Rügamer, H. Scharr and K. Nöh.
How To Make Your Cell Tracker Say 'I dunno!'.
Preprint (Mar. 2025). arXiv
Abstract

Cell tracking is a key computational task in live-cell microscopy, but fully automated analysis of high-throughput imaging requires reliable and, thus, uncertainty-aware data analysis tools, as the amount of data recorded within a single experiment exceeds what humans are able to overlook. We here propose and benchmark various methods to reason about and quantify uncertainty in linear assignment-based cell tracking algorithms. Our methods take inspiration from statistics and machine learning, leveraging two perspectives on the cell tracking problem explored throughout this work: Considering it as a Bayesian inference problem and as a classification problem. Our methods admit a framework-like character in that they equip any frame-to-frame tracking method with uncertainty quantification. We demonstrate this by applying it to various existing tracking algorithms including the recently presented Transformer-based trackers. We demonstrate empirically that our methods yield useful and well-calibrated tracking uncertainties.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1603]
R. Rehms, N. Ellenbach, V. Deffner and S. Hoffmann.
Addressing complex structures of measurement error arising in the exposure assessment in occupational epidemiology using a Bayesian hierarchical approach.
Preprint (Mar. 2025). arXiv
Abstract

Exposure assessment in occupational epidemiology may involve multiple unknown quantities that are measured or reconstructed simultaneously for groups of workers and over several years. Additionally, exposures may be collected using different assessment strategies, depending on the period of exposure. As a consequence, researchers who are analyzing occupational cohort studies are commonly faced with challenging structures of exposure measurement error, involving complex dependence structures and multiple measurement error models, depending on the period of exposure. However, previous work has often made many simplifying assumptions concerning these errors. In this work, we propose a Bayesian hierarchical approach to account for a broad range of error structures arising in occupational epidemiology. The considered error structures may involve several unknown quantities that can be subject to mixtures of Berkson and classical measurement error. It is possible to account for different error structures, depending on the exposure period and the location of a worker. Moreover, errors can present complex dependence structures over time and between workers. We illustrate the proposed hierarchical approach on a subgroup of the German cohort of uranium miners to account for potential exposure uncertainties in the association between radon exposure and lung cancer mortality. The performance of the proposed approach and its sensitivity to model misspecification are evaluated in a simulation study. The results show that biases in estimates arising from very complex measurement errors can be corrected through the proposed Bayesian hierarchical approach.

MCML Authors

[1602]
A. Scagliotti, F. Scagliotti, L. Locati and F. Sottotetti.
Ensemble optimal control for managing drug resistance in cancer therapies.
Preprint (Mar. 2025). arXiv
Abstract

In this paper, we explore the application of ensemble optimal control to derive enhanced strategies for pharmacological cancer treatment. In particular, we focus on moving beyond the classical clinical approach of giving the patient the maximal tolerated drug dose (MTD), which does not properly exploit the fight among sensitive and resistant cells for the available resources. Here, we employ a Lotka-Volterra model to describe the two competing subpopulations, and we enclose this system within the ensemble control framework. In the first part, we establish general results suitable for application to various solid cancers. Then, we carry out numerical simulations in the setting of prostate cancer treated with androgen deprivation therapy, yielding a computed policy that is reminiscent of the medical ‘active surveillance’ paradigm. Finally, inspired by the numerical evidence, we propose a variant of the celebrated adaptive therapy (AT), which we call ‘Off-On’ AT.

MCML Authors
Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1601]
H. Shang, H. Wu, G. Zhai, B. Sun, F. Wang, F. Tombari and M. Pollefeys.
SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation.
Preprint (Mar. 2025). arXiv
Abstract

Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs – whether by adding nodes or modifying edges – remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node’s relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.

MCML Authors
Link to website

Guangyao Zhai

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Federico Tombari

Federico Tombari

PD Dr.

Computer Aided Medical Procedures & Augmented Reality


[1600]
Z. Shi, X. Zhang, Y. Xia, Y. Zang, S. Shen and C. Wang.
L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery.
Preprint (Mar. 2025). arXiv GitHub
Abstract

We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct XA-L&RSI dataset, which encompasses approximately 110,000 remote sensing submaps and 13,000 LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on a dynamic Gaussian mixture model to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on XA-L&RSI demonstrate that, within a 100km2 retrieval range, L2RSI accurately localizes 95.08% of point cloud submaps within a 30m radius for top-1 retrieved location. We provide a video to more vividly display the place recognition results of L2RSI at this https URL.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member


[1599]
J. Shin, A. Khatri, M. A. Hedderich, A. Lucero and A. Oulasvirta.
Facilitating Asynchronous Idea Generation and Selection with Chatbots.
Preprint (Mar. 2025). arXiv
Abstract

People can generate high-quality ideas by building on each other’s ideas. By enabling individuals to contribute their ideas at their own comfortable time and method (i.e., asynchronous ideation), they can deeply engage in ideation and improve idea quality. However, running asynchronous ideation faces a practical constraint. Whereas trained human facilitators are needed to guide effective idea exchange, they cannot be continuously available to engage with individuals joining at varying hours. In this paper, we ask how chatbots can be designed to facilitate asynchronous ideation. For this, we adopted the guidelines found in the literature about human facilitators and designed two chatbots: one provides a structured ideation process, and another adapts the ideation process to individuals’ ideation performance. We invited 48 participants to generate and select ideas by interacting with one of our chatbots and invited an expert facilitator to review our chatbots. We found that both chatbots can guide users to build on each other’s ideas and converge them into a few satisfying ideas. However, we also found the chatbots’ limitations in social interaction with collaborators, which only human facilitators can provide. Accordingly, we conclude that chatbots can be promising facilitators of asynchronous ideation, but hybrid facilitation with human facilitators would be needed to address the social aspects of collaborative ideation.

MCML Authors
Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics


[1598]
S. Si, X. Wang, G. Zhai, N. Navab and B. Plank.
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior.
Preprint (Mar. 2025). arXiv
Abstract

Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such ‘harmlessness’ behavior is mainly achieved by training models to reject harmful requests, such as ‘Explain how to burn down my neighbor’s house’, where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as ‘Tell me how to kill a Python process’. In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to website

Guangyao Zhai

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1597]
V. Sideri-Lampretsa, D. Rückert and H. Qiu.
Evaluation of Alignment-Regularity Characteristics in Deformable Image Registration.
Preprint (Mar. 2025). arXiv
Abstract

Evaluating deformable image registration (DIR) is challenging due to the inherent trade-off between achieving high alignment accuracy and maintaining deformation regularity. In this work, we introduce a novel evaluation scheme based on the alignment-regularity characteristic (ARC) to systematically capture and analyze this trade-off. We first introduce the ARC curves, which describe the performance of a given registration algorithm as a spectrum measured by alignment and regularity metrics. We further adopt a HyperNetwork-based approach that learns to continuously interpolate across the full regularization range, accelerating the construction and improving the sample density of ARC curves. We empirically demonstrate our evaluation scheme using representative learning-based deformable image registration methods with various network architectures and transformation models on two public datasets. We present a range of findings not evident from existing evaluation practices and provide general recommendations for model evaluation and selection using our evaluation scheme. All code relevant is made publicly available.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1596]
P. Spitzer, D. Hendriks, J. Rudolph, S. Schläger, J. Ricke, N. Kühl, B. F. Hoppe and S. Feuerriegel.
The effect of medical explanations from large language models on diagnostic decisions in radiology.
Preprint (Mar. 2025). DOI
Abstract

Large language models (LLMs) are increasingly used by physicians for diagnostic support. A key advantage of LLMs is the ability to generate explanations that can help physicians understand the reasoning behind a diagnosis. However, the best-suited format for LLM-generated explanations remains unclear. In this large-scale study, we examined the effect of different formats for LLM explanations on clinical decision-making. For this, we conducted a randomized experiment with radiologists reviewing patient cases with radiological images (N=2020 assessments). Participants received either no LLM support (control group) or were supported by one of three LLM-generated explanations: (1) a standard output providing the diagnosis without explanation; (2) a differential diagnosis comparing multiple possible diagnoses; or (3) a chain-of-thought explanation offering a detailed reasoning process for the diagnosis. We find that the format of explanations significantly influences diagnostic accuracy. The chain-of-thought explanations yielded the best performance, improving the diagnostic accuracy by 12.2% compared to the control condition without LLM support (P=0.001). The chain-of-thought explanations are also superior to the standard output without explanation (+7.2%; P=0.040) and the differential diagnosis format (+9.7%; P=0.004). Evidently, explaining the reasoning for a diagnosis helps physicians to identify and correct potential errors in LLM predictions and thus improve overall decisions. Altogether, the results highlight the importance of how explanations in medical LLMs are generated to maximize their utility in clinical practice. By designing explanations to support the reasoning processes of physicians, LLMs can improve diagnostic performance and, ultimately, patient outcomes.

MCML Authors
Link to website

Boj Friedrich Hoppe

Dr.

Clinical Data Science in Radiology

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1595]
P. Stangel, D. Bani-Harouni, C. Pellegrini, E. Özsoy, K. Zaripova, M. Keicher and N. Navab.
Rewarding Doubt: A Reinforcement Learning Approach to Confidence Calibration of Large Language Models.
Preprint (Mar. 2025). arXiv
Abstract

A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions. We model the problem as a betting game where the model predicts a confidence score together with every answer, and design a reward function that penalizes both over and under-confidence. We prove that under our reward design an optimal policy would result in a perfectly calibrated confidence estimation. Our experiments demonstrate significantly improved confidence calibration and generalization to new tasks without re-training, indicating that our approach teaches a general confidence awareness. This approach enables the training of inherently calibrated LLMs.

MCML Authors
Link to website

David Bani-Harouni

Computer Aided Medical Procedures & Augmented Reality

Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Kamilia Zaripova

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1594]
N. P. A. Vu, A. Saroha, O. Litany and D. Cremers.
GAS-NeRF: Geometry-Aware Stylization of Dynamic Radiance Fields.
Preprint (Mar. 2025). arXiv
Abstract

Current 3D stylization techniques primarily focus on static scenes, while our world is inherently dynamic, filled with moving objects and changing environments. Existing style transfer methods primarily target appearance – such as color and texture transformation – but often neglect the geometric characteristics of the style image, which are crucial for achieving a complete and coherent stylization effect. To overcome these shortcomings, we propose GAS-NeRF, a novel approach for joint appearance and geometry stylization in dynamic Radiance Fields. Our method leverages depth maps to extract and transfer geometric details into the radiance field, followed by appearance transfer. Experimental results on synthetic and real-world datasets demonstrate that our approach significantly enhances the stylization quality while maintaining temporal coherence in dynamic scenes.

MCML Authors
Link to website

Abhishek Saroha

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1593]
Y. Wang, Z. Xiong, C. Liu, A. J. Stewart, T. Dujardin, N. I. Bountos, A. Zavras, F. Gerken, I. Papoutsis, L. Leal-Taixé and X. Zhu.
Towards a Unified Copernicus Foundation Model for Earth Vision.
Preprint (Mar. 2025). arXiv GitHub
Abstract

Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth’s surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth’s surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research.

MCML Authors
Link to website

Chenying Liu

Data Science in Earth Observation

Link to website

Franziska Gerken

Computer Vision & Artificial Intelligence

Laura Leal-Taixé

Laura Leal-Taixé

Prof. Dr.

* Former Principal Investigator

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1592]
A. Weers, A. H. Berger, L. Lux, P. Schüffler, D. Rückert and J. C. Paetzold.
From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis.
Preprint (Mar. 2025). arXiv GitHub
Abstract

The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissue structures and compromises interpretability. Our method overcomes this limitation through a novel graph-based framework that constructs WSI graph representations. The WSI-graph efficiently captures essential histopathological information in a compact form. We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches all while providing interpretable features for explainability. Through adaptive graph coarsening guided by learned embeddings, we progressively merge regions while maintaining discriminative local features and enabling efficient global information exchange. In our method’s final step, we solve the diagnostic task through a graph attention network. We empirically demonstrate strong performance on multiple challenging tasks such as cancer stage classification and survival prediction, while also identifying predictive factors using Integrated Gradients.

MCML Authors
Link to website

Alexander Weers

Artificial Intelligence in Healthcare and Medicine

Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1591]
T. N. Wolf, E. Kavak, F. Bongratz and C. Wachinger.
SIC: Similarity-Based Interpretable Image Classification with Neural Networks.
Preprint (Mar. 2025). arXiv
Abstract

The deployment of deep learning models in critical domains necessitates a balance between high accuracy and interpretability. We introduce SIC, an inherently interpretable neural network that provides local and global explanations of its decision-making process. Leveraging the concept of case-based reasoning, SIC extracts class-representative support vectors from training images, ensuring they capture relevant features while suppressing irrelevant ones. Classification decisions are made by calculating and aggregating similarity scores between these support vectors and the input’s latent feature vector. We employ B-Cos transformations, which align model weights with inputs, to yield coherent pixel-level explanations in addition to global explanations of case-based reasoning. We evaluate SIC on three tasks: fine-grained classification on Stanford Dogs and FunnyBirds, multi-label classification on Pascal VOC, and pathology detection on the RSNA dataset. Results indicate that SIC not only achieves competitive accuracy compared to state-of-the-art black-box and inherently interpretable models but also offers insightful explanations verified through practical evaluation on the FunnyBirds benchmark. Our theoretical analysis proves that these explanations fulfill established axioms for explanations. Our findings underscore SIC’s potential for applications where understanding model decisions is as critical as the decisions themselves.

MCML Authors
Link to website

Tom Nuno Wolf

Artificial Intelligence in Medical Imaging

Link to website

Emre Kavak

Artificial Intelligence in Medical Imaging

Fabian Bongratz

Fabian Bongratz

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1590]
A. Wuttke, M. Aßenmacher, C. Klamm, M. Lang, Q. Würschinger and F. Kreuter.
AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers.
Preprint (Mar. 2025). arXiv
Abstract

Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to voice their opinions in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to a conversational interview by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. We publish our data and materials for re-use and present specific recommendations for effective implementation.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1589]
R. Amoroso, G. Zhang, R. Koner, L. Baraldi, R. Cucchiara and V. Tresp.
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA benchmarks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with recent advancements in video QA.

MCML Authors
Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to website

Rajat Koner

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1588]
A. H. Berger, L. Lux, S. Shit, I. Ezhov, G. Kaissis, M. Menten, D. Rückert and J. C. Paetzold.
Cross-Domain and Cross-Dimension Learning for Image-to-Graph Transformers.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task’s complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss to effectively learn object relations in multiple domains with different numbers of edges, (2) a domain adaptation framework for image-to-graph transformers aligning image- and graph-level features from different domains, and (3) a projection function that allows using 2D data for training 3D transformers. We demonstrate our method’s utility in cross-domain and cross-dimension experiments, where we utilize labeled data from 2D road networks for simultaneous learning in vastly different target domains. Our method consistently outperforms standard transfer learning and self-supervised pretraining on challenging benchmarks, such as retinal or whole-brain vessel graph extraction.

MCML Authors
Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Georgios Kaissis

Georgios Kaissis

Dr.

* Former Principal Investigator

Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1587]
S. Chen, Z. Han, B. He, J. Liu, M. Buckley, Y. Qin, P. Torr, V. Tresp and J. Gu.
Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI URL
Abstract

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method.

MCML Authors
Link to website

Shuo Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1586]
F. Fundel, J. Schusterbauer, V. T. Hu and B. Ommer.
Distillation of Diffusion Features for Semantic Correspondence.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

MCML Authors
Link to website

Johannes Schusterbauer

Computer Vision & Learning

Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1585]
Y. Li, M. Ghahremani, Y. Wally and C. Wachinger.
DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Diagnosing dementia, particularly for Alzheimer’s Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study.

MCML Authors
Link to website

Yitong Li

Artificial Intelligence in Medical Imaging

Link to website

Morteza Ghahremani

Dr.

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1584]
O. Wysocki, Y. Tan, T. Froech, Y. Xia, M. Wysocki, L. Hoegner, D. Cremers and C. Holst.
ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI GitHub
Abstract

Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA11Project page: https://github.com/OloOcki/zaha, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods’ comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. More-over, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member

Link to website

Magdalena Wysocki

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1583]
Y. Zhang, H. Chen, A. Frikha, Y. Yang, D. Krompass, G. Zhang, J. Gu and V. Tresp.
CL-Cross VQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI
Abstract

Visual Question Answering (VQA) systems witnessed a significant advance in recent years due to the development of large-scale Vision-Language Pre-trained Models (VLPMs). As the application scenario and user demand change over time, an advanced VQA system is expected to be capable of continuously expanding its knowledge and capabilities over time, not only to handle new tasks (i.e., new question types or visual scenes) but also to answer questions in new specialized domains without forgetting previously acquired knowledge and skills. Existing works studying CL on VQA tasks primarily consider answer- and question-type incremental learning or scene- and function-incremental learning, whereas how VQA systems perform when they encounter new domains and increasing user demands has not been studied. Motivated by this, we introduce CL-CrossVQA, a rigorous Continual Learning benchmark for Cross-domain Visual Question Answering, through which we conduct extensive experiments on 4 VLPMs, 5 CL approaches, and 5 VQA datasets from different domains. In addition, by probing the forgetting phenomenon of the intermediate layers, we provide insights into how model architecture affects CL performance, why CL approaches can help mitigate forgetting in VLPMs, and how to design CL approaches suitable for VLPMs in this challenging continual learning environment. To facilitate future work on developing an advanced All-in-One VQA system, we will release our datasets and code.

MCML Authors
Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Haokun Chen

Database Systems and Data Mining

Ahmed Frikha

Ahmed Frikha

Dr.

* Former Member

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1582]
F. Hofherr, B. Haefner and D. Cremers.
On Neural BRDFs: A Thorough Comparison of State-of-the-Art Approaches.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. Oral Presentation. DOI
Abstract

The bidirectional reflectance distribution function (BRDF) is an essential tool to capture the complex interaction of light and matter. Recently, several works have employed neural methods for BRDF modeling, following various strategies, ranging from utilizing existing parametric models to purely neural parametrizations. While all methods yield impressive results, a comprehensive comparison of the different approaches is missing in the literature. In this work, we present a thorough evaluation of several approaches, including results for qualitative and quantitative reconstruction quality and an analysis of reciprocity and energy conservation. Moreover, we propose two extensions that can be added to existing approaches: A novel additive combination strategy for neural BRDFs that split the reflectance into a diffuse and a specular part, and an input mapping that ensures reciprocity exactly by construction, while previous approaches only ensure it by soft constraints.

MCML Authors
Link to website

Florian Hofherr

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1581]
H. Chen, D. Krompass, J. Gu and V. Tresp.
FedPop: Federated Population-based Hyperparameter Tuning.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization and server aggregation procedure in FL are sensitive to the hyperparameter (HP) selection. Despite extensive research on tuning HPs for centralized ML, these methods yield suboptimal results when employed in FL. This is mainly because their ’training-after-tuning’ framework is unsuitable for FL with limited client computation power. While some approaches have been proposed for HP-Tuning in FL, they are limited to the HPs for client local updates. In this work, we propose a novel HP-tuning algorithm, called Federated Population-based Hyperparameter Tuning (FedPop), to address this vital yet challenging problem. FedPop employs population-based evolutionary algorithms to optimize the HPs, which accommodates various HP types at both the client and server sides. Compared with prior tuning methods, FedPop employs an online ’tuning-while-training’ framework, offering computational efficiency and enabling the exploration of a broader HP search space. Our empirical validation on the common FL benchmarks and complex real-world FL datasets, including full-sized Non-IID ImageNet-1K, demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP-tuning methods in FL.

MCML Authors
Link to website

Haokun Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1580]
A. Davtyan, S. Sameni, B. Ommer and P. Favaro.
CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI GitHub
Abstract

In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way. This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration. We conduct a series of experiments to demonstrate capabilities of CAGE in various settings.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1579]
X. Feng, Z. Jiang, T. Kaufmann, P. Xu, E. Hüllermeier, P. Weng and Y. Zhu.
DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

Defining a reward function is usually a challenging but critical task for the system designer in reinforcement learning, especially when specifying complex behaviors. Reinforcement learning from human feedback (RLHF) emerges as a promising approach to circumvent this. In RLHF, the agent typically learns a reward function by querying a human teacher using pairwise comparisons of trajectory segments. A key question in this domain is how to reduce the number of queries necessary to learn an informative reward function since asking a human teacher too many queries is impractical and costly. To tackle this question, we propose DUO, a novel method for diverse, uncertain, on-policy query generation and selection in RLHF. Our method produces queries that are (1) more relevant for policy training (via an on-policy criterion), (2) more informative (via a principled measure of epistemic uncertainty), and (3) diverse (via a clustering-based filter). Experimental results on a variety of locomotion and robotic manipulation tasks demonstrate that our method can outperform state-of-the-art RLHF methods given the same total budget of queries, while being robust to possibly irrational teachers.

MCML Authors
Link to website

Timo Kaufmann

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1578]
J. Lan, D. Frassinelli and B. Plank.
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

Large vision-language models struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit high uncertainty. In this study, we focus on a Visual Question Answering (VQA) task and comprehensively evaluate how well the output of the state-of-the-art vision-language model correlates with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ, not only accuracy, but also three new human-correlated metrics for the first time in VQA, to investigate the impact of HUD. We also verify the effect of common calibration and human calibration (Baan et al. 2022) on the alignment of models and humans. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, to better align model confidence with human uncertainty. Our findings highlight that for VQA, the alignment between human responses and model predictions is understudied and is an important target for future studies.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1577]
Z. Li, S. S. Cranganore, N. Youngblut and N. Kilbertus.
Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.

MCML Authors
Link to website

Zhufeng Li

Ethics in Systems Design and Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1576]
Y. Mu, M. Shahzad and X. Zhu.
MPTSNet: Integrating Multiscale Periodic Local Patterns and Global Dependencies for Multivariate Time Series Classification.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

Multivariate Time Series Classification (MTSC) is crucial in extensive practical applications, such as environmental monitoring, medical EEG analysis, and action recognition. Real-world time series datasets typically exhibit complex dynamics. To capture this complexity, RNN-based, CNN-based, Transformer-based, and hybrid models have been proposed. Unfortunately, current deep learning-based methods often neglect the simultaneous construction of local features and global dependencies at different time scales, lacking sufficient feature extraction capabilities to achieve satisfactory classification accuracy. To address these challenges, we propose a novel Multiscale Periodic Time Series Network (MPTSNet), which integrates multiscale local patterns and global correlations to fully exploit the inherent information in time series. Recognizing the multi-periodicity and complex variable correlations in time series, we use the Fourier transform to extract primary periods, enabling us to decompose data into multiscale periodic segments. Leveraging the inherent strengths of CNN and attention mechanism, we introduce the PeriodicBlock, which adaptively captures local patterns and global dependencies while offering enhanced interpretability through attention integration across different periodic scales. The experiments on UEA benchmark datasets demonstrate that the proposed MPTSNet outperforms 21 existing advanced baselines in the MTSC tasks.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1575]
Y. Shen, Z. Zhuang, K. Yuan, M.-I. Nicolae, N. Navab, N. Padoy and M. Fritz.
Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

Medical multimodal large language models (MLLMs) are becoming an instrumental part of healthcare systems, assisting medical personnel with decision making and results analysis. Models for radiology report generation are able to interpret medical imagery, thus reducing the workload of radiologists. As medical data is scarce and protected by privacy regulations, medical MLLMs represent valuable intellectual property. However, these assets are potentially vulnerable to model stealing, where attackers aim to replicate their functionality via black-box access. So far, model stealing for the medical domain has focused on classification; however, existing attacks are not effective against MLLMs. In this paper, we introduce Adversarial Domain Alignment (ADA-STEAL), the first stealing attack against medical MLLMs. ADA-STEAL relies on natural images, which are public and widely available, as opposed to their medical counterparts. We show that data augmentation with adversarial noise is sufficient to overcome the data distribution gap between natural images and the domain-specific distribution of the victim MLLM. Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that Adversarial Domain Alignment enables attackers to steal the medical MLLM without any access to medical data.

MCML Authors
Link to website

Kun Yuan

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1574]
Y. Zhang, Z. Ma, Y. Ma, Z. Han, Y. Wu and V. Tresp.
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI
Abstract

LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, which lack the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks and continuously refining this plan, thereby focusing the search process and mitigating the challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot marks a significant advancement in general autonomous agent capabilities, paving the way for more advanced and reliable decision-making in practical environments.

MCML Authors
Link to website

Yao Zhang

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1573]
P. Ma, L. Rietdorf, D. Kotovenko, V. T. Hu and B. Ommer.
Does VLM Classification Benefit from LLM Description Semantics?
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. Invited talk. DOI
Abstract

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.

MCML Authors
Link to website

Pingchuan Ma

Computer Vision & Learning

Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1572]
M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V. T. Hu and B. Ommer.
DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. Oral Presentation. DOI
Abstract

Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport. Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions. We are the first to explore flow matching in this field, and we demonstrate that its interpolation trajectories enhance both training and sampling efficiency while preserving high performance. While generative models typically require extensive training data, we mitigate this dependency by integrating external knowledge from a pre-trained image diffusion model, enabling effective transfer even across differing objectives. To further boost our model performance, we employ synthetic data and utilize image-depth pairs generated by a discriminative model on an in-the-wild image dataset. As a generative model, our model can reliably estimate depth confidence, which provides an additional advantage. Our approach achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and only requiring minimal synthetic data for training.

MCML Authors
Link to website

Johannes Schusterbauer

Computer Vision & Learning

Link to website

Pingchuan Ma

Computer Vision & Learning

Link to website

Olga Grebenkova

Computer Vision & Learning

Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1571]
K. Geißler, T. L. Koller, A. Ambroladze, E. M. Fallenberg, M. Ingrisch and H. K. Hahn.
Breast cancer risk prediction using background parenchymal enhancement, radiomics, and symmetry features on MRI.
SPIE 2025 - SPIE Medical Imaging: Computer-Aided Diagnosis. San Diego, CA, USA, Feb 16-21, 2025. DOI
Abstract

Breast cancer is the world’s most prevalent cancer type. Risk models predicting the chance of near future cancer development can help to increase the efficiency of screening programs by targeting high risk patients specifically. In this study we develop machine learning models for predicting the 2 year risk for breast cancer and current breast cancer detection. Therefore, we leverage feature sets based on background parenchymal enhancement (BPE), radiomics and breast symmetry. We train and evaluate our models on longitudinal MRI data from a German high risk screening program using random forests and 5-fold cross validation. The models, which are developed similar to prior work for breast cancer risk prediction, have low predictive power on our dataset. The best performing model is based on BPE features and achieves an AUC of 0.57 for 2 year breast cancer risk prediction.

MCML Authors
Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[1570]
T. L. Koller, K. Geißler, A. Ambroladze, E. M. Fallenberg, M. Ingrisch, H. Amer, P. Seeböck, G. Langs and H. K. Hahn.
Pitfalls with anomaly detection for breast cancer risk prediction.
SPIE 2025 - SPIE Medical Imaging: Computer-Aided Diagnosis. San Diego, CA, USA, Feb 16-21, 2025. DOI
Abstract

Breast cancer has the highest prevalence in the world, and thus, most countries have screening programs which aim to detect the cancer onset early. In these screening programs, negative studies dominate the dataset. Unsu- pervised anomaly detection promises to take advantage of the negative studies by using it to detect abnormalities as cancer or signs of cancer onset. In this study, we evaluate an anomaly detection method for cancer predic- tion (1-year ahead) on a MRI dataset of a high risk cohort with BRCA1 and BRCA2 gene mutations. As the approach fails to predict cancer risk on the dataset, we investigate the intrinsic behavior of the method. Our analysis reveals, that the reconstruction based method might only detect high intensity anomalies and that the reconstruction quality is highly correlated with noisy patterns in the image patches.

MCML Authors
Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[1569]
L. Burk, A. Bender and M. N. Wright.
High-Dimensional Variable Selection With Competing Events Using Cooperative Penalized Regression.
Biometrical Journal 67.1 (Feb. 2025). DOI
Abstract

Variable selection is an important step in the analysis of high-dimensional data, yet there are limited options for survival outcomes in the presence of competing risks. Commonly employed penalized Cox regression considers each event type separately through cause-specific models, neglecting possibly shared information between them. We adapt the feature-weighted elastic net (fwelnet), an elastic net generalization, to survival outcomes and competing risks. For two causes, our proposed algorithm fits two alternating cause-specific models, where each model receives the coefficient vector of the complementary model as prior information. We dub this ‘‘cooperative penalized regression’’, as it enables the modeling of competing risk data with cause-specific models while accounting for shared effects between causes. Coefficients that are shrunken toward zero in the model for the first cause will receive larger penalization weights in the model for the second cause and vice versa. Through multiple iterations, this process ensures stronger penalization of uninformative predictors in both models. We demonstrate our method’s variable selection capabilities on simulated genomics data and apply it to bladder cancer microarray data. We evaluate selection performance using the positive predictive value for the correct selection of informative features and the false positive rate for the selection of uninformative variables. The benchmark compares results with cause-specific penalized Cox regression, random survival forests, and likelihood-boosted Cox regression. Results indicate that our approach is more effective at selecting informative features and removing uninformative features. In settings without shared effects, variable selection performance is similar to cause-specific penalized Cox regression.

MCML Authors
Link to website

Lukas Burk

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[1568]
M. Wünsch, C. Sauer, M. Herrmann, L. C. Hinske and A.-L. Boulesteix.
To tweak or not to tweak. How exploiting flexibilities in gene set analysis leads to over-optimism.
Biometrical Journal 67.1 (Feb. 2025). DOI
Abstract

Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of genes that show enriched expression patterns between two conditions. In addition to the multitude of methods available for this task, users are typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibility can lead to uncertainty about the “right” choice, further reinforced by a lack of evidence-based guidance. Especially when their statistical experience is scarce, this uncertainty might entice users to produce preferable results using a ’trial-and-error’ approach. While it may seem unproblematic at first glance, this practice can be viewed as a form of ‘cherry-picking’ and cause an optimistic bias, rendering the results nonreplicable on independent data. After this problem has attracted a lot of attention in the context of classical hypothesis testing, we now aim to raise awareness of such overoptimism in the different and more complex context of gene set analyses. We mimic a hypothetical researcher who systematically selects the analysis variants yielding their preferred results, thereby considering three distinct goals they might pursue. Using a selection of popular gene set analysis methods, we tweak the results in this way for two frequently used benchmark gene expression data sets. Our study indicates that the potential for overoptimism is particularly high for a group of methods frequently used despite being commonly criticized. We conclude by providing practical recommendations to counter overoptimism in research findings in gene set analysis and beyond.

MCML Authors
Link to website

Christina Sauer (née Nießl)

Biometry in Molecular Medicine

Link to Profile Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1567]
M. Fornasier and L. Sun.
A PDE Framework of Consensus-Based Optimization for Objectives with Multiple Global Minimizers.
Communications in Partial Differential Equations 50.4 (Feb. 2025). DOI
Abstract

Introduced in 2017, Consensus-Based Optimization (CBO) has rapidly emerged as a significant breakthrough in global optimization. This straightforward yet powerful multi-particle, zero-order optimization method draws inspiration from Simulated Annealing and Particle Swarm Optimization. Using a quantitative mean-field approximation, CBO dynamics can be described by a nonlinear Fokker-Planck equation with degenerate diffusion, which does not follow a gradient flow structure. In this paper, we demonstrate that solutions to the CBO equation remain positive and maintain full support. Building on this foundation, we establish the { unconditional} global convergence of CBO methods to global minimizers. Our results are derived through an analysis of solution regularity and the proof of existence for smooth, classical solutions to a broader class of drift-diffusion equations, despite the challenges posed by degenerate diffusion.

MCML Authors
Link to Profile Massimo Fornasier

Massimo Fornasier

Prof. Dr.

Applied Numerical Analysis

Link to website

Lukang Sun

Applied Numerical Analysis


[1566]
D. Tschernutter and S. Feuerriegel.
Data-driven dynamic police patrolling: An efficient Monte Carlo tree search.
European Journal of Operational Research 321.1 (Feb. 2025). DOI
Abstract

Crime is responsible for major financial losses and serious harm to the well-being of individuals, and, hence, a crucial task of police operations is effective patrolling. Yet, in existing decision models aimed at police operations, microscopic routing decisions from patrolling are not considered, and, furthermore, the objective is limited to surrogate metrics (e. g., response time) instead of crime prevention. In this paper, we thus formalize the decision problem of dynamic police patrolling as a Markov decision process that models microscopic routing decisions, so that the expected number of prevented crimes are maximized. We experimentally show that standard solution approaches for our decision problem are not scalable to real-world settings. As a remedy, we present a tailored and highly efficient Monte Carlo tree search algorithm. We then demonstrate our algorithm numerically using real-world crime data from Chicago and show that the decision-making by our algorithm offers significant improvements for crime prevention over patrolling tactics from current practice. Informed by our results, we finally discuss implications for improving the patrolling tactics in police operations.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1565]
A. T. Stüber, M. M. Heimer, J. Ta, M. P. Fabritius, B. F. Hoppe, G. Sheikh, M. Brendel, L. Unterrainer, P. Jurmeister, A. Tufman, J. Ricke, C. C. Cyran and M. Ingrisch.
Replication study of PD-L1 status prediction in NSCLC using PET/CT radiomics.
European Journal of Radiology 183.111825 (Feb. 2025). DOI
Abstract

This study investigates the predictive capability of radiomics in determining programmed cell death ligand 1 (PD-L1) expression (>=1%) status in non-small cell lung cancer (NSCLC) patients using a newly collected [18F]FDG PET/CT dataset. We aimed to replicate and validate the radiomics-based machine learning (ML) model proposed by Zhao et al. [2] predicting PD-L1 status from PET/CT-imaging.
An independent cohort of 254 NSCLC patients underwent [18F]FDG PET/CT imaging, with primary tumor segmentation conducted using lung tissue window (LTW) and more conservative soft tissue window (STW) methods. Radiomics models (“Rad-score” and “complex model”) and a clinical-stage model from Zhao et al. were evaluated via 10-fold cross-validation and AUC analysis, alongside a benchmark-study comparing different ML-model pipelines. Clinicopathological data were collected from medical records.
On our data, the Rad-score model yielded mean AUCs of 0.593 (STW) and 0.573 (LTW), below Zhao et al.’s 0.761. The complex model achieved mean AUCs of 0.505 (STW) and 0.519 (LTW), lower than Zhao et al.’s 0.769. The clinical model showed a mean AUC of 0.555, below Zhao et al.’s 0.64. All models performed significantly lower than Zhao et al.’s findings. Our benchmark study on four ML pipelines revealed consistently low performance across all configurations.
Our study failed to replicate original findings, suggesting poor model performance and questioning predictive value of radiomics features in classifying PD-L1 expression from PET/CT imaging. These results highlight challenges in replicating radiomics-based ML models and stress the need for rigorous validation

MCML Authors
Link to website

Theresa Stüber

Clinical Data Science in Radiology

Link to website

Boj Friedrich Hoppe

Dr.

Clinical Data Science in Radiology

Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[1564]
Z. Sun, J. Kang, K. Qian, B. W. Schuller and B. Hu.
Creating Healthier Living Environments: The Role of Soundscapes in Promoting Mental Health and Well-Being.
IEEE Transactions on Computational Social Systems 12.1 (Feb. 2025). DOI
Abstract

With great pride and anticipation, we present the first issue of IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (TCSS) for 2025. Reflecting on the remarkable achievements of 2024, this past year stands as a testament to academic excellence and prolific scholarly output. Over the course of the year, our journal published an impressive 642 high-quality articles, totaling approximately 5800 pages, distributed across six issues. These works collectively underscore the vibrant growth and interdisciplinary impact of computational social systems.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1563]
M. Ghahremani, B. R. Ernhofer, J. Wang, M. Makowski and C. Wachinger.
Organ-DETR: Organ Detection via Transformers.
IEEE Transactions on Medical Imaging Early Access (Feb. 2025). DOI URL
Abstract

Query-based Transformers have been yielding impressive performance in object localization and detection tasks. However, their application to organ detection in 3D medical imaging data has been relatively unexplored. This study introduces Organ-DETR, featuring two innovative modules, MultiScale Attention (MSA) and Dense Query Matching (DQM), designed to enhance the performance of Detection Transformers (DETRs) for 3D organ detection. MSA is a novel top-down representation learning approach for efficiently encoding Computed Tomography (CT) features. This architecture employs a multiscale attention mechanism, utilizing both dual self-attention and cross-scale attention mechanisms to extract intra- and inter-scale spatial interactions in the attention mechanism. Organ-DETR also introduces DQM, an approach for one-to-many matching that tackles the label assignment difficulties in organ detection. DQM increases positive queries to enhance both recall scores and training efficiency without the need for additional learnable parameters. Extensive results on five 3D CT datasets indicate that the proposed Organ-DETR outperforms comparable techniques by achieving a remarkable improvement of +10.6 mAP COCO.

MCML Authors
Link to website

Morteza Ghahremani

Dr.

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1562]
D. Huang, C. Li, A. Karlas, X. Chu, K. W. S. Au, N. Navab and Z. Jiang.
VibNet: Vibration-Boosted Needle Detection in Ultrasound Images.
IEEE Transactions on Medical Imaging Early Access (Feb. 2025). DOI GitHub
Abstract

Precise percutaneous needle detection is crucial for ultrasound (US)-guided interventions. However, inherent limitations such as speckles, needle-like artifacts, and low resolution make it challenging to robustly detect needles, especially when their visibility is reduced or imperceptible. To address this challenge, we propose VibNet, a learning-based framework designed to enhance the robustness and accuracy of needle detection in US images by leveraging periodic vibration applied externally to the needle shafts. VibNet integrates neural Short-Time Fourier Transform and Hough Transform modules to achieve successive sub-goals, including motion feature extraction in the spatiotemporal space, frequency feature aggregation, and needle detection in the Hough space. Due to the periodic subtle vibration, the features are more robust in the frequency domain than in the image intensity domain, making VibNet more effective than traditional intensity-based methods. To demonstrate the effectiveness of VibNet, we conducted experiments on distinct ex vivo porcine and bovine tissue samples. The results obtained on porcine samples demonstrate that VibNet effectively detects needles even when their visibility is severely reduced, with a tip error of 1.61±1.56 mm compared to 8.15±9.98 mm for UNet and 6.63±7.58 mm for WNet, and a needle direction error of 1.64 ± 1.86° compared to 9.29 ± 15.30° for UNet and 8.54 ± 17.92° for WNet.

MCML Authors
Link to website

Dianye Huang

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1561]
X.-Y. Tong, R. Dong and X. Zhu.
Global high categorical resolution land cover mapping via weak supervision.
ISPRS Journal of Photogrammetry and Remote Sensing 220 (Feb. 2025). DOI GitHub
Abstract

Land cover information is indispensable for advancing the United Nations’ sustainable development goals, and land cover mapping under a more detailed category system would significantly contribute to economic livelihood tracking and environmental degradation measurement. However, the substantial difficulty in acquiring fine-grained training data makes the implementation of this task particularly challenging. Here, we propose to combine fully labeled source domain and weakly labeled target domain for weakly supervised domain adaptation (WSDA). This is beneficial as the utilization of sparse and coarse weak labels can considerably alleviate the labor required for precise and detailed land cover annotation. Specifically, we introduce the Prototype-based pseudo-label Rectification and Expansion (PRE) approach, which leverages the prototypes (i.e., the class-wise feature centroids) as the bridge to connect sparse labels and global feature distributions. According to the feature distances to the prototypes, the confidence of pseudo-labels predicted in the unlabeled regions of the target domain is assessed. This confidence is then utilized to guide the dynamic expansion and rectification of pseudo-labels. Based on PRE, we carry out high categorical resolution land cover mapping for 10 cities in different regions around the world, severally using PlanetScope, Gaofen-1, and Sentinel-2 satellite images. In the study areas, we achieve cross-sensor, cross-category, and cross-continent WSDA, with the overall accuracy exceeding 80%. The promising results indicate that PRE is capable of reducing the dependency of land cover classification on high-quality annotations, thereby improving label efficiency. We expect our work to enable global fine-grained land cover mapping, which in turn promote Earth observation to provide more precise and thorough information for environmental monitoring.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1560]
J. Hanselle, S. Heid, J. Fürnkranz and E. Hüllermeier.
Probabilistic scoring lists for interpretable machine learning.
Machine Learning 114.55 (Feb. 2025). DOI
Abstract

A scoring system is a simple decision model that checks a set of features, adds a certain number of points to a total score for each feature that is satisfied, and finally makes a decision by comparing the total score to a threshold. Scoring systems have a long history of active use in safety-critical domains such as healthcare and justice, where they provide guidance for making objective and accurate decisions. Given their genuine interpretability, the idea of learning scoring systems from data is obviously appealing from the perspective of explainable AI. In this paper, we propose a practically motivated extension of scoring systems called probabilistic scoring lists (PSL), as well as a method for learning PSLs from data. Instead of making a deterministic decision, a PSL represents uncertainty in the form of probability distributions, or, more generally, probability intervals. Moreover, in the spirit of decision lists, a PSL evaluates features one by one and stops as soon as a decision can be made with enough confidence. To evaluate our approach, we conduct case studies in the medical domain and on standard benchmark data.

MCML Authors
Link to website

Jonas Hanselle

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1559]
T. Willem, V. A. Shitov, M. D. Luecken, N. Kilbertus, S. Bauer, M. Piraud, A. Buyx and F. J. Theis.
Biases in machine-learning models of human single-cell data.
Nature Cell Biology (Feb. 2025). DOI
Abstract

Recent machine-learning (ML)-based advances in single-cell data science have enabled the stratification of human tissue donors at single-cell resolution, promising to provide valuable diagnostic and prognostic insights. However, such insights are susceptible to biases. Here we discuss various biases that emerge along the pipeline of ML-based single-cell analysis, ranging from societal biases affecting whose samples are collected, to clinical and cohort biases that influence the generalizability of single-cell datasets, biases stemming from single-cell sequencing, ML biases specific to (weakly supervised or unsupervised) ML models trained on human single-cell samples and biases during the interpretation of results from ML models. We end by providing methods for single-cell data scientists to assess and mitigate biases, and call for efforts to address the root causes of biases.

MCML Authors
Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning

Link to Profile Stefan Bauer

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI


[1558]
C. I. Bercea, B. Wiestler, D. Rückert and J. A. Schnabel.
Evaluating normative representation learning in generative AI for robust anomaly detection in brain imaging.
Nature Communications 16.1624 (Feb. 2025). DOI GitHub
Abstract

Normative representation learning focuses on understanding the typical anatomical distributions from large datasets of medical scans from healthy individuals. Generative Artificial Intelligence (AI) leverages this attribute to synthesize images that accurately reflect these normative patterns. This capability enables the AI allowing them to effectively detect and correct anomalies in new, unseen pathological data without the need for expert labeling. Traditional anomaly detection methods often evaluate the anomaly detection performance, overlooking the crucial role of normative learning. In our analysis, we introduce novel metrics, specifically designed to evaluate this facet in AI models. We apply these metrics across various generative AI frameworks, including advanced diffusion models, and rigorously test them against complex and diverse brain pathologies. In addition, we conduct a large multi-reader study to compare these metrics to experts’ evaluations. Our analysis demonstrates that models proficient in normative learning exhibit exceptional versatility, adeptly detecting a wide range of unseen medical conditions.

MCML Authors
Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Julia Schnabel

Julia Schnabel

Prof. Dr.

Computational Imaging and AI in Medicine


[1557]
S. Feuerriegel, A. Maarouf, D. Bär, D. Geißler, J. Schweisthal, N. Pröllochs, C. E. Robertson, S. Rathje, J. Hartmann, S. M. Mohammad, O. Netzer, A. A. Siegel, B. Plank and J. J. Van Bavel.
Using natural language processing to analyse text data in behavioural science.
Nature Reviews Psychology 4 (Feb. 2025). DOI
Abstract

Language is a uniquely human trait at the core of human interactions. The language people use often reflects their personality, intentions and state of mind. With the integration of the Internet and social media into everyday life, much of human communication is documented as written text. These online forms of communication (for example, blogs, reviews, social media posts and emails) provide a window into human behaviour and therefore present abundant research opportunities for behavioural science. In this Review, we describe how natural language processing (NLP) can be used to analyse text data in behavioural science. First, we review applications of text data in behavioural science. Second, we describe the NLP pipeline and explain the underlying modelling approaches (for example, dictionary-based approaches and large language models). We discuss the advantages and disadvantages of these methods for behavioural science, in particular with respect to the trade-off between interpretability and accuracy. Finally, we provide actionable recommendations for using NLP to ensure rigour and reproducibility.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management

Link to website

Abdurahman Maarouf

Artificial Intelligence in Management

Link to website

Dominik Bär

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1556]
M. Drton, A. Grosdos, I. Portakal and N. Sturma.
Algebraic Sparse Factor Analysis.
SIAM Journal on Applied Algebra and Geometry 9 (Feb. 2025). DOI
Abstract

Factor analysis is a statistical technique that explains correlations among observed random variables with the help of a smaller number of unobserved factors. In traditional full factor analysis, each observed variable is influenced by every factor. However, many applications exhibit interesting sparsity patterns; that is, each observed variable only depends on a subset of the factors. In this paper, we study such sparse factor analysis models from an algebro-geometric perspective. Under mild conditions on the sparsity pattern, we examine the dimension of the set of covariance matrices that corresponds to a given model. Moreover, we study algebraic relations among the covariances in sparse two-factor models. In particular, we identify cases in which a Gröbner basis for these relations can be derived via a 2-delightful term order and join of toric ideals of graphs.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[1555]
E. Ailer, C. L. Müller and N. Kilbertus.
Instrumental variable estimation for compositional treatments.
Scientific Reports 15.5158 (Feb. 2025). DOI
Abstract

Many scientific datasets are compositional in nature. Important biological examples include species abundances in ecology, cell-type compositions derived from single-cell sequencing data, and amplicon abundance data in microbiome research. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices in microbiome data analysis. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account while still yielding scientifically interpretable results. In a comparative analysis on synthetic and real microbiome data we show the advantages and limitations of our proposal. We posit that our analysis provides a useful framework and guidance for valid and informative cause-effect estimation in the context of compositional data.

MCML Authors
Elisabeth Ailer

Elisabeth Ailer

* Former Member

Link to Profile Christian Müller

Christian Müller

Prof. Dr.

Biomedical Statistics and Data Science

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1554]
V. Steidl, J. L. Bamber and X. Zhu.
Physics-aware machine learning for glacier ice thickness estimation: a case study for Svalbard.
The Cryosphere 19.2 (Feb. 2025). DOI
Abstract

The ice thickness of the world’s glaciers is mostly unmeasured, and physics-based models to reconstruct ice thickness cannot always deliver accurate estimates. In this study, we use deep learning paired with physical knowledge to generate ice thickness estimates for all glaciers of Spitsbergen, Barentsøya, and Edgeøya in Svalbard. We incorporate mass conservation and other physically derived conditions into a neural network to predict plausible ice thicknesses even for glaciers without any in situ ice thickness measurements. With a glacier-wise cross-validation scheme, we evaluate the performance of the physics-informed neural network. The results of these proof-of-concept experiments let us identify several challenges and opportunities that affect the model’s performance in a real-world setting.

MCML Authors
Link to website

Viola Steidl

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1553]
V. Iwuajoku, K. Ekici, A. Haas, M. Z. Kazemi, A. Kasajima, C. Delbridge, A. Muckenhuber, E. Schmoeckel, F. Stögbauer, C. Bollwein, K. Schwamborn, K. Steiger, C. Mogler and P. J. Schüffler.
An equivalency and efficiency study for one year digital pathology for clinical routine diagnostics in an accredited tertiary academic center.
Virchows Archiv (Feb. 2025). DOI
Abstract

Digital pathology is revolutionizing clinical diagnostics by offering enhanced efficiency, accuracy, and accessibility of pathological examinations. This study explores the implementation and validation of digital pathology in a large tertiary academic center, focusing on its gradual integration and transition into routine clinical diagnostics. In a comprehensive validation process over a 6-month period, we compared sign-out of digital and physical glass slides of a wide range of different tissue specimens and histopathological diagnoses. Key metrics such as diagnostic concordance and user satisfaction were assessed by involving the pathologists in a validation training and study phase. We measured turnaround times before and after transitioning to digital pathology to assess the impact on overall efficiency. Our results demonstrate a 99% concordance between the analog and digital reports while at the same time reducing the time to sign out a case by almost a minute, suggesting potential long-term efficiency gains. Our digital transition positively impacted our pathology workflow: Pathologists reported increased flexibility and satisfaction due to the ease of accessing and sharing digital slides. However, challenges were identified, including technical issues related to image quality and system integration. Lessons learned from this study emphasize the importance of robust training programs, adequate IT support, and ongoing evaluation to ensure successful integration. This validation study confirms that digital pathology is a viable and beneficial tool for accurate clinical routine diagnostics in large academic centers, offering insights for other institutions considering similar endeavors.

MCML Authors
Link to Profile Peter Schüffler

Peter Schüffler

Prof. Dr.

Computational Pathology


[1552]
E. Banzato, M. Drton, K. Saraf-Poor and H. Shi.
Existence of Direct Density Ratio Estimators.
Preprint (Feb. 2025). arXiv
Abstract

Many two-sample problems call for a comparison of two distributions from an exponential family. Density ratio estimation methods provide ways to solve such problems through direct estimation of the differences in natural parameters. The term direct indicates that one avoids estimating both marginal distributions. In this context, we consider the Kullback–Leibler Importance Estimation Procedure (KLIEP), which has been the subject of recent work on differential networks. Our main result shows that the existence of the KLIEP estimator is characterized by whether the average sufficient statistic for one sample belongs to the convex hull of the set of all sufficient statistics for data points in the second sample. For high-dimensional problems it is customary to regularize the KLIEP loss by adding the product of a tuning parameter and a norm of the vector of parameter differences. We show that the existence of the regularized KLIEP estimator requires the tuning parameter to be no less than the dual norm-based distance between the average sufficient statistic and the convex hull. The implications of these existence issues are explored in applications to differential network analysis.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[1551]
H. Bao and M. Schomaker.
Addressing Positivity Violations in Continuous Interventions through Data-Adaptive Strategies.
Preprint (Feb. 2025). arXiv
Abstract

Positivity violations pose a key challenge in the estimation of causal effects, particularly for continuous interventions. Current approaches for addressing this issue include the use of projection functions or modified treatment policies. While effective in many contexts, these methods can result in estimands that potentially do not align well with the original research question, thereby leading to compromises in interpretability. In this paper, we introduce a novel diagnostic tool, the non-overlap ratio, to detect positivity violations. To address these violations while maintaining interpretability, we propose a data-adaptive solution, specially a ‘most feasible’ intervention strategy. Our strategy operates on a unit-specific basis. For a given intervention of interest, we first assess whether the intervention value is feasible for each unit. For units with sufficient support, conditional on confounders, we adhere to the intervention of interest. However, for units lacking sufficient support, as identified through the assessment of the non-overlap ratio, we do not assign the actual intervention value of interest. Instead, we assign the closest feasible value within the support region. We propose an estimator using g-computation coupled with flexible conditional density estimation to estimate high- and low support regions to estimate this new estimand. Through simulations, we demonstrate that our method effectively reduces bias across various scenarios by addressing positivity violations. Moreover, when positivity violations are absent, the method successfully recovers the standard estimand. We further validate its practical utility using real-world data from the CHAPAS-3 trial, which enrolled HIV-positive children in Zambia and Uganda.

MCML Authors
Link to Profile Michael Schomaker

Michael Schomaker

Prof. Dr.

Biostatistics


[1550]
L. Bertolazzi, P. Mondorf, B. Plank and R. Bernardi.
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It.
Preprint (Feb. 2025). arXiv
Abstract

The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models’ internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads–attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models’ internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1549]
J. Bi, Y. Wang, D. Yan, X. Xiao, A. Hecker, V. Tresp and Y. Ma.
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection.
Preprint (Feb. 2025). arXiv
Abstract

Visual instruction tuning refines pre-trained Multimodal Large Language Models (MLLMs) to enhance their real-world task performance. However, the rapid expansion of visual instruction datasets introduces significant data redundancy, leading to excessive computational costs. Existing data selection methods predominantly rely on proxy models or loss-based metrics, both of which impose substantial computational overheads due to the necessity of model inference and backpropagation. To address this challenge, we propose PRISM, a novel training-free approach for efficient multimodal data selection. Unlike existing methods, PRISM eliminates the reliance on proxy models, warm-up pretraining, and gradient-based optimization. Instead, it leverages Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs, computing a task-specific correlation score to identify high-value instances. This not only enbles data-efficient selection,but maintains the original performance. Empirical evaluations across multiple MLLMs demonstrate that PRISM reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods, while surpassing fully fine-tuned models across eight multimodal and three language understanding benchmarks, achieving a 101.7% relative improvement in final performance.

MCML Authors
Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining


[1548]
S. Dirksen, W. Li and J. Maly.
Subspace and DOA estimation under coarse quantization.
Preprint (Feb. 2025). arXiv
Abstract

We study direction-of-arrival (DOA) estimation from coarsely quantized data. We focus on a two-step approach which first estimates the signal subspace via covariance estimation and then extracts DOA angles by the ESPRIT algorithm. In particular, we analyze two stochastic quantization schemes which use dithering: a one-bit quantizer combined with rectangular dither and a multi-bit quantizer with triangular dither. For each quantizer, we derive rigorous high probability bounds for the distances between the true and estimated signal subspaces and DOA angles. Using our analysis, we identify scenarios in which subspace and DOA estimation via triangular dithering qualitatively outperforms rectangular dithering. We verify in numerical simulations that our estimates are optimal in their dependence on the smallest non-zero eigenvalue of the target matrix. The resulting subspace estimation guarantees are equally applicable in the analysis of other spectral estimation algorithms and related problems.

MCML Authors
Link to Profile Johannes Maly

Johannes Maly

Prof. Dr.

Mathematical Data Science and Artificial Intelligence


[1547]
B. D. Earp, S. P. Mann, M. Aboy, E. Awad, M. Betzler, M. Botes, R. Calcott, M. Caraccio, N. Chater, M. Coeckelbergh, M. Constantinescu, H. Dabbagh, K. Devlin, X. Ding, V. Dranseika, J. A. C. Everett, R. Fan, F. Feroz, K. B. Francis, C. Friedman, O. Friedrich, I. Gabriel, I. Hannikainen, J. Hellmann, A. K. Jahrome, N. S. Janardhanan, P. Jurcys, A. Kappes, M. A. Khan, G. Kraft-Todd, M. Kroner Dale, S. M. Laham, B. Lange, M. Leuenberger, J. Lewis, P. Liu, D. M. Lyreskog, M. Maas, J. McMillan, E. Mihailov, T. Minssen, J. Teperowski Monrad, K. Muyskens, S. Myers, S. Nyholm, A. M. Owen, A. Puzio, C. Register, M. G. Reinecke, A. Safron, H. Shevlin, H. Shimizu, P. V. Treit, C. Voinea, K. Yan, A. Zahiu, R. Zhang, H. Zohny, W. Sinnott-Armstrong, I. Singh, J. Savulescu and M. S. Clark.
Relational Norms for Human-AI Cooperation.
Preprint (Feb. 2025). arXiv
Abstract

How we should design and interact with social artificial intelligence depends on the socio-relational role the AI is meant to emulate or occupy. In human society, relationships such as teacher-student, parent-child, neighbors, siblings, or employer-employee are governed by specific norms that prescribe or proscribe cooperative functions including hierarchy, care, transaction, and mating. These norms shape our judgments of what is appropriate for each partner. For example, workplace norms may allow a boss to give orders to an employee, but not vice versa, reflecting hierarchical and transactional expectations. As AI agents and chatbots powered by large language models are increasingly designed to serve roles analogous to human positions - such as assistant, mental health provider, tutor, or romantic partner - it is imperative to examine whether and how human relational norms should extend to human-AI interactions. Our analysis explores how differences between AI systems and humans, such as the absence of conscious experience and immunity to fatigue, may affect an AI’s capacity to fulfill relationship-specific functions and adhere to corresponding norms. This analysis, which is a collaborative effort by philosophers, psychologists, relationship scientists, ethicists, legal experts, and AI researchers, carries important implications for AI systems design, user behavior, and regulation. While we accept that AI systems can offer significant benefits such as increased availability and consistency in certain socio-relational roles, they also risk fostering unhealthy dependencies or unrealistic expectations that could spill over into human-human relationships. We propose that understanding and thoughtfully shaping (or implementing) suitable human-AI relational norms will be crucial for ensuring that human-AI interactions are ethical, trustworthy, and favorable to human well-being.

MCML Authors
Link to Profile Benjamin Lange

Benjamin Lange

Dr.

Ethics of Artificial Intelligence

Link to Profile Sven Nyholm

Sven Nyholm

Prof. Dr.

Ethics of Artificial Intelligence


[1546]
M. Fornasier and L. Sun.
Regularity and positivity of solutions of the Consensus-Based Optimization equation: unconditional global convergence.
Preprint (Feb. 2025). arXiv
Abstract

Introduced in 2017, Consensus-Based Optimization (CBO) has rapidly emerged as a significant breakthrough in global optimization. This straightforward yet powerful multi-particle, zero-order optimization method draws inspiration from Simulated Annealing and Particle Swarm Optimization. Using a quantitative mean-field approximation, CBO dynamics can be described by a nonlinear Fokker-Planck equation with degenerate diffusion, which does not follow a gradient flow structure. In this paper, we demonstrate that solutions to the CBO equation remain positive and maintain full support. Building on this foundation, we establish the { unconditional} global convergence of CBO methods to global minimizers. Our results are derived through an analysis of solution regularity and the proof of existence for smooth, classical solutions to a broader class of drift-diffusion equations, despite the challenges posed by degenerate diffusion.

MCML Authors
Link to Profile Massimo Fornasier

Massimo Fornasier

Prof. Dr.

Applied Numerical Analysis

Link to website

Lukang Sun

Applied Numerical Analysis


[1545]
F. P. D. Frederico P. Delgado, F. Simões, L. Kronik, W. Kaiser and D. A. Egger.
Machine-Learning Force Fields Reveal Shallow Electronic States on Dynamic Halide Perovskite Surfaces.
Preprint (Feb. 2025). arXiv
Abstract

The spectacular performance of halide perovskites in optoelectronic devices is rooted in their favorable tolerance to structural defects. Previous studies showed that defects in these materials generate shallow electronic states that do not degrade device performance. However, how these shallow states persist amid the pronounced thermally-stimulated atomic dynamics on halide perovskite surfaces remains unknown. This work reveals that electronic states at surfaces of the prototypical CsPbBr3 variant are energetically distributed at room temperature, akin to well-passivated inorganic semiconductors, even when covalent bonds remain cleaved and undercoordinated. Specifically, a striking tendency for shallow surface states is found with approximately 70% of surface-state energies appearing within 0.2 eV or ≈8kBT from the valence-band edge. Furthermore, we show that even when surface states appear deeper in the gap, they are not energetically isolated and are less likely to act as traps. We achieve this result by accelerating first-principles calculations via machine-learning techniques and show that the unique atomic dynamics in these materials render the formation of deep electronic states at their surfaces unlikely. These findings reveal the microscopic mechanism behind the low density of deep defect states at dynamic halide perovskite surfaces, which is key to their exceptional performance in devices.

MCML Authors
Link to Profile David Egger

David Egger

Prof. Dr.

Theory of Functional Energy Materials


[1544]
T. Fröch, O. Wysocki, Y. Xia, J. Xie, B. Schwab, D. Cremers and T. H. Kolbe.
FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models.
Preprint (Feb. 2025). arXiv GitHub
Abstract

High-detail semantic 3D building models are frequently utilized in robotics, geoinformatics, and computer vision. One key aspect of creating such models is employing 2D conflict maps that detect openings’ locations in building facades. Yet, in reality, these maps are often incomplete due to obstacles encountered during laser scanning. To address this challenge, we introduce FacaDiffy, a novel method for inpainting unseen facade parts by completing conflict maps with a personalized Stable Diffusion model. Specifically, we first propose a deterministic ray analysis approach to derive 2D conflict maps from existing 3D building models and corresponding laser scanning point clouds. Furthermore, we facilitate the inpainting of unseen facade objects into these 2D conflict maps by leveraging the potential of personalizing a Stable Diffusion model. To complement the scarcity of real-world training data, we also develop a scalable pipeline to produce synthetic conflict maps using random city model generators and annotated facade images. Extensive experiments demonstrate that FacaDiffy achieves state-of-the-art performance in conflict map completion compared to various inpainting baselines and increases the detection rate by 22% when applying the completed conflict maps for high-definition 3D semantic building reconstruction.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1543]
M. Fuest, V. T. Hu and B. Ommer.
MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation.
Preprint (Feb. 2025). arXiv
Abstract

Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce textbf{MaskFlow}, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences. MaskFlow does so very efficiently by enabling the use of fast Masked Generative Model (MGM)-style sampling and can be deployed in both fully autoregressive as well as full-sequence generation modes. We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Fréchet Video Distance (FVD) competitive with state-of-the-art approaches. We also provide a detailed analysis on the sampling efficiency of our method and demonstrate that MaskFlow can be applied to both timestep-dependent and timestep-independent models in a training-free manner.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1542]
H. Funk, R. Ludwig, H. Küchenhoff and T. Nagler.
Modelling Climate Variables at High Temporal Resolution.
Preprint (Feb. 2025). DOI
Abstract

Large ensembles of climate models are indispensable for analyzing natural climate variability and estimating the occurrence of rare extreme events. Many hydrometeorological applications—such as compound event analysis, return period estimation, weather forecasting, downscaling, and bias correction—rely on an accurate representation of the multivariate distribution of climate variables. However, at high temporal resolutions, variables like precipitation often exhibit significant zero-inflation and heavy-tailed distributions. This inflation propagates through the entire multivariate dependence structure, complicating the relationships between zero-inflated and non-inflated variables. Inadequate modeling and correction of these dependencies can substantially degrade the reliability of hydrometeorological methodologes.
In an earlier work, we developed a novel multivariate density decomposition for zero inflated variables based on vine copulas. This method has been integrated into multivariate Vine Copula Bias Correction for partially zero-inflated margins (VBC), with potential applications in other fields facing high-resolution climate data challenges. We resume the idea behind VBC and illustrate it’s advantages to other bias correction methods. This highlights the interpretability and the advantages of control and assessment of the results generated by VBC.

MCML Authors
Link to website

Henri Funk

Statistical Consulting Unit (StaBLab)

Link to Profile Helmut Küchenhoff

Helmut Küchenhoff

Prof. Dr.

Statistical Consulting Unit (StaBLab)

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[1541]
L. He, E. Nie, S. S. Dindar, A. Firoozi, A. Florea, V. Nguyen, C. Puffay, R. Shimizu, H. Ye, J. Brennan, H. Schmid, H. Schütze and N. Mesgarani.
XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs.
Preprint (Feb. 2025). arXiv
Abstract

We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs’ multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1540]
K. Heß, D. Frauen, V. Melnychuk and S. Feuerriegel.
Efficient and Sharp Off-Policy Learning under Unobserved Confounding.
Preprint (Feb. 2025). arXiv
Abstract

We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a statistically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is statistically efficient. (3) We prove that our estimator leads to the optimal confounding-robust policy. Finally, we extend our theory to the related task of policy improvement under unobserved confounding, i.e., when a baseline policy such as the standard of care is available. We show in experiments with synthetic and real-world data that our method outperforms simple plug-in approaches and existing baselines. Our method is highly relevant for decision-making where unobserved confounding can be problematic, such as in healthcare and public policy.

MCML Authors
Link to website

Konstantin Heß

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1539]
M. Jürgens, T. Mortier, E. Hüllermeier, V. Bengs and W. Waegeman.
A calibration test for evaluating set-based epistemic uncertainty representations.
Preprint (Feb. 2025). arXiv
Abstract

The accurate representation of epistemic uncertainty is a challenging yet essential task in machine learning. A widely used representation corresponds to convex sets of probabilistic predictors, also known as credal sets. One popular way of constructing these credal sets is via ensembling or specialized supervised learning methods, where the epistemic uncertainty can be quantified through measures such as the set size or the disagreement among members. In principle, these sets should contain the true data-generating distribution. As a necessary condition for this validity, we adopt the strongest notion of calibration as a proxy. Concretely, we propose a novel statistical test to determine whether there is a convex combination of the set’s predictions that is calibrated in distribution. In contrast to previous methods, our framework allows the convex combination to be instance dependent, recognizing that different ensemble members may be better calibrated in different regions of the input space. Moreover, we learn this combination via proper scoring rules, which inherently optimize for calibration. Building on differentiable, kernel-based estimators of calibration errors, we introduce a nonparametric testing procedure and demonstrate the benefits of capturing instance-level variability on of synthetic and real-world experiments.

MCML Authors
Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1538]
H. Laus, S. Parkinson, V. Charisopoulos, F. Krahmer and R. Willett.
Solving Inverse Problems with Deep Linear Neural Networks: Global Convergence Guarantees for Gradient Descent with Weight Decay.
Preprint (Feb. 2025). arXiv
Abstract

Machine learning methods are commonly used to solve inverse problems, wherein an unknown signal must be estimated from few measurements generated via a known acquisition procedure. In particular, neural networks perform well empirically but have limited theoretical guarantees. In this work, we study an underdetermined linear inverse problem that admits several possible solution mappings. A standard remedy (e.g., in compressed sensing) establishing uniqueness of the solution mapping is to assume knowledge of latent low-dimensional structure in the source signal. We ask the following question: do deep neural networks adapt to this low-dimensional structure when trained by gradient descent with weight decay regularization? We prove that mildly overparameterized deep linear networks trained in this manner converge to an approximate solution that accurately solves the inverse problem while implicitly encoding latent subspace structure. To our knowledge, this is the first result to rigorously show that deep linear networks trained with weight decay automatically adapt to latent subspace structure in the data under practical stepsize and weight initialization schemes. Our work highlights that regularization and overparameterization improve generalization, while overparameterization also accelerates convergence during training.

MCML Authors
Link to website

Hannah Laus

Optimization & Data Analysis

Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis


[1537]
Y. Liu, R. Chen, L. Hirlimann, A. D. Hakimi, M. Wang, A. H. Kargaran, S. Rothe, F. Yvon and H. Schütze.
On Relation-Specific Neurons in Large Language Models.
Preprint (Feb. 2025). arXiv GitHub
Abstract

In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself – independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the Llama-2 family on a chosen set of relations with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation r on the LLM’s ability to handle (1) facts whose relation is r and (2) facts whose relation is a different relation r′≠r. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. (i) Neuron cumulativity. The neurons for r present a cumulative effect so that deactivating a larger portion of them results in the degradation of more facts in r. (ii) Neuron versatility. Neurons can be shared across multiple closely related as well as less related relations. Some relation neurons transfer across languages. (iii) Neuron interference. Deactivating neurons specific to one relation can improve LLM generation performance for facts of other relations.

MCML Authors
Link to website

Lea Hirlimann

Computational Linguistics

Link to website

Ahmad Dawar Hakimi

Computational Linguistics

Link to website

Mingyang Wang

Computational Linguistics

Link to website

Amir Hossein Kargaran

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1536]
Y. Ma, D. Frauen, J. Schweisthal and S. Feuerriegel.
LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding.
Preprint (Feb. 2025). arXiv
Abstract

Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications.

MCML Authors
Link to website

Yuchen Ma

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1535]
V. Melnychuk, D. Frauen, J. Schweisthal and S. Feuerriegel.
Orthogonal Representation Learning for Estimating Causal Quantities.
Preprint (Feb. 2025). arXiv
Abstract

Representation learning is widely used for estimating causal quantities (e.g., the conditional average treatment effect) from observational data. While existing representation learning methods have the benefit of allowing for end-to-end learning, they do not have favorable theoretical properties of Neyman-orthogonal learners, such as double robustness and quasi-oracle efficiency. Also, such representation learning methods often employ additional constraints, like balancing, which may even lead to inconsistent estimation. In this paper, we propose a novel class of Neyman-orthogonal learners for causal quantities defined at the representation level, which we call OR-learners. Our OR-learners have several practical advantages: they allow for consistent estimation of causal quantities based on any learned representation, while offering favorable theoretical properties including double robustness and quasi-oracle efficiency. In multiple experiments, we show that, under certain regularity conditions, our OR-learners improve existing representation learning methods and achieve state-of-the-art performance. To the best of our knowledge, our OR-learners are the first work to offer a unified framework of representation learning methods and Neyman-orthogonal learners for causal quantities estimation.

MCML Authors
Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1534]
K. Padh, Z. Li, C. Casolo and N. Kilbertus.
Your Assumed DAG is Wrong and Here's How To Deal With It.
Preprint (Feb. 2025). arXiv
Abstract

Assuming a directed acyclic graph (DAG) that represents prior knowledge of causal relationships between variables is a common starting point for cause-effect estimation. Existing literature typically invokes hypothetical domain expert knowledge or causal discovery algorithms to justify this assumption. In practice, neither may propose a single DAG with high confidence. Domain experts are hesitant to rule out dependencies with certainty or have ongoing disputes about relationships; causal discovery often relies on untestable assumptions itself or only provides an equivalence class of DAGs and is commonly sensitive to hyperparameter and threshold choices. We propose an efficient, gradient-based optimization method that provides bounds for causal queries over a collection of causal graphs – compatible with imperfect prior knowledge – that may still be too large for exhaustive enumeration. Our bounds achieve good coverage and sharpness for causal queries such as average treatment effects in linear and non-linear synthetic settings as well as on real-world data. Our approach aims at providing an easy-to-use and widely applicable rebuttal to the valid critique of `What if your assumed DAG is wrong?'.

MCML Authors
Link to website

Kirtan Padh

Ethics in Systems Design and Machine Learning

Link to website

Zhufeng Li

Ethics in Systems Design and Machine Learning

Link to website

Cecilia Casolo

Ethics in Systems Design and Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1533]
G. D. Pelegrina, P. Kolpaczki and E. Hüllermeier.
Shapley Value Approximation Based on k-Additive Games.
Preprint (Feb. 2025). arXiv
Abstract

The Shapley value is the prevalent solution for fair division problems in which a payout is to be divided among multiple agents. By adopting a game-theoretic view, the idea of fair division and the Shapley value can also be used in machine learning to quantify the individual contribution of features or data points to the performance of a predictive model. Despite its popularity and axiomatic justification, the Shapley value suffers from a computational complexity that scales exponentially with the number of entities involved, and hence requires approximation methods for its reliable estimation. We propose SVAkADD, a novel approximation method that fits a k-additive surrogate game. By taking advantage of k-additivity, we are able to elicit the exact Shapley values of the surrogate game and then use these values as estimates for the original fair division problem. The efficacy of our method is evaluated empirically and compared to competing methods.

MCML Authors
Link to website

Patrick Kolpaczki

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1532]
Z. Peng, X. Yin, R. Qian, P. Lin, Y. Liu, C. Ying and Y. Luo.
SolEval: Benchmarking Large Language Models for Repository-level Solidity Code Generation.
Preprint (Feb. 2025). arXiv GitHub
Abstract

Large language models (LLMs) have transformed code generation. However, most existing approaches focus on mainstream languages such as Python and Java, neglecting the Solidity language, the predominant programming language for Ethereum smart contracts. Due to the lack of adequate benchmarks for Solidity, LLMs’ ability to generate secure, cost-effective smart contracts remains unexplored. To fill this gap, we construct SolEval, the first repository-level benchmark designed for Solidity smart contract generation, to evaluate the performance of LLMs on Solidity. SolEval consists of 1,125 samples from 9 different repositories, covering 6 popular domains, providing LLMs with a comprehensive evaluation benchmark. Unlike the existing Solidity benchmark, SolEval not only includes complex function calls but also reflects the real-world complexity of the Ethereum ecosystem by incorporating gas fee and vulnerability rate. We evaluate 10 LLMs on SolEval, and our results show that the best-performing LLM achieves only 26.29% Pass@10, highlighting substantial room for improvement in Solidity code generation by LLMs.

MCML Authors

[1531]
D. Racek, Q. Zhang, P. Thurner, X. Zhu and G. Kauermann.
Unsupervised Detection of Building Destruction during War from Publicly Available Radar Satellite Imagery.
Preprint (Feb. 2025). DOI
Abstract

The timely automated detection of building destruction in conflict zones is crucial for human rights monitoring, humanitarian response, and academic research. However, current approaches rely on expensive proprietary satellite imagery, limiting their scalability and accessibility. This study addresses these challenges by introducing an automated and unsupervised method that uses freely available Sentinel-1 synthetic aperture radar (SAR) imagery from the European Space Agency (ESA). By statistically assessing interferometric coherence changes over time, our approach enables the timely detection of building destruction at scale without requiring labeled training data, which are often not available in conflict-affected regions. We validate our method across three case studies, Beirut, Mariupol, and Gaza, demonstrating its ability to capture diverse patterns of destruction and their spatio-temporal dynamics, despite the moderate resolution of Sentinel-1 imagery. Our approach offers a scalable, global, and cost-effective solution for detecting building destruction in conflict zones.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1530]
K. Reichard, G. Rizzoli, S. Gasperini, L. Hoyer, P. Zanuttigh, N. Navab and F. Tombari.
From Open-Vocabulary to Vocabulary-Free Semantic Segmentation.
Preprint (Feb. 2025). arXiv
Abstract

Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data. While this flexibility represents a significant advancement, current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications. This work proposes a Vocabulary-Free Semantic Segmentation pipeline, eliminating the need for predefined class vocabularies. Specifically, we address the chicken-and-egg problem where users need knowledge of all potential objects within a scene to identify them, yet the purpose of segmentation is often to discover these objects. The proposed approach leverages Vision-Language Models to automatically recognize objects and generate appropriate class names, aiming to solve the challenge of class specification and naming quality. Through extensive experiments on several public datasets, we highlight the crucial role of the text encoder in model performance, particularly when the image text classes are paired with generated descriptions. Despite the challenges introduced by the sensitivity of the segmentation text encoder to false negatives within the class tagging process, which adds complexity to the task, we demonstrate that our fully automated pipeline significantly enhances vocabulary-free segmentation accuracy across diverse real-world scenarios.

MCML Authors
Link to website

Stefano Gasperini

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Federico Tombari

Federico Tombari

PD Dr.

Computer Aided Medical Procedures & Augmented Reality


[1529]
J. Rodemann, E. Garces Arias, C. Luther, C. Jansen and T. Augustin.
A Statistical Case Against Empirical Human-AI Alignment.
Preprint (Feb. 2025). arXiv
Abstract

Empirical human-AI alignment aims to make AI systems act in line with observed human behavior. While noble in its goals, we argue that empirical alignment can inadvertently introduce statistical biases that warrant caution. This position paper thus advocates against naive empirical alignment, offering prescriptive alignment and a posteriori empirical alignment as alternatives. We substantiate our principled argument by tangible examples like human-centric decoding of language models.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science


[1528]
Y. Shen, W. Lai, S. Wang, X. Zhang, K. Luo, A. Fraser and M. Sun.
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection.
Preprint (Feb. 2025). arXiv
Abstract

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and clean multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data and existing multilingual datasets. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of current data cleaning methods, which rely on manual heuristic thresholds, we propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1527]
N. Sturma, M. Kranzlmueller, I. Portakal and M. Drton.
Matching Criterion for Identifiability in Sparse Factor Analysis.
Preprint (Feb. 2025). arXiv
Abstract

Factor analysis models explain dependence among observed variables by a smaller number of unobserved factors. A main challenge in confirmatory factor analysis is determining whether the factor loading matrix is identifiable from the observed covariance matrix. The factor loading matrix captures the linear effects of the factors and, if unrestricted, can only be identified up to an orthogonal transformation of the factors. However, in many applications the factor loadings exhibit an interesting sparsity pattern that may lead to identifiability up to column signs. We study this phenomenon by connecting sparse factor models to bipartite graphs and providing sufficient graphical conditions for identifiability of the factor loading matrix up to column signs. In contrast to previous work, our main contribution, the matching criterion, exploits sparsity by operating locally on the graph structure, thereby improving existing conditions. Our criterion is efficiently decidable in time that is polynomial in the size of the graph, when restricting the search steps to sets of bounded size.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[1526]
M. Surner, A. Khelil and L. Bothmann.
Invariance Pair-Guided Learning: Enhancing Robustness in Neural Networks.
Preprint (Feb. 2025). arXiv
Abstract

Out-of-distribution generalization of machine learning models remains challenging since the models are inherently bound to the training data distribution. This especially manifests, when the learned models rely on spurious correlations. Most of the existing approaches apply data manipulation, representation learning, or learning strategies to achieve generalizable models. Unfortunately, these approaches usually require multiple training domains, group labels, specialized augmentation, or pre-processing to reach generalizable models. We propose a novel approach that addresses these limitations by providing a technique to guide the neural network through the training phase. We first establish input pairs, representing the spurious attribute and describing the invariance, a characteristic that should not affect the outcome of the model. Based on these pairs, we form a corrective gradient complementing the traditional gradient descent approach. We further make this correction mechanism adaptive based on a predefined invariance condition. Experiments on ColoredMNIST, Waterbird-100, and CelebA datasets demonstrate the effectiveness of our approach and the robustness to group shifts.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[1525]
Ö. Turgut, F. S. Bott, M. Ploner and D. Rückert.
Are foundation models useful feature extractors for electroencephalography analysis?
Preprint (Feb. 2025). arXiv
Abstract

The success of foundation models in natural language processing and computer vision has motivated similar approaches for general time series analysis. While these models are effective for a variety of tasks, their applicability in medical domains with limited data remains largely unexplored. To address this, we investigate the effectiveness of foundation models in medical time series analysis involving electroencephalography (EEG). Through extensive experiments on tasks such as age prediction, seizure detection, and the classification of clinically relevant EEG events, we compare their diagnostic accuracy with that of specialised EEG models. Our analysis shows that foundation models extract meaningful EEG features, outperform specialised models even without domain adaptation, and localise task-specific biomarkers. Moreover, we demonstrate that diagnostic accuracy is substantially influenced by architectural choices such as context length. Overall, our study reveals that foundation models with general time series understanding eliminate the dependency on large domain-specific datasets, making them valuable tools for clinical practice.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1524]
M. Wang, A. Stoll, L. Lange, H. Adel, H. Schütze and J. Strötgen.
Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion.
Preprint (Feb. 2025). arXiv
Abstract

Adapting large language models (LLMs) to new and diverse knowledge is essential for their lasting effectiveness in real-world applications. This survey provides an overview of state-of-the-art methods for expanding the knowledge of LLMs, focusing on integrating various knowledge types, including factual information, domain expertise, language proficiency, and user preferences. We explore techniques, such as continual learning, model editing, and retrieval-based explicit adaptation, while discussing challenges like knowledge consistency and scalability. Designed as a guide for researchers and practitioners, this survey sheds light on opportunities for advancing LLMs as adaptable and robust knowledge systems.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1523]
C. Wu, B. Ma, N. Deng, Y. He and Y. Xue.
Multi-Scale and Multi-Objective Optimization for Cross-Lingual Aspect-Based Sentiment Analysis.
Preprint (Feb. 2025). arXiv
Abstract

Aspect-based sentiment analysis (ABSA) is a sequence labeling task that has garnered growing research interest in multilingual contexts. However, recent studies lack more robust feature alignment and finer aspect-level alignment. In this paper, we propose a novel framework, Multi-Scale and Multi-Objective optimization (MSMO) for cross-lingual ABSA. During multi-scale alignment, we achieve cross-lingual sentence-level and aspect-level alignment, aligning features of aspect terms in different contextual environments. Specifically, we introduce code-switched bilingual sentences into the language discriminator and consistency training modules to enhance the model’s robustness. During multi-objective optimization, we design two optimization objectives: supervised training and consistency training, aiming to enhance cross-lingual semantic alignment. To further improve model performance, we incorporate distilled knowledge of the target language into the model. Results show that MSMO significantly enhances cross-lingual ABSA by achieving state-of-the-art performance across multiple languages and models.

MCML Authors

[1522]
C. Wu, B. Ma, Y. Liu, Z. Zhang, N. Deng, Y. Li, B. Chen, Y. Zhang, Y. Xue and B. Plank.
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis.
Preprint (Feb. 2025). arXiv
Abstract

Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1521]
S. Wu, S. Alaniz, E. Schulz and Z. Akata.
Discovering Chunks in Neural Embeddings for Interpretability.
Preprint (Feb. 2025). arXiv
Abstract

Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

MCML Authors
Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1520]
S. Xu, T. Y. S. S. Santosh, Y. Elazar, Q. Vogel, B. Plank and M. Grabmair.
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases.
Preprint (Feb. 2025). arXiv
Abstract

The increased adoption of Large Language Models (LLMs) and their potential to shape public opinion have sparked interest in assessing these models’ political leanings. Building on previous research that compared LLMs and human opinions and observed political bias in system responses, we take a step further to investigate the underlying causes of such biases by empirically examining how the values and biases embedded in training corpora shape model outputs. Specifically, we propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs’ political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data and the need for robust evaluation metrics to ensure LLMs’ alignment with human-centered values.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1519]
X. Xue and X. Zhu.
Regression in EO: Are VLMs Up to the Challenge?
Preprint (Feb. 2025). arXiv
Abstract

Earth Observation (EO) data encompass a vast range of remotely sensed information, featuring multi-sensor and multi-temporal, playing an indispensable role in understanding our planet’s dynamics. Recently, Vision Language Models (VLMs) have achieved remarkable success in perception and reasoning tasks, bringing new insights and opportunities to the EO field. However, the potential for EO applications, especially for scientific regression related applications remains largely unexplored. This paper bridges that gap by systematically examining the challenges and opportunities of adapting VLMs for EO regression tasks. The discussion first contrasts the distinctive properties of EO data with conventional computer vision datasets, then identifies four core obstacles in applying VLMs to EO regression: 1) the absence of dedicated benchmarks, 2) the discrete-versus-continuous representation mismatch, 3) cumulative error accumulation, and 4) the suboptimal nature of text-centric training objectives for numerical tasks. Next, a series of methodological insights and potential subtle pitfalls are explored. Lastly, we offer some promising future directions for designing robust, domain-aware solutions. Our findings highlight the promise of VLMs for scientific regression in EO, setting the stage for more precise and interpretable modeling of critical environmental processes.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1518]
J. Yu, Y. Zhang, B. Wang, P. Lin, Y. Liu and S. Feng.
SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model.
Preprint (Feb. 2025). arXiv GitHub
Abstract

Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices. However, LoRA’s performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (State Space Model Low-Rank Adaptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences.

MCML Authors

[1517]
A. Zavras, D. Michail, X. Zhu, B. Demir and I. Papoutsis.
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis.
Preprint (Feb. 2025). arXiv
Abstract

The continuous operation of Earth-orbiting satellites generates vast and ever-growing archives of Remote Sensing (RS) images. Natural language presents an intuitive interface for accessing, querying, and interpreting the data from such archives. However, existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA’s construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1516]
G. Zhang, M. Ding, T. Liu, Y. Zhang and V. Tresp.
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs.
Preprint (Feb. 2025). arXiv
Abstract

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

MCML Authors
Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to website

Tong Liu

Database Systems and Data Mining

Link to website

Yao Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1515]
M. Fornasier, J. Klemenc and A. Scagliotti.
Trade-off Invariance Principle for minimizers of regularized functionals.
Math4AiMl 2025 - 3rd Workshop of UMI Group Mathematics for Artificial Intelligence and Machine Learning. Bari, Italy, Jan 29-31, 2025. To be published. Preprint available. arXiv
Abstract

In this paper, we consider functionals of the form Hα(u)=F(u)+αG(u) with α∈[0,+∞), where u varies in a set U≠∅ (without further structure). We first show that, excluding at most countably many values of α, we have that infH⋆αG=supH⋆αG, where H⋆α:=argminUHα, which is assumed to be non-empty. We further prove a stronger result that concerns the {invariance of the} limiting value of the functional G along minimizing sequences for Hα. This fact in turn implies an unexpected consequence for functionals regularized with uniformly convex norms: excluding again at most countably many values of α, it turns out that for a minimizing sequence, convergence to a minimizer in the weak or strong sense is equivalent.

MCML Authors
Link to Profile Massimo Fornasier

Massimo Fornasier

Prof. Dr.

Applied Numerical Analysis

Link to website

Jona Klemenc

Applied Numerical Analysis

Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1514]
S. Gasperini.
Strategies Towards Reliable Scene Understanding for Autonomous Driving.
Dissertation 2025. URL
Abstract

Autonomous driving is poised to revolutionize the transportation sector, with scene understanding as a critical component. This dissertation presents methods for increasing the reliability of state-of-the-art models, by focusing on addressing the issues of unknown scenarios and challenging conditions. Therefore, it marks a decisive step towards the safe deployment of autonomous vehicles in the real world.

MCML Authors
Link to website

Stefano Gasperini

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1513]
E. Garces Arias, M. Li, C. Heumann and M. Aßenmacher.
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Since LLMs produce probability distributions over the entire vocabulary, various decoding methods have been developed to transform these probabilities into coherent and fluent text, each with its own set of hyperparameters. In this study, we present a large-scale, comprehensive analysis of how hyperparameter selection affects text quality in open-ended text generation across multiple LLMs, datasets, and evaluation metrics. Through an extensive sensitivity analysis, we provide practical guidelines for hyperparameter tuning and demonstrate the substantial influence of these choices on text quality. Using three established datasets, spanning factual domains (e.g., news) and creative domains (e.g., fiction), we show that hyperparameter tuning significantly impacts generation quality, though its effects vary across models and tasks. We offer in-depth insights into these effects, supported by both human evaluations and a synthesis of widely-used automatic evaluation metrics.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1512]
R. Litschko, O. Kraus, V. Blaschke and B. Plank.
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.

MCML Authors
Link to website

Robert Litschko

AI and Computational Linguistics

Link to website

Verena Blaschke

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1511]
Y. Liu, C. Ma, H. Ye and H. Schütze.
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL GitHub
Abstract

Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1510]
Y. Liu, M. Wang, A. H. Kargaran, A. Imani, O. Xhelili, H. Ye, C. Ma, F. Yvon and H. Schütze.
How Transliterations Improve Crosslingual Alignment.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to website

Amir Hossein Kargaran

Computational Linguistics

Link to website

Ayyoob Imani

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1509]
A. Muñoz-Ortiz, V. Blaschke and B. Plank.
Evaluating Pixel Language Models on Non-Standardized Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

MCML Authors
Link to website

Verena Blaschke

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1508]
Y. Zhang, V. Hangya and A. Fraser.
LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

The capacity of large language models (LLMs) to understand and distinguish socially unacceptable texts enables them to play a promising role in abusive language detection. However, various factors can affect their sensitivity. In this work, we test whether LLMs have an unintended bias in abusive language detection, i.e., whether they predict more or less of a given abusive class than expected in zero-shot settings. Our results show that instruction-tuned LLMs tend to under-predict positive classes, since datasets used for tuning are dominated by the negative class. On the contrary, models fine-tuned with human feedback tend to be overly sensitive. In an exploratory approach to mitigate these issues, we show that label frequency in the prompt helps with the significant over-prediction.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1507]
V. Blaschke, F. Körner and B. Plank.
Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have not yet been applied to dialectal SID data, or compared to each other on the same datasets. We participate in the VarDial 2025 shared task on slot and intent detection in Norwegian varieties, and compare multiple set-ups: varying the training data (English, Norwegian, or dialectal Norwegian), injecting character-level noise, training on auxiliary tasks, and applying Layer Swapping, a technique in which layers of models fine-tuned on different datasets are assembled into a model. We find noise injection to be beneficial while the effects of auxiliary tasks are mixed. Though some experimentation was required to successfully assemble a model from layers, it worked surprisingly well; a combination of models trained on English and small amounts of dialectal data produced the most robust slot predictions. Our best models achieve 97.6% intent accuracy and 85.6% slot F1 in the shared task.

MCML Authors
Link to website

Verena Blaschke

AI and Computational Linguistics

Link to website

Felicia Körner

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1506]
X. Krückl, V. Blaschke and B. Plank.
Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.

MCML Authors
Link to website

Verena Blaschke

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1505]
A.-M. Lutgen, A. Plum, C. Purschke and B. Plank.
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1504]
K. Höhlein.
Data-Driven Modeling and Analysis of Numerical Weather Predictions.
Dissertation 2025. URL
Abstract

Weather prediction systems generate vast numerical simulation datasets that require statistical postprocessing and interactive human exploration. In this thesis, we develop deep-learning-based methods for postprocessing weather predictions and representing the forecasts for subsequent analysis. We use neural networks to enhance the spatial resolution of weather forecasts and postprocess ensemble predictions, and adapt neural networks as compact representations for volumetric ensemble datasets.

MCML Authors
Kevin Höhlein

Kevin Höhlein

Dr.

* Former Member


[1503]
A. Sanin, J. K. Flowers, T. H. Piotrowiak, F. Felsen, L. Merker, A. Ludwig, D. Bresser and H. S. Stein.
Integrating Automated Electrochemistry and High-Throughput Characterization with Machine Learning to Explore Si─Ge─Sn Thin-Film Lithium Battery Anodes.
Advanced Energy Materials 15.11 (Jan. 2025). DOI
Abstract

High-performance batteries need accelerated discovery and optimization of new anode materials. Herein, we explore the Si─Ge─Sn ternary alloy system as a candidate fast-charging anode materials system by utilizing a scanning droplet cell (SDC) as an autonomous electrochemical characterization tool with the goal of subsequent upscaling. As the SDC is performing experiments sequentially, an exploration of the entire ternary space is unfeasible due to time constraints. Thus, closed-loop optimization, guided by real-time data analysis and sequential learning algorithms, is utilized to direct experiments. The lead material identified is scaled up to a coin cell to validate the findings from the autonomous millimeter-scale thin-film electrochemical experimentation. Explainable machine learning (ML) models incorporating data from high-throughput Raman spectroscopy and X-ray diffraction (XRD) are used to elucidate the effect of short and long-range ordering on material performance.

MCML Authors
Link to Profile Helge Stein

Helge Stein

Prof. Dr.

Digital Catalysis


[1502]
M. Abrahamowicz, M.-E. Beauchamp, A.-L. Boulesteix, T. P. Morris, W. Sauerbrei, J. S. Kaufman and o. b. o. t. STRATOS Simulation Panel.
Data-driven simulations to assess the impact of study imperfections in time-to-event analyses.
American Journal of Epidemiology 194.1 (Jan. 2025). DOI
Abstract

Quantitative bias analysis (QBA) permits assessment of the expected impact of various imperfections of the available data on the results and conclusions of a particular real-world study. This article extends QBA methodology to multivariable time-to-event analyses with right-censored endpoints, possibly including time-varying exposures or covariates. The proposed approach employs data-driven simulations, which preserve important features of the data at hand while offering flexibility in controlling the parameters and assumptions that may affect the results. First, the steps required to perform data-driven simulations are described, and then two examples of real-world time-to-event analyses illustrate their implementation and the insights they may offer. The first example focuses on the omission of an important time-invariant predictor of the outcome in a prognostic study of cancer mortality, and permits separating the expected impact of confounding bias from noncollapsibility. The second example assesses how imprecise timing of an interval-censored event—ascertained only at sparse times of clinic visits—affects its estimated association with a time-varying drug exposure. The simulation results also provide a basis for comparing the performance of two alternative strategies for imputing the unknown event times in this setting. The R scripts that permit the reproduction of our examples are provided.

MCML Authors
Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1501]
F. Bortolussi, H. Sandström, F. Partovi, J. Mikkilä, P. Rinke and M. Rissanen.
Technical note: Towards atmospheric compound identification in chemical ionization mass spectrometry with pesticide standards and machine learning.
Atmospheric Chemistry and Physics 25.1 (Jan. 2025). DOI
Abstract

Chemical ionization mass spectrometry (CIMS) is widely used in atmospheric chemistry studies. However, due to the complex interactions between reagent ions and target compounds, chemical understanding remains limited and compound identification difficult. In this study, we apply machine learning to a reference dataset of pesticides in two standard solutions to build a model that can provide insights from CIMS analyses in atmospheric science. The CIMS measurements were performed with an Orbitrap mass spectrometer coupled to a thermal desorption multi-scheme chemical ionization inlet unit (TD-MION-MS) with both negative and positive ionization modes utilizing Br−, , H3O+ and (CH3)2COH+ (AceH+) as reagent ions. We then trained two machine learning methods on these data: (1) random forest (RF) for classifying if a pesticide can be detected with CIMS and (2) kernel ridge regression (KRR) for predicting the expected CIMS signals. We compared their performance on five different representations of the molecular structure: the topological fingerprint (TopFP), the molecular access system keys (MACCS), a custom descriptor based on standard molecular properties (RDKitPROP), the Coulomb matrix (CM) and the many-body tensor representation (MBTR). The results indicate that MACCS outperforms the other descriptors. Our best classification model reaches a prediction accuracy of 0.85 ± 0.02 and a receiver operating characteristic curve area of 0.91 ± 0.01. Our best regression model reaches an accuracy of 0.44 ± 0.03 logarithmic units of the signal intensity. Subsequent feature importance analysis of the classifiers reveals that the most important sub-structures are NH and OH for the negative ionization schemes and nitrogen-containing groups for the positive ionization schemes.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1500]
L. Schneider.
Advancing hyperparameter optimization: foundations, multiple objectives and algorithmic innovations informed through benchmarking.
Dissertation 2025. DOI
Abstract

Hyperparameter optimization (HPO) is a fundamental aspect of machine learning (ML), directly influencing model performance and adaptability. As a computationally expensive black-box optimization problem, HPO requires efficient algorithms to identify optimal hyperparameter configurations. This thesis advances the field of HPO along three key dimensions: foundational insights, HPO in the presence of more than one objective, and algorithmic innovations through benchmarking. (Shortened.)

MCML Authors
Link to website

Lennart Schneider

Statistical Learning and Data Science


[1499]
S. Grosu, M. P. Fabritius, M. Winkelmann, D. Puhr-Westerheide, M. Ingenerf, S. Maurus, A. Graser, C. Schulz, T. Knösel, C. C. Cyran, J. Ricke, P. M. Kazmierczak, M. Ingrisch and P. Wesp.
Effect of artificial intelligence-aided differentiation of adenomatous and non-adenomatous colorectal polyps at CT colonography on radiologists’ therapy management.
European Radiology Early Access (Jan. 2025). DOI
Abstract

Objectives: Adenomatous colorectal polyps require endoscopic resection, as opposed to non-adenomatous hyperplastic colorectal polyps. This study aims to evaluate the effect of artificial intelligence (AI)-assisted differentiation of adenomatous and non-adenomatous colorectal polyps at CT colonography on radiologists’ therapy management.
Materials and methods: Five board-certified radiologists evaluated CT colonography images with colorectal polyps of all sizes and morphologies retrospectively and decided whether the depicted polyps required endoscopic resection. After a primary unassisted reading based on current guidelines, a second reading with access to the classification of a radiomics-based random-forest AI-model labelling each polyp as ’non-adenomatous’ or ‘adenomatous’ was performed. Performance was evaluated using polyp histopathology as the reference standard.
Results: 77 polyps in 59 patients comprising 118 polyp image series (47% supine position, 53% prone position) were evaluated unassisted and AI-assisted by five independent board-certified radiologists, resulting in a total of 1180 readings (subsequent polypectomy: yes or no). AI-assisted readings had higher accuracy (76% +/− 1% vs. 84% +/− 1%), sensitivity (78% +/− 6% vs. 85% +/− 1%), and specificity (73% +/− 8% vs. 82% +/− 2%) in selecting polyps eligible for polypectomy (p < 0.001). Inter-reader agreement was improved in the AI-assisted readings (Fleiss’ kappa 0.69 vs. 0.92).
Conclusion: AI-based characterisation of colorectal polyps at CT colonography as a second reader might enable a more precise selection of polyps eligible for subsequent endoscopic resection. However, further studies are needed to confirm this finding and histopathologic polyp evaluation is still mandatory.

MCML Authors
Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Link to website

Philipp Wesp

Dr.

Clinical Data Science in Radiology


[1498]
L. Bothmann and K. Peters.
Fairness von KI – ein Brückenschlag zwischen Philosophie und Maschinellem Lernen.
Grenzen Künstlicher Intelligenz (Jan. 2025). DOI
MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[1497]
M. Milling, S. D. Rampp, A. Triantafyllopoulos, M. P. Plaza, J. O. Brunner, C. Traidl-Hoffmann, B. W. Schuller and A. Damialis.
Automating airborne pollen classification: Identifying and interpreting hard samples for classifiers.
Heliyon 11.2 (Jan. 2025). DOI GitHub
Abstract

Deep-learning-based classification of pollen grains has been a major driver towards automatic monitoring of airborne pollen. Yet, despite an abundance of available datasets, little effort has been spent to investigate which aspects pose the biggest challenges to the (often black-box- resembling) pollen classification approaches. To shed some light on this issue, we conducted a sample-level difficulty analysis based on the likelihood for one of the largest automatically-generated datasets of pollen grains on microscopy images and investigated the reason for which certain airborne samples and specific pollen taxa pose particular problems to deep learning algorithms. It is here concluded that the main challenges lie in A) the (partly) co-occurring of multiple pollen grains in a single image, B) the occlusion of specific markers through the 2D capturing of microscopy images, and C) for some taxa, a general lack of salient, unique features.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1496]
F. Tian, H. Zhang, Y. Tan, L. Zhu, L. Shen, K. Qian, B. Hu, B. W. Schuller and Y. Yamamoto.
An On-Board Executable Multi-Feature Transfer-Enhanced Fusion Model for Three-Lead EEG Sensor-Assisted Depression Diagnosis.
IEEE Journal of Biomedical and Health Informatics 29.1 (Jan. 2025). DOI
Abstract

The development of affective computing and medical electronic technologies has led to the emergence of Artificial Intelligence (AI)-based methods for the early detection of depression. However, previous studies have often overlooked the necessity for the AI-assisted diagnosis system to be wearable and accessible in practical scenarios for depression recognition. In this work, we present an on-board executable multi-feature transfer-enhanced fusion model for our custom-designed wearable three-lead Electroencephalogram (EEG) sensor, based on EEG data collected from 73 depressed patients and 108 healthy controls. Experimental results show that the proposed model exhibits low-computational complexity (65.0 K parameters), promising Floating-Point Operations (FLOPs) performance (25.6 M), real-time processing (1.5 s/execution), and low power consumption (320.8 mW). Furthermore, it requires only 202.0 KB of Random Access Memory (RAM) and 279.6 KB of Read-Only Memory (ROM) when deployed on the EEG sensor. Despite its low computational and spatial complexity, the model achieves a notable classification accuracy of 95.2%, specificity of 94.0%, and sensitivity of 96.9% under independent test conditions. These results underscore the potential of deploying the model on the wearable three-lead EEG sensor for assisting in the diagnosis of depression.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1495]
J. Beck, L. M. Kemeter, K. Dürrbeck, M. H. I. Abdalla and F. Kreuter.
Toward Integrating ChatGPT Into Satellite Image Annotation Workflows: A Comparison of Label Quality and Costs of Human and Automated Annotators.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 18 (Jan. 2025). DOI
Abstract

High-quality annotations are a critical success factor for machine learning (ML) applications. To achieve this, we have
traditionally relied on human annotators, navigating the challenges of limited budgets and the varying task-specific expertise, costs, and availability. Since the emergence of Large Language Models (LLMs), their popularity for generating automated annotations has grown, extending possibilities and complexity of designing an efficient annotation strategy. Increasingly, computer vision capabilities have been integrated into general-purpose LLMs like ChatGPT. This raises the question of how effectively LLMs can be used in satellite image annotation tasks and how they compare to traditional annotator types. This study presents a comprehensive investigation and comparison of various human and automated annotators for image classification. We evaluate the feasibility and economic competitiveness of using the ChatGPT4-V model for a complex land usage annotation task and compare it with alternative human annotators. A set of satellite images is annotated by a domain expert and 15 additional human and automated annotators, differing in expertise and costs. Our analyses examine the annotation quality loss between the expert and other annotators. This comparison is conducted through (1) descriptive analyses, (2) fitting linear probability models, and (3) comparing F1-scores. Ultimately, we simulate annotation strategies where samples are split according to an automatically assigned certainty score. Routing low-certainty images to human annotators can cut total annotation costs by over 50% with minimal impact on label quality. We discuss implications regarding the economic competitiveness of annotation strategies, prompt engineering and the task-specificity of expertise.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1494]
A. Akman, Q. Sun and B. W. Schuller.
Improving Audio Explanations using Audio Language Models.
IEEE Signal Processing Letters Early Access (Jan. 2025). DOI
Abstract

Foundation models are widely utilised for their strong representational capabilities, driven by training on extensive datasets with self-supervised learning. The increasing complexity of these models highlights the importance of interpretability to enhance transparency and improve human understanding of their decision-making processes. Most existing interpretability methods explain model behaviour by attributing importance to individual data elements across different layers, based on their influence on the final prediction. These approaches often emphasise only the most relevant features, overlooking the broader representational space, removing less important features. In this study, we propose a novel framework for explanation generation that serves as an alternative to feature removal, offering a more comprehensive understanding of model behaviour. Our framework leverages the generative abilities of audio language models to replace removed features with contextually appropriate alternatives, providing a more complete view of the model’s decision-making process. Through extensive evaluations on standard benchmarks, including keyword spotting and speech emotion recognition, our approach demonstrates its effectiveness in generating high-quality audio explanations.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1493]
Y. Sun, Y. Zhou, X. Xu, J. Qi, F. Xu, Z. Ren and B. W. Schuller.
Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label Correction.
IEEE Transactions on Audio, Speech and Language Processing Early Access (Jan. 2025). DOI
Abstract

Automated Depression Detection (ADD) in speech aims to automatically estimate one’s depressive attributes through artificial intelligence tools towards spoken signals. Nevertheless, existing speech-based ADD works fail to sufficiently consider weakly-supervised cases with inaccurate labels, which may typically appear in intelligent mental health. In this regard, we propose the Self-Learning-based Label Correction (SLLC) approach for weakly-supervised depression detection in speech. The proposed approach employs a self-learning manner connecting a label correction module and a depression detection module. Within the approach, the label correction module fuses likelihood-ratio-based and prototype-based label correction strategies in order to effectively correct the inaccurate labels, while the depression detection module aims at detecting depressed samples through a 1D convolutional recurrent neural network with multiple types of losses. The experimental results on two depression detection corpora show that our proposed SLLC approach performs better compared with existing state-of-the-art speech-based depression detection approaches, in the case of weak supervision with inaccurate labels for depression detection in speech.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1492]
W. Huang, Z. Gu, Y. Shi, Z. Xiong and X. Zhu.
Semi-Supervised Building Footprint Extraction Using Debiased Pseudo-Labels.
IEEE Transactions on Geoscience and Remote Sensing 63 (Jan. 2025). DOI GitHub
Abstract

Accurate extraction of building footprints from satellite imagery is of high value. Currently, deep learning methods are predominant in this field due to their powerful representation capabilities. However, they generally require extensive pixel-wise annotations, which constrains their practical application. Semi-supervised learning (SSL) significantly mitigates this requirement by leveraging large volumes of unlabeled data for model self-training (ST), thus enhancing the viability of building footprint extraction. Despite its advantages, SSL faces a critical challenge: the imbalanced distribution between the majority background class and the minority building class, which often results in model bias toward the background during training. To address this issue, this article introduces a novel method called DeBiased matching (DBMatch) for semi-supervised building footprint extraction. DBMatch comprises three main components: 1) a basic supervised learning module (SUP) that uses labeled data for initial model training; 2) a classical weak-to-strong ST module that generates pseudo-labels from unlabeled data for further model ST; and 3) a novel logit debiasing (LDB) module that calculates a global logit bias between building and background, allowing for dynamic pseudo-label calibration. To verify the effectiveness of the proposed DBMatch, extensive experiments are performed on three public building footprint extraction datasets covering six global cities in SSL setting. The experimental results demonstrate that our method significantly outperforms some advanced SSL methods in semi-supervised building footprint extraction.

MCML Authors
Link to website

Ziqi Gu

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1491]
J. Li, T. Su, B. Zhao, F. Lv, Q. Wang, N. Navab, Y. Hu and Z. Jiang.
Ultrasound Report Generation With Cross-Modality Feature Alignment via Unsupervised Guidance.
IEEE Transactions on Medical Imaging 44.1 (Jan. 2025). DOI
Abstract

Automatic report generation has arisen as a significant research area in computer-aided diagnosis, aiming to alleviate the burden on clinicians by generating reports automatically based on medical images. In this work, we propose a novel framework for automatic ultrasound report generation, leveraging a combination of unsupervised and supervised learning methods to aid the report generation process. Our framework incorporates unsupervised learning methods to extract potential knowledge from ultrasound text reports, serving as the prior information to guide the model in aligning visual and textual features, thereby addressing the challenge of feature discrepancy. Additionally, we design a global semantic comparison mechanism to enhance the performance of generating more comprehensive and accurate medical reports. To enable the implementation of ultrasound report generation, we constructed three large-scale ultrasound image-text datasets from different organs for training and validation purposes. Extensive evaluations with other state-of-the-art approaches exhibit its superior performance across all three datasets.

MCML Authors
Link to website

Jun Li

Computational Imaging and AI in Medicine

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1490]
F. Fan, Y. Shi, T. Guggemos and X. Zhu.
Hybrid Quantum Deep Learning With Superpixel Encoding for Earth Observation Data Classification.
IEEE Transactions on Neural Networks and Learning Systems Early Access (Jan. 2025). DOI URL
Abstract

Earth observation (EO) has inevitably entered the Big Data era. The computational challenge associated with analyzing large EO data using sophisticated deep learning models has become a significant bottleneck. To address this challenge, there has been a growing interest in exploring quantum computing as a potential solution. However, the process of encoding EO data into quantum states for analysis potentially undermines the efficiency advantages gained from quantum computing. This article introduces a hybrid quantum deep learning model that effectively encodes and analyzes EO data for classification tasks. The proposed model uses an efficient encoding approach called superpixel encoding, which reduces the quantum resources required for large image representation by incorporating the concept of superpixels. To validate the effectiveness of our model, we conducted evaluations on multiple EO benchmarks, including Overhead-MNIST, So2Sat LCZ42, and SAT-6 datasets. In addition, we studied the impacts of different interaction gates and measurements on classification performance to guide model optimization. The experimental results suggest the validity of our model for accurate classification of EO data.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1489]
W. Mayr, A. Triantafyllopoulos, A. Batliner, B. W. Schuller and T. M. Berghaus.
Assessing the Clinical and Functional Status of COPD Patients Using Speech Analysis During and After Exacerbation.
International Journal of Chronic Obstructive Pulmonary Disease 20 (Jan. 2025). DOI
Abstract

Background: Chronic obstructive pulmonary disease (COPD) affects breathing, speech production, and coughing. We evaluated a machine learning analysis of speech for classifying the disease severity of COPD.
Methods: In this single centre study, non-consecutive COPD patients were prospectively recruited for comparing their speech characteristics during and after an acute COPD exacerbation. We extracted a set of spectral, prosodic, and temporal variability features, which were used as input to a support vector machine (SVM). Our baseline for predicting patient state was an SVM model using self-reported BORG and COPD Assessment Test (CAT) scores.
Results: In 50 COPD patients (52% males, 22% GOLD II, 44% GOLD III, 32% GOLD IV, all patients group E), speech analysis was superior in distinguishing during and after exacerbation status compared to BORG and CAT scores alone by achieving 84% accuracy in prediction. CAT scores correlated with reading rhythm, and BORG scales with stability in articulation. Pulmonary function testing (PFT) correlated with speech pause rate and speech rhythm variability.
Conclusion: Speech analysis may be a viable technology for classifying COPD status, opening up new opportunities for remote disease monitoring.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to website

Anton Batliner

Dr.

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1488]
N. Heldring, A.-R. Rezaie, A. Larsson, R. Gahn, B. Zilg, S. Camilleri, A. Saade, P. Wesp, E. Palm and O. Kvist.
A probability model for estimating age in young individuals relative to key legal thresholds: 15, 18 or 21-year.
International Journal of Legal Medicine 139.1 (Jan. 2025). DOI
Abstract

Age estimations are relevant for pre-trial detention, sentencing in criminal cases and as part of the evaluation in asylum processes to protect the rights and privileges of minors. No current method can determine an exact chronological age due to individual variations in biological development. This study seeks to develop a validated statistical model for estimating an age relative to key legal thresholds (15, 18, and 21 years) based on a skeletal (CT-clavicle, radiography-hand/wrist or MR-knee) and tooth (radiography-third molar) developmental stages. The whole model is based on 34 scientific studies, divided into examinations of the hand/wrist (15 studies), clavicle (5 studies), distal femur (4 studies), and third molars (10 studies). In total, data from approximately 27,000 individuals have been incorporated and the model has subsequently been validated with data from 5,000 individuals. The core framework of the model is built upon transition analysis and is further developed by a combination of a type of parametric bootstrapping and Bayesian theory. Validation of the model includes testing the models on independent datasets of individuals with known ages and shows a high precision with separate populations aligning closely with the model’s predictions. The practical use of the complex statistical model requires a user-friendly tool to provide probabilities together with the margin of error. The assessment based on the model forms the medical component for the overall evaluation of an individual’s age.

MCML Authors
Link to website

Philipp Wesp

Dr.

Clinical Data Science in Radiology


[1487]
B. Lange.
Moral parenthood and gestation: replies to Cordeiro, Murphy, Robinson and Baron.
Journal of Medical Ethics 51.2 (Jan. 2025). DOI
Abstract

I am grateful to James Cordeiro, Timothy Murphy, Heloise Robinson and Teresa Baron for their perceptive and stimulating comments on my article in this journal. In what follows, I seek to respond to some of the main points raised in each commentary.

MCML Authors
Link to Profile Benjamin Lange

Benjamin Lange

Dr.

Ethics of Artificial Intelligence


[1486]
B. Lange.
Moral parenthood: not gestational.
Journal of Medical Ethics 51.2 (Jan. 2025). DOI
Abstract

Parenting our biological children is a centrally important matter, but how, if it all, can it be justified? According to a contemporary influential line of thinking, the acquisition by parents of a moral right to parent their biological children should be grounded by appeal to the value of the intimate emotional relationship that gestation facilitates between a newborn and a gestational procreator. I evaluate two arguments in defence of this proposal and argue that both are unconvincing.Data are available in a public, open access repository.

MCML Authors
Link to Profile Benjamin Lange

Benjamin Lange

Dr.

Ethics of Artificial Intelligence


[1485]
R. Dorent, R. Khajavi, T. Idris, E. Ziegler, B. Somarouthu, H. Jacene, A. LaCasce, J. Deissler, J. Ehrhardt, S. Engelson, S. Fischer, Y. Gu, H. Handels, S. Kasai, S. Kondo, K. Maier-Hein, J. A. Schnabel, G. Wang, L. Wang, T. Wald, G.-Z. Yang, H. Zhang, M. Zhang, S. Pieper, G. Harris, R. Kikinis and T. Kapur.
LNQ 2023 challenge: Benchmark of weakly-supervised techniques for mediastinal lymph node quantification.
Machine Learning for Biomedical Imaging 3.Special Issue (Jan. 2025). DOI GitHub
Abstract

Accurate assessment of lymph node size in 3D CT scans is crucial for cancer staging, therapeutic management, and monitoring treatment response. Existing state-of-the-art segmentation frameworks in medical imaging often rely on fully annotated datasets. However, for lymph node segmentation, these datasets are typically small due to the extensive time and expertise required to annotate the numerous lymph nodes in 3D CT scans. Weakly-supervised learning, which leverages incomplete or noisy annotations, has recently gained interest in the medical imaging community as a potential solution. Despite the variety of weakly-supervised techniques proposed, most have been validated only on private datasets or small publicly available datasets. To address this limitation, the Mediastinal Lymph Node Quantification (LNQ) challenge was organized in conjunction with the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). This challenge aimed to advance weakly-supervised segmentation methods by providing a new, partially annotated dataset and a robust evaluation framework. A total of 16 teams from 5 countries submitted predictions to the validation leaderboard, and 6 teams from 3 countries participated in the evaluation phase. The results highlighted both the potential and the current limitations of weakly-supervised approaches. On one hand, weakly-supervised approaches obtained relatively good performance with a median Dice score of 61.0%. On the other hand, top-ranked teams, with a median Dice score exceeding 70%, boosted their performance by leveraging smaller but fully annotated datasets to combine weak supervision and full supervision. This highlights both the promise of weakly-supervised methods and the ongoing need for high-quality, fully annotated data to achieve higher segmentation performance.

MCML Authors
Link to website

Stefan Fischer

Computational Imaging and AI in Medicine

Link to Profile Julia Schnabel

Julia Schnabel

Prof. Dr.

Computational Imaging and AI in Medicine


[1484]
E. Eulig, F. Jäger, J. Maier, B. Ommer and M. Kachelrieß.
Reconstructing and analyzing the invariances of low-dose CT image denoising networks.
Medical Physics 52 (Jan. 2025). DOI
Abstract

Background: Deep learning-based methods led to significant advancements in many areas of medical imaging, most of which are concerned with the reduction of artifacts caused by motion, scatter, or noise. However, with most neural networks being black boxes, they remain notoriously difficult to interpret, hindering their clinical implementation. In particular, it has been shown that networks exhibit invariances w.r.t. input features, that is, they learn to ignore certain information in the input data.
Purpose: To improve the interpretability of deep learning-based low-dose CT image denoising networks.
Methods: We learn a complete data representation of low-dose input images using a conditional variational autoencoder (cVAE). In this representation, invariances of any given denoising network are then disentangled from the information it is not invariant to using a conditional invertible neural network (cINN). At test time, image-space invariances are generated by applying the inverse of the cINN and subsequent decoding using the cVAE. We propose two methods to analyze sampled invariances and to find those that correspond to alterations of anatomical structures.
Results: The proposed method is applied to four popular deep learning-based low-dose CT image denoising networks. We find that the networks are not only invariant to noise amplitude and realizations, but also to anatomical structures.
Conclusions: The proposed method is capable of reconstructing and analyzing invariances of deep learning-based low-dose CT image denoising networks. This is an important step toward interpreting deep learning-based methods for medical imaging, which is essential for their clinical implementation.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1483]
E. Achterhold, M. Mühlböck, N. Steiber and C. Kern.
Fairness in Algorithmic Profiling: The AMAS Case.
Minds and Machines 35.9 (Jan. 2025). DOI
Abstract

We study a controversial application of algorithmic profiling in the public sector, the Austrian AMAS system. AMAS was supposed to help caseworkers at the Public Employment Service (PES) Austria to allocate support measures to job seekers based on their predicted chance of (re-)integration into the labor market. Shortly after its release, AMAS was criticized for its apparent unequal treatment of job seekers based on gender and citizenship. We systematically investigate the AMAS model using a novel real-world dataset of young job seekers from Vienna, which allows us to provide the first empirical evaluation of the AMAS model with a focus on fairness measures. We further apply bias mitigation strategies to study their effectiveness in our real-world setting. Our findings indicate that the prediction performance of the AMAS model is insufficient for use in practice, as more than 30% of job seekers would be misclassified in our use case. Further, our results confirm that the original model is biased with respect to gender as it tends to (incorrectly) assign women to the group with high chances of re-employment, which is not prioritized in the PES’ allocation of support measures. However, most bias mitigation strategies were able to improve fairness without compromising performance and thus may form an important building block in revising profiling schemes in the present context.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[1482]
T. Li, S. Hofer, G. Moholdt, A. Igneczi, K. Heidler, X. Zhu and J. Bamber.
Pervasive glacier retreats across Svalbard from 1985 to 2023.
Nature Communications 16.705 (Jan. 2025). DOI
Abstract

A major uncertainty in predicting the behaviour of marine-terminating glaciers is ice dynamics driven by non-linear calving front retreat, which is poorly understood and modelled. Using 124919 calving front positions for 149 marine-terminating glaciers in Svalbard from 1985 to 2023, generated with deep learning, we identify pervasive calving front retreats for non-surging glaciers over the past 38 years. We observe widespread seasonal cycles in calving front position for over half of the glaciers. At the seasonal timescale, peak retreat rates exhibit a several-month phase lag, with changes on the west coast occurring before those on the east coast, coincident with regional ocean warming. This spatial variability in seasonal patterns is linked to different timings of warm ocean water inflow from the West Spitsbergen Current, demonstrating the dominant role of ice-ocean interaction in seasonal front changes. The interannual variability of calving front retreat shows a strong sensitivity to both atmospheric and oceanic warming, with immediate responses to large air and ocean temperature anomalies in 2016 and 2019, likely driven by atmospheric blocking that can influence extreme temperature variability. With more frequent blocking occurring and continued regional warming, future calving front retreats will likely intensify, leading to more significant glacier mass loss.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1481]
B. Lange.
Digital Duplicates and Collective Scarcity.
Philosophy and Technology 38.7 (Jan. 2025). DOI
Abstract

Digital duplicates reduce the scarcity of individuals and thus may impact their instrumental and intrinsic value. I here expand upon this idea by introducing the notion of collective scarcity, which pertains to the limitations faced by social groups in maintaining their size, cohesion and function.

MCML Authors
Link to Profile Benjamin Lange

Benjamin Lange

Dr.

Ethics of Artificial Intelligence


[1480]
M. Binz, S. Alaniz, A. Roskies, B. , C. T. Bergstrom, C. Allen, D. Schad, D. Wulff, J. D. , Q. Zhang, R. M. Shiffrin, S. J. Gershman, V. Popov, E. M. Bender, M. Marelli, M. M. Botvinick, Z. Akata and E. Schulz.
How should the advancement of large language models affect the practice of science?
Proceedings of the National Academy of Sciences 122.5 (Jan. 2025). DOI
Abstract

Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advancement of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and overhyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.

MCML Authors
Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1479]
T. Weber, J. Dexl, D. Rügamer and M. Ingrisch.
Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition.
Radiology: Artificial Intelligence 7.2 (Jan. 2025). DOI
Abstract

We address the computational barrier of deploying advanced deep learning segmentation models in clinical settings by studying the efficacy of network compression through tensor decomposition. We propose a post-training Tucker factorization that enables the decomposition of pre-existing models to reduce computational requirements without impeding segmentation accuracy. We applied Tucker decomposition to the convolutional kernels of the TotalSegmentator (TS) model, an nnU-Net model trained on a comprehensive dataset for automatic segmentation of 117 anatomical structures. Our approach reduced the floating-point operations (FLOPs) and memory required during inference, offering an adjustable trade-off between computational efficiency and segmentation quality. This study utilized the publicly available TS dataset, employing various downsampling factors to explore the relationship between model size, inference speed, and segmentation performance. The application of Tucker decomposition to the TS model substantially reduced the model parameters and FLOPs across various compression rates, with limited loss in segmentation accuracy. We removed up to 88% of the model’s parameters with no significant performance changes in the majority of classes after fine-tuning. Practical benefits varied across different graphics processing unit (GPU) architectures, with more distinct speed-ups on less powerful hardware. Post-hoc network compression via Tucker decomposition presents a viable strategy for reducing the computational demand of medical image segmentation models without substantially sacrificing accuracy. This approach enables the broader adoption of advanced deep learning technologies in clinical practice, offering a way to navigate the constraints of hardware capabilities.

MCML Authors
Link to website

Jakob Dexl

Clinical Data Science in Radiology

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[1478]
A. Scagliotti.
Minimax Problems for Ensembles of Control-Affine Systems.
SIAM Journal on Control and Optimization 63.1 (Jan. 2025). DOI
Abstract

In this paper, we consider ensembles of control-affine systems in ℝd, and we study simultaneous optimal control problems related to the worst-case minimization. After proving that such problems admit solutions, denoting with (ΘN)N a sequence of compact sets that parametrize the ensembles of systems, we first show that the corresponding minimax optimal control problems are Γ-convergent whenever (ΘN)N has a limit with respect to the Hausdorff distance. Besides its independent interest, the previous result plays a crucial role for establishing the Pontryagin Maximum Principle (PMP) when the ensemble is parametrized by a set Θ consisting of infinitely many points. Namely, we first approximate Θ by finite and increasing-in-size sets (ΘN)N for which the PMP is known, and then we derive the PMP for the Γ-limiting problem. The same strategy can be pursued in applications, where we can reduce infinite ensembles to finite ones to compute the minimizers numerically. We bring as a numerical example the Schrödinger equation for a qubit with uncertain resonance frequency.

MCML Authors
Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1477]
M. Gorski, S. Wiegrebe, R. Burkhardt, M. Behr, H. Küchenhoff, K. J. Stark, C. A. Böger and I. M. Heid.
Bias-corrected serum creatinine from UK Biobank electronic medical records generates an important data resource for kidney function trajectories.
Scientific Reports 15.3540 (Jan. 2025). DOI
Abstract

Loss of kidney function is a substantial personal and public health burden. Kidney function is typically assessed as estimated glomerular filtration rate (eGFR) based on serum creatinine. UK Biobank provides serum creatinine measurements from study center assessments (SC, n = 425,147 baseline, n = 15,314 with follow-up) and emerging electronic Medical Records (eMR, ‘GP-clinical’) present a promising resource to augment this data longitudinally. However, it is unclear whether eMR-based and SC-based creatinine values can be used jointly for research on eGFR decline. When comparing eMR-based with SC-based creatinine by calendar year (n = 70,231), we found a year-specific multiplicative bias for eMR-based creatinine that decreased over time (factor 0.84 for 2007, 0.97 for 2013). Deriving eGFR based on SC- and bias-corrected eMR-creatinine yielded 454,907 individuals with ≥ 1eGFR assessment (2,102,174 assessments). This included 206,063 individuals with ≥ 2 assessments over up to 60.2 years (median 6.00 assessments, median time = 8.7 years), where we also obtained eMR-based information on kidney disease or renal replacement therapy. We found an annual eGFR decline of 0.11 (95%-CI = 0.10–0.12) versus 1.04 mL/min/1.73m2/year (9%-CI = 1.03–1.05) without and with bias-correction, the latter being in line with literature. In summary, our bias-corrected eMR-based creatinine values enabled a 4-fold increased number of eGFR assessments in UK Biobank suitable for kidney function research.

MCML Authors
Link to Profile Helmut Küchenhoff

Helmut Küchenhoff

Prof. Dr.

Statistical Consulting Unit (StaBLab)


[1476]
K. Ghosh, M. Todorović, A. Vehtari and P. Rinke.
Active learning of molecular data for task-specific objectives.
The Journal of Chemical Physics 162.014103 (Jan. 2025). DOI
Abstract

Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes, and GP noise settings. AL was insensitive to the acquisition batch size, and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform the randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings of up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

MCML Authors
Link to Profile Patrick Rinke

Patrick Rinke

Prof. Dr.

AI-based Material Science


[1475]
J. Homer and O. Friedrich.
SBIAX: Density-estimation simulation-based inference in JAX.
The Journal of Open Source Software 10.105 (Jan. 2025). DOI
Abstract

In a typical Bayesian inference problem, the data likelihood is not known. However, in recent
years, machine learning methods for density estimation can allow for inference using an estimator
of the data likelihood. This likelihood estimator is fit with neural networks that are trained on
simulations to maximise the likelihood of the simulation-parameter pairs - one of the many
available tools for Simulation Based Inference (SBI), (Cranmer et al., 2020)…

MCML Authors
Link to website

Jed Homer

Astrophysics, Cosmology and Artificial Intelligence


[1474]
A. Köksal, M. Thaler, A. Imani, A. Üstün, A. Korhonen and H. Schütze.
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions.
Transactions of the Association for Computational Linguistics (2025). To be published. Preprint available. arXiv GitHub
Abstract

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation.

MCML Authors
Link to website

Ayyoob Imani

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1473]
M. Aleksic, T. Ehring, A. Kunze, Y. Han, H. Funk and L. Wolkenstein.
Selective Effects of Eye Movement Desensitization and Reprocessing, Imagery Rescripting and Imaginal Exposure on Voluntary and Involuntary Memory of an Aversive Autobiographical Event.
Preprint (Jan. 2025). DOI
Abstract

Clinical theories suggest that trauma-focused interventions reduce intrusive memories while preserving voluntary recall. However, concerns persist that they may inadvertently compromise factual memory content. To test these contrasting predictions, we examined the effects of Eye Movement Desensitization and Reprocessing (EMDR), Imagery Rescripting (ImRs), Imaginal Exposure (IE), on involuntary and voluntary memories of an aversive autobiographical event. Healthy participants (N = 182), recruited between 2021 and 2023, completed a free recall task before receiving either one of the interventions or no intervention (NIC). One week later, the recall task was repeated. Intrusion load and frequency were assessed with an app-diary; psychophysiological responses to intrusions were assessed in a laboratory task. Independent raters evaluated disorganization, coherence, consistency of voluntary memory. All interventions reduced intrusion load, but only ImRs decreased intrusion frequency compared to NIC. Psychophysiological responses to intrusions showed no group differences. IE improved the structural organization of voluntary memory by reducing disorganized thoughts, while EMDR and ImRs enhanced conceptual organization by increasing contextual coherence. None of the interventions impaired memory consistency, with no group differences in contradictions or omissions. These findings suggest that these interventions reduce distressing intrusions without compromising voluntary memory. Further research should replicate these effects in clinical samples.

MCML Authors
Link to website

Henri Funk

Statistical Consulting Unit (StaBLab)


[1472]
F. Drexel, V. Sideri-Lampretsa, H. Bast, A. W. Marka, T. Koehler, F. T. Gassert, D. Pfeiffer, D. Rückert and F. Pfeiffer.
Deformable Image Registration of Dark-Field Chest Radiographs for Local Lung Signal Change Assessment.
Preprint (Jan. 2025). arXiv
Abstract

Dark-field radiography of the human chest has been demonstrated to have promising potential for the analysis of the lung microstructure and the diagnosis of respiratory diseases. However, previous studies of dark-field chest radiographs evaluated the lung signal only in the inspiratory breathing state. Our work aims to add a new perspective to these previous assessments by locally comparing dark-field lung information between different respiratory states. To this end, we discuss suitable image registration methods for dark-field chest radiographs to enable consistent spatial alignment of the lung in distinct breathing states. Utilizing full inspiration and expiration scans from a clinical chronic obstructive pulmonary disease study, we assess the performance of the proposed registration framework and outline applicable evaluation approaches. Our regional characterization of lung dark-field signal changes between the breathing states provides a proof-of-principle that dynamic radiography-based lung function assessment approaches may benefit from considering registered dark-field images in addition to standard plain chest radiographs.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[1471]
F. Dülmer, M. F. Azampour and N. Navab.
UltraRay: Full-Path Ray Tracing for Enhancing Realism in Ultrasound Simulation.
Preprint (Jan. 2025). arXiv
Abstract

Traditional ultrasound simulators solve the wave equation to model pressure distribution fields, achieving high accuracy but requiring significant computational time and resources. To address this, ray tracing approaches have been introduced, modeling wave propagation as rays interacting with boundaries and scatterers. However, existing models simplify ray propagation, generating echoes at interaction points without considering return paths to the sensor. This can result in unrealistic artifacts and necessitates careful scene tuning for plausible results. We propose a novel ultrasound simulation pipeline that utilizes a ray tracing algorithm to generate echo data, tracing each ray from the transducer through the scene and back to the sensor. To replicate advanced ultrasound imaging, we introduce a ray emission scheme optimized for plane wave imaging, incorporating delay and steering capabilities. Furthermore, we integrate a standard signal processing pipeline to simulate end-to-end ultrasound image formation. We showcase the efficacy of the proposed pipeline by modeling synthetic scenes featuring highly reflective objects, such as bones. In doing so, our proposed approach, UltraRay, not only enhances the overall visual quality but also improves the realism of the simulated images by accurately capturing secondary reflections and reducing unnatural artifacts. By building on top of a differentiable framework, the proposed pipeline lays the groundwork for a fast and differentiable ultrasound simulation tool necessary for gradient-based optimization, enabling advanced ultrasound beamforming strategies, neural network integration, and accurate inverse scene reconstruction.

MCML Authors
Link to website

Felix Dülmer

Computer Aided Medical Procedures & Augmented Reality

Link to website

Mohammad Farid Azampour

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1470]
S. Eckman, B. Ma, C. Kern, R. Chew, B. Plank and F. Kreuter.
Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR).
Preprint (Jan. 2025). arXiv
Abstract

Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1469]
Y. Feng, S. Feuerriegel and Y. R. Shrestha.
Contextualizing Recommendation Explanations with LLMs: A User Study.
Preprint (Jan. 2025). arXiv
Abstract

Large language models (LLMs) are increasingly prevalent in recommender systems, where LLMs can be used to generate personalized recommendations. Here, we examine how different LLM-generated explanations for movie recommendations affect users’ perceptions of cognitive, affective, and utilitarian needs and consumption intentions. In a pre-registered, between-subject online experiment (N=759) and follow-up interviews (N=30), we compare (a) LLM-generated generic explanations, and (b) LLM-generated contextualized explanations. Our findings show that contextualized explanations (i.e., explanations that incorporate users’ past behaviors) effectively meet users’ cognitive needs while increasing users’ intentions to watch recommended movies. However, adding explanations offers limited benefits in meeting users’ utilitarian and affective needs, raising concerns about the proper design and implications of LLM-generated explanations. Qualitative insights from interviews reveal that referencing users’ past preferences enhances trust and understanding but can feel excessive if overused. Furthermore, users with more active and positive engagement with the recommender system and movie-watching get substantial gains from contextualized explanations. Overall, our research clarifies how LLM-generated recommendations influence users’ motivations and behaviors, providing valuable insights for the future development of user-centric recommender systems, a key element in social media platforms and online ecosystems.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1468]
Z. Haouari, J. Weidner, I. Ezhov, A. Varma, D. Rückert, B. Menze and B. Wiestler.
Efficient Deep Learning-based Forward Solvers for Brain Tumor Growth Models.
Preprint (Jan. 2025). arXiv
Abstract

Glioblastoma, a highly aggressive brain tumor, poses major challenges due to its poor prognosis and high morbidity rates. Partial differential equation-based models offer promising potential to enhance therapeutic outcomes by simulating patient-specific tumor behavior for improved radiotherapy planning. However, model calibration remains a bottleneck due to the high computational demands of optimization methods like Monte Carlo sampling and evolutionary algorithms. To address this, we recently introduced an approach leveraging a neural forward solver with gradient-based optimization to significantly reduce calibration time. This approach requires a highly accurate and fully differentiable forward model. We investigate multiple architectures, including (i) an enhanced TumorSurrogate, (ii) a modified nnU-Net, and (iii) a 3D Vision Transformer (ViT). The optimized TumorSurrogate achieved the best overall results, excelling in both tumor outline matching and voxel-level prediction of tumor cell concentration. It halved the MSE relative to the baseline model and achieved the highest Dice score across all tumor cell concentration thresholds. Our study demonstrates significant enhancement in forward solver performance and outlines important future research directions.

MCML Authors
Link to website

Jonas Weidner

AI for Image-Guided Diagnosis and Therapy

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy


[1467]
B. Jian, J. Pan, Y. Li, F. Bongratz, R. Li, D. Rückert, B. Wiestler and C. Wachinger.
TimeFlow: Longitudinal Brain Image Registration and Aging Progression Analysis.
Preprint (Jan. 2025). arXiv
Abstract

Predicting future brain states is crucial for understanding healthy aging and neurodegenerative diseases. Longitudinal brain MRI registration, a cornerstone for such analyses, has long been limited by its inability to forecast future developments, reliance on extensive, dense longitudinal data, and the need to balance registration accuracy with temporal smoothness. In this work, we present emph{TimeFlow}, a novel framework for longitudinal brain MRI registration that overcomes all these challenges. Leveraging a U-Net architecture with temporal conditioning inspired by diffusion models, TimeFlow enables accurate longitudinal registration and facilitates prospective analyses through future image prediction. Unlike traditional methods that depend on explicit smoothness regularizers and dense sequential data, TimeFlow achieves temporal consistency and continuity without these constraints. Experimental results highlight its superior performance in both future timepoint prediction and registration accuracy compared to state-of-the-art methods. Additionally, TimeFlow supports novel biological brain aging analyses, effectively differentiating neurodegenerative conditions from healthy aging. It eliminates the need for segmentation, thereby avoiding the challenges of non-trivial annotation and inconsistent segmentation errors. TimeFlow paves the way for accurate, data-efficient, and annotation-free prospective analyses of brain aging and chronic diseases.

MCML Authors
Link to website

Bailiang Jian

Artificial Intelligence in Medical Imaging

Link to website

Yitong Li

Artificial Intelligence in Medical Imaging

Fabian Bongratz

Fabian Bongratz

Artificial Intelligence in Medical Imaging

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1466]
O. Kononykhina, M. Schierholz and F. Kreuter.
The Impact of Question Framing on the Precision of Automatic Occupation Coding.
Preprint (Jan. 2025). arXiv
Abstract

Occupational data play a vital role in research, official statistics, and policymaking, yet their collection and accurate classification remain a persistent challenge. This study investigates the effects of occupational question wording on data variability and the performance of automatic coding tools. Through a series of survey experiments conducted and replicated in Germany, we tested two widely-used occupational question formats: one focusing on ‘job title’ (Berufsbezeichnung) and another on ‘occupational tasks’ (berufliche Tätigkeit). Our analysis reveals that automatic coding tools, such as CASCOT and OccuCoDe, exhibit significant sensitivity to the form and origin of the data. Specifically, these tools performed more efficiently when coding responses to the job title question format compared to the occupational task format. Additionally, we found that including examples of main tasks and duties in the questions led respondents to provide more detailed but less linguistically diverse responses. This reduced diversity may negatively affect the precision of automatic coding. These findings highlight the importance of tailoring automatic coding tools to the specific structure and origin of the data they are applied to. We emphasize the need for further research to optimize question design and coding tools for greater accuracy and applicability in occupational data collection.

MCML Authors
Link to website

Olga Kononykhina

Social Data Science and AI

Link to website

Malte Schierholz

Dr.

Social Data Science and AI

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1465]
T. Mortier, A. Javanmardi, Y. Sale, E. Hüllermeier and W. Waegeman.
Conformal Prediction in Hierarchical Classification.
Preprint (Jan. 2025). arXiv
Abstract

Conformal prediction has emerged as a widely used framework for constructing valid prediction sets in classification and regression tasks. In this work, we extend the split conformal prediction framework to hierarchical classification, where prediction sets are commonly restricted to internal nodes of a predefined hierarchy, and propose two computationally efficient inference algorithms. The first algorithm returns internal nodes as prediction sets, while the second relaxes this restriction, using the notion of representation complexity, yielding a more general and combinatorial inference problem, but smaller set sizes. Empirical evaluations on several benchmark datasets demonstrate the effectiveness of the proposed algorithms in achieving nominal coverage.

MCML Authors
Link to website

Alireza Javanmardi

Artificial Intelligence and Machine Learning

Link to website

Yusuf Sale

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1464]
A. Saroha, F. Hofherr, M. Gladkova, C. Curreli, O. Litany and D. Cremers.
ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting.
Preprint (Jan. 2025). arXiv
Abstract

Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.

MCML Authors
Link to website

Florian Hofherr

Computer Vision & Artificial Intelligence

Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to website

Cecilia Curreli

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1463]
R. Schwank, A. McCormack and M. Drton.
Robust Score Matching.
Preprint (Jan. 2025). arXiv
Abstract

Proposed in Hyvärinen (2005), score matching is a parameter estimation procedure that does not require computation of distributional normalizing constants. In this work we utilize the geometric median of means to develop a robust score matching procedure that yields consistent parameter estimates in settings where the observed data has been contaminated. A special appeal of the proposed method is that it retains convexity in exponential family models. The new method is therefore particularly attractive for non-Gaussian, exponential family graphical models where evaluation of normalizing constants is intractable. Support recovery guarantees for such models when contamination is present are provided. Additionally, support recovery is studied in numerical experiments and on a precipitation dataset. We demonstrate that the proposed robust score matching estimator performs comparably to the standard score matching estimator when no contamination is present but greatly outperforms this estimator in a setting with contamination.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[1462]
M. H. Shaker and E. Hüllermeier.
Random Forest Calibration.
Preprint (Jan. 2025). arXiv
Abstract

The Random Forest (RF) classifier is often claimed to be relatively well calibrated when compared with other machine learning methods. Moreover, the existing literature suggests that traditional calibration methods, such as isotonic regression, do not substantially enhance the calibration of RF probability estimates unless supplied with extensive calibration data sets, which can represent a significant obstacle in cases of limited data availability. Nevertheless, there seems to be no comprehensive study validating such claims and systematically comparing state-of-the-art calibration methods specifically for RF. To close this gap, we investigate a broad spectrum of calibration methods tailored to or at least applicable to RF, ranging from scaling techniques to more advanced algorithms. Our results based on synthetic as well as real-world data unravel the intricacies of RF probability estimates, scrutinize the impacts of hyper-parameters, compare calibration methods in a systematic way. We show that a well-optimized RF performs as well as or better than leading calibration approaches.

MCML Authors
Link to website

Mohammad Hossein Shaker

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1461]
Z. Yang, M. Song, X. Jing, H. Zhang, K. Qian, B. Hu, K. Tamada, T. Takumi, B. W. Schuller and Y. Yamamoto.
MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge.
Preprint (Jan. 2025). arXiv
Abstract

The Mice Autism Detection via Ultrasound Vocalization (MADUV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


2024


[1460]
N. Strauß.
Artificial intelligence for resource allocation tasks.
Dissertation 2024. DOI
Abstract

This thesis presents deep reinforcement learning approaches for complex resource allocation tasks, including discrete, continuous, and resource collection problems. It introduces novel neural architectures achieving state-of-the-art results in spatial resource allocation, multi-agent collection, and dynamic ambulance redeployment, including electric ambulances. For continuous tasks like portfolio optimization, it proposes efficient methods to handle allocation constraints, ensuring compliance during training and deployment. (Shortened).

MCML Authors
Link to website

Niklas Strauß

Dr.

Spatial Artificial Intelligence


[1459]
B. Kühbacher, F. Iglesias-Suarez, N. Kilbertus and V. Eyring.
Towards Physically Consistent Deep Learning For Climate Model Parameterizations.
ICMLA 2024 - 23rd IEEE International Conference on Machine Learning and Applications. Miami, FL, USA, Dec 18-20, 2024. DOI
Abstract

Climate models play a critical role in understanding and projecting climate change. Due to their complexity, their horizontal resolution of about 40-100 km remains too coarse to resolve processes such as clouds and convection, which need to be approximated via parameterizations. These parameterizations are a major source of systematic errors and large uncertainties in climate projections. Deep learning (DL)-based parameterizations, trained on data from computationally expensive short, high-resolution simulations, have shown great promise for improving climate models in that regard. However, their lack of interpretability and tendency to learn spurious non-physical correlations result in reduced trust in the climate simulation. We propose an efficient supervised learning framework for DL-based parameterizations that leads to physically consistent models with improved interpretability and negligible computational overhead compared to standard supervised training. First, key features determining the target physical processes are uncovered. Subsequently, the neural network is fine-tuned using only those relevant features. We show empirically that our method robustly identifies a small subset of the inputs as actual physical drivers, therefore removing spurious non-physical relationships. This results in by design physically consistent and interpretable neural networks while maintaining the predictive performance of unconstrained black-box DL-based parameterizations.

MCML Authors
Birgit Kühbacher

Birgit Kühbacher

Ethics in Systems Design and Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1458]
K. Bieker, H. T. Kussaba, P. Scholl, J. Jung, A. Swikir, S. Haddadin and G. Kutyniok.
Compositional Construction of Barrier Functions for Switched Impulsive Systems.
CDC 2024 - 63rd IEEE Conference on Decision and Control. Milan, Italy, Dec 16-19, 2024. DOI
Abstract

Many systems occurring in real-world applications, such as controlling the motions of robots or modeling the spread of diseases, are switched impulsive systems. To ensure that the system state stays in a safe region (e.g., to avoid collisions with obstacles), barrier functions are widely utilized. As the system dimension increases, deriving suitable barrier functions becomes extremely complex. Fortunately, many systems consist of multiple subsystems, such as different areas where the disease occurs. In this work, we present sufficient conditions for interconnected switched impulsive systems to maintain safety by constructing local barrier functions for the individual subsystems instead of a global one, allowing for much easier and more efficient derivation. To validate our results, we numerically demonstrate its effectiveness using an epidemiological model.

MCML Authors
Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1457]
M. Keicher.
Multimodal Deep Learning for Holistic Clinical Decision and Reasoning Support.
Dissertation 2024. URL
Abstract

In clinical decision-making, medical doctors rely not only on a multitude of information about a patient, including lab results and imaging data, but also on their extensive knowledge gained through formal education and experience with previously treated patients. This thesis explores clinical decision support systems based on deep learning that integrate multimodal knowledge about a patient with formal and exemplar clinical knowledge while providing insight into their reasoning.

MCML Authors
Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1456]
L. Gosch, M. Sabanayagam, D. Ghoshdastidar and S. Günnemann.
Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks.
AdvML-Frontiers @NeurIPS 2024 - 3rd Workshop on New Frontiers in Adversarial Machine Learning at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Generalization of machine learning models can be severely compromised by data poisoning, where adversarial changes are applied to the training data. This vulnerability has led to interest in certifying (i.e., proving) that such changes up to a certain magnitude do not affect test predictions. We, for the first time, certify Graph Neural Networks (GNNs) against poisoning attacks, including backdoors, targeting the node features of a given graph. Our certificates are white-box and based upon (i) the neural tangent kernel, which characterizes the training dynamics of sufficiently wide networks; and (ii) a novel reformulation of the bilevel optimization describing poisoning as a mixed-integer linear program. We note that our framework is more general and constitutes the first approach to derive white-box poisoning certificates for NNs, which can be of independent interest beyond graph-related tasks.

MCML Authors
Link to website

Lukas Gosch

Data Analytics & Machine Learning

Link to Profile Debarghya Ghoshdastidar

Debarghya Ghoshdastidar

Prof. Dr.

Theoretical Foundations of Artificial Intelligence

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning


[1455]
C. Bülte, P. Scholl and G. Kutyniok.
Probabilistic predictions with Fourier neural operators.
BDU @NeurIPS 2024 - Workshop Bayesian Decision-making and Uncertainty: from probabilistic and spatiotemporal modeling to sequential experiment design at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Neural networks have been successfully applied in modeling partial differential equations, especially in dynamical systems. Commonly used models, such as neural operators, are performing well at deterministic prediction tasks, but lack a quantification of the uncertainty inherent in many complex systems, for example weather forecasting. In this paper, we explore a new approach that combines Fourier neural operators with generative modeling based on strictly proper scoring rules in order to create well-calibrated probabilistic predictions of dynamical systems. We demonstrate improved predictive uncertainty for our approach, especially in settings with very high inherent uncertainty.

MCML Authors
Link to website

Christopher Bülte

Mathematical Foundations of Artificial Intelligence

Link to website

Philipp Scholl

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1454]
A. Koebler, T. Decker, I. Thon, V. Tresp and F. Buettner.
Incremental Uncertainty-aware Performance Monitoring with Labeling Intervention.
BDU @NeurIPS 2024 - Workshop Bayesian Decision-making and Uncertainty: from probabilistic and spatiotemporal modeling to sequential experiment design at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

We study the problem of monitoring machine learning models under temporal distribution shifts, where circumstances change gradually over time, often leading to unnoticed yet significant declines in accuracy. We propose Incremental Uncertainty-aware Performance Monitoring (IUPM), a novel label-free method that estimates model performance by modeling time-dependent shifts using optimal transport. IUPM also quantifies uncertainty in performance estimates and introduces an active labeling strategy to reduce this uncertainty. We further showcase the benefits of IUPM on different datasets and simulated temporal shifts over existing baselines.

MCML Authors
Link to website

Thomas Decker

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1453]
A. White, A. Büttner, M. Gelbrecht, N. Kilbertus, F. Hellmann and N. Boers.
Projected Neural Differential Equations for Power Grid Modeling with Constraints.
D3S3 @NeurIPS 2024 - Workshop on Data-driven and Differentiable Simulations, Surrogates, and Solvers at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Neural differential equations offer a powerful approach for data-driven simulation. However, many applications in science and engineering possess known constraints that should be obeyed by the learned model. We introduce projected neural differential equations (PNDEs), a new method for constraining neural differential equations based on projection of the learned vector field to the tangent space of the constraint manifold. In tests on two challenging examples from power grid modeling, PNDEs outperform existing methods while requiring fewer hyperparameters. Our approach demonstrates significant potential for enhancing the modeling of constrained dynamical systems, particularly in complex domains like power grid dynamics where accuracy and reliability are essential.

MCML Authors
Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1452]
B. Cong, N. Daheim, Y. Shen, D. Cremers, R. Yokota, M. Khan and T. Möllenhoff.
Variational Low-Rank Adaptation Using IVON.
FITML @NeurIPS 2024 - Workshop Fine-Tuning in Modern Machine Learning: Principles and Scalability at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models.

MCML Authors
Yuesong Shen

Yuesong Shen

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1451]
E. Ailer, N. Dern, J. Hartford and N. Kilbertus.
Targeted Sequential Indirect Experiment Design.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Scientific hypotheses typically concern specific aspects of complex, imperfectly understood or entirely unknown mechanisms, such as the effect of gene expression levels on phenotypes or how microbial communities influence environmental health. Such queries are inherently causal (rather than purely associational), but in many settings, experiments can not be conducted directly on the target variables of interest, but are indirect. Therefore, they perturb the target variable, but do not remove potential confounding factors. If, additionally, the resulting experimental measurements are multi-dimensional and the studied mechanisms nonlinear, the query of interest is generally not identified. We develop an adaptive strategy to design indirect experiments that optimally inform a targeted query about the ground truth mechanism in terms of sequentially narrowing the gap between an upper and lower bound on the query. While the general formulation consists of a bi-level optimization procedure, we derive an efficiently estimable analytical kernel-based estimator of the bounds for the causal effect, a query of key interest, and demonstrate the efficacy of our approach in confounded, multivariate, nonlinear synthetic settings.

MCML Authors
Elisabeth Ailer

Elisabeth Ailer

* Former Member

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[1450]
A. Bonfanti, G. Bruno and C. Cipriani.
The Challenges of the Nonlinear Regime for Physics-Informed Neural Networks.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

The Neural Tangent Kernel (NTK) viewpoint is widely employed to analyze the training dynamics of overparameterized Physics-Informed Neural Networks (PINNs). However, unlike the case of linear Partial Differential Equations (PDEs), we show how the NTK perspective falls short in the nonlinear scenario. Specifically, we establish that the NTK yields a random matrix at initialization that is not constant during training, contrary to conventional belief. Another significant difference from the linear regime is that, even in the idealistic infinite-width limit, the Hessian does not vanish and hence it cannot be disregarded during training. This motivates the adoption of second-order optimization methods. We explore the convergence guarantees of such methods in both linear and nonlinear cases, addressing challenges such as spectral bias and slow convergence. Every theoretical result is supported by numerical examples with both linear and nonlinear PDEs, and we highlight the benefits of second-order methods in benchmark test cases.

MCML Authors
Cristina Cipriani

Cristina Cipriani

Dr.

* Former Member


[1449]
R. Dhahri, A. Immer, B. Charpentier, S. Günnemann and V. Fortuin.
Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to naïvely deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam’s razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.

MCML Authors
Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning

Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[1448]
L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy and Z. Akata.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from ‘reward hacking’ and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-α, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1447]
F. Hoppe, C. M. Verdun, H. Laus, F. Krahmer and H. Rauhut.
Non-Asymptotic Uncertainty Quantification in High-Dimensional Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Uncertainty quantification (UQ) is a crucial but challenging task in many high-dimensional regression or learning problems to increase the confidence of a given predictor. We develop a new data-driven approach for UQ in regression that applies both to classical regression approaches such as the LASSO as well as to neural networks. One of the most notable UQ techniques is the debiased LASSO, which modifies the LASSO to allow for the construction of asymptotic confidence intervals by decomposing the estimation error into a Gaussian and an asymptotically vanishing bias component. However, in real-world problems with finite-dimensional data, the bias term is often too significant to be neglected, resulting in overly narrow confidence intervals. Our work rigorously addresses this issue and derives a data-driven adjustment that corrects the confidence intervals for a large class of predictors by estimating the means and variances of the bias terms from training data, exploiting high-dimensional concentration phenomena. This gives rise to non-asymptotic confidence intervals, which can help avoid overestimating uncertainty in critical applications such as MRI diagnosis. Importantly, our analysis extends beyond sparse regression to data-driven predictors like neural networks, enhancing the reliability of model-based deep learning. Our findings bridge the gap between established theory and the practical applicability of such debiased methods.

MCML Authors
Link to website

Hannah Laus

Optimization & Data Analysis

Link to Profile Felix Krahmer

Felix Krahmer

Prof. Dr.

Optimization & Data Analysis

Link to Profile Holger Rauhut

Holger Rauhut

Prof. Dr.

Mathematical Data Science and Artificial Intelligence


[1446]
A. Javanmardi, D. Stutz and E. Hüllermeier.
Conformalized Credal Set Predictors.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Credal sets are sets of probability distributions that are considered as candidates for an imprecisely known ground-truth distribution. In machine learning, they have recently attracted attention as an appealing formalism for uncertainty representation, in particular due to their ability to represent both the aleatoric and epistemic uncertainty in a prediction. However, the design of methods for learning credal set predictors remains a challenging problem. In this paper, we make use of conformal prediction for this purpose. More specifically, we propose a method for predicting credal sets in the classification task, given training data labeled by probability distributions. Since our method inherits the coverage guarantees of conformal prediction, our conformal credal sets are guaranteed to be valid with high probability (without any assumptions on model or distribution). We demonstrate the applicability of our method to natural language inference, a highly ambiguous natural language task where it is common to obtain multiple annotations per example.

MCML Authors
Link to website

Alireza Javanmardi

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1445]
A. H. Kargaran, F. Yvon and H. Schütze.
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community.

MCML Authors
Link to website

Amir Hossein Kargaran

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1444]
F. Köhler, S. Niedermayr, R. Westermann and N. Thuerey.
APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

We introduce the Autoregressive PDE Emulator Benchmark (APEBench), a comprehensive benchmark suite to evaluate autoregressive neural emulators for solving partial differential equations. APEBench is based on JAX and provides a seamlessly integrated differentiable simulation framework employing efficient pseudo-spectral methods, enabling 46 distinct PDEs across 1D, 2D, and 3D. Facilitating systematic analysis and comparison of learned emulators, we propose a novel taxonomy for unrolled training and introduce a unique identifier for PDE dynamics that directly relates to the stability criteria of classical numerical methods. APEBench enables the evaluation of diverse neural architectures, and unlike existing benchmarks, its tight integration of the solver enables support for differentiable physics training and neural-hybrid emulators. Moreover, APEBench emphasizes rollout metrics to understand temporal generalization, providing insights into the long-term behavior of emulating PDE dynamics. In several experiments, we highlight the similarities between neural emulators and numerical simulators.

MCML Authors
Link to Profile Rüdiger Westermann

Rüdiger Westermann

Prof. Dr.

Computer Graphics & Visualization

Link to Profile Nils Thuerey

Nils Thuerey

Prof. Dr.

Physics-based Simulation


[1443]
M. Kollovieh, B. Charpentier, D. Zügner and S. Günnemann.
Expected Probabilistic Hierarchies.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Hierarchical clustering has usually been addressed by discrete optimization using heuristics or continuous optimization of relaxed scores for hierarchies. In this work, we propose to optimize expected scores under a probabilistic model over hierarchies. (1) We show theoretically that the global optimal values of the expected Dasgupta cost and Tree-Sampling divergence (TSD), two unsupervised metrics for hierarchical clustering, are equal to the optimal values of their discrete counterparts contrary to some relaxed scores. (2) We propose Expected Probabilistic Hierarchies (EPH), a probabilistic model to learn hierarchies in data by optimizing expected scores. EPH uses differentiable hierarchy sampling enabling end-to-end gradient descent based optimization, and an unbiased subgraph sampling approach to scale to large datasets. (3) We evaluate EPH on synthetic and real-world datasets including vector and graph datasets. EPH outperforms all other approaches quantitatively and provides meaningful hierarchies in qualitative evaluations.

MCML Authors
Link to website

Marcel Kollovieh

Data Analytics & Machine Learning

Daniel Zügner

Daniel Zügner

Dr.

* Former Member

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning


[1442]
G. Ma, Y. Wang, D. Lim, S. Jegelka and Y. Wang.
A Canonicalization Perspective on Invariant and Equivariant Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

In many applications, we desire neural networks to exhibit invariance or equivariance to certain groups due to symmetries inherent in the data. Recently, frame-averaging methods emerged to be a unified framework for attaining symmetries efficiently by averaging over input-dependent subsets of the group, i.e., frames. What we currently lack is a principled understanding of the design of frames. In this work, we introduce a canonicalization perspective that provides an essential and complete view of the design of frames. Canonicalization is a classic approach for attaining invariance by mapping inputs to their canonical forms. We show that there exists an inherent connection between frames and canonical forms. Leveraging this connection, we can efficiently compare the complexity of frames as well as determine the optimality of certain frames. Guided by this principle, we design novel frames for eigenvectors that are strictly superior to existing methods – some are even optimal – both theoretically and empirically. The reduction to the canonicalization perspective further uncovers equivalences between previous methods. These observations suggest that canonicalization provides a fundamental understanding of existing frame-averaging methods and unifies existing equivariant and invariant learning methods.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1441]
Y. Ma, V. Melnychuk, J. Schweisthal and S. Feuerriegel.
DiffPO: A causal diffusion model for learning distributions of potential outcomes.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Predicting potential outcomes of interventions from observational data is crucial for decision-making in medicine, but the task is challenging due to the fundamental problem of causal inference. Existing methods are largely limited to point estimates of potential outcomes with no uncertain quantification; thus, the full information about the distributions of potential outcomes is typically ignored. In this paper, we propose a novel causal diffusion model called DiffPO, which is carefully designed for reliable inferences in medicine by learning the distribution of potential outcomes. In our DiffPO, we leverage a tailored conditional denoising diffusion model to learn complex distributions, where we address the selection bias through a novel orthogonal diffusion loss. Another strength of our DiffPO method is that it is highly flexible (e.g., it can also be used to estimate different causal quantities such as CATE). Across a wide range of experiments, we show that our method achieves state-of-the-art performance.

MCML Authors
Link to website

Yuchen Ma

Artificial Intelligence in Management

Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1440]
V. Melnychuk, S. Feuerriegel and M. van der Schaar.
Quantifying Aleatoric Uncertainty of the Treatment Effect: A Novel Orthogonal Learner.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Estimating causal quantities from observational data is crucial for understanding the safety and effectiveness of medical treatments. However, to make reliable inferences, medical practitioners require not only estimating averaged causal quantities, such as the conditional average treatment effect, but also understanding the randomness of the treatment effect as a random variable. This randomness is referred to as aleatoric uncertainty and is necessary for understanding the probability of benefit from treatment or quantiles of the treatment effect. Yet, the aleatoric uncertainty of the treatment effect has received surprisingly little attention in the causal machine learning community. To fill this gap, we aim to quantify the aleatoric uncertainty of the treatment effect at the individualized (covariate-conditional) level, namely, the conditional distribution of the treatment effect (CDTE). Unlike average causal quantities, the CDTE is not point identifiable without strong additional assumptions. As a remedy, we employ partial identification to obtain sharp bounds on the CDTE and thereby quantify the aleatoric uncertainty of the treatment effect. We then develop a novel, orthogonal learner for the bounds on the CDTE, which we call AU-learner. We further show that our AU-learner has several strengths in that it satisfies Neyman-orthogonality and is doubly robust. Finally, we propose a fully-parametric deep learning instantiation of our AU-learner.

MCML Authors
Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1439]
M. Muschalik, H. Baniecki, F. Fumagalli, P. Kolpaczki, B. Hammer and E. Hüllermeier.
shapiq: Shapley Interactions for Machine Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Originally rooted in game theory, the Shapley Value (SV) has recently become an important tool in machine learning research. Perhaps most notably, it is used for feature attribution and data valuation in explainable artificial intelligence. Shapley Interactions (SIs) naturally extend the SV and address its limitations by assigning joint contributions to groups of entities, which enhance understanding of black box machine learning models. Due to the exponential complexity of computing SVs and SIs, various methods have been proposed that exploit structural assumptions or yield probabilistic estimates given limited resources. In this work, we introduce shapiq, an open-source Python package that unifies state-of-the-art algorithms to efficiently compute SVs and any-order SIs in an application-agnostic framework. Moreover, it includes a benchmarking suite containing 11 machine learning applications of SIs with pre-computed games and ground-truth values to systematically assess computational performance across domains. For practitioners, shapiq is able to explain and visualize any-order feature interactions in predictions of models, including vision transformers, language models, as well as XGBoost and LightGBM with TreeSHAP-IQ. With shapiq, we extend shap beyond feature attributions and consolidate the application of SVs and SIs in machine learning that facilitates future research.

MCML Authors
Link to website

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Link to website

Patrick Kolpaczki

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1438]
T. Nagler, L. Schneider, B. Bischl and M. Feurer.
Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model’s generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[1437]
R. Paolino, S. Maskey, P. Welke and G. Kutyniok.
Weisfeiler and Leman Go Loopy: A New Hierarchy for Graph Representational Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

We introduce r-loopy Weisfeiler-Leman (r-ℓWL), a novel hierarchy of graph isomorphism tests and a corresponding GNN framework, r-ℓMPNN, that can count cycles up to length r+2. Most notably, we show that r-ℓWL can count homomorphisms of cactus graphs. This strictly extends classical 1-WL, which can only count homomorphisms of trees and, in fact, is incomparable to k-WL for any fixed k. We empirically validate the expressive and counting power of the proposed r-ℓMPNN on several synthetic datasets and present state-of-the-art predictive performance on various real-world datasets.

MCML Authors
Link to website

Raffaele Paolino

Mathematical Foundations of Artificial Intelligence

Link to website

Sohir Maskey

Mathematical Foundations of Artificial Intelligence

Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1436]
D. Rügamer, B. X. W. Liew, Z. Altai and A. Stöcker.
A Functional Extension of Semi-Structured Networks.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Semi-structured networks (SSNs) merge the structures familiar from additive models with deep neural networks, allowing the modeling of interpretable partial feature effects while capturing higher-order non-linearities at the same time. A significant challenge in this integration is maintaining the interpretability of the additive model component. Inspired by large-scale biomechanics datasets, this paper explores extending SSNs to functional data. Existing methods in functional data analysis are promising but often not expressive enough to account for all interactions and non-linearities and do not scale well to large datasets. Although the SSN approach presents a compelling potential solution, its adaptation to functional data remains complex. In this work, we propose a functional SSN method that retains the advantageous properties of classical functional regression approaches while also improving scalability. Our numerical experiments demonstrate that this approach accurately recovers underlying signals, enhances predictive performance, and performs favorably compared to competing methods.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[1435]
R. Stolz, H. Krasowski, J. Thumm, M. Eichelbeck, P. Gassert and M. Althoff.
Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Continuous action spaces in reinforcement learning (RL) are commonly defined as multidimensional intervals. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.

MCML Authors
Hanna Krasowski

Hanna Krasowski

Dr.

* Former Member

Link to website

Michael Eichelbeck

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[1434]
V. Udandarao, K. Roth, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. Hénaff, S. Albanie, Z. Akata and M. Bethge.
A Practitioner's Guide to Real-World Continual Multimodal Pretraining.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1433]
J. Wang, M. Ghahremani, Y. Li, B. Ommer and C. Wachinger.
Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model’s precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet.

MCML Authors
Link to website

Morteza Ghahremani

Dr.

Artificial Intelligence in Medical Imaging

Link to website

Yitong Li

Artificial Intelligence in Medical Imaging

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1432]
Y. Wang, K. Hu, S. Gupta, Z. Ye, Y. Wang and S. Jegelka.
Understanding the Role of Equivariance in Self-supervised Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Contrastive learning has been a leading paradigm for self-supervised learning, but it is widely observed that it comes at the price of sacrificing useful features (eg colors) by being invariant to data augmentations. Given this limitation, there has been a surge of interest in equivariant self-supervised learning (E-SSL) that learns features to be augmentation-aware. However, even for the simplest rotation prediction method, there is a lack of rigorous understanding of why, when, and how E-SSL learns useful features for downstream tasks. To bridge this gap between practice and theory, we establish an information-theoretic perspective to understand the generalization ability of E-SSL. In particular, we identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks. This synergy effect encourages models to extract class-relevant features to improve its equivariant prediction, which, in turn, benefits downstream tasks requiring semantic features. Based on this perspective, we theoretically analyze the influence of data transformations and reveal several principles for practical designs of E-SSL. Our theory not only aligns well with existing E-SSL methods but also sheds light on new directions by exploring the benefits of model equivariance. We believe that a theoretically grounded understanding on the role of equivariance would inspire more principled and advanced designs in this field.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1431]
Y. Wang, Y. Wu, Z. Wei, S. Jegelka and Y. Wang.
A Theoretical Understanding of Self-Correction through In-context Alignment.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1430]
D. Winkel, N. Strauß, M. Bernhard, Z. Li, T. Seidl and M. Schubert.
Autoregressive Policy Optimization for Constrained Allocation Tasks.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark.

MCML Authors
Link to website

David Winkel

Database Systems and Data Mining

Link to website

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Maximilian Bernhard

Maximilian Bernhard

Dr.

* Former Member

Link to website

Zongyue Li

Spatial Artificial Intelligence

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining

Link to Profile Matthias Schubert

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence


[1429]
M. Yau, N. Karalias, E. Lu, J. Xu and S. Jegelka.
Are Graph Neural Networks Optimal Approximation Algorithms?
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

In this work we design graph neural network architectures that capture optimal approximation algorithms for a large class of combinatorial optimization problems, using powerful algorithmic tools from semidefinite programming (SDP). Concretely, we prove that polynomial-sized message-passing algorithms can represent the most powerful polynomial time algorithms for Max Constraint Satisfaction Problems assuming the Unique Games Conjecture. We leverage this result to construct efficient graph neural network architectures, OptGNN, that obtain high-quality approximate solutions on landmark combinatorial optimization problems such as Max-Cut, Min-Vertex-Cover, and Max-3-SAT. Our approach achieves strong empirical results across a wide range of real-world and synthetic datasets against solvers and neural baselines. Finally, we take advantage of OptGNN’s ability to capture convex relaxations to design an algorithm for producing bounds on the optimal solution from the learned embeddings of OptGNN.

MCML Authors
Link to Profile Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Foundations of Deep Neural Networks


[1428]
Y. Zhang, Y. Li, X. Wang, Q. Shen, B. Plank, B. Bischl, M. Rezaei and K. Kawaguchi.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models.
NeurIPS 2024 - Workshop on Machine Learning and Compression at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model’s output – contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance – without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[1427]
M. Koshil, T. Nagler, M. Feurer and K. Eggensperger.
Towards Localization via Data Embedding for TabPFN.
TLR @NeurIPS 2024 - 3rd Table Representation Learning Workshop at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Prior-data fitted networks (PFNs), especially TabPFN, have shown significant promise in tabular data prediction. However, their scalability is limited by the quadratic complexity of the transformer architecture’s attention across training points. In this work, we propose a method to localize TabPFN, which embeds data points into a learned representation and performs nearest neighbor selection in this space. We evaluate it across six datasets, demonstrating its superior performance over standard TabPFN when scaling to larger datasets. We also explore its design choices and analyze the bias-variance trade-off of this localization method, showing that it reduces bias while maintaining manageable variance. This work opens up a pathway for scaling TabPFN to arbitrarily large tabular datasets.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[1426]
B. M. G. Nielsen, L. Gresele and A. Dittadi.
Challenges in Explaining Representational Similarity through Identifiability.
UniReps @NeurIPS 2024 - 2nd Workshop on Unifying Representations in Neural Models at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

The phenomenon of different deep learning models producing similar data representations has garnered significant attention, raising the question of why such representational similarity occurs. Identifiability theory offers a partial explanation: for a broad class of discriminative models, including many popular in representation learning, those assigning equal likelihood to the observations yield representations that are equal up to a linear transformation, if a suitable diversity condition holds. In this work, we identify two key challenges in applying identifiability theory to explain representational similarity. First, the assumption of exact likelihood equality is rarely satisfied by practical models trained with different initializations. To address this, we describe how the representations of two models deviate from being linear transformations of each other, based on their difference in log-likelihoods. Second, we demonstrate that even models with similar and near-optimal loss values can produce highly dissimilar representations due to an underappreciated difference between loss and likelihood. Our findings highlight key open questions and point to future research directions for advancing the theoretical understanding of representational similarity.

MCML Authors
Link to website

Andrea Dittadi

Dr.

Algorithmic Machine Learning & Explainable AI


[1425]
T. Beker and X. Zhu.
Volcanic Deformation Monitoring utilizing Deep Learning and Wavelet Transform.
AGU 2024 - American Geophysical Union Annual Meeting. Washington D.C., USA, Dec 09-13, 2024. URL
Abstract

There are 20-50 new volcanic eruptions annually, which often do not have onsite monitoring. InSAR can be used to globally monitor volcanic deformations, even in hard-to-reach areas. With state-of-the-art persistent and distributed scatterer processing, InSAR data can even point to the volcanoes’ subtle, few mm/year changes and deep learning (DL) models can red flag them. Our research leverages the practical application of DL with a classification architecture, InceptionResNet v2, to identify InSAR data containing volcanic deformations. We utilize 5-year-long deformation maps covering the Central Volcanic Zone in the South American Andes, reserving the area known for its volcanoes for testing. The remaining data, in combination with synthetic volcanic deformations, is used for training. The explainability tool, Grad-CAM, shows that due to the nature of subtle volcanic deformations observed by InSAR, the model is struggling to delineate and distinguish volcanic deformation signals. We use wavelet transformations and filtering to enhance the data and improve the DL model performance. Daubechies 2 wavelet transform accentuates subtle large-surface signals, which are often volcanic in nature while removing the subtle high-frequency patterns. The DL models are trained, and each is tested on the data with a different number of wavelet transforms from 0-4. The model trained and tested on original data achieved a 64.02% AUC ROC average over 3 runs, while when tested on data two times transformed by wavelet transform, it improved to 84.14% AUC ROC average over 3 runs. These findings prove that Daubechies 2 wavelet transform cleans data while exaggerating the volcanic deformation. It also enlarges the small point deformation sources large in intensity, which can be solved by filtering beforehand. The models trained and used in this way detect all 5 different subtle volcanic deformations in the region, with smallest being 5 mm/year.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1424]
C. Leiber, N. Strauß, M. Schubert and T. Seidl.
Dying Clusters Is All You Need -- Deep Clustering With an Unknown Number of Clusters.
DLC @ICDM 2024 - 6th Workshop on Deep Learning and Clustering at the 24th IEEE International Conference on Data Mining (ICDM 2024). Abu Dhabi, United Arab Emirates, Dec 09-12, 2024. DOI GitHub
Abstract

Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components.

MCML Authors
Collin Leiber

Collin Leiber

Dr.

* Former Member

Link to website

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Link to Profile Matthias Schubert

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[1423]
M. Bernhard.
Deep learning methods for image recognition in remote sensing.
Dissertation 2024. DOI
Abstract

In this dissertation, we present solutions to various image recognition problems in remote sensing. Thereby, we harness the characteristics of remote sensing images and address specific challenges coming with remote sensing images. Overall, the methods presented in this dissertation cover the tasks of image classification, object detection, semantic segmentation, and change detection, as well as learning settings with full, incomplete, and noisy supervision. (Shortened).

MCML Authors
Maximilian Bernhard

Maximilian Bernhard

Dr.

* Former Member


[1422]
A. Beer, P. Weber, L. Miklautz, C. Leiber, W. Durani, C. Böhm and C. Plant.
SHADE: Deep Density-based Clustering.
ICDM 2024 - 24th IEEE International Conference on Data Mining. Abu Dhabi, United Arab Emirates, Dec 09-12, 2024. DOI
Abstract

Detecting arbitrarily shaped clusters in high-dimensional noisy data is challenging for current clustering methods. We introduce SHADE (Structure-preserving High-dimensional Analysis with Density-based Exploration), the first deep clustering algorithm that incorporates density-connectivity into its loss function. Similar to existing deep clustering algorithms, SHADE supports high-dimensional and large data sets with the expressive power of a deep autoencoder. In contrast to most existing deep clustering methods that rely on a centroid-based clustering objective, SHADE incorporates a novel loss function that captures density-connectivity. SHADE thereby learns a representation that enhances the separation of density-connected clusters. SHADE detects a stable clustering and noise points fully automatically without any user input. It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters, such as video data. Moreover, the embedded space of SHADE is suitable for visualization and interpretation of the clustering results as the individual shapes of the clusters are preserved.

MCML Authors
Anna Beer

Anna Beer

Dr.

* Former Member

Collin Leiber

Collin Leiber

Dr.

* Former Member

Link to website

Walid Durani

Database Systems and Data Mining

Christian Böhm

Christian Böhm

Prof. Dr.

* Former Principal Investigator


[1421]
V. Basile, S. Casola, S. Frenda and S. M. Lo.
PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL
Abstract

Works in perspectivism and human label variation have emphasized the need to collect and leverage various voices and points of view in the whole Natural Language Processing pipeline. PERSEID places itself in this line of work. We consider the task of irony detection from short social media conversations in Italian collected from Twitter (X) and Reddit. To do so, we leverage data from MultiPICO, a recent multilingual dataset with disaggregated annotations and annotators’ metadata, containing 1000 Post, Reply pairs with five annotations each on average. We aim to evaluate whether prompting LLMs with additional annotators’ demographic information (namely gender only, age only, and the combination of the two) results in improved performance compared to a baseline in which only the input text is provided. The evaluation is zero-shot; and we evaluate the results on the disaggregated annotations using f1.

MCML Authors
Link to website

Silvia Casola

Dr.

AI and Computational Linguistics


[1420]
T. Bourgeade, S. Casola, A. M. Wizani and C. Bosco.
Data Augmentation through Back-Translation for Stereotypes and Irony Detection.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL
Abstract

Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier for stereotype or irony detection on mono-lingual data.

MCML Authors
Link to website

Silvia Casola

Dr.

AI and Computational Linguistics


[1419]
S. Frenda, A. Piergentili, B. Savoldi, M. Madeddu, M. Rosola, S. Casola, C. Ferrando, V. Patti, M. Negri and L. Bentivogli.
GFG - Gender-Fair Generation: A CALAMITA Challenge.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL
Abstract

Gender-fair language aims at promoting gender equality by using terms and expressions that include all identities and avoid reinforcing gender stereotypes. Implementing gender-fair strategies is particularly challenging in heavily gender-marked languages, such as Italian. To address this, the Gender-Fair Generation challenge intends to help shift toward gender-fair language in written communication. The challenge, designed to assess and monitor the recognition and generation of gender-fair language in both mono- and cross-lingual scenarios, includes three tasks: (1) the detection of gendered expressions in Italian sentences, (2) the reformulation of gendered expressions into gender-fair alternatives, and (3) the generation of gender-fair language in automatic translation from English to Italian. The challenge relies on three different annotated datasets: the GFL-it corpus, which contains Italian texts extracted from administrative documents provided by the University of Brescia; GeNTE, a bilingual test set for gender-neutral rewriting and translation built upon a subset of the Europarl dataset; and Neo-GATE, a bilingual test set designed to assess the use of non-binary neomorphemes in Italian for both fair formulation and translation tasks. Finally, each task is evaluated with specific metrics: average of F1-score obtained by means of BERTScore computed on each entry of the datasets for task 1, an accuracy measured with a gender-neutral classifier, and a coverage-weighted accuracy for tasks 2 and 3.

MCML Authors
Link to website

Silvia Casola

Dr.

AI and Computational Linguistics


[1418]
A. Triantafyllopoulos and B. W. Schuller.
Hearing aids in the era of foundation models.
GMS Zeitschrift für Audiologie 6.28 (Dec. 2024). DOI
Abstract

The recent introduction of foundation models (FMs) has taken the world by storm. Ranging from large language models (LLMs) to image and audio analysis and generation, FMs have introduced a new paradigm in artificial intelligence (AI), one where practitioners transition from standard supervised machine learning to prompting and in-context learning. This has implications for hearing aid research, and specifically for the use of such models for noise attenuation and speech enhancement. Even though the uptake of FMs is minimal to non-existent for this application domain, mainly due to the prohibitive computational complexity of those models, there are nevertheless ways to benefit from FM advances in an indirect way. We review these approaches in the present contribution.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1417]
U. Fischer Abaigar, C. Kern, N. Barda and F. Kreuter.
Bridging the gap: Towards an expanded toolkit for AI-driven decision-making in the public sector.
Government Information Quarterly 41.4 (Dec. 2024). DOI
Abstract

AI-driven decision-making systems are becoming instrumental in the public sector, with applications spanning areas like criminal justice, social welfare, financial fraud detection, and public health. While these systems offer great potential benefits to institutional decision-making processes, such as improved efficiency and reliability, these systems face the challenge of aligning machine learning (ML) models with the complex realities of public sector decision-making. In this paper, we examine five key challenges where misalignment can occur, including distribution shifts, label bias, the influence of past decision-making on the data side, as well as competing objectives and human-in-the-loop on the model output side. Our findings suggest that standard ML methods often rely on assumptions that do not fully account for these complexities, potentially leading to unreliable and harmful predictions. To address this, we propose a shift in modeling efforts from focusing solely on predictive accuracy to improving decision-making outcomes. We offer guidance for selecting appropriate modeling frameworks, including counterfactual prediction and policy learning, by considering how the model estimand connects to the decision-maker’s utility. Additionally, we outline technical methods that address specific challenges within each modeling approach. Finally, we argue for the importance of external input from domain experts and stakeholders to ensure that model assumptions and design choices align with real-world policy objectives, taking a step towards harmonizing AI and public sector objectives.

MCML Authors
Link to website

Unai Fischer Abaigar

Social Data Science and AI Lab

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1416]
T. Hannan, R. Koner, M. Bernhard, S. Shit, B. Menze, V. Tresp, M. Schubert and T. Seidl.
GRAtt-VIS: Gated Residual Attention for Video Instance Segmentation.
ICPR 2024 - 27th International Conference on Pattern Recognition. Kolkata, India, Dec 01-05, 2024. DOI GitHub
Abstract

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce textbf{GRAtt-VIS}, textbf{G}ated textbf{R}esidual textbf{Att}ention for textbf{V}ideo textbf{I}nstance textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods.

MCML Authors
Link to website

Tanveer Hannan

Database Systems and Data Mining

Link to website

Rajat Koner

Database Systems and Data Mining

Maximilian Bernhard

Maximilian Bernhard

Dr.

* Former Member

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining

Link to Profile Matthias Schubert

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Link to Profile Thomas Seidl

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining


[1415]
Y. N. Böck, H. Boche, F. H. P. Fitzek and G. Kutyniok.
Computing-Model and Computing-Hardware Selection for ICT Under Societal and Judicial Constraints.
IEEE Access 12 (Dec. 2024). DOI
Abstract

This article discusses a formalization of aspects of Cyber-Sovereignty (CyS) for information and communication technology (ICT), linking them to technological trustworthiness and deriving an associated paradigm for hard- and software design. The upcoming 6G ICT standard is considered a keystone within modern society’s increasing interconnectedness and automatization, as it provides the necessary technological infrastructure for applications such as the Metaverse or large-scale digital twinning. Since emerging technological systems increasingly affect sensitive human goods, hard- and software manufacturers must consider a new dimension of societal and judicial constraints in the context of technological trustworthiness. This article aims to establish a formalized theory of specific aspects of CyS, providing a paradigm for hard- and software engineering in ICT. This paradigm is directly applicable in formal technology assessment and ensures that the relevant facets of CyS – specifically, the principle of Algorithmic Transparency (AgT) – are satisfied. The framework follows an axiomatic approach. Particularly, the formal basis of our theory consists of four fundamental assumptions about the general nature of physical problems and algorithmic implementations. This formal basis allows for drawing general conclusions on the relation between CyS and technological trustworthiness and entails a formal meta-thesis on AgT in digital computing.

MCML Authors
Link to Profile Gitta Kutyniok

Gitta Kutyniok

Prof. Dr.

Mathematical Foundations of Artificial Intelligence


[1414]
A. Höhl, I. Obadic, M.-Á. Fernández-Torres, H. Najjar, D. Oliveira, Z. Akata, A. Dengel and X. Zhu.
Opening the Black Box: A systematic review on explainable artificial intelligence in remote sensing.
IEEE Geoscience and Remote Sensing Magazine 12.4 (Dec. 2024). DOI
Abstract

In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in remote sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the explainable AI methods used and their objectives, findings, and challenges in remote sensing applications is still missing. In this paper, we address this gap by performing a systematic review to identify the key trends in the field and shed light on novel explainable AI approaches and emerging directions that tackle specific remote sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights, and reflect on the approaches used for the evaluation of explainable AI methods. As such, our review provides a complete summary of the state-of-the-art of explainable AI in remote sensing. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field.

MCML Authors
Link to website

Adrian Höhl

Data Science in Earth Observation

Link to website

Ivica Obadic

Data Science in Earth Observation

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1413]
S. Zhao, Z. Chen, Z. Xiong, Y. Shi, S. Saha and X. Zhu.
Beyond Grid Data: Exploring graph neural networks for Earth observation.
IEEE Geoscience and Remote Sensing Magazine Early Access (Dec. 2024). DOI
Abstract

Earth Observation (EO) data analysis has been significantly revolutionized by deep learning (DL), with applications typically limited to grid-like data structures. Graph Neural Networks (GNNs) emerge as an important innovation, propelling DL into the non-Euclidean domain. Naturally, GNNs can effectively tackle the challenges posed by diverse modalities, multiple sensors, and the heterogeneous nature of EO data. To introduce GNNs in the related domains, our review begins by offering fundamental knowledge on GNNs. Then, we summarize the generic problems in EO, to which GNNs can offer potential solutions. Following this, we explore a broad spectrum of GNNs’ applications to scientific problems in Earth systems, covering areas such as weather and climate analysis, disaster management, air quality monitoring, agriculture, land cover classification, hydrological process modeling, and urban modeling. The rationale behind adopting GNNs in these fields is explained, alongside methodologies for organizing graphs and designing favorable architectures for various tasks. Furthermore, we highlight methodological challenges of implementing GNNs in these domains and possible solutions that could guide future research. While acknowledging that GNNs are not a universal solution, we conclude the paper by comparing them with other popular architectures like transformers and analyzing their potential synergies.

MCML Authors
Link to website

Zhaiyu Chen

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1412]
N. Saberi, M. H. Shaker, C. R. Duguay, K. A. Scott and E. Hüllermeier.
Uncertainty Estimation of Lake Ice Cover Maps From a Random Forest Classifier Using MODIS TOA Reflectance Data.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 18 (Dec. 2024). DOI
Abstract

This article presents a method to improve the usability of lake ice cover (LIC) maps generated from moderate resolution imaging spectroradiometer (MODIS) top-of-atmosphere reflectance data by providing estimates of aleatoric and epistemic uncertainty. We used a random forest (RF) classifier, which has been shown to have superior performance in classifying lake ice, open water, and clouds, to generate daily LIC maps with inherent (aleatoric) and model (epistemic) uncertainties. RF allows for the learning of different hypotheses (trees), producing diverse predictions that can be utilized to quantify aleatoric and epistemic uncertainty. We use a decomposition of Shannon entropy to quantify these uncertainties and apply pixel-based uncertainty estimation. Our results show that using uncertainty values to reject the classification of uncertain pixels significantly improves recall and precision. The method presented herein is under consideration for integration into the processing chain implemented for the production of daily LIC maps as part of the European Space Agency’s Climate Change Initiative (CCI+) Lakes project.

MCML Authors
Link to website

Mohammad Hossein Shaker

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1411]
Q. Sun, A. Akman, X. Jing, M. Milling and B. W. Schuller.
Audio-based Kinship Verification Using Age Domain Conversion.
IEEE Signal Processing Letters 32 (Dec. 2024). DOI
Abstract

Audio-based kinship verification (AKV) is important in many domains, such as home security monitoring, forensic identification, and social network analysis. A key challenge in the task arises from differences in age across samples from different individuals, which can be interpreted as a domain bias in a cross-domain verification task. To address this issue, we design the notion of an ‘age-standardised domain’ wherein we utilise the optimised CycleGAN-VC3 network to perform age-audio conversion to generate the in-domain audio. The generated audio dataset is employed to extract a range of features, which are then fed into a metric learning architecture to verify kinship. Experiments are conducted on the KAN_AV audio dataset, which contains age and kinship labels. The results demonstrate that the method markedly enhances the accuracy of kinship verification, while also offering novel insights for future kinship verification research.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1410]
L. Shen, H. Zhang, C. Zhu, R. Li, K. Qian, W. Meng, F. Tian, B. Hu, B. W. Schuller and Y. Yamamoto.
A First Look at Generative Artificial Intelligence Based Music Therapy for Mental Disorders.
IEEE Transactions on Consumer Electronics Early Access (Dec. 2024). DOI
Abstract

Mental disorders show a rapid increase and cause considerable harm to individuals as well as the society in recent decade. Hence, mental disorders have become a serious public health challenge in nowadays society. Timely treatment of mental disorders plays a critical role for reducing the harm of mental illness to individuals and society. Music therapy is a type of non-pharmaceutical method in treating such mental disorders. However, conventional music therapy suffers from a number of issues resulting in a lack of popularity. Thanks to the rapid development of Artificial Intelligence (AI), especially the AI Generated Content (AIGC), it provides a chance to address these issues. Nevertheless, to the best of our knowledge, there is no work investigating music therapy from AIGC and closed-loop perspective. In this paper, we summarise some universal music therapy methods and discuss their shortages. Then, we indicate some AIGC techniques, especially the music generation, for their application in music therapy. Moreover, we present a closed-loop music therapy system and introduce its implementation details. Finally, we discuss some challenges in AIGC-based music therapy with proposing further research direction, and we suggest the potential of this system to become a consumer-grade product for treating mental disorders.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1409]
Z. Chen, Y. Shi, L. Nan, Z. Xiong and X. Zhu.
PolyGNN: Polyhedron-based graph neural network for 3D building reconstruction from point clouds.
ISPRS Journal of Photogrammetry and Remote Sensing 218.A (Dec. 2024). DOI GitHub
Abstract

We present PolyGNN, a polyhedron-based graph neural network for 3D building reconstruction from point clouds. PolyGNN learns to assemble primitives obtained by polyhedral decomposition via graph node classification, achieving a watertight and compact reconstruction. To effectively represent arbitrary-shaped polyhedra in the neural network, we propose a skeleton-based sampling strategy to generate polyhedron-wise queries. These queries are then incorporated with inter-polyhedron adjacency to enhance the classification. PolyGNN is end-to-end optimizable and is designed to accommodate variable-size input points, polyhedra, and queries with an index-driven batching technique. To address the abstraction gap between existing city-building models and the underlying instances, and provide a fair evaluation of the proposed method, we develop our method on a large-scale synthetic dataset with well-defined ground truths of polyhedral labels. We further conduct a transferability analysis across cities and on real-world point clouds. Both qualitative and quantitative results demonstrate the effectiveness of our method, particularly its efficiency for large-scale reconstructions.

MCML Authors
Link to website

Zhaiyu Chen

Data Science in Earth Observation

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1408]
J. Herbinger, M. N. Wright, T. Nagler, B. Bischl and G. Casalicchio.
Decomposing Global Feature Effects Based on Feature Interactions.
Journal of Machine Learning Research 25.381 (Dec. 2024). URL
Abstract

Global feature effect methods, such as partial dependence plots, provide an intelligible visualization of the expected marginal feature effect. However, such global feature effect methods can be misleading, as they do not represent local feature effects of single observations well when feature interactions are present. We formally introduce generalized additive decomposition of global effects (GADGET), which is a new framework based on recursive partitioning to find interpretable regions in the feature space such that the interaction-related heterogeneity of local feature effects is minimized. We provide a mathematical foundation of the framework and show that it is applicable to the most popular methods to visualize marginal feature effects, namely partial dependence, accumulated local effects, and Shapley additive explanations (SHAP) dependence. Furthermore, we introduce and validate a new permutation-based interaction detection procedure that is applicable to any feature effect method that fits into our proposed framework. We empirically evaluate the theoretical characteristics of the proposed methods based on various feature effect methods in different experimental settings. Moreover, we apply our introduced methodology to three real-world examples to showcase their usefulness.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[1407]
H. Weingärtner, M. Windl, L. L. Chuang and F. Draxler.
Useful but Distracting: Viewer Experience with Keyword Highlights and Time-Synchronization in Captions for Language Learning.
MUM 2024 - 23rd International Conference on Mobile and Ubiquitous Multimedia. Stockholm, Sweden, Dec 01-04, 2024. DOI
Abstract

Captions are a valuable scaffold for language learners, aiding comprehension and vocabulary acquisition. Past work has proposed enhancements such as keyword highlights for increased learning gains. However, little is known about learners’ experience with enhanced captions, although this is critical for adoption in everyday life. We conducted a survey and focus group to elicit learner preferences and requirements and implemented a processing pipeline for enhanced captions with keyword highlights, time-synchronized keyword highlights, and keyword captions. A subsequent online study (n = 66) showed that time-synchronized keyword highlights were the preferred design for learning but were perceived as too distracting to replace standard captions in everyday viewing scenarios. We conclude that keyword highlights and time-synchronization are suitable for integrating learning into an entertaining everyday- life activity, but the design should be optimized to provide a more seamless experience.

MCML Authors
Link to website

Maximiliane Windl

Human-Centered Ubiquitous Media


[1406]
L. B. Kuemmerle, M. D. Luecken, A. B. Firsova, L. Barros de Andrade e Sousa, L. Straßer, I. I. Mekki, F. Campi, L. Heumos, M. Shulman, V. Beliaeva, S. Hediyeh-Zadeh, A. C. Schaar, K. T. Mahbubani, A. Sountoulidis, T. Balassa, F. Kovacs, P. Horvath, M. Piraud, A. Ertürk, C. Samakovlis and F. J. Theis.
Probe set selection for targeted spatial transcriptomics.
Nature Methods 21 (Dec. 2024). DOI
Abstract

Targeted spatial transcriptomic methods capture the topology of cell types and states in tissues at single-cell and subcellular resolution by measuring the expression of a predefined set of genes. The selection of an optimal set of probed genes is crucial for capturing the spatial signals present in a tissue. This requires selecting the most informative, yet minimal, set of genes to profile (gene set selection) for which it is possible to build probes (probe design). However, current selections often rely on marker genes, precluding them from detecting continuous spatial signals or new states. We present Spapros, an end-to-end probe set selection pipeline that optimizes both gene set specificity for cell type identification and within-cell type expression variation to resolve spatially distinct populations while considering prior knowledge as well as probe design and expression constraints. We evaluated Spapros and show that it outperforms other selection approaches in both cell type recovery and recovering expression variation beyond cell types. Furthermore, we used Spapros to design a single-cell resolution in situ hybridization on tissues (SCRINSHOT) experiment of adult lung tissue to demonstrate how probes selected with Spapros identify cell types of interest and detect spatial variation even within cell types.

MCML Authors
Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems


[1405]
J. Senoner, S. Schallmoser, B. Kratzwald, S. Feuerriegel and T. Netland.
Explainable AI improves task performance in human–AI collaboration.
Scientific Reports 14.31150 (Dec. 2024). DOI
Abstract

Artificial intelligence (AI) provides considerable opportunities to assist human work. However, one crucial challenge of human-AI collaboration is that many AI algorithms operate in a black-box manner where the way how the AI makes predictions remains opaque. This makes it difficult for humans to validate a prediction made by AI against their own domain knowledge. For this reason, we hypothesize that augmenting humans with explainable AI as a decision aid improves task performance in human-AI collaboration. To test this hypothesis, we analyze the effect of augmenting domain experts with explainable AI in the form of visual heatmaps. We then compare participants that were either supported by (a) black-box AI or (b) explainable AI, where the latter supports them to follow AI predictions when the AI is accurate or overrule the AI when the AI predictions are wrong. We conducted two preregistered experiments with representative, real-world visual inspection tasks from manufacturing and medicine. The first experiment was conducted with factory workers from an electronics factory, who performed N=9,600 assessments of whether electronic products have defects. The second experiment was conducted with radiologists, who performed N=5,650 assessments of chest X-ray images to identify lung lesions. The results of our experiments with domain experts performing real-world tasks show that task performance improves when participants are supported by explainable AI instead of black-box AI. For example, in the manufacturing setting, we find that augmenting participants with explainable AI (as opposed to black-box AI) leads to a five-fold decrease in the median error rate of human decisions, which gives a significant improvement in task performance.

MCML Authors
Link to website

Simon Schallmoser

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1404]
M. Kollovieh, L. Gosch, M. Lienen, Y. Scholten, L. Schwinn and S. Günnemann.
Assessing Robustness via Score-Based Adversarial Image Generation.
Transactions on Machine Learning Research (Dec. 2024). URL
Abstract

Most adversarial attacks and defenses focus on perturbations within small -norm constraints. However, threat models cannot capture all relevant semantics-preserving perturbations, and hence, the scope of robustness evaluations is limited. In this work, we introduce Score-Based Adversarial Generation (ScoreAG), a novel framework that leverages the advancements in score-based generative models to generate unrestricted adversarial examples that overcome the limitations of -norm constraints. Unlike traditional methods, ScoreAG maintains the core semantics of images while generating adversarial examples, either by transforming existing images or synthesizing new ones entirely from scratch. We further exploit the generative capability of ScoreAG to purify images, empirically enhancing the robustness of classifiers. Our extensive empirical evaluation demonstrates that ScoreAG improves upon the majority of state-of-the-art attacks and defenses across multiple benchmarks. This work highlights the importance of investigating adversarial examples bounded by semantics rather than -norm constraints. ScoreAG represents an important step towards more encompassing robustness assessments.

MCML Authors
Link to website

Marcel Kollovieh

Data Analytics & Machine Learning

Link to website

Lukas Gosch

Data Analytics & Machine Learning

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning


[1403]
A. Baumann, R. Li, M. Klasson, S. Mentu, S. Karthik, Z. Akata, A. Solin and M. Trapp.
Post-hoc Probabilistic Vision-Language Models.
Preprint (Dec. 2024). arXiv
Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[1402]
F. Fumagalli, M. Muschalik, E. Hüllermeier, B. Hammer and J. Herbinger.
Unifying Feature-Based Explanations with Functional ANOVA and Cooperative Game Theory.
Preprint (Dec. 2024). arXiv
Abstract

Feature-based explanations, using perturbations or gradients, are a prevalent tool to understand decisions of black box machine learning models. Yet, differences between these methods still remain mostly unknown, which limits their applicability for practitioners. In this work, we introduce a unified framework for local and global feature-based explanations using two well-established concepts: functional ANOVA (fANOVA) from statistics, and the notion of value and interaction from cooperative game theory. We introduce three fANOVA decompositions that determine the influence of feature distributions, and use game-theoretic measures, such as the Shapley value and interactions, to specify the influence of higher-order interactions. Our framework combines these two dimensions to uncover similarities and differences between a wide range of explanation techniques for features and groups of features. We then empirically showcase the usefulness of our framework on synthetic and real-world datasets.

MCML Authors
Link to website

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[1401]
J. Hingerl, A. Karollus and J. Gagneur.
Flashzoi: An enhanced Borzoi model for accelerated genomic analysis.
Preprint (Dec. 2024). DOI
Abstract

Accurately predicting how DNA sequence drives gene regulation and how genetic variants alter gene expression is a central challenge in genomics. Borzoi, which models over ten thousand genomic assays including RNA-seq coverage from over half a megabase of sequence context alone promises to become an important foundation model in regulatory genomics, both for massively annotating variants and for further model development. However, its reliance on handcrafted, relative positional encodings within the transformer architecture limits its computational efficiency. Here we present Flashzoi, an enhanced Borzoi model that leverages rotary positional encodings and FlashAttention-2. This achieves over 3-fold faster training and inference and up to 2.4-fold reduced memory usage, while maintaining or improving accuracy in modeling various genomic assays including RNA-seq coverage, predicting variant effects, and enhancer-promoter linking. Flashzoi{textquoteright}s improved efficiency facilitates large-scale genomic analyses and opens avenues for exploring more complex regulatory mechanisms and modeling.Competing Interest StatementThe authors have declared no competing interest.

MCML Authors
Link to website

Johannes Hingerl

Computational Molecular Medicine

Link to website

Alexander Karollus

Computational Molecular Medicine

Link to Profile Julien Gagneur

Julien Gagneur

Prof. Dr.

Computational Molecular Medicine


[1400]
V. T. Hu and B. Ommer.
[MASK] is All You Need.
Preprint (Dec. 2024). arXiv
Abstract

In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK] tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1399]
Y. Li, M. Milling, L. Specia and B. W. Schuller.
From Audio Deepfake Detection to AI-Generated Music Detection -- A Pathway and Overview.
Preprint (Dec. 2024). arXiv
Abstract

As Artificial Intelligence (AI) technologies continue to evolve, their use in generating realistic, contextually appropriate content has expanded into various domains. Music, an art form and medium for entertainment, deeply rooted into human culture, is seeing an increased involvement of AI into its production. However, despite the effective application of AI music generation (AIGM) tools, the unregulated use of them raises concerns about potential negative impacts on the music industry, copyright and artistic integrity, underscoring the importance of effective AIGM detection. This paper provides an overview of existing AIGM detection methods. To lay a foundation to the general workings and challenges of AIGM detection, we first review general principles of AIGM, including recent advancements in deepfake audios, as well as multimodal detection techniques. We further propose a potential pathway for leveraging foundation models from audio deepfake detection to AIGM detection. Additionally, we discuss implications of these tools and propose directions for future research to address ongoing challenges in the field.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1398]
S. Liang, S. Wang, K. Li, M. Niemeyer, S. Gasperini, N. Navab and F. Tombari.
SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians.
Preprint (Dec. 2024). arXiv
Abstract

3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While the vanilla Gaussian Splatting representation is mainly designed for view synthesis, more recent works investigated how to extend it with scene understanding and language features. However, existing methods lack a detailed comprehension of scenes, limiting their ability to segment and interpret complex structures. To this end, We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural Gaussians to learn instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of what we call Super-Gaussians. Super-Gaussians facilitate the distillation of 2D language features into 3D space. Through Super-Gaussians, our method enables high-dimensional language feature rendering without extreme increases in GPU memory. Extensive experiments demonstrate that SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.

MCML Authors
Link to website

Sen Wang

Computer Aided Medical Procedures & Augmented Reality

Link to website

Kunyi Li

Computer Aided Medical Procedures & Augmented Reality

Link to website

Stefano Gasperini

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Federico Tombari

Federico Tombari

PD Dr.

Computer Aided Medical Procedures & Augmented Reality


[1397]
Y. Mansour and R. Heckel.
Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training.
Preprint (Dec. 2024). arXiv
Abstract

We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar filtering and deduplication steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. Those biases remain even when the text is rewritten with LLMs. Moreover, these biases propagate through training: Random sequences generated by models trained on those datasets can be classified well by a classifier trained on the original datasets.

MCML Authors
Link to website

Youssef Mansour

Machine Learning and Information Processing

Link to Profile Reinhard Heckel

Reinhard Heckel

Prof. Dr.

Machine Learning and Information Processing


[1396]
A. Reithmeir, V. Spieker, V. Sideri-Lampretsa, D. Rückert, J. A. Schnabel and V. A. Zimmer.
From Model Based to Learned Regularization in Medical Image Registration: A Comprehensive Review.
Preprint (Dec. 2024). arXiv
Abstract

Image registration is fundamental in medical imaging applications, such as disease progression analysis or radiation therapy planning. The primary objective of image registration is to precisely capture the deformation between two or more images, typically achieved by minimizing an optimization problem. Due to its inherent ill-posedness, regularization is a key component in driving the solution toward anatomically meaningful deformations. A wide range of regularization methods has been proposed for both conventional and deep learning-based registration. However, the appropriate application of regularization techniques often depends on the specific registration problem, and no one-fits-all method exists. Despite its importance, regularization is often overlooked or addressed with default approaches, assuming existing methods are sufficient. A comprehensive and structured review remains missing. This review addresses this gap by introducing a novel taxonomy that systematically categorizes the diverse range of proposed regularization methods. It highlights the emerging field of learned regularization, which leverages data-driven techniques to automatically derive deformation properties from the data. Moreover, this review examines the transfer of regularization methods from conventional to learning-based registration, identifies open challenges, and outlines future research directions. By emphasizing the critical role of regularization in image registration, we hope to inspire the research community to reconsider regularization strategies in modern registration algorithms and to explore this rapidly evolving field further.

MCML Authors
Link to website

Anna Reithmeir

Computational Imaging and AI in Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Julia Schnabel

Julia Schnabel

Prof. Dr.

Computational Imaging and AI in Medicine


[1395]
C. Sauer, A.-L. Boulesteix, L. Hanßum, F. Hodiamont, C. Bausewein and T. Ullmann.
Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications.
Preprint (Dec. 2024). arXiv
Abstract

Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.

MCML Authors
Link to website

Christina Sauer (née Nießl)

Biometry in Molecular Medicine

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1394]
Q. Sun, Y. Li, E. Alturki, S. M. K. Murthy and B. W. Schuller.
Towards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment.
Preprint (Dec. 2024). arXiv
Abstract

As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thorough review of FAI, focusing on theoretical perspectives both for and against its development, and presenting a formal definition in a clear and accessible format. Key applications are discussed from the perspectives of eXplainable AI (XAI), privacy, fairness and affective computing (AC). Additionally, the paper identifies challenges in current technological advancements and explores future research avenues. The findings emphasise the significance of developing FAI and advocate for its continued advancement to ensure ethical and beneficial AI development.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1393]
A. Testoni, B. Plank and R. Fernández.
RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs.
Preprint (Dec. 2024). arXiv
Abstract

Ambiguity resolution is key to effective communication. While humans effortlessly address ambiguity through conversational grounding strategies, the extent to which current language models can emulate these strategies remains unclear. In this work, we examine referential ambiguity in image-based question answering by introducing RACQUET, a carefully curated dataset targeting distinct aspects of ambiguity. Through a series of evaluations, we reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses. The overconfidence issue becomes particularly relevant for RACQUET-BIAS, a subset designed to analyze a critical yet underexplored problem: failing to address ambiguity leads to stereotypical, socially biased responses. Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1392]
J. Wang, Z. Qin, Y. Zhang, V. T. Hu, B. Ommer, R. Briq and S. Kesselheim.
Scaling Image Tokenizers with Grouped Spherical Quantization.
Preprint (Dec. 2024). arXiv
Abstract

Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct behaviours at high and low spatial compression levels, underscoring challenges in representing high-dimensional latent spaces. We show that GSQ can restructure high-dimensional latent into compact, low-dimensional spaces, thus enabling efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x down-sampling with a reconstruction FID (rFID) of 0.50.

MCML Authors
Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning


[1391]
Y. Wang, Q. Song, D. Wasif, M. Shahzad, C. Koller, J. Bamber and X. Zhu.
How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning.
Preprint (Dec. 2024). arXiv GitHub
Abstract

Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1390]
J. Weidner, M. Balcerak, I. Ezhov, A. Datchev, L. Lux, L. Zimmer, D. Rückert, B. Menze and B. Wiestler.
Spatial Brain Tumor Concentration Estimation for Individualized Radiotherapy Planning.
Preprint (Dec. 2024). arXiv
Abstract

Biophysical modeling of brain tumors has emerged as a promising strategy for personalizing radiotherapy planning by estimating the otherwise hidden distribution of tumor cells within the brain. However, many existing state-of-the-art methods are computationally intensive, limiting their widespread translation into clinical practice. In this work, we propose an efficient and direct method that utilizes soft physical constraints to estimate the tumor cell concentration from preoperative MRI of brain tumor patients. Our approach optimizes a 3D tumor concentration field by simultaneously minimizing the difference between the observed MRI and a physically informed loss function. Compared to existing state-of-the-art techniques, our method significantly improves predicting tumor recurrence on two public datasets with a total of 192 patients while maintaining a clinically viable runtime of under one minute - a substantial reduction from the 30 minutes required by the current best approach. Furthermore, we showcase the generalizability of our framework by incorporating additional imaging information and physical constraints, highlighting its potential to translate to various medical diffusion phenomena with imperfect data.

MCML Authors
Link to website

Jonas Weidner

AI for Image-Guided Diagnosis and Therapy

Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Benedikt Wiestler

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy


[1389]
Y. Xia, Z. Li, Y.-J. Li, L. Shi, H. Cao, J. F. H. João F. Henriques and D. Cremers.
UniLoc: Towards Universal Place Recognition Using Any Single Modality.
Preprint (Dec. 2024). arXiv GitHub
Abstract

To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1388]
Y. Xia, Y. Lu, R. Song, O. Dhaouadi, J. F. Henriques and D. Cremers.
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes.
Preprint (Dec. 2024). arXiv GitHub
Abstract

We tackle the problem of localizing the traffic surveillance cameras in cooperative perception. To overcome the lack of large-scale real-world intersection datasets, we introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. Moreover, we introduce a novel neural network, TrafficLoc, localizing traffic cameras within a 3D reference map. TrafficLoc employs a coarse-to-fine matching pipeline. For image-point cloud feature fusion, we propose a novel Geometry-guided Attention Loss to address cross-modal viewpoint inconsistencies. During coarse matching, we propose an Inter-Intra Contrastive Learning to achieve precise alignment while preserving distinctiveness among local intra-features within image patch-point group pairs. Besides, we introduce Dense Training Alignment with a soft-argmax operator to consider additional features when regressing the final position. Extensive experiments show that our TrafficLoc improves the localization accuracy over the state-of-the-art Image-to-point cloud registration methods by a large margin (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating strong localization ability across both in-vehicle and traffic cameras.

MCML Authors
Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1387]
X. Xue, G. Wei, H. Chen, H. Zhang, F. Lin, C. Shen and X. Zhu.
REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation.
Preprint (Dec. 2024). arXiv
Abstract

The rapid evolution of Vision Language Models (VLMs) has catalyzed significant advancements in artificial intelligence, expanding research across various disciplines, including Earth Observation (EO). While VLMs have enhanced image understanding and data processing within EO, their applications have predominantly focused on image content description. This limited focus overlooks their potential in geographic and scientific regression tasks, which are essential for diverse EO applications. To bridge this gap, this paper introduces a novel benchmark dataset, called textbf{REO-Instruct} to unify regression and generation tasks specifically for the EO domain. Comprising 1.6 million multimodal EO imagery and language pairs, this dataset is designed to support both biomass regression and image content interpretation tasks. Leveraging this dataset, we develop REO-VLM, a groundbreaking model that seamlessly integrates regression capabilities with traditional generative functions. By utilizing language-driven reasoning to incorporate scientific domain knowledge, REO-VLM goes beyond solely relying on EO imagery, enabling comprehensive interpretation of complex scientific attributes from EO data. This approach establishes new performance benchmarks and significantly enhances the capabilities of environmental monitoring and resource management.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1386]
H. Ye, A. Wisiorek, A. Maronikolakis, Ö. Alaçam and H. Schütze.
A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities.
Preprint (Dec. 2024). arXiv
Abstract

Hate speech online remains an understudied issue for marginalized communities, and has seen rising relevance, especially in the Global South, which includes developing societies with increasing internet penetration. In this paper, we aim to provide marginalized communities living in societies where the dominant language is low-resource with a privacy-preserving tool to protect themselves from hate speech on the internet by filtering offensive content in their native languages. Our contribution in this paper is twofold: 1) we release REACT (REsponsive hate speech datasets Across ConTexts), a collection of high-quality, culture-specific hate speech detection datasets comprising seven distinct target groups in eight low-resource languages, curated by experienced data collectors; 2) we propose a solution to few-shot hate speech detection utilizing federated learning (FL), a privacy-preserving and collaborative learning approach, to continuously improve a central model that exhibits robustness when tackling different target groups and languages. By keeping the training local to the users’ devices, we ensure the privacy of the users’ data while benefitting from the efficiency of federated learning. Furthermore, we personalize client models to target-specific training data and evaluate their performance. Our results indicate the effectiveness of FL across different target groups, whereas the benefits of personalization on few-shot learning are not clear.

MCML Authors
Link to website

Axel Wisiorek

Dr.

Computational Linguistics

Antonis Maronikolakis

Antonis Maronikolakis

* Former Member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1385]
Y. Yeganeh, R. Xiao, G. Guvercin, N. Navab and A. Farshad.
Conformable Convolution for Topologically Aware Learning of Complex Anatomical Structures.
Preprint (Dec. 2024). arXiv
Abstract

While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. Such shortcomings can significantly impact the reliability of analysis results and hinder clinical decision-making. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly enforce topological consistency. Conformable Convolution learns adaptive kernel offsets that preferentially focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Our proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. Experimental results on three diverse datasets demonstrate that our framework effectively preserves the topology in the segmentation downstream task, both quantitatively and qualitatively.

MCML Authors
Link to website

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Link to website

Rui Xiao

Interpretable and Reliable Machine Learning

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality


[1384]
A. Kathan, S. Amiriparian, L. Christ, S. Eulitz and B. W. Schuller.
Automatic Speech-Based Charisma Recognition and the Impact of Integrating Auxiliary Characteristics.
TELEPRESENCE 2024 - IEEE Conference on Telepresence. Pasadena, CA, USA, Nov 16-17, 2024. DOI
Abstract

Automatic recognition of speaker’s states and traits is crucial to facilitate a more naturalistic human-AI interaction – a key focus in human-computer interaction to enhance user experience. One particularly important trait in daily life is charisma. To date, its definition is still controversial. However, it seems that there are characteristics in speech that the majority perceives as charismatic. To this end, we address the novel speech-based task of charisma recognition in a three-fold approach. First, we predict charismatic speech using both interpretable acoustic features and embeddings of two audio Transformers. Afterwards, we make use of auxiliary labels that are highly correlated with charisma, including enthusiastic, likeable, attractive, warm, and leader-like, to check their impact on charisma recognition. Finally, we personalise the best model, taking individual speech characteristics into account. In our experiments, we demonstrate that the charisma prediction model benefits from integrating auxiliary characteristics as well as from the personalised approach, resulting in a best Pearson’s correlation coefficient of 0.4304.

MCML Authors
Link to website

Alexander Kathan

Health Informatics

Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1383]
S. Amiriparian, M. Gerczuk, J. Lutz, W. Strube, I. Papazova, A. Hasan, A. Kathan and B. W. Schuller.
Non-Invasive Suicide Risk Prediction Through Speech Analysis.
EHB 2024 - 12th E-Health and Bioengineering Conference. IASI, Romania, Nov 14-15, 2024. DOI
Abstract

The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we collected a novel speech recording dataset from 20 patients. We extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of 66.2%. Moreover, we show that integrating our speech model with a series of patients’ metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of 94.4%, marking an absolute improvement of 28.2%, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.

MCML Authors
Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to website

Maurice Gerczuk

Health Informatics

Link to website

Alexander Kathan

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1382]
M. Di Marco and A. Fraser.
Subword Segmentation in LLMs: Looking at Inflection and Consistency.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

The role of subword segmentation in relation to capturing morphological patterns in LLMs is currently not well explored. Ideally, one would train models like GPT using various segmentations and evaluate how well word meanings are captured. Since this is not computationally feasible, we group words according to their segmentation properties and compare how well a model can solve a linguistic task for these groups. We study two criteria: (i) adherence to morpheme boundaries and (ii) the segmentation consistency of the different inflected forms of a lemma. We select word forms with high and low values for these criteria and carry out experiments on GPT-4o’s ability to capture verbal inflection for 10 languages. Our results indicate that in particular the criterion of segmentation consistency can help to predict the model’s ability to recognize and generate the lemma from an inflected form, providing evidence that subword segmentation is relevant.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1381]
L. Edman, H. Schmid and A. Fraser.
CUTE: Measuring LLMs’ Understanding of Their Tokens.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

MCML Authors
Link to website

Lukas Edman

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1380]
W. Lai, V. Hangya and A. Fraser.
Style-Specific Neurons for Steering LLMs in Text Style Transfer.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST, a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity, politics, politeness, authorship, and sentiment.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[1379]
Y. J. Liu, T. Aoyama, W. Scivetti, Y. Zhu, S. Behzad, L. E. Levine, J. Lin, D. Tiwari and A. Zeldes.
GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.

MCML Authors
Link to website

Yang Janet Liu

AI and Computational Linguistics


[1378]
Y. Liu, Y. Zhang, Q. Li, T. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

MCML Authors
Link to website

Tong Liu

Database Systems and Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1377]
P. Mondorf and B. Plank.
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character’s identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce TruthQuest, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on TruthQuest show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models’ output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1376]
P. F. Balestrucci, S. Casola, S. M. Lo, V. Basile and A. Mazzei.
I’m sure you’re a real scholar yourself: Exploring Ironic Content Generation by Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Generating ironic content is challenging: it requires a nuanced understanding of context and implicit references and balancing seriousness and playfulness. Moreover, irony is highly subjective and can depend on various factors, such as social, cultural, or generational aspects. This paper explores whether Large Language Models (LLMs) can learn to generate ironic responses to social media posts. To do so, we fine-tune two models to generate ironic and non-ironic content and deeply analyze their outputs’ linguistic characteristics, their connection to the original post, and their similarity to the human-written replies. We also conduct a large-scale human evaluation of the outputs. Additionally, we investigate whether LLMs can learn a form of irony tied to a generational perspective, with mixed results.

MCML Authors
Link to website

Silvia Casola

Dr.

AI and Computational Linguistics


[1375]
B. Chen, X. Wang, S. Peng, R. Litschko, A. Korhonen and B. Plank.
'Seeing the Big through the Small': Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (‘LLM judges’) but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.

MCML Authors
Link to website

Beiduo Chen

AI and Computational Linguistics

Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to website

Robert Litschko

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1374]
Z. Ding, J. Wu, J. Wu, Y. Xia and V. Tresp.
Temporal Fact Reasoning over Hyper-Relational Knowledge Graphs.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Stemming from traditional knowledge graphs (KGs), hyper-relational KGs (HKGs) provide additional key-value pairs (i.e., qualifiers) for each KG fact that help to better restrict the fact validity. In recent years, there has been an increasing interest in studying graph reasoning over HKGs. Meanwhile, as discussed in recent works that focus on temporal KGs (TKGs), world knowledge is ever-evolving, making it important to reason over temporal facts in KGs. Previous mainstream benchmark HKGs do not explicitly specify temporal information for each HKG fact. Therefore, almost all existing HKG reasoning approaches do not devise any module specifically for temporal reasoning. To better study temporal fact reasoning over HKGs, we propose a new type of data structure named hyper-relational TKG (HTKG). Every fact in an HTKG is coupled with a timestamp explicitly indicating its time validity. We develop two new benchmark HTKG datasets, i.e., Wiki-hy and YAGO-hy, and propose an HTKG reasoning model that efficiently models hyper-relational temporal facts. To support future research on this topic, we open-source our datasets and model.

MCML Authors
Link to website

Zifeng Ding

Database Systems and Data Mining

Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1373]
E. Garces Arias, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, k−sampling, nucleus p−sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[1372]
A. Köksal, T. Schick, A. Korhonen and H. Schütze.
LongForm: Effective Instruction Tuning with Reverse Instructions.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Instruction tuning enables language models to more effectively generalize and better follow user intent. However, obtaining instruction data is costly and challenging. Prior work employs methods such as expensive human annotation, crowd-sourced datasets with alignment issues, and generating noisy examples via LLMs. We introduce the LongForm-C dataset, which is created by reverse instructions. We generate instructions via LLMs for human-written corpus examples using reverse instructions. First we select a diverse set of human-written documents from corpora such as C4 and Wikipedia; then we generate instructions for these documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset with natural output and one suitable for long text generation. Our models outperform 10x larger language models without instruction tuning on tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin, and improve language understanding capabilities further.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1371]
R. Liao, M. Erler, H. Wang, G. Zhai, G. Zhang, Y. Ma and V. Tresp.
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA.

MCML Authors
Link to website

Ruotong Liao

Database Systems and Data Mining

Link to website

Guangyao Zhai

Computer Aided Medical Procedures & Augmented Reality

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to website

Yunpu Ma

Dr.

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1370]
B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may capture and convey. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOVs). However, measuring AOVs embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing a comprehensive overview of recent works on the evaluation of AOVs in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOVs in LLMs.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1369]
A. Modarressi, A. Köksal and H. Schütze.
Consistent Document-Level Relation Extraction via Counterfactuals.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge – rather than on the input context – to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.

MCML Authors
Link to website

Ali Modarressi

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1368]
A. Sedova, R. Litschko, D. Frassinelli, B. Roth and B. Plank.
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

MCML Authors
Link to website

Robert Litschko

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1367]
M. Wang, L. Lange, H. Adel, J. Strötgen and H. Schütze.
Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model’s behavior on unrelated knowledge, and significantly damages the model’s generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1366]
O. Xhelili, Y. Liu and H. Schütze.
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, Mediterranean-Amharic-Farsi and South+East Asian Languages, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1365]
A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language.

MCML Authors
Lütfi Kerem Senel

Lütfi Kerem Senel

Dr.

* Former Member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1364]
H. Zhang, J. Liu, Z. Han, S. Chen, B. He, V. Tresp, Z. Xu and J. Gu.
Visual Question Decomposition on Multimodal Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model’s question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.

MCML Authors
Link to website

Shuo Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


[1363]
R. Zhao, A. Köksal, Y. Liu, L. Weissweiler, A. Korhonen and H. Schütze.
SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic Evaluation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.

MCML Authors
Link to website

Raoyuan Zhao

AI and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1362]
K. Hämmerl, A. Manea, G. Vico, J. Helcl and J. Libovický.
CUNI and LMU Submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval.
MRL @EMNLP 2024 - 4th Multilingual Representation Learning Workshop at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

We present the joint CUNI and LMU submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval. The shared task objective was to explore how we can deploy modern methods in NLP in multi-lingual low-resource settings, tested on two sub-tasks: Named-entity recognition and question answering. Our solutions to the subtasks are based on data acquisition and model adaptation. We compare the performance of our submitted systems with the translate-test approach which proved to be the most useful in the previous edition of the shared task. Our results show that using more data as well as fine-tuning recent multilingual pre-trained models leads to considerable improvements over the translate-test baseline.

MCML Authors
Link to website

Katharina Hämmerl

Data Analytics & Statistics


[1361]
J. Wang, L. Zuo, S. Peng and B. Plank.
MultiClimate: Multimodal Stance Detection on Climate Change Videos.
NLP4PI @EMNLP 2024 - 3rd Workshop on NLP for Positive Impact at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models.

MCML Authors
Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[1360]
A. Mallol-Ragolta, M. Milling and B. W. Schuller.
Multi-Triplet Loss-Based Models for Categorical Depression Recognition from Speech.
IberSPEECH 2024 - 7th Conference IberSPEECH 2024. Aveiro, Portugal, Nov 11-13, 2024. PDF
Abstract

We analyse four different acoustic feature sets towards the automatic recognition of depression from speech signals. Specifically, the feature sets investigated are based on Mel-Frequency Cepstral Coefficients (MFCC), the Low-Level Descriptors (LLD) of the eGeMAPS feature set, Mel-spectrogram coefficients, and pretrained self-supervised Wav2Vec 2.0 representations. The main hypothesis investigated lies in the use of a multi-triplet loss to improve the inter-class separability of the data representations learnt in the embedding space, boosting, ultimately, the overall system performance. To assess this aspect, we implement three different techniques to perform the classification of the embedded representations learnt. These include the combination of two fully connected layers with softmax, a linear support vector classifier, and a clustering-based classifier with k−Means. We conduct our experiments on the Extended Distress Analysis Interview Corpus, released in the Detecting Depression Subchallenge (DDS) of the 9th Audio/Visual Emotion Challenge (AVEC), in 2019. We select the Unweighted Average Recall (UAR) as the evaluation metric. Our best model exploits the eGeMAPS-based feature set, optimises a triplet loss, and utilises a LinearSVC as the classifier. Tackling the task as a 6-class classification problem, this model scores a UAR of 25.7% on the test partition, an increment in 9% of the chance level.

MCML Authors
Link to website

Adria Mallol-Ragolta

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1359]
A. Mallol-Ragolta, A. Spiesberger, A. B. Salvador and B. W. Schuller.
Prototypical Networks for Speech Emotion Recognition in Spanish.
IberSPEECH 2024 - 7th Conference IberSPEECH 2024. Aveiro, Portugal, Nov 11-13, 2024. PDF
Abstract

We explore the utilisation of prototypical networks in the Speech Emotion Recognition (SER) problem, creating prototypical representations of the targeted emotions in the embeddings space. We hypothesise this technique can help to improve the performance and robustness of the models, in comparison to standard classification-based approaches. We investigate two approaches to train the prototypes: one optimising a triplet loss, and the other minimising a prototypical loss. To assess our hypothesis, we exploit the EmoMatchSpanishDB Corpus; a novel dataset for SER in Spanish, which includes speech samples conveying the six basic emotions defined by Paul Ekman, in addition to the neutral state. We methodologically split the available samples into three speaker-independent train, development, and test partitions. The proposed splitting is not only balanced in terms of the speakers’ gender, but also homogenised in terms of their recognition difficulty. We analyse the performance of our models with a gender perspective. The models exploit the eGeMAPS and the wav2vec 2.0 feature representations extracted from the speech samples. We choose the Unweighted Average Recall (UAR) as the evaluation metric to assess the models’ performance. The chance level UAR for a seven-class classification problem is 14.3%. The models optimising the prototypical loss obtain the highest UAR scores on the test set, 52.0% and 52.7%, with the eGeMAPS and the wav2vec 2.0 representations, respectively. Nevertheless, the best performances are obtained with a Support Vector Classifier (SVC) implementing a radial basis function kernel, with a UAR of 54.4% and 56.9% when exploiting the eGeMAPS and the wav2vec 2.0 representations, respectively.

MCML Authors
Link to website

Adria Mallol-Ragolta

Health Informatics

Link to website

Anika Spiesberger

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1358]
A. Mallol-Ragolta, A. Spiesberger and B. W. Schuller.
Face Mask Type and Coverage Area Recognition from Speech with Prototypical Networks.
IberSPEECH 2024 - 7th Conference IberSPEECH 2024. Aveiro, Portugal, Nov 11-13, 2024. PDF
Abstract

We investigate the use of prototypical networks on the problems of face mask type (3 classes), face mask coverage area (3 classes), and face mask type and coverage area (5 classes) recognition from speech. We explore the MASCFLICHT Corpus, a dataset containing 2 h 27 m 55 s of speech data from 30 German speakers recorded with a smartphone. We extract formant-related features and the spectrogram representations from the samples. We enrich the spectrograms overlaying the traces of the central frequency of the first four formants. Our experiments also consider the fusion via concatenation of the embedded representations extracted from the formant-related features and the spectrogram representations. We implement classification- and prototypical encoder-based networks. The results obtained on the test sets support the suitability of the prototypical encoder models, scoring an Unweighted Average Recall (UAR) of 49.9%, 45.0%, and 31.6% on the three considered problems, respectively.

MCML Authors
Link to website

Adria Mallol-Ragolta

Health Informatics

Link to website

Anika Spiesberger

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1357]
A. Bashardoust, S. Feuerriegel and Y. R. Shrestha.
Comparing the Willingness to Share for Human-generated vs. AI-generated Fake News.
CSCW 2024 - 27th ACM SIGCHI Conference on Computer-Supported Cooperative Work and Social Computing. San José, Costa Rica, Nov 09-13, 2024. DOI
Abstract

Generative artificial intelligence (AI) presents large risks for society when it is used to create fake news. A crucial factor for fake news to go viral on social media is that users share such content. Here, we aim to shed light on the sharing behavior of users across human-generated vs. AI-generated fake news. Specifically, we study: (1) What is the perceived veracity of human-generated fake news vs. AI-generated fake news? (2) What is the user’s willingness to share human-generated fake news vs. AI-generated fake news on social media? (3) What socio-economic characteristics let users fall for AI-generated fake news? To this end, we conducted a pre-registered, online experiment with N= 988 subjects and 20 fake news from the COVID-19 pandemic generated by GPT-4 vs. humans. Our findings show that AI-generated fake news is perceived as less accurate than human-generated fake news, but both tend to be shared equally. Further, several socio-economic factors explain who falls for AI-generated fake news.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1356]
D. Geißler and S. Feuerriegel.
Analyzing the Strategy of Propaganda using Inverse Reinforcement Learning: Evidence from the 2022 Russian Invasion of Ukraine.
CSCW 2024 - 27th ACM SIGCHI Conference on Computer-Supported Cooperative Work and Social Computing. San José, Costa Rica, Nov 09-13, 2024. DOI
Abstract

The 2022 Russian invasion of Ukraine was accompanied by a large-scale, pro-Russian propaganda campaign on social media. However, the strategy behind the dissemination of propaganda has remained unclear, particularly how the online discourse was strategically shaped by the propagandists’ community. Here, we analyze the strategy of the Twitter community using an inverse reinforcement learning (IRL) approach. Specifically, IRL allows us to model online behavior as a Markov decision process, where the goal is to infer the underlying reward structure that guides propagandists when interacting with users with a supporting or opposing stance toward the invasion. Thereby, we aim to understand empirically whether and how between-user interactions are strategically used to promote the proliferation of Russian propaganda. For this, we leverage a large-scale dataset with 349,455 posts with pro-Russian propaganda from 132,131 users. We show that bots and humans follow a different strategy: bots respond predominantly to pro-invasion messages, suggesting that they seek to drive virality; while messages indicating opposition primarily elicit responses from humans, suggesting that they tend to engage in critical discussions. To the best of our knowledge, this is the first study analyzing the strategy behind propaganda from the 2022 Russian invasion of Ukraine through the lens of IRL.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1355]
A. Maarouf, N. Pröllochs and S. Feuerriegel.
The Virality of Hate Speech on Social Media.
CSCW 2024 - 27th ACM SIGCHI Conference on Computer-Supported Cooperative Work and Social Computing. San José, Costa Rica, Nov 09-13, 2024. DOI
Abstract

Online hate speech is responsible for violent attacks such as, e.g., the Pittsburgh synagogue shooting in 2018, thereby posing a significant threat to vulnerable groups and society in general. However, little is known about what makes hate speech on social media go viral. In this paper, we collect N = 25,219 cascades with 65,946 retweets from X (formerly known as Twitter) and classify them as hateful vs. normal. Using a generalized linear regression, we then estimate differences in the spread of hateful vs. normal content based on author and content variables. We thereby identify important determinants that explain differences in the spreading of hateful vs. normal content. For example, hateful content authored by verified users is disproportionally more likely to go viral than hateful content from non-verified ones: hateful content from a verified user (as opposed to normal content) has a 3.5 times larger cascade size, a 3.2 times longer cascade lifetime, and a 1.2 times larger structural virality. Altogether, we offer novel insights into the virality of hate speech on social media.

MCML Authors
Link to website

Abdurahman Maarouf

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1354]
I. M. Grigore, G. M. Tavares and S. Barbon Junior.
Beyond Flattening: Detecting Concurrency Anomalies Using K-NN Graph-Based Modeling in Object-Centric Event Logs.
DATAMOD @SEFM 2024 - 12th International Symposium From Data to Models and Back at the 22nd International Conference of Software Engineering and Formal Methods (SEFM 2024). Aveiro, Portugal, Nov 04-05, 2024. DOI
Abstract

Detecting anomalous executions is essential in today’s dynamic and diverse business environments. It plays a pivotal role in identifying inefficiencies, ensuring compliance, and mitigating risks associated with deviations from standard procedures. Traditional process mining techniques generally assume a linear sequence of events. However, real-world processes often present concurrency, characterized by the parallel execution of multiple activities or cases and complex interactions among events. These behaviors are not mapped by conventional linear models, this way, not accurately capturing the dynamic nature of process flows. To tackle this challenge, this study proposes a new approach for detecting concurrency anomalies using a K-NN graph-based model, overcoming the traditional flattening method. In our experiments, we explored object-centric event logs with different types of concurrency anomalies and compared them to the traditional flattening procedure. Our proposal was able to provide comprehensive and precise communities (clusters) of anomalous variants compared to the baseline.

MCML Authors
Link to website

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining


[1353]
C. Kern, R. Bach, H. Mautner and F. Kreuter.
When Small Decisions Have Big Impact: Fairness Implications of Algorithmic Profiling Schemes.
ACM Journal on Responsible Computing (Nov. 2024). DOI
Abstract

Algorithmic profiling is increasingly used in the public sector with the hope of allocating limited public resources more effectively and objectively. One example is the prediction-based profiling of job seekers to guide the allocation of support measures by public employment services. However, empirical evaluations of potential side-effects such as unintended discrimination and fairness concerns are rare in this context. We systematically compare and evaluate statistical models for predicting job seekers’ risk of becoming long-term unemployed concerning subgroup prediction performance, fairness metrics, and vulnerabilities to data analysis decisions. Focusing on Germany as a use case, we evaluate profiling models under realistic conditions using large-scale administrative data. We show that despite achieving high prediction performance on average, profiling models can be considerably less accurate for vulnerable social subgroups. In this setting, different classification policies can have very different fairness implications. We therefore call for rigorous auditing processes before such models are put to practice.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1352]
Q. Li, S. Krapf, L. Mou, Y. Shi and X. Zhu.
Deep learning-based framework for city-scale rooftop solar potential estimation by considering roof superstructures.
Applied Energy 374.123839 (Nov. 2024). DOI
Abstract

Solar energy is an environmentally friendly energy source. Identifying suitable rooftops for solar panel installation contributes to not only sustainable energy plans but also carbon neutrality goals. Aerial imagery, bolstered by its growing availability, is a cost-effective data source for rooftop solar potential assessment at large scale. Existing studies generally do not take roof superstructures into account when determining how many solar panels can be installed. This procedure will lead to an overestimation of solar potential. Only several works have considered this issue, but none have devised a network that can simultaneously learn roof orientations and roof superstructures. Therefore, we devise SolarNet+, a novel framework to improve the precision of rooftop solar potential estimation. After implementing SolarNet+ on a benchmark dataset, we find that SolarNet+ outperforms other state-of-the-art approaches in both tasks — roof orientations and roof superstructure segmentation. Moreover, the SolarNet+ framework enables rooftop solar estimation at large-scale applications for investigating the correlation between urban rooftop solar potential and various local climate zone (LCZ) types. The results in the city of Brussels reveal that three specific LCZ urban types exhibit the highest rooftop solar potential efficiency: compact highrise (LCZ1), compact midrise (LCZ2), and heavy industry (LCZ10). The annual photovoltaic potential for these LCZ types is reported as 10.56 , 11.77 , and 10.70 , respectively.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1351]
K. D. Bartl-Pokorny, C. Zitta, M. Beirit, G. Vogrinec, B. W. Schuller and F. B. Pokorny.
Focused review on artificial intelligence for disease detection in infants.
Frontiers in Digital Health 6 (Nov. 2024). DOI
Abstract

Over the last years, studies using artificial intelligence (AI) for the detection and prediction of diseases have increased and also concentrated more and more on vulnerable groups of individuals, such as infants. The release of ChatGPT demonstrated the potential of large language models (LLMs) and heralded a new era of AI with manifold application possibilities. However, the impact of this new technology on medical research cannot be fully estimated yet. In this work, we therefore aimed to summarise the most recent pre-ChatGPT developments in the field of automated detection and prediction of diseases and disease status in infants, i.e., within the first 12 months of life. For this, we systematically searched the scientific databases PubMed and IEEE Xplore for original articles published within the last five years preceding the release of ChatGPT (2018–2022). The search revealed 927 articles; a final number of 154 articles was included for review. First of all, we examined research activity over time. Then, we analysed the articles from 2022 for medical conditions, data types, tasks, AI approaches, and reported model performance. A clear trend of increasing research activity over time could be observed. The most recently published articles focused on medical conditions of twelve different ICD-11 categories; “certain conditions originating in the perinatal period” was the most frequently addressed disease category. AI models were trained with a variety of data types, among which clinical and demographic information and laboratory data were most frequently exploited. The most frequently performed tasks aimed to detect present diseases, followed by the prediction of diseases and disease status at a later point in development. Deep neural networks turned out as the most popular AI approach, even though traditional methods, such as random forests and support vector machines, still play a role—presumably due to their explainability or better suitability when the amount of data is limited. Finally, the reported performances in many of the reviewed articles suggest that AI has the potential to assist in diagnostic procedures for infants in the near future. LLMs will boost developments in this field in the upcoming years.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1350]
Y. Wang, H. Hernández Hernández, C. M. Albrecht and X. Zhu.
Feature Guided Masked Autoencoder for Self-Supervised Learning in Remote Sensing.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 18 (Nov. 2024). DOI
Abstract

Self-supervised learning guided by masked image modeling, such as masked autoencoder (MAE), has attracted wide attention for pretraining vision transformers in remote sensing. However, MAE tends to excessively focus on pixel details, limiting the model’s capacity for semantic understanding, particularly for noisy synthetic aperture radar (SAR) images. In this article, we explore spectral and spatial remote sensing image features as improved MAE-reconstruction targets. We first conduct a study on reconstructing various image features, all performing comparably well or better than raw pixels. Based on such observations, we propose feature guided MAE (FG-MAE): reconstructing a combination of histograms of oriented gradients (HOG) and normalized difference indices (NDI) for multispectral images, and reconstructing HOG for SAR images. Experimental results on three downstream tasks illustrate the effectiveness of FG-MAE with a particular boost for SAR imagery (e.g., up to 5% better than MAE on EuroSAT-SAR). Furthermore, we demonstrate the well-inherited scalability of FG-MAE and release a first series of pretrained vision transformers for medium-resolution SAR and multispectral images.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1349]
Y. Yang, X. Sun, J. Dong, K.-M. Lam and X. Zhu.
Attention-ConvNet Network for Ocean-Front Prediction via Remote Sensing SST Images.
IEEE Transactions on Geoscience and Remote Sensing 62 (Nov. 2024). DOI GitHub
Abstract

Ocean front is one typical geophysical phenomenon acting as oases in the ocean for fishes and marine mammals. Accurate ocean-front prediction is critical for fishery and navigation safety. However, the formation and evolution of ocean fronts are inherently nonlinear and are influenced by various factors such as ocean currents, wind fields, and temperature changes, making ocean-front prediction a considerable challenge. This study proposes a temporal-sensitive network named Attention-ConvNet to address this challenge. Ocean fronts exhibit significant multiscale characteristics, requiring analysis and prediction across various temporal and spatial scales. The proposed network designs a hierarchical attention mechanism (HAM) that efficiently prioritizes relevant spatial and temporal information to meet the specific requirement. What is more, the proposed network uses a complex hierarchical branching convolutional network (HBCNet) architecture, which allows our network to leverage the complementary strengths of spatial and temporal information, effectively capturing the dynamic and complex variations in ocean fronts. In general, the network prioritizes and focuses on the most relevant information of front dynamics, which ensures its ability to effectively predict the ocean front. External experiments demonstrate that our network significantly outperforms conventional methods, confirming its capability for precise ocean-front prediction.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1348]
W. Yu, X. Zhang, R. Gloaguen, X. Zhu and P. Ghamisi.
MineNetCD: A Benchmark for Global Mining Change Detection on Remote Sensing Imagery.
IEEE Transactions on Geoscience and Remote Sensing 62 (Nov. 2024). DOI
Abstract

Monitoring land changes triggered by mining activities is crucial for industrial control, environmental management, and regulatory compliance, yet it poses significant challenges due to the vast and often remote locations of mining sites. Remote sensing technologies have increasingly become indispensable to detect and analyze these changes over time. We thus introduce MineNetCD, a comprehensive benchmark designed for global mining change detection using remote sensing imagery. The benchmark comprises three key contributions. First, we establish a global mining change detection dataset featuring more than 70k paired patches of bitemporal high-resolution remote sensing images and pixel-level annotations from 100 mining sites worldwide. Second, we develop a novel baseline model based on a change-aware fast Fourier transform (ChangeFFT) module, which enhances various backbones by leveraging essential spectrum components within features in the frequency domain and capturing the channelwise correlation of bitemporal feature differences to learn change-aware representations. Third, we construct a unified change detection (UCD) framework that currently integrates 20 change detection methods. This framework is designed for streamlined and efficient processing, using the cloud platform hosted by HuggingFace. Extensive experiments have been conducted to demonstrate the superiority of the proposed baseline model compared with 19 state-of-the-art change detection approaches. Empirical studies on modularized backbones comprehensively confirm the efficacy of different representation learners on change detection. This benchmark represents significant advancements in the field of remote sensing and change detection, providing a robust resource for future research and applications in global mining monitoring.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[1347]
M. F. Azampour, K. Mach, E. Fatemizadeh, B. Demiray, K. Westenfelder, K. Steiger, M. Eiber, T. Wendler, B. Kainz and N. Navab.
Multitask Weakly Supervised Generative Network for MR-US Registration.
IEEE Transactions on Medical Imaging 43.11 (Nov. 2024). DOI
Abstract

Registering pre-operative modalities, such as magnetic resonance imaging or computed tomography, to ultrasound images is crucial for guiding clinicians during surgeries and biopsies. Recently, deep-learning approaches have been proposed to increase the speed and accuracy of this registration problem. However, all of these approaches need expensive supervision from the ultrasound domain. In this work, we propose a multitask generative framework that needs weak supervision only from the pre-operative imaging domain during training. To perform a deformable registration, the proposed framework translates a magnetic resonance image to the ultrasound domain while preserving the structural content. To demonstrate the efficacy of the proposed method, we tackle the registration problem of pre-operative 3D MR to transrectal ultrasonography images as necessary for targeted prostate biopsies. We use an in-house dataset of 600 patients, divided into 540 for training, 30 for validation, and the remaining for testing. An expert manually segmented the prostate in both modalities for validation and test sets to assess the performance of our framework. The proposed framework achieves a 3.58 mm target registration error on the expert-selected landmarks, 89.2% in the Dice score, and 1.81 mm 95th percentile Hausdorff distance on the prostate masks in the test set. Our experiments demonstrate that the proposed generative model successfully translates magnetic resonance images into the ultrasound domain. The translated image contains the structural content and fine details due to an ultrasound-specific two-path design of the generative model. The proposed framework enables training learning-based registration methods while only weak supervision from the pre-operative domain is available.

MCML Authors
Link to website

Mohammad Farid Azampour

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[1346]
T. Woehrle, F. Pfeiffer, M. M. Mandl, W. Sobtzick, J. Heitzer, A. Krstova, L. Kamm, M. Feuerecker, D. Moser, M. Klein, B. Aulinger, M. Dolch, A.-L. Boulesteix, D. Lanz and A. Choukér.
Point-of-care breath sample analysis by semiconductor-based E-Nose technology discriminates non-infected subjects from SARS-CoV-2 pneumonia patients: a multi-analyst experiment.
MedComm 5.11 (Nov. 2024). DOI
Abstract

Metal oxide sensor-based electronic nose (E-Nose) technology provides an easy to use method for breath analysis by detection of volatile organic compound (VOC)-induced changes of electrical conductivity. Resulting signal patterns are then analyzed by machine learning (ML) algorithms. This study aimed to establish breath analysis by E-Nose technology as a diagnostic tool for severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) pneumonia within a multi-analyst experiment. Breath samples of 126 subjects with (n = 63) or without SARS-CoV-2 pneumonia (n = 63) were collected using the ReCIVA® Breath Sampler, enriched and stored on Tenax sorption tubes, and analyzed using an E-Nose unit with 10 sensors. ML approaches were applied by three independent data analyst teams and included a wide range of classifiers, hyperparameters, training modes, and subsets of training data. Within the multi-analyst experiment, all teams successfully classified individuals as infected or uninfected with an averaged area under the curve (AUC) larger than 90% and misclassification error lower than 19%, and identified the same sensor as most relevant to classification success. This new method using VOC enrichment and E-Nose analysis combined with ML can yield results similar to polymerase chain reaction (PCR) detection and superior to point-of-care (POC) antigen testing. Reducing the sensor set to the most relevant sensor may prove interesting for developing targeted POC testing.

MCML Authors
Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[1345]
C. Geldhauser and C. Kuehn.
Travelling waves for discrete stochastic bistable equations.
Partial Differential Equations and Applications 5.35 (Nov. 2024). DOI
Abstract

Many physical, chemical and biological systems have an inherent discrete spatial structure that strongly influences their dynamical behaviour. Similar remarks apply to internal or external noise. In this paper we study the combined effect of spatial discretization and stochastic perturbations on travelling waves in the Nagumo equation, which is a prototypical model for bistable reaction-diffusion partial differential equations (PDEs). We prove that under suitable parameter conditions, various discrete-stochastic variants of the Nagumo equation have solutions, which stay close on long time scales to the classical monotone Nagumo front with high probability if the noise covariance and spatial discretization are sufficiently small.

MCML Authors
Carina Geldhauser

Carina Geldhauser

Dr.

* Former Member


[1344]
S. Nyholm.
Digital Duplicates and Personal Scarcity: Reply to Voinea et al and Lundgren.
Philosophy and Technology 37.132 (Nov. 2024). DOI
Abstract

In our recent paper in this journal, (‘Digital Duplicates and the Scarcity Problem: Might AI Make Us Less Scarce and Therefore Less Valuable?’’, Danaher & Nyholm (2024)), John Danaher and I discussed the possibility of creating digital duplicates of particular people (e.g. by means of creating fine-tuned language models whose outputs sound like those of a particular person). We were specifically interested in how this might be seen as affecting the value of particular people as unique individuals and as scarce resources…

MCML Authors
Link to Profile Sven Nyholm

Sven Nyholm

Prof. Dr.

Ethics of Artificial Intelligence


[1343]
Y. Li, Y. Zhang, K. Kawaguchi, A. Khakzar, B. Bischl and M. Rezaei.
A Dual-Perspective Approach to Evaluating Feature Attribution Methods.
Transactions on Machine Learning Research (Nov. 2024). URL
Abstract

Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model’s behavior (i.e., faithfulness). While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. In this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Ashkan Khakzar

Ashkan Khakzar

Dr.

* Former Member

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[1342]
D. Bär, A. Maarouf and S. Feuerriegel.
Generative AI may backfire for counterspeech.
Preprint (Nov. 2024). arXiv
Abstract

Online hate speech poses a serious threat to individual well-being and societal cohesion. A promising solution to curb online hate speech is counterspeech. Counterspeech is aimed at encouraging users to reconsider hateful posts by direct replies. However, current methods lack scalability due to the need for human intervention or fail to adapt to the specific context of the post. A potential remedy is the use of generative AI, specifically large language models (LLMs), to write tailored counterspeech messages. In this paper, we analyze whether contextualized counterspeech generated by state-of-the-art LLMs is effective in curbing online hate speech. To do so, we conducted a large-scale, pre-registered field experiment (N=2,664) on the social media platform Twitter/X. Our experiment followed a 2x2 between-subjects design and, additionally, a control condition with no counterspeech. On the one hand, users posting hateful content on Twitter/X were randomly assigned to receive either (a) contextualized counterspeech or (b) non-contextualized counterspeech. Here, the former is generated through LLMs, while the latter relies on predefined, generic messages. On the other hand, we tested two counterspeech strategies: (a) promoting empathy and (b) warning about the consequences of online misbehavior. We then measured whether users deleted their initial hateful posts and whether their behavior changed after the counterspeech intervention (e.g., whether users adopted a less toxic language). We find that non-contextualized counterspeech employing a warning-of-consequence strategy significantly reduces online hate speech. However, contextualized counterspeech generated by LLMs proves ineffective and may even backfire.

MCML Authors
Link to website

Dominik Bär

Artificial Intelligence in Management

Link to website

Abdurahman Maarouf

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1341]
F. Bongratz, M. Karmann, A. Holz, M. Bonhoeffer, V. Neumaier, S. Deli, B. Schmitz-Koep, C. Zimmer, C. Sorg, M. Thalhammer, D. M. Hedderich and C. Wachinger.
MLV2-Net: Rater-Based Majority-Label Voting for Consistent Meningeal Lymphatic Vessel Segmentation.
Preprint (Nov. 2024). arXiv
Abstract

Meningeal lymphatic vessels (MLVs) are responsible for the drainage of waste products from the human brain. An impairment in their functionality has been associated with aging as well as brain disorders like multiple sclerosis and Alzheimer’s disease. However, MLVs have only recently been described for the first time in magnetic resonance imaging (MRI), and their ramified structure renders manual segmentation particularly difficult. Further, as there is no consistent notion of their appearance, human-annotated MLV structures contain a high inter-rater variability that most automatic segmentation methods cannot take into account. In this work, we propose a new rater-aware training scheme for the popular nnU-Net model, and we explore rater-based ensembling strategies for accurate and consistent segmentation of MLVs. This enables us to boost nnU-Net’s performance while obtaining explicit predictions in different annotation styles and a rater-based uncertainty estimation. Our final model, MLV2-Net, achieves a Dice similarity coefficient of 0.806 with respect to the human reference standard. The model further matches the human inter-rater reliability and replicates age-related associations with MLV volume.

MCML Authors
Fabian Bongratz

Fabian Bongratz

Artificial Intelligence in Medical Imaging

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging


[1340]
K. Flöge, M. A. Moeed and V. Fortuin.
Stein Variational Newton Neural Network Ensembles.
Preprint (Nov. 2024). arXiv
Abstract

Deep neural network ensembles are powerful tools for uncertainty quantification, which have recently been re-interpreted from a Bayesian perspective. However, current methods inadequately leverage second-order information of the loss landscape, despite the recent availability of efficient Hessian approximations. We propose a novel approximate Bayesian inference method that modifies deep ensembles to incorporate Stein Variational Newton updates. Our approach uniquely integrates scalable modern Hessian approximations, achieving faster convergence and more accurate posterior distribution approximations. We validate the effectiveness of our method on diverse regression and classification tasks, demonstrating superior performance with a significantly reduced number of training epochs compared to existing ensemble-based methods, while enhancing uncertainty quantification and robustness against overfitting.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[1339]
K. Flöge, S. Udayakumar, J. Sommer, M. Piraud, S. Kesselheim, V. Fortuin, S. Günneman, K. J. van der Weg, H. Gohlke, E. Merdivan and A. Bazarova.
OneProt: Towards Multi-Modal Protein Foundation Models.
Preprint (Nov. 2024). arXiv
Abstract

Recent AI advances have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of modality encoders along protein sequences. It demonstrates strong performance in retrieval tasks and surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[1338]
J. Gauss and T. Nagler.
Asymptotics for estimating a diverging number of parameters -- with and without sparsity.
Preprint (Nov. 2024). arXiv
Abstract

We consider high-dimensional estimation problems where the number of parameters diverges with the sample size. General conditions are established for consistency, uniqueness, and asymptotic normality in both unpenalized and penalized estimation settings. The conditions are weak and accommodate a broad class of estimation problems, including ones with non-convex and group structured penalties. The wide applicability of the results is illustrated through diverse examples, including generalized linear models, multi-sample inference, and stepwise estimation procedures.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[1337]
V. Hofmann, L. Weissweiler, D. Mortensen, H. Schütze and J. Pierrehumbert.
Derivational Morphology Reveals Analogical Generalization in Large Language Models.
Preprint (Nov. 2024). arXiv
Abstract

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J’s behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J’s linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[1336]
P. Janetzky, T. Schlagenhauf and S. Feuerriegel.
Slowing Down Forgetting in Continual Learning.
Preprint (Nov. 2024). arXiv
Abstract

A common challenge in continual learning (CL) is catastrophic forgetting, where the performance on old tasks drops after new, additional tasks are learned. In this paper, we propose a novel framework called ReCL to slow down forgetting in CL. Our framework exploits an implicit bias of gradient-based neural networks due to which these converge to margin maximization points. Such convergence points allow us to reconstruct old data from previous tasks, which we then combine with the current training data. Our framework is flexible and can be applied on top of existing, state-of-the-art CL methods to slow down forgetting. We further demonstrate the performance gain from our framework across a large series of experiments, including different CL scenarios (class incremental, domain incremental, task incremental learning) different datasets (MNIST, CIFAR10), and different network architectures. Across all experiments, we find large performance gains through ReCL. To the best of our knowledge, our framework is the first to address catastrophic forgetting by leveraging models in CL as their own memory buffers.

MCML Authors
Link to website

Pascal Janetzky

Artificial Intelligence in Management

Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management


[1335]
K. Jin, J. Latz, C. Liu and A. Scagliotti.
Losing momentum in continuous-time stochastic optimisation.
Preprint (Nov. 2024). arXiv
Abstract

The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classification problem. Our algorithm {attains} competitive results compared to stochastic gradient descent with momentum.

MCML Authors
Link to website

Alessandro Scagliotti

Applied Numerical Analysis


[1334]
K. R. S. Klaus R. Scherer, F. Burkhardt, U. D. Reichel, F. Eyben and B. W. Schuller.
Using voice analysis as an early indicator of risk for depression in young adults.
Preprint (Nov. 2024). arXiv
Abstract

Increasingly frequent publications in the literature report voice quality differences between depressed patients and controls. Here, we examine the possibility of using voice analysis as an early warning signal for the development of emotion disturbances in young adults. As part of a major interdisciplinary European research project in four countries (ECoWeB), examining the effects of web-based prevention programs to reduce the risk for depression in young adults, we analyzed a large number of acoustic voice characteristics in vocal reports of emotions experienced by the participants on a specific day. We were able to identify a number of significant differences in acoustic cues, particularly with respect to the energy distribution in the voice spectrum, encouraging further research efforts to develop promising non-obtrusive risk indicators in the normal speaking voice. This is particularly important in the case of young adults who are less likely to exhibit standard risk factors for depression such as negative life experiences.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[1333]
B. Kulynych, J. F. Gomez, G. Kaissis, F. du Pin Calmon and C. Troncoso.
Attack-Aware Noise Calibration for Differential Privacy.
Preprint (Nov. 2024). arXiv URL
Abstract

Differential privacy (DP) is a widely used approach for mitigating privacy risks when training machine learning models on sensitive data. DP mechanisms add noise during training to limit the risk of information leakage. The scale of the added noise is critical, as it determines the trade-off between privacy and utility. The standard practice is to select the noise scale to satisfy a given privacy budget ε. This privacy budget is in turn interpreted in terms of operational attack risks, such as accuracy, sensitivity, and specificity of inference attacks aimed to recover information about the training data records. We show that first calibrating the noise scale to a privacy budget ε, and then translating {epsilon} to attack risk leads to overly conservative risk assessments and unnecessarily low utility. Instead, we propose methods to directly calibrate the noise scale to a desired attack risk level, bypassing the step of choosing ε. For a given notion of attack risk, our approach significantly decreases noise scale, leading to increased utility at the same level of privacy. We empirically demonstrate that calibrating noise to attack sensitivity/specificity, rather than ε, when training privacy-preserving ML models substantially improves model accuracy for the same risk level. Our work provides a principled and practical way to improve the utility of privacy-preserving ML without compromising on privacy.

MCML Authors
Georgios Kaissis

Georgios Kaissis

Dr.

* Former Principal Investigator


[1332]
Y.-J. Li, M. Gladkova, Y. Xia and D. Cremers.
SADG: Segment Any Dynamic Gaussian Without Object Trackers.
Preprint (Nov. 2024). arXiv
Abstract

Understanding dynamic 3D scenes is fundamental for various applications, including extended reality (XR) and autonomous driving. Effectively integrating semantic information into 3D reconstruction enables holistic representation that opens opportunities for immersive and interactive applications. We introduce SADG, Segment Any Dynamic Gaussian Without Object Trackers, a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs. In contrast to existing works, we do not rely on supervision based on object identities to enable consistent segmentation of dynamic 3D objects. To this end, we propose to learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining. The learned Gaussian features can be effectively clustered without further post-processing. This enables fast computation for further object-level editing, such as object removal, composition, and style transfer by manipulating the Gaussians in the scene. We further extend several dynamic novel-view datasets with segmentation benchmarks to enable testing of learned feature fields from unseen viewpoints. We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes along with its effectiveness for further downstream editing tasks.

MCML Authors
Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Yan Xia

Yan Xia

Dr.

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence