B | Perception, Vision, and Natural Language Processing

Daniel Cremers

Prof. Dr.

Director

3D Artificial Intelligence

Angela Dai

Prof. Dr.

Principal Investigator

Almut Sophia Koepke

Dr.

JRG Leader Multi-Modal Learning

Visual Computing & Artificial Intelligence

Matthias Nießner

Prof. Dr.

Principal Investigator

Björn Ommer

Prof. Dr.

Principal Investigator

Nils Thuerey

Prof. Dr.

Principal Investigator

Xi Wang

Dr.

JRG Leader Egocentric Vision

Computer Graphics & Visualization

Rüdiger Westermann

Prof. Dr.

Principal Investigator

©all images: LMU | TUM

Publications in Research Area B1

[242]

A. Aghdam, V. T. Hu and B. Ommer.
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment.
Preprint (Jun. 2025). arXiv

Abstract

We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

MCML Authors

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

[241]

H. Andrade-Loarca, J. Hege, D. Cremers and G. Kutyniok.
Neural Poisson Surface Reconstruction: Resolution-Agnostic Shape Reconstruction from Point Clouds.
Preprint (Nov. 2023). arXiv

Abstract

We introduce Neural Poisson Surface Reconstruction (nPSR), an architecture for shape reconstruction that addresses the challenge of recovering 3D shapes from points. Traditional deep neural networks face challenges with common 3D shape discretization techniques due to their computational complexity at higher resolutions. To overcome this, we leverage Fourier Neural Operators to solve the Poisson equation and reconstruct a mesh from oriented point cloud measurements. nPSR exhibits two main advantages: First, it enables efficient training on low-resolution data while achieving comparable performance at high-resolution evaluation, thanks to the resolution-agnostic nature of FNOs. This feature allows for one-shot super-resolution. Second, our method surpasses existing approaches in reconstruction quality while being differentiable and robust with respect to point sampling rates. Overall, the neural Poisson surface reconstruction not only improves upon the limitations of classical deep neural networks in shape reconstruction but also achieves superior results in terms of reconstruction quality, running time, and resolution agnosticism.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

Mathematical Foundations of Artificial Intelligence

Gitta Kutyniok

Prof. Dr.

Principal Investigator

[240]

M. Aygun, A. Ošep, M. Weber, M. Maximov, C. Stachniss, J. Behley and L. Leal-Taixé.
4D Panoptic LiDAR Segmentation.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI GitHub

Abstract

Temporal semantic scene understanding is critical for self-driving cars or robots operating in dynamic environments. In this paper, we propose 4D panoptic LiDAR segmentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points. To this end, we present an approach and a point-centric evaluation metric. Our approach determines a semantic class for every point while modeling object instances as probability distributions in the 4D spatio-temporal domain. We process multiple point clouds in parallel and resolve point-to-instance associations, effectively alleviating the need for explicit temporal data association. Inspired by recent advances in benchmarking of multi-object tracking, we propose to adopt a new evaluation metric that separates the semantic and point-to-instance association aspects of the task. With this work, we aim at paving the road for future developments of temporal LiDAR panoptic perception.

MCML Authors

Mark Weber

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[239]

E. Abdelrahman, L. Zhao, V. T. Hu, M. Cord, P. Perez and M. Elhoseiny.
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL GitHub

Abstract

Diffusion models break down the challenging task of generating data from high-dimensional distributions into a series of easier denoising steps. Inspired by this paradigm, we propose a novel approach that extends the diffusion framework into modality space, decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation, such as contours, palettes, and detailed textures, ultimately culminating in a high-quality RGB image. Instead of relying on the naive LDM concatenation conditioning mechanism to connect the different stages together, we employ Schrödinger Bridge to determine the optimal transport between different modalities. Although employing a cascaded pipeline introduces more stages, which could lead to a more complex architecture, each stage is meticulously formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM) performance. Modality composition not only enhances overall performance but enables emerging proprieties such as consistent editing, interaction capabilities, high-level interpretability, and faster convergence and sampling rate. Extensive experiments on diverse datasets, including LSUN-Churches, ImageNet, CelebHQ, and LAION-Art, demonstrate the efficacy of our approach, consistently outperforming state-of-the-art methods. For instance, ToddlerDiffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating 2× faster with a 3× smaller architecture.

MCML Authors

Vincent Tao Hu

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[238]

M. Binz, S. Alaniz, A. Roskies, B. , C. T. Bergstrom, C. Allen, D. Schad, D. Wulff, J. D. , Q. Zhang, R. M. Shiffrin, S. J. Gershman, V. Popov, E. M. Bender, M. Marelli, M. M. Botvinick, Z. Akata and E. Schulz.
How should the advancement of large language models affect the practice of science?
Proceedings of the National Academy of Sciences 122.5 (Jan. 2025). DOI

Abstract

Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advancement of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and overhyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.

MCML Authors

Stephan Alaniz

Dr.

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[237]

G. Brasó, O. Cetintas and L. Leal-Taixé.
Multi-Object Tracking and Segmentation Via Neural Message Passing.
International Journal of Computer Vision 130.12 (Sep. 2022). DOI GitHub

Abstract

Graphs offer a natural way to formulate Multiple Object Tracking (MOT) and Multiple Object Tracking and Segmentation (MOTS) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such structured domain is not trivial. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks. By operating directly on the graph domain, our method can reason globally over an entire set of detections and exploit contextual features. It then jointly predicts both final solutions for the data association problem and segmentation masks for all objects in the scene while exploiting synergies between the two tasks. We achieve state-of-the-art results for both tracking and segmentation in several publicly available datasets.

MCML Authors

Guillem Brasó

* Former Member

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[236]

C. Brunner, A. Duensing, C. Schröder, M. Mittermair, V. Golkov, M. Pollanka, D. Cremers and R. Kienberger.
Deep Learning in Attosecond Metrology.
Optics Express 30.9 (Apr. 2022). Editor’s Pick. DOI

Abstract

Time-resolved photoelectron spectroscopy provides a versatile tool for investigating electron dynamics in gaseous, liquid, and solid samples on sub-femtosecond time scales. The extraction of information from spectrograms recorded with the attosecond streak camera remains a difficult challenge. Common algorithms are highly specialized and typically computationally heavy. In this work, we apply deep neural networks to map from streaking traces to near-infrared pulses as well as electron wavepackets and extensively benchmark our results on simulated data. Additionally, we illustrate domain-shift to real-world data. We also attempt to quantify the model predictive uncertainty. Our deep neural networks display competitive retrieval quality and superior tolerance against noisy data conditions, while reducing the computational time by orders of magnitude.

MCML Authors

Vladimir Golkov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[235]

M. Bini, L. Girrbach and Z. Akata.
Decoupling Angles and Strength in Low-rank Adaptation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL GitHub

Abstract

Parameter Efficient FineTuning (PEFT) methods have recently gained extreme popularity thanks to the vast availability of large-scale models, allowing to quickly adapt pretrained models to downstream tasks with minimal computational costs. However, current additive finetuning methods such as LoRA show low robustness to prolonged training and hyperparameter choices, not allowing for optimal out-of-the-box usage. On the other hand, multiplicative and bounded approaches such as ETHER, even if providing higher robustness, only allow for extremely low-rank adaptations and are limited to a fixed-strength transformation, hindering the expressive power of the adaptation. In this work, we propose the DeLoRA finetuning method that first normalizes and then scales the learnable low-rank matrices, thus effectively bounding the transformation strength, which leads to increased hyperparameter robustness at no cost in performance. We show that this proposed approach effectively and consistently improves over popular PEFT methods by evaluating our method on two finetuning tasks, subject-driven image generation and LLM instruction tuning.

MCML Authors

Massimo Bini

Leander Girrbach

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[234]

J. Bader, L. Girrbach, S. Alaniz and Z. Akata.
SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv URL

Abstract

Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods.

MCML Authors

Jessica Bader

Leander Girrbach

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Christian Wachinger
Artificial Intelligence in Medical Imaging

[233]

F. Bongratz, V. Golkov, L. Mautner, L. Della Libera, F. Heetmeyer, F. Czaja, J. Rodemann and D. Cremers.
How to Choose a Reinforcement-Learning Algorithm.
Preprint (Jul. 2024). arXiv GitHub

Abstract

The field of reinforcement learning offers a large variety of concepts and methods to tackle sequential decision-making problems. This variety has become so large that choosing an algorithm for a task at hand can be challenging. In this work, we streamline the process of choosing reinforcement-learning algorithms and action-distribution families. We provide a structured overview of existing methods and their properties, as well as guidelines for when to choose which methods.

MCML Authors

Fabian Bongratz

Vladimir Golkov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[232]

M. Brahimi, B. Haefner, T. Yenamandra, B. Goldluecke and D. Cremers.
SupeRVol: Super-Resolution Shape and Reflectance Estimation in Inverse Volume Rendering.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

We propose an end-to-end inverse rendering pipeline called SupeRVol that allows us to recover 3D shape and material parameters from a set of color images in a superresolution manner. To this end, we represent both the bidirectional reflectance distribution function’s (BRDF) parameters and the signed distance function (SDF) by multi-layer perceptrons (MLPs). In order to obtain both the surface shape and its reflectance properties, we revert to a differentiable volume renderer with a physically based illumination model that allows us to decouple reflectance and lighting. This physical model takes into account the effect of the camera’s point spread function thereby enabling a reconstruction of shape and material in a super-resolution quality. Experimental validation confirms that SupeRVol achieves state of the art performance in terms of inverse rendering quality. It generates reconstructions that are sharper than the individual input images, making this method ideally suited for 3D modeling from low-resolution imagery.

MCML Authors

Tarun Yenamandra

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[231]

M. Brahimi, B. Haefner, Z. Ye, B. Goldluecke and D. Cremers.
Sparse Views, Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

Neural approaches have shown a significant progress on camera-based reconstruction. But they require either a fairly dense sampling of the viewing sphere, or pre-training on an existing dataset, thereby limiting their generalizability. In contrast, photometric stereo (PS) approaches have shown great potential for achieving high-quality reconstruction under sparse viewpoints. Yet, they are impractical because they typically require tedious laboratory conditions, are restricted to dark rooms, and often multi-staged, making them subject to accumulated errors. To address these shortcomings, we propose an end-to-end uncalibrated multi-view PS frameworkfor reconstructing high-resolution shapes acquiredfrom sparse viewpoints in a real-world environment. We relax the dark room assumption, and allow a combination of static ambient lighting and dynamic near LED lighting, thereby enabling easy data capture outside the lab. Experimental validation confirms that it outperforms existing baseline approaches in the regime of sparse viewpoints by a large margin. This allows to bring high-accuracy 3D reconstruction from the dark room to the real world, while maintaining a reasonable data capture complexity.

MCML Authors

Zhenzhang Ye

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[230]

G. Brasó, N. Kister and L. Leal-Taixé.
The Center of Attention: Center-Keypoint Grouping Attention for Multi-Person Pose Estimation.
ICCV 2021 - IEEE/CVF International Conference on Computer Vision. Virtual, Oct 11-17, 2021. DOI GitHub

Abstract

We introduce CenterGroup, an attention-based framework to estimate human poses from a set of identity-agnostic keypoints and person center predictions in an image. Our approach uses a transformer to obtain context-aware embeddings for all detected keypoints and centers and then applies multi-head attention to directly group joints into their corresponding person centers. While most bottom-up methods rely on non-learnable clustering at inference, CenterGroup uses a fully differentiable attention mechanism that we train end-to-end together with our keypoint detector. As a result, our method obtains state-of-the-art performance with up to 2.5x faster inference time than competing bottom-up approaches.

MCML Authors

Guillem Brasó

* Former Member

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[229]

S. A. Baumann, F. Krause, M. Neumayr, N. Stracke, M. Sevi, V. T. Hu and B. Ommer.
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI GitHub

Abstract

In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between person'' and old person’’). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model.

MCML Authors

Felix Krause

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

[228]

K. Bhatia, F. Köhler and N. Thuerey.
PRDP: Progressively Refined Differentiable Physics.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

The physics solvers employed for neural network training are primarily iterative, and hence, differentiating through them introduces a severe computational burden as iterations grow large. Inspired by works in bilevel optimization, we show that full accuracy of the network is achievable through physics significantly coarser than fully converged solvers. We propose Progressively Refined Differentiable Physics (PRDP), an approach that identifies the level of physics refinement sufficient for full training accuracy. By beginning with coarse physics, adaptively refining it during training, and stopping refinement at the level adequate for training, it enables significant compute savings without sacrificing network accuracy. Our focus is on differentiating iterative linear solvers for sparsely discretized differential operators, which are fundamental to scientific computing. PRDP is applicable to both unrolled and implicit differentiation. We validate its performance on a variety of learning scenarios involving differentiable physics solvers such as inverse problems, autoregressive neural emulators, and correction-based neural-hybrid solvers. In the challenging example of emulating the Navier-Stokes equations, we reduce training time by 62%.

MCML Authors

Nils Thuerey

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[227]

A. Baumann, R. Li, M. Klasson, S. Mentu, S. Karthik, Z. Akata, A. Solin and M. Trapp.
Post-hoc Probabilistic Vision-Language Models.
Preprint (Dec. 2024). arXiv

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

MCML Authors

Shyamgopal Karthik

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[226]

Q. Bouniot, P. Mozharovskyi and F. d'Alché-Buc.
Tailoring Mixup to Data for Calibration.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved predictive performance, Mixup is also a good technique for improving calibration. However, mixing data carelessly can lead to manifold mismatch, i.e., synthetic data lying outside original class manifolds, which can deteriorate calibration. In this work, we show that the likelihood of assigning a wrong label with mixup increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves predictive performance and calibration of models, while being much more efficient.

MCML Authors

Quentin Bouniot

Dr.

[225]

M. Bini, K. Roth, Z. Akata and A. Khoreva.
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL GitHub

Abstract

Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters (∼10-100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility.

MCML Authors

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[224]

Q. Bouniot, I. Redko, A. Mallasto, C. Laclau, O. Struckmeier, K. Arndt, M. Heinonen, V. Kyrki and S. Kaski.
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width – common factors associated with their expressive power – may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport mappings, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications.

MCML Authors

Quentin Bouniot

Dr.

[223]

L. Bastian, M. Rashed, N. Navab and T. Birdal.
Continuous-Time SO(3) Forecasting with Savitzky--Golay Neural Controlled Differential Equations.
4DVision @CVPR 2025 - Workshop on 4D Vision: Modeling the Dynamic World at IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. arXiv

Abstract

Tracking and forecasting the rotation of objects is fundamental in computer vision and robotics, yet SO(3) extrapolation remains challenging as (1) sensor observations can be noisy and sparse, (2) motion patterns can be governed by complex dynamics, and (3) application settings can demand long-term forecasting. This work proposes modeling continuous-time rotational object dynamics on SO(3) using Neural Controlled Differential Equations guided by Savitzky-Golay paths. Unlike existing methods that rely on simplified motion assumptions, our method learns a general latent dynamical system of the underlying object trajectory while respecting the geometric structure of rotations. Experimental results on real-world data demonstrate compelling forecasting capabilities compared to existing approaches.

MCML Authors

Lennart Bastian

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Mohammad Rashed

→ Group Nils Thuerey
Physics-based Simulation

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

[222]

L. Bastian, M. Rashed, N. Navab and T. Birdal.
Forecasting Continuous Non-Conservative Dynamical Systems in SO(3).
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. Oral Presentation. arXiv

Abstract

tbd

MCML Authors

Lennart Bastian

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Mohammad Rashed

→ Group Nils Thuerey
Physics-based Simulation

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

[221]

A. Blattmann, R. Rombach, K. Oktay and B. Ommer.
Retrieval-Augmented Diffusion Models.
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. Our work questions the underlying paradigm of compressing large training data into ever growing parametric representations. We rather present an orthogonal, semi-parametric approach. We complement comparably small diffusion or autoregressive models with a separate image database and a retrieval strategy. During training we retrieve a set of nearest neighbors from this external database for each training instance and condition the generative model on these informative samples. While the retrieval approach is providing the (local) content, the model is focusing on learning the composition of scenes based on this content. As demonstrated by our experiments, simply swapping the database for one with different contents transfers a trained model post-hoc to a novel domain. The evaluation shows competitive performance on tasks which the generative model has not been trained on, such as class-conditional synthesis, zero-shot stylization or text-to-image synthesis without requiring paired text-image data. With negligible memory and computational overhead for the external database and retrieval we can significantly reduce the parameter count of the generative model and still outperform the state-of-the-art.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[220]

Y. Burkhardt, S. Schaefer and S. Leutenegger.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection.
Preprint (Apr. 2025). arXiv GitHub

Abstract

Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin.

MCML Authors

Yannick Burkhardt

Simon Schaefer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[219]

S. A. Baumann, N. Stracke, T. Phan and B. Ommer.
What If: Understanding Motion Through Sparse Interactions.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published.

Abstract

Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed ‘pokes’. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[218]

M. B. Colomer, P. L. Dovesi, T. Panagiotakopoulos, J. F. Carvalho, L. Härenstam-Nielsen, H. Azizpour, H. Kjellström, D. Cremers and M. Poggi.
To adapt or not to adapt? Real-time adaptation for semantic segmentation.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI

Abstract

The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework’s encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

MCML Authors

Linus Härenstam-Nielsen

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[217]

B. Cong, N. Daheim, Y. Shen, D. Cremers, R. Yokota, M. Khan and T. Möllenhoff.
Variational Low-Rank Adaptation Using IVON.
FITML @NeurIPS 2024 - Workshop Fine-Tuning in Modern Machine Learning: Principles and Scalability at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models.

MCML Authors

Yuesong Shen

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[216]

L. Cheng, J. Hu, H. Yan, M. Gladkova, T. Huang, Y.-H. Liu, D. Cremers and H. Li.
Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments.
IROS 2024 - IEEE/RSJ International Conference on Intelligent Robots and Systems. Abu Dhabi, United Arab Emirates, Oct 14-18, 2024. DOI

Abstract

Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.

MCML Authors

Mariia Gladkova

Daniel Cremers

Prof. Dr.

Director

→ Group Volker Tresp
Database Systems, Data Mining and AI

[215]

S. Chen, J. Liu, Z. Han, Y. Xia, D. Cremers, P. Torr, V. Tresp and J. Gu.
True Multimodal In-Context Learning Needs Attention to the Visual Context.
COLM 2025 - Conference on Language Modeling. Montreal, Canada, Oct 07-09, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities.

MCML Authors

Shuo Chen

Daniel Cremers

Prof. Dr.

Director

Volker Tresp

Prof. Dr.

Principal Investigator

[214]

A. Christensen, N. Mojab, K. Patel, K. Ahuja, Z. Akata, O. Winther, O. Gonzalez-Franco and A. Colaco.
Geometry Fidelity for Spherical Images.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI

Abstract

Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fréchet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.

MCML Authors

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[213]

C. Curreli, D. Muhle, A. Saroha, Z. Ye, R. Marin and D. Cremers.
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI GitHub

Abstract

Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics.

MCML Authors

Cecilia Curreli

Dominik Muhle

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[212]

W. Chen, G. Zhang, F. Wimbauer, R. Wang, N. Araslanov, A. Vedaldi and D. Cremers.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv

Abstract

Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM – bundle adjustment – with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

MCML Authors

Weirong Chen

Ganlin Zhang

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Felix Wimbauer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[211]

H. Cao, Z. Zhang, Y. Xia, X. Li, J. Xia, G. Chen and A. Knoll.
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hierarchical feature refinement network for event-frame fusion. The core concept is the design of the coarse-to-fine fusion module, denoted as the cross-modality adaptive feature refinement (CAFR) module. In the initial phase, the bidirectional cross-modality interaction (BCI) part facilitates information bridging from two distinct sources. Subsequently, the features are further refined by aligning the channel-level mean and variance in the two-fold adaptive feature refinement (TAFR) part. We conducted extensive experiments on two benchmarks: the low-resolution PKU-DDD17-Car dataset and the high-resolution DSEC dataset. Experimental results show that our method surpasses the state-of-the-art by an impressive margin of 8% on the DSEC dataset. Besides, our method exhibits significantly better robustness (69.5% versus 38.7%) when introducing 15 different corruption types to the frame images.

MCML Authors

Yan Xia

Dr.

* Former Member

[210]

H. N. Dang, V. Golkov, J. Endres, S. Weinmüller, F. Glang, T. Wimmer, D. Cremers, A. Dörfler, A. Maier and M. Zaiss.
Joint sequence optimization beats pure neural network approaches for super-resolution TSE.
ISMRM 2024 - International Society for Magnetic Resonance in Medicine Annual Meeting. Singapore, May 04-09, 2024. URL

Abstract

Current MRI super-resolution (SR) methods only use existing contrasts acquired from typical clinical sequences as input for the neural network (NN). In turbo spin echo sequences (TSE) the sequence parameters can have a strong influence on the actual resolution of the acquired image and have consequently a considera-ble impact on the performance of the NN. We propose a known-operator learning approach to perform an end-to-end optimization of MR sequence and neural net-work parameters for SR-TSE. This MR-physics-informed training procedure jointly optimizes the radiofrequency pulse train of a proton density- (PD-) and T2-weighted TSE and a subsequently applied convolutional neural network to predict the corresponding PDw and T2w super-resolution TSE images. The found radiofrequency pulse train designs generate an optimal signal for the NN to perform the SR task. Our method generalizes from the simulation-based optimi-zation to in vivo measurements and the acquired physics-informed SR images show higher correlation with a time-consuming segmented high-resolution TSE sequence compared to a pure network training approach.

MCML Authors

Vladimir Golkov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[209]

H. N. Dang, V. Golkov, T. Wimmer, D. Cremers, A. Maier and M. Zaiss.
Joint MR sequence optimization beats pure neural network approaches for spin-echo MRI super-resolution.
Preprint (May. 2023). arXiv

Abstract

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[208]

L. Della Libera, V. Golkov, Y. Zhu, A. Mielke and D. Cremers.
Deep Learning for 2D and 3D Rotatable Data: An Overview of Methods.
Preprint (Oct. 2019). arXiv

Abstract

Convolutional networks are successful due to their equivariance/invariance under translations. However, rotatable data such as images, volumes, shapes, or point clouds require processing with equivariance/invariance under rotations in cases where the rotational orientation of the coordinate system does not affect the meaning of the data (e.g. object classification). On the other hand, estimation/processing of rotations is necessary in cases where rotations are important (e.g. motion estimation). There has been recent progress in methods and theory in all these regards. Here we provide an overview of existing methods, both for 2D and 3D rotations (and translations), and identify commonalities and links between them.

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[207]

O. Dhaouadi, J. Meier, J. Kaiser and D. Cremers.
Shape Your Ground: Refining Road Surfaces Beyond Planar Representations.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. DOI

Abstract

Road surface reconstruction from aerial images is fundamental for autonomous driving, urban planning, and virtual simulation, where smoothness, compactness, and accuracy are critical quality factors. Existing reconstruction methods often produce artifacts and inconsistencies that limit usability, while downstream tasks have a tendency to represent roads as planes for simplicity but at the cost of accuracy. We introduce FlexRoad, the first framework to directly address road surface smoothing by fitting Non-Uniform Rational B-Splines (NURBS) surfaces to 3D road points obtained from photogrammetric reconstructions or geodata providers. Our method at its core utilizes the Elevation-Constrained Spatial Road Clustering (ECSRC) algorithm for robust anomaly correction, significantly reducing surface roughness and fitting errors. To facilitate quantitative comparison between road surface reconstruction methods, we present GeoRoad Dataset (GeRoD), a diverse collection of road surface and terrain profiles derived from openly accessible geodata. Experiments on GeRoD and the photogrammetry-based DeepScenario Open 3D Dataset (DSC3D) demonstrate that FlexRoad considerably surpasses commonly used road surface representations across various metrics while being insensitive to various input sources, terrains, and noise types. By performing ablation studies, we identify the key role of each component towards high-quality reconstruction performance, making FlexRoad a generic method for realistic road surface modeling.

MCML Authors

Johannes Meier

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[206]

O. Dhaouadi, R. Marin, J. Meier, J. Kaiser and D. Cremers.
OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata.
NeurIPS 2025 - 39th Conference on Neural Information Processing Systems. San Diego, CA, USA, Nov 30-Dec 07, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e.g., no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e.g., the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16,425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%.

MCML Authors

Johannes Meier

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[205]

O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berinpanathan, H. Banzhaf and D. Cremers.
Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. DOI

Abstract

Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at this https URL, facilitating research in motion prediction, behavior modeling, and safety validation.

MCML Authors

Johannes Meier

Daniel Cremers

Prof. Dr.

Director

[204]

M. Dani, M. J. Prakash, Z. Akata and S. Liebe.
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research.
AI4Science @ICML 2024 - AI for Science Workshop at the 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Large Language Models have shown promising results in their ability to encode general medical knowledge in standard medical question-answering datasets. However, their potential application in clinical practice requires evaluation in domain-specific tasks, where benchmarks are largely missing. In this study semioLLM, we test the ability of state-of-the-art LLMs (GPT-3.5, GPT-4, Mixtral 8x7B, and Qwen-72chat) to leverage their internal knowledge and reasoning for epilepsy diagnosis. Specifically, we obtain likelihood estimates linking unstructured text descriptions of seizures to seizure-generating brain regions, using an annotated clinical database containing 1269 entries. We evaluate the LLM’s performance, confidence, reasoning, and citation abilities in comparison to clinical evaluation. Models achieve above-chance classification performance with prompt engineering significantly improving their outcome, with some models achieving close-to-clinical performance and reasoning. However, our analyses also reveal significant pitfalls with several models being overly confident while showing poor performance, as well as exhibiting citation errors and hallucinations. In summary, our work provides the first extensive benchmark comparing current SOTA LLMs in the medical domain of epilepsy and highlights their ability to leverage unstructured texts from patients’ medical history to aid diagnostic processes in health care.

MCML Authors

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[203]

N. De La Fuente, M. Pilligua, D. Vidal, A. Soutiff, C. Curreli, D. Cremers and A. Barsky.
Prototype Augmented Hypernetworks for Continual Learning.
Preprint (May. 2025). arXiv

Abstract

Continual learning (CL) aims to learn a sequence of tasks without forgetting prior knowledge, but gradient updates for a new task often overwrite the weights learned earlier, causing catastrophic forgetting (CF). We propose Prototype-Augmented Hypernetworks (PAH), a framework where a single hypernetwork, conditioned on learnable task prototypes, dynamically generates task-specific classifier heads on demand. To mitigate forgetting, PAH combines cross-entropy with dual distillation losses, one to align logits and another to align prototypes, ensuring stable feature representations across tasks. Evaluations on Split-CIFAR100 and TinyImageNet demonstrate that PAH achieves state-of-the-art performance, reaching 74.5% and 63.7% accuracy with only 1.7% and 4.4% forgetting, respectively, surpassing prior methods without storing samples or heads.

MCML Authors

Cecilia Curreli

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[202]

M. S. Deka, L. Sang and D. Cremers.
Erasing the Ephemeral: Joint Camera Refinement and Transient Object Removal for Street View Synthesis.
GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. DOI

Abstract

Creating novel views in urban settings is crucial for applications like autonomous driving and virtual tours. Unlike object-level or indoor situations, outdoor settings pose unique challenges, including larger scenes, frame inconsistencies from moving vehicles, and noisy camera poses. This paper introduces a method to address these challenges in view synthesis for outdoor scenarios, utilizing the neural point light field scene representation with 2D image data and 3D point cloud information. We propose a method that efficiently removes dynamic objects in the scene and jointly refines camera poses to recover clean views. We achieve this by estimating the optical flow for the input video sequence and masking out moving objects during training. By learning a consistent geometric representation in the neural point light field, the masked-out areas are correctly recovered in both trained and unseen views, without leaving black areas. Moreover, the learned geometry allows us to extrapolate from current camera trajectory and recover plausible extended views. Additionally, we propose to simultaneously optimize the camera pose along with the scene representation, accommodating noisy camera pose inputs typical of real-world applications. Through validation on real-world urban datasets, we demonstrate stable and satisfactory results in synthesizing novel views of urban scenes.

MCML Authors

Lu Sang

Daniel Cremers

Prof. Dr.

Director

[201]

A. Davtyan, S. Sameni, B. Ommer and P. Favaro.
CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI GitHub

Abstract

In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way. This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration. We conduct a series of experiments to demonstrate capabilities of CAGE in various settings.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[200]

N. T. Duc, A. Sonnweber, M. Weber, N. Araslanov and D. Cremers.
The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos.
SP4V @ICCV 2025 - Structural Priors for Vision Workshop at the IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. URL

Abstract

Image-based vision foundation models (VFMs) have demonstrated surprising 3D geometric awareness, despite no explicit 3D supervision or pre-training on multi-view data. While image-based models are widely adopted across a range of downstream tasks, video-based models have so far remained on the sidelines of this success. In this work, we conduct a comparative study of image and video models on three tasks encapsulating 3D awareness: multi-view consistency, depth and surface normal estimation. To enable a fair and reproducible evaluation of both image and video models, we develop AnyProbe, a unified framework for probing network representations. The results of our study reveal a surprising conclusion, which we refer to as the diashow paradox. Specifically, video-based pre-training does not provide any consistent advantage on downstream tasks involving 3D understanding over image-based pre-training. We formulate two hypotheses to explain our observations, which underscore the need for high-quality video datasets and highlight the inherent complexity of video-based pre-training. AnyProbe will be publicly released to streamline evaluation of image- and video-based VFMs alike in a consistent fashion.

MCML Authors

Mark Weber

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[199]

S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Multimodal Models Over Time?
MCDC @ICLR 2025 - Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning at the 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Model merging combines multiple expert models finetuned from a base foundation model on diverse tasks and domains into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME (Temporal Integration of Model Expertise) which defines temporal model merging across three axes: (1) initialization, (2) deployment, and (3) merging technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to build a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors

Karsten Roth

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[198]

S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Your Multimodal Models Over Time?
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors

Karsten Roth

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[197]

T. Dagès, S. Weber, Y.-W. E. Lin, R. Talmon, D. Cremers, M. Lindenbaum, A. M. B. Alfred M. Bruckstein and R. Kimmel.
Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Dimensionality reduction is a fundamental task that aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation. To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances. They optimise the embedding to have pairwise distances as close as possible to the data dissimilarities. However, the current standard is limited to embedding data in Riemannian manifolds. Motivated by the lack of asymmetry in the Riemannian metric of the embedding space, this paper extends the MDS problem to a natural asymmetric generalisation of Riemannian manifolds called Finsler manifolds. Inspired by Euclidean spaces, we define a canonical Finsler space for embedding asymmetric data. Due to its simplicity with respect to geodesics, data representation in this space is both intuitive and simple to analyse. We demonstrate that our generalisation benefits from the same theoretical convergence guarantees. We reveal the effectiveness of our Finsler embedding across various types of non-symmetric data, highlighting its value in applications such as data visualisation, dimensionality reduction, directed graph embedding, and link prediction.

MCML Authors

Thomas Dagès

Dr.

Thomas Bayes Fellow

Simon Weber

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Ya-Wei Eileen Lin

→ Group Stefanie Jegelka
Foundations of Deep Neural Networks

Daniel Cremers

Prof. Dr.

Director

→ Group Volker Tresp
Database Systems, Data Mining and AI

[196]

Z. Ding, J. Wu, J. Wu, Y. Xia and V. Tresp.
Temporal Fact Reasoning over Hyper-Relational Knowledge Graphs.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Stemming from traditional knowledge graphs (KGs), hyper-relational KGs (HKGs) provide additional key-value pairs (i.e., qualifiers) for each KG fact that help to better restrict the fact validity. In recent years, there has been an increasing interest in studying graph reasoning over HKGs. Meanwhile, as discussed in recent works that focus on temporal KGs (TKGs), world knowledge is ever-evolving, making it important to reason over temporal facts in KGs. Previous mainstream benchmark HKGs do not explicitly specify temporal information for each HKG fact. Therefore, almost all existing HKG reasoning approaches do not devise any module specifically for temporal reasoning. To better study temporal fact reasoning over HKGs, we propose a new type of data structure named hyper-relational TKG (HTKG). Every fact in an HTKG is coupled with a timestamp explicitly indicating its time validity. We develop two new benchmark HTKG datasets, i.e., Wiki-hy and YAGO-hy, and propose an HTKG reasoning model that efficiently models hyper-relational temporal facts. To support future research on this topic, we open-source our datasets and model.

MCML Authors

Zifeng Ding

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Volker Tresp

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[195]

V. Ehm, D. Cremers and F. Bernard.
Non-Separable Multi-Dimensional Network Flows for Visual Computing.
EG 2023 - Poster at the 44th Annual Conference of the European Association for Computer Graphics. Saarbrücken, Germany, May 08-12, 2023. DOI

Abstract

Flows in networks (or graphs) play a significant role in numerous computer vision tasks. The scalar-valued edges in these graphs often lead to a loss of information and thereby to limitations in terms of expressiveness. For example, oftentimes highdimensional data (e.g. feature descriptors) are mapped to a single scalar value (e.g. the similarity between two feature descriptors). To overcome this limitation, we propose a novel formalism for non-separable multi-dimensional network flows. By doing so, we enable an automatic and adaptive feature selection strategy - since the flow is defined on a per-dimension basis, the maximizing flow automatically chooses the best matching feature dimensions. As a proof of concept, we apply our formalism to the multi-object tracking problem and demonstrate that our approach outperforms scalar formulations on the MOT16 benchmark in terms of robustness to noise.

MCML Authors

Viktoria Ehm

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[194]

V. Ehm, N. El Amrani, Y. Xie, L. Bastian, M. Gao, W. Wang, L. Sang, D. Cao, Z. Lähner, D. Cremers and F. Bernard.
Beyond Complete Shapes: A Benchmark for Quantitative Evaluation of 3D Shape Surface Matching Algorithms.
SGP 2025 - Symposium on Geometry Processing. Bilbao, Spain, Jun 30-Jul 04, 2025. DOI GitHub

Abstract

Finding correspondences between 3D deformable shapes is an important and long-standing problem in geometry processing, computer vision, graphics, and beyond. While various shape matching datasets exist, they are mostly static or limited in size, restricting their adaptation to different problem settings, including both full and partial shape matching. In particular the existing partial shape matching datasets are small (fewer than 100 shapes) and thus unsuitable for data-hungry machine learning approaches. Moreover, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations, we introduce a generic and flexible framework for the procedural generation of challenging full and partial shape matching datasets. Our framework allows the propagation of custom annotations across shapes, making it useful for various applications. By utilising our framework and manually creating cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, we propose a new large benchmark BeCoS with a total of 2543 shapes. Based on this, we offer several challenging benchmark settings, covering both full and partial matching, for which we evaluate respective state-of-the-art methods as baselines.

MCML Authors

Viktoria Ehm

Lennart Bastian

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Lu Sang

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[193]

V. Ehm, M. Gao, P. Roetzer, M. Eisenberger, D. Cremers and F. Bernard.
Partial-to-Partial Shape Matching with Geometric Consistency.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub

Abstract

Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. A prominent challenge are partial-to-partial shape matching settings, which occur when the shapes to match are only observed incompletely (e.g. from 3D scanning). Although partial-to-partial matching is a highly relevant setting in practice, it is rarely explored. Our work bridges the gap between existing (rather artificial) 3D full shape matching and partial-to-partial real-world set-tings by exploiting geometric consistency as a strong constraint. We demonstrate that it is indeed possible to solve this challenging problem in a variety of settings. For the first time, we achieve geometric consistency for partial-to-partial matching, which is realized by a novel integer non-linear program formalism building on triangle prod-uct spaces, along with a new pruning algorithm based on linear integer programming. Further, we generate a new inter-class dataset for partial-to-partial shape-matching. We show that our method outperforms current SOTA meth-ods on both an established intra-class dataset and our novel inter-class dataset.

MCML Authors

Viktoria Ehm

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[192]

E. Eulig, F. Jäger, J. Maier, B. Ommer and M. Kachelrieß.
Reconstructing and analyzing the invariances of low-dose CT image denoising networks.
Medical Physics 52 (Jan. 2025). DOI

Abstract

Background: Deep learning-based methods led to significant advancements in many areas of medical imaging, most of which are concerned with the reduction of artifacts caused by motion, scatter, or noise. However, with most neural networks being black boxes, they remain notoriously difficult to interpret, hindering their clinical implementation. In particular, it has been shown that networks exhibit invariances w.r.t. input features, that is, they learn to ignore certain information in the input data.
Purpose: To improve the interpretability of deep learning-based low-dose CT image denoising networks.
Methods: We learn a complete data representation of low-dose input images using a conditional variational autoencoder (cVAE). In this representation, invariances of any given denoising network are then disentangled from the information it is not invariant to using a conditional invertible neural network (cINN). At test time, image-space invariances are generated by applying the inverse of the cINN and subsequent decoding using the cVAE. We propose two methods to analyze sampled invariances and to find those that correspond to alterations of anatomical structures.
Results: The proposed method is applied to four popular deep learning-based low-dose CT image denoising networks. We find that the networks are not only invariant to noise amplitude and realizations, but also to anatomical structures.
Conclusions: The proposed method is capable of reconstructing and analyzing invariances of deep learning-based low-dose CT image denoising networks. This is an important step toward interpreting deep learning-based methods for medical imaging, which is essential for their clinical implementation.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[191]

L. Eyring, S. Karthik, A. Dosovitskiy, N. Ruiz and Z. Akata.
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models.
Preprint (Aug. 2025). arXiv GitHub

Abstract

The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost.

MCML Authors

Luca Eyring

Shyamgopal Karthik

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[190]

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy and Z. Akata.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from ‘reward hacking’ and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-α, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time.

MCML Authors

Luca Eyring

Shyamgopal Karthik

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Karsten Roth

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

[189]

M. Eisenberger, D. Novotny, G. Kerchenbaum, P. Labatut, N. Neverova, D. Cremers and A. Vedaldi.
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI GitHub

Abstract

We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

[188]

E. Eulig, B. Ommer and M. Kachelrieß.
Benchmarking deep learning-based low-dose CT image denoising algorithms.
Medical Physics 51 (Sep. 2024). DOI

Abstract

Background: Long-lasting efforts have been made to reduce radiation dose and thus the potential radiation risk to the patient for computed tomography (CT) acquisitions without severe deterioration of image quality. To this end, various techniques have been employed over the years including iterative reconstruction methods and noise reduction algorithms.
Purpose: Recently, deep learning-based methods for noise reduction became increasingly popular and a multitude of papers claim ever improving performance both quantitatively and qualitatively. However, the lack of a standardized benchmark setup and inconsistencies in experimental design across studies hinder the verifiability and reproducibility of reported results.
Methods: In this study, we propose a benchmark setup to overcome those flaws and improve reproducibility and verifiability of experimental results in the field. We perform a comprehensive and fair evaluation of several state-of-the-art methods using this standardized setup.
Results: Our evaluation reveals that most deep learning-based methods show statistically similar performance, and improvements over the past years have been marginal at best.
Conclusions: This study highlights the need for a more rigorous and fair evaluation of novel deep learning-based methods for low-dose CT image denoising. Our benchmark setup is a first and important step towards this direction and can be used by future researchers to evaluate their algorithms.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[187]

V. Ehm, P. Roetzer, M. Eisenberger, M. Gao, F. Bernard and D. Cremers.
Geometrically Consistent Partial Shape Matching.
3DV 2024 - 11th International Conference on 3D Vision. Davos, Switzerland, Mar 18-21, 2024. DOI GitHub

Abstract

Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics, which is for example relevant for tasks like shape interpolation, pose transfer, or texture transfer. An often neglected but essential property of matchings is geometric consistency, which means that neighboring triangles in one shape are consistently matched to neighboring triangles in the other shape. Moreover, while in practice one often has only access to partial observations of a 3D shape (e.g. due to occlusion, or scanning artifacts), there do not exist any methods that directly address geometrically consistent partial shape matching. In this work we fill this gap by proposing to integrate state-of-the-art deep shape features into a novel integer linear programming partial shape matching formulation. Our optimization yields a globally optimal solution on low resolution shapes, which we then refine using a coarse-to-fine scheme. We show that our method can find more reliable results on partial shapes in comparison to existing geometrically consistent algorithms (for which one first has to fill missing parts with a dummy geometry). Moreover, our matchings are substantially smoother than learning-based state-of-the-art shape matching methods.

MCML Authors

Viktoria Ehm

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[186]

M. Eisenberger, A. Toker, L. Leal-Taixé and D. Cremers.
G-MSM: Unsupervised Multi-Shape Matching with Graph-based Affinity Priors.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI GitHub

Abstract

We present G-MSM (Graph-based Multi-Shape Matching), a novel unsupervised learning approach for non-rigid shape correspondence. Rather than treating a collection of input poses as an unordered set of samples, we explicitly model the underlying shape data manifold. To this end, we propose an adaptive multi-shape matching architecture that constructs an affinity graph on a given set of training shapes in a self-supervised manner. The key idea is to combine putative, pairwise correspondences by propagating maps along shortest paths in the underlying shape graph. During training, we enforce cycle-consistency between such optimal paths and the pairwise matches which enables our model to learn topology-aware shape priors. We explore different classes of shape graphs and recover specific settings, like template-based matching (star graph) or learnable ranking/sorting (TSP graph), as special cases in our framework. Finally, we demonstrate state-of-the-art performance on several recent shape correspondence benchmarks, including realworld 3D scan meshes with topological noise and challenging inter-class pairs.

MCML Authors

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[185]

G. Fabbro, V. Golkov, T. Kemp and D. Cremers.
Speech Synthesis and Control Using Differentiable DSP.
Preprint (Oct. 2020). arXiv

Abstract

Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre) that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

[184]

J. S. Fischer, M. Gui, P. Ma, N. Stracke, S. A. Baumann and B. Ommer.
FMBoost: Boosting Latent Diffusion with Flow Matching.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. Oral Presentation. DOI

Abstract

Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate our FMBoost approach, which introduces flow matching between a frozen diffusion model and a convolutional decoder that enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space, producing high-resolution images. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at 10242 pixels with minimal computational cost. Cascading FMBoost optionally boosts this further to 20482 pixels. Importantly, this approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.

MCML Authors

Pingchuan Ma

Björn Ommer

Prof. Dr.

Principal Investigator

[183]

M. Fuest, V. T. Hu and B. Ommer.
MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation.
Preprint (Feb. 2025). arXiv

Abstract

Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce textbf{MaskFlow}, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences. MaskFlow does so very efficiently by enabling the use of fast Masked Generative Model (MGM)-style sampling and can be deployed in both fully autoregressive as well as full-sequence generation modes. We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Fréchet Video Distance (FVD) competitive with state-of-the-art approaches. We also provide a detailed analysis on the sampling efficiency of our method and demonstrate that MaskFlow can be applied to both timestep-dependent and timestep-independent models in a training-free manner.

MCML Authors

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[182]

F. Förster, Q. Khan and D. Cremers.
Decentralized Reinforcement Learning for Multi-Agent Navigation in Unconstrained Environments.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. DOI

Abstract

Supervised learning has demonstrated to be an effective strategy in training neural networks for vehicle navigation. However, it requires labeled data, which may not be available when a large number of vehicles need to be controlled simultaneously. In contrast, Deep Reinforcement Learning (DRL) circumvents the necessity for ground truth labels through environmental exploration. However, most concurrent DRL approaches either tend to operate in the discrete action/state space or do not consider the vehicle kinematics. In this paper, we use DRL to control multiple vehicles while also considering their kinematics. The task is for all the vehicles to reach their desired destination/target while avoiding collisions with each other or static obstacles in an unconstrained environment. For this, we propose a decentralized Proximal Policy Optimization (PPO) based DRL agent that independently provides control commands to each vehicle. The agent is based on two separate PPO models. The first is used to drive each vehicle to the proximity of its target. Once within the target’s proximity, the second model is used to park that vehicle at the correct position and orientation. The decentralized nature of the algorithm allows each agent to rely only on information about its current state and target, along with details regarding the closest obstacle/agent. By scaling this approach to all vehicles, simultaneous navigation of multiple vehicles can be achieved. Experimental results show a collective strategy that allows consistent results across a wide range of scenarios while scaling to situations with up to 20 vehicles and 12 stationary obstacles.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

[181]

T. Frerix, D. Kochkov, J. Smith, D. Cremers, M. Brenner and S. Hoyer.
Variational Data Assimilation with a Learned Inverse Observation Operator.
ICML 2021 - 38th International Conference on Machine Learning. Virtual, Jul 18-24, 2021. Spotlight Presentation. URL

Abstract

Variational data assimilation optimizes for an initial state of a dynamical system such that its evolution fits observational data. The physical model can subsequently be evolved into the future to make predictions. This principle is a cornerstone of large scale forecasting applications such as numerical weather prediction. As such, it is implemented in current operational systems of weather forecasting agencies across the globe. However, finding a good initial state poses a difficult optimization problem in part due to the non-invertible relationship between physical states and their corresponding observations. We learn a mapping from observational data to physical states and show how it can be used to improve optimizability. We employ this mapping in two ways: to better initialize the non-convex optimization problem, and to reformulate the objective function in better behaved physics space instead of observation space. Our experimental results for the Lorenz96 model and a two-dimensional turbulent fluid flow demonstrate that this procedure significantly improves forecast quality for chaotic systems.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

[180]

M. Fuest, P. Ma, M. Gui, J. Schusterbauer, V. T. Hu and B. Ommer.
Diffusion Models and Representation Learning: A Survey.
Preprint (Jul. 2024). arXiv GitHub

Abstract

Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration.

MCML Authors

Pingchuan Ma

Johannes Schusterbauer

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

[179]

F. Fundel, J. Schusterbauer, V. T. Hu and B. Ommer.
Distillation of Diffusion Features for Semantic Correspondence.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI

Abstract

Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

MCML Authors

Johannes Schusterbauer

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[178]

T. Fröch, O. Wysocki, Y. Xia, J. Xie, B. Schwab, D. Cremers and T. H. Kolbe.
FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models.
Preprint (Feb. 2025). arXiv GitHub

Abstract

High-detail semantic 3D building models are frequently utilized in robotics, geoinformatics, and computer vision. One key aspect of creating such models is employing 2D conflict maps that detect openings’ locations in building facades. Yet, in reality, these maps are often incomplete due to obstacles encountered during laser scanning. To address this challenge, we introduce FacaDiffy, a novel method for inpainting unseen facade parts by completing conflict maps with a personalized Stable Diffusion model. Specifically, we first propose a deterministic ray analysis approach to derive 2D conflict maps from existing 3D building models and corresponding laser scanning point clouds. Furthermore, we facilitate the inpainting of unseen facade objects into these 2D conflict maps by leveraging the potential of personalizing a Stable Diffusion model. To complement the scarcity of real-world training data, we also develop a scalable pipeline to produce synthetic conflict maps using random city model generators and annotated facade images. Extensive experiments demonstrate that FacaDiffy achieves state-of-the-art performance in conflict map completion compared to various inpainting baselines and increases the detection rate by 22% when applying the completed conflict maps for high-definition 3D semantic building reconstruction.

MCML Authors

Yan Xia

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

[177]

A. Farshad, Y. Yeganeh, Y. Chi, C. Shen, B. Ommer and N. Navab.
Scenegenie: Scene graph guided diffusion models for image synthesis.
Workshop @ICCV 2023 - Workshop at the IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI

Abstract

Text-conditioned image generation has made significant progress in recent years with generative adversarial networks and more recently, diffusion models. While diffusion models conditioned on text prompts have produced impressive and high-quality images, accurately representing complex text prompts such as the number of instances of a specific object remains challenging.To address this limitation, we propose a novel guidance approach for the sampling process in the diffusion model that leverages bounding box and segmentation map information at inference time without additional training data. Through a novel loss in the sampling process, our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints, leading to high-resolution images that accurately represent the scene. To obtain bounding box and segmentation map information, we structure the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our proposed model achieves state-of-the-art performance on two public benchmarks for image generation from scene graphs, surpassing both scene graph to image and text-based diffusion models in various metrics. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process for more accurate text-to-image generation.

MCML Authors

Azade Farshad

Dr.

Yousef Yeganeh

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Björn Ommer

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

[176]

L. Girrbach, S. Alaniz, G. Smith and Z. Akata.
A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models.
Preprint (Mar. 2025). arXiv

Abstract

With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.

MCML Authors

Leander Girrbach

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[175]

L. Girrbach, Y. Huang, S. Alaniz, T. Darrell and Z. Akata.
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs).
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that finetuning-based debiasing methods achieve the best tradeoff between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.

MCML Authors

Leander Girrbach

Yiran Huang

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[174]

X. Gong, O. Hahn, C. Reich, K. Singh, S. Schaub-Meyer, D. Cremers and S. Roth.
Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery.
Workshop @ICCV 2025 - Workshop at the IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Unsupervised multi-object discovery (MOD) aims to detect and localize distinct object instances in visual scenes without any form of human supervision. Recent approaches leverage object-centric learning (OCL) and motion cues from video to identify individual objects. However, these approaches use supervision to generate pseudo labels to train the OCL model. We address this limitation with MR-DINOSAUR – Motion-Refined DINOSAUR – a minimalistic unsupervised approach that extends the self-supervised pre-trained OCL model, DINOSAUR, to the task of unsupervised multi-object discovery. We generate high-quality unsupervised pseudo labels by retrieving video frames without camera motion for which we perform motion segmentation of unsupervised optical flow. We refine DINOSAUR’s slot representations using these pseudo labels and train a slot deactivation module to assign slots to foreground and background. Despite its conceptual simplicity, MR-DINOSAUR achieves strong multi-object discovery results on the TRI-PD and KITTI datasets, outperforming the previous state of the art despite being fully unsupervised.

MCML Authors

Christoph Reich

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[173]

M. Gao, Z. Lähner, J. Thunberg, D. Cremers and F. Bernard.
Isometric Multi-Shape Matching.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI GitHub

Abstract

Finding correspondences between shapes is a fundamental problem in computer vision and graphics, which is relevant for many applications, including 3D reconstruction, object tracking, and style transfer. The vast majority of correspondence methods aim to find a solution between pairs of shapes, even if multiple instances of the same class are available. While isometries are often studied in shape correspondence problems, they have not been considered explicitly in the multi-matching setting. This paper closes this gap by proposing a novel optimisation formulation for isometric multi-shape matching. We present a suitable optimisation algorithm for solving our formulation and provide a convergence and complexity analysis. Our algorithm obtains multi-matchings that are by construction provably cycle-consistent. We demonstrate the superior performance of our method on various datasets and set the new state-of-the-art in isometric multi-shape matching.

MCML Authors

Maolin Gao

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[172]

V. Golkov.
Deep learning and variational analysis for high-dimensional and geometric biomedical data.
Dissertation Aug. 2021. URL

Abstract

In this thesis, we use deep learning and variational analysis to solve various problems from biology and medicine related to advanced data structures. We predict the structure of proteins from their evolutionary statistics, and the function of proteins and small molecules from their structure. We also present image processing methods for diffusion MRI that reduce the scan duration by a factor of twelve and improve the image quality.

MCML Authors

Vladimir Golkov

Dr.

[171]

M. Gao, P. Roetzer, M. Eisenberger, Z. Lähner, M. Moeller, D. Cremers and F. Bernard.
ΣIGMA: Scale-Invariant Global Sparse Shape Matching.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI

Abstract

We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching.

MCML Authors

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[170]

V. Golkov, M. J. Skwark, A. Mirchev, G. Dikov, A. R. Geanes, J. Mendenhall, J. Meiler and D. Cremers.
3D Deep Learning for Biological Function Prediction from Physical Fields.
3DV 2020 - 8th International Conference on 3D Vision. Virtual, Nov 25-28, 2020. DOI

Abstract

Predicting the biological function of molecules, be it proteins or drug-like compounds, from their atomic structure is an important and long-standing problem. The electron density field and electrostatic potential field of a molecule contain the “raw fingerprint” of how this molecule can fit to binding partners. In this paper, we show that deep learning can predict biological function of molecules directly from their raw 3D approximated electron density and electrostatic potential fields. Protein function based on Enzyme Commission numbers is predicted from the approximated electron density field. In another experiment, the activity of small molecules is predicted with quality comparable to state-of-the-art descriptor-based methods. We propose several alternative computational models for the GPU with different memory and runtime requirements for different sizes of molecules and of databases. We also propose application-specific multi-channel data representations.

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

[169]

M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V. T. Hu and B. Ommer.
DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. Oral Presentation. DOI

Abstract

Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport. Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions. We are the first to explore flow matching in this field, and we demonstrate that its interpolation trajectories enhance both training and sampling efficiency while preserving high performance. While generative models typically require extensive training data, we mitigate this dependency by integrating external knowledge from a pre-trained image diffusion model, enabling effective transfer even across differing objectives. To further boost our model performance, we employ synthetic data and utilize image-depth pairs generated by a discriminative model on an in-the-wild image dataset. As a generative model, our model can reliably estimate depth confidence, which provides an additional advantage. Our approach achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and only requiring minimal synthetic data for training.

MCML Authors

Johannes Schusterbauer

Pingchuan Ma

Olga Grebenkova

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[168]

M. Gladkova, R. Wang, N. Zeller and D. Cremers.
Tight Integration of Feature-based Relocalization in Monocular Direct Visual Odometry.
ICRA 2021 - IEEE International Conference on Robotics and Automation. Xi’an, China, May 30-Jun 05, 2021. DOI

Abstract

In this paper we propose a framework for inte-grating map-based relocalization into online direct visual odometry. To achieve map-based relocalization for direct methods, we integrate image features into Direct Sparse Odometry (DSO) and rely on feature matching to associate online visual odometry (VO) with a previously built map. The integration of the relocalization poses is threefold. Firstly, they are incorporated as pose priors in the direct image alignment of the front-end tracking. Secondly, they are tightly integrated into the back-end bundle adjustment. Thirdly, an online fusion module is further proposed to combine relative VO poses and global relocalization poses in a pose graph to estimate keyframe-wise smooth and globally accurate poses. We evaluate our method on two multi-weather datasets showing the benefits of integrating different handcrafted and learned features and demonstrating promising improvements on camera tracking accuracy.

MCML Authors

Mariia Gladkova

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[167]

O. Hahn, N. Araslanov, S. Schaub-Meyer and S. Roth.
Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals.
Transactions on Machine Learning Research (Sep. 2024). URL GitHub

Abstract

Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global semantic categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across different datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.

MCML Authors

Nikita Araslanov

Dr.

[166]

V. T. Hu, S. A. Baumann, M. Gui, O. Grebenkova, P. Ma, J. Fischer and B. Ommer.
ZigMa: A DiT-style Zigzag Mamba Diffusion Model.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI

Abstract

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce Zigzag Mamba, a simple, plug-and-play, minimal-parameter burden, DiT style solution, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines, also this heterogeneous layerwise scan enables zero memory and speed burden when we consider more scan paths. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ and UCF101, MultiModal-CelebA-HQ, and MS COCO .

MCML Authors

Vincent Tao Hu

Dr.

Olga Grebenkova

Pingchuan Ma

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[165]

F. Hofherr, B. Haefner and D. Cremers.
On Neural BRDFs: A Thorough Comparison of State-of-the-Art Approaches.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. Oral Presentation. DOI

Abstract

The bidirectional reflectance distribution function (BRDF) is an essential tool to capture the complex interaction of light and matter. Recently, several works have employed neural methods for BRDF modeling, following various strategies, ranging from utilizing existing parametric models to purely neural parametrizations. While all methods yield impressive results, a comprehensive comparison of the different approaches is missing in the literature. In this work, we present a thorough evaluation of several approaches, including results for qualitative and quantitative reconstruction quality and an analysis of reciprocity and energy conservation. Moreover, we propose two extensions that can be added to existing approaches: A novel additive combination strategy for neural BRDFs that split the reflectance into a diffuse and a specular part, and an input mapping that ensures reciprocity exactly by construction, while previous approaches only ensure it by soft constraints.

MCML Authors

Florian Hofherr

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[164]

T. Hummel, S. Karthik, M.-I. Georgescu and Z. Akata.
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR.

MCML Authors

Shyamgopal Karthik

Iuliana Georgescu

Dr.

* Former Member

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[163]

R. Hartwig, D. Muhle, R. Marin and D. Cremers.
GECO: Geometrically Consistent Embedding with Lightspeed Inference.
Preprint (Aug. 2025). arXiv GitHub

Abstract

Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs). We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning.

MCML Authors

Regine Hartwig

* Former Member

Dominik Muhle

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[162]

K. Han, D. Muhle, F. Wimbauer and D. Cremers.
Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more re-cently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit represen-tations also became popular for scene completion by pre-dicting so-called density fields. Unlike explicit approaches e.g. voxel-based methods, density fields also allow for ac-curate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowl-edge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occu-pancy prediction, especially in occluded regions.

MCML Authors

Dominik Muhle

Felix Wimbauer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[161]

V. T. Hu and B. Ommer.
[MASK] is All You Need.
Preprint (Dec. 2024). arXiv

Abstract

In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK] tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks.

MCML Authors

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Rüdiger Westermann
Computer Graphics & Visualization

[160]

K. Höhlein.
Data-Driven Modeling and Analysis of Numerical Weather Predictions.
Dissertation Jan. 2025. URL

Abstract

Weather prediction systems generate vast numerical simulation datasets that require statistical postprocessing and interactive human exploration. In this thesis, we develop deep-learning-based methods for postprocessing weather predictions and representing the forecasts for subsequent analysis. We use neural networks to enhance the spatial resolution of weather forecasts and postprocess ensemble predictions, and adapt neural networks as compact representations for volumetric ensemble datasets.

MCML Authors

Kevin Höhlein

Dr.

* Former Member

[159]

A. Höhl, I. Obadic, M.-Á. Fernández-Torres, H. Najjar, D. Oliveira, Z. Akata, A. Dengel and X. Zhu.
Opening the Black Box: A systematic review on explainable artificial intelligence in remote sensing.
IEEE Geoscience and Remote Sensing Magazine 12.4 (Dec. 2024). DOI

Abstract

In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in remote sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the explainable AI methods used and their objectives, findings, and challenges in remote sensing applications is still missing. In this paper, we address this gap by performing a systematic review to identify the key trends in the field and shed light on novel explainable AI approaches and emerging directions that tackle specific remote sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights, and reflect on the approaches used for the evaluation of explainable AI methods. As such, our review provides a complete summary of the state-of-the-art of explainable AI in remote sensing. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field.

MCML Authors

Adrian Höhl

→ Group Xiaoxiang Zhu
Data Science in Earth Observation

Ivica Obadic

→ Group Xiaoxiang Zhu
Data Science in Earth Observation

Zeynep Akata

Prof. Dr.

Principal Investigator

Data Science in Earth Observation

Xiaoxiang Zhu

Prof. Dr.

Principal Investigator

[158]

O. Hahn, C. Reich, N. Araslanov, D. Cremers, C. Rupprecht and S. Roth.
Scene-Centric Unsupervised Panoptic Segmentation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI GitHub

Abstract

Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

MCML Authors

Christoph Reich

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[157]

H. H.-H. Hsu, Y. Shen and D. Cremers.
A Graph Is More Than Its Nodes: Towards Structured Uncertainty-Aware Learning on Graphs.
New Frontiers in Graph Learning @NeurIPS 2022 - Workshop on New Frontiers in Graph Learning at the 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Current graph neural networks (GNNs) that tackle node classification on graphs tend to only focus on nodewise scores and are solely evaluated by nodewise metrics. This limits uncertainty estimation on graphs since nodewise marginals do not fully characterize the joint distribution given the graph structure. In this work, we propose novel edgewise metrics, namely the edgewise expected calibration error (ECE) and the agree/disagree ECEs, which provide criteria for uncertainty estimation on graphs beyond the nodewise setting. Our experiments demonstrate that the proposed edgewise metrics can complement the nodewise results and yield additional insights. Moreover, we show that GNN models which consider the structured prediction problem on graphs tend to have better uncertainty estimations, which illustrates the benefit of going beyond the nodewise setting.

MCML Authors

Yuesong Shen

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[156]

L. Härenstam-Nielsen, L. Sang, A. Saroha, N. Araslanov and D. Cremers.
DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

Neural implicit surfaces can be used to recover accurate 3D geometry from imperfect point clouds. In this work, we show that state-of-the-art techniques work by minimizing an approximation of a one-sided Chamfer distance. This shape metric is not symmetric, as it only ensures that the point cloud is near the surface but not vice versa. As a consequence, existing methods can produce inaccurate reconstructions with spurious surfaces. Although one approach against spurious surfaces has been widely used in the literature, we theoretically and experimentally show that it is equivalent to regularizing the surface area, resulting in over-smoothing. As a more appealing alternative, we propose DiffCD, a novel loss function corresponding to the symmetric Chamfer distance. In contrast to previous work, DiffCD also assures that the surface is near the point cloud, which eliminates spurious surfaces without the need for additional regularization. We experimentally show that DiffCD reliably recovers a high degree of shape detail, substantially outperforming existing work across varying surface complexity and noise levels.

MCML Authors

Linus Härenstam-Nielsen

Lu Sang

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[155]

H. H.-H. Hsu, Y. Shen, C. Tomani and D. Cremers.
What Makes Graph Neural Networks Miscalibrated?
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Given the importance of getting calibrated predictions and reliable uncertainty estimations, various post-hoc calibration methods have been developed for neural networks on standard multi-class classification tasks. However, these methods are not well suited for calibrating graph neural networks (GNNs), which presents unique challenges such as accounting for the graph structure and the graph-induced correlations between the nodes. In this work, we conduct a systematic study on the calibration qualities of GNN node predictions. In particular, we identify five factors which influence the calibration of GNNs: general under-confident tendency, diversity of nodewise predictive distributions, distance to training nodes, relative confidence level, and neighborhood similarity. Furthermore, based on the insights from this study, we design a novel calibration method named Graph Attention Temperature Scaling (GATS), which is tailored for calibrating graph neural networks. GATS incorporates designs that address all the identified influential factors and produces nodewise temperature scaling using an attention-based architecture. GATS is accuracy-preserving, data-efficient, and expressive at the same time. Our experiments empirically verify the effectiveness of GATS, demonstrating that it can consistently achieve state-of-the-art calibration results on various graph datasets for different GNN backbones.

MCML Authors

Yuesong Shen

Dr.

* Former Member

Christian Tomani

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[154]

Y. Huang, L. Thede, M. Mancini, W. Xu and Z. Akata.
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study.
GCPR 2025 - German Conference on Pattern Recognition. Freiburg, Germany, Oct 23-26, 2025. To be published. Preprint available. arXiv

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms–layerwise and widthwise pruning–applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.

MCML Authors

Yiran Huang

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[153]

A. Hayler, F. Wimbauer, D. Muhle, C. Rupprecht and D. Cremers.
S4C: Self-Supervised Semantic Scene Completion with Neural Fields.
3DV 2024 - 11th International Conference on 3D Vision. Davos, Switzerland, Mar 18-21, 2024. DOI

Abstract

3D semantic scene understanding is a fundamental challenge in computer vision. It enables mobile agents to autonomously plan and navigate arbitrary environments. SSC formalizes this challenge as jointly estimating dense geometry and semantic information from sparse observations of a scene. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. This process relies on special sensors and annotation by hand which are costly and do not scale well. To overcome this issue, our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data. Our proposed method can reconstruct a scene from a single image and only relies on videos and pseudo segmentation ground truth generated from off-the-shelf image segmentation network during training. Unlike existing methods, which use discrete voxel grids, we represent scenes as implicit semantic fields. This formulation allows querying any point within the camera frustum for occupancy and semantic class. Our architecture is trained through rendering-based self-supervised losses. Nonetheless, our method achieves performance close to fully supervised state-of-the-art methods. Additionally, our method demonstrates strong generalization capabilities and can synthesize accurate segmentation maps for far away viewpoints.

MCML Authors

Felix Wimbauer

Dominik Muhle

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[152]

L. Härenstam-Nielsen, N. Zeller and D. Cremers.
Semidefinite Relaxations for Robust Multiview Triangulation.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI

Abstract

We propose an approach based on convex relaxations for certifiably optimal robust multiview triangulation. To this end, we extend existing relaxation approaches to non-robust multiview triangulation by incorporating a least squares cost function. We propose two formulations, one based on epipolar constraints and one based on fractional reprojection constraints. The first is lower dimensional and remains tight under moderate noise and outlier levels, while the second is higher dimensional and therefore slower but remains tight even under extreme noise and outlier levels. We demonstrate through extensive experiments that the proposed approaches allow us to compute provably optimal re-constructions even under significant noise and a large percentage of outliers.

MCML Authors

Linus Härenstam-Nielsen

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[151]

A. Jevtić, C. Reich, F. Wimbauer, O. Hahn, C. Rupprecht, S. Roth and D. Cremers.
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

MCML Authors

Christoph Reich

Felix Wimbauer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[150]

J. Kim, S. Alaniz, C. Schmid and Z. Akata.
LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance.
Preprint (May. 2025). arXiv GitHub

Abstract

Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement.

MCML Authors

Stephan Alaniz

Dr.

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[149]

J. M. Kim, J. Bader, S. Alaniz, C. Schmid and Z. Akata.
DataDream: Few-shot Guided Dataset Generation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance.

MCML Authors

Jae Myung Kim

Jessica Bader

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[148]

S. Klenk, D. Bonello, L. Koestler, N. Araslanov and D. Cremers.
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

Event cameras asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. However, annotation of event data is a costly and laborious process, which limits the use of deep learning methods for classification and other semantic tasks with the event modality. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task, leading to a consistent improvement of the task accuracy. For example, our method reaches state-of-the-art classification accuracy across three datasets, N-ImageNet, N-Cars, and N-Caltech101, increasing the top-1 accuracy of previous work by significant margins. When tested on real-world event data, MEM is even superior to supervised RGB-based pretraining. The models pretrained with MEM are also label-efficient and generalize well to the dense task of semantic image segmentation.

MCML Authors

Simon Klenk

* Former Member

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[147]

C. Koke and D. Cremers.
HoloNets: Spectral Convolutions do extend to Directed Graphs.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL

Abstract

Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous and – making use of certain advanced tools from complex analysis and spectral theory – extend spectral convolutions to directed graphs. We provide a frequency-response interpretation of newly developed filters, investigate the influence of the basis used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification on many datasets and – as opposed to baselines – may be rendered stable to resolution-scale varying topological perturbations.

MCML Authors

Christian Koke

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[146]

S. Karthik, H. Coskun, Z. Akata, S. Tulyakov, J. Ren and A. Kag.
Scalable Ranked Preference Optimization for Text-to-Image Generation.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv

Abstract

Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset ‘Syn-Pic’ improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.

MCML Authors

Shyamgopal Karthik

Zeynep Akata

Prof. Dr.

Principal Investigator

[145]

D. Kotovenko, O. Grebenkova and B. Ommer.
EDGS: Eliminating Densification for Efficient Convergence of 3DGS.
Preprint (Apr. 2025). arXiv

Abstract

3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and iteratively refining under-reconstructed regions. This process is inherently slow, as it requires multiple densification steps where Gaussians are repeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often leads to suboptimal renderings, particularly in high-frequency regions where detail is critical. We propose a fundamentally different approach: we eliminate densification process with a one-step approximation of scene geometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimate rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian with well-informed colors, scales, and positions. As a result, we dramatically shorten the optimization path and remove the need for densification. Unlike traditional methods that rely on sparse keypoints, our dense initialization ensures uniform detail across the scene, even in high-frequency regions where 3DGS and other methods struggle. Moreover, since all splats are initialized in parallel at the start of optimization, we eliminate the need to wait for densification to adjust new Gaussians. Our method not only outperforms speed-optimized models in training efficiency but also achieves higher rendering quality than state-of-the-art approaches, all while using only half the splats of standard 3DGS. It is fully compatible with other 3DGS acceleration techniques, making it a versatile and efficient solution that can be integrated with existing approaches.

MCML Authors

Olga Grebenkova

Björn Ommer

Prof. Dr.

Principal Investigator

[144]

D. Kotovenko, O. Grebenkova, N. Sarafianos, A. Paliwal, P. Ma, O. Poursaeed, S. Mohan, Y. Fan, Y. Li, R. Ranjan and B. Ommer.
WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Scale (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover’s Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques.

MCML Authors

Olga Grebenkova

Pingchuan Ma

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[143]

C. Koke and G. Kutyniok.
Graph Scattering beyond Wavelet Shackles.
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

This work develops a flexible and mathematically sound framework for the design and analysis of graph scattering networks with variable branching ratios and generic functional calculus filters.Spectrally-agnostic stability guarantees for node- and graph-level perturbations are derived; the vertex-set non-preserving case is treated by utilizing recently developed mathematical-physics based tools. Energy propagation through the network layers is investigated and related to truncation stability. New methods of graph-level feature aggregation are introduced and stability of the resulting composite scattering architectures is established. Finally, scattering transforms are extended to edge- and higher order tensorial input. Theoretical results are complemented by numerical investigations: Suitably chosen scattering networks conforming to the developed theory perform better than traditional graph-wavelet based scattering approaches in social network graph classification tasks andsignificantly outperform other graph-based learning approaches to regression of quantum-chemical energies on QM7.

MCML Authors

Christian Koke

Gitta Kutyniok

Prof. Dr.

Principal Investigator

Mathematical Foundations of Artificial Intelligence

[142]

S. Klenk, L. Koestler, D. Scaramuzza and D. Cremers.
E-NeRF: Neural Radiance Fields from a Moving Event Camera.
IEEE Robotics and Automation Letters 8.3 (Mar. 2023). DOI

Abstract

Estimating neural radiance fields (NeRFs) from “ideal” images has been extensively studied in the computer vision community. Most approaches assume optimal illumination and slow camera motion. These assumptions are often violated in robotic applications, where images may contain motion blur, and the scene may not have suitable illumination. This can cause significant problems for downstream tasks such as navigation, inspection, or visualization of the scene. To alleviate these problems, we present E-NeRF, the first method which estimates a volumetric scene representation in the form of a NeRF from a fast-moving event camera. Our method can recover NeRFs during very fast motion and in high-dynamic-range conditions where frame-based approaches fail. We show that rendering high-quality frames is possible by only providing an event stream as input. Furthermore, by combining events and frames, we can estimate NeRFs of higher quality than state-of-the-art approaches under severe motion blur. We also show that combining events and frames can overcome failure cases of NeRF estimation in scenarios where only a few input views are available without requiring additional regularization.

MCML Authors

Simon Klenk

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[141]

O. Kuzyk, Z. Li, M. Pollefeys and X. Wang.
VisualChef: Generating Visual Aids in Cooking via Mask Inpainting.
Preprint (Jun. 2025). arXiv

Abstract

Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action’s execution and the resulting appearance of the object, while preserving the initial frame’s environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

MCML Authors

Xi Wang

Dr.

JRG Leader Egocentric Vision

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[140]

S. Klenk, M. Motzet, L. Koestler and D. Cremers.
Deep Event Visual Odometry.
3DV 2024 - 11th International Conference on 3D Vision. Davos, Switzerland, Mar 18-21, 2024. DOI

Abstract

Event cameras offer the exciting possibility of tracking the camera’s pose during high-speed motion and in adverse lighting conditions. Despite this promise, existing event-based monocular visual odometry (VO) approaches demonstrate limited performance on recent benchmarks. To address this limitation, some methods resort to additional sensors such as IMUs, stereo event cameras, or frame-based cameras. Nonetheless, these additional sensors limit the application of event cameras in real-world devices since they increase cost and complicate system requirements. Moreover, relying on a frame-based camera makes the system susceptible to motion blur and HDR. To remove the dependency on additional sensors and to push the limits of using only a single event camera, we present Deep Event VO (DEVO), the first monocular event-only system with strong performance on a large number of real-world benchmarks. DEVO sparsely tracks selected event patches over time. A key component of DEVO is a novel deep patch selection mechanism tailored to event data. We significantly decrease the state-of-the-art pose tracking error on seven real-world benchmarks by up to 97% compared to event-only methods and often surpass or are close to stereo or inertial methods.

MCML Authors

Simon Klenk

* Former Member

Daniel Cremers

Prof. Dr.

Director

[139]

D. Kotovenko, P. Ma, T. Milbich and B. Ommer.
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI

Abstract

Learning compact image embeddings that yield seman-tic similarities between images and that generalize to un-seen test classes, is at the core of deep metric learning (DML). Finding a mapping from a rich, localized image feature map onto a compact embedding vector is challenging: Although similarity emerges between tuples of images, DML approaches marginalize out information in an individ-ual image before considering another image to which simi-larity is to be computed. Instead, we propose during training to condition the em-bedding of an image on the image we want to compare it to. Rather than embedding by a simple pooling as in standard DML, we use cross-attention so that one image can iden-tify relevant features in the other image. Consequently, the attention mechanism establishes a hierarchy of conditional embeddings that gradually incorporates information about the tuple to steer the representation of an individual image. The cross-attention layers bridge the gap between the origi-nal unconditional embedding and the final similarity and al-low backpropagtion to update encodings more directly than through a lossy pooling layer. At test time we use the re-sulting improved unconditional embeddings, thus requiring no additional parameters or computational overhead. Ex-periments on established DML benchmarks show that our cross-attention conditional embedding during training im-proves the underlying standard DML pipeline significantly so that it outperforms the state-of-the-art.

MCML Authors

Pingchuan Ma

Björn Ommer

Prof. Dr.

Principal Investigator

Computer Graphics & Visualization

[138]

F. Köhler, S. Niedermayr, R. Westermann and N. Thuerey.
APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

We introduce the Autoregressive PDE Emulator Benchmark (APEBench), a comprehensive benchmark suite to evaluate autoregressive neural emulators for solving partial differential equations. APEBench is based on JAX and provides a seamlessly integrated differentiable simulation framework employing efficient pseudo-spectral methods, enabling 46 distinct PDEs across 1D, 2D, and 3D. Facilitating systematic analysis and comparison of learned emulators, we propose a novel taxonomy for unrolled training and introduce a unique identifier for PDE dynamics that directly relates to the stability criteria of classical numerical methods. APEBench enables the evaluation of diverse neural architectures, and unlike existing benchmarks, its tight integration of the solver enables support for differentiable physics training and neural-hybrid emulators. Moreover, APEBench emphasizes rollout metrics to understand temporal generalization, providing insights into the long-term behavior of emulating PDE dynamics. In several experiments, we highlight the similarities between neural emulators and numerical simulators.

MCML Authors

Rüdiger Westermann

Prof. Dr.

Principal Investigator

Nils Thuerey

Prof. Dr.

Principal Investigator

[137]

F. Krause, T. Phan, M. Gui, S. A. Baumann, V. T. Hu and B. Ommer.
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv

Abstract

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.

MCML Authors

Felix Krause

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[136]

D. Komorowicz, L. Sang, F. Maiwald and D. Cremers.
Coloring the Past: Neural Historical Monuments Reconstruction from Archival Photography.
Wild3D @ECCV 2024 - Workshop 3D Modeling, Reconstruction, and Generation in the Wild at the 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. URL

Abstract

Historical monuments are a treasure and milestone of cultural heritage. Reconstructing the 3D models of these buildings holds significant value. The rapid development of neural rendering methods makes it possible to recover the original 3D shape exclusively based on archival photographs. However, this task presents considerable challenges due to the properties of available color images. Historical pictures are often limited in number and the scenes in these photos might have altered over time. The radiometric quality of these images is often sub-optimal for using automatic methods. To address these challenges, we introduce an approach to reconstruct the geometry of historical buildings from limited input images. We leverage dense point clouds as a geometric prior and introduce a color appearance embedding loss in volumetric rendering to recover the color of the building. We aim for our work to spark increased interest and focus on preserving historic buildings. Together with the proposed method, we introduce a new historical dataset of the Hungarian National Theater, providing a new benchmark for 3D reconstruction.

MCML Authors

Lu Sang

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[135]

D. Komorowicz, L. Sang, F. Maiwald and D. Cremers.
Coloring the Past: Neural Historical Monuments Reconstruction from Archival Photography.
GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. DOI GitHub

Abstract

MCML Authors

Lu Sang

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[134]

Q. Khan, I. Sülö, M. Öcal and D. Cremers.
Learning vision based autonomous lateral vehicle control without supervision.
Applied Intelligence 53 (Mar. 2023). DOI GitHub

Abstract

Supervised deep learning methods using image data as input have shown promising results in the context of vehicle control. However, these supervised methods have two main disadvantages: 1) They require a copious amount of labeled training data, which is difficult and expensive to collect. 2) Such models do not perform well, when situations that are not in the distribution of the training set are encountered. This includes deviations from the designated driving behavior. We therefore provide a framework to mitigate these problems from merely an unlabeled sequence of images. Visual Odometry is first used to determine the vehicle trajectory. Model Predictive Control (MPC) then uses this trajectory to implicitly infer the steering labels. Meanwhile, synthesized images at deviated trajectories are included in the training distribution for enhanced robustness of the neural network model. Experimental results demonstrate that the performance of our network is at par with methods requiring additional data collection or supervision.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[133]

C. Koke, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. B. Michael M. Bronstein and D. Cremers.
Graph Networks struggle with variable Scale.
ICBINB @ICLR 2025 - Workshop I Can’t Believe It’s Not Better: Challenges in Applied Deep Learning at the 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Standard graph neural networks assign vastly different latent embeddings to graphs describing the same object at different resolution scales. This precludes consistency in applications and prevents generalization between scales as would fundamentally be needed e.g. in AI4Science. We uncover the underlying obstruction, investigate its origin and show how to overcome it by modifying the message passing paradigm.

MCML Authors

Christian Koke

Yuesong Shen

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[132]

C. Koke, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. B. Michael M. Bronstein and D. Cremers.
On Incorporating Scale into Graph Networks.
MLMP @ICLR 2025 - Workshop on Machine Learning Multiscale Processes at the 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Best Paper Award. URL

Abstract

Standard graph neural networks assign vastly different latent embeddings to graphs describing the same physical system at different resolution scales. This precludes consistency in applications and prevents generalization between scales as would fundamentally be needed in many scientific applications. We uncover the underlying obstruction, investigate its origin and show how to overcome it.

MCML Authors

Christian Koke

Yuesong Shen

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[131]

C. Koke, D. Schnaus, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. Bronstein and D. Cremers.
On multi-scale Graph Representation Learning.
LMRL @ICLR 2025 - Workshop on Learning Meaningful Representations of Life at the 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

While Graph Neural Networks (GNNs) are widely used in modern computational biology, an underexplored drawback of common GNN methods,is that they are not inherently multiscale consistent: Two graphs describing the same object or situation at different resolution scales are assigned vastly different latent representations. This prevents graph networks from generating data representations that are consistent across scales. It also complicates the integration of representations at the molecular scale with those generated at the biological scale. Here we discuss why existing GNNs struggle with multiscale consistency and show how to overcome this problem by modifying the message passing paradigm within GNNs.

MCML Authors

Christian Koke

Dominik Schnaus

→ Group Xi Wang
Computer Vision & Artificial Intelligence

Yuesong Shen

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Abhishek Saroha

→ Group Xi Wang
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[130]

S. Kim, R. Xiao, M.-I. Georgescu, S. Alaniz and Z. Akata.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

MCML Authors

Sanghwan Kim

Rui Xiao

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Iuliana Georgescu

Dr.

* Former Member

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[129]

S. Laina, S. Boche, S. Papatheodorou, S. Schaefer, J. Jung and S. Leutenegger.
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment.
Preprint (Sep. 2022). arXiv

Abstract

Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.

MCML Authors

Sotiris Papatheodorou

* Former Member

→ Group Stefan Leutenegger

* Former Principal Investigator

Simon Schaefer

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[128]

W. Li, W. Chen, S. Qian, J. Chen, D. Cremers and H. Li.
DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair.
ICVSS 2025 - International Computer Vision Summer School: Computer Vision for Spatial Intelligence. Sicily, Italy, Jul 06-12, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses.

MCML Authors

Weihang Li

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Weirong Chen

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Shenhan Qian

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[127]

J. Liu, I. Chiotellis, R. Triebel and D. Cremers.
Effective Version Space Reduction for Convolutional Neural Networks.
ECML-PKDD 2021 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases . Virtual, Sep 13-17, 2021. DOI

Abstract

In active learning, sampling bias could pose a serious inconsistency problem and hinder the algorithm from finding the optimal hypothesis. However, many methods for neural networks are hypothesis space agnostic and do not address this problem. We examine active learning with convolutional neural networks through the principled lens of version space reduction. We identify the connection between two approaches – prior mass reduction and diameter reduction – and propose a new diameter-based querying method – the minimum Gibbs-vote disagreement. By estimating version space diameter and bias, we illustrate how version space of neural networks evolves and examine the realizability assumption. With experiments on MNIST, Fashion-MNIST, SVHN and STL-10 datasets, we demonstrate that diameter reduction methods reduce the version space more effectively and perform better than prior mass reduction and other baselines, and that the Gibbs vote disagreement is on par with the best query method.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[126]

H. Li, J. Dong, B. Wen, M. Gao, T. Huang, Y.-H. Liu and D. Cremers.
DDIT: Semantic Scene Completion via Deformable Deep Implicit Templates.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI

Abstract

Scene reconstructions are often incomplete due to occlusions and limited viewpoints. There have been efforts to use semantic information for scene completion. However, the completed shapes may be rough and imprecise since respective methods rely on 3D convolution and/or lack effective shape constraints. To overcome these limitations, we propose a semantic scene completion method based on deformable deep implicit templates (DDIT). Specifically, we complete each segmented instance in a scene by deforming a template with a latent code. Such a template is expressed by a deep implicit function in the canonical frame. It abstracts the shape prior of a category, and thus can provide constraints on the overall shape of an instance. Latent code controls the deformation of template to guarantee fine details of an instance. For code prediction, we design a neural network that leverages both intra-and inter-instance information. We also introduce an algorithm to transform instances between the world and canonical frames based on geometric constraints and a hierarchical tree. To further improve accuracy, we jointly optimize the latent code and transformation by enforcing the zero-valued isosurface constraint. In addition, we establish a new dataset to solve different problems of existing datasets. Experiments showed that our DDIT outperforms state-of-the-art approaches.

MCML Authors

Haoang Li

Dr.

* Former Member

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[125]

Y.-J. Li, M. Gladkova, Y. Xia and D. Cremers.
SADG: Segment Any Dynamic Gaussian Without Object Trackers.
Preprint (Nov. 2024). arXiv

Abstract

Understanding dynamic 3D scenes is fundamental for various applications, including extended reality (XR) and autonomous driving. Effectively integrating semantic information into 3D reconstruction enables holistic representation that opens opportunities for immersive and interactive applications. We introduce SADG, Segment Any Dynamic Gaussian Without Object Trackers, a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs. In contrast to existing works, we do not rely on supervision based on object identities to enable consistent segmentation of dynamic 3D objects. To this end, we propose to learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining. The learned Gaussian features can be effectively clustered without further post-processing. This enables fast computation for further object-level editing, such as object removal, composition, and style transfer by manipulating the Gaussians in the scene. We further extend several dynamic novel-view datasets with segmentation benchmarks to enable testing of learned feature fields from unseen viewpoints. We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes along with its effectiveness for further downstream editing tasks.

MCML Authors

Mariia Gladkova

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[124]

Y.-J. Li, M. Gladkova, Y. Xia, R. Wang and D. Cremers.
VXP: Voxel-Cross-Pixel Large-Scale Camera-LiDAR Place Recognition.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. DOI

Abstract

Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities. However, it is non-trivial to perform accurate image-LiDAR global place recognition since extracting consistent and robust global descriptors from different domains (2D images and 3D point clouds) is challenging. To address this issue, we propose a novel Voxel-Cross-Pixel (VXP) approach, which establishes voxel and pixel correspondences in a self-supervised manner and brings them into a shared feature space. Specifically, VXP is trained in a two-stage manner that first explicitly exploits local feature correspondences and enforces similarity of global descriptors. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate our method surpasses the state-of-the-art cross-modal retrieval by a large margin.

MCML Authors

Mariia Gladkova

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Volker Tresp
Database Systems, Data Mining and AI

[123]

R. Liao, G. Huang, Q. Cheng, T. Seidl, D. Cremers and V. Tresp.
When and Where do Events Switch in Multi-Event Video Generation?
LongVid-Foundations @ICCV 2025 - 1st Workshop on Long Multi-Scene Video Foundations: Generation, Understanding and Evaluation at the IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv

MCML Authors

Ruotong Liao

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Thomas Seidl

Prof. Dr.

Director

Daniel Cremers

Prof. Dr.

Director

Volker Tresp

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[122]

Y. Luo, R. Hoffmann, Y. Xia, O. Wysocki, B. Schwab, T. H. Kolbe and D. Cremers.
RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning.
PBVS @CVPR 2025 - 21st IEEE Workshop on Perception Beyond the Visible Spectrum at IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI GitHub

Abstract

Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection.

MCML Authors

Yan Xia

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Volker Tresp
Database Systems, Data Mining and AI

[121]

H. Li, Q. Khan, V. Tresp and D. Cremers.
Biologically Inspired Neural Path Finding.
BI 2022 - 15th International Conference on Brain Informatics. Padova, Italy, Jul 15-15, 2022. DOI GitHub

Abstract

The human brain can be considered to be a graphical structure comprising of tens of billions of biological neurons connected by synapses. It has the remarkable ability to automatically re-route information flow through alternate paths, in case some neurons are damaged. Moreover, the brain is capable of retaining information and applying it to similar but completely unseen scenarios. In this paper, we take inspiration from these attributes of the brain to develop a computational framework to find the optimal low cost path between a source node and a destination node in a generalized graph. We show that our framework is capable of handling unseen graphs at test time. Moreover, it can find alternate optimal paths, when nodes are arbitrarily added or removed during inference, while maintaining a fixed prediction time.

MCML Authors

Hang Li

* Former Member

Qadeer Khan

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Volker Tresp

Prof. Dr.

Principal Investigator

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[120]

A. Liebeskind, J. R. Schüre, M. S. Fabian, S. Weinmüller, P. Schünke, V. Golkov, D. Cremers and M. Zaiss.
The Pulseq-CEST Library: definition of preparations and simulations, example data, and example evaluations.
Magnetic Resonance Materials in Physics, Biology and Medicine (Mar. 2025). DOI

Abstract

Objectives: Despite prevalent use of chemical exchange saturation transfer (CEST) MRI, standardization remains elusive. Imaging depends heavily on parameters dictating radiofrequency (RF) events, gradients, and apparent diffusion coefficient (ADC). We present the Pulseq-CEST Library, a repository of CEST preparation and simulation definitions, including example data and evaluations, that provides a common basis for reproducible research, rapid prototyping, and in silico deep learning training data generation.
Materials and methods: A Pulseq-CEST experiment requires (i) a CEST preparation sequence, (ii) a Bloch–McConnell parameter set, (iii) a Bloch–McConnell simulation, and (iv) an evaluation script. Pulseq-CEST utilizes the Bloch–McConnell equations to model in vitro and in vivo conditions. Using this model, a candidate sequence or environment can be held constant while varying other inputs, enabling robust testing.
Results: Data were compared for amide proton transfer weighted (APTw) and water shift and B1 (WASABI) protocols using a five-tube phantom and simulated environments. Real and simulated data matched anticipated spectral shapes and local peak characteristics. The Pulseq-CEST Library supports similar experiments with common sequences and environments to assess new protocols and sample data.
Discussion: The Pulseq-CEST Library provides a flexible mechanism for standardizing and prototyping CEST sequences, facilitating collaborative development. With the capability for expansion, including open-source incorporation of new sequences and environments, the library accelerates the invention and spread of novel CEST and other saturation transfer approaches, such as relayed NOEs (rNOEs) and semisolid magnetization transfer contrast (MTC) methods.

MCML Authors

Alexander Liebeskind

* Former Member

Vladimir Golkov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[119]

B. Liao, Z. Zhao, L. Chen, H. Li, D. Cremers and P. Liu.
GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

Plane adjustment (PA) is crucial for many 3D applications, involving simultaneous pose estimation and plane recovery. Despite recent advancements, it remains a challenging problem in the realm of multi-view point cloud registration. Current state-of-the-art methods can achieve globally optimal convergence only with good initialization. Furthermore, their high time complexity renders them impractical for large-scale problems. To address these challenges, we first exploit a novel optimization strategy termed Bi-Convex Relaxation, which decouples the original problem into two simpler sub-problems, reformulates each sub-problem using a convex relaxation technique, and alternately solves each one until the original problem converges. Building on this strategy, we propose two algorithmic variants for solving the plane adjustment problem, namely GlobalPointer and GlobalPointer++, based on point-to-plane and plane-to-plane errors, respectively. Extensive experiments on both synthetic and real datasets demonstrate that our method can perform large-scale plane adjustment with linear time complexity, larger convergence region, and robustness to poor initialization, while achieving similar accuracy as prior methods.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[118]

M. de Mayo, D. Cremers and T. Pire.
The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking.
Preprint (Aug. 2025). arXiv

Abstract

Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.

MCML Authors

Mateo de Mayo

Daniel Cremers

Prof. Dr.

Director

[117]

P. Ma, M. Gui, J. Schusterbauer, X. Yang, O. Grebenkova, V. T. Hu and B. Ommer.
Stochastic Interpolants for Revealing Stylistic Flows across the History of Art.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published.

Abstract

tba

MCML Authors

Pingchuan Ma

Johannes Schusterbauer

Olga Grebenkova

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[116]

P. Müller, V. Golkov, V. Tomassini and D. Cremers.
Rotation-Equivariant Deep Learning for Diffusion MRI (short version).
ISMRM 2021 - International Society for Magnetic Resonance in Medicine Annual Meeting. Virtual, May 15-20, 2021. Long version in arXiv. arXiv

Abstract

Convolutional networks are successful, but they have recently been outperformed by new neural networks that are equivariant under rotations and translations. These new networks work better because they do not struggle with learning each possible orientation of each image feature separately. So far, they have been proposed for 2D and 3D data. Here we generalize them to 6D diffusion MRI data, ensuring joint equivariance under 3D roto-translations in image space and the matching 3D rotations in q-space, as dictated by the image formation. Such equivariant deep learning is appropriate for diffusion MRI, because microstructural and macrostructural features such as neural fibers can appear at many different orientations, and because even non-rotation-equivariant deep learning has so far been the best method for many diffusion MRI tasks. We validate our equivariant method on multiple-sclerosis lesion segmentation. Our proposed neural networks yield better results and require fewer scans for training compared to non-rotation-equivariant deep learning. They also inherit all the advantages of deep learning over classical diffusion MRI methods. Our implementation is available at this https URL and can be used off the shelf without understanding the mathematical background.

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[115]

M. Mahajan, F. Hofherr and D. Cremers.
MeshFeat: Multi-Resolution Features for Neural Fields on Meshes.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI

Abstract

Parametric feature grid encodings have gained significant attention as an encoding approach for neural fields since they allow for much smaller MLPs, which significantly decreases the inference time of the models. In this work, we propose MeshFeat, a parametric feature encoding tailored to meshes, for which we adapt the idea of multi-resolution feature grids from Euclidean space. We start from the structure provided by the given vertex topology and use a mesh simplification algorithm to construct a multi-resolution feature representation directly on the mesh. The approach allows the usage of small MLPs for neural fields on meshes, and we show a significant speed-up compared to previous representations while maintaining comparable reconstruction quality for texture reconstruction and BRDF representation. Given its intrinsic coupling to the vertices, the method is particularly well-suited for representations on deforming meshes, making it a good fit for object animation.

MCML Authors

Florian Hofherr

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[114]

J. Meier, L. Inchingolo, O. Dhaouadi, Y. Xia, J. Kaiser and D. Cremers.
MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. DOI

Abstract

We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

MCML Authors

Johannes Meier

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[113]

J. Meier, L. Inchingolo, O. Dhaouadi, Y. Xia, J. Kaiser and D. Cremers.
MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models.
IV 2025 - 36th IEEE Intelligent Vehicles Symposium. Napoca, Romania, Jun 22-25, 2025. To be published. Preprint available. arXiv

Abstract

MCML Authors

Johannes Meier

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[112]

F. Müller, Q. Khan and D. Cremers.
Lateral Ego-Vehicle Control Without Supervision Using Point Clouds.
ICPRAI 2022 - 3rd International Conference on Pattern Recognition and Artificial Intelligence. Paris, France, Jun 01-03, 2022. DOI

Abstract

Existing vision based supervised approaches to lateral vehicle control are capable of directly mapping RGB images to the appropriate steering commands. However, they are prone to suffering from inadequate robustness in real world scenarios due to a lack of failure cases in the training data. In this paper, a framework for training a more robust and scalable model for lateral vehicle control is proposed. The framework only requires an unlabeled sequence of RGB images. The trained model takes a point cloud as input and predicts the lateral offset to a subsequent frame from which the steering angle is inferred. The frame poses are in turn obtained from visual odometry. The point cloud is conceived by projecting dense depth maps into 3D. An arbitrary number of additional trajectories from this point cloud can be generated during training. This is to increase the robustness of the model. Online experiments conducted on a driving simulator show that the performance of our model is superior to that of a supervised model trained on the same initial data set and comparable to the same model but trained on data collected with noise injection.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[111]

Y. Ma, Q. Khan and D. Cremers.
MA-DV2F: A Multi-Agent Navigation Framework Using Dynamic Velocity Vector Field.
IEEE Robotics and Automation Letters 10.6 (Jun. 2025). DOI GitHub

Abstract

In this paper, we propose MA-DV2F: Multi-Agent Dynamic Velocity Vector Field. It is a framework for simultaneously controlling a group of vehicles in challenging environments. DV2F is generated for each vehicle independently and provides a map of reference orientation and speed that a vehicle must attain at any point on the navigation grid such that it safely reaches its target. The field is dynamically updated depending on the speed and proximity of the ego-vehicle to other agents. This dynamic adaptation of the velocity vector field allows prevention of imminent collisions. Experimental results show that MA-DV2F outperforms concurrent methods in terms of safety, computational efficiency and accuracy in reaching the target when scaling to a large number of vehicles.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[110]

Y. Ma, Q. Khan and D. Cremers.
Multi Agent Navigation in Unconstrained Environments Using a Centralized Attention Based Graphical Neural Network Controller.
ITSC 2023 - 26th IEEE International Conference on Intelligent Transportation. Bilbao, Spain, Sep 24-28, 2023. DOI GitHub

Abstract

In this work, we propose a learning based neural model that provides both the longitudinal and lateral control commands to simultaneously navigate multiple vehicles. The goal is to ensure that each vehicle reaches a desired target state without colliding with any other vehicle or obstacle in an unconstrained environment. The model utilizes an attention based Graphical Neural Network paradigm that takes into consideration the state of all the surrounding vehicles to make an informed decision. This allows each vehicle to smoothly reach its destination while also evading collision with the other agents. The data and corresponding labels for training such a network is obtained using an optimization based procedure. Experimental results demonstrate that our model is powerful enough to generalize even to situations with more vehicles than in the training data. Our method also outperforms comparable graphical neural network architectures.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[109]

D. Muhle, L. Koestler, N. Demmel, F. Bernard and D. Cremers.
The Probabilistic Normal Epipolar Constraint for Frame-To-Frame Rotation Optimization under Uncertain Feature Positions.
CVPR 2022 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, Jun 19-24, 2022. DOI

Abstract

The estimation of the relative pose of two camera views is a fundamental problem in computer vision. Kneip et al. proposed to solve this problem by introducing the normal epipolar constraint (NEC). However, their approach does not take into account uncertainties, so that the accuracy of the estimated relative pose is highly dependent on accurate feature positions in the target frame. In this work, we introduce the probabilistic normal epipolar constraint (PNEC) that overcomes this limitation by accounting for anisotropic and inhomogeneous uncertainties in the feature positions. To this end, we propose a novel objective function, along with an efficient optimization scheme that effectively minimizes our objective while maintaining real-time performance. In experiments on synthetic data, we demonstrate that the novel PNEC yields more accurate rotation estimates than the original NEC and several popular relative rotation estimation algorithms. Furthermore, we integrate the proposed method into a state-of-the-art monocular rotation-only odometry system and achieve consistently improved results for the real-world KITTI dataset.

MCML Authors

Dominik Muhle

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[108]

D. Muhle, L. Koestler, K. M. Jatavallabhula and D. Cremers.
Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI

Abstract

We propose a differentiable nonlinear least squares framework to account for uncertainty in relative pose estimation from feature correspondences. Specifically, we introduce a symmetric version of the probabilistic normal epipolar constraint, and an approach to estimate the co-variance of feature positions by differentiating through the camera pose estimation procedure. We evaluate our approach on synthetic, as well as the KITTI and EuRoC real-world datasets. On the synthetic dataset, we confirm that our learned covariances accurately approximate the true noise distribution. In real world experiments, we find that our approach consistently outperforms state-of-the-art non-probabilistic and probabilistic approaches, regardless of the feature extraction algorithm of choice.

MCML Authors

Dominik Muhle

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[107]

Y. Ma, A. Li, Q. Khan and D. Cremers.
Enhancing the Performance of Multi-Vehicle Navigation in Unstructured Environments using Hard Sample Mining.
Preprint (Sep. 2024). arXiv GitHub

Abstract

Contemporary research in autonomous driving has demonstrated tremendous potential in emulating the traits of human driving. However, they primarily cater to areas with well built road infrastructure and appropriate traffic management systems. Therefore, in the absence of traffic signals or in unstructured environments, these self-driving algorithms are expected to fail. This paper proposes a strategy for autonomously navigating multiple vehicles in close proximity to their desired destinations without traffic rules in unstructured environments. Graphical Neural Networks (GNNs) have demonstrated good utility for this task of multi-vehicle control. Among the different alternatives of training GNNs, supervised methods have proven to be most data-efficient, albeit require ground truth labels. However, these labels may not always be available, particularly in unstructured environments without traffic regulations. Therefore, a tedious optimization process may be required to determine them while ensuring that the vehicles reach their desired destination and do not collide with each other or any obstacles. Therefore, in order to expedite the training process, it is essential to reduce the optimization time and select only those samples for labeling that add most value to the training. In this paper, we propose a warm start method that first uses a pre-trained model trained on a simpler subset of data. Inference is then done on more complicated scenarios, to determine the hard samples wherein the model faces the greatest predicament. This is measured by the difficulty vehicles encounter in reaching their desired destination without collision. Experimental results demonstrate that mining for hard samples in this manner reduces the requirement for supervised training data by 10 fold.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

[106]

P. Ma, L. Rietdorf, D. Kotovenko, V. T. Hu and B. Ommer.
Does VLM Classification Benefit from LLM Description Semantics?
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. Invited talk. DOI

Abstract

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.

MCML Authors

Pingchuan Ma

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[105]

J. Meier, L. Scalerandi, O. Dhaouadi, J. Kaiser, N. Araslanov and D. Cremers.
CARLA Drone: Monocular 3D Object Detection from a Different Perspective.
GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. DOI

Abstract

Existing techniques for monocular 3D detection have a serious restriction. They tend to perform well only on a limited set of benchmarks, faring well either on ego-centric car views or on traffic camera views, but rarely on both. To encourage progress, this work advocates for an extended evaluation of 3D detection frameworks across different camera perspectives. We make two key contributions. First, we introduce the CARLA Drone dataset, CDrone. Simulating drone views, it substantially expands the diversity of camera perspectives in existing benchmarks. Despite its synthetic nature, CDrone represents a real-world challenge. To show this, we confirm that previous techniques struggle to perform well both on CDrone and a real-world 3D drone dataset. Second, we develop an effective data augmentation pipeline called GroundMix. Its distinguishing element is the use of the ground for creating 3D-consistent augmentation of a training image. GroundMix significantly boosts the detection accuracy of a lightweight one-stage detector. In our expanded evaluation, we achieve the average precision on par with or substantially higher than the previous state of the art across all tested datasets.

MCML Authors

Johannes Meier

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[104]

M. Mozes, M. Schmitt, V. Golkov, H. Schütze and D. Cremers.
Scene Graph Generation for Better Image Captioning?
Preprint (Sep. 2021). arXiv

Abstract

We investigate the incorporation of visual relationships into the task of supervised image caption generation by proposing a model that leverages detected objects and auto-generated visual relationships to describe images in natural language. To do so, we first generate a scene graph from raw image pixels by identifying individual objects and visual relationships between them. This scene graph then serves as input to our graph-to-text model, which generates the final caption. In contrast to previous approaches, our model thus explicitly models the detection of objects and visual relationships in the image. For our experiments we construct a new dataset from the intersection of Visual Genome and MS COCO, consisting of images with both a corresponding gold scene graph and human-authored caption. Our results show that our methods outperform existing state-of-the-art end-to-end models that generate image descriptions directly from raw input pixels when compared in terms of the BLEU and METEOR evaluation metrics.

MCML Authors

Vladimir Golkov

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

Daniel Cremers

Prof. Dr.

Director

[103]

P. Ma, X. Yang, Y. Li, M. Gui, F. Krause, J. Schusterbauer and B. Ommer.
SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv

Abstract

Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles × 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process.

MCML Authors

Pingchuan Ma

Felix Krause

Johannes Schusterbauer

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[102]

P. Mondorf, S. Zhou, M. Riedler and B. Plank.
Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality.
Preprint (Apr. 2025). arXiv

Abstract

Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend the approach of meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce SYGAR-a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions, significantly outperforming state-of-the-art LLMs, including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

MCML Authors

Philipp Mondorf

Shijia Zhou

→ Group Barbara Plank
AI and Computational Linguistics

Monica Riedler

* Former Member

→ Group Almut Sophia Koepke
Computer Vision & Artificial Intelligence

Barbara Plank

Prof. Dr.

Principal Investigator

[101]

I. A. Petrov, V. Guzov, R. Marin, E. Aksan, X. Chen, D. Cremers, T. Beeler and G. Pons-Moll.
ECHO: Ego-Centric modeling of Human-Object interactions.
Preprint (Aug. 2025). arXiv

Abstract

Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[100]

M. Pach, S. Karthik, Q. Bouniot, S. Belongie and Z. Akata.
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models.
Preprint (Apr. 2025). arXiv

Abstract

Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

MCML Authors

Mateusz Pach

Shyamgopal Karthik

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Quentin Bouniot

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[99]

J. Pan, C. Zhou, M. Gladkova, Q. Khan and D. Cremers.
Robust Autonomous Vehicle Pursuit without Expert Steering Labels.
IEEE Robotics and Automation Letters 8.10 (Oct. 2023). DOI

Abstract

In this work, we present a learning method for both lateral and longitudinal motion control of an ego-vehicle for the task of vehicle pursuit. The car being controlled does not have a pre-defined route, rather it reactively adapts to follow a target vehicle while maintaining a safety distance. To train our model, we do not rely on steering labels recorded from an expert driver, but effectively leverage a classical controller as an offline label generation tool. In addition, we account for the errors in the predicted control values, which can lead to a loss of tracking and catastrophic crashes of the controlled vehicle. To this end, we propose an effective data augmentation approach, which allows to train a network that is capable of handling different views of the target vehicle. During the pursuit, the target vehicle is firstly localized using a Convolutional Neural Network. The network takes a single RGB image along with cars’ velocities and estimates target vehicle’s pose with respect to the ego-vehicle. This information is then fed to a Multi-Layer Perceptron, which regresses the control commands for the ego-vehicle, namely throttle and steering angle. We extensively validate our approach using the CARLA simulator on a wide range of terrains. Our method demonstrates real-time performance, robustness to different scenarios including unseen trajectories and high route completion.

MCML Authors

Mariia Gladkova

Qadeer Khan

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[98]

R. Qorbani, G. Villani, T. Panagiotakopoulos, M. B. Colomer, L. Härenstam-Nielsen, M. Segu, P. L. Dovesi, J. Karlgren, D. Cremers, F. Tombari and M. Poggi.
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world application. We introduce Semantic Library Adaptation (SemLa), a novel framework for training-free, test-time domain adaptation. SemLa leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on an 18-domain benchmark built over 10 standard datasets demonstrate SemLa’s superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.

MCML Authors

Linus Härenstam-Nielsen

Daniel Cremers

Prof. Dr.

Director

Computer Aided Medical Procedures & Augmented Reality

Federico Tombari

PD Dr.

Associate

[97]

Y. Qu, Q. Wang, Y. Mao, V. T. Hu, B. Ommer and X. Ji.
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
Preprint (Jul. 2025). arXiv

Abstract

Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline’s reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt’s success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.

MCML Authors

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[96]

K. Roth, Z. Akata, D. Damen, I. Balažević and O. J. Hénaff.
Context-Aware Multimodal Pretraining.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

MCML Authors

Karsten Roth

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[95]

S. Roschmann, Q. Bouniot, V. Feofanov, I. Redko and Z. Akata.
Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers.
Preprint (Jun. 2025). arXiv

Abstract

Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal yet another direction for reusing vision representations in a non-visual domain.

MCML Authors

Simon Roschmann

Quentin Bouniot

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[94]

C. Reich, B. Debnath, D. Patel, T. Prangemeier, D. Cremers and S. Chakradhar.
Deep Video Codec Control for Vision Models.
Workshop @CVPR 2024 - Workshop at the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.

MCML Authors

Christoph Reich

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[93]

P. Roetzer, V. Ehm, D. Cremers, Z. Lähner and F. Bernard.
Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

In this work we address various shape matching problems that can be cast as finding cyclic paths in a product graph. This involves for example 2D-3D shape matching, 3D shape matching, or the matching of a contour to a graph. In this context, matchings are typically obtained as the minimum cost cycle in the product graph. Instead, inspired by related works on model-based image segmentation, we consider minimum ratio cycles, which we combine with the recently introduced conjugate product graph in order to allow for higher-order matching costs. With that, on the one hand we avoid the bias of obtaining matchings that involve fewer/shorter edges, while on the other hand being able to impose powerful geometric regularisation, e.g. to avoid zig-zagging. In our experiments we demonstrate that this not only leads to improved matching accuracy in most cases, but also to significantly reduced runtimes (up to two orders of magnitude, depending on the setting). Our GPU implementation will be made publicly available upon acceptance.

MCML Authors

Viktoria Ehm

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[92]

C. Reich, O. Hahn, D. Cremers, S. Roth and B. Debnath.
A Perspective on Deep Vision Performance with Standard Image and Video Codecs.
Workshop @CVPR 2024 - Workshop at the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective.

MCML Authors

Christoph Reich

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[91]

S. N. Rai, S. Karthik, M.-I. Georgescu, B. Caputo, C. Masone and Z. Akata.
Road Obstacle Video Segmentation.
GCPR 2025 - German Conference on Pattern Recognition. Freiburg, Germany, Oct 23-26, 2025. To be published. Preprint available. arXiv

Abstract

With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road-obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road-obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road-obstacle video segmentation and evaluate 11 state-of-the-art image- and video-based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, providing valuable insights and direction for future research.

MCML Authors

Shyamgopal Karthik

Iuliana Georgescu

Dr.

* Former Member

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[90]

D. Schnaus, N. Araslanov and D. Cremers.
It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e., without parallel data. We present the first study towards this prospect, and investigate conformity of existing vision and language foundation models in the context of ‘blind’ matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens possibility for exciting applications embedding semantic knowledge into other modalities. As a showcase, we demonstrate a proof-of-concept unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

MCML Authors

Nikita Araslanov

Dr.

Daniel Cremers

Prof. Dr.

Director

[89]

N. Stracke, S. A. Baumann, K. Bauer, F. Fundel and B. Ommer.
CleanDIFT: Diffusion Features without Noise.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[88]

J. Seidenschwarz, G. Braso, V. C. Serrano, I. Elezi and L. Leal-Taixé.
Simple Cues Lead to a Strong Multi-Object Tracker.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI GitHub

Abstract

For a long time, the most common paradigm in Multi-Object Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-of-the-art performance.

MCML Authors

Jenny Seidenschwarz

* Former Member

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[87]

N. Stracke, S. A. Baumann, J. M. Susskind, M. A. Bautista and B. Ommer.
CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control and Altering of T2I Models.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to take into account detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present. LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient and powerful approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

MCML Authors

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[86]

Y. Shen and D. Cremers.
Deep Combinatorial Aggregation.
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Neural networks are known to produce poor uncertainty estimations, and a variety of approaches have been proposed to remedy this issue. This includes deep ensemble, a simple and effective method that achieves state-of-the-art results for uncertainty-aware learning tasks. In this work, we explore a combinatorial generalization of deep ensemble called deep combinatorial aggregation (DCA). DCA creates multiple instances of network components and aggregates their combinations to produce diversified model proposals and predictions. DCA components can be defined at different levels of granularity. And we discovered that coarse-grain DCAs can outperform deep ensemble for uncertainty-aware learning both in terms of predictive performance and uncertainty estimation. For fine-grain DCAs, we discover that an average parameterization approach named deep combinatorial weight averaging (DCWA) can improve the baseline training. It is on par with stochastic weight averaging (SWA) but does not require any custom training schedule or adaptation of BatchNorm layers. Furthermore, we propose a consistency enforcing loss that helps the training of DCWA and modelwise DCA. We experiment on in-domain, distributional shift, and out-of-distribution image classification tasks, and empirically confirm the effectiveness of DCWA and DCA approaches.

MCML Authors

Yuesong Shen

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[85]

L. Sang, Z. Canfes, D. Cao, R. Marin, F. Bernard and D. Cremers.
TwoSquared: 4D Generation from 2D Image Pairs.
Preprint (Apr. 2025). arXiv

Abstract

Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

MCML Authors

Lu Sang

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[84]

L. Sang, Z. Canfes, D. Cao, F. Bernard and D. Cremers.
Implicit Neural Surface Deformation with Explicit Velocity Fields.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

In this work, we introduce the first unsupervised method that simultaneously predicts time-varying neural implicit surfaces and deformations between pairs of point clouds. We propose to model the point movement using an explicit velocity field and directly deform a time-varying implicit field using the modified level-set equation. This equation utilizes an iso-surface evolution with Eikonal constraints in a compact formulation, ensuring the integrity of the signed distance field. By applying a smooth, volume-preserving constraint to the velocity field, our method successfully recovers physically plausible intermediate shapes. Our method is able to handle both rigid and non-rigid deformations without any intermediate shape supervision. Our experimental results demonstrate that our method significantly outperforms existing works, delivering superior results in both quality and efficiency.

MCML Authors

Lu Sang

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[83]

L. Sang, Z. Canfes, D. Cao, R. Marin, F. Bernard and D. Cremers.
4Deform: Neural Surface Deformation for Robust Shape Interpolation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Generating realistic intermediate shapes between non-rigidly deformed shapes is a challenging task in computer vision, especially with unstructured data (e.g., point clouds) where temporal consistency across frames is lacking, and topologies are changing. Most interpolation methods are designed for structured data (i.e., meshes) and do not apply to real-world point clouds. In contrast, our approach, 4Deform, leverages neural implicit representation (NIR) to enable free topology changing shape deformation. Unlike previous mesh-based methods that learn vertex-based deformation fields, our method learns a continuous velocity field in Euclidean space. Thus, it is suitable for less structured data such as point clouds. Additionally, our method does not require intermediate-shape supervision during training; instead, we incorporate physical and geometrical constraints to regularize the velocity field. We reconstruct intermediate surfaces using a modified level-set equation, directly linking our NIR with the velocity field. Experiments show that our method significantly outperforms previous NIR approaches across various scenarios (e.g., noisy, partial, topology-changing, non-isometric shapes) and, for the first time, enables new applications like 4D Kinect sequence upsampling and real-world high-resolution mesh deformation.

MCML Authors

Lu Sang

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[82]

Y. Shen, N. Daheim, B. Cong, P. Nickl, G. M. Marconi, C. Bazan, R. Yokota, I. Gurevych, D. Cremers, M. E. Khan and T. Möllenhoff.
Variational Learning is Effective for Large Deep Networks.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL GitHub

Abstract

We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON’s computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.

MCML Authors

Yuesong Shen

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Benedikt Wiestler
AI for Image-Guided Diagnosis and Therapy

[81]

D. Scholz, A. C. Erdur, V. Ehm, A. Meyer-Baese, J. C. Peeken, D. Rückert and B. Wiestler.
MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis.
MICCAI 2025 - 28th International Conference on Medical Image Computing and Computer Assisted Intervention. Daejeon, Republic of Korea, Sep 23-27, 2025. DOI

Abstract

Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities, a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. Applied to glioma subtype classification from multi-sequence brain MRI, our method achieves a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Our work establishes a scalable and robust solution for multi-modal medical imaging tasks, leveraging powerful vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations.

MCML Authors

Daniel Scholz

Viktoria Ehm

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Rückert

Prof. Dr.

Director

Artificial Intelligence in Healthcare and Medicine

Benedikt Wiestler

Prof. Dr.

Principal Investigator

AI for Image-Guided Diagnosis and Therapy

[80]

D. Scholz, A. C. Erdur, R. Holland, V. Ehm, J. C. Peeken, B. Wiestler and D. Rückert.
Contrastive Anatomy-Contrast Disentanglement: A Domain-General MRI Harmonization Method.
MICCAI 2025 - 28th International Conference on Medical Image Computing and Computer Assisted Intervention. Daejeon, Republic of Korea, Sep 23-27, 2025. DOI

Abstract

Magnetic resonance imaging (MRI) is an invaluable tool for clinical and research applications. Yet, variations in scanners and acquisition parameters cause inconsistencies in image contrast, hindering data comparability and reproducibility across datasets and clinical studies. Existing scanner harmonization methods, designed to address this challenge, face limitations, such as requiring traveling subjects or struggling to generalize to unseen domains. We propose a novel approach using a conditioned diffusion autoencoder with a contrastive loss and domain-agnostic contrast augmentation to harmonize MR images across scanners while preserving subject-specific anatomy. Our method enables brain MRI synthesis from a single reference image. It outperforms baseline techniques, achieving a +7% PSNR improvement on a traveling subjects dataset and +18% improvement on age regression in unseen. Our model provides robust, effective harmonization of brain MRIs to target scanners without requiring fine-tuning. This advancement promises to enhance comparability, reproducibility, and generalizability in multi-site and longitudinal clinical studies, ultimately contributing to improved healthcare outcomes.

MCML Authors

Daniel Scholz

→ Group Benedikt Wiestler
AI for Image-Guided Diagnosis and Therapy

Viktoria Ehm

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Benedikt Wiestler

Prof. Dr.

Principal Investigator

AI for Image-Guided Diagnosis and Therapy

Daniel Rückert

Prof. Dr.

Director

Artificial Intelligence in Healthcare and Medicine

[79]

P. Spohn, L. Girrbach, J. Bader and Z. Akata.
Align-then-Unlearn: Embedding Alignment for LLM Unlearning.
MUGen @ICML 2025 - Workshop on Machine Unlearning for Generative AI at the 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL

Abstract

Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

MCML Authors

Leander Girrbach

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Jessica Bader

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[78]

A. Saroha, M. Gladkova, C. Curreli, D. Muhle, T. Yenamandra and D. Cremers.
Gaussian Splatting in Style.
GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. DOI

Abstract

3D scene stylization extends the work of neural style transfer to 3D. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across multiple views. A vast majority of the previous works achieve this by training a 3D model for every stylized image and a set of multi-view images. In contrast, we propose a novel architecture trained on a collection of style images that, at test time, produces real time high-quality stylized novel views. We choose the underlying 3D scene representation for our model as 3D Gaussian splatting. We take the 3D Gaussians and process them using a multi-resolution hash grid and a tiny MLP to obtain stylized views. The MLP is conditioned on different style codes for generalization to different styles during test time. The explicit nature of 3D Gaussians gives us inherent advantages over NeRF-based methods, including geometric consistency and a fast training and rendering regime. This enables our method to be useful for various practical use cases, such as augmented or virtual reality. We demonstrate that our method achieves state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.

MCML Authors

Mariia Gladkova

Cecilia Curreli

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Dominik Muhle

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Tarun Yenamandra

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[77]

J. Schusterbauer, M. Gui, F. Fundel and B. Ommer.
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Recent advancements in diffusion models have established new benchmarks in both generative tasks and downstream applications. In contrast, flow matching models have shown promising improvements in performance but have not been as extensively explored, particularly due to the difficulty of inheriting knowledge from a pretrained diffusion prior foundation model.In this work, we propose a novel method to bridge the gap between pretrained diffusion models and flow matching models by aligning their trajectories and matching their objectives. Our approach mathematically formalizes this alignment and enables the efficient transfer of knowledge from diffusion priors to flow matching models. We demonstrate that our method outperforms traditional diffusion and flow matching finetuning, achieving competitive results across a variety of tasks.

MCML Authors

Johannes Schusterbauer

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[76]

L. Sang, M. Gao, A. Saroha and D. Cremers.
Enhancing Surface Neural Implicits with Curvature-Guided Sampling and Uncertainty-Augmented Representations.
Wild3D @ECCV 2024 - Workshop 3D Modeling, Reconstruction, and Generation in the Wild at the 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. URL

Abstract

Neural implicits are a widely used surface presentation because they offer an adaptive resolution and support arbitrary topology changes. While previous works rely on ground truth point clouds or meshes, they often do not discuss the data acquisition and ignore the effect of input quality and sampling methods during reconstruction. In this paper, we introduce a sampling method with an uncertainty-augmented surface implicit representation that employs a sampling technique that considers the geometric characteristics of inputs. To this end, we introduce a strategy that efficiently computes differentiable geometric features, namely, mean curvatures, to guide the sampling phase during the training period. The uncertainty augmentation offers insights into the occupancy and reliability of the output signed distance value, thereby expanding representation capabilities into open surfaces. Finally, we demonstrate that our method improves the reconstruction of both synthetic and real-world data.

MCML Authors

Lu Sang

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Abhishek Saroha

→ Group Xi Wang
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[75]

D. Sinitsyn, L. Härenstam-Nielsen and D. Cremers.
PRaDA: Projective Radial Distortion Averaging.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions.Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate.In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion.Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.

MCML Authors

Daniil Sinitsyn

Linus Härenstam-Nielsen

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[74]

Y. Shen.
Probabilistic Modeling and Uncertainty Awareness in Deep learning.
Dissertation Mar. 2025. URL

Abstract

This dissertation focuses on probabilistic modeling and uncertainty-aware approaches for deep learning. It is based on four papers that tackle the problem of uncertainty-aware deep learning, covering techniques such as post-hoc calibration, model aggregation, and Bayesian deep learning with variational inference. Also, an overview of related prior work is provided, which covers both classical and deep-learning-based approaches.

MCML Authors

Yuesong Shen

Dr.

* Former Member

[73]

A. Saroha, F. Hofherr, M. Gladkova, C. Curreli, O. Litany and D. Cremers.
ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting.
Preprint (Jan. 2025). arXiv

Abstract

Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.

MCML Authors

Florian Hofherr

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Mariia Gladkova

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Cecilia Curreli

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[72]

Y. Shehata, B. Holzschuh and N. Thuerey.
Improved Sampling Of Diffusion Models In Fluid Dynamics With Tweedie's Formula.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

State-of-the-art Denoising Diffusion Probabilistic Models (DDPMs) rely on an expensive sampling process with a large Number of Function Evaluations (NFEs) to provide high-fidelity predictions. This computational bottleneck renders diffusion models less appealing as surrogates for the spatio-temporal prediction of physics-based problems with long rollout horizons. We propose Truncated Sampling Models, enabling single-step and few-step sampling with elevated fidelity by simple truncation of the diffusion process, reducing the gap between DDPMs and deterministic single-step approaches. We also introduce a novel approach, Iterative Refinement, to sample pre-trained DDPMs by reformulating the generative process as a refinement process with few sampling steps. Both proposed methods enable significant improvements in accuracy compared to DDPMs, DDIMs, and EDMs with NFEs ≤ 10 on a diverse set of experiments, including incompressible and compressible turbulent flow and airfoil flow uncertainty simulations. Our proposed methods provide stable predictions for long rollout horizons in time-dependent problems and are able to learn all modes of the data distribution in steady-state problems with high uncertainty.

MCML Authors

Nils Thuerey

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[71]

L. Sang, B. Häfner, X. Zuo and D. Cremers.
High-Quality RGB-D Reconstruction via Multi-View Uncalibrated Photometric Stereo and Gradient-SDF.
WACV 2023 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 03-07, 2023. DOI

Abstract

Fine-detailed reconstructions are in high demand in many applications. However, most of the existing RGB-D reconstruction methods rely on pre-calculated accurate camera poses to recover the detailed surface geometry, where the representation of a surface needs to be adapted when optimizing different quantities. In this paper, we present a novel multi-view RGB-D based reconstruction method that tackles camera pose, lighting, albedo, and surface normal estimation via the utilization of a gradient signed distance field (gradient-SDF). The proposed method formulates the image rendering process using specific physically-based model(s) and optimizes the surface’s quantities on the actual surface using its volumetric representation, as opposed to other works which estimate surface quantities only near the actual surface. To validate our method, we investigate two physically-based image formation models for natural light and point light source applications. The experimental results on synthetic and real-world datasets demonstrate that the proposed method can recover high-quality geometry of the surface more faithfully than the state-of-the-art and further improves the accuracy of estimated camera poses

MCML Authors

Lu Sang

Björn Häfner

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Xingxing Zuo

Dr.

* Former Member

→ Group Stefan Leutenegger

* Former Principal Investigator

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[70]

J. Schmidt, Q. Khan and D. Cremers.
LiDAR View Synthesis for Robust Vehicle Navigation Without Expert Labels.
ITSC 2023 - 26th IEEE International Conference on Intelligent Transportation. Bilbao, Spain, Sep 24-28, 2023. DOI GitHub

Abstract

Deep learning models for self-driving cars require a diverse training dataset to manage critical driving scenarios on public roads safely. This includes having data from divergent trajectories, such as the oncoming traffic lane or sidewalks. Such data would be too dangerous to collect in the real world. Data augmentation approaches have been proposed to tackle this issue using RGB images. However, solutions based on LiDAR sensors are scarce. Therefore, we propose synthesizing additional LiDAR point clouds from novel viewpoints without physically driving at dangerous positions. The LiDAR view synthesis is done using mesh reconstruction and ray casting. We train a deep learning model, which takes a LiDAR scan as input and predicts the future trajectory as output. A waypoint controller is then applied to this predicted trajectory to determine the throttle and steering labels of the ego-vehicle. Our method neither requires expert driving labels for the original nor the synthesized LiDAR sequence. Instead, we infer labels from LiDAR odometry. We demonstrate the effectiveness of our approach in a comprehensive online evaluation and with a comparison to concurrent work. Our results show the importance of synthesizing additional LiDAR point clouds, particularly in terms of model robustness.

MCML Authors

Qadeer Khan

Daniel Cremers

Prof. Dr.

Director

→ Group Volker Tresp
Database Systems, Data Mining and AI

[69]

U. Sahin, H. Li, Q. Khan, D. Cremers and V. Tresp.
Enhancing Multimodal Compositional Reasoning of Visual Language Models With Generative Negative Mining.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI GitHub

Abstract

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs’ performance in tasks involving multimodal compositional reasoning.

MCML Authors

Hang Li

* Former Member

Qadeer Khan

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Volker Tresp

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[68]

R. Skorobogat, K. Roth, M.-I. Georgescu and Z. Akata.
Subspace-Boosted Model Merging.
Preprint (Jun. 2025). arXiv

Abstract

Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on vision benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to further quantify task similarity, offering a new interpretable perspective on model merging.

MCML Authors

Karsten Roth

Iuliana Georgescu

Dr.

* Former Member

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[67]

L. Sang, A. Saroha, M. Gao and D. Cremers.
Enhancing Surface Neural Implicits with Curvature-Guided Sampling and Uncertainty-Augmented Representations.
GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. DOI

Abstract

Neural implicit representations have become a popular choice for modeling surfaces due to their adaptability in resolution and support for complex topology. While previous works have achieved impressive reconstruction quality by training on ground truth point clouds or meshes, they often do not discuss the data acquisition and ignore the effect of input quality and sampling methods during reconstruction. In this paper, we introduce a method that directly digests depth images for the task of high-fidelity 3D reconstruction. To this end, a novel local geometry feature computation method is proposed such that a simple sampling strategy can be adopted to generate highly effective training data. Due to its simplicity, our sampling strategy can be easily incorporated into diverse popular methods, allowing their training process to be more stable and efficient. Despite its simplicity, our method outperforms a range of both classical and learning-based baselines and demonstrates state-of-the-art results in both synthetic and real-world datasets.

MCML Authors

Lu Sang

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[66]

S. Solonets, D. Sinitsyn, L. Von Stumberg, N. Araslanov and D. Cremers.
An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL

Abstract

Direct image alignment is a widely used technique for relative 6DoF pose estimation between two images, but its accuracy strongly depends on pose initialization. Therefore, recent end-to-end frameworks increase the convergence basin of the learned feature descriptors with special training objectives, such as the Gauss-Newton loss. However, the training data may exhibit bias toward a specific type of motion and pose initialization, thus limiting the generalization of these methods. In this work, we derive a closed-form solution to the expected optimum of the Gauss-Newton loss. The solution is agnostic to the underlying feature representation and allows us to dynamically adjust the basin of convergence according to our assumptions about the uncertainty in the current estimates. These properties allow for effective control over the convergence in the alignment process. Despite using self-supervised feature embeddings, our solution achieves compelling accuracy w.r.t. the state-of-the-art direct image alignment methods trained end-to-end with pose supervision, and demonstrates improved robustness to pose initialization. Our analytical solution exposes some inherent limitations of end-to-end learning with the Gauss-Newton loss, and establishes an intriguing connection between direct image alignment and feature-matching approaches.

MCML Authors

Sergei Solonets

Daniil Sinitsyn

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[65]

Y. Shan, Y. Xia, Y. Chen and D. Cremers.
SCP: Scene Completion Pre-training for 3D Object Detection.
Preprint (Sep. 2023). arXiv

Abstract

3D object detection using LiDAR point clouds is a fundamental task in the fields of computer vision, robotics, and autonomous driving. However, existing 3D detectors heavily rely on annotated datasets, which are both time-consuming and prone to errors during the process of labeling 3D bounding boxes. In this paper, we propose a Scene Completion Pre-training (SCP) method to enhance the performance of 3D object detectors with less labeled data. SCP offers three key advantages: (1) Improved initialization of the point cloud model. By completing the scene point clouds, SCP effectively captures the spatial and semantic relationships among objects within urban environments. (2) Elimination of the need for additional datasets. SCP serves as a valuable auxiliary network that does not impose any additional efforts or data requirements on the 3D detectors. (3) Reduction of the amount of labeled data for detection. With the help of SCP, the existing state-of-the-art 3D detectors can achieve comparable performance while only relying on 20% labeled data.

MCML Authors

Yan Xia

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[64]

J. Seidenschwarz, Q. Zhou, B. Duisterhof, D. Ramanan and L. Leal-Taixé.
DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. DOI

Abstract

Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.

MCML Authors

Jenny Seidenschwarz

* Former Member

[63]

Z. Shi, X. Zhang, Y. Xia, Y. Zang, S. Shen and C. Wang.
L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery.
Preprint (Mar. 2025). arXiv GitHub

Abstract

We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct XA-L&RSI dataset, which encompasses approximately 110,000 remote sensing submaps and 13,000 LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on a dynamic Gaussian mixture model to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on XA-L&RSI demonstrate that, within a 100km2 retrieval range, L2RSI accurately localizes 95.08% of point cloud submaps within a 30m radius for top-1 retrieved location. We provide a video to more vividly display the place recognition results of L2RSI at this https URL.

MCML Authors

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[62]

C. Tomani and D. Cremers.
Challenger: Training with Attribution Maps.
Preprint (May. 2022). arXiv

Abstract

We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. Regularization is key in deep learning, especially when training complex models on relatively small datasets. In order to understand inner workings of neural networks, attribution methods such as Layer-wise Relevance Propagation (LRP) have been extensively studied, particularly for interpreting the relevance of input features. We introduce Challenger, a module that leverages the explainable power of attribution maps in order to manipulate particularly relevant input patterns. Therefore, exposing and subsequently resolving regions of ambiguity towards separating classes on the ground-truth data manifold, an issue that arises particularly when training models on rather small datasets. Our Challenger module increases model performance through building more diverse filters within the network and can be applied to any input data domain. We demonstrate that our approach results in substantially better classification as well as calibration performance on datasets with only a few samples up to datasets with thousands of samples. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.

MCML Authors

Christian Tomani

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[61]

C. Tomani, D. Cremers and F. Buettner.
Parameterized Temperature Scaling for Boosting the Expressive Power in Post-Hoc Uncertainty Calibration.
ECCV 2022 - 17th European Conference on Computer Vision. Tel Aviv, Israel, Oct 23-27, 2022. DOI GitHub

Abstract

We address the problem of uncertainty calibration and introduce a novel calibration method, Parametrized Temperature Scaling (PTS). Standard deep neural networks typically yield uncalibrated predictions, which can be transformed into calibrated confidence scores using post-hoc calibration methods. In this contribution, we demonstrate that the performance of accuracy-preserving state-of-the-art post-hoc calibrators is limited by their intrinsic expressive power. We generalize temperature scaling by computing prediction-specific temperatures, parameterized by a neural network. We show with extensive experiments that our novel accuracy-preserving approach consistently outperforms existing algorithms across a large number of model architectures, datasets and metrics.

MCML Authors

Christian Tomani

Daniel Cremers

Prof. Dr.

Director

[60]

A. Toker, M. Eisenberger, D. Cremers and L. Leal-Taixé.
SatSynth: Augmenting Image-Mask Pairs Through Diffusion Models for Aerial Semantic Segmentation.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent ad-vancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation - both compared to baselines and when training only on the original data.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[59]

C. Tomani, S. Gruber, M. E. Erdem, D. Cremers and F. Buettner.
Post-hoc Uncertainty Calibration for Domain Drift Scenarios.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI

Abstract

We address the problem of uncertainty calibration. While standard deep neural networks typically yield uncalibrated predictions, calibrated confidence scores that are representative of the true likelihood of a prediction can be achieved using post-hoc calibration methods. However, to date, the focus of these approaches has been on in-domain calibration. Our contribution is two-fold. First, we show that existing post-hoc calibration methods yield highly over-confident predictions under domain shift. Second, we introduce a simple strategy where perturbations are applied to samples in the validation set before performing the post-hoc calibration step. In extensive experiments, we demonstrate that this perturbation step results in substantially better calibration under domain shift on a wide range of architectures and modelling tasks.

MCML Authors

Christian Tomani

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[58]

L. Thede, K. Roth, M. Bethge, Z. Akata and T. Hartvigsen.
WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. Preprint. arXiv

Abstract

Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce WikiBigEdit; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use WikiBigEdit to study existing knowledge editing techniques’ ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.

MCML Authors

Karsten Roth

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[57]

L. Thede, K. Roth, O. J. Hénaff, M. Bethge and Z. Akata.
Reflecting on the State of Rehearsal-free Continual Learning with Pretrained Models.
CoLLAs 2024 - 3rd Conference on Lifelong Learning Agents. Pisa, Italy, Aug 11-14, 2024. URL

Abstract

With the advent and recent ubiquity of foundation models, continual learning (CL) has recently shifted from continual training from scratch to the continual adaptation of pretrained models, seeing particular success on rehearsal-free CL benchmarks (RFCL). To achieve this, most proposed methods adapt and restructure parameter-efficient finetuning techniques (PEFT) to suit the continual nature of the problem. Based most often on input-conditional query-mechanisms or regularizations on top of prompt- or adapter-based PEFT, these PEFT-style RFCL (P-RFCL) approaches report peak performances; often convincingly outperforming existing CL techniques. However, on the other end, critical studies have recently highlighted competitive results by training on just the first task or via simple non-parametric baselines. Consequently, questions arise about the relationship between methodological choices in P-RFCL and their reported high benchmark scores. In this work, we tackle these questions to better understand the true drivers behind strong P-RFCL performances, their placement w.r.t. recent first-task adaptation studies, and their relation to preceding CL standards such as EWC or SI. In particular, we show: (1) P-RFCL techniques relying on input-conditional query mechanisms work not because, but rather despite them by collapsing towards standard PEFT shortcut solutions. (2) Indeed, we show how most often, P-RFCL techniques can be matched by a simple and lightweight PEFT baseline. (3) Using this baseline, we identify the implicit bound on tunable parameters when deriving RFCL approaches from PEFT methods as a potential denominator behind P-RFCL efficacy. Finally, we (4) better disentangle continual versus first-task adaptation, and (5) motivate standard RFCL techniques s.a. EWC or SI in light of recent P-RFCL methods.

MCML Authors

Karsten Roth

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[56]

C. Tomani, D. Vilar, M. Freitag, C. Cherry, S. Naskar, M. Finkelstein, X. Garcia and D. Cremers.
Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations getting assigned a higher score by the model. However, research has shown that this assumption does not always hold, and generation quality can be improved by decoding to optimize a utility function backed by a metric or quality-estimation signal, as is done by Minimum Bayes Risk (MBR) or Quality-Aware decoding. The main disadvantage of these approaches is that they require an additional model to calculate the utility function during decoding, significantly increasing the computational cost. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. Using this approach for MBR decoding we can drastically reduce the size of the candidate list, resulting in a speed-up of two-orders of magnitude. When applying our method to MAP decoding we obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.

MCML Authors

Christian Tomani

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[55]

C. Tomani, F. K. Waseda, Y. Shen and D. Cremers.
Beyond In-Domain Scenarios: Robust Density-Aware Calibration.
ICML 2023 - 40th International Conference on Machine Learning. Honolulu, Hawaii, Jul 23-29, 2023. URL

Abstract

Calibrating deep learning models to yield uncertainty-aware predictions is crucial as deep neural networks get increasingly deployed in safety-critical applications. While existing post-hoc calibration methods achieve impressive results on in-domain test datasets, they are limited by their inability to yield reliable uncertainty estimates in domain-shift and out-of-domain (OOD) scenarios. We aim to bridge this gap by proposing DAC, an accuracy-preserving as well as Density-Aware Calibration method based on k-nearest-neighbors (KNN). In contrast to existing post-hoc methods, we utilize hidden layers of classifiers as a source for uncertainty-related information and study their importance. We show that DAC is a generic method that can readily be combined with state-of-the-art post-hoc methods. DAC boosts the robustness of calibration performance in domain-shift and OOD, while maintaining excellent in-domain predictive uncertainty estimates. We demonstrate that DAC leads to consistently better calibration across a large number of model architectures, datasets, and metrics. Additionally, we show that DAC improves calibration substantially on recent large-scale neural networks pre-trained on vast amounts of data.

MCML Authors

Christian Tomani

Yuesong Shen

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[54]

T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning through Geometry Preservation with the Gromov-Monge Gap.
SPIGM @ICML 2024 - Workshop on Structured Probabilistic Inference & Generative Modeling at the 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. arXiv

Abstract

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

MCML Authors

Luca Eyring

Karsten Roth

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[53]

T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning with the Gromov-Monge Gap.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

MCML Authors

Luca Eyring

Karsten Roth

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[52]

V. Udandarao, K. Roth, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. Hénaff, S. Albanie, Z. Akata and M. Bethge.
A Practitioner's Guide to Real-World Continual Multimodal Pretraining.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment.

MCML Authors

Karsten Roth

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Xi Wang
Computer Vision & Artificial Intelligence

[51]

N. P. A. Vu, A. Saroha, O. Litany and D. Cremers.
GAS-NeRF: Geometry-Aware Stylization of Dynamic Radiance Fields.
Preprint (Mar. 2025). arXiv

Abstract

Current 3D stylization techniques primarily focus on static scenes, while our world is inherently dynamic, filled with moving objects and changing environments. Existing style transfer methods primarily target appearance – such as color and texture transformation – but often neglect the geometric characteristics of the style image, which are crucial for achieving a complete and coherent stylization effect. To overcome these shortcomings, we propose GAS-NeRF, a novel approach for joint appearance and geometry stylization in dynamic Radiance Fields. Our method leverages depth maps to extract and transfer geometric details into the radiance field, followed by appearance transfer. Experimental results on synthetic and real-world datasets demonstrate that our approach significantly enhances the stylization quality while maintaining temporal coherence in dynamic scenes.

MCML Authors

Abhishek Saroha

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[50]

S. Wu, S. Alaniz, E. Schulz and Z. Akata.
Discovering Chunks in Neural Embeddings for Interpretability.
Preprint (Feb. 2025). arXiv

Abstract

Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

MCML Authors

Stephan Alaniz

Dr.

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

[49]

S. Wang, Q. Cheng, Q. Cheng, W. Zhang, S.-C. Wu, N. Zeller, D. Cremers and N. Navab.
VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis.
IEEE Robotics and Automation Letters 10.6 (Jun. 2025). DOI

Abstract

The generation of high-fidelity view synthesis is essential for robotic navigation and interaction but remains challenging, particularly in indoor environments and real-time scenarios. Existing techniques often require significant computational resources for both training and rendering, and they frequently result in suboptimal 3D representations due to insufficient geometric structuring. To address these limitations, we introduce VoxNeRF, a novel approach that utilizes easy-to-obtain geometry priors to enhance both the quality and efficiency of neural indoor reconstruction and novel view synthesis. We propose an efficient voxel-guided sampling technique that allocates computational resources selectively to the most relevant segments of rays based on a voxel-encoded geometry prior, significantly reducing training and rendering time. Additionally, we incorporate a robust depth loss to improve reconstruction and rendering quality in sparse view settings. Our approach is validated with extensive experiments on ScanNet and ScanNet++ where VoxNeRF outperforms existing state-of-the-art methods and establishes a new benchmark for indoor immersive interpolation and extrapolation settings.

MCML Authors

Sen Wang

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

[48]

S. Wang, Q. Cheng, Q. Cheng, W. Zhang, S.-C. Wu, N. Zeller, D. Cremers and N. Navab.
VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis.
IROS 2025 - IEEE/RSJ International Conference on Intelligent Robots and Systems. Hangzhou, China, Oct 19-25, 2025. To be published. Preprint available. DOI

Abstract

MCML Authors

Sen Wang

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

[47]

F. Wimbauer, W. Chen, D. Muhle, C. Rupprecht and D. Cremers.
AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training.As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera motions. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds in a feed-forward fashion.

MCML Authors

Felix Wimbauer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Weirong Chen

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Dominik Muhle

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[46]

S. Weber, N. Demmel, T. Chon Chan and D. Cremers.
Power Bundle Adjustment for Large-Scale 3D Reconstruction.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI

Abstract

We introduce Power Bundle Adjustment as an expansion type algorithm for solving large-scale bundle adjustment problems. It is based on the power series expansion of the inverse Schur complement and constitutes a new family of solvers that we call inverse expansion methods. We theoretically justify the use of power series and we prove the convergence of our approach. Using the real-world BAL dataset we show that the proposed solver challenges the state-of-the-art iterative methods and significantly accelerates the solution of the normal equation, even for reaching a very high accuracy. This easy-to-implement solver can also complement a recently presented distributed bundle adjustment framework. We demonstrate that employing the proposed Power Bundle Adjustment as a subproblem solver significantly improves speed and accuracy of the distributed optimization.

MCML Authors

Simon Weber

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[45]

S. Weber, T. Dagès, M. Gao and D. Cremers.
Finsler-Laplace-Beltrami Operators with Application to Shape Analysis.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

The Laplace-Beltrami operator (LBO) emerges from studying manifolds equipped with a Riemannian metric. It is often called the swiss army knife of geometry processing as it allows to capture intrinsic shape information and gives rise to heat diffusion, geodesic distances, and a mul-titude of shape descriptors. It also plays a central role in geometric deep learning. In this work, we explore Finsler manifolds as a generalization of Riemannian manifolds. We revisit the Finsler heat equation and derive a Finsler heat kernel and a Finsler-Laplace-Beltrami Operator (FLBO): a novel theoretically justified anisotropic Laplace-Beltrami operator (ALBO). In experimental evaluations we demon-strate that the proposed FLBO is a valuable alternative to the traditional Riemannian-based LBO and ALBOs for spa-tialfiltering and shape correspondence estimation. We hope that the proposed Finsler heat kernel and the FLBO will inspire further exploration of Finsler geometry in the Computer vision community.

MCML Authors

Simon Weber

Thomas Dagès

Dr.

Thomas Bayes Fellow

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[44]

T. Wimmer, V. Golkov, H. Dang, M. Zaiss, A. Maier and D. Cremers.
Scale-Equivariant Deep Learning for 3D Data.
Preprint (Apr. 2023). arXiv GitHub

Abstract

The ability of convolutional neural networks (CNNs) to recognize objects regardless of their position in the image is due to the translation-equivariance of the convolutional operation. Group-equivariant CNNs transfer this equivariance to other transformations of the input. Dealing appropriately with objects and object parts of different scale is challenging, and scale can vary for multiple reasons such as the underlying object size or the resolution of the imaging modality. In this paper, we propose a scale-equivariant convolutional network layer for three-dimensional data that guarantees scale-equivariance in 3D CNNs. Scale-equivariance lifts the burden of having to learn each possible scale separately, allowing the neural network to focus on higher-level learning goals, which leads to better results and better data-efficiency. We provide an overview of the theoretical foundations and scientific work on scale-equivariant neural networks in the two-dimensional domain. We then transfer the concepts from 2D to the three-dimensional space and create a scale-equivariant convolutional layer for 3D data. Using the proposed scale-equivariant layer, we create a scale-equivariant U-Net for medical image segmentation and compare it with a non-scale-equivariant baseline method. Our experiments demonstrate the effectiveness of the proposed method in achieving scale-equivariance for 3D medical image analysis.

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

→ Group Christian Wachinger
Artificial Intelligence in Medical Imaging

[43]

J. Wang, M. Ghahremani, Y. Li, B. Ommer and C. Wachinger.
Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model’s precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet.

MCML Authors

Morteza Ghahremani

Dr.

Yitong Li

→ Group Christian Wachinger
Artificial Intelligence in Medical Imaging

Björn Ommer

Prof. Dr.

Principal Investigator

Artificial Intelligence in Medical Imaging

Christian Wachinger

Prof. Dr.

Principal Investigator

[42]

S. Weber, J. H. Hong and D. Cremers.
Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI

Abstract

Most Bundle Adjustment (BA) solvers like the Levenberg-Marquardt algorithm require a good initialization. Instead, initialization-free BA remains a largely uncharted territory. The under-explored Variable Projection algorithm (VarPro) exhibits a wide convergence basin even without initialization. Coupled with object space error formulation, recent works have shown its ability to solve small-scale initialization-free bundle adjustment problem. To make such initialization-free BA approaches scalable, we introduce Power Variable Projection (PoVar), extending a recent inverse expansion method based on power series. Importantly, we link the power series expansion to Riemannian manifold optimization. This projective framework is crucial to solve large-scale bundle adjustment problems without initialization. Using the real-world BAL dataset, we experimentally demonstrate that our solver achieves state-of-the-art results in terms of speed and accuracy. To our knowledge, this work is the first to address the scalability of BA without initialization opening new venues for initialization-free structure-from-motion.

MCML Authors

Simon Weber

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

[41]

J. Wang, Z. Qin, Y. Zhang, V. T. Hu, B. Ommer, R. Briq and S. Kesselheim.
Scaling Image Tokenizers with Grouped Spherical Quantization.
Preprint (Dec. 2024). arXiv

Abstract

Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct behaviours at high and low spatial compression levels, underscoring challenges in representing high-dimensional latent spaces. We show that GSQ can restructure high-dimensional latent into compact, low-dimensional spaces, thus enabling efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x down-sampling with a reconstruction FID (rFID) of 0.50.

MCML Authors

Vincent Tao Hu

Dr.

Björn Ommer

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[40]

Y. Wang, Y. Shen and D. Cremers.
Explicit pairwise factorized graph neural network for semi-supervised node classification.
UAI 2021 - Conference on Uncertainty in Artificial Intelligence. Virtual, Jul 27-29, 2021. URL

Abstract

Node features and structural information of a graph are both crucial for semi-supervised node classification problems. A variety of graph neural network (GNN) based approaches have been proposed to tackle these problems, which typically determine output labels through feature aggregation. This can be problematic, as it implies conditional independence of output nodes given hidden representations, despite their direct connections in the graph. To learn the direct influence among output nodes in a graph, we propose the Explicit Pairwise Factorized Graph Neural Network (EPFGNN), which models the whole graph as a partially observed Markov Random Field. It contains explicit pairwise factors to model output-output relations and uses a GNN backbone to model input-output relations. To balance model complexity and expressivity, the pairwise factors have a shared component and a separate scaling coefficient for each edge. We apply the EM algorithm to train our model, and utilize a star-shaped piecewise likelihood for the tractable surrogate objective. We conduct experiments on various datasets, which shows that our model can effectively improve the performance for semi-supervised node classification on graphs.

MCML Authors

Yuesong Shen

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

[39]

P. Wenzel, T. Schön, L. Leal-Taixé and D. Cremers.
Vision-Based Mobile Robotics Obstacle Avoidance With Deep Reinforcement Learning.
ICRA 2021 - IEEE International Conference on Robotics and Automation. Xi’an, China, May 30-Jun 05, 2021. DOI

Abstract

Obstacle avoidance is a fundamental and challenging problem for autonomous navigation of mobile robots. In this paper, we consider the problem of obstacle avoidance in simple 3D environments where the robot has to solely rely on a single monocular camera. In particular, we are interested in solving this problem without relying on localization, mapping, or planning techniques. Most of the existing work consider obstacle avoidance as two separate problems, namely obstacle detection, and control. Inspired by the recent advantages of deep reinforcement learning in Atari games and understanding highly complex situations in Go, we tackle the obstacle avoidance problem as a data-driven end-to-end deep learning approach. Our approach takes raw images as input and generates control commands as output. We show that discrete action spaces are outperforming continuous control commands in terms of expected average reward in maze-like environments. Furthermore, we show how to accelerate the learning and increase the robustness of the policy by incorporating predicted depth maps by a generative adversarial network.

MCML Authors

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[38]

O. Wysocki, Y. Tan, T. Froech, Y. Xia, M. Wysocki, L. Hoegner, D. Cremers and C. Holst.
ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. DOI GitHub

Abstract

Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA11Project page: https://github.com/OloOcki/zaha, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods’ comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. More-over, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

MCML Authors

Yan Xia

Dr.

* Former Member

Magdalena Wysocki

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[37]

P. Wulff, F. Wimbauer, D. Muhle and D. Cremers.
Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images.
Preprint (Aug. 2025). arXiv

Abstract

Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.

MCML Authors

Felix Wimbauer

Dominik Muhle

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[36]

F. Wimbauer, B. Wu, E. Schoenfeld, X. Dai, J. Hou, Z. He, A. Sanakoyeu, P. Zhang, S. Tsai, J. Kohler, C. Rupprecht, D. Cremers, P. Vajda and J. Wang.
Cache Me if You Can: Accelerating Diffusion Models through Block Caching.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub

Abstract

Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers’ output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block’s changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).

MCML Authors

Felix Wimbauer

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[35]

M. Weber, J. Xie, M. Collins, Y. Zhu, H. Adam, B. Green, A. Geiger, D. Cremers, A. Ošep, L. Leal-Taixé, P. Voigtlaender and B. Chen.
STEP: Segmenting and Tracking Every Pixel.
Track on Datasets and Benchmarks @NeurIPS 2021 - Track on Datasets and Benchmarks at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. PDF

Abstract

The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. To overcome this, we introduce a new benchmark encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP. The datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking under real-world conditions. We further propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length. Finally, we provide several baselines to evaluate the status of existing methods on this new challenging dataset. We have made our datasets, metric, benchmark servers, and baselines publicly available, and hope this will inspire future research.

MCML Authors

Mark Weber

Daniel Cremers

Prof. Dr.

Director

→ Group Xiaoxiang Zhu
Data Science in Earth Observation

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[34]

Y. Wang, Z. Xiong, C. Liu, A. J. Stewart, T. Dujardin, N. I. Bountos, A. Zavras, F. Gerken, I. Papoutsis, L. Leal-Taixé and X. Zhu.
Towards a Unified Copernicus Foundation Model for Earth Vision.
Preprint (Mar. 2025). arXiv GitHub

Abstract

Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth’s surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth’s surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research.

MCML Authors

Chenying Liu

Franziska Gerken

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Xiaoxiang Zhu

Prof. Dr.

Principal Investigator

Data Science in Earth Observation

[33]

Q. Wu, Y. Xia, J. Wan and A. B. Chan.
Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI

Abstract

3D single object tracking (SOT) is an essential task in autonomous driving and robotics. However, learning robust 3D SOT trackers remains challenging due to the limited category-specific point cloud data and the inherent sparsity and incompleteness of LiDAR scans. To tackle these issues, we propose a unified 3D SOT framework that leverages 3D generative pre-training and learns robust 3D matching abilities from 2D pre-trained foundation trackers. Our framework features a consistent target-matching architecture with the widely used 2D trackers, facilitating the transfer of 2D matching knowledge. Specifically, we first propose a lightweight Target-Aware Projection (TAP) module, allowing the pre-trained 2D tracker to work well on the projected point clouds without further fine-tuning. We then propose a novel IoU-guided matching-distillation framework that utilizes the powerful 2D pre-trained trackers to guide 3D matching learning in the 3D tracker, i.e., the 3D template-to-search matching should be consistent with its corresponding 2D template-to-search matching obtained from 2D pre-trained trackers. Our designs are applied to two mainstream 3D SOT frameworks: memory-less Siamese and contextual memory-based approaches, which are respectively named SiamDisst and MemDisst. Extensive experiments show that SiamDisst and MemDisst achieve state-of-the-art performance on KITTI, Waymo Open Dataset and nuScenes benchmarks, while running at above real-time speed of 25 and 90 FPS on a RTX3090 GPU.

MCML Authors

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[32]

F. Wimbauer, N. Yang, C. Rupprecht and D. Cremers.
Behind the Scenes: Density Fields for Single View Reconstruction.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI

Abstract

Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict an implicit density field from a single image. It maps every location in the frustum of the image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.

MCML Authors

Felix Wimbauer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[31]

F. Wimbauer, N. Yang, L. von Stumberg, N. Zeller and D. Cremers.
MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera.
CVPR 2020 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 14-19, 2020. DOI GitHub

Abstract

In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. MonoRec is based on a multi-view stereo setting which encodes the information of multiple consecutive images in a cost volume. To deal with dynamic objects in the scene, we introduce a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes. Unlike other multi-view stereo methods, MonoRec is able to reconstruct both static and moving objects by leveraging the predicted masks. Furthermore, we present a novel multi-stage training scheme with a semi-supervised loss formulation that does not require LiDAR depth values. We carefully evaluate MonoRec on the KITTI dataset and show that it achieves state-of-theart performance compared to both multi-view and singleview methods. With the model trained on KITTI, we furthermore demonstrate that MonoRec is able to generalize well to both the Oxford RobotCar dataset and the more challenging TUM-Mono dataset recorded by a handheld camera.

MCML Authors

Felix Wimbauer

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[30]

M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers and L.-C. Chen.
MaskBit: Embedding-free Image Generation via Bit Tokens.
Transactions on Machine Learning Research (Dec. 2024). Certifications: Reproducibility, Featured. URL

Abstract

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

MCML Authors

Mark Weber

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[29]

M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers and L.-C. Chen.
MaskBit: Embedding-free Image Generation via Bit Tokens.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Journal Track. URL URL

Abstract

MCML Authors

Mark Weber

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[28]

S. Weber, B. Zöngür, N. Araslanov and D. Cremers.
Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI

Abstract

Hierarchy is a natural representation of semantic taxonomies, including the ones routinely used in image segmentation. Indeed, recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results, we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons for the observed improvement in segmentation accuracy may be entirely unrelated to the use of the semantic hierarchy. To demonstrate this, we design a range of crossdomain experiments with a representative hierarchical approach. We find that on the new testing domains, a flat (non-hierarchical) segmentation network, in which the parents are inferred from the children, has superior segmentation accuracy to the hierarchical approach across the board. Complementing these findings and inspired by the intrinsic properties of hyperbolic spaces, we study a more principled approach to hierarchical segmentation using the Poincare ball model. The hyperbolic representation largely outperforms the previous (Euclidean) hierarchical approach as well and is on par with our flat Euclidean baseline in terms of segmentation accuracy. However, it additionally exhibits surprisingly strong calibration quality of the parent nodes in the semantic hierarchy, especially on the more challenging domains. Our combined analysis suggests that the established practice of hierarchical segmentation may be limited to in-domain settings, whereas flat classifiers generalize substantially better, especially if they are modeled in the hyperbolic space.

MCML Authors

Simon Weber

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[27]

J. Xie, O. Dhaouadi, W. Chen, J. Meier, J. Kaiser and D. Cremers.
CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry.
Preprint (Aug. 2025). arXiv

Abstract

Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.

MCML Authors

Weirong Chen

Johannes Meier

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[26]

Y. Xia, R. Ding, Z. Qin, G. Zhan, K. Zhou, L. Yang, H. Dong and D. Cremers.
TARGO: Benchmarking Target-driven Object Grasping under Occlusions.
Preprint (Jul. 2024). arXiv GitHub

Abstract

Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic grasping. However, previous grasping models face challenges in cluttered environments where nearby objects impact the target object’s grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Grasping under Occlusions, named TARGO. We make the following contributions: 1) We are the first to study the occlusion level of grasping. 2) We set up an evaluation benchmark consisting of large-scale synthetic data and part of real-world data, and we evaluated five grasp models and found that even the current SOTA model suffers when the occlusion level increases, leaving grasping under occlusion still a challenge. 3) We also generate a large-scale training dataset via a scalable pipeline, which can be used to boost the performance of grasping under occlusion and generalized to the real world. 4) We further propose a transformer-based grasping model involving a shape completion module, termed TARGO-Net, which performs most robustly as occlusion increases.

MCML Authors

Yan Xia

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[25]

Y. Xie, V. Ehm, P. Roetzer, N. Amrani, M. Gao, F. Bernard and D. Cremers.
EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics. While most research has focused on finding correspondences in settings where at least one of the shapes is complete, the realm of partial-to-partial shape matching remains under-explored. Yet it is of importance since, in many applications, shapes are only observed partially due to occlusion or scanning.Finding correspondences between partial shapes comes with an additional challenge: We not only want to identify correspondences between points on either shape but also have to determine which points of each shape actually have a partner.To tackle this challenging problem, we present EchoMatch, a novel framework for partial-to-partial shape matching that incorporates the concept of correspondence reflection to enable an overlap prediction within a functional map framework.With this approach, we show that we can outperform current SOTA methods in challenging partial-to-partial shape matching problems.

MCML Authors

Viktoria Ehm

Maolin Gao

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[24]

Y. Xia, M. Gladkova, R. Wang, Q. Li, U. Stilla, J. F. Henriques and D. Cremers.
CASSPR: Cross Attention Single Scan Place Recognition.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI

Abstract

Place recognition based on point clouds (LiDAR) is an important component for autonomous robots or self-driving vehicles. Current SOTA performance is achieved on accumulated LiDAR submaps using either point-based or voxel-based structures. While voxel-based approaches nicely integrate spatial context across multiple scales, they do not exhibit the local precision of point-based methods. As a result, existing methods struggle with fine-grained matching of subtle geometric features in sparse single-shot Li-DAR scans. To overcome these limitations, we propose CASSPR as a method to fuse point-based and voxel-based approaches using cross attention transformers. CASSPR leverages a sparse voxel branch for extracting and aggregating information at lower resolution and a point-wise branch for obtaining fine-grained local information. CASSPR uses queries from one branch to try to match structures in the other branch, ensuring that both extract self-contained descriptors of the point cloud (rather than one branch dominating), but using both to inform the out-put global descriptor of the point cloud. Extensive experiments show that CASSPR surpasses the state-of-the-art by a large margin on several datasets (Oxford RobotCar, TUM, USyd). For instance, it achieves AR@1 of 85.6% on the TUM dataset, surpassing the strongest prior model by ~15%. Our code is publicly available.

MCML Authors

Yan Xia

Dr.

* Former Member

Mariia Gladkova

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

[23]

R. Xiao, S. Kim, M.-I. Georgescu, Z. Akata and S. Alaniz.
FLAIR: VLM with Fine-grained Language-informed Image Representations.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI GitHub

Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

MCML Authors

Rui Xiao

Sanghwan Kim

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Iuliana Georgescu

Dr.

* Former Member

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

[22]

Y. Xia, Z. Li, Y.-J. Li, L. Shi, H. Cao, J. F. H. João F. Henriques and D. Cremers.
UniLoc: Towards Universal Place Recognition Using Any Single Modality.
Preprint (Dec. 2024). arXiv GitHub

Abstract

To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios.

MCML Authors

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[21]

Y. Xia, Y. Lu, R. Song, O. Dhaouadi, J. F. Henriques and D. Cremers.
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specifically, in each training step, only a subset of the hash encoder’s parameters is used (local sparse), whereas all parameters in the neural network participate (global dense). Consequently, hash features generated in each step are highly misaligned, as they come from different subsets of the hash encoder. These misalignments from different training steps are then fed into the neural network, causing repeated inconsistent global updates in training, which leads to unstable training, slower convergence, and degraded reconstruction quality. Aiming to alleviate the impact of this local-global optimization mismatch, we introduce a Normalized Hash Encoder, which enhances feature consistency and mitigates the mismatch. Additionally, we propose a Mapping Consistency Initialization(MCI) strategy that initializes the neural network before training by leveraging the global mapping property from a well-trained model. The initialized neural network exhibits improved stability during early training, enabling faster convergence and enhanced reconstruction performance. Our method is simple yet effective, requiring only a few lines of code while substantially improving training efficiency on 128 CT cases collected from 4 different datasets, covering 7 distinct anatomical regions.

MCML Authors

Yan Xia

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[20]

Y. Xia, L. Shi, Z. Ding, J. F. Henriques and D. Cremers.
Text2Loc: 3D Point Cloud Localization from Natural Language.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub

Abstract

We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to 2 × over the state-of-the-art on the KITTI360Pose dataset.

MCML Authors

Yan Xia

Dr.

* Former Member

Zifeng Ding

→ Group Volker Tresp
Database Systems, Data Mining and AI

Daniel Cremers

Prof. Dr.

Director

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

[19]

Y. Yeganeh, A. Farshad, I. Charisiadis, M. Hasny, M. Hartenberger, B. Ommer, N. Navab and E. Adeli.
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. Highlight Paper. DOI

Abstract

Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this work will be publicly released upon its acceptance.

MCML Authors

Yousef Yeganeh

Azade Farshad

Dr.

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Björn Ommer

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

[18]

Y. Yeganeh, A. Farshad, G. Guevercin, A. Abu-zer, R. Xiao, Y. Tang, E. Adeli and N. Navab.
SCOPE: Structural Continuity Preservation for Medical Image Segmentation.
GRAIL @MICCAI 2023 - 5th Workshop on GRaphs in biomedicAl Image anaLysis at the 26th International Conference on Medical Image Computing and Computer Assisted Intervention. Vancouver, Canada, Oct 08-12, 2023. DOI

Abstract

Although the preservation of shape continuity and physiological anatomy is a natural assumption in the segmentation of medical images, it is often neglected by deep learning methods that mostly aim for the statistical modeling of input data as pixels rather than interconnected structures. In biological structures, however, organs are not separate entities; for example, in reality, a severed vessel is an indication of an underlying problem, but traditional segmentation models are not designed to strictly enforce the continuity of anatomy, potentially leading to inaccurate medical diagnoses. To address this issue, we propose a graph-based approach that enforces the continuity and connectivity of anatomical topology in medical images. Our method encodes the continuity of shapes as a graph constraint, ensuring that the network’s predictions maintain this continuity. We evaluate our method on two public benchmarks on retinal vessel segmentation, showing significant improvements in connectivity metrics compared to traditional methods while getting better or on-par performance on segmentation metrics.

MCML Authors

Yousef Yeganeh

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Azade Farshad

Dr.

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Rui Xiao

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

[17]

Y. Yeganeh, G. Güvercin, R. Xiao, A. Abuzer, E. Adeli, A. Farshad and N. Navab.
SCOPE: Structural Continuity Preservation for Retinal Vessel Segmentation.
GRAIL @MICCAI 2023 - 5th Workshop on GRaphs in biomedicAl Image anaLysis at the 26th International Conference on Medical Image Computing and Computer Assisted Intervention. Vancouver, Canada, Oct 08-12, 2023. DOI

Abstract

Although the preservation of shape continuity and physiological anatomy is a natural assumption in the segmentation of medical images, it is often neglected by deep learning methods that mostly aim for the statistical modeling of input data as pixels rather than interconnected structures. In biological structures, however, organs are not separate entities; for example, in reality, a severed vessel is an indication of an underlying problem, but traditional segmentation models are not designed to strictly enforce the continuity of anatomy, potentially leading to inaccurate medical diagnoses. To address this issue, we propose a graph-based approach that enforces the continuity and connectivity of anatomical topology in medical images. Our method encodes the continuity of shapes as a graph constraint, ensuring that the network’s predictions maintain this continuity. We evaluate our method on three public benchmarks of retinal vessel segmentation and one neuronal structure segmentation benchmark, showing significant improvements in connectivity metrics compared to previous works while getting better or on-par performance on segmentation metrics.

MCML Authors

Yousef Yeganeh

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Rui Xiao

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Azade Farshad

Dr.

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

[16]

L. Yang, L. Hoyer, M. Weber, T. Fischer, D. Dai, L. Leal-Taixé, D. Cremers, M. Pollefeys and L. Van Gool.
MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.

MCML Authors

Mark Weber

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[15]

W. Yuan, Q. Khan and V. Golkov.
Generation of Musical Timbres using a Text-Guided Diffusion Model.
Preprint (Apr. 2025). arXiv GitHub

Abstract

In recent years, text-to-audio systems have achieved remarkable success, enabling the generation of complete audio segments directly from text descriptions. While these systems also facilitate music creation, the element of human creativity and deliberate expression is often limited. In contrast, the present work allows composers, arrangers, and performers to create the basic building blocks for music creation: audio of individual musical notes for use in electronic instruments and DAWs. Through text prompts, the user can specify the timbre characteristics of the audio. We introduce a system that combines a latent diffusion model and multi-modal contrastive learning to generate musical timbres conditioned on text descriptions. By jointly generating the magnitude and phase of the spectrogram, our method eliminates the need for subsequently running a phase retrieval algorithm, as related methods do.

MCML Authors

Qadeer Khan

Vladimir Golkov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[14]

Z. Ye, G. Peyré, D. Cremers and P. Ablin.
Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization.
AISTATS 2024 - 27th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, May 02-04, 2024. URL GitHub

Abstract

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT). As a function of the error of the inner problem resolution, we study the error of the IFT method. We analyze two strategies to reduce this error: preconditioning the IFT formula and reparameterizing the inner problem. We give a detailed account of the impact of these two modifications on the error, highlighting the role played by higher-order derivatives of the functionals at stake. Our theoretical findings explain when super efficiency, namely reaching an error on the hypergradient that depends quadratically on the error on the inner problem, is achievable and compare the two approaches when this is impossible. Numerical evaluations on hyperparameter tuning for regression problems substantiate our theoretical findings.

MCML Authors

Zhenzhang Ye

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[13]

T. Yenamandra, A. Tewari, F. Bernard, H.-P. Seidel, M. Elgharib and D. Cremers.
i3DMM: Deep Implicit 3D Morphable Model of Human Heads.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI

Abstract

We present the first deep implicit 3D morphable model (i3DMM) of full heads. Unlike earlier morphable face models it not only captures identity-specific geometry, texture, and expressions of the frontal face, but also models the entire head, including hair. We collect a new dataset consisting of 64 people with different expressions and hairstyles to train i3DMM. Our approach has the following favorable properties: (i) It is the first full head morphable model that includes hair. (ii) In contrast to mesh-based models it can be trained on merely rigidly aligned scans, without requiring difficult non-rigid registration. (iii) We design a novel architecture to decouple the shape model into an implicit reference shape and a deformation of this reference shape. With that, dense correspondences between shapes can be learned implicitly. (iv) This architecture allows us to semantically disentangle the geometry and color components, as color is learned in the reference space. Geometry is further disentangled as identity, expressions, and hairstyle, while color is disentangled as identity and hairstyle components. We show the merits of i3DMM using ablation studies, comparisons to state-of-the-art models, and applications such as semantic head editing and texture transfer. We will make our model publicly available1.

MCML Authors

Tarun Yenamandra

Daniel Cremers

Prof. Dr.

Director

[12]

T. Tewari, N. Yang, F. Bernard, C. Theobalt and D. Cremers.
FIRe: Fast Inverse Rendering Using Directional and Signed Distance Functions.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

Neural 3D implicit representations learn priors that are useful for diverse applications, such as single- or multiple-view 3D reconstruction. A major downside of existing approaches while rendering an image is that they require evaluating the network multiple times per camera ray so that the high computational time forms a bottleneck for downstream applications. We address this problem by introducing a novel neural scene representation that we call the directional distance function (DDF). To this end, we learn a signed distance function (SDF) along with our DDF model to represent a class of shapes. Specifically, our DDF is defined on the unit sphere and predicts the distance to the surface along any given direction. Therefore, our DDF allows rendering images with just a single network evaluation per camera ray. Based on our DDF, we present a novel fast algorithm (FIRe) to reconstruct 3D shapes given a posed depth map. We evaluate our proposed method on 3D reconstruction from single-view depth images, where we empirically show that our algorithm reconstructs 3D shapes more accurately and it is more than 15 times faster (per iteration) than competing methods.

MCML Authors

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[11]

Y. Yuan, Y. Xia, D. Cremers and M. Sester.
SparseAlign: a Fully Sparse Framework for Cooperative Object Detection.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird’s Eye View (BEV) feature maps, which is computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, textit{SparseAlign}, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.

MCML Authors

Yan Xia

Dr.

* Former Member

Daniel Cremers

Prof. Dr.

Director

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

[10]

Y. Yeganeh, R. Xiao, G. Guvercin, N. Navab and A. Farshad.
Conformable Convolution for Topologically Aware Learning of Complex Anatomical Structures.
Preprint (Dec. 2024). arXiv

Abstract

While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. Such shortcomings can significantly impact the reliability of analysis results and hinder clinical decision-making. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly enforce topological consistency. Conformable Convolution learns adaptive kernel offsets that preferentially focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Our proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. Experimental results on three diverse datasets demonstrate that our framework effectively preserves the topology in the segmentation downstream task, both quantitatively and qualitatively.

MCML Authors

Yousef Yeganeh

Rui Xiao

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

Azade Farshad

Dr.

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

[9]

W. Zhang, Q. Cheng, D. Skuddis, N. Zeller, D. Cremers and N. Haala.
HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction.
Preprint (Nov. 2024). arXiv GitHub

Abstract

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.

MCML Authors

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[8]

M. Zaiss, H. N. Dang, V. Golkov, J. R. Rajput, D. Cremers, F. Knoll and A. Maier.
GPT4MR: Exploring GPT-4 as an MR Sequence and Reconstruction Programming Assistant.
ESMRMB 2023 - 39th Annual Meeting of the European Society for Magnetic Resonance in Medicine and Biology. Basel, Switzerland, Oct 04-07, 2023. URL

Abstract

In this study, we explore the potential of generative pre-trained transformer (GPT), as a coding assistant for MRI sequence programming using the Pulseq framework. The programming of MRI sequences is traditionally a complex and time-consuming task, and the Pulseq standard has recently simplified this process. It allows researchers to define and generate complex pulse sequences used in MRI experiments. Leveraging GPT-4’s capabilities in natural language generation, we adapted it for MRI sequence programming, creating a specialized assistant named GPT4MR. Our tests involved generating various MRI sequences, revealing that GPT-4, guided by a tailored prompt, outperformed GPT-3.5, producing fewer errors and demonstrating improved reasoning. Despite limitations in handling complex sequences, GPT4MR corrected its own errors and successfully generated code with step-by-step instructions. The study showcases GPT4MR’s ability to accelerate MRI sequence development, even for novel ideas absent in its training set. While further research and improvement are needed to address complexity limitations, a well-designed prompt enhances performance. The findings propose GPT4MR as a valuable MRI sequence programming assistant, streamlining prototyping and development. The future prospect involves integrating a PyPulseq plugin into lightweight, open-source LLMs, potentially revolutionizing MRI sequence development and prototyping.

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[7]

M. Zaiss, J. R. Rajput, H. N. Dang, V. Golkov, D. Cremers, F. Knoll and A. Maier.
GPT4MR: Exploring GPT-4 as an MR Sequence and Reconstruction Programming Assistant.
BVM 2024 - German Conference on Medical Image Computing -Bildverarbeitung für die Medizin. Erlangen, Germany, Mar 10-02, 2024. DOI

Abstract

MCML Authors

Vladimir Golkov

Dr.

Daniel Cremers

Prof. Dr.

Director

→ Group Volker Tresp
Database Systems, Data Mining and AI

[6]

G. Zhang, M. L. A. Fok, J. Ma, Y. Xia, D. Cremers, P. Torr, V. Tresp and J. Gu.
Localizing Events in Videos with Multimodal Queries.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI

Abstract

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.

MCML Authors

Gengyuan Zhang

Yan Xia

Dr.

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Volker Tresp

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[5]

H. Zeng, M. Gao and D. Cremers.
CoE: Deep Coupled Embedding for Non-Rigid Point Cloud Correspondences.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. DOI

Abstract

The interest in matching non-rigidly deformed shapes represented as raw point clouds is rising due to the proliferation of low-cost 3D sensors. Yet, the task is challenging since point clouds are irregular and there is a lack of intrinsic shape information. We propose to tackle these challenges by learning a new shape representation – a per-point high dimensional embedding, in an embedding space where semantically similar points share similar embeddings. The learned embedding has multiple beneficial properties: it is aware of the underlying shape geometry and is robust to shape deformations and various shape artefacts, such as noise and partiality. Consequently, this embedding can be directly employed to retrieve high-quality dense correspondences through a simple nearest neighbor search in the embedding space. Extensive experiments demonstrate new state-of-the-art results and robustness in numerous challenging non-rigid shape matching benchmarks and show its great potential in other shape analysis tasks, such as segmentation.

MCML Authors

Maolin Gao

Daniel Cremers

Prof. Dr.

Director

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

[4]

D. Zhu, Q. Khan and D. Cremers.
Multi-vehicle trajectory prediction and control at intersections using state and intention information.
Neurocomputing 574 (Jan. 2024). DOI GitHub

Abstract

Traditional deep learning approaches for prediction of future trajectory of multiple road agents rely on knowing information about their past trajectory. In contrast, this work utilizes information of only the current state and intended direction to predict the future trajectory of multiple vehicles at intersections. Incorporating intention information has two distinct advantages: (1) It allows to not just predict the future trajectory but also control the multiple vehicles. (2) By manipulating the intention, the interaction among the vehicles is adapted accordingly to achieve desired behavior. Both these advantages would otherwise not be possible using only past trajectory information Our model utilizes message passing of information between the vehicle nodes for a more holistic overview of the environment, resulting in better trajectory prediction and control of the vehicles. This work also provides a thorough investigation and discussion into the disparity between offline and online metrics for the task of multi-agent control. We particularly show why conducting only offline evaluation would not suffice, thereby necessitating online evaluation. We demonstrate the superiority of utilizing intention information rather than past trajectory in online scenarios. Lastly, we show the capability of our method in adapting to different domains through experiments conducted on two distinct simulation platforms i.e. SUMO and CARLA.

MCML Authors

Dekai Zhu

Qadeer Khan

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[3]

G. Zhang, S. Qian, X. Wang and D. Cremers.
ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association.
Preprint (Sep. 2025). arXiv GitHub

Abstract

We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods.

MCML Authors

Ganlin Zhang

Shenhan Qian

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Xi Wang

Dr.

JRG Leader Egocentric Vision

Daniel Cremers

Prof. Dr.

Director

→ Group Almut Sophia Koepke
Computer Vision & Artificial Intelligence

[2]

D. Zverev, T. Wiedemer, A. Prabhu, M. Bethge, W. Brendel and A. S. Koepke.
VGGSounder: Audio-Visual Evaluations for Foundation Models.
Sight and Sound @CVPR 2025 - Workshop Sight and Sound at IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. PDF

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The classification dataset VGGSound is commonly used as a benchmark for evaluating audio-visual understanding. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. VGGSounder offers a robust benchmark supporting the future development of audio-visual foundation models.

MCML Authors

Daniil Zverev

Almut Sophia Koepke

Dr.

JRG Leader Multi-Modal Learning

→ Group Almut Sophia Koepke
Computer Vision & Artificial Intelligence

[1]

D. Zverev, T. Wiedemer, A. Prabhu, M. Bethge, W. Brendel and A. S. Koepke.
VGGSounder: Audio-Visual Evaluations for Foundation Models.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-23, 2025. To be published.

Abstract

MCML Authors

Daniil Zverev

Almut Sophia Koepke

Dr.

JRG Leader Multi-Modal Learning

B2 | Natural Language Processing

Natural Language Processing (NLP) focuses on understanding and generating natural language text, greatly influenced by recent advances in deep learning. Despite substantial progress, our MCML researchers address key challenges like enhancing deep language understanding through structural biases, developing common sense in models through experimental environments, and improving sample efficiency for more effective learning from large datasets.

Alexander Fraser

Prof. Dr.

Principal Investigator

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Barbara Plank

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

©all images: LMU | TUM

Publications in Research Area B2

[296]

E. Artemova, V. Blaschke and B. Plank.
Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages.We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data.Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets.Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance.Using these new datasets, we conduct an experimental evaluation across six different transformers.Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score.Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

MCML Authors

Verena Blaschke

Barbara Plank

Prof. Dr.

Principal Investigator

[295]

J. O. Alabi, M. A. Hedderich, D. I. Adelani and D. Klakow.
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 734 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

MCML Authors

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

→ Group Alexander Fraser
Data Analytics & Statistics

[294]

H. Asadpour, S. Okabe and A. Fraser.
A Practical Tool to Help Automate Interlinear Glossing: a Study on Mukrī Kurdish.
Field Matters @ACL 2025 - 4th Workshop on NLP Applications to Field Linguistics at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Interlinear gloss generation aims to predict linguistic annotations (gloss) for a sentence in a language that is usually under ongoing documentation. Such output is a first draft for the linguist to work with and should reduce the manual workload.This article studies a simple glossing pipeline based on a Conditional Random Field and applies it to a small fieldwork corpus in Mukrī Kurdish, a variety of Central Kurdish.We mainly focus on making the tool as accessible as possible for field linguists, so it can run on standard computers without the need for GPUs. Our pipeline predicts common grammatical patterns robustly and, more generally, frequent combinations of morphemes and glosses. Although more advanced neural models do reach better results, our feature-based system still manages to be competitive and to provide interpretability.To foster further collaboration between field linguistics and NLP, we also provide some recommendations regarding documentation endeavours and release our pipeline code alongside.

MCML Authors

Shu Okabe

Dr.

Alexander Fraser

Prof. Dr.

Principal Investigator

[293]

E. Asgari, M. J. Sabet, P. Dufter, C. Ringlstetter and H. Schütze.
Subword Sampling for Low Resource Word Alignment.
Preprint (Dec. 2020). arXiv

Abstract

Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method’s hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using 5K parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of 100K’s of parallel sentences in existing word-level fast-align/eflomal alignment methods.

MCML Authors

Masoud Jalili Sabet

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Bernd Bischl
Statistical Learning and Data Science

[292]

M. Aßenmacher, A. Stephan, L. Weissweiler, E. Çano, I. Ziegler, M. Härttrich, B. Bischl, B. Roth, C. Heumann and H. Schütze.
Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing.
TeachingNLP @ACL 2024 - 6th Workshop on Teaching NLP at the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. URL

Abstract

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

MCML Authors

Matthias Aßenmacher

Dr.

Leonie Weissweiler

* Former Member

Bernd Bischl

Prof. Dr.

Director

Hinrich Schütze

Prof. Dr.

Principal Investigator

[291]

J. Baan, W. Aziz, B. Plank and R. Fernandez.
Stop Measuring Calibration When Humans Disagree.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - including class frequency, ranking and entropy.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[290]

A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. S. Aditya K. Surikuchi, E. Takmaz and A. Testoni.
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

MCML Authors

Philipp Mondorf

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Enkelejda Kasneci
Human-Centered Technologies for Learning

[289]

I. Bueno, A. Bavaresco, J. M. Cunha and P. Wicke.
Testing Spatial Intuitions of Humans and Large Language and Multimodal Models in Analogies.
Analogy-Angle II @ACL 2025 - 2nd Workshop on Analogical Abstraction in Cognition, Perception, and Language at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Language and Vision-Language Models exhibit impressive language capabilities akin to human reasoning. However, unlike humans who acquire language through embodied, interactive experiences, these models learn from static datasets without real-world interaction. This difference raises questions about how they conceptualize abstract notions and whether their reasoning aligns with human cognition. We investigate spatial conceptualizations of LLMs and VLMs by conducting analogy prompting studies with LLMs, VLMs, and human participants. We assess their ability to generate and interpret analogies for spatial concepts. We quantitatively compare the analogies produced by each group, examining the impact of multimodal inputs and reasoning mechanisms. Our findings indicate that generative models can produce and interpret analogies but differ significantly from human reasoning in their abstraction of spatial concepts - variability influenced by input modality, model size, and prompting methods, with analogy-based prompts not consistently enhancing alignment. Contributions include a methodology for probing generative models through analogies; a comparative analysis of analogical reasoning among models, and humans; and insights into the effect of multimodal inputs on reasoning.

MCML Authors

Ivo Bueno

Philipp Wicke

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

[288]

V. Basile, S. Casola, S. Frenda and S. M. Lo.
PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL

Abstract

Works in perspectivism and human label variation have emphasized the need to collect and leverage various voices and points of view in the whole Natural Language Processing pipeline. PERSEID places itself in this line of work. We consider the task of irony detection from short social media conversations in Italian collected from Twitter (X) and Reddit. To do so, we leverage data from MultiPICO, a recent multilingual dataset with disaggregated annotations and annotators’ metadata, containing 1000 Post, Reply pairs with five annotations each on average. We aim to evaluate whether prompting LLMs with additional annotators’ demographic information (namely gender only, age only, and the combination of the two) results in improved performance compared to a baseline in which only the input text is provided. The evaluation is zero-shot; and we evaluate the results on the disaggregated annotations using f1.

MCML Authors

Silvia Casola

Dr.

[287]

P. F. Balestrucci, S. Casola, S. M. Lo, V. Basile and A. Mazzei.
I’m sure you’re a real scholar yourself: Exploring Ironic Content Generation by Large Language Models.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Generating ironic content is challenging: it requires a nuanced understanding of context and implicit references and balancing seriousness and playfulness. Moreover, irony is highly subjective and can depend on various factors, such as social, cultural, or generational aspects. This paper explores whether Large Language Models (LLMs) can learn to generate ironic responses to social media posts. To do so, we fine-tune two models to generate ironic and non-ironic content and deeply analyze their outputs’ linguistic characteristics, their connection to the original post, and their similarity to the human-written replies. We also conduct a large-scale human evaluation of the outputs. Additionally, we investigate whether LLMs can learn a form of irony tied to a generational perspective, with mixed results.

MCML Authors

Silvia Casola

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

[286]

T. Bourgeade, S. Casola, A. M. Wizani and C. Bosco.
Data Augmentation through Back-Translation for Stereotypes and Irony Detection.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL

Abstract

Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier for stereotype or irony detection on mono-lingual data.

MCML Authors

Silvia Casola

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

[285]

J. Baan, N. Daheim, E. Ilia, D. Ulmer, H.-S. Li, R. Fernández, B. Plank, R. Sennrich, C. Zerva and W. Aziz.
Uncertainty in Natural Language Generation: From Theory to Applications.
Preprint (Jul. 2023). arXiv

Abstract

Recent advances of powerful Language Models have allowed Natural Language Generation (NLG) to emerge as an important technology that can not only perform traditional tasks like summarisation or translation, but also serve as a natural language interface to a variety of applications. As such, it is crucial that NLG systems are trustworthy and reliable, for example by indicating when they are likely to be wrong; and supporting multiple views, backgrounds and writing styles – reflecting diverse human sub-populations. In this paper, we argue that a principled treatment of uncertainty can assist in creating systems and evaluation protocols better aligned with these goals. We first present the fundamental theory, frameworks and vocabulary required to represent uncertainty. We then characterise the main sources of uncertainty in NLG from a linguistic perspective, and propose a two-dimensional taxonomy that is more informative and faithful than the popular aleatoric/epistemic dichotomy. Finally, we move from theory to applications and highlight exciting research directions that exploit uncertainty to power decoding, controllable generation, self-assessment, selective answering, active learning and more.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[284]

J. Baan, R. Fernández, B. Plank and W. Aziz.
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good representation of uncertainty by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[283]

V. Blaschke, M. Fedzechkina and M. Ter Hoeve.
Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Cross-lingual transfer is a popular approach to increase the amount of training data for NLP tasks in a low-resource context. However, the best strategy to decide which cross-lingual data to include is unclear. Prior research often focuses on a small set of languages from a few language families and/or a single task. It is still an open question how these findings extend to a wider variety of languages and tasks. In this work, we analyze cross-lingual transfer for 263 languages from a wide variety of language families. Moreover, we include three popular NLP tasks: POS tagging, dependency parsing, and topic classification. Our findings indicate that the effect of linguistic similarity on transfer performance depends on a range of factors: the NLP task, the (mono- or multilingual) input representations, and the definition of linguistic similarity.

MCML Authors

Verena Blaschke

[282]

I. d. S. Bueno Júnior, H. Ye, A. Wisiorek and H. Schütze.
Privacy-Preserving Federated Learning for Hate Speech Detection.
SRW @NAACL 2025 - Student Research Workshop at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

This paper presents a federated learning system with differential privacy for hate speech detection, tailored to low-resource languages. By fine-tuning pre-trained language models, ALBERT emerged as the most effective option for balancing performance and privacy. Experiments demonstrated that federated learning with differential privacy performs adequately in low-resource settings, though datasets with fewer than 20 sentences per client struggled due to excessive noise. Balanced datasets and augmenting hateful data with non-hateful examples proved critical for improving model utility. These findings offer a scalable and privacy-conscious framework for integrating hate speech detection into social media platforms and browsers, safeguarding user privacy while addressing online harm.

MCML Authors

Haotian Ye

Axel Wisiorek

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[281]

V. Blaschke, B. Kovačić, S. Peng, H. Schütze and B. Plank.
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

MCML Authors

Verena Blaschke

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[280]

V. Blaschke, F. Körner and B. Plank.
Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have not yet been applied to dialectal SID data, or compared to each other on the same datasets. We participate in the VarDial 2025 shared task on slot and intent detection in Norwegian varieties, and compare multiple set-ups: varying the training data (English, Norwegian, or dialectal Norwegian), injecting character-level noise, training on auxiliary tasks, and applying Layer Swapping, a technique in which layers of models fine-tuned on different datasets are assembled into a model. We find noise injection to be beneficial while the effects of auxiliary tasks are mixed. Though some experimentation was required to successfully assemble a model from layers, it worked surprisingly well; a combination of models trained on English and small amounts of dialectal data produced the most robust slot predictions. Our best models achieve 97.6% intent accuracy and 85.6% slot F1 in the shared task.

MCML Authors

Verena Blaschke

Felicia Körner

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[279]

V. Blaschke, B. Kovačić, S. Peng and B. Plank.
MaiBaam Annotation Guidelines.
Preprint (Oct. 2024). arXiv

Abstract

This document provides the annotation guidelines for MaiBaam, a Bavarian corpus manually annotated with part-of-speech (POS) tags, syntactic dependencies, and German lemmas. MaiBaam belongs to the Universal Dependencies (UD) project, and our annotations elaborate on the general and German UD version 2 guidelines. In this document, we detail how to preprocess and tokenize Bavarian data, provide an overview of the POS tags and dependencies we use, explain annotation decisions that would also apply to closely related languages like German, and lastly we introduce and motivate decisions that are specific to Bavarian grammar.

MCML Authors

Verena Blaschke

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

Applied Statistics in Social Sciences, Economics and Business

[278]

A. Beyer, G. Kauermann and H. Schütze.
Embedding Space Correlation as a Measure of Domain Similarity.
LREC 2020 - 12th International Conference on Language Resources and Evaluation. Marseille, France, May 13-15, 2020. URL

Abstract

Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.

MCML Authors

Göran Kauermann

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[277]

V. Blaschke.
Beyond 'noisy' text: How (and why) to process dialect data.
W-NUT @NAACL 2025 - 10th Workshop on Noisy and User-generated Text at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. Keynote Talk. PDF

Abstract

Processing data from non-standard dialects links two lines of research: creating NLP tools that are robust to ’noisy’ inputs, and extending the coverage of NLP tools to underserved language communities. In this talk, I will describe ways in which processing dialect data differs from processing standard-language data, and discuss some of the current challenges in dialect NLP research. For instance, I will talk about strategies to mitigate the effect of infelicitous subword tokenization caused by ad-hoc pronunciation spellings. Additionally, I argue that we should not only consider how to tackle dialectal variation in NLP, but also why. To this end, I will highlight perspectives of some dialect speaker communities on which language technologies should (or should not) be able to process or produce dialectal in- or output.

MCML Authors

Verena Blaschke

[276]

L. Bertolazzi, P. Mondorf, B. Plank and R. Bernardi.
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It.
Preprint (Feb. 2025). arXiv

Abstract

The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models’ internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads–attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models’ internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.

MCML Authors

Philipp Mondorf

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[275]

E. Bassignana, M. Müller-Eberstein, M. Zhang and B. Plank.
Evidence > Intuition: Transferability Estimation for Encoder Selection.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori—as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[274]

E. Bassignana and B. Plank.
CrossRE: A Cross-Domain Dataset for Relation Extraction.
Findings @EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Relation Extraction (RE) has attracted increasing attention, but current RE evaluation is limited to in-domain evaluation setups. Little is known on how well a RE system fares in challenging, but realistic out-of-distribution evaluation setups. To address this gap, we propose CrossRE, a new, freely-available cross-domain benchmark for RE, which comprises six distinct text domains and includes multi-label annotations. An additional innovation is that we release meta-data collected during annotation, to include explanations and flags of difficult instances. We provide an empirical evaluation with a state-of-the-art model for relation classification. As the meta-data enables us to shed new light on the state-of-the-art model, we provide a comprehensive analysis on the impact of difficult cases and find correlations between model and human annotations. Overall, our empirical investigation highlights the difficulty of cross-domain RE. We release our dataset, to spur more research in this direction.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[273]

V. Blaschke, C. Purschke, H. Schütze and B. Plank.
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations’ needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German – a group of varieties that is heterogeneous in terms of prestige and standardization. We survey speakers of these varieties (N=327) and present their opinions on hypothetical language technologies for their dialects. Although attitudes vary among subgroups of our respondents, we find that respondents are especially in favour of potential NLP tools that work with dialectal input (especially audio input) such as virtual assistants, and less so for applications that produce dialectal output such as machine translation or spellcheckers.

MCML Authors

Verena Blaschke

Hinrich Schütze

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[272]

V. Blaschke, H. Schütze and B. Plank.
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages.
VarDial @EACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects at the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

MCML Authors

Verena Blaschke

Hinrich Schütze

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[271]

V. Blaschke, H. Schütze and B. Plank.
A Survey of Corpora for Germanic Low-Resource Languages and Dialects.
NoDaLiDa 2023 - 24th Nordic Conference on Computational Linguistics. Tórshavn, Faroe Islands, May 22-24, 2023. URL

Abstract

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available.

MCML Authors

Verena Blaschke

Hinrich Schütze

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[270]

V. Blaschke, M. Winkler, C. Förster, G. Wenger-Glemser and B. Plank.
A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation.
Preprint (Jun. 2025). arXiv

Abstract

Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.

MCML Authors

Verena Blaschke

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Volker Tresp
Database Systems, Data Mining and AI

[269]

J. Bi, D. Yan, Y. Wang, W. Huang, H. Chen, G. Wan, M. Ye, X. Xiao, H. Schütze, V. Tresp and Y. Ma.
CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process.
Preprint (May. 2025). arXiv

Abstract

Recent Large Reasoning Models significantly improve the reasoning ability of Large Language Models by learning to reason, exhibiting the promising performance in solving complex tasks. LRMs solve tasks that require complex reasoning by explicitly generating reasoning trajectories together with answers. Nevertheless, judging the quality of such an output answer is not easy because only considering the correctness of the answer is not enough and the soundness of the reasoning trajectory part matters as well. Logically, if the soundness of the reasoning part is poor, even if the answer is correct, the confidence of the derived answer should be low. Existing methods did consider jointly assessing the overall output answer by taking into account the reasoning part, however, their capability is still not satisfactory as the causal relationship of the reasoning to the concluded answer cannot properly reflected. In this paper, inspired by classical mechanics, we present a novel approach towards establishing a CoT-Kinetics energy equation. Specifically, our CoT-Kinetics energy equation formulates the token state transformation process, which is regulated by LRM internal transformer layers, as like a particle kinetics dynamics governed in a mechanical field. Our CoT-Kinetics energy assigns a scalar score to evaluate specifically the soundness of the reasoning phase, telling how confident the derived answer could be given the evaluated reasoning. As such, the LRM’s overall output quality can be accurately measured, rather than a coarse judgment (e.g., correct or incorrect) anymore.

MCML Authors

Jinhe Bi

Haokun Chen

→ Group Volker Tresp
Database Systems, Data Mining and AI

Hinrich Schütze

Prof. Dr.

Principal Investigator

Volker Tresp

Prof. Dr.

Principal Investigator

→ Group Volker Tresp
Database Systems, Data Mining and AI

Yunpu Ma

Dr.

[268]

H. Chen, J. Büssing, D. Rügamer and E. Nie.
Leveraging (Sentence) Transformer Models with Contrastive Learning for Identifying Machine-Generated Text.
SemEval @NAACL 2024 - 18th International Workshop on Semantic Evaluation at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

This paper outlines our approach to SemEval-2024 Task 8 (Subtask B), which focuses on discerning machine-generated text from human-written content, while also identifying the text sources, i.e., from which Large Language Model (LLM) the target text is generated. Our detection system is built upon Transformer-based techniques, leveraging various pre-trained language models (PLMs), including sentence transformer models. Additionally, we incorporate Contrastive Learning (CL) into the classifier to improve the detecting capabilities and employ Data Augmentation methods. Ultimately, our system achieves a peak accuracy of 76.96% on the test set of the competition, configured using a sentence transformer model integrated with CL methodology.

MCML Authors

David Rügamer

Prof. Dr.

Principal Investigator

Statistics, Data Science and Machine Learning

Ercong Nie

→ Group Alexander Fraser
Data Analytics & Statistics

[267]

A. Chronopoulou.
Efficient multilingual and domain adaptation of language models under resource constraints.
Dissertation Jan. 2024. DOI

Abstract

This dissertation develops methods to improve natural language processing (NLP) systems for low-resource languages and diverse domains. For machine translation, it explores bilingual language models, static embeddings, and multilingual systems with adapters, achieving robust performance in low-resource settings. To enhance domain adaptation, it introduces hierarchical tree structures and efficient adapters, enabling better generalization and robustness to domain shifts. These approaches address data disparities and domain variability, advancing adaptable and efficient NLP systems. (Shortened).

MCML Authors

Alexandra Chronopoulou

Dr.

* Former Member

[266]

B. Chen, Y. Liu, A. Korhonen and B. Plank.
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation.
Preprint (May. 2025). arXiv

Abstract

The recent rise of reasoning-tuned Large Language Models (LLMs)–which generate chains of thought (CoTs) before giving the final answer–has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

MCML Authors

Beiduo Chen

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[265]

S. Casola, Y. J. Liu, S. Peng, O. Kraus, A. Gatt and B. Plank.
Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics.
Preprint (Jun. 2025). arXiv

Abstract

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

MCML Authors

Silvia Casola

Dr.

Yang Janet Liu

* Former Member

→ Group Barbara Plank
AI and Computational Linguistics

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[264]

A. Chronopoulou, M. Peters, A. Fraser and J. Dodge.
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models.
Findings @EACL 2023 - Findings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A parameter-efficient adaptation method suggests training an adapter for each domain on the task of language modeling. This leads to good in-domain scores but can be impractical for domain- or resource-restricted settings. A solution is to use a related-domain adapter for the novel domain at test time. In this paper, we introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains. Our approach is embarrassingly parallel: first, we train a set of domain-specific adapters; then, for each novel domain, we determine which adapters should be averaged at test time. We present extensive experiments showing that AdapterSoup consistently improves performance to new domains without extra training. We also explore weight averaging of adapters trained on the same domain with different hyper-parameters, and show that it preserves the performance of a PLM on new domains while obtaining strong in-domain results. We explore various approaches for choosing which adapters to combine, such as text clustering and semantic similarity. We find that using clustering leads to the most competitive results on novel domains.

MCML Authors

Alexandra Chronopoulou

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[263]

B. Chen, S. Peng, A. Korhonen and B. Plank.
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding human label variation (HLV) and large language models (LLMs) can approximate HJD from a few human-provided label-explanation pairs. However, collecting explanations for every label is still time-consuming. This paper examines whether LLMs can be used to replace humans in generating explanations for approximating HJD. Specifically, we use LLMs as annotators to generate model explanations for a few given human labels. We test ways to obtain and combine these label-explanations with the goal to approximate human judgment distribution. We further compare the resulting human with model-generated explanations, and test automatic and human explanation selection. Our experiments show that LLM explanations are promising for NLI: to estimate HJD, generated explanations yield comparable results to human’s when provided with human labels. Importantly, our results generalize from datasets with human explanations to i) datasets where they are not available and ii) challenging out-of-distribution test sets.

MCML Authors

Beiduo Chen

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[262]

A. Chronopoulou, D. Stojanovski and A. Fraser.
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation.
LoResMT @EACL 2023 - 6th Workshop on Technologies for Machine Translation of Low-Resource Languages at the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks. Self-supervised pretrained models are often fine-tuned on parallel data from one or multiple language pairs for machine translation. Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive. Training a new adapter on each language pair or training a single adapter on all language pairs without updating the pretrained model has been proposed as a parameter-efficient alternative. However, the former does not permit any sharing between languages, while the latter shares parameters for all languages and is susceptible to negative interference. In this paper, we propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer. Our approach outperforms related baselines, yielding higher translation scores on average when translating from English to 17 different low-resource languages. We also show that language-family adapters provide an effective method to translate to languages unseen during pretraining.

MCML Authors

Alexandra Chronopoulou

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[261]

Q. Chen, X. Wang, P. Mondorf, M. A. Hedderich and B. Plank.
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination.
Preprint (Oct. 2024). arXiv

Abstract

Tree of Thoughts (ToT) is a reasoning strategy for Large Language Models (LLMs) that employs a generator to suggest reasoning steps and a discriminator to decide which steps to implement. ToT demonstrates strong performance on reasoning tasks, often surpassing simple methods such as Input-Output (IO) prompting and Chain-of-Thought (CoT) reasoning. However, ToT does not consistently outperform such simpler methods across all models, leaving large knowledge gaps on the conditions under which ToT is most beneficial. In this paper, we analyze the roles of the generator and discriminator separately to better understand the conditions when ToT is beneficial. We find that the generator plays a more critical role than the discriminator in driving the success of ToT. Scaling the generator leads to notable improvements in ToT performance, even when using a smaller model as the discriminator, whereas scaling the discriminator with a fixed generator yields only marginal gains. Our results show that models across different scales exhibit comparable discrimination capabilities, yet differ significantly in their generative performance for ToT.

MCML Authors

Xinpeng Wang

Philipp Mondorf

→ Group Barbara Plank
AI and Computational Linguistics

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[260]

B. Chen, X. Wang, S. Peng, R. Litschko, A. Korhonen and B. Plank.
'Seeing the Big through the Small': Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (‘LLM judges’) but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.

MCML Authors

Beiduo Chen

Xinpeng Wang

→ Group Barbara Plank
AI and Computational Linguistics

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Robert Litschko

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[259]

C. Chan, Y. Yim, H. Zeng, Z. Zou, X. Cheng, Z. Sun, Z. Deng, K. Chung, Y. Ao, Y. Fan, C. Jiayang, E. Nie, G. Y. Wong, H. Schmid, H. Schütze, S. See and Y. Song.
XToM: Exploring the Multilingual Theory of Mind for Large Language Models.
Preprint (Jun. 2025). arXiv

Abstract

Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs’ ability to replicate human-like mentalizing across linguistic contexts.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[258]

D. Dementieva, N. Babakov and A. Fraser.
EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian.
Preprint (May. 2025). arXiv

Abstract

While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the this http URL platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.

MCML Authors

Daryna Dementieva

Dr.

Alexander Fraser

Prof. Dr.

Principal Investigator

[257]

A. Dimmelmeier, H. Doll, M. Schierholz, E. Kormanyos, M. Fehr, B. Ma, J. Beck, A. Fraser and F. Kreuter.
Informing climate risk analysis using textual information - A research agenda.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

MCML Authors

Malte Schierholz

Dr.

Bolei Ma

Jacob Beck

Alexander Fraser

Prof. Dr.

Principal Investigator

Frauke Kreuter

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[256]

M. Weller-Di Marco, K. Hämmerl and A. Fraser.
A Study on Accessing Linguistic Information in Pre-Trained Language Models by Using Prompts.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

We study whether linguistic information in pre-trained multilingual language models can be accessed by human language: So far, there is no easy method to directly obtain linguistic information and gain insights into the linguistic principles encoded in such models. We use the technique of prompting and formulate linguistic tasks to test the LM’s access to explicit grammatical principles and study how effective this method is at providing access to linguistic features. Our experiments on German, Icelandic and Spanish show that some linguistic properties can in fact be accessed through prompting, whereas others are harder to capture.

MCML Authors

Katharina Hämmerl

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Eyke Hüllermeier
Artificial Intelligence and Machine Learning

[255]

S. Dutta, T. Kaufmann, G. Glavaš, I. Habernal, K. Kersting, F. Kreuter, M. Mezini, I. Gurevych, E. Hüllermeier and H. Schütze.
Problem Solving Through Human-AI Preference-Based Cooperation.
Computational Linguistics (Jul. 2025). DOI

Abstract

While there is a widespread belief that artificial general intelligence (AGI) – or even superhuman AI – is imminent, complex problems in expert domains are far from being solved. We argue that such problems require human-AI cooperation and that the current state of the art in generative AI is unable to play the role of a reliable partner due to a multitude of shortcomings, including difficulty in keeping track of a complex solution artifact (e.g., a software program), limited support for versatile human preference expression and lack of adapting to human preference in an interactive setting. To address these challenges, we propose HAI-Co2, a novel human-AI co-construction framework.We take first steps towards a formalization of HAI-Co2 and discuss the difficult open research problems that it faces.

MCML Authors

Timo Kaufmann

Frauke Kreuter

Prof. Dr.

Principal Investigator

Artificial Intelligence and Machine Learning

Eyke Hüllermeier

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

[254]

P. Dufter, N. Kassner and H. Schütze.
Static Embeddings as Efficient Knowledge Bases?
NAACL 2021 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Virtual, Jun 06-11, 2021. DOI

Abstract

Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as ‘Paris is the capital of [MASK]’ are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically diverse languages, we study knowledge contained in static embeddings. We show that, when restricting the output space to a candidate set, simple nearest neighbor matching using static embeddings performs better than PLMs. E.g., static embeddings perform 1.6% points better than BERT while just using 0.3% of energy for training. One important factor in their good comparative performance is that static embeddings are standardly learned for a large vocabulary. In contrast, BERT exploits its more sophisticated, but expensive ability to compose meaningful representations from a much smaller subword vocabulary.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[253]

Y. Du, P. Mondorf, S. Casola, Y. Yao, R. Litschko and B. Plank.
Reason to Rote: Rethinking Memorization in Reasoning.
Preprint (Jul. 2025). arXiv

Abstract

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

MCML Authors

Philipp Mondorf

Silvia Casola

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Robert Litschko

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[252]

L. Edman, L. Bylinina, F. Ghorbanpour and A. Fraser.
Are BabyLMs Second Language Learners?
Preprint (Oct. 2024). arXiv

Abstract

This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge (Warstadt et al. 2023). Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data. We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.

MCML Authors

Lukas Edman

Dr.

Faeze Ghorbanpour

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Michael Hedderich
AI and Computational Linguistics

[251]

F. Eichin, Y. Du, P. Mondorf, B. Plank and M. A. Hedderich.
Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior.
Preprint (May. 2025). arXiv GitHub

Abstract

Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

MCML Authors

Florian Eichin

Philipp Mondorf

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

[250]

Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Schütze and Y. Goldberg.
Measuring and Improving Consistency in Pretrained Language Models.
Transactions of the Association for Computational Linguistics 9 (Dec. 2021). DOI

Abstract

Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Michael Hedderich
AI and Computational Linguistics

[249]

F. Eichin, Y. J. Liu, B. Plank and M. A. Hedderich.
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.

MCML Authors

Florian Eichin

Yang Janet Liu

* Former Member

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

[248]

S. Eckman, B. Ma, C. Kern, R. Chew, B. Plank and F. Kreuter.
Aligning NLP Models with Target Population Perspectives using PAIR: Population-Aligned Instance Replication.
NLPerspectives @EMNLP 2025 - 4th Workshop on Perspectivist Approaches to NLP at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.

MCML Authors

Bolei Ma

Christoph Kern

Prof. Dr.

Associate

Barbara Plank

Prof. Dr.

Principal Investigator

Frauke Kreuter

Prof. Dr.

Principal Investigator

[247]

S. Eckman, B. Plank and F. Kreuter.
Position: Insights from Survey Methodology can Improve Training Data.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Whether future AI models are fair, trustworthy, and aligned with the public’s interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

Frauke Kreuter

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[246]

L. Edman, H. Schmid and A. Fraser.
CUTE: Measuring LLMs’ Understanding of Their Tokens.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

MCML Authors

Lukas Edman

Dr.

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[245]

L. Edman, H. Schmid and A. Fraser.
EXECUTE: A Multilingual Benchmark for LLM Token Understanding.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs’ understanding of character components.

MCML Authors

Lukas Edman

Dr.

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Michael Hedderich
AI and Computational Linguistics

[244]

F. Eichin, C. Schuster, G. Groh and M. A. Hedderich.
Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics.
Findings @EMNLP 2025 - Findings of the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

MCML Authors

Florian Eichin

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

→ Group Alexander Fraser
Data Analytics & Statistics

[243]

F. Friedrich, K. Hämmerl, P. Schramowski, M. Brack, J. Libovicky, K. Kersting and A. Fraser.
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.

MCML Authors

Katharina Hämmerl

Alexander Fraser

Prof. Dr.

Principal Investigator

[242]

Q. Feng, Y. Liu and H. Schütze.
Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding.
SRW @ACL 2025 - Student Research Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics – such as text length – which may not accurately reflect the model’s own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.

MCML Authors

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

Artificial Intelligence in Management

[241]

S. Feuerriegel, A. Maarouf, D. Bär, D. Geißler, J. Schweisthal, N. Pröllochs, C. E. Robertson, S. Rathje, J. Hartmann, S. M. Mohammad, O. Netzer, A. A. Siegel, B. Plank and J. J. Van Bavel.
Using natural language processing to analyse text data in behavioural science.
Nature Reviews Psychology 4 (Feb. 2025). DOI

Abstract

Language is a uniquely human trait at the core of human interactions. The language people use often reflects their personality, intentions and state of mind. With the integration of the Internet and social media into everyday life, much of human communication is documented as written text. These online forms of communication (for example, blogs, reviews, social media posts and emails) provide a window into human behaviour and therefore present abundant research opportunities for behavioural science. In this Review, we describe how natural language processing (NLP) can be used to analyse text data in behavioural science. First, we review applications of text data in behavioural science. Second, we describe the NLP pipeline and explain the underlying modelling approaches (for example, dictionary-based approaches and large language models). We discuss the advantages and disadvantages of these methods for behavioural science, in particular with respect to the trade-off between interpretability and accuracy. Finally, we provide actionable recommendations for using NLP to ensure rigour and reproducibility.

MCML Authors

Stefan Feuerriegel

Prof. Dr.

Principal Investigator

Abdurahman Maarouf

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Dominik Bär

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Jonas Schweisthal

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Barbara Plank

Prof. Dr.

Principal Investigator

[240]

M. Fayyaz, A. Modarressi, H. Deilamsalehy, F. Dernoncourt, R. Rossi, T. Bui, H. Schütze and N. Peng.
Steering MoE LLMs via Expert (De)Activation.
Preprint (Sep. 2025). arXiv

Abstract

Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

MCML Authors

Ali Modarressi

Hinrich Schütze

Prof. Dr.

Principal Investigator

[239]

M. Fayyaz, A. Modarressi, H. Schütze and N. Peng.
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query’s answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

MCML Authors

Ali Modarressi

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[238]

S. Frenda, A. Piergentili, B. Savoldi, M. Madeddu, M. Rosola, S. Casola, C. Ferrando, V. Patti, M. Negri and L. Bentivogli.
GFG - Gender-Fair Generation: A CALAMITA Challenge.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL

Abstract

Gender-fair language aims at promoting gender equality by using terms and expressions that include all identities and avoid reinforcing gender stereotypes. Implementing gender-fair strategies is particularly challenging in heavily gender-marked languages, such as Italian. To address this, the Gender-Fair Generation challenge intends to help shift toward gender-fair language in written communication. The challenge, designed to assess and monitor the recognition and generation of gender-fair language in both mono- and cross-lingual scenarios, includes three tasks: (1) the detection of gendered expressions in Italian sentences, (2) the reformulation of gendered expressions into gender-fair alternatives, and (3) the generation of gender-fair language in automatic translation from English to Italian. The challenge relies on three different annotated datasets: the GFL-it corpus, which contains Italian texts extracted from administrative documents provided by the University of Brescia; GeNTE, a bilingual test set for gender-neutral rewriting and translation built upon a subset of the Europarl dataset; and Neo-GATE, a bilingual test set designed to assess the use of non-binary neomorphemes in Italian for both fair formulation and translation tasks. Finally, each task is evaluated with specific metrics: average of F1-score obtained by means of BERTScore computed on each entry of the datasets for task 1, an accuracy measured with a gender-neutral classifier, and a coverage-weighted accuracy for tasks 2 and 3.

MCML Authors

Silvia Casola

Dr.

[237]

C. Gruber, H. Alber, B. Bischl, G. Kauermann, B. Plank and M. Aßenmacher.
Revisiting Active Learning under (Human) Label Variation.
Preprint (Jul. 2025). arXiv

Abstract

Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

MCML Authors

Helen Alber

→ Group Bernd Bischl
Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Director

Applied Statistics in Social Sciences, Economics and Business

Göran Kauermann

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Bernd Bischl
Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

[236]

M. Giulianelli, J. Baan, W. Aziz, R. Fernández and B. Plank.
What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[235]

F. Ghorbanpour, D. Dementieva and A. Fraser.
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study.
Preprint (May. 2025). arXiv

Abstract

Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.

MCML Authors

Faeze Ghorbanpour

Daryna Dementieva

Dr.

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[234]

F. Ghorbanpour, D. Dementieva and A. Fraser.
Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data.
Preprint (May. 2025). arXiv

Abstract

Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.

MCML Authors

Faeze Ghorbanpour

Daryna Dementieva

Dr.

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Bernd Bischl
Statistical Learning and Data Science

[233]

C. Gruber, K. Hechinger, M. Aßenmacher, G. Kauermann and B. Plank.
More Labels or Cases? Assessing Label Variation in Natural Language Inference.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL

Abstract

In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.

MCML Authors

Matthias Aßenmacher

Dr.

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[232]

F. Ghorbanpour, V. Hangya and A. Fraser.
Fine-Grained Transfer Learning for Harmful Content Detection through Label-Specific Soft Prompt Tuning.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

The spread of harmful content online is a dynamic issue evolving over time. Existing detection models, reliant on static data, are becoming less effective and generalizable. Developing new models requires sufficient up-to-date data, which is challenging. A potential solution is to combine existing datasets with minimal new data. However, detection tasks vary—some focus on hate speech, offensive, or abusive content, which differ in the intent to harm, while others focus on identifying targets of harmful speech such as racism, sexism, etc—raising the challenge of handling nuanced class differences. To address these issues, we introduce a novel transfer learning method that leverages class-specific knowledge to enhance harmful
content detection. In our approach, we first present label-specific soft prompt tuning, which captures and represents class-level information. Secondly, we propose two approaches to transfer this fine-grained knowledge from source (existing tasks) to target (unseen and new tasks): initializing the target task prompts from source prompts and using an attention mechanism that learns and adjusts attention scores to utilize the most relevant information from source prompts. Experiments demonstrate significant improvements in harmful content detection across English and German datasets, highlighting the effectiveness of label-specific representations and knowledge transfer.

MCML Authors

Faeze Ghorbanpour

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[231]

F. Ghorbanpour, T. Z. Malaguth and A. Akbaritabar.
Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models.
ICWSM 2025 - 19th International AAAI Conference on Web and Social Media. Copenhagen, Denmark, Jun 23-26, 2025. DOI

Abstract

Most web and digital trace data do not include information about an individual’s nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant’s country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest- and 67%, for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars’ full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods in addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

MCML Authors

Faeze Ghorbanpour

[230]

S. Gerstner and H. Schütze.
Understanding Gated Neurons in Transformers from Their Input-Output Functionality.
Preprint (May. 2025). arXiv

Abstract

Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream (’enrichment neurons’) or reduce its presence (‘depletion neurons’). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

MCML Authors

Sebastian Gerstner

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[229]

P. Hong, B. Chen, S. Peng, M.-C. de Marneffe and B. Plank.
LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference.
Preprint (May. 2025). arXiv

Abstract

There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

MCML Authors

Beiduo Chen

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[228]

K. Hämmerl, B. Deiseroth, P. Schramowski, J. Libovický, C. Rothkopf, A. Fraser and K. Kersting.
Speaking Multiple Languages Affects the Moral Bias of Language Models.
Findings @ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. However, PMLMs are trained on varying amounts of data for each language. In practice this means their performance is often much better on English than many other languages. We explore to what extent this also applies to moral norms. Do the models capture moral norms from English and impose them on other languages? Do the models exhibit random and thus potentially harmful beliefs in certain languages? Both these issues could negatively impact cross-lingual transfer and potentially lead to harmful outcomes. In this paper, we (1) apply the MORALDIRECTION framework to multilingual models, comparing results in German, Czech, Arabic, Chinese, and English, (2) analyse model behaviour on filtered parallel subtitles corpora, and (3) apply the models to a Moral Foundations Questionnaire, comparing with human responses from different countries. Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions. We release our code and models.

MCML Authors

Katharina Hämmerl

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[227]

V. Hangya and A. Fraser.
LMU at HaSpeeDe3: Multi-Dataset Training for Cross-Domain Hate Speech Detection.
EVALITA 2023 - Final Workshop of the 8th evaluation campaign. Parma, Italy, Sep 07-08, 2023. PDF

Abstract

We describe LMU Munich’s hate speech detection system for participating in the cross-domain track of the HaSpeeDe3 shared task at EVALITA 2023. The task focuses on the politics and religion domains, having no in-domain training data for the latter. Our submission combines multiple training sets from various domains in a multitask prompt-training system. We experimented with both Italian and English source datasets as well as monolingual Italian and multilingual pre-trained language models. We found that the Italian out-of-domain datasets are the most influential on the performance in the test domains and that combining both monolingual and multilingual language models using an ensemble gives the best results. Our system ranked second in both domains.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[226]

V. Hangya and A. Fraser.
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Due to the broad range of social media platforms, the requirements of abusive language detection systems are varied and ever-changing. Already a large set of annotated corpora with different properties and label sets were created, such as hate or misogyny detection, but the form and targets of abusive speech are constantly evolving. Since, the annotation of new corpora is expensive, in this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection. Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain. We propose a two-step approach: first we train our model in a multitask fashion. We then carry out few-shot adaptation to the target requirements. Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages. Our analysis also shows that our models acquire a general understanding of abusive language, since they improve the prediction of labels which are present only in the target dataset and can benefit from knowledge about labels which are not directly used for the target task.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[225]

K. Hämmerl, A. Fastowski, J. Libovický and A. Fraser.
Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity.
Findings @ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research. We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility.

MCML Authors

Katharina Hämmerl

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[224]

K. Hämmerl, J. Libovický and A. Fraser.
Understanding Cross-Lingual Alignment—A Survey.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, has been an active field of research in recent years. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field. We present different understandings of cross-lingual alignment and their limitations. We provide a qualitative summary of results from a number of surveyed papers. Finally, we discuss how these insights may be applied not only to encoder models, where this topic has been heavily studied, but also to encoder-decoder or even decoder-only models, and argue that an effective trade-off between language-neutral and language-specific information is key.

MCML Authors

Katharina Hämmerl

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Volker Tresp
Database Systems, Data Mining and AI

[223]

Z. Han, R. Liao, J. Gu, Y. Zhang, Z. Ding, Y. Gu, H. Köppl, H. Schütze and V. Tresp.
ECOLA: Enhancing Temporal Knowledge Embeddings with Contextualized Language Representations.
Findings @ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Since conventional knowledge embedding models cannot take full advantage of the abundant textual information, there have been extensive research efforts in enhancing knowledge embedding using texts. However, existing enhancement approaches cannot apply to temporal knowledge graphs (tKGs), which contain time-dependent event knowledge with complex temporal dynamics. Specifically, existing enhancement approaches often assume knowledge embedding is time-independent. In contrast, the entity embedding in tKG models usually evolves, which poses the challenge of aligning temporally relevant texts with entities. To this end, we propose to study enhancing temporal knowledge embedding with textual data in this paper. As an approach to this task, we propose Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations (ECOLA), which takes the temporal aspect into account and injects textual information into temporal knowledge embedding. To evaluate ECOLA, we introduce three new datasets for training and evaluating ECOLA. Extensive experiments show that ECOLA significantly enhances temporal KG embedding models with up to 287% relative improvements regarding Hits@1 on the link prediction task.

MCML Authors

Ruotong Liao

Yao Zhang

→ Group Volker Tresp
Database Systems, Data Mining and AI

Zifeng Ding

→ Group Volker Tresp
Database Systems, Data Mining and AI

Hinrich Schütze

Prof. Dr.

Principal Investigator

Volker Tresp

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[222]

K. Hämmerl, T. Limisiewicz, J. Libovický and A. Fraser.
Beyond Literal Token Overlap: Token Alignability for Multilinguality.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.

MCML Authors

Katharina Hämmerl

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[221]

K. Hämmerl, A. Manea, G. Vico, J. Helcl and J. Libovický.
CUNI and LMU Submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval.
MRL @EMNLP 2024 - 4th Multilingual Representation Learning Workshop at the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

We present the joint CUNI and LMU submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval. The shared task objective was to explore how we can deploy modern methods in NLP in multi-lingual low-resource settings, tested on two sub-tasks: Named-entity recognition and question answering. Our solutions to the subtasks are based on data acquisition and model adaptation. We compare the performance of our submitted systems with the translate-test approach which proved to be the most useful in the previous edition of the shared task. Our results show that using more data as well as fine-tuning recent multilingual pre-trained models leads to considerable improvements over the translate-test baseline.

MCML Authors

Katharina Hämmerl

[220]

A. D. Hakimi, A. Modarressi, P. Wicke and H. Schütze.
Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its attention heads and feed forward networks (FFNs) over the course of pre-training. We classify these components into four roles: general, entity, relation-answer, and fact-answer specific, and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, attention heads display the highest turnover. We also present evidence that FFNs remain more stable throughout training. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs.

MCML Authors

Ahmad Dawar Hakimi

Ali Modarressi

Philipp Wicke

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

[219]

L. He, E. Nie, S. S. Dindar, A. Firoozi, A. Florea, V. Nguyen, C. Puffay, R. Shimizu, H. Ye, J. Brennan, H. Schmid, H. Schütze and N. Mesgarani.
XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs.
SIGTYP @ACL 2025 - 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs’ multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.

MCML Authors

Ercong Nie

Haotian Ye

Hinrich Schütze

Prof. Dr.

Principal Investigator

[218]

L. Hagström, E. Nie, R. Halifa, H. Schmid, R. Johansson and A. Junge.
Language Model Re-rankers are Fooled by Lexical Similarities.
FEVER @ACL 2025 - 8th Fact Extraction and VERification Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.

MCML Authors

Ercong Nie

[217]

L. He, E. Nie, H. Schmid, H. Schütze, N. Mesgarani and J. Brennan.
Large Language Models as Neurolinguistic Subjects: Discrepancy between Performance and Competence.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM assessment paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical rules that may not accurately represent LLMs’ true linguistic competence. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; (3) Instruction tuning won’t change much competence but improve performance; (4) LLMs exhibit higher competence and performance in form compared to meaning. Additionally, we introduce new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[216]

V. Hangya, H. S. Saadi and A. Fraser.
Improving Low-Resource Languages in Pre-Trained Multilingual Language Models.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Pre-trained multilingual language models are the foundation of many NLP approaches, including cross-lingual transfer solutions. However, languages with small available monolingual corpora are often not well-supported by these models leading to poor performance. We propose an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pre-trained language models. We perform experiments on nine languages, using contextual word retrieval and zero-shot named entity recognition to measure both intrinsic cross-lingual word representation quality and downstream task performance, showing improvements on both tasks. Our results show that it is possible to improve pre-trained multilingual language models by relying only on non-parallel resources.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[215]

V. Hangya, S. Severini, R. Ralev, A. Fraser and H. Schütze.
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages.
MRL @EMNLP 2023 - 3rd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good crosslingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (≤ 5M tokens) and 4 moderately low-resource (≤ 50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

[214]

M. A. Hedderich, A. Wang, R. Zhao, F. Eichin, J. Fischer and B. Plank.
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompt and model changes efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs, and we are able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.

MCML Authors

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

→ Group Michael Hedderich
AI and Computational Linguistics

Raoyuan Zhao

Florian Eichin

→ Group Michael Hedderich
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[213]

L. Hirlimann, S. Zhang, H. Schütze and P. Wicke.
Robustness Testing of Multi-Modal Models in Varied Home Environments for Assistive Robots.
Preprint (Jun. 2024). arXiv

Abstract

The development of assistive robotic agents to support household tasks is advancing, yet the underlying models often operate in virtual settings that do not reflect real-world complexity. For assistive care robots to be effective in diverse environments, their models must be robust and integrate multiple modalities. Consider a caretaker needing assistance in a dimly lit room or navigating around a newly installed glass door. Models relying solely on visual input might fail in low light, while those using depth information could avoid the door. This demonstrates the necessity for models that can process various sensory inputs. Our ongoing study evaluates state-of-the-art robotic models in the AI2Thor virtual environment. We introduce disturbances, such as dimmed lighting and mirrored walls, to assess their impact on modalities like movement or vision, and object recognition. Our goal is to gather input from the Geriatronics community to understand and model the challenges faced by practitioners.

MCML Authors

Lea Hirlimann

Shengqiang Zhang

Hinrich Schütze

Prof. Dr.

Principal Investigator

Philipp Wicke

Dr.

[212]

A. Imani, P. Lin, A. H. Kargaran, S. Severini, M. J. Sabet, N. Kassner, C. Ma, H. Schmid, A. Martins, F. Yvon and H. Schütze.
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI GitHub

Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, ‘help’ from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures.

MCML Authors

Ayyoob Imani

Peiqin Lin

Amir Hossein Kargaran

Nora Kassner

* Former Member

Chunlan Ma

Hinrich Schütze

Prof. Dr.

Principal Investigator

[211]

A. Imani, M. J. Sabet, P. Dufter, M. Cysouw and H. Schütze.
ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus.
ACL-IJCNLP 2021 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand, Aug 01-06, 2021. DOI

Abstract

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

MCML Authors

Ayyoob Imani

Masoud Jalili Sabet

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[210]

A. Imani, M. J. Sabet, L. K. Senel, P. Philipp, F. Yvon and H. Schütze.
Graph Algorithms for Multiparallel Word Alignment.
EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI

Abstract

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in F1 of up to 28{%} over the baseline bilingual word aligner in different datasets.

MCML Authors

Ayyoob Imani

Masoud Jalili Sabet

Dr.

* Former Member

Lütfi Kerem Senel

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[209]

A. Imani, S. Severini, M. J. Sabet, F. Yvon and H. Schütze.
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

MCML Authors

Ayyoob Imani

Masoud Jalili Sabet

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[208]

A. Imani, L. K. Senel, M. Sabet, F. Yvon and H. Schütze.
Graph Neural Networks for Multiparallel Word Alignment.
Preprint (Mar. 2022). arXiv

Abstract

After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection, and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pairs by considering all language pairs together. First, we create a multiparallel word alignment graph, joining all bilingual word alignment pairs in one graph. Next, we use graph neural networks (GNNs) to exploit the graph structure. Our GNN approach (i) utilizes information about the meaning, position, and language of the input words, (ii) incorporates information from multiple parallel sentences, (iii) adds and removes edges from the initial alignments, and (iv) yields a prediction model that can generalize beyond the training sentences. We show that community detection provides valuable information for multiparallel word alignment. Our method outperforms previous work on three word-alignment datasets and on a downstream task.

MCML Authors

Ayyoob Imani

Lütfi Kerem Senel

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[207]

J. Jungmaier, N. Kassner and B. Roth.
Dirichlet-Smoothed Word Embeddings for Low-Resource Settings.
LREC 2020 - 12th International Conference on Language Resources and Evaluation. Marseille, France, May 13-15, 2020. URL

Abstract

Nowadays, classical count-based word embeddings using positive pointwise mutual information (PPMI) weighted co-occurrence matrices have been widely superseded by machine-learning-based methods like word2vec and GloVe. But these methods are usually applied using very large amounts of text data. In many cases, however, there is not much text data available, for example for specific domains or low-resource languages. This paper revisits PPMI by adding Dirichlet smoothing to correct its bias towards rare words. We evaluate on standard word similarity data sets and compare to word2vec and the recent state of the art for low-resource settings: Positive and Unlabeled (PU) Learning for word embeddings. The proposed method outperforms PU-Learning for low-resource settings and obtains competitive results for Maltese and Luxembourgish.

MCML Authors

Nora Kassner

* Former Member

[206]

S. Ji, Z. Li, I. Paul, J. Paavola, P. Lin, P. Chen, D. O'Brien, H. Luo, H. Schütze, J. Tiedemann and B. Haddow.
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models.
Preprint (Sep. 2024). arXiv

Abstract

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models’ language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.

MCML Authors

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[205]

D. N. Jakobi, M. Stegenwallner-Schütz, N. Hollenstein, C. Ding, R. Kaspere, A. M. Škorić, E. Pavlinusic Vilus, S. Frank, M.-L. Müller, K. M. Jensen de López, N. Kharlamov, H. B. Søndergaard Knudsen, Y. Berzak, E. Lion, I. A. S. Irina A. Sekerina, C. Acarturk, M. F. Ansari, K. Harezlak, P. Kasprowski, A. Bautista, L. Beinborn, A. Bondar, A. Boznou, L. Bradshaw, J. M. Hofmann, T. Krosness, N. B. Soliva, A. Çepani, K. Cergol, A. Došen, M. Palmovic, A. Çerpja, D. Chirino, J. Chromý, V. Demberg, I. Škrjanec, N. D. Deniz, I. Fajardo, M. Giménez-Salvador, X. Mínguez-López, M. Filip, Z. Freibergs, J. Gomes, A. Janeiro, P. Luegi, J. Veríssimo, S. Gramatikov, J. Hasenäcker, A. Haveriku, N. Kote, M. M. Kamal, H. Kędzierska, D. Klimek-Jankowska, S. Kosutar, D. G. Krakowczyk, I. Krejtz, M. Łockiewicz, K. Lõo, J. Motiejūnienė, J. A. Nasir, J. S. Krog Nedergård, A. Özkan, M. Preininger, L. Pungă, D. R. Reich, C. Tschirner, Š. Rot, A. Säuberli, J. Solé-Casals, E. Strati, I. Svoboda, E. Trandafili, S. Varlokosta, M. Vulchanova and L. A. .
MultiplEYE: Creating a multilingual eye-tracking-while-reading corpus.
ETRA 2025 - ACM Symposium on Eye Tracking Research and Applications. Tokyo, Japan, May 26-29, 2025. DOI

Abstract

Eye-tracking-while-reading data provide valuable insights across multiple disciplines, including psychology, linguistics, natural language processing, education, and human-computer interaction. Despite its potential, the availability of large, high-quality, multilingual datasets remains limited, hindering both foundational reading research and advancements in applications. The MultiplEYE project addresses this gap by establishing a large-scale, international eye-tracking data collection initiative. It aims to create a multilingual dataset of eye movements recorded during natural reading, balancing linguistic diversity, while ensuring methodological consistency for reliable cross-linguistic comparisons. The dataset spans numerous languages and follows strict procedural, documentation, and data pre-processing standards to enhance eye-tracking data transparency and reproducibility. A novel data-sharing framework, integrated with data quality reports, allows for selective data filtering based on research needs. Researchers and labs worldwide are invited to join the initiative. By establishing and promoting standardized practices and open data sharing, MultiplEYE facilitates interdisciplinary research and advances reading research and gaze-augmented applications.

MCML Authors

Andreas Säuberli

[204]

A. Köksal, R. Aksitov and C.-C. Chang.
Hallucination Augmented Recitations for Language Models.
Preprint (Nov. 2023). arXiv

Abstract

Attribution is a key concept in large language models (LLMs) as it enables control over information sources and enhances the factuality of LLMs. While existing approaches utilize open book question answering to improve attribution, factual datasets may reward language models to recall facts that they already know from their pretraining data, not attribution. In contrast, counterfactual open book QA datasets would further improve attribution because the answer could only be grounded in the given text. We propose Hallucination Augmented Recitations (HAR) for creating counterfactual datasets by utilizing hallucination in LLMs to improve attribution. For open book QA as a case study, we demonstrate that models finetuned with our counterfactual datasets improve text grounding, leading to better open book QA performance, with up to an 8.0% increase in F1 score. Our counterfactual dataset leads to significantly better performance than using humanannotated factual datasets, even with 4x smaller datasets and 4x smaller models. We observe that improvements are consistent across various model sizes and datasets, including multi-hop, biomedical, and adversarial QA datasets.

MCML Authors

Abdullatif Köksal

* Former Member

[203]

A. Karamolegkou, A. Borah, E. Cho, S. R. Choudhury, M. Galletti, R. Ghosh, P. Gupta, O. Ignat, P. Kargupta, N. Kotonya, H. Lamba, S.-J. Lee, A. Mangla, I. Mondal, D. Nazarova, P. Nemkova, D. Pisarevskaya, N. Rizwan, N. Sabri, D. Stammbach, A. Steinberg, D. Tomás, S. R. Wilson, B. Yi, J. H. Zhu, A. Zubiaga, A. Søgaard, A. Fraser, Z. Jin, R. Mihalcea, J. R. Tetreault and D. Dementieva.
NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment.
Preprint (May. 2025). arXiv

Abstract

Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.

MCML Authors

Anna Steinberg

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

Daryna Dementieva

Dr.

[202]

X. Krückl, V. Blaschke and B. Plank.
Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.

MCML Authors

Verena Blaschke

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[201]

N. Kassner, P. Dufter and H. Schütze.
Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models.
EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics. Virtual, Apr 19-23, 2021. DOI

Abstract

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as “Paris is the capital of [MASK]” are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT’s performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[200]

L. Kinder, L. Edman, A. Fraser and T. Käfer.
Positional Overload: Positional Debiasing and Context Window Extension for Large Language Models using Set Encoding.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Large Language Models (LLMs) typically track the order of tokens using positional encoding, which causes the following problems: positional bias, where the model is influenced by an ordering within the prompt, and a fixed context window, as models struggle to generalize to positions beyond those encountered during training. To address these limitations, we developed a novel method called set encoding. This method allows multiple pieces of text to be encoded in the same position, thereby eliminating positional bias entirely. Another promising use case for set encoding is to increase the size of the input an LLM can handle. Our experiments demonstrate that set encoding allows an LLM to solve tasks with far more tokens than without set encoding. To our knowledge, set encoding is the first technique to effectively extend an LLM’s context window without requiring any additional training.

MCML Authors

Lukas Edman

Dr.

Alexander Fraser

Prof. Dr.

Principal Investigator

[199]

A. H. Kargaran, A. Imani, F. Yvon and H. Schütze.
GlotLID: Language Identification for Low-Resource Languages.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub

Abstract

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures.

MCML Authors

Amir Hossein Kargaran

Ayyoob Imani

Hinrich Schütze

Prof. Dr.

Principal Investigator

[198]

N. Kassner, B. Krojer and H. Schütze.
Are Pretrained Language Models Symbolic Reasoners over Knowledge?
CoNLL 2020 - 24th Conference on Computational Natural Language Learning. Virtual, Nov 19-20, 2020. DOI

Abstract

How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs seem to learn to apply some symbolic reasoning rules correctly but struggle with others, including two-hop reasoning. Further analysis suggests that even the application of learned reasoning rules is flawed. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[197]

A. H. Kargaran, Y. Liu, F. Yvon and H. Schütze.
How Programming Concepts and Neurons Are Shared in Code Language Models.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL GitHub

Abstract

Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model’s concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model’s concept space.

MCML Authors

Amir Hossein Kargaran

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

[196]

A. H. Kargaran, A. Modarressi, N. Nikeghbal, J. Diesner, F. Yvon and H. Schütze.
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment.
Findings @ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs.

MCML Authors

Amir Hossein Kargaran

Ali Modarressi

Hinrich Schütze

Prof. Dr.

Principal Investigator

[195]

N. Kassner and H. Schütze.
Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly.
ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics. Virtual, Jul 05-10, 2020. DOI

Abstract

Building on Petroni et al. 2019, we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated (‘‘Birds cannot [MASK]”) and non-negated (‘‘Birds can [MASK]”) cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add “misprimes” to cloze questions (‘‘Talk? Birds can [MASK]”). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[194]

N. Kassner and H. Schütze.
BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA.
Findings @EMNLP 2020 - Findings of the Conference on Empirical Methods in Natural Language Processing. Virtual, Nov 16-20, 2020. DOI

Abstract

Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collection. Our contributions are as follows: i) BERT-kNN outperforms BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., US city), but only kNN recovers the factually correct answer (e.g.,“Miami”). iii) Compared to BERT, BERT-kNN excels for rare facts. iv) BERT-kNN can easily handle facts not covered by BERT’s training set, e.g., recent events.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[193]

A. Köksal, T. Schick, A. Korhonen and H. Schütze.
LongForm: Effective Instruction Tuning with Reverse Instructions.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub

Abstract

Instruction tuning enables language models to more effectively generalize and better follow user intent. However, obtaining instruction data is costly and challenging. Prior work employs methods such as expensive human annotation, crowd-sourced datasets with alignment issues, and generating noisy examples via LLMs. We introduce the LongForm-C dataset, which is created by reverse instructions. We generate instructions via LLMs for human-written corpus examples using reverse instructions. First we select a diverse set of human-written documents from corpora such as C4 and Wikipedia; then we generate instructions for these documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset with natural output and one suitable for long text generation. Our models outperform 10x larger language models without instruction tuning on tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin, and improve language understanding capabilities further.

MCML Authors

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[192]

A. Köksal, T. Schick and H. Schütze.
MEAL: Stable and Active Learning for Few-Shot Prompting.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub

Abstract

Few-shot classification has made great strides due to foundation models that, through priming and prompting, are highly effective few-shot learners. However, this approach has high variance both across different sets of few shots (data selection) and across different finetuning runs (run variability). This is problematic not only because it impedes the fair comparison of different approaches, but especially because it makes few-shot learning too unreliable for many real-world applications. To alleviate these issues, we make two contributions for more stable and effective few-shot learning: First, we propose novel ensembling methods and show that they substantially reduce run variability. Second, we introduce a new active learning (AL) criterion for data selection and present the first AL-based approach specifically tailored towards prompt-based learning. In our experiments, we show that our combined method, MEAL (Multiprompt finetuning and prediction Ensembling with Active Learning), improves overall performance of prompt-based finetuning by 2.3 points on five diverse tasks.

MCML Authors

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[191]

A. Köksal, S. Severini and H. Schütze.
SilverAlign: MT-Based Silver Data Algorithm for Evaluating Word Alignment.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different domains and languages when gold data are not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

MCML Authors

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[190]

A. Köksal, M. Thaler, A. Imani, A. Üstün, A. Korhonen and H. Schütze.
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions.
Transactions of the Association for Computational Linguistics 13 (Aug. 2025). DOI GitHub

Abstract

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation.

MCML Authors

Abdullatif Köksal

* Former Member

Ayyoob Imani

Hinrich Schütze

Prof. Dr.

Principal Investigator

[189]

N. Kassner, O. Tafjord, H. Schütze and P. Clark.
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief.
EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI

Abstract

Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually “believes” about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs – a BeliefBank – that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component – a weighted MaxSAT solver – revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[188]

N. Kassner, O. Tafjord, A. Sabharwal, K. Richardson, H. Schütze and P. Clark.
Language Models with Rationality.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent ‘beliefs’. This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a rational, self-reflecting layer on top of the LLM. First, given a question, we construct a belief graph using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM.

MCML Authors

Nora Kassner

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[187]

A. Köksal, O. Yalcin, A. Akbiyik, M. T. Kilavuz, A. Korhonen and H. Schütze.
Language-Agnostic Bias Detection in Language Models with Bias Probing.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub

Abstract

Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet “surfaces” nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet’s reliability and applicability to different templates and languages through an extensive set of robustness checks.

MCML Authors

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[186]

A. H. Kargaran, F. Yvon and H. Schütze.
GlotScript: A Resource and Tool for Low Resource Writing System Identification.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL GitHub

Abstract

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community.

MCML Authors

Amir Hossein Kargaran

Hinrich Schütze

Prof. Dr.

Principal Investigator

[185]

A. H. Kargaran, F. Yvon and H. Schütze.
MaskLID: Code-Switching Language Identification through Iterative Masking.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI GitHub

Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture.

MCML Authors

Amir Hossein Kargaran

Hinrich Schütze

Prof. Dr.

Principal Investigator

[184]

A. H. Kargaran, F. Yvon and H. Schütze.
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community.

MCML Authors

Amir Hossein Kargaran

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[183]

Y. J. Liu, T. Aoyama, W. Scivetti, Y. Zhu, S. Behzad, L. E. Levine, J. Lin, D. Tiwari and A. Zeldes.
GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.

MCML Authors

Yang Janet Liu

* Former Member

[182]

R. Litschko, V. Blaschke, D. Burkhardt, B. Plank and D. Frassinelli.
Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora.
Findings @EMNLP 2025 - Findings of the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavarian as a case study and investigate the lexical dialect understanding capability of LLMs by examining how well they recognize and translate dialectal terms across different parts-of-speech. To this end, we introduce DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only, and use it to compile a ground truth dataset consisting of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our results show that LLMs perform best on nouns and lexically similar word pairs, and struggle most in distinguishing between direct translations and inflected variants. Interestingly, providing additional context in the form of example usages improves the translation performance, but reduces their ability to recognize dialect variants. This study highlights the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.

MCML Authors

Robert Litschko

→ Group Barbara Plank
AI and Computational Linguistics

Verena Blaschke

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[181]

W. Lai, A. Chronopoulou and A. Fraser.
m4 Adapter: Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter.
Findings @EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Multilingual neural machine translation models (MNMT) yield state-of-the-art performance when evaluated on data from a domain and language pair seen at training time. However, when a MNMT model is used to translate under domain shift or to a new language pair, performance drops dramatically. We consider a very challenging scenario: adapting the MNMT model both to a new domain and to a new language pair at the same time. In this paper, we propose m4Adapter (Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter), which combines domain and language knowledge using meta-learning with adapters. We present results showing that our approach is a parameter-efficient solution which effectively adapts a model to both a new language pair and a new domain, while outperforming other adapter methods. An ablation study also shows that our approach more effectively transfers domain knowledge across different languages and language information across different domains.

MCML Authors

Wen Lai

Alexandra Chronopoulou

Dr.

* Former Member

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[180]

W. Lai, A. Chronopoulou and A. Fraser.
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework which only requires target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective than strong baselines both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.

MCML Authors

Wen Lai

Alexandra Chronopoulou

Dr.

* Former Member

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

[179]

Y. Liu, R. Chen, L. Hirlimann, A. D. Hakimi, M. Wang, A. H. Kargaran, S. Rothe, F. Yvon and H. Schütze.
On Relation-Specific Neurons in Large Language Models.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv GitHub

Abstract

In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself – independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the Llama-2 family on a chosen set of relations with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation r on the LLM’s ability to handle (1) facts whose relation is r and (2) facts whose relation is a different relation r′≠r. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. (i) Neuron cumulativity. The neurons for r present a cumulative effect so that deactivating a larger portion of them results in the degradation of more facts in r. (ii) Neuron versatility. Neurons can be shared across multiple closely related as well as less related relations. Some relation neurons transfer across languages. (iii) Neuron interference. Deactivating neurons specific to one relation can improve LLM generation performance for facts of other relations.

MCML Authors

Yihong Liu

Lea Hirlimann

Ahmad Dawar Hakimi

Mingyang Wang

Amir Hossein Kargaran

Hinrich Schütze

Prof. Dr.

Principal Investigator

[178]

Y. Liu, A. Chronopoulou, H. Schütze and A. Fraser.
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss.
IWSLT 2023 - 20th International Conference on Spoken Language Translation. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages.

MCML Authors

Yihong Liu

→ Group Alexander Fraser
Data Analytics & Statistics

Alexandra Chronopoulou

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

Alexander Fraser

Prof. Dr.

Principal Investigator

[177]

J. Lan, D. Frassinelli and B. Plank.
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI

Abstract

Large vision-language models struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit high uncertainty. In this study, we focus on a Visual Question Answering (VQA) task and comprehensively evaluate how well the output of the state-of-the-art vision-language model correlates with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ, not only accuracy, but also three new human-correlated metrics for the first time in VQA, to investigate the impact of HUD. We also verify the effect of common calibration and human calibration (Baan et al. 2022) on the alignment of models and humans. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, to better align model confidence with human uncertainty. Our findings highlight that for VQA, the alignment between human responses and model predictions is understudied and is an important target for future studies.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[176]

W. Lai, A. Fraser and I. Titov.
Joint Localization and Activation Editing for Low-Resource Fine-Tuning.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv

Abstract

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.

MCML Authors

Wen Lai

Alexander Fraser

Prof. Dr.

Principal Investigator

[175]

Y. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

We investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.

MCML Authors

Yongkang Liu

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[174]

Y. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response.
Preprint (May. 2023). arXiv

Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

MCML Authors

Yongkang Liu

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[173]

T. Lindenbauer, G. Groh and H. Schütze.
From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents.
REALM @ACL 2025 - 1st Workshop for Research on Agent Language Models at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[172]

W. Lai, V. Hangya and A. Fraser.
Extending Multilingual Machine Translation through Imitation Learning.
Preprint (Nov. 2023). arXiv

Abstract

Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world’s languages are still being left behind. We aim to extend large-scale MNMT models to a new language, allowing for translation between the newly added and all of the already supported languages in a challenging scenario: using only a parallel corpus between the new language and English. Previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existence in current large-scale MNMT models.

MCML Authors

Wen Lai

Viktor Hangya

Dr.

* Former Member

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[171]

W. Lai, V. Hangya and A. Fraser.
Style-Specific Neurons for Steering LLMs in Text Style Transfer.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST, a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity, politics, politeness, authorship, and sentiment.

MCML Authors

Wen Lai

Viktor Hangya

Dr.

* Former Member

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

[170]

P. Lin, C. Hu, Z. Zhang, A. Martins and H. Schütze.
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models.
Findings @EACL 2024 - Findings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

MCML Authors

Peiqin Lin

→ Group Gjergji Kasneci
Responsible Data Science

Zheyu Zhang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[169]

P. Lin, S. Ji, J. Tiedemann, A. F. T. Martins and H. Schütze.
MaLA-500: Massive Language Adaptation of Large Language Models.
Preprint (Apr. 2024). arXiv GitHub

Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages.

MCML Authors

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[168]

R. Litschko, O. Kraus, V. Blaschke and B. Plank.
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.

MCML Authors

Robert Litschko

Verena Blaschke

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[167]

M. Li, M. Luo, T. Lv, Y. Zhang, S. Zhao, E. Nie and G. Zhou.
A Survey of Long-Document Retrieval in the PLM and LLM Era.
Preprint (Sep. 2025). arXiv

Abstract

The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.

MCML Authors

Ercong Nie

[166]

Y. Liu, P. Lin, M. Wang and H. Schütze.
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining.
Findings @NAACL 2024 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: One For All (OFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

MCML Authors

Yihong Liu

Peiqin Lin

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[165]

M. Li, X. Lv, J. Zou, T. Chen, C. Zhang, S. An, E. Nie and G. Zhou.
Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey.
Preprint (Sep. 2025). arXiv

Abstract

Modern information retrieval (IR) must bridge short, ambiguous queries and ever more diverse, rapidly evolving corpora. Query Expansion (QE) remains a key mechanism for mitigating vocabulary mismatch, but the design space has shifted markedly with pre-trained language models (PLMs) and large language models (LLMs). This survey synthesizes the field from three angles: (i) a four-dimensional framework of query expansion - from the point of injection (explicit vs. implicit QE), through grounding and interaction (knowledge bases, model-internal capabilities, multi-turn retrieval) and learning alignment, to knowledge graph-based argumentation; (ii) a model-centric taxonomy spanning encoder-only, encoder-decoder, decoder-only, instruction-tuned, and domain/multilingual variants, highlighting their characteristic affordances for QE (contextual disambiguation, controllable generation, zero-/few-shot reasoning); and (iii) practice-oriented guidance on where and how neural QE helps in first-stage retrieval, multi-query fusion, re-ranking, and retrieval-augmented generation (RAG). We compare traditional query expansion with PLM/LLM-based methods across seven key aspects, and we map applications across web search, biomedicine, e-commerce, open-domain QA/RAG, conversational and code search, and cross-lingual settings. The review distills design grounding and interaction, alignment/distillation (SFT/PEFT/DPO), and KG constraints - as robust remedies to topic drift and hallucination. We conclude with an agenda on quality control, cost-aware invocation, domain/temporal adaptation, evaluation beyond end-task metrics, and fairness/privacy. Collectively, these insights provide a principled blueprint for selecting and combining QE techniques under real-world constraints.

MCML Authors

Ercong Nie

→ Group Alexander Fraser
Data Analytics & Statistics

[164]

W. Lai, M. Mesgar and A. Fraser.
LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.

MCML Authors

Wen Lai

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[163]

R. Litschko, M. Müller-Eberstein, R. van der Goot, L. Weber-Genzel and B. Plank.
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model’s functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

MCML Authors

Robert Litschko

Leon Weber-Genzel

Dr.

* Former Member

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[162]

P. Lin, A. F. T. Martins and H. Schütze.
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples.
Findings @NAACL 2025 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI GitHub

Abstract

Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever based on Glot500, a multilingual small language model, using positive and negative English examples constructed from the predictions of a multilingual large language model, i.e., MaLA500. Leveraging the cross-lingual capacity of the retriever, it can directly retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on the multilingual text classification benchmark SIB200 with 176 languages show that XAMPLER substantially improves the in-context learning performance across languages.

MCML Authors

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

[161]

P. Lin, A. F. T. Martins and H. Schütze.
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models.
NAACL 2025 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

MCML Authors

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

[160]

Y. Liu, C. Ma, H. Ye and H. Schütze.
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

The world’s more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in different subspaces, which can result in crosslingual transfer involving languages of different scripts performing suboptimally. To address this problem, we propose TransliCo, a framework that optimizes the Transliteration Contrastive Modeling (TCM) objective to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script (in our case Latin), which enhances uniformity in the representation space for different scripts. Using Glot500-m, an mPLM pretrained on over 500 languages, as our source model, we fine-tune it on a small portion (5%) of its training data, and refer to the resulting model as Furina. We show that Furina not only better aligns representations from distinct scripts but also outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks. Additionally, we achieve consistent improvement in a case study on the Indic group where the languages exhibit areal features but use different scripts. We make our code and models publicly available.

MCML Authors

Yihong Liu

Chunlan Ma

Haotian Ye

Hinrich Schütze

Prof. Dr.

Principal Investigator

[159]

Y. Liu, C. Ma, H. Ye and H. Schütze.
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL GitHub

Abstract

Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks.

MCML Authors

Yihong Liu

Chunlan Ma

Haotian Ye

Hinrich Schütze

Prof. Dr.

Principal Investigator

[158]

Y. Liu, E. Nie, S. Feng, Z. Hua, Z. Ding, D. Wang, Y. Zhang and H. Schütze.
A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases . Vilnius, Lithuania, Sep 09-13, 2024. DOI GitHub

Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data Augmentation framework for Multi-Domain Dialogue Generation, referred to as AMDG. The AMDG framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a de-domaining data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMDG achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMDG as a viable alternative solution for low-resource multi-domain dialogue generation.

MCML Authors

Yongkang Liu

Dr.

* Former Member

Ercong Nie

→ Group Volker Tresp
Database Systems, Data Mining and AI

Zifeng Ding

Hinrich Schütze

Prof. Dr.

Principal Investigator

[157]

X. Li, E. Nie and S. Liang.
From Classification to Generation: Insights into Crosslingual Retrieval Augmented ICL.
Instruction Tuning and Instruction Following @NeurIPS 2023 - Workshop Instruction Tuning and Instruction Following at the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL

Abstract

The remarkable ability of Large Language Models (LLMs) to understand and follow instructions has sometimes been limited by their in-context learning (ICL) performance in low-resource languages. To address this, we introduce a novel approach that leverages cross-lingual retrieval-augmented in-context learning (CREA-ICL). By extracting semantically similar prompts from high-resource languages, we aim to bolster the zero-shot performance of multilingual pretrained language models (MPLMs) across diverse tasks. Though our approach yields steady improvements in classification tasks, it faces challenges in generation tasks, with Bangla serving as a key case study. Our evaluation offers insights into the performance dynamics of retrieval-augmented in-context learning across both classification and generation domains.

MCML Authors

Ercong Nie

Sheng Liang

* Former Member

[156]

X. Li, E. Nie and S. Liang.
Crosslingual Retrieval Augmented In-context Learning for Bangla.
BLP 2023 - 1st Workshop on Bangla Language Processing. Singapore, Dec 07, 2023. DOI

Abstract

The promise of Large Language Models (LLMs) in Natural Language Processing has often been overshadowed by their limited performance in low-resource languages such as Bangla. To address this, our paper presents a pioneering approach that utilizes cross-lingual retrieval augmented in-context learning. By strategically sourcing semantically similar prompts from high-resource language, we enable multilingual pretrained language models (MPLMs), especially the generative model BLOOMZ, to successfully boost performance on Bangla tasks. Our extensive evaluation highlights that the cross-lingual retrieval augmented prompts bring steady improvements to MPLMs over the zero-shot performance.

MCML Authors

Ercong Nie

Sheng Liang

* Former Member

[155]

A.-M. Lutgen, A. Plum, C. Purschke and B. Plank.
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[154]

Y. Liu, F. Shi, D. Wang, Y. Zhang and H. Schütze.
ChatZero: Zero-Shot Cross-Lingual Dialogue Generation via Pseudo-Target Language.
ECAI 2024 - 27th European Conference on Artificial Intelligence. Santiago de Compostela, Spain, Oct 19-24, 2024. DOI

Abstract

Although large language models(LLMs) show amazing capabilities, among various exciting applications discovered for LLMs fall short in other low-resource languages. Besides, most existing methods depend on large-scale dialogue corpora and thus building systems for dialogue generation in a zero-shot scenario remains a considerable challenge. To address this challenge, we propose a novel end-to-end zero-shot dialogue generation model ChatZero based on cross-lingual code-switching method. First, we construct code-switching language and pseudo-target language with placeholders. Then for cross-lingual semantic transfer, we employ unsupervised contrastive learning to minimize the semantics gap of the source language, code-switching language, and pseudo-target language that are mutually positive examples in the high dimensional semantic space. Experiments on the multilingual DailyDialog and DSTC7-AVSD datasets demonstrate that ChatZero can achieve more than 90% of the original performance under the zero-shot case compared to supervised learning, and achieve state-of-the-art performance compared with other baselines.

MCML Authors

Yongkang Liu

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[153]

P. Lin, M. Thaler, D. Goschala, A. H. Kargaran, Y. Liu, A. Martins and H. Schütze.
Construction-Based Reduction of Translationese for Low-Resource Languages: A Pilot Study on Bavarian.
SIGTYP @ACL 2025 - 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL GitHub

Abstract

When translating into a low-resource language, a language model can have a tendency to produce translations that are close to the source (e.g., word-by-word translations) due to a lack of rich low-resource training data in pretraining. Thus, the output often is translationese that differs considerably from what native speakers would produce naturally. To remedy this, we synthetically create a training set in which the frequency of a construction unique to the low-resource language is artificially inflated. For the case of Bavarian, we show that, after training, the language model has learned the unique construction and that native speakers judge its output as more natural. Our pilot study suggests that construction-based mitigation of translationese is a promising approach.

MCML Authors

Peiqin Lin

Amir Hossein Kargaran

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

[152]

Y. Liu, M. Wang, A. H. Kargaran, A. Imani, O. Xhelili, H. Ye, C. Ma, F. Yvon and H. Schütze.
How Transliterations Improve Crosslingual Alignment.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.

MCML Authors

Yihong Liu

Mingyang Wang

Amir Hossein Kargaran

Ayyoob Imani

Haotian Ye

Chunlan Ma

Hinrich Schütze

Prof. Dr.

Principal Investigator

[151]

Y. Liu, M. Wang, A. H. Kargaran, F. Körner, E. Nie, B. Plank, F. Yvon and H. Schütze.
Tracing Multilingual Factual Knowledge Acquisition in Pretraining.
Findings @EMNLP 2025 - Findings of the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts – an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities.

MCML Authors

Yihong Liu

Mingyang Wang

Amir Hossein Kargaran

→ Group Barbara Plank
AI and Computational Linguistics

Felicia Körner

Ercong Nie

Barbara Plank

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

[150]

P. Lin, J. Wang, H. Schütze and W. Li.
Modeling Content-Emotion Duality via Disentanglement for Empathetic Conversation.
Preprint (Sep. 2022). arXiv

Abstract

The task of empathetic response generation aims to understand what feelings a speaker expresses on his/her experiences and then reply to the speaker appropriately. To solve the task, it is essential to model the content-emotion duality of a dialogue, which is composed of the content view (i.e., what personal experiences are described) and the emotion view (i.e., the feelings of the speaker on these experiences). To this end, we design a framework to model the Content-Emotion Duality (CEDual) via disentanglement for empathetic response generation. With disentanglement, we encode the dialogue history from both the content and emotion views, and then generate the empathetic response based on the disentangled representations, thereby both the content and emotion information of the dialogue history can be embedded in the generated response. The experiments on the benchmark dataset EMPATHETICDIALOGUES show that the CEDual model achieves state-of-the-art performance on both automatic and human metrics, and it also generates more empathetic responses than previous methods.

MCML Authors

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

[149]

Y. Liu, X. Xu, E. Nie, Z. Wang, S. Feng, D. Wang, Q. Li and H. Schütze.
Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning.
Preprint (May. 2025). arXiv GitHub

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model’s representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[148]

Y. Liu, H. Ye, C. Ma, M. Wang and H. Schütze.
LangSAMP: Language-Script Aware Multilingual Pretraining.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL GitHub

Abstract

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model’s ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer.

MCML Authors

Yihong Liu

Haotian Ye

Chunlan Ma

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[147]

Y. Liu, H. Ye, L. Weissweiler, P. Wicke, R. Pei, R. Zangenfeind and H. Schütze.
A Crosslingual Investigation of Conceptualization in 1335 Languages.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87%.

MCML Authors

Yihong Liu

Haotian Ye

Leonie Weissweiler

* Former Member

Philipp Wicke

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

[146]

Y. Liu, H. Ye, L. Weissweiler, R. Pei and H. Schütze.
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+. ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train ColexNet+, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate ColexNet+ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.

MCML Authors

Yihong Liu

Haotian Ye

Leonie Weissweiler

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[145]

Y. Liu, Y. Zhang, Q. Li, T. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

MCML Authors

Yongkang Liu

Dr.

* Former Member

→ Group Volker Tresp
Database Systems, Data Mining and AI

Tong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

[144]

J. Li, M. Zhao, Y. Xie, A. Maronikolakis, P. Pu and H. Schütze.
This joke is [MASK]: Recognizing Humor and Offense with Prompting.
TL4NLP @NeurIPS 2022 - 1st Transfer Learning for Natural Language Processing Workshop at the 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Humor is a magnetic component in everyday human interactions and communications. Computationally modeling humor enables NLP systems to entertain and engage with users. We investigate the effectiveness of prompting, a new transfer learning paradigm for NLP, for humor recognition. We show that prompting performs similarly to finetuning when numerous annotations are available, but gives stellar performance in low-resource humor recognition. The relationship between humor and offense is also inspected by applying influence functions to prompting; we show that models could rely on offense to determine humor during transfer.

MCML Authors

Antonis Maronikolakis

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[143]

A. Muñoz-Ortiz, V. Blaschke and B. Plank.
Evaluating Pixel Language Models on Non-Standardized Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

MCML Authors

Verena Blaschke

Barbara Plank

Prof. Dr.

Principal Investigator

[142]

A. Maronikolakis, P. Baader and H. Schütze.
Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes.
GeBNLP 2022 - 4th Workshop on Gender Bias in Natural Language Processing. Seattle, WA, USA, Jul 15, 2022. DOI

Abstract

To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.

MCML Authors

Antonis Maronikolakis

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[141]

A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon and H. Schütze.
NoLiMa: Long-Context Evaluation Beyond Literal Matching.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv URL

Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a ’needle’ (relevant information) from a ‘haystack’ (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

MCML Authors

Ali Modarressi

Hinrich Schütze

Prof. Dr.

Principal Investigator

[140]

L. Madaan, D. Esiobu, P. Stenetorp, B. Plank and D. Hupkes.
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[139]

M. Di Marco and A. Fraser.
Subword Segmentation in LLMs: Looking at Inflection and Consistency.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

The role of subword segmentation in relation to capturing morphological patterns in LLMs is currently not well explored. Ideally, one would train models like GPT using various segmentations and evaluate how well word meanings are captured. Since this is not computationally feasible, we group words according to their segmentation properties and compare how well a model can solve a linguistic task for these groups. We study two criteria: (i) adherence to morpheme boundaries and (ii) the segmentation consistency of the different inflected forms of a lemma. We select word forms with high and low values for these criteria and carry out experiments on GPT-4o’s ability to capture verbal inflection for 10 languages. Our results indicate that in particular the criterion of segmentation consistency can help to predict the model’s ability to recognize and generate the lemma from an inflected form, providing evidence that subword segmentation is relevant.

MCML Authors

Alexander Fraser

Prof. Dr.

Principal Investigator

[138]

A. Modarressi, M. Fayyaz, E. Aghazadeh, Y. Yaghoobzadeh and M. T. Pilehvar.
DecompX: Explaining Transformers Decisions by Propagating Token Decomposition.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI GitHub

Abstract

An emerging solution for explaining Transformer-based models is to use vector-based analysis on how the representations are formed. However, providing a faithful vector-based explanation for a multi-layer model could be challenging in three aspects: (1) Incorporating all components into the analysis, (2) Aggregating the layer dynamics to determine the information flow and mixture throughout the entire model, and (3) Identifying the connection between the vector-based analysis and the model’s predictions. In this paper, we present DecompX to tackle these challenges. DecompX is based on the construction of decomposed token representations and their successive propagation throughout the model without mixing them in between layers. Additionally, our proposal provides multiple advantages over existing solutions for its inclusion of all encoder components (especially nonlinear feed-forward networks) and the classification head. The former allows acquiring precise vectors while the latter transforms the decomposition into meaningful prediction-based values, eliminating the need for norm- or summation-based vector aggregation. According to the standard faithfulness evaluations, DecompX consistently outperforms existing gradient-based and vector-based approaches on various datasets.

MCML Authors

Ali Modarressi

[137]

M. Müller-Eberstein, R. van der Goot, B. Plank and I. Titov.
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[136]

A. Modarressi, A. Imani, M. Fayyaz and H. Schütze.
RET-LLM: Towards a General Read-Write Memory for Large Language Models.
Preprint (May. 2023). arXiv

Abstract

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) through their extensive parameters and comprehensive data utilization. However, existing LLMs lack a dedicated memory unit, limiting their ability to explicitly store and retrieve knowledge for various tasks. In this paper, we propose RET-LLM a novel framework that equips LLMs with a general write-read memory unit, allowing them to extract, store, and recall knowledge from the text as needed for task performance. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. The memory unit is designed to be scalable, aggregatable, updatable, and interpretable. Through qualitative evaluations, we demonstrate the superiority of our proposed framework over baseline approaches in question answering tasks. Moreover, our framework exhibits robust performance in handling temporal-based question answering tasks, showcasing its ability to effectively manage time-dependent information.

MCML Authors

Ali Modarressi

Ayyoob Imani

Hinrich Schütze

Prof. Dr.

Principal Investigator

[135]

D. R. Mortensen, V. Izrailevitch, Y. Xiao, H. Schütze and L. Weissweiler.
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility—the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models—two proprietary models (GPT-3.5 and GPT-4), three open source model (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7-billion parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

Leonie Weissweiler

* Former Member

[134]

C. Ma, A. ImaniGooghari, H. Ye, R. Pei, E. Asgari and H. Schütze.
Taxi1500: A Dataset for Multilingual Text Classification in 1500 Languages.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. URL

Abstract

While natural language processing tools have been developed extensively for some of the world’s languages, a significant portion of the world’s over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

MCML Authors

Chunlan Ma

Haotian Ye

Hinrich Schütze

Prof. Dr.

Principal Investigator

[133]

A. Modarressi, A. Köksal, A. Imani, M. Fayyaz and H. Schütze.
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory.
NFAM @ICLR 2025 - Workshop on New Frontiers in Associative Memories at the 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

While current large language models (LLMs) perform well on many knowledge-related tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with memorizing rare events and with updating their memory as facts change over time. In addition, the uninterpretable nature of parametric memory makes it challenging to prevent hallucination. Model editing and augmenting LLMs with parameters specialized for memory are only partial solutions. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM’s capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM’s performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.

MCML Authors

Ali Modarressi

Abdullatif Köksal

* Former Member

Ayyoob Imani

Hinrich Schütze

Prof. Dr.

Principal Investigator

[132]

A. Modarressi, A. Köksal, A. Imani, M. Fayyaz and H. Schütze.
MemLLM: Finetuning LLMs to Use Explicit Read-Write Memory.
Transactions on Machine Learning Research (Apr. 2025). URL GitHub

Abstract

MCML Authors

Ali Modarressi

Abdullatif Köksal

* Former Member

Ayyoob Imani

Hinrich Schütze

Prof. Dr.

Principal Investigator

[131]

A. Maronikolakis, A. Köksal and H. Schütze.
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks.
LT-EDI 2024 - 4th Workshop on Language Technology for Equality, Diversity, Inclusion. St. Julian’s, Malta, Mar 21, 2024. URL

Abstract

We introduce HATELEXICON, a lexicon of slurs and targets of hate speech for Brazil, Germany, India and Kenya, to aid model development and interpretability. First, we demonstrate how HATELEXICON can be used to interpret model predictions, showing that models developed to classify extreme speech rely heavily on target group names. Further, we propose a culturally-informed method to aid shot selection for training in low-resource settings. In few-shot learning, shot selection is of paramount importance to model performance and we need to ensure we make the most of available data. We work with HASOC German and Hindi data for training and the Multilingual HateCheck (MHC) benchmark for evaluation. We show that selecting shots based on our lexicon leads to models performing better than models trained on shots sampled randomly. Thus, when given only a few training examples, using HATELEXICON to select shots containing more sociocultural information leads to better few-shot performance. With these two use-cases we show how our HATELEXICON can be used for more effective hate speech detection.

MCML Authors

Antonis Maronikolakis

* Former Member

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[130]

A. Modarressi, A. Köksal and H. Schütze.
Consistent Document-Level Relation Extraction via Counterfactuals.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge – rather than on the input context – to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.

MCML Authors

Ali Modarressi

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[129]

C. Ma, Y. Liu, H. Ye and H. Schütze.
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts.
MRL @EMNLP 2025 - 5th Multilingual Representation Learning Workshop at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available.

MCML Authors

Chunlan Ma

Yihong Liu

Haotian Ye

Hinrich Schütze

Prof. Dr.

Principal Investigator

[128]

B. Ma, Y. Li, W. Zhou, Z. Gong, Y. J. Liu, K. Jasinskaja, A. Friedrich, J. Hirschberg, F. Kreuter and B. Plank.
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

MCML Authors

Bolei Ma

→ Group Barbara Plank
AI and Computational Linguistics

Yang Janet Liu

* Former Member

Frauke Kreuter

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

[127]

B. Ma, E. Nie, H. Schmid and H. Schütze.
Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding.
KONVENS 2023 - 19th Conference on Natural Language Processing. Ingolstadt, Germany, Sep 18-22, 2023. URL

Abstract

Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the PROFIT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.

MCML Authors

Bolei Ma

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[126]

B. Ma, E. Nie, S. Yuan, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

MCML Authors

Bolei Ma

Ercong Nie

Frauke Kreuter

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

[125]

A. Maronikolakis, P. O’Grady, H. Schütze and M. Lyra.
Improving Few-Shot Learning with Multilingual Transfer and Monte Carlo Training Set Selection.
LSD 2023 - CLASP Conference on Learning with Small Data. Gothenburg, Sweden, Sep 11-12, 2023. URL

Abstract

In industry settings, machine learning is an attractive tool to automatize processes. Unfortunately, annotated and high-quality data is expensive to source. This problem is exacerbated in settings spanning multiple markets and languages. Thus, developing solutions for multilingual tasks with little available data is challenging. Few-shot learning is a compelling approach when building solutions in multilingual and low-resource settings, since the method not only requires just a few training examples to achieve high performance, but is also a technique agnostic to language. Even though the technique can be applied to multilingual settings, optimizing performance is an open question. In our work we show that leveraging higher-resource, task-specific language data can boost overall performance and we propose a method to select training examples per their average performance in a Monte Carlo simulation, resulting in a training set more conducive to learning. We demonstrate the effectiveness of our methods in fashion text reviews moderation, classifying reviews as related or unrelated to the given product. We show that our methodology boosts performance in multilingual (English, French, German) settings, increasing F1 score and significantly decreasing false positives.

MCML Authors

Antonis Maronikolakis

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[124]

P. Mondorf and B. Plank.
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like supposition following or chain construction. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model’s accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

MCML Authors

Philipp Mondorf

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[123]

P. Mondorf and B. Plank.
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models--A Survey.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF

Abstract

Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs’ reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models’ reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models’ reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

MCML Authors

Philipp Mondorf

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[122]

P. Mondorf and B. Plank.
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character’s identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce TruthQuest, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on TruthQuest show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models’ output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.

MCML Authors

Philipp Mondorf

Barbara Plank

Prof. Dr.

Principal Investigator

[121]

C. Müller and B. Plank.
IndirectQA: Understanding Indirect Answers to Implicit Polar Questions in French and Spanish.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Polar questions are common in dialogue and expect exactly one of two answers (yes/no). It is however not uncommon for speakers to bypass these expected choices and answer, for example, ‘Islands are generally by the sea’ to the question: ‘An island? By the sea?’. While such answers are natural in spoken dialogues, conversational systems still struggle to interpret them. Seminal work to interpret indirect answers were made in recent years—but only for English and with strict question formulations. In this work, we present a new corpus for French and Spanish—IndirectQA —where we mine subtitle data for indirect answers to study the labeling task with six different labels, while broadening polar questions to include also implicit polar questions (statements that trigger a yes/no-answer which are not necessarily formulated as a question). We opted for subtitles since they are a readily available source of conversation in various languages, but also come with peculiarities and challenges which we will discuss. Overall, we provide the first results on French and Spanish. They show that the task is challenging: the baseline accuracy scores drop from 61.43 on English to 44.06 for French and Spanish.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[120]

M. Mozes, M. Schmitt, V. Golkov, H. Schütze and D. Cremers.
Scene Graph Generation for Better Image Captioning?
Preprint (Sep. 2021). arXiv

Abstract

MCML Authors

Vladimir Golkov

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

Daniel Cremers

Prof. Dr.

Director

[119]

M. Müller-Eberstein, R. van der Goot and B. Plank.
Spectral Probing.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Linguistic information is encoded at varying timescales (subwords, phrases, etc.) and communicative levels, such as syntax and semantics. Contextualized embeddings have analogously been found to capture these phenomena at distinctive layers and frequencies. Leveraging these findings, we develop a fully learnable frequency filter to identify spectral profiles for any given task. It enables vastly more granular analyses than prior handcrafted filters, and improves on efficiency. After demonstrating the informativeness of spectral probing over manual filters in a monolingual setting, we investigate its multilingual characteristics across seven diverse NLP tasks in six languages. Our analyses identify distinctive spectral profiles which quantify cross-task similarity in a linguistically intuitive manner, while remaining consistent across languages—highlighting their potential as robust, lightweight task descriptors.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[118]

B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may capture and convey. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOVs). However, measuring AOVs embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing a comprehensive overview of recent works on the evaluation of AOVs in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOVs in LLMs.

MCML Authors

Bolei Ma

→ Group Barbara Plank
AI and Computational Linguistics

Xinpeng Wang

Anna-Carolina Haensch

Dr.

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Barbara Plank

Prof. Dr.

Principal Investigator

Frauke Kreuter

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[117]

P. Mondorf, S. Wold and B. Plank.
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.

MCML Authors

Philipp Mondorf

Barbara Plank

Prof. Dr.

Principal Investigator

[116]

B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Aßenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors

Bolei Ma

Anna-Carolina Haensch

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Xinpeng Wang

Frauke Kreuter

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Bernd Bischl
Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

[115]

P. Mondorf, S. Zhou, M. Riedler and B. Plank.
Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality.
Preprint (Apr. 2025). arXiv

Abstract

MCML Authors

Philipp Mondorf

→ Group Barbara Plank
AI and Computational Linguistics

Shijia Zhou

→ Group Barbara Plank
AI and Computational Linguistics

Monica Riedler

* Former Member

→ Group Almut Sophia Koepke
Computer Vision & Artificial Intelligence

Barbara Plank

Prof. Dr.

Principal Investigator

[114]

N. Nikeghbal, A. H. Kargaran and Diesner.
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. GitHub

Abstract

Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs’ reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction.

MCML Authors

Amir Hossein Kargaran

[113]

E. Nie, S. Liang, H. Schmid and H. Schütze.
Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages.
Findings @ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Multilingual Pretrained Language Models (MPLMs) perform strongly in cross-lingual transfer. We propose Prompts Augmented by Retrieval Crosslingually (PARC) to improve zero-shot performance on low-resource languages (LRLs) by augmenting the context with prompts consisting of semantically similar sentences retrieved from a high-resource language (HRL). PARC improves zero-shot performance on three downstream tasks (sentiment classification, topic categorization, natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in unlabeled (+5.1%) and labeled settings (+16.3%). PARC also outperforms finetuning by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

MCML Authors

Ercong Nie

Sheng Liang

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[112]

E. Nie, B. Shao, Z. Ding, M. Wang, H. Schmid and H. Schütze.
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL GitHub

Abstract

Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE.

MCML Authors

Ercong Nie

→ Group Volker Tresp
Database Systems, Data Mining and AI

Zifeng Ding

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[111]

E. Nie, H. Schmid and H. Schütze.
Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration.
Findings @EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model’s bias toward predicting label words which frequently occurred during the pretraining. These words typically receive high probabilities. To address this issue, we combine the models with calibration techniques which modify the probabilities of label words predicted by the models. We first validate the effectiveness of a proposed simple calibration method together with other existing techniques on monolingual encoders in both zero- and few-shot scenarios. We subsequently employ these calibration techniques on multilingual encoders, resulting in substantial performance improvements across a wide range of tasks.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[110]

E. Nie, H. Schmid and H. Schütze.
Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach.
ALP @RANLP 2023 - 1st Workshop on Ancient Language Processing co-located with the Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria, Sep 08, 2023. URL

Abstract

Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for Middle High German (MHG) under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and Modern German (MG), along with the abundance of MG treebank resources. Specifically, by employing the delexicalization method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. The encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[109]

E. Nie, H. Schmid and H. Schütze.
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models.
Findings @EMNLP 2025 - Findings of the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Language confusion – where large language models (LLMs) generate unintended languages against the user’s need – remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) – specific positions where language switches occur – are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[108]

E. Nie, S. Yuan, B. Ma, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models.
Preprint (Feb. 2024). arXiv

Abstract

Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge.

MCML Authors

Ercong Nie

Bolei Ma

Frauke Kreuter

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[107]

S. Okabe and A. Fraser.
Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian.
Compute-EL @ICLDC 2025 - 8th Workshop on The Use of Computational Methods in the Study of Endangered Languages at the 9th International Conference on Language Documentation and Conservation. Honolulu, Hawaii, USA, Mar 06-06, 2025. URL

Abstract

Parallel sentence mining is crucial for downstream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.

MCML Authors

Shu Okabe

Dr.

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[106]

S. Okabe, K. Hämmerl and A. Fraser.
Improving Parallel Sentence Mining for Low-Resource and Endangered Languages.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

While parallel sentence mining has been extensively covered for fairly well-resourced languages, pairs involving low-resource languages have received comparatively little attention.To address this gap, we present Belopsem, a benchmark of new datasets for parallel sentence mining on three language pairs where the source side is low-resource and endangered: Occitan-Spanish, Upper Sorbian-German, and Chuvash-Russian. These combinations also reflect varying linguistic similarity within each pair. We compare three language models in an established parallel sentence mining pipeline and apply two types of improvements to one of them, Glot500. We observe better mining quality overall by both applying alignment post-processing with an unsupervised aligner and using a cluster-based isotropy enhancement technique. These findings are crucial for optimising parallel data extraction for low-resource languages in a realistic way.

MCML Authors

Shu Okabe

Dr.

Katharina Hämmerl

→ Group Alexander Fraser
Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

Principal Investigator

[105]

E. Özeren, Y. Liu and H. Schütze.
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization.
SRW @ACL 2025 - Student Research Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.

MCML Authors

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

[104]

P. Piccirilli, A. Fraser and S. Schulte im Walde.
VOLIMET: A Parallel Corpus of Literal and Metaphorical Verb-Object Pairs for English–German and English–French.
*SEM 2024 - 13th Joint Conference on Lexical and Computational Semantics co-located with NAACL 2024. Mexico City, Mexico, Jun 20-21, 2024. DOI

Abstract

The interplay of cultural and linguistic elements that characterizes metaphorical language poses a substantial challenge for both human comprehension and machine processing. This challenge goes beyond monolingual settings and becomes particularly complex in translation, even more so in automatic translation. We present VOLIMET, a corpus of 2,916 parallel sentences containing gold standard alignments of metaphorical verb-object pairs and their literal paraphrases, e.g., tackle/address question, from English to German and French. On the one hand, the parallel nature of our corpus enables us to explore monolingual patterns for metaphorical vs. literal uses in English. On the other hand, we investigate different aspects of cross-lingual translations into German and French and the extent to which metaphoricity and literalness in the source language are transferred to the target languages. Monolingually, our findings reveal clear preferences in using metaphorical or literal uses of verb-object pairs. Cross-lingually, we observe a rich variability in translations as well as different behaviors for our two target languages.

MCML Authors

Alexander Fraser

Prof. Dr.

Principal Investigator

[103]

B. Plank.
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, thisconventional practice assumes that there exists a ground truth, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers. In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: data, modeling and evaluation. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the ‘problem’ will lead to an open discussion on possible strategies to devise fundamentally new directions.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[102]

R. Pei, Y. Liu, P. Lin, F. Yvon and H. Schütze.
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an encrypted version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap the conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.

MCML Authors

Yihong Liu

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[101]

S. Peng, Z. Sun, S. Loftus and B. Plank.
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL

Abstract

Named Entity Recognition (NER) is a key information extraction task with a long-standing tradition. While recent studies address and aim to correct annotation errors via re-labeling efforts, little is known about the sources of human label variation, such as text ambiguity, annotation error, or guideline divergence. This is especially the case for high-quality datasets and beyond English CoNLL03. This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian. We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions. We survey student annotations on a subset of difficult entities and substantiate the feasibility and necessity of manifold annotations for understanding named entity ambiguities from a distributional perspective.

MCML Authors

Siyao Peng

Dr.

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[100]

S. Peng, Z. Sun, H. Shan, M. Kolm, V. Blaschke, E. Artemova and B. Plank.
Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

MCML Authors

Siyao Peng

Dr.

Verena Blaschke

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[99]

Z. Peng, X. Yin, R. Qian, P. Lin, Y. Liu, H. Zhang, C. Ying and Y. Luo.
SolEval: Benchmarking Large Language Models for Repository-level Solidity Code Generation.
Preprint (Feb. 2025). arXiv GitHub

Abstract

Large language models (LLMs) have transformed code generation. However, most existing approaches focus on mainstream languages such as Python and Java, neglecting the Solidity language, the predominant programming language for Ethereum smart contracts. Due to the lack of adequate benchmarks for Solidity, LLMs’ ability to generate secure, cost-effective smart contracts remains unexplored. To fill this gap, we construct SolEval, the first repository-level benchmark designed for Solidity smart contract generation, to evaluate the performance of LLMs on Solidity. SolEval consists of 1,125 samples from 9 different repositories, covering 6 popular domains, providing LLMs with a comprehensive evaluation benchmark. Unlike the existing Solidity benchmark, SolEval not only includes complex function calls but also reflects the real-world complexity of the Ethereum ecosystem by incorporating gas fee and vulnerability rate. We evaluate 10 LLMs on SolEval, and our results show that the best-performing LLM achieves only 26.29% Pass@10, highlighting substantial room for improvement in Solidity code generation by LLMs.

MCML Authors

Peiqin Lin

[98]

M. J. Sabet.
Multilingual representations and models for improved low-resource language processing.
Dissertation Jul. 2022. DOI

Abstract

This thesis examines methods to improve Natural Language Processing (NLP) for low-resource languages, addressing challenges such as limited training data, lack of tokenization models, and difficulties in word segmentation. While pretrained language models have advanced multilingual representation learning, they primarily benefit high-resource languages. This work explores multilinguality in language models and develops techniques for word alignment without requiring parallel data. Key contributions include analyzing multilingual word alignments, extracting alignments from the Bible corpus, applying graph algorithms to improve alignments, generating cross-lingual embeddings from small parallel corpora, and enhancing alignment quality through subword sampling. These efforts aim to improve NLP for underrepresented languages. (Shortened.)

MCML Authors

Masoud Jalili Sabet

Dr.

* Former Member

[97]

S. Sharifzadeh, S. M. Baharlou, M. Schmitt, H. Schütze and V. Tresp.
Improving Scene Graph Classification by Exploiting Knowledge from Texts.
AAAI 2022 - 36th Conference on Artificial Intelligence. Virtual, Feb 22-Mar 01, 2022. DOI

Abstract

Training scene graph classification models requires a large amount of annotated image data. Meanwhile, scene graphs represent relational knowledge that can be modeled with symbolic data from texts or knowledge graphs. While image annotation demands extensive labor, collecting textual descriptions of natural scenes requires less effort. In this work, we investigate whether textual scene descriptions can substitute for annotated image data. To this end, we employ a scene graph classification framework that is trained not only from annotated images but also from symbolic data. In our architecture, the symbolic entities are first mapped to their correspondent image-grounded representations and then fed into the relational reasoning pipeline. Even though a structured form of knowledge, such as the form in knowledge graphs, is not always available, we can generate it from unstructured texts using a transformer-based language model. We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve ~8x more accurate results in scene graph classification, ~3x in object classification, and ~1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

Volker Tresp

Prof. Dr.

Principal Investigator

[96]

R. S.-E. Shim, D. De Cristofaro, C. M. Hu, A. Vietti and B. Plank.
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically.
Preprint (May. 2025). arXiv

Abstract

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[95]

V. Steinborn, P. Dufter, H. Jabbar and H. Schütze.
An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models.
Findings @NAACL 2022 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Seattle, WA, USA, Jun 10-15, 2022. DOI

Abstract

Bias research in NLP is a rapidly growing and developing field. Similar to CrowS-Pairs (Nangia et al., 2020), we assess gender bias in masked-language models (MLMs) by studying pairs of sentences with gender swapped person references.Most bias research focuses on and often is specific to English.Using a novel methodology for creating sentence pairs that is applicable across languages, we create, based on CrowS-Pairs, a multilingual dataset for English, Finnish, German, Indonesian and Thai.Additionally, we propose SJSD, a new bias measure based on Jensen–Shannon divergence, which we argue retains more information from the model output probabilities than other previously proposed bias measures for MLMs.Using multilingual MLMs, we find that SJSD diagnoses the same systematic biased behavior for non-English that previous studies have found for monolingual English pre-trained MLMs. SJSD outperforms the CrowS-Pairs measure, which struggles to find such biases for smaller non-English datasets.

MCML Authors

Victor Steinborn

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[94]

M. Sabet, P. Dufter, F. Yvon and H. Schütze.
SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings.
Preprint (Apr. 2021). arXiv

Abstract

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

[93]

L. K. Senel, B. Ebing, K. Baghirova, H. Schütze and G. Glavaš.
Kardeş-NLU: Transfer to Low-Resource Languages with Big Brother’s Help – A Benchmark and Evaluation for Turkic Languages.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. Outstanding Paper Award. URL

Abstract

Cross-lingual transfer (XLT) driven by massively multilingual language models (mmLMs) has been shown largely ineffective for low-resource (LR) target languages with little (or no) representation in mmLM’s pretraining, especially if they are linguistically distant from the high-resource (HR) source language. Much of the recent focus in XLT research has been dedicated to LR language families, i.e., families without any HR languages (e.g., families of African languages or indigenous languages of the Americas). In this work, in contrast, we investigate a configuration that is arguably of practical relevance for more of the world’s languages: XLT to LR languages that do have a close HR relative. To explore the extent to which a HR language can facilitate transfer to its LR relatives, we (1) introduce Kardeş-NLU, an evaluation benchmark with language understanding datasets in five LR Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Uzbek, and Uyghur; and (2) investigate (a) intermediate training and (b) fine-tuning strategies that leverage Turkish in XLT to these target languages. Our experimental results show that both - integrating Turkish in intermediate training and in downstream fine-tuning - yield substantial improvements in XLT to LR Turkic languages. Finally, we benchmark cutting-edge instruction-tuned large language models on Kardeş-NLU, showing that their performance is highly task- and language-dependent.

MCML Authors

Lütfi Kerem Senel

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[92]

L. K. Senel.
Exploring the frontiers of word understanding and language model evaluation in NLP.
Dissertation May. 2025. DOI

Abstract

The field of natural language processing (NLP) has progressed dramatically with the rise of deep learning, yet many challenges in learning high-quality semantic representations remain. This thesis addresses these challenges through a series of studies focusing on both monolingual and multilingual contexts. (Shortened.)

MCML Authors

Lütfi Kerem Senel

Dr.

* Former Member

→ Group Barbara Plank
AI and Computational Linguistics

[91]

A. Säuberli, D. Frassinelli and B. Plank.
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
BEA @ACL 2025 - 20th Workshop on Innovative Use of NLP for Building Educational Applications at the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

MCML Authors

Andreas Säuberli

Barbara Plank

Prof. Dr.

Principal Investigator

[90]

L. K. Senel, B. Fetahu, D. Yoshida, Z. Chen, G. Castellucci, N. Vedula, J. I. Choi and S. Malmasi.
Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning LLMs is prohibitively expensive. We present a training-free approach for optimizing generative recommenders by connecting user feedback loops to LLM-based optimizers. We propose a generative explore-exploit method that can not only exploit generated items with known high engagement, but also actively explore and discover hidden population preferences to improve recommendation quality. We evaluate our approach on question generation in two domains (e-commerce and general knowledge), and model user feedback with Click Through Rate (CTR). Experiments show our LLM-based explore-exploit approach can iteratively improve recommendations, and consistently increase CTR. Ablation analysis shows that generative exploration is key to learning user preferences, avoiding the pitfalls of greedy exploit-only approaches. A human evaluation strongly supports our quantitative findings.

MCML Authors

Lütfi Kerem Senel

Dr.

* Former Member

→ Group Alexander Fraser
Data Analytics & Statistics

[89]

H. S. Saadi, V. Hangya, T. Eder and A. Fraser.
Comparative Analysis of Cross-lingual Contextualized Word Embeddings.
MRL 2022 @EMNLP 2022 - 2nd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Contextualized word embeddings have emerged as the most important tool for performing NLP tasks in a large variety of languages. In order to improve the cross-lingual representation and transfer learning quality, contextualized embedding alignment techniques, such as mapping and model fine-tuning, are employed. Existing techniques however are time-, data- and computational resource-intensive. In this paper we analyze these techniques by utilizing three tasks: bilingual lexicon induction (BLI), word retrieval and cross-lingual natural language inference (XNLI) for a high resource (German-English) and a low resource (Bengali-English) language pair. In contrast to previous works which focus only on a few popular models, we compare five multilingual and seven monolingual language models and investigate the effect of various aspects on their performance, such as vocabulary size, number of languages used for training and number of parameters. Additionally, we propose a parameter-, data- and runtime-efficient technique which can be trained with 10% of the data, less than 10% of the time and have less than 5% of the trainable parameters compared to model fine-tuning. We show that our proposed method is competitive with resource heavy models, even outperforming them in some cases, even though it relies on less resource.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

[88]

J. Shin, M. A. Hedderich, B. J. Rey, A. Lucero and A. Oulasvirta.
Understanding Human-AI Workflows for Generating Personas.
DIS 2024 - ACM Conference on Designing Interactive Systems. Copenhagen, Denmark, Jul 01-05, 2024. DOI

Abstract

One barrier to deeper adoption of user-research methods is the amount of labor required to create high-quality representations of collected data. Trained user researchers need to analyze datasets and produce informative summaries pertaining to the original data. While Large Language Models (LLMs) could assist in generating summaries, they are known to hallucinate and produce biased responses. In this paper, we study human–AI workflows that differently delegate subtasks in user research between human experts and LLMs. Studying persona generation as our case, we found that LLMs are not good at capturing key characteristics of user data on their own. Better results are achieved when we leverage human skill in grouping user data by their key characteristics and exploit LLMs for summarizing pre-grouped data into personas. Personas generated via this collaborative approach can be more representative and empathy-evoking than ones generated by human experts or LLMs alone. We also found that LLMs could mimic generated personas and enable interaction with personas, thereby helping user researchers empathize with them. We conclude that LLMs, by facilitating the analysis of user data, may promote widespread application of qualitative methods in user research.

MCML Authors

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

→ Group Alexander Fraser
Data Analytics & Statistics

[87]

S. Severini, V. Hangya, M. J. Sabet, A. Fraser and H. Schütze.
Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings.
BUCC @LREC 2022 - 15th Workshop on Building and Using Comparable Corpora at the 13th International Conference on Language Resources and Evaluation. Marseille, France, Jun 21-23, 2022. URL

Abstract

Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In addition, they are even competitive with the use of high-quality lexicons in supervised approaches. Our results show that these training signals should not be neglected when building BWEs, even for distant languages.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Masoud Jalili Sabet

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

[86]

S. Severini, A. Imani, P. Dufter and H. Schütze.
Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages.
LREC 2022 - 13th International Conference on Language Resources and Evaluation. Marseille, France, Jun 21-23, 2022. URL

Abstract

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

MCML Authors

Ayyoob Imani

Hinrich Schütze

Prof. Dr.

Principal Investigator

[85]

J. Shin, A. Khatri, M. A. Hedderich, A. Lucero and A. Oulasvirta.
Facilitating Asynchronous Idea Generation and Selection with Chatbots.
Preprint (Mar. 2025). arXiv

Abstract

People can generate high-quality ideas by building on each other’s ideas. By enabling individuals to contribute their ideas at their own comfortable time and method (i.e., asynchronous ideation), they can deeply engage in ideation and improve idea quality. However, running asynchronous ideation faces a practical constraint. Whereas trained human facilitators are needed to guide effective idea exchange, they cannot be continuously available to engage with individuals joining at varying hours. In this paper, we ask how chatbots can be designed to facilitate asynchronous ideation. For this, we adopted the guidelines found in the literature about human facilitators and designed two chatbots: one provides a structured ideation process, and another adapts the ideation process to individuals’ ideation performance. We invited 48 participants to generate and select ideas by interacting with one of our chatbots and invited an expert facilitator to review our chatbots. We found that both chatbots can guide users to build on each other’s ideas and converge them into a few satisfying ideas. However, we also found the chatbots’ limitations in social interaction with collaborators, which only human facilitators can provide. Accordingly, we conclude that chatbots can be promising facilitators of asynchronous ideation, but hybrid facilitation with human facilitators would be needed to address the social aspects of collaborative ideation.

MCML Authors

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

→ Group Barbara Plank
AI and Computational Linguistics

[84]

A. Sedova, R. Litschko, D. Frassinelli, B. Roth and B. Plank.
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

MCML Authors

Robert Litschko

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[83]

Y. Shen, W. Lai, S. Wang, X. Zhang, K. Luo, A. Fraser and M. Sun.
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection.
Preprint (Feb. 2025). arXiv

Abstract

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and clean multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data and existing multilingual datasets. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of current data cleaning methods, which rely on manual heuristic thresholds, we propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.

MCML Authors

Wen Lai

Alexander Fraser

Prof. Dr.

Principal Investigator

→ Group Alexander Fraser
Data Analytics & Statistics

[82]

Y. Shen, W. Lai, S. Wang, K. Luo, A. Fraser and M. Sun.
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora.
Preprint (May. 2025). arXiv

Abstract

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

MCML Authors

Wen Lai

Alexander Fraser

Prof. Dr.

Principal Investigator

[81]

V. Steinborn, A. Maronikolakis and H. Schütze.
Politeness Stereotypes and Attack Vectors: Gender Stereotypes in Japanese and Korean Language Models.
Preprint (Jun. 2023). arXiv

Abstract

In efforts to keep up with the rapid progress and use of large language models, gender bias research is becoming more prevalent in NLP. Non-English bias research, however, is still in its infancy with most work focusing on English. In our work, we study how grammatical gender bias relating to politeness levels manifests in Japanese and Korean language models. Linguistic studies in these languages have identified a connection between gender bias and politeness levels, however it is not yet known if language models reproduce these biases. We analyze relative prediction probabilities of the male and female grammatical genders using templates and find that informal polite speech is most indicative of the female grammatical gender, while rude and formal speech is most indicative of the male grammatical gender. Further, we find politeness levels to be an attack vector for allocational gender bias in cyberbullying detection models. Cyberbullies can evade detection through simple techniques abusing politeness levels. We introduce an attack dataset to (i) identify representational gender bias across politeness levels, (ii) demonstrate how gender biases can be abused to bypass cyberbullying detection models and (iii) show that allocational biases can be mitigated via training on our proposed dataset. Through our findings we highlight the importance of bias research moving beyond its current English-centrism.

MCML Authors

Victor Steinborn

Dr.

* Former Member

Antonis Maronikolakis

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[80]

R. Shim and B. Plank.
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum.
Findings @NAACL 2025 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[79]

A. Sorensen, S. Peng, B. Plank and R. Goot.
EEVEE: An Easy Annotation Tool for Natural Language Processing.
LAW @EACL 2024 - 18th Linguistic Annotation Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Annotation tools are the starting point for creating Natural Language Processing (NLP) datasets. There is a wide variety of tools available; setting up these tools is however a hindrance. We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use. It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.

MCML Authors

Siyao Peng

Dr.

Barbara Plank

Prof. Dr.

Principal Investigator

[78]

V. Steinborn.
Multilingual and multimodal bias probing and mitigation in natural language processing.
Dissertation Apr. 2024. DOI

Abstract

This thesis explores gender bias in Natural Language Processing (NLP) models, highlighting its negative societal impacts, such as discrimination in automated recruitment. While existing research largely focuses on English and occupational biases, this work expands the scope by addressing biases across different languages and contexts. The thesis presents three projects: (1) creating a multilingual dataset and a new bias evaluation measure, (2) examining how gender stereotypes in politeness affect cyberbullying detection in Korean and Japanese, and (3) analyzing how emoji-based visual representations influence biased text generation. These contributions aim to enhance fairness and inclusivity in NLP systems. (Shortened.)

MCML Authors

Victor Steinborn

Dr.

* Former Member

→ Group Barbara Plank
AI and Computational Linguistics

[77]

S. Si, X. Wang, G. Zhai, N. Navab and B. Plank.
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior.
Preprint (Mar. 2025). arXiv

Abstract

Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such ‘harmlessness’ behavior is mainly achieved by training models to reject harmful requests, such as ‘Explain how to burn down my neighbor’s house’, where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as ‘Tell me how to kill a Python process’. In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

MCML Authors

Xinpeng Wang

Guangyao Zhai

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

Barbara Plank

Prof. Dr.

Principal Investigator

[76]

M. Thaler, A. Köksal, A. Leidinger, A. Korhonen and H. Schütze.
How far can bias go? -- Tracing bias from pretraining data to alignment.
Preprint (Nov. 2024). arXiv

Abstract

As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.

MCML Authors

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[75]

Z. S. Taghavi, A. Modarressi, Y. Ma and H. Schütze.
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving ’two days ago’), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge.

MCML Authors

Zeinab Sadat Taghavi

Ali Modarressi

→ Group Volker Tresp
Database Systems, Data Mining and AI

Yunpu Ma

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

[74]

A. Testoni, B. Plank and R. Fernández.
RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs.
Preprint (Dec. 2024). arXiv

Abstract

Ambiguity resolution is key to effective communication. While humans effortlessly address ambiguity through conversational grounding strategies, the extent to which current language models can emulate these strategies remains unclear. In this work, we examine referential ambiguity in image-based question answering by introducing RACQUET, a carefully curated dataset targeting distinct aspects of ambiguity. Through a series of evaluations, we reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses. The overconfidence issue becomes particularly relevant for RACQUET-BIAS, a subset designed to analyze a critical yet underexplored problem: failing to address ambiguity leads to stereotypical, socially biased responses. Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[73]

D. Ulmer, E. Bassignana, M. Müller-Eberstein, D. Varab, M. Zhang, R. van der Goot, C. Hardmeier and B. Plank.
Experimental Standards for Deep Learning in Natural Language Processing Research.
Findings @EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in NLP into a single, widely-applicable methodology. Following these best practices is crucial to strengthen experimental evidence, improve reproducibility and enable scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[72]

L. Veloso, L. Hirlimann, P. Wicke and H. Schütze.
SLAyiNG: Towards Queer Language Processing.
QueerInAI @NeurIPS 2025 - Queer in AI Workshop at the 39th Conference on Neural Information Processing Systems. San Diego, CA, USA, Nov 30-Dec 07, 2025. arXiv

Abstract

Knowledge of slang is a desirable feature of LLMs in the context of user interaction, as slang often reflects an individual’s social identity. Several works on informal language processing have defined and curated benchmarks for tasks such as detection and identification of slang. In this paper, we focus on queer slang. Queer slang can be mistakenly flagged as hate speech or can evoke negative responses from LLMs during user interaction. Research efforts so far have not focused explicitly on queer slang. In particular, detection and processing of queer slang have not been thoroughly evaluated due to the lack of a high-quality annotated benchmark. To address this gap, we curate SLAyiNG, the first dataset containing annotated queer slang derived from subtitles, social media posts, and podcasts, reflecting real-world usage. We describe our data curation process, including the collection of slang terms and definitions, scraping sources for examples that reflect usage of these terms, and our ongoing annotation process. As preliminary results, we calculate inter-annotator agreement for human annotators and OpenAI’s model o3-mini, evaluating performance on the task of sense disambiguation. Reaching an average Krippendorff’s alpha of 0.746, we argue that state-of-the-art reasoning models can serve as tools for pre-filtering, but the complex and often sensitive nature of queer language data requires expert and community-driven annotation efforts.

MCML Authors

Leonor Veloso

Lea Hirlimann

Philipp Wicke

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

[71]

V. Hofmann, L. Weissweiler, D. Mortensen, H. Schütze and J. Pierrehumbert.
Derivational Morphology Reveals Analogical Generalization in Large Language Models.
Preprint (Nov. 2024). arXiv

Abstract

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J’s behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J’s linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

[70]

M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
GradSim: Gradient-Based Language Grouping for Effective Multilingual Training.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Most languages of the world pose low-resource challenges to natural language processing models. With multilingual training, knowledge can be shared among languages. However, not all languages positively influence each other and it is an open research question how to select the most suitable set of languages for multilingual training and avoid negative interference among languages whose characteristics or data distributions are not compatible. In this paper, we propose GradSim, a language grouping method based on gradient similarity. Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains compared to other similarity measures and it is better correlated with cross-lingual model performance. As a result, we set the new state of the art on AfriSenti, a benchmark dataset for sentiment analysis on low-resource African languages. In our extensive analysis, we further reveal that besides linguistic features, the topics of the datasets play an important role for language grouping and that lower layers of transformer models encode language-specific features while higher layers capture task-specific information.

MCML Authors

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[69]

M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Rehearsal-Free Modular and Compositional Continual Learning for Language Models.
NAACL 2024 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning does not consider interaction between tasks, thus hindering knowledge transfer. In this work, we propose MoCL, a rehearsal-free Modular and Compositional Continual Learning framework which continually adds new modules to language models and composes them with existing modules. Experiments on various benchmarks show that MoCL outperforms state of the art and effectively facilitates knowledge transfer.

MCML Authors

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[68]

M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Learn it or Leave it: Module Composition and Pruning for Continual Learning.
Preprint (Jun. 2024). arXiv

Abstract

In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.

MCML Authors

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[67]

M. Wang, H. Adel, L. Lange, Y. Liu, E. Nie, J. Strötgen and H. Schütze.
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.

MCML Authors

Mingyang Wang

Yihong Liu

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[66]

P. Wicke and M. M. Bolognesi.
Red and blue language: Word choices in the Trump and Harris 2024 presidential debate.
PLOS One 20.6 (Jun. 2025). DOI GitHub

Abstract

Political debates are a peculiar type of political discourse, in which candidates directly confront one another, addressing not only the the moderator’s questions, but also their opponent’s statements, as well as the concerns of voters from both parties and undecided voters. Therefore, language is adjusted to meet specific expectations and achieve persuasion. We analyse how the language of Trump and Harris during the Presidential debate (September 10th, 2024) differs in relation to semantic and pragmatic features, for which we formulated targeted hypotheses: framing values and ideology, appealing to emotion, using words with different degrees of concreteness and specificity, addressing others through singular or plural pronouns. Our findings include: differences in the use of figurative frames (Harris often framing issues around recovery and empowerment, Trump often focused on crisis and decline); similar use of emotional language, with Trump showing a slightly higher tendency toward negativity and toward less subjective language compared to Harris; no significant difference in the specificity of candidates’ responses; similar use of abstract language, with Trump showing more variability than Harris, depending on the subject discussed; differences in addressing the opponent, with Trump not mentioning Harris by name, while Harris referring to Trump frequently; different uses of pronouns, with Harris using both singular and plural pronouns equally, while Trump using more singular pronouns. The results are discussed in relation to previous literature on Red and Blue language, which refers to distinct linguistic patterns associated with Republican (Red) and Democratic (Blue) political ideologies.

MCML Authors

Philipp Wicke

Dr.

[65]

L. Weissweiler, N. Böbel, K. Guiller, S. Herrera, W. Scivetti, A. Lorenzi, N. Melnik, A. Bhatia, H. Schütze, L. Levin, A. Zeldes, J. Nivre, W. Croft and N. Schneider.
UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements – for example, interrogative sentences with special markers and/or word orders – are not labeled holistically. We argue for (i) augmenting UD annotations with a ‘UCxn’ annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.

MCML Authors

Leonie Weissweiler

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[64]

M. Marco and A. Fraser.
Analyzing the Understanding of Morphologically Complex Words in Large Language Models.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

We empirically study the ability of a Large Language Model (gpt-3.5-turbo-instruct) to understand morphologically complex words. In our experiments, we looked at a variety of tasks to analyse German compounds with regard to compositional word formation and derivation, such as identifying the head noun of existing and novel compounds, identifying the shared verb stem between two words, or recognizing words constructed with inappropriately used derivation morphemes as invalid. Our results show that the language model is generally capable of solving most tasks, except for the task of identifying ill-formed word forms. While the model demonstrated a good overall understanding of complex words and their word-internal structure, the results also suggest that there is no formal knowledge of derivational rules, but rather an interpretation of the observed word parts to derive the meaning of a word.

MCML Authors

Alexander Fraser

Prof. Dr.

Principal Investigator

[63]

P. Wicke, L. Hirlimann and J. M. Cunha.
Using Analogical Reasoning to Prompt LLMs for their Intuitions of Abstract Spatial Schemas.
Analogy-ANGLE 2024 @IJCAI 2024 - 1st Workshop on Analogical Abstraction in Cognition, Perception, and Languageat the 33rd International Joint Conference on Artificial Intelligence. Jeju, Korea, Aug 03-09, 2024. PDF

Abstract

Abstract notions are often comprehended through analogies, wherein there exists correspondence or partial similarity with more concrete concepts. A fundamental aspect of human cognition involves synthesising embodied experiences into spatial schemas, which profoundly influence conceptualisation and underlie language acquisition. Recent studies have demonstrated that Large Language Models (LLMs) exhibit certain spatial intuitions akin to human language. For instance, both humans and LLMs tend to associate ↑ with hope more readily than with warn. However, the nuanced partial similarities between concrete (e.g., ↑) and abstract (e.g., hope) concepts, remain insufficiently explored. Therefore, we propose a novel methodology utilising analogical reasoning to elucidate these associations and examine whether LLMs adjust their associations in response to analogy-prompts. We find that analogy-prompting is slightly increasing agreement with human choices and the answers given by models include valid explanations supported by analogies, even when in disagreement with human results.

MCML Authors

Philipp Wicke

Dr.

Lea Hirlimann

[62]

L. Weissweiler, V. Hofmann, M. J. Sabet and H. Schütze.
CaMEL: Case Marker Extraction without Labels.
ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland, May 22-27, 2022. DOI

Abstract

We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarities and differences between the case systems of different languages as well as to annotate fine-grained deep cases in languages in which they are not overtly marked.

MCML Authors

Leonie Weissweiler

* Former Member

Valentin Hofmann

Dr.

* Former Member

Masoud Jalili Sabet

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[61]

L. Weissweiler, V. Hofmann, A. Köksal and H. Schütze.
The better your Syntax, the better your Semantics? Probing Pretrained Language Models for the English Comparative Correlative.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.

MCML Authors

Leonie Weissweiler

* Former Member

Valentin Hofmann

Dr.

* Former Member

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[60]

L. Weissweiler, V. Hofmann, A. Kantharuban, A. Cai, R. Dutt, A. Hengle, A. Kabra, A. Kulkarni, A. Vijayakumar, H. Yu, H. Schütze, K. Oflazer and D. Mortensen.
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.

MCML Authors

Leonie Weissweiler

* Former Member

Valentin Hofmann

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[59]

L. Weissweiler, V. Hofmann, A. Köksal and H. Schütze.
Explaining pretrained language models' understanding of linguistic structures using construction grammar.
Frontiers in Artificial Intelligence 6 (Oct. 2023). DOI

Abstract

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasizing the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step toward assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behavior in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs, as well as OPT, are able to recognize the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.

MCML Authors

Leonie Weissweiler

* Former Member

Valentin Hofmann

Dr.

* Former Member

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[58]

X. Wang, C. Hu, B. Ma, P. Rottger and B. Plank.
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF

Abstract

Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

MCML Authors

Xinpeng Wang

Bolei Ma

Barbara Plank

Prof. Dr.

Principal Investigator

[57]

L. He, N. Otani, D. R. Mortensen, L. Levin and H. Schütze.
Construction Grammar Provides Unique Insight into Neural Language Models.
GURT 2023 - Georgetown University Round Table on Linguistics. Washington D.C., USA, Mar 09-12, 2023. URL

Abstract

Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pre-trained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mind, as well as probing methodology that was designed for specific constructions. We analyse selected previous work in detail, and provide our view of the most important challenges and research questions that this promising new field faces.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[56]

X. Wang, C. Hu, P. Röttger and B. Plank.
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g.‘how do I kill someone?’), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. ‘how do I kill a Python process?’). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model’s safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

MCML Authors

Xinpeng Wang

Barbara Plank

Prof. Dr.

Principal Investigator

[55]

P. Wicke.
LMs stand their Ground: Investigating the Effect of Embodiment in Figurative Language Interpretation by Language Models.
Findings @ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Figurative language is a challenge for language models since its interpretation is based on the use of words in a way that deviates from their conventional order and meaning. Yet, humans can easily understand and interpret metaphors, similes or idioms as they can be derived from embodied metaphors. Language is a proxy for embodiment and if a metaphor is conventional and lexicalised, it becomes easier for a system without a body to make sense of embodied concepts. Yet, the intricate relation between embodiment and features such as concreteness or age of acquisition has not been studied in the context of figurative language interpretation concerning language models. Hence, the presented study shows how larger language models perform better at interpreting metaphoric sentences when the action of the metaphorical sentence is more embodied. The analysis rules out multicollinearity with other features (e.g. word length or concreteness) and provides initial evidence that larger language models conceptualise embodied concepts to a degree that facilitates figurative language understanding.

MCML Authors

Philipp Wicke

Dr.

[54]

P. Wicke.
Probing Language Models' Gesture Understanding for Enhanced Human-AI Interaction.
Preprint (Jan. 2024). arXiv

Abstract

The rise of Large Language Models (LLMs) has affected various disciplines that got beyond mere text generation. Going beyond their textual nature, this project proposal aims to investigate the interaction between LLMs and non-verbal communication, specifically focusing on gestures. The proposal sets out a plan to examine the proficiency of LLMs in deciphering both explicit and implicit non-verbal cues within textual prompts and their ability to associate these gestures with various contextual factors. The research proposes to test established psycholinguistic study designs to construct a comprehensive dataset that pairs textual prompts with detailed gesture descriptions, encompassing diverse regional variations, and semantic labels. To assess LLMs’ comprehension of gestures, experiments are planned, evaluating their ability to simulate human behaviour in order to replicate psycholinguistic experiments. These experiments consider cultural dimensions and measure the agreement between LLM-identified gestures and the dataset, shedding light on the models’ contextual interpretation of non-verbal cues (e.g. gestures).

MCML Authors

Philipp Wicke

Dr.

[53]

M. Winkler, V. Juozapaityte, R. van der Goot and B. Plank.
Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Digital assistants perform well in high-resource languages like English, where tasks like slot and intent detection (SID) are well-supported. Many recent SID datasets start including multiple language varieties. However, it is unclear how realistic these translated datasets are. Therefore, we extend one such dataset, namely xSID-0.4, to include two underrepresented languages: Bavarian, a German dialect, and Lithuanian, a Baltic language. Both language variants have limited speaker populations and are often not included in multilingual projects. In addition to translations we provide “natural” queries to digital assistants generated by native speakers. We further include utterances from another dataset for Bavarian to build the richest SID dataset available today for a low-resource dialect without standard orthography. We then set out to evaluate models trained on English in a zero-shot scenario on our target language variants. Our evaluation reveals that translated data can produce overly optimistic scores. However, the error patterns in translated and natural datasets are highly similar. Cross-dataset experiments demonstrate that data collection methods influence performance, with scores lower than those achieved with single-dataset translations. This work contributes to enhancing SID datasets for underrepresented languages, yielding NaLiBaSID, a new evaluation dataset for Bavarian and Lithuanian.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[52]

L. Weissweiler, A. Köksal and H. Schütze.
Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena.
Preprint (Mar. 2024). arXiv

Abstract

Argument Structure Constructions (ASCs) are one of the most well-studied construction groups, providing a unique opportunity to demonstrate the usefulness of Construction Grammar (CxG). For example, the caused-motion construction (CMC, She sneezed the foam off her cappuccino'') demonstrates that constructions must carry meaning, otherwise the fact that sneeze’’ in this context causes movement cannot be explained. We form the hypothesis that this remains challenging even for state-of-the-art Large Language Models (LLMs), for which we devise a test based on substituting the verb with a prototypical motion verb. To be able to perform this test at statistically significant scale, in the absence of adequate CxG corpora, we develop a novel pipeline of NLP-assisted collection of linguistically annotated text. We show how dependency parsing and GPT-3.5 can be used to significantly reduce annotation cost and thus enable the annotation of rare phenomena at scale. We then evaluate GPT, Gemini, Llama2 and Mistral models for their understanding of the CMC using the newly collected corpus. We find that all models struggle with understanding the motion component that the CMC adds to a sentence.

MCML Authors

Leonie Weissweiler

* Former Member

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[51]

M. Wang, L. Lange, H. Adel, J. Strötgen and H. Schütze.
Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model’s behavior on unrelated knowledge, and significantly damages the model’s generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.

MCML Authors

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[50]

L. Weber-Genzel, R. Litschko, E. Artemova and B. Plank.
Donkii: Characterizing and Detecting Errors in Instruction-Tuning Datasets.
LAW @EACL 2024 - 18th Linguistic Annotation Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: Donkii.It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data.We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.

MCML Authors

Leon Weber-Genzel

Dr.

* Former Member

Robert Litschko

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[49]

M. Wang, L. Lange, H. Adel, Y. Ma, J. Strötgen and H. Schütze.
Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model’s internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.

MCML Authors

Mingyang Wang

→ Group Volker Tresp
Database Systems, Data Mining and AI

Yunpu Ma

Dr.

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[48]

X. Wang, B. Ma, C. Hu, L. Weber-Genzel, P. Röttger, F. Kreuter, D. Hovy and B. Plank.
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with ‘Sure’ or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

MCML Authors

Xinpeng Wang

Bolei Ma

Frauke Kreuter

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

[47]

C. Wu, B. Ma, Y. Liu, Z. Zhang, N. Deng, Y. Li, B. Chen, Y. Zhang, Y. Xue and B. Plank.
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.

MCML Authors

Bolei Ma

Yihong Liu

→ Group Gjergji Kasneci
Responsible Data Science

Zheyu Zhang

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[46]

X. Wang and B. Plank.
ACTOR: Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Label aggregation such as majority voting is commonly used to resolve annotator disagreement in dataset creation. However, this may disregard minority values and opinions. Recent studies indicate that learning from individual annotations outperforms learning from aggregated labels, though they require a considerable amount of annotation. Active learning, as an annotation cost-saving strategy, has not been fully explored in the context of learning from disagreement. We show that in the active learning setting, a multi-head model performs significantly better than a single-head model in terms of uncertainty estimation. By designing and evaluating acquisition functions with annotator-specific heads on two datasets, we show that group-level entropy works generally well on both datasets. Importantly, it achieves performance in terms of both prediction and uncertainty estimation comparable to full-scale training from disagreement, while saving 70% of the annotation budget.

MCML Authors

Xinpeng Wang

Barbara Plank

Prof. Dr.

Principal Investigator

[45]

L. Weber and B. Plank.
ActiveAED: A Human in the Loop Improves Annotation Error Detection.
Findings @ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Manually annotated datasets are crucial for training and evaluating Natural Language Processing models. However, recent work has discovered that even widely-used benchmark datasets contain a substantial number of erroneous annotations. This problem has been addressed with Annotation Error Detection (AED) models, which can flag such errors for human re-annotation. However, even though many of these AED methods assume a final curation step in which a human annotator decides whether the annotation is erroneous, they have been developed as static models without any human-in-the-loop component. In this work, we propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop. We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[44]

L. Weber-Genzel, S. Peng, M.-C. De Marneffe and B. Plank.
VariErr NLI: Separating Annotation Error from Human Label Variation.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white.To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs.VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

MCML Authors

Leon Weber-Genzel

Dr.

* Former Member

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

[43]

M. Wang, A. Stoll, L. Lange, H. Adel, H. Schütze and J. Strötgen.
Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion.
Preprint (Feb. 2025). arXiv

Abstract

Adapting large language models (LLMs) to new and diverse knowledge is essential for their lasting effectiveness in real-world applications. This survey provides an overview of state-of-the-art methods for expanding the knowledge of LLMs, focusing on integrating various knowledge types, including factual information, domain expertise, language proficiency, and user preferences. We explore techniques, such as continual learning, model editing, and retrieval-based explicit adaptation, while discussing challenges like knowledge consistency and scalability. Designed as a guide for researchers and practitioners, this survey sheds light on opportunities for advancing LLMs as adaptable and robust knowledge systems.

MCML Authors

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[42]

P. Wicke, L. K. Senel, S. Zhang, L. Figueredo, A. Naceri, S. Haddadin and H. Schütze.
Towards Language-Based Modulation of Assistive Robots through Multimodal Models.
Geriatronics Summit 2023 - 2nd Geriatronics Summit. Garmisch-Partenkirchen, Germany, Jul 02-03, 2023. arXiv

Abstract

In the field of Geriatronics, enabling effective and transparent communication between humans and robots is crucial for enhancing the acceptance and performance of assistive robots. Our early-stage research project investigates the potential of language-based modulation as a means to improve human-robot interaction. We propose to explore real-time modulation during task execution, leveraging language cues, visual references, and multimodal inputs. By developing transparent and interpretable methods, we aim to enable robots to adapt and respond to language commands, enhancing their usability and flexibility. Through the exchange of insights and knowledge at the workshop, we seek to gather valuable feedback to advance our research and contribute to the development of interactive robotic systems for Geriatronics and beyond.

MCML Authors

Philipp Wicke

Dr.

Shengqiang Zhang

Hinrich Schütze

Prof. Dr.

Principal Investigator

[41]

P. Wicke and L. Wachowiak.
Exploring Spatial Schemas in Large Language Models.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI GitHub

Abstract

Despite the ubiquity of large language models (LLMs) in AI research, the question of embodiment in LLMs remains underexplored, distinguishing them from embodied systems in robotics where sensory perception directly informs physical action.Our investigation navigates the intriguing terrain of whether LLMs, despite their non-embodied nature, effectively capture implicit human intuitions about fundamental, spatial building blocks of language. We employ insights from spatial cognitive foundations developed through early sensorimotor experiences, guiding our exploration through the reproduction of three psycholinguistic experiments. Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences. Notable distinctions include polarized language model responses and reduced correlations in vision language models. This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and the computations made by large language models.

MCML Authors

Philipp Wicke

Dr.

[40]

Q. Wang, M. Wang, N. Feldhus, S. Ostermann, Y. Cao, H. Schütze, S. Möller and V. Schmitt.
Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability.
Preprint (May. 2025). arXiv

Abstract

Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). While prior research has extensively investigated the degradation of various LLM capabilities due to quantization, its effects on model explainability and interpretability, which are crucial for understanding decision-making processes, remain unexplored. To address this gap, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with two explainability methods, counterfactual examples and natural language explanations, as well as two interpretability approaches, knowledge memorization analysis and latent multi-hop reasoning analysis. We complement our analysis with a thorough user study, evaluating selected explainability methods. Our findings reveal that, depending on the configuration, quantization can significantly impact model explainability and interpretability. Notably, the direction of this effect is not consistent, as it strongly depends on (1) the quantization method, (2) the explainability or interpretability approach, and (3) the evaluation protocol. In some settings, human evaluation shows that quantization degrades explainability, while in others, it even leads to improvements. Our work serves as a cautionary tale, demonstrating that quantization can unpredictably affect model transparency. This insight has important implications for deploying LLMs in applications where transparency is a critical requirement.

MCML Authors

Mingyang Wang

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[39]

X. Wang, M. Wang, Y. Liu, H. Schütze and B. Plank.
Refusal Direction is Universal Across Safety-Aligned Languages.
Preprint (May. 2025). arXiv

Abstract

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

MCML Authors

Xinpeng Wang

Mingyang Wang

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[38]

X. Wang, L. Weissweiler, H. Schütze and B. Plank.
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

MCML Authors

Xinpeng Wang

Leonie Weissweiler

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

Barbara Plank

Prof. Dr.

Principal Investigator

[37]

Z. Wang, X. Xu, Y. Liu, Y. Zhang, P. Lin, S. Feng, X. Yang, D. Wang and H. Schütze.
Why Do More Experts Fail? A Theoretical Analysis of Model Merging.
Preprint (May. 2025). arXiv GitHub

Abstract

Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model. Although recent model merging methods have shown promising results, they struggle to maintain performance gains as the number of merged models increases. In this paper, we investigate the key obstacles that limit the scalability of model merging when integrating a large number of expert models. First, we prove that there is an upper bound on model merging. Further theoretical analysis reveals that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged. Gaussian Width shows that the marginal benefit of merging additional models diminishes according to a strictly concave function. This implies that the effective parameter space becomes rapidly saturated as the number of merged models increases. Furthermore, using Approximate Kinematics Theory, we prove the existence of a unique optimal threshold beyond which adding more models does not yield significant performance improvements. At the same time, we introduce a straightforward Reparameterized Heavy-Tailed method (RHT) to extend the coverage of the merged model, thereby enhancing its performance. Empirical results on 12 benchmarks, including both knowledge-intensive and general-purpose tasks, validate our theoretical analysis. We believe that these results spark further research beyond the current scope of model merging.

MCML Authors

Peiqin Lin

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[36]

J. Wang, L. Zuo, S. Peng and B. Plank.
MultiClimate: Multimodal Stance Detection on Climate Change Videos.
NLP4PI @EMNLP 2024 - 3rd Workshop on NLP for Positive Impact at the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub

Abstract

Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models.

MCML Authors

Siyao Peng

Dr.

Barbara Plank

Prof. Dr.

Principal Investigator

[35]

O. Xhelili, Y. Liu and H. Schütze.
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub

Abstract

Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, Mediterranean-Amharic-Farsi and South+East Asian Languages, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements.

MCML Authors

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

[34]

S. Xu, T. Y. S. S. Santosh, Y. Elazar, Q. Vogel, B. Plank and M. Grabmair.
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases.
Preprint (Feb. 2025). arXiv

Abstract

The increased adoption of Large Language Models (LLMs) and their potential to shape public opinion have sparked interest in assessing these models’ political leanings. Building on previous research that compared LLMs and human opinions and observed political bias in system responses, we take a step further to investigate the underlying causes of such biases by empirically examining how the values and biases embedded in training corpora shape model outputs. Specifically, we propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs’ political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data and the need for robust evaluation metrics to ensure LLMs’ alignment with human-centered values.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[33]

S. Xu, S. T.y.s.s, O. Ichim, I. Risini, B. Plank and M. Grabmair.
From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RaVE: Rationale Variation in ECHR, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of state-of-the-art COC models on RaVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case’s facts supposedly relevant for its outcome.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[32]

S. Xu, S. T.y.s.s, O. Ichim, B. Plank and M. Grabmair.
Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

In legal decisions, split votes (SV) occur when judges cannot reach a unanimous decision, posing a difficulty for lawyers who must navigate diverse legal arguments and opinions. In high-stakes domains, %as human-AI interaction systems become increasingly important, understanding the alignment of perceived difficulty between humans and AI systems is crucial to build trust. However, existing NLP calibration methods focus on a classifier’s awareness of predictive performance, measured against the human majority class, overlooking inherent human label variation (HLV). This paper explores split votes as naturally observable human disagreement and value pluralism. We collect judges’ vote distributions from the European Court of Human Rights (ECHR), and present SV-ECHR, a case outcome classification (COC) dataset with SV information. We build a taxonomy of disagreement with SV-specific subcategories. We further assess the alignment of perceived difficulty between models and humans, as well as confidence- and human-calibration of COC models. We observe limited alignment with the judge vote distribution. To our knowledge, this is the first systematic exploration of calibration to human judgements in legal NLP. Our study underscores the necessity for further research on measuring and enhancing model calibration considering HLV in legal decision tasks.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[31]

A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
SIGTURK @ACL 2024 - 1st Workshop on Natural Language Processing for Turkic Languages at the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. Invited talk. arXiv GitHub

Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language.

MCML Authors

Abdullatif Köksal

* Former Member

Lütfi Kerem Senel

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[30]

A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub

Abstract

MCML Authors

Abdullatif Köksal

* Former Member

Lütfi Kerem Senel

Dr.

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[29]

H. Yang, J. Lan, Y. Liu, H. Schütze and T. Seidl.
Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach.
Preprint (Aug. 2025). arXiv

Abstract

Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

MCML Authors

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

Thomas Seidl

Prof. Dr.

Director

[28]

H. Ye, Y. Liu, C. Ma and H. Schütze.
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
Insights from Negative Results @NAACL 2024 - 5th Workshop on Insights from Negative Results in NLP at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer), a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.

MCML Authors

Haotian Ye

Yihong Liu

Chunlan Ma

Hinrich Schütze

Prof. Dr.

Principal Investigator

[27]

H. Ye, Y. Liu and H. Schütze.
A study of conceptual language similarity: comparison and evaluation.
Preprint (May. 2023). arXiv

Abstract

An interesting line of research in natural language processing (NLP) aims to incorporate linguistic typology to bridge linguistic diversity and assist the research of low-resource languages. While most works construct linguistic similarity measures based on lexical or typological features, such as word order and verbal inflection, recent work has introduced a novel approach to defining language similarity based on how they represent basic concepts, which is complementary to existing similarity measures. In this work, we study the conceptual similarity in detail and evaluate it extensively on a binary classification task.

MCML Authors

Haotian Ye

Yihong Liu

Hinrich Schütze

Prof. Dr.

Principal Investigator

[26]

S. Yuan, A. Maronikolakis and H. Schütze.
Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing.
WOAH 2022 - 6th Workshop on Online Abuse and Harms. Seattle, WA, USA, Jul 14, 2022. DOI

Abstract

Research to tackle hate speech plaguing online media has made strides in providing solutions, analyzing bias and curating data. A challenging problem is ambiguity between hate speech and offensive language, causing low performance both overall and specifically for the hate speech class. It can be argued that misclassifying actual hate speech content as merely offensive can lead to further harm against targeted groups. In our work, we mitigate this potentially harmful phenomenon by proposing an adversarial debiasing method to separate the two classes. We show that our method works for English, Arabic German and Hindi, plus in a multilingual setting, improving performance over baselines.

MCML Authors

Antonis Maronikolakis

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[25]

S. Yuan, E. Nie, M. Färber, H. Schmid and H. Schütze.
GNNAVI: Navigating the Information Flow in Large Language Models by Graph Neural Network.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are applied to them. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducing a prompt-based parameter-efficient fine-tuning (PEFT) approach. GNNavi leverages insights into ICL’s information flow dynamics, which indicates that label words act in prompts as anchors for information propagation. GNNavi employs a Graph Neural Network (GNN) layer to precisely guide the aggregation and distribution of information flow during the processing of prompts by hardwiring the desired information flow into the GNN. Our experiments on text classification tasks with GPT-2 and Llama2 shows GNNavi surpasses standard prompt-based fine-tuning methods in few-shot settings by updating just 0.2% to 0.5% of parameters. We compare GNNavi with prevalent PEFT approaches, such as prefix tuning, LoRA and Adapter in terms of performance and efficiency. Our analysis reveals that GNNavi enhances information flow and ensures a clear aggregation process.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[24]

S. Yuan, E. Nie, L. Kouba, A. Y. Kangen, H. Schmid, H. Schütze and M. Färber.
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification.
Preprint (Jun. 2025). arXiv

Abstract

Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct ParaDeHate, a large-scale parallel dataset specifically for hatespeech detoxification. We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[23]

S. Yuan, E. Nie, B. Ma and M. Färber.
Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers.
IJCNN 2025 - International Joint Conference on Neural Networks. Rome, Italy, Jun 30-Jul 05, 2025. To be published. Preprint available. arXiv

Abstract

Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs.

MCML Authors

Ercong Nie

Bolei Ma

[22]

S. Yuan, E. Nie, M. Tawfelis, H. Schmid, H. Schütze and M. Färber.
Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models.
Preprint (Jun. 2025). arXiv

Abstract

Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.

MCML Authors

Ercong Nie

Hinrich Schütze

Prof. Dr.

Principal Investigator

[21]

H. Ye, A. Wisiorek, A. Maronikolakis, Ö. Alaçam and H. Schütze.
A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities.
MRL @EMNLP 2025 - 5th Multilingual Representation Learning Workshop at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

MCML Authors

Haotian Ye

Axel Wisiorek

Dr.

Antonis Maronikolakis

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[20]

J. Yu, Y. Zhang, B. Wang, P. Lin, Y. Liu and S. Feng.
SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model.
Preprint (Feb. 2025). arXiv GitHub

Abstract

Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices. However, LoRA’s performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (State Space Model Low-Rank Adaptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences.

MCML Authors

Peiqin Lin

→ Group Michael Hedderich
AI and Computational Linguistics

[19]

R. Zhao, B. Chen, B. Plank and M. A. Hedderich.
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs.
EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

MCML Authors

Raoyuan Zhao

Beiduo Chen

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

[18]

M. Zhang, R. van der Goot, M.-Y. Kan and B. Plank.
NNOSE: Nearest Neighbor Occupational Skill Extraction.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks—combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, Nearest Neighbor Occupational Skill Extraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction without additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[17]

M. Zhang, R. van der Goot and B. Plank.
Entity Linking in the Job Market Domain.
Findings @EACL 2024 - Findings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k).

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

[16]

M. Zhang, V. Gautam, M. Wang, J. Alabi, X. Shen, D. Klakow and M. Mosbach.
The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.

MCML Authors

Mingyang Wang

→ Group Alexander Fraser
Data Analytics & Statistics

[15]

Y. Zhang, V. Hangya and A. Fraser.
A Study of the Class Imbalance Problem in Abusive Language Detection.
WOAH @NAACL 2024 - 8th Workshop on Online Abuse and Harms at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. DOI

Abstract

Abusive language detection has drawn increasing interest in recent years. However, a less systematically explored obstacle is label imbalance, i.e., the amount of abusive data is much lower than non-abusive data, leading to performance issues. The aim of this work is to conduct a comprehensive comparative study of popular methods for addressing the class imbalance issue. We explore 10 well-known approaches on 8 datasets with distinct characteristics: binary or multi-class, moderately or largely imbalanced, focusing on various types of abuse, etc. Additionally, we pro-pose two novel methods specialized for abuse detection: AbusiveLexiconAug and ExternalDataAug, which enrich the training data using abusive lexicons and external abusive datasets, respectively. We conclude that: 1) our AbusiveLexiconAug approach, random oversampling, and focal loss are the most versatile methods on various datasets; 2) focal loss tends to yield peak model performance; 3) oversampling and focal loss provide promising results for binary datasets and small multi-class sets, while undersampling and weighted cross-entropy are more suitable for large multi-class sets; 4) most methods are sensitive to hyperparameters, yet our suggested choice of hyperparameters provides a good starting point.

MCML Authors

Viktor Hangya

Dr.

* Former Member

Alexander Fraser

Prof. Dr.

Principal Investigator

[14]

Y. Zhang, V. Hangya and A. Fraser.
LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

The capacity of large language models (LLMs) to understand and distinguish socially unacceptable texts enables them to play a promising role in abusive language detection. However, various factors can affect their sensitivity. In this work, we test whether LLMs have an unintended bias in abusive language detection, i.e., whether they predict more or less of a given abusive class than expected in zero-shot settings. Our results show that instruction-tuned LLMs tend to under-predict positive classes, since datasets used for tuning are dominated by the negative class. On the contrary, models fine-tuned with human feedback tend to be overly sensitive. In an exploratory approach to mitigate these issues, we show that label frequency in the prompt helps with the significant over-prediction.

MCML Authors

Alexander Fraser

Prof. Dr.

Principal Investigator

[13]

L. Zuo, P. Hong, O. Kraus, B. Plank and R. Litschko.
Evaluating Large Language Models for Cross-Lingual Retrieval.
Findings @EMNLP 2025 - Findings of the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025. To be published. Preprint available. arXiv

Abstract

Multi-stage information retrieval (IR) has become a widely-adopted paradigm in search. While Large Language Models (LLMs) have been extensively evaluated as second-stage reranking models for monolingual IR, a systematic large-scale comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior work shows that LLM-based rerankers improve CLIR performance, their evaluation setup relies on lexical retrieval with machine translation (MT) for the first stage. This is not only prohibitively expensive but also prone to error propagation across stages. Our evaluation on passage-level and document-level CLIR reveals that further gains can be achieved with multilingual bi-encoders as first-stage retrievers and that the benefits of translation diminishes with stronger reranking models. We further show that pairwise rerankers based on instruction-tuned LLMs perform competitively with listwise rerankers. To the best of our knowledge, we are the first to study the interaction between retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that, without MT, current state-of-the-art rerankers fall severely short when directly applied in CLIR.

MCML Authors

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

Robert Litschko

[12]

I. Ziegler, A. Köksal, D. Elliott and H. Schütze.
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation.
Preprint (Sep. 2024). arXiv

Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

MCML Authors

Abdullatif Köksal

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Michael Hedderich
AI and Computational Linguistics

[11]

R. Zhao, A. Köksal, Y. Liu, L. Weissweiler, A. Korhonen and H. Schütze.
SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic Evaluation.
Findings @EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub

Abstract

Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.

MCML Authors

Raoyuan Zhao

Abdullatif Köksal

* Former Member

Yihong Liu

Leonie Weissweiler

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Michael Hedderich
AI and Computational Linguistics

[10]

R. Zhao, A. Köksal, A. Modarressi, M. A. H. Michael A. Hedderich and H. Schütze.
Do We Know What LLMs Don't Know? A Study of Consistency in Knowledge Probing.
Preprint (May. 2025). arXiv

Abstract

The reliability of large language models (LLMs) is greatly compromised by their tendency to hallucinate, underscoring the need for precise identification of knowledge gaps within LLMs. Various methods for probing such gaps exist, ranging from calibration-based to prompting-based methods. To evaluate these probing methods, in this paper, we propose a new process based on using input variations and quantitative metrics. Through this, we expose two dimensions of inconsistency in knowledge gap probing. (1) Intra-method inconsistency: Minimal non-semantic perturbations in prompts lead to considerable variance in detected knowledge gaps within the same probing method; e.g., the simple variation of shuffling answer options can decrease agreement to around 40%. (2) Cross-method inconsistency: Probing methods contradict each other on whether a model knows the answer. Methods are highly inconsistent – with decision consistency across methods being as low as 7% – even though the model, dataset, and prompt are all the same. These findings challenge existing probing methods and highlight the urgent need for perturbation-robust probing frameworks.

MCML Authors

Raoyuan Zhao

Abdullatif Köksal

* Former Member

Ali Modarressi

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Bernd Bischl
Statistical Learning and Data Science

[9]

Y. Zhang, Y. Li, X. Wang, Q. Shen, B. Plank, B. Bischl, M. Rezaei and K. Kawaguchi.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models.
Compression Workshop @NeurIPS 2024 - Workshop on Machine Learning and Compression at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model’s output – contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance – without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

MCML Authors

Yawei Li

Xinpeng Wang

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

Bernd Bischl

Prof. Dr.

Director

→ Group Bernd Bischl
Statistical Learning and Data Science

Mina Rezaei

Dr.

[8]

I. Ziegler, B. Ma, E. Nie, B. Bischl, D. Rügamer, B. Schubert and E. Dorigatti.
What cleaves? Is proteasomal cleavage prediction reaching a ceiling?
LMRL @NeurIPS 2022 - Workshop on Learning Meaningful Representations of Life at the 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors

Bolei Ma

Ercong Nie

Bernd Bischl

Prof. Dr.

Director

Statistics, Data Science and Machine Learning

David Rügamer

Prof. Dr.

Principal Investigator

Emilio Dorigatti

Dr.

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

[7]

M. Zhao, F. Mi, Y. Wang, M. Li, X. Jiang, Q. Liu and H. Schütze.
LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework.
Findings @NAACL 2022 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Seattle, WA, USA, Jun 10-15, 2022. DOI

Abstract

Vast efforts have been devoted to creating high-performance few-shot learners, i.e., large-scale pretrained language models (PLMs) that perform well with little downstream task training data. Training PLMs has incurred significant cost, but utilizing the few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these few-shot learners? We propose LMTurk, a novel approach that treats few-shotlearners as crowdsourcing workers. The rationale is that crowdsourcing workers are in fact few-shot learners: They are shown a few illustrative examples to learn about a task and then start annotating. LMTurk employs few-shot learners built upon PLMs as workers. We show that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios. Active learning is integrated into LMTurk to reduce the amount of queries made to PLMs, minimizing the computational cost of running PLM inference passes. Altogether, LMTurk is an important step towards making effective use of current PLMs.

MCML Authors

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[6]

S. Zhou, S. Peng and B. Plank.
CLIMATELI: Evaluating Entity Linking on Climate Change Data.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Climate Change (CC) is a pressing topic of global importance, attracting increasing attention across research fields, from social sciences to Natural Language Processing (NLP). CC is also discussed in various settings and communication platforms, from academic publications to social media forums. Understanding who and what is mentioned in such data is a first critical step to gaining new insights into CC. We present CLIMATELI (CLIMATe Entity LInking), the first manually annotated CC dataset that links 3,087 entity spans to Wikipedia. Using CLIMATELI (CLIMATe Entity LInking), we evaluate existing entity linking (EL) systems on the CC topic across various genres and propose automated filtering methods for CC entities. We find that the performance of EL models notably lags behind humans at both token and entity levels. Testing within the scope of retaining or excluding non-nominal and/or non-CC entities particularly impacts the models’ performances.

MCML Authors

Shijia Zhou

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[5]

S. Zhou, S. Peng, S. Luebke, J. Haßler, M. Haim, S. M. Mohammad and B. Plank.
What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse.
Preprint (May. 2025). arXiv

Abstract

Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors’ opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs’ limitations in handling nuanced frames and stance expressions on climate change internet memes.

MCML Authors

Shijia Zhou

Siyao Peng

Dr.

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

[4]

S. Zhou, H. Shan, B. Plank and R. Litschko.
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness.
SemEval @NAACL 2024 - 18th International Workshop on Semantic Evaluation at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences from the same languages. For cross-lingual approach we developed a set of linguistics-inspired models trained with several task-specific strategies. We 1) utilize language vectors for selection of donor languages; 2) investigate the multi-source approach for training; 3) use transliteration of non-latin script to study impact of ‘script gap’; 4) opt machine translation for data augmentation. We additionally compare the performance of XLM-RoBERTa and Furina with the same training strategy. Our submission achieved the first place in the C8 (Kinyarwanda) test.

MCML Authors

Shijia Zhou

Barbara Plank

Prof. Dr.

Principal Investigator

→ Group Barbara Plank
AI and Computational Linguistics

Robert Litschko

[3]

S. Zhou, L. Weissweiler, T. He, H. Schütze, D. R. Mortensen and L. Levin.
Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM’s understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don’t adequately represent their meaning or capture the lexical properties of phrasal heads.

MCML Authors

Shijia Zhou

→ Group Barbara Plank
AI and Computational Linguistics

Leonie Weissweiler

* Former Member

Hinrich Schütze

Prof. Dr.

Principal Investigator

[2]

S. Zhang, P. Wicke, L. K. Senel, L. Figueredo, A. Naceri, S. Haddadin, B. Plank and H. Schütze.
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation.
Robot Learning @NeurIPS 2023 - 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models at the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL

Abstract

The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following.Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations.However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletopmanipulation task and releases a simulation benchmark,textit{LoHoRavens}, which covers various long-horizonreasoning aspects spanning color, size, space, arithmeticsand reference.Furthermore, there is a key modality bridging problem forlong-horizon manipulation tasks with LLMs: how toincorporate the observation feedback during robot executionfor the LLM’s closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively.These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve most tasks, indicating long-horizon manipulation tasks are still challenging for current popular models.We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.

MCML Authors

Shengqiang Zhang

Philipp Wicke

Dr.

Lütfi Kerem Senel

Dr.

* Former Member

Barbara Plank

Prof. Dr.

Principal Investigator

Hinrich Schütze

Prof. Dr.

Principal Investigator

→ Group Gjergji Kasneci
Responsible Data Science

[1]

Z. Zhang, H. Yang, B. Ma, D. Rügamer and E. Nie.
Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models.
BabyLM Challenge @CoNLL 2023) - BabyLM Challenge at 27th Conference on Computational Natural Language Learning. Singapore, Dec 06-10, 2023. DOI GitHub

Abstract

Large Language Models (LLMs) demonstrate remarkable performance on a variety of natural language understanding (NLU) tasks, primarily due to their in-context learning ability. This ability could be applied to building babylike models, i.e. models at small scales, improving training efficiency. In this paper, we propose a ‘CoThought’ pipeline, which efficiently trains smaller ‘baby’ language models (BabyLMs) by leveraging the Chain of Thought prompting of LLMs. Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that are comparable to the school texts for language learners. The BabyLM is then pretrained on this restructured dataset in a RoBERTa fashion. In evaluations across 4 benchmarks, our BabyLM outperforms the vanilla RoBERTa in 10 linguistic, NLU, and question-answering tasks by more than 3 points, showing a superior ability to extract contextual information. These results suggest that compact LMs pretrained on small, LLM-resabructured data can better understand tasks and achieve improved performance.

MCML Authors

Zheyu Zhang

Bolei Ma

Statistics, Data Science and Machine Learning

David Rügamer

Prof. Dr.

Principal Investigator

Ercong Nie

B3 | Multimodal Perception

The ability for an intelligent, mobile actor to understand egomotion as well as the surroundings are a fundamental prerequisite for the choice of actions to take. However, vast challenges remain to achieve the necessary levels of safety, which are deeply rooted in research that MCML aims to carry out: Multisensor egomotion estimation and environment mapping, scene representations suitable for interaction in an open-ended environment, understanding and forecasting motion and events, and the the role of uncertainty in ML blocks as modular elements.

Matthias Althoff

Prof. Dr.

Principal Investigator

Human-Centered Technologies for Learning

Enkelejda Kasneci

Prof. Dr.

Principal Investigator

Riccardo Marin

Dr.

Associate

Machine Learning for Robotics

Angela P. Schöllig

Prof. Dr.

Principal Investigator

Learning Systems and Robotics

Björn Schuller

Prof. Dr.

Principal Investigator

Publications in Research Area B3

[103]

S. Amiriparian, L. Christ, A. Kathan, M. Gerczuk, N. Müller, S. Klug, L. Stappen, A. König, E. Cambria, B. W. Schuller and S. Eulitz.
The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition.
MuSe @MM 2024 - 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor at the 32nd ACM International Conference on Multimedia. Melbourne, Australia , Oct 28-Nov 01, 2024. DOI

Abstract

The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems: In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals such as assertiveness, dominance, likability, and sincerity based on the provided audio-visual data. The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset, focusing on the detection of spontaneous humor in a cross-lingual and cross-cultural setting. The main objective of MuSe 2024 is to unite a broad audience from various research domains, including multimodal sentiment analysis, audio-visual affective computing, continuous signal processing, and natural language processing. By fostering collaboration and exchange among experts in these fields, the MuSe 2024 endeavors to advance the understanding and application of sentiment analysis and affective computing across multiple modalities. This baseline paper provides details on each sub-challenge and its corresponding dataset, extracted features from each data modality, and discusses challenge baselines. For our baseline system, we make use of a range of Transformers and expert-designed features and train Gated Recurrent Unit (GRU)-Recurrent Neural Network (RNN) models on them, resulting in a competitive baseline system. On the unseen test datasets of the respective sub-challenges, it achieves a mean Pearson’s Correlation Coefficient (ρ) of 0.3573 for MuSe-Perception and an Area Under the Curve (AUC) value of 0.8682 for MuSe-Humor.

MCML Authors

Shahin Amiriparian

Dr.

Alexander Kathan

Maurice Gerczuk

Björn Schuller

Prof. Dr.

Principal Investigator

[102]

S. Amiriparian, M. Gerczuk, J. Lutz, W. Strube, I. Papazova, A. Hasan, A. Kathan and B. W. Schuller.
Non-Invasive Suicide Risk Prediction Through Speech Analysis.
EHB 2024 - 12th E-Health and Bioengineering Conference. IASI, Romania, Nov 14-15, 2024. DOI

Abstract

The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we collected a novel speech recording dataset from 20 patients. We extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of 66.2%. Moreover, we show that integrating our speech model with a series of patients’ metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of 94.4%, marking an absolute improvement of 28.2%, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.

MCML Authors

Shahin Amiriparian

Dr.

Maurice Gerczuk

Alexander Kathan

Björn Schuller

Prof. Dr.

Principal Investigator

[101]

M. M. Amin, R. Mao, E. Cambria and B. W. Schuller.
A Wide Evaluation of ChatGPT on Affective Computing Tasks.
IEEE Transactions on Affective Computing 15.4 (Oct. 2024). DOI

Abstract

With the rise of foundation models, a new artificial intelligence paradigm has emerged, by simply using general purpose foundation models with prompting to solve problems instead of training a separate machine learning model for each problem. Such models have been shown to have emergent properties of solving problems that they were not initially trained on. The studies for the effectiveness of such models are still quite limited. In this work, we widely study the capabilities of the ChatGPT models, namely GPT-4 and GPT-3.5, on 13 affective computing problems, namely aspect extraction, aspect polarity classification, opinion extraction, sentiment analysis, sentiment intensity ranking, emotions intensity ranking, suicide tendency detection, toxicity detection, well-being assessment, engagement measurement, personality assessment, sarcasm detection, and subjectivity detection. We introduce a framework to evaluate the ChatGPT models on regression-based problems, such as intensity ranking problems, by modelling them as pairwise ranking classification. We compare ChatGPT against more traditional NLP methods, such as end-to-end recurrent neural networks and transformers. The results demonstrate the emergent abilities of the ChatGPT models on a wide range of affective computing problems, where GPT-3.5 and especially GPT-4 have shown strong performance on many problems, particularly the ones related to sentiment, emotions, or toxicity. The ChatGPT models fell short for problems with implicit signals, such as engagement measurement and subjectivity detection.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[100]

S. Amiriparian, F. Packań, M. Gerczuk and B. W. Schuller.
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. DOI

Abstract

Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks.

MCML Authors

Shahin Amiriparian

Dr.

Filip Packań

Maurice Gerczuk

Björn Schuller

Prof. Dr.

Principal Investigator

[99]

M. M. Amin and B. W. Schuller.
On Prompt Sensitivity of ChatGPT in Affective Computing.
ACII 2024 - 12th International Conference on Affective Computing and Intelligent Interaction. Glasgow, UK, Sep 15-18, 2024. DOI

Abstract

Recent studies have demonstrated the emerging capabilities of foundation models like ChatGPT in several fields, including affective computing. However, accessing these emerging capabilities is facilitated through prompt engineering. Despite the existence of some prompting techniques, the field is still rapidly evolving and many prompting ideas still require investigation. In this work, we introduce a method to evaluate and investigate the sensitivity of the performance of foundation models based on different prompts or generation parameters. We perform our evaluation on ChatGPT within the scope of affective computing on three major problems, namely sentiment analysis, toxicity detection, and sarcasm detection. First, we carry out a sensitivity analysis on pivotal parameters in auto-regressive text generation, specifically the temperature parameter T and the top-p parameter in Nucleus sampling, dictating how conservative or creative the model should be during generation. Furthermore, we explore the efficacy of several prompting ideas, where we explore how giving different incentives or structures affect the performance. Our evaluation takes into consideration performance measures on the affective computing tasks, and the effectiveness of the model to follow the stated instructions, hence generating easy-to-parse responses to be smoothly used in downstream applications.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[98]

A. Akman, Q. Sun and B. W. Schuller.
Improving Audio Explanations using Audio Language Models.
IEEE Signal Processing Letters Early Access (Jan. 2025). DOI

Abstract

Foundation models are widely utilised for their strong representational capabilities, driven by training on extensive datasets with self-supervised learning. The increasing complexity of these models highlights the importance of interpretability to enhance transparency and improve human understanding of their decision-making processes. Most existing interpretability methods explain model behaviour by attributing importance to individual data elements across different layers, based on their influence on the final prediction. These approaches often emphasise only the most relevant features, overlooking the broader representational space, removing less important features. In this study, we propose a novel framework for explanation generation that serves as an alternative to feature removal, offering a more comprehensive understanding of model behaviour. Our framework leverages the generative abilities of audio language models to replace removed features with contextually appropriate alternatives, providing a more complete view of the model’s decision-making process. Through extensive evaluations on standard benchmarks, including keyword spotting and speech emotion recognition, our approach demonstrates its effectiveness in generating high-quality audio explanations.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

→ Group Enkelejda Kasneci
Human-Centered Technologies for Learning

[97]

Abstract

MCML Authors

Ivo Bueno

Philipp Wicke

Dr.

[96]

K. D. Bartl-Pokorny, A. Mallol-Ragolta, A. Spiesberger, A. Semertzidou, J. Löchner, F. B. Pokorny and B. W. Schuller.
'Hey Smartphone, Am I Ill?' Detecting Diseases From The Voice.
Frontiers Frontiers for Young Minds (Jun. 2025). URL

Abstract

As humans, we learn from what we perceive with our senses in our daily lives. Computers can have similar learning capabilities, allowing them to learn from what they ‘see’ and ‘hear’ and to use the knowledge they learn to solve future tasks. This ability is called artificial intelligence (AI). Devices equipped with AI, such as smartphones, smartwatches, or smart speakers, have now become our everyday companions. Among other things, they can listen to us and answer our questions. This type of technology is also playing a growing role in medicine. In this article, we explain how a computer can figure out whether the sound of a person’s voice or the way they speak indicates a certain disease. We demonstrate this using the example of detecting COVID-19, and discuss both problems and opportunities that arise when using AI for diagnosis.

MCML Authors

Adria Mallol-Ragolta

Anika Spiesberger

Florian Pokorny

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[95]

Y. Burkhardt, S. Schaefer and S. Leutenegger.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection.
Preprint (Apr. 2025). arXiv GitHub

Abstract

MCML Authors

Yannick Burkhardt

Simon Schaefer

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[94]

K. D. Bartl-Pokorny, C. Zitta, M. Beirit, G. Vogrinec, B. W. Schuller and F. B. Pokorny.
Focused review on artificial intelligence for disease detection in infants.
Frontiers in Digital Health 6 (Nov. 2024). DOI

Abstract

Over the last years, studies using artificial intelligence (AI) for the detection and prediction of diseases have increased and also concentrated more and more on vulnerable groups of individuals, such as infants. The release of ChatGPT demonstrated the potential of large language models (LLMs) and heralded a new era of AI with manifold application possibilities. However, the impact of this new technology on medical research cannot be fully estimated yet. In this work, we therefore aimed to summarise the most recent pre-ChatGPT developments in the field of automated detection and prediction of diseases and disease status in infants, i.e., within the first 12 months of life. For this, we systematically searched the scientific databases PubMed and IEEE Xplore for original articles published within the last five years preceding the release of ChatGPT (2018–2022). The search revealed 927 articles; a final number of 154 articles was included for review. First of all, we examined research activity over time. Then, we analysed the articles from 2022 for medical conditions, data types, tasks, AI approaches, and reported model performance. A clear trend of increasing research activity over time could be observed. The most recently published articles focused on medical conditions of twelve different ICD-11 categories; “certain conditions originating in the perinatal period” was the most frequently addressed disease category. AI models were trained with a variety of data types, among which clinical and demographic information and laboratory data were most frequently exploited. The most frequently performed tasks aimed to detect present diseases, followed by the prediction of diseases and disease status at a later point in development. Deep neural networks turned out as the most popular AI approach, even though traditional methods, such as random forests and support vector machines, still play a role—presumably due to their explainability or better suitability when the amount of data is limited. Finally, the reported performances in many of the reviewed articles suggest that AI has the potential to assist in diagnostic procedures for infants in the near future. LLMs will boost developments in this field in the upcoming years.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

Florian Pokorny

Dr.

[93]

L. Christ, S. Amiriparian, F. Hawighorst, A.-K. Schill, A. Boutalikakis, L. Graf-Vlachy, A. König and B. W. Schuller.
This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. DOI

Abstract

Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection. In particular, we employ pretrained AST, Wav2Vec2, and Whisper models for the speech modality, and Whisper TTS models combined with a RoBERTa text classifier for the textual modality. Subsequently, we build a multimodal classifier by combining text and audio representations. Evaluation on unseen test data demonstrates promising results, with Unweighted Average Recall scores reaching 82.46% in audio-only experiments, 85.97% in text-only experiments, and 87.16% using a multimodal approach.

MCML Authors

Shahin Amiriparian

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[92]

L. Christ, S. Amiriparian, A. Kathan, N. Müller, A. König and B. W. Schuller.
Towards Multimodal Prediction of Spontaneous Humor: A Novel Dataset and First Results.
IEEE Transactions on Affective Computing 16.2 (Apr. 2025). DOI

Abstract

Humor is a substantial element of human social behavior, affect, and cognition. Its automatic understanding can facilitate a more naturalistic human-AI interaction. Current methods of humor detection have been exclusively based on staged data, making them inadequate for ‘real-world’ applications. We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor (Passau-SFCH) dataset, comprising about 11 hours of recordings. The Passau-SFCH dataset is annotated for the presence of humor and its dimensions (sentiment and direction) as proposed in Martin’s Humor Style Questionnaire. We conduct a series of experiments employing pretrained Transformers, convolutional neural networks, and expert-designed features. The performance of each modality (text, audio, video) for spontaneous humor recognition is analyzed and their complementarity is investigated. Our findings suggest that for the automatic analysis of humor and its sentiment, facial expressions are most promising, while humor direction can be best modeled via text-based features. Further, we experiment with different multimodal approaches to humor recognition, including decision-level fusion and MulT, a multimodal Transformer approach. In this context, we propose a novel multimodal architecture that yields the best overall results.

MCML Authors

Shahin Amiriparian

Dr.

Alexander Kathan

Björn Schuller

Prof. Dr.

Principal Investigator

[91]

L. Christ, S. Amiriparian, M. Milling, I. Aslan and B. W. Schuller.
Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning.
Findings @ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modeling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no benchmark for this task. We address this gap by introducing continuous valence and arousal labels for an existing dataset of children’s stories originally annotated with discrete emotion categories. We collect additional annotations for this data and map the categorical labels to the continuous valence and arousal space. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. The best configuration achieves a Concordance Correlation Coefficient (CCC) of .8221 for valence and .7125 for arousal on the test set, demonstrating the efficacy of our proposed approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story. In addition, we uncover the weaknesses of our approach by investigating examples that prove to be difficult to predict.

MCML Authors

Shahin Amiriparian

Dr.

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[90]

P. Gassert and M. Althoff.
Stepping Out of the Shadows: Reinforcement Learning in Shadow Mode.
Preprint (Oct. 2024). arXiv

Abstract

Reinforcement learning (RL) is not yet competitive for many cyber-physical systems, such as robotics, process automation, and power systems, as training on a system with physical components cannot be accelerated, and simulation models do not exist or suffer from a large simulation-to-reality gap. During the long training time, expensive equipment cannot be used and might even be damaged due to inappropriate actions of the reinforcement learning agent. Our novel approach addresses exactly this problem: We train the reinforcement agent in a so-called shadow mode with the assistance of an existing conventional controller, which does not have to be trained and instantaneously performs reasonably well. In shadow mode, the agent relies on the controller to provide action samples and guidance towards favourable states to learn the task, while simultaneously estimating for which states the learned agent will receive a higher reward than the conventional controller. The RL agent will then control the system for these states and all other regions remain under the control of the existing controller. Over time, the RL agent will take over for an increasing amount of states, while leaving control to the baseline, where it cannot surpass its performance. Thus, we keep regret during training low and improve the performance compared to only using conventional controllers or reinforcement learning. We present and evaluate two mechanisms for deciding whether to use the RL agent or the conventional controller. The usefulness of our approach is demonstrated for a reach-avoid task, for which we are able to effectively train an agent, where standard approaches fail.

MCML Authors

Philipp Gassert

* Former Member

Matthias Althoff

Prof. Dr.

Principal Investigator

[89]

M. Gerczuk, S. Amiriparian, J. Lutz, W. Strube, I. Papazova, A. Hasan and B. W. Schuller.
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. DOI

Abstract

In emergency medicine, timely intervention for patients at risk of suicide is often hindered by delayed access to specialised psychiatric care. To bridge this gap, we introduce a speech-based approach for automatic suicide risk assessment. Our study involves a novel dataset comprising speech recordings of 20 patients who read neutral texts. We extract four speech representations encompassing interpretable and deep features. Further, we explore the impact of gender-based modelling and phrase-level normalisation. By applying gender-exclusive modelling, features extracted from an emotion fine-tuned wav2vec2.0 model can be utilised to discriminate high- from low-suicide risk with a balanced accuracy of 81%. Finally, our analysis reveals a discrepancy in the relationship of speech characteristics and suicide risk between female and male subjects. For men in our dataset, suicide risk increases together with agitation while voice characteristics of female subjects point the other way.

MCML Authors

Maurice Gerczuk

Shahin Amiriparian

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[88]

M. Gonzalez-Machorro, U. Reichel, P. Hecker, H. Hammer, H. Sagha, F. Eyben, R. Hoepner and B. W. Schuller.
Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study.
ICNLSP 2025 - 8th International Conference on Natural Language and Speech Processing. Odense, Denmark, Aug 25-27, 2025. To be published. Preprint available. arXiv

Abstract

Depression commonly co-occurs with neurodegenerative disorders like Multiple Sclerosis (MS), yet the potential of speech-based Artificial Intelligence for detecting depression in such contexts remains unexplored. This study examines the transferability of speech-based depression detection methods to people with MS (pwMS) through cross-corpus and cross-lingual analysis using English data from the general population and German data from pwMS. Our approach implements supervised machine learning models using: 1) conventional speech and language features commonly used in the field, 2) emotional dimensions derived from a Speech Emotion Recognition (SER) model, and 3) exploratory speech feature analysis. Despite limited data, our models detect depressive mood in pwMS with moderate generalisability, achieving a 66% Unweighted Average Recall (UAR) on a binary task. Feature selection further improved performance, boosting UAR to 74%. Our findings also highlight the relevant role emotional changes have as an indicator of depressive mood in both the general population and within PwMS. This study provides an initial exploration into generalising speech-based depression detection, even in the presence of co-occurring conditions, such as neurodegenerative diseases.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[87]

Z. Ge, X. Xu, H. Guo and B. W. Schuller.
Multi-Task Partially Spoofed Speech Detection Using a Dual-View Graph Neural Network Assisted Segment-Level Module.
IEEE Transactions on Audio, Speech and Language Processing 33 (Jul. 2025). DOI

Abstract

The Partially Spoofed Speech Detection (PSSD), as a multi-task learning problem, typically comprises segment- and utterance-level detection tasks, benefitting from diverse feature representations for effective classification. However, existing models for multi-tasks PSSD usually employ a shared feature processing module for the two tasks, which may lead to suboptimal performance compared with task-specific strategies. Further, most of existing works mainly capture segment-level information from a single view, which may result in poorly modeling local differences between fake and bonafide segments. In this regard, we propose a Dual-view Graph neural network Assisted segment-level Module (DGAM) for multi-task PSSD. The proposed approach contains three modules: Shared representation extracting, task-specific feature processing for the utterance-level task, and a Dual-View Graph Neural Network (D-GNN) with a dual-view consistency loss for the segment-level task through the graph attention mechanism with cosine similarity and heat kernel function with Euclidean distance as two different views, which capture semantic and Euclidean spatial relationships, respectively. Experimental evaluations on multiple spoofed-speech datasets demonstrate that, the proposed approach outperforms existing approaches in both segment- and utterance-level detection in terms of equal error rate, showcasing its effectiveness for the multi-task partially spoofed scenario.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[86]

J. Jung, S. Boche, S. B. Laina and S. Leutenegger.
Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. DOI

Abstract

We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot’s stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.

MCML Authors

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[85]

X. Jing, J. Wang, I. Tsangko, A. Triantafyllopoulos and B. W. Schuller.
MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge.
Preprint (May. 2025). arXiv

Abstract

Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments’ results demonstrate a consistence performance improvement on SER.

MCML Authors

Xin Jing

Iosif Tsangko

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[84]

X. Jing, K. Zhou, A. Triantafyllopoulos and B. W. Schuller.
Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models.
ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India, Apr 06-11, 2025. DOI

Abstract

While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP’s text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.

MCML Authors

Xin Jing

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[83]

H. Krasowski and M. Althoff.
Provable Traffic Rule Compliance in Safe Reinforcement Learning on the Open Sea.
IEEE Transactions on Intelligent Vehicles Early Access (May. 2024). DOI

Abstract

For safe operation, autonomous vehicles have to obey traffic rules that are set forth in legal documents formulated in natural language. Temporal logic is a suitable concept to formalize such traffic rules. Still, temporal logic rules often result in constraints that are hard to solve using optimization-based motion planners. Reinforcement learning (RL) is a promising method to find motion plans for autonomous vehicles. However, vanilla RL algorithms are based on random exploration and do not automatically comply with traffic rules. Our approach accomplishes guaranteed rule-compliance by integrating temporal logic specifications into RL. Specifically, we consider the application of vessels on the open sea, which must adhere to the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS). To efficiently synthesize rule-compliant actions, we combine predicates based on set-based prediction with a statechart representing our formalized rules and their priorities. Action masking then restricts the RL agent to this set of verified rule-compliant actions. In numerical evaluations on critical maritime traffic situations, our agent always complies with the formalized legal rules and never collides while achieving a high goal-reaching rate during training and deployment. In contrast, vanilla and traffic rule-informed RL agents frequently violate traffic rules and collide even after training.

MCML Authors

Hanna Krasowski

Dr.

* Former Member

Matthias Althoff

Prof. Dr.

Principal Investigator

[82]

A. Kathan, S. Amiriparian, L. Christ, S. Eulitz and B. W. Schuller.
Automatic Speech-Based Charisma Recognition and the Impact of Integrating Auxiliary Characteristics.
TELEPRESENCE 2024 - IEEE Conference on Telepresence. Pasadena, CA, USA, Nov 16-17, 2024. DOI

Abstract

Automatic recognition of speaker’s states and traits is crucial to facilitate a more naturalistic human-AI interaction – a key focus in human-computer interaction to enhance user experience. One particularly important trait in daily life is charisma. To date, its definition is still controversial. However, it seems that there are characteristics in speech that the majority perceives as charismatic. To this end, we address the novel speech-based task of charisma recognition in a three-fold approach. First, we predict charismatic speech using both interpretable acoustic features and embeddings of two audio Transformers. Afterwards, we make use of auxiliary labels that are highly correlated with charisma, including enthusiastic, likeable, attractive, warm, and leader-like, to check their impact on charisma recognition. Finally, we personalise the best model, taking individual speech characteristics into account. In our experiments, we demonstrate that the charisma prediction model benefits from integrating auxiliary characteristics as well as from the personalised approach, resulting in a best Pearson’s correlation coefficient of 0.4304.

MCML Authors

Alexander Kathan

Shahin Amiriparian

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[81]

A. Kathan, S. Amiriparian, A. Triantafyllopoulos, A. Gebhard, S. Milkus, J. Hohmann, P. Muderlak, J. Schottdorf, R. Musil and B. W. Schuller.
Personalised Speech-Based PTSD Prediction Using Weighted-Instance Learning.
EMBC 2024 - 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Orlando, FL, USA, Jul 15-19, 2024. DOI

Abstract

Post-traumatic stress disorder (PTSD) is a prevalent disorder that can develop in people who have experienced very stressful, shocking, or distressing events. It has great influence on peoples’ daily life and can affect their mental, physical, or social wellbeing, which is why a timely and professional treatment is required. In this paper, we propose a personalised speech-based PTSD prediction approach using a newly collected dataset which consists of 15 participants, including speech recordings from people with PTSD and healthy controls. In addition, the dataset includes data before and after a clinical intervention so that the prediction can be analysed at different points in time. In our experiments, we demonstrate the superiority of the personalised approach, achieving a best area under the ROC curve (AUC) of 82% and a best relative improvement of 7% points compared to the non-personalised model.

MCML Authors

Alexander Kathan

Shahin Amiriparian

Dr.

Andreas Triantafyllopoulos

Alexander Gebhard

Björn Schuller

Prof. Dr.

Principal Investigator

[80]

A. Kathan, M. Bürger, A. Triantafyllopoulos, S. Milkus, R. Musil, B. W. Schuller and S. Amiriparian.
Real-world PTSD Recognition: A Cross-corpus and Cross-linguistic Evaluation.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. DOI

Abstract

Post-traumatic Stress Disorder (PTSD) is a mental condition that develops as a result of catastrophic events. Triggers for this may include experiences, such as military combat, natural disasters, or sexual abuse, having a great influence on the mental wellbeing. Due to the severity of this condition, early detection and professional treatment is crucial. For this reason, previous research explored prediction models for recognising PTSD at an early stage. However, when these models are transferred from research to real-world applications, they face heterogeneous environments (e. g., different recording settings, various dialects or languages). To analyse this effect, we develop a speech-based PTSD recognition model and subsequently analyse its cross-corpus and cross-linguistic performance. Our experiments indicate that there are cross-cultural factors influencing PTSD and leading to a best area under the ROC curve (AUC) of 70.1% evaluated cross-corpus.

MCML Authors

Alexander Kathan

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

Shahin Amiriparian

Dr.

[79]

S. Kalabakov, M. Gonzalez-Machorro, F. Eyben, B. W. Schuller and B. Arnrich.
A Comparative Analysis of Federated Learning for Speech-Based Cognitive Decline Detection.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. PDF

Abstract

Speech-based machine learning models that can distinguish between a healthy cognitive state and different stages of cognitive decline would enable a more appropriate and timely treatment of patients. However, their development is often hampered by data scarcity. Federated Learning (FL) is a potential solution that could enable entities with limited voice recordings to collectively build effective models. Motivated by this, we compare centralised, local, and federated learning for building speech-based models to discern Alzheimer’s Disease, Mild Cognitive Impairment, and a healthy state. For a more realistic evaluation, we use three independently collected datasets to simulate healthcare institutions employing these strategies. Our initial analysis shows that FL may not be the best solution in every scenario, as performance improvements are not guaranteed even with small amounts of available data, and further research is needed to determine the conditions under which it is beneficial.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[78]

J. Külz, S. Ha and M. Althoff.
A Design Co-Pilot for Task-Tailored Manipulators.
Preprint (Sep. 2025). arXiv

Abstract

Although robotic manipulators are used in an ever-growing range of applications, robot manufacturers typically follow a ``one-fits-all’’ philosophy, employing identical manipulators in various settings. This often leads to suboptimal performance, as general-purpose designs fail to exploit particularities of tasks. The development of custom, task-tailored robots is hindered by long, cost-intensive development cycles and the high cost of customized hardware. Recently, various computational design methods have been devised to overcome the bottleneck of human engineering. In addition, a surge of modular robots allows quick and economical adaptation to changing industrial settings. This work proposes an approach to automatically designing and optimizing robot morphologies tailored to a specific environment. To this end, we learn the inverse kinematics for a wide range of different manipulators. A fully differentiable framework realizes gradient-based fine-tuning of designed robots and inverse kinematics solutions. Our generative approach accelerates the generation of specialized designs from hours with optimization-based methods to seconds, serving as a design co-pilot that enables instant adaptation and effective human-AI collaboration. Numerical experiments show that our approach finds robots that can navigate cluttered environments, manipulators that perform well across a specified workspace, and can be adapted to different hardware constraints. Finally, we demonstrate the real-world applicability of our method by setting up a modular robot designed in simulation that successfully moves through an obstacle course.

MCML Authors

Jonathan Külz

Matthias Althoff

Prof. Dr.

Principal Investigator

[77]

J. Külz, M. Mayer and M. Althoff.
Timor Python: A Toolbox for Industrial Modular Robotics.
IROS 2023 - IEEE/RSJ International Conference on Intelligent Robots and Systems. Detroit, MI, USA, Oct 01-05, 2023. DOI

Abstract

Modular Reconfigurable Robots (MRRs) represent an exciting path forward for industrial robotics, opening up new possibilities for robot design. Compared to monolithic manipulators, they promise greater flexibility, improved maintainability, and cost-efficiency. However, there is no tool or standardized way to model and simulate assemblies of modules in the same way it has been done for robotic manipulators for decades. We introduce the Toolbox for Industrial Modular Robotics (Timor), a Python toolbox to bridge this gap and integrate modular robotics into existing simulation and optimization pipelines. Our open-source library offers model generation and task-based configuration optimization for MRRs. It can easily be integrated with existing simulation tools - not least by offering URDF export of arbitrary modular robot assemblies. Moreover, our experimental study demonstrates the effectiveness of Timor as a tool for designing modular robots optimized for specific use cases.

MCML Authors

Jonathan Külz

Matthias Althoff

Prof. Dr.

Principal Investigator

[76]

M. Keinert, S. Pistrosch, A. Mallol-Ragolta, B. W. Schuller and M. Berking.
Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.
Journal of Medical Internet Research 27 (Jul. 2025). DOI

Abstract

Background: The development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions.
Objective: This study aimed to develop deep learning–based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models’ performance, and validate them by comparing the models with human observers.
Methods: The STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)–based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy.
Results: Models using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%.
Conclusions: Future studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.

MCML Authors

Simon Pistrosch

Adria Mallol-Ragolta

Björn Schuller

Prof. Dr.

Principal Investigator

[75]

H. Krasowski.
Guaranteeing Complex Safety Specifications for Autonomous Vehicles via Reinforcement Learning with Formal Methods.
Dissertation Jul. 2024. URL

Abstract

Reinforcement learning (RL) solves complicated motion planning tasks for autonomous vehicles. Current RL methods lack safety guarantees. This dissertation combines RL with formal methods that verify safety specifications so that only verified actions are executed. The safe RL approaches are developed for autonomous vehicles and their complex safety specifications. The evaluation confirms the safety guarantees and real-time capability.

MCML Authors

Hanna Krasowski

Dr.

* Former Member

[74]

J. Külz, M. Terzer, M. Magri, A. Giusti and M. Althoff.
Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution.
IEEE Transactions on Automation Science and Engineering Early Access (Jun. 2025). DOI

Abstract

In situ robotic automation in construction is challenging due to constantly changing environments, a shortage of robotic experts, and a lack of standardized frameworks bridging robotics and construction practices. This work proposes a holistic framework for construction task specification, optimization of robot morphology, and mission execution using a mobile modular reconfigurable robot. Users can specify and monitor the desired robot behavior through a graphical interface. In contrast to existing, monolithic solutions, we automatically identify a new task-tailored robot for every task by integrating Building Information Modeling (BIM). Our framework leverages modular robot components that enable the fast adaption of robot hardware to the specific demands of the construction task. Other than previous works on modular robot optimization, we consider multiple competing objectives, which allow us to explicitly model the challenges of real-world transfer, such as calibration errors. We demonstrate our framework in simulation by optimizing robots for drilling and spray painting. Finally, experimental validation demonstrates that our approach robustly enables the autonomous execution of robotic drilling.

MCML Authors

Jonathan Külz

Matthias Althoff

Prof. Dr.

Principal Investigator

[73]

T. Ladner and M. Althoff.
Automatic Abstraction Refinement in Neural Network Verification Using Sensitivity Analysis.
HSCC 2023 - 26th ACM International Conference on Hybrid Systems: Computation and Control. San Antonio, TX, USA, May 09-12, 2023. DOI

Abstract

The formal verification of neural networks is essential for their application in safety-critical environments. However, the set-based verification of neural networks using linear approximations often obtains overly conservative results, while nonlinear approximations quickly become computationally infeasible in deep neural networks. We address this issue for the first time by automatically balancing between precision and computation time without splitting the propagated set. Our work introduces a novel automatic abstraction refinement approach using sensitivity analysis to iteratively reduce the abstraction error at the neuron level until either the specifications are met or a maximum number of iterations is reached. Our evaluation shows that we can tightly over-approximate the output sets of deep neural networks and that our approach is up to a thousand times faster than a naive approach. We further demonstrate the applicability of our approach in closed-loop settings.

MCML Authors

Tobias Ladner

Matthias Althoff

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[72]

Abstract

MCML Authors

Sotiris Papatheodorou

* Former Member

→ Group Stefan Leutenegger

* Former Principal Investigator

Simon Schaefer

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[71]

L. Lützow, M. Eichelbeck, M. J. Kochenderfer and M. Althoff.
Zono-Conformal Prediction: Zonotope-Based Uncertainty Quantification for Regression and Classification Tasks.
Preprint (Aug. 2025). arXiv

Abstract

Conformal prediction is a popular uncertainty quantification method that augments a base predictor with prediction sets with statistically valid coverage guarantees. However, current methods are often computationally expensive and data-intensive, as they require constructing an uncertainty model before calibration. Moreover, existing approaches typically represent the prediction sets with intervals, which limits their ability to capture dependencies in multi-dimensional outputs. We address these limitations by introducing zono-conformal prediction, a novel approach inspired by interval predictor models and reachset-conformant identification that constructs prediction zonotopes with assured coverage. By placing zonotopic uncertainty sets directly into the model of the base predictor, zono-conformal predictors can be identified via a single, data-efficient linear program. While we can apply zono-conformal prediction to arbitrary nonlinear base predictors, we focus on feed-forward neural networks in this work. Aside from regression tasks, we also construct optimal zono-conformal predictors in classification settings where the output of an uncertain predictor is a set of possible classes. We provide probabilistic coverage guarantees and present methods for detecting outliers in the identification data. In extensive numerical experiments, we show that zono-conformal predictors are less conservative than interval predictor models and standard conformal prediction methods, while achieving a similar coverage over the test data.

MCML Authors

Laura Lützow

Michael Eichelbeck

Matthias Althoff

Prof. Dr.

Principal Investigator

[70]

Y. Li, M. Milling and B. W. Schuller.
Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In & Out Learning.
Preprint (Mar. 2025). arXiv

Abstract

Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition to the more spread neuroapoptosis have largely been ignored in DNN architecture design. Instead, contemporary AI development predominantly focuses on constructing advanced frameworks, such as large language models, which retain a static structure of neural connections during training and inference. In this light, we explore how neurogenesis, neuroapoptosis, and neuroplasticity can inspire future AI advances. Specifically, we examine analogous activities in artificial NNs, introducing the concepts of dropin'' for neurogenesis and revisiting dropout’’ and structural pruning for neuroapoptosis. We additionally suggest neuroplasticity combining the two for future large NNs in ``life-long learning’’ settings following the biological inspiration. We conclude by advocating for greater research efforts in this interdisciplinary domain and identifying promising directions for future exploration.

MCML Authors

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[69]

Y. Li, M. Milling, L. Specia and B. W. Schuller.
From Audio Deepfake Detection to AI-Generated Music Detection -- A Pathway and Overview.
Preprint (Dec. 2024). arXiv

Abstract

As Artificial Intelligence (AI) technologies continue to evolve, their use in generating realistic, contextually appropriate content has expanded into various domains. Music, an art form and medium for entertainment, deeply rooted into human culture, is seeing an increased involvement of AI into its production. However, despite the effective application of AI music generation (AIGM) tools, the unregulated use of them raises concerns about potential negative impacts on the music industry, copyright and artistic integrity, underscoring the importance of effective AIGM detection. This paper provides an overview of existing AIGM detection methods. To lay a foundation to the general workings and challenges of AIGM detection, we first review general principles of AIGM, including recent advancements in deepfake audios, as well as multimodal detection techniques. We further propose a potential pathway for leveraging foundation models from audio deepfake detection to AIGM detection. Additionally, we discuss implications of these tools and propose directions for future research to address ongoing challenges in the field.

MCML Authors

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[68]

Y. Li, Q. Sun, S. M. K. Murthy, E. Alturki and B. W. Schuller.
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations.
Preprint (Mar. 2025). arXiv

Abstract

Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual’s expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[67]

Y. Li, S. Shao, M. Milling and B. W. Schuller.
Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge.
Preprint (May. 2025). arXiv GitHub

Abstract

Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper.

MCML Authors

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[66]

Y. Li, Q. Sun, M. Schlicher, Y. W. Lim and B. W. Schuller.
Artificial Emotion: A Survey of Theories and Debates on Realising Emotion in Artificial Intelligence.
Preprint (Aug. 2025). arXiv

Abstract

Affective Computing (AC) has enabled Artificial Intelligence (AI) systems to recognise, interpret, and respond to human emotions - a capability also known as Artificial Emotional Intelligence (AEI). It is increasingly seen as an important component of Artificial General Intelligence (AGI). We discuss whether in order to peruse this goal, AI benefits from moving beyond emotion recognition and synthesis to develop internal emotion-like states, which we term as Artificial Emotion (AE). This shift potentially allows AI to benefit from the paradigm of inner emotions' in ways we - as humans - do. Although recent research shows early signs that AI systems may exhibit AE-like behaviours, a clear framework for how emotions can be realised in AI remains underexplored. In this paper, we discuss potential advantages of AE in AI, review current manifestations of AE in machine learning systems, examine emotion-modulated architectures, and summarise mechanisms for modelling and integrating AE into future AI. We also explore the ethical implications and safety risks associated with emotional’ AGI, while concluding with our opinion on how AE could be beneficial in the future.

MCML Authors

Michelle Schlicher

Björn Schuller

Prof. Dr.

Principal Investigator

[65]

Z. Li, Z. Wang, X. Xu, Y. Chen and B. W. Schuller.
Unsupervised Domain-Adaptive Semantic Segmentation for Surgical Instruments Leveraging Dropout-Enhanced Dual Heads and Coarse-Grained Classification Branch.
IEEE Transactions on Medical Robotics and Bionics Early Access (Apr. 2025). DOI

Abstract

Accurate semantic segmentation for surgical instruments is crucial in robot-assisted minimally invasive surgery, mainly regarded as a core module in surgical-instrument tracking and operation guidance. Nevertheless, it is usually difficult for existing semantic surgical-instrument segmentation approaches to adapt to unknown surgical scenes, particularly due to their insufficient consideration for reducing the domain gaps across different scenes. To address this issue, we propose an unsupervised domain-adaptive semantic segmentation approach for surgical instruments, leveraging Dropout-enhanced Dual Heads and Coarse-Grained classification branch (D2HCG). The proposed approach comprises dropout-enhanced dual heads for diverse feature representation, and a coarse-grained classification branch for capturing complexities across varying granularities. This incorporates consistency loss functions targeting fine-grained features and coarse-grained granularities, aiming to reduce crossscene domain gaps. Afterwards, we perform experiments in crossscene surgical-instrument semantic segmentation cases, with the experimental results reporting the effectiveness for the proposed approach, compared with state-of-the-art semantic segmentation ones.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[64]

M. Milling, S. Liu, A. Triantafyllopoulos, I. Aslan and B. W. Schuller.
Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance.
Journal of Computer Science and Technology 39 (Sep. 2024). DOI

Abstract

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and nonspeech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios, for a wide range of computer audition tasks in everyday-life noisy environments.

MCML Authors

Manuel Milling

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[63]

A. Mallol-Ragolta, M. Milling and B. W. Schuller.
Multi-Triplet Loss-Based Models for Categorical Depression Recognition from Speech.
IberSPEECH 2024 - 7th Conference IberSPEECH 2024. Aveiro, Portugal, Nov 11-13, 2024. PDF

Abstract

We analyse four different acoustic feature sets towards the automatic recognition of depression from speech signals. Specifically, the feature sets investigated are based on Mel-Frequency Cepstral Coefficients (MFCC), the Low-Level Descriptors (LLD) of the eGeMAPS feature set, Mel-spectrogram coefficients, and pretrained self-supervised Wav2Vec 2.0 representations. The main hypothesis investigated lies in the use of a multi-triplet loss to improve the inter-class separability of the data representations learnt in the embedding space, boosting, ultimately, the overall system performance. To assess this aspect, we implement three different techniques to perform the classification of the embedded representations learnt. These include the combination of two fully connected layers with softmax, a linear support vector classifier, and a clustering-based classifier with k−Means. We conduct our experiments on the Extended Distress Analysis Interview Corpus, released in the Detecting Depression Subchallenge (DDS) of the 9th Audio/Visual Emotion Challenge (AVEC), in 2019. We select the Unweighted Average Recall (UAR) as the evaluation metric. Our best model exploits the eGeMAPS-based feature set, optimises a triplet loss, and utilises a LinearSVC as the classifier. Tackling the task as a 6-class classification problem, this model scores a UAR of 25.7% on the test partition, an increment in 9% of the chance level.

MCML Authors

Adria Mallol-Ragolta

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[62]

S. M. K. Murthy, K. Rajamani, S. T. Rajamani, Y. Li, Q. Sun and B. W. Schuller.
Automatic Contouring of Spinal Vertebrae on X-Ray using a Novel Sandwich U-Net Architecture.
Preprint (Jul. 2025). arXiv

Abstract

In spinal vertebral mobility disease, accurately extracting and contouring vertebrae is essential for assessing mobility impairments and monitoring variations during flexion-extension movements. Precise vertebral contouring plays a crucial role in surgical planning; however, this process is traditionally performed manually by radiologists or surgeons, making it labour-intensive, time-consuming, and prone to human error. In particular, mobility disease analysis requires the individual contouring of each vertebra, which is both tedious and susceptible to inconsistencies. Automated methods provide a more efficient alternative, enabling vertebra identification, segmentation, and contouring with greater accuracy and reduced time consumption. In this study, we propose a novel U-Net variation designed to accurately segment thoracic vertebrae from anteroposterior view on X-Ray images. Our proposed approach, incorporating a ‘sandwich’ U-Net structure with dual activation functions, achieves a 4.1% improvement in Dice score compared to the baseline U-Net model, enhancing segmentation accuracy while ensuring reliable vertebral contour extraction.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[61]

M. Milling, S. D. Rampp, A. Triantafyllopoulos, M. P. Plaza, J. O. Brunner, C. Traidl-Hoffmann, B. W. Schuller and A. Damialis.
Automating airborne pollen classification: Identifying and interpreting hard samples for classifiers.
Heliyon 11.2 (Jan. 2025). DOI GitHub

Abstract

Deep-learning-based classification of pollen grains has been a major driver towards automatic monitoring of airborne pollen. Yet, despite an abundance of available datasets, little effort has been spent to investigate which aspects pose the biggest challenges to the (often black-box- resembling) pollen classification approaches. To shed some light on this issue, we conducted a sample-level difficulty analysis based on the likelihood for one of the largest automatically-generated datasets of pollen grains on microscopy images and investigated the reason for which certain airborne samples and specific pollen taxa pose particular problems to deep learning algorithms. It is here concluded that the main challenges lie in A) the (partly) co-occurring of multiple pollen grains in a single image, B) the occlusion of specific markers through the 2D capturing of microscopy images, and C) for some taxa, a general lack of salient, unique features.

MCML Authors

Manuel Milling

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[60]

A. Mallol-Ragolta, A. Spiesberger and B. W. Schuller.
Face Mask Type and Coverage Area Recognition from Speech with Prototypical Networks.
IberSPEECH 2024 - 7th Conference IberSPEECH 2024. Aveiro, Portugal, Nov 11-13, 2024. PDF

Abstract

We investigate the use of prototypical networks on the problems of face mask type (3 classes), face mask coverage area (3 classes), and face mask type and coverage area (5 classes) recognition from speech. We explore the MASCFLICHT Corpus, a dataset containing 2 h 27 m 55 s of speech data from 30 German speakers recorded with a smartphone. We extract formant-related features and the spectrogram representations from the samples. We enrich the spectrograms overlaying the traces of the central frequency of the first four formants. Our experiments also consider the fusion via concatenation of the embedded representations extracted from the formant-related features and the spectrogram representations. We implement classification- and prototypical encoder-based networks. The results obtained on the test sets support the suitability of the prototypical encoder models, scoring an Unweighted Average Recall (UAR) of 49.9%, 45.0%, and 31.6% on the three considered problems, respectively.

MCML Authors

Adria Mallol-Ragolta

Anika Spiesberger

Björn Schuller

Prof. Dr.

Principal Investigator

[59]

A. Mallol-Ragolta, A. Spiesberger, A. B. Salvador and B. W. Schuller.
Prototypical Networks for Speech Emotion Recognition in Spanish.
IberSPEECH 2024 - 7th Conference IberSPEECH 2024. Aveiro, Portugal, Nov 11-13, 2024. PDF

Abstract

We explore the utilisation of prototypical networks in the Speech Emotion Recognition (SER) problem, creating prototypical representations of the targeted emotions in the embeddings space. We hypothesise this technique can help to improve the performance and robustness of the models, in comparison to standard classification-based approaches. We investigate two approaches to train the prototypes: one optimising a triplet loss, and the other minimising a prototypical loss. To assess our hypothesis, we exploit the EmoMatchSpanishDB Corpus; a novel dataset for SER in Spanish, which includes speech samples conveying the six basic emotions defined by Paul Ekman, in addition to the neutral state. We methodologically split the available samples into three speaker-independent train, development, and test partitions. The proposed splitting is not only balanced in terms of the speakers’ gender, but also homogenised in terms of their recognition difficulty. We analyse the performance of our models with a gender perspective. The models exploit the eGeMAPS and the wav2vec 2.0 feature representations extracted from the speech samples. We choose the Unweighted Average Recall (UAR) as the evaluation metric to assess the models’ performance. The chance level UAR for a seven-class classification problem is 14.3%. The models optimising the prototypical loss obtain the highest UAR scores on the test set, 52.0% and 52.7%, with the eGeMAPS and the wav2vec 2.0 representations, respectively. Nevertheless, the best performances are obtained with a Support Vector Classifier (SVC) implementing a radial basis function kernel, with a UAR of 54.4% and 56.9% when exploiting the eGeMAPS and the wav2vec 2.0 representations, respectively.

MCML Authors

Adria Mallol-Ragolta

Anika Spiesberger

Björn Schuller

Prof. Dr.

Principal Investigator

[58]

L. Mamede, R. C. Sabàb, S. Van Coillie, J. Prevot, S. Sánchez-Ramón, C. Poli, A. Barasa, B. W. Schuller, A. Hendel, N. Garcelon, C. Boersma, P. Lee, C. Booth, L. D. Notarangelo, J. Drabwell, N. L. Rider, F. Staal, S. O. Burns, M. van Hagen, M. Pergrnt, J. G. Rivière and N. Mahlaoui.
Navigating disruption in the PID landscape: embracing opportunities and anticipating threats in the next ten years.
Frontiers in Immunology 16 (May. 2025). DOI

Abstract

The International Patient Organisation for Primary Immunodeficiencies (IPOPI) held its third edition of the Global Multi-Stakeholders’ Summit, gathering key primary immunodeficiencies (PID) stakeholders and experts to discuss and foment global collaboration. This edition focused on the impact of genomic medicine in PID treatment, the role of digital health, including artificial intelligence, in PID care, and how to anticipate and minimise risks to ensure optimal patient access to care. These discussions aimed to examine current hurdles and brainstorm feasible solutions and priorities for the PID community in these areas in the next ten years. These discussions led to recommendations for comprehensive approaches to care and access to treatment for PID patients, suggesting actions that will bring the community closer to treatments based on real-world evidence and adjusted to patient’s needs. To accomplish this, collaboration between academia, industry, regulatory authorities, and patients is crucial.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[57]

W. Mayr, A. Triantafyllopoulos, A. Batliner, B. W. Schuller and T. M. Berghaus.
Assessing the Clinical and Functional Status of COPD Patients Using Speech Analysis During and After Exacerbation.
International Journal of Chronic Obstructive Pulmonary Disease 20 (Jan. 2025). DOI

Abstract

Background: Chronic obstructive pulmonary disease (COPD) affects breathing, speech production, and coughing. We evaluated a machine learning analysis of speech for classifying the disease severity of COPD.
Methods: In this single centre study, non-consecutive COPD patients were prospectively recruited for comparing their speech characteristics during and after an acute COPD exacerbation. We extracted a set of spectral, prosodic, and temporal variability features, which were used as input to a support vector machine (SVM). Our baseline for predicting patient state was an SVM model using self-reported BORG and COPD Assessment Test (CAT) scores.
Results: In 50 COPD patients (52% males, 22% GOLD II, 44% GOLD III, 32% GOLD IV, all patients group E), speech analysis was superior in distinguishing during and after exacerbation status compared to BORG and CAT scores alone by achieving 84% accuracy in prediction. CAT scores correlated with reading rhythm, and BORG scales with stability in articulation. Pulmonary function testing (PFT) correlated with speech pause rate and speech rhythm variability.
Conclusion: Speech analysis may be a viable technology for classifying COPD status, opening up new opportunities for remote disease monitoring.

MCML Authors

Andreas Triantafyllopoulos

Anton Batliner

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[56]

Y. Ni, R. Liang, X. Hao, J. Cheng, Q. Wang, C. Huang, C. Zou, W. Zhou, W. Ding and B. W. Schuller.
Affine Modulation-based Audiogram Fusion Network for Joint Noise Reduction and Hearing Loss Compensation.
Information Fusion 127. Part A.103726 (Mar. 2026). DOI GitHub

Abstract

Hearing aids (HAs) are widely used to provide personalized speech enhancement (PSE) services, improving the quality of life for individuals with hearing loss. However, HA performance significantly declines in noisy environments as it treats noise reduction (NR) and hearing loss compensation (HLC) as separate tasks. This separation leads to a lack of systematic optimization, overlooking the interactions between these two critical tasks, and increases the system complexity. To address these challenges, we propose a novel audiogram fusion network, named AFN-HearNet, which simultaneously tackles the NR and HLC tasks by fusing cross-domain audiogram and spectrum features. We propose an audiogram-specific encoder that transforms the sparse audiogram profile into a deep representation, addressing the alignment problem of cross-domain features prior to fusion. To incorporate the interactions between NR and HLC tasks, we propose the affine modulation-based audiogram fusion frequency-temporal Conformer that adaptively fuses these two features into a unified deep representation for speech reconstruction. Furthermore, we introduce a voice activity detection auxiliary training task to embed speech and non-speech patterns into the unified deep representation implicitly. We conduct comprehensive experiments across multiple datasets to validate the effectiveness of each proposed module. The results indicate that the AFN-HearNet significantly outperforms state-of-the-art in-context fusion joint models regarding key metrics such as HASQI and PESQ, achieving a considerable trade-off between performance and efficiency.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[55]

J. Naumann, B. Xu, S. Leutenegger and X. Zuo.
NeRF-VO: Real-Time Sparse Visual Odometry With Neural Radiance Fields.
IEEE Robotics and Automation Letters 9.8 (Aug. 2024). DOI

Abstract

We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.

MCML Authors

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[54]

D. Ostermeier, J. Külz and M. Althoff.
Automatic Geometric Decomposition for Analytical Inverse Kinematics.
IEEE Robotics and Automation Letters 10.10 (Oct. 2025). DOI

Abstract

Calculating the inverse kinematics (IK) is a fundamental challenge in robotics. Compared to numerical or learning-based approaches, analytical IK provides higher efficiency and accuracy. However, existing analytical approaches are difficult to use in most applications, as they require human ingenuity in the derivation process, are numerically unstable, or rely on time-consuming symbolic manipulation. In contrast, we propose a method that, for the first time, enables an analytical IK derivation and computation in less than a millisecond in total. Our work is based on an automatic online decomposition of the IK into pre-solved, numerically stable subproblems via a kinematic classification of the respective manipulator. In numerical experiments, we demonstrate that our approach is orders of magnitude faster in deriving the IK than existing tools that employ symbolic manipulation. Following this one-time derivation, our method matches and often surpasses baselines, such as IKFast, in terms of speed and accuracy during the computation of explicit IK solutions. Finally, we provide an open-source C++ toolbox with Python wrappers that substantially reduces the entry barrier to using analytical IK in applications like rapid prototyping and kinematic robot design.

MCML Authors

Jonathan Külz

Matthias Althoff

Prof. Dr.

Principal Investigator

[53]

F. B. Pokorny and K. D. Bartl-Pokorny.
Editorial: Artificial intelligence for child health and wellbeing.
Frontiers in Digital Health 7 (Sep. 2025). DOI

Abstract

n/a

MCML Authors

Florian Pokorny

Dr.

[52]

S. Papatheodorou, S. Boche, S. Laina and S. Leutenegger.
Efficient Submap-based Autonomous MAV Exploration using Visual-Inertial SLAM Configurable for LiDARs or Depth Cameras.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. DOI

Abstract

Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot’s surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on onboard state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.

MCML Authors

Sotiris Papatheodorou

* Former Member

→ Group Stefan Leutenegger

* Former Principal Investigator

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[51]

P. Purucker, C. Reil, A. Höß and B. W. Schuller.
Deep Neural Quality of Service Prediction for Unmanned Aircraft System Communications.
IWCMC 2024 - 20th International Wireless Communications and Mobile Computing Conference. Cyprus, Greece, May 27-31, 2024. DOI

Abstract

Commercial Unmanned Aircraft Systems (UAS) have a wide range of applications, including package delivery, inspection and search and rescue missions. For the operation of Unmanned Aircraft Vehicles (UAV) Beyond Visual Line of Sight (BVLOS), reliable long-range communication is essential. The cellular network is one possible solution, but there are issues such as signal loss and frequent handovers at higher altitudes. To mitigate these issues, our work proposes the use of two cellular links from different providers prioritised according to Quality of Service (QoS) prediction. We evaluate multiple AI-based model architectures for the prediction, and find that the model consisting of Gated Recurrent Units (GRU) and convolutional layers outperforms the others. The models are trained and tested on real-world data and show a reduction in latency peaks, thereby increasing connection resilience. Moreover, the prediction pipeline is designed to be executable on the UAV side and is not limited to a specific geographical area, making it applicable to real-world scenarios. Finally, we present a pre-flight path planning algorithm that takes QoS into account when calculating the flight path in order to further improve communication. To support the research community, we publicly share the dataset used to obtain our results.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[50]

W. Qiu, Y. Feng, Y. Li, Y. Chang, K. Qian, B. Hu, Y. Yamamoto and B. W. Schuller.
Fed-MStacking: Heterogeneous Federated Learning With Stacking Misaligned Labels for Abnormal Heart Sound Detection.
IEEE Journal of Biomedical and Health Informatics 28.9 (Jul. 2024). DOI

Abstract

Ubiquitous sensing has been widely applied in smart healthcare, providing an opportunity for intelligent heart sound auscultation. However, smart devices contain sensitive information, raising user privacy concerns. To this end, federated learning (FL) has been adopted as an effective solution, enabling decentralised learning without data sharing, thus preserving data privacy in the Internet of Health Things (IoHT). Nevertheless, traditional FL requires the same architectural models to be trained across local clients and global servers, leading to a lack of model heterogeneity and client personalisation. For medical institutions with private data clients, this study proposes Fed-MStacking, a heterogeneous FL framework that incorporates a stacking ensemble learning strategy to support clients in building their own models. The secondary objective of this study is to address scenarios involving local clients with data characterised by inconsistent labelling. Specifically, the local client contains only one case type, and the data cannot be shared within or outside the institution. To train a global multi-class classifier, we aggregate missing class information from all clients at each institution and build meta-data, which then participates in FL training via a meta-learner. We apply the proposed framework to a multi-institutional heart sound database. The experiments utilise random forests (RFs), feedforward neural networks (FNNs), and convolutional neural networks (CNNs) as base classifiers. The results show that the heterogeneous stacking of local models performs better compared to homogeneous stacking.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[49]

W. Qiu, C. Quan, L. Zhu, Y. Yu, Z. Wang, Y. Ma, M. Sun, Y. Chang, K. Qian, B. Hu, Y. Yamamoto and B. W. Schuller.
Heart Sound Abnormality Detection From Multi-Institutional Collaboration: Introducing a Federated Learning Framework.
IEEE Transactions on Biomedical Engineering 71.10 (May. 2024). DOI

Abstract

Objective: Early diagnosis of cardiovascular diseases is a crucial task in medical practice. With the application of computer audition in the healthcare field, artificial intelligence (AI) has been applied to clinical non-invasive intelligent auscultation of heart sounds to provide rapid and effective pre-screening. However, AI models generally require large amounts of data which may cause privacy issues. Unfortunately, it is difficult to collect large amounts of healthcare data from a single centre. Methods: In this study, we propose federated learning (FL) optimisation strategies for the practical application in multi-centre institutional heart sound databases. The horizontal FL is mainly employed to tackle the privacy problem by aligning the feature spaces of FL participating institutions without information leakage. In addition, techniques based on deep learning have poor interpretability due to their “black-box” property, which limits the feasibility of AI in real medical data. To this end, vertical FL is utilised to address the issues of model interpretability and data scarcity. Conclusion: Experimental results demonstrate that, the proposed FL framework can achieve good performance for heart sound abnormality detection by taking the personal privacy protection into account. Moreover, using the federated feature space is beneficial to balance the interpretability of the vertical FL and the privacy of the data. Significance: This work realises the potential of FL from research to clinical practice, and is expected to have extensive application in the federated smart medical system.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[48]

X. Qiu, W. Qiu, Y. Zhang, K. Qian, C. Li, B. Hu, B. W. Schuller and Y. Yamamoto.
FedKDC: Consensus-Driven Knowledge Distillation for Personalized Federated Learning in EEG-Based Emotion Recognition.
IEEE Journal of Biomedical and Health Informatics Early Access (Apr. 2025). DOI GitHub

Abstract

Federated learning (FL) has gained prominence in electroencephalogram (EEG)-based emotion recognition because of its ability to enable secure collaborative training without centralized data. However, traditional FL faces challenges due to model and data heterogeneity in smart healthcare settings. For example, medical institutions have varying computational resources, which creates a need for personalized local models. Moreover, EEG data from medical institutions typically face data heterogeneity issues stemming from limitations in participant availability, ethical constraints, and cultural differences among subjects, which can slow model convergence and degrade model performance. To address these challenges, we propose FedKDC, a novel FL framework that incorporates clustered knowledge distillation (CKD). This method introduces a consensus-based distributed learning mechanism to facilitate the clustering process. It then enhances the convergence speed through intraclass distillation and reduces the negative impact of heterogeneity through interclass distillation. Additionally, we introduce a DriftGuard mechanism to mitigate client drift, along with an entropy reducer to decrease the entropy of aggregated knowledge. The framework is validated on the SEED, SEED-IV, SEED-FRA, and SEED-GER datasets, demonstrating its effectiveness in scenarios where both the data and the models are heterogeneous. Experimental results show that FedKDC outperforms other FL frameworks in emotion recognition, achieving a maximum average accuracy of 85.2%, and in convergence efficiency, with faster and more stable convergence.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[47]

W. Qi, X. Xu, K. Qian, B. W. Schuller, G. Fortino and A. Aliverti.
A Review of AIoT-Based Human Activity Recognition: From Application to Technique.
IEEE Journal of Biomedical and Health Informatics 29.4 (Apr. 2025). DOI

Abstract

This scoping review paper redefines the Artificial Intelligence-based Internet of Things (AIoT) driven Human Activity Recognition (HAR) field by systematically extrapolating from various application domains to deduce potential techniques and algorithms. We distill a general model with adaptive learning and optimization mechanisms by conducting a detailed analysis of human activity types and utilizing contact or non-contact devices. It presents various system integration mathematical paradigms driven by multimodal data fusion, covering predictions of complex behaviors and redefining valuable methods, devices, and systems for HAR. Additionally, this paper establishes benchmarks for behavior recognition across different application requirements, from simple localized actions to group activities. It summarizes open research directions, including data diversity and volume, computational limitations, interoperability, real-time recognition, data security, and privacy concerns. Finally, we aim to serve as a comprehensive and foundational resource for researchers delving into the complex and burgeoning realm of AIoT-enhanced HAR, providing insights and guidance for future innovations and developments.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[46]

Z. Ren, Y. Chang, T. T. Nguyen, Y. Tan, K. Qian and B. W. Schuller.
A Comprehensive Survey on Heart Sound Analysis in the Deep Learning Era.
IEEE Computational Intelligence Magazine 19.3 (Aug. 2024). DOI

Abstract

Heart sound auscultation has been applied in clinical usage for early screening of cardiovascular diseases. Due to the high demand for auscultation expertise, automatic auscultation can help with auxiliary diagnosis and reduce the burden of training professional clinicians. Nevertheless, there is a limit to classic machine learning’s performance improvement in the era of Big Data. Deep learning has outperformed classic machine learning in many research fields, as it employs more complex model architectures with a stronger capability of extracting effective representations. Moreover, it has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were carried out before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning published in 2017–2022. This work introduces both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[45]

A. Mallol-Ragolta, M. Gonzalez-Machorro, R. von Heynitz, K. Scherzer, I. Cordts and B. W. Schuller.
Early Detection of ALS in Absence of Speech Impairments with Computer Audition.
AIME 2025 - 23rd International Conference on Artificial Intelligence in Medicine. Pavia, Italy, Jun 23-26, 2025. DOI

Abstract

We investigate whether Amyotrophic Lateral Sclerosis (ALS) can be detected in patients without speech impairments utilising computer audition techniques. We exploit the information embedded in the patients’ speech while performing five different speech tasks. Specifically, producing the sustained vowel /a:/, repeating the syllables /da/-/da/ and /da/-/ba/ (separately), reading a text passage, and describing a picture. The implemented models are task-dedicated, as they are solely trained and assessed with the speech samples of the corresponding task. We conduct our experiments on the novel, German-speaking AIMnd dataset. We define the Unweighted Average Recall (UAR) as the evaluation metric. When differentiating ALS patients with normal speech from controls – binary classification –, the best models, which obtain a UAR score of 88% on the Test set, mostly exploit the speech samples corresponding to the /da/-/ba/ task. When including the ALS patients with, at least, detectable speech disturbances in the detection – three-class classification –, the best model on the Test set scores a UAR of 70%, also exploiting the speech samples corresponding to the /da/-/ba/ task.

MCML Authors

Adria Mallol-Ragolta

Björn Schuller

Prof. Dr.

Principal Investigator

[44]

S. Rampp, M. Milling, A. Triantafyllopoulos and B. W. Schuller.
Does the Definition of Difficulty Matter? Scoring Functions and their Role for Curriculum Learning.
Preprint (Nov. 2024). arXiv

Abstract

Curriculum learning (CL) describes a machine learning training strategy in which samples are gradually introduced into the training process based on their difficulty. Despite a partially contradictory body of evidence in the literature, CL finds popularity in deep learning research due to its promise of leveraging human-inspired curricula to achieve higher model performance. Yet, the subjectivity and biases that follow any necessary definition of difficulty, especially for those found in orderings derived from models or training statistics, have rarely been investigated. To shed more light on the underlying unanswered questions, we conduct an extensive study on the robustness and similarity of the most common scoring functions for sample difficulty estimation, as well as their potential benefits in CL, using the popular benchmark dataset CIFAR-10 and the acoustic scene classification task from the DCASE2020 challenge as representatives of computer vision and computer audition, respectively. We report a strong dependence of scoring functions on the training setting, including randomness, which can partly be mitigated through ensemble scoring. While we do not find a general advantage of CL over uniform sampling, we observe that the ordering in which data is presented for CL-based training plays an important role in model performance. Furthermore, we find that the robustness of scoring functions across random seeds positively correlates with CL performance. Finally, we uncover that models trained with different CL strategies complement each other by boosting predictive power through late fusion, likely due to differences in the learnt concepts. Alongside our findings, we release the aucurriculum toolkit (this https URL), implementing sample difficulty and CL-based training in a modular fashion.

MCML Authors

Manuel Milling

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[43]

Z. Ren, S. Pistrosch, B. Coşkun, K. Scheck, A. Batliner, B. W. Schuller and T. Schultz.
An Introduction to Silent Paralinguistics.
Preprint (Aug. 2025). arXiv

Abstract

The ability to speak is an inherent part of human nature and fundamental to our existence as a social species. Unfortunately, this ability can be restricted in certain situations, such as for individuals who have lost their voice or in environments where speaking aloud is unsuitable. Additionally, some people may prefer not to speak audibly due to privacy concerns. For such cases, silent speech interfaces have been proposed, which focus on processing biosignals corresponding to silently produced speech. These interfaces enable synthesis of audible speech from biosignals that are produced when speaking silently and recognition aka decoding of biosignals into text that corresponds to the silently produced speech. While recognition and synthesis of silent speech has been a prominent focus in many research studies, there is a significant gap in deriving paralinguistic information such as affective states from silent speech. To fill this gap, we propose Silent Paralinguistics, aiming to predict paralinguistic information from silent speech and ultimately integrate it into the reconstructed audible voice for natural communication. This survey provides a comprehensive look at methods, research strategies, and objectives within the emerging field of silent paralinguistics.

MCML Authors

Simon Pistrosch

Anton Batliner

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[42]

S. T. Rajamani, K. Rajamani, A. J, K. R and B. W. Schuller.
CBAM_SAUNet: A novel attention U-Net for effective segmentation of corner cases.
EMBC 2024 - 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Orlando, FL, USA, Jul 15-19, 2024. DOI

Abstract

U-Net has been demonstrated to be effective for the task of medical image segmentation. Additionally, integrating attention mechanism into U-Net has been shown to yield significant benefits. The Shape Attentive U-Net (SAUNet) is one such recently proposed attention U-Net that also focuses on interpretability. Furthermore, recent research has focused on identification and reporting of corner cases in segmentation to accelerate the utilisation of deep learning models in clinical practise. However, achieving good model performance on such corner cases is a less-explored research area. In this paper, we propose CBAM_SAUNet which enhances the dual attention decoder block of SAUNet to improve its performance on corner cases. We achieve this by utilising a novel variant of the Convolutional Block Attention Module (CBAM)’s channel attention in the decoder block of SAUNet. We demonstrate the effectiveness of CBAM_SAUNet in the Automated Cardiac Diagnosis Challenge (ACDC) cardiac MRI segmentation challenge. Our proposed novel approach results in improvement in the Dice scores of 12% for Left Ventricle (LV) as well as Right Ventricle (RV) segmentation and 8% for Myocardium (MYO) for the identified corner-case dataset.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[41]

A. Mallol-Ragolta and B. W. Schuller.
Coupling Sentiment and Arousal Analysis Towards an Affective Dialogue Manager.
IEEE Access 12 (Feb. 2024). DOI

Abstract

We present the technologies and host components developed to power a speech-based dialogue manager with affective capabilities. The overall goal is that the system adapts its response to the sentiment and arousal level of the user inferred by analysing the linguistic and paralinguistic information embedded in his or her interaction. A linguistic-based, dedicated sentiment analysis component determines the body of the system response. A paralinguistic-based, dedicated arousal recognition component adjusts the energy level to convey in the affective system response. The sentiment analysis model is trained using the CMU-MOSEI dataset and implements a hierarchical contextual attention fusion network, which scores an Unweighted Average Recall (UAR) of 79.04% on the test set when tackling the task as a binary classification problem. The arousal recognition model is trained using the MSP-Podcast corpus. This model extracts the Mel-spectrogram representations of the speech signals, which are exploited with a Convolutional Neural Network (CNN) trained from scratch, and scores a UAR of 61.11% on the test set when tackling the task as a three-class classification problem. Furthermore, we highlight two sample dialogues implemented at the system back-end to detail how the sentiment and arousal inferences are coupled to determine the affective system response. These are also showcased in a proof of concept demonstrator. We publicly release the trained models to provide the research community with off-the-shelf sentiment analysis and arousal recognition tools.

MCML Authors

Adria Mallol-Ragolta

Björn Schuller

Prof. Dr.

Principal Investigator

[40]

S. Rampp, A. Triantafyllopoulos, M. Milling and B. W. Schuller.
autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks.
Preprint (Nov. 2024). arXiv

Abstract

This work introduces the key operating principles for autrainer, our new deep learning training framework for computer audition tasks. autrainer is a PyTorch-based toolkit that allows for rapid, reproducible, and easily extensible training on a variety of different computer audition tasks. Concretely, autrainer offers low-code training and supports a wide range of neural networks as well as preprocessing routines. In this work, we present an overview of its inner workings and key capabilities.

MCML Authors

Andreas Triantafyllopoulos

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[39]

Q. Sun, A. Akman, X. Jing, M. Milling and B. W. Schuller.
Audio-based Kinship Verification Using Age Domain Conversion.
IEEE Signal Processing Letters 32 (Dec. 2024). DOI

Abstract

Audio-based kinship verification (AKV) is important in many domains, such as home security monitoring, forensic identification, and social network analysis. A key challenge in the task arises from differences in age across samples from different individuals, which can be interpreted as a domain bias in a cross-domain verification task. To address this issue, we design the notion of an ‘age-standardised domain’ wherein we utilise the optimised CycleGAN-VC3 network to perform age-audio conversion to generate the in-domain audio. The generated audio dataset is employed to extract a range of features, which are then fed into a metric learning architecture to verify kinship. Experiments are conducted on the KAN_AV audio dataset, which contains age and kinship labels. The results demonstrate that the method markedly enhances the accuracy of kinship verification, while also offering novel insights for future kinship verification research.

MCML Authors

Xin Jing

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[38]

Q. Sun, A. Akman and B. W. Schuller.
Explainable Artificial Intelligence for Medical Applications: A Review.
Preprint (Nov. 2024). arXiv

Abstract

The continuous development of artificial intelligence (AI) theory has propelled this field to unprecedented heights, owing to the relentless efforts of scholars and researchers. In the medical realm, AI takes a pivotal role, leveraging robust machine learning (ML) algorithms. AI technology in medical imaging aids physicians in X-ray, computed tomography (CT) scans, and magnetic resonance imaging (MRI) diagnoses, conducts pattern recognition and disease prediction based on acoustic data, delivers prognoses on disease types and developmental trends for patients, and employs intelligent health management wearable devices with human-computer interaction technology to name but a few. While these well-established applications have significantly assisted in medical field diagnoses, clinical decision-making, and management, collaboration between the medical and AI sectors faces an urgent challenge: How to substantiate the reliability of decision-making? The underlying issue stems from the conflict between the demand for accountability and result transparency in medical scenarios and the black-box model traits of AI. This article reviews recent research grounded in explainable artificial intelligence (XAI), with an emphasis on medical practices within the visual, audio, and multimodal perspectives. We endeavour to categorise and synthesise these practices, aiming to provide support and guidance for future researchers and healthcare professionals.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[37]

K. R. S. Klaus R. Scherer, F. Burkhardt, U. D. Reichel, F. Eyben and B. W. Schuller.
Using voice analysis as an early indicator of risk for depression in young adults.
Preprint (Nov. 2024). arXiv

Abstract

Increasingly frequent publications in the literature report voice quality differences between depressed patients and controls. Here, we examine the possibility of using voice analysis as an early warning signal for the development of emotion disturbances in young adults. As part of a major interdisciplinary European research project in four countries (ECoWeB), examining the effects of web-based prevention programs to reduce the risk for depression in young adults, we analyzed a large number of acoustic voice characteristics in vocal reports of emotions experienced by the participants on a specific day. We were able to identify a number of significant differences in acoustic cues, particularly with respect to the energy distribution in the voice spectrum, encouraging further research efforts to develop promising non-obtrusive risk indicators in the normal speaking voice. This is particularly important in the case of young adults who are less likely to exhibit standard risk factors for depression such as negative life experiences.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

[36]

Abstract

MCML Authors

Lu Sang

Björn Häfner

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Xingxing Zuo

Dr.

* Former Member

→ Group Stefan Leutenegger

* Former Principal Investigator

Daniel Cremers

Prof. Dr.

Director

[35]

Z. Sun, J. Kang, K. Qian, B. W. Schuller and B. Hu.
Creating Healthier Living Environments: The Role of Soundscapes in Promoting Mental Health and Well-Being.
IEEE Transactions on Computational Social Systems 12.1 (Feb. 2025). DOI

Abstract

With great pride and anticipation, we present the first issue of IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (TCSS) for 2025. Reflecting on the remarkable achievements of 2024, this past year stands as a testament to academic excellence and prolific scholarly output. Over the course of the year, our journal published an impressive 642 high-quality articles, totaling approximately 5800 pages, distributed across six issues. These works collectively underscore the vibrant growth and interdisciplinary impact of computational social systems.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[34]

R. Stolz, H. Krasowski, J. Thumm, M. Eichelbeck, P. Gassert and M. Althoff.
Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

Continuous action spaces in reinforcement learning (RL) are commonly defined as multidimensional intervals. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.

MCML Authors

Michael Eichelbeck

Philipp Gassert

* Former Member

Matthias Althoff

Prof. Dr.

Principal Investigator

[33]

Q. Sun, Y. Li, E. Alturki, S. M. K. Murthy and B. W. Schuller.
Towards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment.
Preprint (Dec. 2024). arXiv

Abstract

As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thorough review of FAI, focusing on theoretical perspectives both for and against its development, and presenting a formal definition in a clear and accessible format. Key applications are discussed from the perspectives of eXplainable AI (XAI), privacy, fairness and affective computing (AC). Additionally, the paper identifies challenges in current technological advancements and explores future research avenues. The findings emphasise the significance of developing FAI and advocate for its continued advancement to ensure ethical and beneficial AI development.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[32]

M. Schlicher, Y. Li, S. M. K. Murthy, Q. Sun and B. W. Schuller.
Emotionally Adaptive Support: A Narrative Review of Affective Computing for Mental Health.
Frontiers in Digital Health 7 (Sep. 2025). DOI

Abstract

Digital mental health interventions (DMHIs) have become increasingly prominent as scalable solutions to address global mental health needs. However, many existing tools lack the emotional sensitivity required to foster meaningful engagement and therapeutic effectiveness. Affective computing, a field focused on designing systems capable of detecting and responding to human emotions, offers promising advancements to the emotional responsiveness of these digital interventions. This narrative review examines how affective computing methods such as emotion recognition, sentiment analysis, emotion synthesis, and audiovisual and physiological signal processing, are being integrated into DMHIs to enhance user engagement and improve clinical outcomes. The findings suggest that emotionally adaptive systems can strengthen user engagement, simulate empathy, and support more personalized care. Early studies indicate potential benefits in terms of symptom reduction and user satisfaction, though clinical validation remains limited. Challenges such as algorithmic bias, privacy concerns, and the need for ethical design frameworks continue to shape the development of this emerging field. By synthesizing current trends, technological advancements, and ethical considerations, this review highlights the potential of affective computing in digital mental health and identifies key directions for future research and implementation.

MCML Authors

Michelle Schlicher

Björn Schuller

Prof. Dr.

Principal Investigator

[31]

B. W. Schuller, A. Mallol-Ragolta, A. P. Almansa, I. Tsangko, M. M. Amin, A. Semertzidou, L. Christ and S. Amiriparian.
Affective Computing Has Changed: The Foundation Model Disruption.
Preprint (Sep. 2024). arXiv

Abstract

The dawn of Foundation Models has on the one hand revolutionised a wide range of research problems, and, on the other hand, democratised the access and use of AI-based tools by the general public. We even observe an incursion of these models into disciplines related to human psychology, such as the Affective Computing domain, suggesting their affective, emerging capabilities. In this work, we aim to raise awareness of the power of Foundation Models in the field of Affective Computing by synthetically generating and analysing multimodal affective data, focusing on vision, linguistics, and speech (acoustics). We also discuss some fundamental problems, such as ethical issues and regulatory aspects, related to the use of Foundation Models in this research area.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

Adria Mallol-Ragolta

Iosif Tsangko

Shahin Amiriparian

Dr.

[30]

O. Schrüfer, M. Milling, F. Burkhardt, F. Eyben and B. W. Schuller.
Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. PDF

Abstract

Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Reliable UQ methods are thus of particular interest as in many SER applications no prediction is better than a faulty prediction. While the effects of label ambiguity on uncertainty are well documented in the literature, we focus our work on an evaluation of UQ methods for SER under common challenges in real-world application, such as corrupted signals, and the absence of speech. We show that simple UQ methods can already give an indication of the uncertainty of a prediction and that training with additional OOD data can greatly improve the identification of such signals.

MCML Authors

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[29]

A. Spiesberger, A. Mallol-Ragolta, A. Triantafyllopoulos and B. W. Schuller.
Towards Predicting Menstrual Cycle Phases Exploiting Paralinguistic Features.
EMBC 2024 - 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Orlando, FL, USA, Jul 15-19, 2024. DOI

Abstract

As a growing number of people focus on understanding their bodies, the menstrual cycle and its impact on reproduction are gaining attention. Several studies have shown that the voice changes during the menstrual cycle. However, existing research primarily employs comparative analysis to detect these differences. This paper proposes using machine learning methods to analyse paralinguistic features extracted from women’s voices for predicting menstrual cycle phases. We leverage available data recorded during the menstrual and late follicular phases of 44 naturally cycling women. Using eight paralinguistic features, we achieve an accuracy of 60%, showcasing the feasibility of classifying those two phases using speech signals. We discuss implications and suggest future research avenues, such as the need to use personalised approaches.

MCML Authors

Anika Spiesberger

Adria Mallol-Ragolta

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[28]

A. Spiesberger, A. Triantafyllopoulos, A. Kathan, A. Semertzidou, C. Gawrilow, T. Reinelt, W. Rauch and B. W. Schuller.
'So... my child...' -- How Child ADHD Influences the Way Parents Talk.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. PDF

Abstract

Attention-deficit/hyperactivity disorder (ADHD) exerts a psychological burden not only on affected individuals but also on their social support systems. Of particular interest are the parents, who often face challenges related to their child’s condition, including its impact on their own mental well-being. The interaction among the child’s symptomatology, parental mental health, and the parent-child relationship is a crucial area of investigation. Expressed Emotion (EE), as assessed through the Preschool Five Minute Speech Sample (PFMSS), serves as a valuable measure. However, manual annotation of EE can be cumbersome and impractical for continuous monitoring. To address this, we propose leveraging machine learning methods. This study presents an initial exploration into predicting children’s ADHD diagnosis using linguistic and paralinguistic features derived from the PFMSS. Despite achieving a UAR score of 67.1%, our results have not surpassed the performance of manually annotated EE.

MCML Authors

Anika Spiesberger

Andreas Triantafyllopoulos

Alexander Kathan

Björn Schuller

Prof. Dr.

Principal Investigator

[27]

Y. Sun, Y. Zhou, X. Xu, J. Qi, F. Xu, Z. Ren and B. W. Schuller.
Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label Correction.
IEEE Transactions on Audio, Speech and Language Processing Early Access (Jan. 2025). DOI

Abstract

Automated Depression Detection (ADD) in speech aims to automatically estimate one’s depressive attributes through artificial intelligence tools towards spoken signals. Nevertheless, existing speech-based ADD works fail to sufficiently consider weakly-supervised cases with inaccurate labels, which may typically appear in intelligent mental health. In this regard, we propose the Self-Learning-based Label Correction (SLLC) approach for weakly-supervised depression detection in speech. The proposed approach employs a self-learning manner connecting a label correction module and a depression detection module. Within the approach, the label correction module fuses likelihood-ratio-based and prototype-based label correction strategies in order to effectively correct the inaccurate labels, while the depression detection module aims at detecting depressed samples through a 1D convolutional recurrent neural network with multiple types of losses. The experimental results on two depression detection corpora show that our proposed SLLC approach performs better compared with existing state-of-the-art speech-based depression detection approaches, in the case of weak supervision with inaccurate labels for depression detection in speech.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[26]

L. Shen, H. Zhang, C. Zhu, R. Li, K. Qian, W. Meng, F. Tian, B. Hu, B. W. Schuller and Y. Yamamoto.
A First Look at Generative Artificial Intelligence Based Music Therapy for Mental Disorders.
IEEE Transactions on Consumer Electronics Early Access (Dec. 2024). DOI

Abstract

Mental disorders show a rapid increase and cause considerable harm to individuals as well as the society in recent decade. Hence, mental disorders have become a serious public health challenge in nowadays society. Timely treatment of mental disorders plays a critical role for reducing the harm of mental illness to individuals and society. Music therapy is a type of non-pharmaceutical method in treating such mental disorders. However, conventional music therapy suffers from a number of issues resulting in a lack of popularity. Thanks to the rapid development of Artificial Intelligence (AI), especially the AI Generated Content (AIGC), it provides a chance to address these issues. Nevertheless, to the best of our knowledge, there is no work investigating music therapy from AIGC and closed-loop perspective. In this paper, we summarise some universal music therapy methods and discuss their shortages. Then, we indicate some AIGC techniques, especially the music generation, for their application in music therapy. Moreover, we present a closed-loop music therapy system and introduce its implementation details. Finally, we discuss some challenges in AIGC-based music therapy with proposing further research direction, and we suggest the potential of this system to become a consumer-grade product for treating mental disorders.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[25]

L. Shen, H. Zhang, C. Zhu, R. Li, K. Qian, F. Tian, B. Hu, B. W. Schuller and Y. Yamamoto.
Enhancing Emotion Regulation in Mental Disorder Treatment: An AIGC-based Closed-Loop Music Intervention System.
IEEE Transactions on Affective Computing Early Access (Apr. 2025). DOI

Abstract

Mental disorders have increased rapidly and have emerged as a serious social health issue in the recent decade. Undoubtedly, the timely treatment of mental disorders is crucial. Emotion regulation has been proven to be an effective method for treating mental disorders. Music therapy as one of the methods that can achieve emotional regulation has gained increasing attention in the field of mental disorder treatment. However, traditional music therapy methods still face some unresolved issues, such as the lack of real-time capability and the inability to form closed-loop systems. With the advancement of artificial intelligence (AI), especially AI-generated content (AIGC), AI-based music therapy holds promise in addressing these issues. In this paper, an AIGC-based closed-loop music intervention system demonstration is proposed to regulate emotions for mental disorder treatment. This system demonstration consists of an emotion recognition model and a music generation model. The emotion recognition model can assess mental states, while the music generation model generates the corresponding emotional music for regulation. The system continuously performs recognition and regulation, thus forming a closed-loop process. In the experiment, we first conduct experiments on both the emotion recognition model and the music generation model to validate the accuracy of the recognition model and the music quality generated by the music generation models. In conclusion, we conducted comprehensive tests on the entire system to verify its feasibility and effectiveness.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[24]

A. Triantafyllopoulos, A. Batliner, S. Rampp, M. Milling and B. W. Schuller.
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. DOI

Abstract

We revisit the INTERSPEECH 2009 Emotion Challenge – the first ever speech emotion recognition (SER) challenge – and evaluate a series of deep learning models that are representative of the major advances in SER research in the time since then. We start by training each model using a fixed set of hyperparameters, and further fine-tune the best-performing models of that initial setup with a grid search. Results are always reported on the official test set with a separate validation set only used for early stopping. Most models score below or close to the official baseline, while they marginally outperform the original challenge winners after hyperparameter tuning. Our work illustrates that, despite recent progress, FAU-AIBO remains a very challenging benchmark. An interesting corollary is that newer methods do not consistently outperform older ones, showing that progress towards ‘solving’ SER is not necessarily monotonic.

MCML Authors

Andreas Triantafyllopoulos

Anton Batliner

Dr.

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[23]

A. Triantafyllopoulos, A. Batliner and B. W. Schuller.
Charting 15 years of progress in deep learning for speech emotion recognition: A replication study.
Preprint (Aug. 2025). arXiv

Abstract

Speech emotion recognition (SER) has long benefited from the adoption of deep learning methodologies. Deeper models – with more layers and more trainable parameters – are generally perceived as being `better’ by the SER community. This raises the question – emph{how much better} are modern-era deep neural networks compared to their earlier iterations? Beyond that, the more important question of how to move forward remains as poignant as ever. SER is far from a solved problem; therefore, identifying the most prominent avenues of future research is of paramount importance. In the present contribution, we attempt a quantification of progress in the 15 years of research beginning with the introduction of the landmark 2009 INTERSPEECH Emotion Challenge. We conduct a large scale investigation of model architectures, spanning both audio-based models that rely on speech inputs and text-baed models that rely solely on transcriptions. Our results point towards diminishing returns and a plateau after the recent introduction of transformer architectures. Moreover, we demonstrate how perceptions of progress are conditioned on the particular selection of models that are compared. Our findings have important repercussions about the state-of-the-art in SER research and the paths forward

MCML Authors

Andreas Triantafyllopoulos

Anton Batliner

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[22]

A. Triantafyllopoulos, L. Christ, A. Gebhard, X. Jing, A. Kathan, M. Milling, I. Tsangko, S. Amiriparian and B. W. Schuller.
Beyond deep learning: Charting the next frontiers of affective computing.
Intelligent Computing 3.0089 (Sep. 2024). DOI

Abstract

Affective computing (AC), like most other areas of computational research, has benefited tremendously from advances in deep learning (DL). These advances have opened up new horizons in AC research and practice. Yet, as DL dominates the community’s attention, there is a danger of overlooking other emerging trends in artificial intelligence (AI) research. Furthermore, over-reliance on one particular technology may lead to stagnating progress. In an attempt to foster the exploration of complementary directions, we provide a concise, easily digestible overview of emerging trends in AI research that stand to play a vital role in solving some of the remaining challenges in AC research. Our overview is driven by the limitations of the current state of the art as it pertains to AC.

MCML Authors

Andreas Triantafyllopoulos

Alexander Gebhard

Xin Jing

Alexander Kathan

Manuel Milling

Iosif Tsangko

Shahin Amiriparian

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[21]

A. Triantafyllopoulos, A. Gebhard, M. Milling, S. Rampp and B. W. Schuller.
An Automatic Analysis of Ultrasound Vocalisations for the Prediction of Interaction Context in Captive Egyptian Fruit Bats.
EUSIPCO 2024 - 32nd European Signal Processing Conference. Lyon, France,, Aug 26-30, 2024. DOI

Abstract

Prior work in computational bioacoustics has mostly focused on the detection of animal presence in a particular habitat. However, animal sounds contain much richer information than mere presence; among others, they encapsulate the interactions of those animals with other members of their species. Studying these interactions is almost impossible in a naturalistic setting, as the ground truth is often lacking. The use of animals in captivity instead offers a viable alternative pathway. However, most prior works follow a traditional, statistics-based approach to analysing interactions. In the present work, we go beyond this standard framework by attempting to predict the underlying context in interactions between captive Rousettus Aegyptiacus using deep neural networks. We reach an unweighted average recall of over 30% - more than thrice the chance level - and show error patterns that differ from our statistical analysis. This work thus represents an important step towards the automatic analysis of states in animals from sound.

MCML Authors

Andreas Triantafyllopoulos

Alexander Gebhard

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[20]

A. Triantafyllopoulos and B. W. Schuller.
Expressivity and Speech Synthesis.
Oxford Handbook of Expressivity in Language (Apr. 2024). arXiv URL

Abstract

Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.

MCML Authors

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[19]

A. Triantafyllopoulos and B. W. Schuller.
Enrolment-based personalisation for improving individual-level fairness in speech emotion recognition.
INTERSPEECH 2024 - 25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024. PDF

Abstract

The expression of emotion is highly individualistic. However, contemporary speech emotion recognition (SER) systems typically rely on population-level models that adopt a ‘one-size-fits-all’ approach for predicting emotion. Moreover, standard evaluation practices measure performance also on the population level, thus failing to characterise how models work across different speakers. In the present contribution, we present a new method for capitalising on individual differences to adapt an SER model to each new speaker using a minimal set of enrolment utterances. In addition, we present novel evaluation schemes for measuring fairness across different speakers. Our findings show that aggregated evaluation metrics may obfuscate fairness issues on the individual-level, which are uncovered by our evaluation, and that our proposed method can improve performance both in aggregated and disaggregated terms.

MCML Authors

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[18]

A. Triantafyllopoulos and B. W. Schuller.
Hearing aids in the era of foundation models.
GMS Zeitschrift für Audiologie 6.28 (Dec. 2024). DOI

Abstract

The recent introduction of foundation models (FMs) has taken the world by storm. Ranging from large language models (LLMs) to image and audio analysis and generation, FMs have introduced a new paradigm in artificial intelligence (AI), one where practitioners transition from standard supervised machine learning to prompting and in-context learning. This has implications for hearing aid research, and specifically for the use of such models for noise attenuation and speech enhancement. Even though the uptake of FMs is minimal to non-existent for this application domain, mainly due to the prohibitive computational complexity of those models, there are nevertheless ways to benefit from FM advances in an indirect way. We review these approaches in the present contribution.

MCML Authors

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[17]

A. Triantafyllopoulos, A. Spiesberger, I. Tsangko, X. Jing, V. Distler, F. Dietz, F. Alt and B. W. Schuller.
Vishing: Detecting social engineering in spoken communication — A first survey & urgent roadmap to address an emerging societal challenge.
Computer Speech and Language 94.101802 (Nov. 2025). DOI

Abstract

Vishing – the use of voice calls for phishing – is a form of Social Engineering (SE) attacks. The latter have become a pervasive challenge in modern societies, with over 300,000 yearly victims in the US alone. An increasing number of those attacks is conducted via voice communication, be it through machine-generated ‘robocalls’ or human actors. The goals of ‘social engineers’ can be manifold, from outright fraud to more subtle forms of persuasion. Accordingly, social engineers adopt multi-faceted strategies for voice-based attacks, utilising a variety of ‘tricks’ to exert influence and achieve their goals. Importantly, while organisations have set in place a series of guardrails against other types of SE attacks, voice calls still remain ‘open ground’ for potential bad actors. In the present contribution, we provide an overview of the existing speech technology subfields that need to coalesce into a protective net against one of the major challenges to societies worldwide. Given the dearth of speech science and technology works targeting this issue, we have opted for a narrative review that bridges the gap between the existing psychological literature on the topic and research that has been pursued in parallel by the speech community on some of the constituent constructs. Our review reveals that very little literature exists on addressing this very important topic from a speech technology perspective, an omission further exacerbated by the lack of available data. Thus, our main goal is to highlight this gap and sketch out a roadmap to mitigate it, beginning with the psychological underpinnings of vishing, which primarily include deception and persuasion strategies, continuing with the speech-based approaches that can be used to detect those, as well as the generation and detection of AI-based vishing attempts, and close with a discussion of ethical and legal considerations.

MCML Authors

Andreas Triantafyllopoulos

Anika Spiesberger

Iosif Tsangko

Xin Jing

Björn Schuller

Prof. Dr.

Principal Investigator

[16]

I. Tsangko, A. Triantafyllopoulos, A. Abdelmoula, A. Mallol-Ragolta and B. W. Schuller.
Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition.
Preprint (Jun. 2025). arXiv

Abstract

Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

MCML Authors

Iosif Tsangko

Andreas Triantafyllopoulos

Adria Mallol-Ragolta

Björn Schuller

Prof. Dr.

Principal Investigator

[15]

A. Triantafyllopoulos, I. Tsangko, A. Gebhard, A. Mesaros, T. Virtanen and B. W. Schuller.
Computer Audition: From Task-Specific Machine Learning to Foundation Models.
Proceedings of the IEEE Early Access (Aug. 2025). DOI

Abstract

Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition—i.e., the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily available interaction with human users. Naturally, these promises have created substantial excitement in the audio community and have led to a wave of early attempts to build new, general-purpose FMs for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines toward auditory FMs. Our work highlights the key operating principles that underpin those models and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.

MCML Authors

Andreas Triantafyllopoulos

Iosif Tsangko

Alexander Gebhard

Björn Schuller

Prof. Dr.

Principal Investigator

[14]

I. Tsangko, A. Triantafyllopoulos, E. Kyriakidis, G. Margetis and B. W. Schuller.
Large Language Models for the Analysis of Project Proposals.
AI-HCI 2025 - 6th International Conference on Artificial Intelligence in Human Computer Interaction. Gothenburg, Sweden, Jun 22-27, 2025. DOI

Abstract

We introduce a framework that integrates traditional topic modeling methods-Latent Dirichlet Allocation (LDA) and BERTopic- with Large Language Models (LLMs) to automatically identify topics featured in project proposals for the cultural heritage (CH) domain. Applied to a dataset of 1, 757 English project proposals aimed at protecting and promoting CH in Africa, our approach begins by extracting initial topics using LDA and BERTopic. These topics are further refined by LLaMA3, generating precise and semantically meaningful categories that incorporate domain expert-curated labels to ensure contextual relevance. The consistency of assigned labels is evaluated using automatic classification. Additionally, we explore the role of linguistic features, such as sentence complexity, sentiment analysis, and gendered language, as predictors of proposal success. Results highlight the potential of combining traditional topic modeling with LLMs to uncover hidden insights into funding allocation patterns, aiming to enhance the equitable distribution of resources in CH projects.

MCML Authors

Iosif Tsangko

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[13]

I. Tsangko, A. Triantafyllopoulos, M. Müller, H. Schröter and B. W. Schuller.
DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids.
ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India, Apr 06-11, 2025. DOI

Abstract

The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all’ approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the DFN model, thus proposing the DFingerNet (DFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.

MCML Authors

Iosif Tsangko

Andreas Triantafyllopoulos

Björn Schuller

Prof. Dr.

Principal Investigator

[12]

A. Triantafyllopoulos, Y. Terhorst, I. Tsangko, F. B. Pokorny, K. D. Bartl-Pokorny, L. Seizer, A. Klein, J. Chim, D. Atzil-Slonim, M. Liakata, M. Bühner, J. Löchner and B. W. Schuller.
Large language models for mental health.
Preprint (Nov. 2024). arXiv

Abstract

Digital technologies have long been explored as a complement to standard procedure in mental health research and practice, ranging from the management of electronic health records to app-based interventions. The recent emergence of large language models (LLMs), both proprietary and open-source ones, represents a major new opportunity on that front. Yet there is still a divide between the community developing LLMs and the one which may benefit from them, thus hindering the beneficial translation of the technology into clinical use. This divide largely stems from the lack of a common language and understanding regarding the technology’s inner workings, capabilities, and risks. Our narrative review attempts to bridge this gap by providing intuitive explanations behind the basic concepts related to contemporary LLMs.

MCML Authors

Andreas Triantafyllopoulos

Iosif Tsangko

Florian Pokorny

Dr.

Björn Schuller

Prof. Dr.

Principal Investigator

[11]

F. Tian, H. Zhang, Y. Tan, L. Zhu, L. Shen, K. Qian, B. Hu, B. W. Schuller and Y. Yamamoto.
An On-Board Executable Multi-Feature Transfer-Enhanced Fusion Model for Three-Lead EEG Sensor-Assisted Depression Diagnosis.
IEEE Journal of Biomedical and Health Informatics 29.1 (Jan. 2025). DOI

Abstract

The development of affective computing and medical electronic technologies has led to the emergence of Artificial Intelligence (AI)-based methods for the early detection of depression. However, previous studies have often overlooked the necessity for the AI-assisted diagnosis system to be wearable and accessible in practical scenarios for depression recognition. In this work, we present an on-board executable multi-feature transfer-enhanced fusion model for our custom-designed wearable three-lead Electroencephalogram (EEG) sensor, based on EEG data collected from 73 depressed patients and 108 healthy controls. Experimental results show that the proposed model exhibits low-computational complexity (65.0 K parameters), promising Floating-Point Operations (FLOPs) performance (25.6 M), real-time processing (1.5 s/execution), and low power consumption (320.8 mW). Furthermore, it requires only 202.0 KB of Random Access Memory (RAM) and 279.6 KB of Read-Only Memory (ROM) when deployed on the EEG sensor. Despite its low computational and spatial complexity, the model achieves a notable classification accuracy of 95.2%, specificity of 94.0%, and sensitivity of 96.9% under independent test conditions. These results underscore the potential of deploying the model on the wearable three-lead EEG sensor for assisting in the diagnosis of depression.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[10]

T. Walter, H. Markgraf, J. Külz and M. Althoff.
Leveraging Analytic Gradients in Provably Safe Reinforcement Learning.
IEEE Open Journal of Control Systems Early Access (Sep. 2025). DOI

Abstract

The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These safeguards should be integrated during training to reduce the sim-to-real gap. While there are several approaches for safeguarding sampling-based reinforcement learning, analytic gradient-based reinforcement learning often achieves superior performance from fewer environment interactions. However, there is no safeguarding approach for this learning paradigm yet. Our work addresses this gap by developing the first effective safeguard for analytic gradient-based reinforcement learning. We analyse existing, differentiable safeguards, adapt them through modified mappings and gradient formulations, and integrate them into a state-of-the-art learning algorithm and a differentiable simulation. Using numerical experiments on three control tasks, we evaluate how different safeguards affect learning. The results demonstrate safeguarded training without compromising performance.

MCML Authors

Jonathan Külz

Matthias Althoff

Prof. Dr.

Principal Investigator

[9]

X. Xu, B. W. Schuller, E. André and E. Cambria.
Guest Editorial Extremely Low-Resource Autonomous Affective Learning.
IEEE Transactions on Affective Computing 16.3 (Sep. 2025). DOI

Abstract

n/a

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[8]

J. Xie, Y. Shi, D. Ni, M. Milling, S. Liu, J. Zhang, K. Qian and B. W. Schuller.
Automatic Bird Sound Source Separation Based on Passive Acoustic Devices in Wild Environment.
IEEE Internet of Things Journal 11.9 (Jan. 2024). DOI

Abstract

The Internet of Things (IoT)-based passive acoustic monitoring (PAM) has shown great potential in large-scale remote bird monitoring. However, field recordings often contain overlapping signals, making precise bird information extraction challenging. To solve this challenge, first, the interchannel spatial feature is chosen as complementary information to the spectral feature to obtain additional spatial correlations between the sources. Then, an end-to-end model named BACPPNet is built based on Deeplabv3plus and enhanced with the polarized self-attention mechanism to estimate the spectral magnitude mask (SMM) for separating bird vocalizations. Finally, the separated bird vocalizations are recovered from SMMs and the spectrogram of mixed audio using the inverse short Fourier transform (ISTFT). We evaluate our proposed method utilizing the generated mixed data set. Experiments have shown that our method can separate bird vocalizations from mixed audio with root mean square error (RMSE), source-to-distortion ratio (SDR), source-to-interference ratio (SIR), source-to-artifact ratio (SAR), and short-time objective intelligibility (STOI) values of 2.82, 10.00 dB, 29.90 dB, 11.08 dB, and 0.66, respectively, which are better than existing methods. Furthermore, the average classification accuracy of the separated bird vocalizations drops the least. This indicates that our method outperforms other compared separation methods in bird sound separation and preserves the fidelity of the separated sound sources, which might help us better understand wild bird sound recordings.

MCML Authors

Manuel Milling

Björn Schuller

Prof. Dr.

Principal Investigator

[7]

J. Xie, Y. Wang, X. Qian, J. Zhang and B. W. Schuller.
Improving Bird Vocalization Recognition in Open-Set Cross-Corpus Scenarios with Semantic Feature Reconstruction and Dual Strategy Scoring.
IEEE Signal Processing Letters 32 (Mar. 2025). DOI

Abstract

Automated recognition of bird vocalizations (BVs) is essential for biodiversity monitoring through passive acoustic monitoring (PAM), yet deep learning (DL) models encounter substantial challenges in open environments. These include difficulties in detecting unknown classes, extracting species-specific features, and achieving robust cross-corpus recognition. To address these challenges, this letter presents a DL-based open-set cross-corpus recognition method for BVs that combines feature construction with open-set recognition (OSR) techniques. We introduce a three-channel spectrogram that integrates both amplitude and phase information to enhance feature representation. To improve the recognition accuracy of known classes across corpora, we employ a class-specific semantic reconstruction model to extract deep features. For unknown class discrimination, we propose a Dual Strategy Coupling Scoring (DSCS) mechanism, which synthesizes the log-likelihood ratio score (LLRS) and reconstruction error score (RES). Our method achieves the highest weighted accuracy among existing approaches on a public dataset, demonstrating its effectiveness for open-set cross-corpus bird vocalization recognition.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[6]

Y. Xin, X. Zuo, D. Lu and S. Leutenegger.
SimpleMapping: Real-time visual-inertial dense mapping with deep multi-view stereo.
ISMAR 2023 - IEEE/ACM International Symposium on Mixed and Augmented Reality. Sydney, Australia, Oct 16-20, 2023. DOI

Abstract

We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire dense mapping system on several public datasets as well as our own dataset, demonstrating the system’s impressive generalization capabilities and its ability to deliver high-quality 3D reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.

MCML Authors

Xingxing Zuo

Dr.

* Former Member

→ Group Stefan Leutenegger

* Former Principal Investigator

Stefan Leutenegger

Prof. Dr.

Principal Investigator

* Former Principal Investigator

[5]

Y. Yang, R. Liang, Y. Ni, Y. Xie, C. Zou and B. W. Schuller.
A Non-intrusive Speech Quality Evaluation Framework for Hearing Aids Based on Speech Label Assistance and Multi-task Learning Strategy.
IEEE Transactions on Audio, Speech and Language Processing Early Access (Jul. 2025). DOI

Abstract

Accurate evaluation of hearing aid speech quality is crucial for optimizing the auditory experience of hearing-impaired people. Aiming at the shortcomings of existing methods that rely on clean reference signals and do not take into account the effects of differences in Prescription Formula (PF), this paper proposes a non-intrusive speech quality evaluation framework based on speech label assistance, and multi-task learning strategy, termed MTSE-LA. The framework effectively mitigates evaluation bias caused by PF variations and effectively improves the prediction accuracy of speech quality metrics. MTSE-LA consists of three core modules: a feature extraction module, a label classification module, and a score prediction module. The feature extraction module extracts deep frame-level features from speech using a joint Convolutional Neural Network and Bidirectional Long Short-term Memory network (CNN-BiLSTM) model. The label classification module, acting as a pre-trained network, identifies PF labels and embeds them into the extracted frame-level features, which are then fed into the speech quality prediction branch of the multi-task score prediction module. Effective prediction of speech intelligibility is achieved by introducing the output vectors of the modulation filter bank to the speech intelligibility prediction branch to ensure synergy in the multi-task learning process. Moreover, each prediction branch uses the multi-head self-attention mechanism to capture contextual information and model the importance of speech frames. Experimental results demonstrate that MTSE-LA considerably improves the prediction accuracy of the Hearing Aid Speech Quality Index (HASQI) under multiple PF configurations and different degrees of hearing loss conditions. Compared with existing cutting-edge methods, the proposed framework exhibits higher correlation and fitting accuracy, establishing its reliability and superiority in the field of non-intrusive speech quality evaluation for hearing aids.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[4]

Z. Yang, M. Song, X. Jing, H. Zhang, K. Qian, B. Hu, K. Tamada, T. Takumi, B. W. Schuller and Y. Yamamoto.
MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge.
Preprint (Jan. 2025). arXiv

Abstract

The Mice Autism Detection via Ultrasound Vocalization (MADUV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.

MCML Authors

Xin Jing

Björn Schuller

Prof. Dr.

Principal Investigator

[3]

H. Zhang, F. Tian, Y. Tan, L. Shen, E. Li, J. Ma, J. Liu, K. Qian, J. Li, B. Hu, Y. Yamamoto and B. W. Schuller.
Towards Practical Colorectal Cancer Diagnosis: A Bowel Sound-Based System with Portable Sensor and On-Board Lightweight AI Model.
IEEE Internet of Things Journal Early Access (Aug. 2025). DOI

Abstract

Colorectal Cancer (CRC) is one of the leading causes of cancer-related deaths worldwide, and early screening plays a crucial role in improving patient outcomes. In this study, we present a novel AI-assisted CRC diagnostic system using Bowel Sound (BS) signals. We first develop two portable BS acquisition devices with distinct form factors for high-fidelity signal capture in both clinical and home-care scenarios. A total of 221 recordings were collected under expert-guided protocol, with 144 CRC recordings and 59 Non-CRC healthy controls using the developed device. To enable low-resource deployment, we design a lightweight deep learning model optimized for real-time, on-board inference. The model incorporates multiple training strategies, including transfer learning on a large-scale public BS dataset, self-supervised temporal feature learning, and a hybrid semi-and weakly-supervised approach that leverages both unlabeled and real-noise data. Furthermore, a Sound Event Detection (SED) attention mechanism and iterative consistency learning are introduced to enhance the model’s sensitivity to BS activity. The proposed model comprises only 264.7 K parameters and 253.2 M Floating-Point Operations (FLOPs), requiring 1.57 MB of RAM and 1.03 MB of FLASH when deployed on microcontroller. It performs inference in approximately 3.4 s with low power consumption, making it well-suited for low-resource environments. Despite its compact design, the model achieves 93.06% classification accuracy, 96.46% sensitivity, and 86.99% specificity for binary-classes in CRC diagnosis. These results demonstrate the system’s potential for accessible and cost-effective CRC screening in community, home, and rural healthcare scenarios.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator

[2]

L. Zhu, R. Wang, X. Jin, Y. Li, F. Tian, R. Cai, K. Qian, X. Hu, B. Hu, Y. Yamamoto and B. W. Schuller.
Explainable Depression Classification Based on EEG Feature Selection from Audio Stimuli.
IEEE Transactions on Neural Systems and Rehabilitation Engineering Early Access (Apr. 2025). DOI

Abstract

With the development of affective computing and Artificial Intelligence (AI) technologies, Electroencephalogram (EEG)-based depression detection methods have been widely proposed. However, existing studies have mostly focused on the accuracy of depression recognition, ignoring the association between features and models. Additionally, there is a lack of research on the contribution of different features to depression recognition. To this end, this study introduces an innovative approach to depression detection using EEG data, integrating Ant-Lion Optimization (ALO) and Multi-Agent Reinforcement Learning (MARL) for feature fusion analysis. The inclusion of Explainable Artificial Intelligence (XAI) methods enhances the explainability of the model’s features. The Time-Delay Embedded Hidden Markov Model (TDE-HMM) is employed to infer internal brain states during depression, triggered by audio stimulation. The ALO-MARL algorithm, combined with hyper-parameter optimization of the XGBoost classifier, achieves high accuracy (93.69%), sensitivity (88.60%), specificity (97.08%), and F1-score (91.82%) on a auditory stimulus-evoked three-channel EEG dataset. The results suggest that this approach outperforms state-of-the-art feature selection methods for depression recognition on this dataset, and XAI elucidates the critical impact of the minimum value of Power Spectral Density (PSD), Sample Entropy (SampEn), and Réenyi Entropy (Ren) on depression recognition. The study also explores dynamic brain state transitions revealed by audio stimuli, providing insights for the clinical application of AI algorithms in depression recognition.

MCML Authors

Björn Schuller

Prof. Dr.

Principal Investigator