The MCML has enjoyed fruitful collaborations with the other German AI Competence Centers, reflecting the growing importance of interdisciplinary teamwork in advancing the field of AI. These partnerships have resulted in a series of impactful publications, showcasing the breadth and depth of research being carried out across various domains within AI.
[47]
This work introduces Equivariance by Contrast (EbC), a method that learns equivariant embeddings from observation pairs affected by group actions without group-specific biases. EbC jointly learns a latent space and group representation, achieving faithful equivariance for both abelian and non-abelian groups and enabling general-purpose encoder-only equivariant learning.
[46]
This work presents METok, a training-free, multi-stage event-based token compression framework that accelerates Video Large Language Models while preserving accuracy. By progressively removing redundant visual tokens across encoding, prefilling, and decoding, METok achieves over 80% FLOPs reduction and 93% memory savings without performance loss.
[45]
This work presents a compression strategy for Multimodal Large Language Models (MLLMs) that integrates structural pruning with efficient recovery training. Results show that widthwise pruning combined with supervised finetuning and knowledge distillation preserves over 95% of model performance while requiring only 5% of the original training data.
[44]
The paper introduces TriDi, the first unified model for 3D human-object interaction. Unlike previous methods that only predict in one direction (human→object or object→human), TriDi can model both together using a three-way diffusion process. It links humans, objects, and their interactions through a shared transformer network, controllable by text or contact maps. Despite being a single model, TriDi outperforms specialized methods and generates more diverse, realistic results for 3D interaction tasks.
[43]
This work introduces VGGSounder, a re-annotated, multi-label test set designed to address critical flaws in the widely used VGGSound benchmark. With detailed modality annotations, VGGSounder enables more accurate evaluation of audio-visual foundation models and uncovers limitations previously overlooked.
[42]
The paper explores feature importance (FI) methods as a means to understand the data-generating process (DGP) in machine learning models, which are often opaque. It provides a comprehensive review of FI methods, new theoretical insights, and practical recommendations for selecting the right approach. The study also discusses uncertainty estimation in FI and future directions for statistical inference from black-box models.
[41]
This work introduces ControlEvents, a diffusion-based model that generates high-quality event data guided by text labels, 2D skeletons, and 3D poses. Leveraging priors from foundation models like Stable Diffusion, it enables efficient, low-cost labeled data synthesis that boosts performance in event-based vision tasks.
[40]
This work introduces ECHO, a unified framework for egocentric modeling of human-object interactions using only head and wrist tracking. Leveraging a Diffusion Transformer with a three-variate diffusion process, ECHO jointly reconstructs human pose, object motion, and contact, achieving state-of-the-art performance in flexible HOI reconstruction.
[39]
This work introduces MAGBIG, a controlled benchmark designed to evaluate gender bias in multilingual text-to-image (T2I) generation models. Despite advancements in multilingual capabilities, the study reveals significant gender bias and language-specific inconsistencies, with prompt engineering proving largely ineffective at mitigating these issues.
[38]
This work presents WikiBigEdit, a large-scale benchmark of real-world Wikidata edits designed to advance and future-proof research in lifelong knowledge editing. By evaluating existing editing methods against over 500K question-answer pairs, the study reveals their practical strengths and limitations compared to broader approaches like retrieval augmentation and continual finetuning.
[37]
This work investigates whether modern vision models exhibit compositional understanding by systematically varying data scale, concept diversity, and combination coverage. The findings show that data diversity—not scale—drives compositional generalization, with effective learning emerging from linearly factored representational structures.
[36]
This study explores reducing the number of layers in Large Language Models (LLMs) to address size and efficiency challenges. Remarkably, even models with significantly fewer layers—sometimes just one—can match or outperform fully layered models in prompt-based text classification tasks.
[35]
This work introduces a procedural framework for generating virtually infinite, realistic partial 3D shape matching scenarios from complete geometry and establishes cross-dataset correspondences across seven shape datasets (2543 shapes total). It defines challenging partial-matching benchmarks and evaluates state-of-the-art methods as baselines.
[34]
This paper introduces TIME, a unified framework for temporal model merging—integrating expert models trained over time on emerging tasks. Through extensive experiments, TIME explores key design choices in initialization, merging, and deployment to improve model performance across dynamic learning scenarios.
[33]
This work proposes using minimum ratio cycles in conjugate product graphs to solve shape matching problems more effectively. This approach improves accuracy and significantly reduces runtimes by enabling higher-order costs and better geometric regularization.
[32]
This work extends multimodal pretraining to improve few-shot adaptation by enabling models to better use contextual information, achieving up to 4× sample efficiency and 5% average gains across 21 tasks—without sacrificing zero-shot performance.
[31]
This work introduces VGGSounder, a re-annotated, multi-label test set designed to address critical flaws in the widely used VGGSound benchmark. With detailed modality annotations, VGGSounder enables more accurate evaluation of audio-visual foundation models and uncovers limitations previously overlooked.
[30]
The paper gives an overview of reinforcement learning from human feedback (RLHF) — a method where AI systems learn from human judgments instead of fixed reward functions. It explains how RLHF helps align AI behavior with human values and has been key to the success of large language models. The article reviews RLHF across fields like robotics and control, describing its basic ideas, how human feedback guides learning, and current research trends.
[29]
This work analyzes why merging many expert models yields diminishing returns, showing that task vector spaces suffer from rank collapse during merging. To address this, the authors introduce Subspace Boosting, which preserves task vector ranks and boosts merging efficacy by over 10% across vision benchmarks, while offering new insights via higher-order task similarity analysis.
[28]
Detoxification of harmful language is tackled in this work through an LLM-in-the-loop pipeline that leverages GPT-4o-mini to replace human annotation. Building on this approach, the authors create ParaDeHate, a large-scale hate speech detoxification dataset, and demonstrate that fine-tuned models achieve strong accuracy, fluency, and content preservation.
[27]
To address false refusals in large language models, this work introduces XSB and MS-XSB, two benchmarks for assessing and mitigating exaggerated safety behaviors. Combined with post-hoc explanations and lightweight inference-time methods, the approach improves safe prompt compliance while maintaining strong safety safeguards across LLMs.
[26]
This study investigates the influence of MBTI-based persona prompts on hate speech classification in Large Language Models (LLMs), a previously unexplored aspect of subjectivity in annotation. By demonstrating substantial persona-driven variation and bias, the work emphasizes the need for careful prompt design to support fair and value-aligned model behavior.
[25]
This work introduces a survey of common scoring rules for survival analysis, focusing on their theoretical and empirical properness, and proposes a new marginal definition of properness. While the Integrated Survival Brier Score (ISBS) and Right-Censored Log-Likelihood (RCLL) are theoretically improper under this definition, simulations show they behave properly in practice, supporting their continued use—particularly in automated model evaluation—despite practical estimation challenges.
[24]
This paper investigates how quantization impacts the explainability and interpretability of large language models, an area previously unexplored. Through experiments with multiple quantization methods, analysis techniques, and a user study, the results show that quantization can unpredictably degrade or even improve transparency, highlighting important implications for LLM deployment.
[23]
This paper introduces DeLoRA, a new parameter-efficient finetuning method that normalizes and scales learnable low-rank matrices to bound transformation strength. By doing so, it improves robustness to hyperparameters and training duration while maintaining strong performance, consistently outperforming popular PEFT approaches like LoRA across image generation and LLM instruction tuning tasks.
[22]
Disentangled representation learning is key to improving generalization and fairness, but aligning data with a prior while preserving geometric features is difficult. This work introduces a new method using quadratic optimal transport and a Gromov-Monge-Gap regularizer to minimize geometric distortion, achieving strong disentanglement performance across benchmarks.
[21]
This paper introduces TIME, a unified framework for temporal model merging—integrating expert models trained over time on emerging tasks. Through extensive experiments, TIME explores key design choices in initialization, merging, and deployment to improve model performance across dynamic learning scenarios.
[20]
Addressing the gap in domain-specific vision-language modeling, this work presents GAIA, a large-scale, multi-sensor, multi-modal remote sensing dataset with 205,150 carefully curated image-text pairs. Experiments show that GAIA enables significant improvements in RS image classification, cross-modal retrieval, and captioning, providing rich, scientifically grounded descriptions of environmental and dynamic phenomena.
[19]
Accurate lymph node segmentation in 3D CT scans is vital but challenging due to the limited availability of fully annotated datasets. The LNQ challenge at MICCAI 2023 demonstrated that weakly-supervised methods show promise, but combining them with fully annotated data significantly boosts performance, underscoring the continued need for high-quality annotations.
[18]
The study introduces a novel method for improving text-to-image (T2I) models by optimizing the initial noise using human preference reward models. This approach significantly enhances T2I model performance, outperforming existing open-source models and achieving efficiency and quality levels comparable to proprietary systems.
[17]
The paper introduces a new benchmark, FoMo-in-Flux, for continual multimodal pretraining, designed to tackle the challenges of updating multimodal foundation models. The guide provides practical advice for practitioners on how to update models effectively and efficiently in real-world applications.
[16]
The comprehensive review addresses the growing need for transparency in AI applications, especially in critical fields such as geospatial data analysis. The paper highlights methods, objectives, challenges, and findings, providing a much-needed summary of the state of XAI in this specialized area.
[15]
The paper introduces EgoCVR, a new benchmark for Composed Video retrieval, where a video and a text description modifying the video content are used to retrieve the relevant video. The study shows that existing methods struggle with this task, and proposes a training-free approach with a re-ranking framework.
[14]
The work explores how resource efficiency can be integrated into Automated Machine Learning (AutoML), which traditionally focuses on maximizing predictive quality without considering factors like running time or energy consumption.
[13]
The work explores parameter-efficient fine-tuning (PEFT) techniques in the context of continual learning and examines the strengths and limitations of rehearsal-free methods, providing valuable insights into how they can be improved for better performance in dynamic, real-world environments.
[12]
The work introduces GNNavi, a novel prompt-based parameter-efficient fine-tuning (PEFT) approach for Large Language Models (LLMs). The approach addresses the high resource demands of traditional fine-tuning by leveraging Graph Neural Networks (GNNs) to efficiently guide the flow of information during prompt processing.
[11]
The paper introduces ETHER, a new approach to parameter-efficient fine-tuning (PEFT), which aims to optimize the adaptation of foundation models to downstream tasks while maintaining generalization ability and minimizing the introduction of extra parameters and computational overhead.
[10]
The position paper critiques the current focus on high predictive accuracy in deep learning, particularly for supervised tasks involving large image and language datasets, and calls for greater attention to overlooked metrics and data types, such as uncertainty, active learning, continual learning, and scientific data.
[9]
The paper explores a new method for learning structured representations by leveraging quadratic optimal transport, enhancing the interpretability of learned features.
[8]
The paper explores counterfactual explanations, which help users understand algorithmic decisions by identifying changes that would lead to a desired outcome. These explanations enhance transparency, guide user actions, and provide grounds for contesting decisions.
[7]
The paper explores feature importance (FI) methods as a means to understand the data-generating process (DGP) in machine learning models, which are often opaque. It provides a comprehensive review of FI methods, new theoretical insights, and practical recommendations for selecting the right approach. The study also discusses uncertainty estimation in FI and future directions for statistical inference from black-box models.
[6]
The study introduces Divergent Token Metrics as a novel method for evaluating compressed large language models (LLMs), offering a deeper analysis of model degradation during compression, and providing a more effective way to optimize these models beyond traditional metrics like perplexity and accuracy.
[5]
The paper improves Optimal Transport (OT), a method for efficiently transforming one set of data into another. The new UOT-FM approach helps in areas like predicting biological changes and improving image processing by making the transformation more flexible and accurate. This makes OT more useful for real-world applications.
[4]
The work explores the ethical challenges in the algorithmization of concepts like fairness and diversity in AI. The authors advocate for caution when algorithmically implementing ethical principles and emphasize the importance of human oversight to ensure these systems do not mislead or oversimplify complex ethical dilemmas.
[3]
The work introduces a way to create new test functions for optimization problems, where specific characteristics of the problem landscape can be chosen in advance. By adjusting random data and training a simple neural network, the method can recreate known test functions and also generate new ones with properties not seen before.
[2]
The paper presents the mlr3fairness package which builds upon the ML framework mlr3. The extension contains fairness metrics, fairness visualizations, and model-agnostic pre- and post-processing operators that aim to reduce biases in ML models.
[1]
The paper investigates whether pre-trained multilingual language models (PMLMs) impose English moral norms on other languages or show random, potentially harmful biases. Experiments in five languages reveal that the models do encode different moral biases, but these do not consistently reflect real cultural differences. This could cause problems when using such models across languages.
2025-04-04 - Last modified: 2025-06-06