Research Group Zeynep Akata

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[33]

P. Spohn, L. Girrbach, J. Bader and Z. Akata.
Align-then-Unlearn: Embedding Alignment for LLM Unlearning.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL

Abstract

Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

MCML Authors

Leander Girrbach

Interpretable and Reliable Machine Learning

Jessica Bader

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[32]

Q. Bouniot, I. Redko, A. Mallasto, C. Laclau, O. Struckmeier, K. Arndt, M. Heinonen, V. Kyrki and S. Kaski.
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL

Abstract

In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width – common factors associated with their expressive power – may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport mappings, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications.

MCML Authors

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

[31]

S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Your Multimodal Models Over Time?
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL

Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors

Karsten Roth

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[30]

S. Kim, R. Xiao, M.-I. Georgescu, S. Alaniz and Z. Akata.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

MCML Authors

Sanghwan Kim

Interpretable and Reliable Machine Learning

Rui Xiao

Interpretable and Reliable Machine Learning

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[29]

K. Roth, Z. Akata, D. Damen, I. Balažević and O. J. Hénaff.
Context-Aware Multimodal Pretraining.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL

Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

MCML Authors

Karsten Roth

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[28]

R. Xiao, S. Kim, M.-I. Georgescu, Z. Akata and S. Alaniz.
FLAIR: VLM with Fine-grained Language-informed Image Representations.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub

Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

MCML Authors

Rui Xiao

Interpretable and Reliable Machine Learning

Sanghwan Kim

Interpretable and Reliable Machine Learning

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

[27]

S. Roschmann, Q. Bouniot, V. Feofanov, I. Redko and Z. Akata.
Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers.
Preprint (Jun. 2025). arXiv

Abstract

Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal yet another direction for reusing vision representations in a non-visual domain.

MCML Authors

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[26]

R. Skorobogat, K. Roth, M.-I. Georgescu and Z. Akata.
Subspace-Boosted Model Merging.
Preprint (Jun. 2025). arXiv

Abstract

Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on vision benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to further quantify task similarity, offering a new interpretable perspective on model merging.

MCML Authors

Karsten Roth

Interpretable and Reliable Machine Learning

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[25]

J. Kim, S. Alaniz, C. Schmid and Z. Akata.
LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance.
Preprint (May. 2025). arXiv GitHub

Abstract

Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement.

MCML Authors

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[24]

M. Bini, L. Girrbach and Z. Akata.
Decoupling Angles and Strength in Low-rank Adaptation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

Parameter Efficient FineTuning (PEFT) methods have recently gained extreme popularity thanks to the vast availability of large-scale models, allowing to quickly adapt pretrained models to downstream tasks with minimal computational costs. However, current additive finetuning methods such as LoRA show low robustness to prolonged training and hyperparameter choices, not allowing for optimal out-of-the-box usage. On the other hand, multiplicative and bounded approaches such as ETHER, even if providing higher robustness, only allow for extremely low-rank adaptations and are limited to a fixed-strength transformation, hindering the expressive power of the adaptation. In this work, we propose the DeLoRA finetuning method that first normalizes and then scales the learnable low-rank matrices, thus effectively bounding the transformation strength, which leads to increased hyperparameter robustness at no cost in performance. We show that this proposed approach effectively and consistently improves over popular PEFT methods by evaluating our method on two finetuning tasks, subject-driven image generation and LLM instruction tuning.

MCML Authors

Massimo Bini

Interpretable and Reliable Machine Learning

Leander Girrbach

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[23]

Q. Bouniot, P. Mozharovskyi and F. d'Alché-Buc.
Tailoring Mixup to Data for Calibration.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved predictive performance, Mixup is also a good technique for improving calibration. However, mixing data carelessly can lead to manifold mismatch, i.e., synthetic data lying outside original class manifolds, which can deteriorate calibration. In this work, we show that the likelihood of assigning a wrong label with mixup increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves predictive performance and calibration of models, while being much more efficient.

MCML Authors

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

[22]

L. Girrbach, Y. Huang, S. Alaniz, T. Darrell and Z. Akata.
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs).
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv

Abstract

Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that finetuning-based debiasing methods achieve the best tradeoff between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.

MCML Authors

Leander Girrbach

Interpretable and Reliable Machine Learning

Yiran Huang

Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[21]

T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning with the Gromov-Monge Gap.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv

Abstract

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

MCML Authors

Luca Eyring

Interpretable and Reliable Machine Learning

Karsten Roth

Interpretable and Reliable Machine Learning

Fabian Theis

Prof. Dr.

C2 | Biology

Mathematical Modelling of Biological Systems

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[20]

S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Multimodal Models Over Time?
MCDC @ICLR 2025 - Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

Model merging combines multiple expert models finetuned from a base foundation model on diverse tasks and domains into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME (Temporal Integration of Model Expertise) which defines temporal model merging across three axes: (1) initialization, (2) deployment, and (3) merging technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to build a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors

Karsten Roth

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[19]

M. Pach, S. Karthik, Q. Bouniot, S. Belongie and Z. Akata.
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models.
Preprint (Apr. 2025). arXiv

Abstract

Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

MCML Authors

Mateusz Pach

Interpretable and Reliable Machine Learning

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[18]

L. Girrbach, S. Alaniz, G. Smith and Z. Akata.
A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models.
Preprint (Mar. 2025). arXiv

Abstract

With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.

MCML Authors

Leander Girrbach

Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[17]

S. Wu, S. Alaniz, E. Schulz and Z. Akata.
Discovering Chunks in Neural Embeddings for Interpretability.
Preprint (Feb. 2025). arXiv

Abstract

Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

MCML Authors

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[16]

M. Binz, S. Alaniz, A. Roskies, B. , C. T. Bergstrom, C. Allen, D. Schad, D. Wulff, J. D. , Q. Zhang, R. M. Shiffrin, S. J. Gershman, V. Popov, E. M. Bender, M. Marelli, M. M. Botvinick, Z. Akata and E. Schulz.
How should the advancement of large language models affect the practice of science?
Proceedings of the National Academy of Sciences 122.5 (Jan. 2025). DOI

Abstract

Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advancement of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and overhyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.

MCML Authors

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

2024

[15]

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy and Z. Akata.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from ‘reward hacking’ and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-α, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time.

MCML Authors

Luca Eyring

Interpretable and Reliable Machine Learning

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Karsten Roth

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[14]

V. Udandarao, K. Roth, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. Hénaff, S. Albanie, Z. Akata and M. Bethge.
A Practitioner's Guide to Real-World Continual Multimodal Pretraining.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment.

MCML Authors

Karsten Roth

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Interpretable and Reliable Machine Learning

[13]

A. Höhl, I. Obadic, M.-Á. Fernández-Torres, H. Najjar, D. Oliveira, Z. Akata, A. Dengel and X. Zhu.
Opening the Black Box: A systematic review on explainable artificial intelligence in remote sensing.
IEEE Geoscience and Remote Sensing Magazine 12.4 (Dec. 2024). DOI

Abstract

In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in remote sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the explainable AI methods used and their objectives, findings, and challenges in remote sensing applications is still missing. In this paper, we address this gap by performing a systematic review to identify the key trends in the field and shed light on novel explainable AI approaches and emerging directions that tackle specific remote sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights, and reflect on the approaches used for the evaluation of explainable AI methods. As such, our review provides a complete summary of the state-of-the-art of explainable AI in remote sensing. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field.

MCML Authors

Adrian Höhl

Data Science in Earth Observation

Ivica Obadic

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Data Science in Earth Observation

Zeynep Akata

Prof. Dr.

C3 | Physics and Geo Sciences

Interpretable and Reliable Machine Learning

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

[12]

A. Baumann, R. Li, M. Klasson, S. Mentu, S. Karthik, Z. Akata, A. Solin and M. Trapp.
Post-hoc Probabilistic Vision-Language Models.
Preprint (Dec. 2024). arXiv

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

MCML Authors

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[11]

Y. Yeganeh, R. Xiao, G. Guvercin, N. Navab and A. Farshad.
Conformable Convolution for Topologically Aware Learning of Complex Anatomical Structures.
Preprint (Dec. 2024). arXiv

Abstract

While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. Such shortcomings can significantly impact the reliability of analysis results and hinder clinical decision-making. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly enforce topological consistency. Conformable Convolution learns adaptive kernel offsets that preferentially focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Our proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. Experimental results on three diverse datasets demonstrate that our framework effectively preserves the topology in the segmentation downstream task, both quantitatively and qualitatively.

MCML Authors

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Rui Xiao

Interpretable and Reliable Machine Learning

Nassir Navab

Prof. Dr.

C1 | Medicine

Computer Aided Medical Procedures & Augmented Reality

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

[10]

S. Karthik, H. Coskun, Z. Akata, S. Tulyakov, J. Ren and A. Kag.
Scalable Ranked Preference Optimization for Text-to-Image Generation.
Preprint (Oct. 2024). arXiv

Abstract

Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset ‘Syn-Pic’ improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.

MCML Authors

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[9]

A. Christensen, N. Mojab, K. Patel, K. Ahuja, Z. Akata, O. Winther, O. Gonzalez-Franco and A. Colaco.
Geometry Fidelity for Spherical Images.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI

Abstract

Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fréchet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.

MCML Authors

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[8]

T. Hummel, S. Karthik, M.-I. Georgescu and Z. Akata.
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR.

MCML Authors

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[7]

J. M. Kim, J. Bader, S. Alaniz, C. Schmid and Z. Akata.
DataDream: Few-shot Guided Dataset Generation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance.

MCML Authors

Jae Myung Kim

Interpretable and Reliable Machine Learning

Jessica Bader

Interpretable and Reliable Machine Learning

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[6]

L. Thede, K. Roth, O. J. Hénaff, M. Bethge and Z. Akata.
Reflecting on the State of Rehearsal-free Continual Learning with Pretrained Models.
CoLLAs 2024 - 3rd Conference on Lifelong Learning Agents. Pisa, Italy, Aug 11-14, 2024. URL

Abstract

With the advent and recent ubiquity of foundation models, continual learning (CL) has recently shifted from continual training from scratch to the continual adaptation of pretrained models, seeing particular success on rehearsal-free CL benchmarks (RFCL). To achieve this, most proposed methods adapt and restructure parameter-efficient finetuning techniques (PEFT) to suit the continual nature of the problem. Based most often on input-conditional query-mechanisms or regularizations on top of prompt- or adapter-based PEFT, these PEFT-style RFCL (P-RFCL) approaches report peak performances; often convincingly outperforming existing CL techniques. However, on the other end, critical studies have recently highlighted competitive results by training on just the first task or via simple non-parametric baselines. Consequently, questions arise about the relationship between methodological choices in P-RFCL and their reported high benchmark scores. In this work, we tackle these questions to better understand the true drivers behind strong P-RFCL performances, their placement w.r.t. recent first-task adaptation studies, and their relation to preceding CL standards such as EWC or SI. In particular, we show: (1) P-RFCL techniques relying on input-conditional query mechanisms work not because, but rather despite them by collapsing towards standard PEFT shortcut solutions. (2) Indeed, we show how most often, P-RFCL techniques can be matched by a simple and lightweight PEFT baseline. (3) Using this baseline, we identify the implicit bound on tunable parameters when deriving RFCL approaches from PEFT methods as a potential denominator behind P-RFCL efficacy. Finally, we (4) better disentangle continual versus first-task adaptation, and (5) motivate standard RFCL techniques s.a. EWC or SI in light of recent P-RFCL methods.

MCML Authors

Karsten Roth

Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[5]

M. Bini, K. Roth, Z. Akata and A. Khoreva.
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL GitHub

Abstract

Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters (∼10-100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility.

MCML Authors

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[4]

T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning through Geometry Preservation with the Gromov-Monge Gap.
SPIGM @ICML 2024 - Workshop on Structured Probabilistic Inference & Generative Modeling at the 41st International Conference on Machine Learning (ICML 2024). Vienna, Austria, Jul 21-27, 2024. arXiv

Abstract

MCML Authors

Luca Eyring

Interpretable and Reliable Machine Learning

Karsten Roth

Interpretable and Reliable Machine Learning

Fabian Theis

Prof. Dr.

C2 | Biology

Mathematical Modelling of Biological Systems

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

[3]

M. Dani, M. J. Prakash, Z. Akata and S. Liebe.
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research.
Preprint (Jul. 2024). arXiv

Abstract

Large Language Models have shown promising results in their ability to encode general medical knowledge in standard medical question-answering datasets. However, their potential application in clinical practice requires evaluation in domain-specific tasks, where benchmarks are largely missing. In this study semioLLM, we test the ability of state-of-the-art LLMs (GPT-3.5, GPT-4, Mixtral 8x7B, and Qwen-72chat) to leverage their internal knowledge and reasoning for epilepsy diagnosis. Specifically, we obtain likelihood estimates linking unstructured text descriptions of seizures to seizure-generating brain regions, using an annotated clinical database containing 1269 entries. We evaluate the LLM’s performance, confidence, reasoning, and citation abilities in comparison to clinical evaluation. Models achieve above-chance classification performance with prompt engineering significantly improving their outcome, with some models achieving close-to-clinical performance and reasoning. However, our analyses also reveal significant pitfalls with several models being overly confident while showing poor performance, as well as exhibiting citation errors and hallucinations. In summary, our work provides the first extensive benchmark comparing current SOTA LLMs in the medical domain of epilepsy and highlights their ability to leverage unstructured texts from patients’ medical history to aid diagnostic processes in health care.

MCML Authors

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

2023

[2]

Y. Yeganeh, A. Farshad, G. Guevercin, A. Abu-zer, R. Xiao, Y. Tang, E. Adeli and N. Navab.
SCOPE: Structural Continuity Preservation for Medical Image Segmentation.
GRAIL @MICCAI 2023 - 5th Workshop on GRaphs in biomedicAl Image anaLysis at the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). Vancouver, Canada, Oct 08-12, 2023. DOI

Abstract

Although the preservation of shape continuity and physiological anatomy is a natural assumption in the segmentation of medical images, it is often neglected by deep learning methods that mostly aim for the statistical modeling of input data as pixels rather than interconnected structures. In biological structures, however, organs are not separate entities; for example, in reality, a severed vessel is an indication of an underlying problem, but traditional segmentation models are not designed to strictly enforce the continuity of anatomy, potentially leading to inaccurate medical diagnoses. To address this issue, we propose a graph-based approach that enforces the continuity and connectivity of anatomical topology in medical images. Our method encodes the continuity of shapes as a graph constraint, ensuring that the network’s predictions maintain this continuity. We evaluate our method on two public benchmarks on retinal vessel segmentation, showing significant improvements in connectivity metrics compared to traditional methods while getting better or on-par performance on segmentation metrics.

MCML Authors

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Rui Xiao

Interpretable and Reliable Machine Learning

Nassir Navab

Prof. Dr.

C1 | Medicine

Computer Aided Medical Procedures & Augmented Reality

[1]

Y. Yeganeh, G. Güvercin, R. Xiao, A. Abuzer, E. Adeli, A. Farshad and N. Navab.
SCOPE: Structural Continuity Preservation for Retinal Vessel Segmentation.
GRAIL @MICCAI 2023 - 5th Workshop on GRaphs in biomedicAl Image anaLysis at the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). Vancouver, Canada, Oct 08-12, 2023. DOI

Abstract

Although the preservation of shape continuity and physiological anatomy is a natural assumption in the segmentation of medical images, it is often neglected by deep learning methods that mostly aim for the statistical modeling of input data as pixels rather than interconnected structures. In biological structures, however, organs are not separate entities; for example, in reality, a severed vessel is an indication of an underlying problem, but traditional segmentation models are not designed to strictly enforce the continuity of anatomy, potentially leading to inaccurate medical diagnoses. To address this issue, we propose a graph-based approach that enforces the continuity and connectivity of anatomical topology in medical images. Our method encodes the continuity of shapes as a graph constraint, ensuring that the network’s predictions maintain this continuity. We evaluate our method on three public benchmarks of retinal vessel segmentation and one neuronal structure segmentation benchmark, showing significant improvements in connectivity metrics compared to previous works while getting better or on-par performance on segmentation metrics.

MCML Authors

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Rui Xiao

Interpretable and Reliable Machine Learning

Azade Farshad

Dr.