91 Papers in Highly-Ranked Journals

Digital Catalysis

M. Abrahamowicz, M.-E. Beauchamp, A.-L. Boulesteix, T. P. Morris, W. Sauerbrei, J. S. Kaufman and o. b. o. t. STRATOS Simulation Panel.
Data-driven simulations to assess the impact of study imperfections in time-to-event analyses.
American Journal of Epidemiology 194.1 (Jan. 2025). DOI

Abstract

Quantitative bias analysis (QBA) permits assessment of the expected impact of various imperfections of the available data on the results and conclusions of a particular real-world study. This article extends QBA methodology to multivariable time-to-event analyses with right-censored endpoints, possibly including time-varying exposures or covariates. The proposed approach employs data-driven simulations, which preserve important features of the data at hand while offering flexibility in controlling the parameters and assumptions that may affect the results. First, the steps required to perform data-driven simulations are described, and then two examples of real-world time-to-event analyses illustrate their implementation and the insights they may offer. The first example focuses on the omission of an important time-invariant predictor of the outcome in a prognostic study of cancer mortality, and permits separating the expected impact of confounding bias from noncollapsibility. The second example assesses how imprecise timing of an interval-censored event—ascertained only at sparse times of clinic visits—affects its estimated association with a time-varying drug exposure. The simulation results also provide a basis for comparing the performance of two alternative strategies for imputing the unknown event times in this setting. The R scripts that permit the reproduction of our examples are provided.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

J. Kostin, F. Krahmer and D. Stöger.
How robust is randomized blind deconvolution via nuclear norm minimization against adversarial noise?
Applied and Computational Harmonic Analysis 76.101746 (Apr. 2025). DOI

Abstract

In this paper, we study the problem of recovering two unknown signals from their convolution, which is commonly referred to as blind deconvolution. Reformulation of blind deconvolution as a low-rank recovery problem has led to multiple theoretical recovery guarantees in the past decade due to the success of the nuclear norm minimization heuristic. In particular, in the absence of noise, exact recovery has been established for sufficiently incoherent signals contained in lower-dimensional subspaces. However, if the convolution is corrupted by additive bounded noise, the stability of the recovery problem remains much less understood. In particular, existing reconstruction bounds involve large dimension factors and therefore fail to explain the empirical evidence for dimension-independent robustness of nuclear norm minimization. Recently, theoretical evidence has emerged for ill-posed behavior of low-rank matrix recovery for sufficiently small noise levels. In this work, we develop improved recovery guarantees for blind deconvolution with adversarial noise which exhibit square-root scaling in the noise level. Hence, our results are consistent with existing counterexamples which speak against linear scaling in the noise level as demonstrated for related low-rank matrix recovery problems.

MCML Authors

Felix Krahmer

Prof. Dr.

A2 | Mathematical Foundations
→ Group Gitta Kutyniok

Optimization & Data Analysis

H. Boche, A. Fono and G. Kutyniok.
Mathematical Algorithm Design for Deep Learning under Societal and Judicial Constraints: The Algorithmic Transparency Requirement.
Applied and Computational Harmonic Analysis 77.101763 (Jun. 2025). DOI

Abstract

Deep learning still has drawbacks in terms of trustworthiness, which describes a comprehensible, fair, safe, and reliable method. To mitigate the potential risk of AI, clear obligations associated to trustworthiness have been proposed via regulatory guidelines, e.g., in the European AI Act. Therefore, a central question is to what extent trustworthy deep learning can be realized. Establishing the described properties constituting trustworthiness requires that the factors influencing an algorithmic computation can be retraced, i.e., the algorithmic implementation is transparent. Motivated by the observation that the current evolution of deep learning models necessitates a change in computing technology, we derive a mathematical framework which enables us to analyze whether a transparent implementation in a computing model is feasible. We exemplarily apply our trustworthiness framework to analyze deep learning approaches for inverse problems in digital and analog computing models represented by Turing and Blum-Shub-Smale Machines, respectively. Based on previous results, we find that Blum-Shub-Smale Machines have the potential to establish trustworthy solvers for inverse problems under fairly general conditions, whereas Turing machines cannot guarantee trustworthiness to the same degree.

MCML Authors

Adalbert Fono

Mathematical Foundations of Artificial Intelligence

Gitta Kutyniok

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Daniel Grün

Mathematical Foundations of Artificial Intelligence

J. Homer, O. Friedrich and D. Grün.
Simulation-based inference has its own Dodelson-Schneider effect (but it knows that it does).
Astronomy & Astrophysics 699.A213 (Jul. 2025). DOI

Abstract

Making inferences about physical properties of the Universe requires knowledge of the data likelihood. A Gaussian distribution is commonly assumed for the uncertainties with a covariance matrix estimated from a set of simulations. The noise in such covariance estimates causes two problems: it distorts the width of the parameter contours, and it adds scatter to the location of those contours which is not captured by the widths themselves. For non-Gaussian likelihoods, an approximation may be derived via Simulation-Based Inference (SBI). It is often implicitly assumed that parameter constraints from SBI analyses, which do not use covariance matrices, are not affected by the same problems as parameter estimation with a covariance matrix estimated from simulations. We investigate whether SBI suffers from effects similar to those of covariance estimation in Gaussian likelihoods. We use Neural Posterior and Likelihood Estimation with continuous and masked autoregressive normalizing flows for density estimation. We fit our approximate posterior models to simulations drawn from a Gaussian linear model, so that the SBI result can be compared to the true posterior. We test linear and neural network based compression, demonstrating that neither methods circumvent the issues of covariance estimation. SBI suffers an inflation of posterior variance that is equal or greater than the analytical result in covariance estimation for Gaussian likelihoods for the same number of simulations. The assumption that SBI requires a smaller number of simulations than covariance estimation for a Gaussian likelihood analysis is inaccurate. The limitations of traditional likelihood analysis with simulation-based covariance remain for SBI with a finite simulation budget. Despite these issues, we show that SBI correctly draws the true posterior contour given enough simulations.

MCML Authors

Jed Homer

Astrophysics, Cosmology and Artificial Intelligence

Daniel Grün

Prof. Dr.

Astrophysics, Cosmology and Artificial Intelligence

F. Bortolussi, H. Sandström, F. Partovi, J. Mikkilä, P. Rinke and M. Rissanen.
Technical note: Towards atmospheric compound identification in chemical ionization mass spectrometry with pesticide standards and machine learning.
Atmospheric Chemistry and Physics 25.1 (Jan. 2025). DOI

Abstract

Chemical ionization mass spectrometry (CIMS) is widely used in atmospheric chemistry studies. However, due to the complex interactions between reagent ions and target compounds, chemical understanding remains limited and compound identification difficult. In this study, we apply machine learning to a reference dataset of pesticides in two standard solutions to build a model that can provide insights from CIMS analyses in atmospheric science. The CIMS measurements were performed with an Orbitrap mass spectrometer coupled to a thermal desorption multi-scheme chemical ionization inlet unit (TD-MION-MS) with both negative and positive ionization modes utilizing Br−, , H3O+ and (CH3)2COH+ (AceH+) as reagent ions. We then trained two machine learning methods on these data: (1) random forest (RF) for classifying if a pesticide can be detected with CIMS and (2) kernel ridge regression (KRR) for predicting the expected CIMS signals. We compared their performance on five different representations of the molecular structure: the topological fingerprint (TopFP), the molecular access system keys (MACCS), a custom descriptor based on standard molecular properties (RDKitPROP), the Coulomb matrix (CM) and the many-body tensor representation (MBTR). The results indicate that MACCS outperforms the other descriptors. Our best classification model reaches a prediction accuracy of 0.85 ± 0.02 and a receiver operating characteristic curve area of 0.91 ± 0.01. Our best regression model reaches an accuracy of 0.44 ± 0.03 logarithmic units of the signal intensity. Subsequent feature importance analysis of the classifiers reveals that the most important sub-structures are NH and OH for the negative ionization schemes and nitrogen-containing groups for the positive ionization schemes.

MCML Authors

Patrick Rinke

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

AI-based Material Science

L. Burk, A. Bender and M. N. Wright.
High-Dimensional Variable Selection With Competing Events Using Cooperative Penalized Regression.
Biometrical Journal 67.1 (Feb. 2025). DOI

Abstract

Variable selection is an important step in the analysis of high-dimensional data, yet there are limited options for survival outcomes in the presence of competing risks. Commonly employed penalized Cox regression considers each event type separately through cause-specific models, neglecting possibly shared information between them. We adapt the feature-weighted elastic net (fwelnet), an elastic net generalization, to survival outcomes and competing risks. For two causes, our proposed algorithm fits two alternating cause-specific models, where each model receives the coefficient vector of the complementary model as prior information. We dub this ‘‘cooperative penalized regression’’, as it enables the modeling of competing risk data with cause-specific models while accounting for shared effects between causes. Coefficients that are shrunken toward zero in the model for the first cause will receive larger penalization weights in the model for the second cause and vice versa. Through multiple iterations, this process ensures stronger penalization of uninformative predictors in both models. We demonstrate our method’s variable selection capabilities on simulated genomics data and apply it to bladder cancer microarray data. We evaluate selection performance using the positive predictive value for the correct selection of informative features and the false positive rate for the selection of uninformative variables. The benchmark compares results with cause-specific penalized Cox regression, random survival forests, and likelihood-boosted Cox regression. Results indicate that our approach is more effective at selecting informative features and removing uninformative features. In settings without shared effects, variable selection performance is similar to cause-specific penalized Cox regression.

MCML Authors

Lukas Burk

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

M. Wünsch, C. Sauer, M. Herrmann, L. C. Hinske and A.-L. Boulesteix.
To tweak or not to tweak. How exploiting flexibilities in gene set analysis leads to over-optimism.
Biometrical Journal 67.1 (Feb. 2025). DOI

Abstract

Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of genes that show enriched expression patterns between two conditions. In addition to the multitude of methods available for this task, users are typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibility can lead to uncertainty about the “right” choice, further reinforced by a lack of evidence-based guidance. Especially when their statistical experience is scarce, this uncertainty might entice users to produce preferable results using a ’trial-and-error’ approach. While it may seem unproblematic at first glance, this practice can be viewed as a form of ‘cherry-picking’ and cause an optimistic bias, rendering the results nonreplicable on independent data. After this problem has attracted a lot of attention in the context of classical hypothesis testing, we now aim to raise awareness of such overoptimism in the different and more complex context of gene set analyses. We mimic a hypothetical researcher who systematically selects the analysis variants yielding their preferred results, thereby considering three distinct goals they might pursue. Using a selection of popular gene set analysis methods, we tweak the results in this way for two frequently used benchmark gene expression data sets. Our study indicates that the potential for overoptimism is particularly high for a group of methods frequently used despite being commonly criticized. We conclude by providing practical recommendations to counter overoptimism in research findings in gene set analysis and beyond.

MCML Authors

Milena Wünsch

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Q. Li, H. Taubenböck and X. Zhu.
Identification of the potential for roof greening using remote sensing and deep learning.
Cities 159.105782 (Apr. 2025). DOI

Abstract

Under the mounting pressure from global warming, green roofs emerge as a valuable source for climate adaptation, particularly in compact metropolises where green space is limited. Consequently, there is a need to quantitatively evaluate the potential for roof greening where it is most needed and suitable. Despite the increasing importance of this issue, there have been limited studies on the effectiveness of remote sensing and deep learning in identifying the potential for roof greening in many cities. To address this, we have created a GreenRoof dataset, comprising approximately 6400 pairs of remote sensing images and corresponding masks of roofs with high greening potential in four European cities. Afterward, we exploit the capabilities of deep learning methods to identify roofs that are suitable for greening from remote sensing images. Using 15 German cities as a case study for future urban rooftop planning, we estimate the spatial potential for retrofitting green roofs. Structural parameters for prioritizing green roof implementation include vegetation coverage, thermal environment, and building density. Results indicate that the total area suitable for green roof retrofitting exceeds 20% of the roof area in the 15 German cities examined. The spatial analysis effectively reflects variation in demand and suitability for green roof retrofitting across different cities. In conclusion, this study provides a versatile screening approach utilizing remote sensing, deep learning, and spatial analysis, which can be readily adapted to inform municipal policies in other cities aiming to promote green roofs and enhance sustainable urban development.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

M. Fornasier and L. Sun.
A PDE Framework of Consensus-Based Optimization for Objectives with Multiple Global Minimizers.
Communications in Partial Differential Equations 50.4 (Feb. 2025). DOI

Abstract

Introduced in 2017, Consensus-Based Optimization (CBO) has rapidly emerged as a significant breakthrough in global optimization. This straightforward yet powerful multi-particle, zero-order optimization method draws inspiration from Simulated Annealing and Particle Swarm Optimization. Using a quantitative mean-field approximation, CBO dynamics can be described by a nonlinear Fokker-Planck equation with degenerate diffusion, which does not follow a gradient flow structure. In this paper, we demonstrate that solutions to the CBO equation remain positive and maintain full support. Building on this foundation, we establish the { unconditional} global convergence of CBO methods to global minimizers. Our results are derived through an analysis of solution regularity and the proof of existence for smooth, classical solutions to a broader class of drift-diffusion equations, despite the challenges posed by degenerate diffusion.

MCML Authors

Massimo Fornasier

Prof. Dr.

A2 | Mathematical Foundations
→ Group Massimo Fornasier

Applied Numerical Analysis

Lukang Sun

Applied Numerical Analysis

A. Triantafyllopoulos, A. Spiesberger, I. Tsangko, X. Jing, V. Distler, F. Dietz, F. Alt and B. W. Schuller.
Vishing: Detecting social engineering in spoken communication — A first survey & urgent roadmap to address an emerging societal challenge.
Computer Speech and Language 94.101802 (Nov. 2025). DOI

Abstract

Vishing – the use of voice calls for phishing – is a form of Social Engineering (SE) attacks. The latter have become a pervasive challenge in modern societies, with over 300,000 yearly victims in the US alone. An increasing number of those attacks is conducted via voice communication, be it through machine-generated ‘robocalls’ or human actors. The goals of ‘social engineers’ can be manifold, from outright fraud to more subtle forms of persuasion. Accordingly, social engineers adopt multi-faceted strategies for voice-based attacks, utilising a variety of ‘tricks’ to exert influence and achieve their goals. Importantly, while organisations have set in place a series of guardrails against other types of SE attacks, voice calls still remain ‘open ground’ for potential bad actors. In the present contribution, we provide an overview of the existing speech technology subfields that need to coalesce into a protective net against one of the major challenges to societies worldwide. Given the dearth of speech science and technology works targeting this issue, we have opted for a narrative review that bridges the gap between the existing psychological literature on the topic and research that has been pursued in parallel by the speech community on some of the constituent constructs. Our review reveals that very little literature exists on addressing this very important topic from a speech technology perspective, an omission further exacerbated by the lack of available data. Thus, our main goal is to highlight this gap and sketch out a roadmap to mitigate it, beginning with the psychological underpinnings of vishing, which primarily include deception and persuasion strategies, continuing with the speech-based approaches that can be used to detect those, as well as the generation and detection of AI-based vishing attempts, and close with a discussion of ethical and legal considerations.

MCML Authors

Andreas Triantafyllopoulos

Health Informatics

Anika Spiesberger

Health Informatics

Iosif Tsangko

Health Informatics

Xin Jing

Health Informatics

Björn Schuller

Prof. Dr.

Health Informatics

Z. Yuan, Z. Xiong, L. Mou and X. Zhu.
ChatEarthNet: a global-scale image–text dataset empowering vision–language geo-foundation models.
Earth System Science Data 17.3 (Mar. 2025). DOI

Abstract

The rapid development of remote sensing technology has led to an exponential growth in satellite images, yet their inherent complexity often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can bridge the gap between common users and complicated satellite imagery. Additionally, when paired with visual data, natural language can be utilized to train large vision–language foundation models, significantly improving performance in various tasks. Despite these advancements, the remote sensing community still faces a challenge due to the lack of large-scale, high-quality vision–language datasets for satellite images. To address this challenge, we introduce a new image–text dataset, providing high-quality natural language descriptions for global-scale satellite data. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency’s WorldCover project to enrich the descriptions of land cover types. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. We then include a manual verification process to enhance the dataset’s quality further. This step involves manual inspection and correction to refine the dataset. Finally, we offer the community ChatEarthNet, a large-scale image–text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163 488 image–text pairs with captions generated by ChatGPT-3.5 and an additional 10 000 image–text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for both training and evaluating vision–language geo-foundation models for remote sensing. The code is publicly available at https://doi.org/10.5281/zenodo.11004358 (Yuan et al., 2024b), and the ChatEarthNet dataset is available at https://doi.org/10.5281/zenodo.11003436 (Yuan et al., 2024c).

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

D. Tschernutter and S. Feuerriegel.
Data-driven dynamic police patrolling: An efficient Monte Carlo tree search.
European Journal of Operational Research 321.1 (Feb. 2025). DOI

Abstract

Crime is responsible for major financial losses and serious harm to the well-being of individuals, and, hence, a crucial task of police operations is effective patrolling. Yet, in existing decision models aimed at police operations, microscopic routing decisions from patrolling are not considered, and, furthermore, the objective is limited to surrogate metrics (e. g., response time) instead of crime prevention. In this paper, we thus formalize the decision problem of dynamic police patrolling as a Markov decision process that models microscopic routing decisions, so that the expected number of prevented crimes are maximized. We experimentally show that standard solution approaches for our decision problem are not scalable to real-world settings. As a remedy, we present a tailored and highly efficient Monte Carlo tree search algorithm. We then demonstrate our algorithm numerically using real-world crime data from Chicago and show that the decision-making by our algorithm offers significant improvements for crime prevention over patrolling tactics from current practice. Informed by our results, we finally discuss implications for improving the patrolling tactics in police operations.

MCML Authors

Stefan Feuerriegel

Prof. Dr.

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

A. Maarouf, S. Feuerriegel and N. Pröllochs.
A fused large language model for predicting startup success.
European Journal of Operational Research 322.1 (Apr. 2025). DOI

Abstract

Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup’s probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup’s innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.

MCML Authors

Abdurahman Maarouf

Artificial Intelligence in Management

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management

A. T. Stüber, M. M. Heimer, J. Ta, M. P. Fabritius, B. F. Hoppe, G. Sheikh, M. Brendel, L. Unterrainer, P. Jurmeister, A. Tufman, J. Ricke, C. C. Cyran and M. Ingrisch.
Replication study of PD-L1 status prediction in NSCLC using PET/CT radiomics.
European Journal of Radiology 183.111825 (Feb. 2025). DOI

Abstract

This study investigates the predictive capability of radiomics in determining programmed cell death ligand 1 (PD-L1) expression (>=1%) status in non-small cell lung cancer (NSCLC) patients using a newly collected [18F]FDG PET/CT dataset. We aimed to replicate and validate the radiomics-based machine learning (ML) model proposed by Zhao et al. [2] predicting PD-L1 status from PET/CT-imaging.
An independent cohort of 254 NSCLC patients underwent [18F]FDG PET/CT imaging, with primary tumor segmentation conducted using lung tissue window (LTW) and more conservative soft tissue window (STW) methods. Radiomics models (“Rad-score” and “complex model”) and a clinical-stage model from Zhao et al. were evaluated via 10-fold cross-validation and AUC analysis, alongside a benchmark-study comparing different ML-model pipelines. Clinicopathological data were collected from medical records.
On our data, the Rad-score model yielded mean AUCs of 0.593 (STW) and 0.573 (LTW), below Zhao et al.’s 0.761. The complex model achieved mean AUCs of 0.505 (STW) and 0.519 (LTW), lower than Zhao et al.’s 0.769. The clinical model showed a mean AUC of 0.555, below Zhao et al.’s 0.64. All models performed significantly lower than Zhao et al.’s findings. Our benchmark study on four ML pipelines revealed consistently low performance across all configurations.
Our study failed to replicate original findings, suggesting poor model performance and questioning predictive value of radiomics features in classifying PD-L1 expression from PET/CT imaging. These results highlight challenges in replicating radiomics-based ML models and stress the need for rigorous validation

MCML Authors

Theresa Stüber

Clinical Data Science in Radiology

Boj Friedrich Hoppe

Dr.

Clinical Data Science in Radiology

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

L. Nas, B. F. Hoppe, A. T. Stüber, S. Grosu, N. Fink, A. von Fragstein, J. Rudolph, J. Ricke and B. O. Sabel.
Optimizing lower extremity CT angiography: A prospective study of individualized vs. fixed post-trigger delays in bolus tracking.
European Journal of Radiology 185.112009 (Apr. 2025). DOI

Abstract

Purpose: To compare the contrast media opacification and diagnostic quality in lower-extremity runoff CT angiography (CTA) between bolus-tracking using conventional fixed trigger delay and patient-specific individualized post-trigger delay.
Methods: In this prospective study, lower-extremity runoff CTA was performed in two cohorts, using either fixed or individualized trigger delay. Both cohorts had identical CT protocols, contrast media applications, and image reconstructions. Objective image quality (mean contrast opacification in HU), and subjective image quality (5-point Likert-scale), were assessed in six vessels: abdominal aorta (AA), common iliac artery (CIA), superficial femoral artery (SFA), popliteal artery (PA), posterior tibial artery (PTA), and dorsalis pedis artery (DPA) by one rater for objective and two raters for subjective image quality. Objective image quality was analyzed using Student t-tests, while subjective ratings were compared with Fisher’s exact test.
Results: Overall, 65 patients were included (mean age: 71 ± 14; 39 men), 35 in the individualized cohort and 30 in the fixed cohort. No differences were found between the groups regarding demographics or radiation exposure. Individualized trigger delay ranged from 2 to 23 s (mean: 8.7 ± 4.0 s) and was 10 s in the fixed cohort. The individualized cohort showed higher opacification in the peripheral arteries (PTA: 479 ± 140 HU vs. 379 ± 106 HU; p = 0.009; DPA: 477 ± 191 HU vs. 346 ± 137 HU; p = 0.009). Overall subjective “image quality” was rated higher in the individualized group (“excellent” or “good” in Rater 1: 97% vs. 57%; p < 0.001; and Rater 2: 89% vs. 53%; p = 0.002).
Conclusion: Individualized post-trigger delay enhances diagnostic quality, by improving vessel opacification in peripheral arteries and increasing subjective image quality in lower extremity runoff CTA.

MCML Authors

Boj Friedrich Hoppe

Dr.

Clinical Data Science in Radiology

Theresa Stüber

Clinical Data Science in Radiology

S. Grosu, M. P. Fabritius, M. Winkelmann, D. Puhr-Westerheide, M. Ingenerf, S. Maurus, A. Graser, C. Schulz, T. Knösel, C. C. Cyran, J. Ricke, P. M. Kazmierczak, M. Ingrisch and P. Wesp.
Effect of artificial intelligence-aided differentiation of adenomatous and non-adenomatous colorectal polyps at CT colonography on radiologists’ therapy management.
European Radiology Early Access (Jan. 2025). DOI

Abstract

Objectives: Adenomatous colorectal polyps require endoscopic resection, as opposed to non-adenomatous hyperplastic colorectal polyps. This study aims to evaluate the effect of artificial intelligence (AI)-assisted differentiation of adenomatous and non-adenomatous colorectal polyps at CT colonography on radiologists’ therapy management.
Materials and methods: Five board-certified radiologists evaluated CT colonography images with colorectal polyps of all sizes and morphologies retrospectively and decided whether the depicted polyps required endoscopic resection. After a primary unassisted reading based on current guidelines, a second reading with access to the classification of a radiomics-based random-forest AI-model labelling each polyp as ’non-adenomatous’ or ‘adenomatous’ was performed. Performance was evaluated using polyp histopathology as the reference standard.
Results: 77 polyps in 59 patients comprising 118 polyp image series (47% supine position, 53% prone position) were evaluated unassisted and AI-assisted by five independent board-certified radiologists, resulting in a total of 1180 readings (subsequent polypectomy: yes or no). AI-assisted readings had higher accuracy (76% +/− 1% vs. 84% +/− 1%), sensitivity (78% +/− 6% vs. 85% +/− 1%), and specificity (73% +/− 8% vs. 82% +/− 2%) in selecting polyps eligible for polypectomy (p < 0.001). Inter-reader agreement was improved in the AI-assisted readings (Fleiss’ kappa 0.69 vs. 0.92).
Conclusion: AI-based characterisation of colorectal polyps at CT colonography as a second reader might enable a more precise selection of polyps eligible for subsequent endoscopic resection. However, further studies are needed to confirm this finding and histopathologic polyp evaluation is still mandatory.

MCML Authors

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Philipp Wesp

Dr.

Clinical Data Science in Radiology

L. Mamede, R. C. Sabàb, S. Van Coillie, J. Prevot, S. Sánchez-Ramón, C. Poli, A. Barasa, B. W. Schuller, A. Hendel, N. Garcelon, C. Boersma, P. Lee, C. Booth, L. D. Notarangelo, J. Drabwell, N. L. Rider, F. Staal, S. O. Burns, M. van Hagen, M. Pergrnt, J. G. Rivière and N. Mahlaoui.
Navigating disruption in the PID landscape: embracing opportunities and anticipating threats in the next ten years.
Frontiers in Immunology 16 (May. 2025). DOI

Abstract

The International Patient Organisation for Primary Immunodeficiencies (IPOPI) held its third edition of the Global Multi-Stakeholders’ Summit, gathering key primary immunodeficiencies (PID) stakeholders and experts to discuss and foment global collaboration. This edition focused on the impact of genomic medicine in PID treatment, the role of digital health, including artificial intelligence, in PID care, and how to anticipate and minimise risks to ensure optimal patient access to care. These discussions aimed to examine current hurdles and brainstorm feasible solutions and priorities for the PID community in these areas in the next ten years. These discussions led to recommendations for comprehensive approaches to care and access to treatment for PID patients, suggesting actions that will bring the community closer to treatments based on real-world evidence and adjusted to patient’s needs. To accomplish this, collaboration between academia, industry, regulatory authorities, and patients is crucial.

MCML Authors

Björn Schuller

Prof. Dr.

C1 | Medicine
→ Group Daniel Rückert

Health Informatics

C. S. Vetter, A. Bender, D. B. Dwyer, M. Montembeault, A. Ruef, K. Chrisholm, L. Kambeitz-Ilankovic, L. A. Antonucci, S. Ruhrmann, J. Kambeitz, M. Lichtenstein, A. Riecher, R. Upthegrove, R. K. R. Salokangas, J. Hietala, C. Pantelis, R. Lencer, E. Meisenzahl, S. Wood, P. Brambilla, S. Borgwardt, P. Falkai, A. Bertolino, N. Koutsouleris and PRONIA Consortium.
Exploring the Predictive Value of Structural Covariance Networks for the Diagnosis of Schizophrenia.
Frontiers in Psychiatry 16 (Jun. 2025). DOI

Abstract

Schizophrenia is a psychiatric disorder hypothesized to result from disturbed brain connectivity. Structural covariance networks (SCN) describe the shared variation in morphological properties emerging from coordinated neurodevelopmental processes and may, thus, be a promising diagnostic biomarker for schizophrenia.We compared the diagnostic value of two SCN computation methods derived from regional gray matter volume (GMV) in 154 patients with a diagnosis of first episode psychosis or recurrent schizophrenia (PAT) and 366 healthy control individuals (HC). The first method (REF-SCN) quantifies the contribution of an individual to a normative reference group’s SCN, and the second approach (KLS-SCN) uses a symmetric version of Kulback-Leibler divergence. Their diagnostic value compared to regional GMV was assessed in a stepwise analysis using a series of linear support vector machines within a nested cross-validation framework and stacked generalization, all models were externally validated in an independent sample (NPAT=71, NHC=74), SCN feature importance was assessed, and the derived risk scores were analyzed for differential relationships with clinical variables.We found that models trained on SCNs were able to classify patients with schizophrenia and combining SCNs and regional GMV in a stacked model improved training (balanced accuracy (BAC)=69.96%) and external validation performance (BAC=67.10%). Among all unimodal models, the highest discovery sample performance was achieved by a model trained on REF-SCN (balanced accuracy (BAC=67.03%). All model decisions were driven by widespread structural covariance alterations involving the somato-motor, default mode, control, visual, and the ventral attention networks. Risk estimates derived from KLS-SCNs and regional GMV, but not REF-SCNs, could be predicted from clinical variables, especially driven by body mass index (BMI) and affect-related negative symptoms. These patterns of results show that different SCN computation approaches capture different aspects of the disease. While REF-SCNs contain valuable information for discriminating schizophrenia from healthy control individuals, KLS-SCNs may capture more nuanced symptom-level characteristics similar to those captured by PCA of regional GMV.

MCML Authors

Clara Sophie Vetter

Artificial Intelligence in Healthcare and Medicine

M. Milling, S. D. Rampp, A. Triantafyllopoulos, M. P. Plaza, J. O. Brunner, C. Traidl-Hoffmann, B. W. Schuller and A. Damialis.
Automating airborne pollen classification: Identifying and interpreting hard samples for classifiers.
Heliyon 11.2 (Jan. 2025). DOI GitHub

Abstract

Deep-learning-based classification of pollen grains has been a major driver towards automatic monitoring of airborne pollen. Yet, despite an abundance of available datasets, little effort has been spent to investigate which aspects pose the biggest challenges to the (often black-box- resembling) pollen classification approaches. To shed some light on this issue, we conducted a sample-level difficulty analysis based on the likelihood for one of the largest automatically-generated datasets of pollen grains on microscopy images and investigated the reason for which certain airborne samples and specific pollen taxa pose particular problems to deep learning algorithms. It is here concluded that the main challenges lie in A) the (partly) co-occurring of multiple pollen grains in a single image, B) the occlusion of specific markers through the 2D capturing of microscopy images, and C) for some taxa, a general lack of salient, unique features.

MCML Authors

Manuel Milling

Health Informatics

Andreas Triantafyllopoulos

Health Informatics

Björn Schuller

Prof. Dr.

Health Informatics

F. Tian, H. Zhang, Y. Tan, L. Zhu, L. Shen, K. Qian, B. Hu, B. W. Schuller and Y. Yamamoto.
An On-Board Executable Multi-Feature Transfer-Enhanced Fusion Model for Three-Lead EEG Sensor-Assisted Depression Diagnosis.
IEEE Journal of Biomedical and Health Informatics 29.1 (Jan. 2025). DOI

Abstract

The development of affective computing and medical electronic technologies has led to the emergence of Artificial Intelligence (AI)-based methods for the early detection of depression. However, previous studies have often overlooked the necessity for the AI-assisted diagnosis system to be wearable and accessible in practical scenarios for depression recognition. In this work, we present an on-board executable multi-feature transfer-enhanced fusion model for our custom-designed wearable three-lead Electroencephalogram (EEG) sensor, based on EEG data collected from 73 depressed patients and 108 healthy controls. Experimental results show that the proposed model exhibits low-computational complexity (65.0 K parameters), promising Floating-Point Operations (FLOPs) performance (25.6 M), real-time processing (1.5 s/execution), and low power consumption (320.8 mW). Furthermore, it requires only 202.0 KB of Random Access Memory (RAM) and 279.6 KB of Read-Only Memory (ROM) when deployed on the EEG sensor. Despite its low computational and spatial complexity, the model achieves a notable classification accuracy of 95.2%, specificity of 94.0%, and sensitivity of 96.9% under independent test conditions. These results underscore the potential of deploying the model on the wearable three-lead EEG sensor for assisting in the diagnosis of depression.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

W. Qi, X. Xu, K. Qian, B. W. Schuller, G. Fortino and A. Aliverti.
A Review of AIoT-Based Human Activity Recognition: From Application to Technique.
IEEE Journal of Biomedical and Health Informatics 29.4 (Apr. 2025). DOI

Abstract

This scoping review paper redefines the Artificial Intelligence-based Internet of Things (AIoT) driven Human Activity Recognition (HAR) field by systematically extrapolating from various application domains to deduce potential techniques and algorithms. We distill a general model with adaptive learning and optimization mechanisms by conducting a detailed analysis of human activity types and utilizing contact or non-contact devices. It presents various system integration mathematical paradigms driven by multimodal data fusion, covering predictions of complex behaviors and redefining valuable methods, devices, and systems for HAR. Additionally, this paper establishes benchmarks for behavior recognition across different application requirements, from simple localized actions to group activities. It summarizes open research directions, including data diversity and volume, computational limitations, interoperability, real-time recognition, data security, and privacy concerns. Finally, we aim to serve as a comprehensive and foundational resource for researchers delving into the complex and burgeoning realm of AIoT-enhanced HAR, providing insights and guidance for future innovations and developments.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

X. Qiu, W. Qiu, Y. Zhang, K. Qian, C. Li, B. Hu, B. W. Schuller and Y. Yamamoto.
FedKDC: Consensus-Driven Knowledge Distillation for Personalized Federated Learning in EEG-Based Emotion Recognition.
IEEE Journal of Biomedical and Health Informatics Early Access (Apr. 2025). DOI GitHub

Abstract

Federated learning (FL) has gained prominence in electroencephalogram (EEG)-based emotion recognition because of its ability to enable secure collaborative training without centralized data. However, traditional FL faces challenges due to model and data heterogeneity in smart healthcare settings. For example, medical institutions have varying computational resources, which creates a need for personalized local models. Moreover, EEG data from medical institutions typically face data heterogeneity issues stemming from limitations in participant availability, ethical constraints, and cultural differences among subjects, which can slow model convergence and degrade model performance. To address these challenges, we propose FedKDC, a novel FL framework that incorporates clustered knowledge distillation (CKD). This method introduces a consensus-based distributed learning mechanism to facilitate the clustering process. It then enhances the convergence speed through intraclass distillation and reduces the negative impact of heterogeneity through interclass distillation. Additionally, we introduce a DriftGuard mechanism to mitigate client drift, along with an entropy reducer to decrease the entropy of aggregated knowledge. The framework is validated on the SEED, SEED-IV, SEED-FRA, and SEED-GER datasets, demonstrating its effectiveness in scenarios where both the data and the models are heterogeneous. Experimental results show that FedKDC outperforms other FL frameworks in emotion recognition, achieving a maximum average accuracy of 85.2%, and in convergence efficiency, with faster and more stable convergence.

MCML Authors

Björn Schuller

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Health Informatics

J. Beck, L. M. Kemeter, K. Dürrbeck, M. H. I. Abdalla and F. Kreuter.
Toward Integrating ChatGPT Into Satellite Image Annotation Workflows: A Comparison of Label Quality and Costs of Human and Automated Annotators.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 18 (Jan. 2025). DOI

Abstract

High-quality annotations are a critical success factor for machine learning (ML) applications. To achieve this, we have
traditionally relied on human annotators, navigating the challenges of limited budgets and the varying task-specific expertise, costs, and availability. Since the emergence of Large Language Models (LLMs), their popularity for generating automated annotations has grown, extending possibilities and complexity of designing an efficient annotation strategy. Increasingly, computer vision capabilities have been integrated into general-purpose LLMs like ChatGPT. This raises the question of how effectively LLMs can be used in satellite image annotation tasks and how they compare to traditional annotator types. This study presents a comprehensive investigation and comparison of various human and automated annotators for image classification. We evaluate the feasibility and economic competitiveness of using the ChatGPT4-V model for a complex land usage annotation task and compare it with alternative human annotators. A set of satellite images is annotated by a domain expert and 15 additional human and automated annotators, differing in expertise and costs. Our analyses examine the annotation quality loss between the expert and other annotators. This comparison is conducted through (1) descriptive analyses, (2) fitting linear probability models, and (3) comparing F1-scores. Ultimately, we simulate annotation strategies where samples are split according to an automatically assigned certainty score. Routing low-certainty images to human annotators can cut total annotation costs by over 50% with minimal impact on label quality. We discuss implications regarding the economic competitiveness of annotation strategies, prompt engineering and the task-specificity of expertise.

MCML Authors

Jacob Beck

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

D. Zhao, M. Asgarimehr, K. Heidler, J. Wickert, X. Zhu and L. Mou.
Deep Learning-Based GNSS-R Global Vegetation Water Content: Dataset, Estimation, and Uncertainty.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Early Access (Jun. 2025). DOI

Abstract

Vegetation water content (VWC) is a crucial parameter for understanding vegetation dynamics and hydrological cycle on Earth. With rapid climate changes in recent years, monitoring VWC with high spatiotemporal coverage on a global scale is of paramount importance. Yet, traditional in situ measurements are constrained in remote and densely vegetated regions. Additionally, existing spaceborne remote sensing methods face challenges due to poor cloud penetration capabilities, soil moisture interference, and inadequate temporal resolution. Spaceborne global navigation satellite system reflectometry (GNSS-R) has demonstrated promising potential to overcome these limitations in vegetation monitoring. In this study, we propose a scheme for deep learning-based GNSS-R VWC assessment, leveraging a rapidly growing amount of GNSS-R data with an unprecedented sampling rate. We introduce a triplet dataset, which consists of measurements from the cyclone GNSS (CYGNSS), global land data assimilation system (GLDAS), and soil moisture active passive (SMAP), spanning over three years. Validation is performed using several benchmark models with the proposed dataset. Furthermore, the models’ predictive uncertainty is quantified with Monte Carlo (MC) dropout technique to provide a trustworthy representation of estimations. Experimental evaluation of the models demonstrates good consistency between the estimated VWC and ground truth, with a minimum root mean square deviation (RMSD) of 1.0988 kg/m2 and a bias of 0.002kg/m2 over a twelve-month test period. Moreover, a daily global VWC estimation is achieved through the proposed pipeline, filling the gaps of current products and enabling rapid measurements with enhanced temporal availability. We will make the proposed dataset publicly available.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

S. Garske, K. Heidler, B. Evans, K. Wong and X. Zhu.
SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Early Access (Mar. 2025). DOI GitHub

Abstract

The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring for Hazard Detection and Mapping. SHAZAM uses a lightweight conditional UNet to generate expected images of a region of interest (ROI) for any day of the year, allowing for the direct modelling of normal seasonal changes and the ability to distinguish potential hazards. A modified structural similarity measure compares the generated images with actual satellite observations to compute region-level anomaly scores and pixel-level hazard maps. Additionally, a theoretically grounded seasonal threshold eliminates the need for dataset-specific optimisation. Evaluated on four diverse datasets that contain bushfires (wildfires), burned regions, extreme and out-of-season snowfall, floods, droughts, algal blooms, and deforestation, SHAZAM achieved F1 score improvements of between 0.066 and 0.234 over existing methods. This was achieved primarily through more effective hazard detection (higher recall) while using only 473K parameters. SHAZAM demonstrated superior mapping capabilities through higher spatial resolution and improved ability to suppress background features while accentuating both immediate and gradual hazards. SHAZAM has been established as an effective and generalisable solution for hazard detection and mapping across different geographical regions and a diverse range of hazards.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

Y. Bi, Y. Su, N. Navab and Z. Jiang.
Gaze-Guided Robotic Vascular Ultrasound Leveraging Human Intention Estimation.
IEEE Robotics and Automation Letters 10.4 (Apr. 2025). DOI

Abstract

Medical ultrasound has been widely used to examine vascular structure in modern clinical practice. However, traditional ultrasound examination often faces challenges related to inter- and intra-operator variation. The robotic ultrasound system (RUSS) appears as a potential solution for such challenges because of its superiority in stability and reproducibility. Given the complex anatomy of human vasculature, multiple vessels often appear in ultrasound images, or a single vessel bifurcates into branches, complicating the examination process. To tackle this challenge, this work presents a gaze-guided RUSS for vascular applications. A gaze tracker captures the eye movements of the operator. The extracted gaze signal guides the RUSS to follow the correct vessel when it bifurcates. Additionally, a gaze-guided segmentation network is proposed to enhance segmentation robustness by exploiting gaze information. However, gaze signals are often noisy, requiring interpretation to accurately discern the operator’s true intentions. To this end, this study proposes a stabilization module to process raw gaze data. The inferred attention heatmap is utilized as a region proposal to aid segmentation and serve as a trigger signal when the operator needs to adjust the scanning target, such as when a bifurcation appears. To ensure appropriate contact between the probe and surface during scanning, an automatic ultrasound confidence-based orientation correction method is developed. In experiments, we demonstrated the efficiency of the proposed gaze-guided segmentation pipeline by comparing it with other methods. Besides, the performance of the proposed gaze-guided RUSS was also validated as a whole on a realistic arm phantom with an uneven surface.

MCML Authors

Yuan Bi

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality

J. Huang, P. K. Yu, N. Navab and B. Busam.
TTAPose: Test-time Adaptation for Unseen Object Pose Estimation.
IEEE Robotics and Automation Letters 10.6 (Apr. 2025). DOI

Abstract

Recent advances in the field of 6D pose estimation of unseen objects not present during training are promising, however, the performance gap between these general methods and object-specific methods remains significant. This paper introduces an innovative unsupervised test-time adaptation method, termed TTAPose, capable of adapting a pose estimator to any unseen object. TTAPose initially undergoes pre-training using a large synthetic dataset and thereafter refines the weights using unsupervised loss conducted on unseen real-world target objects. The network, based on a teacher-student architecture, leverages an RGB-D pose refinement pipeline to incrementally improve pseudo labels. Notably, TTAPose operates with no requirement for target data annotation, thus minimizing time and data expenditure. Experimental results show performance levels comparable to supervised methods, effectively narrowing the gap to object-specific baselines.

MCML Authors

Junwen Huang

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Benjamin Busam

Dr.

B1 | Computer Vision
→ Group Daniel Cremers

Computer Aided Medical Procedures & Augmented Reality

Y. Ma, Q. Khan and D. Cremers.
MA-DV2F: A Multi-Agent Navigation Framework Using Dynamic Velocity Vector Field.
IEEE Robotics and Automation Letters 10.6 (Jun. 2025). DOI GitHub

Abstract

In this paper, we propose MA-DV2F: Multi-Agent Dynamic Velocity Vector Field. It is a framework for simultaneously controlling a group of vehicles in challenging environments. DV2F is generated for each vehicle independently and provides a map of reference orientation and speed that a vehicle must attain at any point on the navigation grid such that it safely reaches its target. The field is dynamically updated depending on the speed and proximity of the ego-vehicle to other agents. This dynamic adaptation of the velocity vector field allows prevention of imminent collisions. Experimental results show that MA-DV2F outperforms concurrent methods in terms of safety, computational efficiency and accuracy in reaching the target when scaling to a large number of vehicles.

MCML Authors

Qadeer Khan

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

S. Wang, Q. Cheng, Q. Cheng, W. Zhang, S.-C. Wu, N. Zeller, D. Cremers and N. Navab.
VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis.
IEEE Robotics and Automation Letters 10.6 (Jun. 2025). DOI

Abstract

The generation of high-fidelity view synthesis is essential for robotic navigation and interaction but remains challenging, particularly in indoor environments and real-time scenarios. Existing techniques often require significant computational resources for both training and rendering, and they frequently result in suboptimal 3D representations due to insufficient geometric structuring. To address these limitations, we introduce VoxNeRF, a novel approach that utilizes easy-to-obtain geometry priors to enhance both the quality and efficiency of neural indoor reconstruction and novel view synthesis. We propose an efficient voxel-guided sampling technique that allocates computational resources selectively to the most relevant segments of rays based on a voxel-encoded geometry prior, significantly reducing training and rendering time. Additionally, we incorporate a robust depth loss to improve reconstruction and rendering quality in sparse view settings. Our approach is validated with extensive experiments on ScanNet and ScanNet++ where VoxNeRF outperforms existing state-of-the-art methods and establishes a new benchmark for indoor immersive interpolation and extrapolation settings.

MCML Authors

Sen Wang

B1 | Computer Vision
→ Group Daniel Cremers

Computer Aided Medical Procedures & Augmented Reality

Qing Cheng

Computer Vision & Artificial Intelligence

Qing Cheng

B1 | Computer Vision
→ Group Daniel Cremers

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

J. Xie, Y. Wang, X. Qian, J. Zhang and B. W. Schuller.
Improving Bird Vocalization Recognition in Open-Set Cross-Corpus Scenarios with Semantic Feature Reconstruction and Dual Strategy Scoring.
IEEE Signal Processing Letters 32 (Mar. 2025). DOI

Abstract

Automated recognition of bird vocalizations (BVs) is essential for biodiversity monitoring through passive acoustic monitoring (PAM), yet deep learning (DL) models encounter substantial challenges in open environments. These include difficulties in detecting unknown classes, extracting species-specific features, and achieving robust cross-corpus recognition. To address these challenges, this letter presents a DL-based open-set cross-corpus recognition method for BVs that combines feature construction with open-set recognition (OSR) techniques. We introduce a three-channel spectrogram that integrates both amplitude and phase information to enhance feature representation. To improve the recognition accuracy of known classes across corpora, we employ a class-specific semantic reconstruction model to extract deep features. For unknown class discrimination, we propose a Dual Strategy Coupling Scoring (DSCS) mechanism, which synthesizes the log-likelihood ratio score (LLRS) and reconstruction error score (RES). Our method achieves the highest weighted accuracy among existing approaches on a public dataset, demonstrating its effectiveness for open-set cross-corpus bird vocalization recognition.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

A. Akman, Q. Sun and B. W. Schuller.
Improving Audio Explanations using Audio Language Models.
IEEE Signal Processing Letters Early Access (Jan. 2025). DOI

Abstract

Foundation models are widely utilised for their strong representational capabilities, driven by training on extensive datasets with self-supervised learning. The increasing complexity of these models highlights the importance of interpretability to enhance transparency and improve human understanding of their decision-making processes. Most existing interpretability methods explain model behaviour by attributing importance to individual data elements across different layers, based on their influence on the final prediction. These approaches often emphasise only the most relevant features, overlooking the broader representational space, removing less important features. In this study, we propose a novel framework for explanation generation that serves as an alternative to feature removal, offering a more comprehensive understanding of model behaviour. Our framework leverages the generative abilities of audio language models to replace removed features with contextually appropriate alternatives, providing a more complete view of the model’s decision-making process. Through extensive evaluations on standard benchmarks, including keyword spotting and speech emotion recognition, our approach demonstrates its effectiveness in generating high-quality audio explanations.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

L. Christ, S. Amiriparian, A. Kathan, N. Müller, A. König and B. W. Schuller.
Towards Multimodal Prediction of Spontaneous Humor: A Novel Dataset and First Results.
IEEE Transactions on Affective Computing 16.2 (Apr. 2025). DOI

Abstract

Humor is a substantial element of human social behavior, affect, and cognition. Its automatic understanding can facilitate a more naturalistic human-AI interaction. Current methods of humor detection have been exclusively based on staged data, making them inadequate for ‘real-world’ applications. We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor (Passau-SFCH) dataset, comprising about 11 hours of recordings. The Passau-SFCH dataset is annotated for the presence of humor and its dimensions (sentiment and direction) as proposed in Martin’s Humor Style Questionnaire. We conduct a series of experiments employing pretrained Transformers, convolutional neural networks, and expert-designed features. The performance of each modality (text, audio, video) for spontaneous humor recognition is analyzed and their complementarity is investigated. Our findings suggest that for the automatic analysis of humor and its sentiment, facial expressions are most promising, while humor direction can be best modeled via text-based features. Further, we experiment with different multimodal approaches to humor recognition, including decision-level fusion and MulT, a multimodal Transformer approach. In this context, we propose a novel multimodal architecture that yields the best overall results.

MCML Authors

Shahin Amiriparian

Dr.

Health Informatics

Alexander Kathan

Health Informatics

Björn Schuller

Prof. Dr.

Health Informatics

L. Shen, H. Zhang, C. Zhu, R. Li, K. Qian, F. Tian, B. Hu, B. W. Schuller and Y. Yamamoto.
Enhancing Emotion Regulation in Mental Disorder Treatment: An AIGC-based Closed-Loop Music Intervention System.
IEEE Transactions on Affective Computing Early Access (Apr. 2025). DOI

Abstract

Mental disorders have increased rapidly and have emerged as a serious social health issue in the recent decade. Undoubtedly, the timely treatment of mental disorders is crucial. Emotion regulation has been proven to be an effective method for treating mental disorders. Music therapy as one of the methods that can achieve emotional regulation has gained increasing attention in the field of mental disorder treatment. However, traditional music therapy methods still face some unresolved issues, such as the lack of real-time capability and the inability to form closed-loop systems. With the advancement of artificial intelligence (AI), especially AI-generated content (AIGC), AI-based music therapy holds promise in addressing these issues. In this paper, an AIGC-based closed-loop music intervention system demonstration is proposed to regulate emotions for mental disorder treatment. This system demonstration consists of an emotion recognition model and a music generation model. The emotion recognition model can assess mental states, while the music generation model generates the corresponding emotional music for regulation. The system continuously performs recognition and regulation, thus forming a closed-loop process. In the experiment, we first conduct experiments on both the emotion recognition model and the music generation model to validate the accuracy of the recognition model and the music quality generated by the music generation models. In conclusion, we conducted comprehensive tests on the entire system to verify its feasibility and effectiveness.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

Z. Ge, X. Xu, H. Guo and B. W. Schuller.
Multi-Task Partially Spoofed Speech Detection Using a Dual-View Graph Neural Network Assisted Segment-Level Module.
IEEE Transactions on Audio, Speech and Language Processing 33 (Jul. 2025). DOI

Abstract

The Partially Spoofed Speech Detection (PSSD), as a multi-task learning problem, typically comprises segment- and utterance-level detection tasks, benefitting from diverse feature representations for effective classification. However, existing models for multi-tasks PSSD usually employ a shared feature processing module for the two tasks, which may lead to suboptimal performance compared with task-specific strategies. Further, most of existing works mainly capture segment-level information from a single view, which may result in poorly modeling local differences between fake and bonafide segments. In this regard, we propose a Dual-view Graph neural network Assisted segment-level Module (DGAM) for multi-task PSSD. The proposed approach contains three modules: Shared representation extracting, task-specific feature processing for the utterance-level task, and a Dual-View Graph Neural Network (D-GNN) with a dual-view consistency loss for the segment-level task through the graph attention mechanism with cosine similarity and heat kernel function with Euclidean distance as two different views, which capture semantic and Euclidean spatial relationships, respectively. Experimental evaluations on multiple spoofed-speech datasets demonstrate that, the proposed approach outperforms existing approaches in both segment- and utterance-level detection in terms of equal error rate, showcasing its effectiveness for the multi-task partially spoofed scenario.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

Y. Sun, Y. Zhou, X. Xu, J. Qi, F. Xu, Z. Ren and B. W. Schuller.
Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label Correction.
IEEE Transactions on Audio, Speech and Language Processing Early Access (Jan. 2025). DOI

Abstract

Automated Depression Detection (ADD) in speech aims to automatically estimate one’s depressive attributes through artificial intelligence tools towards spoken signals. Nevertheless, existing speech-based ADD works fail to sufficiently consider weakly-supervised cases with inaccurate labels, which may typically appear in intelligent mental health. In this regard, we propose the Self-Learning-based Label Correction (SLLC) approach for weakly-supervised depression detection in speech. The proposed approach employs a self-learning manner connecting a label correction module and a depression detection module. Within the approach, the label correction module fuses likelihood-ratio-based and prototype-based label correction strategies in order to effectively correct the inaccurate labels, while the depression detection module aims at detecting depressed samples through a 1D convolutional recurrent neural network with multiple types of losses. The experimental results on two depression detection corpora show that our proposed SLLC approach performs better compared with existing state-of-the-art speech-based depression detection approaches, in the case of weak supervision with inaccurate labels for depression detection in speech.

MCML Authors

Björn Schuller

Prof. Dr.

Health Informatics

Y. Yang, R. Liang, Y. Ni, Y. Xie, C. Zou and B. W. Schuller.
A Non-intrusive Speech Quality Evaluation Framework for Hearing Aids Based on Speech Label Assistance and Multi-task Learning Strategy.
IEEE Transactions on Audio, Speech and Language Processing Early Access (Jul. 2025). DOI

Abstract

Accurate evaluation of hearing aid speech quality is crucial for optimizing the auditory experience of hearing-impaired people. Aiming at the shortcomings of existing methods that rely on clean reference signals and do not take into account the effects of differences in Prescription Formula (PF), this paper proposes a non-intrusive speech quality evaluation framework based on speech label assistance, and multi-task learning strategy, termed MTSE-LA. The framework effectively mitigates evaluation bias caused by PF variations and effectively improves the prediction accuracy of speech quality metrics. MTSE-LA consists of three core modules: a feature extraction module, a label classification module, and a score prediction module. The feature extraction module extracts deep frame-level features from speech using a joint Convolutional Neural Network and Bidirectional Long Short-term Memory network (CNN-BiLSTM) model. The label classification module, acting as a pre-trained network, identifies PF labels and embeds them into the extracted frame-level features, which are then fed into the speech quality prediction branch of the multi-task score prediction module. Effective prediction of speech intelligibility is achieved by introducing the output vectors of the modulation filter bank to the speech intelligibility prediction branch to ensure synergy in the multi-task learning process. Moreover, each prediction branch uses the multi-head self-attention mechanism to capture contextual information and model the importance of speech frames. Experimental results demonstrate that MTSE-LA considerably improves the prediction accuracy of the Hearing Aid Speech Quality Index (HASQI) under multiple PF configurations and different degrees of hearing loss conditions. Compared with existing cutting-edge methods, the proposed framework exhibits higher correlation and fitting accuracy, establishing its reliability and superiority in the field of non-intrusive speech quality evaluation for hearing aids.

MCML Authors

Björn Schuller

Prof. Dr.

B3 | Multimodal Perception
→ Group Matthias Althoff

Health Informatics

J. Külz, M. Terzer, M. Magri, A. Giusti and M. Althoff.
Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution.
IEEE Transactions on Automation Science and Engineering Early Access (Jun. 2025). DOI

Abstract

In situ robotic automation in construction is challenging due to constantly changing environments, a shortage of robotic experts, and a lack of standardized frameworks bridging robotics and construction practices. This work proposes a holistic framework for construction task specification, optimization of robot morphology, and mission execution using a mobile modular reconfigurable robot. Users can specify and monitor the desired robot behavior through a graphical interface. In contrast to existing, monolithic solutions, we automatically identify a new task-tailored robot for every task by integrating Building Information Modeling (BIM). Our framework leverages modular robot components that enable the fast adaption of robot hardware to the specific demands of the construction task. Other than previous works on modular robot optimization, we consider multiple competing objectives, which allow us to explicitly model the challenges of real-world transfer, such as calibration errors. We demonstrate our framework in simulation by optimizing robots for drilling and spray painting. Finally, experimental validation demonstrates that our approach robustly enables the autonomous execution of robotic drilling.

MCML Authors

Jonathan Külz

Cyber Physical Systems

Matthias Althoff

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Cyber Physical Systems

W. Huang, Z. Gu, Y. Shi, Z. Xiong and X. Zhu.
Semi-Supervised Building Footprint Extraction Using Debiased Pseudo-Labels.
IEEE Transactions on Geoscience and Remote Sensing 63 (Jan. 2025). DOI GitHub

Abstract

Accurate extraction of building footprints from satellite imagery is of high value. Currently, deep learning methods are predominant in this field due to their powerful representation capabilities. However, they generally require extensive pixel-wise annotations, which constrains their practical application. Semi-supervised learning (SSL) significantly mitigates this requirement by leveraging large volumes of unlabeled data for model self-training (ST), thus enhancing the viability of building footprint extraction. Despite its advantages, SSL faces a critical challenge: the imbalanced distribution between the majority background class and the minority building class, which often results in model bias toward the background during training. To address this issue, this article introduces a novel method called DeBiased matching (DBMatch) for semi-supervised building footprint extraction. DBMatch comprises three main components: 1) a basic supervised learning module (SUP) that uses labeled data for initial model training; 2) a classical weak-to-strong ST module that generates pseudo-labels from unlabeled data for further model ST; and 3) a novel logit debiasing (LDB) module that calculates a global logit bias between building and background, allowing for dynamic pseudo-label calibration. To verify the effectiveness of the proposed DBMatch, extensive experiments are performed on three public building footprint extraction datasets covering six global cities in SSL setting. The experimental results demonstrate that our method significantly outperforms some advanced SSL methods in semi-supervised building footprint extraction.

MCML Authors

Ziqi Gu

Data Science in Earth Observation

Xiaoxiang Zhu

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Data Science in Earth Observation

C. Liu, C. M. Albrecht, Y. Wang and X. Zhu.
CromSS: Cross-Modal Pretraining With Noisy Labels for Remote Sensing Image Segmentation.
IEEE Transactions on Geoscience and Remote Sensing 63 (Mar. 2025). DOI GitHub

Abstract

We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multimodal framework for geospatial applications. We propose a novel cross-modal sample selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multimodal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial–temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels. To validate our approach, we assembled the multimodal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google’s Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks.

MCML Authors

Chenying Liu

Data Science in Earth Observation

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

C. Schweden, K. Hechinger, G. Kauermann and X. Zhu.
Can Uncertainty Quantification Benefit From Label Embeddings? A Case Study on Local Climate Zone Classification.
IEEE Transactions on Geoscience and Remote Sensing 63 (May. 2025). DOI

Abstract

Modern deep learning models have achieved superior performance in almost all fields of remote sensing. An often neglected aspect of these models is the quantification and evaluation of predictive uncertainties. Regarding a classification task, this means that the focus of the analysis solely lies on performance metrics such as accuracy or the loss. On the other hand, a notion of uncertainty indicates the model’s indecisiveness among the given classes and is essential to understand where the model struggles to classify the data samples. In this work, three levels of uncertainty are distinguished, starting with the typical softmax pseudo-probabilities as level-1 uncertainty. As a next level, the more flexible Dirichlet framework is utilized as model output space, and hereby also, a Bayesian setting with an uninformative prior is considered. For the level-3 uncertainty, an empirical Bayes setting is incorporated where a latent embedding of the label space is iteratively estimated by the marginal likelihood of the fully parameterized label space (see [1]). The estimated embeddings are then learned by the network in three different settings: Two regression losses use the embeddings directly, while the closed-form solution of the Kullback-Leibler (KL-) Divergence uses the embedding parameterized as a Dirichlet distribution. To assess the different levels of uncertainty, the label evaluation subset of the So2Sat LCZ42 dataset, which contains label votes from multiple remote sensing experts, is investigated. The predictive uncertainties are evaluated by means of Out-of-Distribution (OoD) detection and calibration performance. Overall, the embedding-based approaches show strong performance for calibration, while for the OoD experiments, the Bayesian Dirichlet setting with an uninformative prior achieves the best performance. In conclusion, embedded labels offer a flexible framework for incorporating uncertain or ambiguous labels into a supervised training setup. They could be highly beneficial for applications in fields such as urban planning or disaster response.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

S. Wang, N. A. A. Braham and X. Zhu.
Weak-strong Graph Contrastive Learning Neural Network for Hyperspectral Image Classification.
IEEE Transactions on Geoscience and Remote Sensing Early Access (May. 2025). DOI GitHub

Abstract

Deep learning methods have shown promising results in various hyperspectral image (HSI) analysis tasks. Despite these advancements, existing models still struggle to accurately identify fine-classified land cover types on noisy hyperspectral images. Traditional methods have limited performance when extracting features from noisy hyperspectral data. Graph Neural Networks (GNNs) offer an adaptable and robust structure by effectively extracting both spectral and spatial features. However, supervised models still require large quantities of labeled data for effective training, posing a significant challenge. Contrastive learning, which leverages unlabeled data for pre-training, can mitigate this issue by reducing the dependency on extensive manual annotation. To address the issues, we propose WSGraphCL, a weak-strong graph contrastive learning model for HSI classification, and conduct experiments in a few-shot scenario. First, the image is transformed into K-hop subgraphs through a spectral-spatial adjacency matrix construction method. Second, WSGraphCL leverages contrastive learning to pre-train a graph-based encoder on the unlabeled hyperspectral image. We demonstrate that weak-strong augmentations and false negative pairs filtering stabilize pre-training and get good-quality representations. Finally, we test our model with a lightweight classifier on the features with a handful of labels. Experimental results showcase the superior performance of WSGraphCL compared to several baseline models, thereby emphasizing its efficacy in addressing the identified limitations in HSI classification.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

S. Zhao, Z. Xiong and X. Zhu.
RainScaler: A Physics-inspired Network for Precipitation Correction and Downscaling.
IEEE Transactions on Geoscience and Remote Sensing Early Access (May. 2025). DOI GitHub

Abstract

Spatial downscaling of precipitation, in which finegrained regional precipitation patterns are recovered from coarse-resolution images, plays a crucial role in various weather and meteorological analyses. However, the intricate noise information presented in the observation data intertwines with the fine-scale characteristics, which poses challenges for subsequent feature extraction. Regional precipitation suffers from complex spatial patterns. Moreover, the real observatory data contains information inconsistent with the established physical principle, due either to inaccurate or incomplete physical models or limited data quality, thus making the implementation of physicallyinformed deep learning more difficult. For example, strong physical constraints may lead to over-regularization, in which the model becomes too rigid and fails to capture certain complexities in the data. In this work, we propose RainScaler, a physicsinspired deep neural network, to tackle these issues. First, to remove the noise and preserve the vital precipitation patterns effectively, the proposed RainScaler exploits an Inconsistencyaware Denoising Net to explicitly model the spatial variability of noise in the input. In addition, a graph module is designed to learn the geographical-dependent fine-grained patterns in high dimensional feature space at a moderate computation cost. Finally, multi-scale physical constraints are skillfully embedded to incorporate additional insights into the data-driven framework. We test our approach on a public dataset consisting of over 60,000 real low-resolution and high-resolution precipitation map pairs collected by different sensors. Our method produces realisticlooking precipitation maps with better discernment capability and corrects the structural error of precipitation distribution, especially for extreme events. Moreover, we evaluate the potential risks of incorporating physical constraints in real-world data applications. Our method unveils opportunities for multi-source data fusion and provides possible solutions to improve the physical feasibility of data-driven models.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

F. Fan, Y. Shi, T. Guggemos and X. Zhu.
Hybrid Quantum Deep Learning With Superpixel Encoding for Earth Observation Data Classification.
IEEE Transactions on Neural Networks and Learning Systems Early Access (Jan. 2025). DOI URL

Abstract

Earth observation (EO) has inevitably entered the Big Data era. The computational challenge associated with analyzing large EO data using sophisticated deep learning models has become a significant bottleneck. To address this challenge, there has been a growing interest in exploring quantum computing as a potential solution. However, the process of encoding EO data into quantum states for analysis potentially undermines the efficiency advantages gained from quantum computing. This article introduces a hybrid quantum deep learning model that effectively encodes and analyzes EO data for classification tasks. The proposed model uses an efficient encoding approach called superpixel encoding, which reduces the quantum resources required for large image representation by incorporating the concept of superpixels. To validate the effectiveness of our model, we conducted evaluations on multiple EO benchmarks, including Overhead-MNIST, So2Sat LCZ42, and SAT-6 datasets. In addition, we studied the impacts of different interaction gates and measurements on classification performance to guide model optimization. The experimental results suggest the validity of our model for accurate classification of EO data.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

L. Zhu, R. Wang, X. Jin, Y. Li, F. Tian, R. Cai, K. Qian, X. Hu, B. Hu, Y. Yamamoto and B. W. Schuller.
Explainable Depression Classification Based on EEG Feature Selection from Audio Stimuli.
IEEE Transactions on Neural Systems and Rehabilitation Engineering Early Access (Apr. 2025). DOI

Abstract

With the development of affective computing and Artificial Intelligence (AI) technologies, Electroencephalogram (EEG)-based depression detection methods have been widely proposed. However, existing studies have mostly focused on the accuracy of depression recognition, ignoring the association between features and models. Additionally, there is a lack of research on the contribution of different features to depression recognition. To this end, this study introduces an innovative approach to depression detection using EEG data, integrating Ant-Lion Optimization (ALO) and Multi-Agent Reinforcement Learning (MARL) for feature fusion analysis. The inclusion of Explainable Artificial Intelligence (XAI) methods enhances the explainability of the model’s features. The Time-Delay Embedded Hidden Markov Model (TDE-HMM) is employed to infer internal brain states during depression, triggered by audio stimulation. The ALO-MARL algorithm, combined with hyper-parameter optimization of the XGBoost classifier, achieves high accuracy (93.69%), sensitivity (88.60%), specificity (97.08%), and F1-score (91.82%) on a auditory stimulus-evoked three-channel EEG dataset. The results suggest that this approach outperforms state-of-the-art feature selection methods for depression recognition on this dataset, and XAI elucidates the critical impact of the minimum value of Power Spectral Density (PSD), Sample Entropy (SampEn), and Réenyi Entropy (Ren) on depression recognition. The study also explores dynamic brain state transitions revealed by audio stimuli, providing insights for the clinical application of AI algorithms in depression recognition.

MCML Authors

Björn Schuller

Prof. Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

Health Informatics

P. Gupta, M. Wever and E. Hüllermeier.
Information Leakage Detection through Approximate Bayes-optimal Prediction.
Information Sciences In Press, Journal Pre-proof.122419 (Jun. 2025). DOI

Abstract

In today’s data-driven world, the proliferation of publicly available information raises security concerns due to the information leakage (IL) problem. IL involves unintentionally exposing sensitive information to unauthorized parties via observable system information. Conventional statistical approaches rely on estimating mutual information (MI) between observable and secret information for detecting ILs, face challenges of the curse of dimensionality, convergence, computational complexity, and MI misestimation. Though effective, emerging supervised machine learning based approaches to detect ILs are limited to binary system sensitive information and lack a comprehensive framework. To address these limitations, we establish a theoretical framework using statistical learning theory and information theory to quantify and detect IL accurately. Using automated machine learning, we demonstrate that MI can be accurately estimated by approximating the typically unknown Bayes predictor’s log-loss and accuracy. Based on this, we show how MI can effectively be estimated to detect ILs. Our method performs superior to state-of-the-art baselines in an empirical study considering synthetic and real-world OpenSSL TLS server datasets.

MCML Authors

Marcel Wever

Dr.

* Former Member

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Q. Li, L. Mou, Y. Shi and X. Zhu.
BANet: A bilateral attention network for extracting changed buildings between remote sensing imagery and cadastral maps.
International Journal of Applied Earth Observation and Geoinformation 139.104486 (May. 2025). DOI

Abstract

Up-to-date cadastral maps are vital to local governments in administrating real estate in cities. With its growing availability, remote sensing imagery is the cost-effective data for updating semantic contents on cadastral maps. In this study, we address the problem of updating buildings on cadastral maps, as city renewal is mainly characterized by new construction and demolition. While previous works focus on extracting all buildings from remote sensing images, we argue that these methods not only disregard preliminary information on cadastral maps but also fail to preserve building priors in unchanged areas on cadastral maps. Therefore, we focus on the task of extracting changed buildings (i.e., newly built and demolished buildings) from remote sensing images and cadastral maps. To address this task, we create an image-map building change detection (IMBCD) dataset, formed by around 27K pairs of remote sensing images and maps and their corresponding changed buildings in six distinct geographical areas across the globe. Accordingly, we propose a Bilateral Attention Network (BANet), introducing a novel attention mechanism: changed-first (CF) attention and non-changed-first (NCF) attention. This bilateral attention mechanism helps to refine the uncertain areas between changed and non-changed regions. Extensive experiments on our IMBCD dataset showcase the superior performance of BANet. Specifically, our BANet outperforms state-of-the-art models with F1 scores of 90.00% and 63.00% for the IMBCD-WHU and IMBCD-Inria datasets. This confirms that the leverage of bilateral attention blocks (BAB) can boost performance.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

Y. Mu, J. Guo, M. Shahzad and X. Zhu.
National-scale tree species mapping with deep learning reveals forest management insights in Germany.
International Journal of Applied Earth Observation and Geoinformation 139.104522 (May. 2025). DOI

Abstract

Accurate tree species distribution is essential for biodiversity assessment, sustainable forest management, and environmental policy. However, mapping species over large areas with satellite data is challenging due to spectral mixing and complex spatial distribution. To address this, we developed a novel deep learning model, ForestFormer, using Sentinel-2 time series data to map eight dominant tree species in Germany. ForestFormer’s dual-branch network with spectral and spatial attention modules improves classification by highlighting species-specific characteristics. Cross-validation in 2,364 National Forest Inventory plots shows that ForestFormer achieves species classification accuracy ranging from 69% to 92%, with an average accuracy of 84%, outperforming existing baseline methods. The developed ForestFormer model can help generate a large-scale and reliable tree species map for Germany, which in turn provides crucial insights into the diverse characteristics of tree species to support forest management. Our analysis of results shows that Pine is the species most resistant to disturbances, while Douglas fir is the least. Northeastern regions of Germany exhibit particularly low levels of forest biodiversity, especially in the states of Brandenburg and Berlin, followed by neighboring states such as Sachsen-Anhalt, Mecklenburg-Vorpommern, Sachsen, and Niedersachsen. In addition, climatic factors, especially water deficit, are shown to play a very important role in determining tree species distribution patterns, followed by topographic and soil factors. These findings are anticipated to provide a critical basis for environmental policy formulation, particularly in forest management strategies responding to ongoing climate change.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

X. Zhao, Z. Xiong, P. Karlshöfer, N. Tziolas, M. Wiesmeier, U. Heiden and X. Zhu.
Soil organic carbon estimation using spaceborne hyperspectral composites on a large scale.
International Journal of Applied Earth Observation and Geoinformation 140 (Jun. 2025). DOI

Abstract

Soil Organic Carbon (SOC) is a key property for soil health. Spectral reflectance such as multispectral and hyperspectral data could provide efficient and cost-effective retrieval of SOC content. However, constrained by the availability of hyperspectral satellite data, current works mostly use a small number of spaceborne hyperspectral imagery for SOC retrieval on a small scale. In this work, the first large-scale hyperspectral imaging reflectance composites were built, and they were used for SOC estimation. Specifically, DESIS satellite images were used to predict SOC over the whole state of Bavaria in Germany ( 70,000 km). We prepare 850 hyperspectral images from the DESIS satellite and build temporal composites from them. For the soil data, data was gathered from LfU(Bavarian State Office for the Environment), LfL(Bavarian State Research Center for Agriculture) and LUCAS 2018 (Land Use and Coverage Area Frame Survey). 828 soil samples were selected after data filtering. For this regression task, different machine learning and deep learning methods were implemented and explored. Moreover, a spectral attention mechanism was added to the model. Besides hyperspectral input, the digital elevation model (DEM) was also included as an auxiliary input as the measured spectrum has inter-variability dependent on the elevation and the generated topographical features are also relevant with SOC distribution. Based on the regression results evaluated by , , and , the deep learning models showed much better performance than machine learning methods. Especially when only using hyperspectral data as input, the best result was achieved with 1.947%, 0.626, and 1.710 on the test set. After incorporating topographical features, the fused model achieved further improved performance with 1.752% and 0.695 and 1.919. From the interpretability analysis for model performance, it was found out that the bands in the range of 530 nm–570 nm, 770 nm–790 nm, and 840 nm - 870 nm are the most relevant bands for SOC estimation. In the end, several SOC maps were generated and analyzed together with soil types. The SOC maps indicate that water-associated areas, such as coastal soils and bogs, tend to have higher SOC, while mountain areas tend to contain lower SOC. Such findings align with SOC distribution across soil types and show the effectiveness of the model.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Mathias Drton

Data Science in Earth Observation

D. Strieder and M. Drton.
Identifying total causal effects in linear models under partial homoscedasticity.
International Journal of Approximate Reasoning 183.109455 (Aug. 2025). DOI

Abstract

A fundamental challenge of scientific research is inferring causal relations based on observed data. One commonly used approach involves utilizing structural causal models that postulate noisy functional relations among interacting variables. A directed graph naturally represents these models and reflects the underlying causal structure. However, classical identifiability results suggest that, without conducting additional experiments, this causal graph can only be identified up to a Markov equivalence class of indistinguishable models. Recent research has shown that focusing on linear relations with equal error variances can enable the identification of the causal structure from mere observational data. Nonetheless, practitioners are often primarily interested in the effects of specific interventions, rendering the complete identification of the causal structure unnecessary. In this work, we investigate the extent to which less restrictive assumptions of partial homoscedasticity are sufficient for identifying the causal effects of interest. Furthermore, we construct mathematically rigorous confidence regions for total causal effects under structure uncertainty and explore the performance gain of relying on stricter error assumptions in a simulation study.

MCML Authors

David Strieder

Mathematical Statistics

Mathias Drton

Prof. Dr.

Mathematical Statistics

W. Mayr, A. Triantafyllopoulos, A. Batliner, B. W. Schuller and T. M. Berghaus.
Assessing the Clinical and Functional Status of COPD Patients Using Speech Analysis During and After Exacerbation.
International Journal of Chronic Obstructive Pulmonary Disease 20 (Jan. 2025). DOI

Abstract

Background: Chronic obstructive pulmonary disease (COPD) affects breathing, speech production, and coughing. We evaluated a machine learning analysis of speech for classifying the disease severity of COPD.
Methods: In this single centre study, non-consecutive COPD patients were prospectively recruited for comparing their speech characteristics during and after an acute COPD exacerbation. We extracted a set of spectral, prosodic, and temporal variability features, which were used as input to a support vector machine (SVM). Our baseline for predicting patient state was an SVM model using self-reported BORG and COPD Assessment Test (CAT) scores.
Results: In 50 COPD patients (52% males, 22% GOLD II, 44% GOLD III, 32% GOLD IV, all patients group E), speech analysis was superior in distinguishing during and after exacerbation status compared to BORG and CAT scores alone by achieving 84% accuracy in prediction. CAT scores correlated with reading rhythm, and BORG scales with stability in articulation. Pulmonary function testing (PFT) correlated with speech pause rate and speech rhythm variability.
Conclusion: Speech analysis may be a viable technology for classifying COPD status, opening up new opportunities for remote disease monitoring.

MCML Authors

Andreas Triantafyllopoulos

Health Informatics

Anton Batliner

Dr.

Health Informatics

Björn Schuller

Prof. Dr.

Health Informatics

N. Heldring, A.-R. Rezaie, A. Larsson, R. Gahn, B. Zilg, S. Camilleri, A. Saade, P. Wesp, E. Palm and O. Kvist.
A probability model for estimating age in young individuals relative to key legal thresholds: 15, 18 or 21-year.
International Journal of Legal Medicine 139.1 (Jan. 2025). DOI

Abstract

Age estimations are relevant for pre-trial detention, sentencing in criminal cases and as part of the evaluation in asylum processes to protect the rights and privileges of minors. No current method can determine an exact chronological age due to individual variations in biological development. This study seeks to develop a validated statistical model for estimating an age relative to key legal thresholds (15, 18, and 21 years) based on a skeletal (CT-clavicle, radiography-hand/wrist or MR-knee) and tooth (radiography-third molar) developmental stages. The whole model is based on 34 scientific studies, divided into examinations of the hand/wrist (15 studies), clavicle (5 studies), distal femur (4 studies), and third molars (10 studies). In total, data from approximately 27,000 individuals have been incorporated and the model has subsequently been validated with data from 5,000 individuals. The core framework of the model is built upon transition analysis and is further developed by a combination of a type of parametric bootstrapping and Bayesian theory. Validation of the model includes testing the models on independent datasets of individuals with known ages and shows a high precision with separate populations aligning closely with the model’s predictions. The practical use of the complex statistical model requires a user-friendly tool to provide probabilities together with the margin of error. The assessment based on the model forms the medical component for the overall evaluation of an individual’s age.

MCML Authors

Philipp Wesp

Dr.

Clinical Data Science in Radiology

X.-Y. Tong, R. Dong and X. Zhu.
Global high categorical resolution land cover mapping via weak supervision.
ISPRS Journal of Photogrammetry and Remote Sensing 220 (Feb. 2025). DOI GitHub

Abstract

Land cover information is indispensable for advancing the United Nations’ sustainable development goals, and land cover mapping under a more detailed category system would significantly contribute to economic livelihood tracking and environmental degradation measurement. However, the substantial difficulty in acquiring fine-grained training data makes the implementation of this task particularly challenging. Here, we propose to combine fully labeled source domain and weakly labeled target domain for weakly supervised domain adaptation (WSDA). This is beneficial as the utilization of sparse and coarse weak labels can considerably alleviate the labor required for precise and detailed land cover annotation. Specifically, we introduce the Prototype-based pseudo-label Rectification and Expansion (PRE) approach, which leverages the prototypes (i.e., the class-wise feature centroids) as the bridge to connect sparse labels and global feature distributions. According to the feature distances to the prototypes, the confidence of pseudo-labels predicted in the unlabeled regions of the target domain is assessed. This confidence is then utilized to guide the dynamic expansion and rectification of pseudo-labels. Based on PRE, we carry out high categorical resolution land cover mapping for 10 cities in different regions around the world, severally using PlanetScope, Gaofen-1, and Sentinel-2 satellite images. In the study areas, we achieve cross-sensor, cross-category, and cross-continent WSDA, with the overall accuracy exceeding 80%. The promising results indicate that PRE is capable of reducing the dependency of land cover classification on high-quality annotations, thereby improving label efficiency. We expect our work to enable global fine-grained land cover mapping, which in turn promote Earth observation to provide more precise and thorough information for environmental monitoring.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

Z. Li, D. Muhtar, F. Gu, X. Zhang, P. Xiao, G. He and X. Zhu.
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation.
ISPRS Journal of Photogrammetry and Remote Sensing 227 (Sep. 2025). DOI GitHub

Abstract

Automatically and rapidly understanding Earth’s surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth’s surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Data Science in Earth Observation

Q. Xu, L. F. De Vos, Y. Shi, N. Rüther, A. Bronstert and X. Zhu.
Urban Flood Modeling and Forecasting with Deep Neural Operator and Transfer Learning.
Journal of Hydrology In Press, Journal Pre-proof.133705 (Jun. 2025). DOI

Abstract

Physics-based models provide accurate flood modeling but are limited by their dependence on high-quality data and computational demands, particularly in complex urban environments. Machine learning-based surrogate models like neural operators present a promising alternative; however, their practical application in urban flood modeling remains challenges, such as insufficient feature representation, high memory demands, and limited transferability. To address these challenges, this study introduces a deep neural operator (DNO) and a transfer learning-based DNO for fast, accurate, resolution-invariant, and cross-scenario urban flood forecasting. The DNO features an enhanced Fourier layer with skip connections for improved memory efficiency, alongside a deep encoder-decoder framework and an urban-embedded residual loss to enhance modeling effectiveness. The transfer learning-based DNO further integrates a fine-tuning-based approach for efficient cross-scenario forecasting in the target domain and a domain adaptation-based strategy for continuous learning across diverse domains. The fine-tuning-based DNO enables rapid adaptation to target domains, while the domain adaptation-based DNO mitigates knowledge forgetting from the source domain. Experimental results demonstrate that the proposed DNO significantly outperforms existing neural solvers using a comprehensive urban flood benchmark dataset, particularly in predicting high water depths and exhibiting exceptional zero-shot downscaling performance for high-resolution forecasting. Moreover, the fine-tuning-based DNO enhances transferability for cross-scenario urban flood forecasting, while the domain adaptation-based DNO achieves accurate flood predictions in both source and target domains, even with limited labeled target data. Through the combination of these ML methods and the benchmark dataset, a practical tool is established for effective, cross-scenario, and downscaled spatiotemporal urban flood forecasting.

MCML Authors

Qingsong Xu

Data Science in Earth Observation

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

B. Lange.
Moral parenthood and gestation: replies to Cordeiro, Murphy, Robinson and Baron.
Journal of Medical Ethics 51.2 (Jan. 2025). DOI

Abstract

I am grateful to James Cordeiro, Timothy Murphy, Heloise Robinson and Teresa Baron for their perceptive and stimulating comments on my article in this journal. In what follows, I seek to respond to some of the main points raised in each commentary.

MCML Authors

Benjamin Lange

Dr.

C5 | Humane AI

Ethics of Artificial Intelligence

B. Lange.
Moral parenthood: not gestational.
Journal of Medical Ethics 51.2 (Jan. 2025). DOI

Abstract

Parenting our biological children is a centrally important matter, but how, if it all, can it be justified? According to a contemporary influential line of thinking, the acquisition by parents of a moral right to parent their biological children should be grounded by appeal to the value of the intimate emotional relationship that gestation facilitates between a newborn and a gestational procreator. I evaluate two arguments in defence of this proposal and argue that both are unconvincing.Data are available in a public, open access repository.

MCML Authors

Benjamin Lange

Dr.

C5 | Humane AI

Ethics of Artificial Intelligence

M. Keinert, S. Pistrosch, A. Mallol-Ragolta, B. W. Schuller and M. Berking.
Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.
Journal of Medical Internet Research 27 (Jul. 2025). DOI

Abstract

Background: The development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions.
Objective: This study aimed to develop deep learning–based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models’ performance, and validate them by comparing the models with human observers.
Methods: The STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)–based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy.
Results: Models using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%.
Conclusions: Future studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.

MCML Authors

Simon Pistrosch

Health Informatics

Adria Mallol-Ragolta

Health Informatics

Björn Schuller

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Health Informatics

L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible modelling of time-varying exposures and recurrent events to analyse training load effects in team sports injuries.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 74.2 (Mar. 2025). DOI

Abstract

We present a flexible modelling approach to analyse time-varying exposures and recurrent events in team sports injuries. The approach is based on the piece-wise exponential additive mixed model where the effects of past exposures (i.e. high-intensity training loads) may accumulate over time and present complex forms of association. In order to identify a relevant time window at which past exposures have an impact on the current risk, we propose a penalty approach. We conduct a simulation study to evaluate the performance of the proposed model, under different true weight functions and different levels of heterogeneity between recurrent events. Finally, we illustrate the approach with a case study application involving an elite male football team participating in the Spanish LaLiga competition. The cohort includes time-loss injuries and external training load variables tracked by Global Positioning System devices, during the seasons 2017–2018 and 2018–2019.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

M. Schneble and G. Kauermann.
Statistical modelling of on-street parking spot occupancy in smart cities.
Journal of the Royal Statistical Society. Series C (Applied Statistics).qlaf017 (Mar. 2025). DOI

Abstract

Many studies suggest that searching for parking is associated with significant direct and indirect costs. Therefore, it is appealing to reduce the time that car drivers spend on finding an available parking spot, especially in urban areas where the space for all road users is limited. The prediction of on-street parking spot occupancy can provide drivers with guidance on where clear parking spaces are likely to be found. This field of research has gained more and more attention in the last decade through the increasing availability of real-time parking spot occupancy data. In this paper, we pursue a statistical approach for the prediction of parking spot occupancy, where we make use of time-to-event models and semi-Markov process theory. The latter involves the employment of Laplace transformations as well as their inversion, which is an ambitious numerical task. We apply our methodology to data from the City of Melbourne in Australia. Our main result is that the semi-Markov model outperforms a Markov model in terms of both true negative rate and true positive rate while this is essentially achieved by respecting the current duration that a parking space already spends in its initial state.

MCML Authors

Göran Kauermann

Prof. Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

Applied Statistics in Social Sciences, Economics and Business

J. Hanselle, S. Heid, J. Fürnkranz and E. Hüllermeier.
Probabilistic scoring lists for interpretable machine learning.
Machine Learning 114.55 (Feb. 2025). DOI

Abstract

A scoring system is a simple decision model that checks a set of features, adds a certain number of points to a total score for each feature that is satisfied, and finally makes a decision by comparing the total score to a threshold. Scoring systems have a long history of active use in safety-critical domains such as healthcare and justice, where they provide guidance for making objective and accurate decisions. Given their genuine interpretability, the idea of learning scoring systems from data is obviously appealing from the perspective of explainable AI. In this paper, we propose a practically motivated extension of scoring systems called probabilistic scoring lists (PSL), as well as a method for learning PSLs from data. Instead of making a deterministic decision, a PSL represents uncertainty in the form of probability distributions, or, more generally, probability intervals. Moreover, in the spirit of decision lists, a PSL evaluates features one by one and stops as soon as a decision can be made with enough confidence. To evaluate our approach, we conduct case studies in the medical domain and on standard benchmark data.

MCML Authors

Jonas Hanselle

Artificial Intelligence and Machine Learning

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Ö. Turgut, P. Müller, P. Hager, S. Shit, S. Starck, M. Menten, E. Martens and D. Rückert.
Unlocking the diagnostic potential of electrocardiograms through information transfer from cardiac magnetic resonance imaging.
Medical Image Analysis 101.103451 (Apr. 2025). DOI GitHub

Abstract

Cardiovascular diseases (CVD) can be diagnosed using various diagnostic modalities. The electrocardiogram (ECG) is a cost-effective and widely available diagnostic aid that provides functional information of the heart. However, its ability to classify and spatially localise CVD is limited. In contrast, cardiac magnetic resonance (CMR) imaging provides detailed structural information of the heart and thus enables evidence-based diagnosis of CVD, but long scan times and high costs limit its use in clinical routine. In this work, we present a deep learning strategy for cost-effective and comprehensive cardiac screening solely from ECG. Our approach combines multimodal contrastive learning with masked data modelling to transfer domain-specific information from CMR imaging to ECG representations. In extensive experiments using data from 40,044 UK Biobank subjects, we demonstrate the utility and generalisability of our method for subject-specific risk prediction of CVD and the prediction of cardiac phenotypes using only ECG data. Specifically, our novel multimodal pre-training paradigm improves performance by up to 12.19% for risk prediction and 27.59% for phenotype prediction. In a qualitative analysis, we demonstrate that our learned ECG representations incorporate information from CMR image regions of interest.

MCML Authors

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

A. Bitarafan, M. Mozafari, M. F. Azampour, M. S. Baghshah, N. Navab and A. Farshad.
Self-supervised 3D medical image segmentation by flow-guided mask propagation learning.
Medical Image Analysis 101.103478 (Apr. 2025). DOI GitHub

Abstract

Despite significant progress in 3D medical image segmentation using deep learning, manual annotation remains a labor-intensive bottleneck. Self-supervised mask propagation (SMP) methods have emerged to alleviate this challenge, allowing intra-volume segmentation with just a single slice annotation. However, the previous SMP methods often rely on 2D information and ignore volumetric contexts. While our previous work, called Vol2Flow, attempts to address this concern, it exhibits limitations, including not focusing enough on local (i.e., slice-pair) information, neglecting global information (i.e., volumetric contexts) in the objective function, and error accumulation during slice-to-slice reconstruction. This paper introduces Flow2Mask, a novel SMP method, developed to overcome the limitations of previous SMP approaches, particularly Vol2Flow. During training, Flow2Mask proposes the Local-to-Global (L2G) loss to learn inter-slice flow fields among all consecutive slices within a volume in an unsupervised manner. This dynamic loss is based on curriculum learning to gradually learn information within a volume from local to global contexts. Additionally, the Inter-Slice Smoothness (ISS) loss is introduced as a regularization term to encourage changes between the slices occur consistently and continuously. During inference, Flow2Mask leverages these 3D flow fields for inter-slice mask propagation in a 3D image, spreading annotation from a single annotated slice to the entire volume. Moreover, we propose an automatic strategy to select the most representative slice as initial annotation in the mask propagation process. Experimental evaluations on different abdominal datasets demonstrate that our proposed SMP method outperforms previous approaches and improves the overall mean DSC of Vol2Flow by +2.1%, +8.2%, and +4.0% for the Sliver, CHAOS, and 3D-IRCAD datasets, respectively. Furthermore, Flow2Mask even exhibits substantial improvements in weakly-supervised and self-supervised few-shot segmentation methods when applied as a mask completion tool.

MCML Authors

Mohammad Farid Azampour

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Y. Bi, L. Huang, R. Clarenbach, R. Ghotbi, A. Karlas, N. Navab and Z. Jiang.
Synomaly noise and multi-stage diffusion: A novel approach for unsupervised anomaly detection in medical images.
Medical Image Analysis In Press, Journal Pre-proof.103737 (Jul. 2025). DOI GitHub

Abstract

Anomaly detection in medical imaging plays a crucial role in identifying pathological regions across various imaging modalities, such as brain MRI, liver CT, and carotid ultrasound (US). However, training fully supervised segmentation models is often hindered by the scarcity of expert annotations and the complexity of diverse anatomical structures. To address these issues, we propose a novel unsupervised anomaly detection framework based on a diffusion model that incorporates a synthetic anomaly (Synomaly) noise function and a multi-stage diffusion process. Synomaly noise introduces synthetic anomalies into healthy images during training, allowing the model to effectively learn anomaly removal. The multi-stage diffusion process is introduced to progressively denoise images, preserving fine details while improving the quality of anomaly-free reconstructions. The generated high-fidelity counterfactual healthy images can further enhance the interpretability of the segmentation models, as well as provide a reliable baseline for evaluating the extent of anomalies and supporting clinical decision-making. Notably, the unsupervised anomaly detection model is trained purely on healthy images, eliminating the need for anomalous training samples and pixel-level annotations. We validate the proposed approach on brain MRI, liver CT datasets, and carotid US. The experimental results demonstrate that the proposed framework outperforms existing state-of-the-art unsupervised anomaly detection methods, achieving performance comparable to fully supervised segmentation models in the US dataset. Ablation studies further highlight the contributions of Synomaly noise and the multi-stage diffusion process in improving anomaly segmentation. These findings underscore the potential of our approach as a robust and annotation-efficient alternative for medical anomaly detection.

MCML Authors

Yuan Bi

C1 | Medicine
→ Group Martin Menten

Computer Aided Medical Procedures & Augmented Reality

Lucie Huang

Artificial Intelligence in Healthcare and Medicine

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Zhongliang Jiang

Dr.

Computer Aided Medical Procedures & Augmented Reality

E. Eulig, F. Jäger, J. Maier, B. Ommer and M. Kachelrieß.
Reconstructing and analyzing the invariances of low-dose CT image denoising networks.
Medical Physics 52 (Jan. 2025). DOI

Abstract

Background: Deep learning-based methods led to significant advancements in many areas of medical imaging, most of which are concerned with the reduction of artifacts caused by motion, scatter, or noise. However, with most neural networks being black boxes, they remain notoriously difficult to interpret, hindering their clinical implementation. In particular, it has been shown that networks exhibit invariances w.r.t. input features, that is, they learn to ignore certain information in the input data.
Purpose: To improve the interpretability of deep learning-based low-dose CT image denoising networks.
Methods: We learn a complete data representation of low-dose input images using a conditional variational autoencoder (cVAE). In this representation, invariances of any given denoising network are then disentangled from the information it is not invariant to using a conditional invertible neural network (cINN). At test time, image-space invariances are generated by applying the inverse of the cINN and subsequent decoding using the cVAE. We propose two methods to analyze sampled invariances and to find those that correspond to alterations of anatomical structures.
Results: The proposed method is applied to four popular deep learning-based low-dose CT image denoising networks. We find that the networks are not only invariant to noise amplitude and realizations, but also to anatomical structures.
Conclusions: The proposed method is capable of reconstructing and analyzing invariances of deep learning-based low-dose CT image denoising networks. This is an important step toward interpreting deep learning-based methods for medical imaging, which is essential for their clinical implementation.

MCML Authors

Björn Ommer

Prof. Dr.

B1 | Computer Vision

Computer Vision & Learning

T. Willem, V. A. Shitov, M. D. Luecken, N. Kilbertus, S. Bauer, M. Piraud, A. Buyx and F. J. Theis.
Biases in machine-learning models of human single-cell data.
Nature Cell Biology (Feb. 2025). DOI

Abstract

Recent machine-learning (ML)-based advances in single-cell data science have enabled the stratification of human tissue donors at single-cell resolution, promising to provide valuable diagnostic and prognostic insights. However, such insights are susceptible to biases. Here we discuss various biases that emerge along the pipeline of ML-based single-cell analysis, ranging from societal biases affecting whose samples are collected, to clinical and cohort biases that influence the generalizability of single-cell datasets, biases stemming from single-cell sequencing, ML biases specific to (weakly supervised or unsupervised) ML models trained on human single-cell samples and biases during the interpretation of results from ML models. We end by providing methods for single-cell data scientists to assess and mitigate biases, and call for efforts to address the root causes of biases.

MCML Authors

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI

C. I. Bercea, B. Wiestler, D. Rückert and J. A. Schnabel.
Evaluating normative representation learning in generative AI for robust anomaly detection in brain imaging.
Nature Communications 16.1624 (Feb. 2025). DOI GitHub

Abstract

Normative representation learning focuses on understanding the typical anatomical distributions from large datasets of medical scans from healthy individuals. Generative Artificial Intelligence (AI) leverages this attribute to synthesize images that accurately reflect these normative patterns. This capability enables the AI allowing them to effectively detect and correct anomalies in new, unseen pathological data without the need for expert labeling. Traditional anomaly detection methods often evaluate the anomaly detection performance, overlooking the crucial role of normative learning. In our analysis, we introduce novel metrics, specifically designed to evaluate this facet in AI models. We apply these metrics across various generative AI frameworks, including advanced diffusion models, and rigorously test them against complex and diverse brain pathologies. In addition, we conduct a large multi-reader study to compare these metrics to experts’ evaluations. Our analysis demonstrates that models proficient in normative learning exhibit exceptional versatility, adeptly detecting a wide range of unseen medical conditions.

MCML Authors

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Julia Schnabel

Prof. Dr.

C1 | Medicine
→ Group Benedikt Wiestler

Computational Imaging and AI in Medicine

M. Balcerak, J. Weidner, P. Karnakov, I. Ezhov, S. Litvinov, P. Koumoutsakos, T. Amiranashvili, R. Z. Zhang, J. S. Lowengrub, I. Yakushev, B. Wiestler and B. Menze.
Individualizing glioma radiotherapy planning by optimization of a data and physics-informed discrete loss.
Nature Communications 16.5982 (Jun. 2025). DOI

Abstract

Brain tumor growth is unique to each glioma patient and extends beyond what is visible in imaging scans, infiltrating surrounding brain tissue. Understanding these hidden patient-specific progressions is essential for effective therapies. Current treatment plans for brain tumors, such as radiotherapy, typically involve delineating a uniform margin around the visible tumor on pre-treatment scans to target this invisible tumor growth. This ‘one size fits all’ approach is derived from population studies and often fails to account for the nuances of individual patient conditions. We present the Glioma Optimizing the Discrete Loss (GliODIL) framework, which infers the full spatial distribution of tumor cell concentration from available multi-modal imaging, leveraging a Fisher-Kolmogorov type physics model to describe tumor growth. This is achieved through the newly introduced method of Optimizing the Discrete Loss (ODIL), where both data and physics-based constraints are softly assimilated into the solution. Our test dataset comprises 152 glioblastoma patients with pre-treatment imaging and post-treatment follow-ups for tumor recurrence monitoring. By blending data-driven techniques with physics-based constraints, GliODIL enhances recurrence prediction in radiotherapy planning, challenging traditional uniform margins and strict adherence to the Fisher-Kolmogorov partial differential equation model, which is adapted for complex cases.

MCML Authors

Jonas Weidner

AI for Image-Guided Diagnosis and Therapy

Benedikt Wiestler

Prof. Dr.

AI for Image-Guided Diagnosis and Therapy

T. Li, S. Hofer, G. Moholdt, A. Igneczi, K. Heidler, X. Zhu and J. Bamber.
Pervasive glacier retreats across Svalbard from 1985 to 2023.
Nature Communications 16.705 (Jan. 2025). DOI

Abstract

A major uncertainty in predicting the behaviour of marine-terminating glaciers is ice dynamics driven by non-linear calving front retreat, which is poorly understood and modelled. Using 124919 calving front positions for 149 marine-terminating glaciers in Svalbard from 1985 to 2023, generated with deep learning, we identify pervasive calving front retreats for non-surging glaciers over the past 38 years. We observe widespread seasonal cycles in calving front position for over half of the glaciers. At the seasonal timescale, peak retreat rates exhibit a several-month phase lag, with changes on the west coast occurring before those on the east coast, coincident with regional ocean warming. This spatial variability in seasonal patterns is linked to different timings of warm ocean water inflow from the West Spitsbergen Current, demonstrating the dominant role of ice-ocean interaction in seasonal front changes. The interannual variability of calving front retreat shows a strong sensitivity to both atmospheric and oceanic warming, with immediate responses to large air and ocean temperature anomalies in 2016 and 2019, likely driven by atmospheric blocking that can influence extreme temperature variability. With more frequent blocking occurring and continued regional warming, future calving front retreats will likely intensify, leading to more significant glacier mass loss.

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

A. Tejada-Lapuerta, P. Bertin, S. Bauer, H. Aliee, Y. Bengio and F. J. Theis.
Causal machine learning for single-cell genomics.
Nature Genetics (Mar. 2025). DOI

Abstract

Advances in single-cell ‘-omics’ allow unprecedented insights into the transcriptional profiles of individual cells and, when combined with large-scale perturbation screens, enable measuring of the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the causative role of genes in complex biological processes. In this Perspective, we delineate the application of causal machine learning to single-cell genomics and its associated challenges. We first present the causal model that is most commonly applied to single-cell biology and then identify and discuss potential approaches to three open problems: the lack of generalization of models to novel experimental conditions, the complexity of interpreting learned models, and the difficulty of learning cell dynamics.

MCML Authors

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI

Fabian Theis

Prof. Dr.

C2 | Biology

Mathematical Modelling of Biological Systems

T. Wiltgen, J. McGinnis, R. Berg, C. C. Voon, O. Puonti, K. Giglhuber, C. Ganter, C. Zimmer, B. Hemmer, B. Wiestler, J. Kirschke, C. Preibisch and M. Mühlau.
Towards quantitative intensity analysis of conventional T1-weighted images in multiple sclerosis.
NeuroImage 318.121395 (Sep. 2025). DOI

Abstract

Conventional T1-weighted (T1w) magnetic resonance imaging (MRI) is commonly used in multiple sclerosis (MS) morphometry and volumetry research. However, arbitrary intensity scales preclude interpretation of signal values across patients, sites, and time. This requires quantitative MRI techniques, which are not always available. This study assessed T1w image intensity scaling methods, relying on extracerebral reference regions, for quantitative analysis of brain MRI in MS. In total, 701 people with a diagnosis of radiologically isolated syndrome, clinically isolated syndrome, or MS were included. Four intensity scaling strategies were applied: 1) MRI signal modeling, 2) linear scaling with reference regions, 3) z-score standardization, and 4) none (only bias field correction). Methods were evaluated using variance analysis, R1 map comparison, and normal-appearing white matter (NAWM) intensity group comparison, using mean and coefficient of variation (CoV), between low (≤3) and high (>3) expanded disability status scale (EDSS) scores. Statistical analysis was conducted using Pearson’s r, two-sided Welch two-sample t-test, ANCOVA, and Cohen’s d.
Linear scaling with temporal fatty tissue achieved the most consistent variance reduction and strong correlation with R1 maps (r = 0.84). R1 values in NAWM were significantly lower in people with high compared to low EDSS scores (d = -0.351). Similarly, group differences in mean NAWM intensity of fat-scaled images were significant (d = -0.252). The largest group differences were found in NAWM CoV in bias field-corrected T1w images (d = 0.818). Linear scaling with fatty tissue most accurately reproduced the results obtained with R1 maps. Changes in MS NAWM appear to increase intensity variability detectable in conventional T1w images.

MCML Authors

Benedikt Wiestler

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Patrick Rinke

AI for Image-Guided Diagnosis and Therapy

P. Pisal, O. Krejci and P. Rinke.
Machine learning accelerated descriptor design for catalyst discovery in CO2 to methanol conversion.
npj Computational Materials 11.213 (Jun. 2025). DOI

Abstract

Transforming CO2 into methanol represents a crucial step towards closing the carbon cycle, with thermoreduction technology nearing industrial application. However, obtaining high methanol yields and ensuring the stability of heterocatalysts remain significant challenges. Herein, we present a sophisticated computational framework to accelerate the discovery of thermal heterogeneous catalysts, using machine-learned force fields. We propose a new catalytic descriptor, termed adsorption energy distribution, that aggregates the binding energies for different catalyst facets, binding sites, and adsorbates. The descriptor is versatile and can be adjusted to a specific reaction through careful choice of the key-step reactants and reaction intermediates. By applying unsupervised machine learning and statistical analysis to a dataset comprising nearly 160 metallic alloys, we offer a powerful tool for catalyst discovery. We propose new promising candidates such as ZnRh and ZnPt3, which to our knowledge, have not yet been tested, and discuss their possible advantage in terms of stability.

MCML Authors

Prajwal Pisal

AI-based Material Science

Patrick Rinke

Prof. Dr.

AI-based Material Science

N. Santhanam, H. E. Kim, D. Rügamer, A. Bender, S. Muthers, C. G. Cho, A. Alonso, K. Szabo, F.-S. Centner, H. Wenz, T. Ganslandt, M. Platten, C. Groden, M. Neumaier, F. Siegel and M. E. Maros.
Machine learning-based forecasting of daily acute ischemic stroke admissions using weather data.
npj Digital Medicine 8.225 (Apr. 2025). DOI

Abstract

Background: In the midst of the emerging climate crisis, healthcare providers lack locally validated, disease-specific surveillance models. Stroke, a significant contributor to the global disease burden, has been linked to climate change. Therefore, we developed and benchmarked machine learning (ML) models based on locoregional weather systems to forecast the number of daily acute ischemic stroke (AIS) admissions.
Methods: AIS patients diagnosed between 2015 and 2021 at the tertiary University Medical Center (UMC) Mannheim, Germany were extracted from the local data integration center and geospatially matched to weather data from the German Weather Service (DWD) based on the clinic’s, patients’ home and closest tower’s locations at the time of admission. Statistical-(Poisson), boosted generalized additive model (GAM), support vector machines (SVR), and tree-based models including random forest (RF) and extreme gradient boosting (XGB) were evaluated in regression settings within time-stratified nested cross-validation setup (training-validation: 2015-2020, test set: 2021) to predict the number of daily AIS admissions.
Findings: The cohort included 7,914 AIS patients (4,244 male, 53·6%). XGB showed the best test performance with lowest mean absolute error (MAE) of 1·21 cases/day. Maximum air pressure was identified as the top predictive variable. Shapley additive explanations analyses revealed that temperature extremes of extended cold- (lag-3 minimum temperature <-2 °C; minimum perceived temperature <-1·4 °C) and hot stressors (lag-7 minimum temperature >15 °C), as well as stormy conditions (lag-1 and lag-2 maximum wind gust >14 m/s and speed >10·4 m/s), increased stroke incidences substantially with distinct seasonal associations.
Interpretation: ML models can sufficiently forecast AIS admissions based on weather patterns allowing for improved resource allocation and preparedness.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

E. Walter, T. Brock, P. Lahoud, N. Werner, F. Czaja, A. Tichy, C. Bumm, A. Bender, A. Castro, W. Teughels, F. Schwendicke and M. Folwaczny.
Predictive modeling for step II therapy response in periodontitis - model development and validation.
npj Digital Medicine 8.445 (Jul. 2025). DOI

Abstract

Steps I and II periodontal therapy is the first-line treatment for periodontal disease, but has varying success. This study aimed to develop machine learning models to predict changes in periodontal probing depth (PPD) after step II therapy using patient-, tooth-, and site-specific clinical covariates. Models accurately predicted that healthy sites stay healthy, but performed suboptimally for diseased sites. Tuning improved performance, with PPD, tooth-site, and tooth-type identified as key predictors. Pocket closure was predicted with fair accuracy, with baseline PPD as the most relevant covariate. Models predicted improving pockets well but underperformed for non-responding sites, with antibiotic treatment and tooth type being the most influential features. While predictive performance for step II periodontal therapy based on routine clinical data remains limited, models can stratify periodontal sites into meaningful categories and estimate the probability of pocket improvement. They provide a foundation for site-specific outcome prediction and may support patient communication and expectations.

MCML Authors

Tobias Brock

A1 | Statistical Foundations & Explainability
→ Group Thomas Nagler

Computational Statistics & Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

R. R. Valiev, R. T. Nasibullin, H. Sandström, P. Rinke, K. Puolamäki and T. Kurten.
Predicting intersystem crossing rate constants of alkoxy-radical pairs with structure-based descriptors and machine learning.
Physical Chemistry Chemical Physics Advance Article (Jun. 2025). DOI

Abstract

Peroxy radicals (RO2) are ubiquitous intermediates in many oxidation processes, especially in the atmospheric gas phase. The recombination reaction of two peroxy radicals (RO2 + R′O2) has been demonstrated to lead, via several steps, to a triplet complex of two alkoxy radicals: 3(RO˙⋯R′O˙). The different product channels of RO2 + R′O2 reactions thus correspond to different reactions of this triplet complex. Of particular interest to atmospheric chemistry is the intersystem crossing (ISC) to the singlet state, which enables the recombination of the two radicals to an ROOR′ peroxide with considerably lower volatility than the original precursors. These peroxides are believed to be key contributors to the formation of secondary organic aerosol (SOA) particles, which in turn contribute to both air pollution and radiative forcing uncertainties. Developing reliable computational models for, e.g., RO2 + R′O2 branching ratios requires accurate estimates of the ISC rate constants, which can currently be obtained only from computationally expensive quantum chemistry calculations. By contrast, machine learning (ML) methods offer a faster alternative for estimating ISC rate constants. In the present work, we create a dataset with 98[thin space (1/6-em)]082 conformations of radical pairs and their corresponding rate constants. We apply three ML models—random forest (RF), CatBoost (CB), and a neural network (NN)—to predict ISC rate constants from triplet to singlet states. Specifically, the models predict kISC(T1 → Si) for i = 1–4 and the cumulative kISC(T1 → Sn), in alkoxy radical pairs, using only molecular geometry descriptors as inputs. All ML models achieved a mean absolute error (MAE) on our test set within one order of magnitude and a coefficient of determination R2 > 0.82 for all rate constants. Overall, the ML prediction matches the quantum chemical calculations within 1–2 orders of magnitude, providing a fast and scalable alternative for quantum chemical methods for ISC rate estimation.

MCML Authors

Patrick Rinke

Prof. Dr.

AI-based Material Science

H. Homm, J. Laakso and P. Rinke.
Efficient dataset generation for machine learning halide perovskite alloys.
Physical Review Materials 9.053802 (May. 2025). DOI

Abstract

Lead-based perovskite solar cells have reached high efficiencies, but toxicity and lack of stability hinder their wide-scale adoption. These issues have been partially addressed through compositional engineering of perovskite materials, but the vast complexity of the perovskite materials space poses a significant obstacle to exploration. We previously demonstrated how machine learning (ML) can accelerate property predictions for the CsPb⁢(Cl/Br)3 perovskite alloy. However, the substantial computational demand of density functional theory (DFT) calculations required for model training prevents applications to more complex materials. Here, we introduce a data-efficient scheme to facilitate model training, validated initially on CsPb⁢(Cl/Br)3 data and extended to the ternary alloy CsSn⁢(Cl/Br/I)3. Our approach employs clustering to construct a compact yet diverse initial dataset of atomic structures. We then apply a two-stage active learning approach to first improve the reliability of the ML-based structure relaxations and then refine accuracy near equilibrium structures. Tests for CsPb⁢(Cl/Br)3 demonstrate that our scheme reduces the number of required DFT calculations during the different parts of our proposed model training method by up to 20% and 50%. The fitted model for CsSn⁢(Cl/Br/I)3 is robust and highly accurate, evidenced by the convergence of all ML-based structure relaxations in our tests and an average relaxation error of only 0.5 meV/atom.

MCML Authors

Patrick Rinke

Prof. Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

AI-based Material Science

P. Wicke and M. M. Bolognesi.
Red and blue language: Word choices in the Trump and Harris 2024 presidential debate.
PLOS One 20.6 (Jun. 2025). DOI GitHub

Abstract

Political debates are a peculiar type of political discourse, in which candidates directly confront one another, addressing not only the the moderator’s questions, but also their opponent’s statements, as well as the concerns of voters from both parties and undecided voters. Therefore, language is adjusted to meet specific expectations and achieve persuasion. We analyse how the language of Trump and Harris during the Presidential debate (September 10th, 2024) differs in relation to semantic and pragmatic features, for which we formulated targeted hypotheses: framing values and ideology, appealing to emotion, using words with different degrees of concreteness and specificity, addressing others through singular or plural pronouns. Our findings include: differences in the use of figurative frames (Harris often framing issues around recovery and empowerment, Trump often focused on crisis and decline); similar use of emotional language, with Trump showing a slightly higher tendency toward negativity and toward less subjective language compared to Harris; no significant difference in the specificity of candidates’ responses; similar use of abstract language, with Trump showing more variability than Harris, depending on the subject discussed; differences in addressing the opponent, with Trump not mentioning Harris by name, while Harris referring to Trump frequently; different uses of pronouns, with Harris using both singular and plural pronouns equally, while Trump using more singular pronouns. The results are discussed in relation to previous literature on Red and Blue language, which refers to distinct linguistic patterns associated with Republican (Red) and Democratic (Blue) political ideologies.

MCML Authors

Philipp Wicke

Dr.

Computational Linguistics

D. Di Fraia, A. Marino, J. H. Lee, E. Kelmer Sacramento, M. Baumgart, S. Bagnoli, T. Balla, F. Schalk, S. Kamrad, R. Guan, C. Caterino, C. Giannuzzi, P. T. da Silva, A. K. Sahu, H. Gut, G. Siano, M. , E. Terzibasi-Tozzini, E. F. Fornasiero, J. Gagneur, C. Englert, K. R. Patil, C. Correia-Melo, D. D. Nedialkova, J. Frydman, A. Cellerino and A. Ori.
Altered translation elongation contributes to key hallmarks of aging in the killifish brain.
Science 389.6759 (Jul. 2025). DOI

Abstract

Introduction: Aging brains are characterized by a series of molecular and cellular changes known as aging hallmarks. Among these, a decline in protein homeostasis (proteostasis) marked by reduced clearance and increased protein damage and aggregation has received particular attention as a plausible link between brain aging and those neurodegenerative diseases also characterized by protein aggregation. A notable phenomenon in brain aging is a loss of concordance between mRNA and protein levels, whereby age-linked changes in mRNA do not necessarily lead to proportional changes in protein levels. In this study, we set out to investigate the causes of this “protein-transcript decoupling” and how impaired protein synthesis might contribute to other hallmarks of brain aging.
Rationale: We used the short-lived African turquoise killifish, which exhibits a naturally compressed life span and accelerated brain aging, to undertake a comprehensive investigation of age-related decline in brain proteostasis. We compared young, adult, and old killifish brains at the levels of amino acid concentrations, tRNAs, mRNAs (transcriptome), actively translated mRNAs (translatome), proteins (proteome), protein modifications [phosphorylation (Ph), ubiquitylation (Ub), and acetylation (Ac)], and protein solubility and subcellular localization. We also tested whether reduced protein degradation caused by proteasome impairment contributes to protein-transcript decoupling and other aging hallmarks in the killifish brain. Our comprehensive design allowed us to pinpoint aging-vulnerable steps in protein biogenesis and reveal mechanisms connecting proteostasis decline to other aging hallmarks.
Results: We observed alterations in all molecular signatures investigated, ranging from amino acid concentrations to protein solubility and localization. A clear pattern of proteostasis dysfunction emerged: Although the synthesis of some proteins was enhanced, there was a widespread reduction of proteins enriched in positively charged (basic) amino acids. Notably, many DNA and RNA binding proteins exhibited reduced abundance in old brains, decreasing at the protein but not the transcript levels. Ribosome profiling (Ribo-seq) revealed that brain aging increased ribosome stalling. Accordingly, ribosome collisions were more frequent in old brains. Crucially, stalling events occurred disproportionately on stretches enriched in lysine and arginine codons, thus affecting translation of mRNAs encoding proteins enriched in these basic amino acids, leading to a decline in their protein levels in old brains. Aging-affected proteins included ribosomal subunits and proteins involved in DNA repair, transcription, chromatin maintenance, and RNA splicing and export, which all mediate processes influenced by aging. Ribosome stalling was also associated with increased protein insolubility, likely owing to nascent polypeptide misfolding. Partial proteasome inhibition affected aging hallmarks distinct from those linked to translation dysfunction and primarily influenced lysosomes and mitochondria.
Conclusion: This work identifies altered translation elongation and impaired protein biogenesis as hallmarks of brain aging in a short-lived vertebrate. Increased ribosome pausing is proposed as a key mechanism contributing to the mismatch between mRNA and protein changes observed in aged brains, leading to proteome aging by altering the production of proteins essential for genome integrity, mRNA transcription, splicing, protein synthesis, and mitochondrial function. This mechanism thereby links translation and proteostasis decline to other hallmarks of aging and may also be implicated in neurodegenerative diseases where similar ribosome dysfunction and protein misfolding occur.

MCML Authors

Pedro Tomaz da Silva

C2 | Biology
→ Group Julien Gagneur

Computational Molecular Medicine

Julien Gagneur

Prof. Dr.

C2 | Biology

Computational Molecular Medicine

Q. Xu, Y. Shi, J. Zhao and X. Zhu.
FloodCastBench: A Large-Scale Dataset and Foundation Models for Flood Modeling and Forecasting.
Scientific Data 12.431 (Mar. 2025). DOI

Abstract

Effective flood forecasting is crucial for informed decision-making and emergency response. Existing flood datasets mainly describe flood events but lack dynamic process data suitable for machine learning (ML). This work introduces the FloodCastBench dataset, designed for ML-based flood modeling and forecasting, featuring four major flood events: Pakistan 2022, UK 2015, Australia 2022, and Mozambique 2019. FloodCastBench details the process of flood dynamics data acquisition, starting with input data preparation (e.g., topography, land use, rainfall) and flood measurement data collection (e.g., SAR-based maps, surveyed outlines) for hydrodynamic modeling. We deploy a widely recognized finite difference numerical solution to construct high-resolution spatiotemporal dynamic processes with 30-m spatial and 300-second temporal resolutions. Flood measurement data are used to calibrate the hydrodynamic model parameters and validate the flood inundation maps. FloodCastBench provides comprehensive low-fidelity and high-fidelity flood forecasting datasets specifically for ML. Furthermore, we establish a benchmark of foundational models for neural flood forecasting using FloodCastBench, validating its effectiveness in supporting ML models for spatiotemporal, cross-regional, and downscaled flood forecasting.

MCML Authors

Qingsong Xu

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Data Science in Earth Observation

Jie Zhao

Dr.

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Data Science in Earth Observation

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

M. Gorski, S. Wiegrebe, R. Burkhardt, M. Behr, H. Küchenhoff, K. J. Stark, C. A. Böger and I. M. Heid.
Bias-corrected serum creatinine from UK Biobank electronic medical records generates an important data resource for kidney function trajectories.
Scientific Reports 15.3540 (Jan. 2025). DOI

Abstract

Loss of kidney function is a substantial personal and public health burden. Kidney function is typically assessed as estimated glomerular filtration rate (eGFR) based on serum creatinine. UK Biobank provides serum creatinine measurements from study center assessments (SC, n = 425,147 baseline, n = 15,314 with follow-up) and emerging electronic Medical Records (eMR, ‘GP-clinical’) present a promising resource to augment this data longitudinally. However, it is unclear whether eMR-based and SC-based creatinine values can be used jointly for research on eGFR decline. When comparing eMR-based with SC-based creatinine by calendar year (n = 70,231), we found a year-specific multiplicative bias for eMR-based creatinine that decreased over time (factor 0.84 for 2007, 0.97 for 2013). Deriving eGFR based on SC- and bias-corrected eMR-creatinine yielded 454,907 individuals with ≥ 1eGFR assessment (2,102,174 assessments). This included 206,063 individuals with ≥ 2 assessments over up to 60.2 years (median 6.00 assessments, median time = 8.7 years), where we also obtained eMR-based information on kidney disease or renal replacement therapy. We found an annual eGFR decline of 0.11 (95%-CI = 0.10–0.12) versus 1.04 mL/min/1.73m2/year (9%-CI = 1.03–1.05) without and with bias-correction, the latter being in line with literature. In summary, our bias-corrected eMR-based creatinine values enabled a 4-fold increased number of eGFR assessments in UK Biobank suitable for kidney function research.

MCML Authors

Helmut Küchenhoff

Prof. Dr.

A3 | Computational Models
→ Group Niki Kilbertus

Statistical Consulting Unit (StaBLab)

E. Ailer, C. L. Müller and N. Kilbertus.
Instrumental variable estimation for compositional treatments.
Scientific Reports 15.5158 (Feb. 2025). DOI

Abstract

Many scientific datasets are compositional in nature. Important biological examples include species abundances in ecology, cell-type compositions derived from single-cell sequencing data, and amplicon abundance data in microbiome research. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices in microbiome data analysis. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account while still yielding scientifically interpretable results. In a comparative analysis on synthetic and real microbiome data we show the advantages and limitations of our proposal. We posit that our analysis provides a useful framework and guidance for valid and informative cause-effect estimation in the context of compositional data.

MCML Authors

Elisabeth Ailer

* Former Member

Christian Müller

Prof. Dr.

C2 | Biology

Biomedical Statistics and Data Science

Niki Kilbertus

Prof. Dr.

A2 | Mathematical Foundations
→ Group Massimo Fornasier

Ethics in Systems Design and Machine Learning

A. Scagliotti.
Minimax Problems for Ensembles of Control-Affine Systems.
SIAM Journal on Control and Optimization 63.1 (Jan. 2025). DOI

Abstract

In this paper, we consider ensembles of control-affine systems in ℝd, and we study simultaneous optimal control problems related to the worst-case minimization. After proving that such problems admit solutions, denoting with (ΘN)N a sequence of compact sets that parametrize the ensembles of systems, we first show that the corresponding minimax optimal control problems are Γ-convergent whenever (ΘN)N has a limit with respect to the Hausdorff distance. Besides its independent interest, the previous result plays a crucial role for establishing the Pontryagin Maximum Principle (PMP) when the ensemble is parametrized by a set Θ consisting of infinitely many points. Namely, we first approximate Θ by finite and increasing-in-size sets (ΘN)N for which the PMP is known, and then we derive the PMP for the Γ-limiting problem. The same strategy can be pursued in applications, where we can reduce infinite ensembles to finite ones to compute the minimizers numerically. We bring as a numerical example the Schrödinger equation for a qubit with uncertain resonance frequency.

MCML Authors

Alessandro Scagliotti

Applied Numerical Analysis

A. Datar, A. Datar, F. Dietrich and W. Schilders.
Systematic Construction of Continuous-Time Neural Networks for Linear Dynamical Systems.
SIAM Journal on Scientific Computing 47.4 (Jul. 2025). DOI

Abstract

Discovering a suitable neural network architecture for modeling complex dynamical systems poses a formidable challenge, often involving extensive trial and error and navigation through a high-dimensional hyperparameter space. In this paper, we discuss a systematic approach to constructing neural architectures for modeling a subclass of dynamical systems, namely, linear time-invariant (LTI) systems. We use a variant of continuous-time neural networks in which the output of each neuron evolves continuously as a solution of a first-order or second-order ordinary differential equation. Instead of deriving the network architecture and parameters from data, we propose a gradient-free algorithm to compute sparse architecture and network parameters directly from the given LTI system, leveraging its properties. We bring forth a novel neural architecture paradigm featuring horizontal hidden layers and provide insights into why employing conventional neural architectures with vertical hidden layers may not be favorable. We also provide an upper bound on the numerical errors of our neural networks. Finally, we demonstrate the high accuracy of our constructed networks on three numerical examples.

MCML Authors

Felix Dietrich

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Physics-enhanced Machine Learning

L. von der Heyde, A.-C. Haensch and A. Wenz.
Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion.
Social Science Computer Review Online First (Apr. 2025). DOI

Abstract

‘Synthetic samples’ generated by large language models (LLMs) have been argued to complement or replace traditional surveys, assuming their training data is grounded in human-generated data that potentially reflects attitudes and behaviors prevalent in the population. Initial US-based studies that have prompted LLMs to mimic survey respondents found that the responses match survey data. However, the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this paper, we critically evaluate the use of LLMs for public opinion research in a different context, by investigating whether LLMs can estimate vote choice in Germany. We generate a synthetic sample matching the 2017 German Longitudinal Election Study respondents and ask the LLM GPT-3.5 to predict each respondent’s vote choice. Comparing these predictions to the survey-based estimates on the aggregate and subgroup levels, we find that GPT-3.5 exhibits a bias towards the Green and Left parties. While the LLM predictions capture the tendencies of “typical” voters, they miss more complex factors of vote choice. By examining the LLM-based prediction of voting behavior in a non-English speaking context, our study contributes to research on the extent to which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitations in applying them for public opinion estimation.

MCML Authors

Leah von der Heyde

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

R. Hornung, M. Nalenz, L. Schneider, A. Bender, L. Bothmann, F. Dumpert, B. Bischl, T. Augustin and A.-L. Boulesteix.
Evaluating Machine Learning Models in Non-Standard Settings: An Overview and New Findings.
Statistical Science (Mar. 2025). To be published. Preprint available. arXiv

Abstract

Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines for GE estimation in various such non-standard settings: clustered data, spatial data, unequal sampling probabilities, concept drift, and hierarchically structured outcomes. Our overview combines well-established methodologies with other existing methods that, to our knowledge, have not been frequently considered in these particular settings. A unifying principle among these techniques is that the test data used in each iteration of the resampling procedure should reflect the new observations to which the model will be applied, while the training data should be representative of the entire data set used to obtain the final model. Beyond providing an overview, we address literature gaps by conducting simulation studies. These studies assess the necessity of using GE-estimation methods tailored to the respective setting. Our findings corroborate the concern that standard resampling methods often yield biased GE estimates in non-standard settings, underscoring the importance of tailored GE estimation.

MCML Authors

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

S. Lumpp and M. Drton.
On weak convergence of Gaussian conditional distributions.
Statistics and Probability Letters 226.110497 (Nov. 2025). DOI

Abstract

Weak convergence of joint distributions generally does not imply convergence of conditional distributions. In particular, conditional distributions need not converge when joint Gaussian distributions converge to a singular Gaussian limit. Algebraically, this is due to the fact that at singular covariance matrices, Schur complements are not continuous functions of the matrix entries. Our results lay out special conditions under which convergence of Gaussian conditional distributions nevertheless occurs, and we exemplify how this allows one to reason about conditional independence in a new class of graphical models.

MCML Authors

Mathias Drton

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Mathematical Statistics

M. M. Mandl, A.-L. Boulesteix, S. Burgess and V. Zuber.
Outlier Detection in Mendelian Randomization.
Statistics in Medicine 44.15-17 (Jul. 2025). DOI

Abstract

Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal effects of exposures on an outcome. One key assumption of MR is that the genetic variants used as instrumental variables are independent of the outcome conditional on the risk factor and unobserved confounders. Violations of this assumption, that is, the effect of the instrumental variables on the outcome through a path other than the risk factor included in the model (which can be caused by pleiotropy), are common phenomena in human genetics. Genetic variants, which deviate from this assumption, appear as outliers to the MR model fit and can be detected by the general heterogeneity statistics proposed in the literature, which are known to suffer from overdispersion, that is, too many genetic variants are declared as false outliers. We propose a method that corrects for overdispersion of the heterogeneity statistics in uni- and multivariable MR analysis by making use of the estimated inflation factor to correctly remove outlying instruments and therefore account for pleiotropic effects. Our method is applicable to summary-level data.

MCML Authors

Maximilian Mandl

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

T. Boege, M. Drton, B. Hollering, S. Lumpp, P. Misra and D. Schkoda.
Conditional independence in stationary distributions of diffusions.
Stochastic Processes and their Applications 184.104604 (Jun. 2025). DOI

Abstract

Stationary distributions of multivariate diffusion processes have recently been proposed as probabilistic models of causal systems in statistics and machine learning. Motivated by these developments, we study stationary multivariate diffusion processes with a sparsely structured drift. Our main result gives a characterization of the conditional independence relations that hold in a stationary distribution. The result draws on a graphical representation of the drift structure and pertains to conditional independence relations that hold generally as a consequence of the drift’s sparsity pattern.

MCML Authors

Mathias Drton

Prof. Dr.

C3 | Physics and Geo Sciences
→ Group Xiaoxiang Zhu

Mathematical Statistics

V. Steidl, J. L. Bamber and X. Zhu.
Physics-aware machine learning for glacier ice thickness estimation: a case study for Svalbard.
The Cryosphere 19.2 (Feb. 2025). DOI

Abstract

The ice thickness of the world’s glaciers is mostly unmeasured, and physics-based models to reconstruct ice thickness cannot always deliver accurate estimates. In this study, we use deep learning paired with physical knowledge to generate ice thickness estimates for all glaciers of Spitsbergen, Barentsøya, and Edgeøya in Svalbard. We incorporate mass conservation and other physically derived conditions into a neural network to predict plausible ice thicknesses even for glaciers without any in situ ice thickness measurements. With a glacier-wise cross-validation scheme, we evaluate the performance of the physics-informed neural network. The results of these proof-of-concept experiments let us identify several challenges and opportunities that affect the model’s performance in a real-world setting.

MCML Authors

Viola Steidl

Data Science in Earth Observation

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

K. Ghosh, M. Todorović, A. Vehtari and P. Rinke.
Active learning of molecular data for task-specific objectives.
The Journal of Chemical Physics 162.014103 (Jan. 2025). DOI

Abstract

Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes, and GP noise settings. AL was insensitive to the acquisition batch size, and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform the randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings of up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

MCML Authors

Patrick Rinke

Prof. Dr.

AI-based Material Science

V. Iwuajoku, K. Ekici, A. Haas, M. Z. Kazemi, A. Kasajima, C. Delbridge, A. Muckenhuber, E. Schmoeckel, F. Stögbauer, C. Bollwein, K. Schwamborn, K. Steiger, C. Mogler and P. J. Schüffler.
An equivalency and efficiency study for one year digital pathology for clinical routine diagnostics in an accredited tertiary academic center.
Virchows Archiv (Feb. 2025). DOI

Abstract

Digital pathology is revolutionizing clinical diagnostics by offering enhanced efficiency, accuracy, and accessibility of pathological examinations. This study explores the implementation and validation of digital pathology in a large tertiary academic center, focusing on its gradual integration and transition into routine clinical diagnostics. In a comprehensive validation process over a 6-month period, we compared sign-out of digital and physical glass slides of a wide range of different tissue specimens and histopathological diagnoses. Key metrics such as diagnostic concordance and user satisfaction were assessed by involving the pathologists in a validation training and study phase. We measured turnaround times before and after transitioning to digital pathology to assess the impact on overall efficiency. Our results demonstrate a 99% concordance between the analog and digital reports while at the same time reducing the time to sign out a case by almost a minute, suggesting potential long-term efficiency gains. Our digital transition positively impacted our pathology workflow: Pathologists reported increased flexibility and satisfaction due to the ease of accessing and sharing digital slides. However, challenges were identified, including technical issues related to image quality and system integration. Lessons learned from this study emphasize the importance of robust training programs, adequate IT support, and ongoing evaluation to ensure successful integration. This validation study confirms that digital pathology is a viable and beneficial tool for accurate clinical routine diagnostics in large academic centers, offering insights for other institutions considering similar endeavors.

MCML Authors

Peter Schüffler

Prof. Dr.