Home | Research | Area A

A | Foundations of Machine Learning

aims at strengthening the competence in Statistical Foundations and Explainability, Mathematical Foundations, and Computational Methods. These fields form the basis for all methodological advances.

A1 | Statistical Foundations & Explainability

Research is being conducted at MCML to improve the reliability, interpretability, and acceptability of results obtained with ML algorithms for their practical application through better integration of statistical concepts. Key challenges include the integration of uncertainty quantification into ML algorithms, the explainability of ML models, the simplification of ML methods, and the incorporation of prior knowledge into ML algorithms.

Link to Profile Stefan Bauer

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Profile Volker Schmid

Volker Schmid

Prof. Dr.

Bayesian Imaging & Spatial Statistics

Link to Profile Andreas Döpp

Andreas Döpp

Dr. habil

Data-driven methods in Physics and Optics

Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning

Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Michael Schomaker

Michael Schomaker

Prof. Dr.

Biostatistics

Publication in Research Area A1
[360]
D. Dold, J. Kobialka, N. Palm, E. Sommer, D. Rügamer and O. Dürr.
Paths and Ambient Spaces in Neural Loss Landscapes.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. To be published. URL
Abstract

Understanding the structure of neural network loss surfaces, particularly the emergence of low-loss tunnels, is critical for advancing neural network theory and practice. In this paper, we propose a novel approach to directly embed loss tunnels into the loss landscape of neural networks. Exploring the properties of these loss tunnels offers new insights into their length and structure and sheds light on some common misconceptions. We then apply our approach to Bayesian neural networks, where we improve subspace inference by identifying pitfalls and proposing a more natural prior that better guides the sampling procedure.

MCML Authors
Link to website

Julius Kobialka

Statistics, Data Science and Machine Learning

Link to website

Nicolai Palm

Computational Statistics & Data Science

Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[359]
R. Schulte and D. Rügamer.
Additive Model Boosting: New Insights and Path(ologie)s.
AISTATS 2025 - 28th International Conference on Artificial Intelligence and Statistics. Mai Khao, Thailand, May 03-05, 2025. To be published. URL
Abstract

Additive models (AMs) have sparked a lot of interest in machine learning recently, allowing the incorporation of interpretable structures into a wide range of model classes. Many commonly used approaches to fit a wide variety of potentially complex additive models build on the idea of boosting additive models. While boosted additive models (BAMs) work well in practice, certain theoretical aspects are still poorly understood, including general convergence behavior and what optimization problem is being solved when accounting for the implicit regularizing nature of boosting. In this work, we study the solution paths of BAMs and establish connections with other approaches for certain classes of problems. Along these lines, we derive novel convergence results for BAMs, which yield crucial insights into the inner workings of the method. While our results generally provide reassuring theoretical evidence for the practical use of BAMs, they also uncover some ‘pathologies’ of boosting for certain additive model classes concerning their convergence behavior that require caution in practice. We empirically validate our theoretical findings through several numerical experiments.

MCML Authors
Link to website

Rickmer Schulte

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[358]
H. Baniecki, G. Casalicchio, B. Bischl and P. Biecek.
Efficient and Accurate Explanation Estimation with Distribution Compression.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Spotlight paper. To be published. Preprint available. arXiv
Abstract

We discover a theoretical connection between explanation estimation and distribution compression that significantly improves the approximation of feature attributions, importance, and effects. While the exact computation of various machine learning explanations requires numerous model inferences and becomes impractical, the computational cost of approximation increases with an ever-increasing size of data and model parameters. We show that the standard i.i.d. sampling used in a broad spectrum of algorithms for post-hoc explanation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm of sample-efficient explainability. It relies on distribution compression through kernel thinning to obtain a data sample that best approximates its marginal distribution. CTE significantly improves the accuracy and stability of explanation estimation with negligible computational overhead. It often achieves an on-par explanation approximation error 2-3x faster by using fewer samples, i.e. requiring 2-3x fewer model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.

MCML Authors
Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[357]
C. Kolb, T. Weber, B. Bischl and D. Rügamer.
Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. arXiv
Abstract

Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the L1 norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of L1-penalized neural networks by adding differentiable L2 regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.

MCML Authors
Link to website

Chris Kolb

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[356]
Y. Li, D. Rügamer, B. Bischl and M. Rezaei.
Calibrating LLMs with Information-Theoretic Evidential Deep Learning.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[355]
E. Sommer, J. Robnik, G. Nozadze, U. Seljak and D. Rügamer.
Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL
Abstract

Despite recent advances, sampling-based inference for Bayesian Neural Networks (BNNs) remains a significant challenge in probabilistic deep learning. While sampling-based approaches do not require a variational distribution assumption, current state-of-the-art samplers still struggle to navigate the complex and highly multimodal posteriors of BNNs. As a consequence, sampling still requires considerably longer inference times than non-Bayesian methods even for small neural networks, despite recent advances in making software implementations more efficient. Besides the difficulty of finding high-probability regions, the time until samplers provide sufficient exploration of these areas remains unpredictable. To tackle these challenges, we introduce an ensembling approach that leverages strategies from optimization and a recently proposed sampler called Microcanonical Langevin Monte Carlo (MCLMC) for efficient, robust and predictable sampling performance. Compared to approaches based on the state-of-the-art No-U-Turn Sampler, our approach delivers substantial speedups up to an order of magnitude, while maintaining or improving predictive performance and uncertainty quantification across diverse tasks and data modalities. The suggested Microcanonical Langevin Ensembles and modifications to MCLMC additionally enhance the method’s predictability in resource requirements, facilitating easier parallelization. All in all, the proposed method offers a promising direction for practical, scalable inference for BNNs.

MCML Authors
Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[354]
L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible modelling of time-varying exposures in event history analysis.
DAGStat 2025 - 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik. Berlin, Germany, Mar 24-28, 2025. Poster presentation. Full paper available. DOI
Abstract

We present a flexible modelling approach to analyse time-varying exposures and recurrent events in team sports injuries. The approach is based on the piece-wise exponential additive mixed model where the effects of past exposures (i.e. high-intensity training loads) may accumulate over time and present complex forms of association. In order to identify a relevant time window at which past exposures have an impact on the current risk, we propose a penalty approach. We conduct a simulation study to evaluate the performance of the proposed model, under different true weight functions and different levels of heterogeneity between recurrent events. Finally, we illustrate the approach with a case study application involving an elite male football team participating in the Spanish LaLiga competition. The cohort includes time-loss injuries and external training load variables tracked by Global Positioning System devices, during the seasons 2017–2018 and 2018–2019.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[353]
L. Bothmann, S. Dandl, J. M. A. Jose M. Alvarez, P. A. Boustani and B. Bischl.
Privilege Scores for Fairness-Aware ML.
DAGStat 2025 - 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik. Berlin, Germany, Mar 24-28, 2025. Poster presentation. Preprint available.
Abstract

Bias-preserving methods in fairness-aware machine learning (fairML) focus on metrics that prioritize formal equality by balancing error rates across subgroups. These methods can perpetuate historical discrimination embedded in real-world data. In contrast, bias-transforming methods aim for substantive equality by actively addressing historical inequalities. As a contribution to bias-transforming methods, we introduce the concept of privilege scores, a novel approach to identifying and quantifying individual privilege in machine learning tasks. Privilege scores use causal inference techniques to compare real-world outcomes to those in a ‘fair’ world in which the protected attributes do not influence the target variable. This individual-level perspective provides actionable insights for applications such as affirmative action and beyond. Key contributions include (1) the formalization of privilege scores, (2) a methodological framework for estimation with uncertainty quantification via confidence intervals, (3) an interpretable machine learning approach for understanding privilege score contributions, and (4) a novel in-processing method, Multi-PrivScore, to mitigate model-level discrimination during model training. Experiments on simulated and real-world data demonstrate the usefulness of privilege scores. Overall, our work highlights privilege scores as a versatile tool for assessing and mitigating historical discrimination in various machine learning applications.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to website

Philip Amir Boustani

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[352]
A. H. Berger, L. Lux, S. Shit, I. Ezhov, G. Kaissis, M. Menten, D. Rückert and J. C. Paetzold.
Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers.
WACV 2025 - IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Feb 28-Mar 04, 2025. To be published. Preprint available. arXiv
Abstract

Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task’s complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss to effectively learn object relations in multiple domains with different numbers of edges, (2) a domain adaptation framework for image-to-graph transformers aligning image- and graph-level features from different domains, and (3) a projection function that allows using 2D data for training 3D transformers. We demonstrate our method’s utility in cross-domain and cross-dimension experiments, where we utilize labeled data from 2D road networks for simultaneous learning in vastly different target domains. Our method consistently outperforms standard transfer learning and self-supervised pretraining on challenging benchmarks, such as retinal or whole-brain vessel graph extraction.

MCML Authors
Link to website

Laurin Lux

Artificial Intelligence in Healthcare and Medicine

Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Martin Menten

Martin Menten

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[351]
L. Bothmann, P. Boustani, J. Alvarez, G. Casalicchio, B. Bischl and S. Dandl.
Privilege Scores.
Preprint (Feb. 2025). arXiv
Abstract

Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence of the PA is removed. At the individual level, PS can identify individuals who qualify for affirmative action; at the global level, PS can inform bias-transforming policies. After presenting estimation methods for PS, we propose privilege score contributions (PSCs), an interpretation method that attributes the origin of privilege to mediating features and direct effects. We provide confidence intervals for both PS and PSCs. Experiments on simulated and real-world data demonstrate the broad applicability of our methods and provide novel insights into gender and racial privilege in mortgage and college admissions applications.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[350]
N. Sturma, M. Kranzlmueller, I. Portakal and M. Drton.
Matching Criterion for Identifiability in Sparse Factor Analysis.
Preprint (Feb. 2025). arXiv
Abstract

Factor analysis models explain dependence among observed variables by a smaller number of unobserved factors. A main challenge in confirmatory factor analysis is determining whether the factor loading matrix is identifiable from the observed covariance matrix. The factor loading matrix captures the linear effects of the factors and, if unrestricted, can only be identified up to an orthogonal transformation of the factors. However, in many applications the factor loadings exhibit an interesting sparsity pattern that may lead to identifiability up to column signs. We study this phenomenon by connecting sparse factor models to bipartite graphs and providing sufficient graphical conditions for identifiability of the factor loading matrix up to column signs. In contrast to previous work, our main contribution, the matching criterion, exploits sparsity by operating locally on the graph structure, thereby improving existing conditions. Our criterion is efficiently decidable in time that is polynomial in the size of the graph, when restricting the search steps to sets of bounded size.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[349]
E. Garces Arias, M. Li, C. Heumann and M. Aßenmacher.
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Since LLMs produce probability distributions over the entire vocabulary, various decoding methods have been developed to transform these probabilities into coherent and fluent text, each with its own set of hyperparameters. In this study, we present a large-scale, comprehensive analysis of how hyperparameter selection affects text quality in open-ended text generation across multiple LLMs, datasets, and evaluation metrics. Through an extensive sensitivity analysis, we provide practical guidelines for hyperparameter tuning and demonstrate the substantial influence of these choices on text quality. Using three established datasets, spanning factual domains (e.g., news) and creative domains (e.g., fiction), we show that hyperparameter tuning significantly impacts generation quality, though its effects vary across models and tasks. We offer in-depth insights into these effects, supported by both human evaluations and a synthesis of widely-used automatic evaluation metrics.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[348]
C. Leininger, S. Rittel and L. Bothmann.
Overcoming Fairness Trade-offs via Pre-processing: A Causal Perspective.
Preprint (Jan. 2025). arXiv
Abstract

Training machine learning models for fair decisions faces two key challenges: The fairness-accuracy trade-off results from enforcing fairness which weakens its predictive performance in contrast to an unconstrained model. The incompatibility of different fairness metrics poses another trade-off – also known as the impossibility theorem. Recent work identifies the bias within the observed data as a possible root cause and shows that fairness and predictive performance are in fact in accord when predictive performance is measured on unbiased data. We offer a causal explanation for these findings using the framework of the FiND (fictitious and normatively desired) world, a ‘fair’ world, where protected attributes have no causal effects on the target variable. We show theoretically that (i) classical fairness metrics deemed to be incompatible are naturally satisfied in the FiND world, while (ii) fairness aligns with high predictive performance. We extend our analysis by suggesting how one can benefit from these theoretical insights in practice, using causal pre-processing methods that approximate the FiND world. Additionally, we propose a method for evaluating the approximation of the FiND world via pre-processing in practical use cases where we do not have access to the FiND world. In simulations and empirical studies, we demonstrate that these pre-processing methods are successful in approximating the FiND world and resolve both trade-offs. Our results provide actionable solutions for practitioners to achieve fairness and high predictive performance simultaneously.

MCML Authors
Link to website

Simon Rittel

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[347]
A. Reuter, T. G. J. Rudner, V. Fortuin and D. Rügamer.
Can Transformers Learn Full Bayesian Inference in Context?
Preprint (Jan. 2025). arXiv
Abstract

Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context – without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows which enables us to infer complex posterior distributions for methods such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods not operating in context.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[346]
R. Schwank, A. McCormack and M. Drton.
Robust Score Matching.
Preprint (Jan. 2025). arXiv
Abstract

Proposed in Hyvärinen (2005), score matching is a parameter estimation procedure that does not require computation of distributional normalizing constants. In this work we utilize the geometric median of means to develop a robust score matching procedure that yields consistent parameter estimates in settings where the observed data has been contaminated. A special appeal of the proposed method is that it retains convexity in exponential family models. The new method is therefore particularly attractive for non-Gaussian, exponential family graphical models where evaluation of normalizing constants is intractable. Support recovery guarantees for such models when contamination is present are provided. Additionally, support recovery is studied in numerical experiments and on a precipitation dataset. We demonstrate that the proposed robust score matching estimator performs comparably to the standard score matching estimator when no contamination is present but greatly outperforms this estimator in a setting with contamination.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[345]
R. Dhahri, A. Immer, B. Charpentier, S. Günnemann and V. Fortuin.
Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to naïvely deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam’s razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.

MCML Authors
Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning

Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[344]
T. Nagler, L. Schneider, B. Bischl and M. Feurer.
Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub
Abstract

Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model’s generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[343]
D. Rügamer, B. X. W. Liew, Z. Altai and A. Stöcker.
A Functional Extension of Semi-Structured Networks.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Semi-structured networks (SSNs) merge the structures familiar from additive models with deep neural networks, allowing the modeling of interpretable partial feature effects while capturing higher-order non-linearities at the same time. A significant challenge in this integration is maintaining the interpretability of the additive model component. Inspired by large-scale biomechanics datasets, this paper explores extending SSNs to functional data. Existing methods in functional data analysis are promising but often not expressive enough to account for all interactions and non-linearities and do not scale well to large datasets. Although the SSN approach presents a compelling potential solution, its adaptation to functional data remains complex. In this work, we propose a functional SSN method that retains the advantageous properties of classical functional regression approaches while also improving scalability. Our numerical experiments demonstrate that this approach accurately recovers underlying signals, enhances predictive performance, and performs favorably compared to competing methods.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[342]
Y. Zhang, Y. Li, X. Wang, Q. Shen, B. Plank, B. Bischl, M. Rezaei and K. Kawaguchi.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models.
NeurIPS 2024 - Workshop on Machine Learning and Compression at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model’s output – contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance – without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[341]
M. Koshil, T. Nagler, M. Feurer and K. Eggensperger.
Towards Localization via Data Embedding for TabPFN.
TLR @NeurIPS 2024 - 3rd Table Representation Learning Workshop at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL
Abstract

Prior-data fitted networks (PFNs), especially TabPFN, have shown significant promise in tabular data prediction. However, their scalability is limited by the quadratic complexity of the transformer architecture’s attention across training points. In this work, we propose a method to localize TabPFN, which embeds data points into a learned representation and performs nearest neighbor selection in this space. We evaluate it across six datasets, demonstrating its superior performance over standard TabPFN when scaling to larger datasets. We also explore its design choices and analyze the bias-variance trade-off of this localization method, showing that it reduces bias while maintaining manageable variance. This work opens up a pathway for scaling TabPFN to arbitrarily large tabular datasets.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[340]
L. Bothmann and K. Peters.
Fairness von KI – ein Brückenschlag zwischen Philosophie und Maschinellem Lernen.
Grenzen Künstlicher Intelligenz (Dec. 2024).
MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[339]
J. Herbinger, M. N. Wright, T. Nagler, B. Bischl and G. Casalicchio.
Decomposing Global Feature Effects Based on Feature Interactions.
Journal of Machine Learning Research 25.381 (Dec. 2024). URL
Abstract

Global feature effect methods, such as partial dependence plots, provide an intelligible visualization of the expected marginal feature effect. However, such global feature effect methods can be misleading, as they do not represent local feature effects of single observations well when feature interactions are present. We formally introduce generalized additive decomposition of global effects (GADGET), which is a new framework based on recursive partitioning to find interpretable regions in the feature space such that the interaction-related heterogeneity of local feature effects is minimized. We provide a mathematical foundation of the framework and show that it is applicable to the most popular methods to visualize marginal feature effects, namely partial dependence, accumulated local effects, and Shapley additive explanations (SHAP) dependence. Furthermore, we introduce and validate a new permutation-based interaction detection procedure that is applicable to any feature effect method that fits into our proposed framework. We empirically evaluate the theoretical characteristics of the proposed methods based on various feature effect methods in different experimental settings. Moreover, we apply our introduced methodology to three real-world examples to showcase their usefulness.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[338]
F. Fumagalli, M. Muschalik, E. Hüllermeier, B. Hammer and J. Herbinger.
Unifying Feature-Based Explanations with Functional ANOVA and Cooperative Game Theory.
Preprint (Dec. 2024). arXiv
Abstract

Feature-based explanations, using perturbations or gradients, are a prevalent tool to understand decisions of black box machine learning models. Yet, differences between these methods still remain mostly unknown, which limits their applicability for practitioners. In this work, we introduce a unified framework for local and global feature-based explanations using two well-established concepts: functional ANOVA (fANOVA) from statistics, and the notion of value and interaction from cooperative game theory. We introduce three fANOVA decompositions that determine the influence of feature distributions, and use game-theoretic measures, such as the Shapley value and interactions, to specify the influence of higher-order interactions. Our framework combines these two dimensions to uncover similarities and differences between a wide range of explanation techniques for features and groups of features. We then empirically showcase the usefulness of our framework on synthetic and real-world datasets.

MCML Authors
Link to website

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[337]
C. Sauer, A.-L. Boulesteix, L. Hanßum, F. Hodiamont, C. Bausewein and T. Ullmann.
Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications.
Preprint (Dec. 2024). arXiv
Abstract

Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.

MCML Authors
Link to website

Christina Sauer (née Nießl)

Biometry in Molecular Medicine

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[336]
E. Garces Arias, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, k−sampling, nucleus p−sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[335]
L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible Modelling of Time-Varying Exposures and Recurrent Events to Analyse Training Load Effects in Team Sports Injuries.
Journal of the Royal Statistical Society. Series C (Applied Statistics).qlae059 (Nov. 2024). DOI
Abstract

We present a flexible modelling approach to analyse time-varying exposures and recurrent events in team sports injuries. The approach is based on the piece-wise exponential additive mixed model where the effects of past exposures (i.e. high-intensity training loads) may accumulate over time and present complex forms of association. In order to identify a relevant time window at which past exposures have an impact on the current risk, we propose a penalty approach. We conduct a simulation study to evaluate the performance of the proposed model, under different true weight functions and different levels of heterogeneity between recurrent events. Finally, we illustrate the approach with a case study application involving an elite male football team participating in the Spanish LaLiga competition. The cohort includes time-loss injuries and external training load variables tracked by Global Positioning System devices, during the seasons 2017–2018 and 2018–2019.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[334]
Y. Li, Y. Zhang, K. Kawaguchi, A. Khakzar, B. Bischl and M. Rezaei.
A Dual-Perspective Approach to Evaluating Feature Attribution Methods.
Transactions on Machine Learning Research (Nov. 2024). URL
Abstract

Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model’s behavior (i.e., faithfulness). While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. In this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Ashkan Khakzar

Ashkan Khakzar

Dr.

* Former Member

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[333]
K. Flöge, M. A. Moeed and V. Fortuin.
Stein Variational Newton Neural Network Ensembles.
Preprint (Nov. 2024). arXiv
Abstract

Deep neural network ensembles are powerful tools for uncertainty quantification, which have recently been re-interpreted from a Bayesian perspective. However, current methods inadequately leverage second-order information of the loss landscape, despite the recent availability of efficient Hessian approximations. We propose a novel approximate Bayesian inference method that modifies deep ensembles to incorporate Stein Variational Newton updates. Our approach uniquely integrates scalable modern Hessian approximations, achieving faster convergence and more accurate posterior distribution approximations. We validate the effectiveness of our method on diverse regression and classification tasks, demonstrating superior performance with a significantly reduced number of training epochs compared to existing ensemble-based methods, while enhancing uncertainty quantification and robustness against overfitting.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[332]
K. Flöge, S. Udayakumar, J. Sommer, M. Piraud, S. Kesselheim, V. Fortuin, S. Günneman, K. J. van der Weg, H. Gohlke, A. Bazarova and E. Merdivan.
OneProt: Towards Multi-Modal Protein Foundation Models.
Preprint (Nov. 2024). arXiv
Abstract

Recent AI advances have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of modality encoders along protein sequences. It demonstrates strong performance in retrieval tasks and surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[331]
J. Gauss and T. Nagler.
Asymptotics for estimating a diverging number of parameters -- with and without sparsity.
Preprint (Nov. 2024). arXiv
Abstract

We consider high-dimensional estimation problems where the number of parameters diverges with the sample size. General conditions are established for consistency, uniqueness, and asymptotic normality in both unpenalized and penalized estimation settings. The conditions are weak and accommodate a broad class of estimation problems, including ones with non-convex and group structured penalties. The wide applicability of the results is illustrated through diverse examples, including generalized linear models, multi-sample inference, and stepwise estimation procedures.

MCML Authors
Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[330]
B. Kulynych, J. F. Gomez, G. Kaissis, F. du Pin Calmon and C. Troncoso.
Attack-Aware Noise Calibration for Differential Privacy.
Preprint (Nov. 2024). arXiv URL
Abstract

Differential privacy (DP) is a widely used approach for mitigating privacy risks when training machine learning models on sensitive data. DP mechanisms add noise during training to limit the risk of information leakage. The scale of the added noise is critical, as it determines the trade-off between privacy and utility. The standard practice is to select the noise scale to satisfy a given privacy budget ε. This privacy budget is in turn interpreted in terms of operational attack risks, such as accuracy, sensitivity, and specificity of inference attacks aimed to recover information about the training data records. We show that first calibrating the noise scale to a privacy budget ε, and then translating {epsilon} to attack risk leads to overly conservative risk assessments and unnecessarily low utility. Instead, we propose methods to directly calibrate the noise scale to a desired attack risk level, bypassing the step of choosing ε. For a given notion of attack risk, our approach significantly decreases noise scale, leading to increased utility at the same level of privacy. We empirically demonstrate that calibrating noise to attack sensitivity/specificity, rather than ε, when training privacy-preserving ML models substantially improves model accuracy for the same risk level. Our work provides a principled and practical way to improve the utility of privacy-preserving ML without compromising on privacy.

MCML Authors
Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine


[329]
J. Nam, I. Chalkidis and M. Rezaei.
Hyperbolic Contrastive Learning for Document Representations – A Multi-View Approach with Paragraph-Level Similarities.
ECAI 2024 - 27th European Conference on Artificial Intelligence. Santiago de Compostela, Spain, Oct 19-24, 2024. DOI
Abstract

Self-supervised learning (SSL) has gained prominence due to the increasing availability of unlabeled data and advances in computational efficiency, leading to revolutionized natural language processing with pre-trained language models like BERT and GPT. Representation learning, a core concept in SSL, aims to reduce data dimensionality while preserving meaningful aspects. Conventional SSL methods typically embed data in Euclidean space. However, recent research has revealed that alternative geometries can hold even richer representations, unlocking more meaningful insights from the data. Motivated by this, we propose two novel methods for integrating Hilbert geometry into self-supervised learning for efficient document embedding. First, we present a method directly incorporating Hilbert geometry into the standard Euclidean contrastive learning framework. Additionally, we propose a multi-view hyperbolic contrastive learning framework contrasting both documents and paragraphs. Our findings demonstrate that contrasting only paragraphs, rather than entire documents, can lead to superior efficiency and effectiveness.

MCML Authors
Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[328]
M. Aßenmacher, L. Karrlein, P. Schiele and C. Heumann.
Introducing wwm-german-18k - Can LLMs Crack the Million? (Or Win at Least 500 Euros?).
ICNLSP 2024 - 7th International Conference on Natural Language and Speech Processing. Trento, Italy, Oct 19-20, 2024. URL
Abstract

Language-specific evaluation of large language models (LLMs) for multiple-choice question answering (MCQA) is an important means to test their abilities for a multitude of different dimensions. With a data set assembled from questions from the German variant of ‘Who Wants to Be a Millionaire?’ we evaluate a set of German models and ChatGPT concerning factual/commonsense knowledge, syntactic abilities, and logical reasoning, amongst others. We contribute this new MCQA data set, extracted from the show’s episodes and designed to evaluate the ability of models to answer this diverse range of questions. To ensure data quality, we describe our preprocessing, encompassing data cleaning, deduplication, and the creation of stratified splits. Furthermore, we fine-tune a set of German LLMs and prompt ChatGPT to provide baseline results. Our findings reveal that these models achieve (partly) satisfactory performance on questions of lower difficulty levels (≤ 1000 euros). As the difficulty increases, performance steadily declines, highlighting the challenging nature of the later stages of the game. We contribute to the ongoing efforts to advance the capabilities of LLMs in comprehending and answering questions by providing a valuable resource for German MCQA research as well as further insights into the limitations of current LLMs.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[327]
D. Daum, R. Osuala, A. Riess, G. Kaissis, J. A. Schnabel and M. Di Folco.
On Differentially Private 3D Medical Image Synthesis with Controllable Latent Diffusion Models.
DGM4 @MICCAI 2024 - 4th International Workshop on Deep Generative Models at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024). Marrakesh, Morocco, Oct 06-10, 2024. DOI GitHub
Abstract

Generally, the small size of public medical imaging datasets coupled with stringent privacy concerns, hampers the advancement of data-hungry deep learning models in medical imaging. This study addresses these challenges for 3D cardiac MRI images in the short-axis view. We propose Latent Diffusion Models that generate synthetic images conditioned on medical attributes, while ensuring patient privacy through differentially private model training. To our knowledge, this is the first work to apply and quantify differential privacy in 3D medical image generation. We pre-train our models on public data and finetune them with differential privacy on the UK Biobank dataset. Our experiments reveal that pre-training significantly improves model performance, achieving a Fréchet Inception Distance (FID) of 26.77 at ϵ=10, compared to 92.52 for models without pre-training. Additionally, we explore the trade-off between privacy constraints and image quality, investigating how tighter privacy budgets affect output controllability and may lead to degraded performance. Our results demonstrate that proper consideration during training with differential privacy can substantially improve the quality of synthetic cardiac MRI images, but there are still notable challenges in achieving consistent medical realism.

MCML Authors
Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Julia Schnabel

Julia Schnabel

Prof. Dr.

Computational Imaging and AI in Medicine


[326]
A. Riess, A. Ziller, S. Kolek, D. Rückert, J. A. Schnabel and G. Kaissis.
Complex-Valued Federated Learning with Differential Privacy and MRI Applications.
DeCaF @MICCAI 2024 - 5th Workshop on Distributed, Collaborative and Federated Learning at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024). Marrakesh, Morocco, Oct 06-10, 2024. DOI
Abstract

Federated learning enhanced with Differential Privacy (DP) is a powerful privacy-preserving strategy to protect individuals sharing their sensitive data for processing in fields such as medicine and healthcare. Many medical applications, for example magnetic resonance imaging (MRI), rely on complex-valued signal processing techniques for data acquisition and analysis. However, the appropriate application of DP to complex-valued data is still underexplored. To address this issue, from the theoretical side, we introduce the complex-valued Gaussian mechanism, whose behaviour we characterise in terms of f-DP, -DP and Rényi-DP. Moreover, we generalise the fundamental algorithm DP stochastic gradient descent to complex-valued neural networks and present novel complex-valued neural network primitives compatible with DP. Experimentally, we showcase a proof-of-concept by training federated complex-valued neural networks with DP on a real-world task (MRI pulse sequence classification in k-space), yielding excellent utility and privacy. Our results highlight the relevance of combining federated learning with robust privacy-preserving techniques in the MRI context.

MCML Authors
Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Julia Schnabel

Julia Schnabel

Prof. Dr.

Computational Imaging and AI in Medicine

Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine


[325]
H. Funk, R. Ludwig, H. Kuechenhoff and T. Nagler.
Towards more realistic climate model outputs: A multivariate bias correction based on zero-inflated vine copulas.
Preprint (Oct. 2024). arXiv
Abstract

Climate model large ensembles are an essential research tool for analysing and quantifying natural climate variability and providing robust information for rare extreme events. The models simulated representations of reality are susceptible to bias due to incomplete understanding of physical processes. This paper aims to correct the bias of five climate variables from the CRCM5 Large Ensemble over Central Europe at a 3-hourly temporal resolution. At this high temporal resolution, two variables, precipitation and radiation, exhibit a high share of zero inflation. We propose a novel bias-correction method, VBC (Vine copula bias correction), that models and transfers multivariate dependence structures for zero-inflated margins in the data from its error-prone model domain to a reference domain. VBC estimates the model and reference distribution using vine copulas and corrects the model distribution via (inverse) Rosenblatt transformation. To deal with the variables’ zero-inflated nature, we develop a new vine density decomposition that accommodates such variables and employs an adequately randomized version of the Rosenblatt transform. This novel approach allows for more accurate modelling of multivariate zero-inflated climate data. Compared with state-of-the-art correction methods, VBC is generally the best-performing correction and the most accurate method for correcting zero-inflated events.

MCML Authors
Link to website

Henri Funk

Statistical Consulting Unit (StaBLab)

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[324]
E. Garces Arias, H. Blocher, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework.
Preprint (Oct. 2024). arXiv
Abstract

Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Furthermore, we discuss the alignment of these approaches with human judgments. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, exhibit similarities with human preferences, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[323]
K. Schwethelm, J. Kaiser, J. Kuntzer, M. Yigitsoy, D. Rueckert and G. Kaissis.
Differentially Private Active Learning: Balancing Effective Data Selection and Privacy.
Preprint (Oct. 2024). arXiv
Abstract

Active learning (AL) is a widely used technique for optimizing data labeling in machine learning by iteratively selecting, labeling, and training on the most informative data. However, its integration with formal privacy-preserving methods, particularly differential privacy (DP), remains largely underexplored. While some works have explored differentially private AL for specialized scenarios like online learning, the fundamental challenge of combining AL with DP in standard learning settings has remained unaddressed, severely limiting AL’s applicability in privacy-sensitive domains. This work addresses this gap by introducing differentially private active learning (DP-AL) for standard learning settings. We demonstrate that naively integrating DP-SGD training into AL presents substantial challenges in privacy budget allocation and data utilization. To overcome these challenges, we propose step amplification, which leverages individual sampling probabilities in batch creation to maximize data point participation in training steps, thus optimizing data utilization. Additionally, we investigate the effectiveness of various acquisition functions for data selection under privacy constraints, revealing that many commonly used functions become impractical. Our experiments on vision and natural language processing tasks show that DP-AL can improve performance for specific datasets and model architectures. However, our findings also highlight the limitations of AL in privacy-constrained environments, emphasizing the trade-offs between privacy, model accuracy, and data selection accuracy.

MCML Authors
Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine


[322]
P. Müller, G. Kaissis and D. Rückert.
ChEX: Interactive Localization and Region Description in Chest X-rays.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX’s interactive capabilities.

MCML Authors
Link to Profile Georgios Kaissis

Georgios Kaissis

Dr.

Artificial Intelligence in Healthcare and Medicine

Link to Profile Daniel Rückert

Daniel Rückert

Prof. Dr.

Artificial Intelligence in Healthcare and Medicine


[321]
H. Baniecki, G. Casalicchio, B. Bischl and P. Biecek.
On the Robustness of Global Feature Effect Explanations.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI
Abstract

We study the robustness of global post-hoc explanations for predictive models trained on tabular data. Effects of predictor features in black-box supervised learning are an essential diagnostic tool for model debugging and scientific discovery in applied sciences. However, how vulnerable they are to data and model perturbations remains an open research question. We introduce several theoretical bounds for evaluating the robustness of partial dependence plots and accumulated local effects. Our experimental results with synthetic and real-world datasets quantify the gap between the best and worst-case scenarios of (mis)interpreting machine learning predictions globally.

MCML Authors
Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[320]
F. Stermann, I. Chalkidis, A. Vahidi, B. Bischl and M. Rezaei.
Attention-Driven Dropout: A Simple Method to Improve Self-supervised Contrastive Sentence Embeddings.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI
Abstract

Self-contrastive learning has proven effective for vision and natural language tasks. It aims to learn aligned data representations by encoding similar and dissimilar sentence pairs without human annotation. Therefore, data augmentation plays a crucial role in the learned embedding quality. However, in natural language processing (NLP), creating augmented samples for unsupervised contrastive learning is challenging since random editing may modify the semantic meanings of sentences and thus affect learning good representations. In this paper, we introduce a simple, still effective approach dubbed ADD (Attention-Driven Dropout) to generate better-augmented views of sentences to be used in self-contrastive learning. Given a sentence and a Pre-trained Transformer Language Model (PLM), such as RoBERTa, we use the aggregated attention scores of the PLM to remove the less “informative” tokens from the input. We consider two alternative algorithms based on NAIVEAGGREGATION across layers/heads and ATTENTIONROLLOUT [1]. Our approach significantly improves the overall performance of various self-supervised contrastive-based methods, including SIMCSE [14], DIFFCSE [10], and INFOCSE [33] by facilitating the generation of high-quality positive pairs required by these methods. Through empirical evaluations on multiple Semantic Textual Similarity (STS) and Transfer Learning tasks, we observe enhanced performance across the board.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[319]
A. Vahidi, L. Wimmer, H. A. Gündüz, B. Bischl, E. Hüllermeier and M. Rezaei.
Diversified Ensemble of Independent Sub-Networks for Robust Self-Supervised Representation Learning.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI
Abstract

Ensembling a neural network is a widely recognized approach to enhance model performance, estimate uncertainty, and improve robustness in deep supervised learning. However, deep ensembles often come with high computational costs and memory demands. In addition, the efficiency of a deep ensemble is related to diversity among the ensemble members, which is challenging for large, over-parameterized deep neural networks. Moreover, ensemble learning has not yet seen such widespread adoption for unsupervised learning and it remains a challenging endeavor for self-supervised or unsupervised representation learning. Motivated by these challenges, we present a novel self-supervised training regime that leverages an ensemble of independent sub-networks, complemented by a new loss function designed to encourage diversity. Our method efficiently builds a sub-model ensemble with high diversity, leading to well-calibrated estimates of model uncertainty, all achieved with minimal computational overhead compared to traditional deep self-supervised ensembles. To evaluate the effectiveness of our approach, we conducted extensive experiments across various tasks, including in-distribution generalization, out-of-distribution detection, dataset corruption, and semi-supervised settings. The results demonstrate that our method significantly improves prediction reliability. Our approach not only achieves excellent accuracy but also enhances calibration, improving on important baseline performance across a wide range of self-supervised architectures in computer vision, natural language processing, and genomics data.

MCML Authors
Link to website

Lisa Wimmer

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[318]
H. J. Coyle-Asbil, L. Burk, M. Brandes, B. Brandes, C. Buck, M. N. Wright and L. A. Vallis.
Energy Expenditure Prediction in Preschool Children: A Machine Learning Approach Using Accelerometry and External Validation.
Physiological Measurement 45.9 (Sep. 2024). DOI
Abstract

Objective. This study aimed to develop convolutional neural networks (CNNs) models to predict the energy expenditure (EE) of children from raw accelerometer data. Additionally, this study sought to external validation of the CNN models in addition to the linear regression (LM), random forest (RF), and full connected neural network (FcNN) models published in Steenbock et al (2019 J. Meas. Phys. Behav. 2 94–102). Approach. Included in this study were 41 German children (3.0–6.99 years) for the training and internal validation who were equipped with GENEActiv, GT3X+, and activPAL accelerometers. The external validation dataset consisted of 39 Canadian children (3.0–5.99 years) that were equipped with OPAL, GT9X, GENEActiv, and GT3X+ accelerometers. EE was recorded simultaneously in both datasets using a portable metabolic unit. The protocols consisted of a semi-structured activities ranging from low to high intensities. The root mean square error (RMSE) values were calculated and used to evaluate model performances. Main results. (1) The CNNs outperformed the LM (13.17%–23.81% lower mean RMSE values), FcNN (8.13%–27.27% lower RMSE values) and the RF models (3.59%–18.84% lower RMSE values) in the internal dataset. (2) In contrast, it was found that when applied to the external Canadian dataset, the CNN models had consistently higher RMSE values compared to the LM, FcNN, and RF. Significance. Although CNNs can enhance EE prediction accuracy, their ability to generalize to new datasets and accelerometer brands/models, is more limited compared to LM, RF, and FcNN models.

MCML Authors
Link to website

Lukas Burk

Statistical Learning and Data Science


[317]
H. Schulz-Kümpel, S. Fischer, T. Nagler, A.-L. Boulesteix, B. Bischl and R. Hornung.
Constructing Confidence Intervals for 'the' Generalization Error – a Comprehensive Benchmark Study.
Preprint (Sep. 2024). arXiv
Abstract

When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct the first large-scale study comparing CIs for the generalization error - empirically evaluating 13 different methods on a total of 18 tabular regression and classification problems, using four different inducers and a total of eight loss functions. We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework. Finally, the CI methods are evaluated in terms of their relative coverage frequency, width, and runtime. Based on these findings, we are able to identify a subset of methods that we would recommend. We also publish the datasets as a benchmarking suite on OpenML and our code on GitHub to serve as a basis for further studies.

MCML Authors
Link to website

Hannah Schulz-Kümpel

Biometry in Molecular Medicine

Link to website

Sebastian Fischer

Statistical Learning and Data Science

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[316]
A. Stephan, D. Zhu, M. Aßenmacher, X. Shen and B. Roth.
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks.
Preprint (Sep. 2024). arXiv
Abstract

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[315]
S. Urchs, V. Thurner, M. Aßenmacher, C. Heumann and S. Thiemichen.
Detecting Gender Discrimination on Actor Level Using Linguistic Discourse Analysis.
GeBNLP 2024 - 5th Workshop on Gender Bias in Natural Language Processing. Bangkok, Thailand, Aug 16, 2024. URL
Abstract

With the usage of tremendous amounts of text data for training powerful large language models such as ChatGPT, the issue of analysing and securing data quality has become more pressing than ever. Any biases, stereotypes and discriminatory patterns that exist in the training data can be reproduced, reinforced or broadly disseminated by the models in production. Therefore, it is crucial to carefully select and monitor the text data that is used as input to train the model. Due to the vast amount of training data, this process needs to be (at least partially) automated. In this work, we introduce a novel approach for automatically detecting gender discrimination in text data on the actor level based on linguistic discourse analysis. Specifically, we combine existing information extraction (IE) techniques to partly automate the qualitative research done in linguistic discourse analysis. We focus on two important steps: Identifying the respectiveperson-named-entity (an actor) and all forms it is referred to (Nomination), and detecting the characteristics it is ascribed (Predication). Asa proof of concept, we integrate these two steps into a pipeline for automated text analysis. The separate building blocks of the pipeline could be flexibly adapted, extended, and scaled for bigger datasets to accommodate a wide range of usage scenarios and specific ML tasks or help social scientists with analysis tasks. We showcase and evaluate our approach on several real and simulated exemplary texts.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[314]
J. Pavlopoulos, V. Kougia, E. Garces Arias, P. Platanou, S. Shabalin, K. Liagkou, E. Papadatos, H. Essler, J.-B. Camps and F. Fischer.
Challenging Error Correction in Recognised Byzantine Greek.
ML4AL @ACL 2024 - 1st Workshop on Machine Learning for Ancient Languages at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Automatic correction of errors in Handwritten Text Recognition (HTR) output poses persistent challenges yet to be fully resolved. In this study, we introduce a shared task aimed at addressing this challenge, which attracted 271 submissions, yielding only a handful of promising approaches. This paper presents the datasets, the most effective methods, and an experimental analysis in error-correcting HTRed manuscripts and papyri in Byzantine Greek, the language that followed Classical and preceded Modern Greek. By using recognised and transcribed data from seven centuries, the two best-performing methods are compared, one based on a neural encoder-decoder architecture and the other based on engineered linguistic rules. We show that the recognition error rate can be reduced by both, up to 2.5 points at the level of characters and up to 15 at the level of words, while also elucidating their respective strengths and weaknesses.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science


[313]
M. Aßenmacher, A. Stephan, L. Weissweiler, E. Çano, I. Ziegler, M. Härttrich, B. Bischl, B. Roth, C. Heumann and H. Schütze.
Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing.
TeachingNLP @ACL 2024 - 6th Workshop on Teaching NLP at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. URL
Abstract

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former Member

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[312]
J. G. Wiese, L. Wimmer, T. Papamarkou, B. Bischl, S. Günnemann and D. Rügamer.
Towards Efficient Posterior Sampling in Deep Neural Networks via Symmetry Removal (Extended Abstract).
IJCAI 2024 - 33rd International Joint Conference on Artificial Intelligence. Jeju, Korea, Aug 03-09, 2024. DOI
Abstract

Bayesian inference in deep neural networks is challenging due to the high-dimensional, strongly multi-modal parameter posterior density landscape. Markov chain Monte Carlo approaches asymptotically recover the true posterior but are considered prohibitively expensive for large modern architectures. Local methods, which have emerged as a popular alternative, focus on specific parameter regions that can be approximated by functions with tractable integrals. While these often yield satisfactory empirical results, they fail, by definition, to account for the multi-modality of the parameter posterior. In this work, we argue that the dilemma between exact-but-unaffordable and cheap-but-inexact approaches can be mitigated by exploiting symmetries in the posterior landscape. Such symmetries, induced by neuron interchangeability and certain activation functions, manifest in different parameter values leading to the same functional output value. We show theoretically that the posterior predictive density in Bayesian neural networks can be restricted to a symmetry-free parameter reference set. By further deriving an upper bound on the number of Monte Carlo chains required to capture the functional diversity, we propose a straightforward approach for feasible Bayesian inference. Our experiments suggest that efficient sampling is indeed possible, opening up a promising path to accurate uncertainty quantification in deep learning.

MCML Authors
Link to website

Lisa Wimmer

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Stephan Günnemann

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[311]
L. Bothmann and K. Peters.
Fairness als Qualitätskriterium im Maschinellen Lernen – Rekonstruktion des philosophischen Konzepts und Implikationen für die Nutzung außergesetzlicher Merkmale bei qualifizierten Mietspiegeln.
AStA Wirtschafts- und Sozialstatistisches Archiv 18 (Aug. 2024). DOI
Abstract

With the increased use of machine learning (ML) models within automated decision-making systems, the demands on the quality of ML models are growing. Pure prediction quality is no longer the sole quality criterion; in particular, there is an increasing demand to consider fairness aspects. This paper pursues two goals. First, it summarizes the current fairness discussion in the field of ML (fairML) and describes the most recent developments, especially with respect to the philosophical foundations of the concept of fairness within ML. On the other hand, the question is addressed to what extent so-called ‘extra-legal’ characteristics may be used in the compilation of qualified rent indices. A recent proposal by Kauermann and Windmann (AStA Wirtschafts- und Sozialstatistisches Archiv, Volume 17, 2023) on using extra-legal features in qualified rent indices includes a model-based imputation method, which we contrast with the legal requirements. Finally, we show which alternatives from the field of fairML could be used and outline the different basic philosophical assumptions behind the various methods.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[310]
D. Schalk, R. Rehms, V. S. Hoffmann, B. Bischl and U. Mansmann.
Distributed non-disclosive validation of predictive models by a modified ROC-GLM.
BMC Medical Research Methodology 24.190 (Aug. 2024). DOI
Abstract

Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach…

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[309]
F. Drost, E. Dorigatti, A. Straub, P. Hilgendorf, K. I. Wagner, K. Heyer, M. López Montes, B. Bischl, D. H. Busch, K. Schober and B. Schubert.
Predicting T cell receptor functionality against mutant epitopes.
Cell Genomics 4.9 (Aug. 2024). DOI
Abstract

Cancer cells and pathogens can evade T cell receptors (TCRs) via mutations in immunogenic epitopes. TCR cross-reactivity (i.e., recognition of multiple epitopes with sequence similarities) can counteract such escape but may cause severe side effects in cell-based immunotherapies through targeting self-antigens. To predict the effect of epitope point mutations on T cell functionality, we here present the random forest-based model Predicting T Cell Epitope-Specific Activation against Mutant Versions (P-TEAM). P-TEAM was trained and tested on three datasets with TCR responses to single-amino-acid mutations of the model epitope SIINFEKL, the tumor neo-epitope VPSVWRSSL, and the human cytomegalovirus antigen NLVPMVATV, totaling 9,690 unique TCR-epitope interactions. P-TEAM was able to accurately classify T cell reactivities and quantitatively predict T cell functionalities for unobserved single-point mutations and unseen TCRs. Overall, P-TEAM provides an effective computational tool to study T cell responses against mutated epitopes.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[308]
A. Mittermeier, M. Aßenmacher, B. Schachtner, S. Grosu, V. Dakovic, V. Kandratovich, B. Sabel and M. Ingrisch.
Automatische ICD-10-Codierung.
Die Radiologie 64 (Aug. 2024). DOI
Abstract

Hintergrund: Die medizinische Codierung von radiologischen Befunden ist essenziell für eine gute Qualität der Versorgung und die korrekte Abrechnung, gleichzeitig aber eine aufwändige und fehleranfällige Aufgabe.
Ziel der Arbeit: Bewertung der Anwendbarkeit natürlicher Sprachverarbeitung (Natural Language Processing, NLP) für die ICD-10-Codierung von radiologischen Befunden in deutscher Sprache durch Finetuning geeigneter Sprachmodelle.
Material und Methoden: In dieser retrospektiven Studie wurden alle Magnetresonanztomographie(MRT)-Befunde unseres Instituts zwischen 2010 und 2020 berücksichtigt. Die ICD-10-Codes bei Entlassung wurden den jeweiligen Befunden zugeordnet, um einen Datensatz für eine Multiclass-Klassifizierung zu erstellen. Finetuning von GermanBERT und flanT5 wurde auf dem Gesamtdatensatz (dstotal) mit 1035 verschiedenen ICD-10-Codes und zwei reduzierten Datensätzen mit den 100 (ds100) und 50 (ds50) häufigsten Codes durchgeführt. Die Performance der Modelle wurde mit Top-k-Genauigkeit für k = 1, 3, 5 evaluiert. In einer Ablationsstudie wurden beide Modelle einmal auf den zugehörigen Metadaten und dem Befund allein trainiert.
Ergebnisse: Der Gesamtdatensatz bestand aus 100.672 radiologischen Befunden, die reduzierten Datensätze ds100 aus 68.103 und ds50 aus 52.293 Berichten. Die Modellperformance stieg, wenn mehrere der besten Voraussagen des Modells in Betracht gezogen wurden, die Anzahl der Zielklassen reduziert wurde und die Metadaten mit dem Befund kombiniert wurden. FlanT5 übertraf GermanBERT in allen Datensätzen und Metriken und eignet sich am besten als medizinischer Codierungsassistent, wobei eine Top-3-Genauigkeit von fast 70% im realitätsnahen Datensatz dstotal erreicht wurde.
Schlussfolgerung: Finetuning von Sprachmodellen verspricht eine zuverlässige Vorhersage von ICD-10-Codes deutscher radiologischer MRT-Befunde in unterschiedlichen Szenarien. Als Codierungsassistent kann flanT5 medizinischen Codierern helfen, informierte Entscheidungen zu treffen und potenziell ihre Arbeitsbelastung reduzieren.

MCML Authors
Link to website

Andreas Mittermeier

Dr.

Clinical Data Science in Radiology

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Link to website

Balthasar Schachtner

Dr.

Clinical Data Science in Radiology

Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[307]
F. Ott, L. Heublein, D. Rügamer, B. Bischl and C. Mutschler.
Fusing structure from motion and simulation-augmented pose regression from optical flow for challenging indoor environments.
Journal of Visual Communication and Image Representation 103 (Aug. 2024). DOI
Abstract

The localization of objects is essential in many applications, such as robotics, virtual and augmented reality, and warehouse logistics. Recent advancements in deep learning have enabled localization using monocular cameras. Traditionally, structure from motion (SfM) techniques predict an object’s absolute position from a point cloud, while absolute pose regression (APR) methods use neural networks to understand the environment semantically. However, both approaches face challenges from environmental factors like motion blur, lighting changes, repetitive patterns, and featureless areas. This study addresses these challenges by incorporating additional information and refining absolute pose estimates with relative pose regression (RPR) methods. RPR also struggles with issues like motion blur. To overcome this, we compute the optical flow between consecutive images using the Lucas–Kanade algorithm and use a small recurrent convolutional network to predict relative poses. Combining absolute and relative poses is difficult due to differences between global and local coordinate systems. Current methods use pose graph optimization (PGO) to align these poses. In this work, we propose recurrent fusion networks to better integrate absolute and relative pose predictions, enhancing the accuracy of absolute pose estimates. We evaluate eight different recurrent units and create a simulation environment to pre-train the APR and RPR networks for improved generalization. Additionally, we record a large dataset of various scenarios in a challenging indoor environment resembling a warehouse with transportation robots. Through hyperparameter searches and experiments, we demonstrate that our recurrent fusion method outperforms PGO in effectiveness.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[306]
E. Bergman, M. Feurer, A. Bahram, A. R. Balef, L. Purucker, S. Segel, M. Lindauer, F. Hutter and K. Eggensperger.
AMLTK: A Modular AutoML Toolkit in Python.
The Journal of Open Source Software 9.100 (Aug. 2024). DOI
Abstract

Machine Learning is a core building block in novel data-driven applications. Practitioners face many ambiguous design decisions while developing practical machine learning (ML) solutions. Automated machine learning (AutoML) facilitates the development of machine learning applications by providing efficient methods for optimizing hyperparameters, searching for neural architectures, or constructing whole ML pipelines (Hutter et al., 2019). Thereby, design decisions such as the choice of modelling, pre-processing, and training algorithm are crucial to obtaining well-performing solutions. By automatically obtaining ML solutions, AutoML aims to lower the barrier to leveraging machine learning and reduce the time needed to develop or adapt ML solutions for new domains or data.
Highly performant software packages for automatically building ML pipelines given data, so-called AutoML systems, are available and can be used off-the-shelf. Typically, AutoML systems evaluate ML models sequentially to return a well-performing single best model or multiple models combined into an ensemble. Existing AutoML systems are typically highly engineered monolithic software developed for specific use cases to perform well and robustly under various conditions…

MCML Authors
Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science


[305]
T. Boege, M. Drton, B. Hollering, S. Lumpp, P. Misra and D. Schkoda.
Conditional Independence in Stationary Diffusions.
Preprint (Aug. 2024). arXiv
Abstract

Stationary distributions of multivariate diffusion processes have recently been proposed as probabilistic models of causal systems in statistics and machine learning. Motivated by these developments, we study stationary multivariate diffusion processes with a sparsely structured drift. Our main result gives a characterization of the conditional independence relations that hold in a stationary distribution. The result draws on a graphical representation of the drift structure and pertains to conditional independence relations that hold generally as a consequence of the drift’s sparsity pattern.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[304]
Y. Liang, O. Zadorozhnyi and M. Drton.
Kernel-Based Differentiable Learning of Non-Parametric Directed Acyclic Graphical Models.
Preprint (Aug. 2024). arXiv
Abstract

Causal discovery amounts to learning a directed acyclic graph (DAG) that encodes a causal model. This model selection problem can be challenging due to its large combinatorial search space, particularly when dealing with non-parametric causal models. Recent research has sought to bypass the combinatorial search by reformulating causal discovery as a continuous optimization problem, employing constraints that ensure the acyclicity of the graph. In non-parametric settings, existing approaches typically rely on finite-dimensional approximations of the relationships between nodes, resulting in a score-based continuous optimization problem with a smooth acyclicity constraint. In this work, we develop an alternative approximation method by utilizing reproducing kernel Hilbert spaces (RKHS) and applying general sparsity-inducing regularization terms based on partial derivatives. Within this framework, we introduce an extended RKHS representer theorem. To enforce acyclicity, we advocate the log-determinant formulation of the acyclicity constraint and show its stability. Finally, we assess the performance of our proposed RKHS-DAGMA procedure through simulations and illustrative data analyses.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[303]
D. Schkoda, E. Robeva and M. Drton.
Causal Discovery of Linear Non-Gaussian Causal Models with Unobserved Confounding.
Preprint (Aug. 2024). arXiv
Abstract

We consider linear non-Gaussian structural equation models that involve latent confounding. In this setting, the causal structure is identifiable, but, in general, it is not possible to identify the specific causal effects. Instead, a finite number of different causal effects result in the same observational distribution. Most existing algorithms for identifying these causal effects use overcomplete independent component analysis (ICA), which often suffers from convergence to local optima. Furthermore, the number of latent variables must be known a priori. To address these issues, we propose an algorithm that operates recursively rather than using overcomplete ICA. The algorithm first infers a source, estimates the effect of the source and its latent parents on their descendants, and then eliminates their influence from the data. For both source identification and effect size estimation, we use rank conditions on matrices formed from higher-order cumulants. We prove asymptotic correctness under the mild assumption that locally, the number of latent variables never exceeds the number of observed variables. Simulation studies demonstrate that our method achieves comparable performance to overcomplete ICA even though it does not know the number of latents in advance.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[302]
D. Strieder and M. Drton.
Identifying Total Causal Effects in Linear Models under Partial Homoscedasticity.
Preprint (Aug. 2024). arXiv
Abstract

A fundamental challenge of scientific research is inferring causal relations based on observed data. One commonly used approach involves utilizing structural causal models that postulate noisy functional relations among interacting variables. A directed graph naturally represents these models and reflects the underlying causal structure. However, classical identifiability results suggest that, without conducting additional experiments, this causal graph can only be identified up to a Markov equivalence class of indistinguishable models. Recent research has shown that focusing on linear relations with equal error variances can enable the identification of the causal structure from mere observational data. Nonetheless, practitioners are often primarily interested in the effects of specific interventions, rendering the complete identification of the causal structure unnecessary. In this work, we investigate the extent to which less restrictive assumptions of partial homoscedasticity are sufficient for identifying the causal effects of interest. Furthermore, we construct mathematically rigorous confidence regions for total causal effects under structure uncertainty and explore the performance gain of relying on stricter error assumptions in a simulation study.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[301]
K. Bouchiat, A. Immer, H. Yèche, G. Ratsch and V. Fortuin.
Improving Neural Additive Models with Bayesian Principles.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

Neural additive models (NAMs) enhance the transparency of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we augment them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) facilitating the ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[300]
M. Herrmann, F. J. D. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix and B. Bischl.
Position: Why We Must Rethink Empirical Research in Machine Learning.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

We warn against a common but incomplete understanding of empirical research in machine learning (ML) that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue most current empirical ML research is fashioned as confirmatory research while it should rather be considered exploratory.

MCML Authors
Link to Profile Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[299]
F. Karl, M. Kemeter, G. Dax and P. Sierak.
Position: Embracing Negative Results in Machine Learning.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

Publications proposing novel machine learning methods are often primarily rated by exhibited predictive performance on selected problems. In this position paper we argue that predictive performance alone is not a good indicator for the worth of a publication. Using it as such even fosters problems like inefficiencies of the machine learning research community as a whole and setting wrong incentives for researchers. We therefore put out a call for the publication of “negative” results, which can help alleviate some of these problems and improve the scientific output of the machine learning research community. To substantiate our position, we present the advantages of publishing negative results and provide concrete measures for the community to move towards a paradigm where their publication is normalized.

MCML Authors
Link to website

Florian Karl

Statistical Learning and Data Science


[298]
M. Lindauer, F. Karl, A. Klier, J. Moosbauer, A. Tornede, A. C. Mueller, F. Hutter, M. Feurer and B. Bischl.
Position: A Call to Action for a Human-Centered AutoML Paradigm.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) workflows, aiding the research of new ML algorithms, and contributing to the democratization of ML by making it accessible to a broader audience. Over the past decade, commendable achievements in AutoML have primarily focused on optimizing predictive performance. This focused progress, while substantial, raises questions about how well AutoML has met its broader, original goals. In this position paper, we argue that a key to unlocking AutoML’s full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems, including their diverse roles, expectations, and expertise. We envision a more human-centered approach in future AutoML research, promoting the collaborative design of ML systems that tightly integrates the complementary strengths of human expertise and AutoML methodologies.

MCML Authors
Link to website

Florian Karl

Statistical Learning and Data Science

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[297]
T. Papamarkou, M. Skoularidou, K. Palla, L. Aitchison, J. Arbel, D. Dunson, M. Filippone, V. Fortuin, P. Hennig, J. M. Hernández-Lobato, A. Hubin, A. Immer, T. Karaletsos, M. E. Khan, A. Kristiadi, Y. Li, S. Mandt, C. Nemeth, M. A. Osborne, T. G. J. Rudner, D. Rügamer, Y. W. Teh, M. Welling, A. G. Wilson and R. Zhang.
Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learning (BDL) constitutes a promising avenue, offering advantages across these diverse settings. This paper posits that BDL can elevate the capabilities of deep learning. It revisits the strengths of BDL, acknowledges existing challenges, and highlights some exciting research avenues aimed at addressing these obstacles. Looking ahead, the discussion focuses on possible ways to combine large-scale foundation models with BDL to unlock their full potential.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[296]
D. Rügamer, C. Kolb, T. Weber, L. Kook and T. Nagler.
Generalizing orthogonalization for models with non-linearities.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

The complexity of black-box algorithms can lead to various challenges, including the introduction of biases. These biases present immediate risks in the algorithms’ application. It was, for instance, shown that neural networks can deduce racial information solely from a patient’s X-ray scan, a task beyond the capability of medical experts. If this fact is not known to the medical expert, automatic decision-making based on this algorithm could lead to prescribing a treatment (purely) based on racial information. While current methodologies allow for the ‘‘orthogonalization’’ or ‘’normalization’’ of neural networks with respect to such information, existing approaches are grounded in linear models. Our paper advances the discourse by introducing corrections for non-linearities such as ReLU activations. Our approach also encompasses scalar and tensor-valued predictions, facilitating its integration into neural network architectures. Through extensive experiments, we validate our method’s effectiveness in safeguarding sensitive data in generalized linear models, normalizing convolutional neural networks for metadata, and rectifying pre-existing embeddings for undesired attributes.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to website

Chris Kolb

Statistical Learning and Data Science

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[295]
E. Sommer, L. Wimmer, T. Papamarkou, L. Bothmann, B. Bischl and D. Rügamer.
Connecting the Dots: Is Mode Connectedness the Key to Feasible Sample-Based Inference in Bayesian Neural Networks?
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

A major challenge in sample-based inference (SBI) for Bayesian neural networks is the size and structure of the networks’ parameter space. Our work shows that successful SBI is possible by embracing the characteristic relationship between weight and function space, uncovering a systematic link between overparameterization and the difficulty of the sampling problem. Through extensive experiments, we establish practical guidelines for sampling and convergence diagnosis. As a result, we present a Bayesian deep ensemble approach as an effective solution with competitive performance and uncertainty quantification.

MCML Authors
Link to website

Emanuel Sommer

Statistics, Data Science and Machine Learning

Link to website

Lisa Wimmer

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[294]
D. Tramontano, Y. Kivva, S. Salehkaleybar, M. Drton and N. Kiyavash.
Causal Effect Identification in LiNGAM Models with Latent Confounders.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. Regularization is key in deep learning, especially when training complex models on relatively small datasets. In order to understand inner workings of neural networks, attribution methods such as Layer-wise Relevance Propagation (LRP) have been extensively studied, particularly for interpreting the relevance of input features. We introduce Challenger, a module that leverages the explainable power of attribution maps in order to manipulate particularly relevant input patterns. Therefore, exposing and subsequently resolving regions of ambiguity towards separating classes on the ground-truth data manifold, an issue that arises particularly when training models on rather small datasets. Our Challenger module increases model performance through building more diverse filters within the network and can be applied to any input data domain. We demonstrate that our approach results in substantially better classification as well as calibration performance on datasets with only a few samples up to datasets with thousands of samples. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[293]
S. Dandl, K. Blesch, T. Freiesleben, G. König, J. Kapar, B. Bischl and M. N. Wright.
CountARFactuals – Generating plausible model-agnostic counterfactual explanations with adversarial random forests.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI
Abstract

Counterfactual explanations elucidate algorithmic decisions by pointing to scenarios that would have led to an alternative, desired outcome. Giving insight into the model’s behavior, they hint users towards possible actions and give grounds for contesting decisions. As a crucial factor in achieving these goals, counterfactuals must be plausible, i.e., describing realistic alternative scenarios within the data manifold. This paper leverages a recently developed generative modeling technique – adversarial random forests (ARFs) – to efficiently generate plausible counterfactuals in a model-agnostic way. ARFs can serve as a plausibility measure or directly generate counterfactual explanations. Our ARF-based approach surpasses the limitations of existing methods that aim to generate plausible counterfactual explanations: It is easy to train and computationally highly efficient, handles continuous and categorical data naturally, and allows integrating additional desiderata such as sparsity in a straightforward manner.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[292]
F. K. Ewald, L. Bothmann, M. N. Wright, B. Bischl, G. Casalicchio and G. König.
A Guide to Feature Importance Methods for Scientific Inference.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI
Abstract

While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of global FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models.

MCML Authors
Link to website

Fiona Ewald

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[291]
D. Rundel, J. Kobialka, C. von Crailsheim, M. Feurer, T. Nagler and D. Rügamer.
Interpretable Machine Learning for TabPFN.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI GitHub
Abstract

The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN.

MCML Authors
Link to website

David Rundel

Statistical Learning and Data Science

Link to website

Julius Kobialka

Statistics, Data Science and Machine Learning

Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[290]
C. A. Scholbeck, H. Funk and G. Casalicchio.
Algorithm-Agnostic Feature Attributions for Clustering.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI
Abstract

Understanding how assignments of instances to clusters can be attributed to the features can be vital in many applications. However, research to provide such feature attributions has been limited. Clustering algorithms with built-in explanations are scarce. Common algorithm-agnostic approaches involve dimension reduction and subsequent visualization, which transforms the original features used to cluster the data; or training a supervised learning classifier on the found cluster labels, which adds additional and intractable complexity. We present FACT (feature attributions for clustering), an algorithm-agnostic framework that preserves the integrity of the data and does not introduce additional models. As the defining characteristic of FACT, we introduce a set of work stages: sampling, intervention, reassignment, and aggregation. Furthermore, we propose two novel FACT methods: SMART (scoring metric after permutation) measures changes in cluster assignments by custom scoring functions after permuting selected features; IDEA (isolated effect on assignment) indicates local and global changes in cluster assignments after making uniform changes to selected features.

MCML Authors
Link to website

Henri Funk

Statistical Consulting Unit (StaBLab)

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[289]
S. Dandl, M. Becker, B. Bischl, G. Casalicchio and L. Bothmann.
mlr3summary: Concise and interpretable summaries for machine learning models.
xAI 2024 - Demo Track of the 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. arXiv
Abstract

This work introduces a novel R package for concise, informative summaries of machine learning models. We take inspiration from the summary function for (generalized) linear models in R, but extend it in several directions: First, our summary function is model-agnostic and provides a unified summary output also for non-parametric machine learning models; Second, the summary output is more extensive and customizable – it comprises information on the dataset, model performance, model complexity, model’s estimated feature importances, feature effects, and fairness metrics;
Third, models are evaluated based on resampling strategies for unbiased estimates of model performances, feature importances, etc. Overall, the clear, structured output should help to enhance and expedite the model selection process, making it a helpful tool for practitioners and researchers alike.

MCML Authors
Link to website

Marc Becker

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[288]
L. Kook, C. Kolb, P. Schiele, D. Dold, M. Arpogaus, C. Fritz, P. Baumann, P. Kopper, T. Pielok, E. Dorigatti and D. Rügamer.
How Inverse Conditional Flows Can Serve as a Substitute for Distributional Regression.
UAI 2024 - 40th Conference on Uncertainty in Artificial Intelligence. Barcelona, Spain, Jul 16-18, 2024. URL
Abstract

Neural network representations of simple models, such as linear regression, are being studied increasingly to better understand the underlying principles of deep learning algorithms. However, neural representations of distributional regression models, such as the Cox model, have received little attention so far. We close this gap by proposing a framework for distributional regression using inverse flow transformations (DRIFT), which includes neural representations of the aforementioned models. We empirically demonstrate that the neural representations of models in DRIFT can serve as a substitute for their classical statistical counterparts in several applications involving continuous, ordered, time-series, and survival outcomes. We confirm that models in DRIFT empirically match the performance of several statistical methods in terms of estimation of partial effects, prediction, and aleatoric uncertainty quantification. DRIFT covers both interpretable statistical models and flexible neural networks opening up new avenues in both statistical modeling and deep learning.

MCML Authors
Link to website

Chris Kolb

Statistical Learning and Data Science

Link to website

Tobias Pielok

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[287]
Y. Sale, P. Hofman, T. Löhr, L. Wimmer, T. Nagler and E. Hüllermeier.
Label-wise Aleatoric and Epistemic Uncertainty Quantification.
UAI 2024 - 40th Conference on Uncertainty in Artificial Intelligence. Barcelona, Spain, Jul 16-18, 2024. URL
Abstract

We present a novel approach to uncertainty quantification in classification tasks based on label-wise decomposition of uncertainty measures. This label-wise perspective allows uncertainty to be quantified at the individual class level, thereby improving cost-sensitive decision-making and helping understand the sources of uncertainty. Furthermore, it allows to define total, aleatoric, and epistemic uncertainty on the basis of non-categorical measures such as variance, going beyond common entropy-based measures. In particular, variance-based measures address some of the limitations associated with established methods that have recently been discussed in the literature. We show that our proposed measures adhere to a number of desirable properties. Through empirical evaluation on a variety of benchmark data sets – including applications in the medical domain where accurate uncertainty quantification is crucial – we establish the effectiveness of label-wise uncertainty quantification.

MCML Authors
Link to website

Paul Hofman

Artificial Intelligence and Machine Learning

Link to website

Lisa Wimmer

Statistical Learning and Data Science

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning


[286]
S. Dandl, M. Becker, B. Bischl, G. Casalicchio and L. Bothmann.
mlr3summary: Concise and interpretable summaries for machine learning models.
useR! 2024 - International R User Conference. Salzburg, Austria, Jul 08-22, 2024. arXiv GitHub
Abstract

This work introduces a novel R package for concise, informative summaries of machine learning models. We take inspiration from the summary function for (generalized) linear models in R, but extend it in several directions: First, our summary function is model-agnostic and provides a unified summary output also for non-parametric machine learning models; Second, the summary output is more extensive and customizable – it comprises information on the dataset, model performance, model complexity, model’s estimated feature importances, feature effects, and fairness metrics;
Third, models are evaluated based on resampling strategies for unbiased estimates of model performances, feature importances, etc. Overall, the clear, structured output should help to enhance and expedite the model selection process, making it a helpful tool for practitioners and researchers alike.

MCML Authors
Link to website

Marc Becker

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[285]
S. Fischer and M. Binder.
mlr3torch - Deep Learning in R.
useR! 2024 - International R User Conference. Salzburg, Austria, Jul 08-22, 2024. GitHub
Abstract

mlr3torch is a deep learning framework for the mlr3 ecosystem built on top of torch. It allows to easily build, train and evaluate deep learning models in a few lines of codes, without needing to worry about low-level details. Off-the-shelf learners are readily available, but custom architectures can be defined by connecting PipeOpTorch operators in an mlr3pipelines::Graph.

MCML Authors
Link to website

Sebastian Fischer

Statistical Learning and Data Science

Link to website

Martin Binder

Statistical Learning and Data Science


[284]
B. Ronval, S. Nijssen and L. Bothmann.
Can generative AI-based data balancing mitigate unfairness issues in Machine Learning?
EWAF 2024 - 3rd European Workshop on Algorithmic Fairness. Mainz, Germany, Jul 01-03, 2024. URL
Abstract

Data imbalance in the protected attributes can lead to machine learning models that perform better on the majority than on the minority group, giving rise to unfairness issues. While baseline methods like undersampling or SMOTE can balance datasets, we investigate how methods of generative artificial intelligence compare concerning classical fairness metrics. Using generated fake data, we propose different balancing methods and investigate the behavior of classification models in thorough benchmark studies using German credit and Berkeley admission data. While our experiments suggest that such methods may improve fairness metrics, further investigations are necessary to derive clear practical recommendations.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science


[283]
F. Karl, J. Thomas, J. Elstner, R. Gross and B. Bischl.
Automated Machine Learning.
Unlocking Artificial Intelligence (Jul. 2024). DOI
Abstract

In the past few years automated machine learning (AutoML) has gained a lot of traction in the data science and machine learning community. AutoML aims at reducing the partly repetitive work of data scientists and enabling domain experts to construct machine learning pipelines without extensive knowledge in data science. This chapter presents a comprehensive review of the current leading AutoML methods and sets AutoML in an industrial context. To this extent we present the typical components of an AutoML system, give an overview over the stateof-the-art and highlight challenges to industrial application by presenting several important topics such as AutoML for time series data, AutoML in unsupervised settings, AutoML with multiple evaluation criteria, or interactive human-in-the-loop methods. Finally, the connection to Neural Architecture Search (NAS) is presented and a brief review with special emphasis on hardware-aware NAS is given.

MCML Authors
Link to website

Florian Karl

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[282]
D. Köhler, D. Rügamer and M. Schmid.
Achieving interpretable machine learning by functional decomposition of black-box models into explainable predictor effects.
Preprint (Jul. 2024). arXiv
Abstract

Machine learning (ML) has seen significant growth in both popularity and importance. The high prediction accuracy of ML models is often achieved through complex black-box architectures that are difficult to interpret. This interpretability problem has been hindering the use of ML in fields like medicine, ecology and insurance, where an understanding of the inner workings of the model is paramount to ensure user acceptance and fairness. The need for interpretable ML models has boosted research in the field of interpretable machine learning (IML). Here we propose a novel approach for the functional decomposition of black-box predictions, which is considered a core concept of IML. The idea of our method is to replace the prediction function by a surrogate model consisting of simpler subfunctions. Similar to additive regression models, these functions provide insights into the direction and strength of the main feature contributions and their interactions. Our method is based on a novel concept termed stacked orthogonality, which ensures that the main effects capture as much functional behavior as possible and do not contain information explained by higher-order interactions. Unlike earlier functional IML approaches, it is neither affected by extrapolation nor by hidden feature interactions. To compute the subfunctions, we propose an algorithm based on neural additive modeling and an efficient post-hoc orthogonalization procedure.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[281]
F. Sergeev, P. Malsot, G. Rätsch and V. Fortuin.
Towards Dynamic Feature Acquisition on Medical Time Series by Maximizing Conditional Mutual Information.
Preprint (Jul. 2024). arXiv
Abstract

Knowing which features of a multivariate time series to measure and when is a key task in medicine, wearables, and robotics. Better acquisition policies can reduce costs while maintaining or even improving the performance of downstream predictors. Inspired by the maximization of conditional mutual information, we propose an approach to train acquirers end-to-end using only the downstream loss. We show that our method outperforms random acquisition policy, matches a model with an unrestrained budget, but does not yet overtake a static acquisition strategy. We highlight the assumptions and outline avenues for future work.

MCML Authors
Link to Profile Vincent Fortuin

Vincent Fortuin

Dr.

Bayesian Deep Learning


[280]
B. Deiseroth, M. Meuer, N. Gritsch, C. Eichenberg, P. Schramowski, M. Aßenmacher and K. Kersting.
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization.
NAACL 2024 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. DOI
Abstract

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components’ impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[279]
H. Chen, J. Büssing, D. Rügamer and E. Nie.
Leveraging (Sentence) Transformer Models with Contrastive Learning for Identifying Machine-Generated Text.
SemEval @NAACL 2024 - 18th International Workshop on Semantic Evaluation at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

This paper outlines our approach to SemEval-2024 Task 8 (Subtask B), which focuses on discerning machine-generated text from human-written content, while also identifying the text sources, i.e., from which Large Language Model (LLM) the target text is generated. Our detection system is built upon Transformer-based techniques, leveraging various pre-trained language models (PLMs), including sentence transformer models. Additionally, we incorporate Contrastive Learning (CL) into the classifier to improve the detecting capabilities and employ Data Augmentation methods. Ultimately, our system achieves a peak accuracy of 76.96% on the test set of the competition, configured using a sentence transformer model integrated with CL methodology.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[278]
L. Mayer, C. Heumann and M. Aßenmacher.
Can OpenSource beat ChatGPT? - A Comparative Study of Large Language Models for Text-to-Code Generation.
SwissText 2024 - Swiss Text Analytics Conference. Chur, Switzerland, Jun 10-11, 2024. URL
Abstract

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[277]
J. Simson, F. Pfisterer and C. Kern.
One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate the Influence of Model Design Decisions.
ACM FAccT 2024 - 7th ACM Conference on Fairness, Accountability, and Transparency. Rio de Janeiro, Brazil, Jun 03-06, 2024. DOI
Abstract

A vast number of systems across the world use algorithmic decision making (ADM) to (partially) automate decisions that have previously been made by humans. The downstream effects of ADM systems critically depend on the decisions made during a systems’ design, implementation, and evaluation, as biases in data can be mitigated or reinforced along the modeling pipeline. Many of these decisions are made implicitly, without knowing exactly how they will influence the final system. To study this issue, we draw on insights from the field of psychology and introduce the method of multiverse analysis for algorithmic fairness. In our proposed method, we turn implicit decisions during design and evaluation into explicit ones and demonstrate their fairness implications. By combining decisions, we create a grid of all possible “universes” of decision combinations. For each of these universes, we compute metrics of fairness and performance. Using the resulting dataset, one can investigate the variability and robustness of fairness scores and see how and which decisions impact fairness. We demonstrate how multiverse analyses can be used to better understand fairness implications of design and evaluation decisions using an exemplary case study of predicting public health care coverage for vulnerable populations. Our results highlight how decisions regarding the evaluation of a system can lead to vastly different fairness metrics for the same model. This is problematic, as a nefarious actor could optimise or “hack” a fairness metric to portray a discriminating model as fair merely by changing how it is evaluated. We illustrate how a multiverse analysis can help to address this issue.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[276]
J. Ramjith, A. Bender, K. C. B. Roes and M. A. Jonker.
Recurrent events analysis with piece-wise exponential additive mixed models.
Statistical Modelling 24.3 (Jun. 2024). DOI
Abstract

Recurrent events analysis plays an important role in many applications, including the study of chronic diseases or recurrence of infections. Historically, many models for recurrent events have been variants of the Cox model. In this article we introduce and describe the application of the piece-wise exponential Additive Mixed Model (PAMM) for recurrent events analysis and illustrate how PAMMs can be used to flexibly model the dependencies in recurrent events data. Simulations confirm that PAMMs provide unbiased estimates as well as equivalence to the Cox model when proportional hazards are assumed. Applications to recurrence of staphylococcus aureus and malaria in children illustrate the estimation of seasonality, bivariate non-linear effects, multiple timescales and relaxation of the proportional hazards assumption via time-varying effects. The R package pammtools is extended to facilitate estimation and visualization of PAMMs for recurrent events data.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[275]
L. Burk, J. Zobolas, B. Bischl, A. Bender, M. N. Wright and R. Sonabend.
A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data.
Preprint (Jun. 2024). arXiv
Abstract

This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.

MCML Authors
Link to website

Lukas Burk

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[274]
R. Sonabend, J. Zobolas, P. Kopper, L. Burk and A. Bender.
Examining properness in the external validation of survival models with squared and logarithmic losses.
Preprint (Jun. 2024). arXiv
Abstract

Scoring rules promote rational and honest decision-making, which is becoming increasingly important for automated procedures in auto-ML'. In this paper we survey common squared and logarithmic scoring rules for survival analysis and determine which losses are proper and improper. We prove that commonly utilised squared and logarithmic scoring rules that are claimed to be proper are in fact improper, such as the Integrated Survival Brier Score (ISBS). We further prove that under a strict set of assumptions a class of scoring rules is strictly proper for, what we term, approximate’ survival losses. Despite the difference in properness, experiments in simulated and real-world datasets show there is no major difference between improper and proper versions of the widely-used ISBS, ensuring that we can reasonably trust previous experiments utilizing the original score for evaluation purposes. We still advocate for the use of proper scoring rules, as even minor differences between losses can have important implications in automated processes such as model tuning. We hope our findings encourage further research into the properties of survival measures so that robust and honest evaluation of survival models can be achieved.

MCML Authors
Link to website

Lukas Burk

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[273]
R. Kohli, M. Feurer, B. Bischl, K. Eggensperger and F. Hutter.
Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning.
DMLR @ICLR 2024 - Workshop on Data-centric Machine Learning Research at the 12th International Conference on Learning Representations (ICLR 2024). Vienna, Austria, May 07-11, 2024. URL
Abstract

Data in tabular form makes up a large part of real-world ML applications, and thus, there has been a strong interest in developing novel deep learning (DL) architectures for supervised learning on tabular data in recent years. As a result, there is a debate as to whether DL methods are superior to the ubiquitous ensembles of boosted decision trees. Typically, the advantage of one model class over the other is claimed based on an empirical evaluation, where different variations of both model classes are compared on a set of benchmark datasets that supposedly resemble relevant real-world tabular data. While the landscape of state-of-the-art models for tabular data changed, one factor has remained largely constant over the years: The datasets. Here, we examine 30 recent publications and 187 different datasets they use, in terms of age, study size and relevance. We found that the average study used less than 10 datasets and that half of the datasets are older than 20 years. Our insights raise questions about the conclusions drawn from previous studies and urge the research community to develop and publish additional recent, challenging and relevant datasets and ML tasks for supervised learning on tabular data.

MCML Authors
Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[272]
A. Vahidi, S. Schosser, L. Wimmer, Y. Li, B. Bischl, E. Hüllermeier and M. Rezaei.
Probabilistic Self-supervised Learning via Scoring Rules Minimization.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL GitHub
Abstract

In this paper, we propose a novel probabilistic self-supervised learning via Scoring Rule Minimization (ProSMIN), which leverages the power of probabilistic models to enhance representation quality and mitigate collapsing representations. Our proposed approach involves two neural networks; the online network and the target network, which collaborate and learn the diverse distribution of representations from each other through knowledge distillation. By presenting the input samples in two augmented formats, the online network is trained to predict the target network representation of the same sample under a different augmented view. The two networks are trained via our new loss function based on proper scoring rules. We provide a theoretical justification for ProSMIN’s convergence, demonstrating the strict propriety of its modified scoring rule. This insight validates the method’s optimization process and contributes to its robustness and effectiveness in improving representation quality. We evaluate our probabilistic model on various downstream tasks, such as in-distribution generalization, out-of-distribution detection, dataset corruption, low-shot learning, and transfer learning. Our method achieves superior accuracy and calibration, surpassing the self-supervised baseline in a wide range of experiments on large-scale datasets like ImageNet-O and ImageNet-C, ProSMIN demonstrates its scalability and real-world applicability.

MCML Authors
Link to website

Lisa Wimmer

Statistical Learning and Data Science

Link to website

Yawei Li

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[271]
D. Dold, D. Rügamer, B. Sick and O. Dürr.
Bayesian Semi-structured Subspace Inference.
AISTATS 2024 - 27th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, May 02-04, 2024. URL
Abstract

Semi-structured regression models enable the joint modeling of interpretable structured and complex unstructured feature effects. The structured model part is inspired by statistical models and can be used to infer the input-output relationship for features of particular importance. The complex unstructured part defines an arbitrary deep neural network and thereby provides enough flexibility to achieve competitive prediction performance. While these models can also account for aleatoric uncertainty, there is still a lack of work on accounting for epistemic uncertainty. In this paper, we address this problem by presenting a Bayesian approximation for semi-structured regression models using subspace inference. To this end, we extend subspace inference for joint posterior sampling from a full parameter space for structured effects and a subspace for unstructured effects. Apart from this hybrid sampling scheme, our method allows for tunable complexity of the subspace and can capture multiple minima in the loss landscape. Numerical experiments validate our approach’s efficacy in recovering structured effect parameter posteriors in semi-structured models and approaching the full-space posterior distribution of MCMC for increasing subspace dimension. Further, our approach exhibits competitive predictive performance across simulated and real-world datasets.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[270]
N. Palm and T. Nagler.
An Online Bootstrap for Time Series.
AISTATS 2024 - 27th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, May 02-04, 2024. URL
Abstract

Resampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications. This method is based on an autoregressive sequence of increasingly dependent resampling weights. We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments.

MCML Authors
Link to website

Nicolai Palm

Computational Statistics & Data Science

Link to Profile Thomas Nagler

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science


[269]
D. Rügamer.
Scalable Higher-Order Tensor Product Spline Models.
AISTATS 2024 - 27th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, May 02-04, 2024. URL
Abstract

In the current era of vast data and transparent machine learning, it is essential for techniques to operate at a large scale while providing a clear mathematical comprehension of the internal workings of the method. Although there already exist interpretable semi-parametric regression methods for large-scale applications that take into account non-linearity in the data, the complexity of the models is still often limited. One of the main challenges is the absence of interactions in these models, which are left out for the sake of better interpretability but also due to impractical computational costs. To overcome this limitation, we propose a new approach using a factorization method to derive a highly scalable higher-order tensor product spline model. Our method allows for the incorporation of all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We further develop a meaningful penalization scheme and examine the induced optimization problem. We conclude by evaluating the predictive and estimation performance of our method.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[268]
A. Solderer, S. P. Hicklin, M. Aßenmacher, A. Ender and P. R. Schmidlin.
Influence of an allogenic collagen scaffold on implant sites with thin supracrestal tissue height: a randomized clinical trial.
Clinical Oral Investigations 28.313 (May. 2024). DOI
Abstract

Objectives: This randomized clinical trial focused on patients with thin peri-implant soft-tissue height (STH) (≤ 2.5 mm) and investigated the impact of an allogenic collagen scaffold (aCS) on supracrestal tissue height and marginal bone loss (MBL).
Material & methods: Forty patients received bone level implants and were randomly assigned to the test group with simultaneous tissue thickening with aCS or the control group. After three months, prosthetic restoration occurred. STH measurements were taken at baseline (T0) and reopening surgery (TR), with MBL assessed at 12 months (T1). Descriptive statistics were calculated for continuous variables, and counts for categorical variables (significance level, p = 0.05).
Results: At T1, 37 patients were available. At T0, control and test groups had mean STH values of 2.3 ± 0.3 mm and 2.1 ± 0.4 mm. TR revealed mean STH values of 2.3 ± 0.2 mm (control) and 2.6 ± 0.7 mm (test), with a significant tissue thickening of 0.5 ± 0.6 mm in the test group (p < 0.03). At T1, control and test groups showed MBL mean values of 1.1 ± 0.8 mm and 1.0 ± 0.6 mm, with a moderate but significant correlation with STH thickening (-0.34), implant position (0.43), history of periodontitis (0.39), and smoking status (0.27).
Conclusion: The use of an aCS protocol resulted in soft tissue thickening but did not reach a threshold to reliably reduce MBL compared to the control group within the study’s limitations.
Clinical relevance: Peri-implant STH is crucial for maintaining peri-implant marginal bone stability. Marginal bone stability represents a crucial factor in prevention of peri-implantitis development.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[267]
K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. O. Sabel, J. Ricke and M. Ingrisch.
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.
European Radiology 34 (May. 2024). DOI
Abstract

Objectives: To assess the quality of simplified radiology reports generated with the large language model (LLM) ChatGPT and to discuss challenges and chances of ChatGPT-like LLMs for medical text simplification.
Methods: In this exploratory case study, a radiologist created three fictitious radiology reports which we simplified by prompting ChatGPT with ‘Explain this medical report to a child using simple language.’’ In a questionnaire, we tasked 15 radiologists to rate the quality of the simplified radiology reports with respect to their factual correctness, completeness, and potential harm for patients. We used Likert scale analysis and inductive free-text categorization to assess the quality of the simplified reports.
Results: Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed relevant medical information, and potentially harmful passages were reported.
Conclusion: While we see a need for further adaption to the medical field, the initial insights of this study indicate a tremendous potential in using LLMs like ChatGPT to improve patient-centered care in radiology and other medical domains.
Clinical relevance statement: Patients have started to use ChatGPT to simplify and explain their medical reports, which is expected to affect patient-doctor interaction. This phenomenon raises several opportunities and challenges for clinical routine.

MCML Authors
Link to website

Katharina Jeblick

Dr.

Clinical Data Science in Radiology

Link to website

Balthasar Schachtner

Dr.

Clinical Data Science in Radiology

Link to website

Jakob Dexl

Clinical Data Science in Radiology

Link to website

Andreas Mittermeier

Dr.

Clinical Data Science in Radiology

Link to website

Theresa Stüber

Clinical Data Science in Radiology

Link to website

Philipp Wesp

Dr.

Clinical Data Science in Radiology

Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[266]
A. F. Thielmann, A. Reuter, T. Kneib, D. Rügamer and B. Säfken.
Interpretable Additive Tabular Transformer Networks.
Transactions on Machine Learning Research (May. 2024). URL
Abstract

Attention based Transformer networks have not only revolutionized Natural Language Processing but have also achieved state-of-the-art results for tabular data modeling. The attention mechanism, in particular, has proven to be highly effective in accurately modeling categorical variables. Although deep learning models recently outperform tree-based models, they often lack a complete comprehension of the individual impact of features because of their opaque nature. In contrast, additive neural network structures have proven to be both predictive and interpretable. Within the context of explainable deep learning, we propose Neural Additive Tabular Transformer Networks (NATT), a modeling framework that combines the intelligibility of additive neural networks with the predictive power of Transformer models. NATT offers inherent intelligibility while achieving similar performance to complex deep learning models. To validate its efficacy, we conduct experiments on multiple datasets and find that NATT performs on par with state-of-the-art methods on tabular data and surpasses other interpretable approaches.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[265]
R. Debelak, T. Koch, M. Aßenmacher and C. Stachl.
From Embeddings to Explainability: A Tutorial on Transformer-Based Text Analysis for Social and Behavioral Scientists.
Preprint (May. 2024). DOI
Abstract

Large language models and their use for text analysis have had a significant impact on psychology and the social and behavioral sciences in general. Key applications include the analysis of texts, such as social media posts, to infer psychological characteristics, as well as survey and interview analysis. In this tutorial paper, we demonstrate the use of the Python-based natural language processing software package transformers (and related modules from the Hugging Face Ecosystem) that allow for the automated classification of text inputs in a practical exercise. In doing so, we rely on pretrained transformer models which can be fine-tuned to a specific task and domain. The first proposed application of this model class is to use it as a feature extractor, allowing for the transformation of written text into real-valued numerical vectors (called ’embeddings’) that capture a text’s semantic meaning. These vectors can, in turn, be used as input for a subsequent machine-learning model. The second presented application of transformer models is the end-to-end training (so-called ‘fine-tuning’) of the model. This results in a direct prediction of the label within the same model that directly maps the text to the embeddings. While in the second case, results are usually better and training works more seamlessly, the model itself is often not directly interpretable. We showcase an alleviation of this issue via the application of post-hoc interpretability methods by calculating SHAP values and applying local interpretable model-agnostic explanations (LIME) in an attempt to explain the model’s inner workings.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[264]
K. Hechinger, C. Koller, X. Zhu and G. Kauermann.
Human-in-the-loop: Towards Label Embeddings for Measuring Classification Difficulty.
Preprint (May. 2024). arXiv
Abstract

Uncertainty in machine learning models is a timely and vast field of research. In supervised learning, uncertainty can already occur in the first stage of the training process, the annotation phase. This scenario is particularly evident when some instances cannot be definitively classified. In other words, there is inevitable ambiguity in the annotation step and hence, not necessarily a ‘ground truth’ associated with each instance. The main idea of this work is to drop the assumption of a ground truth label and instead embed the annotations into a multidimensional space. This embedding is derived from the empirical distribution of annotations in a Bayesian setup, modeled via a Dirichlet-Multinomial framework. We estimate the model parameters and posteriors using a stochastic Expectation Maximization algorithm with Markov Chain Monte Carlo steps. The methods developed in this paper readily extend to various situations where multiple annotators independently label instances. To showcase the generality of the proposed approach, we apply our approach to three benchmark datasets for image classification and Natural Language Inference. Besides the embeddings, we can investigate the resulting correlation matrices, which reflect the semantic similarities of the original classes very well for all three exemplary datasets.

MCML Authors
Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation

Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business


[263]
P. Dettling, M. Drton and M. Kolar.
On the Lasso for Graphical Continuous Lyapunov Models.
CLeaR 2024 - 3rd Conference on Causal Learning and Reasoning. Los Angeles, CA, USA, Apr 01-03, 2024. URL
Abstract

Graphical continuous Lyapunov models offer a new perspective on modeling causally interpretable dependence structure in multivariate data by treating each independent observation as a one-time cross-sectional snapshot of a temporal process. Specifically, the models assume that the observations are cross-sections of independent multivariate Ornstein-Uhlenbeck processes in equilibrium. The Gaussian equilibrium exists under a stability assumption on the drift matrix, and the equilibrium covariance matrix is determined by the continuous Lyapunov equation. Each graphical continuous Lyapunov model assumes the drift matrix to be sparse, with a support determined by a directed graph. A natural approach to model selection in this setting is to use an ℓ1-regularization technique that, based on a given sample covariance matrix, seeks to find a sparse approximate solution to the Lyapunov equation. We study the model selection properties of the resulting lasso technique to arrive at a consistency result. Our detailed analysis reveals that the involved irrepresentability condition is surprisingly difficult to satisfy. While this may prevent asymptotic consistency in model selection, our numerical experiments indicate that even if the theoretical requirements for consistency are not met, the lasso approach is able to recover relevant structure of the drift matrix and is robust to aspects of model misspecification.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[262]
K. Göbler, T. Windisch, M. Drton, T. Pychynski, M. Roth and S. Sonntag.
causalAssembly: Generating Realistic Production Data for Benchmarking Causal Discovery.
CLeaR 2024 - 3rd Conference on Causal Learning and Reasoning. Los Angeles, CA, USA, Apr 01-03, 2024. URL
Abstract

Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real and complex data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To tackle these challenges, we introduce causalAssembly, a semisynthetic data generator designed to facilitate the benchmarking of causal discovery methods. The tool is built using a complex real-world dataset comprised of measurements collected along an assembly line in a manufacturing setting. For these measurements, we establish a partial set of ground truth causal relationships through a detailed study of the physics underlying the processes carried out in the assembly line. The partial ground truth is sufficiently informative to allow for estimation of a full causal graph by mere nonparametric regression. To overcome potential confounding and privacy concerns, we use distributional random forests to estimate and represent conditional distributions implied by the ground truth causal graph. These conditionals are combined into a joint distribution that strictly adheres to a causal model over the observed variables. Sampling from this distribution, causalAssembly generates data that are guaranteed to be Markovian with respect to the ground truth. Using our tool, we showcase how to benchmark several well-known causal discovery algorithms.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[261]
D. Strieder and M. Drton.
Dual Likelihood for Causal Inference under Structure Uncertainty.
CLeaR 2024 - 3rd Conference on Causal Learning and Reasoning. Los Angeles, CA, USA, Apr 01-03, 2024. URL
Abstract

Knowledge of the underlying causal relations is essential for inferring the effect of interventions in complex systems. In a widely studied approach, structural causal models postulate noisy functional relations among interacting variables, where the underlying causal structure is then naturally represented by a directed graph whose edges indicate direct causal dependencies. In the typical application, this underlying causal structure must be learned from data, and thus, the remaining structure uncertainty needs to be incorporated into causal inference in order to draw reliable conclusions. In recent work, test inversions provide an ansatz to account for this data-driven model choice and, therefore, combine structure learning with causal inference. In this article, we propose the use of dual likelihood to greatly simplify the treatment of the involved testing problem. Indeed, dual likelihood leads to a closed-form solution for constructing confidence regions for total causal effects that rigorously capture both sources of uncertainty: causal structure and numerical size of nonzero effects. The proposed confidence regions can be computed with a bottom-up procedure starting from sink nodes. To render the causal structure identifiable, we develop our ideas in the context of linear causal relations with equal error variances.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[260]
H. A. Gündüz, R. Mreches, J. Moosbauer, G. Robertson, X.-Y. To, E. A. Franzosa, C. Huttenhower, M. Rezaei, A. C. McHardy, B. Bischl, P. C. Münch and M. Binder.
Optimized model architectures for deep learning on genomic data.
Communications Biology 7.1 (Apr. 2024). DOI
Abstract

The success of deep learning in various applications depends on task-specific architecture design choices, including the types, hyperparameters, and number of layers. In computational biology, there is no consensus on the optimal architecture design, and decisions are often made using insights from more well-established fields such as computer vision. These may not consider the domain-specific characteristics of genome sequences, potentially limiting performance. Here, we present GenomeNet-Architect, a neural architecture design framework that automatically optimizes deep learning models for genome sequence data. It optimizes the overall layout of the architecture, with a search space specifically designed for genomics. Additionally, it optimizes hyperparameters of individual layers and the model training procedure. On a viral classification task, GenomeNet-Architect reduced the read-level misclassification rate by 19%, with 67% faster inference and 83% fewer parameters, and achieved similar contig-level accuracy with ~100 times fewer parameters compared to the best-performing deep learning baselines.

MCML Authors
Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Martin Binder

Statistical Learning and Data Science


[259]
M. Herrmann, D. Kazempour, F. Scheipl and P. Kröger.
Enhancing cluster analysis via topological manifold learning.
Data Mining and Knowledge Discovery 38 (Apr. 2024). DOI
Abstract

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

MCML Authors
Link to Profile Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Daniyal Kazempour

Daniyal Kazempour

Dr.

* Former Member

Link to Profile Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Peer Kröger

Peer Kröger

Prof. Dr.

* Former Principal Investigator


[258]
S. Feuerriegel, D. Frauen, V. Melnychuk, J. Schweisthal, K. Hess, A. Curth, S. Bauer, N. Kilbertus, I. S. Kohane and M. van der Schaar.
Causal machine learning for predicting treatment outcomes.
Nature Medicine 30 (Apr. 2024). DOI
Abstract

Causal machine learning (ML) offers flexible, data-driven methods for predicting treatment outcomes including efficacy and toxicity, thereby supporting the assessment and safety of drugs. A key benefit of causal ML is that it allows for estimating individualized treatment effects, so that clinical decision-making can be personalized to individual patient profiles. Causal ML can be used in combination with both clinical trial data and real-world data, such as clinical registries and electronic health records, but caution is needed to avoid biased or incorrect predictions. In this Perspective, we discuss the benefits of causal ML (relative to traditional statistical or ML approaches) and outline the key components and steps. Finally, we provide recommendations for the reliable use of causal ML and effective translation into the clinic.

MCML Authors
Link to Profile Stefan Feuerriegel

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management

Link to website

Dennis Frauen

Artificial Intelligence in Management

Link to website

Valentyn Melnychuk

Artificial Intelligence in Management

Link to website

Jonas Schweisthal

Artificial Intelligence in Management

Link to Profile Stefan Bauer

Stefan Bauer

Prof. Dr.

Algorithmic Machine Learning & Explainable AI

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning


[257]
V. Gkolemis, C. Diou, E. Ntoutsi, T. Dalamagas, B. Bischl, J. Herbinger and G. Casalicchio.
Effector: A Python package for regional explanations.
Preprint (Apr. 2024). arXiv GitHub
Abstract

Global feature effect methods explain a model outputting one plot per feature. The plot shows the average effect of the feature on the output, like the effect of age on the annual income. However, average effects may be misleading when derived from local effects that are heterogeneous, i.e., they significantly deviate from the average. To decrease the heterogeneity, regional effects provide multiple plots per feature, each representing the average effect within a specific subspace. For interpretability, subspaces are defined as hyperrectangles defined by a chain of logical rules, like age’s effect on annual income separately for males and females and different levels of professional experience. We introduce Effector, a Python library dedicated to regional feature effects. Effector implements well-established global effect methods, assesses the heterogeneity of each method and, based on that, provides regional effects. Effector automatically detects subspaces where regional effects have reduced heterogeneity. All global and regional effect methods share a common API, facilitating comparisons between them. Moreover, the library’s interface is extensible so new methods can be easily added and benchmarked.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[256]
T. Weber, J. Dexl, D. Rügamer and M. Ingrisch.
Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition.
Preprint (Apr. 2024). arXiv
Abstract

We address the computational barrier of deploying advanced deep learning segmentation models in clinical settings by studying the efficacy of network compression through tensor decomposition. We propose a post-training Tucker factorization that enables the decomposition of pre-existing models to reduce computational requirements without impeding segmentation accuracy. We applied Tucker decomposition to the convolutional kernels of the TotalSegmentator (TS) model, an nnU-Net model trained on a comprehensive dataset for automatic segmentation of 117 anatomical structures. Our approach reduced the floating-point operations (FLOPs) and memory required during inference, offering an adjustable trade-off between computational efficiency and segmentation quality. This study utilized the publicly available TS dataset, employing various downsampling factors to explore the relationship between model size, inference speed, and segmentation performance. The application of Tucker decomposition to the TS model substantially reduced the model parameters and FLOPs across various compression rates, with limited loss in segmentation accuracy. We removed up to 88% of the model’s parameters with no significant performance changes in the majority of classes after fine-tuning. Practical benefits varied across different graphics processing unit (GPU) architectures, with more distinct speed-ups on less powerful hardware. Post-hoc network compression via Tucker decomposition presents a viable strategy for reducing the computational demand of medical image segmentation models without substantially sacrificing accuracy. This approach enables the broader adoption of advanced deep learning technologies in clinical practice, offering a way to navigate the constraints of hardware capabilities.

MCML Authors
Link to website

Jakob Dexl

Clinical Data Science in Radiology

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology


[255]
C. Gruber, K. Hechinger, M. Aßenmacher, G. Kauermann and B. Plank.
More Labels or Cases? Assessing Label Variation in Natural Language Inference.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL
Abstract

In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[254]
S. Dandl, C. Haslinger, T. Hothorn, H. Seibold, E. Sverdrup, S. Wager and A. Zeileis.
What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?
Annals of Applied Statistics 18.1 (Mar. 2024). DOI
Abstract

Estimation of heterogeneous treatment effects (HTE) is of prime importance in many disciplines, from personalized medicine to economics among many others. Random forests have been shown to be a flexible and powerful approach to HTE estimation in both randomized trials and observational studies. In particular “causal forests” introduced by Athey, Tibshirani and Wager (Ann. Statist. 47 (2019) 1148–1178), along with the R implementation in package grf were rapidly adopted. A related approach, called ‘model-based forests’ that is geared toward randomized trials and simultaneously captures effects of both prognostic and predictive variables, was introduced by Seibold, Zeileis and Hothorn (Stat. Methods Med. Res. 27 (2018) 3104–3125) along with a modular implementation in the R package model4you.
Neither procedure is directly applicable to the estimation of individualized predictions of excess postpartum blood loss caused by a cesarean section in comparison to vaginal delivery. Clearly, randomization is hardly possible in this setup, and thus model-based forests lack clinical trial data to address this question. On the other hand, the skewed and interval-censored postpartum blood loss observations violate assumptions made by causal forests. Here we present a tailored model-based forest for skewed and interval-censored data to infer possible predictive prepartum characteristics and their impact on excess postpartum blood loss caused by a cesarean section.
As a methodological basis, we propose a unifying view on causal and model-based forests that goes beyond the theoretical motivations and investigates which computational elements make causal forests so successful and how these can be blended with the strengths of model-based forests. To do so, we show that both methods can be understood in terms of the same parameters and model assumptions for an additive model under L2 loss. This theoretical insight allows us to implement several flavors of ‘model-based causal forests’ and dissect their different elements in silico.
The original causal forests and model-based forests are compared with the new blended versions in a benchmark study exploring both randomized trials and observational settings. In the randomized setting, both approaches performed akin. If confounding was present in the data-generating process, we found local centering of the treatment indicator with the corresponding propensities to be the main driver for good performance. Local centering of the outcome was less important and might be replaced or enhanced by simultaneous split selection with respect to both prognostic and predictive effects. This lays the foundation for future research combining random forests for HTE estimation with other types of models.

MCML Authors

[253]
F. Coens, N. Knops, I. Tieken, S. Vogelaar, A. Bender, J. J. Kim, K. Krupka, L. Pape, A. Raes, B. Tönshoff, A. Prytula and C. Registry.
Time-Varying Determinants of Graft Failure in Pediatric Kidney Transplantation in Europe.
Clinical Journal of the American Society of Nephrology 19.3 (Mar. 2024). DOI
Abstract

Little is known about the time-varying determinants of kidney graft failure in children. We performed a retrospective study of primary pediatric kidney transplant recipients (younger than 18 years) from the Eurotransplant registry (1990-2020). Piece-wise exponential additive mixed models were applied to analyze time-varying recipient, donor, and transplant risk factors. Primary outcome was death-censored graft failure.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[252]
W. H. Hartl, P. Kopper, L. Xu, L. Heller, M. Mironov, R. Wang, A. G. Day, G. Elke, H. Küchenhoff and A. Bender.
Relevance of Protein Intake for Weaning in the Mechanically Ventilated Critically Ill: Analysis of a Large International Database.
Critical Care Medicine 50.3 (Mar. 2024). DOI
Abstract

The association between protein intake and the need for mechanical ventilation (MV) is controversial. We aimed to investigate the associations between protein intake and outcomes in ventilated critically ill patients.

MCML Authors
Link to Profile Helmut Küchenhoff

Helmut Küchenhoff

Prof. Dr.

Statistical Consulting Unit (StaBLab)

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[251]
B. X. Liew, F. Pfisterer, D. Rügamer and X. Zhai.
Strategies to optimise machine learning classification performance when using biomechanical features.
Journal of Biomechanics 165 (Mar. 2024). DOI
Abstract

Building prediction models using biomechanical features is challenging because such models may require large sample sizes. However, collecting biomechanical data on large sample sizes is logistically very challenging. This study aims to investigate if modern machine learning algorithms can help overcome the issue of limited sample sizes on developing prediction models. This was a secondary data analysis two biomechanical datasets – a walking dataset on 2295 participants, and a countermovement jump dataset on 31 participants. The input features were the three-dimensional ground reaction forces (GRFs) of the lower limbs. The outcome was the orthopaedic disease category (healthy, calcaneus, ankle, knee, hip) in the walking dataset, and healthy vs people with patellofemoral pain syndrome in the jump dataset. Different algorithms were compared: multinomial/LASSO regression, XGBoost, various deep learning time-series algorithms with augmented data, and with transfer learning. For the outcome of weighted multiclass area under the receiver operating curve (AUC) in the walking dataset, the three models with the best performance were InceptionTime with x12 augmented data (0.810), XGBoost (0.804), and multinomial logistic regression (0.800). For the jump dataset, the top three models with the highest AUC were the LASSO (1.00), InceptionTime with x8 augmentation (0.750), and transfer learning (0.653). Machine-learning based strategies for managing the challenging issue of limited sample size for biomechanical ML-based problems, could benefit the development of alternative prediction models in healthcare, especially when time-series data are involved.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[250]
S. Dandl, A. Bender and T. Hothorn.
Heterogeneous Treatment Effect Estimation for Observational Data Using Model-Based Forests.
Statistical Methods in Medical Research 33.3 (Mar. 2024). DOI
Abstract

The estimation of heterogeneous treatment effects has attracted considerable interest in many disciplines, most prominently in medicine and economics. Contemporary research has so far primarily focused on continuous and binary responses where heterogeneous treatment effects are traditionally estimated by a linear model, which allows the estimation of constant or heterogeneous effects even under certain model misspecifications. More complex models for survival, count, or ordinal outcomes require stricter assumptions to reliably estimate the treatment effect. Most importantly, the noncollapsibility issue necessitates the joint estimation of treatment and prognostic effects. Model-based forests allow simultaneous estimation of covariate-dependent treatment and prognostic effects, but only for randomized trials. In this paper, we propose modifications to model-based forests to address the confounding issue in observational data. In particular, we evaluate an orthogonalization strategy originally proposed by Robinson (1988, Econometrica) in the context of model-based forests targeting heterogeneous treatment effect estimation in generalized linear models and transformation models. We found that this strategy reduces confounding effects in a simulated study with various outcome distributions. We demonstrate the practical aspects of heterogeneous treatment effect estimation for survival and ordinal outcomes by an assessment of the potentially heterogeneous effect of Riluzole on the progress of Amyotrophic Lateral Sclerosis.

MCML Authors
Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[249]
P. Kopper, D. Rügamer, R. Sonabend, B. Bischl and A. Bender.
Training Survival Models using Scoring Rules.
Preprint (Mar. 2024). arXiv
Abstract

Survival Analysis provides critical insights for partially incomplete time-to-event data in various domains. It is also an important example of probabilistic machine learning. The probabilistic nature of the predictions can be exploited by using (proper) scoring rules in the model fitting process instead of likelihood-based optimization. Our proposal does so in a generic manner and can be used for a variety of model classes. We establish different parametric and non-parametric sub-frameworks that allow different degrees of flexibility. Incorporated into neural networks, it leads to a computationally efficient and scalable optimization routine, yielding state-of-the-art predictive performance. Finally, we show that using our framework, we can recover various parametric models and demonstrate that optimization works equally well when compared to likelihood-based methods.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[248]
J. Rodemann, F. Croppi, P. Arens, Y. Sale, J. Herbinger, B. Bischl, E. Hüllermeier, T. Augustin, C. J. Walsh and G. Casalicchio.
Explaining Bayesian Optimization by Shapley Values Facilitates Human-AI Collaboration.
Preprint (Mar. 2024). arXiv
Abstract

In today’s data-driven world, the proliferation of publicly available information raises security concerns due to the information leakage (IL) problem. IL involves unintentionally exposing sensitive information to unauthorized parties via observable system information. Conventional statistical approaches rely on estimating mutual information (MI) between observable and secret information for detecting ILs, face challenges of the curse of dimensionality, convergence, computational complexity, and MI misestimation. Though effective, emerging supervised machine learning based approaches to detect ILs are limited to binary system sensitive information and lack a comprehensive framework. To address these limitations, we establish a theoretical framework using statistical learning theory and information theory to quantify and detect IL accurately. Using automated machine learning, we demonstrate that MI can be accurately estimated by approximating the typically unknown Bayes predictor’s log-loss and accuracy. Based on this, we show how MI can effectively be estimated to detect ILs. Our method performs superior to state-of-the-art baselines in an empirical study considering synthetic and real-world OpenSSL TLS server datasets.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[247]
S. Wiegrebe, P. Kopper, R. Sonabend, B. Bischl and A. Bender.
Deep learning for survival analysis: a review.
Artificial Intelligence Review 57.65 (Feb. 2024). DOI
Abstract

The influx of deep learning (DL) techniques into the field of survival analysis in recent years has led to substantial methodological progress; for instance, learning from unstructured or high-dimensional data such as images, text or omics data. In this work, we conduct a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival- and DL-related attributes. In summary, the reviewed methods often address only a small subset of tasks relevant to time-to-event data—e.g., single-risk right-censored data—and neglect to incorporate more complex settings.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)


[246]
C. A. Scholbeck, G. Casalicchio, C. Molnar, B. Bischl and C. Heumann.
Marginal Effects for Non-Linear Prediction Functions.
Data Mining and Knowledge Discovery 38 (Feb. 2024). DOI
Abstract

Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models and especially generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either in the shape of derivatives of the prediction function or forward differences in prediction due to a change in a feature value. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a model-agnostic interpretation method for machine learning models. This may stem from their inflexibility as a univariate feature effect and their inability to deal with the non-linearities found in black box models. We introduce a new class of marginal effects termed forward marginal effects. We argue to abandon derivatives in favor of better-interpretable forward differences. Furthermore, we generalize marginal effects based on forward differences to multivariate changes in feature values. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for marginal effects. We argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to partition the feature space to compute conditional average marginal effects on feature subspaces, which serve as conditional feature effect estimates.

MCML Authors
Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[245]
C. A. Scholbeck.
Bridging gaps in interpretable machine learning: sensitivity analysis, marginal effects, and cluster explanations.
Dissertation 2024. DOI
Abstract

This thesis explores interpretable machine learning (IML) through six papers, bridging the gap between IML and model interpretation in other domains. It presents a generalized framework for model-agnostic interpretation methods, highlights potential pitfalls, and connects IML to sensitivity analysis used in fields like environmental modeling. A novel approach, forward marginal effects (FMEs), is introduced to interpret predictive models at multiple levels, supported by the R package fmeffects. The work also extends IML to unsupervised learning by proposing algorithm-agnostic cluster explanation methods, including two new techniques: SMART and IDEA, for analyzing feature contributions to clustering. (Shortened.)

MCML Authors
Christian Alexander Scholbeck

Christian Alexander Scholbeck

* Former Member


[244]
B. X. W. Liew, D. Rügamer and A. V. Birn-Jeffery.
Neuromechanical stabilisation of the centre of mass during running.
Gait and Posture 108 (Feb. 2024). DOI
Abstract

Background: Stabilisation of the centre of mass (COM) trajectory is thought to be important during running. There is emerging evidence of the importance of leg length and angle regulation during running, which could contribute to stability in the COM trajectory The present study aimed to understand if leg length and angle stabilises the vertical and anterior-posterior (AP) COM displacements, and if the stability alters with running speeds.
Methods: Data for this study came from an open-source treadmill running dataset (n = 28). Leg length (m) was calculated by taking the resultant distance of the two-dimensional sagittal plane leg vector (from pelvis segment to centre of pressure). Leg angle was defined by the angle subtended between the leg vector and the horizontal surface. Leg length and angle were scaled to a standard deviation of one. Uncontrolled manifold analysis (UCM) was used to provide an index of motor abundance (IMA) in the stabilisation of the vertical and AP COM displacement.
Results: IMAAP and IMAvertical were largely destabilising and always stabilising, respectively. As speed increased, the peak destabilising effect on IMAAP increased from −0.66(0.18) at 2.5 m/s to −1.12(0.18) at 4.5 m/s, and the peak stabilising effect on IMAvertical increased from 0.69 (0.19) at 2.5 m/s to 1.18 (0.18) at 4.5 m/s.
Conclusion: Two simple parameters from a simple spring-mass model, leg length and angle, can explain the control behind running. The variability in leg length and angle helped stabilise the vertical COM, whilst maintaining constant running speed may rely more on inter-limb variation to adjust the horizontal COM accelerations.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[243]
H. Weerts, F. Pfisterer, M. Feurer, K. Eggensperger, E. Bergman, N. Awad, J. Vanschoren, M. Pechenizkiy, B. Bischl and F. Hutter.
Can Fairness be Automated? Guidelines and Opportunities for Fairness-aware AutoML.
Journal of Artificial Intelligence Research 79 (Feb. 2024). DOI
Abstract

The field of automated machine learning (AutoML) introduces techniques that automate parts of the development of machine learning (ML) systems, accelerating the process and reducing barriers for novices. However, decisions derived from ML models can reproduce, amplify, or even introduce unfairness in our societies, causing harm to (groups of) individuals. In response, researchers have started to propose AutoML systems that jointly optimize fairness and predictive performance to mitigate fairness-related harm. However, fairness is a complex and inherently interdisciplinary subject, and solely posing it as an optimization problem can have adverse side effects. With this work, we aim to raise awareness among developers of AutoML systems about such limitations of fairness-aware AutoML, while also calling attention to the potential of AutoML as a tool for fairness research. We present a comprehensive overview of different ways in which fairness-related harm can arise and the ensuing implications for the design of fairness-aware AutoML. We conclude that while fairness cannot be automated, fairness-aware AutoML can play an important role in the toolbox of ML practitioners. We highlight several open technical challenges for future work in this direction. Additionally, we advocate for the creation of more user-centered assistive systems designed to tackle challenges encountered in fairness work.

MCML Authors
Link to Profile Matthias Feurer

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[242]
P. Gijsbers, M. L. P. Bueno, S. Coors, E. LeDell, S. Poirier, J. Thomas, B. Bischl and J. Vanschoren.
AMLB: an AutoML Benchmark.
Journal of Machine Learning Research 25.101 (Feb. 2024). URL
Abstract

Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[241]
D. Schalk, B. Bischl and D. Rügamer.
Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models.
Statistics and Computing 34.31 (Feb. 2024). DOI
Abstract

Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data have been developed in recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting of base learners using the $L_2$-loss. In order to account for the heterogeneity of different data location sites, we propose a distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[240]
M. Drton, A. Grosdos, I. Portakal and N. Sturma.
Algebraic Sparse Factor Analysis.
Preprint (Feb. 2024). arXiv
Abstract

Factor analysis is a statistical technique that explains correlations among observed random variables with the help of a smaller number of unobserved factors. In traditional full factor analysis, each observed variable is influenced by every factor. However, many applications exhibit interesting sparsity patterns, that is, each observed variable only depends on a subset of the factors. In this paper, we study such sparse factor analysis models from an algebro-geometric perspective. Under mild conditions on the sparsity pattern, we examine the dimension of the set of covariance matrices that corresponds to a given model. Moreover, we study algebraic relations among the covariances in sparse two-factor models. In particular, we identify cases in which a Gröbner basis for these relations can be derived via a 2-delightful term order and joins of toric edge ideals.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[239]
T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Constrained Probabilistic Mask Learning for Task-specific Undersampled MRI Reconstruction.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI
Abstract

Undersampling is a common method in Magnetic Resonance Imaging (MRI) to subsample the number of data points in k-space, reducing acquisition times at the cost of decreased image quality. A popular approach is to employ undersampling patterns following various strategies, e.g., variable density sampling or radial trajectories. In this work, we propose a method that directly learns the under-sampling masks from data points, thereby also providing task- and domain-specific patterns. To solve the resulting discrete optimization problem, we propose a general optimization routine called ProM: A fully probabilistic, differentiable, versatile, and model-free framework for mask optimization that enforces acceleration factors through a convex constraint. Analyzing knee, brain, and cardiac MRI datasets with our method, we discover that different anatomic regions reveal distinct optimal undersampling masks, demonstrating the benefits of using custom masks, tailored for a downstream task. For example, ProM can create undersampling masks that maximize performance in downstream tasks like segmentation with networks trained on fully-sampled MRIs. Even with extreme acceleration factors, ProM yields reasonable performance while being more versatile than existing methods, paving the way for data-driven all-purpose mask generation

MCML Authors
Link to Profile Michael Ingrisch

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[238]
G. Casalicchio and L. Burk.
Evaluation and Benchmarking.
Applied Machine Learning Using mlr3 in R I.3 (Jan. 2024). DOI
Abstract

Machine learning models can only be deployed in practice if they are robustly evaluated to estimate a model’s generalization performance, i.e. how well it will perform on new data. Resampling strategies including cross-validation and bootstrapping, can be used to estimate the generalization performance. Models can be compared to one another using a benchmark experiment, which makes use of the same resampling strategies and measures to fairly compare models and to help practitioners decide which model to use in practice.
This chapter introduces resample strategies in mlr3, including cross-validation, repeated cross-validation, leave-one-out, bootstrapping, and custom strategies. These are then demonstrated with the resample() function, which is used to resample a single learner with a given strategy. Benchmarking is then introduced and the benchmark() function is demonstrated for comparing multiple learners. The chapter concludes with a deep dive into binary classification evaluation, including ROC analysis and the Area Under the Curve metric.

MCML Authors
Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Link to website

Lukas Burk

Statistical Learning and Data Science


[237]
M. Becker, L. Schneider and S. Fischer.
Hyperparameter Optimization.
Applied Machine Learning Using mlr3 in R II.4 (Jan. 2024). DOI
Abstract

Machine learning models include parameters and hyperparameters. The former refers to model coefficients that are estimated during training. The latter are parameters that are set by the user and affect how the model is fit or how it makes predictions. Setting hyperparameters manually is arduous and error-prone, instead hyperparameter optimization (HPO) automating this ‘tuning’ procedure to reduce bias. When performing HPO there are many considerations including what tuning algorithm to use, how long to tune it for, and what measures to optimize. Moreover users have to decide which hyperparameters to tune and for what configurations. Finally, one has to be careful to make use of nested resampling to prevent leakage of information from training to testing datasets that can occur when resampling and tuning simultaneously. This chapter begins by introducing mlr3tuning and its functionality for tuning learners. This includes Tuners for configuring and running optimization algorithms, TuningInstances for storing results, and Terminators for controlling when to stop the HPO process. The chapter provides a practical example of tuning hyperparameters of a support vector machine, including introducing logarithmic transformations. The AutoTuner class is also introduced which is used for automating nested resampling to reduce bias in tuning.

MCML Authors
Link to website

Marc Becker

Statistical Learning and Data Science

Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to website

Sebastian Fischer

Statistical Learning and Data Science


[236]
L. Schneider and M. Becker.
Advanced Tuning Methods and Black Box Optimization.
Applied Machine Learning Using mlr3 in R II.5 (Jan. 2024). DOI
Abstract

Automated tuning can be error prone and it is very likely that models will crash in the tuning process, it is therefore essential to have reliable methods of encapsulating errors to prevent large experiments from failing and losing intermediate results. This chapter therefore begins by introducing fallback learners and encapsulation methods, which are returned to in ‘Advanced Technical Aspects of mlr3’.
Models can be tuned with respect to one or multiple measures. In general when tuning to multiple measures there will be a trade-off between them and therefore there will not be one optimal hyperparameter configuration, instead the aim is to estimate configurations that are not Pareto-dominated by any other. This chapter introduces multi-objective tuning and concepts including Pareto optimality.
Some tuning methods are more advanced than others, including Hyperband and Bayesian optimization. Hyperband is a multi-fidelity tuner that makes use of fidelity parameters, which provide a tradeoff between model runtime and performance accuracy. Bayesian optimization is a sample-efficient black-box optimization algorithm that is highly flexible and allows user fine-grained control over tuning large search spaces. This chapter introduces mlr3hyperband and the concept of fidelity parameters, and then mlr3mbo and bbotk to discuss black-box optimization and Bayesian optimization.

MCML Authors
Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to website

Marc Becker

Statistical Learning and Data Science


[235]
M. Binder and F. Pfisterer.
Sequential Pipelines.
Applied Machine Learning Using mlr3 in R II.7 (Jan. 2024). DOI
Abstract

Computational pipelines provide a layer of abstraction for swapping in and out different elements of the pipeline. In machine learning this can be useful for swapping algorithms, as well as common operations for data preprocessing and model post processing. Many real-world machine learning applications involve more than just fitting a single model at a time: It is often beneficial or even necessary to preprocess data for feature engineering and compatibility with learners. In many cases it is also useful to combine predictions of multiple models in ensembles. By defining these workflows as computational objects, it is then possible to treat them like models to be trained/tested and even tuned. This chapter introduces mlr3pipelines, a dataflow programming language that can be used to define machine learning processes from simple building blocks. The chapter focuses on sequential pipelines, in which data passes from one operation to another in a linear sequence and each operation has one input and output. The chapter introduces PipeOp and Graph, which are the building blocks of a pipeline, and provides some concrete examples with PCA.

MCML Authors
Link to website

Martin Binder

Statistical Learning and Data Science


[234]
M. Binder, F. Pfisterer, M. Becker and M. N. Wright.
Non-sequential Pipelines and Tuning.
Applied Machine Learning Using mlr3 in R II.8 (Jan. 2024). DOI
Abstract

Real-world applications often require complicated pipeline that do not progress sequentially. For example, many experiments have demonstrated that bagging is a powerful method to improve model performance. Bagging can be thought of as a non-sequential pipeline where a learner is replicated, each separate learner is trained and makes predictions, and their results are combined. This is non-sequential as data is not flowing sequentially through the pipeline but is instead passed to all learners (who may then subsample the data) and then recombined, thus creating a pipeline where operations have multiple inputs and outputs. Pipeline operations also have hyperparameters that can be set and tuned to improve model performance. Moreover the choice of operations to include in a pipeline can also be tuned, known as combined algorithm selection and hyperparameter optimization (CASH).
This chapter looks at more advanced uses of mlr3pipelines. This is put into practice by demonstrating how to build a bagging and stacking pipeline from scratch, as well as how to access common pipelines that are readily available in mlr3pipelines. The chapter then looks at tuning pipelines and CASH.

MCML Authors
Link to website

Martin Binder

Statistical Learning and Data Science

Link to website

Marc Becker

Statistical Learning and Data Science


[233]
M. Lang, S. Fischer and R. Sonabend.
Advanced Technical Aspects of mlr3.
Applied Machine Learning Using mlr3 in R IV.10 (Jan. 2024). DOI
Abstract

Parallelization is often required to efficiently run machine learning models, which means models are run simultaneously on multiple CPU cores, CPUs, or computational nodes. This chapter begins by demonstrating how mlr3 uses the future package for parallelization and how different ‘plans’ can be applied to mlr3 experiments. In large machine learning experiments, it is common for a model to error during training or predicting. This is because the algorithms have to process arbitrary data, and not all eventualities can always be handled. It is therefore imperative to have robust methods for encapsulating and dealing with errors. This chapter builds on what has been briefly seen in Chapter 5 to discuss error handling and logging, including how to make use of fallback learners in experiments. Large experiments may also require data to be handled in different formats and to prevent all the data being loaded into memory. This chapter discussed different ‘backends’ that can be used for mlr3 Tasks, including interfacing with DuckDB and SQL. Finally, this chapter demonstrates how to extend classes in mlr3 by using the Measure class as an example. This may be of particular interest to readers who want to create new Measures or Learners.

MCML Authors

[232]
S. Fischer, M. Lang and M. Becker.
Large-Scale Benchmarking.
Applied Machine Learning Using mlr3 in R IV.11 (Jan. 2024). DOI
Abstract

In the field of machine learning, benchmark experiments are used to evaluate and compare the performance of algorithms. To draw robust conclusions, benchmark experiments often have to be ‘large-scale’, which means including many datasets, learners, and possibly measures. Finding datasets can be difficult and the choice of dataset impacts conclusions that can be drawn. Conducting large-scale benchmark experiments is also complex as they are usually computationally intensive. It is therefore common to make use of high-performance computing clusters to efficiently run the experiment. Finally once these experiments are run, analysis of experiments usually requires more than a single score from a given performance measure, and therefore statistical test are often employed.
This chapter introduces mlr3oml for interfacing the OpenML database for accessing data and tasks. It then continues by discussing how to run experiments on high-performance computing clusters using batchtools and mlr3batchmark. Finally, mlr3benchmark is introduced for statistical analysis including Friedman tests and critical difference diagrams.

MCML Authors
Link to website

Sebastian Fischer

Statistical Learning and Data Science

Link to website

Marc Becker

Statistical Learning and Data Science


[231]
S. Dandl, P. Biecek, G. Casalicchio and M. N. Wright.
Model Interpretation.
Applied Machine Learning Using mlr3 in R IV.12 (Jan. 2024). DOI
Abstract

The increasing availability of data and software frameworks to create predictive models has allowed the widespread adoption of machine learning in many applications. However, high predictive performance of such models often comes at the cost of interpretability. Machine learning interpretation methods can be useful for several purposes: 1) gaining global insights into a model (e.g., feature importance); 2) model improvement if flaws were identified (e.g., unexpected reliance on a certain feature); 3) understanding individual predictions. Several model-agnostic methods have been developed including feature permutation, Shapleys, and LIME.
This chapter presents the packages iml, counterfactuals, and DALEX, which implement model-agnostic interpretation methods. Throughout the chapter an xgboost is trained on the german credit dataset to understand how predictions are made and why. The chapter starts with discussing the iml package and the theory behind the discussed methods, as well as how to practically use the interface. It then moves to counterfactuals and the benefits of counterfactual analysis, including methods What-If and MOC. Finally, DALEX is introduced, which includes similar methods to iml but with a different design, hence users can make use of either package depending on their design preference.

MCML Authors
Link to website

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science


[230]
J. Gertheiss, D. Rügamer, B. X. Liew and S. Greven.
Functional Data Analysis: An Introduction and Recent Developments.
Biometrical Journal (2024). DOI GitHub
Abstract

Functional data analysis (FDA) is a statistical framework that allows for the analysis of curves, images, or functions on higher dimensional domains. The goals of FDA, such as descriptive analyses, classification, and regression, are generally the same as for statistical analyses of scalar-valued or multivariate data, but FDA brings additional challenges due to the high- and infinite dimensionality of observations and parameters, respectively. This paper provides an introduction to FDA, including a description of the most common statistical analysis techniques, their respective software implementations, and some recent developments in the field. The paper covers fundamental concepts such as descriptives and outliers, smoothing, amplitude and phase variation, and functional principal component analysis. It also discusses functional regression, statistical inference with functional data, functional classification and clustering, and machine learning approaches for functional data analysis. The methods discussed in this paper are widely applicable in fields such as medicine, biophysics, neuroscience, and chemistry, and are increasingly relevant due to the widespread use of technologies that allow for the collection of functional data. Sparse functional data methods are also relevant for longitudinal data analysis. All presented methods are demonstrated using available software in R by analyzing a data set on human motion and motor control. To facilitate the understanding of the methods, their implementation, and hands-on application, the code for these practical examples is made available on Github.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[229]
B. Bischl, R. Sonabend, L. Kotthoff and M. Lang.
Applied Machine Learning Using mlr3 in R.
Chapman and Hall/CRC (Jan. 2024). DOI
Abstract

mlr3 is an award-winning ecosystem of R packages that have been developed to enable state-of-the-art machine learning capabilities in R. Applied Machine Learning Using mlr3 in R gives an overview of flexible and robust machine learning methods, with an emphasis on how to implement them using mlr3 in R. It covers various key topics, including basic machine learning tasks, such as building and evaluating a predictive model; hyperparameter tuning of machine learning approaches to obtain peak performance; building machine learning pipelines that perform complex operations such as pre-processing followed by modelling followed by aggregation of predictions; and extending the mlr3 ecosystem with custom learners, measures, or pipeline components. The book is primarily aimed at researchers, practitioners, and graduate students who use machine learning or who are interested in using it. It can be used as a textbook for an introductory or advanced machine learning class that uses R, as a reference for people who work with machine learning methods, and in industry for exploratory experiments in machine learning.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[228]
K. Röck.
Stochastic processes as surrogate models for dynamical systems in magnetic confinement fusion.
Dissertation 2024. DOI
Abstract

This thesis focuses on incorporating domain-specific knowledge into machine learning (ML) models for scientific applications, ensuring they accurately reflect underlying physical systems.
The first part introduces physics-consistent Gaussian processes (GPs), embedding physical laws directly into the model. These models address data governed by partial differential equations (PDEs) and Hamiltonian systems, preserving physical properties like symplecticity and enabling faster, long-term simulations. Applications include classifying chaotic trajectories and computing Lyapunov exponents.
The second part tackles data scarcity in plasma physics by proposing robust surrogate models for multivariate time series. Using Student-$t$ process regression, these models handle outliers effectively and facilitate data imputation and augmentation, ensuring reliable predictions for multichannel sensor data.
This work advances ML approaches for surrogate modeling, chaos analysis, and plasma physics. (Shortened.)

MCML Authors
Katharina Röck

Katharina Röck (née Rath)

Dr.

* Former Member


[227]
L. Kook, P. F. M. Baumann, O. Dürr, B. Sick and D. Rügamer.
Estimating Conditional Distributions with Neural Networks Using R Package deeptrafo.
Journal of Statistical Software 111.10 (2024). DOI
Abstract

Contemporary empirical applications frequently require flexible regression models for complex response types and large tabular or non-tabular, including image or text, data. Classical regression models either break down under the computational load of processing such data or require additional manual feature extraction to make these problems tractable. Here, we present deeptrafo, a package for fitting flexible regression models for conditional distributions using a tensorflow backend with numerous additional processors, such as neural networks, penalties, and smoothing splines. Package deeptrafo implements deep conditional transformation models (DCTMs) for binary, ordinal, count, survival, continuous, and time series responses, potentially with uninformative censoring. Unlike other available methods, DCTMs do not assume a parametric family of distributions for the response. Further, the data analyst may trade off interpretability and flexibility by supplying custom neural network architectures and smoothers for each term in an intuitive formula interface. We demonstrate how to set up, fit, and work with DCTMs for several response types. We further showcase how to construct ensembles of these models, evaluate models using inbuilt cross-validation, and use other convenience functions for DCTMs in several applications. Lastly, we discuss DCTMs in light of other approaches to regression with non-tabular data.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[226]
M. M. Mandl, S. Hoffmann, S. Bieringer, A. E. Jacob, M. Kraft, S. Lemster and A.-L. Boulesteix.
Raising awareness of uncertain choices in empirical data analysis: A teaching concept towards replicable research practices.
PLOS Computational Biology 20.3 (2024). DOI
Abstract

Throughout their education and when reading the scientific literature, students may get the impression that there is a unique and correct analysis strategy for every data analysis task and that this analysis strategy will always yield a significant and noteworthy result. This expectation conflicts with a growing realization that there is a multiplicity of possible analysis strategies in empirical research, which will lead to overoptimism and nonreplicable research findings if it is combined with result-dependent selective reporting. Here, we argue that students are often ill-equipped for real-world data analysis tasks and unprepared for the dangers of selectively reporting the most promising results. We present a seminar course intended for advanced undergraduates and beginning graduate students of data analysis fields such as statistics, data science, or bioinformatics that aims to increase the awareness of uncertain choices in the analysis of empirical data and present ways to deal with these choices through theoretical modules and practical hands-on sessions.

MCML Authors
Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[225]
Z. S. Dunias, B. Van Calster, D. Timmerman, A.-L. Boulesteix and M. van Smeden.
A comparison of hyperparameter tuning procedures for clinical prediction models: A simulation study.
Statistics in Medicine (Jan. 2024). DOI
Abstract

Tuning hyperparameters, such as the regularization parameter in Ridge or Lasso regression, is often aimed at improving the predictive performance of risk prediction models. In this study, various hyperparameter tuning procedures for clinical prediction models were systematically compared and evaluated in low-dimensional data. The focus was on out-of-sample predictive performance (discrimination, calibration, and overall prediction error) of risk prediction models developed using Ridge, Lasso, Elastic Net, or Random Forest. The influence of sample size, number of predictors and events fraction on performance of the hyperparameter tuning procedures was studied using extensive simulations. The results indicate important differences between tuning procedures in calibration performance, while generally showing similar discriminative performance. The one-standard-error rule for tuning applied to cross-validation (1SE CV) often resulted in severe miscalibration. Standard non-repeated and repeated cross-validation (both 5-fold and 10-fold) performed similarly well and outperformed the other tuning procedures. Bootstrap showed a slight tendency to more severe miscalibration than standard cross-validation-based tuning procedures. Differences between tuning procedures were larger for smaller sample sizes, lower events fractions and fewer predictors. These results imply that the choice of tuning procedure can have a profound influence on the predictive performance of prediction models. The results support the application of standard 5-fold or 10-fold cross-validation that minimizes out-of-sample prediction error. Despite an increased computational burden, we found no clear benefit of repeated over non-repeated cross-validation for hyperparameter tuning. We warn against the potentially detrimental effects on model calibration of the popular 1SE CV rule for tuning prediction models in low-dimensional settings.

MCML Authors
Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[224]
M. Wünsch, C. Sauer, P. Callahan, L. C. Hinske and A.-L. Boulesteix.
From RNA sequencing measurements to the final results: a practical guide to navigating the choices and uncertainties of gene set analysis.
Wiley Interdisciplinary Reviews: Computational Statistics 16.1 (Jan. 2024). DOI
Abstract

Gene set analysis (GSA), a popular approach for analyzing high-throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between different conditions. In the last years, a multitude of methods have been developed for this task. However, clear guidance is lacking: choosing the right method is the first hurdle a researcher is confronted with. No less challenging than overcoming this so-called method uncertainty is the procedure of preprocessing, from knowing which steps are required to selecting a corresponding approach from the plethora of valid options to create the accepted input object (data preprocessing uncertainty), with clear guidance again being scarce. Here, we provide a practical guide through all steps required to conduct GSA, beginning with a concise overview of a selection of established methods, including Gene Set Enrichment Analysis and Database for Annotation, Visualization, and Integrated Discovery (DAVID). We thereby lay a special focus on reviewing and explaining the necessary preprocessing steps for each method under consideration (e.g., the necessity of a transformation of the RNA sequencing data)—an essential aspect that is typically paid only limited attention to in both existing reviews and applications. To raise awareness of the spectrum of uncertainties, our review is accompanied by an extensive overview of the literature on valid approaches for each step and illustrative R code demonstrating the complex analysis pipelines. It ends with a discussion and recommendations to both users and developers to ensure that the results of GSA are, despite the above-mentioned uncertainties, replicable and transparent.

MCML Authors
Link to website

Christina Sauer (née Nießl)

Biometry in Molecular Medicine

Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[223]
L. Bothmann, K. Peters and B. Bischl.
What Is Fairness? On the Role of Protected Attributes and Fictitious Worlds.
Preprint (Jan. 2024). arXiv
Abstract

A growing body of literature in fairness-aware machine learning (fairML) aims to mitigate machine learning (ML)-related unfairness in automated decision-making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods to ensure that trained ML models achieve low scores on these metrics. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a significant gap between centuries of philosophical discussion and the recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We argue that fairness problems can arise even without the presence of protected attributes (PAs), and point out that fairness and predictive performance are not irreconcilable opposites, but that the latter is necessary to achieve the former. Furthermore, we argue why and how causal considerations are necessary when assessing fairness in the presence of PAs by proposing a fictitious, normatively desired (FiND) world in which PAs have no causal effects. In practice, this FiND world must be approximated by a warped world in which the causal effects of the PAs are removed from the real-world data. Finally, we achieve greater linguistic clarity in the discussion of fairML. We outline algorithms for practical applications and present illustrative experiments on COMPAS data.

MCML Authors
Link to website

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[222]
M. M. Mandl, A. S. Becker-Pennrich, L. C. Hinske, S. Hoffmann and A.-L. Boulesteix.
Addressing researcher degrees of freedom through minP adjustment.
Preprint (Jan. 2024). arXiv
Abstract

When different researchers study the same research question using the same dataset they may obtain different and potentially even conflicting results. This is because there is often substantial flexibility in researchers’ analytical choices, an issue also referred to as ‘‘researcher degrees of freedom’’. Combined with selective reporting of the smallest p-value or largest effect, researcher degrees of freedom may lead to an increased rate of false positive and overoptimistic results. In this paper, we address this issue by formalizing the multiplicity of analysis strategies as a multiple testing problem. As the test statistics of different analysis strategies are usually highly dependent, a naive approach such as the Bonferroni correction is inappropriate because it leads to an unacceptable loss of power. Instead, we propose using the ‘‘minP’’ adjustment method, which takes potential test dependencies into account and approximates the underlying null distribution of the minimal p-value through a permutation-based procedure. This procedure is known to achieve more power than simpler approaches while ensuring a weak control of the family-wise error rate. We illustrate our approach for addressing researcher degrees of freedom by applying it to a study on the impact of perioperative paO2 on post-operative complications after neurosurgery. A total of 48 analysis strategies are considered and adjusted using the minP procedure. This approach allows to selectively report the result of the analysis strategy yielding the most convincing evidence, while controlling the type 1 error – and thus the risk of publishing false positive results that may not be replicable.

MCML Authors
Link to Profile Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[221]
H. A. Gündüz, S. Giri, M. Binder, B. Bischl and M. Rezaei.
Uncertainty Quantification for Deep Learning Models Predicting the Regulatory Activity of DNA Sequences.
ICMLA 2023 - 22nd IEEE International Conference on Machine Learning and Applications. Jacksonville, Florida, USA, Dec 15-17, 2023. DOI
Abstract

The field of computational biology has been enhanced by deep learning models, which hold great promise for revolutionizing domains such as protein folding and drug discovery. Recent studies have underscored the tremendous potential of these models, particularly in the realm of gene regulation and the more profound understanding of the non-coding regions of the genome. On the other hand, this raises significant concerns about the reliability and efficacy of such models, which have their own biases by design, along with those learned from the data. Uncertainty quantification allows us to measure where the system is confident and know when it can be trusted. In this paper, we study several uncertainty quantification methods with respect to a multi-target regression task, specifically predicting regulatory activity profiles using DNA sequence data. Using the Basenji model, we investigate how such methods can improve in-domain generalization, out-of-distribution detection, and provide coverage guarantees on prediction intervals.

MCML Authors
Link to website

Martin Binder

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science


[220]
N. Sturma, C. Squires, M. Drton and C. Uhler.
Unpaired Multi-Domain Causal Representation Learning.
NeurIPS 2023 - 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL
Abstract

The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our results into a practical method to recover the shared latent causal graph.

MCML Authors
Link to Profile Mathias Drton

Mathias Drton

Prof. Dr.

Mathematical Statistics


[219]
Y. Zhang, Y. Li, H. Brown, M. Rezaei, B. Bischl, P. Torr, A. Khakzar and K. Kawaguchi.
AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments.
XAIA @NeurIPS 2023 - Workshop XAI in Action: Past, Present, and Future Applications at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). New Orleans, LA, USA, Dec 10-16, 2023. URL
Abstract

Feature attribution explains neural network outputs by identifying relevant input features. How do we know if the identified features are indeed relevant to the network? This notion is referred to as faithfulness, an essential property that reflects the alignment between the identified (attributed) features and the features used by the model. One recent trend to test faithfulness is to design the data such that we know which input features are relevant to the label and then train a model on the designed data. Subsequently, the identified features are evaluated by comparing them with these designed ground truth features. However, this idea has the underlying assumption that the neural network learns to use all and only these designed features, while there is no guarantee that the learning process trains the network in this way. In this paper, we solve this missing link by explicitly designing the neural network by manually setting its weights, along with designing data, so we know precisely which input features in the dataset are relevant to the designed network. Thus, we can test faithfulness in AttributionLab, our designed synthetic environment, which serves as a sanity check and is effective in filtering out attribution methods. If an attribution method is not faithful in a simple controlled environment, it can be unreliable in more complex scenarios. Furthermore, the AttributionLab environment serves as a laboratory for controlled experiments through which we can study feature attribution methods, identify issues, and suggest potential improvements.

MCML Authors
Link to website

Yawei Li

Statistical Learning and Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Ashkan Khakzar

Ashkan Khakzar

Dr.

* Former Member


[218]
Z. Zhang, H. Yang, B. Ma, D. Rügamer and E. Nie.
Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models.
CoNLL 2023 - BabyLM Challenge at 27th Conference on Computational Natural Language Learning. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Large Language Models (LLMs) demonstrate remarkable performance on a variety of natural language understanding (NLU) tasks, primarily due to their in-context learning ability. This ability could be applied to building babylike models, i.e. models at small scales, improving training efficiency. In this paper, we propose a ‘CoThought’ pipeline, which efficiently trains smaller ‘baby’ language models (BabyLMs) by leveraging the Chain of Thought prompting of LLMs. Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that are comparable to the school texts for language learners. The BabyLM is then pretrained on this restructured dataset in a RoBERTa fashion. In evaluations across 4 benchmarks, our BabyLM outperforms the vanilla RoBERTa in 10 linguistic, NLU, and question-answering tasks by more than 3 points, showing a superior ability to extract contextual information. These results suggest that compact LMs pretrained on small, LLM-resabructured data can better understand tasks and achieve improved performance.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[217]
E. Garces Arias, V. Pai, M. Schöffel, C. Heumann and M. Aßenmacher.
Automatic transcription of handwritten Old Occitan language.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

While existing neural network-based approaches have shown promising results in Handwritten Text Recognition (HTR) for high-resource languages and standardized/machine-written text, their application to low-resource languages often presents challenges, resulting in reduced effectiveness. In this paper, we propose an innovative HTR approach that leverages the Transformer architecture for recognizing handwritten Old Occitan language. Given the limited availability of data, which comprises only word pairs of graphical variants and lemmas, we develop and rely on elaborate data augmentation techniques for both text and image data. Our model combines a custom-trained Swin image encoder with a BERT text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set. Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source Transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision. To nurture further research and development, we make our models, data sets, and code publicly available.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[216]
F. Karl, T. Pielok, J. Moosbauer, F. Pfisterer, S. Coors, M. Binder, L. Schneider, J. Thomas, J. Richter, M. Lang, E. C. Garrido-Merchán, J. Branke and B. Bischl.
Multi-Objective Hyperparameter Optimization in Machine Learning—An Overview.
ACM Transactions on Evolutionary Learning and Optimization 3.4 (Dec. 2023). DOI
Abstract

Hyperparameter optimization constitutes a large part of typical modern machine learning (ML) workflows. This arises from the fact that ML methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration, resulting in a multi-objective optimization problem. This is often neglected in practice, due to a lack of knowledge and readily available software implementations for multi-objective hyperparameter optimization. In this work, we introduce the reader to the basics of multi-objective hyperparameter optimization and motivate its usefulness in applied ML. Furthermore, we provide an extensive survey of existing optimization strategies from the domains of evolutionary algorithms and Bayesian optimization. We illustrate the utility of multi-objective optimization in several specific ML applications, considering objectives such as operating conditions, prediction time, sparseness, fairness, interpretability, and robustness.

MCML Authors
Link to website

Florian Karl

Statistical Learning and Data Science

Link to website

Tobias Pielok

Statistical Learning and Data Science

Link to website

Martin Binder

Statistical Learning and Data Science

Link to website

Lennart Schneider

Statistical Learning and Data Science

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


[215]
A. T. Stüber, S. Coors, B. Schachtner, T. Weber, D. Rügamer, A. Bender, A. Mittermeier, O. Öcal, M. Seidensticker, J. Ricke, B. Bischl and M. Ingrisch.
A comprehensive machine learning benchmark study for radiomics-based survival analysis of CT imaging data in patients with hepatic metastases of CRC.
Investigative Radiology 58.12 (Dec. 2023). DOI
Abstract

Optimizing a machine learning (ML) pipeline for radiomics analysis involves numerous choices in data set composition, preprocessing, and model selection. Objective identification of the optimal setup is complicated by correlated features, interdependency structures, and a multitude of available ML algorithms. Therefore, we present a radiomics-based benchmarking framework to optimize a comprehensive ML pipeline for the prediction of overall survival. This study is conducted on an image set of patients with hepatic metastases of colorectal cancer, for which radiomics features of the whole liver and of metastases from computed tomography images were calculated. A mixed model approach was used to find the optimal pipeline configuration and to identify the added prognostic value of radiomics features.

MCML Authors
Link to website

Theresa Stüber

Clinical Data Science in Radiology

Link to website

Balthasar Schachtner

Dr.

Clinical Data Science in Radiology

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Link to website

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Link to website

Andreas Mittermeier

Dr.

Clinical Data Science in Radiology