Research Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

holds the Chair of Statistical Learning and Data Science at the Department of Statistics at LMU Munich.

He studied Computer Science, Artificial Intelligence and Data Sciences in Hamburg, Edinburgh and Dortmund and obtained his PhD from Dortmund Technical University in 2013 with a thesis on “Model and Algorithm Selection in Statistical Learning and Optimization”. His research interests include AutoML, Model Selection, Interpretable ML, as well as the development of Statistical Software. He is a member of ELLIS in general, and a faculty member of ELLIS Munich, an active developer of several R-packages, leads the “mlr” (Machine Learning in R) engineering group and is co-founder of the science platform “OpenML” for open and reproducible ML. Furthermore, he leads the Munich branch of the Fraunhofer ADA Lovelace Center for Analytics, Data & Applications, i.e. a new type of research infrastructure to support businesses in Bavaria, especially in the SME sector.

Team members @MCML

PostDocs

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

PhD Students

Helen Alber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Salem Ayadi

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Coco Bögel

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Philip Amir Boustani

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lukas Burk

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Fiona Katharina Ewald

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Florian Karl

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Chris Kolb

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Yawei Li

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Simon Rittel

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Xiao-Yin To

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lisa Wimmer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Zizheng Zhang

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Publications @MCML

2025

[295]

E. Özeren, A. Ulbrich, S. Filimon, D. Rügamer and A. Bender.
Enhancing Traffic Accident Classifications: Application of NLP Methods for City Safety.
ECML-PKDD 2025 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Porto, Portugal, Sep 15-19, 2025. To be published. Preprint available. arXiv

Abstract

A comprehensive understanding of traffic accidents is essential for improving city safety and informing policy decisions. In this study, we analyze traffic incidents in Munich to identify patterns and characteristics that distinguish different types of accidents. The dataset consists of both structured tabular features, such as location, time, and weather conditions, as well as unstructured free-text descriptions detailing the circumstances of each accident. Each incident is categorized into one of seven predefined classes. To assess the reliability of these labels, we apply NLP methods, including topic modeling and few-shot learning, which reveal inconsistencies in the labeling process. These findings highlight potential ambiguities in accident classification and motivate a refined predictive approach. Building on these insights, we develop a classification model that achieves high accuracy in assigning accidents to their respective categories. Our results demonstrate that textual descriptions contain the most informative features for classification, while the inclusion of tabular data provides only marginal improvements. These findings emphasize the critical role of free-text data in accident analysis and highlight the potential of transformer-based models in improving classification reliability.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[294]

L. Schneider, B. Bischl and M. Feurer.
Overtuning in Hyperparameter Optimization.
AutoML 2025 - Methods Track - Methods Track at the International Conference on Automated Machine Learning. New York City, NY, USA, Sep 08-11, 2025. To be published. URL

Abstract

Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter configuration (HPC) such that the resulting model generalizes well to unseen data. Since directly optimizing the expected generalization error is impossible, resampling techniques like holdout validation or cross-validation are used as proxy measures in HPO. However, this implicitly assumes that the HPC minimizing validation error will also yield the best true generalization performance. Given that our inner validation error estimate is inherently stochastic and depends on the resampling, we study: Can excessive optimization of the validation error lead to a similarly detrimental effect as excessive optimization of the empirical risk of an ML model? This phenomenon, which we refer to as overtuning, represents a form of overfitting at the HPO level. Despite its potential impact, overtuning has received limited attention in the HPO and automated machine learning (AutoML) literature. We first formally define overtuning and distinguish it from related concepts such as meta-overfitting. We then reanalyze large-scale HPO benchmark data, assessing how frequently overtuning occurs and its practical relevance. Our findings suggest that overtuning is more common than expected, although often mild. However, in 10% of cases, severe overtuning results in selecting an HPC whose generalization performance is worse than the default HPC. We further examine how factors such as the chosen performance metric, resampling method, dataset size, learning algorithm, and optimization strategy influence overtuning and discuss potential mitigation strategies. Our results highlight the need to raise awareness of overtuning, particularly in the small-data regime, indicating that further mitigation strategies should be studied.

MCML Authors

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Statistical Learning and Data Science

[293]

B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Aßenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors

Bolei Ma

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Barbara Plank

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

AI and Computational Linguistics

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[292]

S. Urchs, V. Thurner, M. Aßenmacher, C. Heumann and S. Thiemichen.
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus’s utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[291]

E. Garces Arias, H. Blocher, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework.
GEM2 @ACL 2025 - 4th Workshop on Generation, Evaluation and Metrics at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. Preprint available. arXiv

Abstract

Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Furthermore, we discuss the alignment of these approaches with human judgments. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, exhibit similarities with human preferences, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[290]

T. Pielok, B. Bischl and D. Rügamer.
Revisiting Unbiased Implicit Variational Inference.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv URL

Abstract

Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from highly complicated distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. While unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and identify key shortcomings. In particular, we show that UIVI’s MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback–Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.

MCML Authors

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[289]

R. Debelak, T. K. Koch, M. Aßenmacher and C. Stachl.
From Embeddings to Explainability: A Tutorial on Large-Language-Model-Based Text Analysis for Behavioral Scientists.
Advances in Methods and Practices in Psychological Science 8.3 (Jul. 2025). DOI

Abstract

Large language models (LLMs) are transforming research in psychology and the behavioral sciences by enabling advanced text analysis at scale. Their applications range from the analysis of social media posts to infer psychological traits to the automated scoring of open-ended survey responses. However, despite their potential, many behavioral scientists struggle to integrate LLMs into their research because of the complexity of text modeling. In this tutorial, we aim to provide an accessible introduction to LLM-based text analysis, focusing on the Transformer architecture. We guide researchers through the process of preparing text data, using pretrained Transformer models to generate text embeddings, fine-tuning models for specific tasks such as text classification, and applying interpretability methods, such as Shapley additive explanations and local interpretable model-agnostic explanations, to explain model predictions. By making these powerful techniques more approachable, we hope to empower behavioral scientists to leverage LLMs in their research, unlocking new opportunities for analyzing and interpreting textual data.

MCML Authors

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[288]

B. Bischl, G. Casalicchio, T. Das, M. Feurer, S. Fischer, P. Gijsbers, S. Mukherjee, A. C. Müller, L. Németh, L. Oala, L. Purucker, S. Ravi, J. N. van Rijn, P. Singh, J. Vanschoren, J. van der Velde and M. Wever.
OpenML: Insights from 10 years and more than a thousand papers.
Patterns In Press, Corrected Proof (Jul. 2025). DOI

Abstract

OpenML is an open-source platform that democratizes machine-learning evaluation by enabling anyone to share datasets in uniform standards, define precise machine-learning tasks, and automatically share detailed workflows and model evaluations. More than just a platform, OpenML fosters a collaborative ecosystem where scientists create new tools, launch initiatives, and establish standards to advance machine learning. Over the past decade, OpenML has inspired over 1,500 publications across diverse fields, from scientists releasing new datasets and benchmarking new models to educators teaching reproducible science. Looking back, we detail and describe the platform’s impact by looking at usage and citations. We share lessons from a decade of building, maintaining, and expanding OpenML, highlighting how rich metadata, collaborative benchmarking, and open interfaces have enhanced research and interoperability. Looking ahead, we cover ongoing efforts to expand OpenML’s capabilities and integrate with other platforms, informing a broader vision for open-science infrastructure for machine learning.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Sebastian Fischer

Statistical Learning and Data Science

[287]

E. Walter, T. Brock, P. Lahoud, N. Werner, F. Czaja, A. Tichy, C. Bumm, A. Bender, A. Castro, W. Teughels, F. Schwendicke and M. Folwaczny.
Predictive modeling for step II therapy response in periodontitis - model development and validation.
npj Digital Medicine 8.445 (Jul. 2025). DOI

Abstract

Steps I and II periodontal therapy is the first-line treatment for periodontal disease, but has varying success. This study aimed to develop machine learning models to predict changes in periodontal probing depth (PPD) after step II therapy using patient-, tooth-, and site-specific clinical covariates. Models accurately predicted that healthy sites stay healthy, but performed suboptimally for diseased sites. Tuning improved performance, with PPD, tooth-site, and tooth-type identified as key predictors. Pocket closure was predicted with fair accuracy, with baseline PPD as the most relevant covariate. Models predicted improving pockets well but underperformed for non-responding sites, with antibiotic treatment and tooth type being the most influential features. While predictive performance for step II periodontal therapy based on routine clinical data remains limited, models can stratify periodontal sites into meaningful categories and estimate the probability of pocket improvement. They provide a foundation for site-specific outcome prediction and may support patient communication and expectations.

MCML Authors

Tobias Brock

A1 | Statistical Foundations & Explainability
→ Group Thomas Nagler

Computational Statistics & Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[286]

C. Gruber, H. Alber, B. Bischl, G. Kauermann, B. Plank and M. Aßenmacher.
Revisiting Active Learning under (Human) Label Variation.
Preprint (Jul. 2025). arXiv

Abstract

Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

MCML Authors

Helen Alber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Barbara Plank

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

AI and Computational Linguistics

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[285]

L. Bothmann, P. A. Boustani, J. M. Alvarez, G. Casalicchio, B. Bischl and S. Dandl.
Privilege Scores.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published. Preprint available. arXiv

Abstract

Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence of the PA is removed. At the individual level, PS can identify individuals who qualify for affirmative action; at the global level, PS can inform bias-transforming policies. After presenting estimation methods for PS, we propose privilege score contributions (PSCs), an interpretation method that attributes the origin of privilege to mediating features and direct effects. We provide confidence intervals for both PS and PSCs. Experiments on simulated and real-world data demonstrate the broad applicability of our methods and provide novel insights into gender and racial privilege in mortgage and college admissions applications.

MCML Authors

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Philip Amir Boustani

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[284]

L. Bothmann, K. Peters and B. Bischl.
What Is Fairness? On the Role of Protected Attributes and Fictitious Worlds.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published. Preprint available. arXiv

Abstract

A growing body of literature in fairness-aware machine learning (fairML) aims to mitigate machine learning (ML)-related unfairness in automated decision-making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods to ensure that trained ML models achieve low scores on these metrics. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a significant gap between centuries of philosophical discussion and the recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We argue that fairness problems can arise even without the presence of protected attributes (PAs), and point out that fairness and predictive performance are not irreconcilable opposites, but that the latter is necessary to achieve the former. Furthermore, we argue why and how causal considerations are necessary when assessing fairness in the presence of PAs by proposing a fictitious, normatively desired (FiND) world in which PAs have no causal effects. In practice, this FiND world must be approximated by a warped world in which the causal effects of the PAs are removed from the real-world data. Finally, we achieve greater linguistic clarity in the discussion of fairML. We outline algorithms for practical applications and present illustrative experiments on COMPAS data.

MCML Authors

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[283]

C. Leininger, S. Rittel and L. Bothmann.
Overcoming Fairness Trade-offs via Pre-processing: A Causal Perspective.
EWAF 2025 - 4th European Workshop on Algorithmic Fairness. Eindhoven, The Netherlands, Jun 30-Jul 02, 2025. To be published. Preprint available. arXiv

Abstract

Training machine learning models for fair decisions faces two key challenges: The fairness-accuracy trade-off results from enforcing fairness which weakens its predictive performance in contrast to an unconstrained model. The incompatibility of different fairness metrics poses another trade-off – also known as the impossibility theorem. Recent work identifies the bias within the observed data as a possible root cause and shows that fairness and predictive performance are in fact in accord when predictive performance is measured on unbiased data. We offer a causal explanation for these findings using the framework of the FiND (fictitious and normatively desired) world, a ‘fair’ world, where protected attributes have no causal effects on the target variable. We show theoretically that (i) classical fairness metrics deemed to be incompatible are naturally satisfied in the FiND world, while (ii) fairness aligns with high predictive performance. We extend our analysis by suggesting how one can benefit from these theoretical insights in practice, using causal pre-processing methods that approximate the FiND world. Additionally, we propose a method for evaluating the approximation of the FiND world via pre-processing in practical use cases where we do not have access to the FiND world. In simulations and empirical studies, we demonstrate that these pre-processing methods are successful in approximating the FiND world and resolve both trade-offs. Our results provide actionable solutions for practitioners to achieve fairness and high predictive performance simultaneously.

MCML Authors

Simon Rittel

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[282]

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Preventing Sensitive Information Leakage via Post-hoc Orthogonalization with Application to Chest Radiograph Embeddings.
PAKDD 2025 - 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Sydney, Australia, Jun 10-13, 2025. DOI GitHub

Abstract

Deep learning has substantially advanced data analysis across various fields. However, research indicates that protected characteristics, such as age, sex, and race, are often implicitly encoded within the deep feature representations, or embeddings, generated by neural networks. This encoding can lead to inherent biases, which in turn may influence decision-making processes. In clinical settings, in particular, such biases risk leading to unfair treatment of certain subgroups, potentially resulting in serious consequences. After analyzing the sources of these biases in the field of radiology, we illustrate how embeddings of chest radiographs (CXRs) can be corrected to remove the influence of protected features. To showcase the harms of such incidents, we study the MIMIC and CheXpert datasets with three prominent pre-trained models: a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our experiments reveal a significant influence of protected features on predictions of pathologies in CXRs, demonstrating the potential harm of such practices. We then propose a correction method, removing these harmful effects while maintaining competitive predictive performance.

MCML Authors

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[281]

E. Garces Arias, H. Blocher, J. Rodemann, M. Aßenmacher and C. Jansen.
Statistical Multicriteria Evaluation of LLM-Generated Text.
Preprint (Jun. 2025). arXiv

Abstract

Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.

MCML Authors

Esteban Garces Arias

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[280]

M. Schöffel, E. Garces Arias, M. Wiedner, P. Ruppert, M. Li, C. Heumann and M. Aßenmacher.
Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages.
Preprint (Jun. 2025). arXiv

Abstract

Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs’ ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[279]

H. Löwe, C. A. Scholbeck, C. Heumann, B. Bischl and G. Casalicchio.
fmeffects: An R Package for Forward Marginal Effects.
The R Journal 16.3 (May. 2025). DOI

Abstract

Forward marginal effects have recently been introduced as a versatile and effective model-agnostic interpretation method particularly suited for non-linear and non-parametric prediction models. They provide comprehensible model explanations of the form: if we change feature values by a pre-specified step size, what is the change in the predicted outcome? We present the R package fmeffects, the first software implementation of the theory surrounding forward marginal effects. The relevant theoretical background, package functionality and handling, as well as the software design and options for future extensions are discussed in this paper.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[278]

R. Sonabend, J. Zobolas, R. Bin, P. Kopper, L. Burk and A. Bender.
Examining marginal properness in the external validation of survival models with squared and logarithmic losses.
Preprint (May. 2025). arXiv

Abstract

Scoring rules promote rational and honest decision-making, which is important for model evaluation and becoming increasingly important for automated procedures such as ‘AutoML’. In this paper we survey common squared and logarithmic scoring rules for survival analysis, with a focus on their theoretical and empirical properness. We introduce a marginal definition of properness and show that both the Integrated Survival Brier Score (ISBS) and the Right-Censored Log-Likelihood (RCLL) are theoretically improper under this definition. We also investigate a new class of losses that may inform future survival scoring rules. Simulation experiments reveal that both the ISBS and RCLL behave as proper scoring rules in practice. The RCLL showed no violations across all settings, while ISBS exhibited only minor, negligible violations at extremely small sample sizes, suggesting one can trust results from historical experiments. As such we advocate for both the RCLL and ISBS in external validation of models, including in automated procedures. However, we note practical challenges in estimating these losses including estimation of censoring distributions and densities; as such further research is required to advance development of robust and honest evaluation in survival analysis.

MCML Authors

Lukas Burk

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[277]

C. Zhang, S. Wu, Y. Chen, M. Aßenmacher, C. Heumann, Y. Men, G. Fan and J. Gama.
OBD-Finder: Explainable Coarse-to-Fine Text-Centric Oracle Bone Duplicates Discovery.
Preprint (May. 2025). arXiv GitHub

Abstract

Oracle Bone Inscription (OBI) is the earliest systematic writing system in China, while the identification of Oracle Bone (OB) duplicates is a fundamental issue in OBI research. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our approach with state-of-the-art content-based image retrieval and image matching methods, showing that our approach yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by OBI researchers for decades.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[276]

M. Schöffel, M. Wiedner, E. Garces Arias, P. Ruppert, C. Heumann and M. Aßenmacher.
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. To be published. Preprint available. arXiv

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[275]

D. Rundel, E. Sommer, B. Bischl, D. Rügamer and M. Feurer.
Efficiently Warmstarting MCMC for BNNS.
FPI @ICLR 2025 - Workshop on Frontiers in Probabilistic Inference: Learning meets Sampling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. URL

Abstract

Markov Chain Monte Carlo (MCMC) algorithms are widely regarded as the gold standard for approximate inference in Bayesian neural networks (BNNs). However, they remain computationally expensive and prone to inefficiencies, such
as dying samplers, frequently leading to substantial waste of computational resources. While prior work has presented warmstarting techniques as an effective method to mitigate these inefficiencies, we provide a more comprehensive empirical analysis of how initializations of samplers affect their behavior. Based on various experiments examining the dynamics of warmstarting MCMC, we propose novel warmstarting strategies that leverage performance predictors and adaptive termination criteria to achieve better-performing, yet more cost-efficient, models. In numerical experiments, we demonstrate that this approach provides a practical pathway to more resource-efficient approximate inference in BNNs.

MCML Authors

David Rundel

A1 | Statistical Foundations & Explainability
→ Group Matthias Feurer

Statistical Learning and Data Science

Emanuel Sommer

A1 | Statistical Foundations & Explainability
→ Group David Rügamer

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[274]

C. Kolb, T. Weber, B. Bischl and D. Rügamer.
Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the L1 norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of L1-penalized neural networks by adding differentiable L2 regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.

MCML Authors

Chris Kolb

Statistical Learning and Data Science

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[273]

H. Baniecki, G. Casalicchio, B. Bischl and P. Biecek.
Efficient and Accurate Explanation Estimation with Distribution Compression.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Spotlight Presentation. URL

Abstract

We discover a theoretical connection between explanation estimation and distribution compression that significantly improves the approximation of feature attributions, importance, and effects. While the exact computation of various machine learning explanations requires numerous model inferences and becomes impractical, the computational cost of approximation increases with an ever-increasing size of data and model parameters. We show that the standard i.i.d. sampling used in a broad spectrum of algorithms for post-hoc explanation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm of sample-efficient explainability. It relies on distribution compression through kernel thinning to obtain a data sample that best approximates its marginal distribution. CTE significantly improves the accuracy and stability of explanation estimation with negligible computational overhead. It often achieves an on-par explanation approximation error 2-3x faster by using fewer samples, i.e. requiring 2-3x fewer model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.

MCML Authors

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[272]

Y. Li, D. Rügamer, B. Bischl and M. Rezaei.
Calibrating LLMs with Information-Theoretic Evidential Deep Learning.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration.

MCML Authors

Yawei Li

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[271]

L. Wimmer, B. Bischl and L. Bothmann.
Trust Me, I Know the Way: Predictive Uncertainty in the Presence of Shortcut Learning.
SCSL @ICLR 2025 - Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. URL

Abstract

The correct way to quantify predictive uncertainty in neural networks remains a topic of active discussion. In particular, it is unclear whether the state-of-the art entropy decomposition leads to a meaningful representation of model, or epistemic, uncertainty (EU) in the light of a debate that pits ignorance against disagreement perspectives. We aim to reconcile the conflicting viewpoints by arguing that both are valid but arise from different learning situations. Notably, we show that the presence of shortcuts is decisive for EU manifesting as disagreement.

MCML Authors

Lisa Wimmer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

[270]

C. Kolb, B. Bischl and D. Rügamer.
Differentiable Attention Sparsity via Structured D-Gating.
SLLM @ICLR 2025 - Workshop on Sparsity in LLMs at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. URL

Abstract

A core component of modern large language models is the attention mechanism, but its immense parameter count necessitates structured sparsity for resource-efficient optimization and inference. Traditional sparsity penalties, such as the group lasso, are non-smooth and thus incompatible with standard stochastic gradient descent methods. To address this, we propose a deep gating mechanism that reformulates the structured sparsity penalty into a fully differentiable optimization problem, allowing effective and principled norm-based group sparsification without requiring specialized non-smooth optimizers. Our theoretical analysis and empirical results demonstrate that this approach enables structured sparsity with simple stochastic gradient descent or variants while maintaining predictive performance.

MCML Authors

Chris Kolb

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[269]

H. A. Gündüz.
Designing and optimizing deep learning methods for genomic sequencing data.
Dissertation 2025. DOI

Abstract

This dissertation applies modern deep learning techniques to genomics, introducing new approaches for self-supervised learning, uncertainty quantification, and automated model design. A key focus is the effective use of unlabeled genomic data, highlighted by the development of Self-GenomeNet, a self-supervised method tailored to genomic sequences. The work also presents automated optimization strategies for model architectures and hyperparameters, achieving better results than expert-designed models. Finally, it contributes user-friendly software that supports various genomic data formats and integrates core methods developed in the thesis. (Shortened).

MCML Authors

Hüseyin Anil Gündüz

* Former Member

[268]

M. Mironov, A. Marquard, D. Racek, C. Heumann, P. W. Thurner and M. Aßenmacher.
A Geoparsing Pipeline for Multilingual Social Media Posts from Ukraine.
GeoExT @ECIR 2025 - 3rd International Workshop on Geographic Information Extraction from Texts at the 47th European Conference on Information Retrieval (ECIR 2025). Lucca, Italy, Apr 06-10, 2025. PDF

Abstract

The dynamics of contemporary social media communication, particularly on platforms like X (formerly Twitter), have significantly evolved, and this data is frequently used for scientific research. However, due to X’s API changes in 2019, a tweet’s precise geolocation is no longer present in the data, thus preventing a geographical assessment of tweets. This project aims to extract location mentions from tweets’ texts and to map them to Ukraine’s administrative regions. We have developed a specialized pipeline for geoparsing with specific prebuilt components for the Ukrainian, Russian, and English languages. The main advantage of our pipeline’s architecture is the interchangeability of all components, allowing for the integration of custom-developed solutions. Initial tests on our hand-labeled Ukrainian dataset show promising results in accurately identifying and mapping location mentions despite various challenges, such as declension and the presence of multiple languages in a single tweet. Additional experiments using publicly available benchmark data further indicate promising performance when transferring our pipeline to other geographical regions. Both our geoparsing pipeline and its online documentation have been made publicly available.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[267]

T. Weber.
Advancing Deep Learning in medical imaging through generative modeling and representation learning.
Dissertation 2025. DOI

Abstract

In recent years, deep learning (DL) has proven to be a disruptive enabler in many domains, including the realm of medical imaging. The application of neural networks and other learnable algorithms has substantially impacted the medical field, promising to improve diagnostic accuracy, enhance patient outcomes, and streamline clinical workflows. The advent of large-scale datasets and advancements in computational power have facilitated the development of sophisticated DL models capable of analyzing and interpreting complex medical images. The scope of this thesis concentrates on a subset of the full DL spectrum, specifically the uprising areas of generative modeling and representation learning, which are closely interleaved with each other. The proposed contributions aim to push the boundaries of established medical image DL methods, venturing into more experimental research areas. (Shortened)

MCML Authors

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[266]

N. Santhanam, H. E. Kim, D. Rügamer, A. Bender, S. Muthers, C. G. Cho, A. Alonso, K. Szabo, F.-S. Centner, H. Wenz, T. Ganslandt, M. Platten, C. Groden, M. Neumaier, F. Siegel and M. E. Maros.
Machine learning-based forecasting of daily acute ischemic stroke admissions using weather data.
npj Digital Medicine 8.225 (Apr. 2025). DOI

Abstract

Background: In the midst of the emerging climate crisis, healthcare providers lack locally validated, disease-specific surveillance models. Stroke, a significant contributor to the global disease burden, has been linked to climate change. Therefore, we developed and benchmarked machine learning (ML) models based on locoregional weather systems to forecast the number of daily acute ischemic stroke (AIS) admissions.
Methods: AIS patients diagnosed between 2015 and 2021 at the tertiary University Medical Center (UMC) Mannheim, Germany were extracted from the local data integration center and geospatially matched to weather data from the German Weather Service (DWD) based on the clinic’s, patients’ home and closest tower’s locations at the time of admission. Statistical-(Poisson), boosted generalized additive model (GAM), support vector machines (SVR), and tree-based models including random forest (RF) and extreme gradient boosting (XGB) were evaluated in regression settings within time-stratified nested cross-validation setup (training-validation: 2015-2020, test set: 2021) to predict the number of daily AIS admissions.
Findings: The cohort included 7,914 AIS patients (4,244 male, 53·6%). XGB showed the best test performance with lowest mean absolute error (MAE) of 1·21 cases/day. Maximum air pressure was identified as the top predictive variable. Shapley additive explanations analyses revealed that temperature extremes of extended cold- (lag-3 minimum temperature <-2 °C; minimum perceived temperature <-1·4 °C) and hot stressors (lag-7 minimum temperature >15 °C), as well as stormy conditions (lag-1 and lag-2 maximum wind gust >14 m/s and speed >10·4 m/s), increased stroke incidences substantially with distinct seasonal associations.
Interpretation: ML models can sufficiently forecast AIS admissions based on weather patterns allowing for improved resource allocation and preparedness.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[265]

L. Bothmann, S. Dandl, J. M. A. Jose M. Alvarez, P. A. Boustani and B. Bischl.
Privilege Scores for Fairness-Aware ML.
DAGStat 2025 - 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik. Berlin, Germany, Mar 24-28, 2025. Poster presentation. Full paper available. arXiv

Abstract

Bias-preserving methods in fairness-aware machine learning (fairML) focus on metrics that prioritize formal equality by balancing error rates across subgroups. These methods can perpetuate historical discrimination embedded in real-world data. In contrast, bias-transforming methods aim for substantive equality by actively addressing historical inequalities. As a contribution to bias-transforming methods, we introduce the concept of privilege scores, a novel approach to identifying and quantifying individual privilege in machine learning tasks. Privilege scores use causal inference techniques to compare real-world outcomes to those in a ‘fair’ world in which the protected attributes do not influence the target variable. This individual-level perspective provides actionable insights for applications such as affirmative action and beyond. Key contributions include (1) the formalization of privilege scores, (2) a methodological framework for estimation with uncertainty quantification via confidence intervals, (3) an interpretable machine learning approach for understanding privilege score contributions, and (4) a novel in-processing method, Multi-PrivScore, to mitigate model-level discrimination during model training. Experiments on simulated and real-world data demonstrate the usefulness of privilege scores. Overall, our work highlights privilege scores as a versatile tool for assessing and mitigating historical discrimination in various machine learning applications.

MCML Authors

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Philip Amir Boustani

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[264]

L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible modelling of time-varying exposures in event history analysis.
DAGStat 2025 - 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik. Berlin, Germany, Mar 24-28, 2025. Poster presentation. Full paper available. DOI

Abstract

We present a flexible modelling approach to analyse time-varying exposures and recurrent events in team sports injuries. The approach is based on the piece-wise exponential additive mixed model where the effects of past exposures (i.e. high-intensity training loads) may accumulate over time and present complex forms of association. In order to identify a relevant time window at which past exposures have an impact on the current risk, we propose a penalty approach. We conduct a simulation study to evaluate the performance of the proposed model, under different true weight functions and different levels of heterogeneity between recurrent events. Finally, we illustrate the approach with a case study application involving an elite male football team participating in the Spanish LaLiga competition. The cohort includes time-loss injuries and external training load variables tracked by Global Positioning System devices, during the seasons 2017–2018 and 2018–2019.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[263]

L. Zumeta-Olaskoaga, A. Bender and D.-J. Lee.
Flexible modelling of time-varying exposures and recurrent events to analyse training load effects in team sports injuries.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 74.2 (Mar. 2025). DOI

Abstract

MCML Authors

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[262]

R. Hornung, M. Nalenz, L. Schneider, A. Bender, L. Bothmann, F. Dumpert, B. Bischl, T. Augustin and A.-L. Boulesteix.
Evaluating Machine Learning Models in Non-Standard Settings: An Overview and New Findings.
Statistical Science (Mar. 2025). To be published. Preprint available. arXiv

Abstract

Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines for GE estimation in various such non-standard settings: clustered data, spatial data, unequal sampling probabilities, concept drift, and hierarchically structured outcomes. Our overview combines well-established methodologies with other existing methods that, to our knowledge, have not been frequently considered in these particular settings. A unifying principle among these techniques is that the test data used in each iteration of the resampling procedure should reflect the new observations to which the model will be applied, while the training data should be representative of the entire data set used to obtain the final model. Beyond providing an overview, we address literature gaps by conducting simulation studies. These studies assess the necessity of using GE-estimation methods tailored to the respective setting. Our findings corroborate the concern that standard resampling methods often yield biased GE estimates in non-standard settings, underscoring the importance of tailored GE estimation.

MCML Authors

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Biometry in Molecular Medicine

[261]

A. Wuttke, M. Aßenmacher, C. Klamm, M. Lang, Q. Würschinger and F. Kreuter.
AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers.
Preprint (Mar. 2025). arXiv

Abstract

Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to voice their opinions in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to a conversational interview by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. We publish our data and materials for re-use and present specific recommendations for effective implementation.

MCML Authors

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Frauke Kreuter

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Social Data Science and AI

[260]

L. Burk, A. Bender and M. N. Wright.
High-Dimensional Variable Selection With Competing Events Using Cooperative Penalized Regression.
Biometrical Journal 67.1 (Feb. 2025). DOI

Abstract

Variable selection is an important step in the analysis of high-dimensional data, yet there are limited options for survival outcomes in the presence of competing risks. Commonly employed penalized Cox regression considers each event type separately through cause-specific models, neglecting possibly shared information between them. We adapt the feature-weighted elastic net (fwelnet), an elastic net generalization, to survival outcomes and competing risks. For two causes, our proposed algorithm fits two alternating cause-specific models, where each model receives the coefficient vector of the complementary model as prior information. We dub this ‘‘cooperative penalized regression’’, as it enables the modeling of competing risk data with cause-specific models while accounting for shared effects between causes. Coefficients that are shrunken toward zero in the model for the first cause will receive larger penalization weights in the model for the second cause and vice versa. Through multiple iterations, this process ensures stronger penalization of uninformative predictors in both models. We demonstrate our method’s variable selection capabilities on simulated genomics data and apply it to bladder cancer microarray data. We evaluate selection performance using the positive predictive value for the correct selection of informative features and the false positive rate for the selection of uninformative variables. The benchmark compares results with cause-specific penalized Cox regression, random survival forests, and likelihood-boosted Cox regression. Results indicate that our approach is more effective at selecting informative features and removing uninformative features. In settings without shared effects, variable selection performance is similar to cause-specific penalized Cox regression.

MCML Authors

Lukas Burk

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[259]

J. Rodemann, E. Garces Arias, C. Luther, C. Jansen and T. Augustin.
A Statistical Case Against Empirical Human-AI Alignment.
Preprint (Feb. 2025). arXiv

Abstract

Empirical human-AI alignment aims to make AI systems act in line with observed human behavior. While noble in its goals, we argue that empirical alignment can inadvertently introduce statistical biases that warrant caution. This position paper thus advocates against naive empirical alignment, offering prescriptive alignment and a posteriori empirical alignment as alternatives. We substantiate our principled argument by tangible examples like human-centric decoding of language models.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[258]

M. Surner, A. Khelil and L. Bothmann.
Invariance Pair-Guided Learning: Enhancing Robustness in Neural Networks.
Preprint (Feb. 2025). arXiv

Abstract

Out-of-distribution generalization of machine learning models remains challenging since the models are inherently bound to the training data distribution. This especially manifests, when the learned models rely on spurious correlations. Most of the existing approaches apply data manipulation, representation learning, or learning strategies to achieve generalizable models. Unfortunately, these approaches usually require multiple training domains, group labels, specialized augmentation, or pre-processing to reach generalizable models. We propose a novel approach that addresses these limitations by providing a technique to guide the neural network through the training phase. We first establish input pairs, representing the spurious attribute and describing the invariance, a characteristic that should not affect the outcome of the model. Based on these pairs, we form a corrective gradient complementing the traditional gradient descent approach. We further make this correction mechanism adaptive based on a predefined invariance condition. Experiments on ColoredMNIST, Waterbird-100, and CelebA datasets demonstrate the effectiveness of our approach and the robustness to group shifts.

MCML Authors

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[257]

E. Garces Arias, M. Li, C. Heumann and M. Aßenmacher.
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Since LLMs produce probability distributions over the entire vocabulary, various decoding methods have been developed to transform these probabilities into coherent and fluent text, each with its own set of hyperparameters. In this study, we present a large-scale, comprehensive analysis of how hyperparameter selection affects text quality in open-ended text generation across multiple LLMs, datasets, and evaluation metrics. Through an extensive sensitivity analysis, we provide practical guidelines for hyperparameter tuning and demonstrate the substantial influence of these choices on text quality. Using three established datasets, spanning factual domains (e.g., news) and creative domains (e.g., fiction), we show that hyperparameter tuning significantly impacts generation quality, though its effects vary across models and tasks. We offer in-depth insights into these effects, supported by both human evaluations and a synthesis of widely-used automatic evaluation metrics.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[256]

L. Schneider.
Advancing hyperparameter optimization: foundations, multiple objectives and algorithmic innovations informed through benchmarking.
Dissertation 2025. DOI

Abstract

Hyperparameter optimization (HPO) is a fundamental aspect of machine learning (ML), directly influencing model performance and adaptability. As a computationally expensive black-box optimization problem, HPO requires efficient algorithms to identify optimal hyperparameter configurations. This thesis advances the field of HPO along three key dimensions: foundational insights, HPO in the presence of more than one objective, and algorithmic innovations through benchmarking. (Shortened.)

MCML Authors

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[255]

L. Bothmann and K. Peters.
Fairness von KI – ein Brückenschlag zwischen Philosophie und Maschinellem Lernen.
Grenzen Künstlicher Intelligenz (Jan. 2025). DOI

MCML Authors

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[254]

T. Weber, J. Dexl, D. Rügamer and M. Ingrisch.
Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition.
Radiology: Artificial Intelligence 7.2 (Jan. 2025). DOI

Abstract

We address the computational barrier of deploying advanced deep learning segmentation models in clinical settings by studying the efficacy of network compression through tensor decomposition. We propose a post-training Tucker factorization that enables the decomposition of pre-existing models to reduce computational requirements without impeding segmentation accuracy. We applied Tucker decomposition to the convolutional kernels of the TotalSegmentator (TS) model, an nnU-Net model trained on a comprehensive dataset for automatic segmentation of 117 anatomical structures. Our approach reduced the floating-point operations (FLOPs) and memory required during inference, offering an adjustable trade-off between computational efficiency and segmentation quality. This study utilized the publicly available TS dataset, employing various downsampling factors to explore the relationship between model size, inference speed, and segmentation performance. The application of Tucker decomposition to the TS model substantially reduced the model parameters and FLOPs across various compression rates, with limited loss in segmentation accuracy. We removed up to 88% of the model’s parameters with no significant performance changes in the majority of classes after fine-tuning. Practical benefits varied across different graphics processing unit (GPU) architectures, with more distinct speed-ups on less powerful hardware. Post-hoc network compression via Tucker decomposition presents a viable strategy for reducing the computational demand of medical image segmentation models without substantially sacrificing accuracy. This approach enables the broader adoption of advanced deep learning technologies in clinical practice, offering a way to navigate the constraints of hardware capabilities.

MCML Authors

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Jakob Dexl

Clinical Data Science in Radiology

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

2024

[253]

T. Nagler, L. Schneider, B. Bischl and M. Feurer.
Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model’s generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.

MCML Authors

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computational Statistics & Data Science

Lennart Schneider

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[252]

Y. Zhang, Y. Li, X. Wang, Q. Shen, B. Plank, B. Bischl, M. Rezaei and K. Kawaguchi.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models.
NeurIPS 2024 - Workshop on Machine Learning and Compression at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model’s output – contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance – without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

MCML Authors

Yawei Li

Statistical Learning and Data Science

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[251]

J. Herbinger, M. N. Wright, T. Nagler, B. Bischl and G. Casalicchio.
Decomposing Global Feature Effects Based on Feature Interactions.
Journal of Machine Learning Research 25.381 (Dec. 2024). URL

Abstract

Global feature effect methods, such as partial dependence plots, provide an intelligible visualization of the expected marginal feature effect. However, such global feature effect methods can be misleading, as they do not represent local feature effects of single observations well when feature interactions are present. We formally introduce generalized additive decomposition of global effects (GADGET), which is a new framework based on recursive partitioning to find interpretable regions in the feature space such that the interaction-related heterogeneity of local feature effects is minimized. We provide a mathematical foundation of the framework and show that it is applicable to the most popular methods to visualize marginal feature effects, namely partial dependence, accumulated local effects, and Shapley additive explanations (SHAP) dependence. Furthermore, we introduce and validate a new permutation-based interaction detection procedure that is applicable to any feature effect method that fits into our proposed framework. We empirically evaluate the theoretical characteristics of the proposed methods based on various feature effect methods in different experimental settings. Moreover, we apply our introduced methodology to three real-world examples to showcase their usefulness.

MCML Authors

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[250]

F. Fumagalli, M. Muschalik, E. Hüllermeier, B. Hammer and J. Herbinger.
Unifying Feature-Based Explanations with Functional ANOVA and Cooperative Game Theory.
Preprint (Dec. 2024). arXiv

Abstract

Feature-based explanations, using perturbations or gradients, are a prevalent tool to understand decisions of black box machine learning models. Yet, differences between these methods still remain mostly unknown, which limits their applicability for practitioners. In this work, we introduce a unified framework for local and global feature-based explanations using two well-established concepts: functional ANOVA (fANOVA) from statistics, and the notion of value and interaction from cooperative game theory. We introduce three fANOVA decompositions that determine the influence of feature distributions, and use game-theoretic measures, such as the Shapley value and interactions, to specify the influence of higher-order interactions. Our framework combines these two dimensions to uncover similarities and differences between a wide range of explanation techniques for features and groups of features. We then empirically showcase the usefulness of our framework on synthetic and real-world datasets.

MCML Authors

Maximilian Muschalik

Artificial Intelligence and Machine Learning

Eyke Hüllermeier

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Julia Herbinger

Dr.

* Former Member

[249]

E. Garces Arias, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, k−sampling, nucleus p−sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[248]

Y. Li, Y. Zhang, K. Kawaguchi, A. Khakzar, B. Bischl and M. Rezaei.
A Dual-Perspective Approach to Evaluating Feature Attribution Methods.
Transactions on Machine Learning Research (Nov. 2024). URL

Abstract

Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model’s behavior (i.e., faithfulness). While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. In this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.

MCML Authors

Yawei Li

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Ashkan Khakzar

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[247]

J. Nam, I. Chalkidis and M. Rezaei.
Hyperbolic Contrastive Learning for Document Representations – A Multi-View Approach with Paragraph-level Similarities.
ECAI 2024 - 27th European Conference on Artificial Intelligence. Santiago de Compostela, Spain, Oct 19-24, 2024. DOI

Abstract

Self-supervised learning (SSL) has gained prominence due to the increasing availability of unlabeled data and advances in computational efficiency, leading to revolutionized natural language processing with pre-trained language models like BERT and GPT. Representation learning, a core concept in SSL, aims to reduce data dimensionality while preserving meaningful aspects. Conventional SSL methods typically embed data in Euclidean space. However, recent research has revealed that alternative geometries can hold even richer representations, unlocking more meaningful insights from the data. Motivated by this, we propose two novel methods for integrating Hilbert geometry into self-supervised learning for efficient document embedding. First, we present a method directly incorporating Hilbert geometry into the standard Euclidean contrastive learning framework. Additionally, we propose a multi-view hyperbolic contrastive learning framework contrasting both documents and paragraphs. Our findings demonstrate that contrasting only paragraphs, rather than entire documents, can lead to superior efficiency and effectiveness.

MCML Authors

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[246]

M. Aßenmacher, L. Karrlein, P. Schiele and C. Heumann.
Introducing wwm-german-18k - Can LLMs Crack the Million? (Or Win at Least 500 Euros?).
ICNLSP 2024 - 7th International Conference on Natural Language and Speech Processing. Trento, Italy, Oct 19-20, 2024. URL

Abstract

Language-specific evaluation of large language models (LLMs) for multiple-choice question answering (MCQA) is an important means to test their abilities for a multitude of different dimensions. With a data set assembled from questions from the German variant of ‘Who Wants to Be a Millionaire?’ we evaluate a set of German models and ChatGPT concerning factual/commonsense knowledge, syntactic abilities, and logical reasoning, amongst others. We contribute this new MCQA data set, extracted from the show’s episodes and designed to evaluate the ability of models to answer this diverse range of questions. To ensure data quality, we describe our preprocessing, encompassing data cleaning, deduplication, and the creation of stratified splits. Furthermore, we fine-tune a set of German LLMs and prompt ChatGPT to provide baseline results. Our findings reveal that these models achieve (partly) satisfactory performance on questions of lower difficulty levels (≤ 1000 euros). As the difficulty increases, performance steadily declines, highlighting the challenging nature of the later stages of the game. We contribute to the ongoing efforts to advance the capabilities of LLMs in comprehending and answering questions by providing a valuable resource for German MCQA research as well as further insights into the limitations of current LLMs.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[245]

T. Pielok, B. Bischl and D. Rügamer.
Semi-Implicit Variational Inference via Kernelized Path Gradient Descent.
Preprint (Oct. 2024). arXiv

Abstract

Semi-implicit variational inference (SIVI) is a powerful framework for approximating complex posterior distributions, but training with the Kullback-Leibler (KL) divergence can be challenging due to high variance and bias in high-dimensional settings. While current state-of-the-art semi-implicit variational inference methods, particularly Kernel Semi-Implicit Variational Inference (KSIVI), have been shown to work in high dimensions, training remains moderately expensive. In this work, we propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing. To further reduce the bias, we introduce an importance sampling correction. We provide a theoretical connection to the amortized version of the Stein variational gradient descent, which estimates the score gradient via Stein’s identity, showing that both methods minimize the same objective, but our semi-implicit approach achieves lower gradient variance. In addition, our method’s bias in function space is benign, leading to more stable and efficient optimization. Empirical results demonstrate that our method outperforms or matches state-of-the-art SIVI methods in both performance and training efficiency.

MCML Authors

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[244]

H. Baniecki, G. Casalicchio, B. Bischl and P. Biecek.
On the Robustness of Global Feature Effect Explanations.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI

Abstract

We study the robustness of global post-hoc explanations for predictive models trained on tabular data. Effects of predictor features in black-box supervised learning are an essential diagnostic tool for model debugging and scientific discovery in applied sciences. However, how vulnerable they are to data and model perturbations remains an open research question. We introduce several theoretical bounds for evaluating the robustness of partial dependence plots and accumulated local effects. Our experimental results with synthetic and real-world datasets quantify the gap between the best and worst-case scenarios of (mis)interpreting machine learning predictions globally.

MCML Authors

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[243]

F. Stermann, I. Chalkidis, A. Vahidi, B. Bischl and M. Rezaei.
Attention-Driven Dropout: A Simple Method to Improve Self-supervised Contrastive Sentence Embeddings.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI

Abstract

Self-contrastive learning has proven effective for vision and natural language tasks. It aims to learn aligned data representations by encoding similar and dissimilar sentence pairs without human annotation. Therefore, data augmentation plays a crucial role in the learned embedding quality. However, in natural language processing (NLP), creating augmented samples for unsupervised contrastive learning is challenging since random editing may modify the semantic meanings of sentences and thus affect learning good representations. In this paper, we introduce a simple, still effective approach dubbed ADD (Attention-Driven Dropout) to generate better-augmented views of sentences to be used in self-contrastive learning. Given a sentence and a Pre-trained Transformer Language Model (PLM), such as RoBERTa, we use the aggregated attention scores of the PLM to remove the less “informative” tokens from the input. We consider two alternative algorithms based on NAIVEAGGREGATION across layers/heads and ATTENTIONROLLOUT [1]. Our approach significantly improves the overall performance of various self-supervised contrastive-based methods, including SIMCSE [14], DIFFCSE [10], and INFOCSE [33] by facilitating the generation of high-quality positive pairs required by these methods. Through empirical evaluations on multiple Semantic Textual Similarity (STS) and Transfer Learning tasks, we observe enhanced performance across the board.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[242]

A. Vahidi, L. Wimmer, H. A. Gündüz, B. Bischl, E. Hüllermeier and M. Rezaei.
Diversified Ensemble of Independent Sub-Networks for Robust Self-Supervised Representation Learning.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI

Abstract

Ensembling a neural network is a widely recognized approach to enhance model performance, estimate uncertainty, and improve robustness in deep supervised learning. However, deep ensembles often come with high computational costs and memory demands. In addition, the efficiency of a deep ensemble is related to diversity among the ensemble members, which is challenging for large, over-parameterized deep neural networks. Moreover, ensemble learning has not yet seen such widespread adoption for unsupervised learning and it remains a challenging endeavor for self-supervised or unsupervised representation learning. Motivated by these challenges, we present a novel self-supervised training regime that leverages an ensemble of independent sub-networks, complemented by a new loss function designed to encourage diversity. Our method efficiently builds a sub-model ensemble with high diversity, leading to well-calibrated estimates of model uncertainty, all achieved with minimal computational overhead compared to traditional deep self-supervised ensembles. To evaluate the effectiveness of our approach, we conducted extensive experiments across various tasks, including in-distribution generalization, out-of-distribution detection, dataset corruption, and semi-supervised settings. The results demonstrate that our method significantly improves prediction reliability. Our approach not only achieves excellent accuracy but also enhances calibration, improving on important baseline performance across a wide range of self-supervised architectures in computer vision, natural language processing, and genomics data.

MCML Authors

Lisa Wimmer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Hüseyin Anil Gündüz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Eyke Hüllermeier

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Mina Rezaei

Dr.

Statistical Learning and Data Science

[241]

C. Molnar, G. König, B. Bischl and G. Casalicchio.
Model-agnostic feature importance and effects with dependent features: a conditional subgroup approach.
Data Mining and Knowledge Discovery 38 (Sep. 2024). DOI

Abstract

The interpretation of feature importance in machine learning models is challenging when features are dependent. Permutation feature importance (PFI) ignores such dependencies, which can cause misleading interpretations due to extrapolation. A possible remedy is more advanced conditional PFI approaches that enable the assessment of feature importance conditional on all other features. Due to this shift in perspective and in order to enable correct interpretations, it is beneficial if the conditioning is transparent and comprehensible. In this paper, we propose a new sampling mechanism for the conditional distribution based on permutations in conditional subgroups. As these subgroups are constructed using tree-based methods such as transformation trees, the conditioning becomes inherently interpretable. This not only provides a simple and effective estimator of conditional PFI, but also local PFI estimates within the subgroups. In addition, we apply the conditional subgroups approach to partial dependence plots, a popular method for describing feature effects that can also suffer from extrapolation when features are dependent and interactions are present in the model. In simulations and a real-world application, we demonstrate the advantages of the conditional subgroup approach over existing methods: It allows to compute conditional PFI that is more true to the data than existing proposals and enables a fine-grained interpretation of feature effects and importance within the conditional subgroups.

MCML Authors

Gunnar König

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[240]

H. J. Coyle-Asbil, L. Burk, M. Brandes, B. Brandes, C. Buck, M. N. Wright and L. A. Vallis.
Energy Expenditure Prediction in Preschool Children: A Machine Learning Approach Using Accelerometry and External Validation.
Physiological Measurement 45.9 (Sep. 2024). DOI

Abstract

Objective. This study aimed to develop convolutional neural networks (CNNs) models to predict the energy expenditure (EE) of children from raw accelerometer data. Additionally, this study sought to external validation of the CNN models in addition to the linear regression (LM), random forest (RF), and full connected neural network (FcNN) models published in Steenbock et al (2019 J. Meas. Phys. Behav. 2 94–102). Approach. Included in this study were 41 German children (3.0–6.99 years) for the training and internal validation who were equipped with GENEActiv, GT3X+, and activPAL accelerometers. The external validation dataset consisted of 39 Canadian children (3.0–5.99 years) that were equipped with OPAL, GT9X, GENEActiv, and GT3X+ accelerometers. EE was recorded simultaneously in both datasets using a portable metabolic unit. The protocols consisted of a semi-structured activities ranging from low to high intensities. The root mean square error (RMSE) values were calculated and used to evaluate model performances. Main results. (1) The CNNs outperformed the LM (13.17%–23.81% lower mean RMSE values), FcNN (8.13%–27.27% lower RMSE values) and the RF models (3.59%–18.84% lower RMSE values) in the internal dataset. (2) In contrast, it was found that when applied to the external Canadian dataset, the CNN models had consistently higher RMSE values compared to the LM, FcNN, and RF. Significance. Although CNNs can enhance EE prediction accuracy, their ability to generalize to new datasets and accelerometer brands/models, is more limited compared to LM, RF, and FcNN models.

MCML Authors

Lukas Burk

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[239]

H. Schulz-Kümpel, S. Fischer, T. Nagler, A.-L. Boulesteix, B. Bischl and R. Hornung.
Constructing Confidence Intervals for 'the' Generalization Error – a Comprehensive Benchmark Study.
Preprint (Sep. 2024). arXiv

Abstract

When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct the first large-scale study comparing CIs for the generalization error - empirically evaluating 13 different methods on a total of 18 tabular regression and classification problems, using four different inducers and a total of eight loss functions. We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework. Finally, the CI methods are evaluated in terms of their relative coverage frequency, width, and runtime. Based on these findings, we are able to identify a subset of methods that we would recommend. We also publish the datasets as a benchmarking suite on OpenML and our code on GitHub to serve as a basis for further studies.

MCML Authors

Hannah Schulz-Kümpel

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Statistical Learning and Data Science

Roman Hornung

Dr.

Biometry in Molecular Medicine

[238]

A. Stephan, D. Zhu, M. Aßenmacher, X. Shen and B. Roth.
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks.
Preprint (Sep. 2024). arXiv

Abstract

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[237]

S. Urchs, V. Thurner, M. Aßenmacher, C. Heumann and S. Thiemichen.
Detecting Gender Discrimination on Actor Level Using Linguistic Discourse Analysis.
GeBNLP 2024 - 5th Workshop on Gender Bias in Natural Language Processing. Bangkok, Thailand, Aug 16, 2024. URL

Abstract

With the usage of tremendous amounts of text data for training powerful large language models such as ChatGPT, the issue of analysing and securing data quality has become more pressing than ever. Any biases, stereotypes and discriminatory patterns that exist in the training data can be reproduced, reinforced or broadly disseminated by the models in production. Therefore, it is crucial to carefully select and monitor the text data that is used as input to train the model. Due to the vast amount of training data, this process needs to be (at least partially) automated. In this work, we introduce a novel approach for automatically detecting gender discrimination in text data on the actor level based on linguistic discourse analysis. Specifically, we combine existing information extraction (IE) techniques to partly automate the qualitative research done in linguistic discourse analysis. We focus on two important steps: Identifying the respectiveperson-named-entity (an actor) and all forms it is referred to (Nomination), and detecting the characteristics it is ascribed (Predication). Asa proof of concept, we integrate these two steps into a pipeline for automated text analysis. The separate building blocks of the pipeline could be flexibly adapted, extended, and scaled for bigger datasets to accommodate a wide range of usage scenarios and specific ML tasks or help social scientists with analysis tasks. We showcase and evaluate our approach on several real and simulated exemplary texts.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[236]

J. Pavlopoulos, V. Kougia, E. Garces Arias, P. Platanou, S. Shabalin, K. Liagkou, E. Papadatos, H. Essler, J.-B. Camps and F. Fischer.
Challenging Error Correction in Recognised Byzantine Greek.
ML4AL @ACL 2024 - 1st Workshop on Machine Learning for Ancient Languages at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Automatic correction of errors in Handwritten Text Recognition (HTR) output poses persistent challenges yet to be fully resolved. In this study, we introduce a shared task aimed at addressing this challenge, which attracted 271 submissions, yielding only a handful of promising approaches. This paper presents the datasets, the most effective methods, and an experimental analysis in error-correcting HTRed manuscripts and papyri in Byzantine Greek, the language that followed Classical and preceded Modern Greek. By using recognised and transcribed data from seven centuries, the two best-performing methods are compared, one based on a neural encoder-decoder architecture and the other based on engineered linguistic rules. We show that the recognition error rate can be reduced by both, up to 2.5 points at the level of characters and up to 15 at the level of words, while also elucidating their respective strengths and weaknesses.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[235]

M. Aßenmacher, A. Stephan, L. Weissweiler, E. Çano, I. Ziegler, M. Härttrich, B. Bischl, B. Roth, C. Heumann and H. Schütze.
Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing.
TeachingNLP @ACL 2024 - 6th Workshop on Teaching NLP at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. URL

Abstract

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Leonie Weissweiler

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Hinrich Schütze

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computational Linguistics

[234]

J. G. Wiese, L. Wimmer, T. Papamarkou, B. Bischl, S. Günnemann and D. Rügamer.
Towards Efficient Posterior Sampling in Deep Neural Networks via Symmetry Removal (Extended Abstract).
IJCAI 2024 - 33rd International Joint Conference on Artificial Intelligence. Jeju, Korea, Aug 03-09, 2024. DOI

Abstract

Bayesian inference in deep neural networks is challenging due to the high-dimensional, strongly multi-modal parameter posterior density landscape. Markov chain Monte Carlo approaches asymptotically recover the true posterior but are considered prohibitively expensive for large modern architectures. Local methods, which have emerged as a popular alternative, focus on specific parameter regions that can be approximated by functions with tractable integrals. While these often yield satisfactory empirical results, they fail, by definition, to account for the multi-modality of the parameter posterior. In this work, we argue that the dilemma between exact-but-unaffordable and cheap-but-inexact approaches can be mitigated by exploiting symmetries in the posterior landscape. Such symmetries, induced by neuron interchangeability and certain activation functions, manifest in different parameter values leading to the same functional output value. We show theoretically that the posterior predictive density in Bayesian neural networks can be restricted to a symmetry-free parameter reference set. By further deriving an upper bound on the number of Monte Carlo chains required to capture the functional diversity, we propose a straightforward approach for feasible Bayesian inference. Our experiments suggest that efficient sampling is indeed possible, opening up a promising path to accurate uncertainty quantification in deep learning.

MCML Authors

Lisa Wimmer

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[233]

L. Bothmann and K. Peters.
Fairness als Qualitätskriterium im Maschinellen Lernen – Rekonstruktion des philosophischen Konzepts und Implikationen für die Nutzung außergesetzlicher Merkmale bei qualifizierten Mietspiegeln.
AStA Wirtschafts- und Sozialstatistisches Archiv 18 (Aug. 2024). DOI

Abstract

With the increased use of machine learning (ML) models within automated decision-making systems, the demands on the quality of ML models are growing. Pure prediction quality is no longer the sole quality criterion; in particular, there is an increasing demand to consider fairness aspects. This paper pursues two goals. First, it summarizes the current fairness discussion in the field of ML (fairML) and describes the most recent developments, especially with respect to the philosophical foundations of the concept of fairness within ML. On the other hand, the question is addressed to what extent so-called ‘extra-legal’ characteristics may be used in the compilation of qualified rent indices. A recent proposal by Kauermann and Windmann (AStA Wirtschafts- und Sozialstatistisches Archiv, Volume 17, 2023) on using extra-legal features in qualified rent indices includes a model-based imputation method, which we contrast with the legal requirements. Finally, we show which alternatives from the field of fairML could be used and outline the different basic philosophical assumptions behind the various methods.

MCML Authors

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

[232]

D. Schalk, R. Rehms, V. S. Hoffmann, B. Bischl and U. Mansmann.
Distributed non-disclosive validation of predictive models by a modified ROC-GLM.
BMC Medical Research Methodology 24.190 (Aug. 2024). DOI

Abstract

Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach…

MCML Authors

Daniel Schalk

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[231]

F. Drost, E. Dorigatti, A. Straub, P. Hilgendorf, K. I. Wagner, K. Heyer, M. López Montes, B. Bischl, D. H. Busch, K. Schober and B. Schubert.
Predicting T cell receptor functionality against mutant epitopes.
Cell Genomics 4.9 (Aug. 2024). DOI

Abstract

Cancer cells and pathogens can evade T cell receptors (TCRs) via mutations in immunogenic epitopes. TCR cross-reactivity (i.e., recognition of multiple epitopes with sequence similarities) can counteract such escape but may cause severe side effects in cell-based immunotherapies through targeting self-antigens. To predict the effect of epitope point mutations on T cell functionality, we here present the random forest-based model Predicting T Cell Epitope-Specific Activation against Mutant Versions (P-TEAM). P-TEAM was trained and tested on three datasets with TCR responses to single-amino-acid mutations of the model epitope SIINFEKL, the tumor neo-epitope VPSVWRSSL, and the human cytomegalovirus antigen NLVPMVATV, totaling 9,690 unique TCR-epitope interactions. P-TEAM was able to accurately classify T cell reactivities and quantitatively predict T cell functionalities for unobserved single-point mutations and unseen TCRs. Overall, P-TEAM provides an effective computational tool to study T cell responses against mutated epitopes.

MCML Authors

Emilio Dorigatti

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[230]

A. Mittermeier, M. Aßenmacher, B. Schachtner, S. Grosu, V. Dakovic, V. Kandratovich, B. Sabel and M. Ingrisch.
Automatische ICD-10-Codierung.
Die Radiologie 64 (Aug. 2024). DOI

Abstract

Hintergrund: Die medizinische Codierung von radiologischen Befunden ist essenziell für eine gute Qualität der Versorgung und die korrekte Abrechnung, gleichzeitig aber eine aufwändige und fehleranfällige Aufgabe.
Ziel der Arbeit: Bewertung der Anwendbarkeit natürlicher Sprachverarbeitung (Natural Language Processing, NLP) für die ICD-10-Codierung von radiologischen Befunden in deutscher Sprache durch Finetuning geeigneter Sprachmodelle.
Material und Methoden: In dieser retrospektiven Studie wurden alle Magnetresonanztomographie(MRT)-Befunde unseres Instituts zwischen 2010 und 2020 berücksichtigt. Die ICD-10-Codes bei Entlassung wurden den jeweiligen Befunden zugeordnet, um einen Datensatz für eine Multiclass-Klassifizierung zu erstellen. Finetuning von GermanBERT und flanT5 wurde auf dem Gesamtdatensatz (dstotal) mit 1035 verschiedenen ICD-10-Codes und zwei reduzierten Datensätzen mit den 100 (ds100) und 50 (ds50) häufigsten Codes durchgeführt. Die Performance der Modelle wurde mit Top-k-Genauigkeit für k = 1, 3, 5 evaluiert. In einer Ablationsstudie wurden beide Modelle einmal auf den zugehörigen Metadaten und dem Befund allein trainiert.
Ergebnisse: Der Gesamtdatensatz bestand aus 100.672 radiologischen Befunden, die reduzierten Datensätze ds100 aus 68.103 und ds50 aus 52.293 Berichten. Die Modellperformance stieg, wenn mehrere der besten Voraussagen des Modells in Betracht gezogen wurden, die Anzahl der Zielklassen reduziert wurde und die Metadaten mit dem Befund kombiniert wurden. FlanT5 übertraf GermanBERT in allen Datensätzen und Metriken und eignet sich am besten als medizinischer Codierungsassistent, wobei eine Top-3-Genauigkeit von fast 70% im realitätsnahen Datensatz dstotal erreicht wurde.
Schlussfolgerung: Finetuning von Sprachmodellen verspricht eine zuverlässige Vorhersage von ICD-10-Codes deutscher radiologischer MRT-Befunde in unterschiedlichen Szenarien. Als Codierungsassistent kann flanT5 medizinischen Codierern helfen, informierte Entscheidungen zu treffen und potenziell ihre Arbeitsbelastung reduzieren.

MCML Authors

Andreas Mittermeier

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Balthasar Schachtner

Dr.

Clinical Data Science in Radiology

Michael Ingrisch

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

[229]

F. Ott, L. Heublein, D. Rügamer, B. Bischl and C. Mutschler.
Fusing structure from motion and simulation-augmented pose regression from optical flow for challenging indoor environments.
Journal of Visual Communication and Image Representation 103 (Aug. 2024). DOI

Abstract

The localization of objects is essential in many applications, such as robotics, virtual and augmented reality, and warehouse logistics. Recent advancements in deep learning have enabled localization using monocular cameras. Traditionally, structure from motion (SfM) techniques predict an object’s absolute position from a point cloud, while absolute pose regression (APR) methods use neural networks to understand the environment semantically. However, both approaches face challenges from environmental factors like motion blur, lighting changes, repetitive patterns, and featureless areas. This study addresses these challenges by incorporating additional information and refining absolute pose estimates with relative pose regression (RPR) methods. RPR also struggles with issues like motion blur. To overcome this, we compute the optical flow between consecutive images using the Lucas–Kanade algorithm and use a small recurrent convolutional network to predict relative poses. Combining absolute and relative poses is difficult due to differences between global and local coordinate systems. Current methods use pose graph optimization (PGO) to align these poses. In this work, we propose recurrent fusion networks to better integrate absolute and relative pose predictions, enhancing the accuracy of absolute pose estimates. We evaluate eight different recurrent units and create a simulation environment to pre-train the APR and RPR networks for improved generalization. Additionally, we record a large dataset of various scenarios in a challenging indoor environment resembling a warehouse with transportation robots. Through hyperparameter searches and experiments, we demonstrate that our recurrent fusion method outperforms PGO in effectiveness.

MCML Authors

Felix Ott

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[228]

L. Bothmann, K. Peters, S. Dandl, M. Schomaker and B. Bischl.
Causal Fair Machine Learning.
GMDS/IBS-DR - 54. Arbeitstagung der Arbeitsgruppen Statistical Computing, Klassifikation und Datenanalyse in den Biowissenschaften. Günzburg, Germany, Jul 28-31, 2024. PDF

Abstract

A growing body of literature in fairness-aware ML aspires to mitigate machine learning (ML)-related unfairness in automated decision-making (ADM) by defining metrics that measure the fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those metrics (see, e.g., Verma & Rubin, 2018, Caton & Haas, 2023). However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and the recent adoption of the concept in the ML community. We bridge this gap by formalizing a consistent concept of fairness and translating the philosophical considerations into a formal framework for training and evaluating ML models in ADM systems (Bothmann et al., 2024). We argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes (PAs) by proposing a fictitious, normatively desired (FiND) world where the PAs have no (direct or indirect) causal effect on the target. In practice, this unknown FiND world must be approximated by a warped world, for which the causal effects of the PAs must be removed from the real-world data. We propose rank-preserving interventional distributions to define an estimand of this FiND world and a warping method for estimation (Bothmann et al., 2023). Evaluation criteria for both the method and the resulting ML model are presented. Experiments on simulated data show that our method effectively identifies the most discriminated individuals and mitigates unfairness. Experiments on real-world data showcase the practical application of our method.

MCML Authors

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michael Schomaker

Prof. Dr.

Biostatistics

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[227]

L. Burk, J. Zobolas, B. Bischl, A. Bender, M. N. Wright and R. Sonabend.
A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data.
GMDS/IBS-DR - 54. Arbeitstagung der Arbeitsgruppen Statistical Computing, Klassifikation und Datenanalyse in den Biowissenschaften. Günzburg, Germany, Jul 28-31, 2024. PDF

Abstract

This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.

MCML Authors

Lukas Burk

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[226]

M. Herrmann, F. J. D. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix and B. Bischl.
Position: Why We Must Rethink Empirical Research in Machine Learning.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

We warn against a common but incomplete understanding of empirical research in machine learning (ML) that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue most current empirical ML research is fashioned as confirmatory research while it should rather be considered exploratory.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Marcel Wever

Dr.

* Former Member

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[225]

F. Karl, M. Kemeter, G. Dax and P. Sierak.
Position: Embracing Negative Results in Machine Learning.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Publications proposing novel machine learning methods are often primarily rated by exhibited predictive performance on selected problems. In this position paper we argue that predictive performance alone is not a good indicator for the worth of a publication. Using it as such even fosters problems like inefficiencies of the machine learning research community as a whole and setting wrong incentives for researchers. We therefore put out a call for the publication of “negative” results, which can help alleviate some of these problems and improve the scientific output of the machine learning research community. To substantiate our position, we present the advantages of publishing negative results and provide concrete measures for the community to move towards a paradigm where their publication is normalized.

MCML Authors

Florian Karl

Statistical Learning and Data Science

[224]

M. Lindauer, F. Karl, A. Klier, J. Moosbauer, A. Tornede, A. C. Mueller, F. Hutter, M. Feurer and B. Bischl.
Position: A Call to Action for a Human-Centered AutoML Paradigm.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) workflows, aiding the research of new ML algorithms, and contributing to the democratization of ML by making it accessible to a broader audience. Over the past decade, commendable achievements in AutoML have primarily focused on optimizing predictive performance. This focused progress, while substantial, raises questions about how well AutoML has met its broader, original goals. In this position paper, we argue that a key to unlocking AutoML’s full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems, including their diverse roles, expectations, and expertise. We envision a more human-centered approach in future AutoML research, promoting the collaborative design of ML systems that tightly integrates the complementary strengths of human expertise and AutoML methodologies.

MCML Authors

Florian Karl

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[223]

D. Rügamer, C. Kolb, T. Weber, L. Kook and T. Nagler.
Generalizing orthogonalization for models with non-linearities.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

The complexity of black-box algorithms can lead to various challenges, including the introduction of biases. These biases present immediate risks in the algorithms’ application. It was, for instance, shown that neural networks can deduce racial information solely from a patient’s X-ray scan, a task beyond the capability of medical experts. If this fact is not known to the medical expert, automatic decision-making based on this algorithm could lead to prescribing a treatment (purely) based on racial information. While current methodologies allow for the ‘‘orthogonalization’’ or ‘’normalization’’ of neural networks with respect to such information, existing approaches are grounded in linear models. Our paper advances the discourse by introducing corrections for non-linearities such as ReLU activations. Our approach also encompasses scalar and tensor-valued predictions, facilitating its integration into neural network architectures. Through extensive experiments, we validate our method’s effectiveness in safeguarding sensitive data in generalized linear models, normalizing convolutional neural networks for metadata, and rectifying pre-existing embeddings for undesired attributes.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Chris Kolb

Statistical Learning and Data Science

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group David Rügamer

Computational Statistics & Data Science

[222]

E. Sommer, L. Wimmer, T. Papamarkou, L. Bothmann, B. Bischl and D. Rügamer.
Connecting the Dots: Is Mode Connectedness the Key to Feasible Sample-Based Inference in Bayesian Neural Networks?
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

A major challenge in sample-based inference (SBI) for Bayesian neural networks is the size and structure of the networks’ parameter space. Our work shows that successful SBI is possible by embracing the characteristic relationship between weight and function space, uncovering a systematic link between overparameterization and the difficulty of the sampling problem. Through extensive experiments, we establish practical guidelines for sampling and convergence diagnosis. As a result, we present a Bayesian deep ensemble approach as an effective solution with competitive performance and uncertainty quantification.

MCML Authors

Emanuel Sommer

Statistics, Data Science and Machine Learning

Lisa Wimmer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[221]

S. Dandl, K. Blesch, T. Freiesleben, G. König, J. Kapar, B. Bischl and M. N. Wright.
CountARFactuals – Generating plausible model-agnostic counterfactual explanations with adversarial random forests.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI

Abstract

Counterfactual explanations elucidate algorithmic decisions by pointing to scenarios that would have led to an alternative, desired outcome. Giving insight into the model’s behavior, they hint users towards possible actions and give grounds for contesting decisions. As a crucial factor in achieving these goals, counterfactuals must be plausible, i.e., describing realistic alternative scenarios within the data manifold. This paper leverages a recently developed generative modeling technique – adversarial random forests (ARFs) – to efficiently generate plausible counterfactuals in a model-agnostic way. ARFs can serve as a plausibility measure or directly generate counterfactual explanations. Our ARF-based approach surpasses the limitations of existing methods that aim to generate plausible counterfactual explanations: It is easy to train and computationally highly efficient, handles continuous and categorical data naturally, and allows integrating additional desiderata such as sparsity in a straightforward manner.

MCML Authors

Susanne Dandl

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[220]

F. K. Ewald, L. Bothmann, M. N. Wright, B. Bischl, G. Casalicchio and G. König.
A Guide to Feature Importance Methods for Scientific Inference.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI

Abstract

While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of global FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models.

MCML Authors

Fiona Katharina Ewald

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[219]

C. A. Scholbeck, H. Funk and G. Casalicchio.
Algorithm-Agnostic Feature Attributions for Clustering.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI

Abstract

Understanding how assignments of instances to clusters can be attributed to the features can be vital in many applications. However, research to provide such feature attributions has been limited. Clustering algorithms with built-in explanations are scarce. Common algorithm-agnostic approaches involve dimension reduction and subsequent visualization, which transforms the original features used to cluster the data; or training a supervised learning classifier on the found cluster labels, which adds additional and intractable complexity. We present FACT (feature attributions for clustering), an algorithm-agnostic framework that preserves the integrity of the data and does not introduce additional models. As the defining characteristic of FACT, we introduce a set of work stages: sampling, intervention, reassignment, and aggregation. Furthermore, we propose two novel FACT methods: SMART (scoring metric after permutation) measures changes in cluster assignments by custom scoring functions after permuting selected features; IDEA (isolated effect on assignment) indicates local and global changes in cluster assignments after making uniform changes to selected features.

MCML Authors

Henri Funk

C4 | Computational Social Sciences
→ Group Helmut Küchenhoff

Statistical Consulting Unit (StaBLab)

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[218]

S. Dandl, M. Becker, B. Bischl, G. Casalicchio and L. Bothmann.
mlr3summary: Concise and interpretable summaries for machine learning models.
xAI 2024 - Demo Track of the 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. arXiv

Abstract

This work introduces a novel R package for concise, informative summaries of machine learning models. We take inspiration from the summary function for (generalized) linear models in R, but extend it in several directions: First, our summary function is model-agnostic and provides a unified summary output also for non-parametric machine learning models; Second, the summary output is more extensive and customizable – it comprises information on the dataset, model performance, model complexity, model’s estimated feature importances, feature effects, and fairness metrics; Third, models are evaluated based on resampling strategies for unbiased estimates of model performances, feature importances, etc. Overall, the clear, structured output should help to enhance and expedite the model selection process, making it a helpful tool for practitioners and researchers alike.

MCML Authors

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[217]

L. Kook, C. Kolb, P. Schiele, D. Dold, M. Arpogaus, C. Fritz, P. Baumann, P. Kopper, T. Pielok, E. Dorigatti and D. Rügamer.
How Inverse Conditional Flows Can Serve as a Substitute for Distributional Regression.
UAI 2024 - 40th Conference on Uncertainty in Artificial Intelligence. Barcelona, Spain, Jul 16-18, 2024. URL

Abstract

Neural network representations of simple models, such as linear regression, are being studied increasingly to better understand the underlying principles of deep learning algorithms. However, neural representations of distributional regression models, such as the Cox model, have received little attention so far. We close this gap by proposing a framework for distributional regression using inverse flow transformations (DRIFT), which includes neural representations of the aforementioned models. We empirically demonstrate that the neural representations of models in DRIFT can serve as a substitute for their classical statistical counterparts in several applications involving continuous, ordered, time-series, and survival outcomes. We confirm that models in DRIFT empirically match the performance of several statistical methods in terms of estimation of partial effects, prediction, and aleatoric uncertainty quantification. DRIFT covers both interpretable statistical models and flexible neural networks opening up new avenues in both statistical modeling and deep learning.

MCML Authors

Chris Kolb

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

[216]

Y. Sale, P. Hofman, T. Löhr, L. Wimmer, T. Nagler and E. Hüllermeier.
Label-wise Aleatoric and Epistemic Uncertainty Quantification.
UAI 2024 - 40th Conference on Uncertainty in Artificial Intelligence. Barcelona, Spain, Jul 16-18, 2024. URL

Abstract

We present a novel approach to uncertainty quantification in classification tasks based on label-wise decomposition of uncertainty measures. This label-wise perspective allows uncertainty to be quantified at the individual class level, thereby improving cost-sensitive decision-making and helping understand the sources of uncertainty. Furthermore, it allows to define total, aleatoric, and epistemic uncertainty on the basis of non-categorical measures such as variance, going beyond common entropy-based measures. In particular, variance-based measures address some of the limitations associated with established methods that have recently been discussed in the literature. We show that our proposed measures adhere to a number of desirable properties. Through empirical evaluation on a variety of benchmark data sets – including applications in the medical domain where accurate uncertainty quantification is crucial – we establish the effectiveness of label-wise uncertainty quantification.

MCML Authors

Yusuf Sale

Artificial Intelligence and Machine Learning

Paul Hofman

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Lisa Wimmer

Statistical Learning and Data Science

Thomas Nagler

Prof. Dr.

Computational Statistics & Data Science

Eyke Hüllermeier

Prof. Dr.

C4 | Computational Social Sciences
→ Group Helmut Küchenhoff

Artificial Intelligence and Machine Learning

[215]

J. Piller, H. Küchenhoff and A. Bender.
Flexible additive models for multi-event survival analysis.
IWSM 2024 - 38th International Workshop on Statistical Modelling. Durham, UK, Jul 14-19, 2024. PDF

Abstract

Piecewise Exponential Additive Mixed Models (PAMMs) (Bender et al., 2018) have gained popularity in various domains due to their ability to tackle a wide variety of survival problems and their flexibility to model non-linear covariate effects, including time-varying effects and cumulative effects (Bender et al., 2019). One advantage of such reduction techniques is that they do not require any specialised software for the estimation of the model parameters. Thus, in the case of the PAMM, they can be conveniently estimated using generalized additive mixed modeling methodology or, for example, respective boosting or deep learning based approaches (Bender et al., 2022). Nevertheless, their use in practice requires pre-processing, which differs depending on the survival task at hand (e.g. left-truncation, competing risks, etc.) and post-processing (e.g. transforming estimated parameters to useful quantities like survival or transition probabilities). The R package pammtools facilitates the entire modeling process, so far, however, only for single-event data. Here we extend the framework and package capabilities to handle general multi-state models.

MCML Authors

Johannes Piller

Statistical Consulting Unit (StaBLab)

Helmut Küchenhoff

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Consulting Unit (StaBLab)

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[214]

S. Dandl, M. Becker, B. Bischl, G. Casalicchio and L. Bothmann.
mlr3summary: Concise and interpretable summaries for machine learning models.
useR! 2024 - International R User Conference. Salzburg, Austria, Jul 08-22, 2024. arXiv GitHub

Abstract

MCML Authors

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[213]

S. Fischer and M. Binder.
mlr3torch - Deep Learning in R.
useR! 2024 - International R User Conference. Salzburg, Austria, Jul 08-22, 2024. GitHub

Abstract

mlr3torch is a deep learning framework for the mlr3 ecosystem built on top of torch. It allows to easily build, train and evaluate deep learning models in a few lines of codes, without needing to worry about low-level details. Off-the-shelf learners are readily available, but custom architectures can be defined by connecting PipeOpTorch operators in an mlr3pipelines::Graph.

MCML Authors

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[212]

B. Ronval, S. Nijssen and L. Bothmann.
Can generative AI-based data balancing mitigate unfairness issues in Machine Learning?
EWAF 2024 - 3rd European Workshop on Algorithmic Fairness. Mainz, Germany, Jul 01-03, 2024. PDF

Abstract

Data imbalance in the protected attributes can lead to machine learning models that perform better on the majority than on the minority group, giving rise to unfairness issues. While baseline methods like undersampling or SMOTE can balance datasets, we investigate how methods of generative artificial intelligence compare concerning classical fairness metrics. Using generated fake data, we propose different balancing methods and investigate the behavior of classification models in thorough benchmark studies using German credit and Berkeley admission data. While our experiments suggest that such methods may improve fairness metrics, further investigations are necessary to derive clear practical recommendations.

MCML Authors

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[211]

F. Karl, J. Thomas, J. Elstner, R. Gross and B. Bischl.
Automated Machine Learning.
Unlocking Artificial Intelligence (Jul. 2024). DOI

Abstract

In the past few years automated machine learning (AutoML) has gained a lot of traction in the data science and machine learning community. AutoML aims at reducing the partly repetitive work of data scientists and enabling domain experts to construct machine learning pipelines without extensive knowledge in data science. This chapter presents a comprehensive review of the current leading AutoML methods and sets AutoML in an industrial context. To this extent we present the typical components of an AutoML system, give an overview over the stateof-the-art and highlight challenges to industrial application by presenting several important topics such as AutoML for time series data, AutoML in unsupervised settings, AutoML with multiple evaluation criteria, or interactive human-in-the-loop methods. Finally, the connection to Neural Architecture Search (NAS) is presented and a brief review with special emphasis on hardware-aware NAS is given.

MCML Authors

Florian Karl

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[210]

B. Deiseroth, M. Meuer, N. Gritsch, C. Eichenberg, P. Schramowski, M. Aßenmacher and K. Kersting.
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization.
NAACL 2024 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. DOI

Abstract

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components’ impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

MCML Authors

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[209]

L. Mayer, C. Heumann and M. Aßenmacher.
Can OpenSource beat ChatGPT? - A Comparative Study of Large Language Models for Text-to-Code Generation.
SwissText 2024 - Swiss Text Analytics Conference. Chur, Switzerland, Jun 10-11, 2024. URL

Abstract

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[208]

J. Ramjith, A. Bender, K. C. B. Roes and M. A. Jonker.
Recurrent events analysis with piece-wise exponential additive mixed models.
Statistical Modelling 24.3 (Jun. 2024). DOI

Abstract

Recurrent events analysis plays an important role in many applications, including the study of chronic diseases or recurrence of infections. Historically, many models for recurrent events have been variants of the Cox model. In this article we introduce and describe the application of the piece-wise exponential Additive Mixed Model (PAMM) for recurrent events analysis and illustrate how PAMMs can be used to flexibly model the dependencies in recurrent events data. Simulations confirm that PAMMs provide unbiased estimates as well as equivalence to the Cox model when proportional hazards are assumed. Applications to recurrence of staphylococcus aureus and malaria in children illustrate the estimation of seasonality, bivariate non-linear effects, multiple timescales and relaxation of the proportional hazards assumption via time-varying effects. The R package pammtools is extended to facilitate estimation and visualization of PAMMs for recurrent events data.

MCML Authors

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[207]

B. Felderer, L. Repke, W. Weber, J. Schweisthal and L. Bothmann.
Predicting the Validity and Reliability of Survey Questions.
Preprint (Jun. 2024). DOI

Abstract

The Survey Quality Predictor (SQP) is an open-access system to predict the quality, i.e., the reliability and validity, of survey questions based on the characteristics of the questions. The prediction is based on a meta-regression of many multitrait-multimethod (MTMM) experiments in which characteristics of the survey questions were systematically varied. The release of SQP 3.0 that is based on an expanded data base as compared to previous SQP versions raised the need for a new meta-regression. To find the best method for analyzing the complex data structure of SQP (e.g., the existence of various uncorrelated predictors), we compared four suitable machine learning methods in terms of their ability to predict both survey quality indicators: LASSO, elastic net, boosting and random forest. The article discusses the performance of the models and illustrates the importance of the individual item characteristics in the random forest model, which was chosen for SQP 3.0.

MCML Authors

Jonas Schweisthal

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[206]

E. Dorigatti.
Cancer immunotherapy design and analysis through discrete optimization, positive-unlabeled learning, and semi-structured regression models.
Dissertation 2024. DOI

Abstract

This thesis advances precision medicine by leveraging artificial intelligence to improve cancer immunotherapy development and tackle key challenges in clinical trials, where high failure rates often stem from insufficient understanding of patient and disease-specific factors. Through novel computational frameworks for cancer vaccine design, methods for handling imbalanced biological data, and hybrid modeling techniques that combine clinical data with imaging, this work demonstrates AI’s potential to personalize and accelerate therapeutic development. These contributions collectively pave the way for more effective, targeted treatments, potentially reducing the time and cost to bring new therapies to market. (Shortened).

MCML Authors

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[205]

C. A. Scholbeck.
Bridging gaps in interpretable machine learning: sensitivity analysis, marginal effects, and cluster explanations.
Dissertation 2024. DOI

Abstract

This thesis explores interpretable machine learning (IML) through six papers, bridging the gap between IML and model interpretation in other domains. It presents a generalized framework for model-agnostic interpretation methods, highlights potential pitfalls, and connects IML to sensitivity analysis used in fields like environmental modeling. A novel approach, forward marginal effects (FMEs), is introduced to interpret predictive models at multiple levels, supported by the R package fmeffects. The work also extends IML to unsupervised learning by proposing algorithm-agnostic cluster explanation methods, including two new techniques: SMART and IDEA, for analyzing feature contributions to clustering. (Shortened.)

MCML Authors

Christian Alexander Scholbeck

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[204]

K. Röck.
Stochastic processes as surrogate models for dynamical systems in magnetic confinement fusion.
Dissertation 2024. DOI

Abstract

This thesis focuses on incorporating domain-specific knowledge into machine learning (ML) models for scientific applications, ensuring they accurately reflect underlying physical systems.
The first part introduces physics-consistent Gaussian processes (GPs), embedding physical laws directly into the model. These models address data governed by partial differential equations (PDEs) and Hamiltonian systems, preserving physical properties like symplecticity and enabling faster, long-term simulations. Applications include classifying chaotic trajectories and computing Lyapunov exponents.
The second part tackles data scarcity in plasma physics by proposing robust surrogate models for multivariate time series. Using Student-$t$ process regression, these models handle outliers effectively and facilitate data imputation and augmentation, ensuring reliable predictions for multichannel sensor data.
This work advances ML approaches for surrogate modeling, chaos analysis, and plasma physics. (Shortened.)

MCML Authors

Katharina Röck (née Rath)

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[203]

R. Kohli, M. Feurer, B. Bischl, K. Eggensperger and F. Hutter.
Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning.
DMLR @ICLR 2024 - Workshop on Data-centric Machine Learning Research at the 12th International Conference on Learning Representations (ICLR 2024). Vienna, Austria, May 07-11, 2024. URL

Abstract

Data in tabular form makes up a large part of real-world ML applications, and thus, there has been a strong interest in developing novel deep learning (DL) architectures for supervised learning on tabular data in recent years. As a result, there is a debate as to whether DL methods are superior to the ubiquitous ensembles of boosted decision trees. Typically, the advantage of one model class over the other is claimed based on an empirical evaluation, where different variations of both model classes are compared on a set of benchmark datasets that supposedly resemble relevant real-world tabular data. While the landscape of state-of-the-art models for tabular data changed, one factor has remained largely constant over the years: The datasets. Here, we examine 30 recent publications and 187 different datasets they use, in terms of age, study size and relevance. We found that the average study used less than 10 datasets and that half of the datasets are older than 20 years. Our insights raise questions about the conclusions drawn from previous studies and urge the research community to develop and publish additional recent, challenging and relevant datasets and ML tasks for supervised learning on tabular data.

MCML Authors

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[202]

A. Vahidi, S. Schosser, L. Wimmer, Y. Li, B. Bischl, E. Hüllermeier and M. Rezaei.
Probabilistic Self-supervised Representation Learning via Scoring Rules Minimization.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL GitHub

Abstract

In this paper, we propose a novel probabilistic self-supervised learning via Scoring Rule Minimization (ProSMIN), which leverages the power of probabilistic models to enhance representation quality and mitigate collapsing representations. Our proposed approach involves two neural networks; the online network and the target network, which collaborate and learn the diverse distribution of representations from each other through knowledge distillation. By presenting the input samples in two augmented formats, the online network is trained to predict the target network representation of the same sample under a different augmented view. The two networks are trained via our new loss function based on proper scoring rules. We provide a theoretical justification for ProSMIN’s convergence, demonstrating the strict propriety of its modified scoring rule. This insight validates the method’s optimization process and contributes to its robustness and effectiveness in improving representation quality. We evaluate our probabilistic model on various downstream tasks, such as in-distribution generalization, out-of-distribution detection, dataset corruption, low-shot learning, and transfer learning. Our method achieves superior accuracy and calibration, surpassing the self-supervised baseline in a wide range of experiments on large-scale datasets like ImageNet-O and ImageNet-C, ProSMIN demonstrates its scalability and real-world applicability.

MCML Authors

Lisa Wimmer

Statistical Learning and Data Science

Yawei Li

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Eyke Hüllermeier

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Mina Rezaei

Dr.

Statistical Learning and Data Science

[201]

A. Solderer, S. P. Hicklin, M. Aßenmacher, A. Ender and P. R. Schmidlin.
Influence of an allogenic collagen scaffold on implant sites with thin supracrestal tissue height: a randomized clinical trial.
Clinical Oral Investigations 28.313 (May. 2024). DOI

Abstract

Objectives: This randomized clinical trial focused on patients with thin peri-implant soft-tissue height (STH) (≤ 2.5 mm) and investigated the impact of an allogenic collagen scaffold (aCS) on supracrestal tissue height and marginal bone loss (MBL).
Material & methods: Forty patients received bone level implants and were randomly assigned to the test group with simultaneous tissue thickening with aCS or the control group. After three months, prosthetic restoration occurred. STH measurements were taken at baseline (T0) and reopening surgery (TR), with MBL assessed at 12 months (T1). Descriptive statistics were calculated for continuous variables, and counts for categorical variables (significance level, p = 0.05).
Results: At T1, 37 patients were available. At T0, control and test groups had mean STH values of 2.3 ± 0.3 mm and 2.1 ± 0.4 mm. TR revealed mean STH values of 2.3 ± 0.2 mm (control) and 2.6 ± 0.7 mm (test), with a significant tissue thickening of 0.5 ± 0.6 mm in the test group (p < 0.03). At T1, control and test groups showed MBL mean values of 1.1 ± 0.8 mm and 1.0 ± 0.6 mm, with a moderate but significant correlation with STH thickening (-0.34), implant position (0.43), history of periodontitis (0.39), and smoking status (0.27).
Conclusion: The use of an aCS protocol resulted in soft tissue thickening but did not reach a threshold to reliably reduce MBL compared to the control group within the study’s limitations.
Clinical relevance: Peri-implant STH is crucial for maintaining peri-implant marginal bone stability. Marginal bone stability represents a crucial factor in prevention of peri-implantitis development.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[200]

K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. O. Sabel, J. Ricke and M. Ingrisch.
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.
European Radiology 34 (May. 2024). DOI

Abstract

Objectives: To assess the quality of simplified radiology reports generated with the large language model (LLM) ChatGPT and to discuss challenges and chances of ChatGPT-like LLMs for medical text simplification.
Methods: In this exploratory case study, a radiologist created three fictitious radiology reports which we simplified by prompting ChatGPT with ‘Explain this medical report to a child using simple language.’’ In a questionnaire, we tasked 15 radiologists to rate the quality of the simplified radiology reports with respect to their factual correctness, completeness, and potential harm for patients. We used Likert scale analysis and inductive free-text categorization to assess the quality of the simplified reports.
Results: Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed relevant medical information, and potentially harmful passages were reported.
Conclusion: While we see a need for further adaption to the medical field, the initial insights of this study indicate a tremendous potential in using LLMs like ChatGPT to improve patient-centered care in radiology and other medical domains.
Clinical relevance statement: Patients have started to use ChatGPT to simplify and explain their medical reports, which is expected to affect patient-doctor interaction. This phenomenon raises several opportunities and challenges for clinical routine.

MCML Authors

Katharina Jeblick

Dr.

Clinical Data Science in Radiology

Balthasar Schachtner

Dr.

Clinical Data Science in Radiology

Jakob Dexl

Clinical Data Science in Radiology

Andreas Mittermeier

Dr.

Clinical Data Science in Radiology

Theresa Stüber

Clinical Data Science in Radiology

Johanna Topalis

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

Tobias Weber

* Former Member

Philipp Wesp

Dr.

Clinical Data Science in Radiology

Michael Ingrisch

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

[199]

R. Debelak, T. Koch, M. Aßenmacher and C. Stachl.
From Embeddings to Explainability: A Tutorial on Transformer-Based Text Analysis for Social and Behavioral Scientists.
Preprint (May. 2024). DOI

Abstract

Large language models and their use for text analysis have had a significant impact on psychology and the social and behavioral sciences in general. Key applications include the analysis of texts, such as social media posts, to infer psychological characteristics, as well as survey and interview analysis. In this tutorial paper, we demonstrate the use of the Python-based natural language processing software package transformers (and related modules from the Hugging Face Ecosystem) that allow for the automated classification of text inputs in a practical exercise. In doing so, we rely on pretrained transformer models which can be fine-tuned to a specific task and domain. The first proposed application of this model class is to use it as a feature extractor, allowing for the transformation of written text into real-valued numerical vectors (called ’embeddings’) that capture a text’s semantic meaning. These vectors can, in turn, be used as input for a subsequent machine-learning model. The second presented application of transformer models is the end-to-end training (so-called ‘fine-tuning’) of the model. This results in a direct prediction of the label within the same model that directly maps the text to the embeddings. While in the second case, results are usually better and training works more seamlessly, the model itself is often not directly interpretable. We showcase an alleviation of this issue via the application of post-hoc interpretability methods by calculating SHAP values and applying local interpretable model-agnostic explanations (LIME) in an attempt to explain the model’s inner workings.

MCML Authors

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[198]

H. A. Gündüz, R. Mreches, J. Moosbauer, G. Robertson, X.-Y. To, E. A. Franzosa, C. Huttenhower, M. Rezaei, A. C. McHardy, B. Bischl, P. C. Münch and M. Binder.
Optimized model architectures for deep learning on genomic data.
Communications Biology 7.1 (Apr. 2024). DOI

Abstract

The success of deep learning in various applications depends on task-specific architecture design choices, including the types, hyperparameters, and number of layers. In computational biology, there is no consensus on the optimal architecture design, and decisions are often made using insights from more well-established fields such as computer vision. These may not consider the domain-specific characteristics of genome sequences, potentially limiting performance. Here, we present GenomeNet-Architect, a neural architecture design framework that automatically optimizes deep learning models for genome sequence data. It optimizes the overall layout of the architecture, with a search space specifically designed for genomics. Additionally, it optimizes hyperparameters of individual layers and the model training procedure. On a viral classification task, GenomeNet-Architect reduced the read-level misclassification rate by 19%, with 67% faster inference and 83% fewer parameters, and achieved similar contig-level accuracy with ~100 times fewer parameters compared to the best-performing deep learning baselines.

MCML Authors

Hüseyin Anil Gündüz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Xiao-Yin To

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

Statistical Learning and Data Science

[197]

V. Gkolemis, C. Diou, E. Ntoutsi, T. Dalamagas, B. Bischl, J. Herbinger and G. Casalicchio.
Effector: A Python package for regional explanations.
Preprint (Apr. 2024). arXiv GitHub

Abstract

Global feature effect methods explain a model outputting one plot per feature. The plot shows the average effect of the feature on the output, like the effect of age on the annual income. However, average effects may be misleading when derived from local effects that are heterogeneous, i.e., they significantly deviate from the average. To decrease the heterogeneity, regional effects provide multiple plots per feature, each representing the average effect within a specific subspace. For interpretability, subspaces are defined as hyperrectangles defined by a chain of logical rules, like age’s effect on annual income separately for males and females and different levels of professional experience. We introduce Effector, a Python library dedicated to regional feature effects. Effector implements well-established global effect methods, assesses the heterogeneity of each method and, based on that, provides regional effects. Effector automatically detects subspaces where regional effects have reduced heterogeneity. All global and regional effect methods share a common API, facilitating comparisons between them. Moreover, the library’s interface is extensible so new methods can be easily added and benchmarked.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Julia Herbinger

Dr.

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[196]

C. Gruber, K. Hechinger, M. Aßenmacher, G. Kauermann and B. Plank.
More Labels or Cases? Assessing Label Variation in Natural Language Inference.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL

Abstract

In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Barbara Plank

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

AI and Computational Linguistics

[195]

S. Dandl, C. Haslinger, T. Hothorn, H. Seibold, E. Sverdrup, S. Wager and A. Zeileis.
What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?
Annals of Applied Statistics 18.1 (Mar. 2024). DOI

Abstract

Estimation of heterogeneous treatment effects (HTE) is of prime importance in many disciplines, from personalized medicine to economics among many others. Random forests have been shown to be a flexible and powerful approach to HTE estimation in both randomized trials and observational studies. In particular “causal forests” introduced by Athey, Tibshirani and Wager (Ann. Statist. 47 (2019) 1148–1178), along with the R implementation in package grf were rapidly adopted. A related approach, called ‘model-based forests’ that is geared toward randomized trials and simultaneously captures effects of both prognostic and predictive variables, was introduced by Seibold, Zeileis and Hothorn (Stat. Methods Med. Res. 27 (2018) 3104–3125) along with a modular implementation in the R package model4you. Neither procedure is directly applicable to the estimation of individualized predictions of excess postpartum blood loss caused by a cesarean section in comparison to vaginal delivery. Clearly, randomization is hardly possible in this setup, and thus model-based forests lack clinical trial data to address this question. On the other hand, the skewed and interval-censored postpartum blood loss observations violate assumptions made by causal forests. Here we present a tailored model-based forest for skewed and interval-censored data to infer possible predictive prepartum characteristics and their impact on excess postpartum blood loss caused by a cesarean section. As a methodological basis, we propose a unifying view on causal and model-based forests that goes beyond the theoretical motivations and investigates which computational elements make causal forests so successful and how these can be blended with the strengths of model-based forests. To do so, we show that both methods can be understood in terms of the same parameters and model assumptions for an additive model under L2 loss. This theoretical insight allows us to implement several flavors of ‘model-based causal forests’ and dissect their different elements in silico. The original causal forests and model-based forests are compared with the new blended versions in a benchmark study exploring both randomized trials and observational settings. In the randomized setting, both approaches performed akin. If confounding was present in the data-generating process, we found local centering of the treatment indicator with the corresponding propensities to be the main driver for good performance. Local centering of the outcome was less important and might be replaced or enhanced by simultaneous split selection with respect to both prognostic and predictive effects. This lays the foundation for future research combining random forests for HTE estimation with other types of models.

MCML Authors

Susanne Dandl

Dr.

* Former Member

[194]

F. Coens, N. Knops, I. Tieken, S. Vogelaar, A. Bender, J. J. Kim, K. Krupka, L. Pape, A. Raes, B. Tönshoff, A. Prytula and C. Registry.
Time-Varying Determinants of Graft Failure in Pediatric Kidney Transplantation in Europe.
Clinical Journal of the American Society of Nephrology 19.3 (Mar. 2024). DOI

Abstract

Little is known about the time-varying determinants of kidney graft failure in children. We performed a retrospective study of primary pediatric kidney transplant recipients (younger than 18 years) from the Eurotransplant registry (1990-2020). Piece-wise exponential additive mixed models were applied to analyze time-varying recipient, donor, and transplant risk factors. Primary outcome was death-censored graft failure.

MCML Authors

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[193]

W. H. Hartl, P. Kopper, L. Xu, L. Heller, M. Mironov, R. Wang, A. G. Day, G. Elke, H. Küchenhoff and A. Bender.
Relevance of Protein Intake for Weaning in the Mechanically Ventilated Critically Ill: Analysis of a Large International Database.
Critical Care Medicine 50.3 (Mar. 2024). DOI

Abstract

The association between protein intake and the need for mechanical ventilation (MV) is controversial. We aimed to investigate the associations between protein intake and outcomes in ventilated critically ill patients.

MCML Authors

Helmut Küchenhoff

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Consulting Unit (StaBLab)

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[192]

B. X. Liew, F. Pfisterer, D. Rügamer and X. Zhai.
Strategies to optimise machine learning classification performance when using biomechanical features.
Journal of Biomechanics 165 (Mar. 2024). DOI

Abstract

Building prediction models using biomechanical features is challenging because such models may require large sample sizes. However, collecting biomechanical data on large sample sizes is logistically very challenging. This study aims to investigate if modern machine learning algorithms can help overcome the issue of limited sample sizes on developing prediction models. This was a secondary data analysis two biomechanical datasets – a walking dataset on 2295 participants, and a countermovement jump dataset on 31 participants. The input features were the three-dimensional ground reaction forces (GRFs) of the lower limbs. The outcome was the orthopaedic disease category (healthy, calcaneus, ankle, knee, hip) in the walking dataset, and healthy vs people with patellofemoral pain syndrome in the jump dataset. Different algorithms were compared: multinomial/LASSO regression, XGBoost, various deep learning time-series algorithms with augmented data, and with transfer learning. For the outcome of weighted multiclass area under the receiver operating curve (AUC) in the walking dataset, the three models with the best performance were InceptionTime with x12 augmented data (0.810), XGBoost (0.804), and multinomial logistic regression (0.800). For the jump dataset, the top three models with the highest AUC were the LASSO (1.00), InceptionTime with x8 augmentation (0.750), and transfer learning (0.653). Machine-learning based strategies for managing the challenging issue of limited sample size for biomechanical ML-based problems, could benefit the development of alternative prediction models in healthcare, especially when time-series data are involved.

MCML Authors

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[191]

S. Dandl, A. Bender and T. Hothorn.
Heterogeneous treatment effect estimation for observational data using model-based forests.
Statistical Methods in Medical Research 33.3 (Mar. 2024). DOI

Abstract

The estimation of heterogeneous treatment effects has attracted considerable interest in many disciplines, most prominently in medicine and economics. Contemporary research has so far primarily focused on continuous and binary responses where heterogeneous treatment effects are traditionally estimated by a linear model, which allows the estimation of constant or heterogeneous effects even under certain model misspecifications. More complex models for survival, count, or ordinal outcomes require stricter assumptions to reliably estimate the treatment effect. Most importantly, the noncollapsibility issue necessitates the joint estimation of treatment and prognostic effects. Model-based forests allow simultaneous estimation of covariate-dependent treatment and prognostic effects, but only for randomized trials. In this paper, we propose modifications to model-based forests to address the confounding issue in observational data. In particular, we evaluate an orthogonalization strategy originally proposed by Robinson (1988, Econometrica) in the context of model-based forests targeting heterogeneous treatment effect estimation in generalized linear models and transformation models. We found that this strategy reduces confounding effects in a simulated study with various outcome distributions. We demonstrate the practical aspects of heterogeneous treatment effect estimation for survival and ordinal outcomes by an assessment of the potentially heterogeneous effect of Riluzole on the progress of Amyotrophic Lateral Sclerosis.

MCML Authors

Susanne Dandl

Dr.

* Former Member

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

[190]

P. Kopper, D. Rügamer, R. Sonabend, B. Bischl and A. Bender.
On Training Survival Models with Scoring Rules.
Preprint (Mar. 2024). arXiv

Abstract

Survival Analysis provides critical insights for partially incomplete time-to-event data in various domains. It is also an important example of probabilistic machine learning. The probabilistic nature of the predictions can be exploited by using (proper) scoring rules in the model fitting process instead of likelihood-based optimization. Our proposal does so in a generic manner and can be used for a variety of model classes. We establish different parametric and non-parametric sub-frameworks that allow different degrees of flexibility. Incorporated into neural networks, it leads to a computationally efficient and scalable optimization routine, yielding state-of-the-art predictive performance. Finally, we show that using our framework, we can recover various parametric models and demonstrate that optimization works equally well when compared to likelihood-based methods.

MCML Authors

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[189]

A. Reuter, A. Thielmann, C. Weisser, S. Fischer and B. Säfken.
GPTopic: Dynamic and Interactive Topic Representations.
Preprint (Mar. 2024). arXiv GitHub

Abstract

Topic modeling seems to be almost synonymous with generating lists of top words to represent topics within large text corpora. However, deducing a topic from such list of individual terms can require substantial expertise and experience, making topic modelling less accessible to people unfamiliar with the particularities and pitfalls of top-word interpretation. A topic representation limited to top-words might further fall short of offering a comprehensive and easily accessible characterization of the various aspects, facets and nuances a topic might have. To address these challenges, we introduce GPTopic, a software package that leverages Large Language Models (LLMs) to create dynamic, interactive topic representations. GPTopic provides an intuitive chat interface for users to explore, analyze, and refine topics interactively, making topic modeling more accessible and comprehensive.

MCML Authors

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[188]

J. Rodemann, F. Croppi, P. Arens, Y. Sale, J. Herbinger, B. Bischl, E. Hüllermeier, T. Augustin, C. J. Walsh and G. Casalicchio.
Explaining Bayesian Optimization by Shapley Values Facilitates Human-AI Collaboration.
Preprint (Mar. 2024). arXiv

Abstract

Bayesian optimization (BO) with Gaussian processes (GP) has become an indispensable algorithm for black box optimization problems. Not without a dash of irony, BO is often considered a black box itself, lacking ways to provide reasons as to why certain parameters are proposed to be evaluated. This is particularly relevant in human-in-the-loop applications of BO, such as in robotics. We address this issue by proposing ShapleyBO, a framework for interpreting BO’s proposals by game-theoretic Shapley this http URL quantify each parameter’s contribution to BO’s acquisition function. Exploiting the linearity of Shapley values, we are further able to identify how strongly each parameter drives BO’s exploration and exploitation for additive acquisition functions like the confidence bound. We also show that ShapleyBO can disentangle the contributions to exploration into those that explore aleatoric and epistemic uncertainty. Moreover, our method gives rise to a ShapleyBO-assisted human machine interface (HMI), allowing users to interfere with BO in case proposals do not align with human reasoning. We demonstrate this HMI’s benefits for the use case of personalizing wearable robotic devices (assistive back exosuits) by human-in-the-loop BO. Results suggest human-BO teams with access to ShapleyBO can achieve lower regret than teams without.

MCML Authors

Yusuf Sale

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Julia Herbinger

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Eyke Hüllermeier

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[187]

S. Wiegrebe, P. Kopper, R. Sonabend, B. Bischl and A. Bender.
Deep learning for survival analysis: a review.
Artificial Intelligence Review 57.65 (Feb. 2024). DOI

Abstract

The influx of deep learning (DL) techniques into the field of survival analysis in recent years has led to substantial methodological progress; for instance, learning from unstructured or high-dimensional data such as images, text or omics data. In this work, we conduct a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival- and DL-related attributes. In summary, the reviewed methods often address only a small subset of tasks relevant to time-to-event data—e.g., single-risk right-censored data—and neglect to incorporate more complex settings.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[186]

C. A. Scholbeck, G. Casalicchio, C. Molnar, B. Bischl and C. Heumann.
Marginal Effects for Non-Linear Prediction Functions.
Data Mining and Knowledge Discovery 38 (Feb. 2024). DOI

Abstract

Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models and especially generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either in the shape of derivatives of the prediction function or forward differences in prediction due to a change in a feature value. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a model-agnostic interpretation method for machine learning models. This may stem from their inflexibility as a univariate feature effect and their inability to deal with the non-linearities found in black box models. We introduce a new class of marginal effects termed forward marginal effects. We argue to abandon derivatives in favor of better-interpretable forward differences. Furthermore, we generalize marginal effects based on forward differences to multivariate changes in feature values. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for marginal effects. We argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to partition the feature space to compute conditional average marginal effects on feature subspaces, which serve as conditional feature effect estimates.

MCML Authors

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[185]

H. Weerts, F. Pfisterer, M. Feurer, K. Eggensperger, E. Bergman, N. Awad, J. Vanschoren, M. Pechenizkiy, B. Bischl and F. Hutter.
Can Fairness be Automated? Guidelines and Opportunities for Fairness-aware AutoML.
Journal of Artificial Intelligence Research 79 (Feb. 2024). DOI

Abstract

The field of automated machine learning (AutoML) introduces techniques that automate parts of the development of machine learning (ML) systems, accelerating the process and reducing barriers for novices. However, decisions derived from ML models can reproduce, amplify, or even introduce unfairness in our societies, causing harm to (groups of) individuals. In response, researchers have started to propose AutoML systems that jointly optimize fairness and predictive performance to mitigate fairness-related harm. However, fairness is a complex and inherently interdisciplinary subject, and solely posing it as an optimization problem can have adverse side effects. With this work, we aim to raise awareness among developers of AutoML systems about such limitations of fairness-aware AutoML, while also calling attention to the potential of AutoML as a tool for fairness research. We present a comprehensive overview of different ways in which fairness-related harm can arise and the ensuing implications for the design of fairness-aware AutoML. We conclude that while fairness cannot be automated, fairness-aware AutoML can play an important role in the toolbox of ML practitioners. We highlight several open technical challenges for future work in this direction. Additionally, we advocate for the creation of more user-centered assistive systems designed to tackle challenges encountered in fairness work.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[184]

P. Gijsbers, M. L. P. Bueno, S. Coors, E. LeDell, S. Poirier, J. Thomas, B. Bischl and J. Vanschoren.
AMLB: an AutoML Benchmark.
Journal of Machine Learning Research 25.101 (Feb. 2024). URL

Abstract

Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.

MCML Authors

Stefan Coors

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[183]

D. Schalk, B. Bischl and D. Rügamer.
Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models.
Statistics and Computing 34.31 (Feb. 2024). DOI

Abstract

Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data have been developed in recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting of base learners using the $L_2$-loss. In order to account for the heterogeneity of different data location sites, we propose a distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.

MCML Authors

Daniel Schalk

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[182]

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Constrained Probabilistic Mask Learning for Task-specific Undersampled MRI Reconstruction.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI

Abstract

Undersampling is a common method in Magnetic Resonance Imaging (MRI) to subsample the number of data points in k-space, reducing acquisition times at the cost of decreased image quality. A popular approach is to employ undersampling patterns following various strategies, e.g., variable density sampling or radial trajectories. In this work, we propose a method that directly learns the under-sampling masks from data points, thereby also providing task- and domain-specific patterns. To solve the resulting discrete optimization problem, we propose a general optimization routine called ProM: A fully probabilistic, differentiable, versatile, and model-free framework for mask optimization that enforces acceleration factors through a convex constraint. Analyzing knee, brain, and cardiac MRI datasets with our method, we discover that different anatomic regions reveal distinct optimal undersampling masks, demonstrating the benefits of using custom masks, tailored for a downstream task. For example, ProM can create undersampling masks that maximize performance in downstream tasks like segmentation with networks trained on fully-sampled MRIs. Even with extreme acceleration factors, ProM yields reasonable performance while being more versatile than existing methods, paving the way for data-driven all-purpose mask generation.

MCML Authors

Tobias Weber

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

[181]

B. Bischl, R. Sonabend, L. Kotthoff and M. Lang.
Applied Machine Learning Using mlr3 in R.
American Statistician 79.2 (Jan. 2024). DOI

Abstract

mlr3 is an award-winning ecosystem of R packages that have been developed to enable state-of-the-art machine learning capabilities in R. Applied Machine Learning Using mlr3 in R gives an overview of flexible and robust machine learning methods, with an emphasis on how to implement them using mlr3 in R. It covers various key topics, including basic machine learning tasks, such as building and evaluating a predictive model; hyperparameter tuning of machine learning approaches to obtain peak performance; building machine learning pipelines that perform complex operations such as pre-processing followed by modelling followed by aggregation of predictions; and extending the mlr3 ecosystem with custom learners, measures, or pipeline components. The book is primarily aimed at researchers, practitioners, and graduate students who use machine learning or who are interested in using it. It can be used as a textbook for an introductory or advanced machine learning class that uses R, as a reference for people who work with machine learning methods, and in industry for exploratory experiments in machine learning.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

* Former Member

[180]

G. Casalicchio and L. Burk.
Evaluation and Benchmarking.
Applied Machine Learning Using mlr3 in R I.3 (Jan. 2024). DOI

Abstract

Machine learning models can only be deployed in practice if they are robustly evaluated to estimate a model’s generalization performance, i.e. how well it will perform on new data. Resampling strategies including cross-validation and bootstrapping, can be used to estimate the generalization performance. Models can be compared to one another using a benchmark experiment, which makes use of the same resampling strategies and measures to fairly compare models and to help practitioners decide which model to use in practice.
This chapter introduces resample strategies in mlr3, including cross-validation, repeated cross-validation, leave-one-out, bootstrapping, and custom strategies. These are then demonstrated with the resample() function, which is used to resample a single learner with a given strategy. Benchmarking is then introduced and the benchmark() function is demonstrated for comparing multiple learners. The chapter concludes with a deep dive into binary classification evaluation, including ROC analysis and the Area Under the Curve metric.

MCML Authors

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lukas Burk

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[179]

M. Becker, L. Schneider and S. Fischer.
Hyperparameter Optimization.
Applied Machine Learning Using mlr3 in R II.4 (Jan. 2024). DOI

Abstract

Machine learning models include parameters and hyperparameters. The former refers to model coefficients that are estimated during training. The latter are parameters that are set by the user and affect how the model is fit or how it makes predictions. Setting hyperparameters manually is arduous and error-prone, instead hyperparameter optimization (HPO) automating this ‘tuning’ procedure to reduce bias. When performing HPO there are many considerations including what tuning algorithm to use, how long to tune it for, and what measures to optimize. Moreover users have to decide which hyperparameters to tune and for what configurations. Finally, one has to be careful to make use of nested resampling to prevent leakage of information from training to testing datasets that can occur when resampling and tuning simultaneously. This chapter begins by introducing mlr3tuning and its functionality for tuning learners. This includes Tuners for configuring and running optimization algorithms, TuningInstances for storing results, and Terminators for controlling when to stop the HPO process. The chapter provides a practical example of tuning hyperparameters of a support vector machine, including introducing logarithmic transformations. The AutoTuner class is also introduced which is used for automating nested resampling to reduce bias in tuning.

MCML Authors

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[178]

L. Schneider and M. Becker.
Advanced Tuning Methods and Black Box Optimization.
Applied Machine Learning Using mlr3 in R II.5 (Jan. 2024). DOI

Abstract

Automated tuning can be error prone and it is very likely that models will crash in the tuning process, it is therefore essential to have reliable methods of encapsulating errors to prevent large experiments from failing and losing intermediate results. This chapter therefore begins by introducing fallback learners and encapsulation methods, which are returned to in ‘Advanced Technical Aspects of mlr3’.
Models can be tuned with respect to one or multiple measures. In general when tuning to multiple measures there will be a trade-off between them and therefore there will not be one optimal hyperparameter configuration, instead the aim is to estimate configurations that are not Pareto-dominated by any other. This chapter introduces multi-objective tuning and concepts including Pareto optimality.
Some tuning methods are more advanced than others, including Hyperband and Bayesian optimization. Hyperband is a multi-fidelity tuner that makes use of fidelity parameters, which provide a tradeoff between model runtime and performance accuracy. Bayesian optimization is a sample-efficient black-box optimization algorithm that is highly flexible and allows user fine-grained control over tuning large search spaces. This chapter introduces mlr3hyperband and the concept of fidelity parameters, and then mlr3mbo and bbotk to discuss black-box optimization and Bayesian optimization.

MCML Authors

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[177]

M. Binder and F. Pfisterer.
Sequential Pipelines.
Applied Machine Learning Using mlr3 in R II.7 (Jan. 2024). DOI

Abstract

Computational pipelines provide a layer of abstraction for swapping in and out different elements of the pipeline. In machine learning this can be useful for swapping algorithms, as well as common operations for data preprocessing and model post processing. Many real-world machine learning applications involve more than just fitting a single model at a time: It is often beneficial or even necessary to preprocess data for feature engineering and compatibility with learners. In many cases it is also useful to combine predictions of multiple models in ensembles. By defining these workflows as computational objects, it is then possible to treat them like models to be trained/tested and even tuned. This chapter introduces mlr3pipelines, a dataflow programming language that can be used to define machine learning processes from simple building blocks. The chapter focuses on sequential pipelines, in which data passes from one operation to another in a linear sequence and each operation has one input and output. The chapter introduces PipeOp and Graph, which are the building blocks of a pipeline, and provides some concrete examples with PCA.

MCML Authors

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[176]

M. Binder, F. Pfisterer, M. Becker and M. N. Wright.
Non-sequential Pipelines and Tuning.
Applied Machine Learning Using mlr3 in R II.8 (Jan. 2024). DOI

Abstract

Real-world applications often require complicated pipeline that do not progress sequentially. For example, many experiments have demonstrated that bagging is a powerful method to improve model performance. Bagging can be thought of as a non-sequential pipeline where a learner is replicated, each separate learner is trained and makes predictions, and their results are combined. This is non-sequential as data is not flowing sequentially through the pipeline but is instead passed to all learners (who may then subsample the data) and then recombined, thus creating a pipeline where operations have multiple inputs and outputs. Pipeline operations also have hyperparameters that can be set and tuned to improve model performance. Moreover the choice of operations to include in a pipeline can also be tuned, known as combined algorithm selection and hyperparameter optimization (CASH).
This chapter looks at more advanced uses of mlr3pipelines. This is put into practice by demonstrating how to build a bagging and stacking pipeline from scratch, as well as how to access common pipelines that are readily available in mlr3pipelines. The chapter then looks at tuning pipelines and CASH.

MCML Authors

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[175]

M. Lang, S. Fischer and R. Sonabend.
Advanced Technical Aspects of mlr3.
Applied Machine Learning Using mlr3 in R IV.10 (Jan. 2024). DOI

Abstract

Parallelization is often required to efficiently run machine learning models, which means models are run simultaneously on multiple CPU cores, CPUs, or computational nodes. This chapter begins by demonstrating how mlr3 uses the future package for parallelization and how different ‘plans’ can be applied to mlr3 experiments. In large machine learning experiments, it is common for a model to error during training or predicting. This is because the algorithms have to process arbitrary data, and not all eventualities can always be handled. It is therefore imperative to have robust methods for encapsulating and dealing with errors. This chapter builds on what has been briefly seen in Chapter 5 to discuss error handling and logging, including how to make use of fallback learners in experiments. Large experiments may also require data to be handled in different formats and to prevent all the data being loaded into memory. This chapter discussed different ‘backends’ that can be used for mlr3 Tasks, including interfacing with DuckDB and SQL. Finally, this chapter demonstrates how to extend classes in mlr3 by using the Measure class as an example. This may be of particular interest to readers who want to create new Measures or Learners.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[174]

S. Fischer, M. Lang and M. Becker.
Large-Scale Benchmarking.
Applied Machine Learning Using mlr3 in R IV.11 (Jan. 2024). DOI

Abstract

In the field of machine learning, benchmark experiments are used to evaluate and compare the performance of algorithms. To draw robust conclusions, benchmark experiments often have to be ‘large-scale’, which means including many datasets, learners, and possibly measures. Finding datasets can be difficult and the choice of dataset impacts conclusions that can be drawn. Conducting large-scale benchmark experiments is also complex as they are usually computationally intensive. It is therefore common to make use of high-performance computing clusters to efficiently run the experiment. Finally once these experiments are run, analysis of experiments usually requires more than a single score from a given performance measure, and therefore statistical test are often employed.
This chapter introduces mlr3oml for interfacing the OpenML database for accessing data and tasks. It then continues by discussing how to run experiments on high-performance computing clusters using batchtools and mlr3batchmark. Finally, mlr3benchmark is introduced for statistical analysis including Friedman tests and critical difference diagrams.

MCML Authors

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[173]

S. Dandl, P. Biecek, G. Casalicchio and M. N. Wright.
Model Interpretation.
Applied Machine Learning Using mlr3 in R IV.12 (Jan. 2024). DOI

Abstract

The increasing availability of data and software frameworks to create predictive models has allowed the widespread adoption of machine learning in many applications. However, high predictive performance of such models often comes at the cost of interpretability. Machine learning interpretation methods can be useful for several purposes: 1) gaining global insights into a model (e.g., feature importance); 2) model improvement if flaws were identified (e.g., unexpected reliance on a certain feature); 3) understanding individual predictions. Several model-agnostic methods have been developed including feature permutation, Shapleys, and LIME.
This chapter presents the packages iml, counterfactuals, and DALEX, which implement model-agnostic interpretation methods. Throughout the chapter an xgboost is trained on the german credit dataset to understand how predictions are made and why. The chapter starts with discussing the iml package and the theory behind the discussed methods, as well as how to practically use the interface. It then moves to counterfactuals and the benefits of counterfactual analysis, including methods What-If and MOC. Finally, DALEX is introduced, which includes similar methods to iml but with a different design, hence users can make use of either package depending on their design preference.

MCML Authors

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

2023

[172]

J. Goschenhofer.
Reducing the effort for data annotation: contributions to weakly supervised deep learning.
Dissertation 2023. DOI

Abstract

This thesis addresses methods for training machine learning models with limited labeled data, focusing on semi-supervised, positive unlabeled, constrained clustering, and transfer learning. It explores deep semi-supervised learning, particularly in time series and medical imaging contexts, and investigates positive unlabeled learning methods that utilize predictive uncertainty for self-training. The thesis also introduces weakly supervised learning for constrained clustering, combining it with semi-supervised approaches, and applies transfer learning to tasks with varying granularity in medical domains. (Shortened).

MCML Authors

Jann Goschenhofer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[171]

H. A. Gündüz, S. Giri, M. Binder, B. Bischl and M. Rezaei.
Uncertainty Quantification for Deep Learning Models Predicting the Regulatory Activity of DNA Sequences.
ICMLA 2023 - 22nd IEEE International Conference on Machine Learning and Applications. Jacksonville, FL, USA, Dec 15-17, 2023. DOI

Abstract

The field of computational biology has been enhanced by deep learning models, which hold great promise for revolutionizing domains such as protein folding and drug discovery. Recent studies have underscored the tremendous potential of these models, particularly in the realm of gene regulation and the more profound understanding of the non-coding regions of the genome. On the other hand, this raises significant concerns about the reliability and efficacy of such models, which have their own biases by design, along with those learned from the data. Uncertainty quantification allows us to measure where the system is confident and know when it can be trusted. In this paper, we study several uncertainty quantification methods with respect to a multi-target regression task, specifically predicting regulatory activity profiles using DNA sequence data. Using the Basenji model, we investigate how such methods can improve in-domain generalization, out-of-distribution detection, and provide coverage guarantees on prediction intervals.

MCML Authors

Hüseyin Anil Gündüz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[170]

Y. Zhang, Y. Li, H. Brown, M. Rezaei, B. Bischl, P. Torr, A. Khakzar and K. Kawaguchi.
AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments.
XAIA @NeurIPS 2023 - Workshop XAI in Action: Past, Present, and Future Applications at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). New Orleans, LA, USA, Dec 10-16, 2023. URL

Abstract

Feature attribution explains neural network outputs by identifying relevant input features. How do we know if the identified features are indeed relevant to the network? This notion is referred to as faithfulness, an essential property that reflects the alignment between the identified (attributed) features and the features used by the model. One recent trend to test faithfulness is to design the data such that we know which input features are relevant to the label and then train a model on the designed data. Subsequently, the identified features are evaluated by comparing them with these designed ground truth features. However, this idea has the underlying assumption that the neural network learns to use all and only these designed features, while there is no guarantee that the learning process trains the network in this way. In this paper, we solve this missing link by explicitly designing the neural network by manually setting its weights, along with designing data, so we know precisely which input features in the dataset are relevant to the designed network. Thus, we can test faithfulness in AttributionLab, our designed synthetic environment, which serves as a sanity check and is effective in filtering out attribution methods. If an attribution method is not faithful in a simple controlled environment, it can be unreliable in more complex scenarios. Furthermore, the AttributionLab environment serves as a laboratory for controlled experiments through which we can study feature attribution methods, identify issues, and suggest potential improvements.

MCML Authors

Yawei Li

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Ashkan Khakzar

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[169]

S. Dandl.
Causality concepts in machine learning: heterogeneous treatment effect estimation with machine learning and model interpretation with counterfactual and semi-factual explanations.
Dissertation 2023. DOI

Abstract

This thesis explores the growing intersection of machine learning and causality through seven articles, offering new insights into how these fields can enhance one another. It addresses key topics, including adapting machine learning algorithms for heterogeneous treatment effect estimation, where combining causal and model-based forest elements improves performance across diverse datasets. Additionally, the thesis introduces advanced interpretability tools, proposing methods to generate multiple counterfactual and semi-factual explanations that aid in fairness assessments and address interpretability challenges. A modular R package developed in this work provides accessible tools for researchers to apply and compare counterfactual explanation methods, further bridging machine learning and causal inference for practical applications. (Shortened).

MCML Authors

Susanne Dandl

Dr.

* Former Member

[168]

E. Garces Arias, V. Pai, M. Schöffel, C. Heumann and M. Aßenmacher.
Automatic transcription of handwritten Old Occitan language.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

While existing neural network-based approaches have shown promising results in Handwritten Text Recognition (HTR) for high-resource languages and standardized/machine-written text, their application to low-resource languages often presents challenges, resulting in reduced effectiveness. In this paper, we propose an innovative HTR approach that leverages the Transformer architecture for recognizing handwritten Old Occitan language. Given the limited availability of data, which comprises only word pairs of graphical variants and lemmas, we develop and rely on elaborate data augmentation techniques for both text and image data. Our model combines a custom-trained Swin image encoder with a BERT text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set. Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source Transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision. To nurture further research and development, we make our models, data sets, and code publicly available.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[167]

J. Herbinger.
On grouping and partitioning approaches in interpretable machine learning.
Dissertation 2023. DOI

Abstract

This thesis addresses the challenges of interpreting machine learning models, particularly focusing on the limitations of global explanation methods. It identifies two key issues: the human-incomprehensibility of high-dimensional outputs and the misleading interpretations caused by aggregation bias. The thesis proposes solutions to these problems, such as grouping features for simpler interpretations and using recursive partitioning algorithms to provide regional explanations, ensuring more accurate and understandable insights into model behavior. (Shortened.)

MCML Authors

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[166]

F. Karl, T. Pielok, J. Moosbauer, F. Pfisterer, S. Coors, M. Binder, L. Schneider, J. Thomas, J. Richter, M. Lang, E. C. Garrido-Merchán, J. Branke and B. Bischl.
Multi-Objective Hyperparameter Optimization in Machine Learning—An Overview.
ACM Transactions on Evolutionary Learning and Optimization 3.4 (Dec. 2023). DOI

Abstract

Hyperparameter optimization constitutes a large part of typical modern machine learning (ML) workflows. This arises from the fact that ML methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration, resulting in a multi-objective optimization problem. This is often neglected in practice, due to a lack of knowledge and readily available software implementations for multi-objective hyperparameter optimization. In this work, we introduce the reader to the basics of multi-objective hyperparameter optimization and motivate its usefulness in applied ML. Furthermore, we provide an extensive survey of existing optimization strategies from the domains of evolutionary algorithms and Bayesian optimization. We illustrate the utility of multi-objective optimization in several specific ML applications, considering objectives such as operating conditions, prediction time, sparseness, fairness, interpretability, and robustness.

MCML Authors

Florian Karl

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Stefan Coors

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[165]

A. T. Stüber, S. Coors, B. Schachtner, T. Weber, D. Rügamer, A. Bender, A. Mittermeier, O. Öcal, M. Seidensticker, J. Ricke, B. Bischl and M. Ingrisch.
A comprehensive machine learning benchmark study for radiomics-based survival analysis of CT imaging data in patients with hepatic metastases of CRC.
Investigative Radiology 58.12 (Dec. 2023). DOI

Abstract

Optimizing a machine learning (ML) pipeline for radiomics analysis involves numerous choices in data set composition, preprocessing, and model selection. Objective identification of the optimal setup is complicated by correlated features, interdependency structures, and a multitude of available ML algorithms. Therefore, we present a radiomics-based benchmarking framework to optimize a comprehensive ML pipeline for the prediction of overall survival. This study is conducted on an image set of patients with hepatic metastases of colorectal cancer, for which radiomics features of the whole liver and of metastases from computed tomography images were calculated. A mixed model approach was used to find the optimal pipeline configuration and to identify the added prognostic value of radiomics features.

MCML Authors

Theresa Stüber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

Stefan Coors

* Former Member

Balthasar Schachtner

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

Tobias Weber

* Former Member

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Andreas Mittermeier

Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

[164]

Y. Sale, P. Hofman, L. Wimmer, E. Hüllermeier and T. Nagler.
Second-Order Uncertainty Quantification: Variance-Based Measures.
Preprint (Dec. 2023). arXiv

Abstract

Uncertainty quantification is a critical aspect of machine learning models, providing important insights into the reliability of predictions and aiding the decision-making process in real-world applications. This paper proposes a novel way to use variance-based measures to quantify uncertainty on the basis of second-order distributions in classification problems. A distinctive feature of the measures is the ability to reason about uncertainties on a class-based level, which is useful in situations where nuanced decision-making is required. Recalling some properties from the literature, we highlight that the variance-based measures satisfy important (axiomatic) properties. In addition to this axiomatic approach, we present empirical results showing the measures to be effective and competitive to commonly used entropy-based measures.

MCML Authors

Yusuf Sale

Artificial Intelligence and Machine Learning

Paul Hofman

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

Lisa Wimmer

Statistical Learning and Data Science

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computational Statistics & Data Science

[163]

C. A. Scholbeck, J. Moosbauer, G. Casalicchio, H. Gupta, B. Bischl and C. Heumann.
Position Paper: Bridging the Gap Between Machine Learning and Sensitivity Analysis.
Preprint (Dec. 2023). arXiv

Abstract

We argue that interpretations of machine learning (ML) models or the model-building process can be seen as a form of sensitivity analysis (SA), a general methodology used to explain complex systems in many fields such as environmental modeling, engineering, or economics. We address both researchers and practitioners, calling attention to the benefits of a unified SA-based view of explanations in ML and the necessity to fully credit related work. We bridge the gap between both fields by formally describing how (a) the ML process is a system suitable for SA, (b) how existing ML interpretation methods relate to this perspective, and (c) how other SA techniques could be applied to ML.

MCML Authors

Julia Moosbauer

Dr.

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[162]

D. Rügamer, F. Pfisterer, B. Bischl and B. Grün.
Mixture of Experts Distributional Regression: Implementation Using Robust Estimation with Adaptive First-order Methods.
Advances in Statistical Analysis (Nov. 2023). DOI

Abstract

In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in mixdistreg, an R software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Florian Pfisterer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[161]

L. Bothmann, L. Wimmer, O. Charrakh, T. Weber, H. Edelhoff, W. Peters, H. Nguyen, C. Benjamin and A. Menzel.
Automated wildlife image classification: An active learning tool for ecological applications.
Ecological Informatics 77 (Nov. 2023). DOI

Abstract

Wildlife camera trap images are being used extensively to investigate animal abundance, habitat associations, and behavior, which is complicated by the fact that experts must first classify the images to retrieve relevant information. Artificial intelligence systems can take over this task but usually need a large number of already-labeled training images to achieve sufficient performance. This requirement necessitates human expert labor and poses a particular challenge for projects with few cameras or short durations. We propose a label-efficient learning strategy that enables researchers with small or medium-sized image databases to leverage the potential of modern machine learning, thus freeing crucial resources for subsequent analyses. Our methodological proposal is twofold: On the one hand, we improve current strategies of combining object detection and image classification by tuning the hyperparameters of both models. On the other hand, we provide an active learning system that allows training deep learning models very efficiently in terms of required manually labeled training images. We supply a software package that enables researchers to use these methods without specific programming skills and thereby ensure the broad applicability of the proposed framework in ecological practice. We show that our tuning strategy improves predictive performance, emphasizing that tuning can and must be done separately for a new data set. We demonstrate how the active learning pipeline reduces the amount of pre-labeled data needed to achieve specific predictive performance and that it is especially valuable for improving out-of-sample predictive performance. We conclude that the combination of tuning and active learning increases the predictive performance of automated image classifiers substantially. Furthermore, we argue that our work can broadly impact the community through the ready-to-use software package provided. Finally, the publication of our models tailored to European wildlife data enriches existing model bases mostly trained on data from Africa and North America.

MCML Authors

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Lisa Wimmer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[160]

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Post-hoc Orthogonalization for Mitigation of Protected Feature Bias in CXR Embeddings.
Preprint (Nov. 2023). arXiv

Abstract

Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in CXR embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray image classification.

MCML Authors

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[159]

G. König.
If interpretability is the answer, what is the question?: a causal perspective.
Dissertation 2023. DOI

Abstract

This thesis addresses fundamental challenges in the field of interpretable machine learning (IML), particularly the lack of a clear definition of ‘interpretability’, the potential misinterpretation of existing methods, and the computational difficulties of conditional-sampling-based techniques. By disentangling the different goals of interpretability, we provide clearer guidelines for deriving target estimands, with specific examples such as recourse and scientific inference. Additionally, we propose formal interpretation rules for feature importance, highlight common pitfalls in IML, and introduce efficient methods for estimating conditional-sampling techniques by leveraging the data’s dependence structure, with a strong emphasis on causal inference to improve clarity and computational efficiency. (Shortened.)

MCML Authors

Gunnar König

Dr.

* Former Member

[158]

M. Rezaei, F. Soleymani, B. Bischl and S. Azizi.
Deep Bregman divergence for self-supervised representations learning.
Computer Vision and Image Understanding 235.103801 (Oct. 2023). DOI

Abstract

Neural Bregman divergence measures the divergence of data points using convex neural networks, which is beyond Euclidean distance and capable of capturing divergence over distributions. The non-Euclidean geometry is not well explored in deep representation learning and remains a challenging endeavor for self-supervised representation learning. In this paper, we propose deep Bregman divergences for self-supervised pretext task learning, where we aim to enhance self-supervised embedding representation by training additional networks based on functional Bregman divergences. Our framework can capture the divergence of embedding distributions and improve the quality of learned representation using an arbitrary Bregman divergence over data embedding. Specifically, we develop a novel self-supervised architecture and a new divergence loss that measures the asymmetric distance of arbitrary Bergman divergences of neural networks. We show that the combination of self-supervised contrastive learning and our proposed method outperforms the baseline as well as most established methods for self-supervised and semi-supervised learning on multiple classifications and object detection tasks and datasets. Moreover, the learned representations generalize well when transferred to other datasets and tasks.

MCML Authors

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[157]

L. Bothmann, S. Dandl and M. Schomaker.
Causal Fair Machine Learning via Rank-Preserving Interventional Distributions.
AEQUITAS @ECAI 2023 - 1st Workshop on Fairness and Bias in AI co-located with the 26th European Conference on Artificial Intelligence (ECAI 2023). Kraków, Poland, Sep 30-Oct 04, 2023. PDF

Abstract

A decision can be defined as fair if equal individuals are treated equally and unequals unequally. Adopting this definition, the task of designing machine learning models that mitigate unfairness in automated decision-making systems must include causal thinking when introducing protected attributes. Following a recent proposal, we define individuals as being normatively equal if they are equal in a fictitious, normatively desired (FiND) world, where the protected attribute has no (direct or indirect) causal effect on the target. We propose rank-preserving interventional distributions to define an estimand of this FiND world and a warping method for estimation. Evaluation criteria for both the method and resulting model are presented and validated through simulations and empirical data. With this, we show that our warping approach effectively identifies the most discriminated individuals and mitigates unfairness.

MCML Authors

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michael Schomaker

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Biostatistics

[156]

J. Herbinger, S. Dandl, F. K. Ewald, S. Loibl and G. Casalicchio.
Leveraging Model-based Trees as Interpretable Surrogate Models for Model Distillation.
XI-ML @ECAI 2023 - 3rd International Workshop on Explainable and Interpretable Machine Learning co-located with the 26th European Conference on Artificial Intelligence (ECAI 2023). Kraków, Poland, Sep 30-Oct 04, 2023. DOI

Abstract

Surrogate models play a crucial role in retrospectively interpreting complex and powerful black box machine learning models via model distillation. This paper focuses on using model-based trees as surrogate models which partition the feature space into interpretable regions via decision rules. Within each region, interpretable models based on additive main effects are used to approximate the behavior of the black box model, striking for an optimal balance between interpretability and performance. Four model-based tree algorithms, namely SLIM, GUIDE, MOB, and CTree, are compared regarding their ability to generate such surrogate models. We investigate fidelity, interpretability, stability, and the algorithms’ capability to capture interaction effects through appropriate splits. Based on our comprehensive analyses, we finally provide an overview of user-specific recommendations.

MCML Authors

Julia Herbinger

Dr.

* Former Member

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Fiona Katharina Ewald

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[155]

S. Urchs, V. Thurner, M. Aßenmacher, C. Heumann and S. Thiemichen.
How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses.
BDCA @ECML-PKDD 2023 - 1st Workshop on Biased Data in Conversational Agents at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2023). Turin, Italy, Sep 18-22, 2023. DOI

Abstract

With the introduction of ChatGPT, OpenAI made large language models (LLM) accessible to users with limited IT expertise. However, users with no background in natural language processing (NLP) might lack a proper understanding of LLMs. Thus the awareness of their inherent limitations, and therefore will take the systems’ output at face value. In this paper, we systematically analyse prompts and the generated responses to identify possible problematic issues with a special focus on gender biases, which users need to be aware of when processing the system’s output. We explore how ChatGPT reacts in English and German if prompted to answer from a female, male, or neutral perspective. In an in-depth investigation, we examine selected prompts and analyse to what extent responses differ if the system is prompted several times in an identical way. On this basis, we show that ChatGPT is indeed useful for helping non-IT users draft texts for their daily work. However, it is absolutely crucial to thoroughly check the system’s responses for biases as well as for syntactic and grammatical mistakes.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[154]

I. T. Öztürk, R. Nedelchev, C. Heumann, E. Garces Arias, M. Roger, B. Bischl and M. Aßenmacher.
How Different Is Stereotypical Bias Across Languages?
BIAS @ECML-PKDD 2023 - 3rd Workshop on Bias and Fairness in AI at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2023). Turin, Italy, Sep 18-22, 2023. DOI

Abstract

Recent studies have demonstrated how to assess the stereotypical bias in pre-trained English language models. In this work, we extend this branch of research in multiple different dimensions by systematically investigating (a) mono- and multilingual models of (b) different underlying architectures with respect to their bias in (c) multiple different languages. To that end, we make use of the English StereoSet data set (Nadeem et al., 2021), which we semi-automatically translate into German, French, Spanish, and Turkish. We find that it is of major importance to conduct this type of analysis in a multilingual setting, as our experiments show a much more nuanced picture as well as notable differences from the English-only analysis. The main takeaways from our analysis are that mGPT-2 (partly) shows surprising anti-stereotypical behavior across languages, English (monolingual) models exhibit the strongest bias, and the stereotypes reflected in the data set are least present in Turkish models. Finally, we release our codebase alongside the translated data sets and practical guidelines for the semi-automatic translation to encourage a further extension of our work to other languages.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[153]

S. Dandl, G. Casalicchio, B. Bischl and L. Bothmann.
Interpretable Regional Descriptors: Hyperbox-Based Local Explanations.
ECML-PKDD 2023 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Turin, Italy, Sep 18-22, 2023. DOI

Abstract

This work introduces interpretable regional descriptors, or IRDs, for local, model-agnostic interpretations. IRDs are hyperboxes that describe how an observation’s feature values can be changed without affecting its prediction. They justify a prediction by providing a set of “even if” arguments (semi-factual explanations), and they indicate which features affect a prediction and whether pointwise biases or implausibilities exist. A concrete use case shows that this is valuable for both machine learning modelers and persons subject to a decision. We formalize the search for IRDs as an optimization problem and introduce a unifying framework for computing IRDs that covers desiderata, initialization techniques, and a post-processing method. We show how existing hyperbox methods can be adapted to fit into this unified framework. A benchmark study compares the methods based on several quality measures and identifies two strategies to improve IRDs.

MCML Authors

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

[152]

L. Rauch, M. Aßenmacher, D. Huseljic, M. Wirth, B. Bischl and B. Sick.
ActiveGLAE: A Benchmark for Deep Active Learning with Transformers.
ECML-PKDD 2023 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Turin, Italy, Sep 18-22, 2023. DOI

Abstract

Deep active learning (DAL) seeks to reduce annotation costs by enabling the model to actively query instance annotations from which it expects to learn the most. Despite extensive research, there is currently no standardized evaluation protocol for transformer-based language models in the field of DAL. Diverse experimental settings lead to difficulties in comparing research and deriving recommendations for practitioners. To tackle this challenge, we propose the ACTIVEGLAE benchmark, a comprehensive collection of data sets and evaluation guidelines for assessing DAL. Our benchmark aims to facilitate and streamline the evaluation process of novel DAL strategies. Additionally, we provide an extensive overview of current practice in DAL with transformer-based language models. We identify three key challenges - data set selection, model training, and DAL settings - that pose difficulties in comparing query strategies. We establish baseline results through an extensive set of experiments as a reference point for evaluating future work. Based on our findings, we provide guidelines for researchers and practitioners.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[151]

J. G. Wiese, L. Wimmer, T. Papamarkou, B. Bischl, S. Günnemann and D. Rügamer.
Towards Efficient MCMC Sampling in Bayesian Neural Networks by Exploiting Symmetry.
ECML-PKDD 2023 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Turin, Italy, Sep 18-22, 2023. Best paper award. DOI

Abstract

Bayesian inference in deep neural networks is challenging due to the high-dimensional, strongly multi-modal parameter posterior density landscape. Markov chain Monte Carlo approaches asymptotically recover the true posterior but are considered prohibitively expensive for large modern architectures. Local methods, which have emerged as a popular alternative, focus on specific parameter regions that can be approximated by functions with tractable integrals. While these often yield satisfactory empirical results, they fail, by definition, to account for the multi-modality of the parameter posterior. Such coarse approximations can be detrimental in practical applications, notably safety-critical ones. In this work, we argue that the dilemma between exact-but-unaffordable and cheap-but-inexact approaches can be mitigated by exploiting symmetries in the posterior landscape. These symmetries, induced by neuron interchangeability and certain activation functions, manifest in different parameter values leading to the same functional output value. We show theoretically that the posterior predictive density in Bayesian neural networks can be restricted to a symmetry-free parameter reference set. By further deriving an upper bound on the number of Monte Carlo chains required to capture the functional diversity, we propose a straightforward approach for feasible Bayesian inference. Our experiments suggest that efficient sampling is indeed possible, opening up a promising path to accurate uncertainty quantification in deep learning.

MCML Authors

Lisa Wimmer

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Stephan Günnemann

Prof. Dr.

Data Analytics & Machine Learning

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[150]

M. Aßenmacher, L. Rauch, J. Goschenhofer, A. Stephan, B. Bischl, B. Roth and B. Sick.
Towards Enhancing Deep Active Learning with Weak Supervision and Constrained Clustering.
IAL @ECML-PKDD 2023 - 7th International Workshop on Interactive Adaptive Learning at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2023). Turin, Italy, Sep 18-22, 2023. PDF

Abstract

Three fields revolving around the question of how to cope with limited amounts of labeled data are Deep Active Learning (DAL), deep Constrained Clustering (CC), and Weakly Supervised Learning (WSL). DAL tackles the problem by adaptively posing the question of which data samples to annotate next in order to achieve the best incremental learning improvement, although it suffers from several limitations that hinder its deployment in practical settings. We point out how CC algorithms and WSL could be employed to overcome these limitations and increase the practical applicability of DAL research. Specifically, we discuss the opportunities to use the class discovery capabilities of CC and the possibility of further reducing human annotation efforts by utilizing WSL. We argue that the practical applicability of DAL algorithms will benefit from employing CC and WSL methods for the learning and labeling process. We inspect the overlaps between the three research areas and identify relevant and exciting research questions at the intersection of these areas.

MCML Authors

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

Jann Goschenhofer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[149]

S. F. Fischer, L. Harutyunyan, M. Feurer and B. Bischl.
OpenML-CTR23 - A curated tabular regression benchmarking suite.
AutoML 2023 - International Conference on Automated Machine Learning - Workshop Track. Berlin, Germany, Sep 12-15, 2023. URL

Abstract

Benchmark experiments are one of the cornerstones of modern machine learning research. An essential part in the design of such experiments is the selection of datasets. We present the OpenML Curated Tabular Regression benchmarking suite 2023 (OpenML-CTR23). It is available on OpenML and comprises 35 regression problems that have been selected according to a set of strict criteria. We compare its design with existing regression benchmark suites and also challenge some of the dataset choices of previous efforts. As a first experiment, we compare five machine learning methods of varying complexity on the OpenML-CTR23.

MCML Authors

Matthias Feurer

Prof. Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[148]

L. O. Purucker, L. Schneider, M. Anastacio, J. Beel, B. Bischl and H. Hoos.
Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML.
AutoML 2023 - International Conference on Automated Machine Learning. Berlin, Germany, Sep 12-15, 2023. URL

Abstract

Automated machine learning (AutoML) systems commonly ensemble models post hoc to improve predictive performance, typically via greedy ensemble selection (GES). However, we believe that GES may not always be optimal, as it performs a simple deterministic greedy search. In this work, we introduce two novel population-based ensemble selection methods, QO-ES and QDO-ES, and compare them to GES. While QO-ES optimises solely for predictive performance, QDO-ES also considers the diversity of ensembles within the population, maintaining a diverse set of well-performing ensembles during optimisation based on ideas of quality diversity optimisation. The methods are evaluated using 71 classification datasets from the AutoML benchmark, demonstrating that QO-ES and QDO-ES often outrank GES, albeit only statistically significant on validation data. Our results further suggest that diversity can be beneficial for post hoc ensembling but also increases the risk of overfitting.

MCML Authors

Lennart Schneider

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[147]

S. Segel, H. Graf, A. Tornede, B. Bischl and M. Lindauer.
Symbolic Explanations for Hyperparameter Optimization.
AutoML 2023 - International Conference on Automated Machine Learning. Berlin, Germany, Sep 12-15, 2023. URL

Abstract

Hyperparameter optimization (HPO) methods can determine well-performing hyperparameter configurations efficiently but often lack insights and transparency. We propose to apply symbolic regression to meta-data collected with Bayesian optimization (BO) during HPO. In contrast to prior approaches explaining the effects of hyperparameters on model performance, symbolic regression allows for obtaining explicit formulas quantifying the relation between hyperparameter values and model performance. Overall, our approach aims to make the HPO process more explainable and human-centered, addressing the needs of multiple user groups: First, providing insights into the HPO process can support data scientists and machine learning practitioners in their decisions when using and interacting with HPO tools. Second, obtaining explicit formulas and inspecting their properties could help researchers understand the HPO loss landscape better. In an experimental evaluation, we find that naively applying symbolic regression directly to meta-data collected during HPO is affected by the sampling bias introduced by BO. However, the true underlying loss landscape can be approximated by fitting the symbolic regression on the surrogate model trained during BO. By penalizing longer formulas, symbolic regression furthermore allows the user to decide how to balance the accuracy and explainability of the resulting formulas.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[146]

P. Koch, G. V. Nuñez, E. Garces Arias, C. Heumann, M. Schöffel, A. Häberlin and M. Aßenmacher.
A tailored Handwritten-Text-Recognition System for Medieval Latin.
ALP @RANLP 2023 - 1st Workshop on Ancient Language Processing co-located with the Conference on Recent Advances in Natural Language Processing (RANLP 2023). Varna, Bulgaria, Sep 08, 2023. URL

Abstract

The Bavarian Academy of Sciences and Humanities aims to digitize the Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the handwritten text recognition (HTR) of the handwritten lemmas on the record cards. In our work, we introduce an end-to-end pipeline, tailored for the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art image segmentation models to prepare the initial data set for the HTR task. Further, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a character error rate of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.

MCML Authors

Esteban Garces Arias

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[145]

H. A. Gündüz, M. Binder, X.-Y. To, R. Mreches, B. Bischl, A. C. McHardy, P. C. Münch and M. Rezaei.
A self-supervised deep learning method for data-efficient training in genomics.
Communications Biology 6.928 (Sep. 2023). DOI

Abstract

Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.

MCML Authors

Hüseyin Anil Gündüz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Xiao-Yin To

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[144]

R. P. Prager, K. Dietrich, L. Schneider, L. Schäpermeier, B. Bischl, P. Kerschke, H. Trautmann and O. Mersmann.
Neural Networks as Black-Box Benchmark Functions Optimized for Exploratory Landscape Features.
FOGA 2023 - 17th ACM/SIGEVO Conference on Foundations of Genetic Algorithms. Potsdam, Germany, Aug 30-Sep 01, 2023. DOI

Abstract

Artificial benchmark functions are commonly used in optimization research because of their ability to rapidly evaluate potential solutions, making them a preferred substitute for real-world problems. However, these benchmark functions have faced criticism for their limited resemblance to real-world problems. In response, recent research has focused on automatically generating new benchmark functions for areas where established test suites are inadequate. These approaches have limitations, such as the difficulty of generating new benchmark functions that exhibit exploratory landscape analysis (ELA) features beyond those of existing benchmarks. The objective of this work is to develop a method for generating benchmark functions for single-objective continuous optimization with user-specified structural properties. Specifically, we aim to demonstrate a proof of concept for a method that uses an ELA feature vector to specify these properties in advance. To achieve this, we begin by generating a random sample of decision space variables and objective values. We then adjust the objective values using CMA-ES until the corresponding features of our new problem match the predefined ELA features within a specified threshold. By iteratively transforming the landscape in this way, we ensure that the resulting function exhibits the desired properties. To create the final function, we use the resulting point cloud as training data for a simple neural network that produces a function exhibiting the target ELA features. We demonstrate the effectiveness of this approach by replicating the existing functions of the well-known BBOB suite and creating new functions with ELA feature values that are not present in BBOB.

MCML Authors

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[143]

A. Scheppach, H. A. Gündüz, E. Dorigatti, P. C. Münch, A. C. McHardy, B. Bischl, M. Rezaei and M. Binder.
Neural Architecture Search for Genomic Sequence Data.
CIBCB 2023 - 20th IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology. Eindhoven, The Netherlands, Aug 29-31, 2023. DOI

Abstract

Deep learning has enabled outstanding progress on bioinformatics datasets and a variety of tasks, such as protein structure prediction, identification of regulatory regions, genome annotation, and interpretation of the noncoding genome. The layout and configuration of neural networks used for these tasks have mostly been developed manually by human experts, which is a time-consuming and error-prone process. Therefore, there is growing interest in automated neural architecture search (NAS) methods in bioinformatics. In this paper, we present a novel search space for NAS algorithms that operate on genome data, thus creating extensions for existing NAS algorithms for sequence data that we name Genome-DARTS, Genome-P-DARTS, Genome-BONAS, Genome-SH, and Genome-RS. Moreover, we introduce two novel NAS algorithms, CWP-DARTS and EDPDARTS, that build on and extend the idea of P-DARTS. We evaluate the presented methods and compare them to manually designed neural architectures on a widely used genome sequence machine learning task to show that NAS methods can be adapted well for bioinformatics sequence datasets. Our experiments show that architectures optimized by our NAS methods outperform manually developed architectures while having significantly fewer parameters.

MCML Authors

Hüseyin Anil Gündüz

* Former Member

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[142]

F. Ott, D. Rügamer, L. Heublein, B. Bischl and C. Mutschler.
Auxiliary Cross-Modal Representation Learning With Triplet Loss Functions for Online Handwriting Recognition.
IEEE Access 11 (Aug. 2023). DOI

Abstract

Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types - such as images and time-series data (e.g., audio or text data) – requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the contrastive or triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. We present a triplet loss with a dynamic margin for single label and sequence-to-sequence classification tasks. We perform extensive evaluations on synthetic image and time-series data, and on data for offline handwriting recognition (HWR) and on online HWR from sensor-enhanced pens for classifying written words. Our experiments show an improved classification accuracy, faster convergence, and better generalizability due to an improved cross-modal representation. Furthermore, the more suitable generalizability leads to a better adaptability between writers for online HWR.

MCML Authors

Felix Ott

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[141]

F. Pfisterer, S. Wei, S. Vollmer, M. Lang and B. Bischl.
Fairness Audits and Debiasing Using mlr3fairness.
The R Journal 15.1 (Aug. 2023). DOI

Abstract

Given an increase in data-driven automated decision-making based on machine learning (ML) models, it is imperative that, along with tools to develop and improve such models, there are sufficient capabilities to analyze and assess models with respect to potential biases. We present the package mlr3fairness, a collection of metrics and methods that allow for the assessment of bias in machine learning models. Our package implements a variety of widely used fairness metrics that can be used to audit models for potential biases, along with a set of visualizations that can help to provide additional insights into such biases. mlr3fairness furthermore integrates bias mitigation methods for machine learning models through data pre-processing or post-processing of predictions. These allow practitioners to trade off performance and fairness metrics that are appropriate for their use case.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[140]

J. Rodemann, J. Goschenhofer, E. Dorigatti, T. Nagler and T. Augustin.
Approximately Bayes-optimal pseudo-label selection.
UAI 2023 - 39th Conference on Uncertainty in Artificial Intelligence. Pittsburgh, PA, USA, Jul 31-Aug 03, 2023. URL

Abstract

Semi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). This selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes-optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace’s method and the Gaussian integral. We empirically assess BPLS on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.

MCML Authors

Jann Goschenhofer

Dr.

* Former Member

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computational Statistics & Data Science

[139]

L. Wimmer, Y. Sale, P. Hofman, B. Bischl and E. Hüllermeier.
Quantifying Aleatoric and Epistemic Uncertainty in Machine Learning: Are Conditional Entropy and Mutual Information Appropriate Measures?
UAI 2023 - 39th Conference on Uncertainty in Artificial Intelligence. Pittsburgh, PA, USA, Jul 31-Aug 03, 2023. URL

Abstract

The quantification of aleatoric and epistemic uncertainty in terms of conditional entropy and mutual information, respectively, has recently become quite common in machine learning. While the properties of these measures, which are rooted in information theory, seem appealing at first glance, we identify various incoherencies that call their appropriateness into question. In addition to the measures themselves, we critically discuss the idea of an additive decomposition of total uncertainty into its aleatoric and epistemic constituents. Experiments across different computer vision tasks support our theoretical findings and raise concerns about current practice in uncertainty quantification.

MCML Authors

Lisa Wimmer

Statistical Learning and Data Science

Yusuf Sale

Artificial Intelligence and Machine Learning

Paul Hofman

Artificial Intelligence and Machine Learning

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Eyke Hüllermeier

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Artificial Intelligence and Machine Learning

[138]

A. Stüber, S. Coors and M. Ingrisch.
Revitalize the Potential of Radiomics: Interpretation and Feature Stability in Medical Imaging Analyses through Groupwise Feature Importance.
LB-D-DC @xAI 2023 - Late-breaking Work, Demos and Doctoral Consortium at the 1st World Conference on eXplainable Artificial Intelligence (xAI 2023). Lisbon, Portugal, Jul 26-28, 2023. PDF

Abstract

Radiomics, involving analysis of calculated, quantitative features from medical images with machine learning tools, shares the instability challenge with other high-dimensional data analyses due to variations in the training set. This instability affects model interpretation and feature importance assessment. To enhance stability and interpretability, we introduce grouped feature importance, shedding light on tool limitations and advocating for more reliable radiomics-based analysis methods.

MCML Authors

Stefan Coors

* Former Member

Michael Ingrisch

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

[137]

C. Molnar, T. Freiesleben, G. König, J. Herbinger, T. Reisinger, G. Casalicchio, M. N. Wright and B. Bischl.
Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process.
xAI 2023 - 1st World Conference on eXplainable Artificial Intelligence. Lisbon, Portugal, Jul 26-28, 2023. DOI

Abstract

Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. However, their model parameters usually cannot be easily related to the data generating process. To learn about the modeled relationships, partial dependence (PD) plots and permutation feature importance (PFI) are often used as interpretation methods. However, PD and PFI lack a theory that relates them to the data generating process. We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process. We show that PD and PFI estimates deviate from this ground truth due to statistical biases, model variance and Monte Carlo approximation errors. To account for model variance in PD and PFI estimation, we propose the learner-PD and the learner-PFI based on model refits, and propose corrected variance and confidence interval estimators.

MCML Authors

Gunnar König

Dr.

* Former Member

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[136]

J. Goschenhofer, B. Bischl and Z. Kira.
ConstraintMatch for Semi-constrained Clustering.
IJCNN 2023 - International Joint Conference on Neural Networks. Gold Coast Convention and Exhibition Centre, Queensland, Australia, Jul 18-23, 2023. DOI

Abstract

Constrained clustering allows the training of classi-fication models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of unconstrained data is available alongside a smaller set of constraints, and propose ConstraintMatch to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a pseudo-constraining mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of informative unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.

MCML Authors

Jann Goschenhofer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[135]

C. Kolb, B. Bischl, C. L. Müller and D. Rügamer.
Sparse Modality Regression.
IWSM 2023 - 37th International Workshop on Statistical Modelling. Dortmund, Germany, Jul 17-21, 2023. Best Paper Award. PDF

Abstract

Deep neural networks (DNNs) enable learning from various data modalities, such as images or text. This concept has also found its way into statistical modelling through the use of semi-structured regression, a model additively combining structured predictors with unstructured effects from arbitrary data modalities learned through a DNN. This paper introduces a new framework called sparse modality regression (SMR). SMR is a regression model combining different data modalities and uses a group lasso-type regularization approach to perform modality selection by zeroing out potentially uninformative modalities.

MCML Authors

Chris Kolb

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Christian Müller

Prof. Dr.

C2 | Biology

Biomedical Statistics and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[134]

L. Schneider, B. Bischl and J. Thomas.
Multi-Objective Optimization of Performance and Interpretability of Tabular Supervised Machine Learning Models.
GECCO 2023 - Genetic and Evolutionary Computation Conference. Lisbon, Portugal, Jul 15-19, 2023. DOI

Abstract

We present a model-agnostic framework for jointly optimizing the predictive performance and interpretability of supervised machine learning models for tabular data. Interpretability is quantified via three measures: feature sparsity, interaction sparsity of features, and sparsity of non-monotone feature effects. By treating hyperparameter optimization of a machine learning algorithm as a multi-objective optimization problem, our framework allows for generating diverse models that trade off high performance and ease of interpretability in a single optimization run. Efficient optimization is achieved via augmentation of the search space of the learning algorithm by incorporating feature selection, interaction and monotonicity constraints into the hyperparameter search space. We demonstrate that the optimization problem effectively translates to finding the Pareto optimal set of groups of selected features that are allowed to interact in a model, along with finding their optimal monotonicity constraints and optimal hyperparameters of the learning algorithm itself. We then introduce a novel evolutionary algorithm that can operate efficiently on this augmented search space. In benchmark experiments, we show that our framework is capable of finding diverse models that are highly competitive or outperform state-of-the-art XGBoost or Explainable Boosting Machine models, both with respect to performance and interpretability.

MCML Authors

Lennart Schneider

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Janek Thomas

Dr.

* Former Member

[133]

D. Saggau, M. Rezaei, B. Bischl and I. Chalkidis.
Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -siamese neural network- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.

MCML Authors

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[132]

M. Aßenmacher, N. Sauter and C. Heumann.
Classifying multilingual party manifestos: Domain transfer across country, time, and genre.
Preprint (Jul. 2023). arXiv

Abstract

Annotating costs of large corpora are still one of the main bottlenecks in empirical social science research. On the one hand, making use of the capabilities of domain transfer allows re-using annotated data sets and trained models. On the other hand, it is not clear how well domain transfer works and how reliable the results are for transfer across different dimensions. We explore the potential of domain transfer across geographical locations, languages, time, and genre in a large-scale database of political manifestos. First, we show the strong within-domain classification performance of fine-tuned transformer models. Second, we vary the genre of the test set across the aforementioned dimensions to test for the fine-tuned models’ robustness and transferability. For switching genres, we use an external corpus of transcribed speeches from New Zealand politicians while for the other three dimensions, custom splits of the Manifesto database are used. While BERT achieves the best scores in the initial experiments across modalities, DistilBERT proves to be competitive at a lower computational expense and is thus used for further experiments across time and country. The results of the additional analysis show that (Distil)BERT can be applied to future data with similar performance. Moreover, we observe (partly) notable differences between the political manifestos of different countries of origin, even if these countries share a language or a cultural background.

MCML Authors

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[131]

C. Kolb, C. L. Müller, B. Bischl and D. Rügamer.
Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization.
Preprint (Jul. 2023). arXiv

Abstract

We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity. These non-smooth and possibly non-convex problems typically rely on solvers tailored to specific models and regularizers. In contrast, our method enables fully differentiable and approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning. The proposed optimization transfer comprises an overparameterization of selected parameters and a change of penalties. In the overparametrized problem, smooth surrogate regularization induces non-smooth, sparse regularization in the base parametrization. We prove that the surrogate objective is equivalent in the sense that it not only has identical global minima but also matching local minima, thereby avoiding the introduction of spurious solutions. Additionally, our theory establishes results of independent interest regarding matching local minima for arbitrary, potentially unregularized, objectives. We comprehensively review sparsity-inducing parametrizations across different fields that are covered by our general theory, extend their scope, and propose improvements in several aspects. Numerical experiments further demonstrate the correctness and effectiveness of our approach on several sparse learning problems ranging from high-dimensional regression to sparse neural network training.

MCML Authors

Chris Kolb

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Christian Müller

Prof. Dr.

C2 | Biology

Biomedical Statistics and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[130]

S. Kaminwar, J. Goschenhofer, J. Thomas, I. Thon and B. Bischl.
Structured Verification of Machine Learning Models in Industrial Settings.
Big Data 11.3 (Jun. 2023). DOI

Abstract

The use of machine learning (ML) allows us to automate and scale the decision-making processes. The key to this automation is the development of ML models that generalize training data toward unseen data. Such models can become extremely versatile and powerful, which makes democratization of artificial intelligence (AI) possible, that is, providing ML to non-ML experts such as software engineers or domain experts. Typically, automated ML (AutoML) is being referred to as a key step toward it. However, from our perspective, we believe that democratization of the verification process of ML systems is a larger and even more crucial challenge to achieve the democratization of AI. Currently, the process of ensuring that an ML model works as intended is unstructured. It is largely based on experience and domain knowledge that cannot be automated. The current approaches such as cross-validation or explainable AI are not enough to overcome the real challenges and are discussed extensively in this article. Arguing toward structured verification approaches, we discuss a set of guidelines to verify models, code, and data in each step of the ML lifecycle. These guidelines can help to reliably measure and select an optimal solution, besides minimizing the risk of bugs and undesired behavior in edge-cases.

MCML Authors

Jann Goschenhofer

Dr.

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[129]

M. Rezaei, A. Vahidi, B. Bischl, T. Elze and M. Eslami.
Self-supervised Learning and Self-labeling Framework for Glaucoma Detection.
Investigative Ophthalmology and Visual Science 64.8 (Jun. 2023). URL

Abstract

Purpose: Self-supervised learning methods have made a significant impact in recent years on different domains, such as natural language processing and computer vision. Here, we develop a new self-supervised framework for simultaneous retina image clustering and self-supervised representation learning to enhance the diagnosis of glaucoma.
Methods: The network is optimized using both a contrastive self-supervised network and a clustering network that clustering helps to improve the embedding representation. Our method comprises two parallel deep networks; 1) a representation network which is a self-supervised contrastive representation network that takes two augmented views of the retina image, and 2) an image clustering or self-labeling network that takes original retina images. The representation network first projects the augmented views onto an embedding space. Then it processes these representations in a multi-layer perceptron head, which generates the baseline for the pair-wise contrastive objective. On the other hand, the clustering network performs KL divergence on the top embedding layer of the representation network.
Results: We train our framework for simultaneous representation learning and self-labeling using a clustering network. We follow standard protocols by self-supervised learning for empirical analysis and evaluate the learned representation of our model by classification (Table 1), as well as image clustering tasks (Table 2) on two different Glaucoma datasets. According to the result shown in Table 1, our method improves the results of Glaucoma classification by up to 14%, better compared to SOTA self-supervised algorithm in terms of F1 score and 2% better for the task of clustering. Glaucoma-1 is composed of the labeled subset of the human retinal images used in [1]. This dataset contains 2,397 images in total, with 956 glaucoma diagnoses. While the training set for Glaucoma-2 [2] was released by the REFUGE-2 challenge.
Conclusions: We showed that combining self-supervised representation learning along with self-labeling improves the learned representation compared to the existing self-supervised learning models on retina-based glaucoma detection by up to 14% better. Moreover, our method outperformed other self-supervised methods for image clustering tasks.

MCML Authors

Mina Rezaei

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[128]

J. Moosbauer.
Towards explainable automated machine learning.
Dissertation 2023. DOI

Abstract

This thesis explores the intersection of Automated Machine Learning (AutoML) and explainable AI, addressing the need for transparency at multiple levels: the model, the learning algorithm, and the AutoML system itself. The work develops methods for enhancing model explainability through multi-objective hyperparameter optimization (HPO) and introduces new techniques to understand the effects of hyperparameters and optimizers within AutoML systems. These contributions advance the field by providing more interpretable and reliable tools for AutoML, ultimately increasing the accessibility and trustworthiness of machine learning models and their deployment. (Shortened.)

MCML Authors

Julia Moosbauer

Dr.

* Former Member

[127]

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Cascaded Latent Diffusion Models for High-Resolution Chest X-ray Synthesis.
PAKDD 2023 - 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Osaka, Japan, May 25-28, 2023. DOI

Abstract

While recent advances in large-scale foundational models show promising results, their application to the medical domain has not yet been explored in detail. In this paper, we progress into the realms of large-scale modeling in medical synthesis by proposing Cheff - a foundational cascaded latent diffusion model, which generates highly-realistic chest radiographs providing state-of-the-art quality on a 1-megapixel scale. We further propose MaCheX, which is a unified interface for public chest datasets and forms the largest open collection of chest X-rays up to date. With Cheff conditioned on radiological reports, we further guide the synthesis process over text prompts and unveil the research area of report-to-chest-X-ray generation.

MCML Authors

Tobias Weber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[126]

K. Rath, D. Rügamer, B. Bischl, U. von Toussaint and C. G. Albert.
Dependent state space Student-t processes for imputation and data augmentation in plasma diagnostics.
Contributions to Plasma Physics 63.5-6 (May. 2023). DOI

Abstract

Multivariate time series measurements in plasma diagnostics present several challenges when training machine learning models: the availability of only a few labeled data increases the risk of overfitting, and missing data points or outliers due to sensor failures pose additional difficulties. To overcome these issues, we introduce a fast and robust regression model that enables imputation of missing points and data augmentation by massive sampling while exploiting the inherent correlation between input signals. The underlying Student-t process allows for a noise distribution with heavy tails and thus produces robust results in the case of outliers. We consider the state space form of the Student-t process, which reduces the computational complexity and makes the model suitable for high-resolution time series. We evaluate the performance of the proposed method using two test cases, one of which was inspired by measurements of flux loop signals.

MCML Authors

Katharina Röck (née Rath)

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[125]

T. Pielok, B. Bischl and D. Rügamer.
Approximate Bayesian Inference with Stein Functional Variational Gradient Descent.
ICLR 2023 - 11th International Conference on Learning Representations. Kigali, Rwanda, May 01-05, 2023. URL

Abstract

We propose a general-purpose variational algorithm that forms a natural analogue of Stein variational gradient descent (SVGD) in function space. While SVGD successively updates a set of particles to match a target density, the method introduced here of Stein functional variational gradient descent (SFVGD) updates a set of particle functions to match a target stochastic process (SP). The update step is found by minimizing the functional derivative of the Kullback-Leibler divergence between SPs. SFVGD can either be used to train Bayesian neural networks (BNNs) or for ensemble gradient boosting. We show the efficacy of training BNNs with SFVGD on various real-world datasets.

MCML Authors

Tobias Pielok

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[124]

E. Dorigatti, B. Schubert, B. Bischl and D. Rügamer.
Frequentist Uncertainty Quantification in Semi-Structured Neural Networks.
AISTATS 2023 - 26th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, Apr 25-27, 2023. URL

Abstract

Semi-structured regression (SSR) models jointly learn the effect of structured (tabular) and unstructured (non-tabular) data through additive predictors and deep neural networks (DNNs), respectively. Inference in SSR models aims at deriving confidence intervals for the structured predictor, although current approaches ignore the variance of the DNN estimation of the unstructured effects. This results in an underestimation of the variance of the structured coefficients and, thus, an increase of Type-I error rates. To address this shortcoming, we present here a theoretical framework for structured inference in SSR models that incorporates the variance of the DNN estimate into confidence intervals for the structured predictor. By treating this estimate as a random offset with known variance, our formulation is agnostic to the specific deep uncertainty quantification method employed. Through numerical experiments and a practical application on a medical dataset, we show that our approach results in increased coverage of the true structured coefficients and thus a reduction in Type-I error rate compared to ignoring the variance of the neural network, naive ensembling of SSR models, and a variational inference baseline.

MCML Authors

Emilio Dorigatti

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[123]

C. Luther, G. König and M. Grosse-Wentrup.
Efficient SAGE Estimation via Causal Structure Learning.
AISTATS 2023 - 26th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, Apr 25-27, 2023. URL

Abstract

The Shapley Additive Global Importance (SAGE) value is a theoretically appealing interpretability method that fairly attributes global importance to a model’s features. However, its exact calculation requires the computation of the feature’s surplus performance contributions over an exponential number of feature sets. This is computationally expensive, particularly because estimating the surplus contributions requires sampling from conditional distributions. Thus, SAGE approximation algorithms only take a fraction of the feature sets into account. We propose $d$-SAGE, a method that accelerates SAGE approximation. $d$-SAGE is motivated by the observation that conditional independencies (CIs) between a feature and the model target imply zero surplus contributions, such that their computation can be skipped. To identify CIs, we leverage causal structure learning (CSL) to infer a graph that encodes (conditional) independencies in the data as $d$-separations. This is computationally more efficient because the expense of the one-time graph inference and the $d$-separation queries is negligible compared to the expense of surplus contribution evaluations. Empirically we demonstrate that $d$-SAGE enables the efficient and accurate estimation of SAGE values.

MCML Authors

Gunnar König

Dr.

* Former Member

Moritz Grosse-Wentrup

Prof. Dr.

* Former Principal Investigator

[122]

M. Feurer, K. Eggensperger, E. Bergman, F. Pfisterer, B. Bischl and F. Hutter.
Mind the Gap: Measuring Generalization Performance Across Multiple Objectives.
IDA 2023 - 21st International Symposium on Intelligent Data Analysis. Louvain-la-Neuve, Belgium, Apr 12-14, 2023. DOI

Abstract

Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Florian Pfisterer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[121]

D. Schalk, B. Bischl and D. Rügamer.
Accelerated Componentwise Gradient Boosting Using Efficient Data Representation and Momentum-Based Optimization.
Journal of Computational and Graphical Statistics 32.2 (Apr. 2023). DOI

Abstract

Componentwise boosting (CWB), also known as model-based boosting, is a variant of gradient boosting that builds on additive models as base learners to ensure interpretability. CWB is thus often used in research areas where models are employed as tools to explain relationships in data. One downside of CWB is its computational complexity in terms of memory and runtime. In this article, we propose two techniques to overcome these issues without losing the properties of CWB: feature discretization of numerical features and incorporating Nesterov momentum into functional gradient descent. As the latter can be prone to early overfitting, we also propose a hybrid approach that prevents a possibly diverging gradient descent routine while ensuring faster convergence. Our adaptions improve vanilla CWB by reducing memory consumption and speeding up the computation time per iteration (through feature discretization) while also enabling CWB learn faster and hence to require fewer iterations in total using momentum. We perform extensive benchmarks on multiple simulated and real-world datasets to demonstrate the improvements in runtime and memory consumption while maintaining state-of-the-art estimation and prediction performance.

MCML Authors

Daniel Schalk

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Statistics, Data Science and Machine Learning

[120]

M. Herrmann, F. Pfisterer and F. Scheipl.
A geometric framework for outlier detection in high-dimensional data.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery e1491 (Apr. 2023). DOI

Abstract

Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework which exploits the metric structure of a data set. Our approach rests on the manifold assumption, that is, that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high dimensional data. We also suggest a novel, mathematically precise and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Functional Data Analysis

[119]

S. Dandl, A. Hofheinz, M. Binder, B. Bischl and G. Casalicchio.
counterfactuals: An R Package for Counterfactual Explanation Methods.
Preprint (Apr. 2023). arXiv

Abstract

Counterfactual explanation methods provide information on how feature values of individual observations must be changed to obtain a desired prediction. Despite the increasing amount of proposed methods in research, only a few implementations exist whose interfaces and requirements vary widely. In this work, we introduce the counterfactuals R package, which provides a modular and unified R6-based interface for counterfactual explanation methods. We implemented three existing counterfactual explanation methods and propose some optional methodological extensions to generalize these methods to different scenarios and to make them more comparable. We explain the structure and workflow of the package using real use cases and show how to integrate additional counterfactual explanation methods into the package. In addition, we compared the implemented methods for a variety of models and datasets with regard to the quality of their counterfactual explanations and their runtime behavior.

MCML Authors

Susanne Dandl

Dr.

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[118]

D. Schalk.
Modern approaches for component-wise boosting: Automation, efficiency, and distributed computing with application to the medical domain.
Dissertation 2023. DOI

Abstract

This thesis focuses on enhancing component-wise boosting (CWB) by improving its efficiency and usability, particularly in high-dimensional feature spaces and distributed data settings. Key contributions include the optimization of the CWB algorithm through Nesterov’s momentum for faster fitting and reduced memory usage, as well as the development of the Autocompboost framework to integrate CWB with AutoML, emphasizing model interpretability. Additionally, the thesis introduces methods for evaluating binary classification models on distributed data using ROC analysis, and presents several R packages (compboost, dsCWB, Autocompboost, dsBinVal) that implement these advances. (Shortened.)

MCML Authors

Daniel Schalk

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[117]

J. Moosbauer, G. Casalicchio, M. Lindauer and B. Bischl.
Improving Accuracy of Interpretability Measures in Hyperparameter Optimization via Bayesian Algorithm Execution.
COSEAL 2023 - Workshop on Configuration and Selection of Algorithms. Paris, France, Mar 06-08, 2023. arXiv

Abstract

Despite all the benefits of automated hyperparameter optimization (HPO), most modern HPO algorithms are black-boxes themselves. This makes it difficult to understand the decision process which leads to the selected configuration, reduces trust in HPO, and thus hinders its broad adoption. Here, we study the combination of HPO with interpretable machine learning (IML) methods such as partial dependence plots. These techniques are more and more used to explain the marginal effect of hyperparameters on the black-box cost function or to quantify the importance of hyperparameters. However, if such methods are naively applied to the experimental data of the HPO process in a post-hoc manner, the underlying sampling bias of the optimizer can distort interpretations. We propose a modified HPO method which efficiently balances the search for the global optimum w.r.t. predictive performance and the reliable estimation of IML explanations of an underlying black-box function by coupling Bayesian optimization and Bayesian Algorithm Execution. On benchmark cases of both synthetic objectives and HPO of a neural network, we demonstrate that our method returns more reliable explanations of the underlying black-box without a loss of optimization performance.

MCML Authors

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[116]

B. Bischl, M. Binder, M. Lang, T. Pielok, J. Richter, S. Coors, J. Thomas, T. Ullmann, M. Becker, A.-L. Boulesteix, D. Deng and M. Lindauer.
Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 13.2 (Mar. 2023). DOI

Abstract

Most machine learning algorithms are configured by a set of hyperparameters whose values must be carefully chosen and which often considerably impact performance. To avoid a time-consuming and irreproducible manual process of trial-and-error to find well-performing hyperparameter configurations, various automatic hyperparameter optimization (HPO) methods—for example, based on resampling error estimation for supervised machine learning—can be employed. After introducing HPO from a general perspective, this paper reviews important HPO methods, from simple techniques such as grid or random search to more advanced methods like evolution strategies, Bayesian optimization, Hyperband, and racing. This work gives practical recommendations regarding important choices to be made when conducting HPO, including the HPO algorithms themselves, performance evaluation, how to combine HPO with machine learning pipelines, runtime improvements, and parallelization.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

Statistical Learning and Data Science

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Stefan Coors

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Biometry in Molecular Medicine

[115]

G. König, T. Freiesleben and M. Grosse-Wentrup.
Improvement-focused causal recourse (ICR).
AAAI 2023 - 37th Conference on Artificial Intelligence. Washington, DC, USA, Feb 07-14, 2023. DOI

Abstract

Algorithmic recourse recommendations, such as Karimi et al.’s (2021) causal recourse (CR), inform stakeholders of how to act to revert unfavorable decisions. However, there are actions that lead to acceptance (i.e., revert the model’s decision) but do not lead to improvement (i.e., may not revert the underlying real-world state). To recommend such actions is to recommend fooling the predictor. We introduce a novel method, Improvement-Focused Causal Recourse (ICR), which involves a conceptual shift: Firstly, we require ICR recommendations to guide toward improvement. Secondly, we do not tailor the recommendations to be accepted by a specific predictor. Instead, we leverage causal knowledge to design decision systems that predict accurately pre- and post-recourse. As a result, improvement guarantees translate into acceptance guarantees. We demonstrate that given correct causal knowledge ICR, in contrast to existing approaches, guides toward both acceptance and improvement.

MCML Authors

Gunnar König

Dr.

* Former Member

Moritz Grosse-Wentrup

Prof. Dr.

* Former Principal Investigator

[114]

D. Rügamer, C. Kolb and N. Klein.
Semi-Structured Distributional Regression.
American Statistician (Feb. 2023). DOI

Abstract

Combining additive models and neural networks allows to broaden the scope of statistical regression and extends deep learning-based approaches by interpretable structured additive predictors at the same time. Existing approaches uniting the two modeling approaches are, however, limited to very specific combinations and, more importantly, involve an identifiability issue. As a consequence, interpretability and stable estimation is typically lost. We propose a general framework to combine structured regression models and deep neural networks into a unifying network architecture. To overcome the inherent identifiability issues between different model parts, we construct an orthogonalization cell that projects the deep neural network into the orthogonal complement of the statistical model predictor. This enables proper estimation of structured model parts and thereby interpretability. We demonstrate the framework’s efficacy in numerical experiments and illustrate its special merits in benchmarks and real-world applications.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Chris Kolb

Statistical Learning and Data Science

[113]

D. Schalk, V. S. Hoffmann, B. Bischl and U. Mansmann.
dsBinVal: Conducting distributed ROC analysis using DataSHIELD.
The Journal of Open Source Software 8.82 (Feb. 2023). DOI

Abstract

Our R (R Core Team, 2021) package dsBinVal implements the methodology explained by Schalk et al. (2022). It extends the ROC-GLM (Pepe, 2000) to distributed data by using techniques of differential privacy (Dwork et al., 2006) and the idea of sharing highly aggregated values only. The package also exports functionality to calculate distributed calibration curves and assess the calibration. Using the package allows us to evaluate a prognostic model based on a binary outcome using the DataSHIELD (Gaye et al., 2014) framework. Therefore, the main functionality makes it able to 1) compute the receiver operating characteristic (ROC) curve using the ROC-GLM from which 2) the area under the curve (AUC) and confidence intervals (CI) are derived to conduct hypothesis testing according to DeLong et al. (1988). Furthermore, 3) the calibration can be assessed distributively via calibration curves and the Brier score. Visualizing the approximated ROC curve, the AUC with confidence intervals, and the calibration curves using ggplot2 is also supported. Examples can be found in the README file of the repository.

MCML Authors

Daniel Schalk

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[112]

F. Ott.
Representation learning for domain adaptation and cross-modal retrieval: in the context of online handwriting recognition and visual self-localization.
Dissertation 2023. DOI

Abstract

This thesis focuses on domain adaptation and cross-modal retrieval to address the challenges posed by domain shifts in machine learning applications. Specifically, it explores techniques for online handwriting recognition and visual self-localization. For handwriting recognition, the study uses deep metric learning and optimal transport to reduce domain shifts between different writing styles and writing modalities, while for visual self-localization, it enhances pose prediction through auxiliary tasks and representation learning fusion techniques to improve accuracy across sensor modalities. (Shortened.)

MCML Authors

Felix Ott

Dr.

* Former Member

[111]

I. Ziegler, B. Ma, B. Bischl, E. Dorigatti and B. Schubert.
Proteasomal cleavage prediction: state-of-the-art and future directions.
Preprint (2023). DOI GitHub

Abstract

Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 1.7 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Emilio Dorigatti

Dr.

* Former Member

2022

[110]

J. Goschenhofer, P. Ragupathy, C. Heumann, B. Bischl and M. Aßenmacher.
CC-Top: Constrained Clustering for Dynamic Topic Discovery.
EvoNLP 2022 - 1st Workshop on Ever Evolving NLP. Abu Dhabi, United Arab Emirates, Dec 07, 2022. URL

Abstract

Research on multi-class text classification of short texts mainly focuses on supervised (transfer) learning approaches, requiring a finite set of pre-defined classes which is constant over time. This work explores deep constrained clustering (CC) as an alternative to supervised learning approaches in a setting with a dynamically changing number of classes, a task we introduce as dynamic topic discovery (DTD).We do so by using pairwise similarity constraints instead of instance-level class labels which allow for a flexible number of classes while exhibiting a competitive performance compared to supervised approaches. First, we substantiate this through a series of experiments and show that CC algorithms exhibit a predictive performance similar to state-of-the-art supervised learning algorithms while requiring less annotation effort. Second, we demonstrate the overclustering capabilities of deep CC for detecting topics in short text data sets in the absence of the ground truth class cardinality during model training. Third, we showcase that these capabilities can be leveraged for the DTD setting as a step towards dynamic learning over time and finally, we release our codebase to nurture further research in this area.

MCML Authors

Jann Goschenhofer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[109]

C. Fritz, G. De Nicola, F. Günther, D. Rügamer, M. Rave, M. Schneble, A. Bender, M. Weigert, R. Brinks, A. Hoyer, U. Berger, H. Küchenhoff and G. Kauermann.
Challenges in Interpreting Epidemiological Surveillance Data – Experiences from Germany.
Journal of Computational and Graphical Statistics 32.3 (Dec. 2022). DOI

Abstract

As early as March 2020, the authors of this letter started to work on surveillance data to obtain a clearer picture of the pandemic’s dynamic. This letter outlines the lessons learned during this peculiar time, emphasizing the benefits that better data collection, management, and communication processes would bring to the table. We further want to promote nuanced data analyses as a vital element of general political discussion as opposed to drawing conclusions from raw data, which are often flawed in epidemiological surveillance data, and therefore underline the overall need for statistics to play a more central role in public discourse.

MCML Authors

Cornelius Fritz

Dr.

A1 | Statistical Foundations & Explainability
→ Group Göran Kauermann

* Former Member

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Maximilian Weigert

C4 | Computational Social Sciences
→ Group Helmut Küchenhoff

* Former Member

Helmut Küchenhoff

Prof. Dr.

Statistical Consulting Unit (StaBLab)

Göran Kauermann

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Applied Statistics in Social Sciences, Economics and Business

[108]

M. Rezaei, E. Dorigatti, D. Rügamer and B. Bischl.
Joint Debiased Representation Learning and Imbalanced Data Clustering.
ICDMW 2022 - IEEE International Conference on Data Mining Workshops. Orlando, FL, USA, Nov 30-Dec 02, 2022. DOI

Abstract

One of the most promising approaches for unsu-pervised learning is combining deep representation learning and deep clustering. Some recent works propose to simultaneously learn representation using deep neural networks and perform clustering by defining a clustering loss on top of embedded features. However, these approaches are sensitive to imbalanced data and out-of-distribution samples. As a consequence, these methods optimize clustering by pushing data close to randomly initialized cluster centers. This is problematic when the number of instances varies largely in different classes or a cluster with few samples has less chance to be assigned a good centroid. To overcome these limitations, we introduce a new unsupervised framework for joint debiased representation learning and image clustering. We simultaneously train two deep learning models, a deep representation network that captures the data distribution, and a deep clustering network that learns embedded features and performs clustering. Specifically, the clustering network and learning representation network both take advantage of our proposed statistics pooling block that represents mean, variance, and cardinality to handle the out-of-distribution samples and class imbalance. Our experiments show that using these repre-sentations, one can considerably improve results on imbalanced image clustering across a variety of image datasets. Moreover, the learned representations generalize well when transferred to the out-of-distribution dataset.

MCML Authors

Mina Rezaei

Dr.

Statistical Learning and Data Science

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[107]

N. Hurmer, X.-Y. To, M. Binder, H. A. Gündüz, P. C. Münch, R. Mreches, A. C. McHardy, B. Bischl and M. Rezaei.
Transformer Model for Genome Sequence Analysis.
LMRL @NeurIPS 2022 - Workshop on Learning Meaningful Representations of Life at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

One major challenge of applying machine learning in genomics is the scarcity of labeled data, which often requires expensive and time-consuming physical experimentation under laboratory conditions to obtain. However, the advent of high throughput sequencing has made large quantities of unlabeled genome data available. This can be used to apply semi-supervised learning methods through representation learning. In this paper, we investigate the impact of a popular and well-established language model, namely BERT [Devlin et al., 2018], for sequence genome analysis. Specifically, we adapt DNABERT [Ji et al., 2021] to GenomeNet-BERT in order to produce useful representations for downstream tasks such as classification and semi10 supervised learning. We explore different pretraining setups and compare their performance on a virus genome classification task to strictly supervised training and baselines on different training set size setups. The conducted experiments show that this architecture provides an increase in performance compared to existing methods at the cost of more resource-intensive training.

MCML Authors

Xiao-Yin To

Statistical Learning and Data Science

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Hüseyin Anil Gündüz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[106]

I. Ziegler, B. Ma, E. Nie, B. Bischl, D. Rügamer, B. Schubert and E. Dorigatti.
What cleaves? Is proteasomal cleavage prediction reaching a ceiling?
LMRL @NeurIPS 2022 - Workshop on Learning Meaningful Representations of Life at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Ercong Nie

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Emilio Dorigatti

Dr.

* Former Member

[105]

E. Pretzsch, V. Heinemann, S. Stintzing, A. Bender, S. Chen, J. W. Holch, F. O. Hofmann, H. Ren, F. Böschand, H. Küchenhoff, J. Werner and M. K. Angele.
EMT-Related Genes Have No Prognostic Relevance in Metastatic Colorectal Cancer as Opposed to Stage II/III: Analysis of the Randomised, Phase III Trial FIRE-3 (AIO KRK 0306; FIRE-3).
Cancers 14.22 (Nov. 2022). DOI

Abstract

Despite huge advances in local and systemic therapies, the 5-year relative survival rate for patients with metastatic CRC is still low. To avoid over- or undertreatment, proper risk stratification with regard to treatment strategy is highly needed. As EMT (epithelial-mesenchymal transition) is a major step in metastatic spread, this study analysed the prognostic effect of EMT-related genes in stage IV colorectal cancer patients using the study cohort of the FIRE-3 trial, an open-label multi-centre randomised controlled phase III trial of stage IV colorectal cancer patients. Overall, the prognostic relevance of EMT-related genes seems stage-dependent. EMT-related genes have no prognostic relevance in stage IV CRC as opposed to stage II/III.

MCML Authors

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

Shuo Chen

A3 | Computational Models
→ Group Volker Tresp

Database Systems and Data Mining

Helmut Küchenhoff

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Consulting Unit (StaBLab)

[104]

F. Pfisterer.
Democratizing machine learning: contributions in AutoML and fairness.
Dissertation 2022. DOI

Abstract

This thesis focuses on democratizing access to machine learning (ML) by improving automated machine learning (AutoML) systems and making ML tools more accessible to non-experts. Key contributions include methods to accelerate hyperparameter optimization by learning from previous experiments, the integration of fairness considerations in AutoML, and the development of software packages such as mlr3pipelines for creating machine learning pipelines and mlr3fairness for auditing and debiasing models. The thesis also includes tools for estimating and mitigating model fairness, such as the mcboost package for multi-calibration, addressing both the technical and ethical challenges of widespread ML deployment. (Shortened.)

MCML Authors

Florian Pfisterer

Dr.

* Former Member

[103]

F. Ott, D. Rügamer, L. Heublein, B. Bischl and C. Mutschler.
Domain Adaptation for Time-Series Classification to Mitigate Covariate Shift.
MM 2022 - 30th ACM International Conference on Multimedia. Lisbon, Portugal, Oct 10-14, 2022. DOI

Abstract

The performance of a machine learning model degrades when it is applied to data from a similar but different domain than the data it has initially been trained on. To mitigate this domain shift problem, domain adaptation (DA) techniques search for an optimal transformation that converts the (current) input data from a source domain to a target domain to learn a domain-invariant representation that reduces domain discrepancy. This paper proposes a novel supervised DA based on two steps. First, we search for an optimal class-dependent transformation from the source to the target domain from a few samples. We consider optimal transport methods such as the earth mover’s distance, Sinkhorn transport and correlation alignment. Second, we use embedding similarity techniques to select the corresponding transformation at inference. We use correlation metrics and higher-order moment matching techniques. We conduct an extensive evaluation on time-series datasets with domain shift including simulated and various online handwriting datasets to demonstrate the performance.

MCML Authors

Felix Ott

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[102]

J. Moosbauer, M. Binder, L. Schneider, F. Pfisterer, M. Becker, M. Lang, L. Kotthoff and B. Bischl.
Automated Benchmark-Driven Design and Explanation of Hyperparameter Optimizers.
IEEE Transactions on Evolutionary Computation 26.6 (Oct. 2022). DOI

Abstract

Automated hyperparameter optimization (HPO) has gained great popularity and is an important component of most automated machine learning frameworks. However, the process of designing HPO algorithms is still an unsystematic and manual process: new algorithms are often built on top of prior work, where limitations are identified and improvements are proposed. Even though this approach is guided by expert knowledge, it is still somewhat arbitrary. The process rarely allows for gaining a holistic understanding of which algorithmic components drive performance and carries the risk of overlooking good algorithmic design choices. We present a principled approach to automated benchmark-driven algorithm design applied to multifidelity HPO (MF-HPO). First, we formalize a rich space of MF-HPO candidates that includes, but is not limited to, common existing HPO algorithms and then present a configurable framework covering this space. To find the best candidate automatically and systematically, we follow a programming-by-optimization approach and search over the space of algorithm candidates via Bayesian optimization. We challenge whether the found design choices are necessary or could be replaced by more naive and simpler ones by performing an ablation analysis. We observe that using a relatively simple configuration (in some ways, simpler than established methods) performs very well as long as some critical configuration parameters are set to the right value.

MCML Authors

Julia Moosbauer

Dr.

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[101]

K. Rath, D. Rügamer, B. Bischl, U. von Toussaint, C. Rea, A. Maris, R. Granetz and C. G. Albert.
Data augmentation for disruption prediction via robust surrogate models.
Journal of Plasma Physics 88.5 (Oct. 2022). DOI

Abstract

The goal of this work is to generate large statistically representative data sets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student $t$ process regression. We apply Student $t$ process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via colouring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics and classic machine learning clustering algorithms.

MCML Authors

Katharina Röck (née Rath)

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[100]

L. Bothmann, S. Strickroth, G. Casalicchio, D. Rügamer, M. Lindauer, F. Scheipl and B. Bischl.
Developing Open Source Educational Resources for Machine Learning and Data Science.
ECML-PKDD 2022 - 3rd Teaching Machine Learning and Artificial Intelligence Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Grenoble, France, Sep 19-23, 2022. URL

Abstract

Education should not be a privilege but a common good. It should be openly accessible to everyone, with as few barriers as possible; even more so for key technologies such as Machine Learning (ML) and Data Science (DS). Open Educational Resources (OER) are a crucial factor for greater educational equity. In this paper, we describe the specific requirements for OER in ML and DS and argue that it is especially important for these fields to make source files publicly available, leading to Open Source Educational Resources (OSER). We present our view on the collaborative development of OSER, the challenges this poses, and first steps towards their solutions. We outline how OSER can be used for blended learning scenarios and share our experiences in university education. Finally, we discuss additional challenges such as credit assignment or granting certificates.

MCML Authors

Ludwig Bothmann

Dr.

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Fabian Scheipl

PD Dr.

Functional Data Analysis

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[99]

D. Rügamer, A. Bender, S. Wiegrebe, D. Racek, B. Bischl, C. L. Müller and C. Stachl.
Factorized Structured Regression for Large-Scale Varying Coefficient Models.
ECML-PKDD 2022 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Grenoble, France, Sep 19-23, 2022. DOI

Abstract

Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR’s performance and interpretability on a large-scale behavioral study with smartphone user data.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Christian Müller

Prof. Dr.

C2 | Biology

Biomedical Statistics and Data Science

[98]

D. Deng, F. Karl, F. Hutter, B. Bischl and M. Lindauer.
Efficient Automated Deep Learning for Time Series Forecasting.
ECML-PKDD 2022 - Workshops at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Grenoble, France, Sep 19-23, 2022. DOI

Abstract

Recent years have witnessed tremendously improved efficiency of Automated Machine Learning (AutoML), especially Automated Deep Learning (AutoDL) systems, but recent work focuses on tabular, image, or NLP tasks. So far, little attention has been paid to general AutoDL frameworks for time series forecasting, despite the enormous success in applying different novel architectures to such tasks. In this paper, we propose an efficient approach for the joint optimization of neural architecture and hyperparameters of the entire data processing pipeline for time series forecasting. In contrast to common NAS search spaces, we designed a novel neural architecture search space covering various state-of-the-art architectures, allowing for an efficient macro-search over different DL approaches. To efficiently search in such a large configuration space, we use Bayesian optimization with multi-fidelity optimization. We empirically study several different budget types enabling efficient multi-fidelity optimization on different forecasting datasets. Furthermore, we compared our resulting system, against several established baselines and show that it significantly outperforms all of them across several datasets.

MCML Authors

Florian Karl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[97]

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Implicit Embeddings via GAN Inversion for High Resolution Chest Radiographs.
MAD @MICCAI 2022 - 1st Workshop on Medical Applications with Disentanglements at the 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2022). Singapore, Sep 18-22, 2022. DOI

Abstract

Generative models allow for the creation of highly realistic artificial samples, opening up promising applications in medical imaging. In this work, we propose a multi-stage encoder-based approach to invert the generator of a generative adversarial network (GAN) for high resolution chest radiographs. This gives direct access to its implicitly formed latent space, makes generative models more accessible to researchers, and enables to apply generative techniques to actual patient’s images. We investigate various applications for this embedding, including image compression, disentanglement in the encoded dataset, guided image manipulation, and creation of stylized samples. We find that this type of GAN inversion is a promising research direction in the domain of chest radiograph modeling and opens up new ways to combine realistic X-ray sample synthesis with radiological image analysis.

MCML Authors

Tobias Weber

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[96]

R. Sonabend, A. Bender and S. Vollmer.
Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures.
Bioinformatics 38.17 (Sep. 2022). DOI GitHub

Abstract

Motivation: In this article, we consider how to evaluate survival distribution predictions with measures of discrimination. This is non-trivial as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantages and disadvantages.
Results: Whilst distributions are frequently evaluated by discrimination measures, we find that the method for doing so is rarely described in the literature and often leads to unfair comparisons or ‘C-hacking’. We demonstrate by example how simple it can be to manipulate results and use this to argue for better reporting guidelines and transparency in the literature. We recommend that machine learning survival analysis software implements clear transformations between distribution and risk predictions in order to allow more transparent and accessible model evaluation.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[95]

W. Ghada, E. Casellas, J. Herbinger, A. Garcia-Benadí, L. Bothmann, N. Estrella, J. Bech and A. Menzel.
Stratiform and Convective Rain Classification Using Machine Learning Models and Micro Rain Radar.
Remote Sensing 14.18 (Sep. 2022). DOI

Abstract

Rain type classification into convective and stratiform is an essential step required to improve quantitative precipitation estimations by remote sensing instruments. Previous studies with Micro Rain Radar (MRR) measurements and subjective rules have been performed to classify rain events. However, automating this process by using machine learning (ML) models provides the advantages of fast and reliable classification with the possibility to classify rain minute by minute. A total of 20,979 min of rain data measured by an MRR at Das in northeast Spain were used to build seven types of ML models for stratiform and convective rain type classification. The proposed classification models use a set of 22 parameters that summarize the reflectivity, the Doppler velocity, and the spectral width (SW) above and below the so-called separation level (SL). This level is defined as the level with the highest increase in Doppler velocity and corresponds with the bright band in stratiform rain. A pre-classification of the rain type for each minute based on the rain microstructure provided by the collocated disdrometer was performed. Our results indicate that complex ML models, particularly tree-based ensembles such as xgboost and random forest which capture the interactions of different features, perform better than simpler models. Applying methods from the field of interpretable ML, we identified reflectivity at the lowest layer and the average spectral width in the layers below SL as the most important features. High reflectivity and low SW values indicate a higher probability of convective rain.

MCML Authors

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[94]

E. Dorigatti, B. Bischl and B. Schubert.
Improved proteasomal cleavage prediction with positive-unlabeled learning.
Preprint (Sep. 2022). arXiv

Abstract

Accurate in silico modeling of the antigen processing pathway is crucial to enable personalized epitope vaccine design for cancer. An important step of such pathway is the degradation of the vaccine into smaller peptides by the proteasome, some of which are going to be presented to T cells by the MHC complex. While predicting MHC-peptide presentation has received a lot of attention recently, proteasomal cleavage prediction remains a relatively unexplored area in light of recent advancesin high-throughput mass spectrometry-based MHC ligandomics. Moreover, as such experimental techniques do not allow to identify regions that cannot be cleaved, the latest predictors generate decoy negative samples and treat them as true negatives when training, even though some of them could actually be positives. In this work, we thus present a new predictor trained with an expanded dataset and the solid theoretical underpinning of positive-unlabeled learning, achieving a new state-of-the-art in proteasomal cleavage prediction. The improved predictive capabilities will in turn enable more precise vaccine development improving the efficacy of epitope-based vaccines. Pretrained models are available on GitHub.

MCML Authors

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[93]

E. Dorigatti, J. Schweisthal, B. Bischl and M. Rezaei.
Robust and Efficient Imbalanced Positive-Unlabeled Learning with Self-supervision.
Preprint (Sep. 2022). arXiv GitHub

Abstract

Learning from positive and unlabeled (PU) data is a setting where the learner only has access to positive and unlabeled samples while having no information on negative examples. Such PU setting is of great importance in various tasks such as medical diagnosis, social network analysis, financial markets analysis, and knowledge base completion, which also tend to be intrinsically imbalanced, i.e., where most examples are actually negatives. Most existing approaches for PU learning, however, only consider artificially balanced datasets and it is unclear how well they perform in the realistic scenario of imbalanced and long-tail data distribution. This paper proposes to tackle this challenge via robust and efficient self-supervised pretraining. However, training conventional self-supervised learning methods when applied with highly imbalanced PU distribution needs better reformulation. In this paper, we present textit{ImPULSeS}, a unified representation learning framework for underline{Im}balanced underline{P}ositive underline{U}nlabeled underline{L}earning leveraging underline{Se}lf-underline{S}upervised debiase pre-training. ImPULSeS uses a generic combination of large-scale unsupervised learning with debiased contrastive loss and additional reweighted PU loss. We performed different experiments across multiple datasets to show that ImPULSeS is able to halve the error rate of the previous state-of-the-art, even compared with previous methods that are given the true prior. Moreover, our method showed increased robustness to prior misspecification and superior performance even when pretraining was performed on an unrelated dataset. We anticipate such robustness and efficiency will make it much easier for practitioners to obtain excellent results on other PU datasets of interest.

MCML Authors

Emilio Dorigatti

Dr.

* Former Member

Jonas Schweisthal

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[92]

C. A. Scholbeck, H. Funk and G. Casalicchio.
Algorithm-Agnostic Interpretations for Clustering.
Preprint (Sep. 2022). arXiv

Abstract

A clustering outcome for high-dimensional data is typically interpreted via post-processing, involving dimension reduction and subsequent visualization. This destroys the meaning of the data and obfuscates interpretations. We propose algorithm-agnostic interpretation methods to explain clustering outcomes in reduced dimensions while preserving the integrity of the data. The permutation feature importance for clustering represents a general framework based on shuffling feature values and measuring changes in cluster assignments through custom score functions. The individual conditional expectation for clustering indicates observation-wise changes in the cluster assignment due to changes in the data. The partial dependence for clustering evaluates average changes in cluster assignments for the entire feature space. All methods can be used with any clustering algorithm able to reassign instances through soft or hard labels. In contrast to common post-processing methods such as principal component analysis, the introduced methods maintain the original structure of the features.

MCML Authors

Henri Funk

C4 | Computational Social Sciences
→ Group Helmut Küchenhoff

Statistical Consulting Unit (StaBLab)

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[91]

S.-F. Zheng, J. Nam, E. Dorigatti, B. Bischl, S. Azizi and M. Rezaei.
Joint Debiased Representation and Image Clustering Learning with Self-Supervision.
Preprint (Sep. 2022). arXiv GitHub

Abstract

Contrastive learning is among the most successful methods for visual representation learning, and its performance can be further improved by jointly performing clustering on the learned representations. However, existing methods for joint clustering and contrastive learning do not perform well on long-tailed data distributions, as majority classes overwhelm and distort the loss of minority classes, thus preventing meaningful representations to be learned. Motivated by this, we develop a novel joint clustering and contrastive learning framework by adapting the debiased contrastive loss to avoid under-clustering minority classes of imbalanced datasets. We show that our proposed modified debiased contrastive loss and divergence clustering loss improves the performance across multiple datasets and learning tasks.

MCML Authors

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[90]

F. Ott, D. Rügamer, L. Heublein, B. Bischl and C. Mutschler.
Representation Learning for Tablet and Paper Domain Adaptation in favor of Online Handwriting Recognition.
MPRSS @ICPR 2022 - 7th International Workshop on Multimodal pattern recognition of social signals in human computer interaction at the 26th International Conference on Pattern Recognition (ICPR 2022). Montreal, Canada, Aug 21-25, 2022. arXiv

Abstract

The performance of a machine learning model degrades when it is applied to data from a similar but different domain than the data it has initially been trained on. The goal of domain adaptation (DA) is to mitigate this domain shift problem by searching for an optimal feature transformation to learn a domain-invariant representation. Such a domain shift can appear in handwriting recognition (HWR) applications where the motion pattern of the hand and with that the motion pattern of the pen is different for writing on paper and on tablet. This becomes visible in the sensor data for online handwriting (OnHW) from pens with integrated inertial measurement units. This paper proposes a supervised DA approach to enhance learning for OnHW recognition between tablet and paper data. Our method exploits loss functions such as maximum mean discrepancy and correlation alignment to learn a domain-invariant feature representation (i.e., similar covariances between tablet and paper features). We use a triplet loss that takes negative samples of the auxiliary domain (i.e., paper samples) to increase the amount of samples of the tablet dataset. We conduct an evaluation on novel sequence-based OnHW datasets (i.e., words) and show an improvement on the paper domain with an early fusion strategy by using pairwise learning.

MCML Authors

Felix Ott

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[89]

F. Ott, N. L. Raichur, D. Rügamer, T. Feigl, H. Neumann, B. Bischl and C. Mutschler.
Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose Regression and Odometry-aided Absolute Pose Regression.
Preprint (Aug. 2022). arXiv

Abstract

Visual-inertial localization is a key problem in computer vision and robotics applications such as virtual reality, self-driving cars, and aerial vehicles. The goal is to estimate an accurate pose of an object when either the environment or the dynamics are known. Absolute pose regression (APR) techniques directly regress the absolute pose from an image input in a known scene using convolutional and spatio-temporal networks. Odometry methods perform relative pose regression (RPR) that predicts the relative pose from a known object dynamic (visual or inertial inputs). The localization task can be improved by retrieving information from both data sources for a cross-modal setup, which is a challenging problem due to contradictory tasks. In this work, we conduct a benchmark to evaluate deep multimodal fusion based on pose graph optimization and attention networks. Auxiliary and Bayesian learning are utilized for the APR task. We show accuracy improvements for the APR-RPR task and for the RPR-RPR task for aerial vehicles and hand-held devices. We conduct experiments on the EuRoC MAV and PennCOSYVIO datasets and record and evaluate a novel industry dataset.

MCML Authors

Felix Ott

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[88]

L. Schneider, L. Schäpermeier, R. Prager, B. Bischl, H. Trautmann and P. Kerschke.
HPO X ELA: Investigating Hyperparameter Optimization Landscapes by Means of Exploratory Landscape Analysis.
Preprint (Aug. 2022). arXiv

Abstract

Hyperparameter optimization (HPO) is a key component of machine learning models for achieving peak predictive performance. While numerous methods and algorithms for HPO have been proposed over the last years, little progress has been made in illuminating and examining the actual structure of these black-box optimization problems. Exploratory landscape analysis (ELA) subsumes a set of techniques that can be used to gain knowledge about properties of unknown optimization problems. In this paper, we evaluate the performance of five different black-box optimizers on 30 HPO problems, which consist of two-, three- and five-dimensional continuous search spaces of the XGBoost learner trained on 10 different data sets. This is contrasted with the performance of the same optimizers evaluated on 360 problem instances from the black-box optimization benchmark (BBOB). We then compute ELA features on the HPO and BBOB problems and examine similarities and differences. A cluster analysis of the HPO and BBOB problems in ELA feature space allows us to identify how the HPO problems compare to the BBOB problems on a structural meta-level. We identify a subset of BBOB problems that are close to the HPO problems in ELA feature space and show that optimizer performance is comparably similar on these two sets of benchmark problems. We highlight open challenges of ELA for HPO and discuss potential directions of future research and applications.

MCML Authors

Lennart Schneider

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[87]

F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder and B. Bischl.
YAHPO Gym - Design Criteria and a new Multifidelity Benchmark for Hyperparameter Optimization.
AutoML @ICML 2022 - 1st International Conference on Automated Machine Learning co-located with the 39th International Conference on Machine Learning (ICML 2022). Baltimore, MD, USA, Jul 25-27, 2022. URL GitHub

Abstract

When developing and analyzing new hyperparameter optimization (HPO) methods, it is vital to empirically evaluate and compare them on well-curated benchmark suites. In this work, we list desirable properties and requirements for such benchmarks and propose a new set of challenging and relevant multifidelity HPO benchmark problems motivated by these requirements. For this, we revisit the concept of surrogate-based benchmarks and empirically compare them to more widely-used tabular benchmarks, showing that the latter ones may induce bias in performance estimation and ranking of HPO methods. We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total. All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO. We examine and compare our benchmark suite with respect to the defined requirements and show that our benchmarks provide viable additions to existing suites.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[86]

L. Schneider, F. Pfisterer, P. Kent, J. Branke, B. Bischl and J. Thomas.
Tackling neural architecture search with quality diversity optimization.
AutoML @ICML 2022 - 1st International Conference on Automated Machine Learning co-located with the 39th International Conference on Machine Learning (ICML 2022). Baltimore, MD, USA, Jul 25-27, 2022. URL

Abstract

Neural architecture search (NAS) has been studied extensively and has grown to become a research field with substantial impact. While classical single-objective NAS searches for the architecture with the best performance, multi-objective NAS considers multiple objectives that should be optimized simultaneously, e.g., minimizing resource usage along the validation error. Although considerable progress has been made in the field of multi-objective NAS, we argue that there is some discrepancy between the actual optimization problem of practical interest and the optimization problem that multi-objective NAS tries to solve. We resolve this discrepancy by formulating the multi-objective NAS problem as a quality diversity optimization (QDO) problem and introduce three quality diversity NAS optimizers (two of them belonging to the group of multifidelity optimizers), which search for high-performing yet diverse architectures that are optimal for application-specific niches, e.g., hardware constraints. By comparing these optimizers to their multi-objective counterparts, we demonstrate that quality diversity NAS in general outperforms multi-objective NAS with respect to quality of solutions and efficiency. We further show how applications and future NAS research can thrive on QDO.

MCML Authors

Lennart Schneider

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Janek Thomas

Dr.

* Former Member

[85]

A. Klaß, S. M. Lorenz, M. W. Lauer-Schmaltz, D. Rügamer, B. Bischl, C. Mutschler and F. Ott.
Uncertainty-aware Evaluation of Time-Series Classification for Online Handwriting Recognition with Domain Shift.
STRL @IJCAI-ECAI 2022 - Workshop on Spatio-Temporal Reasoning and Learning at the 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence (IJCAI-ECAI 2022). Vienna, Austria, Jul 23-29, 2022. URL

Abstract

For many applications, analyzing the uncertainty of a machine learning model is indispensable. While research of uncertainty quantification (UQ) techniques is very advanced for computer vision applications, UQ methods for spatio-temporal data are less studied. In this paper, we focus on models for online handwriting recognition, one particular type of spatio-temporal data. The data is observed from a sensor-enhanced pen with the goal to classify written characters. We conduct a broad evaluation of aleatoric (data) and epistemic (model) UQ based on two prominent techniques for Bayesian inference, Stochastic Weight Averaging-Gaussian (SWAG) and Deep Ensembles. Next to a better understanding of the model, UQ techniques can detect out-of-distribution data and domain shifts when combining right-handed and left-handed writers (an underrepresented group).

MCML Authors

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Felix Ott

Dr.

* Former Member

[84]

A. Khakzar, Y. Li, Y. Zhang, M. Sanisoglu, S. T. Kim, M. Rezaei, B. Bischl and N. Navab.
Analyzing the Effects of Handling Data Imbalance on Learned Features from Medical Images by Looking Into the Models.
IMLH @ICML 2022 - 2nd Workshop on Interpretable Machine Learning in Healthcare at the 39th International Conference on Machine Learning (ICML 2022). Baltimore, MD, USA, Jul 17-23, 2022. arXiv

Abstract

One challenging property lurking in medical datasets is the imbalanced data distribution, where the frequency of the samples between the different classes is not balanced. Training a model on an imbalanced dataset can introduce unique challenges to the learning problem where a model is biased towards the highly frequent class. Many methods are proposed to tackle the distributional differences and the imbalanced problem. However, the impact of these approaches on the learned features is not well studied. In this paper, we look deeper into the internal units of neural networks to observe how handling data imbalance affects the learned features. We study several popular cost-sensitive approaches for handling data imbalance and analyze the feature maps of the convolutional neural networks from multiple perspectives: analyzing the alignment of salient features with pathologies and analyzing the pathology-related concepts encoded by the networks. Our study reveals differences and insights regarding the trained models that are not reflected by quantitative metrics such as AUROC and AP and show up only by looking at the models through a lens.

MCML Authors

Ashkan Khakzar

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Yawei Li

Statistical Learning and Data Science

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Nassir Navab

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computer Aided Medical Procedures & Augmented Reality

[83]

S. Dandl, F. Pfisterer and B. Bischl.
Multi-Objective Counterfactual Fairness.
GECCO 2022 - Genetic and Evolutionary Computation Conference. Boston, MA, USA, Jul 09-13, 2022. DOI

Abstract

When machine learning is used to automate judgments, e.g. in areas like lending or crime prediction, incorrect decisions can lead to adverse effects for affected individuals. This occurs, e.g., if the data used to train these models is based on prior decisions that are unfairly skewed against specific subpopulations. If models should automate decision-making, they must account for these biases to prevent perpetuating or creating discriminatory practices. Counter-factual fairness audits models with respect to a notion of fairness that asks for equal outcomes between a decision made in the real world and a counterfactual world where the individual subject to a decision comes from a different protected demographic group. In this work, we propose a method to conduct such audits without access to the underlying causal structure of the data generating process by framing it as a multi-objective optimization task that can be efficiently solved using a genetic algorithm.

MCML Authors

Susanne Dandl

Dr.

* Former Member

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[82]

L. Schneider, F. Pfisterer, J. Thomas and B. Bischl.
A Collection of Quality Diversity Optimization Problems Derived from Hyperparameter Optimization of Machine Learning Models.
GECCO 2022 - Genetic and Evolutionary Computation Conference. Boston, MA, USA, Jul 09-13, 2022. DOI

Abstract

The goal of Quality Diversity Optimization is to generate a collection of diverse yet high-performing solutions to a given problem at hand. Typical benchmark problems are, for example, finding a repertoire of robot arm configurations or a collection of game playing strategies. In this paper, we propose a set of Quality Diversity Optimization problems that tackle hyperparameter optimization of machine learning models - a so far underexplored application of Quality Diversity Optimization. Our benchmark problems involve novel feature functions, such as interpretability or resource usage of models. To allow for fast and efficient benchmarking, we build upon YAHPO Gym, a recently proposed open source benchmarking suite for hyperparameter optimization that makes use of high performing surrogate models and returns these surrogate model predictions instead of evaluating the true expensive black box function. We present results of an initial experimental study comparing different Quality Diversity optimizers on our benchmark problems. Furthermore, we discuss future directions and challenges of Quality Diversity Optimization in the context of hyperparameter optimization.

MCML Authors

Lennart Schneider

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[81]

Q. Au, J. Herbinger, C. Stachl, B. Bischl and G. Casalicchio.
Grouped Feature Importance and Combined Features Effect Plot.
Data Mining and Knowledge Discovery 36 (Jun. 2022). DOI

Abstract

Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms and their inherently challenging interpretability. Most work in this area has been focused on the interpretation of single features in a model. However, for researchers and practitioners, it is often equally important to quantify the importance or visualize the effect of feature groups. To address this research gap, we provide a comprehensive overview of how existing model-agnostic techniques can be defined for feature groups to assess the grouped feature importance, focusing on permutation-based, refitting, and Shapley-based methods. We also introduce an importance-based sequential procedure that identifies a stable and well-performing combination of features in the grouped feature space. Furthermore, we introduce the combined features effect plot, which is a technique to visualize the effect of a group of features based on a sparse, interpretable linear combination of features. We used simulation studies and real data examples to analyze, compare, and discuss these methods.

MCML Authors

Julia Herbinger

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[80]

P. Kopper, S. Wiegrebe, B. Bischl, A. Bender and D. Rügamer.
DeepPAMM: Deep Piecewise Exponential Additive Mixed Models for Complex Hazard Structures in Survival Analysis.
PAKDD 2022 - 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Chengdu, China, May 16-19, 2022. DOI

Abstract

Survival analysis (SA) is an active field of research that is concerned with time-to-event outcomes and is prevalent in many domains, particularly biomedical applications. Despite its importance, SA remains challenging due to small-scale data sets and complex outcome distributions, concealed by truncation and censoring processes. The piecewise exponential additive mixed model (PAMM) is a model class addressing many of these challenges, yet PAMMs are not applicable in high-dimensional feature settings or in the case of unstructured or multimodal data. We unify existing approaches by proposing DeepPAMM, a versatile deep learning framework that is well-founded from a statistical point of view, yet with enough flexibility for modeling complex hazard structures. We illustrate that DeepPAMM is competitive with other machine learning approaches with respect to predictive performance while maintaining interpretability through benchmark experiments and an extended case study.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[79]

J. Herbinger, B. Bischl and G. Casalicchio.
REPID: Regional Effect Plots with implicit Interaction Detection.
AISTATS 2022 - 25th International Conference on Artificial Intelligence and Statistics. Virtual, Mar 28-30, 2022. URL

Abstract

Machine learning models can automatically learn complex relationships, such as non-linear and interaction effects. Interpretable machine learning methods such as partial dependence plots visualize marginal feature effects but may lead to misleading interpretations when feature interactions are present. Hence, employing additional methods that can detect and measure the strength of interactions is paramount to better understand the inner workings of machine learning models. We demonstrate several drawbacks of existing global interaction detection approaches, characterize them theoretically, and evaluate them empirically. Furthermore, we introduce regional effect plots with implicit interaction detection, a novel framework to detect interactions between a feature of interest and other features. The framework also quantifies the strength of interactions and provides interpretable and distinct regions in which feature effects can be interpreted more reliably, as they are less confounded by interactions. We prove the theoretical eligibility of our method and show its applicability on various simulation and real-world examples.

MCML Authors

Julia Herbinger

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[78]

F. Pargent, F. Pfisterer, J. Thomas and B. Bischl.
Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features.
Computational Statistics 37 (Mar. 2022). DOI

Abstract

Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

MCML Authors

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Göran Kauermann

Statistical Learning and Data Science

[77]

C. Fritz, E. Dorigatti and D. Rügamer.
Combining Graph Neural Networks and Spatio-temporal Disease Models to Predict COVID-19 Cases in Germany.
Scientific Reports 12.3930 (Mar. 2022). DOI

Abstract

During 2020, the infection rate of COVID-19 has been investigated by many scholars from different research fields. In this context, reliable and interpretable forecasts of disease incidents are a vital tool for policymakers to manage healthcare resources. In this context, several experts have called for the necessity to account for human mobility to explain the spread of COVID-19. Existing approaches often apply standard models of the respective research field, frequently restricting modeling possibilities. For instance, most statistical or epidemiological models cannot directly incorporate unstructured data sources, including relational data that may encode human mobility. In contrast, machine learning approaches may yield better predictions by exploiting these data structures yet lack intuitive interpretability as they are often categorized as black-box models. We propose a combination of both research directions and present a multimodal learning framework that amalgamates statistical regression and machine learning models for predicting local COVID-19 cases in Germany. Results and implications: the novel approach introduced enables the use of a richer collection of data types, including mobility flows and colocation probabilities, and yields the lowest mean squared error scores throughout the observational period in the reported benchmark study. The results corroborate that during most of the observational period more dispersed meeting patterns and a lower percentage of people staying put are associated with higher infection rates. Moreover, the analysis underpins the necessity of including mobility data and showcases the flexibility and interpretability of the proposed approach.

MCML Authors

Cornelius Fritz

Dr.

* Former Member

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Statistics, Data Science and Machine Learning

[76]

C. Nießl, M. Herrmann, C. Wiedemann, G. Casalicchio and A.-L. Boulesteix.
Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12.2 (Mar. 2022). DOI

Abstract

In recent years, the need for neutral benchmark studies that focus on the comparison of methods coming from computational sciences has been increasingly recognized by the scientific community. While general advice on the design and analysis of neutral benchmark studies can be found in recent literature, a certain flexibility always exists. This includes the choice of data sets and performance measures, the handling of missing performance values, and the way the performance values are aggregated over the data sets. As a consequence of this flexibility, researchers may be concerned about how their choices affect the results or, in the worst case, may be tempted to engage in questionable research practices (e.g., the selective reporting of results or the post hoc modification of design or analysis components) to fit their expectations. To raise awareness for this issue, we use an example benchmark study to illustrate how variable benchmark results can be when all possible combinations of a range of design and analysis options are considered. We then demonstrate how the impact of each choice on the results can be assessed using multidimensional unfolding. In conclusion, based on previous literature and on our illustrative example, we claim that the multiplicity of design and analysis options combined with questionable research practices lead to biased interpretations of benchmark results and to over-optimistic conclusions. This issue should be considered by computational researchers when designing and analyzing their benchmark studies and by the scientific community in general in an effort towards more reliable benchmark results.

MCML Authors

Christina Sauer (née Nießl)

Biometry in Molecular Medicine

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Biometry in Molecular Medicine

[75]

F. Ott, D. Rügamer, L. Heublein, B. Bischl and C. Mutschler.
Cross-Modal Common Representation Learning with Triplet Loss Functions.
Preprint (Mar. 2022). DOI

Abstract

Common representation learning (CRL) learns a shared embedding between two or more modalities to improve in a given task over using only one of the modalities. CRL from different data types such as images and time-series data (e.g., audio or text data) requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for CRL between image and time-series modalities. By adapting the triplet loss for CRL, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show an improved classification accuracy, faster convergence, and a better generalizability.

MCML Authors

Felix Ott

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[74]

M. Rezaei, J. J. Näppi, B. Bischl and H. Yoshida.
Bayesian uncertainty estimation for detection of long-tail and unseen conditions in abdominal images.
SPIE 2022 - SPIE Medical Imaging: Computer-Aided Diagnosis. San Diego, CA, USA, Feb 20-Mar 28, 2022. DOI

Abstract

Deep supervised learning provides an effective approach for developing robust models for various computer-aided diagnosis tasks. However, the underlying assumption is that the frequency of the samples between the different classes of the training dataset is similar or balanced. In real-world medical data, the positive classes often occur too infrequently to satisfy this assumption. Thus, there is an unmet need for deep learning systems that could automatically identify and adapt to the real-world conditions of imbalanced data. In this paper, we propose a novel Bayesian deep ensemble learning framework to address the problem of the representation learning of longtailed and out-of-distribution samples in medical images. By estimating the relative uncertainties of the input data, our framework is able to adapt to the imbalanced data for learning generalizable classifiers. To evaluate the framework, we trained and tested our framework on two public medical imaging datasets that consist of different imbalance ratios and imaging modalities. Our results on the semantic segmentation of high-resolution CT and MRI images achieved 0.93% recall, which represents a 3% relative improvement over previous state-of-the-art ensemble GANs in the handling of the associated long-tailed data and detection of out-of-distribution samples.

MCML Authors

Mina Rezaei

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[73]

M. Rezaei, J. J. Näppi, B. Bischl and H. Yoshida.
Deep mutual GANs: representation learning from multiple experts.
SPIE 2022 - SPIE Medical Imaging: Imaging Informatics for Healthcare, Research, and Applications. San Diego, CA, USA, Feb 20-Mar 28, 2022. DOI

Abstract

Representation learning is one of the canonical objectives of most deep learning models. However, the learning of real-world clinical data is often compromised by their inherently imbalanced or long-tailed distribution wherein a few classes have significantly larger numbers of training instances than do the other classes. In this study, we investigated the representation learning of such long-tailed data distributions by the use of a deep mutual ensemble generative adversarial network. Our proposed framework consists of multiple powerful pre-trained discriminator networks that transfer knowledge to multiple individual untrained generator networks. During the training process, each generator learns to collaborate with the other generators. Additionally, each generator receives feedback from the individual discriminators in an adversarial manner. Especially, we explored the use of mutual information shared between the independent generators that makes our framework robust against misclassification of long-tailed data distributions in medical image analysis. We evaluated our proposed framework on four public datasets that represented different medical imaging modalities and imbalance ratios. Our experimental results show that our proposed framework benefits from ensemble learning and shared mutual learning, and achieves compelling results on several medical imaging benchmarks. Thus, our approach offers potential advantages over traditional deep learning in real-world applications.

MCML Authors

Mina Rezaei

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[72]

F. Ott, D. Rügamer, L. Heublein, B. Bischl and C. Mutschler.
Joint Classification and Trajectory Regression of Online Handwriting Using a Multi-Task Learning Approach.
WACV 2022 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2022. DOI

Abstract

Multivariate Time Series (MTS) classification is important in various applications such as signature verification, person identification, and motion recognition. In deep learning these classification tasks are usually learned using the cross-entropy loss. A related yet different task is predicting trajectories observed as MTS. Important use cases include handwriting reconstruction, shape analysis, and human pose estimation. The goal is to align an arbitrary dimensional time series with its ground truth as accurately as possible while reducing the error in the prediction with a distance loss and the variance with a similarity loss. Although learning both losses with Multi-Task Learning (MTL) helps to improve trajectory alignment, learning often remains difficult as both tasks are contradictory. We propose a novel neural network architecture for MTL that notably improves the MTS classification and trajectory regression performance in online handwriting (OnHW) recognition. We achieve this by jointly learning the cross-entropy loss in combination with distance and similarity losses. On an OnHW task of handwritten characters with multivariate inertial and visual data inputs we are able to achieve crucial improvements (lower error with less variance) of trajectory prediction while still improving the character classification accuracy in comparison to models trained on the individual tasks.

MCML Authors

Felix Ott

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[71]

W. Hartl, P. Kopper, A. Bender, F. Scheipl, A. G. Day, G. Elke and H. Küchenhoff.
Protein intake and outcome of critically ill patients: analysis of a large international database using piece-wise exponential additive mixed models.
Critical Care 26.7 (Jan. 2022). DOI

Abstract

Background: Proteins are an essential part of medical nutrition therapy in critically ill patients. Guidelines almost universally recommend a high protein intake without robust evidence supporting its use.
Methods: Using a large international database, we modelled associations between the hazard rate of in-hospital death and live hospital discharge (competing risks) and three categories of protein intake (low: < 0.8 g/kg per day, standard: 0.8–1.2 g/kg per day, high: > 1.2 g/kg per day) during the first 11 days after ICU admission (acute phase). Time-varying cause-specific hazard ratios (HR) were calculated from piece-wise exponential additive mixed models. We used the estimated model to compare five different hypothetical protein diets (an exclusively low protein diet, a standard protein diet administered early (day 1 to 4) or late (day 5 to 11) after ICU admission, and an early or late high protein diet).
Results: Of 21,100 critically ill patients in the database, 16,489 fulfilled inclusion criteria for the analysis. By day 60, 11,360 (68.9%) patients had been discharged from hospital, 4,192 patients (25.4%) had died in hospital, and 937 patients (5.7%) were still hospitalized. Median daily low protein intake was 0.49 g/kg [IQR 0.27–0.66], standard intake 0.99 g/kg [IQR 0.89– 1.09], and high intake 1.41 g/kg [IQR 1.29–1.60]. In comparison with an exclusively low protein diet, a late standard protein diet was associated with a lower hazard of in-hospital death: minimum 0.75 (95% CI 0.64, 0.87), and a higher hazard of live hospital discharge: maximum HR 1.98 (95% CI 1.72, 2.28). Results on hospital discharge, however, were qualitatively changed by a sensitivity analysis. There was no evidence that an early standard or a high protein intake during the acute phase was associated with a further improvement of outcome.
Conclusions: Provision of a standard protein intake during the late acute phase may improve outcome compared to an exclusively low protein diet. In unselected critically ill patients, clinical outcome may not be improved by a high protein intake during the acute phase.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Fabian Scheipl

PD Dr.

Functional Data Analysis

Helmut Küchenhoff

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Consulting Unit (StaBLab)

[70]

F. Ott, D. Rügamer, L. Heublein, T. Hamann, J. Barth, B. Bischl and C. Mutschler.
Benchmarking online sequence-to-sequence and character-based handwriting recognition from IMU-enhanced pens.
International Journal on Document Analysis and Recognition 25.4 (2022). DOI

Abstract

Handwriting is one of the most frequently occurring patterns in everyday life and with it comes challenging applications such as handwriting recognition, writer identification and signature verification. In contrast to offline HWR that only uses spatial information (i.e., images), online HWR uses richer spatio-temporal information (i.e., trajectory data or inertial data). While there exist many offline HWR datasets, there are only little data available for the development of OnHWR methods on paper as it requires hardware-integrated pens. This paper presents data and benchmark models for real-time sequence-to-sequence learning and single character-based recognition. Our data are recorded by a sensor-enhanced ballpoint pen, yielding sensor data streams from triaxial accelerometers, a gyroscope, a magnetometer and a force sensor at 100 Hz. We propose a variety of datasets including equations and words for both the writer-dependent and writer-independent tasks. Our datasets allow a comparison between classical OnHWR on tablets and on paper with sensor-enhanced pens. We provide an evaluation benchmark for seq2seq and single character-based HWR using recurrent and temporal convolutional networks and transformers combined with a connectionist temporal classification (CTC) loss and cross-entropy (CE) losses. Our convolutional network combined with BiLSTMs outperforms transformer-based architectures, is on par with InceptionTime for sequence-based classification tasks and yields better results compared to 28 state-of-the-art techniques. Time-series augmentation methods improve the sequence-based task, and we show that CE variants can improve the single classification task. Our implementations together with the large benchmark of state-of-the-art techniques of novel OnHWR datasets serve as a baseline for future research in the area of OnHWR on paper.

MCML Authors

Felix Ott

Dr.

* Former Member

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[69]

A. Python, A. Bender, M. Blangiardo, J. B. Illian, Y. Lin, B. Liu, T. C. D. Lucas, S. Tan, Y. Wen, D. Svanidze and J. Yin.
A downscaling approach to compare COVID-19 count data from databases aggregated at different spatial scales.
Journal of the Royal Statistical Society. Series A (Statistics in Society) 185.1 (Jan. 2022). DOI

Abstract

As the COVID-19 pandemic continues to threaten various regions around the world, obtaining accurate and reliable COVID-19 data is crucial for governments and local communities aiming at rigorously assessing the extent and magnitude of the virus spread and deploying efficient interventions. Using data reported between January and February 2020 in China, we compared counts of COVID-19 from near-real-time spatially disaggregated data (city level) with fine-spatial scale predictions from a Bayesian downscaling regression model applied to a reference province-level data set. The results highlight discrepancies in the counts of coronavirus-infected cases at the district level and identify districts that may require further investigation.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[68]

E. Dorigatti, J. Goschenhofer, B. Schubert, M. Rezaei and B. Bischl.
Positive-Unlabeled Learning with Uncertainty-aware Pseudo-label Selection.
Preprint (Jan. 2022). arXiv

Abstract

Positive-unlabeled learning (PUL) aims at learning a binary classifier from only positive and unlabeled training data. Even though real-world applications often involve imbalanced datasets where the majority of examples belong to one class, most contemporary approaches to PUL do not investigate performance in this setting, thus severely limiting their applicability in practice. In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased predictive performance. Within a series of experiments, PUUPL yields substantial performance gains in highly imbalanced settings while also showing strong performance in balanced PU scenarios across recent baselines. We furthermore provide ablations and sensitivity analyses to shed light on PUUPL’s several ingredients. Finally, a real-world application with an imbalanced dataset confirms the advantage of our approach.

MCML Authors

Emilio Dorigatti

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Jann Goschenhofer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

2021

[67]

J. Moosbauer, J. Herbinger, G. Casalicchio, M. Lindauer and B. Bischl.
Explaining Hyperparameter Optimization via Partial Dependence Plots.
NeurIPS 2021 - 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. URL GitHub

Abstract

Automated hyperparameter optimization (HPO) can support practitioners to obtain peak performance in machine learning models. However, there is often a lack of valuable insights into the effects of different hyperparameters on the final model performance. This lack of explainability makes it difficult to trust and understand the automated HPO process and its results. We suggest using interpretable machine learning (IML) to gain insights from the experimental data obtained during HPO with Bayesian optimization (BO). BO tends to focus on promising regions with potential high-performance configurations and thus induces a sampling bias. Hence, many IML techniques, such as the partial dependence plot (PDP), carry the risk of generating biased interpretations. By leveraging the posterior uncertainty of the BO surrogate model, we introduce a variant of the PDP with estimated confidence bands. We propose to partition the hyperparameter space to obtain more confident and reliable PDPs in relevant sub-regions. In an experimental study, we provide quantitative evidence for the increased quality of the PDPs within sub-regions.

MCML Authors

Julia Moosbauer

Dr.

* Former Member

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[66]

B. Bischl, G. Casalicchio, M. Feurer, P. Gijsbers, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn and J. Vanschoren.
OpenML Benchmarking Suites.
NeurIPS 2021 - Track on Datasets and Benchmarks at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. URL

Abstract

Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the OpenML platform, and accessible through interfaces in Python, Java, and R. OpenML benchmarking suites (a) are easy to use through standardized data formats, APIs, and client libraries; (b) come with extensive meta-information on the included datasets; and (c) allow benchmarks to be shared and reused in future studies. We then present a first, carefully curated and practical benchmarking suite for classification: the OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18). Finally, we discuss use cases and applications which demonstrate the usefulness of OpenML benchmarking suites and the OpenML-CC18 in particular.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

* Former Member

[65]

Y. Zhang, A. Khakzar, Y. Li, A. Farshad, S. T. Kim and N. Navab.
Fine-Grained Neural Network Explanation by Identifying Input Features with Predictive Information.
NeurIPS 2021 - Track on Datasets and Benchmarks at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. URL

Abstract

One principal approach for illuminating a black-box neural network is feature attribution, i.e. identifying the importance of input features for the network’s prediction. The predictive information of features is recently proposed as a proxy for the measure of their importance. So far, the predictive information is only identified for latent features by placing an information bottleneck within the network. We propose a method to identify features with predictive information in the input domain. The method results in fine-grained identification of input features’ information and is agnostic to network architecture. The core idea of our method is leveraging a bottleneck on the input that only lets input features associated with predictive latent features pass through. We compare our method with several feature attribution methods using mainstream feature attribution evaluation experiments. The code is publicly available.

MCML Authors

Ashkan Khakzar

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Yawei Li

Statistical Learning and Data Science

Azade Farshad

Dr.

C1 | Medicine
→ Group Nassir Navab

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computer Aided Medical Procedures & Augmented Reality

[64]

T. Weber, M. Ingrisch, M. Fabritius, B. Bischl and D. Rügamer.
Survival-oriented embeddings for improving accessibility to complex data structures.
NeurIPS 2021 - Workshop on Bridging the Gap: from Machine Learning Research to Clinical Practice at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. arXiv

Abstract

Deep learning excels in the analysis of unstructured data and recent advancements allow to extend these techniques to survival analysis. In the context of clinical radiology, this enables, e.g., to relate unstructured volumetric images to a risk score or a prognosis of life expectancy and support clinical decision making. Medical applications are, however, associated with high criticality and consequently, neither medical personnel nor patients do usually accept black box models as reason or basis for decisions. Apart from averseness to new technologies, this is due to missing interpretability, transparency and accountability of many machine learning methods. We propose a hazard-regularized variational autoencoder that supports straightforward interpretation of deep neural architectures in the context of survival analysis, a field highly relevant in healthcare. We apply the proposed approach to abdominal CT scans of patients with liver tumors and their corresponding survival times.

MCML Authors

Tobias Weber

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[63]

T. Weber, M. Ingrisch, B. Bischl and D. Rügamer.
Towards modelling hazard factors in unstructured data spaces using gradient-based latent interpolation.
NeurIPS 2021 - Workshop on Deep Generative Models and Downstream Applications at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. PDF

Abstract

The application of deep learning in survival analysis (SA) allows utilizing unstructured and high-dimensional data types uncommon in traditional survival methods. This allows to advance methods in fields such as digital health, predictive maintenance, and churn analysis, but often yields less interpretable and intuitively understandable models due to the black-box character of deep learning-based approaches. We close this gap by proposing 1) a multi-task variational autoencoder (VAE) with survival objective, yielding survival-oriented embeddings, and 2) a novel method HazardWalk that allows to model hazard factors in the original data space. HazardWalk transforms the latent distribution of our autoencoder into areas of maximized/minimized hazard and then uses the decoder to project changes to the original domain. Our procedure is evaluated on a simulated dataset as well as on a dataset of CT imaging data of patients with liver metastases.

MCML Authors

Tobias Weber

* Former Member

Michael Ingrisch

Prof. Dr.

Clinical Data Science in Radiology

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[62]

S. Hilbert, S. Coors, E. Kraus, B. Bischl, A. Lindl, M. Frei, J. Wild, S. Krauss, D. Goretzko and C. Stachl.
Machine learning for the educational sciences.
Review of Education 9.3 (Nov. 2021). DOI

Abstract

Machine learning (ML) provides a powerful framework for the analysis of high-dimensional datasets by modelling complex relationships, often encountered in modern data with many variables, cases and potentially non-linear effects. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows, as larger and more complex datasets become available through massive open online courses (MOOCs) and large-scale investigations. The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. To provide educational researchers with an elaborate introduction to the topic, we provide an instructional summary of the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. Specifically, we (1) provide an overview of the types of data suitable for ML and (2) give practical advice for the application of ML methods. In each section, we provide analytical examples and reproducible R code. Also, we provide an extensive Appendix on ML-based applications for education. This instructional summary will help educational scientists and practitioners to prepare for the promises and threats that come with the shift towards digitisation and large-scale assessment in education.

MCML Authors

Stefan Coors

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[61]

A. Khakzar, Y. Zhang, W. Mansour, Y. Cai, Y. Li, Y. Zhang, S. T. Kim and N. Navab.
Explaining COVID-19 and Thoracic Pathology Model Predictions by Identifying Informative Input Features.
MICCAI 2021 - 24th International Conference on Medical Image Computing and Computer Assisted Intervention. Strasbourg, France, Sep 27-Oct 01, 2021. DOI GitHub

Abstract

Neural networks have demonstrated remarkable performance in classification and regression tasks on chest X-rays. In order to establish trust in the clinical routine, the networks’ prediction mechanism needs to be interpretable. One principal approach to interpretation is feature attribution. Feature attribution methods identify the importance of input features for the output prediction. Building on Information Bottleneck Attribution (IBA) method, for each prediction we identify the chest X-ray regions that have high mutual information with the network’s output. Original IBA identifies input regions that have sufficient predictive information. We propose Inverse IBA to identify all informative regions. Thus all predictive cues for pathologies are highlighted on the X-rays, a desirable property for chest X-ray diagnosis. Moreover, we propose Regression IBA for explaining regression models. Using Regression IBA we observe that a model trained on cumulative severity score labels implicitly learns the severity of different X-ray regions. Finally, we propose Multi-layer IBA to generate higher resolution and more detailed attribution/saliency maps. We evaluate our methods using both human-centric (ground-truth-based) interpretability metrics, and human-agnostic feature importance metrics on NIH Chest X-ray8 and BrixIA datasets.

MCML Authors

Ashkan Khakzar

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Yawei Li

Statistical Learning and Data Science

Nassir Navab

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Computer Aided Medical Procedures & Augmented Reality

[60]

S. Coors, D. Schalk, B. Bischl and D. Rügamer.
Automatic Componentwise Boosting: An Interpretable AutoML System.
ADS @ECML-PKDD 2021 - Automating Data Science Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2021). Virtual, Sep 13-17, 2021. arXiv

Abstract

In practice, machine learning (ML) workflows require various different steps, from data preprocessing, missing value imputation, model selection, to model tuning as well as model evaluation. Many of these steps rely on human ML experts. AutoML - the field of automating these ML pipelines - tries to help practitioners to apply ML off-the-shelf without any expert knowledge. Most modern AutoML systems like auto-sklearn, H20-AutoML or TPOT aim for high predictive performance, thereby generating ensembles that consist almost exclusively of black-box models. This, in turn, makes the interpretation for the layperson more intricate and adds another layer of opacity for users. We propose an AutoML system that constructs an interpretable additive model that can be fitted using a highly scalable componentwise boosting algorithm. Our system provides tools for easy model interpretation such as visualizing partial effects and pairwise interactions, allows for a straightforward calculation of feature importance, and gives insights into the required model complexity to fit the given task. We introduce the general framework and outline its implementation autocompboost. To demonstrate the frameworks efficacy, we compare autocompboost to other existing systems based on the OpenML AutoML-Benchmark. Despite its restriction to an interpretable model space, our system is competitive in terms of predictive performance on most data sets while being more user-friendly and transparent.

MCML Authors

Stefan Coors

* Former Member

Daniel Schalk

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[59]

R. Sonabend, F. J. Király, A. Bender, B. Bischl and M. Lang.
mlr3proba: An R Package for Machine Learning in Survival Analysis.
Bioinformatics 37.17 (Sep. 2021). DOI

Abstract

In tasks like node classification, image segmentation, and named-entity recognition we have a classifier that simultaneously outputs multiple predictions (a vector of labels) based on a single input, i.e. a single graph, image, or document respectively. Existing adversarial robustness certificates consider each prediction independently and are thus overly pessimistic for such tasks. They implicitly assume that an adversary can use different perturbed inputs to attack different predictions, ignoring the fact that we have a single shared input. We propose the first collective robustness certificate which computes the number of predictions that are simultaneously guaranteed to remain stable under perturbation, i.e. cannot be attacked. We focus on Graph Neural Networks and leverage their locality property - perturbations only affect the predictions in a close neighborhood - to fuse multiple single-node certificates into a drastically stronger collective certificate. For example, on the Citeseer dataset our collective certificate for node classification increases the average number of certifiable feature perturbations from 7 to 351.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

* Former Member

[58]

F. Soleymani, M. Eslami, T. Elze, B. Bischl and M. Rezaei.
Deep Variational Clustering Framework for Self-labeling of Large-scale Medical Images.
Preprint (Sep. 2021). arXiv GitHub

Abstract

We propose a Deep Variational Clustering (DVC) framework for unsupervised representation learning and clustering of large-scale medical images. DVC simultaneously learns the multivariate Gaussian posterior through the probabilistic convolutional encoder and the likelihood distribution with the probabilistic convolutional decoder; and optimizes cluster labels assignment. Here, the learned multivariate Gaussian posterior captures the latent distribution of a large set of unlabeled images. Then, we perform unsupervised clustering on top of the variational latent space using a clustering loss. In this approach, the probabilistic decoder helps to prevent the distortion of data points in the latent space and to preserve the local structure of data generating distribution. The training process can be considered as a self-training process to refine the latent space and simultaneously optimizing cluster assignments iteratively. We evaluated our proposed framework on three public datasets that represented different medical imaging modalities. Our experimental results show that our proposed framework generalizes better across different datasets. It achieves compelling results on several medical imaging benchmarks. Thus, our approach offers potential advantages over conventional deep unsupervised learning in real-world applications.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Mina Rezaei

Dr.

Statistical Learning and Data Science

[57]

M. P. Fabritius, M. Seidensticker, J. Rueckel, C. Heinze, M. Pech, K. J. Paprottka, P. M. Paprottka, J. Topalis, A. Bender, J. Ricke, A. Mittermeier and M. Ingrisch.
Bi-Centric Independent Validation of Outcome Prediction after Radioembolization of Primary and Secondary Liver Cancer.
Journal of Clinical Medicine 10.16 (Aug. 2021). DOI

Abstract

Background: Yttrium-90 radioembolization (RE) plays an important role in the treatment of liver malignancies. Optimal patient selection is crucial for an effective and safe treatment. In this study, we aim to validate the prognostic performance of a previously established random survival forest (RSF) with an external validation cohort from a different national center. Furthermore, we compare outcome prediction models with different established metrics. Methods: A previously established RSF model, trained on a consecutive cohort of 366 patients who had received RE due to primary or secondary liver tumor at a national center (center 1), was used to predict the outcome of an independent consecutive cohort of 202 patients from a different national center (center 2) and vice versa. Prognostic performance was evaluated using the concordance index (C-index) and the integrated Brier score (IBS). The prognostic importance of designated baseline parameters was measured with the minimal depth concept, and the influence on the predicted outcome was analyzed with accumulated local effects plots. RSF values were compared to conventional cox proportional hazards models in terms of C-index and IBS. Results: The established RSF model achieved a C-index of 0.67 for center 2, comparable to the results obtained for center 1, which it was trained on (0.66). The RSF model trained on center 2 achieved a C-index of 0.68 on center 2 data and 0.66 on center 1 data. CPH models showed comparable results on both cohorts, with C-index ranging from 0.68 to 0.72. IBS validation showed more differentiated results depending on which cohort was trained on and which cohort was predicted (range: 0.08 to 0.20). Baseline cholinesterase was the most important variable for survival prediction. Conclusion: The previously developed predictive RSF model was successfully validated with an independent external cohort. C-index and IBS are suitable metrics to compare outcome prediction models, with IBS showing more differentiated results. The findings corroborate that survival after RE is critically determined by functional hepatic reserve and thus baseline liver function should play a key role in patient selection.

MCML Authors

Johanna Topalis

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Andreas Mittermeier

Dr.

Clinical Data Science in Radiology

Michael Ingrisch

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Clinical Data Science in Radiology

[56]

F. Pfisterer, C. Kern, S. Dandl, M. Sun, M. P. Kim and B. Bischl.
mcboost: Multi-Calibration Boosting for R.
The Journal of Open Source Software 6.64 (Aug. 2021). DOI

Abstract

Given the increasing usage of automated prediction systems in the context of high-stakes de- cisions, a growing body of research focuses on methods for detecting and mitigating biases in algorithmic decision-making. One important framework to audit for and mitigate biases in predictions is that of Multi-Calibration, introduced by Hebert-Johnson et al. (2018). The underlying fairness notion, Multi-Calibration, promotes the idea of multi-group fairness and requires calibrated predictions not only for marginal populations, but also for subpopulations that may be defined by complex intersections of many attributes. A simpler variant of Multi- Calibration, referred to as Multi-Accuracy, requires unbiased predictions for large collections of subpopulations. Hebert-Johnson et al. (2018) proposed a boosting-style algorithm for learning multi-calibrated predictors. Kim et al. (2019) demonstrated how to turn this al- gorithm into a post-processing strategy to achieve multi-accuracy, demonstrating empirical effectiveness across various domains. This package provides a stable implementation of the multi-calibration algorithm, called MCBoost. In contrast to other Fair ML approaches, MC- Boost does not harm the overall utility of a prediction model, but rather aims at improving calibration and accuracy for large sets of subpopulations post-training. MCBoost comes with strong theoretical guarantees, which have been explored formally in Hebert-Johnson et al. (2018), Kim et al. (2019), Dwork et al. (2019), Dwork et al. (2020) and Kim et al. (2021).

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[55]

J. Moosbauer, J. Herbinger, G. Casalicchio, M. Lindauer and B. Bischl.
Towards Explaining Hyperparameter Optimization via Partial Dependence Plots.
AutoML @ICML 2021 - 8th Workshop on Automated Machine Learning at the 38th International Conference on Machine Learning (ICML 2021). Virtual, Jul 18-24, 2021. URL

Abstract

Automated hyperparameter optimization (HPO) can support practitioners to obtain peak performance in machine learning models. However, there is often a lack of valuable insights into the effects of different hyperparameters on the final model performance. This lack of comprehensibility and transparency makes it difficult to trust and understand the automated HPO process and its results. We suggest using interpretable machine learning (IML) to gain insights from the experimental data obtained during HPO and especially discuss the popular case of Bayesian optimization (BO). BO tends to focus on promising regions with potential high-performance configurations and thus induces a sampling bias. Hence, many IML techniques, like Partial Dependence Plots (PDP), carry the risk of generating biased interpretations. By leveraging the posterior uncertainty of the BO surrogate model, we introduce a variant of the PDP with estimated confidence bands. In addition, we propose to partition the hyperparameter space to obtain more confident and reliable PDPs in relevant sub-regions. In an experimental study, we provide quantitative evidence for the increased quality of the PDPs within sub-regions.

MCML Authors

Julia Moosbauer

Dr.

* Former Member

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[54]

G. König, T. Freiesleben and M. Grosse-Wentrup.
A causal perspective on meaningful and robust algorithmic recourse.
ICML 2021 - Workshop on Algorithmic Recourse at the 38th International Conference on Machine Learning. Virtual, Jul 18-24, 2021. URL

Abstract

Algorithmic recourse explanations inform stakeholders on how to act to revert unfavorable predictions. However, in general ML models do not predict well in interventional distributions. Thus, an action that changes the prediction in the desired way may not lead to an improvement of the underlying target. Such recourse is neither meaningful nor robust to model refits. Extending the work of Karimi et al. (2021), we propose meaningful algorithmic recourse (MAR) that only recommends actions that improve both prediction and target. We justify this selection constraint by highlighting the differences between model audit and meaningful, actionable recourse explanations. Additionally, we introduce a relaxation of MAR called effective algorithmic recourse (EAR), which, under certain assumptions, yields meaningful recourse by only allowing interventions on causes of the target.

MCML Authors

Gunnar König

Dr.

* Former Member

Moritz Grosse-Wentrup

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Principal Investigator

[53]

P. Gijsbers, F. Pfisterer, J. N. van Rijn, B. Bischl and J. Vanschoren.
Meta-Learning for Symbolic Hyperparameter Defaults.
GECCO 2021 - Genetic and Evolutionary Computation Conference. Lile, France, Jul 10-14, 2021. DOI

Abstract

Hyperparameter optimization in machine learning (ML) deals with the problem of empirically learning an optimal algorithm configuration from data, usually formulated as a black-box optimization problem. In this work, we propose a zero-shot method to meta-learn symbolic default hyperparameter configurations that are expressed in terms of the properties of the dataset. This enables a much faster, but still data-dependent, configuration of the ML algorithm, compared to standard hyperparameter optimization approaches. In the past, symbolic and static default values have usually been obtained as hand-crafted heuristics. We propose an approach of learning such symbolic configurations as formulas of dataset properties from a large set of prior evaluations on multiple datasets by optimizing over a grammar of expressions using an evolutionary algorithm. We evaluate our method on surrogate empirical performance models as well as on real data across 6 ML algorithms on more than 100 datasets and demonstrate that our method indeed finds viable symbolic defaults.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[52]

F. Pfisterer, J. N. van Rijn, P. Probst, A. C. Müller and B. Bischl.
Learning Multiple Defaults for Machine Learning Algorithms.
GECCO 2021 - Genetic and Evolutionary Computation Conference. Lile, France, Jul 10-14, 2021. DOI

Abstract

Modern machine learning methods highly depend on their hyper-parameter configurations for optimal performance. A widely used approach to selecting a configuration is using default settings, often proposed along with the publication of a new algorithm. Those default values are usually chosen in an ad-hoc manner to work on a wide variety of datasets. Different automatic hyperparameter configuration algorithms which select an optimal configuration per dataset have been proposed, but despite its importance, tuning is often skipped in applications because of additional run time, complexity, and experimental design questions. Instead, the learner is often applied in its defaults. This principled approach usually improves performance but adds additional algorithmic complexity and computational costs to the training procedure. We propose and study using a set of complementary default values, learned from a large database of prior empirical results as an alternative. Selecting an appropriate configuration on a new dataset then requires only a simple, efficient, and embarrassingly parallel search over this set. To demonstrate the effectiveness and efficiency of the approach, we compare learned sets of configurations to random search and Bayesian optimization. We show that sets of defaults can improve performance while being easy to deploy in comparison to more complex methods.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[51]

A. Python, A. Bender, A. K. Nandi, P. A. Hancock, R. Arambepola, J. Brandsch and T. C. D. Lucas.
Predicting non-state terrorism worldwide.
Science Advances 7.31 (Jul. 2021). DOI

Abstract

Several thousand people die every year worldwide because of terrorist attacks perpetrated by non-state actors. In this context, reliable and accurate short-term predictions of non-state terrorism at the local level are key for policy makers to target preventative measures. Using only publicly available data, we show that predictive models that include structural and procedural predictors can accurately predict the occurrence of non-state terrorism locally and a week ahead in regions affected by a relatively high prevalence of terrorism. In these regions, theoretically informed models systematically outperform models using predictors built on past terrorist events only. We further identify and interpret the local effects of major global and regional terrorism drivers. Our study demonstrates the potential of theoretically informed models to predict and explain complex forms of political violence at policy-relevant scales.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

[50]

M. Binder, F. Pfisterer, M. Lang, L. Schneider, L. Kotthoff and B. Bischl.
mlr3pipelines - Flexible Machine Learning Pipelines in R.
Journal of Machine Learning Research 22.184 (Jun. 2021). URL

Abstract

Recent years have seen a proliferation of ML frameworks. Such systems make ML accessible to non-experts, especially when combined with powerful parameter tuning and AutoML techniques. Modern, applied ML extends beyond direct learning on clean data, however, and needs an expressive language for the construction of complex ML workflows beyond simple pre- and post-processing. We present mlr3pipelines, an R framework which can be used to define linear and complex non-linear ML workflows as directed acyclic graphs. The framework is part of the mlr3 ecosystem, leveraging convenient resampling, benchmarking, and tuning components.

MCML Authors

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[49]

G. König, T. Freiesleben, B. Bischl, G. Casalicchio and M. Grosse-Wentrup.
Decomposition of Global Feature Importance into Direct and Associative Components (DEDACT).
Preprint (Jun. 2021). arXiv

Abstract

Global model-agnostic feature importance measures either quantify whether features are directly used for a model’s predictions (direct importance) or whether they contain prediction-relevant information (associative importance). Direct importance provides causal insight into the model’s mechanism, yet it fails to expose the leakage of information from associated but not directly used variables. In contrast, associative importance exposes information leakage but does not provide causal insight into the model’s mechanism. We introduce DEDACT - a framework to decompose well-established direct and associative importance measures into their respective associative and direct components. DEDACT provides insight into both the sources of prediction-relevant information in the data and the direct and indirect feature pathways by which the information enters the model. We demonstrate the method’s usefulness on simulated examples.

MCML Authors

Gunnar König

Dr.

* Former Member

Timo Freiesleben

Dr.

A2 | Mathematical Foundations
→ Group Tom Sterkenburg

Munich Center for Mathematical Philosophy

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Moritz Grosse-Wentrup

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Principal Investigator

[48]

I. Gerostathopoulos, F. Plášil, C. Prehofer, J. Thomas and B. Bischl.
Automated Online Experiment-Driven Adaptation--Mechanics and Cost Aspects.
IEEE Access 9 (Apr. 2021). DOI

Abstract

As modern software-intensive systems become larger, more complex, and more customizable, it is desirable to optimize their functionality by runtime adaptations. However, in most cases it is infeasible to fully model and predict their behavior in advance, which is a classical requirement of runtime self-adaptation. To address this problem, we propose their self-adaptation based on a sequence of online experiments carried out in a production environment. The key idea is to evaluate each experiment by data analysis and determine the next potential experiment via an optimization strategy. The feasibility of the approach is illustrated on a use case devoted to online self-adaptation of traffic navigation where Bayesian optimization, grid search, and local search are employed as the optimization strategies. Furthermore, the cost of the experiments is discussed and three key cost components are examined-time cost, adaptation cost, and endurability cost.

MCML Authors

Janek Thomas

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[47]

P. Kopper, S. Pölsterl, C. Wachinger, B. Bischl, A. Bender and D. Rügamer.
Semi-Structured Deep Piecewise Exponential Models.
AAAI-SPACA 2021 - AAAI Spring Symposium Series on Survival Prediction: Algorithms, Challenges and Applications. Palo Alto, California, USA, Mar 21-24, 2021. PDF

Abstract

We propose a versatile framework for survival analysis that combines advanced concepts from statistics with deep learning. The presented framework is based on piecewise expo-nential models and thereby supports various survival tasks, such as competing risks and multi-state modeling, and further allows for estimation of time-varying effects and time-varying features. To also include multiple data sources and higher-order interaction effects into the model, we embed the model class in a neural network and thereby enable the si-multaneous estimation of both inherently interpretable structured regression inputs as well as deep neural network components which can potentially process additional unstructured data sources. A proof of concept is provided by using the framework to predict Alzheimer’s disease progression based on tabular and 3D point cloud data and applying it to synthetic data.

MCML Authors

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Medical Imaging

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

[46]

J. Goschenhofer, R. Hvingelby, D. Rügamer, J. Thomas, M. Wagner and B. Bischl.
Deep Semi-Supervised Learning for Time Series Classification.
Preprint (Feb. 2021). arXiv

Abstract

While Semi-supervised learning has gained much attention in computer vision on image data, yet limited research exists on its applicability in the time series domain. In this work, we investigate the transferability of state-of-the-art deep semi-supervised models from image to time series classification. We discuss the necessary model adaptations, in particular an appropriate model backbone architecture and the use of tailored data augmentation strategies. Based on these adaptations, we explore the potential of deep semi-supervised learning in the context of time series classification by evaluating our methods on large public time series classification problems with varying amounts of labelled samples. We perform extensive comparisons under a decidedly realistic and appropriate evaluation scheme with a unified reimplementation of all algorithms considered, which is yet lacking in the field. We find that these transferred semi-supervised models show significant performance gains over strong supervised, semi-supervised and self-supervised alternatives, especially for scenarios with very few labelled samples.

MCML Authors

Jann Goschenhofer

Dr.

* Former Member

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Janek Thomas

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[45]

G. König, C. Molnar, B. Bischl and M. Grosse-Wentrup.
Relative Feature Importance.
ICPR 2020 - 25th International Conference on Pattern Recognition. Virtual - Milano, Italy, Jan 10-15, 2021. DOI

Abstract

Interpretable Machine Learning (IML) methods are used to gain insight into the relevance of a feature of interest for the performance of a model. Commonly used IML methods differ in whether they consider features of interest in isolation, e.g., Permutation Feature Importance (PFI), or in relation to all remaining feature variables, e.g., Conditional Feature Importance (CFI). As such, the perturbation mechanisms inherent to PFI and CFI represent extreme reference points. We introduce Relative Feature Importance (RFI), a generalization of PFI and CFI that allows for a more nuanced feature importance computation beyond the PFI versus CFI dichotomy. With RFI, the importance of a feature relative to any other subset of features can be assessed, including variables that were not available at training time. We derive general interpretation rules for RFI based on a detailed theoretical analysis of the implications of relative feature relevance, and demonstrate the method’s usefulness on simulated examples.

MCML Authors

Gunnar König

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Moritz Grosse-Wentrup

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Principal Investigator

[44]

M. Becker, S. Gruber, J. Richter, J. Moosbauer and B. Bischl.
mlr3hyperband: Hyperband for 'mlr3'.
2021. URL GitHub

Abstract

mlr3hyperband adds the optimization algorithms Successive Halving (Jamieson and Talwalkar 2016) and Hyperband (Li et al. 2018) to the mlr3 ecosystem. The implementation in mlr3hyperband features improved scheduling and parallelizes the evaluation of configurations. The package includes tuners for hyperparameter optimization in mlr3tuning and optimizers for black-box optimization in bbotk.

MCML Authors

Marc Becker

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[43]

M. Becker, M. Lang, J. Richter, B. Bischl and D. Schalk.
mlr3tuning: Tuning for 'mlr3'.
2021. URL GitHub

Abstract

mlr3tuning is the hyperparameter optimization package of the mlr3 ecosystem. It features highly configurable search spaces via the paradox package and finds optimal hyperparameter configurations for any mlr3 learner. mlr3tuning works with several optimization algorithms e.g. Random Search, Iterated Racing, Bayesian Optimization (in mlr3mbo) and Hyperband (in mlr3hyperband). Moreover, it can automatically optimize learners and estimate the performance of optimized models with nested resampling. The package is built on the optimization framework bbotk.

MCML Authors

Marc Becker

Statistical Learning and Data Science

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Daniel Schalk

Dr.

* Former Member

[42]

M. Becker, J. Richter, M. Lang, B. Bischl and M. Binder.
bbotk: Black-Box Optimization Toolkit.
2021. URL GitHub

Abstract

bbotk is a black-box optimization framework for R. It features highly configurable search spaces via the paradox package and optimizes every user-defined objective function. The package includes several optimization algorithms e.g. Random Search, Grid Search, Iterated Racing, Bayesian Optimization (in mlr3mbo) and Hyperband (in mlr3hyperband). bbotk is the base package of mlr3tuning, mlr3fselect and miesmuschel.

MCML Authors

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

Statistical Learning and Data Science

[41]

M. Binder.
mlrintermbo: Model-Based Optimization for 'mlr3' through 'mlrMBO'.
2021. URL GitHub

Abstract

The ‘mlrMBO’ package can ordinarily not be used for optimization within ‘mlr3’, because of incompatibilities of their respective class systems. ‘mlrintermbo’ offers a compatibility interface that provides ‘mlrMBO’ as an ‘mlr3tuning’ ‘Tuner’ object, for tuning of machine learning algorithms within ‘mlr3’, as well as a ‘bbotk’ ‘Optimizer’ object for optimization of general objective functions using the ‘bbotk’ black box optimization framework. The control parameters of ‘mlrMBO’ are faithfully reproduced as a ‘paradox’ ‘ParamSet’.

MCML Authors

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[40]

M. Lang.
mlr3measures: Performance Measures for 'mlr3'.
2021. URL

Abstract

Implements multiple performance measures for supervised learning. Includes over 40 measures for regression and classification. Additionally, meta information about the performance measures can be queried, e.g. what the best and worst possible performances scores are.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[39]

M. Lang, B. Bischl, J. Richter, X. Sun and M. Binder.
paradox: Define and Work with Parameter Spaces for Complex Algorithms.
2021. URL GitHub

Abstract

The paradox package offers a language for the description of parameter spaces, as well as tools for useful operations on these parameter spaces.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

Statistical Learning and Data Science

[38]

D. Rügamer, F. Pfisterer and P. Baumann.
deepregression: Fitting Semi-Structured Deep Distributional Regression in R.
2021. URL

Abstract

Allows for the specification of semi-structured deep distributional regression models which are fitted in a neural network as proposed by Ruegamer et al. (2023). Predictors can be modeled using structured (penalized) linear effects, structured non-linear effects or using an unstructured deep network model.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Florian Pfisterer

Dr.

* Former Member

[37]

P. Schratz and M. Becker.
mlr3spatiotempcv: Spatiotemporal Resampling Methods for 'mlr3'.
2021. URL

Abstract

Extends the mlr3 ML framework with spatio-temporal resampling methods to account for the presence of spatiotemporal autocorrelation (STAC) in predictor variables. STAC may cause highly biased performance estimates in cross-validation if ignored.

MCML Authors

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[36]

H. Seibold, S. Czerny, S. Decke, R. Dieterle, T. Eder, S. Fohr, N. Hahn, R. Hartmann, C. Heindl, P. Kopper, D. Lepke, V. Loidl, M. M. Mandl, S. Musiol, J. Peter, A. Piehler, E. Rojas, S. Schmid, H. Schmidt, M. Schmoll, L. Schneider, X.-Y. To, V. Tran, A. Völker, M. Wagner, J. Wagner, M. Waize, H. Wecker, R. Yang, S. Zellner and M. Nalenz.
A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses.
PLOS One 16.6 (2021). DOI

Abstract

Computational reproducibility is a corner stone for sound and credible research. Especially in complex statistical analyses—such as the analysis of longitudinal data—reproducing results is far from simple, especially if no source code is available. In this work we aimed to reproduce analyses of longitudinal data of 11 articles published in PLOS ONE. Inclusion criteria were the availability of data and author consent. We investigated the types of methods and software used and whether we were able to reproduce the data analysis using open source software. Most articles provided overview tables and simple visualisations. Generalised Estimating Equations (GEEs) were the most popular statistical models among the selected articles. Only one article used open source software and only one published part of the analysis code. Replication was difficult in most cases and required reverse engineering of results or contacting the authors. For three articles we were not able to reproduce the results, for another two only parts of them. For all but two articles we had to contact the authors to be able to reproduce the results. Our main learning is that reproducing papers is difficult if no code is supplied and leads to a high burden for those conducting the reproductions. Open data policies in journals are good, but to truly boost reproducibility we suggest adding open code policies.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Xiao-Yin To

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Viet Tran

C2 | Biology
→ Group Christian Müller

Biomedical Statistics and Data Science

2020

[35]

A. Agrawal, F. Pfisterer, B. Bischl, F. Buet-Golfouse, S. Sood, J. Chen, S. Shah and S. Vollmer.
Debiasing classifiers: is reality at variance with expectation?
Preprint (Nov. 2020). arXiv

Abstract

We present an empirical study of debiasing methods for classifiers, showing that debiasers often fail in practice to generalize out-of-sample, and can in fact make fairness worse rather than better. A rigorous evaluation of the debiasing treatment effect requires extensive cross-validation beyond what is usually done. We demonstrate that this phenomenon can be explained as a consequence of bias-variance trade-off, with an increase in variance necessitated by imposing a fairness constraint. Follow-up experiments validate the theoretical prediction that the estimation variance depends strongly on the base rates of the protected class. Considering fairness–performance trade-offs justifies the counterintuitive notion that partial debiasing can actually yield better results in practice on out-of-sample data.

MCML Authors

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[34]

D. Rügamer, F. Pfisterer and B. Bischl.
Neural Mixture Distributional Regression.
Preprint (Oct. 2020). arXiv

Abstract

We present neural mixture distributional regression (NMDR), a holistic framework to estimate complex finite mixtures of distributional regressions defined by flexible additive predictors. Our framework is able to handle a large number of mixtures of potentially different distributions in high-dimensional settings, allows for efficient and scalable optimization and can be applied to recent concepts that combine structured regression models with deep neural networks. While many existing approaches for mixture models address challenges in optimization of such and provide results for convergence under specific model assumptions, our approach is assumption-free and instead makes use of optimizers well-established in deep learning. Through extensive numerical experiments and a high-dimensional deep learning application we provide evidence that the proposed approach is competitive to existing approaches and works well in more complex scenarios.

MCML Authors

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Florian Pfisterer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[33]

A. Bender, D. Rügamer, F. Scheipl and B. Bischl.
A General Machine Learning Framework for Survival Analysis.
ECML-PKDD 2020 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Virtual, Sep 14-18, 2020. DOI

Abstract

The modeling of time-to-event data, also known as survival analysis, requires specialized methods that can deal with censoring and truncation, time-varying features and effects, and that extend to settings with multiple competing events. However, many machine learning methods for survival analysis only consider the standard setting with right-censored data and proportional hazards assumption. The methods that do provide extensions usually address at most a subset of these challenges and often require specialized software that can not be integrated into standard machine learning workflows directly. In this work, we present a very general machine learning framework for time-to-event analysis that uses a data augmentation strategy to reduce complex survival tasks to standard Poisson regression tasks. This reformulation is based on well developed statistical theory. With the proposed approach, any algorithm that can optimize a Poisson (log-)likelihood, such as gradient boosted trees, deep neural networks, model-based boosting and many more can be used in the context of time-to-event analysis. The proposed technique does not require any assumptions with respect to the distribution of event times or the functional shapes of feature and interaction effects. Based on the proposed framework we develop new methods that are competitive with specialized state of the art approaches in terms of accuracy, and versatility, but with comparatively small investments of programming effort or requirements for specialized methodological know-how.

MCML Authors

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning

Fabian Scheipl

PD Dr.

Functional Data Analysis

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[32]

C. Molnar, G. Casalicchio and B. Bischl.
Interpretable Machine Learning -- A Brief History, State-of-the-Art and Challenges.
ECML-PKDD 2020 - Workshops at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Virtual, Sep 14-18, 2020. DOI

Abstract

We present a brief history of the field of interpretable machine learning (IML), give an overview of state-of-the-art interpretation methods and discuss challenges. Research in IML has boomed in recent years. As young as the field is, it has over 200 years old roots in regression modeling and rule-based machine learning, starting in the 1960s. Recently, many new IML methods have been proposed, many of them model-agnostic, but also interpretation techniques specific to deep learning and tree-based ensembles. IML methods either directly analyze model components, study sensitivity to input perturbations, or analyze local or global surrogate approximations of the ML model. The field approaches a state of readiness and stability, with many methods not only proposed in research, but also implemented in open-source software. But many important challenges remain for IML, such as dealing with dependent features, causal interpretation, and uncertainty estimation, which need to be resolved for its successful application to scientific problems. A further challenge is a missing rigorous definition of interpretability, which is accepted by the community. To address the challenges and advance the field, we urge to recall our roots of interpretable, data-driven modeling in statistics and (rule-based) ML, but also to consider other areas such as sensitivity analysis, causal inference, and the social sciences.

MCML Authors

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[31]

S. Dandl, C. Molnar, M. Binder and B. Bischl.
Multi-Objective Counterfactual Explanations.
PPSN 2020 - 16th International Conference on Parallel Problem Solving from Nature. Leiden, Netherlands, Sep 05-09, 2020. DOI

Abstract

Counterfactual explanations are one of the most popular methods to make predictions of black box machine learning models interpretable by providing explanations in the form of ‘what-if scenarios’. Most current approaches optimize a collapsed, weighted sum of multiple objectives, which are naturally difficult to balance a-priori. We propose the Multi-Objective Counterfactuals (MOC) method, which translates the counterfactual search into a multi-objective optimization problem. Our approach not only returns a diverse set of counterfactuals with different trade-offs between the proposed objectives, but also maintains diversity in feature space. This enables a more detailed post-hoc analysis to facilitate better understanding and also more options for actionable user responses to change the predicted outcome. Our approach is also model-agnostic and works for numerical and categorical input features. We show the usefulness of MOC in concrete cases and compare our approach with state-of-the-art methods for counterfactual explanations.

MCML Authors

Susanne Dandl

Dr.

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[30]

M. Binder, F. Pfisterer and B. Bischl.
Collecting empirical data about hyperparameters for data driven AutoML.
AutoML @ICML 2020 - 7th Workshop on Automated Machine Learning co-located with ICML 2020. Virtual, Jul 18, 2020. PDF

Abstract

All optimization needs some kind of prior over the functions it is optimizing over. We used a large computing cluster to collect empirical data about the behavior of ML performance, by randomly sampling hyperparameter values and performing cross-validation. We also collected information about cross-validation error by performing some evaluations multiple times, and information about progression of performance with respect to training data size by performing some evaluations on data subsets. We present how we collected data, make some preliminary analyses on the surrogate models that can be built with them, and give an outlook over interesting analyses this should enable.

MCML Authors

Martin Binder

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[29]

C. Molnar, G. König, J. Herbinger, T. Freiesleben, S. Dandl, C. A. Scholbeck, G. Casalicchio, M. Grosse-Wentrup and B. Bischl.
General Pitfalls of Model-Agnostic Interpretation Methods for Machine Learning Models.
XXAI @ICML 2020 - Workshop on Extending Explainable AI Beyond Deep Models and Classifiers at the 37th International Conference on Machine Learning (ICML 2020). Virtual, Jul 12-18, 2020. DOI

Abstract

An increasing number of model-agnostic interpretation techniques for machine learning (ML) models such as partial dependence plots (PDP), permutation feature importance (PFI) and Shapley values provide insightful model interpretations, but can lead to wrong conclusions if applied incorrectly. We highlight many general pitfalls of ML model interpretation, such as using interpretation techniques in the wrong context, interpreting models that do not generalize well, ignoring feature dependencies, interactions, uncertainty estimates and issues in high-dimensional settings, or making unjustified causal interpretations, and illustrate them with examples. We focus on pitfalls for global methods that describe the average model behavior, but many pitfalls also apply to local methods that explain individual predictions. Our paper addresses ML practitioners by raising awareness of pitfalls and identifying solutions for correct model interpretation, but also addresses ML researchers by discussing open issues for further research.

MCML Authors

Gunnar König

Dr.

* Former Member

Julia Herbinger

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Moritz Grosse-Wentrup

Prof. Dr.

* Former Principal Investigator

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[28]

M. Binder, J. Moosbauer, J. Thomas and B. Bischl.
Multi-Objective Hyperparameter Tuning and Feature Selection Using Filter Ensembles.
GECCO 2020 - Genetic and Evolutionary Computation Conference. Cancun, Mexico, Jul 08-12, 2020. DOI

Abstract

Both feature selection and hyperparameter tuning are key tasks in machine learning. Hyperparameter tuning is often useful to increase model performance, while feature selection is undertaken to attain sparse models. Sparsity may yield better model interpretability and lower cost of data acquisition, data handling and model inference. While sparsity may have a beneficial or detrimental effect on predictive performance, a small drop in performance may be acceptable in return for a substantial gain in sparseness. We therefore treat feature selection as a multi-objective optimization task. We perform hyperparameter tuning and feature selection simultaneously because the choice of features of a model may influence what hyperparameters perform well. We present, benchmark, and compare two different approaches for multi-objective joint hyperparameter optimization and feature selection: The first uses multi-objective model-based optimization. The second is an evolutionary NSGA-II-based wrapper approach to feature selection which incorporates specialized sampling, mutation and recombination operators. Both methods make use of parameterized filter ensembles. While model-based optimization needs fewer objective evaluations to achieve good performance, it incurs computational overhead compared to the NSGA-II, so the preferred choice depends on the cost of evaluating a model on given data.

MCML Authors

Martin Binder

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Statistical Learning and Data Science

[27]

N. Ellenbach, A.-L. Boulesteix, B. Bischl, K. Unger and R. Hornung.
Improved outcome prediction across data sources through robust parameter tuning.
Journal of Classification 38 (Jul. 2020). DOI

Abstract

In many application areas, prediction rules trained based on high-dimensional data are subsequently applied to make predictions for observations from other sources, but they do not always perform well in this setting. This is because data sets from different sources can feature (slightly) differing distributions, even if they come from similar populations. In the context of high-dimensional data and beyond, most prediction methods involve one or several tuning parameters. Their values are commonly chosen by maximizing the cross-validated prediction performance on the training data. This procedure, however, implicitly presumes that the data to which the prediction rule will be ultimately applied, follow the same distribution as the training data. If this is not the case, less complex prediction rules that slightly underfit the training data may be preferable. Indeed, a tuning parameter does not only control the degree of adjustment of a prediction rule to the training data, but also, more generally, the degree of adjustment to the distribution of the training data. On the basis of this idea, in this paper we compare various approaches including new procedures for choosing tuning parameter values that lead to better generalizing prediction rules than those obtained based on cross-validation. Most of these approaches use an external validation data set. In our extensive comparison study based on a large collection of 15 transcriptomic data sets, tuning on external data and robust tuning with a tuned robustness parameter are the two approaches leading to better generalizing prediction rules.

MCML Authors

Nicole Ellenbach

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Statistical Learning and Data Science

Roman Hornung

Dr.

Biometry in Molecular Medicine

[26]

C. Stachl, Q. Au, R. Schoedel, S. D. Gosling, G. M. Harari, D. Buschek, S. T. Völkel, T. Schuwerk, M. Oldemeier, T. Ullmann, H. Hussmann, B. Bischl and M. Bühner.
Predicting personality from patterns of behavior collected with smartphones.
Proceedings of the National Academy of Sciences 117.30 (Jul. 2020). DOI

Abstract

Smartphones enjoy high adoption rates around the globe. Rarely more than an arm’s length away, these sensor-rich devices can easily be repurposed to collect rich and extensive records of their users’ behaviors (e.g., location, communication, media consumption), posing serious threats to individual privacy. Here we examine the extent to which individuals’ Big Five personality dimensions can be predicted on the basis of six different classes of behavioral information collected via sensor and log data harvested from smartphones. Taking a machine-learning approach, we predict personality at broad domain ( = 0.37) and narrow facet levels ( = 0.40) based on behavioral data collected from 624 volunteers over 30 consecutive days (25,347,089 logging events). Our cross-validated results reveal that specific patterns in behaviors in the domains of 1) communication and social behavior, 2) music consumption, 3) app usage, 4) mobility, 5) overall phone activity, and 6) day- and night-time activity are distinctively predictive of the Big Five personality traits. The accuracy of these predictions is similar to that found for predictions based on digital footprints from social media platforms and demonstrates the possibility of obtaining information about individuals’ private traits from behavioral patterns passively collected from their smartphones. Overall, our results point to both the benefits (e.g., in research settings) and dangers (e.g., privacy implications, psychological targeting) presented by the widespread collection and modeling of behavioral data obtained from smartphones.

MCML Authors

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[25]

M. Becker, P. Schratz, M. Lang and B. Bischl.
mlr3fselect: Feature Selection for 'mlr3'.
2020. URL

Abstract

Feature selection package of the ‘mlr3’ ecosystem. It selects the optimal feature set for any ‘mlr3’ learner. The package works with several optimization algorithms e.g. Random Search, Recursive Feature Elimination, and Genetic Search. Moreover, it can automatically optimize learners and estimate the performance of optimized feature sets with nested resampling.

MCML Authors

Marc Becker

Statistical Learning and Data Science

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[24]

M. Binder, F. Pfisterer, L. Schneider, B. Bischl, M. Lang and S. Dandl.
mlr3pipelines: Preprocessing Operators and Pipelines for 'mlr3'.
2020. URL GitHub

Abstract

mlr3pipelines is a dataflow programming toolkit for machine learning in R utilising the mlr3 package. Machine learning workflows can be written as directed “Graphs” that represent data flows between preprocessing, model fitting, and ensemble learning units in an expressive and intuitive language. Using methods from the mlr3tuning package, it is even possible to simultaneously optimize parameters of multiple processing units.

MCML Authors

Martin Binder

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

* Former Member

Susanne Dandl

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[23]

M. Lang.
mlr3db: Data Base Backend for 'mlr3'.
2020. URL GitHub

Abstract

Extends the mlr3 package with a DataBackend to transparently work with databases.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[22]

M. Lang.
mlr3oml: Connector Between 'mlr3' and 'OpenML'.
2020. URL GitHub

Abstract

OpenML is an open-source platform that facilitates the sharing and dissemination of machine learning research data. All entities on the platform have unique identifiers and standardized (meta)data that can be accessed via an open-access REST API or the web interface. mlr3oml allows to work with the REST API through R and integrates OpenML with the mlr3 ecosystem.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[21]

M. Lang, Q. Au, S. Coors and P. Schratz.
mlr3learners: Recommended Learners for 'mlr3'.
2020. URL GitHub

Abstract

This packages provides essential learners for mlr3, maintained by the mlr-org team. Additional learners can be found in the mlr3extralearners package on GitHub. Request additional learners over there.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Stefan Coors

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[20]

M. Lang, P. Schratz and R. Sonabend.
mlr3viz: Visualizations for 'mlr3'.
2020. URL GitHub

Abstract

mlr3viz is the visualization package of the mlr3 ecosystem. It features plots for mlr3 objects such as tasks, learners, predictions, benchmark results, tuning instances and filters via the autoplot() generic of ggplot2. The package draws plots with the viridis color palette and applies the minimal theme. Visualizations include barplots, boxplots, histograms, ROC curves, and Precision-Recall curves.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[19]

D. Pulatov and M. Lang.
mlr3cluster: Cluster Extension for 'mlr3'.
2020. URL GitHub

Abstract

mlr3cluster is an extension package for cluster analysis within the mlr3 ecosystem. It is a successor of clustering capabilities of mlr2.

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

[18]

P. Schratz, M. Lang, B. Bischl and M. Binder.
mlr3filters: Filter Based Feature Selection for 'mlr3'.
2020. URL GitHub

Abstract

mlr3filters adds feature selection filters to mlr3. The implemented filters can be used stand-alone, or as part of a machine learning pipeline in combination with mlr3pipelines and the filter operator.

MCML Authors

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Martin Binder

Statistical Learning and Data Science

[17]

R. Sonabend, F. J. Kiraly, A. Bender, B. Bischl and M. Lang.
mlr3proba: Probabilistic Supervised Learning for 'mlr3'. R package version 0.2.6.
2020. DOI URL

Abstract

As machine learning has become increasingly popular over the last few decades, so too has the number of machine-learning interfaces for implementing these models. Whilst many R libraries exist for machine learning, very few offer extended support for survival analysis. This is problematic considering its importance in fields like medicine, bioinformatics, economics, engineering and more. mlr3proba provides a comprehensive machine-learning interface for survival analysis and connects with mlr3’s general model tuning and benchmarking facilities to provide a systematic infrastructure for survival modelling and evaluation.

MCML Authors

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

* Former Member

[16]

M. Urban, K. Heckel, C. Berger, P. Schratz, I. P. Smit, T. Strydom, J. Baade and C. Schmullius.
Woody cover mapping in the savanna ecosystem of the Kruger National Park using Sentinel-1 C-Band time series data.
Koedoe 62.1 (Jan. 2020). DOI

Abstract

The savanna ecosystems in South Africa, which are predominantly characterised by woody vegetation (e.g. shrubs and trees) and grasslands with annual phenological cycles, are shaped by ecosystem processes such as droughts, fires and herbivory interacting with management actions. Therefore, monitoring of the intra- and inter-annual vegetation structure dynamics is one of the essential components for the management of complex savanna ecosystems such as the Kruger National Park (KNP). To map the woody cover in the KNP, data from European Space Agency’s (ESA) Copernicus Sentinel-1 radar satellite (C-Band vertical-vertical [VV]/vertical-horizontal [VH]) for the years 2016 and 2017, at 10 m spatial resolution and repeated acquisitions every 12 days, were utilised. A high-resolution light detection and ranging (LiDAR) data set was reclassified to produce woody cover percentages and consequently used for calibration and validation. Woody cover estimation for different spatial resolutions was carried out by fitting a random forest (RF) model. Model accuracy was assessed via spatial cross-validation and revealed an overall root mean squared error (RMSE) of 22.8% for the product with a spatial resolution of 10 m and improved with spatial averaging to 15.8% for 30 m, 14.8% for 50 m and 13.4% for 100 m. In addition, the product was validated against a second LiDAR data set, confirming the results of the spatial cross-validation of the model. The methodology of this study is designed for savanna vegetation structure mapping based on height estimates by using open-source software and open-access data, to allow for a continuation of woody cover classification and change monitoring in these types of ecosystems.

MCML Authors

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

2019

[15]

M. Lang, M. Binder, J. Richter, P. Schratz, F. Pfisterer, S. Coors, Q. Au, G. Casalicchio, L. Kotthoff and B. Bischl.
mlr3: A modern object-oriented machine learning framework in R.
The Journal of Open Source Software 4.44 (Dec. 2019). DOI

Abstract

The R (R Core Team, 2019) package mlr3 and its associated ecosystem of extension packages implements a powerful, object-oriented and extensible framework for machine learning (ML) in R. It provides a unified interface to many learning algorithms available on CRAN, augmenting them with model-agnostic general-purpose functionality that is needed in every ML project, for example train-test-evaluation, resampling, preprocessing, hyperparameter tuning, nested resampling, and visualization of results from ML experiments. The package is a complete reimplementation of the mlr (Bischl et al., 2016) package that leverages many years of experience and learned best practices to provide a state-of-the-art system that is powerful, flexible, extensible, and maintainable. We target both practitioners who want to quickly apply ML algorithms to their problems and researchers who want to implement, benchmark, and compare their new methods in a structured environment. mlr3 is suitable for short scripts that test an idea, for complex multi-stage experiments with advanced functionality that use a broad range of ML functionality, as a foundation to implement new ML (meta-)algorithms (for example AutoML systems), and everything in between. Functional correctness is ensured through extensive unit and integration tests.
Several other general-purpose ML toolboxes exist for different programing languages. The most widely used ones are scikit-learn (Pedregosa et al., 2011) for Python , Weka (Hall et al., 2009) for Java, and mlj (Blaom, Kiraly, Lienart, & Vollmer, 2019) for Julia. The most important toolboxes for R are mlr, caret (Kuhn, 2008) and tidymodels (Kuhn & Wickham, 2019).

MCML Authors

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Patrick Schratz

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Stefan Coors

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[14]

M. Binder, J. Moosbauer, J. Thomas and B. Bischl.
Multi-Objective Hyperparameter Tuning and Feature Selection using Filter Ensembles.
Preprint (Dec. 2019). arXiv

Abstract

MCML Authors

Martin Binder

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[13]

F. Pfisterer, L. Beggel, X. Sun, F. Scheipl and B. Bischl.
Benchmarking time series classification -- Functional data vs machine learning approaches.
Preprint (Nov. 2019). arXiv

Abstract

Time series classification problems have drawn increasing attention in the machine learning and statistical community. Closely related is the field of functional data analysis (FDA): it refers to the range of problems that deal with the analysis of data that is continuously indexed over some domain. While often employing different methods, both fields strive to answer similar questions, a common example being classification or regression problems with functional covariates. We study methods from functional data analysis, such as functional generalized additive models, as well as functionality to concatenate (functional-) feature extraction or basis representations with traditional machine learning algorithms like support vector machines or classification trees. In order to assess the methods and implementations, we run a benchmark on a wide variety of representative (time series) data sets, with in-depth analysis of empirical results, and strive to provide a reference ranking for which method(s) to use for non-expert practitioners. Additionally, we provide a software framework in R for functional data analysis for supervised learning, including machine learning and more linear approaches from statistics. This allows convenient access, and in connection with the machine-learning toolbox mlr, those methods can now also be tuned and benchmarked.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Fabian Scheipl

PD Dr.

Functional Data Analysis

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[12]

F. Pfisterer, J. Thomas and B. Bischl.
Towards Human Centered AutoML.
Preprint (Nov. 2019). arXiv

Abstract

Building models from data is an integral part of the majority of data science workflows. While data scientists are often forced to spend the majority of the time available for a given project on data cleaning and exploratory analysis, the time available to practitioners to build actual models from data is often rather short due to time constraints for a given project. AutoML systems are currently rising in popularity, as they can build powerful models without human oversight. In this position paper, we aim to discuss the impact of the rising popularity of such systems and how a user-centered interface for such systems could look like. More importantly, we also want to point out features that are currently missing in those systems and start to explore better usability of such systems from a data-scientists perspective.

MCML Authors

Florian Pfisterer

Dr.

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[11]

G. König and M. Grosse-Wentrup.
A Causal Perspective on Challenges for AI in Precision Medicine.
PMBC 2019 - 2nd International Congress on Precision Medicine. Munich, Germany, Oct 14-15, 2019.

MCML Authors

Gunnar König

Dr.

* Former Member

Moritz Grosse-Wentrup

Prof. Dr.

* Former Principal Investigator

[10]

L. Beggel, M. Pfeiffer and B. Bischl.
Robust Anomaly Detection in Images Using Adversarial Autoencoders.
ECML-PKDD 2019 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Wuerzburg, Germany, Sep 16-20, 2019. DOI

Abstract

Reliably detecting anomalies in a given set of images is a task of high practical relevance for visual quality inspection, surveillance, or medical image analysis. Autoencoder neural networks learn to reconstruct normal images, and hence can classify those images as anomalies, where the reconstruction error exceeds some threshold. Here we analyze a fundamental problem of this approach when the training set is contaminated with a small fraction of outliers. We find that continued training of autoencoders inevitably reduces the reconstruction error of outliers, and hence degrades the anomaly detection performance. In order to counteract this effect, an adversarial autoencoder architecture is adapted, which imposes a prior distribution on the latent representation, typically placing anomalies into low likelihood-regions. Utilizing the likelihood model, potential anomalies can be identified and rejected already during training, which results in an anomaly detector that is significantly more robust to the presence of outliers during training.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[9]

J. Goschenhofer, F. M. J. Pfister, K. A. Yuksel, B. Bischl, U. Fietzek and J. Thomas.
Wearable-based Parkinson's Disease Severity Monitoring using Deep Learning.
ECML-PKDD 2019 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Wuerzburg, Germany, Sep 16-20, 2019. DOI

Abstract

One major challenge in the medication of Parkinson’s disease is that the severity of the disease, reflected in the patients’ motor state, cannot be measured using accessible biomarkers. Therefore, we develop and examine a variety of statistical models to detect the motor state of such patients based on sensor data from a wearable device. We find that deep learning models consistently outperform a classical machine learning model applied on hand-crafted features in this time series classification task. Furthermore, our results suggest that treating this problem as a regression instead of an ordinal regression or a classification task is most appropriate. For consistent model evaluation and training, we adopt the leave-one-subject-out validation scheme to the training of deep learning models. We also employ a class-weighting scheme to successfully mitigate the problem of high multi-class imbalances in this domain. In addition, we propose a customized performance measure that reflects the requirements of the involved medical staff on the model. To solve the problem of limited availability of high quality training data, we propose a transfer learning technique which helps to improve model performance substantially. Our results suggest that deep learning techniques offer a high potential to autonomously detect motor states of patients with Parkinson’s disease.

MCML Authors

Jann Goschenhofer

Dr.

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Janek Thomas

Dr.

* Former Member

[8]

C. Molnar, G. Casalicchio and B. Bischl.
Quantifying Model Complexity via Functional Decomposition for Better Post-hoc Interpretability.
ECML-PKDD 2019 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Wuerzburg, Germany, Sep 16-20, 2019. DOI

Abstract

Post-hoc model-agnostic interpretation methods such as partial dependence plots can be employed to interpret complex machine learning models. While these interpretation methods can be applied regardless of model complexity, they can produce misleading and verbose results if the model is too complex, especially w.r.t. feature interactions. To quantify the complexity of arbitrary machine learning models, we propose model-agnostic complexity measures based on functional decomposition: number of features used, interaction strength and main effect complexity. We show that post-hoc interpretation of models that minimize the three measures is more reliable and compact. Furthermore, we demonstrate the application of these measures in a multi-objective optimization approach which simultaneously minimizes loss and complexity.

MCML Authors

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

[7]

C. A. Scholbeck, C. Molnar, C. Heumann, B. Bischl and G. Casalicchio.
Sampling, Intervention, Prediction, Aggregation: A Generalized Framework for Model Agnostic Interpretations.
ECML-PKDD 2019 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Wuerzburg, Germany, Sep 16-20, 2019. DOI

Abstract

Model-agnostic interpretation techniques allow us to explain the behavior of any predictive model. Due to different notations and terminology, it is difficult to see how they are related. A unified view on these methods has been missing. We present the generalized SIPA (sampling, intervention, prediction, aggregation) framework of work stages for model-agnostic interpretations and demonstrate how several prominent methods for feature effects can be embedded into the proposed framework. Furthermore, we extend the framework to feature importance computations by pointing out how variance-based and performance-based importance measures are based on the same work stages. The SIPA framework reduces the diverse set of model-agnostic techniques to a single methodology and establishes a common terminology to discuss them in future work.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[6]

F. Pfisterer, S. Coors, J. Thomas and B. Bischl.
Multi-Objective Automatic Machine Learning with AutoxgboostMC.
ECML-PKDD 2019 - Workshops at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Wuerzburg, Germany, Sep 16-20, 2019. arXiv

Abstract

AutoML systems are currently rising in popularity, as they can build powerful models without human oversight. They often combine techniques from many different sub-fields of machine learning in order to find a model or set of models that optimize a user-supplied criterion, such as predictive performance. The ultimate goal of such systems is to reduce the amount of time spent on menial tasks, or tasks that can be solved better by algorithms while leaving decisions that require human intelligence to the end-user. In recent years, the importance of other criteria, such as fairness and interpretability, and many others have become more and more apparent. Current AutoML frameworks either do not allow to optimize such secondary criteria or only do so by limiting the system’s choice of models and preprocessing steps. We propose to optimize additional criteria defined by the user directly to guide the search towards an optimal machine learning pipeline. In order to demonstrate the need and usefulness of our approach, we provide a simple multi-criteria AutoML system and showcase an exemplary application.

MCML Authors

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Stefan Coors

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[5]

J. Thomas.
Gradient boosting in automatic machine learning: feature selection and hyperparameter optimization.
Dissertation 2019. DOI

Abstract

This thesis focuses on automating model selection in AutoML, specifically through gradient boosting techniques like gradient tree and component-wise boosting. It addresses challenges in hyperparameter optimization using Bayesian methods, introduces a new feature selection technique, and proposes an AutoML approach that simplifies the process while maintaining accuracy. Four R packages were developed: mlrMBO for Bayesian optimization, autoxgboost for AutoML, compboost for component-wise boosting, and gamboostLSS for generalized additive models (Shortened.)

MCML Authors

Janek Thomas

Dr.

* Former Member

[4]

Q. Au, D. Schalk, G. Casalicchio, R. Schoedel, C. Stachl and B. Bischl.
Component-Wise Boosting of Targets for Multi-Output Prediction.
Preprint (Apr. 2019). arXiv

Abstract

Multi-output prediction deals with the prediction of several targets of possibly diverse types. One way to address this problem is the so called problem transformation method. This method is often used in multi-label learning, but can also be used for multi-output prediction due to its generality and simplicity. In this paper, we introduce an algorithm that uses the problem transformation method for multi-output prediction, while simultaneously learning the dependencies between target variables in a sparse and interpretable manner. In a first step, predictions are obtained for each target individually. Target dependencies are then learned via a component-wise boosting approach. We compare our new method with similar approaches in a benchmark using multi-label, multivariate regression and mixed-type datasets.

MCML Authors

Daniel Schalk

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[3]

G. Casalicchio.
On benchmark experiments and visualization methods for the evaluation and interpretation of machine learning models.
Dissertation 2019. DOI

Abstract

This cumulative dissertation consists of five articles divided into three parts. The first part extends the mlr package in R to implement and benchmark multilabel classification methods. The second part focuses on simplifying benchmark experiments with OpenML.org, introducing the OpenML R package and the OpenML100 benchmarking suite for standardized dataset and result management. The third part addresses model evaluation and interpretability, proposing the residual-based predictiveness (RBP) curve to improve upon the predictiveness curve and introducing new visualization tools, including the Shapley feature importance (SFIMP) measure for model interpretation. (Shortened.)

MCML Authors

Giuseppe Casalicchio

Dr.

Statistical Learning and Data Science

[2]

P. Probst, A.-L. Boulesteix and B. Bischl.
Tunability: Importance of Hyperparameters of Machine Learning Algorithms.
Journal of Machine Learning Research 20 (Mar. 2019). PDF

Abstract

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.