Home | Research | Groups | Anne-Laure Boulesteix

Research Group Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Principal Investigator

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Anne-Laure Boulesteix

is Professor for Biometry in Molecular Medicine at LMU Munich.

Her working group focuses on developing advanced biostatistical methods for prediction modeling and high-dimensional data analysis, with applications in biomedical research, especially omics data. Additionally, they engage in metascience, examining research practices to improve study reliability and address issues like selective reporting and researchers’ degrees of freedom.

Team members @MCML

PostDocs

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

PhD Students

Mathilde Dicaire-Cartier

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Nicole Ellenbach

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Julian Lange

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Hannah Schulz-Kümpel

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Milena Wünsch

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Publications @MCML

2025

[53]

M. M. Mandl, A.-L. Boulesteix, S. Burgess and V. Zuber.
Outlier Detection in Mendelian Randomization.
Statistics in Medicine 44.15-17 (Jul. 2025). DOI

Abstract

Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal effects of exposures on an outcome. One key assumption of MR is that the genetic variants used as instrumental variables are independent of the outcome conditional on the risk factor and unobserved confounders. Violations of this assumption, that is, the effect of the instrumental variables on the outcome through a path other than the risk factor included in the model (which can be caused by pleiotropy), are common phenomena in human genetics. Genetic variants, which deviate from this assumption, appear as outliers to the MR model fit and can be detected by the general heterogeneity statistics proposed in the literature, which are known to suffer from overdispersion, that is, too many genetic variants are declared as false outliers. We propose a method that corrects for overdispersion of the heterogeneity statistics in uni- and multivariable MR analysis by making use of the estimated inflation factor to correctly remove outlying instruments and therefore account for pleiotropic effects. Our method is applicable to summary-level data.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[52]

C. Sauer, F. J. D. Lange, M. Thurow, I. Dormuth and A.-L. Boulesteix.
Statistical parametric simulation studies based on real data.
Preprint (Apr. 2025). arXiv

Abstract

Simulation studies are indispensable for evaluating and comparing statistical methods. The most common simulation approach is parametric simulation, where the data-generating mechanism (DGM) corresponds to a predefined parametric model from which observations are drawn. Many statistical simulation studies aim to provide practical recommendations on a method’s suitability for a given application; however, parametric simulations in particular are frequently criticized for being too simplistic and not reflecting reality. To overcome this drawback, it is generally considered a sensible approach to employ real data for constructing the parametric DGMs. However, while the concept of real-data-based parametric DGMs is widely recognized, the specific ways in which DGM components are inferred from real data vary, and their implications may not always be well understood. Additionally, researchers often rely on a limited selection of real datasets, with the rationale for their selection often unclear. This paper addresses these issues by formally discussing how components of parametric DGMs can be inferred from real data and how dataset selection can be performed more systematically. By doing so, we aim to support researchers in conducting simulation studies with a lower risk of overgeneralization and misinterpretation. We illustrate the construction of parametric DGMs based on a systematically selected set of real datasets using two examples: one on ordinal outcomes in randomized controlled trials and one on differential gene expression analysis.

MCML Authors

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[51]

R. Hornung, M. Nalenz, L. Schneider, A. Bender, L. Bothmann, F. Dumpert, B. Bischl, T. Augustin and A.-L. Boulesteix.
Evaluating Machine Learning Models in Non-Standard Settings: An Overview and New Findings.
Statistical Science (Mar. 2025). To be published. Preprint available. arXiv

Abstract

Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines for GE estimation in various such non-standard settings: clustered data, spatial data, unequal sampling probabilities, concept drift, and hierarchically structured outcomes. Our overview combines well-established methodologies with other existing methods that, to our knowledge, have not been frequently considered in these particular settings. A unifying principle among these techniques is that the test data used in each iteration of the resampling procedure should reflect the new observations to which the model will be applied, while the training data should be representative of the entire data set used to obtain the final model. Beyond providing an overview, we address literature gaps by conducting simulation studies. These studies assess the necessity of using GE-estimation methods tailored to the respective setting. Our findings corroborate the concern that standard resampling methods often yield biased GE estimates in non-standard settings, underscoring the importance of tailored GE estimation.

MCML Authors

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Andreas Bender

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Machine Learning Consulting Unit (MLCU)

Ludwig Bothmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[50]

F. J. D. Lange, J. C. Wilcke, S. Hoffmann, M. Herrmann and A.-L. Boulesteix.
On 'confirmatory' methodological research in statistics and related fields.
Preprint (Mar. 2025). arXiv

Abstract

Empirical substantive research, such as in the life or social sciences, is commonly categorized into the two modes exploratory and confirmatory, both of which are essential to scientific progress. The former is also referred to as hypothesis-generating or data-contingent research, the latter is also called hypothesis-testing research. In the context of empirical methodological research in statistics, however, the exploratory-confirmatory distinction has received very little attention so far. Our paper aims to fill this gap. First, we revisit the concept of empirical methodological research through the lens of the exploratory-confirmatory distinction. Secondly, we examine current practice with respect to this distinction through a literature survey including 115 articles from the field of biostatistics. Thirdly, we provide practical recommendations towards more appropriate design, interpretation, and reporting of empirical methodological research in light of this distinction. In particular, we argue that both modes of research are crucial to methodological progress, but that most published studies – even if sometimes disguised as confirmatory – are essentially of exploratory nature. We emphasize that it may be adequate to consider empirical methodological research as a continuum between ‘pure’ exploration and ‘strict’ confirmation, recommend transparently reporting the mode of conducted research within the spectrum between exploratory and confirmatory, and stress the importance of study protocols written before conducting the study, especially in confirmatory methodological research.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[49]

M. M. Mandl, F. Weber, T. Wöhrle and A.-L. Boulesteix.
The impact of the storytelling fallacy on real data examples in methodological research.
Preprint (Mar. 2025). arXiv

Abstract

The term ‘researcher degrees of freedom’ (RDF), which was introduced in metascientific literature in the context of the replication crisis in science, refers to the extent of flexibility a scientist has in making decisions related to data analysis. These choices occur at all stages of the data analysis process. In combination with selective reporting, RDF may lead to over-optimistic statements and an increased rate of false positive findings. Even though the concept has been mainly discussed in fields such as epidemiology or psychology, similar problems affect methodological statistical research. Researchers who develop and evaluate statistical methods are left with a multitude of decisions when designing their comparison studies. This leaves room for an over-optimistic representation of the performance of their preferred method(s). The present paper defines and explores a particular RDF that has not been previously identified and discussed. When interpreting the results of real data examples that are most often part of methodological evaluations, authors typically tell a domain-specific ‘story’ that best supports their argumentation in favor of their preferred method. However, there are often plenty of other plausible stories that would support different conclusions. We define the ‘storytelling fallacy’ as the selective use of anecdotal domain-specific knowledge to support the superiority of specific methods in real data examples. While such examples fed by domain knowledge play a vital role in methodological research, if deployed inappropriately they can also harm the validity of conclusions on the investigated methods. The goal of our work is to create awareness for this issue, fuel discussions on the role of real data in generating evidence in methodological research and warn readers of methodological literature against naive interpretations of real data examples.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[48]

R. Rehms, N. Ellenbach, V. Deffner and S. Hoffmann.
Addressing complex structures of measurement error arising in the exposure assessment in occupational epidemiology using a Bayesian hierarchical approach.
Preprint (Mar. 2025). arXiv

Abstract

Exposure assessment in occupational epidemiology may involve multiple unknown quantities that are measured or reconstructed simultaneously for groups of workers and over several years. Additionally, exposures may be collected using different assessment strategies, depending on the period of exposure. As a consequence, researchers who are analyzing occupational cohort studies are commonly faced with challenging structures of exposure measurement error, involving complex dependence structures and multiple measurement error models, depending on the period of exposure. However, previous work has often made many simplifying assumptions concerning these errors. In this work, we propose a Bayesian hierarchical approach to account for a broad range of error structures arising in occupational epidemiology. The considered error structures may involve several unknown quantities that can be subject to mixtures of Berkson and classical measurement error. It is possible to account for different error structures, depending on the exposure period and the location of a worker. Moreover, errors can present complex dependence structures over time and between workers. We illustrate the proposed hierarchical approach on a subgroup of the German cohort of uranium miners to account for potential exposure uncertainties in the association between radon exposure and lung cancer mortality. The performance of the proposed approach and its sensitivity to model misspecification are evaluated in a simulation study. The results show that biases in estimates arising from very complex measurement errors can be corrected through the proposed Bayesian hierarchical approach.

MCML Authors

Nicole Ellenbach

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[47]

M. Wünsch, C. Sauer, M. Herrmann, L. C. Hinske and A.-L. Boulesteix.
To tweak or not to tweak. How exploiting flexibilities in gene set analysis leads to over-optimism.
Biometrical Journal 67.1 (Feb. 2025). DOI

Abstract

Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of genes that show enriched expression patterns between two conditions. In addition to the multitude of methods available for this task, users are typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibility can lead to uncertainty about the “right” choice, further reinforced by a lack of evidence-based guidance. Especially when their statistical experience is scarce, this uncertainty might entice users to produce preferable results using a ’trial-and-error’ approach. While it may seem unproblematic at first glance, this practice can be viewed as a form of ‘cherry-picking’ and cause an optimistic bias, rendering the results nonreplicable on independent data. After this problem has attracted a lot of attention in the context of classical hypothesis testing, we now aim to raise awareness of such overoptimism in the different and more complex context of gene set analyses. We mimic a hypothetical researcher who systematically selects the analysis variants yielding their preferred results, thereby considering three distinct goals they might pursue. Using a selection of popular gene set analysis methods, we tweak the results in this way for two frequently used benchmark gene expression data sets. Our study indicates that the potential for overoptimism is particularly high for a group of methods frequently used despite being commonly criticized. We conclude by providing practical recommendations to counter overoptimism in research findings in gene set analysis and beyond.

MCML Authors

Milena Wünsch

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[46]

M. Abrahamowicz, M.-E. Beauchamp, A.-L. Boulesteix, T. P. Morris, W. Sauerbrei, J. S. Kaufman and o. b. o. t. STRATOS Simulation Panel.
Data-driven simulations to assess the impact of study imperfections in time-to-event analyses.
American Journal of Epidemiology 194.1 (Jan. 2025). DOI

Abstract

Quantitative bias analysis (QBA) permits assessment of the expected impact of various imperfections of the available data on the results and conclusions of a particular real-world study. This article extends QBA methodology to multivariable time-to-event analyses with right-censored endpoints, possibly including time-varying exposures or covariates. The proposed approach employs data-driven simulations, which preserve important features of the data at hand while offering flexibility in controlling the parameters and assumptions that may affect the results. First, the steps required to perform data-driven simulations are described, and then two examples of real-world time-to-event analyses illustrate their implementation and the insights they may offer. The first example focuses on the omission of an important time-invariant predictor of the outcome in a prognostic study of cancer mortality, and permits separating the expected impact of confounding bias from noncollapsibility. The second example assesses how imprecise timing of an interval-censored event—ascertained only at sparse times of clinic visits—affects its estimated association with a time-varying drug exposure. The simulation results also provide a basis for comparing the performance of two alternative strategies for imputing the unknown event times in this setting. The R scripts that permit the reproduction of our examples are provided.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

2024

[45]

C. Sauer, A.-L. Boulesteix, L. Hanßum, F. Hodiamont, C. Bausewein and T. Ullmann.
Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications.
Preprint (Dec. 2024). arXiv

Abstract

Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.

MCML Authors

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

[44]

T. Woehrle, F. Pfeiffer, M. M. Mandl, W. Sobtzick, J. Heitzer, A. Krstova, L. Kamm, M. Feuerecker, D. Moser, M. Klein, B. Aulinger, M. Dolch, A.-L. Boulesteix, D. Lanz and A. Choukér.
Point-of-care breath sample analysis by semiconductor-based E-Nose technology discriminates non-infected subjects from SARS-CoV-2 pneumonia patients: a multi-analyst experiment.
MedComm 5.11 (Nov. 2024). DOI

Abstract

Metal oxide sensor-based electronic nose (E-Nose) technology provides an easy to use method for breath analysis by detection of volatile organic compound (VOC)-induced changes of electrical conductivity. Resulting signal patterns are then analyzed by machine learning (ML) algorithms. This study aimed to establish breath analysis by E-Nose technology as a diagnostic tool for severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) pneumonia within a multi-analyst experiment. Breath samples of 126 subjects with (n = 63) or without SARS-CoV-2 pneumonia (n = 63) were collected using the ReCIVA® Breath Sampler, enriched and stored on Tenax sorption tubes, and analyzed using an E-Nose unit with 10 sensors. ML approaches were applied by three independent data analyst teams and included a wide range of classifiers, hyperparameters, training modes, and subsets of training data. Within the multi-analyst experiment, all teams successfully classified individuals as infected or uninfected with an averaged area under the curve (AUC) larger than 90% and misclassification error lower than 19%, and identified the same sensor as most relevant to classification success. This new method using VOC enrichment and E-Nose analysis combined with ML can yield results similar to polymerase chain reaction (PCR) detection and superior to point-of-care (POC) antigen testing. Reducing the sensor set to the most relevant sensor may prove interesting for developing targeted POC testing.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[43]

L. Barreñada, P. Dhiman, D. Timmerman, A.-L. Boulesteix and B. Van Calster.
Understanding overfitting in random forest for probability estimation: a visualization and simulation study.
Diagnostic and Prognostic Research 8.14 (Sep. 2024). DOI

Abstract

Background: Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.
Methods: For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).
Results: The visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size.
Conclusions: Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[42]

Y. Li, T. Herold, U. Mansmann and R. Hornung.
Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study.
Earth System Science Data 24.244 (Sep. 2024). DOI

Abstract

Background: Predictive modeling based on multi-omics data, which incorporates several types of omics data for the same patients, has shown potential to outperform single-omics predictive modeling. Most research in this domain focuses on incorporating numerous data types, despite the complexity and cost of acquiring them. The prevailing assumption is that increasing the number of data types necessarily improves predictive performance. However, the integration of less informative or redundant data types could potentially hinder this performance. Therefore, identifying the most effective combinations of omics data types that enhance predictive performance is critical for cost-effective and accurate predictions.
Methods: In this study, we systematically evaluated the predictive performance of all 31 possible combinations including at least one of five genomic data types (mRNA, miRNA, methylation, DNAseq, and copy number variation) using 14 cancer datasets with right-censored survival outcomes, publicly available from the TCGA database. We employed various prediction methods and up-weighted clinical data in every model to leverage their predictive importance. Harrell’s C-index and the integrated Brier Score were used as performance measures. To assess the robustness of our findings, we performed a bootstrap analysis at the level of the included datasets. Statistical testing was conducted for key results, limiting the number of tests to ensure a low risk of false positives.
Results: Contrary to expectations, we found that using only mRNA data or a combination of mRNA and miRNA data was sufficient for most cancer types. For some cancer types, the additional inclusion of methylation data led to improved prediction results. Far from enhancing performance, the introduction of more data types most often resulted in a decline in performance, which varied between the two performance measures.
Conclusions: Our findings challenge the prevailing notion that combining multiple omics data types in multi-omics survival prediction improves predictive performance. Thus, the widespread approach in multi-omics prediction of incorporating as many data types as possible should be reconsidered to avoid suboptimal prediction results and unnecessary expenditure.

MCML Authors

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[41]

R. Hornung and A. Hapfelmeier.
Multi forests: Variable importance for multi-class outcomes.
Preprint (Sep. 2024). arXiv

Abstract

In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm’s primary purpose of calculating the multi-class VIM.

MCML Authors

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[40]

H. Schulz-Kümpel, S. Fischer, T. Nagler, A.-L. Boulesteix, B. Bischl and R. Hornung.
Constructing Confidence Intervals for 'the' Generalization Error – a Comprehensive Benchmark Study.
Preprint (Sep. 2024). arXiv

Abstract

When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct the first large-scale study comparing CIs for the generalization error - empirically evaluating 13 different methods on a total of 18 tabular regression and classification problems, using four different inducers and a total of eight loss functions. We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework. Finally, the CI methods are evaluated in terms of their relative coverage frequency, width, and runtime. Based on these findings, we are able to identify a subset of methods that we would recommend. We also publish the datasets as a benchmarking suite on OpenML and our code on GitHub to serve as a basis for further studies.

MCML Authors

Hannah Schulz-Kümpel

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability

Computational Statistics & Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[39]

M. Herrmann.
Dimensionality and Distance: Curse or Blessing? Geometrical Aspects of Nearest Neighbor Computation in High-Dimensional Data.
GMDS/IBS-DR - 54. Arbeitstagung der Arbeitsgruppen Statistical Computing, Klassifikation und Datenanalyse in den Biowissenschaften. Günzburg, Germany, Jul 28-31, 2024. PDF

Abstract

When it comes to computation, it is often said that high-dimensional data is particularly challenging, known as the curse of dimensionality. For example, in their seminal work, Beyer et al [1] study the impact of high-dimensional data on nearest neighbor computation. They show that in a wide range of settings, including IID data, the difference between the distance to the nearest neighbor and the distance to the most distant neighbor vanishes as the dimension increases. However, it is arguably often overlooked that they also point out that this result does not hold in certain situations, in particular when the intrinsic dimension of the data is low and/or when the data is distributed in well separable subsets. More generally, it is probably less well known that high dimensionality can make computation easier, to the extent that Kainen [2] even speaks of a blessing of dimensionality. Given these different aspects, a natural question to ask is: when is high dimensionality a curse and when is it not (or even a blessing)? In this talk we approach this question from a geometric point of view. Focusing on the aspect of nearest neighbor (and hence distance) computation, we show that high-dimensional data need not be more challenging than low-dimensional data in many practically relevant situations. In particular, using results from extensive experiments on synthetic and real data, we show that this can be the case for both outlier detection and cluster analysis, and for a range of different data types, including image and functional data [3, 4]. Moreover, based on concepts from manifold learning and topological data analysis, we show that these observations can be explained using a common conceptual foundation.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[38]

M. Herrmann, F. J. D. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix and B. Bischl.
Position: Why We Must Rethink Empirical Research in Machine Learning.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

We warn against a common but incomplete understanding of empirical research in machine learning (ML) that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue most current empirical ML research is fashioned as confirmatory research while it should rather be considered exploratory.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Marcel Wever

Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

* Former Member

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistics, Data Science and Machine Learning

Eyke Hüllermeier

Prof. Dr.

A3 | Computational Models

Artificial Intelligence and Machine Learning

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[37]

M. M. Mandl, A. S. Becker-Pennrich, L. C. Hinske, S. Hoffmann and A.-L. Boulesteix.
Addressing researcher degrees of freedom through minP adjustment.
BMC Medical Research Methodology 24.152 (Jul. 2024). DOI

Abstract

When different researchers study the same research question using the same dataset they may obtain different and potentially even conflicting results. This is because there is often substantial flexibility in researchers’ analytical choices, an issue also referred to as ‘‘researcher degrees of freedom’’. Combined with selective reporting of the smallest p-value or largest effect, researcher degrees of freedom may lead to an increased rate of false positive and overoptimistic results. In this paper, we address this issue by formalizing the multiplicity of analysis strategies as a multiple testing problem. As the test statistics of different analysis strategies are usually highly dependent, a naive approach such as the Bonferroni correction is inappropriate because it leads to an unacceptable loss of power. Instead, we propose using the ‘‘minP’’ adjustment method, which takes potential test dependencies into account and approximates the underlying null distribution of the minimal p-value through a permutation-based procedure. This procedure is known to achieve more power than simpler approaches while ensuring a weak control of the family-wise error rate. We illustrate our approach for addressing researcher degrees of freedom by applying it to a study on the impact of perioperative paO2 on post-operative complications after neurosurgery. A total of 48 analysis strategies are considered and adjusted using the minP procedure. This approach allows to selectively report the result of the analysis strategy yielding the most convincing evidence, while controlling the type 1 error – and thus the risk of publishing false positive results that may not be replicable.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[36]

M. Herrmann, D. Kazempour, F. Scheipl and P. Kröger.
Enhancing cluster analysis via topological manifold learning.
Data Mining and Knowledge Discovery 38 (Apr. 2024). DOI

Abstract

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability

Functional Data Analysis

[35]

G. S. Collins, K. G. M. Moons, P. Dhiman, R. D. Riley, A. L. Beam, B. Van Calster, M. Ghassemi, X. Liu, J. B. Reitsma, M. van Smeden, A.-L. Boulesteix, J. C. Camaradou, L. A. Celi, S. Denaxas, A. K. Denniston, B. Glocker, R. M. Golub, H. Harvey, G. Heinze, M. M. Hoffman, A. P. Kengne, E. Lam, N. Lee, E. W. Loder, L. Maier-Hein, B. A. Mateen, M. D. McCradden, L. Oakden-Rayner, J. Ordish, R. Parnell, S. Rose, K. Singh, L. Wynants and P. Logullo.
TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods.
The BMJ 385.e078378 (Apr. 2024). DOI

Abstract

The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement was published in 2015 to provide the minimum reporting recommendations for studies developing or evaluating the performance of a prediction model. Methodological advances in the field of prediction have since included the widespread use of artificial intelligence (AI) powered by machine learning methods to develop prediction models. An update to the TRIPOD statement is thus needed. TRIPOD+AI provides harmonised guidance for reporting prediction model studies, irrespective of whether regression modelling or machine learning methods have been used. The new checklist supersedes the TRIPOD 2015 checklist, which should no longer be used. This article describes the development of TRIPOD+AI and presents the expanded 27 item checklist with more detailed explanation of each reporting recommendation, and the TRIPOD+AI for Abstracts checklist. TRIPOD+AI aims to promote the complete, accurate, and transparent reporting of studies that develop a prediction model or evaluate its performance. Complete reporting will facilitate study appraisal, model evaluation, and model implementation.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[34]

M. Wünsch, M. Herrmann, E. Noltenius, M. Mohr, T. P. Morris and A.-L. Boulesteix.
On the handling of method failure in comparison studies.
Preprint (Apr. 2024). arXiv

Abstract

Comparison studies in methodological research are intended to compare methods in an evidence-based manner, offering guidance to data analysts to select a suitable method for their application. To provide trustworthy evidence, they must be carefully designed, implemented, and reported, especially given the many decisions made in planning and running. A common challenge in comparison studies is to handle the ``failure’’ of one or more methods to produce a result for some (real or simulated) data sets, such that their performances cannot be measured in those instances. Despite an increasing emphasis on this topic in recent literature (focusing on non-convergence as a common manifestation), there is little guidance on proper handling and interpretation, and reporting of the chosen approach is often neglected. This paper aims to fill this gap and provides practical guidance for handling method failure in comparison studies. In particular, we show that the popular approaches of discarding data sets yielding failure (either for all or the failing methods only) and imputing are inappropriate in most cases. We also discuss how method failure in published comparison studies – in various contexts from classical statistics and predictive modeling – may manifest differently, but is often caused by a complex interplay of several aspects. Building on this, we provide recommendations derived from realistic considerations on suitable fallbacks when encountering method failure, hence avoiding the need for discarding data sets or imputation. Finally, we illustrate our recommendations and the dangers of inadequate handling of method failure through two illustrative comparison studies.

MCML Authors

Milena Wünsch

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[33]

C. Nießl, S. Hoffmann, T. Ullmann and A.-L. Boulesteix.
Explaining the optimistic performance evaluation of newly proposed methods: A cross-design validation experiment.
Biometrical Journal 66.1 (Jan. 2024). DOI

Abstract

The constant development of new data analysis methods in many fields of research is accompanied by an increasing awareness that these new methods often perform better in their introductory paper than in subsequent comparison studies conducted by other researchers. We attempt to explain this discrepancy by conducting a systematic experiment that we call “cross-design validation of methods”. In the experiment, we select two methods designed for the same data analysis task, reproduce the results shown in each paper, and then reevaluate each method based on the study design (i.e., datasets, competing methods, and evaluation criteria) that was used to show the abilities of the other method. We conduct the experiment for two data analysis tasks, namely cancer subtyping using multiomic data and differential gene expression analysis. Three of the four methods included in the experiment indeed perform worse when they are evaluated on the new study design, which is mainly caused by the different datasets. Apart from illustrating the many degrees of freedom existing in the assessment of a method and their effect on its performance, our experiment suggests that the performance discrepancies between original and subsequent papers may not only be caused by the nonneutrality of the authors proposing the new method but also by differences regarding the level of expertise and field of application. Authors of new methods should thus focus not only on a transparent and extensive evaluation but also on comprehensive method documentation that enables the correct use of their methods in subsequent studies.

MCML Authors

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[32]

M. M. Mandl, S. Hoffmann, S. Bieringer, A. E. Jacob, M. Kraft, S. Lemster and A.-L. Boulesteix.
Raising awareness of uncertain choices in empirical data analysis: A teaching concept toward replicable research practices.
PLOS Computational Biology 20.3 (2024). DOI

Abstract

Throughout their education and when reading the scientific literature, students may get the impression that there is a unique and correct analysis strategy for every data analysis task and that this analysis strategy will always yield a significant and noteworthy result. This expectation conflicts with a growing realization that there is a multiplicity of possible analysis strategies in empirical research, which will lead to overoptimism and nonreplicable research findings if it is combined with result-dependent selective reporting. Here, we argue that students are often ill-equipped for real-world data analysis tasks and unprepared for the dangers of selectively reporting the most promising results. We present a seminar course intended for advanced undergraduates and beginning graduate students of data analysis fields such as statistics, data science, or bioinformatics that aims to increase the awareness of uncertain choices in the analysis of empirical data and present ways to deal with these choices through theoretical modules and practical hands-on sessions.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[31]

B. S. Siepe, F. Bartoš, T. P. Morris, A.-L. Boulesteix, D. W. Heck and S. Pawel.
Simulation Studies for Methodological Research in Psychology: A Standardized Template for Planning, Preregistration, and Reporting.
Psychological Methods Advance online publication (2024). DOI

Abstract

Simulation studies are widely used for evaluating the performance of statistical methods in psychology. However, the quality of simulation studies can vary widely in terms of their design, execution, and reporting. In order to assess the quality of typical simulation studies in psychology, we reviewed 321 articles published in Psychological Methods, Behavior Research Methods, and Multivariate Behavioral Research in 2021 and 2022, among which 100/321 = 31.2% report a simulation study. We find that many articles do not provide complete and transparent information about key aspects of the study, such as justifications for the number of simulation repetitions, Monte Carlo uncertainty estimates, or code and data to reproduce the simulation studies. To address this problem, we provide a summary of the ADEMP (aims, data-generating mechanism, estimands and other targets, methods, performance measures) design and reporting framework from Morris et al. (2019) adapted to simulation studies in psychology. Based on this framework, we provide ADEMP-PreReg, a step-by-step template for researchers to use when designing, potentially preregistering, and reporting their simulation studies. We give formulae for estimating common performance measures, their Monte Carlo standard errors, and for calculating the number of simulation repetitions to achieve a desired Monte Carlo standard error. Finally, we give a detailed tutorial on how to apply the ADEMP framework in practice using an example simulation study on the evaluation of methods for the analysis of pre–post measurement experiments. (PsycInfo Database Record (c) 2024 APA, all rights reserved)

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[30]

Z. S. Dunias, B. Van Calster, D. Timmerman, A.-L. Boulesteix and M. van Smeden.
A comparison of hyperparameter tuning procedures for clinical prediction models: A simulation study.
Statistics in Medicine (Jan. 2024). DOI

Abstract

Tuning hyperparameters, such as the regularization parameter in Ridge or Lasso regression, is often aimed at improving the predictive performance of risk prediction models. In this study, various hyperparameter tuning procedures for clinical prediction models were systematically compared and evaluated in low-dimensional data. The focus was on out-of-sample predictive performance (discrimination, calibration, and overall prediction error) of risk prediction models developed using Ridge, Lasso, Elastic Net, or Random Forest. The influence of sample size, number of predictors and events fraction on performance of the hyperparameter tuning procedures was studied using extensive simulations. The results indicate important differences between tuning procedures in calibration performance, while generally showing similar discriminative performance. The one-standard-error rule for tuning applied to cross-validation (1SE CV) often resulted in severe miscalibration. Standard non-repeated and repeated cross-validation (both 5-fold and 10-fold) performed similarly well and outperformed the other tuning procedures. Bootstrap showed a slight tendency to more severe miscalibration than standard cross-validation-based tuning procedures. Differences between tuning procedures were larger for smaller sample sizes, lower events fractions and fewer predictors. These results imply that the choice of tuning procedure can have a profound influence on the predictive performance of prediction models. The results support the application of standard 5-fold or 10-fold cross-validation that minimizes out-of-sample prediction error. Despite an increased computational burden, we found no clear benefit of repeated over non-repeated cross-validation for hyperparameter tuning. We warn against the potentially detrimental effects on model calibration of the popular 1SE CV rule for tuning prediction models in low-dimensional settings.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[29]

R. Hornung, F. Ludwigs, J. Hagenberg and A.-L. Boulesteix.
Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study.
Wiley Interdisciplinary Reviews: Computational Statistics 16.1 (Jan. 2024). DOI

Abstract

As the availability of omics data has increased in the last few years, more multi-omics data have been generated, that is, high-dimensional molecular data consisting of several types such as genomic, transcriptomic, or proteomic data, all obtained from the same patients. Such data lend themselves to being used as covariates in automatic outcome prediction because each omics type may contribute unique information, possibly improving predictions compared to using only one omics data type. Frequently, however, in the training data and the data to which automatic prediction rules should be applied, the test data, the different omics data types are not available for all patients. We refer to this type of data as block-wise missing multi-omics data. First, we provide a literature review on existing prediction methods applicable to such data. Subsequently, using a collection of 13 publicly available multi-omics data sets, we compare the predictive performances of several of these approaches for different block-wise missingness patterns. Finally, we discuss the results of this empirical comparison study and draw some tentative conclusions.

MCML Authors

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[28]

M. Wünsch, C. Sauer, P. Callahan, L. C. Hinske and A.-L. Boulesteix.
From RNA sequencing measurements to the final results: a practical guide to navigating the choices and uncertainties of gene set analysis.
Wiley Interdisciplinary Reviews: Computational Statistics 16.1 (Jan. 2024). DOI

Abstract

Gene set analysis (GSA), a popular approach for analyzing high-throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between different conditions. In the last years, a multitude of methods have been developed for this task. However, clear guidance is lacking: choosing the right method is the first hurdle a researcher is confronted with. No less challenging than overcoming this so-called method uncertainty is the procedure of preprocessing, from knowing which steps are required to selecting a corresponding approach from the plethora of valid options to create the accepted input object (data preprocessing uncertainty), with clear guidance again being scarce. Here, we provide a practical guide through all steps required to conduct GSA, beginning with a concise overview of a selection of established methods, including Gene Set Enrichment Analysis and Database for Annotation, Visualization, and Integrated Discovery (DAVID). We thereby lay a special focus on reviewing and explaining the necessary preprocessing steps for each method under consideration (e.g., the necessity of a transformation of the RNA sequencing data)—an essential aspect that is typically paid only limited attention to in both existing reviews and applications. To raise awareness of the spectrum of uncertainties, our review is accompanied by an extensive overview of the literature on valid approaches for each step and illustrative R code demonstrating the complex analysis pipelines. It ends with a discussion and recommendations to both users and developers to ensure that the results of GSA are, despite the above-mentioned uncertainties, replicable and transparent.

MCML Authors

Milena Wünsch

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

2023

[27]

J. Gauss, F. Scheipl and M. Herrmann.
DCSI–An improved measure of cluster separability based on separation and connectedness.
Preprint (Oct. 2023). arXiv

Abstract

Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not correspond to meaningful density-based clusters.

MCML Authors

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability

Functional Data Analysis

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[26]

S. Hoffmann, F. Scheipl and A.-L. Boulesteix.
Reproduzierbare und replizierbare Forschung.
Moderne Verfahren der Angewandten Statistik (Sep. 2023). DOI

Abstract

In den letzten Jahren haben Berichte über die fehlende Replizierbarkeit und Reproduzierbarkeit von Forschungsergebnissen viel Aufmerksamkeit erhalten und dazu geführt, dass die Art und Weise, wie wissenschaftliche Studien geplant, analysiert und berichtet werden, hinterfragt wird. Bei der statistischen Planung und Auswertung wissenschaftlicher Studien muss eine Vielzahl von Entscheidungen getroffen werden, ohne dass es dabei eindeutig richtige oder falsche Wahlmöglichkeiten gäbe. Hier wird erläutert, wie diese Multiplizität an möglichen Analysestrategien, die durch Modell-, Datenaufbereitungs- und Methodenunsicherheit beschrieben werden kann, in Verbindung mit selektiver Berichterstattung zu Ergebnissen führen kann, die sich auf unabhängigen Daten nicht replizieren lassen. Zudem werden Lösungsstrategien vorgestellt, mit denen die Replizierbarkeit der Ergebnisse verbessert werden kann, und Praktiken und Hilfsmittel vorgestellt, mit denen durchgeführte Analysen reproduzierbar werden können.

MCML Authors

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability

Functional Data Analysis

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[25]

I. van Mechelen, A.-L. Boulesteix, R. Dangl, N. Dean, C. Hennig, F. Leisch, D. Steinley and M. J. Warrens.
A white paper on good research practices in benchmarking: The case of cluster analysis.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 13.6 (Jul. 2023). DOI

Abstract

To achieve scientific progress in terms of building a cumulative body of knowledge, careful attention to benchmarking is of the utmost importance, requiring that proposals of new methods are extensively and carefully compared with their best predecessors, and existing methods subjected to neutral comparison studies. Answers to benchmarking questions should be evidence-based, with the relevant evidence being collected through well-thought-out procedures, in reproducible and replicable ways. In the present paper, we review good research practices in benchmarking from the perspective of the area of cluster analysis. Discussion is given to the theoretical, conceptual underpinnings of benchmarking based on simulated and empirical data in this context. Subsequently, the practicalities of how to address benchmarking questions in clustering are dealt with, and foundational recommendations are made based on existing literature.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[24]

M. Herrmann, F. Pfisterer and F. Scheipl.
A geometric framework for outlier detection in high-dimensional data.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery e1491 (Apr. 2023). DOI

Abstract

Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework which exploits the metric structure of a data set. Our approach rests on the manifold assumption, that is, that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high dimensional data. We also suggest a novel, mathematically precise and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability

Functional Data Analysis

[23]

T. Ullmann, A. Beer, M. Hünemörder, T. Seidl and A.-L. Boulesteix.
Over-optimistic evaluation and reporting of novel cluster algorithms: An illustrative study.
Advances in Data Analysis and Classification 17 (Mar. 2023). DOI

Abstract

When researchers publish new cluster algorithms, they usually demonstrate the strengths of their novel approaches by comparing the algorithms’ performance with existing competitors. However, such studies are likely to be optimistically biased towards the new algorithms, as the authors have a vested interest in presenting their method as favorably as possible in order to increase their chances of getting published. Therefore, the superior performance of newly introduced cluster algorithms is over-optimistic and might not be confirmed in independent benchmark studies performed by neutral and unbiased authors. This problem is known among many researchers, but so far, the different mechanisms leading to over-optimism in cluster algorithm evaluation have never been systematically studied and discussed. Researchers are thus often not aware of the full extent of the problem. We present an illustrative study to illuminate the mechanisms by which authors—consciously or unconsciously—paint their cluster algorithm’s performance in an over-optimistic light. Using the recently published cluster algorithm Rock as an example, we demonstrate how optimization of the used datasets or data characteristics, of the algorithm’s parameters and of the choice of the competing cluster algorithms leads to Rock’s performance appearing better than it actually is. Our study is thus a cautionary tale that illustrates how easy it can be for researchers to claim apparent ‘superiority’ of a new cluster algorithm. This illuminates the vital importance of strategies for avoiding the problems of over-optimism (such as, e.g., neutral benchmark studies), which we also discuss in the article.

MCML Authors

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Anna Beer

Dr.

A3 | Computational Models
→ Group Thomas Seidl

* Former Member

Thomas Seidl

Prof. Dr.

A3 | Computational Models

Database Systems and Data Mining AI Lab

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[22]

B. Bischl, M. Binder, M. Lang, T. Pielok, J. Richter, S. Coors, J. Thomas, T. Ullmann, M. Becker, A.-L. Boulesteix, D. Deng and M. Lindauer.
Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 13.2 (Mar. 2023). DOI

Abstract

Most machine learning algorithms are configured by a set of hyperparameters whose values must be carefully chosen and which often considerably impact performance. To avoid a time-consuming and irreproducible manual process of trial-and-error to find well-performing hyperparameter configurations, various automatic hyperparameter optimization (HPO) methods—for example, based on resampling error estimation for supervised machine learning—can be employed. After introducing HPO from a general perspective, this paper reviews important HPO methods, from simple techniques such as grid or random search to more advanced methods like evolution strategies, Bayesian optimization, Hyperband, and racing. This work gives practical recommendations regarding important choices to be made when conducting HPO, including the HPO algorithms themselves, performance evaluation, how to combine HPO with machine learning pipelines, runtime improvements, and parallelization.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Martin Binder

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Michel Lang

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Tobias Pielok

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Stefan Coors

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Janek Thomas

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Marc Becker

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[21]

T. Ullmann, S. Peschel, P. Finger, C. L. Müller and A.-L. Boulesteix.
Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering.
PLOS Computational Biology 19.1 (Jan. 2023). DOI

Abstract

In recent years, unsupervised analysis of microbiome data, such as microbial network analysis and clustering, has increased in popularity. Many new statistical and computational methods have been proposed for these tasks. This multiplicity of analysis strategies poses a challenge for researchers, who are often unsure which method(s) to use and might be tempted to try different methods on their dataset to look for the “best” ones. However, if only the best results are selectively reported, this may cause over-optimism: the “best” method is overly fitted to the specific dataset, and the results might be non-replicable on validation data. Such effects will ultimately hinder research progress. Yet so far, these topics have been given little attention in the context of unsupervised microbiome analysis. In our illustrative study, we aim to quantify over-optimism effects in this context. We model the approach of a hypothetical microbiome researcher who undertakes four unsupervised research tasks: clustering of bacterial genera, hub detection in microbial networks, differential microbial network analysis, and clustering of samples. While these tasks are unsupervised, the researcher might still have certain expectations as to what constitutes interesting results. We translate these expectations into concrete evaluation criteria that the hypothetical researcher might want to optimize. We then randomly split an exemplary dataset from the American Gut Project into discovery and validation sets multiple times. For each research task, multiple method combinations (e.g., methods for data normalization, network generation, and/or clustering) are tried on the discovery data, and the combination that yields the best result according to the evaluation criterion is chosen. While the hypothetical researcher might only report this result, we also apply the “best” method combination to the validation dataset. The results are then compared between discovery and validation data. In all four research tasks, there are notable over-optimism effects; the results on the validation data set are worse compared to the discovery data, averaged over multiple random splits into discovery/validation data. Our study thus highlights the importance of validation and replication in microbiome analysis to obtain reliable results and demonstrates that the issue of over-optimism goes beyond the context of statistical testing and fishing for significance.

MCML Authors

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Stefanie Peschel

C2 | Biology
→ Group Christian Müller

Biomedical Statistics and Data Science

Christian Müller

Prof. Dr.

C2 | Biology

Biomedical Statistics and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

2022

[20]

T. Ullmann.
Evaluation of clustering results and novel cluster algorithms: a metascientific perspective.
Dissertation 2022. DOI

Abstract

This dissertation addresses the reliability of clustering results and the evaluation of new clustering algorithms, particularly in light of the replication crisis in scientific research. The first contribution presents a framework for validating clustering results using validation data, ensuring the replicability and generalizability of findings. The second contribution quantifies over-optimistic bias in microbiome research by analyzing the effects of multiple analysis strategies on unsupervised tasks, while the third contribution highlights the over-optimism in evaluating new clustering algorithms, using the example of the ‘Rock’ algorithm, and advocates for more rigorous and neutral benchmarking methods. (Shortened.)

MCML Authors

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

[19]

M. Herrmann.
Towards more reliable machine learning: conceptual insights and practical approaches for unsupervised manifold learning and supervised benchmark studies.
Dissertation 2022. DOI

Abstract

This thesis focuses on improving the reliability and trustworthiness of machine learning, particularly in unsupervised learning methods like manifold learning. It investigates the challenges of evaluating manifold learning techniques and proposes improvements for embedding evaluation, outlier detection, and cluster analysis, using methods like UMAP and DBSCAN. Additionally, the thesis contributes to supervised learning by presenting a benchmark study on survival prediction in multi-omics cancer data and exploring the effects of design and analysis choices on benchmark results. (Shortened).

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[18]

M. van Smeden, G. Heinze, B. Van Calster, F. W. Asselbergs, P. E. Vardas, N. Bruining, P. de Jaegere, J. H. Moore, S. Denaxas, A.-L. Boulesteix and K. G. M. Moons.
Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease.
European Heart Journal 43.31 (Aug. 2022). DOI

Abstract

The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the limitations of the AI-based predictions. In this article, we present 12 critical questions for cardiovascular health professionals to ask when confronted with an AI-based prediction model. We aim to support medical professionals to distinguish the AI-based prediction models that can add value to patient care from the AI that does not.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[17]

T. Ullmann, C. Hennig and A.-L. Boulesteix.
Validation of cluster analysis results on validation data: A systematic framework.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12.3 (May. 2022). DOI

Abstract

Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To assess the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic, structured review of the existing literature about this topic. For this purpose, we outline a formal framework that covers most existing approaches for validating clustering results on validation data. In particular, we review classical validation techniques such as internal and external validation, stability analysis, and visual validation, and show how they can be interpreted in terms of our framework. We define and formalize different types of validation of clustering results on a validation dataset, and give examples of how clustering studies from the applied literature that used a validation dataset can be seen as instances of our framework.

MCML Authors

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[16]

C. Nießl, M. Herrmann, C. Wiedemann, G. Casalicchio and A.-L. Boulesteix.
Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12.2 (Mar. 2022). DOI

Abstract

In recent years, the need for neutral benchmark studies that focus on the comparison of methods coming from computational sciences has been increasingly recognized by the scientific community. While general advice on the design and analysis of neutral benchmark studies can be found in recent literature, a certain flexibility always exists. This includes the choice of data sets and performance measures, the handling of missing performance values, and the way the performance values are aggregated over the data sets. As a consequence of this flexibility, researchers may be concerned about how their choices affect the results or, in the worst case, may be tempted to engage in questionable research practices (e.g., the selective reporting of results or the post hoc modification of design or analysis components) to fit their expectations. To raise awareness for this issue, we use an example benchmark study to illustrate how variable benchmark results can be when all possible combinations of a range of design and analysis options are considered. We then demonstrate how the impact of each choice on the results can be assessed using multidimensional unfolding. In conclusion, based on previous literature and on our illustrative example, we claim that the multiplicity of design and analysis options combined with questionable research practices lead to biased interpretations of benchmark results and to over-optimistic conclusions. This issue should be considered by computational researchers when designing and analyzing their benchmark studies and by the scientific community in general in an effort towards more reliable benchmark results.

MCML Authors

Christina Sauer (née Nießl)

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

2021

[15]

M. Herrmann and F. Scheipl.
A Geometric Perspective on Functional Outlier Detection.
Stats 4.4 (Nov. 2021). DOI

Abstract

We consider functional outlier detection from a geometric perspective, specifically: for functional datasets drawn from a functional manifold, which is defined by the data’s modes of variation in shape, translation, and phase. Based on this manifold, we developed a conceptualization of functional outlier detection that is more widely applicable and realistic than previously proposed taxonomies. Our theoretical and experimental analyses demonstrated several important advantages of this perspective: it considerably improves theoretical understanding and allows describing and analyzing complex functional outlier scenarios consistently and in full generality, by differentiating between structurally anomalous outlier data that are off-manifold and distributionally outlying data that are on-manifold, but at its margins. This improves the practical feasibility of functional outlier detection: we show that simple manifold-learning methods can be used to reliably infer and visualize the geometric structure of functional datasets. We also show that standard outlier-detection methods requiring tabular data inputs can be applied to functional data very successfully by simply using their vector-valued representations learned from manifold learning methods as the input features. Our experiments on synthetic and real datasets demonstrated that this approach leads to outlier detection performances at least on par with existing functional-data-specific methods in a large variety of settings, without the highly specialized, complex methodology and narrow domain of application these methods often entail.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability

Functional Data Analysis

[14]

H. Seibold, A. Charlton, A.-L. Boulesteix and S. Hoffmann.
Statisticians, Roll Up Your Sleeves! There's A Crisis to be Solved.
Significance 18.4 (Aug. 2021). DOI

Abstract

Statisticians play a key role in almost all scientific research. As such, they may be key to solving the reproducibility crisis. Heidi Seibold, Alethea Charlton, Anne-Laure Boulesteix and Sabine Hoffmann urge statisticians to take an active role in promoting more credible science.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[13]

S. Klau, S. Hoffmann, C. J. Patel, J. P. A. Ioannidis and A.-L. Boulesteix.
Examining the robustness of observational associations to model, measurement and sampling uncertainty with the vibration of effects framework.
International Journal of Epidemiology 50.1 (Feb. 2021). DOI

Abstract

Uncertainty is a crucial issue in statistics which can be considered from different points of view. One type of uncertainty, typically referred to as sampling uncertainty, arises through the variability of results obtained when the same analysis strategy is applied to different samples. Another type of uncertainty arises through the variability of results obtained when using the same sample but different analysis strategies addressing the same research question. We denote this latter type of uncertainty as method uncertainty. It results from all the choices to be made for an analysis, for example, decisions related to data preparation, method choice, or model selection. In medical sciences, a large part of omics research is focused on the identification of molecular biomarkers, which can either be performed through ranking or by selection from among a large number of candidates. In this paper, we introduce a general resampling-based framework to quantify and compare sampling and method uncertainty. For illustration, we apply this framework to different scenarios related to the selection and ranking of omics biomarkers in the context of acute myeloid leukemia: variable selection in multivariable regression using different types of omics markers, the ranking of biomarkers according to their predictive performance, and the identification of differentially expressed genes from RNA-seq data. For all three scenarios, our findings suggest highly unstable results when the same analysis strategy is applied to two independent samples, indicating high sampling uncertainty and a comparatively smaller, but non-negligible method uncertainty, which strongly depends on the methods being compared.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[12]

H. Seibold, S. Czerny, S. Decke, R. Dieterle, T. Eder, S. Fohr, N. Hahn, R. Hartmann, C. Heindl, P. Kopper, D. Lepke, V. Loidl, M. M. Mandl, S. Musiol, J. Peter, A. Piehler, E. Rojas, S. Schmid, H. Schmidt, M. Schmoll, L. Schneider, X.-Y. To, V. Tran, A. Völker, M. Wagner, J. Wagner, M. Waize, H. Wecker, R. Yang, S. Zellner and M. Nalenz.
A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses.
PLOS One 16.6 (2021). DOI

Abstract

Computational reproducibility is a corner stone for sound and credible research. Especially in complex statistical analyses—such as the analysis of longitudinal data—reproducing results is far from simple, especially if no source code is available. In this work we aimed to reproduce analyses of longitudinal data of 11 articles published in PLOS ONE. Inclusion criteria were the availability of data and author consent. We investigated the types of methods and software used and whether we were able to reproduce the data analysis using open source software. Most articles provided overview tables and simple visualisations. Generalised Estimating Equations (GEEs) were the most popular statistical models among the selected articles. Only one article used open source software and only one published part of the analysis code. Replication was difficult in most cases and required reverse engineering of results or contacting the authors. For three articles we were not able to reproduce the results, for another two only parts of them. For all but two articles we had to contact the authors to be able to reproduce the results. Our main learning is that reproducing papers is difficult if no code is supplied and leads to a high burden for those conducting the reproductions. Open data policies in journals are good, but to truly boost reproducibility we suggest adding open code policies.

MCML Authors

Maximilian Mandl

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Xiao-Yin To

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Viet Tran

C2 | Biology
→ Group Christian Müller

Biomedical Statistics and Data Science

2020

[11]

M. Herrmann and F. Scheipl.
Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction.
Preprint (Dec. 2020). arXiv

Abstract

In recent years, manifold methods have moved into focus as tools for dimension reduction. Assuming that the high-dimensional data actually lie on or close to a low-dimensional nonlinear manifold, these methods have shown convincing results in several settings. This manifold assumption is often reasonable for functional data, i.e., data representing continuously observed functions, as well. However, the performance of manifold methods recently proposed for tabular or image data has not been systematically assessed in the case of functional data yet. Moreover, it is unclear how to evaluate the quality of learned embeddings that do not yield invertible mappings, since the reconstruction error cannot be used as a performance measure for such representations. In this work, we describe and investigate the specific challenges for nonlinear dimension reduction posed by the functional data setting. The contributions of the paper are three-fold: First of all, we define a theoretical framework which allows to systematically assess specific challenges that arise in the functional data context, transfer several nonlinear dimension reduction methods for tabular and image data to functional data, and show that manifold methods can be used successfully in this setting. Secondly, we subject performance assessment and tuning strategies to a thorough and systematic evaluation based on several different functional data settings and point out some previously undescribed weaknesses and pitfalls which can jeopardize reliable judgment of embedding quality. Thirdly, we propose a nuanced approach to make trustworthy decisions for or against competing nonconforming embeddings more objectively.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Fabian Scheipl

PD Dr.

A1 | Statistical Foundations & Explainability

Functional Data Analysis

[10]

A.-L. Boulesteix, S. Hoffmann, A. Charlton and H. Seibold.
A replication crisis in methodological research?
Significance 17.5 (Oct. 2020). DOI

Abstract

Statisticians have been keen to critique statistical aspects of the enquote{replication crisis} in other scientific disciplines. But new statistical tools are often published and promoted without any thought to replicability. This needs to change, argue Anne-Laure Boulesteix, Sabine Hoffmann, Alethea Charlton and Heidi Seibold.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[9]

M. Herrmann, P. Probst, R. Hornung, V. Jurinovic and A.-L. Boulesteix.
Large-scale benchmark study of survival prediction methods using multi-omics data.
Briefings in Bioinformatics (Aug. 2020). DOI

Abstract

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[8]

N. Ellenbach, A.-L. Boulesteix, B. Bischl, K. Unger and R. Hornung.
Improved outcome prediction across data sources through robust parameter tuning.
Journal of Classification 38 (Jul. 2020). DOI

Abstract

In many application areas, prediction rules trained based on high-dimensional data are subsequently applied to make predictions for observations from other sources, but they do not always perform well in this setting. This is because data sets from different sources can feature (slightly) differing distributions, even if they come from similar populations. In the context of high-dimensional data and beyond, most prediction methods involve one or several tuning parameters. Their values are commonly chosen by maximizing the cross-validated prediction performance on the training data. This procedure, however, implicitly presumes that the data to which the prediction rule will be ultimately applied, follow the same distribution as the training data. If this is not the case, less complex prediction rules that slightly underfit the training data may be preferable. Indeed, a tuning parameter does not only control the degree of adjustment of a prediction rule to the training data, but also, more generally, the degree of adjustment to the distribution of the training data. On the basis of this idea, in this paper we compare various approaches including new procedures for choosing tuning parameter values that lead to better generalizing prediction rules than those obtained based on cross-validation. Most of these approaches use an external validation data set. In our extensive comparison study based on a large collection of 15 transcriptomic data sets, tuning on external data and robust tuning with a tuned robustness parameter are the two approaches leading to better generalizing prediction rules.

MCML Authors

Nicole Ellenbach

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Roman Hornung

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[7]

C. Stachl, Q. Au, R. Schoedel, S. D. Gosling, G. M. Harari, D. Buschek, S. T. Völkel, T. Schuwerk, M. Oldemeier, T. Ullmann, H. Hussmann, B. Bischl and M. Bühner.
Predicting personality from patterns of behavior collected with smartphones.
Proceedings of the National Academy of Sciences 117.30 (Jul. 2020). DOI

Abstract

Smartphones enjoy high adoption rates around the globe. Rarely more than an arm’s length away, these sensor-rich devices can easily be repurposed to collect rich and extensive records of their users’ behaviors (e.g., location, communication, media consumption), posing serious threats to individual privacy. Here we examine the extent to which individuals’ Big Five personality dimensions can be predicted on the basis of six different classes of behavioral information collected via sensor and log data harvested from smartphones. Taking a machine-learning approach, we predict personality at broad domain ( = 0.37) and narrow facet levels ( = 0.40) based on behavioral data collected from 624 volunteers over 30 consecutive days (25,347,089 logging events). Our cross-validated results reveal that specific patterns in behaviors in the domains of 1) communication and social behavior, 2) music consumption, 3) app usage, 4) mobility, 5) overall phone activity, and 6) day- and night-time activity are distinctively predictive of the Big Five personality traits. The accuracy of these predictions is similar to that found for predictions based on digital footprints from social media platforms and demonstrates the possibility of obtaining information about individuals’ private traits from behavioral patterns passively collected from their smartphones. Overall, our results point to both the benefits (e.g., in research settings) and dangers (e.g., privacy implications, psychological targeting) presented by the widespread collection and modeling of behavioral data obtained from smartphones.

MCML Authors

Theresa Ullmann

Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[6]

S. Klau, M.-L. Martin-Magniette, A.-L. Boulesteix and S. Hoffmann.
Sampling uncertainty versus method uncertainty: a general framework with applications to omics biomarker selection.
Biometrical Journal 62.3 (May. 2020). DOI

Abstract

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[5]

M. Herrmann.
fda-ndr: Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction. R package.
2020. GitHub

Abstract

manifun: Collection of functions to work with embeddings and functional data.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

[4]

M. Herrmann.
manifun: Collection of functions to work with embeddings and functional data. R package.
2020. GitHub

Abstract

Repository contains material to reproduce the results of ‘Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction’(https://arxiv.org/abs/2012.11987).

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

2019

[3]

L. M. Weber, W. Saelens, R. Cannoodt, C. Soneson, A. Hapfelmeier, P. P. Gardner, A.-L. Boulesteix, Y. Saeys and M. D. Robinson.
Essential guidelines for computational method benchmarking.
Genome Biology 20.125 (Jun. 2019). DOI

Abstract

In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

[2]

P. Probst, A.-L. Boulesteix and B. Bischl.
Tunability: Importance of Hyperparameters of Machine Learning Algorithms.
Journal of Machine Learning Research 20 (Mar. 2019). PDF

Abstract

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[1]

P. Probst, M. N. Wright and A.-L. Boulesteix.
Hyperparameters and Tuning Strategies for Random Forest.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9.3 (Jan. 2019). DOI

Abstract

The random forest (RF) algorithm has several hyperparameters that have to be set by the user, for example, the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain, and the number of trees. In this paper, we first provide a literature review on the parameters’ influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a presenting brief overview of tuning strategies, we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Research Group Anne-Laure Boulesteix

Anne-Laure Boulesteix

Team members @MCML

PostDocs

PhD Students

Recent News @MCML

How Reliable Are Machine Learning Methods? With Anne-Laure Boulesteix and Milena Wünsch

Anne-Laure Boulesteix Receives Reinhart Koselleck Grant

Epistemic Foundations and Limitations of Statistics and Science

Publications @MCML

2025

2024

2023

2022

2021

2020

2019