01.01.2022

©Joachim Wendler - stock-adobe.com

28 Papers in Highly-Ranked Journals

We are happy to announce that MCML researchers are represented in 2022 with 28 papers in highly-ranked Journals. Congrats to our researchers!

M. Schneble and G. Kauermann.
Intensity Estimation on Geometric Networks with Penalized Splines.
Annals of Applied Statistics 16.2 (Jun. 2022). DOI

Abstract

In the past decades the growing amount of network data lead to many novel statistical models. In this paper we consider so-called geometric networks. Typical examples are road networks or other infrastructure networks. Nevertheless, the neurons or the blood vessels in a human body can also be interpreted as a geometric network embedded in a three-dimensional space. A network-specific metric, rather than the Euclidean metric, is usually used in all these applications, making the analyses of network data challenging. We consider network-based point processes, and our task is to estimate the intensity (or density) of the process which allows us to detect high- and low-intensity regions of the underlying stochastic processes. Available routines that tackle this problem are commonly based on kernel smoothing methods. This paper uses penalized spline smoothing and extends this toward smooth intensity estimation on geometric networks. Furthermore, our approach easily allows incorporating covariates, enabling us to respect the network geometry in a regression model framework. Several data examples and a simulation study show that penalized spline-based intensity estimation on geometric networks is a numerically stable and efficient tool. Furthermore, it also allows estimating linear and smooth covariate effects, distinguishing our approach from already existing methodologies.

MCML Authors

Göran Kauermann

Prof. Dr.

Principal Investigator

Applied Statistics in Social Sciences, Economics and Business

R. Foygel Barber, M. Drton, N. Sturma and L. Weihs.
Half-trek criterion for identifiability of latent variable models.
Annals of Statistics 50.6 (Dec. 2022). DOI

Abstract

We consider linear structural equation models with latent variables and develop a criterion to certify whether the direct causal effects between the observable variables are identifiable based on the observed covariance matrix. Linear structural equation models assume that both observed and latent variables solve a linear equation system featuring stochastic noise terms. Each model corresponds to a directed graph whose edges represent the direct effects that appear as coefficients in the equation system. Prior research has developed a variety of methods to decide identifiability of direct effects in a latent projection framework, in which the confounding effects of the latent variables are represented by correlation among noise terms. This approach is effective when the confounding is sparse and effects only small subsets of the observed variables. In contrast, the new latent-factor half-trek criterion (LF-HTC) we develop in this paper operates on the original unprojected latent variable model and is able to certify identifiability in settings, where some latent variables may also have dense effects on many or even all of the observables. Our LF-HTC is an effective sufficient criterion for rational identifiability, under which the direct effects can be uniquely recovered as rational functions of the joint covariance matrix of the observed random variables. When restricting the search steps in LF-HTC to consider subsets of latent variables of bounded size, the criterion can be verified in time that is polynomial in the size of the graph.

MCML Authors

Mathias Drton

Prof. Dr.

Principal Investigator

Mathematical Statistics

Nils Sturma

→ Group Mathias Drton
Mathematical Statistics

R. Sonabend, A. Bender and S. Vollmer.
Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures.
Bioinformatics 38.17 (Sep. 2022). DOI GitHub

Abstract

Motivation: In this article, we consider how to evaluate survival distribution predictions with measures of discrimination. This is non-trivial as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantages and disadvantages.
Results: Whilst distributions are frequently evaluated by discrimination measures, we find that the method for doing so is rarely described in the literature and often leads to unfair comparisons or ‘C-hacking’. We demonstrate by example how simple it can be to manipulate results and use this to argue for better reporting guidelines and transparency in the literature. We recommend that machine learning survival analysis software implements clear transformations between distribution and risk predictions in order to allow more transparent and accessible model evaluation.

MCML Authors

Andreas Bender

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

E. Pretzsch, V. Heinemann, S. Stintzing, A. Bender, S. Chen, J. W. Holch, F. O. Hofmann, H. Ren, F. Böschand, H. Küchenhoff, J. Werner and M. K. Angele.
EMT-Related Genes Have No Prognostic Relevance in Metastatic Colorectal Cancer as Opposed to Stage II/III: Analysis of the Randomised, Phase III Trial FIRE-3 (AIO KRK 0306; FIRE-3).
Cancers 14.22 (Nov. 2022). DOI

Abstract

Despite huge advances in local and systemic therapies, the 5-year relative survival rate for patients with metastatic CRC is still low. To avoid over- or undertreatment, proper risk stratification with regard to treatment strategy is highly needed. As EMT (epithelial-mesenchymal transition) is a major step in metastatic spread, this study analysed the prognostic effect of EMT-related genes in stage IV colorectal cancer patients using the study cohort of the FIRE-3 trial, an open-label multi-centre randomised controlled phase III trial of stage IV colorectal cancer patients. Overall, the prognostic relevance of EMT-related genes seems stage-dependent. EMT-related genes have no prognostic relevance in stage IV CRC as opposed to stage II/III.

MCML Authors

Andreas Bender

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

Shuo Chen

→ Group Volker Tresp
Database Systems, Data Mining and AI

Helmut Küchenhoff

Prof. Dr.

Principal Investigator

Statistical Consulting Unit (StaBLab)

W. Hartl, P. Kopper, A. Bender, F. Scheipl, A. G. Day, G. Elke and H. Küchenhoff.
Protein intake and outcome of critically ill patients: analysis of a large international database using piece-wise exponential additive mixed models.
Critical Care 26.7 (Jan. 2022). DOI

Abstract

Background: Proteins are an essential part of medical nutrition therapy in critically ill patients. Guidelines almost universally recommend a high protein intake without robust evidence supporting its use.
Methods: Using a large international database, we modelled associations between the hazard rate of in-hospital death and live hospital discharge (competing risks) and three categories of protein intake (low: < 0.8 g/kg per day, standard: 0.8–1.2 g/kg per day, high: > 1.2 g/kg per day) during the first 11 days after ICU admission (acute phase). Time-varying cause-specific hazard ratios (HR) were calculated from piece-wise exponential additive mixed models. We used the estimated model to compare five different hypothetical protein diets (an exclusively low protein diet, a standard protein diet administered early (day 1 to 4) or late (day 5 to 11) after ICU admission, and an early or late high protein diet).
Results: Of 21,100 critically ill patients in the database, 16,489 fulfilled inclusion criteria for the analysis. By day 60, 11,360 (68.9%) patients had been discharged from hospital, 4,192 patients (25.4%) had died in hospital, and 937 patients (5.7%) were still hospitalized. Median daily low protein intake was 0.49 g/kg [IQR 0.27–0.66], standard intake 0.99 g/kg [IQR 0.89– 1.09], and high intake 1.41 g/kg [IQR 1.29–1.60]. In comparison with an exclusively low protein diet, a late standard protein diet was associated with a lower hazard of in-hospital death: minimum 0.75 (95% CI 0.64, 0.87), and a higher hazard of live hospital discharge: maximum HR 1.98 (95% CI 1.72, 2.28). Results on hospital discharge, however, were qualitatively changed by a sensitivity analysis. There was no evidence that an early standard or a high protein intake during the acute phase was associated with a further improvement of outcome.
Conclusions: Provision of a standard protein intake during the late acute phase may improve outcome compared to an exclusively low protein diet. In unselected critically ill patients, clinical outcome may not be improved by a high protein intake during the acute phase.

MCML Authors

Andreas Bender

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

Fabian Scheipl

PD Dr.

Principal Investigator

Functional Data Analysis

Helmut Küchenhoff

Prof. Dr.

Principal Investigator

Statistical Consulting Unit (StaBLab)

Q. Au, J. Herbinger, C. Stachl, B. Bischl and G. Casalicchio.
Grouped Feature Importance and Combined Features Effect Plot.
Data Mining and Knowledge Discovery 36 (Jun. 2022). DOI

Abstract

Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms and their inherently challenging interpretability. Most work in this area has been focused on the interpretation of single features in a model. However, for researchers and practitioners, it is often equally important to quantify the importance or visualize the effect of feature groups. To address this research gap, we provide a comprehensive overview of how existing model-agnostic techniques can be defined for feature groups to assess the grouped feature importance, focusing on permutation-based, refitting, and Shapley-based methods. We also introduce an importance-based sequential procedure that identifies a stable and well-performing combination of features in the grouped feature space. Furthermore, we introduce the combined features effect plot, which is a technique to visualize the effect of a group of features based on a sparse, interpretable linear combination of features. We used simulation studies and real data examples to analyze, compare, and discuss these methods.

MCML Authors

Julia Herbinger

Dr.

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Director

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

K. Lotto, T. Nagler and M. Radic.
Modeling Stochastic Data Using Copulas for Applications in the Validation of Autonomous Driving.
Electronics 11.24 (Dec. 2022). DOI

Abstract

The verification and validation processes of fully automated vehicles are linked to an almost intractable challenge of reflecting the real world with all its interactions in a virtual environment. Influential stochastic parameters need to be extracted from real-world measurements and real-time data, capturing all interdependencies, for an accurate simulation of reality. A copula is a probability model that represents a multivariate distribution, examining the dependence between the underlying variables. This model is used on drone measurement data from a roundabout containing dependent stochastic parameters. With the help of the copula model, samples are generated that reflect the real-time data. The resulting applications and possible extensions are discussed and explored.

MCML Authors

Thomas Nagler

Prof. Dr.

Principal Investigator

Computational Statistics & Data Science

M. Mittermeier, M. Weigert, D. Rügamer, H. Küchenhoff and R. Ludwig.
A deep learning based classification of atmospheric circulation types over Europe: projection of future changes in a CMIP6 large ensemble.
Environmental Research Letters 17.8 (Jul. 2022). DOI

Abstract

High- and low pressure systems of the large-scale atmospheric circulation in the mid-latitudes drive European weather and climate. Potential future changes in the occurrence of circulation types are highly relevant for society. Classifying the highly dynamic atmospheric circulation into discrete classes of circulation types helps to categorize the linkages between atmospheric forcing and surface conditions (e.g. extreme events). Previous studies have revealed a high internal variability of projected changes of circulation types. Dealing with this high internal variability requires the employment of a single-model initial-condition large ensemble (SMILE) and an automated classification method, which can be applied to large climate data sets. One of the most established classifications in Europe are the 29 subjective circulation types called Grosswetterlagen by Hess & Brezowsky (HB circulation types). We developed, in the first analysis of its kind, an automated version of this subjective classification using deep learning. Our classifier reaches an overall accuracy of 41.1% on the test sets of nested cross-validation. It outperforms the state-of-the-art automatization of the HB circulation types in 20 of the 29 classes. We apply the deep learning classifier to the SMHI-LENS, a SMILE of the Coupled Model Intercomparison Project phase 6, composed of 50 members of the EC-Earth3 model under the SSP37.0 scenario. For the analysis of future frequency changes of the 29 circulation types, we use the signal-to-noise ratio to discriminate the climate change signal from the noise of internal variability. Using a 5%-significance level, we find significant frequency changes in 69% of the circulation types when comparing the future (2071–2100) to a reference period (1991–2020).

MCML Authors

Maximilian Weigert

* Former Member

→ Group Helmut Küchenhoff
Statistical Consulting Unit (StaBLab)

David Rügamer

Prof. Dr.

Principal Investigator

Statistics, Data Science and Machine Learning

Helmut Küchenhoff

Prof. Dr.

Principal Investigator

Statistical Consulting Unit (StaBLab)

M. van Smeden, G. Heinze, B. Van Calster, F. W. Asselbergs, P. E. Vardas, N. Bruining, P. de Jaegere, J. H. Moore, S. Denaxas, A.-L. Boulesteix and K. G. M. Moons.
Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease.
European Heart Journal 43.31 (Aug. 2022). DOI

Abstract

The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the limitations of the AI-based predictions. In this article, we present 12 critical questions for cardiovascular health professionals to ask when confronted with an AI-based prediction model. We aim to support medical professionals to distinguish the AI-based prediction models that can add value to patient care from the AI that does not.

MCML Authors

Anne-Laure Boulesteix

Prof. Dr.

Principal Investigator

Biometry in Molecular Medicine

K. Baßler, W. Fujii, T. S. Kapellos, E. Dudkin, N. Reusch, A. Horne, B. Reiz, M. D. Luecken, C. Osei-Sarpong, S. Warnat-Herresthal, L. Bonaguro, J. Schulte-Schrepping, A. Wagner, P. Günther, C. Pizarro, T. Schreiber, R. Knoll, L. Holsten, C. Kröger, E. De Domenico, M. Becker, K. Händler, C. T. Wohnhaas, F. Baumgartner, M. Köhler, H. Theis, M. Kraut, M. H. Wadsworth, T. K. Hughes, H. J. Ferreira, E. Hinkley, I. H. Kaltheuner, M. Geyer, C. Thiele, A. K. Shalek, A. Feißt, D. Thomas, H. Dickten, M. Beyer, P. Baum, N. Yosef, A. C. Aschenbrenner, T. Ulas, J. Hasenauer, F. J. Theis, D. Skowasch and J. L. Schultze.
Alveolar macrophages in early stage COPD show functional deviations with properties of impaired immune activation.
Frontiers in Immunology 13 (Jul. 2022). DOI

Abstract

Despite its high prevalence, the cellular and molecular mechanisms of chronic obstructive pulmonary disease (COPD) are far from being understood. Here, we determine disease-related changes in cellular and molecular compositions within the alveolar space and peripheral blood of a cohort of COPD patients and controls. Myeloid cells were the largest cellular compartment in the alveolar space with invading monocytes and proliferating macrophages elevated in COPD. Modeling cell-to-cell communication, signaling pathway usage, and transcription factor binding predicts TGF-β1 to be a major upstream regulator of transcriptional changes in alveolar macrophages of COPD patients. Functionally, macrophages in COPD showed reduced antigen presentation capacity, accumulation of cholesteryl ester, reduced cellular chemotaxis, and mitochondrial dysfunction, reminiscent of impaired immune activation.

MCML Authors

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

N. Palm, F. Stroebl and H. Palm.
Parameter Individual Optimal Experimental Design and Calibration of Parametric Models.
IEEE Access 10 (Oct. 2022). DOI GitHub

Abstract

Parametric models allow to reflect system behavior in general and characterize individual system instances by specific parameter values. For a variety of scientific disciplines, model calibration by parameter quantification is therefore of central importance. As the time and cost of calibration experiments increases, the question of how to determine parameter values of required quality with a minimum number of experiments comes to the fore. In this paper, a methodology is introduced allowing to quantify and optimize achievable parameter extraction quality based on an experimental plan including a process and methods how to adapt the experimental plan for improved estimation of individually selectable parameters. The resulting parameter-individual optimal design of experiments (pi-OED) enables experimenters to extract a maximum of parameter-specific information from a given number of experiments. We demonstrate how to minimize variance or covariances of individually selectable parameter estimators by model-based calculation of the experimental designs. Using the Fisher Information Matrix in combination with the Cramer-Raó inequality, the pi-OED plan is reduced to a global optimization problem. The pi-OED workflow is demonstrated using computer experiments to calibrate a model describing calendrical aging of lithium-ion battery cells. Applying bootstrapping methods allows to also quantify parameter estimation distributions for further benchmarking. Comparing pi-OED based computer experimental results with those based on state-of-the-art designs of experiments, reveals its efficiency improvement. All computer experimental results are gained in Python and may be reproduced using a provided Jupyter Notebook along with the source code. Both are available under https://github.com/nicolaipalm/oed.

MCML Authors

Nicolai Palm

→ Group Thomas Nagler
Computational Statistics & Data Science

J. Moosbauer, M. Binder, L. Schneider, F. Pfisterer, M. Becker, M. Lang, L. Kotthoff and B. Bischl.
Automated Benchmark-Driven Design and Explanation of Hyperparameter Optimizers.
IEEE Transactions on Evolutionary Computation 26.6 (Oct. 2022). DOI

Abstract

Automated hyperparameter optimization (HPO) has gained great popularity and is an important component of most automated machine learning frameworks. However, the process of designing HPO algorithms is still an unsystematic and manual process: new algorithms are often built on top of prior work, where limitations are identified and improvements are proposed. Even though this approach is guided by expert knowledge, it is still somewhat arbitrary. The process rarely allows for gaining a holistic understanding of which algorithmic components drive performance and carries the risk of overlooking good algorithmic design choices. We present a principled approach to automated benchmark-driven algorithm design applied to multifidelity HPO (MF-HPO). First, we formalize a rich space of MF-HPO candidates that includes, but is not limited to, common existing HPO algorithms and then present a configurable framework covering this space. To find the best candidate automatically and systematically, we follow a programming-by-optimization approach and search over the space of algorithm candidates via Bayesian optimization. We challenge whether the found design choices are necessary or could be replaced by more naive and simpler ones by performing an ablation analysis. We observe that using a relatively simple configuration (in some ways, simpler than established methods) performs very well as long as some critical configuration parameters are set to the right value.

MCML Authors

Julia Moosbauer

Dr.

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

Martin Binder

→ Group Bernd Bischl
Statistical Learning and Data Science

Lennart Schneider

→ Group Bernd Bischl
Statistical Learning and Data Science

Florian Pfisterer

Dr.

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

Marc Becker

→ Group Bernd Bischl
Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

Director

Statistical Learning and Data Science

M. Ali, M. Berrendorf, C. T. Hoyt, L. Vermue, M. Galkin, S. Sharifzadeh, A. Fischer, V. Tresp and J. Lehmann.
Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models under a Unified Framework.
IEEE Transactions on Pattern Analysis and Machine Intelligence 44.12 (Dec. 2022). DOI GitHub

Abstract

The heterogeneity in recently published knowledge graph embedding models’ implementations, training, and evaluation has made fair and thorough comparisons difficult. To assess the reproducibility of previously published results, we re-implemented and evaluated 21 models in the PyKEEN software package. In this paper, we outline which results could be reproduced with their reported hyper-parameters, which could only be reproduced with alternate hyper-parameters, and which could not be reproduced at all, as well as provide insight as to why this might be the case. We then performed a large-scale benchmarking on four datasets with several thousands of experiments and 24,804 GPU hours of computation time. We present insights gained as to best practices, best configurations for each model, and where improvements could be made over previously published best configurations. Our results highlight that the combination of model architecture, training approach, loss function, and the explicit modeling of inverse relations is crucial for a model’s performance and is not only determined by its architecture. We provide evidence that several architectures can obtain results competitive to the state of the art when configured carefully.

MCML Authors

Max Berrendorf

Dr.

* Former Member

→ Group Volker Tresp
Database Systems, Data Mining and AI

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

G. Brasó, O. Cetintas and L. Leal-Taixé.
Multi-Object Tracking and Segmentation Via Neural Message Passing.
International Journal of Computer Vision 130.12 (Sep. 2022). DOI GitHub

Abstract

Graphs offer a natural way to formulate Multiple Object Tracking (MOT) and Multiple Object Tracking and Segmentation (MOTS) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such structured domain is not trivial. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks. By operating directly on the graph domain, our method can reason globally over an entire set of detections and exploit contextual features. It then jointly predicts both final solutions for the data association problem and segmentation masks for all objects in the scene while exploiting synergies between the two tasks. We achieve state-of-the-art results for both tracking and segmentation in several publicly available datasets.

MCML Authors

Guillem Brasó

* Former Member

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Laura Leal-Taixé

Prof. Dr.

Principal Investigator

* Former Principal Investigator

K. E. Riehm, E. Badillo Goicoechea, F. M. Wang, E. Kim, L. R. Aldridge, C. P. Lupton-Smith, R. Presskreischer, T.-H. Chang, S. LaRocca, F. Kreuter and E. A. Stuart.
Association of Non-Pharmaceutical Interventions to Reduce the Spread of SARS-CoV-2 With Anxiety and Depressive Symptoms: A Multi-National Study of 43 Countries.
International Journal of Public Health 67 (Mar. 2022). DOI

Abstract

Objectives: To examine the association of non-pharmaceutical interventions (NPIs) with anxiety and depressive symptoms among adults and determine if these associations varied by gender and age.
Methods: We combined survey data from 16,177,184 adults from 43 countries who participated in the daily COVID-19 Trends and Impact Survey via Facebook with time-varying NPI data from the Oxford COVID-19 Government Response Tracker between 24 April 2020 and 20 December 2020. Using logistic regression models, we examined the association of [1] overall NPI stringency and [2] seven individual NPIs (school closures, workplace closures, cancellation of public events, restrictions on the size of gatherings, stay-at-home requirements, restrictions on internal movement, and international travel controls) with anxiety and depressive symptoms.
Results: More stringent implementation of NPIs was associated with a higher odds of anxiety and depressive symptoms, albeit with very small effect sizes. Individual NPIs had heterogeneous associations with anxiety and depressive symptoms by gender and age.
Conclusion: Governments worldwide should be prepared to address the possible mental health consequences of stringent NPI implementation with both universal and targeted interventions for vulnerable groups.

MCML Authors

Frauke Kreuter

Prof. Dr.

Principal Investigator

Social Data Science and AI

E. Schede, J. Brandt, A. Tornede, M. Wever, V. Bengs, E. Hüllermeier and K. Tierney.
A Survey of Methods for Automated Algorithm Configuration.
Journal of Artificial Intelligence Research 75 (Oct. 2022). DOI

Abstract

Algorithm configuration (AC) is concerned with the automated search of the most suitable parameter configuration of a parametrized algorithm. There is currently a wide variety of AC problem variants and methods proposed in the literature. Existing reviews do not take into account all derivatives of the AC problem, nor do they offer a complete classification scheme. To this end, we introduce taxonomies to describe the AC problem and features of configuration methods, respectively. We review existing AC literature within the lens of our taxonomies, outline relevant design choices of configuration approaches, contrast methods and problem variants against each other, and describe the state of AC in industry. Finally, our review provides researchers and practitioners with a look at future research directions in the field of AC.

MCML Authors

Marcel Wever

Dr.

* Former Member

→ Group Eyke Hüllermeier
Artificial Intelligence and Machine Learning

Viktor Bengs

Dr.

* Former Member

→ Group Eyke Hüllermeier
Artificial Intelligence and Machine Learning

Eyke Hüllermeier

Prof. Dr.

Principal Investigator

Artificial Intelligence and Machine Learning

C. Fritz, G. De Nicola, F. Günther, D. Rügamer, M. Rave, M. Schneble, A. Bender, M. Weigert, R. Brinks, A. Hoyer, U. Berger, H. Küchenhoff and G. Kauermann.
Challenges in Interpreting Epidemiological Surveillance Data – Experiences from Germany.
Journal of Computational and Graphical Statistics 32.3 (Dec. 2022). DOI

Abstract

As early as March 2020, the authors of this letter started to work on surveillance data to obtain a clearer picture of the pandemic’s dynamic. This letter outlines the lessons learned during this peculiar time, emphasizing the benefits that better data collection, management, and communication processes would bring to the table. We further want to promote nuanced data analyses as a vital element of general political discussion as opposed to drawing conclusions from raw data, which are often flawed in epidemiological surveillance data, and therefore underline the overall need for statistics to play a more central role in public discourse.

MCML Authors

Cornelius Fritz

Dr.

* Former Member

→ Group Göran Kauermann
Applied Statistics in Social Sciences, Economics and Business

David Rügamer

Prof. Dr.

Principal Investigator

Statistics, Data Science and Machine Learning

Andreas Bender

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

Maximilian Weigert

* Former Member

→ Group Helmut Küchenhoff
Statistical Consulting Unit (StaBLab)

Helmut Küchenhoff

Prof. Dr.

Principal Investigator

Statistical Consulting Unit (StaBLab)

Göran Kauermann

Prof. Dr.

Principal Investigator

Applied Statistics in Social Sciences, Economics and Business

C. Fritz and G. Kauermann.
On the Interplay of Regional Mobility, Social Connectedness, and the Spread of COVID-19 in Germany.
Journal of the Royal Statistical Society. Series A (Statistics in Society) 185.1 (Jan. 2022). DOI

Abstract

Since the primary mode of respiratory virus transmission is person-to-person interaction, we are required to reconsider physical interaction patterns to mitigate the number of people infected with COVID-19. While research has shown that non-pharmaceutical interventions (NPI) had an evident impact on national mobility patterns, we investigate the relative regional mobility behaviour to assess the effect of human movement on the spread of COVID-19. In particular, we explore the impact of human mobility and social connectivity derived from Facebook activities on the weekly rate of new infections in Germany between 3 March and 22 June 2020. Our results confirm that reduced social activity lowers the infection rate, accounting for regional and temporal patterns. The extent of social distancing, quantified by the percentage of people staying put within a federal administrative district, has an overall negative effect on the incidence of infections. Additionally, our results show spatial infection patterns based on geographical as well as social distances.

MCML Authors

Cornelius Fritz

Dr.

* Former Member

→ Group Göran Kauermann
Applied Statistics in Social Sciences, Economics and Business

Göran Kauermann

Prof. Dr.

Principal Investigator

Applied Statistics in Social Sciences, Economics and Business

A. Python, A. Bender, M. Blangiardo, J. B. Illian, Y. Lin, B. Liu, T. C. D. Lucas, S. Tan, Y. Wen, D. Svanidze and J. Yin.
A downscaling approach to compare COVID-19 count data from databases aggregated at different spatial scales.
Journal of the Royal Statistical Society. Series A (Statistics in Society) 185.1 (Jan. 2022). DOI

Abstract

As the COVID-19 pandemic continues to threaten various regions around the world, obtaining accurate and reliable COVID-19 data is crucial for governments and local communities aiming at rigorously assessing the extent and magnitude of the virus spread and deploying efficient interventions. Using data reported between January and February 2020 in China, we compared counts of COVID-19 from near-real-time spatially disaggregated data (city level) with fine-spatial scale predictions from a Bayesian downscaling regression model applied to a reference province-level data set. The results highlight discrepancies in the counts of coronavirus-infected cases at the district level and identify districts that may require further investigation.

MCML Authors

Andreas Bender

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

Z. Liu, Y. Ma, M. Hildebrandt, Y. Ouyang and Z. Xiong.
CDARL: a contrastive discriminator-augmented reinforcement learning framework for sequential recommendations.
Knowledge and Information Systems 64 (Jul. 2022). DOI

Abstract

Sequential recommendations play a crucial role in many real-world applications. Due to the sequential nature, reinforcement learning has been employed to iteratively produce recommendations based on an observed stream of user behavior. In this setting, a recommendation agent interacts with the environments (users) by sequentially recommending items (actions) to maximize users’ overall long-term cumulative rewards. However, most reinforcement learning-based recommendation models only focus on extrinsic rewards based on user feedback, leading to sub-optimal policies if user-item interactions are sparse and fail to obtain the dynamic rewards based on the users’ preferences. As a remedy, we propose a dynamic intrinsic reward signal integrated with a contrastive discriminator-augmented reinforcement learning framework. Concretely, our framework contains two modules: (1) a contrastive learning module is employed to learn the representation of item sequences; (2) an intrinsic reward learning function to imitate the user’s internal dynamics. Furthermore, we combine static extrinsic reward and dynamic intrinsic reward to train a sequential recommender system based on double Q-learning. We integrate our framework with five representative sequential recommendation models. Specifically, our framework augments these recommendation models with two output layers: the supervised layer that applies cross-entropy loss to perform ranking and the other for reinforcement learning. Experimental results on two real-world datasets demonstrate that the proposed framework outperforms several sequential recommendation baselines and exploration with intrinsic reward baselines.

MCML Authors

Yunpu Ma

Dr.

→ Group Volker Tresp
Database Systems, Data Mining and AI

V.-L. Nguyen, M. H. Shaker and E. Hüllermeier.
How to measure uncertainty in uncertainty sampling for active learning.
Machine Learning 111.1 (Jan. 2022). DOI

Abstract

Various strategies for active learning have been proposed in the machine learning literature. In uncertainty sampling, which is among the most popular approaches, the active learner sequentially queries the label of those instances for which its current prediction is maximally uncertain. The predictions as well as the measures used to quantify the degree of uncertainty, such as entropy, are traditionally of a probabilistic nature. Yet, alternative approaches to capturing uncertainty in machine learning, alongside with corresponding uncertainty measures, have been proposed in recent years. In particular, some of these measures seek to distinguish different sources and to separate different types of uncertainty, such as the reducible (epistemic) and the irreducible (aleatoric) part of the total uncertainty in a prediction. The goal of this paper is to elaborate on the usefulness of such measures for uncertainty sampling, and to compare their performance in active learning. To this end, we instantiate uncertainty sampling with different measures, analyze the properties of the sampling strategies thus obtained, and compare them in an experimental study.

MCML Authors

Mohammad Hossein Shaker

→ Group Eyke Hüllermeier
Artificial Intelligence and Machine Learning

Eyke Hüllermeier

Prof. Dr.

Principal Investigator

Artificial Intelligence and Machine Learning

B. A. Hersbach, D. S. Fischer, G. Masserdotti, Deeksha, K. Mojžišová, T. Waltzhöni, D. Rodriguez‐Terrones, M. Heinig, F. J. Theis, M. Götz and S. H. Stricker.
Probing cell identity hierarchies by fate titration and collision during direct reprogramming.
Molecular Systems Biology 18.e11129 (Sep. 2022). DOI

Abstract

Despite the therapeutic promise of direct reprogramming, basic principles concerning fate erasure and the mechanisms to resolve cell identity conflicts remain unclear. To tackle these fundamental questions, we established a single‐cell protocol for the simultaneous analysis of multiple cell fate conversion events based on combinatorial and traceable reprogramming factor expression: Collide‐seq. Collide‐seq revealed the lack of a common mechanism through which fibroblast‐specific gene expression loss is initiated. Moreover, we found that the transcriptome of converting cells abruptly changes when a critical level of each reprogramming factor is attained, with higher or lower levels not contributing to major changes. By simultaneously inducing multiple competing reprogramming factors, we also found a deterministic system, in which titration of fates against each other yields dominant or colliding fates. By investigating one collision in detail, we show that reprogramming factors can disturb cell identity programs independent of their ability to bind their target genes. Taken together, Collide‐seq has shed light on several fundamental principles of fate conversion that may aid in improving current reprogramming paradigms.

MCML Authors

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

M. Lotfollahi, M. Naghipourfar, M. D. Luecken, M. Khajavi, M. Büttner, M. Wagenstetter, Z. Avsec, A. Gayoso, N. Yosef, M. Interlandi, S. Rybakov, A. V. Misharin and F. J. Theis.
Mapping single-cell data to reference atlases by transfer learning.
Nature Biotechnology 40 (Aug. 2022). DOI

Abstract

Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.

MCML Authors

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

G. Palla, H. Spitzer, M. Klein, D. Fischer, A. C. Schaar, L. B. Kuemmerle, S. Rybakov, I. L. Ibarra, O. Holmberg, I. Virshup, M. Lotfollahi, S. Richter and F. J. Theis.
Squidpy: a scalable framework for spatial omics analysis.
Nature Methods 19 (Jan. 2022). DOI

Abstract

Spatial omics data are advancing the study of tissue organization and cellular communication at an unprecedented scale. Flexible tools are required to store, integrate and visualize the large diversity of spatial omics data. Here, we present Squidpy, a Python framework that brings together tools from omics and image analysis to enable scalable description of spatial molecular data, such as transcriptome or multivariate proteins. Squidpy provides efficient infrastructure and numerous analysis methods that allow to efficiently store, manipulate and interactively visualize spatial omics data. Squidpy is extensible and can be interfaced with a variety of already existing libraries for the scalable analysis of spatial omics data.

MCML Authors

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

M. Lange, V. Bergen, M. Klein, M. Setty, B. Reuter, M. Bakhti, H. Lickert, M. Ansari, J. Schniering, H. B. Schiller, D. Pe’er and F. J. Theis.
CellRank for directed single-cell fate mapping.
Nature Methods 19.2 (Jan. 2022). DOI

Abstract

Computational trajectory inference enables the reconstruction of cell state dynamics from single-cell RNA sequencing experiments. However, trajectory inference requires that the direction of a biological process is known, largely limiting its application to differentiating systems in normal development. Here, we present CellRank (https://cellrank.org) for single-cell fate mapping in diverse scenarios, including regeneration, reprogramming and disease, for which direction is unknown. Our approach combines the robustness of trajectory inference with directional information from RNA velocity, taking into account the gradual and stochastic nature of cellular fate decisions, as well as uncertainty in velocity vectors. On pancreas development data, CellRank automatically detects initial, intermediate and terminal populations, predicts fate potentials and visualizes continuous gene expression trends along individual lineages. Applied to lineage-traced cellular reprogramming data, predicted fate probabilities correctly recover reprogramming outcomes. CellRank also predicts a new dedifferentiation trajectory during postinjury lung regeneration, including previously unknown intermediate cell states, which we confirm experimentally.

MCML Authors

Marius Lange

Dr.

* Former Member

→ Group Fabian Theis
Mathematical Modelling of Biological Systems

Fabian Theis

Prof. Dr.

Principal Investigator

Mathematical Modelling of Biological Systems

W. Ghada, E. Casellas, J. Herbinger, A. Garcia-Benadí, L. Bothmann, N. Estrella, J. Bech and A. Menzel.
Stratiform and Convective Rain Classification Using Machine Learning Models and Micro Rain Radar.
Remote Sensing 14.18 (Sep. 2022). DOI

Abstract

Rain type classification into convective and stratiform is an essential step required to improve quantitative precipitation estimations by remote sensing instruments. Previous studies with Micro Rain Radar (MRR) measurements and subjective rules have been performed to classify rain events. However, automating this process by using machine learning (ML) models provides the advantages of fast and reliable classification with the possibility to classify rain minute by minute. A total of 20,979 min of rain data measured by an MRR at Das in northeast Spain were used to build seven types of ML models for stratiform and convective rain type classification. The proposed classification models use a set of 22 parameters that summarize the reflectivity, the Doppler velocity, and the spectral width (SW) above and below the so-called separation level (SL). This level is defined as the level with the highest increase in Doppler velocity and corresponds with the bright band in stratiform rain. A pre-classification of the rain type for each minute based on the rain microstructure provided by the collocated disdrometer was performed. Our results indicate that complex ML models, particularly tree-based ensembles such as xgboost and random forest which capture the interactions of different features, perform better than simpler models. Applying methods from the field of interpretable ML, we identified reflectivity at the lowest layer and the average spectral width in the layers below SL as the most important features. High reflectivity and low SW values indicate a higher probability of convective rain.

MCML Authors

Julia Herbinger

Dr.

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

Ludwig Bothmann

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

C. Fritz, E. Dorigatti and D. Rügamer.
Combining Graph Neural Networks and Spatio-temporal Disease Models to Predict COVID-19 Cases in Germany.
Scientific Reports 12.3930 (Mar. 2022). DOI

Abstract

During 2020, the infection rate of COVID-19 has been investigated by many scholars from different research fields. In this context, reliable and interpretable forecasts of disease incidents are a vital tool for policymakers to manage healthcare resources. In this context, several experts have called for the necessity to account for human mobility to explain the spread of COVID-19. Existing approaches often apply standard models of the respective research field, frequently restricting modeling possibilities. For instance, most statistical or epidemiological models cannot directly incorporate unstructured data sources, including relational data that may encode human mobility. In contrast, machine learning approaches may yield better predictions by exploiting these data structures yet lack intuitive interpretability as they are often categorized as black-box models. We propose a combination of both research directions and present a multimodal learning framework that amalgamates statistical regression and machine learning models for predicting local COVID-19 cases in Germany. Results and implications: the novel approach introduced enables the use of a richer collection of data types, including mobility flows and colocation probabilities, and yields the lowest mean squared error scores throughout the observational period in the reported benchmark study. The results corroborate that during most of the observational period more dispersed meeting patterns and a lower percentage of people staying put are associated with higher infection rates. Moreover, the analysis underpins the necessity of including mobility data and showcases the flexibility and interpretability of the proposed approach.

MCML Authors

Cornelius Fritz

Dr.

* Former Member

→ Group Göran Kauermann
Applied Statistics in Social Sciences, Economics and Business

Emilio Dorigatti

Dr.

* Former Member

→ Group Bernd Bischl
Statistical Learning and Data Science

David Rügamer

Prof. Dr.

Principal Investigator

Statistics, Data Science and Machine Learning

S. Kevork and G. Kauermann.
Bipartite Exponential Random Graph Models with Nodal Random Effects.
Social Networks 70 (Jun. 2022). DOI

Abstract

We examine the inclusion of specific nodal random effects for first- and second-mode nodes towards an ERGM for bipartite networks. The inclusion of such node-specific random effects in the ERGM accounts for unobserved heterogeneity in the bipartite network and ensures stable estimation results, especially for large-scale bipartite networks. Moreover, The predicted nodal random effects deliver reasonable interpretation to understand the network behavior. The estimation is carried out by an iterative estimation technique, iterating between pseudolikelihood estimation for the nodal random effects and maximum likelihood estimation for the network parameters.

MCML Authors

Göran Kauermann

Prof. Dr.

Principal Investigator

Applied Statistics in Social Sciences, Economics and Business

Subscribe to RSS News feed

29.09.2025

Machine Learning for Climate Action - With Researcher Kerstin Forster

Kerstin Forster researches how AI can cut emissions, boost renewable energy, and drive corporate sustainability.

26.09.2025

Björn Ommer Featured in WELT

MCML PI Björn Ommer told WELT that AI can never be entirely neutral and that human judgment remains essential.

25.09.2025

Björn Schuller Featured in Macwelt Article

MCML PI Björn Schuller discusses in Macwelt how Apple Watch monitors health, detects subtle changes, and supports early intervention.

24.09.2025

MCML PI Björn Ommer Featured on ZDF NANO Talk

MCML PIs Björn Ommer & Alena Buyx discuss AI’s essence on ZDF NANO Talk, covering tech, ethics, and societal impact.

23.09.2025

Benjamin Lange Explores Opportunities and Risks of AI Agents

Benjamin Lange highlights both opportunities and ethical risks of AI agents and calls for clear rules to ensure they benefit society.