Home | Research | Groups | Fabian Scheipl

Research Group Fabian Scheipl

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Principal Investigator

Functional Data Analysis

Fabian Scheipl

is Head of the Workgroup Functional Data Analysis at LMU Munich.

The group works on methodology and software implementations that process, describe, visualize and model functional data, such as curves, trajectories, or even higher dimensional surfaces. The research focuses on the analysis of functional data using generalized additive regression and on both supervised and unsupervised methods for functional data, for example for automated outlier detection or dimension reduction.

Publications @MCML

2024


[21]
M. Herrmann, D. Kazempour, F. Scheipl and P. Kröger.
Enhancing cluster analysis via topological manifold learning.
Data Mining and Knowledge Discovery 38 (Apr. 2024). DOI
Abstract

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to Daniyal Kazempour

Daniyal Kazempour

Dr.

* Former member

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Peer Kröger

Peer Kröger

Prof. Dr.

* Former member


2023


[20]
J. Gauss, F. Scheipl and M. Herrmann.
DCSI–An improved measure of cluster separability based on separation and connectedness.
Preprint (Oct. 2023). arXiv
Abstract

Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not correspond to meaningful density-based clusters.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine


[19]
S. Hoffmann, F. Scheipl and A.-L. Boulesteix.
Reproduzierbare und replizierbare Forschung.
Moderne Verfahren der Angewandten Statistik (Sep. 2023). DOI
Abstract

In den letzten Jahren haben Berichte über die fehlende Replizierbarkeit und Reproduzierbarkeit von Forschungsergebnissen viel Aufmerksamkeit erhalten und dazu geführt, dass die Art und Weise, wie wissenschaftliche Studien geplant, analysiert und berichtet werden, hinterfragt wird. Bei der statistischen Planung und Auswertung wissenschaftlicher Studien muss eine Vielzahl von Entscheidungen getroffen werden, ohne dass es dabei eindeutig richtige oder falsche Wahlmöglichkeiten gäbe. Hier wird erläutert, wie diese Multiplizität an möglichen Analysestrategien, die durch Modell-, Datenaufbereitungs- und Methodenunsicherheit beschrieben werden kann, in Verbindung mit selektiver Berichterstattung zu Ergebnissen führen kann, die sich auf unabhängigen Daten nicht replizieren lassen. Zudem werden Lösungsstrategien vorgestellt, mit denen die Replizierbarkeit der Ergebnisse verbessert werden kann, und Praktiken und Hilfsmittel vorgestellt, mit denen durchgeführte Analysen reproduzierbar werden können.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Anne-Laure Boulesteix

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine


[18]
A. Volkmann, A. Stöcker, F. Scheipl and S. Greven.
Multivariate Functional Additive Mixed Models.
Statistical Modelling 23.4 (Aug. 2023). DOI
Abstract

Multivariate functional data can be intrinsically multivariate like movement trajectories in 2D or complementary such as precipitation, temperature and wind speeds over time at a given weather station. We propose a multivariate functional additive mixed model (multiFAMM) and show its application to both data situations using examples from sports science (movement trajectories of snooker players) and phonetic science (acoustic signals and articulation of consonants). The approach includes linear and nonlinear covariate effects and models the dependency structure between the dimensions of the responses using multivariate functional principal component analysis. Multivariate functional random intercepts capture both the auto-correlation within a given function and cross-correlations between the multivariate functional dimensions. They also allow us to model between-function correlations as induced by, for example, repeated measurements or crossed study designs. Modelling the dependency structure between the dimensions can generate additional insight into the properties of the multivariate functional process, improves the estimation of random effects, and yields corrected confidence bands for covariate effects. Extensive simulation studies indicate that a multivariate modelling approach is more parsimonious than fitting independent univariate models to the data while maintaining or improving model fit.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[17]
M. Herrmann, F. Pfisterer and F. Scheipl.
A geometric framework for outlier detection in high-dimensional data.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery e1491 (Apr. 2023). DOI
Abstract

Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework which exploits the metric structure of a data set. Our approach rests on the manifold assumption, that is, that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high dimensional data. We also suggest a novel, mathematically precise and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


2022


[16]
L. Bothmann, S. Strickroth, G. Casalicchio, D. Rügamer, M. Lindauer, F. Scheipl and B. Bischl.
Developing Open Source Educational Resources for Machine Learning and Data Science.
ECML-PKDD 2022 - 3rd Teaching Machine Learning and Artificial Intelligence Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Grenoble, France, Sep 19-23, 2022. URL
Abstract

Education should not be a privilege but a common good. It should be openly accessible to everyone, with as few barriers as possible; even more so for key technologies such as Machine Learning (ML) and Data Science (DS). Open Educational Resources (OER) are a crucial factor for greater educational equity. In this paper, we describe the specific requirements for OER in ML and DS and argue that it is especially important for these fields to make source files publicly available, leading to Open Source Educational Resources (OSER). We present our view on the collaborative development of OSER, the challenges this poses, and first steps towards their solutions. We outline how OSER can be used for blended learning scenarios and share our experiences in university education. Finally, we discuss additional challenges such as credit assignment or granting certificates.

MCML Authors
Link to Ludwig Bothmann

Ludwig Bothmann

Dr.

Statistical Learning & Data Science

Link to Giuseppe Casalicchio

Giuseppe Casalicchio

Dr.

Statistical Learning & Data Science

Link to David Rügamer

David Rügamer

Prof. Dr.

Data Science Group

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning & Data Science


[15]
J. Goldsmith and F. Scheipl.
tf: S3 classes and methods for tidy functional data. R package.
2022. GitHub
Abstract

The goal of tidyfun, in turn, is to provide accessible and well-documented software that makes functional data analysis in R easy – specifically data wrangling and exploratory analysis.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[14]
J. Goldsmith and F. Scheipl.
tidyfun: Clean, wholesome, tidy fun with functional data in R. R package.
2022. GitHub
Abstract

The goal of tidyfun, in turn, is to provide accessible and well-documented software that makes functional data analysis in R easy – specifically data wrangling and exploratory analysis.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[13]
M. Herrmann.
Towards more reliable machine learning: conceptual insights and practical approaches for unsupervised manifold learning and supervised benchmark studies.
Dissertation 2022. DOI
Abstract

This thesis focuses on improving the reliability and trustworthiness of machine learning, particularly in unsupervised learning methods like manifold learning. It investigates the challenges of evaluating manifold learning techniques and proposes improvements for embedding evaluation, outlier detection, and cluster analysis, using methods like UMAP and DBSCAN. Additionally, the thesis contributes to supervised learning by presenting a benchmark study on survival prediction in multi-omics cancer data and exploring the effects of design and analysis choices on benchmark results. (Shortened).

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine


2021


[12]
M. Herrmann and F. Scheipl.
A Geometric Perspective on Functional Outlier Detection.
Stats 4.4 (Nov. 2021). DOI
Abstract

We consider functional outlier detection from a geometric perspective, specifically: for functional datasets drawn from a functional manifold, which is defined by the data’s modes of variation in shape, translation, and phase. Based on this manifold, we developed a conceptualization of functional outlier detection that is more widely applicable and realistic than previously proposed taxonomies. Our theoretical and experimental analyses demonstrated several important advantages of this perspective: it considerably improves theoretical understanding and allows describing and analyzing complex functional outlier scenarios consistently and in full generality, by differentiating between structurally anomalous outlier data that are off-manifold and distributionally outlying data that are on-manifold, but at its margins. This improves the practical feasibility of functional outlier detection: we show that simple manifold-learning methods can be used to reliably infer and visualize the geometric structure of functional datasets. We also show that standard outlier-detection methods requiring tabular data inputs can be applied to functional data very successfully by simply using their vector-valued representations learned from manifold learning methods as the input features. Our experiments on synthetic and real datasets demonstrated that this approach leads to outlier detection performances at least on par with existing functional-data-specific methods in a large variety of settings, without the highly specialized, complex methodology and narrow domain of application these methods often entail.

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[11]
A. Bauer, F. Scheipl and H. Küchenhoff.
Registration for Incomplete Non-Gaussian Functional Data.
Preprint (Aug. 2021). arXiv
Abstract

Accounting for phase variability is a critical challenge in functional data analysis. To separate it from amplitude variation, functional data are registered, i.e., their observed domains are deformed elastically so that the resulting functions are aligned with template functions. At present, most available registration approaches are limited to datasets of complete and densely measured curves with Gaussian noise. However, many real-world functional data sets are not Gaussian and contain incomplete curves, in which the underlying process is not recorded over its entire domain. In this work, we extend and refine a framework for joint likelihood-based registration and latent Gaussian process-based generalized functional principal component analysis that is able to handle incomplete curves. Our approach is accompanied by sophisticated open-source software, allowing for its application in diverse non-Gaussian data settings and a public code repository to reproduce all results. We register data from a seismological application comprising spatially indexed, incomplete ground velocity time series with a highly volatile Gamma structure. We describe, implement and evaluate the approach for such incomplete non-Gaussian functional data and compare it to existing routines.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Helmut Küchenhoff

Helmut Küchenhoff

Prof. Dr.

Statistical Consulting Unit (StaBLab)


2020


[10]
M. Herrmann and F. Scheipl.
Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction.
Preprint (Dec. 2020). arXiv
Abstract

In recent years, manifold methods have moved into focus as tools for dimension reduction. Assuming that the high-dimensional data actually lie on or close to a low-dimensional nonlinear manifold, these methods have shown convincing results in several settings. This manifold assumption is often reasonable for functional data, i.e., data representing continuously observed functions, as well. However, the performance of manifold methods recently proposed for tabular or image data has not been systematically assessed in the case of functional data yet. Moreover, it is unclear how to evaluate the quality of learned embeddings that do not yield invertible mappings, since the reconstruction error cannot be used as a performance measure for such representations. In this work, we describe and investigate the specific challenges for nonlinear dimension reduction posed by the functional data setting. The contributions of the paper are three-fold: First of all, we define a theoretical framework which allows to systematically assess specific challenges that arise in the functional data context, transfer several nonlinear dimension reduction methods for tabular and image data to functional data, and show that manifold methods can be used successfully in this setting. Secondly, we subject performance assessment and tuning strategies to a thorough and systematic evaluation based on several different functional data settings and point out some previously undescribed weaknesses and pitfalls which can jeopardize reliable judgment of embedding quality. Thirdly, we propose a nuanced approach to make trustworthy decisions for or against competing nonconforming embeddings more objectively.

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[9]
A. Bender, D. Rügamer, F. Scheipl and B. Bischl.
A General Machine Learning Framework for Survival Analysis.
ECML-PKDD 2020 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Virtual, Sep 14-18, 2020. DOI
Abstract

The modeling of time-to-event data, also known as survival analysis, requires specialized methods that can deal with censoring and truncation, time-varying features and effects, and that extend to settings with multiple competing events. However, many machine learning methods for survival analysis only consider the standard setting with right-censored data and proportional hazards assumption. The methods that do provide extensions usually address at most a subset of these challenges and often require specialized software that can not be integrated into standard machine learning workflows directly. In this work, we present a very general machine learning framework for time-to-event analysis that uses a data augmentation strategy to reduce complex survival tasks to standard Poisson regression tasks. This reformulation is based on well developed statistical theory. With the proposed approach, any algorithm that can optimize a Poisson (log-)likelihood, such as gradient boosted trees, deep neural networks, model-based boosting and many more can be used in the context of time-to-event analysis. The proposed technique does not require any assumptions with respect to the distribution of event times or the functional shapes of feature and interaction effects. Based on the proposed framework we develop new methods that are competitive with specialized state of the art approaches in terms of accuracy, and versatility, but with comparatively small investments of programming effort or requirements for specialized methodological know-how.

MCML Authors
Link to Andreas Bender

Andreas Bender

Dr.

Machine Learning Consulting Unit (MLCU)

Link to David Rügamer

David Rügamer

Prof. Dr.

Data Science Group

Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning & Data Science


[8]
M. Herrmann.
fda-ndr: Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction. R package.
2020. GitHub
Abstract

manifun: Collection of functions to work with embeddings and functional data.

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine


[7]
M. Herrmann.
manifun: Collection of functions to work with embeddings and functional data. R package.
2020. GitHub
Abstract

Repository contains material to reproduce the results of ‘Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction’(https://arxiv.org/abs/2012.11987).

MCML Authors
Link to Moritz Herrmann

Moritz Herrmann

Dr.

Transfer Coordinator

Biometry in Molecular Medicine


[6]
F. Scheipl, J. Goldsmith and J. Wrobel.
tidyfun: Tools for Tidy Functional Data. R package.
2020. URL GitHub
Abstract

The goal of tidyfun is to provide accessible and well-documented software that makes functional data analysis in R easy – specifically data wrangling and exploratory analysis.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[5]
J. Wrobel, A. Bauer, J. McDonnel and F. Scheipl.
registr: Curve Registration for Exponential Family Functional Data. R package.
2020. GitHub
Abstract

Registration for incomplete exponential family functional data.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


2019


[4]
F. Pfisterer, L. Beggel, X. Sun, F. Scheipl and B. Bischl.
Benchmarking time series classification -- Functional data vs machine learning approaches.
Preprint (Nov. 2019). arXiv
Abstract

Time series classification problems have drawn increasing attention in the machine learning and statistical community. Closely related is the field of functional data analysis (FDA): it refers to the range of problems that deal with the analysis of data that is continuously indexed over some domain. While often employing different methods, both fields strive to answer similar questions, a common example being classification or regression problems with functional covariates. We study methods from functional data analysis, such as functional generalized additive models, as well as functionality to concatenate (functional-) feature extraction or basis representations with traditional machine learning algorithms like support vector machines or classification trees. In order to assess the methods and implementations, we run a benchmark on a wide variety of representative (time series) data sets, with in-depth analysis of empirical results, and strive to provide a reference ranking for which method(s) to use for non-expert practitioners. Additionally, we provide a software framework in R for functional data analysis for supervised learning, including machine learning and more linear approaches from statistics. This allows convenient access, and in connection with the machine-learning toolbox mlr, those methods can now also be tuned and benchmarked.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis

Link to Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning & Data Science


[3]
C. Happ, F. Scheipl, A.-A. Gabriel and S. Greven.
A general framework for multivariate functional principal component analysis of amplitude and phase variation.
Stat 8.2 (Feb. 2019). DOI
Abstract

Functional data typically contain amplitude and phase variation. In many data situations, phase variation is treated as a nuisance effect and is removed during preprocessing, although it may contain valuable information. In this note, we focus on joint principal component analysis (PCA) of amplitude and phase variation. As the space of warping functions has a complex geometric structure, one key element of the analysis is transforming the warping functions to urn:x-wiley:sta4:media:sta4220:sta4220-math-0001. We present different transformation approaches and show how they fit into a general class of transformations. This allows us to compare their strengths and limitations. In the context of PCA, our results offer arguments in favour of the centred log-ratio transformation. We further embed two existing approaches from the literature for joint PCA of amplitude and phase variation into the framework of multivariate functional PCA, where we study the properties of the estimators based on an appropriate metric. The approach is illustrated through an application from seismology.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


[2]
J. Goldsmith, F. Scheipl, L. Huang, J. Wrobel, C. Di, J. Gellar, J. Harezlak, M. W. McLean, B. Swihart, L. Xiao, C. Crainiceanu and P. T. Reiss.
refund: Regression with Functional Data.
2019. URL
Abstract

Methods for regression for functional data, including function-on-scalar, scalar-on-function, and function-on-function regression. Some of the functions are applicable to image data.

MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis


2018


[1]
J. Minkwitz, F. Scheipl, E. Binder, C. Sander, U. Hegerl and H. Himmerich.
Generalised functional additive models for brain arousal state dynamics (Poster).
IPEG 2018 - 20th International Pharmaco-EEG Society for Preclinical and Clinical Electrophysiological Brain Research Meeting. Zurich, Switzerland, Nov 21-25, 2018. DOI
MCML Authors
Link to Fabian Scheipl

Fabian Scheipl

PD Dr.

Functional Data Analysis