Home | Research | Groups | Matthias Feurer

Research Group Matthias Feurer

Matthias Feurer

Prof. Dr.

Thomas Bayes Fellow

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Matthias Feurer

is a Thomas Bayes Fellow of MCML and Interim Professor at the Chair of Statistical Learning and Data Science at LMU Munich.

His research focuses on simplifying machine learning for both domain scientists and expert users by developing tools and methods within Automated Machine Learning (AutoML), covering hyperparameter optimization, meta-learning, and model selection. He prioritizes multi-objective AutoML to address goals beyond predictive performance, such as interpretability, deployability, and fairness, and contribute to open-source AutoML projects, including co-founding the Open Machine Learning Foundation.

Team members @MCML

PhD Students

David Rundel

A1 | Statistical Foundations & Explainability
→ Group Matthias Feurer

Statistical Learning and Data Science

Publications @MCML

2025

[17]

L. Schneider, B. Bischl and M. Feurer.
Overtuning in Hyperparameter Optimization.
AutoML 2025 - Methods Track - Methods Track at the International Conference on Automated Machine Learning. New York City, NY, USA, Sep 08-11, 2025. To be published. URL

Abstract

Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter configuration (HPC) such that the resulting model generalizes well to unseen data. Since directly optimizing the expected generalization error is impossible, resampling techniques like holdout validation or cross-validation are used as proxy measures in HPO. However, this implicitly assumes that the HPC minimizing validation error will also yield the best true generalization performance. Given that our inner validation error estimate is inherently stochastic and depends on the resampling, we study: Can excessive optimization of the validation error lead to a similarly detrimental effect as excessive optimization of the empirical risk of an ML model? This phenomenon, which we refer to as overtuning, represents a form of overfitting at the HPO level. Despite its potential impact, overtuning has received limited attention in the HPO and automated machine learning (AutoML) literature. We first formally define overtuning and distinguish it from related concepts such as meta-overfitting. We then reanalyze large-scale HPO benchmark data, assessing how frequently overtuning occurs and its practical relevance. Our findings suggest that overtuning is more common than expected, although often mild. However, in 10% of cases, severe overtuning results in selecting an HPC whose generalization performance is worse than the default HPC. We further examine how factors such as the chosen performance metric, resampling method, dataset size, learning algorithm, and optimization strategy influence overtuning and discuss potential mitigation strategies. Our results highlight the need to raise awareness of overtuning, particularly in the small-data regime, indicating that further mitigation strategies should be studied.

MCML Authors

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[16]

T. Zehle, M. Schlager, T. Heiß and M. Feurer.
CAPO: Cost-Aware Prompt Optimization.
AutoML 2025 - International Conference on Automated Machine Learning. New York City, NY, USA, Sep 08-11, 2025. To be published. Preprint available. arXiv

Abstract

Large language models (LLMs) have revolutionized natural language processing by solving a wide range of tasks simply guided by a prompt. Yet their performance is highly sensitive to prompt formulation. While automated prompt optimization addresses this challenge by finding optimal prompts, current methods require a substantial number of LLM calls and input tokens, making prompt optimization expensive. We introduce CAPO (Cost-Aware Prompt Optimization), an algorithm that enhances prompt optimization efficiency by integrating AutoML techniques. CAPO is an evolutionary approach with LLMs as operators, incorporating racing to save evaluations and multi-objective optimization to balance performance with prompt length. It jointly optimizes instructions and few-shot examples while leveraging task descriptions for improved robustness. Our extensive experiments across diverse datasets and LLMs demonstrate that CAPO outperforms state-of-the-art discrete prompt optimization methods in 11/15 cases with improvements up to 21%p. Our algorithm achieves better performances already with smaller budgets, saves evaluations through racing, and decreases average prompt length via a length penalty, making it both cost-efficient and cost-aware. Even without few-shot examples, CAPO outperforms its competitors and generally remains robust to initial prompts. CAPO represents an important step toward making prompt optimization more powerful and accessible by improving cost-efficiency.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[15]

M. Koshil, M. Feurer and K. Eggensperger.
In-Context Learning of Soft Nearest Neighbor Classifiers for Intelligible Tabular Machine Learning.
TRL @ACL 2025 - 4th Table Representation Learning Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

With in-context learning foundation models like TabPFN excelling on small supervised tabular learning tasks, it has been argued that ‘boosted trees are not the best default choice when working with data in tables’. However, such foundation models are inherently black-box models that do not provide interpretable predictions. We introduce a novel learning task to train ICL models to act as a nearest neighbor algorithm, which enables intelligible inference and does not decrease performance empirically.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[14]

B. Bischl, G. Casalicchio, T. Das, M. Feurer, S. Fischer, P. Gijsbers, S. Mukherjee, A. C. Müller, L. Németh, L. Oala, L. Purucker, S. Ravi, J. N. van Rijn, P. Singh, J. Vanschoren, J. van der Velde and M. Wever.
OpenML: Insights from 10 years and more than a thousand papers.
Patterns In Press, Corrected Proof (Jul. 2025). DOI

Abstract

OpenML is an open-source platform that democratizes machine-learning evaluation by enabling anyone to share datasets in uniform standards, define precise machine-learning tasks, and automatically share detailed workflows and model evaluations. More than just a platform, OpenML fosters a collaborative ecosystem where scientists create new tools, launch initiatives, and establish standards to advance machine learning. Over the past decade, OpenML has inspired over 1,500 publications across diverse fields, from scientists releasing new datasets and benchmarking new models to educators teaching reproducible science. Looking back, we detail and describe the platform’s impact by looking at usage and citations. We share lessons from a decade of building, maintaining, and expanding OpenML, highlighting how rich metadata, collaborative benchmarking, and open interfaces have enhanced research and interoperability. Looking ahead, we cover ongoing efforts to expand OpenML’s capabilities and integrate with other platforms, informing a broader vision for open-science infrastructure for machine learning.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Sebastian Fischer

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[13]

C. Benjamins, H. Graf, S. Segel, D. Deng, T. Ruhkopf, L. Hennig, S. Basu, N. Mallik, E. Bergman, D. Chen, F. Clément, M. Feurer, K. Eggensperger, F. Hutter, C. Doerr and M. Lindauer.
carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks.
Preprint (Jun. 2025). arXiv URL

Abstract

Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (this https URL), we make an important step in the standardization of HPO evaluation.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[12]

D. Rundel, E. Sommer, B. Bischl, D. Rügamer and M. Feurer.
Efficiently Warmstarting MCMC for BNNS.
FPI @ICLR 2025 - Workshop on Frontiers in Probabilistic Inference: Learning meets Sampling at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. URL

Abstract

Markov Chain Monte Carlo (MCMC) algorithms are widely regarded as the gold standard for approximate inference in Bayesian neural networks (BNNs). However, they remain computationally expensive and prone to inefficiencies, such
as dying samplers, frequently leading to substantial waste of computational resources. While prior work has presented warmstarting techniques as an effective method to mitigate these inefficiencies, we provide a more comprehensive empirical analysis of how initializations of samplers affect their behavior. Based on various experiments examining the dynamics of warmstarting MCMC, we propose novel warmstarting strategies that leverage performance predictors and adaptive termination criteria to achieve better-performing, yet more cost-efficient, models. In numerical experiments, we demonstrate that this approach provides a practical pathway to more resource-efficient approximate inference in BNNs.

MCML Authors

David Rundel

A1 | Statistical Foundations & Explainability
→ Group Matthias Feurer

Statistical Learning and Data Science

Emanuel Sommer

A1 | Statistical Foundations & Explainability
→ Group David Rügamer

Statistics, Data Science and Machine Learning

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistics, Data Science and Machine Learning

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

2024

[11]

T. Nagler, L. Schneider, B. Bischl and M. Feurer.
Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model’s generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.

MCML Authors

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability

Computational Statistics & Data Science

Lennart Schneider

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[10]

M. Koshil, T. Nagler, M. Feurer and K. Eggensperger.
Towards Localization via Data Embedding for TabPFN.
TLR @NeurIPS 2024 - 3rd Table Representation Learning Workshop at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

Prior-data fitted networks (PFNs), especially TabPFN, have shown significant promise in tabular data prediction. However, their scalability is limited by the quadratic complexity of the transformer architecture’s attention across training points. In this work, we propose a method to localize TabPFN, which embeds data points into a learned representation and performs nearest neighbor selection in this space. We evaluate it across six datasets, demonstrating its superior performance over standard TabPFN when scaling to larger datasets. We also explore its design choices and analyze the bias-variance trade-off of this localization method, showing that it reduces bias while maintaining manageable variance. This work opens up a pathway for scaling TabPFN to arbitrarily large tabular datasets.

MCML Authors

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability

Computational Statistics & Data Science

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[9]

E. Bergman, M. Feurer, A. Bahram, A. R. Balef, L. Purucker, S. Segel, M. Lindauer, F. Hutter and K. Eggensperger.
AMLTK: A Modular AutoML Toolkit in Python.
The Journal of Open Source Software 9.100 (Aug. 2024). DOI

Abstract

Machine Learning is a core building block in novel data-driven applications. Practitioners face many ambiguous design decisions while developing practical machine learning (ML) solutions. Automated machine learning (AutoML) facilitates the development of machine learning applications by providing efficient methods for optimizing hyperparameters, searching for neural architectures, or constructing whole ML pipelines (Hutter et al., 2019). Thereby, design decisions such as the choice of modelling, pre-processing, and training algorithm are crucial to obtaining well-performing solutions. By automatically obtaining ML solutions, AutoML aims to lower the barrier to leveraging machine learning and reduce the time needed to develop or adapt ML solutions for new domains or data.
Highly performant software packages for automatically building ML pipelines given data, so-called AutoML systems, are available and can be used off-the-shelf. Typically, AutoML systems evaluate ML models sequentially to return a well-performing single best model or multiple models combined into an ensemble. Existing AutoML systems are typically highly engineered monolithic software developed for specific use cases to perform well and robustly under various conditions…

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[8]

M. Herrmann, F. J. D. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix and B. Bischl.
Position: Why We Must Rethink Empirical Research in Machine Learning.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

We warn against a common but incomplete understanding of empirical research in machine learning (ML) that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue most current empirical ML research is fashioned as confirmatory research while it should rather be considered exploratory.

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

Biometry in Molecular Medicine

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Marcel Wever

Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

* Former Member

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistics, Data Science and Machine Learning

Eyke Hüllermeier

Prof. Dr.

A3 | Computational Models

Artificial Intelligence and Machine Learning

Anne-Laure Boulesteix

Prof. Dr.

A1 | Statistical Foundations & Explainability

Biometry in Molecular Medicine

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[7]

M. Lindauer, F. Karl, A. Klier, J. Moosbauer, A. Tornede, A. C. Mueller, F. Hutter, M. Feurer and B. Bischl.
Position: A Call to Action for a Human-Centered AutoML Paradigm.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) workflows, aiding the research of new ML algorithms, and contributing to the democratization of ML by making it accessible to a broader audience. Over the past decade, commendable achievements in AutoML have primarily focused on optimizing predictive performance. This focused progress, while substantial, raises questions about how well AutoML has met its broader, original goals. In this position paper, we argue that a key to unlocking AutoML’s full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems, including their diverse roles, expectations, and expertise. We envision a more human-centered approach in future AutoML research, promoting the collaborative design of ML systems that tightly integrates the complementary strengths of human expertise and AutoML methodologies.

MCML Authors

Florian Karl

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Julia Moosbauer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[6]

D. Rundel, J. Kobialka, C. von Crailsheim, M. Feurer, T. Nagler and D. Rügamer.
Interpretable Machine Learning for TabPFN.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI GitHub

Abstract

The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN.

MCML Authors

David Rundel

A1 | Statistical Foundations & Explainability
→ Group Matthias Feurer

Statistical Learning and Data Science

Julius Kobialka

A1 | Statistical Foundations & Explainability
→ Group David Rügamer

Statistics, Data Science and Machine Learning

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Thomas Nagler

Prof. Dr.

A1 | Statistical Foundations & Explainability

Computational Statistics & Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistics, Data Science and Machine Learning

[5]

R. Kohli, M. Feurer, B. Bischl, K. Eggensperger and F. Hutter.
Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning.
DMLR @ICLR 2024 - Workshop on Data-centric Machine Learning Research at the 12th International Conference on Learning Representations (ICLR 2024). Vienna, Austria, May 07-11, 2024. URL

Abstract

Data in tabular form makes up a large part of real-world ML applications, and thus, there has been a strong interest in developing novel deep learning (DL) architectures for supervised learning on tabular data in recent years. As a result, there is a debate as to whether DL methods are superior to the ubiquitous ensembles of boosted decision trees. Typically, the advantage of one model class over the other is claimed based on an empirical evaluation, where different variations of both model classes are compared on a set of benchmark datasets that supposedly resemble relevant real-world tabular data. While the landscape of state-of-the-art models for tabular data changed, one factor has remained largely constant over the years: The datasets. Here, we examine 30 recent publications and 187 different datasets they use, in terms of age, study size and relevance. We found that the average study used less than 10 datasets and that half of the datasets are older than 20 years. Our insights raise questions about the conclusions drawn from previous studies and urge the research community to develop and publish additional recent, challenging and relevant datasets and ML tasks for supervised learning on tabular data.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[4]

H. Weerts, F. Pfisterer, M. Feurer, K. Eggensperger, E. Bergman, N. Awad, J. Vanschoren, M. Pechenizkiy, B. Bischl and F. Hutter.
Can Fairness be Automated? Guidelines and Opportunities for Fairness-aware AutoML.
Journal of Artificial Intelligence Research 79 (Feb. 2024). DOI

Abstract

The field of automated machine learning (AutoML) introduces techniques that automate parts of the development of machine learning (ML) systems, accelerating the process and reducing barriers for novices. However, decisions derived from ML models can reproduce, amplify, or even introduce unfairness in our societies, causing harm to (groups of) individuals. In response, researchers have started to propose AutoML systems that jointly optimize fairness and predictive performance to mitigate fairness-related harm. However, fairness is a complex and inherently interdisciplinary subject, and solely posing it as an optimization problem can have adverse side effects. With this work, we aim to raise awareness among developers of AutoML systems about such limitations of fairness-aware AutoML, while also calling attention to the potential of AutoML as a tool for fairness research. We present a comprehensive overview of different ways in which fairness-related harm can arise and the ensuing implications for the design of fairness-aware AutoML. We conclude that while fairness cannot be automated, fairness-aware AutoML can play an important role in the toolbox of ML practitioners. We highlight several open technical challenges for future work in this direction. Additionally, we advocate for the creation of more user-centered assistive systems designed to tackle challenges encountered in fairness work.

MCML Authors

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

2023

[3]

S. F. Fischer, L. Harutyunyan, M. Feurer and B. Bischl.
OpenML-CTR23 - A curated tabular regression benchmarking suite.
AutoML 2023 - International Conference on Automated Machine Learning - Workshop Track. Berlin, Germany, Sep 12-15, 2023. URL

Abstract

Benchmark experiments are one of the cornerstones of modern machine learning research. An essential part in the design of such experiments is the selection of datasets. We present the OpenML Curated Tabular Regression benchmarking suite 2023 (OpenML-CTR23). It is available on OpenML and comprises 35 regression problems that have been selected according to a set of strict criteria. We compare its design with existing regression benchmark suites and also challenge some of the dataset choices of previous efforts. As a first experiment, we compare five machine learning methods of varying complexity on the OpenML-CTR23.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

[2]

M. Feurer, K. Eggensperger, E. Bergman, F. Pfisterer, B. Bischl and F. Hutter.
Mind the Gap: Measuring Generalization Performance Across Multiple Objectives.
IDA 2023 - 21st International Symposium on Intelligent Data Analysis. Louvain-la-Neuve, Belgium, Apr 12-14, 2023. DOI

Abstract

Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.

MCML Authors

Matthias Feurer

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Florian Pfisterer

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

2021

[1]

B. Bischl, G. Casalicchio, M. Feurer, P. Gijsbers, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn and J. Vanschoren.
OpenML Benchmarking Suites.
NeurIPS 2021 - Track on Datasets and Benchmarks at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. URL

Abstract

Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the OpenML platform, and accessible through interfaces in Python, Java, and R. OpenML benchmarking suites (a) are easy to use through standardized data formats, APIs, and client libraries; (b) come with extensive meta-information on the included datasets; and (c) allow benchmarks to be shared and reused in future studies. We then present a first, carefully curated and practical benchmarking suite for classification: the OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18). Finally, we discuss use cases and applications which demonstrate the usefulness of OpenML benchmarking suites and the OpenML-CC18 in particular.

MCML Authors

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Giuseppe Casalicchio

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science