Home | Publications | KFB+24

Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning

MCML Authors

Matthias Feurer

Prof. Dr.

Thomas Bayes Fellow

* Former Thomas Bayes Fellow

Bernd Bischl

Prof. Dr.

Director

Statistical Learning and Data Science

Abstract

Data in tabular form makes up a large part of real-world ML applications, and thus, there has been a strong interest in developing novel deep learning (DL) architectures for supervised learning on tabular data in recent years. As a result, there is a debate as to whether DL methods are superior to the ubiquitous ensembles of boosted decision trees. Typically, the advantage of one model class over the other is claimed based on an empirical evaluation, where different variations of both model classes are compared on a set of benchmark datasets that supposedly resemble relevant real-world tabular data. While the landscape of state-of-the-art models for tabular data changed, one factor has remained largely constant over the years: The datasets. Here, we examine 30 recent publications and 187 different datasets they use, in terms of age, study size and relevance. We found that the average study used less than 10 datasets and that half of the datasets are older than 20 years. Our insights raise questions about the conclusions drawn from previous studies and urge the research community to develop and publish additional recent, challenging and relevant datasets and ML tasks for supervised learning on tabular data.

inproceedings KFB+24

DMLR @ICLR 2024

Workshop on Data-centric Machine Learning Research at the 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024.