When assessing the quality of prediction models in machine learning, confidence intervals<br>(CIs) for the generalization error, which measures predictive performance, are a crucial<br>tool. Luckily, there exist many methods for computing such CIs and new promising<br>approaches are continuously being proposed. Typically, these methods combine various<br>resampling procedures, most popular among them cross-validation and bootstrapping,<br>with di!erent variance estimation techniques. Unfortunately, however, there is currently<br>no consensus on when any of these combinations may be most reliably employed and how<br>they generally compare. Here, we present the results of a large-scale study comparing<br>CIs for the generalization error, where we empirically evaluate 13 di!erent CI methods<br>on a total of 19 tabular regression and classification problems, using seven di!erent<br>learning algorithms and a total of eight loss functions. Furthermore, we give an overview<br>of the methodological foundations and inherent challenges of constructing CIs for the<br>generalization error.
inproceedings
BibTeXKey: BF25