On Hyperparameter Tuning on the Test Set

None, None

On Hyperparameter Tuning on the Test Set

Master Thesis (2026)

Author(s)

M. Fregonara (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

T.J. Viering – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.C. van Gemert – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

O.E. Scharenborg – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Model selection Hyperparameter tuning Adaptive overfitting Test-set reuse Benchmark evaluation

To reference this document use

https://resolver.tudelft.nl/uuid:be402fbe-ec7e-480b-9c72-7b8c812918cc

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

30-06-2026

Awarding Institution

Delft University of Technology

Programme

Data Science and Artificial Intelligence Technology

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

6

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

“Don’t tune hyperparameters on the test set” is often stated in machine learning textbooks. Violating it is considered a cardinal sin that produces misleadingly optimistic results, corrupts benchmark integrity, and thus can even be interpreted as scientific fraud. Yet evidence suggests that test set hyperparameter tuning does occur in practice, making it all the more important to understand its actual consequences. So how bad is it, really? In this work we question this dogma and put it to an empirical test. We systematically study the magnitude of the performance inflation caused by tuning the hyperparameters on the test set for MNIST-1D and CIFAR-10. Our experiments show that while the effect is real and significant, it is frequently small relative to other sources of noise. In many cases, we find that tuning on the test set recovers exactly the same model as when tuning on the validation set. Most importantly, we find that the rankings of models remain preserved after tuning on the test set, and therefore that consistent test-set tuning does not invalidate benchmarks or model selection. Our results call for a more nuanced view of tuning hyperparameters on the test set, stimulating researchers to openly report test tuning.

Files

On_Hyperparameter_Tuning_on_th... (pdf)

(pdf | 0 Mb)

License info not available

File under embargo until 30-06-2027