On Hyperparameter Tuning on the Test Set
M. Fregonara (TU Delft - Electrical Engineering, Mathematics and Computer Science)
T.J. Viering – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J.C. van Gemert – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
O.E. Scharenborg – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
“Don’t tune hyperparameters on the test set” is often stated in machine learning textbooks. Violating it is considered a cardinal sin that produces misleadingly optimistic results, corrupts benchmark integrity, and thus can even be interpreted as scientific fraud. Yet evidence suggests that test set hyperparameter tuning does occur in practice, making it all the more important to understand its actual consequences. So how bad is it, really? In this work we question this dogma and put it to an empirical test. We systematically study the magnitude of the performance inflation caused by tuning the hyperparameters on the test set for MNIST-1D and CIFAR-10. Our experiments show that while the effect is real and significant, it is frequently small relative to other sources of noise. In many cases, we find that tuning on the test set recovers exactly the same model as when tuning on the validation set. Most importantly, we find that the rankings of models remain preserved after tuning on the test set, and therefore that consistent test-set tuning does not invalidate benchmarks or model selection. Our results call for a more nuanced view of tuning hyperparameters on the test set, stimulating researchers to openly report test tuning.
Files
File under embargo until 30-06-2027